unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: Austin Clements <amdragon@MIT.EDU>
To: David Bremner <david@tethera.net>
Cc: notmuch@notmuchmail.org
Subject: Re: [PATCH 2/5] util: Function to parse boolean term queries
Date: Tue, 25 Dec 2012 20:23:00 -0500	[thread overview]
Message-ID: <20121226012300.GW6187@mit.edu> (raw)
In-Reply-To: <87obhidxkt.fsf@zancas.localnet>

Quoth David Bremner on Dec 25 at 10:18 am:
> Austin Clements <amdragon@MIT.EDU> writes:
> 
> > +    if (consume_double_quote (&pos)) {
> > +	char *out = talloc_strdup (ctx, pos);
> > +	pos = *term_out = out;
> > +	while (1) {
> 
> Overall the control flow here is a bit tricky to follow. I'm not sure if
> a real loop condition would help or make it worse.
> 
> > +	    if (! *pos) {
> > +		/* Premature end of string */
> > +		goto FAIL;
> > +	    } else if (*pos == '"') {
> > +		if (*++pos != '"')
> > +		    break;
> > +	    } else if (consume_double_quote (&pos)) {
> > +		break;
> > +	    }
> 
> I'm confused by the asymmetry here. Quoted strings can start with
> unicode quotes, but internally can only have ascii '"'? Is this
> documented somewhere?

Only in the source, to my knowledge.  Here's how Xapian parses these
things (where 'it' is a UTF8 string iterator):

if (it != end && is_double_quote(*it)) {
    // Quoted boolean term (can contain any character).
    ++it;
    while (it != end) {
	if (*it == '"') {
	    // Interpret "" as an escaped ".
	    if (++it == end || *it != '"')
		break;
	} else if (is_double_quote(*it)) {
	    ++it;
	    break;
	}
	Unicode::append_utf8(name, *it++);
    }
} else {
    // Can't boolean filter prefix a subexpression, so
    // just use anything following the prefix until the
    // next space or ')' as part of the boolean filter
    // term.
    while (it != end && *it > ' ' && *it != ')')
	Unicode::append_utf8(name, *it++);
}

> > +    } else {
> > +	while (*pos > ' ' && *pos != ')')
> > +	    ++pos;
> > +	if (*pos)
> > +	    goto FAIL;
> > +    }
> 
> So if there is no quote, we skip the part after the ':'? I guess I
> probably missed something because that doesn't sound like the intended
> behaviour.

This isn't skipping it; it's checking its well-formedness.  In this
case, *term_out already points to a correct string that can be used
literally; we just have to check that there's no trailing garbage
after the boolean query.

This is certainly worth commenting.

For the record, I also tried passing the query straight to the
library, without parsing it in the CLI (and simply checking that the
query returned exactly one result), and it was noticeably slower (the
restore performance test took between 3.82 and 5.25 seconds for the
code in this series and ~7.2 seconds using a general query.)

  parent reply	other threads:[~2012-12-26  1:23 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-25  5:57 [PATCH 0/5] Use Xapian query syntax for batch-tag dump/restore Austin Clements
2012-12-25  5:57 ` [PATCH 1/5] util: Factor out boolean term quoting routine Austin Clements
2012-12-25 12:25   ` David Bremner
2012-12-25  5:57 ` [PATCH 2/5] util: Function to parse boolean term queries Austin Clements
2012-12-25 14:18   ` David Bremner
2012-12-25 14:34     ` David Bremner
2012-12-26  1:23     ` Austin Clements [this message]
2012-12-25  5:57 ` [PATCH 3/5] dump: Disallow \n in message IDs Austin Clements
2012-12-25 14:21   ` David Bremner
2012-12-25  5:57 ` [PATCH 4/5] dump/restore: Use Xapian queries for batch-tag format Austin Clements
2012-12-25  5:57 ` [PATCH 5/5] man: Update notmuch-dump(1) for new " Austin Clements
2012-12-25 14:47   ` David Bremner
2012-12-25 15:18     ` David Bremner
2012-12-25 18:05   ` David Bremner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121226012300.GW6187@mit.edu \
    --to=amdragon@mit.edu \
    --cc=david@tethera.net \
    --cc=notmuch@notmuchmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).