unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: Austin Clements <amdragon@MIT.EDU>
To: notmuch@notmuchmail.org
Cc: amdragon@mit.edu
Subject: [RFC PATCH 0/2] Custom query parser
Date: Thu, 23 Dec 2010 01:00:22 -0500	[thread overview]
Message-ID: <1293084024-7893-1-git-send-email-amdragon@mit.edu> (raw)

Following is a patch series that implements a custom query parser.
This code is by no means ready for serious review or inclusion in the
tree, but I wanted to spark some discussion and get feedback while
it's still molten.

The query parser is basically compatible with Xapian's, but is
designed to be flexible enough to support (at least) date searches,
saved queries, folder searches (without a schema change), "from:me",
and implicit filters (like defaulting to "-tag:spam").  Such features
are implemented as transformations over an intermediate
representation, keeping the parser itself as simple as possible.  The
code already uses transformers for the usual prefixes and the
type:mail filter, and I have most of the implementation of folder
searches in another branch.

If you'd like to play with it, it's on the qparser branch at
http://awakening.csail.mit.edu/git/notmuch.git.  It's not complete,
but it's close (notably, stemming is missing).

The grammar is approximately compatible with Xapian's query parser.
To my knowledge, it differs in the following ways:

1) The way this parser separates terms is much simpler and, I believe,
   more predictable.  In Xapian, many things can separate terms, and
   exactly what is context-dependent.  For example, "x/y" is parsed as
   two terms in a phrase, "x#y" is parsed as two separate terms "x"
   and "y", and "x# y" is parsed as two separate terms "x#" and "y".
   This parser instead splits the query into "term phrases", which are
   separated only by whitespace, quote, left paren, or right paren.
   These term phrases are then broken into terms using the same
   Xapian::TermGenerator that tokenizes documents.  If this results in
   multiple terms, they're tied into a phrase in the final query.
   Thus, "x/y" and "x# y" are handled as in Xapian, but "x#y" is
   parsed as a phrase.

2) This parser handles errors differently.  Because queries are
   roughly natural language, I feel they should never fail to parse
   (has Google ever told you that your query contains a syntax
   error?), so this parser tries to do reasonable things in all cases.
   Xapian's parser has a strange approach.  For syntax errors
   involving boolean operators (for example "x AND"), it returns a
   parse error.  For other syntax errors (for example, "x) OR y"),
   Xapian will *retry* the parse from scratch with no parser flags set
   (no operators, parens, or quotes).

3) The handling of NEAR and ADJ is quite different and possibly wrong.
   I didn't realize how subtle these operators were until I'd already
   implemented something completely different from Xapian's logic.

The code is much simpler than Xapian's query parser because it elides
several features that notmuch doesn't use (such as multi-term
synonyms), it reuses Xapian's TermGenerator to parse terms (instead of
copy-pasting and tweaking the code), and because I placed the boundary
between the lexer and the parser differently.  This parser is under
1000 lines, while Xapain's is 2000 lines plus a 5000 line parser
generator.

             reply	other threads:[~2010-12-23  6:00 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-23  6:00 Austin Clements [this message]
2010-12-23  6:00 ` [PATCH 1/2] Implement a custom query parser with a mostly Xapian-compatible grammar Austin Clements
2010-12-23  6:00 ` [PATCH 2/2] Replace Xapian query parser with custom query parser Austin Clements

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1293084024-7893-1-git-send-email-amdragon@mit.edu \
    --to=amdragon@mit.edu \
    --cc=notmuch@notmuchmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).