From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id DB0556DE0AAB for ; Wed, 18 Jan 2017 12:05:26 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.188 X-Spam-Level: X-Spam-Status: No, score=-0.188 tagged_above=-999 required=5 tests=[AWL=0.532, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Y9jXeK58CsMT for ; Wed, 18 Jan 2017 12:05:24 -0800 (PST) Received: from mail-lf0-f46.google.com (mail-lf0-f46.google.com [209.85.215.46]) by arlo.cworth.org (Postfix) with ESMTPS id EFB686DE00BD for ; Wed, 18 Jan 2017 12:05:23 -0800 (PST) Received: by mail-lf0-f46.google.com with SMTP id n124so21004842lfd.2 for ; Wed, 18 Jan 2017 12:05:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nikula-org.20150623.gappssmtp.com; s=20150623; h=from:to:subject:in-reply-to:references:date:message-id:mime-version :content-transfer-encoding; bh=1nHa1iHR2KvBnOVSVU5/0sYBSDYtklfc0f+VGXHAgTA=; b=iPOf5aCaxA5M94TCPUEHtjgXYYO7BvCWERC11lk9FLo7JTufWl25xTAb/gzphXd/UV OqplvhGItlO3uoX88U6dtSCgAJBGWfdYew95qwsqfRXI+HC2pBuuYskEuFjZDGGBXATi o80k9Hk/VqCWhymcMQgTynHDhbdghexyg16DFVAcLFmOciPDbSuhTD/YIJVxuOxRf4rp xi67e93vQWXzqhuDtmLFwsnl7NVCOOUzKhOZ4W+Rjv9eWfiZ7eCIUHs3toyNemg0uZTs slHCzOQlMOyUc2vkfv4DR7oWPWwOCsBhKOrRBKPNby9wq3jEcKVuHuzzE7l3IEEsJk1h lJsQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:in-reply-to:references:date :message-id:mime-version:content-transfer-encoding; bh=1nHa1iHR2KvBnOVSVU5/0sYBSDYtklfc0f+VGXHAgTA=; b=KHHW5u9EDZpSmz7MC7rnOylkGyLoKncRGJ1Mm5ro1UnHg7HEba8sbRiyRj6fuqghVg djtfycdFeHAKZ+aPYH0p/QpFd4Cy5VJL2eH4tySlEEMoUqColKmJzulNVdqK5/HiE2sc 2u+JCyr7CC3vyVPXm9TNm/qhfKCN5aqK5hoiLblJZ+gtJtSTOcQH538tggRJ/pIMYIyo 5xXYDSpEEMb5O7KUkVszlR/VniCnXVGCD5UVdPHdM9n+Iu2imNfOptSZAEnMNYnhHFIa aDt3pyAlCU7C5LhHs7aRtJ7zrywNHCLAfARN8bvgY8ytTxIhR8rLVjogwfHGPJivQ37G Hr8Q== X-Gm-Message-State: AIkVDXKXcEqSnVY7WIm06kEs6EmbyLIIJLWoQOZsEu3D4XGEIz763HdY2zZqk60OQr2t7Q== X-Received: by 10.46.20.73 with SMTP id 9mr2430877lju.10.1484769921772; Wed, 18 Jan 2017 12:05:21 -0800 (PST) Received: from localhost (mobile-access-5d6a3e-39.dhcp.inet.fi. [93.106.62.39]) by smtp.gmail.com with ESMTPSA id m18sm735769lfe.45.2017.01.18.12.05.19 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Wed, 18 Jan 2017 12:05:19 -0800 (PST) From: Jani Nikula To: David Bremner , David Bremner , notmuch@notmuchmail.org Subject: Re: [Patch v2] lib: regexp matching in 'subject' and 'from' In-Reply-To: <20161114214651.19770-1-david@tethera.net> References: <1467034387-16885-1-git-send-email-david@tethera.net> <20161114214651.19770-1-david@tethera.net> Date: Wed, 18 Jan 2017 22:05:18 +0200 Message-ID: <877f5scewx.fsf@nikula.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jan 2017 20:05:27 -0000 On Mon, 14 Nov 2016, David Bremner wrote: > the idea is that you can run > > % notmuch search re:subject: > % notmuch search re:from:' > > or > > % notmuch search subject:"your usual phrase search" > % notmuch search from:"usual phrase search" > > This should also work with bindings, since it extends the query parser. > > This is trivial to extend for other value slots, but currently the only > value slots are date, message_id, from, subject, and last_mod. Date is > already searchable, and message_id is not obviously useful to regex > match. > > This was originally written by Austin Clements, and ported to Xapian > field processors (from Austin's custom query parser) by yours truly. I can't say I would have done a detailed review of all the Xapian bits and pieces here, but I didn't spot anything obviously wrong either. I suppose I'd prefer the documentation to be more explicit about "re:subject:" and "re:from:" instead of having the generic "re::" that I think is bound to confuse people. The _ suffixes instead of prefixes in variables seemed a bit odd, but no strong opinions on it. I played around with this a bit, and it seemed to work. Unsurprisingly, getting the quoting right was the hardest part. Even though I know how the stuff works under the hood, it took me a while to realize that you have to use 're:"subject:"' to make it work. (I kept trying 're:subject:""'.) I don't know if there's anything we could really do about this. BR, Jani. > --- > > rebase of id:1467034387-16885-1-git-send-email-david@tethera.net against = master > > doc/man7/notmuch-search-terms.rst | 17 +++++- > lib/Makefile.local | 1 + > lib/database-private.h | 1 + > lib/database.cc | 5 ++ > lib/regexp-fields.cc | 125 ++++++++++++++++++++++++++++++++= ++++++ > lib/regexp-fields.h | 77 +++++++++++++++++++++++ > test/T630-regexp-query.sh | 91 +++++++++++++++++++++++++++ > 7 files changed, 316 insertions(+), 1 deletion(-) > create mode 100644 lib/regexp-fields.cc > create mode 100644 lib/regexp-fields.h > create mode 100755 test/T630-regexp-query.sh > > diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-= terms.rst > index de93d73..4c7afc2 100644 > --- a/doc/man7/notmuch-search-terms.rst > +++ b/doc/man7/notmuch-search-terms.rst > @@ -60,6 +60,8 @@ indicate user-supplied values): >=20=20 > - property:=3D >=20=20 > +- re:{subject,from}: > + > The **from:** prefix is used to match the name or address of the sender > of an email message. >=20=20 > @@ -146,6 +148,12 @@ The **property:** prefix searches for messages with = a particular > (and extensions) to add metadata to messages. A given key can be > present on a given message with several different values. >=20=20 > +The **re::** prefix can be used to restrict the results to > +those whose matches the given regular expression (see > +**regex(7)**). Regular expression searches are only available if > +notmuch is built with **Xapian Field Processors** (see below), and > +currently only for the Subject and From fields. > + > Operators > --------- >=20=20 > @@ -220,13 +228,19 @@ Boolean and Probabilistic Prefixes > ---------------------------------- >=20=20 > Xapian (and hence notmuch) prefixes are either **boolean**, supporting > -exact matches like "tag:inbox" or **probabilistic**, supporting a more = flexible **term** based searching. The prefixes currently supported by notm= uch are as follows. > +exact matches like "tag:inbox" or **probabilistic**, supporting a more > +flexible **term** based searching. Certain **special** prefixes are > +processed by notmuch in a way not stricly fitting either of Xapian's > +built in styles. The prefixes currently supported by notmuch are as > +follows. >=20=20 >=20=20 > Boolean > **tag:**, **id:**, **thread:**, **folder:**, **path:**, **property:** > Probabilistic > **from:**, **to:**, **subject:**, **attachment:**, **mimetype:** > +Special > + **query:**, **re:** >=20=20 > Terms and phrases > ----------------- > @@ -396,6 +410,7 @@ Currently the following features require field proces= sor support: >=20=20 > - non-range date queries, e.g. "date:today" > - named queries e.g. "query:my_special_query" > +- regular expression searches, e.g. "re:subject:^\\[SPAM\\]" >=20=20 > SEE ALSO > =3D=3D=3D=3D=3D=3D=3D=3D > diff --git a/lib/Makefile.local b/lib/Makefile.local > index 3d1030a..ccd32ab 100644 > --- a/lib/Makefile.local > +++ b/lib/Makefile.local > @@ -53,6 +53,7 @@ libnotmuch_cxx_srcs =3D \ > $(dir)/query.cc \ > $(dir)/query-fp.cc \ > $(dir)/config.cc \ > + $(dir)/regexp-fields.cc \ > $(dir)/thread.cc >=20=20 > libnotmuch_modules :=3D $(libnotmuch_c_srcs:.c=3D.o) $(libnotmuch_cxx_sr= cs:.cc=3D.o) > diff --git a/lib/database-private.h b/lib/database-private.h > index ca71a92..900a989 100644 > --- a/lib/database-private.h > +++ b/lib/database-private.h > @@ -186,6 +186,7 @@ struct _notmuch_database { > #if HAVE_XAPIAN_FIELD_PROCESSOR > Xapian::FieldProcessor *date_field_processor; > Xapian::FieldProcessor *query_field_processor; > + Xapian::FieldProcessor *re_field_processor; > #endif > Xapian::ValueRangeProcessor *last_mod_range_processor; > }; > diff --git a/lib/database.cc b/lib/database.cc > index 2d19f20..851a62d 100644 > --- a/lib/database.cc > +++ b/lib/database.cc > @@ -21,6 +21,7 @@ > #include "database-private.h" > #include "parse-time-vrp.h" > #include "query-fp.h" > +#include "regexp-fields.h" > #include "string-util.h" >=20=20 > #include > @@ -1042,6 +1043,8 @@ notmuch_database_open_verbose (const char *path, > notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_p= rocessor); > notmuch->query_field_processor =3D new QueryFieldProcessor (*notmuch->q= uery_parser, notmuch); > notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field= _processor); > + notmuch->re_field_processor =3D new RegexpFieldProcessor (*notmuch->que= ry_parser, notmuch); > + notmuch->query_parser->add_boolean_prefix("re", notmuch->re_field_proce= ssor); > #endif > notmuch->last_mod_range_processor =3D new Xapian::NumberValueRangeProce= ssor (NOTMUCH_VALUE_LAST_MOD, "lastmod:"); >=20=20 > @@ -1138,6 +1141,8 @@ notmuch_database_close (notmuch_database_t *notmuch) > notmuch->date_field_processor =3D NULL; > delete notmuch->query_field_processor; > notmuch->query_field_processor =3D NULL; > + delete notmuch->re_field_processor; > + notmuch->re_field_processor =3D NULL; > #endif >=20=20 > return status; > diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc > new file mode 100644 > index 0000000..4d3d972 > --- /dev/null > +++ b/lib/regexp-fields.cc > @@ -0,0 +1,125 @@ > +/* regexp-fields.cc - "re:" field processor glue > + * > + * This file is part of notmuch. > + * > + * Copyright =C2=A9 2015 Austin Clements > + * Copyright =C2=A9 2016 David Bremner > + * > + * This program is free software: you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation, either version 3 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program. If not, see https://www.gnu.org/licenses/ . > + * > + * Author: Austin Clements > + * David Bremner > + */ > + > +#include "regexp-fields.h" > +#include "notmuch-private.h" > + > +#if HAVE_XAPIAN_FIELD_PROCESSOR > +RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const st= d::string ®exp) > + : slot_ (slot) > +{ > + int err =3D regcomp (®exp_, regexp.c_str (), REG_EXTENDED | REG_N= OSUB); > + > + if (err !=3D 0) { > + size_t len =3D regerror (err, ®exp_, NULL, 0); > + char *buffer =3D new char[len]; > + std::string msg; > + (void) regerror (err, ®exp_, buffer, len); > + msg.assign (buffer, len); > + delete buffer; > + > + throw Xapian::QueryParserError (msg); > + } > +} > + > +RegexpPostingSource::~RegexpPostingSource () > +{ > + regfree (®exp_); > +} > + > +void > +RegexpPostingSource::init (const Xapian::Database &db) > +{ > + db_ =3D db; > + it_ =3D db_.valuestream_begin (slot_); > + end_ =3D db.valuestream_end (slot_); > + started_ =3D false; > +} > + > +Xapian::doccount > +RegexpPostingSource::get_termfreq_min () const > +{ > + return 0; > +} > + > +Xapian::doccount > +RegexpPostingSource::get_termfreq_est () const > +{ > + return get_termfreq_max () / 2; > +} > + > +Xapian::doccount > +RegexpPostingSource::get_termfreq_max () const > +{ > + return db_.get_value_freq (slot_); > +} > + > +Xapian::docid > +RegexpPostingSource::get_docid () const > +{ > + return it_.get_docid (); > +} > + > +bool > +RegexpPostingSource::at_end () const > +{ > + return it_ =3D=3D end_; > +} > + > +void > +RegexpPostingSource::next (unused (double min_wt)) > +{ > + if (started_ && ! at_end ()) > + ++it_; > + started_ =3D true; > + > + for (; ! at_end (); ++it_) { > + std::string value =3D *it_; > + if (regexec (®exp_, value.c_str (), 0, NULL, 0) =3D=3D 0) > + break; > + } > +} > + > +static Xapian::valueno > +_find_slot (std::string prefix) > +{ > + if (prefix =3D=3D "from") > + return NOTMUCH_VALUE_FROM; > + else if (prefix =3D=3D "subject") > + return NOTMUCH_VALUE_SUBJECT; > + else > + throw Xapian::QueryParserError ("unsupported regexp field '" + prefix += "'"); > +} > + > +Xapian::Query > +RegexpFieldProcessor::operator() (const std::string & str) > +{ > + size_t pos =3D str.find_first_of (':'); > + std::string prefix =3D str.substr (0, pos); > + std::string regexp =3D str.substr (pos + 1); > + > + postings =3D new RegexpPostingSource (_find_slot (prefix), regexp); > + return Xapian::Query (postings); > +} > +#endif > diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h > new file mode 100644 > index 0000000..2c9c2d7 > --- /dev/null > +++ b/lib/regexp-fields.h > @@ -0,0 +1,77 @@ > +/* regex-fields.h - xapian glue for semi-bruteforce regexp search > + * > + * This file is part of notmuch. > + * > + * Copyright =C2=A9 2015 Austin Clements > + * Copyright =C2=A9 2016 David Bremner > + * > + * This program is free software: you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation, either version 3 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program. If not, see https://www.gnu.org/licenses/ . > + * > + * Author: Austin Clements > + * David Bremner > + */ > + > +#ifndef NOTMUCH_REGEXP_FIELDS_H > +#define NOTMUCH_REGEXP_FIELDS_H > +#if HAVE_XAPIAN_FIELD_PROCESSOR > +#include > +#include > +#include > +#include "notmuch-private.h" > + > +/* A posting source that returns documents where a value matches a > + * regexp. > + */ > +class RegexpPostingSource : public Xapian::PostingSource > +{ > + protected: > + const Xapian::valueno slot_; > + regex_t regexp_; > + Xapian::Database db_; > + bool started_; > + Xapian::ValueIterator it_, end_; > + > +/* No copying */ > + RegexpPostingSource (const RegexpPostingSource &); > + RegexpPostingSource &operator=3D (const RegexpPostingSource &); > + > + public: > + RegexpPostingSource (Xapian::valueno slot, const std::string ®exp= ); > + ~RegexpPostingSource (); > + void init (const Xapian::Database &db); > + Xapian::doccount get_termfreq_min () const; > + Xapian::doccount get_termfreq_est () const; > + Xapian::doccount get_termfreq_max () const; > + Xapian::docid get_docid () const; > + bool at_end () const; > + void next (unused (double min_wt)); > +}; > + > + > +class RegexpFieldProcessor : public Xapian::FieldProcessor { > + protected: > + Xapian::QueryParser &parser; > + notmuch_database_t *notmuch; > + RegexpPostingSource *postings =3D NULL; > + > + public: > + RegexpFieldProcessor (Xapian::QueryParser &parser_, notmuch_database= _t *notmuch_) > + : parser(parser_), notmuch(notmuch_) { }; > + > + ~RegexpFieldProcessor () { delete postings; }; > + > + Xapian::Query operator()(const std::string & str); > +}; > +#endif > +#endif /* NOTMUCH_REGEXP_FIELDS_H */ > diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh > new file mode 100755 > index 0000000..3bbe47c > --- /dev/null > +++ b/test/T630-regexp-query.sh > @@ -0,0 +1,91 @@ > +#!/usr/bin/env bash > +test_description=3D'regular expression searches' > +. ./test-lib.sh || exit 1 > + > +add_email_corpus > + > + > +if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then > + > + notmuch search --output=3Dmessages from:cworth > cworth.msg-ids > + > + test_begin_subtest "regexp from search, case sensitive" > + notmuch search --output=3Dmessages re:from:carl > OUTPUT > + test_expect_equal_file /dev/null OUTPUT > + > + test_begin_subtest "empty regexp or query" > + notmuch search --output=3Dmessages re:from:carl or from:cworth > OUT= PUT > + test_expect_equal_file cworth.msg-ids OUTPUT > + > + test_begin_subtest "non-empty regexp and query" > + notmuch search re:from:cworth and subject:patch > OUTPUT > + cat < EXPECTED > +thread:0000000000000008 2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry= ; [notmuch] [PATCH] Error out if no query is supplied to search instead of = going into an infinite loop (attachment inbox unread) > +thread:0000000000000007 2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel;= [notmuch] [PATCH] Typsos (inbox unread) > +thread:0000000000000018 2009-11-18 [1/2] Carl Worth| Jan Janak; [notmu= ch] [PATCH] Older versions of install do not support -C. (inbox unread) > +thread:0000000000000017 2009-11-18 [1/2] Carl Worth| Keith Packard; [n= otmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and = unread) tags (inbox unread) > +thread:0000000000000014 2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, = Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing messa= ge headers (inbox unread) > +thread:0000000000000001 2009-11-18 [1/1] Stewart Smith; [notmuch] [PAT= CH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread) > +EOF > + test_expect_equal_file EXPECTED OUTPUT > + > + test_begin_subtest "regexp from search, duplicate term search" > + notmuch search --output=3Dmessages re:from:cworth > OUTPUT > + test_expect_equal_file cworth.msg-ids OUTPUT > + > + test_begin_subtest "long enough regexp matches only desired senders" > + notmuch search --output=3Dmessages 're:"from:C.* Wo"' > OUTPUT > + test_expect_equal_file cworth.msg-ids OUTPUT > + > + test_begin_subtest "shorter regexp matches one more sender" > + notmuch search --output=3Dmessages 're:"from:C.* W"' > OUTPUT > + (echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk = ; cat cworth.msg-ids) > EXPECTED > + test_expect_equal_file EXPECTED OUTPUT > + > + test_begin_subtest "regexp subject search, non-ASCII" > + notmuch search --output=3Dmessages re:subject:accentu=C3=A9 > OUTPUT > + echo id:877h1wv7mg.fsf@inf-8657.int-evry.fr > EXPECTED > + test_expect_equal_file EXPECTED OUTPUT > + > + test_begin_subtest "regexp subject search, punctuation" > + notmuch search re:subject:\'X\' > OUTPUT > + cat < EXPECTED > +thread:0000000000000017 2009-11-18 [2/2] Keith Packard, Carl Worth; [n= otmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and = unread) tags (inbox unread) > +EOF > + test_expect_equal_file EXPECTED OUTPUT > + > + test_begin_subtest "regexp subject search, no punctuation" > + notmuch search re:subject:X > OUTPUT > + cat < EXPECTED > +thread:0000000000000017 2009-11-18 [2/2] Keith Packard, Carl Worth; [n= otmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and = unread) tags (inbox unread) > +thread:000000000000000f 2009-11-18 [4/4] Jjgod Jiang, Alexander Botero= -Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread) > +EOF > + test_expect_equal_file EXPECTED OUTPUT > + > + test_begin_subtest "combine regexp from and subject" > + notmuch search re:subject:-C and re:from:.an.k > OUTPUT > + cat < EXPECTED > +thread:0000000000000018 2009-11-17 [1/2] Jan Janak| Carl Worth; [notmu= ch] [PATCH] Older versions of install do not support -C. (inbox unread) > +EOF > + test_expect_equal_file EXPECTED OUTPUT > + > + test_begin_subtest "bad subprefix" > + notmuch search 're:unsupported:.*' 1>OUTPUT 2>&1 > + cat < EXPECTED > +notmuch search: A Xapian exception occurred > +A Xapian exception occurred performing query: unsupported regexp field '= unsupported' > +Query string was: re:unsupported:.* > +EOF > + test_expect_equal_file EXPECTED OUTPUT > + > + test_begin_subtest "regexp error reporting" > + notmuch search 're:from:unbalanced[' 1>OUTPUT 2>&1 > + cat < EXPECTED > +notmuch search: A Xapian exception occurred > +A Xapian exception occurred performing query: Invalid regular expression > +Query string was: re:from:unbalanced[ > +EOF > + test_expect_equal_file EXPECTED OUTPUT > +fi > + > +test_done > --=20 > 2.10.2 > > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > https://notmuchmail.org/mailman/listinfo/notmuch