unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* [PATCH] lib: regexp matching in 'subject' and 'from'
@ 2016-06-27 13:33 David Bremner
  2016-11-14 21:46 ` [Patch v2] " David Bremner
  0 siblings, 1 reply; 18+ messages in thread
From: David Bremner @ 2016-06-27 13:33 UTC (permalink / raw)
  To: notmuch

the idea is that you can run

% notmuch search re:subject:<your-favourite-regexp>
% notmuch search re:from:<your-favourite-regexp>'

or

% notmuch search subject:"your usual phrase search"
% notmuch search from:"usual phrase search"

This should also work with bindings, since it extends the query parser.

This is trivial to extend for other value slots, but currently the only
value slots are date, message_id, from, subject, and last_mod. Date is
already searchable, and message_id is not obviously useful to regex
match.

This was originally written by Austin Clements, and ported to Xapian
field processors (from Austin's custom query parser) by yours truly.
---

This is the zero-th non-WIP version. Since the last version [1], I
have added some better error reporting for regexp syntax errors, tests
for two kinds of query syntax error, and some documentation for the
query syntax.

 doc/man7/notmuch-search-terms.rst |  17 +++++-
 lib/Makefile.local                |   1 +
 lib/database-private.h            |   1 +
 lib/database.cc                   |   5 ++
 lib/regexp-fields.cc              | 125 ++++++++++++++++++++++++++++++++++++++
 lib/regexp-fields.h               |  77 +++++++++++++++++++++++
 test/T630-regexp-query.sh         |  91 +++++++++++++++++++++++++++
 7 files changed, 316 insertions(+), 1 deletion(-)
 create mode 100644 lib/regexp-fields.cc
 create mode 100644 lib/regexp-fields.h
 create mode 100755 test/T630-regexp-query.sh

diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
index 075f88c..6155406 100644
--- a/doc/man7/notmuch-search-terms.rst
+++ b/doc/man7/notmuch-search-terms.rst
@@ -58,6 +58,8 @@ indicate user-supplied values):
 
 -  query:<name>
 
+- re:{subject,from}:<regex>
+
 The **from:** prefix is used to match the name or address of the sender
 of an email message.
 
@@ -139,6 +141,12 @@ queries added with **notmuch-config(1)**. Named queries are only
 available if notmuch is built with **Xapian Field Processors** (see
 below).
 
+The **re:<field>:** prefix can be used to restrict the results to
+those whose <field> matches the given regular expression (see
+**regex(7)**). Regular expression searches are only available if
+notmuch is built with **Xapian Field Processors** (see below), and
+currently only for the Subject and From fields.
+
 Operators
 ---------
 
@@ -213,13 +221,19 @@ Boolean and Probabilistic Prefixes
 ----------------------------------
 
 Xapian (and hence notmuch) prefixes are either **boolean**, supporting
-exact matches like "tag:inbox"  or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows.
+exact matches like "tag:inbox" or **probabilistic**, supporting a more
+flexible **term** based searching. Certain **special** prefixes are
+processed by notmuch in a way not stricly fitting either of Xapian's
+built in styles. The prefixes currently supported by notmuch are as
+follows.
 
 
 Boolean
    **tag:**, **id:**, **thread:**, **folder:**, **path:**
 Probabilistic
    **from:**, **to:**, **subject:**, **attachment:**, **mimetype:**
+Special
+   **query:**, **re:<field>**
 
 Terms and phrases
 -----------------
@@ -389,6 +403,7 @@ Currently the following features require field processor support:
 
 - non-range date queries, e.g. "date:today"
 - named queries e.g. "query:my_special_query"
+- regular expression searches, e.g. "re:subject:^\\[SPAM\\]"
 
 SEE ALSO
 ========
diff --git a/lib/Makefile.local b/lib/Makefile.local
index beb9635..68771e6 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -51,6 +51,7 @@ libnotmuch_cxx_srcs =		\
 	$(dir)/query.cc		\
 	$(dir)/query-fp.cc      \
 	$(dir)/config.cc	\
+	$(dir)/regexp-fields.cc     \
 	$(dir)/thread.cc
 
 libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
diff --git a/lib/database-private.h b/lib/database-private.h
index ca71a92..900a989 100644
--- a/lib/database-private.h
+++ b/lib/database-private.h
@@ -186,6 +186,7 @@ struct _notmuch_database {
 #if HAVE_XAPIAN_FIELD_PROCESSOR
     Xapian::FieldProcessor *date_field_processor;
     Xapian::FieldProcessor *query_field_processor;
+    Xapian::FieldProcessor *re_field_processor;
 #endif
     Xapian::ValueRangeProcessor *last_mod_range_processor;
 };
diff --git a/lib/database.cc b/lib/database.cc
index afafe88..b52b62d 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -21,6 +21,7 @@
 #include "database-private.h"
 #include "parse-time-vrp.h"
 #include "query-fp.h"
+#include "regexp-fields.h"
 #include "string-util.h"
 
 #include <iostream>
@@ -1016,6 +1017,8 @@ notmuch_database_open_verbose (const char *path,
 	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
 	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
 	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
+	notmuch->re_field_processor = new RegexpFieldProcessor (*notmuch->query_parser, notmuch);
+	notmuch->query_parser->add_boolean_prefix("re", notmuch->re_field_processor);
 #endif
 	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
 
@@ -1112,6 +1115,8 @@ notmuch_database_close (notmuch_database_t *notmuch)
     notmuch->date_field_processor = NULL;
     delete notmuch->query_field_processor;
     notmuch->query_field_processor = NULL;
+    delete notmuch->re_field_processor;
+    notmuch->re_field_processor = NULL;
 #endif
 
     return status;
diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
new file mode 100644
index 0000000..4d3d972
--- /dev/null
+++ b/lib/regexp-fields.cc
@@ -0,0 +1,125 @@
+/* regexp-fields.cc - "re:" field processor glue
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#include "regexp-fields.h"
+#include "notmuch-private.h"
+
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
+    : slot_ (slot)
+{
+    int err = regcomp (&regexp_, regexp.c_str (), REG_EXTENDED | REG_NOSUB);
+
+    if (err != 0) {
+	size_t len = regerror (err, &regexp_, NULL, 0);
+	char *buffer = new char[len];
+	std::string msg;
+	(void) regerror (err, &regexp_, buffer, len);
+	msg.assign (buffer, len);
+	delete buffer;
+
+	throw Xapian::QueryParserError (msg);
+    }
+}
+
+RegexpPostingSource::~RegexpPostingSource ()
+{
+    regfree (&regexp_);
+}
+
+void
+RegexpPostingSource::init (const Xapian::Database &db)
+{
+    db_ = db;
+    it_ = db_.valuestream_begin (slot_);
+    end_ = db.valuestream_end (slot_);
+    started_ = false;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_min () const
+{
+    return 0;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_est () const
+{
+    return get_termfreq_max () / 2;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_max () const
+{
+    return db_.get_value_freq (slot_);
+}
+
+Xapian::docid
+RegexpPostingSource::get_docid () const
+{
+    return it_.get_docid ();
+}
+
+bool
+RegexpPostingSource::at_end () const
+{
+    return it_ == end_;
+}
+
+void
+RegexpPostingSource::next (unused (double min_wt))
+{
+    if (started_ && ! at_end ())
+	++it_;
+    started_ = true;
+
+    for (; ! at_end (); ++it_) {
+	std::string value = *it_;
+	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
+	    break;
+    }
+}
+
+static Xapian::valueno
+_find_slot (std::string prefix)
+{
+    if (prefix == "from")
+	return NOTMUCH_VALUE_FROM;
+    else if (prefix == "subject")
+	return NOTMUCH_VALUE_SUBJECT;
+    else
+	throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
+}
+
+Xapian::Query
+RegexpFieldProcessor::operator() (const std::string & str)
+{
+    size_t pos = str.find_first_of (':');
+    std::string prefix = str.substr (0, pos);
+    std::string regexp = str.substr (pos + 1);
+
+    postings = new RegexpPostingSource (_find_slot (prefix), regexp);
+    return Xapian::Query (postings);
+}
+#endif
diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
new file mode 100644
index 0000000..2c9c2d7
--- /dev/null
+++ b/lib/regexp-fields.h
@@ -0,0 +1,77 @@
+/* regex-fields.h - xapian glue for semi-bruteforce regexp search
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#ifndef NOTMUCH_REGEXP_FIELDS_H
+#define NOTMUCH_REGEXP_FIELDS_H
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+#include <sys/types.h>
+#include <regex.h>
+#include <xapian.h>
+#include "notmuch-private.h"
+
+/* A posting source that returns documents where a value matches a
+ * regexp.
+ */
+class RegexpPostingSource : public Xapian::PostingSource
+{
+ protected:
+    const Xapian::valueno slot_;
+    regex_t regexp_;
+    Xapian::Database db_;
+    bool started_;
+    Xapian::ValueIterator it_, end_;
+
+/* No copying */
+    RegexpPostingSource (const RegexpPostingSource &);
+    RegexpPostingSource &operator= (const RegexpPostingSource &);
+
+ public:
+    RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
+    ~RegexpPostingSource ();
+    void init (const Xapian::Database &db);
+    Xapian::doccount get_termfreq_min () const;
+    Xapian::doccount get_termfreq_est () const;
+    Xapian::doccount get_termfreq_max () const;
+    Xapian::docid get_docid () const;
+    bool at_end () const;
+    void next (unused (double min_wt));
+};
+
+
+class RegexpFieldProcessor : public Xapian::FieldProcessor {
+ protected:
+    Xapian::QueryParser &parser;
+    notmuch_database_t *notmuch;
+    RegexpPostingSource *postings = NULL;
+
+ public:
+    RegexpFieldProcessor (Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
+	: parser(parser_), notmuch(notmuch_) { };
+
+    ~RegexpFieldProcessor () { delete postings; };
+
+    Xapian::Query operator()(const std::string & str);
+};
+#endif
+#endif /* NOTMUCH_REGEXP_FIELDS_H */
diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
new file mode 100755
index 0000000..3bbe47c
--- /dev/null
+++ b/test/T630-regexp-query.sh
@@ -0,0 +1,91 @@
+#!/usr/bin/env bash
+test_description='regular expression searches'
+. ./test-lib.sh || exit 1
+
+add_email_corpus
+
+
+if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
+
+    notmuch search --output=messages from:cworth > cworth.msg-ids
+
+    test_begin_subtest "regexp from search, case sensitive"
+    notmuch search --output=messages re:from:carl > OUTPUT
+    test_expect_equal_file /dev/null OUTPUT
+
+    test_begin_subtest "empty regexp or query"
+    notmuch search --output=messages re:from:carl or from:cworth > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "non-empty regexp and query"
+    notmuch search  re:from:cworth and subject:patch > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000008   2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
+thread:0000000000000007   2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
+thread:0000000000000018   2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
+thread:0000000000000017   2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+thread:0000000000000014   2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
+thread:0000000000000001   2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp from search, duplicate term search"
+    notmuch search --output=messages re:from:cworth > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "long enough regexp matches only desired senders"
+    notmuch search --output=messages 're:"from:C.* Wo"' > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "shorter regexp matches one more sender"
+    notmuch search --output=messages 're:"from:C.* W"' > OUTPUT
+    (echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, non-ASCII"
+    notmuch search --output=messages re:subject:accentué > OUTPUT
+    echo id:877h1wv7mg.fsf@inf-8657.int-evry.fr > EXPECTED
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, punctuation"
+    notmuch search   re:subject:\'X\' > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, no punctuation"
+    notmuch search  re:subject:X > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+thread:000000000000000f   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "combine regexp from and subject"
+    notmuch search  re:subject:-C and re:from:.an.k > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000018   2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "bad subprefix"
+    notmuch search 're:unsupported:.*' 1>OUTPUT 2>&1
+    cat <<EOF > EXPECTED
+notmuch search: A Xapian exception occurred
+A Xapian exception occurred performing query: unsupported regexp field 'unsupported'
+Query string was: re:unsupported:.*
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp error reporting"
+    notmuch search 're:from:unbalanced[' 1>OUTPUT 2>&1
+    cat <<EOF > EXPECTED
+notmuch search: A Xapian exception occurred
+A Xapian exception occurred performing query: Invalid regular expression
+Query string was: re:from:unbalanced[
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+fi
+
+test_done
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Patch v2] lib: regexp matching in 'subject' and 'from'
  2016-06-27 13:33 [PATCH] lib: regexp matching in 'subject' and 'from' David Bremner
@ 2016-11-14 21:46 ` David Bremner
  2017-01-18 20:05   ` Jani Nikula
  2017-01-19 14:27   ` [Patch v2] " David Bremner
  0 siblings, 2 replies; 18+ messages in thread
From: David Bremner @ 2016-11-14 21:46 UTC (permalink / raw)
  To: David Bremner, notmuch

the idea is that you can run

% notmuch search re:subject:<your-favourite-regexp>
% notmuch search re:from:<your-favourite-regexp>'

or

% notmuch search subject:"your usual phrase search"
% notmuch search from:"usual phrase search"

This should also work with bindings, since it extends the query parser.

This is trivial to extend for other value slots, but currently the only
value slots are date, message_id, from, subject, and last_mod. Date is
already searchable, and message_id is not obviously useful to regex
match.

This was originally written by Austin Clements, and ported to Xapian
field processors (from Austin's custom query parser) by yours truly.
---

rebase of id:1467034387-16885-1-git-send-email-david@tethera.net against master

 doc/man7/notmuch-search-terms.rst |  17 +++++-
 lib/Makefile.local                |   1 +
 lib/database-private.h            |   1 +
 lib/database.cc                   |   5 ++
 lib/regexp-fields.cc              | 125 ++++++++++++++++++++++++++++++++++++++
 lib/regexp-fields.h               |  77 +++++++++++++++++++++++
 test/T630-regexp-query.sh         |  91 +++++++++++++++++++++++++++
 7 files changed, 316 insertions(+), 1 deletion(-)
 create mode 100644 lib/regexp-fields.cc
 create mode 100644 lib/regexp-fields.h
 create mode 100755 test/T630-regexp-query.sh

diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
index de93d73..4c7afc2 100644
--- a/doc/man7/notmuch-search-terms.rst
+++ b/doc/man7/notmuch-search-terms.rst
@@ -60,6 +60,8 @@ indicate user-supplied values):
 
 -  property:<key>=<value>
 
+- re:{subject,from}:<regex>
+
 The **from:** prefix is used to match the name or address of the sender
 of an email message.
 
@@ -146,6 +148,12 @@ The **property:** prefix searches for messages with a particular
 (and extensions) to add metadata to messages. A given key can be
 present on a given message with several different values.
 
+The **re:<field>:** prefix can be used to restrict the results to
+those whose <field> matches the given regular expression (see
+**regex(7)**). Regular expression searches are only available if
+notmuch is built with **Xapian Field Processors** (see below), and
+currently only for the Subject and From fields.
+
 Operators
 ---------
 
@@ -220,13 +228,19 @@ Boolean and Probabilistic Prefixes
 ----------------------------------
 
 Xapian (and hence notmuch) prefixes are either **boolean**, supporting
-exact matches like "tag:inbox"  or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows.
+exact matches like "tag:inbox" or **probabilistic**, supporting a more
+flexible **term** based searching. Certain **special** prefixes are
+processed by notmuch in a way not stricly fitting either of Xapian's
+built in styles. The prefixes currently supported by notmuch are as
+follows.
 
 
 Boolean
    **tag:**, **id:**, **thread:**, **folder:**, **path:**, **property:**
 Probabilistic
    **from:**, **to:**, **subject:**, **attachment:**, **mimetype:**
+Special
+   **query:**, **re:<field>**
 
 Terms and phrases
 -----------------
@@ -396,6 +410,7 @@ Currently the following features require field processor support:
 
 - non-range date queries, e.g. "date:today"
 - named queries e.g. "query:my_special_query"
+- regular expression searches, e.g. "re:subject:^\\[SPAM\\]"
 
 SEE ALSO
 ========
diff --git a/lib/Makefile.local b/lib/Makefile.local
index 3d1030a..ccd32ab 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -53,6 +53,7 @@ libnotmuch_cxx_srcs =		\
 	$(dir)/query.cc		\
 	$(dir)/query-fp.cc      \
 	$(dir)/config.cc	\
+	$(dir)/regexp-fields.cc     \
 	$(dir)/thread.cc
 
 libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
diff --git a/lib/database-private.h b/lib/database-private.h
index ca71a92..900a989 100644
--- a/lib/database-private.h
+++ b/lib/database-private.h
@@ -186,6 +186,7 @@ struct _notmuch_database {
 #if HAVE_XAPIAN_FIELD_PROCESSOR
     Xapian::FieldProcessor *date_field_processor;
     Xapian::FieldProcessor *query_field_processor;
+    Xapian::FieldProcessor *re_field_processor;
 #endif
     Xapian::ValueRangeProcessor *last_mod_range_processor;
 };
diff --git a/lib/database.cc b/lib/database.cc
index 2d19f20..851a62d 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -21,6 +21,7 @@
 #include "database-private.h"
 #include "parse-time-vrp.h"
 #include "query-fp.h"
+#include "regexp-fields.h"
 #include "string-util.h"
 
 #include <iostream>
@@ -1042,6 +1043,8 @@ notmuch_database_open_verbose (const char *path,
 	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
 	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
 	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
+	notmuch->re_field_processor = new RegexpFieldProcessor (*notmuch->query_parser, notmuch);
+	notmuch->query_parser->add_boolean_prefix("re", notmuch->re_field_processor);
 #endif
 	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
 
@@ -1138,6 +1141,8 @@ notmuch_database_close (notmuch_database_t *notmuch)
     notmuch->date_field_processor = NULL;
     delete notmuch->query_field_processor;
     notmuch->query_field_processor = NULL;
+    delete notmuch->re_field_processor;
+    notmuch->re_field_processor = NULL;
 #endif
 
     return status;
diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
new file mode 100644
index 0000000..4d3d972
--- /dev/null
+++ b/lib/regexp-fields.cc
@@ -0,0 +1,125 @@
+/* regexp-fields.cc - "re:" field processor glue
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#include "regexp-fields.h"
+#include "notmuch-private.h"
+
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
+    : slot_ (slot)
+{
+    int err = regcomp (&regexp_, regexp.c_str (), REG_EXTENDED | REG_NOSUB);
+
+    if (err != 0) {
+	size_t len = regerror (err, &regexp_, NULL, 0);
+	char *buffer = new char[len];
+	std::string msg;
+	(void) regerror (err, &regexp_, buffer, len);
+	msg.assign (buffer, len);
+	delete buffer;
+
+	throw Xapian::QueryParserError (msg);
+    }
+}
+
+RegexpPostingSource::~RegexpPostingSource ()
+{
+    regfree (&regexp_);
+}
+
+void
+RegexpPostingSource::init (const Xapian::Database &db)
+{
+    db_ = db;
+    it_ = db_.valuestream_begin (slot_);
+    end_ = db.valuestream_end (slot_);
+    started_ = false;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_min () const
+{
+    return 0;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_est () const
+{
+    return get_termfreq_max () / 2;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_max () const
+{
+    return db_.get_value_freq (slot_);
+}
+
+Xapian::docid
+RegexpPostingSource::get_docid () const
+{
+    return it_.get_docid ();
+}
+
+bool
+RegexpPostingSource::at_end () const
+{
+    return it_ == end_;
+}
+
+void
+RegexpPostingSource::next (unused (double min_wt))
+{
+    if (started_ && ! at_end ())
+	++it_;
+    started_ = true;
+
+    for (; ! at_end (); ++it_) {
+	std::string value = *it_;
+	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
+	    break;
+    }
+}
+
+static Xapian::valueno
+_find_slot (std::string prefix)
+{
+    if (prefix == "from")
+	return NOTMUCH_VALUE_FROM;
+    else if (prefix == "subject")
+	return NOTMUCH_VALUE_SUBJECT;
+    else
+	throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
+}
+
+Xapian::Query
+RegexpFieldProcessor::operator() (const std::string & str)
+{
+    size_t pos = str.find_first_of (':');
+    std::string prefix = str.substr (0, pos);
+    std::string regexp = str.substr (pos + 1);
+
+    postings = new RegexpPostingSource (_find_slot (prefix), regexp);
+    return Xapian::Query (postings);
+}
+#endif
diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
new file mode 100644
index 0000000..2c9c2d7
--- /dev/null
+++ b/lib/regexp-fields.h
@@ -0,0 +1,77 @@
+/* regex-fields.h - xapian glue for semi-bruteforce regexp search
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#ifndef NOTMUCH_REGEXP_FIELDS_H
+#define NOTMUCH_REGEXP_FIELDS_H
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+#include <sys/types.h>
+#include <regex.h>
+#include <xapian.h>
+#include "notmuch-private.h"
+
+/* A posting source that returns documents where a value matches a
+ * regexp.
+ */
+class RegexpPostingSource : public Xapian::PostingSource
+{
+ protected:
+    const Xapian::valueno slot_;
+    regex_t regexp_;
+    Xapian::Database db_;
+    bool started_;
+    Xapian::ValueIterator it_, end_;
+
+/* No copying */
+    RegexpPostingSource (const RegexpPostingSource &);
+    RegexpPostingSource &operator= (const RegexpPostingSource &);
+
+ public:
+    RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
+    ~RegexpPostingSource ();
+    void init (const Xapian::Database &db);
+    Xapian::doccount get_termfreq_min () const;
+    Xapian::doccount get_termfreq_est () const;
+    Xapian::doccount get_termfreq_max () const;
+    Xapian::docid get_docid () const;
+    bool at_end () const;
+    void next (unused (double min_wt));
+};
+
+
+class RegexpFieldProcessor : public Xapian::FieldProcessor {
+ protected:
+    Xapian::QueryParser &parser;
+    notmuch_database_t *notmuch;
+    RegexpPostingSource *postings = NULL;
+
+ public:
+    RegexpFieldProcessor (Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
+	: parser(parser_), notmuch(notmuch_) { };
+
+    ~RegexpFieldProcessor () { delete postings; };
+
+    Xapian::Query operator()(const std::string & str);
+};
+#endif
+#endif /* NOTMUCH_REGEXP_FIELDS_H */
diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
new file mode 100755
index 0000000..3bbe47c
--- /dev/null
+++ b/test/T630-regexp-query.sh
@@ -0,0 +1,91 @@
+#!/usr/bin/env bash
+test_description='regular expression searches'
+. ./test-lib.sh || exit 1
+
+add_email_corpus
+
+
+if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
+
+    notmuch search --output=messages from:cworth > cworth.msg-ids
+
+    test_begin_subtest "regexp from search, case sensitive"
+    notmuch search --output=messages re:from:carl > OUTPUT
+    test_expect_equal_file /dev/null OUTPUT
+
+    test_begin_subtest "empty regexp or query"
+    notmuch search --output=messages re:from:carl or from:cworth > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "non-empty regexp and query"
+    notmuch search  re:from:cworth and subject:patch > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000008   2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
+thread:0000000000000007   2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
+thread:0000000000000018   2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
+thread:0000000000000017   2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+thread:0000000000000014   2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
+thread:0000000000000001   2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp from search, duplicate term search"
+    notmuch search --output=messages re:from:cworth > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "long enough regexp matches only desired senders"
+    notmuch search --output=messages 're:"from:C.* Wo"' > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "shorter regexp matches one more sender"
+    notmuch search --output=messages 're:"from:C.* W"' > OUTPUT
+    (echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, non-ASCII"
+    notmuch search --output=messages re:subject:accentué > OUTPUT
+    echo id:877h1wv7mg.fsf@inf-8657.int-evry.fr > EXPECTED
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, punctuation"
+    notmuch search   re:subject:\'X\' > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, no punctuation"
+    notmuch search  re:subject:X > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+thread:000000000000000f   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "combine regexp from and subject"
+    notmuch search  re:subject:-C and re:from:.an.k > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000018   2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "bad subprefix"
+    notmuch search 're:unsupported:.*' 1>OUTPUT 2>&1
+    cat <<EOF > EXPECTED
+notmuch search: A Xapian exception occurred
+A Xapian exception occurred performing query: unsupported regexp field 'unsupported'
+Query string was: re:unsupported:.*
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp error reporting"
+    notmuch search 're:from:unbalanced[' 1>OUTPUT 2>&1
+    cat <<EOF > EXPECTED
+notmuch search: A Xapian exception occurred
+A Xapian exception occurred performing query: Invalid regular expression
+Query string was: re:from:unbalanced[
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+fi
+
+test_done
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Patch v2] lib: regexp matching in 'subject' and 'from'
  2016-11-14 21:46 ` [Patch v2] " David Bremner
@ 2017-01-18 20:05   ` Jani Nikula
  2017-01-18 21:01     ` David Bremner
  2017-01-19 14:27   ` [Patch v2] " David Bremner
  1 sibling, 1 reply; 18+ messages in thread
From: Jani Nikula @ 2017-01-18 20:05 UTC (permalink / raw)
  To: David Bremner, David Bremner, notmuch

On Mon, 14 Nov 2016, David Bremner <david@tethera.net> wrote:
> the idea is that you can run
>
> % notmuch search re:subject:<your-favourite-regexp>
> % notmuch search re:from:<your-favourite-regexp>'
>
> or
>
> % notmuch search subject:"your usual phrase search"
> % notmuch search from:"usual phrase search"
>
> This should also work with bindings, since it extends the query parser.
>
> This is trivial to extend for other value slots, but currently the only
> value slots are date, message_id, from, subject, and last_mod. Date is
> already searchable, and message_id is not obviously useful to regex
> match.
>
> This was originally written by Austin Clements, and ported to Xapian
> field processors (from Austin's custom query parser) by yours truly.

I can't say I would have done a detailed review of all the Xapian bits
and pieces here, but I didn't spot anything obviously wrong either.

I suppose I'd prefer the documentation to be more explicit about
"re:subject:" and "re:from:" instead of having the generic "re:<field>:"
that I think is bound to confuse people.

The _ suffixes instead of prefixes in variables seemed a bit odd, but no
strong opinions on it.

I played around with this a bit, and it seemed to work. Unsurprisingly,
getting the quoting right was the hardest part. Even though I know how
the stuff works under the hood, it took me a while to realize that you
have to use 're:"subject:<regex with spaces>"' to make it work. (I kept
trying 're:subject:"<regex with spaces>"'.) I don't know if there's
anything we could really do about this.

BR,
Jani.



> ---
>
> rebase of id:1467034387-16885-1-git-send-email-david@tethera.net against master
>
>  doc/man7/notmuch-search-terms.rst |  17 +++++-
>  lib/Makefile.local                |   1 +
>  lib/database-private.h            |   1 +
>  lib/database.cc                   |   5 ++
>  lib/regexp-fields.cc              | 125 ++++++++++++++++++++++++++++++++++++++
>  lib/regexp-fields.h               |  77 +++++++++++++++++++++++
>  test/T630-regexp-query.sh         |  91 +++++++++++++++++++++++++++
>  7 files changed, 316 insertions(+), 1 deletion(-)
>  create mode 100644 lib/regexp-fields.cc
>  create mode 100644 lib/regexp-fields.h
>  create mode 100755 test/T630-regexp-query.sh
>
> diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
> index de93d73..4c7afc2 100644
> --- a/doc/man7/notmuch-search-terms.rst
> +++ b/doc/man7/notmuch-search-terms.rst
> @@ -60,6 +60,8 @@ indicate user-supplied values):
>  
>  -  property:<key>=<value>
>  
> +- re:{subject,from}:<regex>
> +
>  The **from:** prefix is used to match the name or address of the sender
>  of an email message.
>  
> @@ -146,6 +148,12 @@ The **property:** prefix searches for messages with a particular
>  (and extensions) to add metadata to messages. A given key can be
>  present on a given message with several different values.
>  
> +The **re:<field>:** prefix can be used to restrict the results to
> +those whose <field> matches the given regular expression (see
> +**regex(7)**). Regular expression searches are only available if
> +notmuch is built with **Xapian Field Processors** (see below), and
> +currently only for the Subject and From fields.
> +
>  Operators
>  ---------
>  
> @@ -220,13 +228,19 @@ Boolean and Probabilistic Prefixes
>  ----------------------------------
>  
>  Xapian (and hence notmuch) prefixes are either **boolean**, supporting
> -exact matches like "tag:inbox"  or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows.
> +exact matches like "tag:inbox" or **probabilistic**, supporting a more
> +flexible **term** based searching. Certain **special** prefixes are
> +processed by notmuch in a way not stricly fitting either of Xapian's
> +built in styles. The prefixes currently supported by notmuch are as
> +follows.
>  
>  
>  Boolean
>     **tag:**, **id:**, **thread:**, **folder:**, **path:**, **property:**
>  Probabilistic
>     **from:**, **to:**, **subject:**, **attachment:**, **mimetype:**
> +Special
> +   **query:**, **re:<field>**
>  
>  Terms and phrases
>  -----------------
> @@ -396,6 +410,7 @@ Currently the following features require field processor support:
>  
>  - non-range date queries, e.g. "date:today"
>  - named queries e.g. "query:my_special_query"
> +- regular expression searches, e.g. "re:subject:^\\[SPAM\\]"
>  
>  SEE ALSO
>  ========
> diff --git a/lib/Makefile.local b/lib/Makefile.local
> index 3d1030a..ccd32ab 100644
> --- a/lib/Makefile.local
> +++ b/lib/Makefile.local
> @@ -53,6 +53,7 @@ libnotmuch_cxx_srcs =		\
>  	$(dir)/query.cc		\
>  	$(dir)/query-fp.cc      \
>  	$(dir)/config.cc	\
> +	$(dir)/regexp-fields.cc     \
>  	$(dir)/thread.cc
>  
>  libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
> diff --git a/lib/database-private.h b/lib/database-private.h
> index ca71a92..900a989 100644
> --- a/lib/database-private.h
> +++ b/lib/database-private.h
> @@ -186,6 +186,7 @@ struct _notmuch_database {
>  #if HAVE_XAPIAN_FIELD_PROCESSOR
>      Xapian::FieldProcessor *date_field_processor;
>      Xapian::FieldProcessor *query_field_processor;
> +    Xapian::FieldProcessor *re_field_processor;
>  #endif
>      Xapian::ValueRangeProcessor *last_mod_range_processor;
>  };
> diff --git a/lib/database.cc b/lib/database.cc
> index 2d19f20..851a62d 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -21,6 +21,7 @@
>  #include "database-private.h"
>  #include "parse-time-vrp.h"
>  #include "query-fp.h"
> +#include "regexp-fields.h"
>  #include "string-util.h"
>  
>  #include <iostream>
> @@ -1042,6 +1043,8 @@ notmuch_database_open_verbose (const char *path,
>  	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
>  	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
>  	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
> +	notmuch->re_field_processor = new RegexpFieldProcessor (*notmuch->query_parser, notmuch);
> +	notmuch->query_parser->add_boolean_prefix("re", notmuch->re_field_processor);
>  #endif
>  	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
>  
> @@ -1138,6 +1141,8 @@ notmuch_database_close (notmuch_database_t *notmuch)
>      notmuch->date_field_processor = NULL;
>      delete notmuch->query_field_processor;
>      notmuch->query_field_processor = NULL;
> +    delete notmuch->re_field_processor;
> +    notmuch->re_field_processor = NULL;
>  #endif
>  
>      return status;
> diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
> new file mode 100644
> index 0000000..4d3d972
> --- /dev/null
> +++ b/lib/regexp-fields.cc
> @@ -0,0 +1,125 @@
> +/* regexp-fields.cc - "re:" field processor glue
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements@csail.mit.edu>
> + *                David Bremner <david@tethera.net>
> + */
> +
> +#include "regexp-fields.h"
> +#include "notmuch-private.h"
> +
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
> +    : slot_ (slot)
> +{
> +    int err = regcomp (&regexp_, regexp.c_str (), REG_EXTENDED | REG_NOSUB);
> +
> +    if (err != 0) {
> +	size_t len = regerror (err, &regexp_, NULL, 0);
> +	char *buffer = new char[len];
> +	std::string msg;
> +	(void) regerror (err, &regexp_, buffer, len);
> +	msg.assign (buffer, len);
> +	delete buffer;
> +
> +	throw Xapian::QueryParserError (msg);
> +    }
> +}
> +
> +RegexpPostingSource::~RegexpPostingSource ()
> +{
> +    regfree (&regexp_);
> +}
> +
> +void
> +RegexpPostingSource::init (const Xapian::Database &db)
> +{
> +    db_ = db;
> +    it_ = db_.valuestream_begin (slot_);
> +    end_ = db.valuestream_end (slot_);
> +    started_ = false;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_min () const
> +{
> +    return 0;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_est () const
> +{
> +    return get_termfreq_max () / 2;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_max () const
> +{
> +    return db_.get_value_freq (slot_);
> +}
> +
> +Xapian::docid
> +RegexpPostingSource::get_docid () const
> +{
> +    return it_.get_docid ();
> +}
> +
> +bool
> +RegexpPostingSource::at_end () const
> +{
> +    return it_ == end_;
> +}
> +
> +void
> +RegexpPostingSource::next (unused (double min_wt))
> +{
> +    if (started_ && ! at_end ())
> +	++it_;
> +    started_ = true;
> +
> +    for (; ! at_end (); ++it_) {
> +	std::string value = *it_;
> +	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
> +	    break;
> +    }
> +}
> +
> +static Xapian::valueno
> +_find_slot (std::string prefix)
> +{
> +    if (prefix == "from")
> +	return NOTMUCH_VALUE_FROM;
> +    else if (prefix == "subject")
> +	return NOTMUCH_VALUE_SUBJECT;
> +    else
> +	throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
> +}
> +
> +Xapian::Query
> +RegexpFieldProcessor::operator() (const std::string & str)
> +{
> +    size_t pos = str.find_first_of (':');
> +    std::string prefix = str.substr (0, pos);
> +    std::string regexp = str.substr (pos + 1);
> +
> +    postings = new RegexpPostingSource (_find_slot (prefix), regexp);
> +    return Xapian::Query (postings);
> +}
> +#endif
> diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
> new file mode 100644
> index 0000000..2c9c2d7
> --- /dev/null
> +++ b/lib/regexp-fields.h
> @@ -0,0 +1,77 @@
> +/* regex-fields.h - xapian glue for semi-bruteforce regexp search
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements@csail.mit.edu>
> + *                David Bremner <david@tethera.net>
> + */
> +
> +#ifndef NOTMUCH_REGEXP_FIELDS_H
> +#define NOTMUCH_REGEXP_FIELDS_H
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +#include <sys/types.h>
> +#include <regex.h>
> +#include <xapian.h>
> +#include "notmuch-private.h"
> +
> +/* A posting source that returns documents where a value matches a
> + * regexp.
> + */
> +class RegexpPostingSource : public Xapian::PostingSource
> +{
> + protected:
> +    const Xapian::valueno slot_;
> +    regex_t regexp_;
> +    Xapian::Database db_;
> +    bool started_;
> +    Xapian::ValueIterator it_, end_;
> +
> +/* No copying */
> +    RegexpPostingSource (const RegexpPostingSource &);
> +    RegexpPostingSource &operator= (const RegexpPostingSource &);
> +
> + public:
> +    RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
> +    ~RegexpPostingSource ();
> +    void init (const Xapian::Database &db);
> +    Xapian::doccount get_termfreq_min () const;
> +    Xapian::doccount get_termfreq_est () const;
> +    Xapian::doccount get_termfreq_max () const;
> +    Xapian::docid get_docid () const;
> +    bool at_end () const;
> +    void next (unused (double min_wt));
> +};
> +
> +
> +class RegexpFieldProcessor : public Xapian::FieldProcessor {
> + protected:
> +    Xapian::QueryParser &parser;
> +    notmuch_database_t *notmuch;
> +    RegexpPostingSource *postings = NULL;
> +
> + public:
> +    RegexpFieldProcessor (Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
> +	: parser(parser_), notmuch(notmuch_) { };
> +
> +    ~RegexpFieldProcessor () { delete postings; };
> +
> +    Xapian::Query operator()(const std::string & str);
> +};
> +#endif
> +#endif /* NOTMUCH_REGEXP_FIELDS_H */
> diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
> new file mode 100755
> index 0000000..3bbe47c
> --- /dev/null
> +++ b/test/T630-regexp-query.sh
> @@ -0,0 +1,91 @@
> +#!/usr/bin/env bash
> +test_description='regular expression searches'
> +. ./test-lib.sh || exit 1
> +
> +add_email_corpus
> +
> +
> +if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
> +
> +    notmuch search --output=messages from:cworth > cworth.msg-ids
> +
> +    test_begin_subtest "regexp from search, case sensitive"
> +    notmuch search --output=messages re:from:carl > OUTPUT
> +    test_expect_equal_file /dev/null OUTPUT
> +
> +    test_begin_subtest "empty regexp or query"
> +    notmuch search --output=messages re:from:carl or from:cworth > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "non-empty regexp and query"
> +    notmuch search  re:from:cworth and subject:patch > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000008   2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
> +thread:0000000000000007   2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
> +thread:0000000000000018   2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +thread:0000000000000017   2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:0000000000000014   2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
> +thread:0000000000000001   2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp from search, duplicate term search"
> +    notmuch search --output=messages re:from:cworth > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "long enough regexp matches only desired senders"
> +    notmuch search --output=messages 're:"from:C.* Wo"' > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "shorter regexp matches one more sender"
> +    notmuch search --output=messages 're:"from:C.* W"' > OUTPUT
> +    (echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, non-ASCII"
> +    notmuch search --output=messages re:subject:accentué > OUTPUT
> +    echo id:877h1wv7mg.fsf@inf-8657.int-evry.fr > EXPECTED
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, punctuation"
> +    notmuch search   re:subject:\'X\' > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, no punctuation"
> +    notmuch search  re:subject:X > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:000000000000000f   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "combine regexp from and subject"
> +    notmuch search  re:subject:-C and re:from:.an.k > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000018   2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "bad subprefix"
> +    notmuch search 're:unsupported:.*' 1>OUTPUT 2>&1
> +    cat <<EOF > EXPECTED
> +notmuch search: A Xapian exception occurred
> +A Xapian exception occurred performing query: unsupported regexp field 'unsupported'
> +Query string was: re:unsupported:.*
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp error reporting"
> +    notmuch search 're:from:unbalanced[' 1>OUTPUT 2>&1
> +    cat <<EOF > EXPECTED
> +notmuch search: A Xapian exception occurred
> +A Xapian exception occurred performing query: Invalid regular expression
> +Query string was: re:from:unbalanced[
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +fi
> +
> +test_done
> -- 
> 2.10.2
>
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> https://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Patch v2] lib: regexp matching in 'subject' and 'from'
  2017-01-18 20:05   ` Jani Nikula
@ 2017-01-18 21:01     ` David Bremner
  2017-01-19 12:16       ` [Patch v3] " David Bremner
  0 siblings, 1 reply; 18+ messages in thread
From: David Bremner @ 2017-01-18 21:01 UTC (permalink / raw)
  To: Jani Nikula, notmuch

Jani Nikula <jani@nikula.org> writes:


> I played around with this a bit, and it seemed to work. Unsurprisingly,
> getting the quoting right was the hardest part. Even though I know how
> the stuff works under the hood, it took me a while to realize that you
> have to use 're:"subject:<regex with spaces>"' to make it work. (I kept
> trying 're:subject:"<regex with spaces>"'.) I don't know if there's
> anything we could really do about this.
>

I _think_ we could add distinct prefixes at the xapian level for each
regex-prefix. That opens the can of worms of naming them eg re-subject:
and re-from:. I'm not sure about the added complexity, but I think it's
a matter of adding an extra argument to the field processor constructor.

d

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Patch v3] lib: regexp matching in 'subject' and 'from'
  2017-01-18 21:01     ` David Bremner
@ 2017-01-19 12:16       ` David Bremner
  2017-01-21  3:27         ` [WIP] " David Bremner
  0 siblings, 1 reply; 18+ messages in thread
From: David Bremner @ 2017-01-19 12:16 UTC (permalink / raw)
  To: David Bremner, Jani Nikula, notmuch

the idea is that you can run

% notmuch search re_subject:<your-favourite-regexp>
% notmuch search re_from:<your-favourite-regexp>'

or

% notmuch search subject:"your usual phrase search"
% notmuch search from:"usual phrase search"

This should also work with bindings, since it extends the query parser.

This is trivial to extend for other value slots, but currently the only
value slots are date, message_id, from, subject, and last_mod. Date is
already searchable, and message_id is not obviously useful to regex
match.

This was originally written by Austin Clements, and ported to Xapian
field processors (from Austin's custom query parser) by yours truly.
---

this version changes re:from -> re_from, and makes the quoting more
natural

 doc/man7/notmuch-search-terms.rst |  17 +++++-
 lib/Makefile.local                |   1 +
 lib/database-private.h            |   2 +
 lib/database.cc                   |   9 ++++
 lib/regexp-fields.cc              | 110 ++++++++++++++++++++++++++++++++++++++
 lib/regexp-fields.h               |  90 +++++++++++++++++++++++++++++++
 test/T630-regexp-query.sh         |  82 ++++++++++++++++++++++++++++
 7 files changed, 310 insertions(+), 1 deletion(-)
 create mode 100644 lib/regexp-fields.cc
 create mode 100644 lib/regexp-fields.h
 create mode 100755 test/T630-regexp-query.sh

diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
index de93d733..8800039d 100644
--- a/doc/man7/notmuch-search-terms.rst
+++ b/doc/man7/notmuch-search-terms.rst
@@ -60,6 +60,8 @@ indicate user-supplied values):
 
 -  property:<key>=<value>
 
+- re_{subject,from}:<regex>
+
 The **from:** prefix is used to match the name or address of the sender
 of an email message.
 
@@ -146,6 +148,12 @@ The **property:** prefix searches for messages with a particular
 (and extensions) to add metadata to messages. A given key can be
 present on a given message with several different values.
 
+The **re_from:** and **re_subject** prefix can be used to restrict the
+results to those whose from/subject value matches the given regular
+expression (see **regex(7)**). Regular expression searches are only
+available if notmuch is built with **Xapian Field Processors** (see
+below).
+
 Operators
 ---------
 
@@ -220,13 +228,19 @@ Boolean and Probabilistic Prefixes
 ----------------------------------
 
 Xapian (and hence notmuch) prefixes are either **boolean**, supporting
-exact matches like "tag:inbox"  or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows.
+exact matches like "tag:inbox" or **probabilistic**, supporting a more
+flexible **term** based searching. Certain **special** prefixes are
+processed by notmuch in a way not stricly fitting either of Xapian's
+built in styles. The prefixes currently supported by notmuch are as
+follows.
 
 
 Boolean
    **tag:**, **id:**, **thread:**, **folder:**, **path:**, **property:**
 Probabilistic
    **from:**, **to:**, **subject:**, **attachment:**, **mimetype:**
+Special
+   **query:**, **re:<field>**
 
 Terms and phrases
 -----------------
@@ -396,6 +410,7 @@ Currently the following features require field processor support:
 
 - non-range date queries, e.g. "date:today"
 - named queries e.g. "query:my_special_query"
+- regular expression searches, e.g. "re:subject:^\\[SPAM\\]"
 
 SEE ALSO
 ========
diff --git a/lib/Makefile.local b/lib/Makefile.local
index b77e5780..ff812b5f 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -52,6 +52,7 @@ libnotmuch_cxx_srcs =		\
 	$(dir)/query.cc		\
 	$(dir)/query-fp.cc      \
 	$(dir)/config.cc	\
+	$(dir)/regexp-fields.cc     \
 	$(dir)/thread.cc
 
 libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
diff --git a/lib/database-private.h b/lib/database-private.h
index ccc1e9a1..92f4b72f 100644
--- a/lib/database-private.h
+++ b/lib/database-private.h
@@ -190,6 +190,8 @@ struct _notmuch_database {
 #if HAVE_XAPIAN_FIELD_PROCESSOR
     Xapian::FieldProcessor *date_field_processor;
     Xapian::FieldProcessor *query_field_processor;
+    Xapian::FieldProcessor *re_from_field_processor;
+    Xapian::FieldProcessor *re_subject_field_processor;
 #endif
     Xapian::ValueRangeProcessor *last_mod_range_processor;
 };
diff --git a/lib/database.cc b/lib/database.cc
index 2d19f20c..2b2f8f5e 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -21,6 +21,7 @@
 #include "database-private.h"
 #include "parse-time-vrp.h"
 #include "query-fp.h"
+#include "regexp-fields.h"
 #include "string-util.h"
 
 #include <iostream>
@@ -1042,6 +1043,10 @@ notmuch_database_open_verbose (const char *path,
 	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
 	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
 	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
+	notmuch->re_from_field_processor = new RegexpFieldProcessor ("from", *notmuch->query_parser, notmuch);
+	notmuch->re_subject_field_processor = new RegexpFieldProcessor ("subject", *notmuch->query_parser, notmuch);
+	notmuch->query_parser->add_boolean_prefix("re_from", notmuch->re_from_field_processor);
+	notmuch->query_parser->add_boolean_prefix("re_subject", notmuch->re_subject_field_processor);
 #endif
 	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
 
@@ -1138,6 +1143,10 @@ notmuch_database_close (notmuch_database_t *notmuch)
     notmuch->date_field_processor = NULL;
     delete notmuch->query_field_processor;
     notmuch->query_field_processor = NULL;
+    delete notmuch->re_from_field_processor;
+    notmuch->re_from_field_processor = NULL;
+    delete notmuch->re_subject_field_processor;
+    notmuch->re_subject_field_processor = NULL;
 #endif
 
     return status;
diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
new file mode 100644
index 00000000..211ec02d
--- /dev/null
+++ b/lib/regexp-fields.cc
@@ -0,0 +1,110 @@
+/* regexp-fields.cc - "re:" field processor glue
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#include "regexp-fields.h"
+#include "notmuch-private.h"
+
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
+    : slot_ (slot)
+{
+    int err = regcomp (&regexp_, regexp.c_str (), REG_EXTENDED | REG_NOSUB);
+
+    if (err != 0) {
+	size_t len = regerror (err, &regexp_, NULL, 0);
+	char *buffer = new char[len];
+	std::string msg;
+	(void) regerror (err, &regexp_, buffer, len);
+	msg.assign (buffer, len);
+	delete buffer;
+
+	throw Xapian::QueryParserError (msg);
+    }
+}
+
+RegexpPostingSource::~RegexpPostingSource ()
+{
+    regfree (&regexp_);
+}
+
+void
+RegexpPostingSource::init (const Xapian::Database &db)
+{
+    db_ = db;
+    it_ = db_.valuestream_begin (slot_);
+    end_ = db.valuestream_end (slot_);
+    started_ = false;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_min () const
+{
+    return 0;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_est () const
+{
+    return get_termfreq_max () / 2;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_max () const
+{
+    return db_.get_value_freq (slot_);
+}
+
+Xapian::docid
+RegexpPostingSource::get_docid () const
+{
+    return it_.get_docid ();
+}
+
+bool
+RegexpPostingSource::at_end () const
+{
+    return it_ == end_;
+}
+
+void
+RegexpPostingSource::next (unused (double min_wt))
+{
+    if (started_ && ! at_end ())
+	++it_;
+    started_ = true;
+
+    for (; ! at_end (); ++it_) {
+	std::string value = *it_;
+	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
+	    break;
+    }
+}
+
+Xapian::Query
+RegexpFieldProcessor::operator() (const std::string & str)
+{
+    postings = new RegexpPostingSource (slot, str);
+    return Xapian::Query (postings);
+}
+#endif
diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
new file mode 100644
index 00000000..c2c44167
--- /dev/null
+++ b/lib/regexp-fields.h
@@ -0,0 +1,90 @@
+/* regex-fields.h - xapian glue for semi-bruteforce regexp search
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#ifndef NOTMUCH_REGEXP_FIELDS_H
+#define NOTMUCH_REGEXP_FIELDS_H
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+#include <sys/types.h>
+#include <regex.h>
+#include <xapian.h>
+#include "notmuch-private.h"
+
+/* A posting source that returns documents where a value matches a
+ * regexp.
+ */
+class RegexpPostingSource : public Xapian::PostingSource
+{
+ protected:
+    const Xapian::valueno slot_;
+    regex_t regexp_;
+    Xapian::Database db_;
+    bool started_;
+    Xapian::ValueIterator it_, end_;
+
+/* No copying */
+    RegexpPostingSource (const RegexpPostingSource &);
+    RegexpPostingSource &operator= (const RegexpPostingSource &);
+
+ public:
+    RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
+    ~RegexpPostingSource ();
+    void init (const Xapian::Database &db);
+    Xapian::doccount get_termfreq_min () const;
+    Xapian::doccount get_termfreq_est () const;
+    Xapian::doccount get_termfreq_max () const;
+    Xapian::docid get_docid () const;
+    bool at_end () const;
+    void next (unused (double min_wt));
+};
+
+
+class RegexpFieldProcessor : public Xapian::FieldProcessor {
+ protected:
+    Xapian::valueno slot;
+    Xapian::QueryParser &parser;
+    notmuch_database_t *notmuch;
+    RegexpPostingSource *postings = NULL;
+
+
+    static inline Xapian::valueno _find_slot (std::string prefix)
+    {
+	if (prefix == "from")
+	    return NOTMUCH_VALUE_FROM;
+	else if (prefix == "subject")
+	    return NOTMUCH_VALUE_SUBJECT;
+	else
+	    throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
+    }
+
+
+ public:
+    RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
+	: slot(_find_slot (prefix)), parser(parser_), notmuch(notmuch_) { };
+
+    ~RegexpFieldProcessor () { delete postings; };
+
+    Xapian::Query operator()(const std::string & str);
+};
+#endif
+#endif /* NOTMUCH_REGEXP_FIELDS_H */
diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
new file mode 100755
index 00000000..1b25634d
--- /dev/null
+++ b/test/T630-regexp-query.sh
@@ -0,0 +1,82 @@
+#!/usr/bin/env bash
+test_description='regular expression searches'
+. ./test-lib.sh || exit 1
+
+add_email_corpus
+
+
+if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
+
+    notmuch search --output=messages from:cworth > cworth.msg-ids
+
+    test_begin_subtest "regexp from search, case sensitive"
+    notmuch search --output=messages re_from:carl > OUTPUT
+    test_expect_equal_file /dev/null OUTPUT
+
+    test_begin_subtest "empty regexp or query"
+    notmuch search --output=messages re_from:carl or from:cworth > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "non-empty regexp and query"
+    notmuch search  re_from:cworth and subject:patch > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000008   2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
+thread:0000000000000007   2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
+thread:0000000000000018   2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
+thread:0000000000000017   2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+thread:0000000000000014   2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
+thread:0000000000000001   2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp from search, duplicate term search"
+    notmuch search --output=messages re_from:cworth > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "long enough regexp matches only desired senders"
+    notmuch search --output=messages 're_from:"C.* Wo"' > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "shorter regexp matches one more sender"
+    notmuch search --output=messages 're_from:"C.* W"' > OUTPUT
+    (echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, non-ASCII"
+    notmuch search --output=messages re_subject:accentué > OUTPUT
+    echo id:877h1wv7mg.fsf@inf-8657.int-evry.fr > EXPECTED
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, punctuation"
+    notmuch search   re_subject:\'X\' > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, no punctuation"
+    notmuch search  re_subject:X > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+thread:000000000000000f   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "combine regexp from and subject"
+    notmuch search  re_subject:-C and re_from:.an.k > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000018   2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp error reporting"
+    notmuch search 're_from:unbalanced[' 1>OUTPUT 2>&1
+    cat <<EOF > EXPECTED
+notmuch search: A Xapian exception occurred
+A Xapian exception occurred performing query: Invalid regular expression
+Query string was: re_from:unbalanced[
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+fi
+
+test_done
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Patch v2] lib: regexp matching in 'subject' and 'from'
  2016-11-14 21:46 ` [Patch v2] " David Bremner
  2017-01-18 20:05   ` Jani Nikula
@ 2017-01-19 14:27   ` David Bremner
  1 sibling, 0 replies; 18+ messages in thread
From: David Bremner @ 2017-01-19 14:27 UTC (permalink / raw)
  To: notmuch

David Bremner <david@tethera.net> writes:

> the idea is that you can run
>
> % notmuch search re:subject:<your-favourite-regexp>
> % notmuch search re:from:<your-favourite-regexp>'
>
> or
>
> % notmuch search subject:"your usual phrase search"
> % notmuch search from:"usual phrase search"

I'm not sure how useful it is, but here's an interdiff.

diff --git a/lib/database-private.h b/lib/database-private.h
index e7cbed8f..92f4b72f 100644
--- a/lib/database-private.h
+++ b/lib/database-private.h
@@ -190,7 +190,8 @@ struct _notmuch_database {
 #if HAVE_XAPIAN_FIELD_PROCESSOR
     Xapian::FieldProcessor *date_field_processor;
     Xapian::FieldProcessor *query_field_processor;
-    Xapian::FieldProcessor *re_field_processor;
+    Xapian::FieldProcessor *re_from_field_processor;
+    Xapian::FieldProcessor *re_subject_field_processor;
 #endif
     Xapian::ValueRangeProcessor *last_mod_range_processor;
 };
diff --git a/lib/database.cc b/lib/database.cc
index 851a62d1..2b2f8f5e 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1043,8 +1043,10 @@ notmuch_database_open_verbose (const char *path,
 	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
 	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
 	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
-	notmuch->re_field_processor = new RegexpFieldProcessor (*notmuch->query_parser, notmuch);
-	notmuch->query_parser->add_boolean_prefix("re", notmuch->re_field_processor);
+	notmuch->re_from_field_processor = new RegexpFieldProcessor ("from", *notmuch->query_parser, notmuch);
+	notmuch->re_subject_field_processor = new RegexpFieldProcessor ("subject", *notmuch->query_parser, notmuch);
+	notmuch->query_parser->add_boolean_prefix("re_from", notmuch->re_from_field_processor);
+	notmuch->query_parser->add_boolean_prefix("re_subject", notmuch->re_subject_field_processor);
 #endif
 	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
 
@@ -1141,8 +1143,10 @@ notmuch_database_close (notmuch_database_t *notmuch)
     notmuch->date_field_processor = NULL;
     delete notmuch->query_field_processor;
     notmuch->query_field_processor = NULL;
-    delete notmuch->re_field_processor;
-    notmuch->re_field_processor = NULL;
+    delete notmuch->re_from_field_processor;
+    notmuch->re_from_field_processor = NULL;
+    delete notmuch->re_subject_field_processor;
+    notmuch->re_subject_field_processor = NULL;
 #endif
 
     return status;
diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
index 4d3d9721..211ec02d 100644
--- a/lib/regexp-fields.cc
+++ b/lib/regexp-fields.cc
@@ -101,25 +101,10 @@ RegexpPostingSource::next (unused (double min_wt))
     }
 }
 
-static Xapian::valueno
-_find_slot (std::string prefix)
-{
-    if (prefix == "from")
-	return NOTMUCH_VALUE_FROM;
-    else if (prefix == "subject")
-	return NOTMUCH_VALUE_SUBJECT;
-    else
-	throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
-}
-
 Xapian::Query
 RegexpFieldProcessor::operator() (const std::string & str)
 {
-    size_t pos = str.find_first_of (':');
-    std::string prefix = str.substr (0, pos);
-    std::string regexp = str.substr (pos + 1);
-
-    postings = new RegexpPostingSource (_find_slot (prefix), regexp);
+    postings = new RegexpPostingSource (slot, str);
     return Xapian::Query (postings);
 }
 #endif
diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
index 2c9c2d7e..c2c44167 100644
--- a/lib/regexp-fields.h
+++ b/lib/regexp-fields.h
@@ -61,13 +61,26 @@ class RegexpPostingSource : public Xapian::PostingSource
 
 class RegexpFieldProcessor : public Xapian::FieldProcessor {
  protected:
+    Xapian::valueno slot;
     Xapian::QueryParser &parser;
     notmuch_database_t *notmuch;
     RegexpPostingSource *postings = NULL;
 
+
+    static inline Xapian::valueno _find_slot (std::string prefix)
+    {
+	if (prefix == "from")
+	    return NOTMUCH_VALUE_FROM;
+	else if (prefix == "subject")
+	    return NOTMUCH_VALUE_SUBJECT;
+	else
+	    throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
+    }
+
+
  public:
-    RegexpFieldProcessor (Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
-	: parser(parser_), notmuch(notmuch_) { };
+    RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
+	: slot(_find_slot (prefix)), parser(parser_), notmuch(notmuch_) { };
 
     ~RegexpFieldProcessor () { delete postings; };
 
diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
index 3bbe47cf..1b25634d 100755
--- a/test/T630-regexp-query.sh
+++ b/test/T630-regexp-query.sh
@@ -10,15 +10,15 @@ if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
     notmuch search --output=messages from:cworth > cworth.msg-ids
 
     test_begin_subtest "regexp from search, case sensitive"
-    notmuch search --output=messages re:from:carl > OUTPUT
+    notmuch search --output=messages re_from:carl > OUTPUT
     test_expect_equal_file /dev/null OUTPUT
 
     test_begin_subtest "empty regexp or query"
-    notmuch search --output=messages re:from:carl or from:cworth > OUTPUT
+    notmuch search --output=messages re_from:carl or from:cworth > OUTPUT
     test_expect_equal_file cworth.msg-ids OUTPUT
 
     test_begin_subtest "non-empty regexp and query"
-    notmuch search  re:from:cworth and subject:patch > OUTPUT
+    notmuch search  re_from:cworth and subject:patch > OUTPUT
     cat <<EOF > EXPECTED
 thread:0000000000000008   2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
 thread:0000000000000007   2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
@@ -30,32 +30,32 @@ EOF
     test_expect_equal_file EXPECTED OUTPUT
 
     test_begin_subtest "regexp from search, duplicate term search"
-    notmuch search --output=messages re:from:cworth > OUTPUT
+    notmuch search --output=messages re_from:cworth > OUTPUT
     test_expect_equal_file cworth.msg-ids OUTPUT
 
     test_begin_subtest "long enough regexp matches only desired senders"
-    notmuch search --output=messages 're:"from:C.* Wo"' > OUTPUT
+    notmuch search --output=messages 're_from:"C.* Wo"' > OUTPUT
     test_expect_equal_file cworth.msg-ids OUTPUT
 
     test_begin_subtest "shorter regexp matches one more sender"
-    notmuch search --output=messages 're:"from:C.* W"' > OUTPUT
+    notmuch search --output=messages 're_from:"C.* W"' > OUTPUT
     (echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED
     test_expect_equal_file EXPECTED OUTPUT
 
     test_begin_subtest "regexp subject search, non-ASCII"
-    notmuch search --output=messages re:subject:accentué > OUTPUT
+    notmuch search --output=messages re_subject:accentué > OUTPUT
     echo id:877h1wv7mg.fsf@inf-8657.int-evry.fr > EXPECTED
     test_expect_equal_file EXPECTED OUTPUT
 
     test_begin_subtest "regexp subject search, punctuation"
-    notmuch search   re:subject:\'X\' > OUTPUT
+    notmuch search   re_subject:\'X\' > OUTPUT
     cat <<EOF > EXPECTED
 thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
 EOF
     test_expect_equal_file EXPECTED OUTPUT
 
     test_begin_subtest "regexp subject search, no punctuation"
-    notmuch search  re:subject:X > OUTPUT
+    notmuch search  re_subject:X > OUTPUT
     cat <<EOF > EXPECTED
 thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
 thread:000000000000000f   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
@@ -63,27 +63,18 @@ EOF
     test_expect_equal_file EXPECTED OUTPUT
 
     test_begin_subtest "combine regexp from and subject"
-    notmuch search  re:subject:-C and re:from:.an.k > OUTPUT
+    notmuch search  re_subject:-C and re_from:.an.k > OUTPUT
     cat <<EOF > EXPECTED
 thread:0000000000000018   2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
 EOF
     test_expect_equal_file EXPECTED OUTPUT
 
-    test_begin_subtest "bad subprefix"
-    notmuch search 're:unsupported:.*' 1>OUTPUT 2>&1
-    cat <<EOF > EXPECTED
-notmuch search: A Xapian exception occurred
-A Xapian exception occurred performing query: unsupported regexp field 'unsupported'
-Query string was: re:unsupported:.*
-EOF
-    test_expect_equal_file EXPECTED OUTPUT
-
     test_begin_subtest "regexp error reporting"
-    notmuch search 're:from:unbalanced[' 1>OUTPUT 2>&1
+    notmuch search 're_from:unbalanced[' 1>OUTPUT 2>&1
     cat <<EOF > EXPECTED
 notmuch search: A Xapian exception occurred
 A Xapian exception occurred performing query: Invalid regular expression
-Query string was: re:from:unbalanced[
+Query string was: re_from:unbalanced[
 EOF
     test_expect_equal_file EXPECTED OUTPUT
 fi

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [WIP] lib: regexp matching in 'subject' and 'from'
  2017-01-19 12:16       ` [Patch v3] " David Bremner
@ 2017-01-21  3:27         ` David Bremner
  2017-01-21 13:59           ` [Patch v4] " David Bremner
  0 siblings, 1 reply; 18+ messages in thread
From: David Bremner @ 2017-01-21  3:27 UTC (permalink / raw)
  To: David Bremner, Jani Nikula, notmuch

the idea is that you can run

% notmuch search subject:<your-favourite-regexp>
% notmuch search from:<your-favourite-regexp>'

or

% notmuch search subject:"your usual phrase search"
% notmuch search from:"usual phrase search"

The heuristic to decide how to interepret the query is based on a
regex, roughly [a-z -]+

This should also work with bindings, since it extends the query parser.

This is trivial to extend for other value slots, but currently the only
value slots are date, message_id, from, subject, and last_mod. Date is
already searchable, and message_id is not obviously useful to regex
match.

This was originally written by Austin Clements, and ported to Xapian
field processors (from Austin's custom query parser) by yours truly.
---

It turns out to be not as hard as I thought to have the same field
interpreted as a regex search and as a regular xapian phrase search.
I haven't fixed the tests and docs yet because I'm not sure about the
best UI to trigger the regex search. Currently it just guesses based
on the string, but this has some surprising effects for
notmuch-address (hence the test breakage). Maybe from:/regex/ although
the quoting means this would look like from:"/regex/"

 doc/man7/notmuch-search-terms.rst |  17 ++++-
 lib/Makefile.local                |   1 +
 lib/database-private.h            |   2 +
 lib/database.cc                   |  29 +++++++-
 lib/regexp-fields.cc              | 142 ++++++++++++++++++++++++++++++++++++++
 lib/regexp-fields.h               |  81 ++++++++++++++++++++++
 test/T630-regexp-query.sh         |  82 ++++++++++++++++++++++
 7 files changed, 350 insertions(+), 4 deletions(-)
 create mode 100644 lib/regexp-fields.cc
 create mode 100644 lib/regexp-fields.h
 create mode 100755 test/T630-regexp-query.sh

diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
index de93d733..8800039d 100644
--- a/doc/man7/notmuch-search-terms.rst
+++ b/doc/man7/notmuch-search-terms.rst
@@ -60,6 +60,8 @@ indicate user-supplied values):
 
 -  property:<key>=<value>
 
+- re_{subject,from}:<regex>
+
 The **from:** prefix is used to match the name or address of the sender
 of an email message.
 
@@ -146,6 +148,12 @@ The **property:** prefix searches for messages with a particular
 (and extensions) to add metadata to messages. A given key can be
 present on a given message with several different values.
 
+The **re_from:** and **re_subject** prefix can be used to restrict the
+results to those whose from/subject value matches the given regular
+expression (see **regex(7)**). Regular expression searches are only
+available if notmuch is built with **Xapian Field Processors** (see
+below).
+
 Operators
 ---------
 
@@ -220,13 +228,19 @@ Boolean and Probabilistic Prefixes
 ----------------------------------
 
 Xapian (and hence notmuch) prefixes are either **boolean**, supporting
-exact matches like "tag:inbox"  or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows.
+exact matches like "tag:inbox" or **probabilistic**, supporting a more
+flexible **term** based searching. Certain **special** prefixes are
+processed by notmuch in a way not stricly fitting either of Xapian's
+built in styles. The prefixes currently supported by notmuch are as
+follows.
 
 
 Boolean
    **tag:**, **id:**, **thread:**, **folder:**, **path:**, **property:**
 Probabilistic
    **from:**, **to:**, **subject:**, **attachment:**, **mimetype:**
+Special
+   **query:**, **re:<field>**
 
 Terms and phrases
 -----------------
@@ -396,6 +410,7 @@ Currently the following features require field processor support:
 
 - non-range date queries, e.g. "date:today"
 - named queries e.g. "query:my_special_query"
+- regular expression searches, e.g. "re:subject:^\\[SPAM\\]"
 
 SEE ALSO
 ========
diff --git a/lib/Makefile.local b/lib/Makefile.local
index b77e5780..ff812b5f 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -52,6 +52,7 @@ libnotmuch_cxx_srcs =		\
 	$(dir)/query.cc		\
 	$(dir)/query-fp.cc      \
 	$(dir)/config.cc	\
+	$(dir)/regexp-fields.cc     \
 	$(dir)/thread.cc
 
 libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
diff --git a/lib/database-private.h b/lib/database-private.h
index ccc1e9a1..9f5659a9 100644
--- a/lib/database-private.h
+++ b/lib/database-private.h
@@ -190,6 +190,8 @@ struct _notmuch_database {
 #if HAVE_XAPIAN_FIELD_PROCESSOR
     Xapian::FieldProcessor *date_field_processor;
     Xapian::FieldProcessor *query_field_processor;
+    Xapian::FieldProcessor *from_field_processor;
+    Xapian::FieldProcessor *subject_field_processor;
 #endif
     Xapian::ValueRangeProcessor *last_mod_range_processor;
 };
diff --git a/lib/database.cc b/lib/database.cc
index 2d19f20c..8a9ad251 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -21,6 +21,7 @@
 #include "database-private.h"
 #include "parse-time-vrp.h"
 #include "query-fp.h"
+#include "regexp-fields.h"
 #include "string-util.h"
 
 #include <iostream>
@@ -272,12 +273,16 @@ static prefix_t BOOLEAN_PREFIX_EXTERNAL[] = {
     { "folder",			"XFOLDER:" },
 };
 
-static prefix_t PROBABILISTIC_PREFIX[]= {
+static prefix_t REGEX_PREFIX[]= {
     { "from",			"XFROM" },
+    { "subject",		"XSUBJECT"},
+};
+
+static prefix_t PROBABILISTIC_PREFIX[]= {
+
     { "to",			"XTO" },
     { "attachment",		"XATTACHMENT" },
     { "mimetype",		"XMIMETYPE"},
-    { "subject",		"XSUBJECT"},
 };
 
 const char *
@@ -295,6 +300,11 @@ _find_prefix (const char *name)
 	    return BOOLEAN_PREFIX_EXTERNAL[i].prefix;
     }
 
+    for (i = 0; i < ARRAY_SIZE (REGEX_PREFIX); i++) {
+	if (strcmp (name, REGEX_PREFIX[i].name) == 0)
+	    return REGEX_PREFIX[i].prefix;
+    }
+
     for (i = 0; i < ARRAY_SIZE (PROBABILISTIC_PREFIX); i++) {
 	if (strcmp (name, PROBABILISTIC_PREFIX[i].name) == 0)
 	    return PROBABILISTIC_PREFIX[i].prefix;
@@ -1042,6 +1052,10 @@ notmuch_database_open_verbose (const char *path,
 	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
 	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
 	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
+	notmuch->from_field_processor = new RegexpFieldProcessor ("from", *notmuch->query_parser, notmuch);
+	notmuch->subject_field_processor = new RegexpFieldProcessor ("subject", *notmuch->query_parser, notmuch);
+	notmuch->query_parser->add_boolean_prefix("from", notmuch->from_field_processor);
+	notmuch->query_parser->add_boolean_prefix("subject", notmuch->subject_field_processor);
 #endif
 	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
 
@@ -1058,7 +1072,12 @@ notmuch_database_open_verbose (const char *path,
 	    notmuch->query_parser->add_boolean_prefix (prefix->name,
 						       prefix->prefix);
 	}
-
+#if !HAVE_XAPIAN_FIELD_PROCESSOR
+	for (i = 0; i < ARRAY_SIZE (REGEX_PREFIX); i++) {
+	    prefix_t *prefix = &REGEX_PREFIX[i];
+	    notmuch->query_parser->add_prefix (prefix->name, prefix->prefix);
+	}
+#endif
 	for (i = 0; i < ARRAY_SIZE (PROBABILISTIC_PREFIX); i++) {
 	    prefix_t *prefix = &PROBABILISTIC_PREFIX[i];
 	    notmuch->query_parser->add_prefix (prefix->name, prefix->prefix);
@@ -1138,6 +1157,10 @@ notmuch_database_close (notmuch_database_t *notmuch)
     notmuch->date_field_processor = NULL;
     delete notmuch->query_field_processor;
     notmuch->query_field_processor = NULL;
+    delete notmuch->from_field_processor;
+    notmuch->from_field_processor = NULL;
+    delete notmuch->subject_field_processor;
+    notmuch->subject_field_processor = NULL;
 #endif
 
     return status;
diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
new file mode 100644
index 00000000..b67daf06
--- /dev/null
+++ b/lib/regexp-fields.cc
@@ -0,0 +1,142 @@
+/* regexp-fields.cc - field processor glue for regex supporting fields
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#include "regexp-fields.h"
+#include "notmuch-private.h"
+#include "database-private.h"
+
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+static void
+compile_regex (regex_t &regexp, const char *str)
+{
+    int err = regcomp (&regexp, str, REG_EXTENDED | REG_NOSUB);
+
+    if (err != 0) {
+	size_t len = regerror (err, &regexp, NULL, 0);
+	char *buffer = new char[len];
+	std::string msg;
+	(void) regerror (err, &regexp, buffer, len);
+	msg.assign (buffer, len);
+	delete buffer;
+
+	throw Xapian::QueryParserError (msg);
+
+    }
+}
+
+RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
+    : slot_ (slot)
+{
+
+    compile_regex (regexp_, regexp.c_str ());
+}
+
+RegexpPostingSource::~RegexpPostingSource ()
+{
+    regfree (&regexp_);
+}
+
+void
+RegexpPostingSource::init (const Xapian::Database &db)
+{
+    db_ = db;
+    it_ = db_.valuestream_begin (slot_);
+    end_ = db.valuestream_end (slot_);
+    started_ = false;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_min () const
+{
+    return 0;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_est () const
+{
+    return get_termfreq_max () / 2;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_max () const
+{
+    return db_.get_value_freq (slot_);
+}
+
+Xapian::docid
+RegexpPostingSource::get_docid () const
+{
+    return it_.get_docid ();
+}
+
+bool
+RegexpPostingSource::at_end () const
+{
+    return it_ == end_;
+}
+
+void
+RegexpPostingSource::next (unused (double min_wt))
+{
+    if (started_ && ! at_end ())
+	++it_;
+    started_ = true;
+
+    for (; ! at_end (); ++it_) {
+	std::string value = *it_;
+	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
+	    break;
+    }
+}
+
+static inline Xapian::valueno _find_slot (std::string prefix)
+{
+    if (prefix == "from")
+	return NOTMUCH_VALUE_FROM;
+    else if (prefix == "subject")
+	return NOTMUCH_VALUE_SUBJECT;
+    else
+	throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
+}
+
+RegexpFieldProcessor::RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
+	: slot(_find_slot (prefix)), term_prefix(_find_prefix (prefix.c_str ())), parser(parser_), notmuch(notmuch_)
+{
+    compile_regex (phrase_regex, phrase_regex_str);
+};
+
+Xapian::Query
+RegexpFieldProcessor::operator() (const std::string & str)
+{
+    if (regexec (&phrase_regex, str.c_str (), 0, NULL, 0) == 0){
+	return parser.parse_query (str, NOTMUCH_QUERY_PARSER_FLAGS, term_prefix);
+    } else {
+	if (postings)
+	    delete postings;
+
+	postings = new RegexpPostingSource (slot, str);
+	return Xapian::Query (postings);
+    }
+}
+#endif
diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
new file mode 100644
index 00000000..d58ee7c3
--- /dev/null
+++ b/lib/regexp-fields.h
@@ -0,0 +1,81 @@
+/* regex-fields.h - xapian glue for semi-bruteforce regexp search
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#ifndef NOTMUCH_REGEXP_FIELDS_H
+#define NOTMUCH_REGEXP_FIELDS_H
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+#include <sys/types.h>
+#include <regex.h>
+#include <xapian.h>
+#include "notmuch-private.h"
+
+/* A posting source that returns documents where a value matches a
+ * regexp.
+ */
+class RegexpPostingSource : public Xapian::PostingSource
+{
+ protected:
+    const Xapian::valueno slot_;
+    regex_t regexp_;
+    Xapian::Database db_;
+    bool started_;
+    Xapian::ValueIterator it_, end_;
+
+/* No copying */
+    RegexpPostingSource (const RegexpPostingSource &);
+    RegexpPostingSource &operator= (const RegexpPostingSource &);
+
+ public:
+    RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
+    ~RegexpPostingSource ();
+    void init (const Xapian::Database &db);
+    Xapian::doccount get_termfreq_min () const;
+    Xapian::doccount get_termfreq_est () const;
+    Xapian::doccount get_termfreq_max () const;
+    Xapian::docid get_docid () const;
+    bool at_end () const;
+    void next (unused (double min_wt));
+};
+
+
+class RegexpFieldProcessor : public Xapian::FieldProcessor {
+ protected:
+    Xapian::valueno slot;
+    std::string term_prefix;
+    Xapian::QueryParser &parser;
+    notmuch_database_t *notmuch;
+    RegexpPostingSource *postings = NULL;
+
+    const char *phrase_regex_str="^[[:lower:][:digit:][:blank:]-]+$";
+    regex_t phrase_regex;
+
+ public:
+    RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_);
+
+    ~RegexpFieldProcessor () { delete postings; regfree (&phrase_regex); };
+
+    Xapian::Query operator()(const std::string & str);
+};
+#endif
+#endif /* NOTMUCH_REGEXP_FIELDS_H */
diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
new file mode 100755
index 00000000..eba4670f
--- /dev/null
+++ b/test/T630-regexp-query.sh
@@ -0,0 +1,82 @@
+#!/usr/bin/env bash
+test_description='regular expression searches'
+. ./test-lib.sh || exit 1
+
+add_email_corpus
+
+
+if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
+
+    notmuch search --output=messages from:cworth > cworth.msg-ids
+
+    test_begin_subtest "regexp from search, case sensitive"
+    notmuch search --output=messages from::^carl > OUTPUT
+    test_expect_equal_file /dev/null OUTPUT
+
+    test_begin_subtest "empty regexp or query"
+    notmuch search --output=messages from::carl or from:cworth > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "non-empty regexp and query"
+    notmuch search  from:cworth@cworth.org and subject:patch > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000008   2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
+thread:0000000000000007   2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
+thread:0000000000000018   2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
+thread:0000000000000017   2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+thread:0000000000000014   2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
+thread:0000000000000001   2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp from search, duplicate term search"
+    notmuch search --output=messages from:cworth > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "long enough regexp matches only desired senders"
+    notmuch search --output=messages 'from:"C.* Wo"' > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "shorter regexp matches one more sender"
+    notmuch search --output=messages 'from:"C.* W"' > OUTPUT
+    (echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, non-ASCII"
+    notmuch search --output=messages subject:accentué > OUTPUT
+    echo id:877h1wv7mg.fsf@inf-8657.int-evry.fr > EXPECTED
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, punctuation"
+    notmuch search   subject:\'X\' > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, no punctuation"
+    notmuch search  subject:X > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+thread:000000000000000f   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "combine regexp from and subject"
+    notmuch search  subject:-C and from:.an.k > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000018   2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp error reporting"
+    notmuch search 'from:unbalanced[' 1>OUTPUT 2>&1
+    cat <<EOF > EXPECTED
+notmuch search: A Xapian exception occurred
+A Xapian exception occurred performing query: Invalid regular expression
+Query string was: from:unbalanced[
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+fi
+
+test_done
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Patch v4] lib: regexp matching in 'subject' and 'from'
  2017-01-21  3:27         ` [WIP] " David Bremner
@ 2017-01-21 13:59           ` David Bremner
  2017-01-25 19:40             ` Tomi Ollila
  0 siblings, 1 reply; 18+ messages in thread
From: David Bremner @ 2017-01-21 13:59 UTC (permalink / raw)
  To: David Bremner, Jani Nikula, notmuch

the idea is that you can run

% notmuch search subject:/<your-favourite-regexp>/
% notmuch search from:/<your-favourite-regexp>/

or

% notmuch search subject:"your usual phrase search"
% notmuch search from:"usual phrase search"

This should also work with bindings, since it extends the query parser.

This is trivial to extend for other value slots, but currently the only
value slots are date, message_id, from, subject, and last_mod. Date is
already searchable, and message_id is not obviously useful to regex
match.

This was originally written by Austin Clements, and ported to Xapian
field processors (from Austin's custom query parser) by yours truly.
---

This version impliments the use of // to delimit regular expressions.
I have not tested the code paths with old (pre field processor) xapian.

 doc/man7/notmuch-search-terms.rst |  27 +++++++-
 lib/Makefile.local                |   1 +
 lib/database-private.h            |   2 +
 lib/database.cc                   |  29 +++++++-
 lib/regexp-fields.cc              | 142 ++++++++++++++++++++++++++++++++++++++
 lib/regexp-fields.h               |  77 +++++++++++++++++++++
 test/T630-regexp-query.sh         |  82 ++++++++++++++++++++++
 7 files changed, 354 insertions(+), 6 deletions(-)
 create mode 100644 lib/regexp-fields.cc
 create mode 100644 lib/regexp-fields.h
 create mode 100755 test/T630-regexp-query.sh

diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
index de93d733..d8527e18 100644
--- a/doc/man7/notmuch-search-terms.rst
+++ b/doc/man7/notmuch-search-terms.rst
@@ -34,10 +34,14 @@ indicate user-supplied values):
 
 -  from:<name-or-address>
 
+-  from:/<regex>/
+
 -  to:<name-or-address>
 
 -  subject:<word-or-quoted-phrase>
 
+-  subject:/<regex>/
+
 -  attachment:<word>
 
 -  mimetype:<word>
@@ -71,6 +75,17 @@ subject of an email. Searching for a phrase in the subject is supported
 by including quotation marks around the phrase, immediately following
 **subject:**.
 
+The **from:** and **subject** prefix can be also used to restrict the
+results to those whose from/subject value matches a regular
+expression (see **regex(7)**) delimited with //.
+
+::
+
+   notmuch search 'from:/bob@.*[.]example[.]com/'
+
+Regular expression searches are only available if notmuch is built
+with **Xapian Field Processors** (see below).
+
 The **attachment:** prefix can be used to search for specific filenames
 (or extensions) of attachments to email messages.
 
@@ -220,13 +235,18 @@ Boolean and Probabilistic Prefixes
 ----------------------------------
 
 Xapian (and hence notmuch) prefixes are either **boolean**, supporting
-exact matches like "tag:inbox"  or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows.
-
+exact matches like "tag:inbox" or **probabilistic**, supporting a more
+flexible **term** based searching. Certain **special** prefixes are
+processed by notmuch in a way not stricly fitting either of Xapian's
+built in styles. The prefixes currently supported by notmuch are as
+follows.
 
 Boolean
    **tag:**, **id:**, **thread:**, **folder:**, **path:**, **property:**
 Probabilistic
-   **from:**, **to:**, **subject:**, **attachment:**, **mimetype:**
+  **to:**, **attachment:**, **mimetype:**
+Special
+   **from:**, **query:**, **subject:**
 
 Terms and phrases
 -----------------
@@ -396,6 +416,7 @@ Currently the following features require field processor support:
 
 - non-range date queries, e.g. "date:today"
 - named queries e.g. "query:my_special_query"
+- regular expression searches, e.g. "subject:/^\\[SPAM\\]/"
 
 SEE ALSO
 ========
diff --git a/lib/Makefile.local b/lib/Makefile.local
index b77e5780..ff812b5f 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -52,6 +52,7 @@ libnotmuch_cxx_srcs =		\
 	$(dir)/query.cc		\
 	$(dir)/query-fp.cc      \
 	$(dir)/config.cc	\
+	$(dir)/regexp-fields.cc     \
 	$(dir)/thread.cc
 
 libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
diff --git a/lib/database-private.h b/lib/database-private.h
index ccc1e9a1..9f5659a9 100644
--- a/lib/database-private.h
+++ b/lib/database-private.h
@@ -190,6 +190,8 @@ struct _notmuch_database {
 #if HAVE_XAPIAN_FIELD_PROCESSOR
     Xapian::FieldProcessor *date_field_processor;
     Xapian::FieldProcessor *query_field_processor;
+    Xapian::FieldProcessor *from_field_processor;
+    Xapian::FieldProcessor *subject_field_processor;
 #endif
     Xapian::ValueRangeProcessor *last_mod_range_processor;
 };
diff --git a/lib/database.cc b/lib/database.cc
index 2d19f20c..8a9ad251 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -21,6 +21,7 @@
 #include "database-private.h"
 #include "parse-time-vrp.h"
 #include "query-fp.h"
+#include "regexp-fields.h"
 #include "string-util.h"
 
 #include <iostream>
@@ -272,12 +273,16 @@ static prefix_t BOOLEAN_PREFIX_EXTERNAL[] = {
     { "folder",			"XFOLDER:" },
 };
 
-static prefix_t PROBABILISTIC_PREFIX[]= {
+static prefix_t REGEX_PREFIX[]= {
     { "from",			"XFROM" },
+    { "subject",		"XSUBJECT"},
+};
+
+static prefix_t PROBABILISTIC_PREFIX[]= {
+
     { "to",			"XTO" },
     { "attachment",		"XATTACHMENT" },
     { "mimetype",		"XMIMETYPE"},
-    { "subject",		"XSUBJECT"},
 };
 
 const char *
@@ -295,6 +300,11 @@ _find_prefix (const char *name)
 	    return BOOLEAN_PREFIX_EXTERNAL[i].prefix;
     }
 
+    for (i = 0; i < ARRAY_SIZE (REGEX_PREFIX); i++) {
+	if (strcmp (name, REGEX_PREFIX[i].name) == 0)
+	    return REGEX_PREFIX[i].prefix;
+    }
+
     for (i = 0; i < ARRAY_SIZE (PROBABILISTIC_PREFIX); i++) {
 	if (strcmp (name, PROBABILISTIC_PREFIX[i].name) == 0)
 	    return PROBABILISTIC_PREFIX[i].prefix;
@@ -1042,6 +1052,10 @@ notmuch_database_open_verbose (const char *path,
 	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
 	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
 	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
+	notmuch->from_field_processor = new RegexpFieldProcessor ("from", *notmuch->query_parser, notmuch);
+	notmuch->subject_field_processor = new RegexpFieldProcessor ("subject", *notmuch->query_parser, notmuch);
+	notmuch->query_parser->add_boolean_prefix("from", notmuch->from_field_processor);
+	notmuch->query_parser->add_boolean_prefix("subject", notmuch->subject_field_processor);
 #endif
 	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
 
@@ -1058,7 +1072,12 @@ notmuch_database_open_verbose (const char *path,
 	    notmuch->query_parser->add_boolean_prefix (prefix->name,
 						       prefix->prefix);
 	}
-
+#if !HAVE_XAPIAN_FIELD_PROCESSOR
+	for (i = 0; i < ARRAY_SIZE (REGEX_PREFIX); i++) {
+	    prefix_t *prefix = &REGEX_PREFIX[i];
+	    notmuch->query_parser->add_prefix (prefix->name, prefix->prefix);
+	}
+#endif
 	for (i = 0; i < ARRAY_SIZE (PROBABILISTIC_PREFIX); i++) {
 	    prefix_t *prefix = &PROBABILISTIC_PREFIX[i];
 	    notmuch->query_parser->add_prefix (prefix->name, prefix->prefix);
@@ -1138,6 +1157,10 @@ notmuch_database_close (notmuch_database_t *notmuch)
     notmuch->date_field_processor = NULL;
     delete notmuch->query_field_processor;
     notmuch->query_field_processor = NULL;
+    delete notmuch->from_field_processor;
+    notmuch->from_field_processor = NULL;
+    delete notmuch->subject_field_processor;
+    notmuch->subject_field_processor = NULL;
 #endif
 
     return status;
diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
new file mode 100644
index 00000000..8cb1cada
--- /dev/null
+++ b/lib/regexp-fields.cc
@@ -0,0 +1,142 @@
+/* regexp-fields.cc - field processor glue for regex supporting fields
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#include "regexp-fields.h"
+#include "notmuch-private.h"
+#include "database-private.h"
+#include <stdio.h>
+
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+static void
+compile_regex (regex_t &regexp, const char *str)
+{
+    int err = regcomp (&regexp, str, REG_EXTENDED | REG_NOSUB);
+
+    if (err != 0) {
+	size_t len = regerror (err, &regexp, NULL, 0);
+	char *buffer = new char[len];
+	std::string msg;
+	(void) regerror (err, &regexp, buffer, len);
+	msg.assign (buffer, len);
+	delete buffer;
+
+	throw Xapian::QueryParserError (msg);
+
+    }
+}
+
+RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
+    : slot_ (slot)
+{
+
+    compile_regex (regexp_, regexp.c_str ());
+}
+
+RegexpPostingSource::~RegexpPostingSource ()
+{
+    regfree (&regexp_);
+}
+
+void
+RegexpPostingSource::init (const Xapian::Database &db)
+{
+    db_ = db;
+    it_ = db_.valuestream_begin (slot_);
+    end_ = db.valuestream_end (slot_);
+    started_ = false;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_min () const
+{
+    return 0;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_est () const
+{
+    return get_termfreq_max () / 2;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_max () const
+{
+    return db_.get_value_freq (slot_);
+}
+
+Xapian::docid
+RegexpPostingSource::get_docid () const
+{
+    return it_.get_docid ();
+}
+
+bool
+RegexpPostingSource::at_end () const
+{
+    return it_ == end_;
+}
+
+void
+RegexpPostingSource::next (unused (double min_wt))
+{
+    if (started_ && ! at_end ())
+	++it_;
+    started_ = true;
+
+    for (; ! at_end (); ++it_) {
+	std::string value = *it_;
+	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
+	    break;
+    }
+}
+
+static inline Xapian::valueno _find_slot (std::string prefix)
+{
+    if (prefix == "from")
+	return NOTMUCH_VALUE_FROM;
+    else if (prefix == "subject")
+	return NOTMUCH_VALUE_SUBJECT;
+    else
+	throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
+}
+
+RegexpFieldProcessor::RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
+	: slot(_find_slot (prefix)), term_prefix(_find_prefix (prefix.c_str ())), parser(parser_), notmuch(notmuch_)
+{
+};
+
+Xapian::Query
+RegexpFieldProcessor::operator() (const std::string & str)
+{
+    if (str.at (0) == '/' && str.at (str.size () - 1)){
+	RegexpPostingSource *postings = new RegexpPostingSource (slot, str.substr(1,str.size () - 2));
+	return Xapian::Query (postings->release ());
+    } else {
+	/* TODO replace this with a nicer API level triggering of
+	 * phrase parsing, when possible */
+	std::string quoted='"' + str + '"';
+	return parser.parse_query (quoted, NOTMUCH_QUERY_PARSER_FLAGS, term_prefix);
+    }
+}
+#endif
diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
new file mode 100644
index 00000000..bac11999
--- /dev/null
+++ b/lib/regexp-fields.h
@@ -0,0 +1,77 @@
+/* regex-fields.h - xapian glue for semi-bruteforce regexp search
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#ifndef NOTMUCH_REGEXP_FIELDS_H
+#define NOTMUCH_REGEXP_FIELDS_H
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+#include <sys/types.h>
+#include <regex.h>
+#include "database-private.h"
+#include "notmuch-private.h"
+
+/* A posting source that returns documents where a value matches a
+ * regexp.
+ */
+class RegexpPostingSource : public Xapian::PostingSource
+{
+ protected:
+    const Xapian::valueno slot_;
+    regex_t regexp_;
+    Xapian::Database db_;
+    bool started_;
+    Xapian::ValueIterator it_, end_;
+
+/* No copying */
+    RegexpPostingSource (const RegexpPostingSource &);
+    RegexpPostingSource &operator= (const RegexpPostingSource &);
+
+ public:
+    RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
+    ~RegexpPostingSource ();
+    void init (const Xapian::Database &db);
+    Xapian::doccount get_termfreq_min () const;
+    Xapian::doccount get_termfreq_est () const;
+    Xapian::doccount get_termfreq_max () const;
+    Xapian::docid get_docid () const;
+    bool at_end () const;
+    void next (unused (double min_wt));
+};
+
+
+class RegexpFieldProcessor : public Xapian::FieldProcessor {
+ protected:
+    Xapian::valueno slot;
+    std::string term_prefix;
+    Xapian::QueryParser &parser;
+    notmuch_database_t *notmuch;
+
+ public:
+    RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_);
+
+    ~RegexpFieldProcessor () { };
+
+    Xapian::Query operator()(const std::string & str);
+};
+#endif
+#endif /* NOTMUCH_REGEXP_FIELDS_H */
diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
new file mode 100755
index 00000000..722af715
--- /dev/null
+++ b/test/T630-regexp-query.sh
@@ -0,0 +1,82 @@
+#!/usr/bin/env bash
+test_description='regular expression searches'
+. ./test-lib.sh || exit 1
+
+add_email_corpus
+
+
+if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
+
+    notmuch search --output=messages from:cworth > cworth.msg-ids
+
+    test_begin_subtest "regexp from search, case sensitive"
+    notmuch search --output=messages from:/carl/ > OUTPUT
+    test_expect_equal_file /dev/null OUTPUT
+
+    test_begin_subtest "empty regexp or query"
+    notmuch search --output=messages from:/carl/ or from:/cworth/ > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "non-empty regexp and query"
+    notmuch search  from:/cworth@cworth.org/ and subject:patch > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000008   2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
+thread:0000000000000007   2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
+thread:0000000000000018   2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
+thread:0000000000000017   2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+thread:0000000000000014   2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
+thread:0000000000000001   2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp from search, duplicate term search"
+    notmuch search --output=messages from:/cworth/ > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "long enough regexp matches only desired senders"
+    notmuch search --output=messages 'from:"/C.* Wo/"' > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "shorter regexp matches one more sender"
+    notmuch search --output=messages 'from:"/C.* W/"' > OUTPUT
+    (echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, non-ASCII"
+    notmuch search --output=messages subject:/accentué/ > OUTPUT
+    echo id:877h1wv7mg.fsf@inf-8657.int-evry.fr > EXPECTED
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, punctuation"
+    notmuch search   subject:/\'X\'/ > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, no punctuation"
+    notmuch search  subject:/X/ > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+thread:000000000000000f   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "combine regexp from and subject"
+    notmuch search  subject:/-C/ and from:/.an.k/ > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000018   2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp error reporting"
+    notmuch search 'from:/unbalanced[/' 1>OUTPUT 2>&1
+    cat <<EOF > EXPECTED
+notmuch search: A Xapian exception occurred
+A Xapian exception occurred performing query: Invalid regular expression
+Query string was: from:/unbalanced[/
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+fi
+
+test_done
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Patch v4] lib: regexp matching in 'subject' and 'from'
  2017-01-21 13:59           ` [Patch v4] " David Bremner
@ 2017-01-25 19:40             ` Tomi Ollila
  2017-01-26  2:21               ` David Bremner
  2017-01-29 11:06               ` Jani Nikula
  0 siblings, 2 replies; 18+ messages in thread
From: Tomi Ollila @ 2017-01-25 19:40 UTC (permalink / raw)
  To: David Bremner, notmuch

On Sat, Jan 21 2017, David Bremner <david@tethera.net> wrote:

> the idea is that you can run
>
> % notmuch search subject:/<your-favourite-regexp>/
> % notmuch search from:/<your-favourite-regexp>/

I like this interface.

>
> or
>
> % notmuch search subject:"your usual phrase search"
> % notmuch search from:"usual phrase search"
>
> This should also work with bindings, since it extends the query parser.
>
> This is trivial to extend for other value slots, but currently the only
> value slots are date, message_id, from, subject, and last_mod. Date is
> already searchable, and message_id is not obviously useful to regex
> match.

Why would not mesasge_id not be useful to regex match. I can come up quite
a few use cases... but if there are techinal difficulties... then that
should be mentioned instead.

maybe this commit message should inform that xapian with field processors
(1.4.x) is required for this feature -- and emphasize it a bit better in
manual page ?

Probably '//' is used to escape '/' -- should such a character ever needed
in regex search.

>
> This was originally written by Austin Clements, and ported to Xapian
> field processors (from Austin's custom query parser) by yours truly.
> ---
>
> This version impliments the use of // to delimit regular expressions.
> I have not tested the code paths with old (pre field processor) xapian.

Fedora 25 has 1.2.24 -- T630 tests are skipped. It looks like these changes
did not increase the failure count there.

Some (mostly whitespace nitpicking) comments below:


>
>  doc/man7/notmuch-search-terms.rst |  27 +++++++-
>  lib/Makefile.local                |   1 +
>  lib/database-private.h            |   2 +
>  lib/database.cc                   |  29 +++++++-
>  lib/regexp-fields.cc              | 142 ++++++++++++++++++++++++++++++++++++++
>  lib/regexp-fields.h               |  77 +++++++++++++++++++++
>  test/T630-regexp-query.sh         |  82 ++++++++++++++++++++++
>  7 files changed, 354 insertions(+), 6 deletions(-)
>  create mode 100644 lib/regexp-fields.cc
>  create mode 100644 lib/regexp-fields.h
>  create mode 100755 test/T630-regexp-query.sh
>
> diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
> index de93d733..d8527e18 100644
> --- a/doc/man7/notmuch-search-terms.rst
> +++ b/doc/man7/notmuch-search-terms.rst
> @@ -34,10 +34,14 @@ indicate user-supplied values):
>  
>  -  from:<name-or-address>
>  
> +-  from:/<regex>/
> +
>  -  to:<name-or-address>
>  
>  -  subject:<word-or-quoted-phrase>
>  
> +-  subject:/<regex>/
> +
>  -  attachment:<word>
>  
>  -  mimetype:<word>
> @@ -71,6 +75,17 @@ subject of an email. Searching for a phrase in the subject is supported
>  by including quotation marks around the phrase, immediately following
>  **subject:**.
>  
> +The **from:** and **subject** prefix can be also used to restrict the
> +results to those whose from/subject value matches a regular
> +expression (see **regex(7)**) delimited with //.
> +
> +::
> +
> +   notmuch search 'from:/bob@.*[.]example[.]com/'
> +
> +Regular expression searches are only available if notmuch is built
> +with **Xapian Field Processors** (see below).

And the poor user stopped reading far before this line, desperately trying
the regex searches... >;/ so IMO this requirement should be notified earlier.

> +
>  The **attachment:** prefix can be used to search for specific filenames
>  (or extensions) of attachments to email messages.
>  
> @@ -220,13 +235,18 @@ Boolean and Probabilistic Prefixes
>  ----------------------------------
>  
>  Xapian (and hence notmuch) prefixes are either **boolean**, supporting
> -exact matches like "tag:inbox"  or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows.
> -
> +exact matches like "tag:inbox" or **probabilistic**, supporting a more
> +flexible **term** based searching. Certain **special** prefixes are
> +processed by notmuch in a way not stricly fitting either of Xapian's
> +built in styles. The prefixes currently supported by notmuch are as
> +follows.
>  
>  Boolean
>     **tag:**, **id:**, **thread:**, **folder:**, **path:**, **property:**
>  Probabilistic
> -   **from:**, **to:**, **subject:**, **attachment:**, **mimetype:**
> +  **to:**, **attachment:**, **mimetype:**
> +Special
> +   **from:**, **query:**, **subject:**
>  
>  Terms and phrases
>  -----------------
> @@ -396,6 +416,7 @@ Currently the following features require field processor support:
>  
>  - non-range date queries, e.g. "date:today"
>  - named queries e.g. "query:my_special_query"
> +- regular expression searches, e.g. "subject:/^\\[SPAM\\]/"
>  
>  SEE ALSO
>  ========
> diff --git a/lib/Makefile.local b/lib/Makefile.local
> index b77e5780..ff812b5f 100644
> --- a/lib/Makefile.local
> +++ b/lib/Makefile.local
> @@ -52,6 +52,7 @@ libnotmuch_cxx_srcs =		\
>  	$(dir)/query.cc		\
>  	$(dir)/query-fp.cc      \
>  	$(dir)/config.cc	\
> +	$(dir)/regexp-fields.cc     \

Space instead of TAB above -- tab is used more often (and \:s usually aligned)

>  	$(dir)/thread.cc
>  
>  libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
> diff --git a/lib/database-private.h b/lib/database-private.h
> index ccc1e9a1..9f5659a9 100644
> --- a/lib/database-private.h
> +++ b/lib/database-private.h
> @@ -190,6 +190,8 @@ struct _notmuch_database {
>  #if HAVE_XAPIAN_FIELD_PROCESSOR
>      Xapian::FieldProcessor *date_field_processor;
>      Xapian::FieldProcessor *query_field_processor;
> +    Xapian::FieldProcessor *from_field_processor;
> +    Xapian::FieldProcessor *subject_field_processor;
>  #endif
>      Xapian::ValueRangeProcessor *last_mod_range_processor;
>  };
> diff --git a/lib/database.cc b/lib/database.cc
> index 2d19f20c..8a9ad251 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -21,6 +21,7 @@
>  #include "database-private.h"
>  #include "parse-time-vrp.h"
>  #include "query-fp.h"
> +#include "regexp-fields.h"
>  #include "string-util.h"
>  
>  #include <iostream>
> @@ -272,12 +273,16 @@ static prefix_t BOOLEAN_PREFIX_EXTERNAL[] = {
>      { "folder",			"XFOLDER:" },
>  };
>  
> -static prefix_t PROBABILISTIC_PREFIX[]= {
> +static prefix_t REGEX_PREFIX[]= {
>      { "from",			"XFROM" },
> +    { "subject",		"XSUBJECT"},
> +};
> +
> +static prefix_t PROBABILISTIC_PREFIX[]= {
> +

empty line ^

>      { "to",			"XTO" },
>      { "attachment",		"XATTACHMENT" },
>      { "mimetype",		"XMIMETYPE"},
> -    { "subject",		"XSUBJECT"},
>  };
>  
>  const char *
> @@ -295,6 +300,11 @@ _find_prefix (const char *name)
>  	    return BOOLEAN_PREFIX_EXTERNAL[i].prefix;
>      }
>  
> +    for (i = 0; i < ARRAY_SIZE (REGEX_PREFIX); i++) {
> +	if (strcmp (name, REGEX_PREFIX[i].name) == 0)
> +	    return REGEX_PREFIX[i].prefix;
> +    }
> +
>      for (i = 0; i < ARRAY_SIZE (PROBABILISTIC_PREFIX); i++) {
>  	if (strcmp (name, PROBABILISTIC_PREFIX[i].name) == 0)
>  	    return PROBABILISTIC_PREFIX[i].prefix;
> @@ -1042,6 +1052,10 @@ notmuch_database_open_verbose (const char *path,
>  	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
>  	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
>  	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
> +	notmuch->from_field_processor = new RegexpFieldProcessor ("from", *notmuch->query_parser, notmuch);
> +	notmuch->subject_field_processor = new RegexpFieldProcessor ("subject", *notmuch->query_parser, notmuch);
> +	notmuch->query_parser->add_boolean_prefix("from", notmuch->from_field_processor);
> +	notmuch->query_parser->add_boolean_prefix("subject", notmuch->subject_field_processor);
>  #endif
>  	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
>  
> @@ -1058,7 +1072,12 @@ notmuch_database_open_verbose (const char *path,
>  	    notmuch->query_parser->add_boolean_prefix (prefix->name,
>  						       prefix->prefix);
>  	}
> -
> +#if !HAVE_XAPIAN_FIELD_PROCESSOR
> +	for (i = 0; i < ARRAY_SIZE (REGEX_PREFIX); i++) {
> +	    prefix_t *prefix = &REGEX_PREFIX[i];
> +	    notmuch->query_parser->add_prefix (prefix->name, prefix->prefix);
> +	}
> +#endif
>  	for (i = 0; i < ARRAY_SIZE (PROBABILISTIC_PREFIX); i++) {
>  	    prefix_t *prefix = &PROBABILISTIC_PREFIX[i];
>  	    notmuch->query_parser->add_prefix (prefix->name, prefix->prefix);
> @@ -1138,6 +1157,10 @@ notmuch_database_close (notmuch_database_t *notmuch)
>      notmuch->date_field_processor = NULL;
>      delete notmuch->query_field_processor;
>      notmuch->query_field_processor = NULL;
> +    delete notmuch->from_field_processor;
> +    notmuch->from_field_processor = NULL;
> +    delete notmuch->subject_field_processor;
> +    notmuch->subject_field_processor = NULL;
>  #endif
>  
>      return status;
> diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
> new file mode 100644
> index 00000000..8cb1cada
> --- /dev/null
> +++ b/lib/regexp-fields.cc
> @@ -0,0 +1,142 @@
> +/* regexp-fields.cc - field processor glue for regex supporting fields
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements@csail.mit.edu>
> + *                David Bremner <david@tethera.net>
> + */
> +
> +#include "regexp-fields.h"
> +#include "notmuch-private.h"
> +#include "database-private.h"
> +#include <stdio.h>
> +
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +static void
> +compile_regex (regex_t &regexp, const char *str)
> +{
> +    int err = regcomp (&regexp, str, REG_EXTENDED | REG_NOSUB);
> +
> +    if (err != 0) {
> +	size_t len = regerror (err, &regexp, NULL, 0);
> +	char *buffer = new char[len];
> +	std::string msg;
> +	(void) regerror (err, &regexp, buffer, len);
> +	msg.assign (buffer, len);
> +	delete buffer;
> +
> +	throw Xapian::QueryParserError (msg);
> +

empty line ^

> +    }
> +}
> +
> +RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
> +    : slot_ (slot)
> +{
> +

ditto

> +    compile_regex (regexp_, regexp.c_str ());
> +}
> +
> +RegexpPostingSource::~RegexpPostingSource ()
> +{
> +    regfree (&regexp_);
> +}
> +
> +void
> +RegexpPostingSource::init (const Xapian::Database &db)
> +{
> +    db_ = db;
> +    it_ = db_.valuestream_begin (slot_);
> +    end_ = db.valuestream_end (slot_);
> +    started_ = false;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_min () const
> +{
> +    return 0;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_est () const
> +{
> +    return get_termfreq_max () / 2;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_max () const
> +{
> +    return db_.get_value_freq (slot_);
> +}
> +
> +Xapian::docid
> +RegexpPostingSource::get_docid () const
> +{
> +    return it_.get_docid ();
> +}
> +
> +bool
> +RegexpPostingSource::at_end () const
> +{
> +    return it_ == end_;
> +}
> +
> +void
> +RegexpPostingSource::next (unused (double min_wt))
> +{
> +    if (started_ && ! at_end ())
> +	++it_;
> +    started_ = true;
> +
> +    for (; ! at_end (); ++it_) {
> +	std::string value = *it_;
> +	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
> +	    break;
> +    }
> +}
> +
> +static inline Xapian::valueno _find_slot (std::string prefix)
> +{
> +    if (prefix == "from")
> +	return NOTMUCH_VALUE_FROM;
> +    else if (prefix == "subject")
> +	return NOTMUCH_VALUE_SUBJECT;
> +    else
> +	throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
> +}
> +
> +RegexpFieldProcessor::RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
> +	: slot(_find_slot (prefix)), term_prefix(_find_prefix (prefix.c_str ())), parser(parser_), notmuch(notmuch_)
> +{
> +};
> +
> +Xapian::Query
> +RegexpFieldProcessor::operator() (const std::string & str)
> +{
> +    if (str.at (0) == '/' && str.at (str.size () - 1)){
> +	RegexpPostingSource *postings = new RegexpPostingSource (slot, str.substr(1,str.size () - 2));
> +	return Xapian::Query (postings->release ());
> +    } else {
> +	/* TODO replace this with a nicer API level triggering of
> +	 * phrase parsing, when possible */
> +	std::string quoted='"' + str + '"';
> +	return parser.parse_query (quoted, NOTMUCH_QUERY_PARSER_FLAGS, term_prefix);
> +    }
> +}
> +#endif
> diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
> new file mode 100644
> index 00000000..bac11999
> --- /dev/null
> +++ b/lib/regexp-fields.h
> @@ -0,0 +1,77 @@
> +/* regex-fields.h - xapian glue for semi-bruteforce regexp search
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements@csail.mit.edu>
> + *                David Bremner <david@tethera.net>
> + */
> +
> +#ifndef NOTMUCH_REGEXP_FIELDS_H
> +#define NOTMUCH_REGEXP_FIELDS_H
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +#include <sys/types.h>
> +#include <regex.h>
> +#include "database-private.h"
> +#include "notmuch-private.h"
> +
> +/* A posting source that returns documents where a value matches a
> + * regexp.
> + */
> +class RegexpPostingSource : public Xapian::PostingSource
> +{
> + protected:
> +    const Xapian::valueno slot_;
> +    regex_t regexp_;
> +    Xapian::Database db_;
> +    bool started_;
> +    Xapian::ValueIterator it_, end_;
> +
> +/* No copying */
> +    RegexpPostingSource (const RegexpPostingSource &);
> +    RegexpPostingSource &operator= (const RegexpPostingSource &);
> +
> + public:
> +    RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
> +    ~RegexpPostingSource ();
> +    void init (const Xapian::Database &db);
> +    Xapian::doccount get_termfreq_min () const;
> +    Xapian::doccount get_termfreq_est () const;
> +    Xapian::doccount get_termfreq_max () const;
> +    Xapian::docid get_docid () const;
> +    bool at_end () const;
> +    void next (unused (double min_wt));
> +};
> +
> +
> +class RegexpFieldProcessor : public Xapian::FieldProcessor {
> + protected:
> +    Xapian::valueno slot;
> +    std::string term_prefix;
> +    Xapian::QueryParser &parser;
> +    notmuch_database_t *notmuch;
> +
> + public:
> +    RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_);
> +
> +    ~RegexpFieldProcessor () { };
> +
> +    Xapian::Query operator()(const std::string & str);
> +};
> +#endif
> +#endif /* NOTMUCH_REGEXP_FIELDS_H */
> diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
> new file mode 100755
> index 00000000..722af715
> --- /dev/null
> +++ b/test/T630-regexp-query.sh
> @@ -0,0 +1,82 @@
> +#!/usr/bin/env bash
> +test_description='regular expression searches'
> +. ./test-lib.sh || exit 1
> +
> +add_email_corpus
> +
> +
> +if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
> +
> +    notmuch search --output=messages from:cworth > cworth.msg-ids
> +
> +    test_begin_subtest "regexp from search, case sensitive"
> +    notmuch search --output=messages from:/carl/ > OUTPUT
> +    test_expect_equal_file /dev/null OUTPUT
> +
> +    test_begin_subtest "empty regexp or query"
> +    notmuch search --output=messages from:/carl/ or from:/cworth/ > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "non-empty regexp and query"
> +    notmuch search  from:/cworth@cworth.org/ and subject:patch > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000008   2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
> +thread:0000000000000007   2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
> +thread:0000000000000018   2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +thread:0000000000000017   2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:0000000000000014   2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
> +thread:0000000000000001   2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp from search, duplicate term search"
> +    notmuch search --output=messages from:/cworth/ > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "long enough regexp matches only desired senders"
> +    notmuch search --output=messages 'from:"/C.* Wo/"' > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "shorter regexp matches one more sender"
> +    notmuch search --output=messages 'from:"/C.* W/"' > OUTPUT
> +    (echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED

The above doesn't need to be executed in subshell: 

  { echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk; cat cworth.msg-ids; } > EXPECTED

does it in the same shell


> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, non-ASCII"
> +    notmuch search --output=messages subject:/accentué/ > OUTPUT
> +    echo id:877h1wv7mg.fsf@inf-8657.int-evry.fr > EXPECTED
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, punctuation"
> +    notmuch search   subject:/\'X\'/ > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, no punctuation"
> +    notmuch search  subject:/X/ > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:000000000000000f   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "combine regexp from and subject"
> +    notmuch search  subject:/-C/ and from:/.an.k/ > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000018   2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp error reporting"
> +    notmuch search 'from:/unbalanced[/' 1>OUTPUT 2>&1
> +    cat <<EOF > EXPECTED
> +notmuch search: A Xapian exception occurred
> +A Xapian exception occurred performing query: Invalid regular expression
> +Query string was: from:/unbalanced[/
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +fi
> +
> +test_done
> -- 
> 2.11.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Patch v4] lib: regexp matching in 'subject' and 'from'
  2017-01-25 19:40             ` Tomi Ollila
@ 2017-01-26  2:21               ` David Bremner
  2017-01-29 11:23                 ` Jani Nikula
  2017-01-29 11:06               ` Jani Nikula
  1 sibling, 1 reply; 18+ messages in thread
From: David Bremner @ 2017-01-26  2:21 UTC (permalink / raw)
  To: Tomi Ollila, notmuch

Tomi Ollila <tomi.ollila@iki.fi> writes:

>
> Why would not mesasge_id not be useful to regex match. I can come up quite
> a few use cases... but if there are techinal difficulties... then that
> should be mentioned instead.

I'll have a look. Since the first version of this patch (when that
message was written), people have actually asked for some kind of
wildcard matching of message-ids.

>
> maybe this commit message should inform that xapian with field processors
> (1.4.x) is required for this feature -- and emphasize it a bit better in
> manual page ?
>
> Probably '//' is used to escape '/' -- should such a character ever needed
> in regex search.
>

Currently no escaping is needed because it only looks at the first and
last characters of the string (the usual xapian/shell rules mean that "" might
be needed).

The following seem to work as hoped

# match a / with a space before it

% notmuch search 'subject:"/ //"'

# just a slash

% notmuch search subject:///

# anchored slash

% notmuch search subject:/^//

The trailing slash is actually decorative, we could drop it. Actually
*blush* I just noticed the current code is missing something from this line

         if (str.at (0) == '/' && str.at (str.size () - 1)){

_if_ that line is fixed, then it will have the slightly odd behaviour of

subject:/blah

doing a non-regex search

We could also throw an error for that case, maybe that's the best option.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Patch v4] lib: regexp matching in 'subject' and 'from'
  2017-01-25 19:40             ` Tomi Ollila
  2017-01-26  2:21               ` David Bremner
@ 2017-01-29 11:06               ` Jani Nikula
  1 sibling, 0 replies; 18+ messages in thread
From: Jani Nikula @ 2017-01-29 11:06 UTC (permalink / raw)
  To: Tomi Ollila, David Bremner, notmuch

On Wed, 25 Jan 2017, Tomi Ollila <tomi.ollila@iki.fi> wrote:
> On Sat, Jan 21 2017, David Bremner <david@tethera.net> wrote:
>
>> the idea is that you can run
>>
>> % notmuch search subject:/<your-favourite-regexp>/
>> % notmuch search from:/<your-favourite-regexp>/
>
> I like this interface.

FWIW I think this is superior to the earlier alternatives too.

I think people would like to use regexps (or globbing) for path: and
folder: queries. Is there a risk of ambiguity between normal path: and
folder: searches and regexp searches due to "/"? I suppose the normal
queries never begin with "/" for them (due to being relative to database
path, not absolute) but is that confusing?

BR,
Jani.


>
>>
>> or
>>
>> % notmuch search subject:"your usual phrase search"
>> % notmuch search from:"usual phrase search"
>>
>> This should also work with bindings, since it extends the query parser.
>>
>> This is trivial to extend for other value slots, but currently the only
>> value slots are date, message_id, from, subject, and last_mod. Date is
>> already searchable, and message_id is not obviously useful to regex
>> match.
>
> Why would not mesasge_id not be useful to regex match. I can come up quite
> a few use cases... but if there are techinal difficulties... then that
> should be mentioned instead.
>
> maybe this commit message should inform that xapian with field processors
> (1.4.x) is required for this feature -- and emphasize it a bit better in
> manual page ?
>
> Probably '//' is used to escape '/' -- should such a character ever needed
> in regex search.
>
>>
>> This was originally written by Austin Clements, and ported to Xapian
>> field processors (from Austin's custom query parser) by yours truly.
>> ---
>>
>> This version impliments the use of // to delimit regular expressions.
>> I have not tested the code paths with old (pre field processor) xapian.
>
> Fedora 25 has 1.2.24 -- T630 tests are skipped. It looks like these changes
> did not increase the failure count there.
>
> Some (mostly whitespace nitpicking) comments below:
>
>
>>
>>  doc/man7/notmuch-search-terms.rst |  27 +++++++-
>>  lib/Makefile.local                |   1 +
>>  lib/database-private.h            |   2 +
>>  lib/database.cc                   |  29 +++++++-
>>  lib/regexp-fields.cc              | 142 ++++++++++++++++++++++++++++++++++++++
>>  lib/regexp-fields.h               |  77 +++++++++++++++++++++
>>  test/T630-regexp-query.sh         |  82 ++++++++++++++++++++++
>>  7 files changed, 354 insertions(+), 6 deletions(-)
>>  create mode 100644 lib/regexp-fields.cc
>>  create mode 100644 lib/regexp-fields.h
>>  create mode 100755 test/T630-regexp-query.sh
>>
>> diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
>> index de93d733..d8527e18 100644
>> --- a/doc/man7/notmuch-search-terms.rst
>> +++ b/doc/man7/notmuch-search-terms.rst
>> @@ -34,10 +34,14 @@ indicate user-supplied values):
>>  
>>  -  from:<name-or-address>
>>  
>> +-  from:/<regex>/
>> +
>>  -  to:<name-or-address>
>>  
>>  -  subject:<word-or-quoted-phrase>
>>  
>> +-  subject:/<regex>/
>> +
>>  -  attachment:<word>
>>  
>>  -  mimetype:<word>
>> @@ -71,6 +75,17 @@ subject of an email. Searching for a phrase in the subject is supported
>>  by including quotation marks around the phrase, immediately following
>>  **subject:**.
>>  
>> +The **from:** and **subject** prefix can be also used to restrict the
>> +results to those whose from/subject value matches a regular
>> +expression (see **regex(7)**) delimited with //.
>> +
>> +::
>> +
>> +   notmuch search 'from:/bob@.*[.]example[.]com/'
>> +
>> +Regular expression searches are only available if notmuch is built
>> +with **Xapian Field Processors** (see below).
>
> And the poor user stopped reading far before this line, desperately trying
> the regex searches... >;/ so IMO this requirement should be notified earlier.
>
>> +
>>  The **attachment:** prefix can be used to search for specific filenames
>>  (or extensions) of attachments to email messages.
>>  
>> @@ -220,13 +235,18 @@ Boolean and Probabilistic Prefixes
>>  ----------------------------------
>>  
>>  Xapian (and hence notmuch) prefixes are either **boolean**, supporting
>> -exact matches like "tag:inbox"  or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows.
>> -
>> +exact matches like "tag:inbox" or **probabilistic**, supporting a more
>> +flexible **term** based searching. Certain **special** prefixes are
>> +processed by notmuch in a way not stricly fitting either of Xapian's
>> +built in styles. The prefixes currently supported by notmuch are as
>> +follows.
>>  
>>  Boolean
>>     **tag:**, **id:**, **thread:**, **folder:**, **path:**, **property:**
>>  Probabilistic
>> -   **from:**, **to:**, **subject:**, **attachment:**, **mimetype:**
>> +  **to:**, **attachment:**, **mimetype:**
>> +Special
>> +   **from:**, **query:**, **subject:**
>>  
>>  Terms and phrases
>>  -----------------
>> @@ -396,6 +416,7 @@ Currently the following features require field processor support:
>>  
>>  - non-range date queries, e.g. "date:today"
>>  - named queries e.g. "query:my_special_query"
>> +- regular expression searches, e.g. "subject:/^\\[SPAM\\]/"
>>  
>>  SEE ALSO
>>  ========
>> diff --git a/lib/Makefile.local b/lib/Makefile.local
>> index b77e5780..ff812b5f 100644
>> --- a/lib/Makefile.local
>> +++ b/lib/Makefile.local
>> @@ -52,6 +52,7 @@ libnotmuch_cxx_srcs =		\
>>  	$(dir)/query.cc		\
>>  	$(dir)/query-fp.cc      \
>>  	$(dir)/config.cc	\
>> +	$(dir)/regexp-fields.cc     \
>
> Space instead of TAB above -- tab is used more often (and \:s usually aligned)
>
>>  	$(dir)/thread.cc
>>  
>>  libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
>> diff --git a/lib/database-private.h b/lib/database-private.h
>> index ccc1e9a1..9f5659a9 100644
>> --- a/lib/database-private.h
>> +++ b/lib/database-private.h
>> @@ -190,6 +190,8 @@ struct _notmuch_database {
>>  #if HAVE_XAPIAN_FIELD_PROCESSOR
>>      Xapian::FieldProcessor *date_field_processor;
>>      Xapian::FieldProcessor *query_field_processor;
>> +    Xapian::FieldProcessor *from_field_processor;
>> +    Xapian::FieldProcessor *subject_field_processor;
>>  #endif
>>      Xapian::ValueRangeProcessor *last_mod_range_processor;
>>  };
>> diff --git a/lib/database.cc b/lib/database.cc
>> index 2d19f20c..8a9ad251 100644
>> --- a/lib/database.cc
>> +++ b/lib/database.cc
>> @@ -21,6 +21,7 @@
>>  #include "database-private.h"
>>  #include "parse-time-vrp.h"
>>  #include "query-fp.h"
>> +#include "regexp-fields.h"
>>  #include "string-util.h"
>>  
>>  #include <iostream>
>> @@ -272,12 +273,16 @@ static prefix_t BOOLEAN_PREFIX_EXTERNAL[] = {
>>      { "folder",			"XFOLDER:" },
>>  };
>>  
>> -static prefix_t PROBABILISTIC_PREFIX[]= {
>> +static prefix_t REGEX_PREFIX[]= {
>>      { "from",			"XFROM" },
>> +    { "subject",		"XSUBJECT"},
>> +};
>> +
>> +static prefix_t PROBABILISTIC_PREFIX[]= {
>> +
>
> empty line ^
>
>>      { "to",			"XTO" },
>>      { "attachment",		"XATTACHMENT" },
>>      { "mimetype",		"XMIMETYPE"},
>> -    { "subject",		"XSUBJECT"},
>>  };
>>  
>>  const char *
>> @@ -295,6 +300,11 @@ _find_prefix (const char *name)
>>  	    return BOOLEAN_PREFIX_EXTERNAL[i].prefix;
>>      }
>>  
>> +    for (i = 0; i < ARRAY_SIZE (REGEX_PREFIX); i++) {
>> +	if (strcmp (name, REGEX_PREFIX[i].name) == 0)
>> +	    return REGEX_PREFIX[i].prefix;
>> +    }
>> +
>>      for (i = 0; i < ARRAY_SIZE (PROBABILISTIC_PREFIX); i++) {
>>  	if (strcmp (name, PROBABILISTIC_PREFIX[i].name) == 0)
>>  	    return PROBABILISTIC_PREFIX[i].prefix;
>> @@ -1042,6 +1052,10 @@ notmuch_database_open_verbose (const char *path,
>>  	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
>>  	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
>>  	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
>> +	notmuch->from_field_processor = new RegexpFieldProcessor ("from", *notmuch->query_parser, notmuch);
>> +	notmuch->subject_field_processor = new RegexpFieldProcessor ("subject", *notmuch->query_parser, notmuch);
>> +	notmuch->query_parser->add_boolean_prefix("from", notmuch->from_field_processor);
>> +	notmuch->query_parser->add_boolean_prefix("subject", notmuch->subject_field_processor);
>>  #endif
>>  	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
>>  
>> @@ -1058,7 +1072,12 @@ notmuch_database_open_verbose (const char *path,
>>  	    notmuch->query_parser->add_boolean_prefix (prefix->name,
>>  						       prefix->prefix);
>>  	}
>> -
>> +#if !HAVE_XAPIAN_FIELD_PROCESSOR
>> +	for (i = 0; i < ARRAY_SIZE (REGEX_PREFIX); i++) {
>> +	    prefix_t *prefix = &REGEX_PREFIX[i];
>> +	    notmuch->query_parser->add_prefix (prefix->name, prefix->prefix);
>> +	}
>> +#endif
>>  	for (i = 0; i < ARRAY_SIZE (PROBABILISTIC_PREFIX); i++) {
>>  	    prefix_t *prefix = &PROBABILISTIC_PREFIX[i];
>>  	    notmuch->query_parser->add_prefix (prefix->name, prefix->prefix);
>> @@ -1138,6 +1157,10 @@ notmuch_database_close (notmuch_database_t *notmuch)
>>      notmuch->date_field_processor = NULL;
>>      delete notmuch->query_field_processor;
>>      notmuch->query_field_processor = NULL;
>> +    delete notmuch->from_field_processor;
>> +    notmuch->from_field_processor = NULL;
>> +    delete notmuch->subject_field_processor;
>> +    notmuch->subject_field_processor = NULL;
>>  #endif
>>  
>>      return status;
>> diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
>> new file mode 100644
>> index 00000000..8cb1cada
>> --- /dev/null
>> +++ b/lib/regexp-fields.cc
>> @@ -0,0 +1,142 @@
>> +/* regexp-fields.cc - field processor glue for regex supporting fields
>> + *
>> + * This file is part of notmuch.
>> + *
>> + * Copyright © 2015 Austin Clements
>> + * Copyright © 2016 David Bremner
>> + *
>> + * This program is free software: you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation, either version 3 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program.  If not, see https://www.gnu.org/licenses/ .
>> + *
>> + * Author: Austin Clements <aclements@csail.mit.edu>
>> + *                David Bremner <david@tethera.net>
>> + */
>> +
>> +#include "regexp-fields.h"
>> +#include "notmuch-private.h"
>> +#include "database-private.h"
>> +#include <stdio.h>
>> +
>> +#if HAVE_XAPIAN_FIELD_PROCESSOR
>> +static void
>> +compile_regex (regex_t &regexp, const char *str)
>> +{
>> +    int err = regcomp (&regexp, str, REG_EXTENDED | REG_NOSUB);
>> +
>> +    if (err != 0) {
>> +	size_t len = regerror (err, &regexp, NULL, 0);
>> +	char *buffer = new char[len];
>> +	std::string msg;
>> +	(void) regerror (err, &regexp, buffer, len);
>> +	msg.assign (buffer, len);
>> +	delete buffer;
>> +
>> +	throw Xapian::QueryParserError (msg);
>> +
>
> empty line ^
>
>> +    }
>> +}
>> +
>> +RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
>> +    : slot_ (slot)
>> +{
>> +
>
> ditto
>
>> +    compile_regex (regexp_, regexp.c_str ());
>> +}
>> +
>> +RegexpPostingSource::~RegexpPostingSource ()
>> +{
>> +    regfree (&regexp_);
>> +}
>> +
>> +void
>> +RegexpPostingSource::init (const Xapian::Database &db)
>> +{
>> +    db_ = db;
>> +    it_ = db_.valuestream_begin (slot_);
>> +    end_ = db.valuestream_end (slot_);
>> +    started_ = false;
>> +}
>> +
>> +Xapian::doccount
>> +RegexpPostingSource::get_termfreq_min () const
>> +{
>> +    return 0;
>> +}
>> +
>> +Xapian::doccount
>> +RegexpPostingSource::get_termfreq_est () const
>> +{
>> +    return get_termfreq_max () / 2;
>> +}
>> +
>> +Xapian::doccount
>> +RegexpPostingSource::get_termfreq_max () const
>> +{
>> +    return db_.get_value_freq (slot_);
>> +}
>> +
>> +Xapian::docid
>> +RegexpPostingSource::get_docid () const
>> +{
>> +    return it_.get_docid ();
>> +}
>> +
>> +bool
>> +RegexpPostingSource::at_end () const
>> +{
>> +    return it_ == end_;
>> +}
>> +
>> +void
>> +RegexpPostingSource::next (unused (double min_wt))
>> +{
>> +    if (started_ && ! at_end ())
>> +	++it_;
>> +    started_ = true;
>> +
>> +    for (; ! at_end (); ++it_) {
>> +	std::string value = *it_;
>> +	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
>> +	    break;
>> +    }
>> +}
>> +
>> +static inline Xapian::valueno _find_slot (std::string prefix)
>> +{
>> +    if (prefix == "from")
>> +	return NOTMUCH_VALUE_FROM;
>> +    else if (prefix == "subject")
>> +	return NOTMUCH_VALUE_SUBJECT;
>> +    else
>> +	throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
>> +}
>> +
>> +RegexpFieldProcessor::RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
>> +	: slot(_find_slot (prefix)), term_prefix(_find_prefix (prefix.c_str ())), parser(parser_), notmuch(notmuch_)
>> +{
>> +};
>> +
>> +Xapian::Query
>> +RegexpFieldProcessor::operator() (const std::string & str)
>> +{
>> +    if (str.at (0) == '/' && str.at (str.size () - 1)){
>> +	RegexpPostingSource *postings = new RegexpPostingSource (slot, str.substr(1,str.size () - 2));
>> +	return Xapian::Query (postings->release ());
>> +    } else {
>> +	/* TODO replace this with a nicer API level triggering of
>> +	 * phrase parsing, when possible */
>> +	std::string quoted='"' + str + '"';
>> +	return parser.parse_query (quoted, NOTMUCH_QUERY_PARSER_FLAGS, term_prefix);
>> +    }
>> +}
>> +#endif
>> diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
>> new file mode 100644
>> index 00000000..bac11999
>> --- /dev/null
>> +++ b/lib/regexp-fields.h
>> @@ -0,0 +1,77 @@
>> +/* regex-fields.h - xapian glue for semi-bruteforce regexp search
>> + *
>> + * This file is part of notmuch.
>> + *
>> + * Copyright © 2015 Austin Clements
>> + * Copyright © 2016 David Bremner
>> + *
>> + * This program is free software: you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation, either version 3 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program.  If not, see https://www.gnu.org/licenses/ .
>> + *
>> + * Author: Austin Clements <aclements@csail.mit.edu>
>> + *                David Bremner <david@tethera.net>
>> + */
>> +
>> +#ifndef NOTMUCH_REGEXP_FIELDS_H
>> +#define NOTMUCH_REGEXP_FIELDS_H
>> +#if HAVE_XAPIAN_FIELD_PROCESSOR
>> +#include <sys/types.h>
>> +#include <regex.h>
>> +#include "database-private.h"
>> +#include "notmuch-private.h"
>> +
>> +/* A posting source that returns documents where a value matches a
>> + * regexp.
>> + */
>> +class RegexpPostingSource : public Xapian::PostingSource
>> +{
>> + protected:
>> +    const Xapian::valueno slot_;
>> +    regex_t regexp_;
>> +    Xapian::Database db_;
>> +    bool started_;
>> +    Xapian::ValueIterator it_, end_;
>> +
>> +/* No copying */
>> +    RegexpPostingSource (const RegexpPostingSource &);
>> +    RegexpPostingSource &operator= (const RegexpPostingSource &);
>> +
>> + public:
>> +    RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
>> +    ~RegexpPostingSource ();
>> +    void init (const Xapian::Database &db);
>> +    Xapian::doccount get_termfreq_min () const;
>> +    Xapian::doccount get_termfreq_est () const;
>> +    Xapian::doccount get_termfreq_max () const;
>> +    Xapian::docid get_docid () const;
>> +    bool at_end () const;
>> +    void next (unused (double min_wt));
>> +};
>> +
>> +
>> +class RegexpFieldProcessor : public Xapian::FieldProcessor {
>> + protected:
>> +    Xapian::valueno slot;
>> +    std::string term_prefix;
>> +    Xapian::QueryParser &parser;
>> +    notmuch_database_t *notmuch;
>> +
>> + public:
>> +    RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_);
>> +
>> +    ~RegexpFieldProcessor () { };
>> +
>> +    Xapian::Query operator()(const std::string & str);
>> +};
>> +#endif
>> +#endif /* NOTMUCH_REGEXP_FIELDS_H */
>> diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
>> new file mode 100755
>> index 00000000..722af715
>> --- /dev/null
>> +++ b/test/T630-regexp-query.sh
>> @@ -0,0 +1,82 @@
>> +#!/usr/bin/env bash
>> +test_description='regular expression searches'
>> +. ./test-lib.sh || exit 1
>> +
>> +add_email_corpus
>> +
>> +
>> +if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
>> +
>> +    notmuch search --output=messages from:cworth > cworth.msg-ids
>> +
>> +    test_begin_subtest "regexp from search, case sensitive"
>> +    notmuch search --output=messages from:/carl/ > OUTPUT
>> +    test_expect_equal_file /dev/null OUTPUT
>> +
>> +    test_begin_subtest "empty regexp or query"
>> +    notmuch search --output=messages from:/carl/ or from:/cworth/ > OUTPUT
>> +    test_expect_equal_file cworth.msg-ids OUTPUT
>> +
>> +    test_begin_subtest "non-empty regexp and query"
>> +    notmuch search  from:/cworth@cworth.org/ and subject:patch > OUTPUT
>> +    cat <<EOF > EXPECTED
>> +thread:0000000000000008   2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
>> +thread:0000000000000007   2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
>> +thread:0000000000000018   2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
>> +thread:0000000000000017   2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
>> +thread:0000000000000014   2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
>> +thread:0000000000000001   2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread)
>> +EOF
>> +    test_expect_equal_file EXPECTED OUTPUT
>> +
>> +    test_begin_subtest "regexp from search, duplicate term search"
>> +    notmuch search --output=messages from:/cworth/ > OUTPUT
>> +    test_expect_equal_file cworth.msg-ids OUTPUT
>> +
>> +    test_begin_subtest "long enough regexp matches only desired senders"
>> +    notmuch search --output=messages 'from:"/C.* Wo/"' > OUTPUT
>> +    test_expect_equal_file cworth.msg-ids OUTPUT
>> +
>> +    test_begin_subtest "shorter regexp matches one more sender"
>> +    notmuch search --output=messages 'from:"/C.* W/"' > OUTPUT
>> +    (echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED
>
> The above doesn't need to be executed in subshell: 
>
>   { echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk; cat cworth.msg-ids; } > EXPECTED
>
> does it in the same shell
>
>
>> +    test_expect_equal_file EXPECTED OUTPUT
>> +
>> +    test_begin_subtest "regexp subject search, non-ASCII"
>> +    notmuch search --output=messages subject:/accentué/ > OUTPUT
>> +    echo id:877h1wv7mg.fsf@inf-8657.int-evry.fr > EXPECTED
>> +    test_expect_equal_file EXPECTED OUTPUT
>> +
>> +    test_begin_subtest "regexp subject search, punctuation"
>> +    notmuch search   subject:/\'X\'/ > OUTPUT
>> +    cat <<EOF > EXPECTED
>> +thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
>> +EOF
>> +    test_expect_equal_file EXPECTED OUTPUT
>> +
>> +    test_begin_subtest "regexp subject search, no punctuation"
>> +    notmuch search  subject:/X/ > OUTPUT
>> +    cat <<EOF > EXPECTED
>> +thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
>> +thread:000000000000000f   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
>> +EOF
>> +    test_expect_equal_file EXPECTED OUTPUT
>> +
>> +    test_begin_subtest "combine regexp from and subject"
>> +    notmuch search  subject:/-C/ and from:/.an.k/ > OUTPUT
>> +    cat <<EOF > EXPECTED
>> +thread:0000000000000018   2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
>> +EOF
>> +    test_expect_equal_file EXPECTED OUTPUT
>> +
>> +    test_begin_subtest "regexp error reporting"
>> +    notmuch search 'from:/unbalanced[/' 1>OUTPUT 2>&1
>> +    cat <<EOF > EXPECTED
>> +notmuch search: A Xapian exception occurred
>> +A Xapian exception occurred performing query: Invalid regular expression
>> +Query string was: from:/unbalanced[/
>> +EOF
>> +    test_expect_equal_file EXPECTED OUTPUT
>> +fi
>> +
>> +test_done
>> -- 
>> 2.11.0
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> https://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Patch v4] lib: regexp matching in 'subject' and 'from'
  2017-01-26  2:21               ` David Bremner
@ 2017-01-29 11:23                 ` Jani Nikula
  2017-02-05 20:16                   ` Tomi Ollila
  2017-02-09  3:11                   ` David Bremner
  0 siblings, 2 replies; 18+ messages in thread
From: Jani Nikula @ 2017-01-29 11:23 UTC (permalink / raw)
  To: David Bremner, Tomi Ollila, notmuch

On Wed, 25 Jan 2017, David Bremner <david@tethera.net> wrote:
> Tomi Ollila <tomi.ollila@iki.fi> writes:
>
>>
>> Why would not mesasge_id not be useful to regex match. I can come up quite
>> a few use cases... but if there are techinal difficulties... then that
>> should be mentioned instead.
>
> I'll have a look. Since the first version of this patch (when that
> message was written), people have actually asked for some kind of
> wildcard matching of message-ids.

Theoretically "/" is an acceptable character in message-ids [1]. Rare,
unlikely, but acceptable. Searching for message-id's beginning with "/"
would have to use regexps, which would break in all sorts of ways
throughout the stack. I don't think there are handy alternatives to
"/<regex>/", given the characters that are acceptable in message-ids,
but this is something to think about.

For example, could the regexp matcher for message-ids first check if the
"regexp" is a strict match with "/" and all, and accept those? This
might be a reasonable workaround if it can be made to work.

[1] https://tools.ietf.org/html/rfc2822#section-3.2.4

>> maybe this commit message should inform that xapian with field processors
>> (1.4.x) is required for this feature -- and emphasize it a bit better in
>> manual page ?
>>
>> Probably '//' is used to escape '/' -- should such a character ever needed
>> in regex search.
>>
>
> Currently no escaping is needed because it only looks at the first and
> last characters of the string (the usual xapian/shell rules mean that "" might
> be needed).
>
> The following seem to work as hoped
>
> # match a / with a space before it
>
> % notmuch search 'subject:"/ //"'
>
> # just a slash
>
> % notmuch search subject:///
>
> # anchored slash
>
> % notmuch search subject:/^//
>
> The trailing slash is actually decorative, we could drop it. Actually
> *blush* I just noticed the current code is missing something from this line
>
>          if (str.at (0) == '/' && str.at (str.size () - 1)){
>
> _if_ that line is fixed, then it will have the slightly odd behaviour of
>
> subject:/blah
>
> doing a non-regex search
>
> We could also throw an error for that case, maybe that's the best option.

I'd go with an error. It's easy to loosen the rules later on if we
decide that's a good idea. Much harder to accept loose rules now, let
users get used to it, and try to tighten the rules if we realize we'd
need that for some reason.

BR,
Jani.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Patch v4] lib: regexp matching in 'subject' and 'from'
  2017-01-29 11:23                 ` Jani Nikula
@ 2017-02-05 20:16                   ` Tomi Ollila
  2017-02-05 23:28                     ` David Bremner
  2017-02-09  3:11                   ` David Bremner
  1 sibling, 1 reply; 18+ messages in thread
From: Tomi Ollila @ 2017-02-05 20:16 UTC (permalink / raw)
  To: David Bremner, notmuch

TOn Sun, Jan 29 2017, Jani Nikula <jani@nikula.org> wrote:

> On Wed, 25 Jan 2017, David Bremner <david@tethera.net> wrote:
>> Tomi Ollila <tomi.ollila@iki.fi> writes:
>>
>>>
>>> Why would not mesasge_id not be useful to regex match. I can come up quite
>>> a few use cases... but if there are techinal difficulties... then that
>>> should be mentioned instead.
>>
>> I'll have a look. Since the first version of this patch (when that
>> message was written), people have actually asked for some kind of
>> wildcard matching of message-ids.
>
> Theoretically "/" is an acceptable character in message-ids [1]. Rare,
> unlikely, but acceptable. Searching for message-id's beginning with "/"
> would have to use regexps, which would break in all sorts of ways
> throughout the stack. I don't think there are handy alternatives to
> "/<regex>/", given the characters that are acceptable in message-ids,
> but this is something to think about.
>
> For example, could the regexp matcher for message-ids first check if the
> "regexp" is a strict match with "/" and all, and accept those? This
> might be a reasonable workaround if it can be made to work.
>
> [1] https://tools.ietf.org/html/rfc2822#section-3.2.4
>
>>> maybe this commit message should inform that xapian with field processors
>>> (1.4.x) is required for this feature -- and emphasize it a bit better in
>>> manual page ?
>>>
>>> Probably '//' is used to escape '/' -- should such a character ever needed
>>> in regex search.
>>>
>>
>> Currently no escaping is needed because it only looks at the first and
>> last characters of the string (the usual xapian/shell rules mean that "" might
>> be needed).
>>
>> The following seem to work as hoped
>>
>> # match a / with a space before it
>>
>> % notmuch search 'subject:"/ //"'
>>
>> # just a slash
>>
>> % notmuch search subject:///
>>
>> # anchored slash
>>
>> % notmuch search subject:/^//
>>
>> The trailing slash is actually decorative, we could drop it. Actually
>> *blush* I just noticed the current code is missing something from this line
>>
>>          if (str.at (0) == '/' && str.at (str.size () - 1)){
>>
>> _if_ that line is fixed, then it will have the slightly odd behaviour of
>>
>> subject:/blah
>>
>> doing a non-regex search
>>
>> We could also throw an error for that case, maybe that's the best option.
>
> I'd go with an error. It's easy to loosen the rules later on if we
> decide that's a good idea. Much harder to accept loose rules now, let
> users get used to it, and try to tighten the rules if we realize we'd
> need that for some reason.

I agree -- should we allow trailing slash ('/') without first char also
being '/' (e.g. subject:blah/)

Tomi

>
> BR,
> Jani.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Patch v4] lib: regexp matching in 'subject' and 'from'
  2017-02-05 20:16                   ` Tomi Ollila
@ 2017-02-05 23:28                     ` David Bremner
  0 siblings, 0 replies; 18+ messages in thread
From: David Bremner @ 2017-02-05 23:28 UTC (permalink / raw)
  To: Tomi Ollila, notmuch

Tomi Ollila <tomi.ollila@iki.fi> writes:

> TOn Sun, Jan 29 2017, Jani Nikula <jani@nikula.org> wrote:
>
>>
>> I'd go with an error. It's easy to loosen the rules later on if we
>> decide that's a good idea. Much harder to accept loose rules now, let
>> users get used to it, and try to tighten the rules if we realize we'd
>> need that for some reason.
>
> I agree -- should we allow trailing slash ('/') without first char also
> being '/' (e.g. subject:blah/)
>

I'd say that should also be an error. it doesn't add anything useful to
subject search. Even for path search, (which is non-trivial to add regex
search for, I think) the trailing / doesn't add anything does it?

d

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Patch v4] lib: regexp matching in 'subject' and 'from'
  2017-01-29 11:23                 ` Jani Nikula
  2017-02-05 20:16                   ` Tomi Ollila
@ 2017-02-09  3:11                   ` David Bremner
  2017-02-09 16:15                     ` Tomi Ollila
  2017-02-10  8:29                     ` Mark Walters
  1 sibling, 2 replies; 18+ messages in thread
From: David Bremner @ 2017-02-09  3:11 UTC (permalink / raw)
  To: Jani Nikula, Tomi Ollila, notmuch

Jani Nikula <jani@nikula.org> writes:

>
> Theoretically "/" is an acceptable character in message-ids [1]. Rare,
> unlikely, but acceptable. Searching for message-id's beginning with "/"
> would have to use regexps, which would break in all sorts of ways
> throughout the stack. I don't think there are handy alternatives to
> "/<regex>/", given the characters that are acceptable in message-ids,
> but this is something to think about.

Would telling the user to \ escape ( or double /) the initial / be good
enough there? This would disable regex processing.  I guess this goes
back to someone's earlier suggestion.  A third option would be to use
single quotes there ("id:'/foo'"), but that isn't really consistent with either Xapian
or usual regex conventions.

So I guess my favourite idea ATM is to use id:\/some/crazy/message-id
FWIW, I don't have any such message ids.

> For example, could the regexp matcher for message-ids first check if the
> "regexp" is a strict match with "/" and all, and accept those? This
> might be a reasonable workaround if it can be made to work.

We're building a query, so I think the equivalent is to make an OR, with
the exact match and the regex posting source. That could be done,
although I'm a bit uneasy about how this makes the syntax for id:
different, so id:/foo would be legit, but from:/foo would be an error.
Maybe the dwim-factor is worth it.

d

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Patch v4] lib: regexp matching in 'subject' and 'from'
  2017-02-09  3:11                   ` David Bremner
@ 2017-02-09 16:15                     ` Tomi Ollila
  2017-02-10  8:29                     ` Mark Walters
  1 sibling, 0 replies; 18+ messages in thread
From: Tomi Ollila @ 2017-02-09 16:15 UTC (permalink / raw)
  To: David Bremner, notmuch

On Thu, Feb 09 2017, David Bremner <david@tethera.net> wrote:

> Jani Nikula <jani@nikula.org> writes:
>
>>
>> Theoretically "/" is an acceptable character in message-ids [1]. Rare,
>> unlikely, but acceptable. Searching for message-id's beginning with "/"
>> would have to use regexps, which would break in all sorts of ways
>> throughout the stack. I don't think there are handy alternatives to
>> "/<regex>/", given the characters that are acceptable in message-ids,
>> but this is something to think about.
>
> Would telling the user to \ escape ( or double /) the initial / be good

a while ago I thought this double // but dismissed it quickly 
(re-searching just for single quote can be useful...) In the rare
cases anyone needs to disable regex processing, imo this \ is the
best idea i've (not) come up with.

some command line testing with(and -out) quoting:

$ printf %s\\n  id:\/some/crazy/message-id
id:/some/crazy/message-id

$ printf %s\\n  "id:\/some/crazy/message-id"
id:\/some/crazy/message-id

$ printf %s\\n  'id:\/some/crazy/message-id'
id:\/some/crazy/message-id


$ printf %s\\n  id:\\/some/crazy/message-id
id:\/some/crazy/message-id

$ printf %s\\n  "id:\\/some/crazy/message-id"
id:\/some/crazy/message-id

$ printf %s\\n  'id:\\/some/crazy/message-id'
id:\\/some/crazy/message-id

so:
$ printf %s\\n  'id:"\/some/crazy/message-id with spaces"'
id:"\/some/crazy/message-id with spaces"


> enough there? This would disable regex processing.  I guess this goes
> back to someone's earlier suggestion.  A third option would be to use
> single quotes there ("id:'/foo'"), but that isn't really consistent with
> either Xapian 
> or usual regex conventions.

$ printf %s\\n 'id:"'\''/foo with spaces ;D'\''"'
id:"'/foo with spaces ;D'"

or, perhaps this is clearer >;)

$ printf %s\\n 'id:"'"'"'/foo with spaces ;D'"'"'"'
id:"'/foo with spaces ;D'"

>
> So I guess my favourite idea ATM is to use id:\/some/crazy/message-id
> FWIW, I don't have any such message ids.
>
>> For example, could the regexp matcher for message-ids first check if the
>> "regexp" is a strict match with "/" and all, and accept those? This
>> might be a reasonable workaround if it can be made to work.
>
> We're building a query, so I think the equivalent is to make an OR, with
> the exact match and the regex posting source. That could be done,
> although I'm a bit uneasy about how this makes the syntax for id:
> different, so id:/foo would be legit, but from:/foo would be an error.
> Maybe the dwim-factor is worth it.
>
> d

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Patch v4] lib: regexp matching in 'subject' and 'from'
  2017-02-09  3:11                   ` David Bremner
  2017-02-09 16:15                     ` Tomi Ollila
@ 2017-02-10  8:29                     ` Mark Walters
  2017-02-11 23:25                       ` David Bremner
  1 sibling, 1 reply; 18+ messages in thread
From: Mark Walters @ 2017-02-10  8:29 UTC (permalink / raw)
  To: David Bremner, Jani Nikula, Tomi Ollila, notmuch

On Thu, 09 Feb 2017, David Bremner <david@tethera.net> wrote:
> Jani Nikula <jani@nikula.org> writes:
>
>>
>> Theoretically "/" is an acceptable character in message-ids [1]. Rare,
>> unlikely, but acceptable. Searching for message-id's beginning with "/"
>> would have to use regexps, which would break in all sorts of ways
>> throughout the stack. I don't think there are handy alternatives to
>> "/<regex>/", given the characters that are acceptable in message-ids,
>> but this is something to think about.
>
> Would telling the user to \ escape ( or double /) the initial / be good
> enough there? This would disable regex processing.  I guess this goes
> back to someone's earlier suggestion.  A third option would be to use
> single quotes there ("id:'/foo'"), but that isn't really consistent with either Xapian
> or usual regex conventions.
>
> So I guess my favourite idea ATM is to use id:\/some/crazy/message-id
> FWIW, I don't have any such message ids.
>
>> For example, could the regexp matcher for message-ids first check if the
>> "regexp" is a strict match with "/" and all, and accept those? This
>> might be a reasonable workaround if it can be made to work.
>
> We're building a query, so I think the equivalent is to make an OR, with
> the exact match and the regex posting source. That could be done,
> although I'm a bit uneasy about how this makes the syntax for id:
> different, so id:/foo would be legit, but from:/foo would be an error.
> Maybe the dwim-factor is worth it.

Hi

Broadly I like the backslash escaping option. Two thoughts: can any
fields (from/subject/message-id) start with a "\" anyway? I think not
but thought it worth checking.

Secondly, message-id is often round-tripped, that is output from notmuch
and then fed back to notmuch. Do we want to escape the output as above
before printing in any cases? My view is that if we output the
message-id prefixed with "id:" then we should escape it (which applies
with --output=messages --format=text), but if we don't print the "id:"
part then we shouldn't (eg with --format=json). A similar thing would
apply to emacs: if it is a normal stash then escape the id, but if it is
a "bare stash" then do not.

Actually, one more thing: it would be a shame to block or significantly
delay the series for such a corner case.

Best wishes

Mark

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Patch v4] lib: regexp matching in 'subject' and 'from'
  2017-02-10  8:29                     ` Mark Walters
@ 2017-02-11 23:25                       ` David Bremner
  0 siblings, 0 replies; 18+ messages in thread
From: David Bremner @ 2017-02-11 23:25 UTC (permalink / raw)
  To: Mark Walters

Mark Walters <markwalters1009@gmail.com> writes:

>
> Hi
>
> Broadly I like the backslash escaping option. Two thoughts: can any
> fields (from/subject/message-id) start with a "\" anyway? I think not
> but thought it worth checking.

From and subject are probablistic xapian fields, so punctuation is
essentially ignored by the query parser. That being said, nothing
prevents subjects from starting with /.  According to my reading of
rfc5322, conforming message ids cannot contain any of '()<>[]:;@\,."'

> Secondly, message-id is often round-tripped, that is output from notmuch
> and then fed back to notmuch. Do we want to escape the output as above
> before printing in any cases? My view is that if we output the
> message-id prefixed with "id:" then we should escape it (which applies
> with --output=messages --format=text), but if we don't print the "id:"
> part then we shouldn't (eg with --format=json). A similar thing would
> apply to emacs: if it is a normal stash then escape the id, but if it is
> a "bare stash" then do not.

Yes that sounds about right. Do we actually output from and subject with
prefixes attached to them? I have the feeling not.

>
> Actually, one more thing: it would be a shame to block or significantly
> delay the series for such a corner case.
>

If it's _only_ the output of notmuch search --output=messages, then I
guess it's doable.

d

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-02-12  1:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-27 13:33 [PATCH] lib: regexp matching in 'subject' and 'from' David Bremner
2016-11-14 21:46 ` [Patch v2] " David Bremner
2017-01-18 20:05   ` Jani Nikula
2017-01-18 21:01     ` David Bremner
2017-01-19 12:16       ` [Patch v3] " David Bremner
2017-01-21  3:27         ` [WIP] " David Bremner
2017-01-21 13:59           ` [Patch v4] " David Bremner
2017-01-25 19:40             ` Tomi Ollila
2017-01-26  2:21               ` David Bremner
2017-01-29 11:23                 ` Jani Nikula
2017-02-05 20:16                   ` Tomi Ollila
2017-02-05 23:28                     ` David Bremner
2017-02-09  3:11                   ` David Bremner
2017-02-09 16:15                     ` Tomi Ollila
2017-02-10  8:29                     ` Mark Walters
2017-02-11 23:25                       ` David Bremner
2017-01-29 11:06               ` Jani Nikula
2017-01-19 14:27   ` [Patch v2] " David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).