unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* searching: '*analysis' vs 'reanalysis'
@ 2016-06-06  6:58 Gaute Hope
  2016-06-06 12:42 ` David Bremner
  0 siblings, 1 reply; 21+ messages in thread
From: Gaute Hope @ 2016-06-06  6:58 UTC (permalink / raw)
  To: notmuch

Hi,

I have an email with the word 'reanalysis' in the subject line and the
email body. However, when I try to search for '*analysis' or 'analysis'
I do not get any matches, should not '*analysis' at least match?

Regards, Gaute



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: searching: '*analysis' vs 'reanalysis'
  2016-06-06  6:58 searching: '*analysis' vs 'reanalysis' Gaute Hope
@ 2016-06-06 12:42 ` David Bremner
  2016-06-06 12:53   ` Gaute Hope
  0 siblings, 1 reply; 21+ messages in thread
From: David Bremner @ 2016-06-06 12:42 UTC (permalink / raw)
  To: Gaute Hope, notmuch

Gaute Hope <eg@gaute.vetsj.com> writes:

> Hi,
>
> I have an email with the word 'reanalysis' in the subject line and the
> email body. However, when I try to search for '*analysis' or 'analysis'
> I do not get any matches, should not '*analysis' at least match?
>

We talked about this on IRC (the short answer is no), but is there some
improvement you could suggest to the "Wildcards" section in
notmuch-search-terms(7) ?

d

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: searching: '*analysis' vs 'reanalysis'
  2016-06-06 12:42 ` David Bremner
@ 2016-06-06 12:53   ` Gaute Hope
  2016-06-06 15:52     ` Sebastian Fischmeister
  0 siblings, 1 reply; 21+ messages in thread
From: Gaute Hope @ 2016-06-06 12:53 UTC (permalink / raw)
  To: David Bremner, notmuch

David Bremner writes on juni 6, 2016 14:42:
> Gaute Hope <eg@gaute.vetsj.com> writes:
>
>> Hi,
>>
>> I have an email with the word 'reanalysis' in the subject line and the
>> email body. However, when I try to search for '*analysis' or 'analysis'
>> I do not get any matches, should not '*analysis' at least match?
>>
>
> We talked about this on IRC (the short answer is no), but is there some
> improvement you could suggest to the "Wildcards" section in
> notmuch-search-terms(7) ?

Yes, thanks, not very important, but maybe add the sentence:

> It is not possible to use wildcards at the beginning of a term.

after the current explanation to emphasize this limitation (possibly
blaming Xapian to avoid futile requests).

I think it is something many would expect (and want). The current
description feels more like an example, and it is easy to make the
assumption that it works for prefixing the terms as well - although,
technically, nothing is promised in the original docs.

-gaute



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: searching: '*analysis' vs 'reanalysis'
  2016-06-06 12:53   ` Gaute Hope
@ 2016-06-06 15:52     ` Sebastian Fischmeister
  2016-06-06 17:29       ` David Bremner
  0 siblings, 1 reply; 21+ messages in thread
From: Sebastian Fischmeister @ 2016-06-06 15:52 UTC (permalink / raw)
  To: Gaute Hope, David Bremner, notmuch

>> It is not possible to use wildcards at the beginning of a term.
>
> after the current explanation to emphasize this limitation (possibly
> blaming Xapian to avoid futile requests).
>
> I think it is something many would expect (and want). The current
> description feels more like an example, and it is easy to make the
> assumption that it works for prefixing the terms as well - although,
> technically, nothing is promised in the original docs.

I ran into this problem before as well. Storage is cheap. Notmuch could
index all emails with reversed text to get around some of this
problem. It doesn't solve the problem of *analysis*, but it's still an
improvement.

  Sebastian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: searching: '*analysis' vs 'reanalysis'
  2016-06-06 15:52     ` Sebastian Fischmeister
@ 2016-06-06 17:29       ` David Bremner
  2016-06-06 19:20         ` Austin Clements
  0 siblings, 1 reply; 21+ messages in thread
From: David Bremner @ 2016-06-06 17:29 UTC (permalink / raw)
  To: sfischme, Gaute Hope, notmuch; +Cc: Austin Clements

Sebastian Fischmeister <sfischme@uwaterloo.ca> writes:

>
> I ran into this problem before as well. Storage is cheap. Notmuch could
> index all emails with reversed text to get around some of this
> problem. It doesn't solve the problem of *analysis*, but it's still an
> improvement.

It would probably be more useful to have brute force regexp searches on
headers.  Austin did some experiments that sounded promising, where you
basically postprocess the result of a xapian query with a regexp. OTOH,
I don't know what kept him from proposing this for mainline. If it was
just parser issues, those are probably more or less solved now, at least
for people using xapian 1.3+

d

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: searching: '*analysis' vs 'reanalysis'
  2016-06-06 17:29       ` David Bremner
@ 2016-06-06 19:20         ` Austin Clements
  2016-06-06 20:08           ` Gaute Hope
  2016-06-07  2:05           ` [PATCH] WIP: regexp matching in subjects David Bremner
  0 siblings, 2 replies; 21+ messages in thread
From: Austin Clements @ 2016-06-06 19:20 UTC (permalink / raw)
  To: David Bremner; +Cc: sfischme, Gaute Hope, notmuch

[-- Attachment #1: Type: text/plain, Size: 1252 bytes --]

On Mon, Jun 6, 2016 at 1:29 PM, David Bremner <david@tethera.net> wrote:

> Sebastian Fischmeister <sfischme@uwaterloo.ca> writes:
>
> >
> > I ran into this problem before as well. Storage is cheap. Notmuch could
> > index all emails with reversed text to get around some of this
> > problem. It doesn't solve the problem of *analysis*, but it's still an
> > improvement.
>
> It would probably be more useful to have brute force regexp searches on
> headers.  Austin did some experiments that sounded promising, where you
> basically postprocess the result of a xapian query with a regexp. OTOH,
> I don't know what kept him from proposing this for mainline. If it was
> just parser issues, those are probably more or less solved now, at least
> for people using xapian 1.3+
>

The experiment was specifically for regexp matching subject, but it should
work for any header we store a literal copy of in the database. The code is
here, though in its current form it builds on my custom query parser:
https://github.com/aclements/notmuch/commit/ce41b29aba4d9b84e2f1eb6ed8df67065196c960.
Based on my understanding of Xapian 1.3+ field processors, these days it
should be quite easy to hook the PostingSource in that commit into the
Xapian QueryProcessor.

[-- Attachment #2: Type: text/html, Size: 1840 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: searching: '*analysis' vs 'reanalysis'
  2016-06-06 19:20         ` Austin Clements
@ 2016-06-06 20:08           ` Gaute Hope
  2016-06-06 20:22             ` Austin Clements
  2016-06-07  2:05           ` [PATCH] WIP: regexp matching in subjects David Bremner
  1 sibling, 1 reply; 21+ messages in thread
From: Gaute Hope @ 2016-06-06 20:08 UTC (permalink / raw)
  To: Austin Clements, David Bremner; +Cc: sfischme, notmuch

Austin Clements writes on juni 6, 2016 21:20:
>
> The experiment was specifically for regexp matching subject, but it should
> work for any header we store a literal copy of in the database.

Does it work for terms in the body of the message?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: searching: '*analysis' vs 'reanalysis'
  2016-06-06 20:08           ` Gaute Hope
@ 2016-06-06 20:22             ` Austin Clements
  0 siblings, 0 replies; 21+ messages in thread
From: Austin Clements @ 2016-06-06 20:22 UTC (permalink / raw)
  To: Gaute Hope; +Cc: David Bremner, sfischme, notmuch

Quoth Gaute Hope on Jun 06 at  8:08 pm:
> Austin Clements writes on juni 6, 2016 21:20:
> >
> >The experiment was specifically for regexp matching subject, but it should
> >work for any header we store a literal copy of in the database.
> 
> Does it work for terms in the body of the message?

No. It's not impossible that it could be made to work, but it might be
slow and unintuitive. It would have to iterate over all of the terms
in the database and see which ones match the regexp. These are
available, but I don't know how much time it takes to iterate over all
of them. It might be okay. It might not.

It could also expand to a very large query if the regexp matches many
terms, akin to how searching for "a*" can be quite expensive.

And it might not match what you expect. It could only match individual
terms, so a regexp containing any punctuation (including but not
limited to a space) simply wouldn't match anything.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] WIP: regexp matching in subjects
  2016-06-06 19:20         ` Austin Clements
  2016-06-06 20:08           ` Gaute Hope
@ 2016-06-07  2:05           ` David Bremner
  2016-06-07 10:16             ` David Bremner
  2016-06-10  2:28             ` [PATCH] WIP: regexp matching in 'subject' and 'from' David Bremner
  1 sibling, 2 replies; 21+ messages in thread
From: David Bremner @ 2016-06-07  2:05 UTC (permalink / raw)
  To: Austin Clements, David Bremner; +Cc: sfischme, Gaute Hope, notmuch

the idea is that you can run

% notmuch search 'subject:rx:<your-favourite-regexp>'

or

% notmuch search subject:"your usual phrase search"

This should also work with bindings.
---

Here is Austin's "hack", crammed into the field processor framework.
I seem to have broken one of the existing subject search tests with my
recursive query parsing. I didn't have time to figure out why, yet.

 lib/Makefile.local     |  2 ++
 lib/database-private.h |  1 +
 lib/database.cc        |  5 +++
 lib/regexp-ps.cc       | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/regexp-ps.h        | 37 ++++++++++++++++++++
 lib/subject-fp.cc      | 41 ++++++++++++++++++++++
 lib/subject-fp.h       | 43 +++++++++++++++++++++++
 7 files changed, 221 insertions(+)
 create mode 100644 lib/regexp-ps.cc
 create mode 100644 lib/regexp-ps.h
 create mode 100644 lib/subject-fp.cc
 create mode 100644 lib/subject-fp.h

diff --git a/lib/Makefile.local b/lib/Makefile.local
index beb9635..0e7311f 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -51,6 +51,8 @@ libnotmuch_cxx_srcs =		\
 	$(dir)/query.cc		\
 	$(dir)/query-fp.cc      \
 	$(dir)/config.cc	\
+	$(dir)/regexp-ps.cc     \
+	$(dir)/subject-fp.cc    \
 	$(dir)/thread.cc
 
 libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
diff --git a/lib/database-private.h b/lib/database-private.h
index ca71a92..5de0b81 100644
--- a/lib/database-private.h
+++ b/lib/database-private.h
@@ -186,6 +186,7 @@ struct _notmuch_database {
 #if HAVE_XAPIAN_FIELD_PROCESSOR
     Xapian::FieldProcessor *date_field_processor;
     Xapian::FieldProcessor *query_field_processor;
+    Xapian::FieldProcessor *subject_field_processor;
 #endif
     Xapian::ValueRangeProcessor *last_mod_range_processor;
 };
diff --git a/lib/database.cc b/lib/database.cc
index 86bf261..adfbb81 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -21,6 +21,7 @@
 #include "database-private.h"
 #include "parse-time-vrp.h"
 #include "query-fp.h"
+#include "subject-fp.h"
 #include "string-util.h"
 
 #include <iostream>
@@ -1008,6 +1009,8 @@ notmuch_database_open_verbose (const char *path,
 	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
 	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
 	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
+	notmuch->subject_field_processor = new SubjectFieldProcessor (*notmuch->query_parser, notmuch);
+	notmuch->query_parser->add_boolean_prefix("subject", notmuch->subject_field_processor);
 #endif
 	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
 
@@ -1027,6 +1030,8 @@ notmuch_database_open_verbose (const char *path,
 
 	for (i = 0; i < ARRAY_SIZE (PROBABILISTIC_PREFIX); i++) {
 	    prefix_t *prefix = &PROBABILISTIC_PREFIX[i];
+	    if (strcmp (prefix->name, "subject") == 0)
+		continue;
 	    notmuch->query_parser->add_prefix (prefix->name, prefix->prefix);
 	}
     } catch (const Xapian::Error &error) {
diff --git a/lib/regexp-ps.cc b/lib/regexp-ps.cc
new file mode 100644
index 0000000..540c7d6
--- /dev/null
+++ b/lib/regexp-ps.cc
@@ -0,0 +1,92 @@
+/* query-fp.cc - "query:" field processor glue
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#include "regexp-ps.h"
+
+RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
+    : slot_ (slot)
+{
+    int r = regcomp (&regexp_, regexp.c_str (), REG_EXTENDED | REG_NOSUB);
+
+    if (r != 0)
+	/* XXX Report a query syntax error using regerror */
+	throw "regcomp failed";
+}
+
+RegexpPostingSource::~RegexpPostingSource ()
+{
+    regfree (&regexp_);
+}
+
+void
+RegexpPostingSource::init (const Xapian::Database &db)
+{
+    db_ = db;
+    it_ = db_.valuestream_begin (slot_);
+    end_ = db.valuestream_end (slot_);
+    started_ = false;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_min () const
+{
+    return 0;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_est () const
+{
+    return get_termfreq_max () / 2;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_max () const
+{
+    return db_.get_value_freq (slot_);
+}
+
+Xapian::docid
+RegexpPostingSource::get_docid () const
+{
+    return it_.get_docid ();
+}
+
+bool
+RegexpPostingSource::at_end () const
+{
+    return it_ == end_;
+}
+
+void
+RegexpPostingSource::next (unused (double min_wt))
+{
+    if (started_ && ! at_end ())
+	++it_;
+    started_ = true;
+
+    for (; ! at_end (); ++it_) {
+	std::string value = *it_;
+	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
+	    break;
+    }
+}
diff --git a/lib/regexp-ps.h b/lib/regexp-ps.h
new file mode 100644
index 0000000..a4553a7
--- /dev/null
+++ b/lib/regexp-ps.h
@@ -0,0 +1,37 @@
+#ifndef NOTMUCH_REGEX_PS_H
+#define NOTMUCH_REGEX_PS_H
+
+#include <sys/types.h>
+#include <regex.h>
+#include <xapian.h>
+#include "notmuch-private.h"
+
+/* A posting source that returns documents where a value matches a
+ * regexp.
+ */
+class RegexpPostingSource : public Xapian::PostingSource
+{
+protected:
+const Xapian::valueno slot_;
+regex_t regexp_;
+Xapian::Database db_;
+bool started_;
+Xapian::ValueIterator it_, end_;
+
+/* No copying */
+RegexpPostingSource (const RegexpPostingSource &);
+RegexpPostingSource &operator= (const RegexpPostingSource &);
+
+public:
+ RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
+~RegexpPostingSource ();
+void init (const Xapian::Database &db);
+Xapian::doccount get_termfreq_min () const;
+Xapian::doccount get_termfreq_est () const;
+Xapian::doccount get_termfreq_max () const;
+Xapian::docid get_docid () const;
+bool at_end () const;
+void next (unused (double min_wt));
+};
+
+#endif
diff --git a/lib/subject-fp.cc b/lib/subject-fp.cc
new file mode 100644
index 0000000..1627721
--- /dev/null
+++ b/lib/subject-fp.cc
@@ -0,0 +1,41 @@
+/* subject-fp.cc - "subject:" field processor glue
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: David Bremner <david@tethera.net>
+ */
+
+#include "database-private.h"
+#include "subject-fp.h"
+#include <iostream>
+
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+
+Xapian::Query
+SubjectFieldProcessor::operator() (const std::string & str)
+{
+    std::string prefix = "rx:";
+
+    if (str.compare(0,prefix.size(),prefix)==0) {
+	postings = new RegexpPostingSource(NOTMUCH_VALUE_SUBJECT, str.substr(prefix.size()));
+	return Xapian::Query(postings);
+    } else {
+	return parser.parse_query (str, NOTMUCH_QUERY_PARSER_FLAGS, _find_prefix ("subject"));
+    }
+}
+#endif
diff --git a/lib/subject-fp.h b/lib/subject-fp.h
new file mode 100644
index 0000000..ca622ba
--- /dev/null
+++ b/lib/subject-fp.h
@@ -0,0 +1,43 @@
+/* subject-fp.h - subject field processor glue
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: David Bremner <david@tethera.net>
+ */
+
+#ifndef NOTMUCH_SUBJECT_FP_H
+#define NOTMUCH_SUBJECT_FP_H
+
+#include <xapian.h>
+#include "notmuch.h"
+#include "regexp-ps.h"
+
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+class SubjectFieldProcessor : public Xapian::FieldProcessor {
+ protected:
+    Xapian::QueryParser &parser;
+    notmuch_database_t *notmuch;
+    RegexpPostingSource *postings = NULL;
+ public:
+    SubjectFieldProcessor (Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
+	: parser(parser_), notmuch(notmuch_) { };
+
+    Xapian::Query operator()(const std::string & str);
+};
+#endif
+#endif /* NOTMUCH_SUBJECT_FP_H */
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] WIP: regexp matching in subjects
  2016-06-07  2:05           ` [PATCH] WIP: regexp matching in subjects David Bremner
@ 2016-06-07 10:16             ` David Bremner
  2016-06-10  2:28             ` [PATCH] WIP: regexp matching in 'subject' and 'from' David Bremner
  1 sibling, 0 replies; 21+ messages in thread
From: David Bremner @ 2016-06-07 10:16 UTC (permalink / raw)
  To: Austin Clements; +Cc: sfischme, Gaute Hope, notmuch

David Bremner <david@tethera.net> writes:

> the idea is that you can run
>
> % notmuch search 'subject:rx:<your-favourite-regexp>'
>
> or
>
> % notmuch search subject:"your usual phrase search"
>
> This should also work with bindings.
> ---
>
> Here is Austin's "hack", crammed into the field processor framework.
> I seem to have broken one of the existing subject search tests with my
> recursive query parsing. I didn't have time to figure out why, yet.

A few hours sleep and I think I understand the issue.

with

        subject:"your usual phrase search"

I believe the phrase-ness (word ordering and proximity) gets lost when
using a field processor. So, since I don't know if/when this issue will
be fixed in Xapian, we should probably use a seperate prefix for regexp
search. This leads to two potential syntaxes

        subject_re:"^i am first"

        regexp:subject:"^i am first"

If we did stick with the current syntax, one could use

   subject:your-usual-phrase-search

to preserve phraseness. Note that for regexps with spaces in them, the
required quoting is a bit counterintuitive

 ./notmuch  count 'subject:"rx:Graduate Committee"'

That's a consequence of my faking "sub-prefixes". Which seemed clever at
the time, but maybe not.

d

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] WIP: regexp matching in 'subject' and 'from'
  2016-06-07  2:05           ` [PATCH] WIP: regexp matching in subjects David Bremner
  2016-06-07 10:16             ` David Bremner
@ 2016-06-10  2:28             ` David Bremner
  2016-06-10  2:42               ` David Bremner
                                 ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: David Bremner @ 2016-06-10  2:28 UTC (permalink / raw)
  To: David Bremner, Austin Clements; +Cc: sfischme, Gaute Hope, notmuch

the idea is that you can run

% notmuch search subject_re:<your-favourite-regexp>
% notmuch search from_re:<your-favourite-regexp>'

or

% notmuch search subject:"your usual phrase search"
% notmuch search from:"usual phrase search"

This should also work with bindings, since it extends the query parser.

This is trivial to extend for other value slots, but currently the only
value slots are date, message_id, from, subject, and last_mod. Date is
already searchable, and message_id is not obviously useful to regex
match.
---

This is more or less complete codewise, it fixes the know problems
with the last version. Names of prefixes are debatable, and of course
it needs doc and tests.  I don't see any reason not to do this at the moment,
since it's basically free; no new terms are added to the database.

 lib/Makefile.local     |   1 +
 lib/database-private.h |   2 +
 lib/database.cc        |  12 +++++-
 lib/regexp-fields.cc   | 102 +++++++++++++++++++++++++++++++++++++++++++++++++
 lib/regexp-fields.h    |  78 +++++++++++++++++++++++++++++++++++++
 5 files changed, 194 insertions(+), 1 deletion(-)
 create mode 100644 lib/regexp-fields.cc
 create mode 100644 lib/regexp-fields.h

diff --git a/lib/Makefile.local b/lib/Makefile.local
index beb9635..68771e6 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -51,6 +51,7 @@ libnotmuch_cxx_srcs =		\
 	$(dir)/query.cc		\
 	$(dir)/query-fp.cc      \
 	$(dir)/config.cc	\
+	$(dir)/regexp-fields.cc     \
 	$(dir)/thread.cc
 
 libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
diff --git a/lib/database-private.h b/lib/database-private.h
index ca71a92..090fcdf 100644
--- a/lib/database-private.h
+++ b/lib/database-private.h
@@ -186,6 +186,8 @@ struct _notmuch_database {
 #if HAVE_XAPIAN_FIELD_PROCESSOR
     Xapian::FieldProcessor *date_field_processor;
     Xapian::FieldProcessor *query_field_processor;
+    Xapian::FieldProcessor *subject_re_field_processor;
+    Xapian::FieldProcessor *from_re_field_processor;
 #endif
     Xapian::ValueRangeProcessor *last_mod_range_processor;
 };
diff --git a/lib/database.cc b/lib/database.cc
index 86bf261..4049406 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -21,6 +21,7 @@
 #include "database-private.h"
 #include "parse-time-vrp.h"
 #include "query-fp.h"
+#include "regexp-fields.h"
 #include "string-util.h"
 
 #include <iostream>
@@ -1008,6 +1009,10 @@ notmuch_database_open_verbose (const char *path,
 	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
 	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
 	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
+	notmuch->subject_re_field_processor = new RegexpFieldProcessor (NOTMUCH_VALUE_SUBJECT, *notmuch->query_parser, notmuch);
+	notmuch->query_parser->add_boolean_prefix("subject_re", notmuch->subject_re_field_processor);
+	notmuch->from_re_field_processor = new RegexpFieldProcessor (NOTMUCH_VALUE_FROM, *notmuch->query_parser, notmuch);
+	notmuch->query_parser->add_boolean_prefix("from_re", notmuch->from_re_field_processor);
 #endif
 	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
 
@@ -1098,7 +1103,12 @@ notmuch_database_close (notmuch_database_t *notmuch)
     notmuch->date_range_processor = NULL;
     delete notmuch->last_mod_range_processor;
     notmuch->last_mod_range_processor = NULL;
-
+#ifdef HAVE_XAPIAN_FIELD_PROCESSOR
+    delete notmuch->from_re_field_processor;
+    notmuch->from_re_field_processor = NULL;
+    delete notmuch->subject_re_field_processor;
+    notmuch->subject_re_field_processor = NULL;
+#endif
     return status;
 }
 
diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
new file mode 100644
index 0000000..4bbebda
--- /dev/null
+++ b/lib/regexp-fields.cc
@@ -0,0 +1,102 @@
+/* query-fp.cc - "query:" field processor glue
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#include "regexp-fields.h"
+
+#ifdef HAVE_XAPIAN_FIELD_PROCESSOR
+RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
+    : slot_ (slot)
+{
+    int r = regcomp (&regexp_, regexp.c_str (), REG_EXTENDED | REG_NOSUB);
+
+    if (r != 0)
+	/* XXX Report a query syntax error using regerror */
+	throw "regcomp failed";
+}
+
+RegexpPostingSource::~RegexpPostingSource ()
+{
+    regfree (&regexp_);
+}
+
+void
+RegexpPostingSource::init (const Xapian::Database &db)
+{
+    db_ = db;
+    it_ = db_.valuestream_begin (slot_);
+    end_ = db.valuestream_end (slot_);
+    started_ = false;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_min () const
+{
+    return 0;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_est () const
+{
+    return get_termfreq_max () / 2;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_max () const
+{
+    return db_.get_value_freq (slot_);
+}
+
+Xapian::docid
+RegexpPostingSource::get_docid () const
+{
+    return it_.get_docid ();
+}
+
+bool
+RegexpPostingSource::at_end () const
+{
+    return it_ == end_;
+}
+
+void
+RegexpPostingSource::next (unused (double min_wt))
+{
+    if (started_ && ! at_end ())
+	++it_;
+    started_ = true;
+
+    for (; ! at_end (); ++it_) {
+	std::string value = *it_;
+	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
+	    break;
+    }
+}
+
+Xapian::Query
+RegexpFieldProcessor::operator() (const std::string & str)
+{
+    postings = new RegexpPostingSource (slot, str);
+    return Xapian::Query (postings);
+}
+#endif
diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
new file mode 100644
index 0000000..a184cab
--- /dev/null
+++ b/lib/regexp-fields.h
@@ -0,0 +1,78 @@
+/* regex-fields.h - xapian glue for semi-bruteforce regexp search
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#ifndef NOTMUCH_REGEXP_FIELDS_H
+#define NOTMUCH_REGEXP_FIELDS_H
+#ifdef HAVE_XAPIAN_FIELD_PROCESSOR
+#include <sys/types.h>
+#include <regex.h>
+#include <xapian.h>
+#include "notmuch-private.h"
+
+/* A posting source that returns documents where a value matches a
+ * regexp.
+ */
+class RegexpPostingSource : public Xapian::PostingSource
+{
+ protected:
+    const Xapian::valueno slot_;
+    regex_t regexp_;
+    Xapian::Database db_;
+    bool started_;
+    Xapian::ValueIterator it_, end_;
+
+/* No copying */
+    RegexpPostingSource (const RegexpPostingSource &);
+    RegexpPostingSource &operator= (const RegexpPostingSource &);
+
+ public:
+    RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
+    ~RegexpPostingSource ();
+    void init (const Xapian::Database &db);
+    Xapian::doccount get_termfreq_min () const;
+    Xapian::doccount get_termfreq_est () const;
+    Xapian::doccount get_termfreq_max () const;
+    Xapian::docid get_docid () const;
+    bool at_end () const;
+    void next (unused (double min_wt));
+};
+
+
+class RegexpFieldProcessor : public Xapian::FieldProcessor {
+ protected:
+    Xapian::valueno slot;
+    Xapian::QueryParser &parser;
+    notmuch_database_t *notmuch;
+    RegexpPostingSource *postings = NULL;
+
+ public:
+    RegexpFieldProcessor (Xapian::valueno slot_, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
+	: slot(slot_), parser(parser_), notmuch(notmuch_) { };
+
+    ~RegexpFieldProcessor () { delete postings; };
+
+    Xapian::Query operator()(const std::string & str);
+};
+#endif
+#endif /* NOTMUCH_REGEXP_FIELDS_H */
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] WIP: regexp matching in 'subject' and 'from'
  2016-06-10  2:28             ` [PATCH] WIP: regexp matching in 'subject' and 'from' David Bremner
@ 2016-06-10  2:42               ` David Bremner
  2016-06-10 11:11                 ` Tomi Ollila
  2016-06-10  8:38               ` Gaute Hope
  2016-06-11  1:49               ` David Bremner
  2 siblings, 1 reply; 21+ messages in thread
From: David Bremner @ 2016-06-10  2:42 UTC (permalink / raw)
  To: notmuch

David Bremner <david@tethera.net> writes:

> +#ifdef HAVE_XAPIAN_FIELD_PROCESSOR
> +    delete notmuch->from_re_field_processor;
> +    notmuch->from_re_field_processor = NULL;
> +    delete notmuch->subject_re_field_processor;
> +    notmuch->subject_re_field_processor = NULL;
> +#endif

and of course everywhere it says #ifdef HAVE_XAPIAN_FIELD_PROCESSOR, is
should say #if.

d

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] WIP: regexp matching in 'subject' and 'from'
  2016-06-10  2:28             ` [PATCH] WIP: regexp matching in 'subject' and 'from' David Bremner
  2016-06-10  2:42               ` David Bremner
@ 2016-06-10  8:38               ` Gaute Hope
  2016-06-10 11:09                 ` David Bremner
  2016-06-11  1:49               ` David Bremner
  2 siblings, 1 reply; 21+ messages in thread
From: Gaute Hope @ 2016-06-10  8:38 UTC (permalink / raw)
  To: David Bremner, Austin Clements; +Cc: sfischme, notmuch

David Bremner writes on juni 10, 2016 4:28:
> the idea is that you can run
> 
> % notmuch search subject_re:<your-favourite-regexp>
> % notmuch search from_re:<your-favourite-regexp>'
> 
> or
> 
> % notmuch search subject:"your usual phrase search"
> % notmuch search from:"usual phrase search"
> 
> This should also work with bindings, since it extends the query parser.
> 
> This is trivial to extend for other value slots, but currently the only
> value slots are date, message_id, from, subject, and last_mod. Date is
> already searchable, and message_id is not obviously useful to regex
> match.
> ---
> 
> This is more or less complete codewise, it fixes the know problems
> with the last version. Names of prefixes are debatable, and of course
> it needs doc and tests.  I don't see any reason not to do this at the moment,
> since it's basically free; no new terms are added to the database.

Cool!

Would it break a lot of things if you just replace the original prefix?

Could it be made to work on the message body?

Regards, Gaute


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] WIP: regexp matching in 'subject' and 'from'
  2016-06-10  8:38               ` Gaute Hope
@ 2016-06-10 11:09                 ` David Bremner
  2016-06-11 16:32                   ` Gaute Hope
  0 siblings, 1 reply; 21+ messages in thread
From: David Bremner @ 2016-06-10 11:09 UTC (permalink / raw)
  To: Gaute Hope, Austin Clements; +Cc: sfischme, notmuch

Gaute Hope <eg@gaute.vetsj.com> writes:

>
> Cool!
>
> Would it break a lot of things if you just replace the original prefix?

It would change the matching behaviour. I guess there are people that
like the current "sloppy" matching of from: and subject:.  In my
not-very-scientific tests, it is a factor of 5 to 10 times slower to do
regexp search, which makes sense because it is effectively post
processing the results from Xapian. At least on my system it seems fast
enough to be usable interactively, but that is a pretty shocking
performance regression. And I know there are people with more mail on
slower systems.

> Could it be made to work on the message body?

See Austin's previous reply for the details, but basically no; these
"values" index in terms of whole strings, while the body is indexed by
terms (roughly, words). In principle we could add a value slot for the
body, but I think that would at least double the size of the database
(maybe more).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] WIP: regexp matching in 'subject' and 'from'
  2016-06-10  2:42               ` David Bremner
@ 2016-06-10 11:11                 ` Tomi Ollila
  2016-06-10 11:50                   ` David Bremner
  0 siblings, 1 reply; 21+ messages in thread
From: Tomi Ollila @ 2016-06-10 11:11 UTC (permalink / raw)
  To: David Bremner, notmuch

On Fri, Jun 10 2016, David Bremner <david@tethera.net> wrote:

> David Bremner <david@tethera.net> writes:
>
>> +#ifdef HAVE_XAPIAN_FIELD_PROCESSOR
>> +    delete notmuch->from_re_field_processor;
>> +    notmuch->from_re_field_processor = NULL;
>> +    delete notmuch->subject_re_field_processor;
>> +    notmuch->subject_re_field_processor = NULL;
>> +#endif
>
> and of course everywhere it says #ifdef HAVE_XAPIAN_FIELD_PROCESSOR, is
> should say #if.

... is there a static code analyzer which notices such a mistakes... ?

i did marked that version trivial, which completes the thought that there
are no such thing as trivial changes ;/


> d

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] WIP: regexp matching in 'subject' and 'from'
  2016-06-10 11:11                 ` Tomi Ollila
@ 2016-06-10 11:50                   ` David Bremner
  0 siblings, 0 replies; 21+ messages in thread
From: David Bremner @ 2016-06-10 11:50 UTC (permalink / raw)
  To: Tomi Ollila, notmuch

Tomi Ollila <tomi.ollila@iki.fi> writes:

> On Fri, Jun 10 2016, David Bremner <david@tethera.net> wrote:
>
>> David Bremner <david@tethera.net> writes:

>> and of course everywhere it says #ifdef HAVE_XAPIAN_FIELD_PROCESSOR, is
>> should say #if.
>
> ... is there a static code analyzer which notices such a mistakes... ?

It seems tough for a static analyzer, because the bug is really in
misunderstanding the build system (which always defines that symbol)
rather than the code itself.

d

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] WIP: regexp matching in 'subject' and 'from'
  2016-06-10  2:28             ` [PATCH] WIP: regexp matching in 'subject' and 'from' David Bremner
  2016-06-10  2:42               ` David Bremner
  2016-06-10  8:38               ` Gaute Hope
@ 2016-06-11  1:49               ` David Bremner
  2 siblings, 0 replies; 21+ messages in thread
From: David Bremner @ 2016-06-11  1:49 UTC (permalink / raw)
  To: David Bremner; +Cc: notmuch, jani

the idea is that you can run

% notmuch search re:subject:<your-favourite-regexp>
% notmuch search re:from:<your-favourite-regexp>'

or

% notmuch search subject:"your usual phrase search"
% notmuch search from:"usual phrase search"

This should also work with bindings, since it extends the query parser.

This is trivial to extend for other value slots, but currently the only
value slots are date, message_id, from, subject, and last_mod. Date is
already searchable, and message_id is not obviously useful to regex
match.
---

After some discussion on IRC, here is a version that uses a single re:
prefix internally, and some examples/tests of how that syntax would
work in practice.

 lib/Makefile.local        |   1 +
 lib/database-private.h    |   1 +
 lib/database.cc           |   5 ++
 lib/regexp-fields.cc      | 117 ++++++++++++++++++++++++++++++++++++++++++++++
 lib/regexp-fields.h       |  77 ++++++++++++++++++++++++++++++
 test/T630-regexp-query.sh |  77 ++++++++++++++++++++++++++++++
 6 files changed, 278 insertions(+)
 create mode 100644 lib/regexp-fields.cc
 create mode 100644 lib/regexp-fields.h
 create mode 100755 test/T630-regexp-query.sh

diff --git a/lib/Makefile.local b/lib/Makefile.local
index beb9635..68771e6 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -51,6 +51,7 @@ libnotmuch_cxx_srcs =		\
 	$(dir)/query.cc		\
 	$(dir)/query-fp.cc      \
 	$(dir)/config.cc	\
+	$(dir)/regexp-fields.cc     \
 	$(dir)/thread.cc
 
 libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
diff --git a/lib/database-private.h b/lib/database-private.h
index ca71a92..900a989 100644
--- a/lib/database-private.h
+++ b/lib/database-private.h
@@ -186,6 +186,7 @@ struct _notmuch_database {
 #if HAVE_XAPIAN_FIELD_PROCESSOR
     Xapian::FieldProcessor *date_field_processor;
     Xapian::FieldProcessor *query_field_processor;
+    Xapian::FieldProcessor *re_field_processor;
 #endif
     Xapian::ValueRangeProcessor *last_mod_range_processor;
 };
diff --git a/lib/database.cc b/lib/database.cc
index afafe88..b52b62d 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -21,6 +21,7 @@
 #include "database-private.h"
 #include "parse-time-vrp.h"
 #include "query-fp.h"
+#include "regexp-fields.h"
 #include "string-util.h"
 
 #include <iostream>
@@ -1016,6 +1017,8 @@ notmuch_database_open_verbose (const char *path,
 	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
 	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
 	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
+	notmuch->re_field_processor = new RegexpFieldProcessor (*notmuch->query_parser, notmuch);
+	notmuch->query_parser->add_boolean_prefix("re", notmuch->re_field_processor);
 #endif
 	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
 
@@ -1112,6 +1115,8 @@ notmuch_database_close (notmuch_database_t *notmuch)
     notmuch->date_field_processor = NULL;
     delete notmuch->query_field_processor;
     notmuch->query_field_processor = NULL;
+    delete notmuch->re_field_processor;
+    notmuch->re_field_processor = NULL;
 #endif
 
     return status;
diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
new file mode 100644
index 0000000..d9d1625
--- /dev/null
+++ b/lib/regexp-fields.cc
@@ -0,0 +1,117 @@
+/* query-fp.cc - "query:" field processor glue
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#include "regexp-fields.h"
+#include "notmuch-private.h"
+
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
+    : slot_ (slot)
+{
+    int r = regcomp (&regexp_, regexp.c_str (), REG_EXTENDED | REG_NOSUB);
+
+    if (r != 0)
+	/* XXX Report a query syntax error using regerror */
+	throw "regcomp failed";
+}
+
+RegexpPostingSource::~RegexpPostingSource ()
+{
+    regfree (&regexp_);
+}
+
+void
+RegexpPostingSource::init (const Xapian::Database &db)
+{
+    db_ = db;
+    it_ = db_.valuestream_begin (slot_);
+    end_ = db.valuestream_end (slot_);
+    started_ = false;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_min () const
+{
+    return 0;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_est () const
+{
+    return get_termfreq_max () / 2;
+}
+
+Xapian::doccount
+RegexpPostingSource::get_termfreq_max () const
+{
+    return db_.get_value_freq (slot_);
+}
+
+Xapian::docid
+RegexpPostingSource::get_docid () const
+{
+    return it_.get_docid ();
+}
+
+bool
+RegexpPostingSource::at_end () const
+{
+    return it_ == end_;
+}
+
+void
+RegexpPostingSource::next (unused (double min_wt))
+{
+    if (started_ && ! at_end ())
+	++it_;
+    started_ = true;
+
+    for (; ! at_end (); ++it_) {
+	std::string value = *it_;
+	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
+	    break;
+    }
+}
+
+static Xapian::valueno
+_find_slot(std::string prefix){
+    if (prefix == "from")
+	return NOTMUCH_VALUE_FROM;
+    else if (prefix == "subject")
+	return NOTMUCH_VALUE_SUBJECT;
+    else
+	throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
+}
+
+Xapian::Query
+RegexpFieldProcessor::operator() (const std::string & str)
+{
+    size_t pos = str.find_first_of(':');
+    std::string prefix = str.substr(0,pos);
+    std::string regexp = str.substr(pos+1);
+
+    postings = new RegexpPostingSource (_find_slot (prefix), regexp);
+    return Xapian::Query (postings);
+}
+#endif
diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
new file mode 100644
index 0000000..2c9c2d7
--- /dev/null
+++ b/lib/regexp-fields.h
@@ -0,0 +1,77 @@
+/* regex-fields.h - xapian glue for semi-bruteforce regexp search
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2015 Austin Clements
+ * Copyright © 2016 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: Austin Clements <aclements@csail.mit.edu>
+ *                David Bremner <david@tethera.net>
+ */
+
+#ifndef NOTMUCH_REGEXP_FIELDS_H
+#define NOTMUCH_REGEXP_FIELDS_H
+#if HAVE_XAPIAN_FIELD_PROCESSOR
+#include <sys/types.h>
+#include <regex.h>
+#include <xapian.h>
+#include "notmuch-private.h"
+
+/* A posting source that returns documents where a value matches a
+ * regexp.
+ */
+class RegexpPostingSource : public Xapian::PostingSource
+{
+ protected:
+    const Xapian::valueno slot_;
+    regex_t regexp_;
+    Xapian::Database db_;
+    bool started_;
+    Xapian::ValueIterator it_, end_;
+
+/* No copying */
+    RegexpPostingSource (const RegexpPostingSource &);
+    RegexpPostingSource &operator= (const RegexpPostingSource &);
+
+ public:
+    RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
+    ~RegexpPostingSource ();
+    void init (const Xapian::Database &db);
+    Xapian::doccount get_termfreq_min () const;
+    Xapian::doccount get_termfreq_est () const;
+    Xapian::doccount get_termfreq_max () const;
+    Xapian::docid get_docid () const;
+    bool at_end () const;
+    void next (unused (double min_wt));
+};
+
+
+class RegexpFieldProcessor : public Xapian::FieldProcessor {
+ protected:
+    Xapian::QueryParser &parser;
+    notmuch_database_t *notmuch;
+    RegexpPostingSource *postings = NULL;
+
+ public:
+    RegexpFieldProcessor (Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
+	: parser(parser_), notmuch(notmuch_) { };
+
+    ~RegexpFieldProcessor () { delete postings; };
+
+    Xapian::Query operator()(const std::string & str);
+};
+#endif
+#endif /* NOTMUCH_REGEXP_FIELDS_H */
diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
new file mode 100755
index 0000000..09caed6
--- /dev/null
+++ b/test/T630-regexp-query.sh
@@ -0,0 +1,77 @@
+#!/usr/bin/env bash
+test_description='named queries'
+. ./test-lib.sh || exit 1
+
+QUERYSTR="date:2009-11-18..2009-11-18 and tag:unread"
+QUERYSTR2="query:test and subject:Maildir"
+
+add_email_corpus
+
+
+if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
+
+    notmuch search --output=messages from:cworth > cworth.msg-ids
+
+    test_begin_subtest "regexp from search, case sensitive"
+    notmuch search --output=messages re:from:carl > OUTPUT
+    test_expect_equal_file /dev/null OUTPUT
+
+    test_begin_subtest "empty regexp or query"
+    notmuch search --output=messages re:from:carl or from:cworth > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "non-empty regexp and query"
+    notmuch search  re:from:cworth and subject:patch > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000008   2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
+thread:0000000000000007   2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
+thread:0000000000000018   2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
+thread:0000000000000017   2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+thread:0000000000000014   2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
+thread:0000000000000001   2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp from search, duplicate term search"
+    notmuch search --output=messages re:from:cworth > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "long enough regexp matches only desired senders"
+    notmuch search --output=messages 're:"from:C.* Wo"' > OUTPUT
+    test_expect_equal_file cworth.msg-ids OUTPUT
+
+    test_begin_subtest "shorter regexp matches one more sender"
+    notmuch search --output=messages 're:"from:C.* W"' > OUTPUT
+    (echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, non-ASCII"
+    notmuch search --output=messages re:subject:accentué > OUTPUT
+    echo id:877h1wv7mg.fsf@inf-8657.int-evry.fr > EXPECTED
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, punctuation"
+    notmuch search   re:subject:\'X\' > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "regexp subject search, no punctuation"
+    notmuch search  re:subject:X > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
+thread:000000000000000f   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+    test_begin_subtest "combine regexp from and subject"
+    notmuch search  re:subject:-C and re:from:.an.k > OUTPUT
+    cat <<EOF > EXPECTED
+thread:0000000000000018   2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
+EOF
+    test_expect_equal_file EXPECTED OUTPUT
+
+fi
+
+test_done
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] WIP: regexp matching in 'subject' and 'from'
  2016-06-10 11:09                 ` David Bremner
@ 2016-06-11 16:32                   ` Gaute Hope
  2016-06-11 16:49                     ` David Bremner
  2016-06-11 17:09                     ` Tomi Ollila
  0 siblings, 2 replies; 21+ messages in thread
From: Gaute Hope @ 2016-06-11 16:32 UTC (permalink / raw)
  To: David Bremner, Austin Clements; +Cc: sfischme, notmuch

David Bremner writes on juni 10, 2016 13:09:
> Gaute Hope <eg@gaute.vetsj.com> writes:
> 
>>
>> Cool!
>>
>> Would it break a lot of things if you just replace the original prefix?
> 
> It would change the matching behaviour. I guess there are people that
> like the current "sloppy" matching of from: and subject:.  In my
> not-very-scientific tests, it is a factor of 5 to 10 times slower to do
> regexp search, which makes sense because it is effectively post
> processing the results from Xapian. At least on my system it seems fast
> enough to be usable interactively, but that is a pretty shocking
> performance regression. And I know there are people with more mail on
> slower systems.

Maybe we could check if the search string contains a regexp and decide
whether to pre-process it on the background of that? I think that would
make the interface more user-friendly. You'd just always use search
whether you decide that you need to put in some regexp or not.

> 
>> Could it be made to work on the message body?
> 
> See Austin's previous reply for the details, but basically no; these
> "values" index in terms of whole strings, while the body is indexed by
> terms (roughly, words). In principle we could add a value slot for the
> body, but I think that would at least double the size of the database
> (maybe more).
> 

I would rather have double the db and be able wildcard beginning of
terms. If it is not too much maintaining overhead it might be made
optional?


Regards, Gaute


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] WIP: regexp matching in 'subject' and 'from'
  2016-06-11 16:32                   ` Gaute Hope
@ 2016-06-11 16:49                     ` David Bremner
  2016-06-11 17:09                     ` Tomi Ollila
  1 sibling, 0 replies; 21+ messages in thread
From: David Bremner @ 2016-06-11 16:49 UTC (permalink / raw)
  To: Gaute Hope, Austin Clements; +Cc: notmuch

Gaute Hope <eg@gaute.vetsj.com> writes:

>
> Maybe we could check if the search string contains a regexp and decide
> whether to pre-process it on the background of that? I think that would
> make the interface more user-friendly. You'd just always use search
> whether you decide that you need to put in some regexp or not.
>

There are some technical limitations of the xapian query parser (and
field processors in particular) that mean we'll probably have explicitly
ask for regex expansion.

> I would rather have double the db and be able wildcard beginning of
> terms. If it is not too much maintaining overhead it might be made
> optional?

perhaps. If that's really your primary goal, regexp search is
overkill. Maybe you should discuss with xapian upstream the possibility
of having xapian support more general wildcards. I have a vague memory
of olly once saying it was not out of the question. But I might be wrong
about that, it's just a half-remembered IRC discussion.

d

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] WIP: regexp matching in 'subject' and 'from'
  2016-06-11 16:32                   ` Gaute Hope
  2016-06-11 16:49                     ` David Bremner
@ 2016-06-11 17:09                     ` Tomi Ollila
  2016-06-11 17:34                       ` Gaute Hope
  1 sibling, 1 reply; 21+ messages in thread
From: Tomi Ollila @ 2016-06-11 17:09 UTC (permalink / raw)
  To: Gaute Hope, David Bremner, Austin Clements; +Cc: notmuch

On Sat, Jun 11 2016, Gaute Hope <eg@gaute.vetsj.com> wrote:

> David Bremner writes on juni 10, 2016 13:09:
>> Gaute Hope <eg@gaute.vetsj.com> writes:
>> 
>>>
>>> Cool!
>>>
>>> Would it break a lot of things if you just replace the original prefix?
>> 
>> It would change the matching behaviour. I guess there are people that
>> like the current "sloppy" matching of from: and subject:.  In my
>> not-very-scientific tests, it is a factor of 5 to 10 times slower to do
>> regexp search, which makes sense because it is effectively post
>> processing the results from Xapian. At least on my system it seems fast
>> enough to be usable interactively, but that is a pretty shocking
>> performance regression. And I know there are people with more mail on
>> slower systems.
>
> Maybe we could check if the search string contains a regexp and decide
> whether to pre-process it on the background of that? I think that would
> make the interface more user-friendly. You'd just always use search
> whether you decide that you need to put in some regexp or not.

You probably wanted to suggest that the command line handling in notmuch
goes through the search terms and potentially modify it before giving
to xapian to chew for... I think this is deliberately avoided (*) -- this
would get out of hands so easily (if we could decide syntax)...

(*) there is some optmization done before feeding the query to xapian --
but that does not affect interface (i.e. it could be dropped and none of
the users' expectations would be broken...)

What one can do, is write ones own wrapper around notmuch. I have one
that was written long before notmuch got date: searches (it mangles
e.g 5h.. to 1234567890.. (**) and logs search and show queries
(**) should change that to use date:... instead (i.e. date: queries w/o
date: prefix). I "suggested" subject:/one's own subject re search w// slashes/
which one could pretty easily write to the wrapper...

Tomi

>
>> 
>>> Could it be made to work on the message body?
>> 
>> See Austin's previous reply for the details, but basically no; these
>> "values" index in terms of whole strings, while the body is indexed by
>> terms (roughly, words). In principle we could add a value slot for the
>> body, but I think that would at least double the size of the database
>> (maybe more).
>> 
>
> I would rather have double the db and be able wildcard beginning of
> terms. If it is not too much maintaining overhead it might be made
> optional?
>
>
> Regards, Gaute
>
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> https://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] WIP: regexp matching in 'subject' and 'from'
  2016-06-11 17:09                     ` Tomi Ollila
@ 2016-06-11 17:34                       ` Gaute Hope
  0 siblings, 0 replies; 21+ messages in thread
From: Gaute Hope @ 2016-06-11 17:34 UTC (permalink / raw)
  To: Tomi Ollila, David Bremner, Austin Clements; +Cc: notmuch

Tomi Ollila writes on juni 11, 2016 19:09:
> On Sat, Jun 11 2016, Gaute Hope <eg@gaute.vetsj.com> wrote:
>> Maybe we could check if the search string contains a regexp and decide
>> whether to pre-process it on the background of that? I think that would
>> make the interface more user-friendly. You'd just always use search
>> whether you decide that you need to put in some regexp or not.
> 
> You probably wanted to suggest that the command line handling in notmuch
> goes through the search terms and potentially modify it before giving
> to xapian to chew for... I think this is deliberately avoided (*) -- this
> would get out of hands so easily (if we could decide syntax)...
> 
> (*) there is some optmization done before feeding the query to xapian --
> but that does not affect interface (i.e. it could be dropped and none of
> the users' expectations would be broken...)
> 
> What one can do, is write ones own wrapper around notmuch. I have one
> that was written long before notmuch got date: searches (it mangles
> e.g 5h.. to 1234567890.. (**) and logs search and show queries
> (**) should change that to use date:... instead (i.e. date: queries w/o
> date: prefix). I "suggested" subject:/one's own subject re search w// slashes/
> which one could pretty easily write to the wrapper...
> 

Yes, that is pretty much what I meant. So that the user only needs
to know about 'search:', if it is 'search:foo' regular queryparser is
used, if it is 'search:/^foo/' it is preprocessed using the regexp
parser. Then the performance will remain the same for normal queries,
but seamlessly switch to the heavier regexp'er if necessary.

It could be done with a wrapper, but I am mainly using notmuch through
the API and astroid - where it could also be implemented of course.

-gaute

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-06-11 17:34 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-06  6:58 searching: '*analysis' vs 'reanalysis' Gaute Hope
2016-06-06 12:42 ` David Bremner
2016-06-06 12:53   ` Gaute Hope
2016-06-06 15:52     ` Sebastian Fischmeister
2016-06-06 17:29       ` David Bremner
2016-06-06 19:20         ` Austin Clements
2016-06-06 20:08           ` Gaute Hope
2016-06-06 20:22             ` Austin Clements
2016-06-07  2:05           ` [PATCH] WIP: regexp matching in subjects David Bremner
2016-06-07 10:16             ` David Bremner
2016-06-10  2:28             ` [PATCH] WIP: regexp matching in 'subject' and 'from' David Bremner
2016-06-10  2:42               ` David Bremner
2016-06-10 11:11                 ` Tomi Ollila
2016-06-10 11:50                   ` David Bremner
2016-06-10  8:38               ` Gaute Hope
2016-06-10 11:09                 ` David Bremner
2016-06-11 16:32                   ` Gaute Hope
2016-06-11 16:49                     ` David Bremner
2016-06-11 17:09                     ` Tomi Ollila
2016-06-11 17:34                       ` Gaute Hope
2016-06-11  1:49               ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).