* searching: '*analysis' vs 'reanalysis' @ 2016-06-06 6:58 Gaute Hope 2016-06-06 12:42 ` David Bremner 0 siblings, 1 reply; 21+ messages in thread From: Gaute Hope @ 2016-06-06 6:58 UTC (permalink / raw) To: notmuch Hi, I have an email with the word 'reanalysis' in the subject line and the email body. However, when I try to search for '*analysis' or 'analysis' I do not get any matches, should not '*analysis' at least match? Regards, Gaute ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: searching: '*analysis' vs 'reanalysis' 2016-06-06 6:58 searching: '*analysis' vs 'reanalysis' Gaute Hope @ 2016-06-06 12:42 ` David Bremner 2016-06-06 12:53 ` Gaute Hope 0 siblings, 1 reply; 21+ messages in thread From: David Bremner @ 2016-06-06 12:42 UTC (permalink / raw) To: Gaute Hope, notmuch Gaute Hope <eg@gaute.vetsj.com> writes: > Hi, > > I have an email with the word 'reanalysis' in the subject line and the > email body. However, when I try to search for '*analysis' or 'analysis' > I do not get any matches, should not '*analysis' at least match? > We talked about this on IRC (the short answer is no), but is there some improvement you could suggest to the "Wildcards" section in notmuch-search-terms(7) ? d ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: searching: '*analysis' vs 'reanalysis' 2016-06-06 12:42 ` David Bremner @ 2016-06-06 12:53 ` Gaute Hope 2016-06-06 15:52 ` Sebastian Fischmeister 0 siblings, 1 reply; 21+ messages in thread From: Gaute Hope @ 2016-06-06 12:53 UTC (permalink / raw) To: David Bremner, notmuch David Bremner writes on juni 6, 2016 14:42: > Gaute Hope <eg@gaute.vetsj.com> writes: > >> Hi, >> >> I have an email with the word 'reanalysis' in the subject line and the >> email body. However, when I try to search for '*analysis' or 'analysis' >> I do not get any matches, should not '*analysis' at least match? >> > > We talked about this on IRC (the short answer is no), but is there some > improvement you could suggest to the "Wildcards" section in > notmuch-search-terms(7) ? Yes, thanks, not very important, but maybe add the sentence: > It is not possible to use wildcards at the beginning of a term. after the current explanation to emphasize this limitation (possibly blaming Xapian to avoid futile requests). I think it is something many would expect (and want). The current description feels more like an example, and it is easy to make the assumption that it works for prefixing the terms as well - although, technically, nothing is promised in the original docs. -gaute ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: searching: '*analysis' vs 'reanalysis' 2016-06-06 12:53 ` Gaute Hope @ 2016-06-06 15:52 ` Sebastian Fischmeister 2016-06-06 17:29 ` David Bremner 0 siblings, 1 reply; 21+ messages in thread From: Sebastian Fischmeister @ 2016-06-06 15:52 UTC (permalink / raw) To: Gaute Hope, David Bremner, notmuch >> It is not possible to use wildcards at the beginning of a term. > > after the current explanation to emphasize this limitation (possibly > blaming Xapian to avoid futile requests). > > I think it is something many would expect (and want). The current > description feels more like an example, and it is easy to make the > assumption that it works for prefixing the terms as well - although, > technically, nothing is promised in the original docs. I ran into this problem before as well. Storage is cheap. Notmuch could index all emails with reversed text to get around some of this problem. It doesn't solve the problem of *analysis*, but it's still an improvement. Sebastian ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: searching: '*analysis' vs 'reanalysis' 2016-06-06 15:52 ` Sebastian Fischmeister @ 2016-06-06 17:29 ` David Bremner 2016-06-06 19:20 ` Austin Clements 0 siblings, 1 reply; 21+ messages in thread From: David Bremner @ 2016-06-06 17:29 UTC (permalink / raw) To: sfischme, Gaute Hope, notmuch; +Cc: Austin Clements Sebastian Fischmeister <sfischme@uwaterloo.ca> writes: > > I ran into this problem before as well. Storage is cheap. Notmuch could > index all emails with reversed text to get around some of this > problem. It doesn't solve the problem of *analysis*, but it's still an > improvement. It would probably be more useful to have brute force regexp searches on headers. Austin did some experiments that sounded promising, where you basically postprocess the result of a xapian query with a regexp. OTOH, I don't know what kept him from proposing this for mainline. If it was just parser issues, those are probably more or less solved now, at least for people using xapian 1.3+ d ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: searching: '*analysis' vs 'reanalysis' 2016-06-06 17:29 ` David Bremner @ 2016-06-06 19:20 ` Austin Clements 2016-06-06 20:08 ` Gaute Hope 2016-06-07 2:05 ` [PATCH] WIP: regexp matching in subjects David Bremner 0 siblings, 2 replies; 21+ messages in thread From: Austin Clements @ 2016-06-06 19:20 UTC (permalink / raw) To: David Bremner; +Cc: sfischme, Gaute Hope, notmuch [-- Attachment #1: Type: text/plain, Size: 1252 bytes --] On Mon, Jun 6, 2016 at 1:29 PM, David Bremner <david@tethera.net> wrote: > Sebastian Fischmeister <sfischme@uwaterloo.ca> writes: > > > > > I ran into this problem before as well. Storage is cheap. Notmuch could > > index all emails with reversed text to get around some of this > > problem. It doesn't solve the problem of *analysis*, but it's still an > > improvement. > > It would probably be more useful to have brute force regexp searches on > headers. Austin did some experiments that sounded promising, where you > basically postprocess the result of a xapian query with a regexp. OTOH, > I don't know what kept him from proposing this for mainline. If it was > just parser issues, those are probably more or less solved now, at least > for people using xapian 1.3+ > The experiment was specifically for regexp matching subject, but it should work for any header we store a literal copy of in the database. The code is here, though in its current form it builds on my custom query parser: https://github.com/aclements/notmuch/commit/ce41b29aba4d9b84e2f1eb6ed8df67065196c960. Based on my understanding of Xapian 1.3+ field processors, these days it should be quite easy to hook the PostingSource in that commit into the Xapian QueryProcessor. [-- Attachment #2: Type: text/html, Size: 1840 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: searching: '*analysis' vs 'reanalysis' 2016-06-06 19:20 ` Austin Clements @ 2016-06-06 20:08 ` Gaute Hope 2016-06-06 20:22 ` Austin Clements 2016-06-07 2:05 ` [PATCH] WIP: regexp matching in subjects David Bremner 1 sibling, 1 reply; 21+ messages in thread From: Gaute Hope @ 2016-06-06 20:08 UTC (permalink / raw) To: Austin Clements, David Bremner; +Cc: sfischme, notmuch Austin Clements writes on juni 6, 2016 21:20: > > The experiment was specifically for regexp matching subject, but it should > work for any header we store a literal copy of in the database. Does it work for terms in the body of the message? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: searching: '*analysis' vs 'reanalysis' 2016-06-06 20:08 ` Gaute Hope @ 2016-06-06 20:22 ` Austin Clements 0 siblings, 0 replies; 21+ messages in thread From: Austin Clements @ 2016-06-06 20:22 UTC (permalink / raw) To: Gaute Hope; +Cc: David Bremner, sfischme, notmuch Quoth Gaute Hope on Jun 06 at 8:08 pm: > Austin Clements writes on juni 6, 2016 21:20: > > > >The experiment was specifically for regexp matching subject, but it should > >work for any header we store a literal copy of in the database. > > Does it work for terms in the body of the message? No. It's not impossible that it could be made to work, but it might be slow and unintuitive. It would have to iterate over all of the terms in the database and see which ones match the regexp. These are available, but I don't know how much time it takes to iterate over all of them. It might be okay. It might not. It could also expand to a very large query if the regexp matches many terms, akin to how searching for "a*" can be quite expensive. And it might not match what you expect. It could only match individual terms, so a regexp containing any punctuation (including but not limited to a space) simply wouldn't match anything. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] WIP: regexp matching in subjects 2016-06-06 19:20 ` Austin Clements 2016-06-06 20:08 ` Gaute Hope @ 2016-06-07 2:05 ` David Bremner 2016-06-07 10:16 ` David Bremner 2016-06-10 2:28 ` [PATCH] WIP: regexp matching in 'subject' and 'from' David Bremner 1 sibling, 2 replies; 21+ messages in thread From: David Bremner @ 2016-06-07 2:05 UTC (permalink / raw) To: Austin Clements, David Bremner; +Cc: sfischme, Gaute Hope, notmuch the idea is that you can run % notmuch search 'subject:rx:<your-favourite-regexp>' or % notmuch search subject:"your usual phrase search" This should also work with bindings. --- Here is Austin's "hack", crammed into the field processor framework. I seem to have broken one of the existing subject search tests with my recursive query parsing. I didn't have time to figure out why, yet. lib/Makefile.local | 2 ++ lib/database-private.h | 1 + lib/database.cc | 5 +++ lib/regexp-ps.cc | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++ lib/regexp-ps.h | 37 ++++++++++++++++++++ lib/subject-fp.cc | 41 ++++++++++++++++++++++ lib/subject-fp.h | 43 +++++++++++++++++++++++ 7 files changed, 221 insertions(+) create mode 100644 lib/regexp-ps.cc create mode 100644 lib/regexp-ps.h create mode 100644 lib/subject-fp.cc create mode 100644 lib/subject-fp.h diff --git a/lib/Makefile.local b/lib/Makefile.local index beb9635..0e7311f 100644 --- a/lib/Makefile.local +++ b/lib/Makefile.local @@ -51,6 +51,8 @@ libnotmuch_cxx_srcs = \ $(dir)/query.cc \ $(dir)/query-fp.cc \ $(dir)/config.cc \ + $(dir)/regexp-ps.cc \ + $(dir)/subject-fp.cc \ $(dir)/thread.cc libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o) diff --git a/lib/database-private.h b/lib/database-private.h index ca71a92..5de0b81 100644 --- a/lib/database-private.h +++ b/lib/database-private.h @@ -186,6 +186,7 @@ struct _notmuch_database { #if HAVE_XAPIAN_FIELD_PROCESSOR Xapian::FieldProcessor *date_field_processor; Xapian::FieldProcessor *query_field_processor; + Xapian::FieldProcessor *subject_field_processor; #endif Xapian::ValueRangeProcessor *last_mod_range_processor; }; diff --git a/lib/database.cc b/lib/database.cc index 86bf261..adfbb81 100644 --- a/lib/database.cc +++ b/lib/database.cc @@ -21,6 +21,7 @@ #include "database-private.h" #include "parse-time-vrp.h" #include "query-fp.h" +#include "subject-fp.h" #include "string-util.h" #include <iostream> @@ -1008,6 +1009,8 @@ notmuch_database_open_verbose (const char *path, notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor); notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch); notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor); + notmuch->subject_field_processor = new SubjectFieldProcessor (*notmuch->query_parser, notmuch); + notmuch->query_parser->add_boolean_prefix("subject", notmuch->subject_field_processor); #endif notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:"); @@ -1027,6 +1030,8 @@ notmuch_database_open_verbose (const char *path, for (i = 0; i < ARRAY_SIZE (PROBABILISTIC_PREFIX); i++) { prefix_t *prefix = &PROBABILISTIC_PREFIX[i]; + if (strcmp (prefix->name, "subject") == 0) + continue; notmuch->query_parser->add_prefix (prefix->name, prefix->prefix); } } catch (const Xapian::Error &error) { diff --git a/lib/regexp-ps.cc b/lib/regexp-ps.cc new file mode 100644 index 0000000..540c7d6 --- /dev/null +++ b/lib/regexp-ps.cc @@ -0,0 +1,92 @@ +/* query-fp.cc - "query:" field processor glue + * + * This file is part of notmuch. + * + * Copyright © 2016 David Bremner + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see https://www.gnu.org/licenses/ . + * + * Author: Austin Clements <aclements@csail.mit.edu> + * David Bremner <david@tethera.net> + */ + +#include "regexp-ps.h" + +RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string ®exp) + : slot_ (slot) +{ + int r = regcomp (®exp_, regexp.c_str (), REG_EXTENDED | REG_NOSUB); + + if (r != 0) + /* XXX Report a query syntax error using regerror */ + throw "regcomp failed"; +} + +RegexpPostingSource::~RegexpPostingSource () +{ + regfree (®exp_); +} + +void +RegexpPostingSource::init (const Xapian::Database &db) +{ + db_ = db; + it_ = db_.valuestream_begin (slot_); + end_ = db.valuestream_end (slot_); + started_ = false; +} + +Xapian::doccount +RegexpPostingSource::get_termfreq_min () const +{ + return 0; +} + +Xapian::doccount +RegexpPostingSource::get_termfreq_est () const +{ + return get_termfreq_max () / 2; +} + +Xapian::doccount +RegexpPostingSource::get_termfreq_max () const +{ + return db_.get_value_freq (slot_); +} + +Xapian::docid +RegexpPostingSource::get_docid () const +{ + return it_.get_docid (); +} + +bool +RegexpPostingSource::at_end () const +{ + return it_ == end_; +} + +void +RegexpPostingSource::next (unused (double min_wt)) +{ + if (started_ && ! at_end ()) + ++it_; + started_ = true; + + for (; ! at_end (); ++it_) { + std::string value = *it_; + if (regexec (®exp_, value.c_str (), 0, NULL, 0) == 0) + break; + } +} diff --git a/lib/regexp-ps.h b/lib/regexp-ps.h new file mode 100644 index 0000000..a4553a7 --- /dev/null +++ b/lib/regexp-ps.h @@ -0,0 +1,37 @@ +#ifndef NOTMUCH_REGEX_PS_H +#define NOTMUCH_REGEX_PS_H + +#include <sys/types.h> +#include <regex.h> +#include <xapian.h> +#include "notmuch-private.h" + +/* A posting source that returns documents where a value matches a + * regexp. + */ +class RegexpPostingSource : public Xapian::PostingSource +{ +protected: +const Xapian::valueno slot_; +regex_t regexp_; +Xapian::Database db_; +bool started_; +Xapian::ValueIterator it_, end_; + +/* No copying */ +RegexpPostingSource (const RegexpPostingSource &); +RegexpPostingSource &operator= (const RegexpPostingSource &); + +public: + RegexpPostingSource (Xapian::valueno slot, const std::string ®exp); +~RegexpPostingSource (); +void init (const Xapian::Database &db); +Xapian::doccount get_termfreq_min () const; +Xapian::doccount get_termfreq_est () const; +Xapian::doccount get_termfreq_max () const; +Xapian::docid get_docid () const; +bool at_end () const; +void next (unused (double min_wt)); +}; + +#endif diff --git a/lib/subject-fp.cc b/lib/subject-fp.cc new file mode 100644 index 0000000..1627721 --- /dev/null +++ b/lib/subject-fp.cc @@ -0,0 +1,41 @@ +/* subject-fp.cc - "subject:" field processor glue + * + * This file is part of notmuch. + * + * Copyright © 2016 David Bremner + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see https://www.gnu.org/licenses/ . + * + * Author: David Bremner <david@tethera.net> + */ + +#include "database-private.h" +#include "subject-fp.h" +#include <iostream> + +#if HAVE_XAPIAN_FIELD_PROCESSOR + +Xapian::Query +SubjectFieldProcessor::operator() (const std::string & str) +{ + std::string prefix = "rx:"; + + if (str.compare(0,prefix.size(),prefix)==0) { + postings = new RegexpPostingSource(NOTMUCH_VALUE_SUBJECT, str.substr(prefix.size())); + return Xapian::Query(postings); + } else { + return parser.parse_query (str, NOTMUCH_QUERY_PARSER_FLAGS, _find_prefix ("subject")); + } +} +#endif diff --git a/lib/subject-fp.h b/lib/subject-fp.h new file mode 100644 index 0000000..ca622ba --- /dev/null +++ b/lib/subject-fp.h @@ -0,0 +1,43 @@ +/* subject-fp.h - subject field processor glue + * + * This file is part of notmuch. + * + * Copyright © 2016 David Bremner + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see https://www.gnu.org/licenses/ . + * + * Author: David Bremner <david@tethera.net> + */ + +#ifndef NOTMUCH_SUBJECT_FP_H +#define NOTMUCH_SUBJECT_FP_H + +#include <xapian.h> +#include "notmuch.h" +#include "regexp-ps.h" + +#if HAVE_XAPIAN_FIELD_PROCESSOR +class SubjectFieldProcessor : public Xapian::FieldProcessor { + protected: + Xapian::QueryParser &parser; + notmuch_database_t *notmuch; + RegexpPostingSource *postings = NULL; + public: + SubjectFieldProcessor (Xapian::QueryParser &parser_, notmuch_database_t *notmuch_) + : parser(parser_), notmuch(notmuch_) { }; + + Xapian::Query operator()(const std::string & str); +}; +#endif +#endif /* NOTMUCH_SUBJECT_FP_H */ -- 2.8.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH] WIP: regexp matching in subjects 2016-06-07 2:05 ` [PATCH] WIP: regexp matching in subjects David Bremner @ 2016-06-07 10:16 ` David Bremner 2016-06-10 2:28 ` [PATCH] WIP: regexp matching in 'subject' and 'from' David Bremner 1 sibling, 0 replies; 21+ messages in thread From: David Bremner @ 2016-06-07 10:16 UTC (permalink / raw) To: Austin Clements; +Cc: sfischme, Gaute Hope, notmuch David Bremner <david@tethera.net> writes: > the idea is that you can run > > % notmuch search 'subject:rx:<your-favourite-regexp>' > > or > > % notmuch search subject:"your usual phrase search" > > This should also work with bindings. > --- > > Here is Austin's "hack", crammed into the field processor framework. > I seem to have broken one of the existing subject search tests with my > recursive query parsing. I didn't have time to figure out why, yet. A few hours sleep and I think I understand the issue. with subject:"your usual phrase search" I believe the phrase-ness (word ordering and proximity) gets lost when using a field processor. So, since I don't know if/when this issue will be fixed in Xapian, we should probably use a seperate prefix for regexp search. This leads to two potential syntaxes subject_re:"^i am first" regexp:subject:"^i am first" If we did stick with the current syntax, one could use subject:your-usual-phrase-search to preserve phraseness. Note that for regexps with spaces in them, the required quoting is a bit counterintuitive ./notmuch count 'subject:"rx:Graduate Committee"' That's a consequence of my faking "sub-prefixes". Which seemed clever at the time, but maybe not. d ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] WIP: regexp matching in 'subject' and 'from' 2016-06-07 2:05 ` [PATCH] WIP: regexp matching in subjects David Bremner 2016-06-07 10:16 ` David Bremner @ 2016-06-10 2:28 ` David Bremner 2016-06-10 2:42 ` David Bremner ` (2 more replies) 1 sibling, 3 replies; 21+ messages in thread From: David Bremner @ 2016-06-10 2:28 UTC (permalink / raw) To: David Bremner, Austin Clements; +Cc: sfischme, Gaute Hope, notmuch the idea is that you can run % notmuch search subject_re:<your-favourite-regexp> % notmuch search from_re:<your-favourite-regexp>' or % notmuch search subject:"your usual phrase search" % notmuch search from:"usual phrase search" This should also work with bindings, since it extends the query parser. This is trivial to extend for other value slots, but currently the only value slots are date, message_id, from, subject, and last_mod. Date is already searchable, and message_id is not obviously useful to regex match. --- This is more or less complete codewise, it fixes the know problems with the last version. Names of prefixes are debatable, and of course it needs doc and tests. I don't see any reason not to do this at the moment, since it's basically free; no new terms are added to the database. lib/Makefile.local | 1 + lib/database-private.h | 2 + lib/database.cc | 12 +++++- lib/regexp-fields.cc | 102 +++++++++++++++++++++++++++++++++++++++++++++++++ lib/regexp-fields.h | 78 +++++++++++++++++++++++++++++++++++++ 5 files changed, 194 insertions(+), 1 deletion(-) create mode 100644 lib/regexp-fields.cc create mode 100644 lib/regexp-fields.h diff --git a/lib/Makefile.local b/lib/Makefile.local index beb9635..68771e6 100644 --- a/lib/Makefile.local +++ b/lib/Makefile.local @@ -51,6 +51,7 @@ libnotmuch_cxx_srcs = \ $(dir)/query.cc \ $(dir)/query-fp.cc \ $(dir)/config.cc \ + $(dir)/regexp-fields.cc \ $(dir)/thread.cc libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o) diff --git a/lib/database-private.h b/lib/database-private.h index ca71a92..090fcdf 100644 --- a/lib/database-private.h +++ b/lib/database-private.h @@ -186,6 +186,8 @@ struct _notmuch_database { #if HAVE_XAPIAN_FIELD_PROCESSOR Xapian::FieldProcessor *date_field_processor; Xapian::FieldProcessor *query_field_processor; + Xapian::FieldProcessor *subject_re_field_processor; + Xapian::FieldProcessor *from_re_field_processor; #endif Xapian::ValueRangeProcessor *last_mod_range_processor; }; diff --git a/lib/database.cc b/lib/database.cc index 86bf261..4049406 100644 --- a/lib/database.cc +++ b/lib/database.cc @@ -21,6 +21,7 @@ #include "database-private.h" #include "parse-time-vrp.h" #include "query-fp.h" +#include "regexp-fields.h" #include "string-util.h" #include <iostream> @@ -1008,6 +1009,10 @@ notmuch_database_open_verbose (const char *path, notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor); notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch); notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor); + notmuch->subject_re_field_processor = new RegexpFieldProcessor (NOTMUCH_VALUE_SUBJECT, *notmuch->query_parser, notmuch); + notmuch->query_parser->add_boolean_prefix("subject_re", notmuch->subject_re_field_processor); + notmuch->from_re_field_processor = new RegexpFieldProcessor (NOTMUCH_VALUE_FROM, *notmuch->query_parser, notmuch); + notmuch->query_parser->add_boolean_prefix("from_re", notmuch->from_re_field_processor); #endif notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:"); @@ -1098,7 +1103,12 @@ notmuch_database_close (notmuch_database_t *notmuch) notmuch->date_range_processor = NULL; delete notmuch->last_mod_range_processor; notmuch->last_mod_range_processor = NULL; - +#ifdef HAVE_XAPIAN_FIELD_PROCESSOR + delete notmuch->from_re_field_processor; + notmuch->from_re_field_processor = NULL; + delete notmuch->subject_re_field_processor; + notmuch->subject_re_field_processor = NULL; +#endif return status; } diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc new file mode 100644 index 0000000..4bbebda --- /dev/null +++ b/lib/regexp-fields.cc @@ -0,0 +1,102 @@ +/* query-fp.cc - "query:" field processor glue + * + * This file is part of notmuch. + * + * Copyright © 2015 Austin Clements + * Copyright © 2016 David Bremner + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see https://www.gnu.org/licenses/ . + * + * Author: Austin Clements <aclements@csail.mit.edu> + * David Bremner <david@tethera.net> + */ + +#include "regexp-fields.h" + +#ifdef HAVE_XAPIAN_FIELD_PROCESSOR +RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string ®exp) + : slot_ (slot) +{ + int r = regcomp (®exp_, regexp.c_str (), REG_EXTENDED | REG_NOSUB); + + if (r != 0) + /* XXX Report a query syntax error using regerror */ + throw "regcomp failed"; +} + +RegexpPostingSource::~RegexpPostingSource () +{ + regfree (®exp_); +} + +void +RegexpPostingSource::init (const Xapian::Database &db) +{ + db_ = db; + it_ = db_.valuestream_begin (slot_); + end_ = db.valuestream_end (slot_); + started_ = false; +} + +Xapian::doccount +RegexpPostingSource::get_termfreq_min () const +{ + return 0; +} + +Xapian::doccount +RegexpPostingSource::get_termfreq_est () const +{ + return get_termfreq_max () / 2; +} + +Xapian::doccount +RegexpPostingSource::get_termfreq_max () const +{ + return db_.get_value_freq (slot_); +} + +Xapian::docid +RegexpPostingSource::get_docid () const +{ + return it_.get_docid (); +} + +bool +RegexpPostingSource::at_end () const +{ + return it_ == end_; +} + +void +RegexpPostingSource::next (unused (double min_wt)) +{ + if (started_ && ! at_end ()) + ++it_; + started_ = true; + + for (; ! at_end (); ++it_) { + std::string value = *it_; + if (regexec (®exp_, value.c_str (), 0, NULL, 0) == 0) + break; + } +} + +Xapian::Query +RegexpFieldProcessor::operator() (const std::string & str) +{ + postings = new RegexpPostingSource (slot, str); + return Xapian::Query (postings); +} +#endif diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h new file mode 100644 index 0000000..a184cab --- /dev/null +++ b/lib/regexp-fields.h @@ -0,0 +1,78 @@ +/* regex-fields.h - xapian glue for semi-bruteforce regexp search + * + * This file is part of notmuch. + * + * Copyright © 2015 Austin Clements + * Copyright © 2016 David Bremner + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see https://www.gnu.org/licenses/ . + * + * Author: Austin Clements <aclements@csail.mit.edu> + * David Bremner <david@tethera.net> + */ + +#ifndef NOTMUCH_REGEXP_FIELDS_H +#define NOTMUCH_REGEXP_FIELDS_H +#ifdef HAVE_XAPIAN_FIELD_PROCESSOR +#include <sys/types.h> +#include <regex.h> +#include <xapian.h> +#include "notmuch-private.h" + +/* A posting source that returns documents where a value matches a + * regexp. + */ +class RegexpPostingSource : public Xapian::PostingSource +{ + protected: + const Xapian::valueno slot_; + regex_t regexp_; + Xapian::Database db_; + bool started_; + Xapian::ValueIterator it_, end_; + +/* No copying */ + RegexpPostingSource (const RegexpPostingSource &); + RegexpPostingSource &operator= (const RegexpPostingSource &); + + public: + RegexpPostingSource (Xapian::valueno slot, const std::string ®exp); + ~RegexpPostingSource (); + void init (const Xapian::Database &db); + Xapian::doccount get_termfreq_min () const; + Xapian::doccount get_termfreq_est () const; + Xapian::doccount get_termfreq_max () const; + Xapian::docid get_docid () const; + bool at_end () const; + void next (unused (double min_wt)); +}; + + +class RegexpFieldProcessor : public Xapian::FieldProcessor { + protected: + Xapian::valueno slot; + Xapian::QueryParser &parser; + notmuch_database_t *notmuch; + RegexpPostingSource *postings = NULL; + + public: + RegexpFieldProcessor (Xapian::valueno slot_, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_) + : slot(slot_), parser(parser_), notmuch(notmuch_) { }; + + ~RegexpFieldProcessor () { delete postings; }; + + Xapian::Query operator()(const std::string & str); +}; +#endif +#endif /* NOTMUCH_REGEXP_FIELDS_H */ -- 2.8.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH] WIP: regexp matching in 'subject' and 'from' 2016-06-10 2:28 ` [PATCH] WIP: regexp matching in 'subject' and 'from' David Bremner @ 2016-06-10 2:42 ` David Bremner 2016-06-10 11:11 ` Tomi Ollila 2016-06-10 8:38 ` Gaute Hope 2016-06-11 1:49 ` David Bremner 2 siblings, 1 reply; 21+ messages in thread From: David Bremner @ 2016-06-10 2:42 UTC (permalink / raw) To: notmuch David Bremner <david@tethera.net> writes: > +#ifdef HAVE_XAPIAN_FIELD_PROCESSOR > + delete notmuch->from_re_field_processor; > + notmuch->from_re_field_processor = NULL; > + delete notmuch->subject_re_field_processor; > + notmuch->subject_re_field_processor = NULL; > +#endif and of course everywhere it says #ifdef HAVE_XAPIAN_FIELD_PROCESSOR, is should say #if. d ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] WIP: regexp matching in 'subject' and 'from' 2016-06-10 2:42 ` David Bremner @ 2016-06-10 11:11 ` Tomi Ollila 2016-06-10 11:50 ` David Bremner 0 siblings, 1 reply; 21+ messages in thread From: Tomi Ollila @ 2016-06-10 11:11 UTC (permalink / raw) To: David Bremner, notmuch On Fri, Jun 10 2016, David Bremner <david@tethera.net> wrote: > David Bremner <david@tethera.net> writes: > >> +#ifdef HAVE_XAPIAN_FIELD_PROCESSOR >> + delete notmuch->from_re_field_processor; >> + notmuch->from_re_field_processor = NULL; >> + delete notmuch->subject_re_field_processor; >> + notmuch->subject_re_field_processor = NULL; >> +#endif > > and of course everywhere it says #ifdef HAVE_XAPIAN_FIELD_PROCESSOR, is > should say #if. ... is there a static code analyzer which notices such a mistakes... ? i did marked that version trivial, which completes the thought that there are no such thing as trivial changes ;/ > d ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] WIP: regexp matching in 'subject' and 'from' 2016-06-10 11:11 ` Tomi Ollila @ 2016-06-10 11:50 ` David Bremner 0 siblings, 0 replies; 21+ messages in thread From: David Bremner @ 2016-06-10 11:50 UTC (permalink / raw) To: Tomi Ollila, notmuch Tomi Ollila <tomi.ollila@iki.fi> writes: > On Fri, Jun 10 2016, David Bremner <david@tethera.net> wrote: > >> David Bremner <david@tethera.net> writes: >> and of course everywhere it says #ifdef HAVE_XAPIAN_FIELD_PROCESSOR, is >> should say #if. > > ... is there a static code analyzer which notices such a mistakes... ? It seems tough for a static analyzer, because the bug is really in misunderstanding the build system (which always defines that symbol) rather than the code itself. d ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] WIP: regexp matching in 'subject' and 'from' 2016-06-10 2:28 ` [PATCH] WIP: regexp matching in 'subject' and 'from' David Bremner 2016-06-10 2:42 ` David Bremner @ 2016-06-10 8:38 ` Gaute Hope 2016-06-10 11:09 ` David Bremner 2016-06-11 1:49 ` David Bremner 2 siblings, 1 reply; 21+ messages in thread From: Gaute Hope @ 2016-06-10 8:38 UTC (permalink / raw) To: David Bremner, Austin Clements; +Cc: sfischme, notmuch David Bremner writes on juni 10, 2016 4:28: > the idea is that you can run > > % notmuch search subject_re:<your-favourite-regexp> > % notmuch search from_re:<your-favourite-regexp>' > > or > > % notmuch search subject:"your usual phrase search" > % notmuch search from:"usual phrase search" > > This should also work with bindings, since it extends the query parser. > > This is trivial to extend for other value slots, but currently the only > value slots are date, message_id, from, subject, and last_mod. Date is > already searchable, and message_id is not obviously useful to regex > match. > --- > > This is more or less complete codewise, it fixes the know problems > with the last version. Names of prefixes are debatable, and of course > it needs doc and tests. I don't see any reason not to do this at the moment, > since it's basically free; no new terms are added to the database. Cool! Would it break a lot of things if you just replace the original prefix? Could it be made to work on the message body? Regards, Gaute ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] WIP: regexp matching in 'subject' and 'from' 2016-06-10 8:38 ` Gaute Hope @ 2016-06-10 11:09 ` David Bremner 2016-06-11 16:32 ` Gaute Hope 0 siblings, 1 reply; 21+ messages in thread From: David Bremner @ 2016-06-10 11:09 UTC (permalink / raw) To: Gaute Hope, Austin Clements; +Cc: sfischme, notmuch Gaute Hope <eg@gaute.vetsj.com> writes: > > Cool! > > Would it break a lot of things if you just replace the original prefix? It would change the matching behaviour. I guess there are people that like the current "sloppy" matching of from: and subject:. In my not-very-scientific tests, it is a factor of 5 to 10 times slower to do regexp search, which makes sense because it is effectively post processing the results from Xapian. At least on my system it seems fast enough to be usable interactively, but that is a pretty shocking performance regression. And I know there are people with more mail on slower systems. > Could it be made to work on the message body? See Austin's previous reply for the details, but basically no; these "values" index in terms of whole strings, while the body is indexed by terms (roughly, words). In principle we could add a value slot for the body, but I think that would at least double the size of the database (maybe more). ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] WIP: regexp matching in 'subject' and 'from' 2016-06-10 11:09 ` David Bremner @ 2016-06-11 16:32 ` Gaute Hope 2016-06-11 16:49 ` David Bremner 2016-06-11 17:09 ` Tomi Ollila 0 siblings, 2 replies; 21+ messages in thread From: Gaute Hope @ 2016-06-11 16:32 UTC (permalink / raw) To: David Bremner, Austin Clements; +Cc: sfischme, notmuch David Bremner writes on juni 10, 2016 13:09: > Gaute Hope <eg@gaute.vetsj.com> writes: > >> >> Cool! >> >> Would it break a lot of things if you just replace the original prefix? > > It would change the matching behaviour. I guess there are people that > like the current "sloppy" matching of from: and subject:. In my > not-very-scientific tests, it is a factor of 5 to 10 times slower to do > regexp search, which makes sense because it is effectively post > processing the results from Xapian. At least on my system it seems fast > enough to be usable interactively, but that is a pretty shocking > performance regression. And I know there are people with more mail on > slower systems. Maybe we could check if the search string contains a regexp and decide whether to pre-process it on the background of that? I think that would make the interface more user-friendly. You'd just always use search whether you decide that you need to put in some regexp or not. > >> Could it be made to work on the message body? > > See Austin's previous reply for the details, but basically no; these > "values" index in terms of whole strings, while the body is indexed by > terms (roughly, words). In principle we could add a value slot for the > body, but I think that would at least double the size of the database > (maybe more). > I would rather have double the db and be able wildcard beginning of terms. If it is not too much maintaining overhead it might be made optional? Regards, Gaute ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] WIP: regexp matching in 'subject' and 'from' 2016-06-11 16:32 ` Gaute Hope @ 2016-06-11 16:49 ` David Bremner 2016-06-11 17:09 ` Tomi Ollila 1 sibling, 0 replies; 21+ messages in thread From: David Bremner @ 2016-06-11 16:49 UTC (permalink / raw) To: Gaute Hope, Austin Clements; +Cc: notmuch Gaute Hope <eg@gaute.vetsj.com> writes: > > Maybe we could check if the search string contains a regexp and decide > whether to pre-process it on the background of that? I think that would > make the interface more user-friendly. You'd just always use search > whether you decide that you need to put in some regexp or not. > There are some technical limitations of the xapian query parser (and field processors in particular) that mean we'll probably have explicitly ask for regex expansion. > I would rather have double the db and be able wildcard beginning of > terms. If it is not too much maintaining overhead it might be made > optional? perhaps. If that's really your primary goal, regexp search is overkill. Maybe you should discuss with xapian upstream the possibility of having xapian support more general wildcards. I have a vague memory of olly once saying it was not out of the question. But I might be wrong about that, it's just a half-remembered IRC discussion. d ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] WIP: regexp matching in 'subject' and 'from' 2016-06-11 16:32 ` Gaute Hope 2016-06-11 16:49 ` David Bremner @ 2016-06-11 17:09 ` Tomi Ollila 2016-06-11 17:34 ` Gaute Hope 1 sibling, 1 reply; 21+ messages in thread From: Tomi Ollila @ 2016-06-11 17:09 UTC (permalink / raw) To: Gaute Hope, David Bremner, Austin Clements; +Cc: notmuch On Sat, Jun 11 2016, Gaute Hope <eg@gaute.vetsj.com> wrote: > David Bremner writes on juni 10, 2016 13:09: >> Gaute Hope <eg@gaute.vetsj.com> writes: >> >>> >>> Cool! >>> >>> Would it break a lot of things if you just replace the original prefix? >> >> It would change the matching behaviour. I guess there are people that >> like the current "sloppy" matching of from: and subject:. In my >> not-very-scientific tests, it is a factor of 5 to 10 times slower to do >> regexp search, which makes sense because it is effectively post >> processing the results from Xapian. At least on my system it seems fast >> enough to be usable interactively, but that is a pretty shocking >> performance regression. And I know there are people with more mail on >> slower systems. > > Maybe we could check if the search string contains a regexp and decide > whether to pre-process it on the background of that? I think that would > make the interface more user-friendly. You'd just always use search > whether you decide that you need to put in some regexp or not. You probably wanted to suggest that the command line handling in notmuch goes through the search terms and potentially modify it before giving to xapian to chew for... I think this is deliberately avoided (*) -- this would get out of hands so easily (if we could decide syntax)... (*) there is some optmization done before feeding the query to xapian -- but that does not affect interface (i.e. it could be dropped and none of the users' expectations would be broken...) What one can do, is write ones own wrapper around notmuch. I have one that was written long before notmuch got date: searches (it mangles e.g 5h.. to 1234567890.. (**) and logs search and show queries (**) should change that to use date:... instead (i.e. date: queries w/o date: prefix). I "suggested" subject:/one's own subject re search w// slashes/ which one could pretty easily write to the wrapper... Tomi > >> >>> Could it be made to work on the message body? >> >> See Austin's previous reply for the details, but basically no; these >> "values" index in terms of whole strings, while the body is indexed by >> terms (roughly, words). In principle we could add a value slot for the >> body, but I think that would at least double the size of the database >> (maybe more). >> > > I would rather have double the db and be able wildcard beginning of > terms. If it is not too much maintaining overhead it might be made > optional? > > > Regards, Gaute > > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > https://notmuchmail.org/mailman/listinfo/notmuch ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] WIP: regexp matching in 'subject' and 'from' 2016-06-11 17:09 ` Tomi Ollila @ 2016-06-11 17:34 ` Gaute Hope 0 siblings, 0 replies; 21+ messages in thread From: Gaute Hope @ 2016-06-11 17:34 UTC (permalink / raw) To: Tomi Ollila, David Bremner, Austin Clements; +Cc: notmuch Tomi Ollila writes on juni 11, 2016 19:09: > On Sat, Jun 11 2016, Gaute Hope <eg@gaute.vetsj.com> wrote: >> Maybe we could check if the search string contains a regexp and decide >> whether to pre-process it on the background of that? I think that would >> make the interface more user-friendly. You'd just always use search >> whether you decide that you need to put in some regexp or not. > > You probably wanted to suggest that the command line handling in notmuch > goes through the search terms and potentially modify it before giving > to xapian to chew for... I think this is deliberately avoided (*) -- this > would get out of hands so easily (if we could decide syntax)... > > (*) there is some optmization done before feeding the query to xapian -- > but that does not affect interface (i.e. it could be dropped and none of > the users' expectations would be broken...) > > What one can do, is write ones own wrapper around notmuch. I have one > that was written long before notmuch got date: searches (it mangles > e.g 5h.. to 1234567890.. (**) and logs search and show queries > (**) should change that to use date:... instead (i.e. date: queries w/o > date: prefix). I "suggested" subject:/one's own subject re search w// slashes/ > which one could pretty easily write to the wrapper... > Yes, that is pretty much what I meant. So that the user only needs to know about 'search:', if it is 'search:foo' regular queryparser is used, if it is 'search:/^foo/' it is preprocessed using the regexp parser. Then the performance will remain the same for normal queries, but seamlessly switch to the heavier regexp'er if necessary. It could be done with a wrapper, but I am mainly using notmuch through the API and astroid - where it could also be implemented of course. -gaute ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] WIP: regexp matching in 'subject' and 'from' 2016-06-10 2:28 ` [PATCH] WIP: regexp matching in 'subject' and 'from' David Bremner 2016-06-10 2:42 ` David Bremner 2016-06-10 8:38 ` Gaute Hope @ 2016-06-11 1:49 ` David Bremner 2 siblings, 0 replies; 21+ messages in thread From: David Bremner @ 2016-06-11 1:49 UTC (permalink / raw) To: David Bremner; +Cc: notmuch, jani the idea is that you can run % notmuch search re:subject:<your-favourite-regexp> % notmuch search re:from:<your-favourite-regexp>' or % notmuch search subject:"your usual phrase search" % notmuch search from:"usual phrase search" This should also work with bindings, since it extends the query parser. This is trivial to extend for other value slots, but currently the only value slots are date, message_id, from, subject, and last_mod. Date is already searchable, and message_id is not obviously useful to regex match. --- After some discussion on IRC, here is a version that uses a single re: prefix internally, and some examples/tests of how that syntax would work in practice. lib/Makefile.local | 1 + lib/database-private.h | 1 + lib/database.cc | 5 ++ lib/regexp-fields.cc | 117 ++++++++++++++++++++++++++++++++++++++++++++++ lib/regexp-fields.h | 77 ++++++++++++++++++++++++++++++ test/T630-regexp-query.sh | 77 ++++++++++++++++++++++++++++++ 6 files changed, 278 insertions(+) create mode 100644 lib/regexp-fields.cc create mode 100644 lib/regexp-fields.h create mode 100755 test/T630-regexp-query.sh diff --git a/lib/Makefile.local b/lib/Makefile.local index beb9635..68771e6 100644 --- a/lib/Makefile.local +++ b/lib/Makefile.local @@ -51,6 +51,7 @@ libnotmuch_cxx_srcs = \ $(dir)/query.cc \ $(dir)/query-fp.cc \ $(dir)/config.cc \ + $(dir)/regexp-fields.cc \ $(dir)/thread.cc libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o) diff --git a/lib/database-private.h b/lib/database-private.h index ca71a92..900a989 100644 --- a/lib/database-private.h +++ b/lib/database-private.h @@ -186,6 +186,7 @@ struct _notmuch_database { #if HAVE_XAPIAN_FIELD_PROCESSOR Xapian::FieldProcessor *date_field_processor; Xapian::FieldProcessor *query_field_processor; + Xapian::FieldProcessor *re_field_processor; #endif Xapian::ValueRangeProcessor *last_mod_range_processor; }; diff --git a/lib/database.cc b/lib/database.cc index afafe88..b52b62d 100644 --- a/lib/database.cc +++ b/lib/database.cc @@ -21,6 +21,7 @@ #include "database-private.h" #include "parse-time-vrp.h" #include "query-fp.h" +#include "regexp-fields.h" #include "string-util.h" #include <iostream> @@ -1016,6 +1017,8 @@ notmuch_database_open_verbose (const char *path, notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor); notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch); notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor); + notmuch->re_field_processor = new RegexpFieldProcessor (*notmuch->query_parser, notmuch); + notmuch->query_parser->add_boolean_prefix("re", notmuch->re_field_processor); #endif notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:"); @@ -1112,6 +1115,8 @@ notmuch_database_close (notmuch_database_t *notmuch) notmuch->date_field_processor = NULL; delete notmuch->query_field_processor; notmuch->query_field_processor = NULL; + delete notmuch->re_field_processor; + notmuch->re_field_processor = NULL; #endif return status; diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc new file mode 100644 index 0000000..d9d1625 --- /dev/null +++ b/lib/regexp-fields.cc @@ -0,0 +1,117 @@ +/* query-fp.cc - "query:" field processor glue + * + * This file is part of notmuch. + * + * Copyright © 2015 Austin Clements + * Copyright © 2016 David Bremner + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see https://www.gnu.org/licenses/ . + * + * Author: Austin Clements <aclements@csail.mit.edu> + * David Bremner <david@tethera.net> + */ + +#include "regexp-fields.h" +#include "notmuch-private.h" + +#if HAVE_XAPIAN_FIELD_PROCESSOR +RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string ®exp) + : slot_ (slot) +{ + int r = regcomp (®exp_, regexp.c_str (), REG_EXTENDED | REG_NOSUB); + + if (r != 0) + /* XXX Report a query syntax error using regerror */ + throw "regcomp failed"; +} + +RegexpPostingSource::~RegexpPostingSource () +{ + regfree (®exp_); +} + +void +RegexpPostingSource::init (const Xapian::Database &db) +{ + db_ = db; + it_ = db_.valuestream_begin (slot_); + end_ = db.valuestream_end (slot_); + started_ = false; +} + +Xapian::doccount +RegexpPostingSource::get_termfreq_min () const +{ + return 0; +} + +Xapian::doccount +RegexpPostingSource::get_termfreq_est () const +{ + return get_termfreq_max () / 2; +} + +Xapian::doccount +RegexpPostingSource::get_termfreq_max () const +{ + return db_.get_value_freq (slot_); +} + +Xapian::docid +RegexpPostingSource::get_docid () const +{ + return it_.get_docid (); +} + +bool +RegexpPostingSource::at_end () const +{ + return it_ == end_; +} + +void +RegexpPostingSource::next (unused (double min_wt)) +{ + if (started_ && ! at_end ()) + ++it_; + started_ = true; + + for (; ! at_end (); ++it_) { + std::string value = *it_; + if (regexec (®exp_, value.c_str (), 0, NULL, 0) == 0) + break; + } +} + +static Xapian::valueno +_find_slot(std::string prefix){ + if (prefix == "from") + return NOTMUCH_VALUE_FROM; + else if (prefix == "subject") + return NOTMUCH_VALUE_SUBJECT; + else + throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'"); +} + +Xapian::Query +RegexpFieldProcessor::operator() (const std::string & str) +{ + size_t pos = str.find_first_of(':'); + std::string prefix = str.substr(0,pos); + std::string regexp = str.substr(pos+1); + + postings = new RegexpPostingSource (_find_slot (prefix), regexp); + return Xapian::Query (postings); +} +#endif diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h new file mode 100644 index 0000000..2c9c2d7 --- /dev/null +++ b/lib/regexp-fields.h @@ -0,0 +1,77 @@ +/* regex-fields.h - xapian glue for semi-bruteforce regexp search + * + * This file is part of notmuch. + * + * Copyright © 2015 Austin Clements + * Copyright © 2016 David Bremner + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see https://www.gnu.org/licenses/ . + * + * Author: Austin Clements <aclements@csail.mit.edu> + * David Bremner <david@tethera.net> + */ + +#ifndef NOTMUCH_REGEXP_FIELDS_H +#define NOTMUCH_REGEXP_FIELDS_H +#if HAVE_XAPIAN_FIELD_PROCESSOR +#include <sys/types.h> +#include <regex.h> +#include <xapian.h> +#include "notmuch-private.h" + +/* A posting source that returns documents where a value matches a + * regexp. + */ +class RegexpPostingSource : public Xapian::PostingSource +{ + protected: + const Xapian::valueno slot_; + regex_t regexp_; + Xapian::Database db_; + bool started_; + Xapian::ValueIterator it_, end_; + +/* No copying */ + RegexpPostingSource (const RegexpPostingSource &); + RegexpPostingSource &operator= (const RegexpPostingSource &); + + public: + RegexpPostingSource (Xapian::valueno slot, const std::string ®exp); + ~RegexpPostingSource (); + void init (const Xapian::Database &db); + Xapian::doccount get_termfreq_min () const; + Xapian::doccount get_termfreq_est () const; + Xapian::doccount get_termfreq_max () const; + Xapian::docid get_docid () const; + bool at_end () const; + void next (unused (double min_wt)); +}; + + +class RegexpFieldProcessor : public Xapian::FieldProcessor { + protected: + Xapian::QueryParser &parser; + notmuch_database_t *notmuch; + RegexpPostingSource *postings = NULL; + + public: + RegexpFieldProcessor (Xapian::QueryParser &parser_, notmuch_database_t *notmuch_) + : parser(parser_), notmuch(notmuch_) { }; + + ~RegexpFieldProcessor () { delete postings; }; + + Xapian::Query operator()(const std::string & str); +}; +#endif +#endif /* NOTMUCH_REGEXP_FIELDS_H */ diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh new file mode 100755 index 0000000..09caed6 --- /dev/null +++ b/test/T630-regexp-query.sh @@ -0,0 +1,77 @@ +#!/usr/bin/env bash +test_description='named queries' +. ./test-lib.sh || exit 1 + +QUERYSTR="date:2009-11-18..2009-11-18 and tag:unread" +QUERYSTR2="query:test and subject:Maildir" + +add_email_corpus + + +if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then + + notmuch search --output=messages from:cworth > cworth.msg-ids + + test_begin_subtest "regexp from search, case sensitive" + notmuch search --output=messages re:from:carl > OUTPUT + test_expect_equal_file /dev/null OUTPUT + + test_begin_subtest "empty regexp or query" + notmuch search --output=messages re:from:carl or from:cworth > OUTPUT + test_expect_equal_file cworth.msg-ids OUTPUT + + test_begin_subtest "non-empty regexp and query" + notmuch search re:from:cworth and subject:patch > OUTPUT + cat <<EOF > EXPECTED +thread:0000000000000008 2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread) +thread:0000000000000007 2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread) +thread:0000000000000018 2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread) +thread:0000000000000017 2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread) +thread:0000000000000014 2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread) +thread:0000000000000001 2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread) +EOF + test_expect_equal_file EXPECTED OUTPUT + + test_begin_subtest "regexp from search, duplicate term search" + notmuch search --output=messages re:from:cworth > OUTPUT + test_expect_equal_file cworth.msg-ids OUTPUT + + test_begin_subtest "long enough regexp matches only desired senders" + notmuch search --output=messages 're:"from:C.* Wo"' > OUTPUT + test_expect_equal_file cworth.msg-ids OUTPUT + + test_begin_subtest "shorter regexp matches one more sender" + notmuch search --output=messages 're:"from:C.* W"' > OUTPUT + (echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED + test_expect_equal_file EXPECTED OUTPUT + + test_begin_subtest "regexp subject search, non-ASCII" + notmuch search --output=messages re:subject:accentué > OUTPUT + echo id:877h1wv7mg.fsf@inf-8657.int-evry.fr > EXPECTED + test_expect_equal_file EXPECTED OUTPUT + + test_begin_subtest "regexp subject search, punctuation" + notmuch search re:subject:\'X\' > OUTPUT + cat <<EOF > EXPECTED +thread:0000000000000017 2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread) +EOF + test_expect_equal_file EXPECTED OUTPUT + + test_begin_subtest "regexp subject search, no punctuation" + notmuch search re:subject:X > OUTPUT + cat <<EOF > EXPECTED +thread:0000000000000017 2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread) +thread:000000000000000f 2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread) +EOF + test_expect_equal_file EXPECTED OUTPUT + + test_begin_subtest "combine regexp from and subject" + notmuch search re:subject:-C and re:from:.an.k > OUTPUT + cat <<EOF > EXPECTED +thread:0000000000000018 2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread) +EOF + test_expect_equal_file EXPECTED OUTPUT + +fi + +test_done -- 2.8.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
end of thread, other threads:[~2016-06-11 17:34 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-06-06 6:58 searching: '*analysis' vs 'reanalysis' Gaute Hope 2016-06-06 12:42 ` David Bremner 2016-06-06 12:53 ` Gaute Hope 2016-06-06 15:52 ` Sebastian Fischmeister 2016-06-06 17:29 ` David Bremner 2016-06-06 19:20 ` Austin Clements 2016-06-06 20:08 ` Gaute Hope 2016-06-06 20:22 ` Austin Clements 2016-06-07 2:05 ` [PATCH] WIP: regexp matching in subjects David Bremner 2016-06-07 10:16 ` David Bremner 2016-06-10 2:28 ` [PATCH] WIP: regexp matching in 'subject' and 'from' David Bremner 2016-06-10 2:42 ` David Bremner 2016-06-10 11:11 ` Tomi Ollila 2016-06-10 11:50 ` David Bremner 2016-06-10 8:38 ` Gaute Hope 2016-06-10 11:09 ` David Bremner 2016-06-11 16:32 ` Gaute Hope 2016-06-11 16:49 ` David Bremner 2016-06-11 17:09 ` Tomi Ollila 2016-06-11 17:34 ` Gaute Hope 2016-06-11 1:49 ` David Bremner
Code repositories for project(s) associated with this public inbox https://yhetil.org/notmuch.git/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).