unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: David Bremner <david@tethera.net>
To: notmuch@notmuchmail.org
Cc: David Bremner <david@tethera.net>
Subject: [PATCH 13/36] lib/parse-sexp: support phrase queries.
Date: Tue, 24 Aug 2021 08:17:22 -0700	[thread overview]
Message-ID: <20210824151745.2941868-14-david@tethera.net> (raw)
In-Reply-To: <20210824151745.2941868-1-david@tethera.net>

Anything that is quoted or not purely word characters is considered a
phrase.  Phrases are not stemmed, because the stems do not have
positional information in the database. It is less efficient to scan
the term twice, but it avoids a second pass to add prefixes, so maybe
it balances out. In any case, it seems unlikely query parsing is very
often a bottleneck.
---
 doc/man7/notmuch-sexp-queries.rst | 32 ++++++++++++++++++----
 lib/parse-sexp.cc                 | 45 +++++++++++++++++++++++++------
 test/T081-sexpr-search.sh         | 21 +++++++++++++--
 3 files changed, 83 insertions(+), 15 deletions(-)

diff --git a/doc/man7/notmuch-sexp-queries.rst b/doc/man7/notmuch-sexp-queries.rst
index 08e97cc3..b763876d 100644
--- a/doc/man7/notmuch-sexp-queries.rst
+++ b/doc/man7/notmuch-sexp-queries.rst
@@ -40,10 +40,12 @@ subqueries.
     Match all messages.
 
 *term*
-    Match all messages containing *term*, possibly after
-    stemming or phase splitting. For discussion of stemming in
-    notmuch see :any:`notmuch-search-terms(7)`. Stemming only applies
-    to unquoted terms (basic values) in s-expression queries.
+
+    Match all messages containing *term*, possibly after stemming or
+    phrase splitting. For discussion of stemming in notmuch see
+    :any:`notmuch-search-terms(7)`. Stemming only applies to unquoted
+    terms (basic values) in s-expression queries.  For information on
+    phrase splitting see :any:`fields`.
 
 ``(`` *field* |q1| |q2| ... |qn| ``)``
     Restrict the queries |q1| to |qn| to *field*, and combine with *and*
@@ -63,7 +65,7 @@ subqueries.
 FIELDS
 ``````
 
-*Fields* (also called *prefixes* in notmuch documentation)
+*Fields* [#aka-pref]_
 correspond to attributes of mail messages. Some are inherent (and
 immutable) like ``subject``, while others ``tag`` and ``property`` are
 settable by the user.  Each concrete field in
@@ -72,6 +74,13 @@ is discussed further under "Search prefixes" in
 :any:`notmuch-search-terms(7)`. The row *user* refers to user defined
 fields, described in :any:`notmuch-config(1)`.
 
+Most fields are either *phrase fields* [#aka-prob]_ (which match
+sequences of words), or *term fields* [#aka-bool]_ (which match exact
+strings). *Phrase splitting* breaks the term (basic value or quoted
+string) into words, ignore punctuation. Phrase splitting is applied to
+terms in phrase (probabilistic) fields. Both phrase splitting and
+stemming apply only in phrase fields.
+
 .. _field-table:
 
 .. table:: Fields with supported modifiers
@@ -138,10 +147,23 @@ EXAMPLES
 ``(not Bob Marley)``
     Match messages containing neither "Bob" nor "Marley", nor their stems,
 
+``"quick fox"`` ``quick-fox`` ``quick@fox``
+    Match the *phrase* "quick" followed by "fox" in phrase fields (or
+    outside a field). Match the literal string in a term field.
+
 ``(subject quick "brown fox")``
     Match messages whose subject contains "quick" (anywhere, stemmed) and
     the phrase "brown fox".
 
+NOTES
+=====
+
+.. [#aka-pref] a.k.a. prefixes
+
+.. [#aka-prob] a.k.a. probabilistic prefixes
+
+.. [#aka-bool] a.k.a. boolean prefixes
+
 .. |q1| replace:: :math:`q_1`
 .. |q2| replace:: :math:`q_2`
 .. |qn| replace:: :math:`q_n`
diff --git a/lib/parse-sexp.cc b/lib/parse-sexp.cc
index 25556058..0917f505 100644
--- a/lib/parse-sexp.cc
+++ b/lib/parse-sexp.cc
@@ -2,7 +2,7 @@
 
 #if HAVE_SFSEXP
 #include "sexp.h"
-
+#include "unicode-util.h"
 
 /* _sexp is used for file scope symbols to avoid clashing with
  * definitions from sexp.h */
@@ -67,6 +67,36 @@ _sexp_combine_query (notmuch_database_t *notmuch,
 				sx->next, output);
 }
 
+static notmuch_status_t
+_sexp_parse_phrase (std::string term_prefix, const char *phrase, Xapian::Query &output)
+{
+    Xapian::Utf8Iterator p (phrase);
+    Xapian::Utf8Iterator end;
+    std::vector<std::string> terms;
+
+    while (p != end) {
+	Xapian::Utf8Iterator start;
+	while (p != end && ! Xapian::Unicode::is_wordchar (*p))
+	    p++;
+
+	if (p == end)
+	    break;
+
+	start = p;
+
+	while (p != end && Xapian::Unicode::is_wordchar (*p))
+	    p++;
+
+	if (p != start) {
+	    std::string word (start, p);
+	    word = Xapian::Unicode::tolower (word);
+	    terms.push_back (term_prefix + word);
+	}
+    }
+    output = Xapian::Query (Xapian::Query::OP_PHRASE, terms.begin (), terms.end ());
+    return NOTMUCH_STATUS_SUCCESS;
+}
+
 /* Here we expect the s-expression to be a proper list, with first
  * element defining and operation, or as a special case the empty
  * list */
@@ -80,13 +110,12 @@ _sexp_to_xapian_query (notmuch_database_t *notmuch, const _sexp_prefix_t *parent
 	std::string term = Xapian::Unicode::tolower (sx->val);
 	Xapian::Stem stem = *(notmuch->stemmer);
 	std::string term_prefix = parent ? _find_prefix (parent->name) : "";
-	if (sx->aty == SEXP_BASIC)
-	    term = "Z" + term_prefix + stem (term);
-	else
-	    term = term_prefix + term;
-
-	output = Xapian::Query (term);
-	return NOTMUCH_STATUS_SUCCESS;
+	if (sx->aty == SEXP_BASIC && unicode_word_utf8 (sx->val)) {
+	    output = Xapian::Query ("Z" + term_prefix + stem (term));
+	    return NOTMUCH_STATUS_SUCCESS;
+	} else {
+	    return _sexp_parse_phrase (term_prefix, sx->val, output);
+	}
     }
 
     /* Empty list */
diff --git a/test/T081-sexpr-search.sh b/test/T081-sexpr-search.sh
index 90cef50c..4a051a50 100755
--- a/test/T081-sexpr-search.sh
+++ b/test/T081-sexpr-search.sh
@@ -102,15 +102,32 @@ EOF
 test_expect_equal_file EXPECTED OUTPUT
 
 test_begin_subtest "Search by 'subject' (utf-8, phrase-token):"
-test_subtest_known_broken
 output=$(notmuch search --query=sexp '(subject utf8-sübjéct)' | notmuch_search_sanitize)
 test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Notmuch Test Suite; utf8-sübjéct (inbox unread)"
 
 test_begin_subtest "Search by 'subject' (utf-8, quoted string):"
-test_subtest_known_broken
 output=$(notmuch search --query=sexp '(subject "utf8 sübjéct")' | notmuch_search_sanitize)
 test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Notmuch Test Suite; utf8-sübjéct (inbox unread)"
 
+test_begin_subtest "Search by 'subject' (combine phrase, term):"
+output=$(notmuch search --query=sexp '(subject Mac "compatibility issues")' | notmuch_search_sanitize)
+test_expect_equal "$output" "thread:XXX   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)"
+
+test_begin_subtest "Search by 'subject' (combine phrase, term 2):"
+notmuch search --query=sexp '(subject (or utf8 "compatibility issues"))' | notmuch_search_sanitize > OUTPUT
+cat <<EOF > EXPECTED
+thread:XXX   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
+thread:XXX   2000-01-01 [1/1] Notmuch Test Suite; utf8-sübjéct (inbox unread)
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
+test_begin_subtest "Search by 'subject' (combine phrase, term 3):"
+notmuch search --query=sexp '(subject issues X/Darwin)' | notmuch_search_sanitize > OUTPUT
+cat <<EOF > EXPECTED
+thread:XXX   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
 test_begin_subtest "Unbalanced parens"
 # A code 1 indicates the error was handled (a crash will return e.g. 139).
 test_expect_code 1 "notmuch search --query=sexp '('"
-- 
2.32.0\r

  parent reply	other threads:[~2021-08-24 15:22 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-24 15:17 v5 sexp query parser David Bremner
2021-08-24 15:17 ` [PATCH 01/36] CLI: make variable n_requested_db_uuid file scope David Bremner
2021-08-24 15:17 ` [PATCH 02/36] configure: optional library sfsexp David Bremner
2021-08-24 15:17 ` [PATCH 03/36] lib: split notmuch_query_create David Bremner
2021-08-24 15:17 ` [PATCH 04/36] lib: define notmuch_query_create_with_syntax David Bremner
2021-08-24 15:17 ` [PATCH 05/36] CLI/search+address: support sexpr queries David Bremner
2021-08-24 15:17 ` [PATCH 06/36] lib: add new status code for query syntax errors David Bremner
2021-08-24 15:17 ` [PATCH 07/36] lib/parse-sexp: parse single terms and the empty list David Bremner
2021-08-24 15:17 ` [PATCH 08/36] lib: leave stemmer object accessible David Bremner
2021-08-24 15:17 ` [PATCH 09/36] lib/parse-sexp: stem unquoted atoms David Bremner
2021-08-24 15:17 ` [PATCH 10/36] lib/parse-sexp: support and, not, and or David Bremner
2021-08-24 15:17 ` [PATCH 11/36] lib/parse-sexp: support subject field David Bremner
2021-08-24 15:17 ` [PATCH 12/36] util/unicode: allow calling from C++ David Bremner
2021-08-24 15:17 ` David Bremner [this message]
2021-08-24 15:17 ` [PATCH 14/36] lib/parse-sexp: add term prefix backed fields David Bremner
2021-08-24 15:17 ` [PATCH 15/36] lib/parse-sexp: 'starts-with' wildcard searches David Bremner
2021-08-24 15:17 ` [PATCH 16/36] lib/parse-sexp: add '*' as syntactic sugar for '(starts-with "")' David Bremner
2021-08-24 15:17 ` [PATCH 17/36] lib/parse-sexp: handle unprefixed terms David Bremner
2021-08-24 15:17 ` [PATCH 18/36] lib/query: generalize exclude handling to s-expression queries David Bremner
2021-08-24 15:17 ` [PATCH 19/36] lib: factor out query construction from regexp David Bremner
2021-08-24 15:17 ` [PATCH 20/36] lib/parse-sexp: support regular expressions David Bremner
2021-08-24 15:17 ` [PATCH 21/36] lib: generate actual Xapian query for "*" and "" David Bremner
2021-08-24 15:17 ` [PATCH 22/36] lib/query: factor out _notmuch_query_string_to_xapian_query David Bremner
2021-08-24 15:17 ` [PATCH 23/36] lib/thread-fp: factor out query expansion, rewrite in Xapian David Bremner
2021-08-24 15:17 ` [PATCH 24/36] lib/parse-sexp: expand queries David Bremner
2021-08-24 15:17 ` [PATCH 25/36] lib/parse-sexp: support infix subqueries David Bremner
2021-08-24 15:17 ` [PATCH 26/36] lib/parse-sexp: parse user headers David Bremner
2021-08-24 15:17 ` [PATCH 27/36] lib: factor out expansion of saved queries David Bremner
2021-08-24 15:17 ` [PATCH 28/36] lib/parse-sexp: handle " David Bremner
2021-08-24 15:17 ` [PATCH 29/36] CLI/config support saving s-expression queries David Bremner
2021-08-24 15:17 ` [PATCH 30/36] lib/parse-sexp: support saved " David Bremner
2021-08-24 15:17 ` [PATCH 31/36] lib/parse-sexp: thread environment argument through parser David Bremner
2021-08-24 15:17 ` [PATCH 32/36] lib/parse-sexp: apply macros David Bremner
2021-08-24 15:17 ` [PATCH 33/36] CLI: move query syntax to shared option David Bremner
2021-08-24 15:17 ` [PATCH 34/36] CLI/{count, dump, reindex, reply, show}: enable sexp queries David Bremner
2021-08-24 15:17 ` [PATCH 35/36] CLI/tag: " David Bremner
2021-08-24 15:17 ` [PATCH 36/36] doc/sexp-queries: update synopsis and description David Bremner
2021-09-05 19:31 ` v5 sexp query parser David Bremner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210824151745.2941868-14-david@tethera.net \
    --to=david@tethera.net \
    --cc=notmuch@notmuchmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).