unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* [PATCH 0/5] Fix adjacent and overlapping termpos
@ 2014-06-16  2:40 Austin Clements
  2014-06-16  2:40 ` [PATCH 1/5] test: Fix from/to search test queries Austin Clements
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: Austin Clements @ 2014-06-16  2:40 UTC (permalink / raw)
  To: notmuch

This series fixes several related bugs that caused indexing to
generate adjacent and overlapping term positions, which caused
messages to incorrectly match phrase queries.  This obsoletes (and
fixes) the known-broken tests from
id:1400359552-10928-1-git-send-email-amdragon@mit.edu.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/5] test: Fix from/to search test queries
  2014-06-16  2:40 [PATCH 0/5] Fix adjacent and overlapping termpos Austin Clements
@ 2014-06-16  2:40 ` Austin Clements
  2014-06-16  2:40 ` [PATCH 2/5] test: Add search tests for combined name/address queries Austin Clements
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Austin Clements @ 2014-06-16  2:40 UTC (permalink / raw)
  To: notmuch

Two of the search tests for "from" and "to" queries were clearly
trying to search for prefixed phrases, but forgot to shell quote the
phrases.  Fix this by quoting them correctly.
---
 test/T080-search.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/test/T080-search.sh b/test/T080-search.sh
index a7a0b18..4f0e16c 100755
--- a/test/T080-search.sh
+++ b/test/T080-search.sh
@@ -59,7 +59,7 @@ test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] searchbyfrom@example.
 
 test_begin_subtest "Search by from: (name)"
 add_message '[subject]="search by from (name)"' '[date]="Sat, 01 Jan 2000 12:00:00 -0000"' '[from]="Search By From Name <test@example.com>"'
-output=$(notmuch search from:"Search By From Name" | notmuch_search_sanitize)
+output=$(notmuch search 'from:"Search By From Name"' | notmuch_search_sanitize)
 test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Search By From Name; search by from (name) (inbox unread)"
 
 test_begin_subtest "Search by to: (address)"
@@ -69,7 +69,7 @@ test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Notmuch Test Suite; s
 
 test_begin_subtest "Search by to: (name)"
 add_message '[subject]="search by to (name)"' '[date]="Sat, 01 Jan 2000 12:00:00 -0000"' '[to]="Search By To Name <test@example.com>"'
-output=$(notmuch search to:"Search By To Name" | notmuch_search_sanitize)
+output=$(notmuch search 'to:"Search By To Name"' | notmuch_search_sanitize)
 test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Notmuch Test Suite; search by to (name) (inbox unread)"
 
 test_begin_subtest "Search by subject: (phrase)"
-- 
2.0.0.rc2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/5] test: Add search tests for combined name/address queries
  2014-06-16  2:40 [PATCH 0/5] Fix adjacent and overlapping termpos Austin Clements
  2014-06-16  2:40 ` [PATCH 1/5] test: Fix from/to search test queries Austin Clements
@ 2014-06-16  2:40 ` Austin Clements
  2014-06-16  2:40 ` [PATCH 3/5] lib: Index name and address of from/to headers as a phrase Austin Clements
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Austin Clements @ 2014-06-16  2:40 UTC (permalink / raw)
  To: notmuch

Two of these are currently known-broken.  We index the name and
address parts in two separate calls to _notmuch_message_gen_terms.
Currently this has the effect of placing the term positions of the
prefixed terms from the second call right after those of the first
call, but screws up the term positions of the non-prefixed terms.
---
 test/T080-search.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/test/T080-search.sh b/test/T080-search.sh
index 4f0e16c..8ed5701 100755
--- a/test/T080-search.sh
+++ b/test/T080-search.sh
@@ -62,6 +62,15 @@ add_message '[subject]="search by from (name)"' '[date]="Sat, 01 Jan 2000 12:00:
 output=$(notmuch search 'from:"Search By From Name"' | notmuch_search_sanitize)
 test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Search By From Name; search by from (name) (inbox unread)"
 
+test_begin_subtest "Search by from: (name and address)"
+output=$(notmuch search 'from:"Search By From Name <test@example.com>"' | notmuch_search_sanitize)
+test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Search By From Name; search by from (name) (inbox unread)"
+
+test_begin_subtest "Search by from: without prefix (name and address)"
+test_subtest_known_broken
+output=$(notmuch search '"Search By From Name <test@example.com>"' | notmuch_search_sanitize)
+test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Search By From Name; search by from (name) (inbox unread)"
+
 test_begin_subtest "Search by to: (address)"
 add_message '[subject]="search by to (address)"' '[date]="Sat, 01 Jan 2000 12:00:00 -0000"' [to]=searchbyto@example.com
 output=$(notmuch search to:searchbyto@example.com | notmuch_search_sanitize)
@@ -72,6 +81,15 @@ add_message '[subject]="search by to (name)"' '[date]="Sat, 01 Jan 2000 12:00:00
 output=$(notmuch search 'to:"Search By To Name"' | notmuch_search_sanitize)
 test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Notmuch Test Suite; search by to (name) (inbox unread)"
 
+test_begin_subtest "Search by to: (name and adress)"
+output=$(notmuch search 'to:"Search By To Name <test@example.com>"' | notmuch_search_sanitize)
+test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Notmuch Test Suite; search by to (name) (inbox unread)"
+
+test_begin_subtest "Search by to: without prefix (name and adress)"
+test_subtest_known_broken
+output=$(notmuch search '"Search By To Name <test@example.com>"' | notmuch_search_sanitize)
+test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Notmuch Test Suite; search by to (name) (inbox unread)"
+
 test_begin_subtest "Search by subject: (phrase)"
 add_message '[subject]="subject search test (phrase)"' '[date]="Sat, 01 Jan 2000 12:00:00 -0000"'
 add_message '[subject]="this phrase should not match the subject search test"' '[date]="Sat, 01 Jan 2000 12:00:00 -0000"'
-- 
2.0.0.rc2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/5] lib: Index name and address of from/to headers as a phrase
  2014-06-16  2:40 [PATCH 0/5] Fix adjacent and overlapping termpos Austin Clements
  2014-06-16  2:40 ` [PATCH 1/5] test: Fix from/to search test queries Austin Clements
  2014-06-16  2:40 ` [PATCH 2/5] test: Add search tests for combined name/address queries Austin Clements
@ 2014-06-16  2:40 ` Austin Clements
  2014-06-16  2:40 ` [PATCH 4/5] test: Known-broken test for overlapping/adjacent termpos Austin Clements
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Austin Clements @ 2014-06-16  2:40 UTC (permalink / raw)
  To: notmuch

Previously, we indexed the name and address parts of from/to headers
with two calls to _notmuch_message_gen_terms.  In general, this
indicates that these parts are separate phrases.  However, because of
an implementation quirk, the two calls to _notmuch_message_gen_terms
generated adjacent term positions for the prefixed terms, which
happens to be the right thing to do in this case, but the wrong thing
to do for all other calls.  Furthermore, _notmuch_message_gen_terms
produced potentially overlapping term positions for the un-prefixed
copies of the terms, which is simply wrong.

This change indexes both the name and address in a single call to
_notmuch_message_gen_terms, indicating that they should be part of a
single phrase.  This masks the problem with the un-prefixed terms
(fixing the two known-broken tests) and puts us in a position to fix
the unintentionally phrases generated by other calls to
_notmuch_message_gen_terms.
---
 lib/index.cc        | 24 ++++++++++--------------
 test/T080-search.sh |  2 --
 2 files changed, 10 insertions(+), 16 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index e1e2a38..1a2e63d 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -231,26 +231,22 @@ _index_address_mailbox (notmuch_message_t *message,
 			InternetAddress *address)
 {
     InternetAddressMailbox *mailbox = INTERNET_ADDRESS_MAILBOX (address);
-    const char *name, *addr;
+    const char *name, *addr, *combined;
     void *local = talloc_new (message);
 
     name = internet_address_get_name (address);
     addr = internet_address_mailbox_get_addr (mailbox);
 
-    /* In the absence of a name, we'll strip the part before the @
-     * from the address. */
-    if (! name) {
-	const char *at;
+    /* Combine the name and address and index them as a phrase. */
+    if (name && addr)
+	combined = talloc_asprintf (local, "%s %s", name, addr);
+    else if (name)
+	combined = name;
+    else
+	combined = addr;
 
-	at = strchr (addr, '@');
-	if (at)
-	    name = talloc_strndup (local, addr, at - addr);
-    }
-
-    if (name)
-	_notmuch_message_gen_terms (message, prefix_name, name);
-    if (addr)
-	_notmuch_message_gen_terms (message, prefix_name, addr);
+    if (combined)
+	_notmuch_message_gen_terms (message, prefix_name, combined);
 
     talloc_free (local);
 }
diff --git a/test/T080-search.sh b/test/T080-search.sh
index 8ed5701..b63bf02 100755
--- a/test/T080-search.sh
+++ b/test/T080-search.sh
@@ -67,7 +67,6 @@ output=$(notmuch search 'from:"Search By From Name <test@example.com>"' | notmuc
 test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Search By From Name; search by from (name) (inbox unread)"
 
 test_begin_subtest "Search by from: without prefix (name and address)"
-test_subtest_known_broken
 output=$(notmuch search '"Search By From Name <test@example.com>"' | notmuch_search_sanitize)
 test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Search By From Name; search by from (name) (inbox unread)"
 
@@ -86,7 +85,6 @@ output=$(notmuch search 'to:"Search By To Name <test@example.com>"' | notmuch_se
 test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Notmuch Test Suite; search by to (name) (inbox unread)"
 
 test_begin_subtest "Search by to: without prefix (name and adress)"
-test_subtest_known_broken
 output=$(notmuch search '"Search By To Name <test@example.com>"' | notmuch_search_sanitize)
 test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Notmuch Test Suite; search by to (name) (inbox unread)"
 
-- 
2.0.0.rc2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/5] test: Known-broken test for overlapping/adjacent termpos
  2014-06-16  2:40 [PATCH 0/5] Fix adjacent and overlapping termpos Austin Clements
                   ` (2 preceding siblings ...)
  2014-06-16  2:40 ` [PATCH 3/5] lib: Index name and address of from/to headers as a phrase Austin Clements
@ 2014-06-16  2:40 ` Austin Clements
  2014-06-16  2:40 ` [PATCH 5/5] lib: Separate all phrases indexed by _notmuch_message_gen_terms Austin Clements
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Austin Clements @ 2014-06-16  2:40 UTC (permalink / raw)
  To: notmuch

This adds two known-broken tests and one working test related to the
term positions assigned to terms from different headers or MIME parts.
The first test fails because we don't create a termpos gap between
different headers.  The second test fails because we don't adjust
termpos at all when indexing multiple parts.
---
 test/T080-search.sh | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/test/T080-search.sh b/test/T080-search.sh
index b63bf02..3f6b565 100755
--- a/test/T080-search.sh
+++ b/test/T080-search.sh
@@ -145,4 +145,44 @@ add_message '[subject]="utf8-message-body-subject"' '[date]="Sat, 01 Jan 2000 12
 output=$(notmuch search "bödý" | notmuch_search_sanitize)
 test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Notmuch Test Suite; utf8-message-body-subject (inbox unread)"
 
+
+cat <<EOF > ${MAIL_DIR}/termpos
+From: Source <source@example.com>
+To: Dest <dest@example.com>
+Subject: part overlap test
+Date: Sat, 01 January 2000 00:00:00 +0000
+Message-ID: <termpos>
+MIME-Version: 1.0
+Content-Type: multipart/mixed; boundary="==-=="
+
+--==-==
+Content-Type: text/plain
+
+a b c
+
+--==-==
+Content-Type: text/plain
+
+x y z
+
+--==-==--
+EOF
+notmuch new > /dev/null
+
+test_begin_subtest "headers do not have adjacent term positions"
+test_subtest_known_broken
+# Regression test for a bug where term positions for non-prefixed
+# terms weren't updated
+output=$(notmuch search id:termpos and '"com dest"')
+test_expect_equal "$output" ""
+
+test_begin_subtest "parts have non-overlapping term positions"
+test_subtest_known_broken
+output=$(notmuch search id:termpos and '"a y c"')
+test_expect_equal "$output" ""
+
+test_begin_subtest "parts do not have adjacent term positions"
+output=$(notmuch search id:termpos and '"c x"')
+test_expect_equal "$output" ""
+
 test_done
-- 
2.0.0.rc2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/5] lib: Separate all phrases indexed by _notmuch_message_gen_terms
  2014-06-16  2:40 [PATCH 0/5] Fix adjacent and overlapping termpos Austin Clements
                   ` (3 preceding siblings ...)
  2014-06-16  2:40 ` [PATCH 4/5] test: Known-broken test for overlapping/adjacent termpos Austin Clements
@ 2014-06-16  2:40 ` Austin Clements
  2014-06-18 21:11   ` David Bremner
  2014-06-21 19:20 ` [PATCH 0/5] Fix adjacent and overlapping termpos Tomi Ollila
  2014-06-22 10:02 ` David Bremner
  6 siblings, 1 reply; 9+ messages in thread
From: Austin Clements @ 2014-06-16  2:40 UTC (permalink / raw)
  To: notmuch

This adds a 100 termpos gap between all phrases indexed by
_notmuch_message_gen_terms.  This fixes a bug where terms from the end
of one header and the beginning of another header could match together
in a single phrase and a separate bug where term positions of
un-prefixed terms overlapped.

This fix only affects newly indexed messages.  Messages that are
already indexed won't benefit from this fix without re-indexing, but
the fix won't make things any worse for existing messages.
---
 lib/message.cc      | 9 +++++++--
 test/T080-search.sh | 2 --
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/lib/message.cc b/lib/message.cc
index 9243b76..d0b7351 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -1023,16 +1023,21 @@ _notmuch_message_gen_terms (notmuch_message_t *message,
 	return NOTMUCH_PRIVATE_STATUS_NULL_POINTER;
 
     term_gen->set_document (message->doc);
-    term_gen->set_termpos (message->termpos);
 
     if (prefix_name) {
 	const char *prefix = _find_prefix (prefix_name);
 
+	term_gen->set_termpos (message->termpos);
 	term_gen->index_text (text, 1, prefix);
-	message->termpos = term_gen->get_termpos ();
+	/* Create a gap between this an the next terms so they don't
+	 * appear to be a phrase. */
+	message->termpos = term_gen->get_termpos () + 100;
     }
 
+    term_gen->set_termpos (message->termpos);
     term_gen->index_text (text);
+    /* Create a term gap, as above. */
+    message->termpos = term_gen->get_termpos () + 100;
 
     return NOTMUCH_PRIVATE_STATUS_SUCCESS;
 }
diff --git a/test/T080-search.sh b/test/T080-search.sh
index 3f6b565..05027fb 100755
--- a/test/T080-search.sh
+++ b/test/T080-search.sh
@@ -170,14 +170,12 @@ EOF
 notmuch new > /dev/null
 
 test_begin_subtest "headers do not have adjacent term positions"
-test_subtest_known_broken
 # Regression test for a bug where term positions for non-prefixed
 # terms weren't updated
 output=$(notmuch search id:termpos and '"com dest"')
 test_expect_equal "$output" ""
 
 test_begin_subtest "parts have non-overlapping term positions"
-test_subtest_known_broken
 output=$(notmuch search id:termpos and '"a y c"')
 test_expect_equal "$output" ""
 
-- 
2.0.0.rc2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 5/5] lib: Separate all phrases indexed by _notmuch_message_gen_terms
  2014-06-16  2:40 ` [PATCH 5/5] lib: Separate all phrases indexed by _notmuch_message_gen_terms Austin Clements
@ 2014-06-18 21:11   ` David Bremner
  0 siblings, 0 replies; 9+ messages in thread
From: David Bremner @ 2014-06-18 21:11 UTC (permalink / raw)
  To: Austin Clements, notmuch

Austin Clements <amdragon@MIT.EDU> writes:

> This adds a 100 termpos gap between all phrases indexed by
> _notmuch_message_gen_terms.  This fixes a bug where terms from the end
> of one header and the beginning of another header could match together
> in a single phrase and a separate bug where term positions of
> un-prefixed terms overlapped.
>
> This fix only affects newly indexed messages.  Messages that are
> already indexed won't benefit from this fix without re-indexing, but
> the fix won't make things any worse for existing messages.

The series looks OK to me. It took me a little while to understand the
problem with multiple parts was that the term positions currently start
from zero for each part. If you happen to be re-rolling the series for
some other reason, maybe you could be more explicit about that. I
wouldn't bother just for that though.

d

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] Fix adjacent and overlapping termpos
  2014-06-16  2:40 [PATCH 0/5] Fix adjacent and overlapping termpos Austin Clements
                   ` (4 preceding siblings ...)
  2014-06-16  2:40 ` [PATCH 5/5] lib: Separate all phrases indexed by _notmuch_message_gen_terms Austin Clements
@ 2014-06-21 19:20 ` Tomi Ollila
  2014-06-22 10:02 ` David Bremner
  6 siblings, 0 replies; 9+ messages in thread
From: Tomi Ollila @ 2014-06-21 19:20 UTC (permalink / raw)
  To: Austin Clements, notmuch

On Mon, Jun 16 2014, Austin Clements <amdragon@MIT.EDU> wrote:

> This series fixes several related bugs that caused indexing to
> generate adjacent and overlapping term positions, which caused
> messages to incorrectly match phrase queries.  This obsoletes (and
> fixes) the known-broken tests from
> id:1400359552-10928-1-git-send-email-amdragon@mit.edu.

The code looks good to me and tests pass. +1


Tomi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] Fix adjacent and overlapping termpos
  2014-06-16  2:40 [PATCH 0/5] Fix adjacent and overlapping termpos Austin Clements
                   ` (5 preceding siblings ...)
  2014-06-21 19:20 ` [PATCH 0/5] Fix adjacent and overlapping termpos Tomi Ollila
@ 2014-06-22 10:02 ` David Bremner
  6 siblings, 0 replies; 9+ messages in thread
From: David Bremner @ 2014-06-22 10:02 UTC (permalink / raw)
  To: Austin Clements, notmuch

Austin Clements <amdragon@MIT.EDU> writes:

> This series fixes several related bugs that caused indexing to
> generate adjacent and overlapping term positions, which caused
> messages to incorrectly match phrase queries.  This obsoletes (and
> fixes) the known-broken tests from
> id:1400359552-10928-1-git-send-email-amdragon@mit.edu.
>

pushed to release and master. Please have a quick look at 

       id:1403381426-19813-1-git-send-email-david@tethera.net

to check the corresponding NEWS

d

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-06-22 10:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-16  2:40 [PATCH 0/5] Fix adjacent and overlapping termpos Austin Clements
2014-06-16  2:40 ` [PATCH 1/5] test: Fix from/to search test queries Austin Clements
2014-06-16  2:40 ` [PATCH 2/5] test: Add search tests for combined name/address queries Austin Clements
2014-06-16  2:40 ` [PATCH 3/5] lib: Index name and address of from/to headers as a phrase Austin Clements
2014-06-16  2:40 ` [PATCH 4/5] test: Known-broken test for overlapping/adjacent termpos Austin Clements
2014-06-16  2:40 ` [PATCH 5/5] lib: Separate all phrases indexed by _notmuch_message_gen_terms Austin Clements
2014-06-18 21:11   ` David Bremner
2014-06-21 19:20 ` [PATCH 0/5] Fix adjacent and overlapping termpos Tomi Ollila
2014-06-22 10:02 ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).