unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* excessive thread fusing
@ 2014-04-19 12:33 David Bremner
  2014-04-19 17:52 ` Eric
  2014-04-20  7:14 ` [RFC PATCH] " Mark Walters
  0 siblings, 2 replies; 11+ messages in thread
From: David Bremner @ 2014-04-19 12:33 UTC (permalink / raw)
  To: notmuch

[-- Attachment #1: Type: text/plain, Size: 599 bytes --]


Gregor Zattler mentioned some problems with threading at 

       id:20120126004024.GA13704@shi.workgroup

After some off list discussions, I believe I have a smaller test case.

The attached maildir contains 24 messages from the org mode list.

According to notmuch, these form one thread, but I can't figure out
exactly why. It seems like the chronologically first two messages should
be a seperate thread. There are several of the infamous malformed ME-E
In-reply-to headers, but each of these messages also has a References
header; this seems to indicate a case missed by commit cf8aaafbad68.


[-- Attachment #2: test-maildir.tar.gz --]
[-- Type: application/octet-stream, Size: 23367 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: excessive thread fusing
  2014-04-19 12:33 excessive thread fusing David Bremner
@ 2014-04-19 17:52 ` Eric
  2014-04-19 21:04   ` Andrei POPESCU
  2014-04-20  7:14 ` [RFC PATCH] " Mark Walters
  1 sibling, 1 reply; 11+ messages in thread
From: Eric @ 2014-04-19 17:52 UTC (permalink / raw)
  To: David Bremner, notmuch

On Sat, 19 Apr 2014 21:33:36 +0900, David Bremner <david@tethera.net> wrote:
> 
> Gregor Zattler mentioned some problems with threading at 
> 
>        id:20120126004024.GA13704@shi.workgroup
> 
> After some off list discussions, I believe I have a smaller test case.
> 
> The attached maildir contains 24 messages from the org mode list.
> 
> According to notmuch, these form one thread, but I can't figure out
> exactly why. It seems like the chronologically first two messages should
> be a seperate thread. There are several of the infamous malformed ME-E
> In-reply-to headers, but each of these messages also has a References
> header; this seems to indicate a case missed by commit cf8aaafbad68.
> 

This may not actually be any help, but both hypermail and mhonarc agree
that two messages form a separate thread from the rest. I believe that
the latter, at least, is the JWZ algorithm.

Eric
-- 
ms fnd in a lbry

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: excessive thread fusing
  2014-04-19 17:52 ` Eric
@ 2014-04-19 21:04   ` Andrei POPESCU
  2014-04-20 16:48     ` Austin Clements
  0 siblings, 1 reply; 11+ messages in thread
From: Andrei POPESCU @ 2014-04-19 21:04 UTC (permalink / raw)
  To: notmuch

[-- Attachment #1: Type: text/plain, Size: 419 bytes --]

On Sb, 19 apr 14, 18:52:02, Eric wrote:
> 
> This may not actually be any help, but both hypermail and mhonarc agree
> that two messages form a separate thread from the rest. I believe that
> the latter, at least, is the JWZ algorithm.

mutt concurs.

Kind regards,
Andrei
-- 
If you can't explain it simply, you don't understand it well enough.
(Albert Einstein)
http://nuvreauspam.ro/gpg-transition.txt

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH] Re: excessive thread fusing
  2014-04-19 12:33 excessive thread fusing David Bremner
  2014-04-19 17:52 ` Eric
@ 2014-04-20  7:14 ` Mark Walters
       [not found]   ` <87oazwjq1e.fsf@yoom.home.cworth.org>
  1 sibling, 1 reply; 11+ messages in thread
From: Mark Walters @ 2014-04-20  7:14 UTC (permalink / raw)
  To: David Bremner, notmuch


On Sat, 19 Apr 2014, David Bremner <david@tethera.net> wrote:
> Gregor Zattler mentioned some problems with threading at 
>
>        id:20120126004024.GA13704@shi.workgroup
>
> After some off list discussions, I believe I have a smaller test case.
>
> The attached maildir contains 24 messages from the org mode list.
>
> According to notmuch, these form one thread, but I can't figure out
> exactly why. It seems like the chronologically first two messages should
> be a seperate thread. There are several of the infamous malformed ME-E
> In-reply-to headers, but each of these messages also has a References
> header; this seems to indicate a case missed by commit cf8aaafbad68.

Hi 

I have done dome debugging of this. There is a patch below which fixes
this test case but who knows what it breaks! Please DO NOT apply unless
someone who knows this code says it's OK.

First, the bug is quite sensitive. The attached 24 messages are numbered
and i will use the last two digits to refer to them (ie the 2 digits are
the ?? in 1397885606.0002??.mbox:2,). The number range from 17-52; 17
and 18 should be one thread and the rest a different thread.

1) If you add all messages you get one thread. 
2) If you add all apart from 52 you get 2 threads. However, then adding
52 still gives two threads.
3) If you add 18 and then 52 you get 1 thread.
4) If you add 17 and 18 then 52 you get 2 threads.

I think notmuch will use inode sort and since the tar file contains
these three files in the order 18 52 17 we get cases 1 and 2 above.

I put some debug stuff in _notmuch_database_link_message_to_parents and
I think that the problem comes from the call to parse_references on line
1767 which adds the malformed in-reply-to header to the hash table, so
this malformed line gets added as a potential parent. 

As a clear example that I don't understand this code I don't know why
this no longer causes a problem if message 17 gets added too.

Best wishes

Mark

---
 lib/database.cc |   21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index 1efb14d..373a255 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1763,20 +1763,23 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
 					    this_message_id,
 					    parents, refs);
 
-    in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
-    in_reply_to_message_id = parse_references (message,
-					       this_message_id,
-					       parents, in_reply_to);
-
     /* For the parent of this message, use the last message ID of the
      * References header, if available.  If not, fall back to the
-     * first message ID in the In-Reply-To header. */
+     * first message ID in the In-Reply-To header. We only parse the
+     * In-Reply-To header if we need to as otherwise we might
+     * contanimate the hash table if it is malformed. */
     if (last_ref_message_id) {
         _notmuch_message_add_term (message, "replyto",
                                    last_ref_message_id);
-    } else if (in_reply_to_message_id) {
-	_notmuch_message_add_term (message, "replyto",
-			     in_reply_to_message_id);
+    } else {
+	in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
+	in_reply_to_message_id = parse_references (message,
+						   this_message_id,
+						   parents, in_reply_to);
+	if (in_reply_to_message_id) {
+	    _notmuch_message_add_term (message, "replyto",
+				       in_reply_to_message_id);
+	}
     }
 
     keys = g_hash_table_get_keys (parents);
-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] Re: excessive thread fusing
       [not found]   ` <87oazwjq1e.fsf@yoom.home.cworth.org>
@ 2014-04-20 12:03     ` Mark Walters
  2014-04-21  7:20       ` Mark Walters
  2014-04-20 12:59     ` David Bremner
  1 sibling, 1 reply; 11+ messages in thread
From: Mark Walters @ 2014-04-20 12:03 UTC (permalink / raw)
  To: Carl Worth, David Bremner, notmuch


On Sun, 20 Apr 2014, Carl Worth <cworth@cworth.org> wrote:
> Mark Walters <markwalters1009@gmail.com> writes:
>> I have done dome debugging of this.
>
> Thanks for looking closely, Mark!
>
>> There is a patch below which fixes this test case but who knows what
>> it breaks! Please DO NOT apply unless someone who knows this code says
>> it's OK.
>
> I wrote much of the original code being patched here, so hopefully I
> understand it and can say something useful.
>
> I agree that the patch should not be applied. I don't like to see one
> piece of code not trusting another in the same code base. If the
> parse_references() function doesn't deal well with a malformed header,
> then we should fix it, not step around it.

>
> Meanwhile, not treating all potential referenced message IDs
> consistently could definitely make the notmuch algorithm more fragile
> and sensitive to the order of message indexing, etc. So let's not do
> that.

I agree. This bug first came up in id:874nvcekjk.fsf@qmul.ac.uk; I think
that got mostly fixed by cf8aaafbad68
(id:1361836225-17279-1-git-send-email-aaronecay@gmail.com and related
thread) so we may want to check whether that change is still wanted if
we fix the actual bug.

> Instead, let's track down and fix the actual bug.
>
> Thanks for the idea of using two-digit names for these messages. That
> makes it much easier to inspect the relevant headers.
>
> Below, I've grepped out the actual References and In-Reply-To headers
> From the messages, and then simply substituted minimal, and
> easy-to-understand values for the message IDs.
>
> With these minimally modified headers, it's easy to manually inspect the
> relationships and see that messages 17 and 18 belong in one thread, and
> messages 32-52 belong in a separate thread.
>
> It's also quite easy to see the potential source of the bug. The
> In-Reply-To headers for messages 18, 32, and 52 all share a common
> string (an email address) formatted to look like a message-id,
> "<e.fraga@ucl.ac.uk>". If notmuch looks at those headers, and treats
> that string as a message-id, then all of theses messages will be
> connected into a single thread.
>
> And since that's the reported behavior, it seems likely that
> "<e.fraga@ucl.ac.uk>" is the cause of this bug.
>
>> I put some debug stuff in _notmuch_database_link_message_to_parents and
>> I think that the problem comes from the call to parse_references on line
>> 1767 which adds the malformed in-reply-to header to the hash table, so
>> this malformed line gets added as a potential parent. 
>
> Am I correct that your debugging showed that "<e.fraga@ucl.ac.uk>" is
> being added to the hash table?

Yes that is correct.

> My inspection of _parse_references() and parse_message_id() suggests
> that that's exactly what notmuch is doing, (treating both of the
> angle-bracketed portions ("<e.fraga@ucl.ac.uk>" as well as the actual
> message-ID, "<ID17>" or "<ID31>" or "<ID39>") as message IDs.
>
> So it seems like we need a new _parse_in_reply_to() function to use in
> place of _parse_references() and the new function will need a better
> heuristic for dealing with the unpredictability of In-Reply-To.
>
> The only real reason that we are trying to grab multiple message ID
> values from an In-Reply-To header is that RFC 2822 explicitly allows
> that, (to support a message simultaneously replying to multiple
> messages). I don't believe that that's common, but we might as well
> support it. At the same time, RFC 2822 also explicitly specifies that
> the In-Reply-To header will consist of nothing but message IDs.
>
> So perhaps the heuristic here could be to notice any characters outside
> of angle brackets, (like "Message" in the headers below), and in that
> case go to a strictly "not RFC 2822" mode and look for exactly one
> message ID. At that point, JWZ would recommend "the first <>-bracketed
> text found therein", but that would give precisely the wrong answer in
> this particular case. Here the correct Message ID appears in the last
> <>-bracketed text. I have not surveyed a large email corpus to determine
> how often "last <>-bracketed text" would fail as a heuristic.
>
> Another idea would be to trigger specifically on common forms. Judging
> From the samples in this particular thread, it seems like a workable
> heuristic would be:
>
> 	If the In-Reply-To header begins with '<':
>
> 		Parse that initial portion as a message ID
>
> 	Else if it ends with '>':
>
> 		Parse that final portion as a message ID
>
> 	Else
>
> 		Ignore this garbage-valued header.
>
> That's probably the best and most reliably thing to do here.
>
> Does anyone have any better ideas?

Is there a case coming before all the above: if the In-Reply-To header
is correctly formed then parse as we do currently? (You sort of suggest
so above but I just wanted to check)

>> As a clear example that I don't understand this code I don't know why
>> this no longer causes a problem if message 17 gets added too.
>
> I wanted to test my own knowledge of the code to see if I could explain
> this. But I didn't precisely follow your explanation of the behavior you
> saw. In cases (1) and (2) of your description, what order are you using
> to "add all messages" or "add all apart from 52"?

I just untarred the tar file David posted. Then the messages get added
in the following order:

45 39 47 33 31 18 42 51 41 46 37 44 35 36 34 49 40 48 38 52 17 50 32 43

which is the same as the order in the tar file. (I think this is notmuch
using inode based sort as it has not seen the directory before)

In Case 2 I started with a fresh untar; then moved message 52 out of the
Maildir; ran notmuch new,  then moved message 52 back
into the the Maildir tree and ran notmuch new again.

> Then, for cases (3) and (4), what is done before adding the messages
> mentioned in these cases? Add all other messages? Again, in what order?

In case 3 I started with a fresh untar. Moved all the message except 18
elsewhere. ran notmuch new. moved message 52 back and ran notmuch new.

In have checked case 4 carefully adding messages 1 at a time and running
notmuch new between each addition.

If I add 18 17 52 I get 2 threads.
If I add 17 18 52 I get 1 thread

> I haven't tracked through all the logic of the existing algorithm for
> this case. But I don't like hearing that notmuch constructs different
> threads for the same messages presented in different orders. This sounds
> like a bug separate from what we've discussed above. 

I agree but I don't know the logic well enough to be sure.

Best wishes

Mark

>
> -Carl
>
> 18:References: <ID17>
> 32:References: <ID31>
> 33:References: <ID31> <ID32>
> 34:References: <ID31> <ID32> <ID33>
> 35:References: <ID31> <ID32> <ID33>
> 36:References: <ID31> <ID32> <ID33> <ID35>
> 37:References: <ID31> <ID32> <ID33> <ID35> <ID36>
> 38:References: <ID31> <ID32> <ID33> <ID35> <ID36> <ID37>
> 39:References: <ID31> <ID32>
> 40:References: <ID31> <ID32> <ID39>
> 41:References: <ID31> <ID32> <ID39> <ID40>
> 42:References: <ID31> <ID32> <ID39> <ID40> <ID41>
> 43:References: <ID31> <ID32> <ID39> <ID40> <ID41> <ID42>
> 44:References: <ID31> <ID32> <ID39> <ID40> <ID41> <ID42>
> 45:References: <ID31> <ID32> <ID39> <ID40>
> 46:References: <ID31> <ID32> <ID39> <ID40> <ID45>
> 47:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46>
> 48:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46> <ID47>
> 49:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46> <ID47> <ID48>
> 50:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46> <ID47> <ID48> <ID49>
> 51:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46> <ID47> <ID48> <ID49> <ID50>
> 52:References: <ID31> <ID32> <ID39>
>
> 18:In-reply-to: Message from Eric S Fraga <e.fraga@ucl.ac.uk> of "Tue, 01 Mar 2011 15:25:38 GMT." <ID17>
> 32:In-Reply-To: Message from Eric S Fraga <e.fraga@ucl.ac.uk> of "Thu, 10 Mar 2011 21:00:16 GMT." <ID31>
> 33:In-Reply-To: <ID32> (Nick Dokos's message of "Thu, 10 Mar 2011 18:06:33 -0500")
> 34:In-Reply-To: <ID33>
> 35:In-Reply-To: <ID33>
> 36:In-Reply-To: <ID35> (Carsten Dominik's message of "Sun, 13 Mar 2011 08:39:13 +0100")
> 37:In-Reply-To: <ID36>
> 38:In-Reply-To: <ID37> (Carsten Dominik's message of "Mon, 14 Mar 2011 08:40:33 +0100")
> 39:In-Reply-To: <ID32> (Nick Dokos's message of "Thu, 10 Mar 2011 18:06:33 -0500")
> 40:In-Reply-To: <ID39>
> 41:In-Reply-To: <ID40> (Carsten Dominik's message of "Fri, 11 Mar 2011 12:36:13 +0100")
> 42:In-Reply-To: <ID41>
> 43:In-Reply-To: <ID42>
> 44:In-Reply-To: <ID42>
> 45:In-reply-to: Message from Carsten Dominik <carsten.dominik@gmail.com> of "Fri, 11 Mar 2011 12:36:13 +0100." <ID40>
> 46:In-Reply-To: <ID45>
> 47:In-reply-to: Message from Carsten Dominik <carsten.dominik@gmail.com> of "Mon, 14 Mar 2011 11:21:36 BST." <ID46>
> 48:In-Reply-To: <ID47>
> 49:In-reply-to: Message from Carsten Dominik <carsten.dominik@gmail.com> of "Mon, 14 Mar 2011 18:02:54 BST." <ID48>
> 51:In-Reply-To: <ID50>
> 52:In-reply-to: Message from Eric S Fraga <e.fraga@ucl.ac.uk> of "Fri, 11 Mar 2011 08:47:58 GMT." <ID39>
>
> -- 
> carl.d.worth@intel.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] Re: excessive thread fusing
       [not found]   ` <87oazwjq1e.fsf@yoom.home.cworth.org>
  2014-04-20 12:03     ` Mark Walters
@ 2014-04-20 12:59     ` David Bremner
  1 sibling, 0 replies; 11+ messages in thread
From: David Bremner @ 2014-04-20 12:59 UTC (permalink / raw)
  To: Carl Worth, Mark Walters, notmuch

Carl Worth <cworth@cworth.org> writes:
>
> Another idea would be to trigger specifically on common forms. Judging
> From the samples in this particular thread, it seems like a workable
> heuristic would be:
>
> 	If the In-Reply-To header begins with '<':
>
> 		Parse that initial portion as a message ID
>
> 	Else if it ends with '>':
>
> 		Parse that final portion as a message ID
>
> 	Else
>
> 		Ignore this garbage-valued header.
>

using the hacky script below, I scanned my own mail collection of about
300k messages. I can make the following observations

- I have some RFC compliant in-reply-to's with multiple ids
- I have have a non-trivial number of Message from $NAME <address> of $date <id>
- I didn't see any cases where using the last angle bracketed thing
  would fail.
- I did see some some cases where the header starts with '<' but the
  matching '>' was missing
- I also noticed some rfc2047 encoding of in-reply-to headers.


######################################################################
# hacky script follows
dir=$1
echo Scanning $dir

tempdir=$(mktemp -d)
echo Writing to ${tempdir}

find $dir -exec sh -c "formail -c -xIn-reply-to < {}" \; \
  > ${tempdir}/ids

sed  -e 's/\t/ /' -e 's/   */ /g' -e 's/<[^ ]*>/<id>/g' -e 's/(.*)/(comment)/' < ${tempdir}/ids | sort | uniq | tee ${tempdir}/report

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: excessive thread fusing
  2014-04-19 21:04   ` Andrei POPESCU
@ 2014-04-20 16:48     ` Austin Clements
  2014-04-20 17:46       ` Austin Clements
  0 siblings, 1 reply; 11+ messages in thread
From: Austin Clements @ 2014-04-20 16:48 UTC (permalink / raw)
  To: notmuch

Quoth Andrei POPESCU on Apr 20 at 12:04 am:
> On Sb, 19 apr 14, 18:52:02, Eric wrote:
> > 
> > This may not actually be any help, but both hypermail and mhonarc agree
> > that two messages form a separate thread from the rest. I believe that
> > the latter, at least, is the JWZ algorithm.
> 
> mutt concurs.

Can anyone explain why JWZ *doesn't* have the same problem?  I don't
see how this heuristic doesn't doom it to the same fate:

  The References field is populated from the ``References'' and/or
  ``In-Reply-To'' headers. If both headers exist, take the first thing
  in the In-Reply-To header that looks like a Message-ID, and append
  it to the References header.

Given this, even considering only messages 18 and 52 (which "should"
be in different threads), JWZ should find the common "parent"
e.fraga@ucl.ac.uk and link them in to the same thread:

Add 18 (step 1)
- The combined "references" list is <ID17> <e.fraga@ucl.ac.uk>
- Creates and links containers 17 <- e.fraga@ucl.ac.uk <- 18 where the
  first two are empty

Add 52 (step 1)
- The combined "references" list is <ID31> <ID32> <ID39>
  <e.fraga@ucl.ac.uk>
- Creates and links containers 31 <- 32 <- 39
- Also considers container e.fraga@ucl.ac.uk, but this is already
  linked, so it doesn't change it
- Creates container 52 and links e.fraga@ucl.ac.uk <- 52 (step 1C)

18 and 52 will later get promoted over their empty parent (step 4),
but will remain in the same thread.

What am I missing?  Or are these other MUAs not using pure JWZ?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: excessive thread fusing
  2014-04-20 16:48     ` Austin Clements
@ 2014-04-20 17:46       ` Austin Clements
  0 siblings, 0 replies; 11+ messages in thread
From: Austin Clements @ 2014-04-20 17:46 UTC (permalink / raw)
  To: notmuch

Quoth myself on Apr 20 at 12:48 pm:
> Quoth Andrei POPESCU on Apr 20 at 12:04 am:
> > On Sb, 19 apr 14, 18:52:02, Eric wrote:
> > > 
> > > This may not actually be any help, but both hypermail and mhonarc agree
> > > that two messages form a separate thread from the rest. I believe that
> > > the latter, at least, is the JWZ algorithm.
> > 
> > mutt concurs.
> 
> Can anyone explain why JWZ *doesn't* have the same problem?  I don't
> see how this heuristic doesn't doom it to the same fate:
> 
>   The References field is populated from the ``References'' and/or
>   ``In-Reply-To'' headers. If both headers exist, take the first thing
>   in the In-Reply-To header that looks like a Message-ID, and append
>   it to the References header.
> 
> Given this, even considering only messages 18 and 52 (which "should"
> be in different threads), JWZ should find the common "parent"
> e.fraga@ucl.ac.uk and link them in to the same thread:
> 
> Add 18 (step 1)
> - The combined "references" list is <ID17> <e.fraga@ucl.ac.uk>
> - Creates and links containers 17 <- e.fraga@ucl.ac.uk <- 18 where the
>   first two are empty
> 
> Add 52 (step 1)
> - The combined "references" list is <ID31> <ID32> <ID39>
>   <e.fraga@ucl.ac.uk>
> - Creates and links containers 31 <- 32 <- 39
> - Also considers container e.fraga@ucl.ac.uk, but this is already
>   linked, so it doesn't change it
> - Creates container 52 and links e.fraga@ucl.ac.uk <- 52 (step 1C)
> 
> 18 and 52 will later get promoted over their empty parent (step 4),
> but will remain in the same thread.
> 
> What am I missing?  Or are these other MUAs not using pure JWZ?

I dug in to mutt's mutt_sort_threads a bit.  It's not using JWZ,
though it's something similar.  The most salient thing may be how it
handles in-reply-to and references:

1. If a message has both in-reply-to and references, the parent chain
   is the *last* in-reply-to ID and then the references from right to
   left (skipping the last reference ID if it's the same as the last
   in-reply-to ID).  (See also mutt_parse_references.)
2. If a message has only in-reply-to, the parent chain is *all* of the
   IDs in in-reply-to *from right to left* (e.g., the right-most one
   is the immediate parent).
3. If a message has only references, the parent chain is that, from
   right to left.

Like JWZ, mutt creates and links together "empty containers" as it
scans the parent chain towards the root, though unlike JWZ it stops
when it finds a non-empty container or a container that already has a
parent.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] Re: excessive thread fusing
  2014-04-20 12:03     ` Mark Walters
@ 2014-04-21  7:20       ` Mark Walters
  2014-04-21 16:20         ` Austin Clements
  2022-01-01  0:26         ` David Bremner
  0 siblings, 2 replies; 11+ messages in thread
From: Mark Walters @ 2014-04-21  7:20 UTC (permalink / raw)
  To: Carl Worth, David Bremner, notmuch


>> I haven't tracked through all the logic of the existing algorithm for
>> this case. But I don't like hearing that notmuch constructs different
>> threads for the same messages presented in different orders. This sounds
>> like a bug separate from what we've discussed above. 

I think I have now found this bug and it is separate from the malformed
In-Reply-To problems.

The problem is that when we merge threads we update all the thread-ids
of documents in the loser thread. But we don't (if I understand the code
correctly) update dangling "metadata" references to threads which don't
(yet) have any documents.

To make this explicit consider the 2 messages 17,18 in the set. 

Message 17 has id <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk> and has no
references/in-reply-to so has no parents.

Message 18 has a reference to <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk>
and an in-reply-to to <e.fraga@ucl.ac.uk> and
<87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk>

If you add 17 first then it gets thread-id 1 and then when you add 18 message 18 gets
thread-id 2 as does the dangling link to the "unseen" message
e.fraga@ucl.ac.uk, and then message 17 is moved to thread-id 2.

However, if you add 18 first then it gets thread-id 1 and the dangling
link gets id 1. When 17 is added it gets thread-id 2, message 18 gets
thread-id updated to 2 *but* the dangling link to e.fraga@ucl.ac.uk does
not get updated so stays thread-id 1.

In particular when 52 comes along with its link to e.fraga@ucl.ac.uk
then:

        In the first case it gets put in thread-id 3 and the other two
        messages get moved into thread 3.

        In the second case, message 52 gets put in thread 3 and thread 1
        (the dangling link) gets moved into thread 3 leaving thread 2
        (containing messages 17 and 18) untouched.

Best wishes

Mark

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] Re: excessive thread fusing
  2014-04-21  7:20       ` Mark Walters
@ 2014-04-21 16:20         ` Austin Clements
  2022-01-01  0:26         ` David Bremner
  1 sibling, 0 replies; 11+ messages in thread
From: Austin Clements @ 2014-04-21 16:20 UTC (permalink / raw)
  To: Mark Walters; +Cc: notmuch

Quoth Mark Walters on Apr 21 at  8:20 am:
> 
> >> I haven't tracked through all the logic of the existing algorithm for
> >> this case. But I don't like hearing that notmuch constructs different
> >> threads for the same messages presented in different orders. This sounds
> >> like a bug separate from what we've discussed above. 
> 
> I think I have now found this bug and it is separate from the malformed
> In-Reply-To problems.
> 
> The problem is that when we merge threads we update all the thread-ids
> of documents in the loser thread. But we don't (if I understand the code
> correctly) update dangling "metadata" references to threads which don't
> (yet) have any documents.

This exactly the problem I wrote
id:1395608456-9673-1-git-send-email-amdragon@mit.edu to test, but I
had convinced myself everything was okay because we link a message to
both its parents and all of its children.  But that's only true if you
eventually receive the linking message (which in the test I made, you
do).  In this case, you never receive the linking message, so even
though notmuch has enough information to bring the two threads
together, it doesn't.

Maybe I should create a second variant of that test where all of the
messages reference their entire heritage (rather than just their
immediate parent) and test that they're *always* in one thread
regardless of receipt order (rather than only checking once they've
all been received)?  I think that would weed out this case.

> To make this explicit consider the 2 messages 17,18 in the set. 
> 
> Message 17 has id <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk> and has no
> references/in-reply-to so has no parents.
> 
> Message 18 has a reference to <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk>
> and an in-reply-to to <e.fraga@ucl.ac.uk> and
> <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk>
> 
> If you add 17 first then it gets thread-id 1 and then when you add 18 message 18 gets
> thread-id 2 as does the dangling link to the "unseen" message
> e.fraga@ucl.ac.uk, and then message 17 is moved to thread-id 2.
> 
> However, if you add 18 first then it gets thread-id 1 and the dangling
> link gets id 1. When 17 is added it gets thread-id 2, message 18 gets
> thread-id updated to 2 *but* the dangling link to e.fraga@ucl.ac.uk does
> not get updated so stays thread-id 1.
> 
> In particular when 52 comes along with its link to e.fraga@ucl.ac.uk
> then:
> 
>         In the first case it gets put in thread-id 3 and the other two
>         messages get moved into thread 3.
> 
>         In the second case, message 52 gets put in thread 3 and thread 1
>         (the dangling link) gets moved into thread 3 leaving thread 2
>         (containing messages 17 and 18) untouched.

So, there's an obvious, messy way to fix this: update the metadata
references when we do thread renumbering.  This is messy because that
data *isn't indexed*.  The only way to find the records we need to
update is to scan them all.  This isn't completely terrible because
it's a sequential scan and we could cache it in memory, but it
certainly isn't going to help notmuch new's performance.  (My database
has 6,749 of these, which takes ~1 second to scan on a cold cache,
though that's with an SSD [1]).


But let me propose an idea I've been kicking around for a while: ghost
message documents.  Rather than using user metadata for tracking these
missing messages, use regular documents with the exact same terms we
use now for message IDs and thread IDs, but with a Tghost term instead
of a Tmail term to distinguish their type.  This solves the problem
using infrastructure we already have in place, simplifies the message
linking code, and may even make it faster.  It's a schema update, but
a simple and fast one.  I think the hardest part is that things like
notmuch_database_find_message would need to distinguish ghosts and
regular messages (which may require pre-fetching the Tghost or Tmail
posting list to do efficiently).

This also sets us up to do some cool things in the future, though
they're more invasive.  If we have message-like documents for these
ghosts, we can store other message-like metadata as well.  If we store
tags on them, then we can keep tags around for deleted messages and
*reapply them* if the message comes back.  This would finally fix the
races we have now where, if a message is renamed or moved during a
notmuch new, we may think it's deleted only to reindex it with default
tags on the next run.  We could also pre-tag messages that haven't
been indexed yet, say from procmail or when sending a message.  This
could simplify or even obviate notmuch insert.  If we add message
ctimes as proposed by Dave Mazières, this would give us a place to
store and query ctimes of deleted messages (otherwise it's unclear how
you find out about deletions without a full database scan).  In
effect, the database becomes truly monotonic.

[1] Curious?
    yes n | xapian-inspect postlist.DB | \
    awk '!/Key/ {next} /Key: \\x00\\xc0thread_id_/ {N++} /Key: \\x00\\xd0/ {exit} END {print N}'

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] Re: excessive thread fusing
  2014-04-21  7:20       ` Mark Walters
  2014-04-21 16:20         ` Austin Clements
@ 2022-01-01  0:26         ` David Bremner
  1 sibling, 0 replies; 11+ messages in thread
From: David Bremner @ 2022-01-01  0:26 UTC (permalink / raw)
  To: Mark Walters, Carl Worth, notmuch

Mark Walters <markwalters1009@gmail.com> writes:

>>> I haven't tracked through all the logic of the existing algorithm for
>>> this case. But I don't like hearing that notmuch constructs different
>>> threads for the same messages presented in different orders. This sounds
>>> like a bug separate from what we've discussed above. 
>
> I think I have now found this bug and it is separate from the malformed
> In-Reply-To problems.
>
> The problem is that when we merge threads we update all the thread-ids
> of documents in the loser thread. But we don't (if I understand the code
> correctly) update dangling "metadata" references to threads which don't
> (yet) have any documents.

This bug seems fixed by the introduction of ghost messages discussed
later in the thread, so marking it fixed. The question of malformed
In-Reply-To headers (and whether there is a bug handling those) is
discussed in the thread id:20211226161716.2079457-1-david@tethera.net.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-01-01  0:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-19 12:33 excessive thread fusing David Bremner
2014-04-19 17:52 ` Eric
2014-04-19 21:04   ` Andrei POPESCU
2014-04-20 16:48     ` Austin Clements
2014-04-20 17:46       ` Austin Clements
2014-04-20  7:14 ` [RFC PATCH] " Mark Walters
     [not found]   ` <87oazwjq1e.fsf@yoom.home.cworth.org>
2014-04-20 12:03     ` Mark Walters
2014-04-21  7:20       ` Mark Walters
2014-04-21 16:20         ` Austin Clements
2022-01-01  0:26         ` David Bremner
2014-04-20 12:59     ` David Bremner

Code repositories for project(s) associated with this inbox:

	notmuch.git.git (no URL configured)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).