unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: Austin Clements <amdragon@MIT.EDU>
To: Mark Walters <markwalters1009@gmail.com>
Cc: notmuch <notmuch@notmuchmail.org>
Subject: Re: [RFC PATCH] Re: excessive thread fusing
Date: Mon, 21 Apr 2014 12:20:58 -0400	[thread overview]
Message-ID: <20140421162058.GE25817@mit.edu> (raw)
In-Reply-To: <8738h7kv2q.fsf@qmul.ac.uk>

Quoth Mark Walters on Apr 21 at  8:20 am:
> 
> >> I haven't tracked through all the logic of the existing algorithm for
> >> this case. But I don't like hearing that notmuch constructs different
> >> threads for the same messages presented in different orders. This sounds
> >> like a bug separate from what we've discussed above. 
> 
> I think I have now found this bug and it is separate from the malformed
> In-Reply-To problems.
> 
> The problem is that when we merge threads we update all the thread-ids
> of documents in the loser thread. But we don't (if I understand the code
> correctly) update dangling "metadata" references to threads which don't
> (yet) have any documents.

This exactly the problem I wrote
id:1395608456-9673-1-git-send-email-amdragon@mit.edu to test, but I
had convinced myself everything was okay because we link a message to
both its parents and all of its children.  But that's only true if you
eventually receive the linking message (which in the test I made, you
do).  In this case, you never receive the linking message, so even
though notmuch has enough information to bring the two threads
together, it doesn't.

Maybe I should create a second variant of that test where all of the
messages reference their entire heritage (rather than just their
immediate parent) and test that they're *always* in one thread
regardless of receipt order (rather than only checking once they've
all been received)?  I think that would weed out this case.

> To make this explicit consider the 2 messages 17,18 in the set. 
> 
> Message 17 has id <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk> and has no
> references/in-reply-to so has no parents.
> 
> Message 18 has a reference to <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk>
> and an in-reply-to to <e.fraga@ucl.ac.uk> and
> <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk>
> 
> If you add 17 first then it gets thread-id 1 and then when you add 18 message 18 gets
> thread-id 2 as does the dangling link to the "unseen" message
> e.fraga@ucl.ac.uk, and then message 17 is moved to thread-id 2.
> 
> However, if you add 18 first then it gets thread-id 1 and the dangling
> link gets id 1. When 17 is added it gets thread-id 2, message 18 gets
> thread-id updated to 2 *but* the dangling link to e.fraga@ucl.ac.uk does
> not get updated so stays thread-id 1.
> 
> In particular when 52 comes along with its link to e.fraga@ucl.ac.uk
> then:
> 
>         In the first case it gets put in thread-id 3 and the other two
>         messages get moved into thread 3.
> 
>         In the second case, message 52 gets put in thread 3 and thread 1
>         (the dangling link) gets moved into thread 3 leaving thread 2
>         (containing messages 17 and 18) untouched.

So, there's an obvious, messy way to fix this: update the metadata
references when we do thread renumbering.  This is messy because that
data *isn't indexed*.  The only way to find the records we need to
update is to scan them all.  This isn't completely terrible because
it's a sequential scan and we could cache it in memory, but it
certainly isn't going to help notmuch new's performance.  (My database
has 6,749 of these, which takes ~1 second to scan on a cold cache,
though that's with an SSD [1]).


But let me propose an idea I've been kicking around for a while: ghost
message documents.  Rather than using user metadata for tracking these
missing messages, use regular documents with the exact same terms we
use now for message IDs and thread IDs, but with a Tghost term instead
of a Tmail term to distinguish their type.  This solves the problem
using infrastructure we already have in place, simplifies the message
linking code, and may even make it faster.  It's a schema update, but
a simple and fast one.  I think the hardest part is that things like
notmuch_database_find_message would need to distinguish ghosts and
regular messages (which may require pre-fetching the Tghost or Tmail
posting list to do efficiently).

This also sets us up to do some cool things in the future, though
they're more invasive.  If we have message-like documents for these
ghosts, we can store other message-like metadata as well.  If we store
tags on them, then we can keep tags around for deleted messages and
*reapply them* if the message comes back.  This would finally fix the
races we have now where, if a message is renamed or moved during a
notmuch new, we may think it's deleted only to reindex it with default
tags on the next run.  We could also pre-tag messages that haven't
been indexed yet, say from procmail or when sending a message.  This
could simplify or even obviate notmuch insert.  If we add message
ctimes as proposed by Dave Mazières, this would give us a place to
store and query ctimes of deleted messages (otherwise it's unclear how
you find out about deletions without a full database scan).  In
effect, the database becomes truly monotonic.

[1] Curious?
    yes n | xapian-inspect postlist.DB | \
    awk '!/Key/ {next} /Key: \\x00\\xc0thread_id_/ {N++} /Key: \\x00\\xd0/ {exit} END {print N}'

  reply	other threads:[~2014-04-21 16:21 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-19 12:33 excessive thread fusing David Bremner
2014-04-19 17:52 ` Eric
2014-04-19 21:04   ` Andrei POPESCU
2014-04-20 16:48     ` Austin Clements
2014-04-20 17:46       ` Austin Clements
2014-04-20  7:14 ` [RFC PATCH] " Mark Walters
     [not found]   ` <87oazwjq1e.fsf@yoom.home.cworth.org>
2014-04-20 12:03     ` Mark Walters
2014-04-21  7:20       ` Mark Walters
2014-04-21 16:20         ` Austin Clements [this message]
2022-01-01  0:26         ` David Bremner
2014-04-20 12:59     ` David Bremner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140421162058.GE25817@mit.edu \
    --to=amdragon@mit.edu \
    --cc=markwalters1009@gmail.com \
    --cc=notmuch@notmuchmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).