unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: Eirik Byrkjeflot Anonsen <eirik@eirikba.org>
To: notmuch@notmuchmail.org
Subject: Re: Automatic suppression of non-duplicate messages
Date: Sun, 04 Nov 2012 11:06:18 +0100	[thread overview]
Message-ID: <87wqy1u1gl.fsf@star.eba> (raw)
In-Reply-To: <87390qxvb4.fsf@maritornes.cs.unb.ca> (David Bremner's message of "Sat, 03 Nov 2012 16:53:19 -0400")

David Bremner <david@tethera.net> writes:

> Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:
>
>> That's not what I see.  If I search for a term that only appears in
>> one of the "copies", none of the copies are included in the search
>> result.
>
> The offending code is at line 1813 of lib/database.cc; the message is
> only indexed if the message-id is new.
>
> It might be sensible to move _notmuch_message_index_file into the other
> branch of the if, but even if that works fine, something more
> sophisticated is needed for the call to
> __notmuch_message_set_header_values; the invariant that each message has
> a single subject seems reasonable.

Hmm, depends.  Assuming indexing is intended to be used for searching,
one might want to search for something that occurs in one subject but
not the other.  In practice I doubt it matters.


> Offhand I'm not sure of a good method of automatically deciding what is
> the same message (with e.g. headers and footer text added by a mailing
> list).

I don't think the real problem here is the duplicate detection algorithm
itself.  It is rather that notmuch forces a particular duplicate
detection algorithm on its users.  Duplicate detection should really be
delegated to a different application, thus allowing people to experiment
with whatever algorithm works best for them.  (Just like notmuch
delegates the choice of initial tags on messages to an external
application.)

But first notmuch must be modified so it can sensibly treat multiple
instances having the same message-id as separate messages.  That seems
to me to be the hard part.  (And some way for external applications to
join and split copies, of course.)




However, if you want an algorithm that is likely to get rid of most
duplicates while keeping most non-duplicates separate, here's a quick
suggestion:


Just to clarify: The goal is to suppress most copies of the same message
while not suppressing a single instance of a different message.  It
isn't important if a few duplicate messages makes it through, but it is
imperative that no "real" message is dropped.

To check whether two instances are duplicates, I suspect something like
this algorithm would be "good enough":

- Message-Id must be the same.  This isn't actually necessary, but it
  makes sense to require it anyway.

- From and Date must be the same.  These form important context that may
  change the meaning of the message (e.g. "me too" depends heavily on
  From, and "let's meet tomorrow" depends heavily on Date).  (Are there
  more context-supplying headers we should worry about?)

- If Subject and body are also the same, the instances are duplicates.

- Otherwise, if neither of the messages come from a mailing list,
  they're probably not duplicates.

- Otherwise, grab a few other (recent) mails from the same mailing list.
  If all the bodies end with the same text, ignore that text when
  comparing the bodies.

- For the Subject, again use a few other (recent) mails from the same
  mailing list for comparison.  But this time only look for one of the
  well-known common patterns.  If all the mails matches the same
  pattern, ignore that pattern when comparing the Subject.

- For both of the above, it would be good to pick messages from
  different threads, to avoid accidental similarities.  I suspect this
  is more important for subjects than bodies, though.

- Also, leading and trailing whitespace should probably be dropped.

- (Some other transformations may make sense, such as reflowing text or
  converting between character sets.  In practice I doubt that will make
  much of a difference.)

- If the "canonicalized" body and Subject are the same, the messages are
  duplicates.  At least there's now pretty much no chance that there is
  anything interesting that will be missed by dropping one of the
  messages.


(I'm assuming that identifying mailing lists are usually
straightforward, e.g. using the List-Id header).

eirik

  reply	other threads:[~2012-11-04 10:06 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-03 10:17 Automatic suppression of non-duplicate messages Eirik Byrkjeflot Anonsen
2012-11-03 20:53 ` David Bremner
2012-11-04 10:06   ` Eirik Byrkjeflot Anonsen [this message]
2012-11-04 22:34   ` Jani Nikula
2012-11-05  4:28     ` Austin Clements
2012-11-05 15:22     ` Eirik Byrkjeflot Anonsen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wqy1u1gl.fsf@star.eba \
    --to=eirik@eirikba.org \
    --cc=notmuch@notmuchmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).