unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Automatic suppression of non-duplicate messages
@ 2012-11-03 10:17 Eirik Byrkjeflot Anonsen
  2012-11-03 20:53 ` David Bremner
  0 siblings, 1 reply; 6+ messages in thread
From: Eirik Byrkjeflot Anonsen @ 2012-11-03 10:17 UTC (permalink / raw)
  To: notmuch

As has been mentioned a few times before, notmuch chooses to silently
drop any message that has the same message-id as an already-seen
message.

In another thread, Austin Clements said:

> notmuch tracks all copies of a message, but its output generally shows
> messages, rather than files, so you see a message only once regardless
> of how many copies there are in the file system.

That's not what I see.  If I search for a term that only appears in one
of the "copies", none of the copies are included in the search result.

This is with:
$ dpkg -l | grep notmuch | <remove uninteresting stuff>
ii  libnotmuch3                                    0.13.2-1
ii  notmuch                                        0.13.2-1
ii  notmuch-emacs                                  0.13.2-1


However, I still find it a much bigger problem that notmuch will
silently and automatically discard new mail without letting me know that
they exist, let alone allow me to read them.  (I.e. when receiving a new
mail that happens to have the same message-id as a previously received
mail.)

Personally, I find almost no cost in seeing all near-duplicates of the
same message (a cost that could be further mitigated by clever
presentation).  Conversely, I find a huge cost in never seeing some
messages at all.

This is currently the main issue that keeps me from switching to
notmuch.  (Importing my mail into notmuch dropped roughly 6800 mails.
I'm sure most of these mails are near-duplicates, but I've found 34
mails so far that are emphatically not.)

I think it is a nice and useful design to use the message-id as the
unique id to refer to a specific message.  Most of the time, this will
work just fine.  However, it is less useful to blindly trust that this
will always work correctly.

eirik

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Automatic suppression of non-duplicate messages
  2012-11-03 10:17 Automatic suppression of non-duplicate messages Eirik Byrkjeflot Anonsen
@ 2012-11-03 20:53 ` David Bremner
  2012-11-04 10:06   ` Eirik Byrkjeflot Anonsen
  2012-11-04 22:34   ` Jani Nikula
  0 siblings, 2 replies; 6+ messages in thread
From: David Bremner @ 2012-11-03 20:53 UTC (permalink / raw)
  To: Eirik Byrkjeflot Anonsen, notmuch

Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:

> That's not what I see.  If I search for a term that only appears in
> one of the "copies", none of the copies are included in the search
> result.

The offending code is at line 1813 of lib/database.cc; the message is
only indexed if the message-id is new.

It might be sensible to move _notmuch_message_index_file into the other
branch of the if, but even if that works fine, something more
sophisticated is needed for the call to
__notmuch_message_set_header_values; the invariant that each message has
a single subject seems reasonable.

Offhand I'm not sure of a good method of automatically deciding what is
the same message (with e.g. headers and footer text added by a mailing
list).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Automatic suppression of non-duplicate messages
  2012-11-03 20:53 ` David Bremner
@ 2012-11-04 10:06   ` Eirik Byrkjeflot Anonsen
  2012-11-04 22:34   ` Jani Nikula
  1 sibling, 0 replies; 6+ messages in thread
From: Eirik Byrkjeflot Anonsen @ 2012-11-04 10:06 UTC (permalink / raw)
  To: notmuch

David Bremner <david@tethera.net> writes:

> Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:
>
>> That's not what I see.  If I search for a term that only appears in
>> one of the "copies", none of the copies are included in the search
>> result.
>
> The offending code is at line 1813 of lib/database.cc; the message is
> only indexed if the message-id is new.
>
> It might be sensible to move _notmuch_message_index_file into the other
> branch of the if, but even if that works fine, something more
> sophisticated is needed for the call to
> __notmuch_message_set_header_values; the invariant that each message has
> a single subject seems reasonable.

Hmm, depends.  Assuming indexing is intended to be used for searching,
one might want to search for something that occurs in one subject but
not the other.  In practice I doubt it matters.


> Offhand I'm not sure of a good method of automatically deciding what is
> the same message (with e.g. headers and footer text added by a mailing
> list).

I don't think the real problem here is the duplicate detection algorithm
itself.  It is rather that notmuch forces a particular duplicate
detection algorithm on its users.  Duplicate detection should really be
delegated to a different application, thus allowing people to experiment
with whatever algorithm works best for them.  (Just like notmuch
delegates the choice of initial tags on messages to an external
application.)

But first notmuch must be modified so it can sensibly treat multiple
instances having the same message-id as separate messages.  That seems
to me to be the hard part.  (And some way for external applications to
join and split copies, of course.)




However, if you want an algorithm that is likely to get rid of most
duplicates while keeping most non-duplicates separate, here's a quick
suggestion:


Just to clarify: The goal is to suppress most copies of the same message
while not suppressing a single instance of a different message.  It
isn't important if a few duplicate messages makes it through, but it is
imperative that no "real" message is dropped.

To check whether two instances are duplicates, I suspect something like
this algorithm would be "good enough":

- Message-Id must be the same.  This isn't actually necessary, but it
  makes sense to require it anyway.

- From and Date must be the same.  These form important context that may
  change the meaning of the message (e.g. "me too" depends heavily on
  From, and "let's meet tomorrow" depends heavily on Date).  (Are there
  more context-supplying headers we should worry about?)

- If Subject and body are also the same, the instances are duplicates.

- Otherwise, if neither of the messages come from a mailing list,
  they're probably not duplicates.

- Otherwise, grab a few other (recent) mails from the same mailing list.
  If all the bodies end with the same text, ignore that text when
  comparing the bodies.

- For the Subject, again use a few other (recent) mails from the same
  mailing list for comparison.  But this time only look for one of the
  well-known common patterns.  If all the mails matches the same
  pattern, ignore that pattern when comparing the Subject.

- For both of the above, it would be good to pick messages from
  different threads, to avoid accidental similarities.  I suspect this
  is more important for subjects than bodies, though.

- Also, leading and trailing whitespace should probably be dropped.

- (Some other transformations may make sense, such as reflowing text or
  converting between character sets.  In practice I doubt that will make
  much of a difference.)

- If the "canonicalized" body and Subject are the same, the messages are
  duplicates.  At least there's now pretty much no chance that there is
  anything interesting that will be missed by dropping one of the
  messages.


(I'm assuming that identifying mailing lists are usually
straightforward, e.g. using the List-Id header).

eirik

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Automatic suppression of non-duplicate messages
  2012-11-03 20:53 ` David Bremner
  2012-11-04 10:06   ` Eirik Byrkjeflot Anonsen
@ 2012-11-04 22:34   ` Jani Nikula
  2012-11-05  4:28     ` Austin Clements
  2012-11-05 15:22     ` Eirik Byrkjeflot Anonsen
  1 sibling, 2 replies; 6+ messages in thread
From: Jani Nikula @ 2012-11-04 22:34 UTC (permalink / raw)
  To: David Bremner, Eirik Byrkjeflot Anonsen, notmuch

On Sat, 03 Nov 2012, David Bremner <david@tethera.net> wrote:
> Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:
>
>> That's not what I see.  If I search for a term that only appears in
>> one of the "copies", none of the copies are included in the search
>> result.
>
> The offending code is at line 1813 of lib/database.cc; the message is
> only indexed if the message-id is new.
>
> It might be sensible to move _notmuch_message_index_file into the other
> branch of the if, but even if that works fine, something more
> sophisticated is needed for the call to
> __notmuch_message_set_header_values; the invariant that each message has
> a single subject seems reasonable.
>
> Offhand I'm not sure of a good method of automatically deciding what is
> the same message (with e.g. headers and footer text added by a mailing
> list).

Assuming there was good method, what would you do with two different
messages that have the same message id? That is the unique id we use to
identify messages (which should be fine per RFC 5322 and its
predecessors; we're talking about messages from broken systems here).

It might be helpful to have a configuration option similar to new.tags
that would define the tags to be assigned to messages with duplicate
message ids. (This could be done in the
NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID case near line 516 of
notmuch-new.c). This could be used to assign a "dupe" tag, for example,
so the user could do whatever they want in the post-new hook or the user
interface. A sufficiently clever post-new hook could compare the files
of a message, and drop the tag or add another, as the case may
be. Surely not a perfect solution, but keeps the implementation simple.


BR,
Jani.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Automatic suppression of non-duplicate messages
  2012-11-04 22:34   ` Jani Nikula
@ 2012-11-05  4:28     ` Austin Clements
  2012-11-05 15:22     ` Eirik Byrkjeflot Anonsen
  1 sibling, 0 replies; 6+ messages in thread
From: Austin Clements @ 2012-11-05  4:28 UTC (permalink / raw)
  To: Jani Nikula; +Cc: notmuch

Quoth Jani Nikula on Nov 05 at 12:34 am:
> On Sat, 03 Nov 2012, David Bremner <david@tethera.net> wrote:
> > Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:
> >
> >> That's not what I see.  If I search for a term that only appears in
> >> one of the "copies", none of the copies are included in the search
> >> result.
> >
> > The offending code is at line 1813 of lib/database.cc; the message is
> > only indexed if the message-id is new.
> >
> > It might be sensible to move _notmuch_message_index_file into the other
> > branch of the if, but even if that works fine, something more
> > sophisticated is needed for the call to
> > __notmuch_message_set_header_values; the invariant that each message has
> > a single subject seems reasonable.
> >
> > Offhand I'm not sure of a good method of automatically deciding what is
> > the same message (with e.g. headers and footer text added by a mailing
> > list).
> 
> Assuming there was good method, what would you do with two different
> messages that have the same message id? That is the unique id we use to
> identify messages (which should be fine per RFC 5322 and its
> predecessors; we're talking about messages from broken systems here).
> 
> It might be helpful to have a configuration option similar to new.tags
> that would define the tags to be assigned to messages with duplicate
> message ids. (This could be done in the
> NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID case near line 516 of
> notmuch-new.c). This could be used to assign a "dupe" tag, for example,
> so the user could do whatever they want in the post-new hook or the user
> interface. A sufficiently clever post-new hook could compare the files
> of a message, and drop the tag or add another, as the case may
> be. Surely not a perfect solution, but keeps the implementation simple.

This would also trigger on message flag changes and folder moves
performed outside of notmuch, since notmuch sees those as a duplicate
message ID followed by a deletion.  The only way to do something for
every received message even if it has the same message ID as an
existing message is to do it in whatever delivers mail.  Currently, we
don't have a good story for integrating on-delivery operations with
notmuch.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Automatic suppression of non-duplicate messages
  2012-11-04 22:34   ` Jani Nikula
  2012-11-05  4:28     ` Austin Clements
@ 2012-11-05 15:22     ` Eirik Byrkjeflot Anonsen
  1 sibling, 0 replies; 6+ messages in thread
From: Eirik Byrkjeflot Anonsen @ 2012-11-05 15:22 UTC (permalink / raw)
  To: notmuch

Jani Nikula <jani@nikula.org> writes:

> On Sat, 03 Nov 2012, David Bremner <david@tethera.net> wrote:
>> Offhand I'm not sure of a good method of automatically deciding what is
>> the same message (with e.g. headers and footer text added by a mailing
>> list).
>
> Assuming there was good method, what would you do with two different
> messages that have the same message id? That is the unique id we use to
> identify messages (which should be fine per RFC 5322 and its
> predecessors; we're talking about messages from broken systems here).

We're also talking about data from "untrusted" sources.  Assuming that
such data is always non-broken seems overly optimistic.  (See
e.g. http://cr.yp.to/immhf/thread.html, the section "Security and
reliability issues" for one view on the matter.)

In fact, I'd say that it should be a design goal for any mail client to
deal with as much invalid input as possible.  Show big, fat warning
messages if you want to, but don't just drop the message and pretend it
does not exist.

eirik

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-11-05 15:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-03 10:17 Automatic suppression of non-duplicate messages Eirik Byrkjeflot Anonsen
2012-11-03 20:53 ` David Bremner
2012-11-04 10:06   ` Eirik Byrkjeflot Anonsen
2012-11-04 22:34   ` Jani Nikula
2012-11-05  4:28     ` Austin Clements
2012-11-05 15:22     ` Eirik Byrkjeflot Anonsen

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).