* Automatic suppression of non-duplicate messages
@ 2012-11-03 10:17 Eirik Byrkjeflot Anonsen
2012-11-03 20:53 ` David Bremner
0 siblings, 1 reply; 6+ messages in thread
From: Eirik Byrkjeflot Anonsen @ 2012-11-03 10:17 UTC (permalink / raw)
To: notmuch
As has been mentioned a few times before, notmuch chooses to silently
drop any message that has the same message-id as an already-seen
message.
In another thread, Austin Clements said:
> notmuch tracks all copies of a message, but its output generally shows
> messages, rather than files, so you see a message only once regardless
> of how many copies there are in the file system.
That's not what I see. If I search for a term that only appears in one
of the "copies", none of the copies are included in the search result.
This is with:
$ dpkg -l | grep notmuch | <remove uninteresting stuff>
ii libnotmuch3 0.13.2-1
ii notmuch 0.13.2-1
ii notmuch-emacs 0.13.2-1
However, I still find it a much bigger problem that notmuch will
silently and automatically discard new mail without letting me know that
they exist, let alone allow me to read them. (I.e. when receiving a new
mail that happens to have the same message-id as a previously received
mail.)
Personally, I find almost no cost in seeing all near-duplicates of the
same message (a cost that could be further mitigated by clever
presentation). Conversely, I find a huge cost in never seeing some
messages at all.
This is currently the main issue that keeps me from switching to
notmuch. (Importing my mail into notmuch dropped roughly 6800 mails.
I'm sure most of these mails are near-duplicates, but I've found 34
mails so far that are emphatically not.)
I think it is a nice and useful design to use the message-id as the
unique id to refer to a specific message. Most of the time, this will
work just fine. However, it is less useful to blindly trust that this
will always work correctly.
eirik
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Automatic suppression of non-duplicate messages
2012-11-03 10:17 Automatic suppression of non-duplicate messages Eirik Byrkjeflot Anonsen
@ 2012-11-03 20:53 ` David Bremner
2012-11-04 10:06 ` Eirik Byrkjeflot Anonsen
2012-11-04 22:34 ` Jani Nikula
0 siblings, 2 replies; 6+ messages in thread
From: David Bremner @ 2012-11-03 20:53 UTC (permalink / raw)
To: Eirik Byrkjeflot Anonsen, notmuch
Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:
> That's not what I see. If I search for a term that only appears in
> one of the "copies", none of the copies are included in the search
> result.
The offending code is at line 1813 of lib/database.cc; the message is
only indexed if the message-id is new.
It might be sensible to move _notmuch_message_index_file into the other
branch of the if, but even if that works fine, something more
sophisticated is needed for the call to
__notmuch_message_set_header_values; the invariant that each message has
a single subject seems reasonable.
Offhand I'm not sure of a good method of automatically deciding what is
the same message (with e.g. headers and footer text added by a mailing
list).
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Automatic suppression of non-duplicate messages
2012-11-03 20:53 ` David Bremner
@ 2012-11-04 10:06 ` Eirik Byrkjeflot Anonsen
2012-11-04 22:34 ` Jani Nikula
1 sibling, 0 replies; 6+ messages in thread
From: Eirik Byrkjeflot Anonsen @ 2012-11-04 10:06 UTC (permalink / raw)
To: notmuch
David Bremner <david@tethera.net> writes:
> Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:
>
>> That's not what I see. If I search for a term that only appears in
>> one of the "copies", none of the copies are included in the search
>> result.
>
> The offending code is at line 1813 of lib/database.cc; the message is
> only indexed if the message-id is new.
>
> It might be sensible to move _notmuch_message_index_file into the other
> branch of the if, but even if that works fine, something more
> sophisticated is needed for the call to
> __notmuch_message_set_header_values; the invariant that each message has
> a single subject seems reasonable.
Hmm, depends. Assuming indexing is intended to be used for searching,
one might want to search for something that occurs in one subject but
not the other. In practice I doubt it matters.
> Offhand I'm not sure of a good method of automatically deciding what is
> the same message (with e.g. headers and footer text added by a mailing
> list).
I don't think the real problem here is the duplicate detection algorithm
itself. It is rather that notmuch forces a particular duplicate
detection algorithm on its users. Duplicate detection should really be
delegated to a different application, thus allowing people to experiment
with whatever algorithm works best for them. (Just like notmuch
delegates the choice of initial tags on messages to an external
application.)
But first notmuch must be modified so it can sensibly treat multiple
instances having the same message-id as separate messages. That seems
to me to be the hard part. (And some way for external applications to
join and split copies, of course.)
However, if you want an algorithm that is likely to get rid of most
duplicates while keeping most non-duplicates separate, here's a quick
suggestion:
Just to clarify: The goal is to suppress most copies of the same message
while not suppressing a single instance of a different message. It
isn't important if a few duplicate messages makes it through, but it is
imperative that no "real" message is dropped.
To check whether two instances are duplicates, I suspect something like
this algorithm would be "good enough":
- Message-Id must be the same. This isn't actually necessary, but it
makes sense to require it anyway.
- From and Date must be the same. These form important context that may
change the meaning of the message (e.g. "me too" depends heavily on
From, and "let's meet tomorrow" depends heavily on Date). (Are there
more context-supplying headers we should worry about?)
- If Subject and body are also the same, the instances are duplicates.
- Otherwise, if neither of the messages come from a mailing list,
they're probably not duplicates.
- Otherwise, grab a few other (recent) mails from the same mailing list.
If all the bodies end with the same text, ignore that text when
comparing the bodies.
- For the Subject, again use a few other (recent) mails from the same
mailing list for comparison. But this time only look for one of the
well-known common patterns. If all the mails matches the same
pattern, ignore that pattern when comparing the Subject.
- For both of the above, it would be good to pick messages from
different threads, to avoid accidental similarities. I suspect this
is more important for subjects than bodies, though.
- Also, leading and trailing whitespace should probably be dropped.
- (Some other transformations may make sense, such as reflowing text or
converting between character sets. In practice I doubt that will make
much of a difference.)
- If the "canonicalized" body and Subject are the same, the messages are
duplicates. At least there's now pretty much no chance that there is
anything interesting that will be missed by dropping one of the
messages.
(I'm assuming that identifying mailing lists are usually
straightforward, e.g. using the List-Id header).
eirik
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Automatic suppression of non-duplicate messages
2012-11-03 20:53 ` David Bremner
2012-11-04 10:06 ` Eirik Byrkjeflot Anonsen
@ 2012-11-04 22:34 ` Jani Nikula
2012-11-05 4:28 ` Austin Clements
2012-11-05 15:22 ` Eirik Byrkjeflot Anonsen
1 sibling, 2 replies; 6+ messages in thread
From: Jani Nikula @ 2012-11-04 22:34 UTC (permalink / raw)
To: David Bremner, Eirik Byrkjeflot Anonsen, notmuch
On Sat, 03 Nov 2012, David Bremner <david@tethera.net> wrote:
> Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:
>
>> That's not what I see. If I search for a term that only appears in
>> one of the "copies", none of the copies are included in the search
>> result.
>
> The offending code is at line 1813 of lib/database.cc; the message is
> only indexed if the message-id is new.
>
> It might be sensible to move _notmuch_message_index_file into the other
> branch of the if, but even if that works fine, something more
> sophisticated is needed for the call to
> __notmuch_message_set_header_values; the invariant that each message has
> a single subject seems reasonable.
>
> Offhand I'm not sure of a good method of automatically deciding what is
> the same message (with e.g. headers and footer text added by a mailing
> list).
Assuming there was good method, what would you do with two different
messages that have the same message id? That is the unique id we use to
identify messages (which should be fine per RFC 5322 and its
predecessors; we're talking about messages from broken systems here).
It might be helpful to have a configuration option similar to new.tags
that would define the tags to be assigned to messages with duplicate
message ids. (This could be done in the
NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID case near line 516 of
notmuch-new.c). This could be used to assign a "dupe" tag, for example,
so the user could do whatever they want in the post-new hook or the user
interface. A sufficiently clever post-new hook could compare the files
of a message, and drop the tag or add another, as the case may
be. Surely not a perfect solution, but keeps the implementation simple.
BR,
Jani.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Automatic suppression of non-duplicate messages
2012-11-04 22:34 ` Jani Nikula
@ 2012-11-05 4:28 ` Austin Clements
2012-11-05 15:22 ` Eirik Byrkjeflot Anonsen
1 sibling, 0 replies; 6+ messages in thread
From: Austin Clements @ 2012-11-05 4:28 UTC (permalink / raw)
To: Jani Nikula; +Cc: notmuch
Quoth Jani Nikula on Nov 05 at 12:34 am:
> On Sat, 03 Nov 2012, David Bremner <david@tethera.net> wrote:
> > Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:
> >
> >> That's not what I see. If I search for a term that only appears in
> >> one of the "copies", none of the copies are included in the search
> >> result.
> >
> > The offending code is at line 1813 of lib/database.cc; the message is
> > only indexed if the message-id is new.
> >
> > It might be sensible to move _notmuch_message_index_file into the other
> > branch of the if, but even if that works fine, something more
> > sophisticated is needed for the call to
> > __notmuch_message_set_header_values; the invariant that each message has
> > a single subject seems reasonable.
> >
> > Offhand I'm not sure of a good method of automatically deciding what is
> > the same message (with e.g. headers and footer text added by a mailing
> > list).
>
> Assuming there was good method, what would you do with two different
> messages that have the same message id? That is the unique id we use to
> identify messages (which should be fine per RFC 5322 and its
> predecessors; we're talking about messages from broken systems here).
>
> It might be helpful to have a configuration option similar to new.tags
> that would define the tags to be assigned to messages with duplicate
> message ids. (This could be done in the
> NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID case near line 516 of
> notmuch-new.c). This could be used to assign a "dupe" tag, for example,
> so the user could do whatever they want in the post-new hook or the user
> interface. A sufficiently clever post-new hook could compare the files
> of a message, and drop the tag or add another, as the case may
> be. Surely not a perfect solution, but keeps the implementation simple.
This would also trigger on message flag changes and folder moves
performed outside of notmuch, since notmuch sees those as a duplicate
message ID followed by a deletion. The only way to do something for
every received message even if it has the same message ID as an
existing message is to do it in whatever delivers mail. Currently, we
don't have a good story for integrating on-delivery operations with
notmuch.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Automatic suppression of non-duplicate messages
2012-11-04 22:34 ` Jani Nikula
2012-11-05 4:28 ` Austin Clements
@ 2012-11-05 15:22 ` Eirik Byrkjeflot Anonsen
1 sibling, 0 replies; 6+ messages in thread
From: Eirik Byrkjeflot Anonsen @ 2012-11-05 15:22 UTC (permalink / raw)
To: notmuch
Jani Nikula <jani@nikula.org> writes:
> On Sat, 03 Nov 2012, David Bremner <david@tethera.net> wrote:
>> Offhand I'm not sure of a good method of automatically deciding what is
>> the same message (with e.g. headers and footer text added by a mailing
>> list).
>
> Assuming there was good method, what would you do with two different
> messages that have the same message id? That is the unique id we use to
> identify messages (which should be fine per RFC 5322 and its
> predecessors; we're talking about messages from broken systems here).
We're also talking about data from "untrusted" sources. Assuming that
such data is always non-broken seems overly optimistic. (See
e.g. http://cr.yp.to/immhf/thread.html, the section "Security and
reliability issues" for one view on the matter.)
In fact, I'd say that it should be a design goal for any mail client to
deal with as much invalid input as possible. Show big, fat warning
messages if you want to, but don't just drop the message and pretend it
does not exist.
eirik
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2012-11-05 15:22 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-03 10:17 Automatic suppression of non-duplicate messages Eirik Byrkjeflot Anonsen
2012-11-03 20:53 ` David Bremner
2012-11-04 10:06 ` Eirik Byrkjeflot Anonsen
2012-11-04 22:34 ` Jani Nikula
2012-11-05 4:28 ` Austin Clements
2012-11-05 15:22 ` Eirik Byrkjeflot Anonsen
Code repositories for project(s) associated with this public inbox
https://yhetil.org/notmuch.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).