unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
To: David Bremner <david@tethera.net>, notmuch@notmuchmail.org
Subject: Re: [RFC patch 2/2] lib: index message files with duplicate message-ids
Date: Fri, 17 Mar 2017 12:44:02 -0400	[thread overview]
Message-ID: <874lyronu5.fsf@alice.fifthhorseman.net> (raw)
In-Reply-To: <8760j8n3ld.fsf@tethera.net>

[-- Attachment #1: Type: text/plain, Size: 3294 bytes --]

On Thu 2017-03-16 20:34:22 -0400, David Bremner wrote:
> Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes:
>>  0) what happens when one of the files gets deleted from the message
>>     store? do the terms it contributes get removed from the index?
>
> That's a good guestion, and an issue I hadn't thought about.
> Currently there's no way to do this short of deleting all the terms (for
> all the files (excepting tags and properties, presumably) and
> reindexing. This will require some more thought, I think.

i didn't mean to raise the concern to drag this work down, i just want
to make sure the problem is on the table.  dropping all terms on
deletion and re-indexing remaining files with the same message ID isn't
terribly efficient, but i don't think it's going to be terribly costly
either.  we're not talking about hundreds of files per message-id in
most normal cases; usually only two (sent-to-self,
recvd-from-mailing-list), and maybe a half-dozen at most (messages sent
to multiple mailboxes that all forward to me).

of course, if multiple files are deleted concurrently, and notmuch
notices that one of them is missing, then re-indexing the other will
depend on whether it was also deleted in that same batch.

>>  1) when a message is displayed to the user as a result of a match, it
>>     gets pulled from one of the files, not both.  if it's pulled from
>>     the file that didn't have the term the user searched for, that's
>>     likely to be confusing.  do you have a way to avoid that confusion?
>
> I was looking for an incremental improvement, so I imagined something
> like various output flagging "yes, there are duplicate files for this
> message", and letting users dig those out using something like the
> --duplicate= option.

This kind of output flagging would be worthwhile in its own right, and
maybe is an even less controversial place to start for the incremental
improvement.

>> It also occurs to me that one of the things i'd love to have is
>> well-indexed notes about any given e-mail.  So if this was adopted, i
>> could presumably just write a file that has the same Message-Id as the
>> message, put my notes in it, and index it.  that's a little weird,
>> though.  would there be a better way to do such a thing?
>
> One option would be to use a note=foo mesage property. That's not
> immediately searchable though, although we could kludge together
> something like the subject regexp search which would be slower.

right, i think i'd want the notes to be searchable, if possible.

Now i'm thinking about attack scenarios for this multi-indexed scheme,
though.  If i know that you've already gotten an e-mail with message-id
X, then i can go ahead and remotely, silently add search terms to that
message by sending you new messages that have the same message-id.  That
seems troubling :/ The status quo at least requires the attacker to win
a race to get their message indexed first, obscuring the real message.
in the proposed new scenario, the attacker doesn't need to win any race.
they can't prevent the true message from being indexed, but they can
associate it with whatever toxicity (e.g. "viagra", or "From:
killfiled-user") they want which might be useful in suppressing the
message in a post-processing run.

ugh, mail,

      --dkg

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  reply	other threads:[~2017-03-17 16:59 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-16  1:57 a first step for the duplicate message-id dilemma David Bremner
2017-03-16  1:57 ` [RFC patch 1/2] test: add known broken test for duplicate message id David Bremner
2017-03-16  1:57 ` [RFC patch 2/2] lib: index message files with duplicate message-ids David Bremner
2017-03-16 18:22   ` Daniel Kahn Gillmor
2017-03-17  0:34     ` David Bremner
2017-03-17 16:44       ` Daniel Kahn Gillmor [this message]
2017-03-18 21:31         ` David Bremner
2017-03-22 17:29       ` Jani Nikula
2017-03-17  5:47     ` Mark Walters

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874lyronu5.fsf@alice.fifthhorseman.net \
    --to=dkg@fifthhorseman.net \
    --cc=david@tethera.net \
    --cc=notmuch@notmuchmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).