* a first step for the duplicate message-id dilemma @ 2017-03-16 1:57 David Bremner 2017-03-16 1:57 ` [RFC patch 1/2] test: add known broken test for duplicate message id David Bremner 2017-03-16 1:57 ` [RFC patch 2/2] lib: index message files with duplicate message-ids David Bremner 0 siblings, 2 replies; 9+ messages in thread From: David Bremner @ 2017-03-16 1:57 UTC (permalink / raw) To: notmuch These are mainly RFC because I'm not 100% sure about the performance impact. It seems OK for me: about 3% slower indexing my 500 K messages with about 35k duplicates. I didn't see a noticable increase in database size (both cases it's 5.8G / 3.5G before/after notmuch compact). There are also tons of UI issues: for example in the test case here, notmuch search subject:'"message 2"' will happily print thread:0000000000000001 2001-01-05 [1/1] Notmuch Test Suite; message 1 (inbox unread) I claim it's still an improvement over the current code, where that second message is not findable by any terms unique to it. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC patch 1/2] test: add known broken test for duplicate message id 2017-03-16 1:57 a first step for the duplicate message-id dilemma David Bremner @ 2017-03-16 1:57 ` David Bremner 2017-03-16 1:57 ` [RFC patch 2/2] lib: index message files with duplicate message-ids David Bremner 1 sibling, 0 replies; 9+ messages in thread From: David Bremner @ 2017-03-16 1:57 UTC (permalink / raw) To: notmuch There are many other problems that could be tested, but this one we have some hope of fixing because it doesn't require UI changes, just indexing changes. --- test/T670-duplicate-mid.sh | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) create mode 100755 test/T670-duplicate-mid.sh diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh new file mode 100755 index 00000000..d28afc91 --- /dev/null +++ b/test/T670-duplicate-mid.sh @@ -0,0 +1,17 @@ +#!/usr/bin/env bash +test_description="duplicate message ids" +. ./test-lib.sh || exit 1 + +add_message [id]=id:duplicate '[subject]="message 1"' +add_message [id]=id:duplicate '[subject]="message 2"' + +test_begin_subtest 'Search for second subject' +test_subtest_known_broken +cat <<EOF >EXPECTED +MAIL_DIR/msg-001 +MAIL_DIR/msg-002 +EOF +notmuch search --output=files subject:'"message 2"' | notmuch_dir_sanitize > OUTPUT +test_expect_equal_file EXPECTED OUTPUT + +test_done -- 2.11.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC patch 2/2] lib: index message files with duplicate message-ids 2017-03-16 1:57 a first step for the duplicate message-id dilemma David Bremner 2017-03-16 1:57 ` [RFC patch 1/2] test: add known broken test for duplicate message id David Bremner @ 2017-03-16 1:57 ` David Bremner 2017-03-16 18:22 ` Daniel Kahn Gillmor 1 sibling, 1 reply; 9+ messages in thread From: David Bremner @ 2017-03-16 1:57 UTC (permalink / raw) To: notmuch The corresponding xapian document just gets more terms added to it, but this doesn't seem to break anything. --- lib/database.cc | 3 +++ test/T670-duplicate-mid.sh | 1 - 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/lib/database.cc b/lib/database.cc index a679cbab..e83017ed 100644 --- a/lib/database.cc +++ b/lib/database.cc @@ -2582,6 +2582,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch, if (ret) goto DONE; } else { + ret = _notmuch_message_index_file (message, message_file); + if (ret) + goto DONE; ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID; } diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh index d28afc91..41c53bc8 100755 --- a/test/T670-duplicate-mid.sh +++ b/test/T670-duplicate-mid.sh @@ -6,7 +6,6 @@ add_message [id]=id:duplicate '[subject]="message 1"' add_message [id]=id:duplicate '[subject]="message 2"' test_begin_subtest 'Search for second subject' -test_subtest_known_broken cat <<EOF >EXPECTED MAIL_DIR/msg-001 MAIL_DIR/msg-002 -- 2.11.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC patch 2/2] lib: index message files with duplicate message-ids 2017-03-16 1:57 ` [RFC patch 2/2] lib: index message files with duplicate message-ids David Bremner @ 2017-03-16 18:22 ` Daniel Kahn Gillmor 2017-03-17 0:34 ` David Bremner 2017-03-17 5:47 ` Mark Walters 0 siblings, 2 replies; 9+ messages in thread From: Daniel Kahn Gillmor @ 2017-03-16 18:22 UTC (permalink / raw) To: David Bremner, notmuch On Wed 2017-03-15 21:57:28 -0400, David Bremner wrote: > The corresponding xapian document just gets more terms added to it, > but this doesn't seem to break anything. this is an interesting suggestion. thanks for proposing it! A couple questions: 0) what happens when one of the files gets deleted from the message store? do the terms it contributes get removed from the index? 1) when a message is displayed to the user as a result of a match, it gets pulled from one of the files, not both. if it's pulled from the file that didn't have the term the user searched for, that's likely to be confusing. do you have a way to avoid that confusion? It also occurs to me that one of the things i'd love to have is well-indexed notes about any given e-mail. So if this was adopted, i could presumably just write a file that has the same Message-Id as the message, put my notes in it, and index it. that's a little weird, though. would there be a better way to do such a thing? --dkg ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC patch 2/2] lib: index message files with duplicate message-ids 2017-03-16 18:22 ` Daniel Kahn Gillmor @ 2017-03-17 0:34 ` David Bremner 2017-03-17 16:44 ` Daniel Kahn Gillmor 2017-03-22 17:29 ` Jani Nikula 2017-03-17 5:47 ` Mark Walters 1 sibling, 2 replies; 9+ messages in thread From: David Bremner @ 2017-03-17 0:34 UTC (permalink / raw) To: Daniel Kahn Gillmor, notmuch Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes: > On Wed 2017-03-15 21:57:28 -0400, David Bremner wrote: >> The corresponding xapian document just gets more terms added to it, >> but this doesn't seem to break anything. > > this is an interesting suggestion. thanks for proposing it! > > A couple questions: > > 0) what happens when one of the files gets deleted from the message > store? do the terms it contributes get removed from the index? > That's a good guestion, and an issue I hadn't thought about. Currently there's no way to do this short of deleting all the terms (for all the files (excepting tags and properties, presumably) and reindexing. This will require some more thought, I think. > 1) when a message is displayed to the user as a result of a match, it > gets pulled from one of the files, not both. if it's pulled from > the file that didn't have the term the user searched for, that's > likely to be confusing. do you have a way to avoid that confusion? I was looking for an incremental improvement, so I imagined something like various output flagging "yes, there are duplicate files for this message", and letting users dig those out using something like the --duplicate= option. > It also occurs to me that one of the things i'd love to have is > well-indexed notes about any given e-mail. So if this was adopted, i > could presumably just write a file that has the same Message-Id as the > message, put my notes in it, and index it. that's a little weird, > though. would there be a better way to do such a thing? > > --dkg One option would be to use a note=foo mesage property. That's not immediately searchable though, although we could kludge together something like the subject regexp search which would be slower. d ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC patch 2/2] lib: index message files with duplicate message-ids 2017-03-17 0:34 ` David Bremner @ 2017-03-17 16:44 ` Daniel Kahn Gillmor 2017-03-18 21:31 ` David Bremner 2017-03-22 17:29 ` Jani Nikula 1 sibling, 1 reply; 9+ messages in thread From: Daniel Kahn Gillmor @ 2017-03-17 16:44 UTC (permalink / raw) To: David Bremner, notmuch [-- Attachment #1: Type: text/plain, Size: 3294 bytes --] On Thu 2017-03-16 20:34:22 -0400, David Bremner wrote: > Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes: >> 0) what happens when one of the files gets deleted from the message >> store? do the terms it contributes get removed from the index? > > That's a good guestion, and an issue I hadn't thought about. > Currently there's no way to do this short of deleting all the terms (for > all the files (excepting tags and properties, presumably) and > reindexing. This will require some more thought, I think. i didn't mean to raise the concern to drag this work down, i just want to make sure the problem is on the table. dropping all terms on deletion and re-indexing remaining files with the same message ID isn't terribly efficient, but i don't think it's going to be terribly costly either. we're not talking about hundreds of files per message-id in most normal cases; usually only two (sent-to-self, recvd-from-mailing-list), and maybe a half-dozen at most (messages sent to multiple mailboxes that all forward to me). of course, if multiple files are deleted concurrently, and notmuch notices that one of them is missing, then re-indexing the other will depend on whether it was also deleted in that same batch. >> 1) when a message is displayed to the user as a result of a match, it >> gets pulled from one of the files, not both. if it's pulled from >> the file that didn't have the term the user searched for, that's >> likely to be confusing. do you have a way to avoid that confusion? > > I was looking for an incremental improvement, so I imagined something > like various output flagging "yes, there are duplicate files for this > message", and letting users dig those out using something like the > --duplicate= option. This kind of output flagging would be worthwhile in its own right, and maybe is an even less controversial place to start for the incremental improvement. >> It also occurs to me that one of the things i'd love to have is >> well-indexed notes about any given e-mail. So if this was adopted, i >> could presumably just write a file that has the same Message-Id as the >> message, put my notes in it, and index it. that's a little weird, >> though. would there be a better way to do such a thing? > > One option would be to use a note=foo mesage property. That's not > immediately searchable though, although we could kludge together > something like the subject regexp search which would be slower. right, i think i'd want the notes to be searchable, if possible. Now i'm thinking about attack scenarios for this multi-indexed scheme, though. If i know that you've already gotten an e-mail with message-id X, then i can go ahead and remotely, silently add search terms to that message by sending you new messages that have the same message-id. That seems troubling :/ The status quo at least requires the attacker to win a race to get their message indexed first, obscuring the real message. in the proposed new scenario, the attacker doesn't need to win any race. they can't prevent the true message from being indexed, but they can associate it with whatever toxicity (e.g. "viagra", or "From: killfiled-user") they want which might be useful in suppressing the message in a post-processing run. ugh, mail, --dkg [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC patch 2/2] lib: index message files with duplicate message-ids 2017-03-17 16:44 ` Daniel Kahn Gillmor @ 2017-03-18 21:31 ` David Bremner 0 siblings, 0 replies; 9+ messages in thread From: David Bremner @ 2017-03-18 21:31 UTC (permalink / raw) To: Daniel Kahn Gillmor, notmuch Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes: > On Thu 2017-03-16 20:34:22 -0400, David Bremner wrote: >> Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes: >>> 0) what happens when one of the files gets deleted from the message >>> store? do the terms it contributes get removed from the index? >> >> That's a good guestion, and an issue I hadn't thought about. >> Currently there's no way to do this short of deleting all the terms (for >> all the files (excepting tags and properties, presumably) and >> reindexing. This will require some more thought, I think. > > i didn't mean to raise the concern to drag this work down, i just want > to make sure the problem is on the table. dropping all terms on > deletion and re-indexing remaining files with the same message ID isn't > terribly efficient, but i don't think it's going to be terribly costly > either. we're not talking about hundreds of files per message-id in > most normal cases; usually only two (sent-to-self, > recvd-from-mailing-list), and maybe a half-dozen at most (messages sent > to multiple mailboxes that all forward to me). I can think of 3 general approaches at the moment. They each have (at least) one gotcha; more precisely they each require some added complexity somewhere else in the codebase. One is this one, just add all the terms to one xapian document. The gotcha is needing some reindexing facility (we want this for other reasons, so that might not be so bad). The second approach that occurs to me is to still add the terms to one xapian document, but to prefix them with a number identifying the file copy (1,2, etc). The complexity here is in the generation of queries, each one needs to be OR_ed with eg. SUBJECT:foo or 1#SUBJECT:foo or 2#SUBJECT:foo. I'm not really sure offhand how to do that without field processors. I'm also not sure about the performance impact. The third approach is create extra xapian documents per file, which have a different document type (from the notmuch point of view). Here the complexity will be dealing with the returned documents from a xapian query. We can probably use a wildcard search on the type (mail, mail1, mail2, etc...) to make the queries reasonably easy. My gut feeling is that this is the "right" approach, althought it will be a bit more complicated to get started. It will also require changing our idea of threads in the "structured output" where a thread looks something like (thread (message (instance/file) (instance/file)) (message (instance/file)) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC patch 2/2] lib: index message files with duplicate message-ids 2017-03-17 0:34 ` David Bremner 2017-03-17 16:44 ` Daniel Kahn Gillmor @ 2017-03-22 17:29 ` Jani Nikula 1 sibling, 0 replies; 9+ messages in thread From: Jani Nikula @ 2017-03-22 17:29 UTC (permalink / raw) To: David Bremner, Daniel Kahn Gillmor, notmuch On Thu, 16 Mar 2017, David Bremner <david@tethera.net> wrote: > Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes: > >> On Wed 2017-03-15 21:57:28 -0400, David Bremner wrote: >>> The corresponding xapian document just gets more terms added to it, >>> but this doesn't seem to break anything. >> >> this is an interesting suggestion. thanks for proposing it! >> >> A couple questions: >> >> 0) what happens when one of the files gets deleted from the message >> store? do the terms it contributes get removed from the index? >> > > That's a good guestion, and an issue I hadn't thought about. > Currently there's no way to do this short of deleting all the terms (for > all the files (excepting tags and properties, presumably) and > reindexing. This will require some more thought, I think. We already see some of this issue. First file gets indexed, second file gets added, first file gets removed. There's also the related problem of reindexing potentially changing the file being indexed and returned. The first time around the indexing order is likely the order the message files were received in; on reindexing it's the order the message files are encountered in the file system. I presume the patch at hand keeps the search terms that find the messages the same regardless of the indexing order. BR, Jani. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC patch 2/2] lib: index message files with duplicate message-ids 2017-03-16 18:22 ` Daniel Kahn Gillmor 2017-03-17 0:34 ` David Bremner @ 2017-03-17 5:47 ` Mark Walters 1 sibling, 0 replies; 9+ messages in thread From: Mark Walters @ 2017-03-17 5:47 UTC (permalink / raw) To: Daniel Kahn Gillmor, David Bremner, notmuch Hi Just a comment on your last point: > It also occurs to me that one of the things i'd love to have is > well-indexed notes about any given e-mail. So if this was adopted, i > could presumably just write a file that has the same Message-Id as the > message, put my notes in it, and index it. that's a little weird, > though. would there be a better way to do such a thing? A different way which might get pretty close to what you would be to start a reply and then postpone it. Ideally we would wrap this in a "note" function would delete the to/cc/bcc headers to make sure it doesn't accidentally get sent and add a +note tag when saving. Best wishes Mark ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2017-03-22 17:29 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-03-16 1:57 a first step for the duplicate message-id dilemma David Bremner 2017-03-16 1:57 ` [RFC patch 1/2] test: add known broken test for duplicate message id David Bremner 2017-03-16 1:57 ` [RFC patch 2/2] lib: index message files with duplicate message-ids David Bremner 2017-03-16 18:22 ` Daniel Kahn Gillmor 2017-03-17 0:34 ` David Bremner 2017-03-17 16:44 ` Daniel Kahn Gillmor 2017-03-18 21:31 ` David Bremner 2017-03-22 17:29 ` Jani Nikula 2017-03-17 5:47 ` Mark Walters
Code repositories for project(s) associated with this public inbox https://yhetil.org/notmuch.git/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).