From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <david@tethera.net>
Received: from localhost (localhost [127.0.0.1])
 by arlo.cworth.org (Postfix) with ESMTP id A9E226DE169D
 for <notmuch@notmuchmail.org>; Sat, 18 Mar 2017 14:31:51 -0700 (PDT)
X-Virus-Scanned: Debian amavisd-new at cworth.org
X-Spam-Flag: NO
X-Spam-Score: -0.005
X-Spam-Level: 
X-Spam-Status: No, score=-0.005 tagged_above=-999 required=5 tests=[AWL=0.006, 
 SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled
Received: from arlo.cworth.org ([127.0.0.1])
 by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id B0Pqmib4OpEi for <notmuch@notmuchmail.org>;
 Sat, 18 Mar 2017 14:31:49 -0700 (PDT)
Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197])
 by arlo.cworth.org (Postfix) with ESMTPS id 8D9286DE14FC
 for <notmuch@notmuchmail.org>; Sat, 18 Mar 2017 14:31:48 -0700 (PDT)
Received: from remotemail by fethera.tethera.net with local (Exim 4.84_2)
 (envelope-from <david@tethera.net>)
 id 1cpLwF-0001dy-AR; Sat, 18 Mar 2017 17:31:03 -0400
Received: (nullmailer pid 25070 invoked by uid 1000);
 Sat, 18 Mar 2017 21:31:44 -0000
From: David Bremner <david@tethera.net>
To: Daniel Kahn Gillmor <dkg@fifthhorseman.net>, notmuch@notmuchmail.org
Subject: Re: [RFC patch 2/2] lib: index message files with duplicate
 message-ids
In-Reply-To: <874lyronu5.fsf@alice.fifthhorseman.net>
References: <20170316015728.29325-1-david@tethera.net>
 <20170316015728.29325-3-david@tethera.net>
 <87r31xnkts.fsf@alice.fifthhorseman.net> <8760j8n3ld.fsf@tethera.net>
 <874lyronu5.fsf@alice.fifthhorseman.net>
Date: Sat, 18 Mar 2017 18:31:44 -0300
Message-ID: <87efxul1a7.fsf@tethera.net>
MIME-Version: 1.0
Content-Type: text/plain
X-BeenThere: notmuch@notmuchmail.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Use and development of the notmuch mail system."
 <notmuch.notmuchmail.org>
List-Unsubscribe: <https://notmuchmail.org/mailman/options/notmuch>,
 <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
List-Archive: <http://notmuchmail.org/pipermail/notmuch/>
List-Post: <mailto:notmuch@notmuchmail.org>
List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
List-Subscribe: <https://notmuchmail.org/mailman/listinfo/notmuch>,
 <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
X-List-Received-Date: Sat, 18 Mar 2017 21:31:51 -0000

Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes:

> On Thu 2017-03-16 20:34:22 -0400, David Bremner wrote:
>> Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes:
>>>  0) what happens when one of the files gets deleted from the message
>>>     store? do the terms it contributes get removed from the index?
>>
>> That's a good guestion, and an issue I hadn't thought about.
>> Currently there's no way to do this short of deleting all the terms (for
>> all the files (excepting tags and properties, presumably) and
>> reindexing. This will require some more thought, I think.
>
> i didn't mean to raise the concern to drag this work down, i just want
> to make sure the problem is on the table.  dropping all terms on
> deletion and re-indexing remaining files with the same message ID isn't
> terribly efficient, but i don't think it's going to be terribly costly
> either.  we're not talking about hundreds of files per message-id in
> most normal cases; usually only two (sent-to-self,
> recvd-from-mailing-list), and maybe a half-dozen at most (messages sent
> to multiple mailboxes that all forward to me).

I can think of 3 general approaches at the moment. They each have (at
least) one gotcha; more precisely they each require some added
complexity somewhere else in the codebase.

One is this one, just add all the terms to one xapian document. The
gotcha is needing some reindexing facility (we want this for other
reasons, so that might not be so bad).

The second approach that occurs to me is to still add the terms to one
xapian document, but to prefix them with a number identifying the file
copy (1,2, etc). The complexity here is in the generation of queries,
each one needs to be OR_ed with eg. SUBJECT:foo or 1#SUBJECT:foo or
2#SUBJECT:foo. I'm not really sure offhand how to do that without field
processors. I'm also not sure about the performance impact.

The third approach is create extra xapian documents per file, which have
a different document type (from the notmuch point of view). Here the
complexity will be dealing with the returned documents from a xapian
query. We can probably use a wildcard search on the type (mail, mail1,
mail2, etc...) to make the queries reasonably easy. My gut feeling is
that this is the "right" approach, althought it will be a bit more
complicated to get started.  It will also require changing our idea of
threads in the "structured output" where a thread looks something like

(thread
       (message
          (instance/file)
          (instance/file))
       (message
          (instance/file))