From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id A9E226DE169D for ; Sat, 18 Mar 2017 14:31:51 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.005 X-Spam-Level: X-Spam-Status: No, score=-0.005 tagged_above=-999 required=5 tests=[AWL=0.006, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id B0Pqmib4OpEi for ; Sat, 18 Mar 2017 14:31:49 -0700 (PDT) Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197]) by arlo.cworth.org (Postfix) with ESMTPS id 8D9286DE14FC for ; Sat, 18 Mar 2017 14:31:48 -0700 (PDT) Received: from remotemail by fethera.tethera.net with local (Exim 4.84_2) (envelope-from ) id 1cpLwF-0001dy-AR; Sat, 18 Mar 2017 17:31:03 -0400 Received: (nullmailer pid 25070 invoked by uid 1000); Sat, 18 Mar 2017 21:31:44 -0000 From: David Bremner To: Daniel Kahn Gillmor , notmuch@notmuchmail.org Subject: Re: [RFC patch 2/2] lib: index message files with duplicate message-ids In-Reply-To: <874lyronu5.fsf@alice.fifthhorseman.net> References: <20170316015728.29325-1-david@tethera.net> <20170316015728.29325-3-david@tethera.net> <87r31xnkts.fsf@alice.fifthhorseman.net> <8760j8n3ld.fsf@tethera.net> <874lyronu5.fsf@alice.fifthhorseman.net> Date: Sat, 18 Mar 2017 18:31:44 -0300 Message-ID: <87efxul1a7.fsf@tethera.net> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Mar 2017 21:31:51 -0000 Daniel Kahn Gillmor writes: > On Thu 2017-03-16 20:34:22 -0400, David Bremner wrote: >> Daniel Kahn Gillmor writes: >>> 0) what happens when one of the files gets deleted from the message >>> store? do the terms it contributes get removed from the index? >> >> That's a good guestion, and an issue I hadn't thought about. >> Currently there's no way to do this short of deleting all the terms (for >> all the files (excepting tags and properties, presumably) and >> reindexing. This will require some more thought, I think. > > i didn't mean to raise the concern to drag this work down, i just want > to make sure the problem is on the table. dropping all terms on > deletion and re-indexing remaining files with the same message ID isn't > terribly efficient, but i don't think it's going to be terribly costly > either. we're not talking about hundreds of files per message-id in > most normal cases; usually only two (sent-to-self, > recvd-from-mailing-list), and maybe a half-dozen at most (messages sent > to multiple mailboxes that all forward to me). I can think of 3 general approaches at the moment. They each have (at least) one gotcha; more precisely they each require some added complexity somewhere else in the codebase. One is this one, just add all the terms to one xapian document. The gotcha is needing some reindexing facility (we want this for other reasons, so that might not be so bad). The second approach that occurs to me is to still add the terms to one xapian document, but to prefix them with a number identifying the file copy (1,2, etc). The complexity here is in the generation of queries, each one needs to be OR_ed with eg. SUBJECT:foo or 1#SUBJECT:foo or 2#SUBJECT:foo. I'm not really sure offhand how to do that without field processors. I'm also not sure about the performance impact. The third approach is create extra xapian documents per file, which have a different document type (from the notmuch point of view). Here the complexity will be dealing with the returned documents from a xapian query. We can probably use a wildcard search on the type (mail, mail1, mail2, etc...) to make the queries reasonably easy. My gut feeling is that this is the "right" approach, althought it will be a bit more complicated to get started. It will also require changing our idea of threads in the "structured output" where a thread looks something like (thread (message (instance/file) (instance/file)) (message (instance/file))