From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 92C8C6DE1DF2 for ; Wed, 1 Mar 2017 16:41:59 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.005 X-Spam-Level: X-Spam-Status: No, score=-0.005 tagged_above=-999 required=5 tests=[AWL=0.006, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id b7iVIWa9xAhV for ; Wed, 1 Mar 2017 16:41:58 -0800 (PST) Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197]) by arlo.cworth.org (Postfix) with ESMTPS id 778D16DE1DE5 for ; Wed, 1 Mar 2017 16:41:58 -0800 (PST) Received: from remotemail by fethera.tethera.net with local (Exim 4.84_2) (envelope-from ) id 1cjEny-0006TV-Dx; Wed, 01 Mar 2017 19:41:14 -0500 Received: (nullmailer pid 26060 invoked by uid 1000); Thu, 02 Mar 2017 00:41:51 -0000 From: David Bremner To: Olaf TNSB , Steven Allen Cc: notmuch@notmuchmail.org Subject: Re: Add (extracted) attachment text to the search index? In-Reply-To: References: <87inntut68.fsf@tethera.net> <877f48lw4s.fsf@bistromath> Date: Wed, 01 Mar 2017 20:41:51 -0400 Message-ID: <87varscxww.fsf@tethera.net> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 02 Mar 2017 00:41:59 -0000 Olaf TNSB writes: > On Thu, Mar 2, 2017 at 4:55 AM, Steven Allen wrote: >> >> >> David Bremner writes: >> > This would require some modifications of notmuch. Either modifying >> > lib/index.cc to add the terms at indexing (notmuch new/insert) time, or >> > providing some way of adding the terms later. The former actually sounds >> > simpler to me. >> >> To do this correctly, you'd want to be able to run an external text >> extraction tool (for PDFs, word documents, etc.) so I think the latter >> would be better in the long run (it would allow the user to index >> attachments in the hooks). > > (As a non-dev...) I agree. The ability to add (and delete!) content > post-insert sounds more desirable. I don't want to have to re-index all my > email as the next version of -to-text gets > released. I'd like to be able to (search-for-attachment)-(delete)-(re-add). There has been some patches (related to encrypted email), that reindex individual messages. So that would be a clean fix for that. I haven't really thought about reindexing parts of messages, which is what seems to be the proposal here. d