From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 489576DE184F for ; Sun, 18 Mar 2018 11:22:51 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[AWL=0.011, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id kUg-VWr2jU10 for ; Sun, 18 Mar 2018 11:22:50 -0700 (PDT) Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197]) by arlo.cworth.org (Postfix) with ESMTPS id 52D5E6DE184E for ; Sun, 18 Mar 2018 11:22:50 -0700 (PDT) Received: from remotemail by fethera.tethera.net with local (Exim 4.89) (envelope-from ) id 1excxE-0000MI-W6; Sun, 18 Mar 2018 14:22:49 -0400 Received: (nullmailer pid 21731 invoked by uid 1000); Sun, 18 Mar 2018 18:22:47 -0000 From: David Bremner To: Daniel Kahn Gillmor , notmuch@notmuchmail.org Subject: Re: multilingual notmuch (and Content-Language) In-Reply-To: <87d101v33s.fsf@fifthhorseman.net> References: <87d101v33s.fsf@fifthhorseman.net> Date: Sun, 18 Mar 2018 15:22:47 -0300 Message-ID: <87tvtdkzuw.fsf@tethera.net> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Mar 2018 18:22:51 -0000 Daniel Kahn Gillmor writes: > AIUI, xapian is pretty much committed to being a single-language > indexer. But i just wanted to point out that it's possible that we > could be smarter about this in notmuch, and wanted to make a space for > possible design discussion. > More precisely, it uses a single _stemmer_ when generating terms and when parsing queries. Nothing says that these have to correspond to a single human language. The stemmer is also configured at runtime, so it could in principle be per database configurable. I mention the possibility of a custom stemmer because that also seems like a natural place to put things like unicode normalization and accent removal. d