From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 0F8026DE150D for ; Mon, 19 Mar 2018 02:42:26 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.034 X-Spam-Level: X-Spam-Status: No, score=-0.034 tagged_above=-999 required=5 tests=[AWL=-0.034] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xDvna5-9lgrF for ; Mon, 19 Mar 2018 02:42:25 -0700 (PDT) Received: from che.mayfirst.org (che.mayfirst.org [162.247.75.118]) by arlo.cworth.org (Postfix) with ESMTPS id 3B8F56DE1508 for ; Mon, 19 Mar 2018 02:42:25 -0700 (PDT) Received: from fifthhorseman.net (dhcp-8362.meeting.ietf.org [31.133.131.98]) by che.mayfirst.org (Postfix) with ESMTPSA id BA5C8F99B; Mon, 19 Mar 2018 05:42:22 -0400 (EDT) Received: by fifthhorseman.net (Postfix, from userid 1000) id A4FCC20489; Mon, 19 Mar 2018 07:38:07 +0000 (GMT) From: Daniel Kahn Gillmor To: Jani Nikula , notmuch@notmuchmail.org, Olly Betts Subject: Re: multilingual notmuch (and Content-Language) In-Reply-To: <87bmflrxgs.fsf@nikula.org> References: <87d101v33s.fsf@fifthhorseman.net> <87bmflrxgs.fsf@nikula.org> Date: Mon, 19 Mar 2018 07:38:07 +0000 Message-ID: <87vadstt0g.fsf@fifthhorseman.net> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 19 Mar 2018 09:42:26 -0000 On Sun 2018-03-18 21:32:35 +0200, Jani Nikula wrote: > On Sun, 18 Mar 2018, Daniel Kahn Gillmor wrote: >> * if we know our index expects english, and we have a message part that >> *is not* english (e.g. Content-Language: es), we could avoid indexing >> that part. > > Why would we do that? Search mostly works just fine for non-English > languages, it's just that the *stemming* is not right. > >> what do you think? what ideas are missing from the branstorm above? I'd >> love to hear from people with multilingual mailboxes about how we might >> be able to make notmuch work better for them. > > With my limited understanding of this, stemming happens both at indexing > and searching. Basically at indexing, the term generator indexes both > the full and the stemmed version of words. I'm wondering if we could > look at Content-Language (and missing that, heuristics), and (if the > user so desires) use multiple term generators with different stemmers on > a per document basis. Or, use non-stemming indexing for unidentified or > unsupported languages. How far would that take us? Then, perhaps, we > could also perform language specific queries? > > I don't know how feasible that is, or if it would require Xapian > changes. thanks, this is exactly the kind of promising idea i was hoping my dumb questions and half-baked suggestions would provoke :) Maybe Olly or someone else with deeper knowledge of xapian can weigh in about the feasibility of this proposal? --dkg