From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 26ACC6DE0FB6 for ; Mon, 13 Nov 2017 05:22:51 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[AWL=0.011, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id spkzEt8Mb7LM for ; Mon, 13 Nov 2017 05:22:49 -0800 (PST) Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197]) by arlo.cworth.org (Postfix) with ESMTPS id 7CB446DE0C97 for ; Mon, 13 Nov 2017 05:22:49 -0800 (PST) Received: from remotemail by fethera.tethera.net with local (Exim 4.89) (envelope-from ) id 1eEEhJ-0007e4-Vx; Mon, 13 Nov 2017 08:22:46 -0500 Received: (nullmailer pid 4396 invoked by uid 1000); Mon, 13 Nov 2017 13:22:44 -0000 From: David Bremner To: Bruno Deremble , notmuch@notmuchmail.org Cc: Stefano Zacchiroli Subject: Re: accented characters In-Reply-To: <87h8tz8b2v.fsf@ens.fr> References: <87h8tz8b2v.fsf@ens.fr> X-List-To: notmuch Date: Mon, 13 Nov 2017 09:22:36 -0400 Message-ID: <87efp2b9er.fsf@tethera.net> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Nov 2017 13:22:51 -0000 --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Bruno Deremble writes: > A way to handle this could be to only index non accented words which > requires to add a filter before the indexing process. I looked at the code > and it seems that this should be handled by gmime?=20 > there are also libraries that are supposed to do that such as 'unac'. > > Is it something that you have been exploring already? We have discussed it a bit (with another francophone, in copy ;) ), but I think no-one got very far. I guess the ideal case would be to have the possibility of for both accented and accent free search. That would require adding some more terms to the index (both accented and unaccented version). It's not clear to me yet what kind of performance impact that would have.=20 Xapian already has something called "stemmers" (in xapian-core/languages in the source tree), which do, among other things, strip accents. Those are generally targetted at a single language, which I suspect is not very useful for notmuch (even I as a mostly-unilingual person have a fair amount of English, French, and German in my mailstore). Nonetheless a custom stemmer might be the right way to go, since that step is happening anyway. Or perhaps people would be happy enough with being able to set the stemmer (currently it is hardcoded to English). That would be a relatively easy change to notmuch, but I don't know how many people would find it a good tradeoff to lose English stemming (i.e. search for 'stem' and 'stemming' being equivalant) for de-accenting. I'm not sure if the query language would need to support the distinction between accented and unaccented searches. I imagine that people naturally type the non-accented versions in a search, but I do wonder about cases like (German) M=C3=BCnchen. Should that be stemmed to Munchen or Muenchen ? The other thing I don't know is how many people would be happy with just stripping all accents. That could be done in a gmime filter, as you suggest. That would be more likely to require changes to the query language. Off hand I don't know how to transparently de-accent all query words. d --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQGzBAEBCAAdFiEE3VS2dnyDRXKVCQCp8gKXHaSnniwFAloJnJwACgkQ8gKXHaSn niyYyQwAivUIGHEugkN7AR6mzy+nQyJ9+HC8MeJG8O97GFagwpKLmtn8piOOBgf4 99Bwr/+EEHjCa+yKYIMi4h+sWXNa62KovngJsEEQLnwxbrD/EQ2nz6CgBrik0nnO rjNxA18JRwtSuFB+1oifX0dF7mRnGX1IpxeBu8aAUnQMlZiqBHh9nEaDg/QxtRQs DnBRhbwLmoxLzDpRXRY13PdU7tRNyJtNDVZhnmaYO8ImMNBROxLwMgYGlYePG4BG ofnSGhKkb7U6jnSwj6sGaqX4HTdF9QazrbJJyvdGC69m29+pHNXtoSaqIy2G/Qa+ hsqk69Y/vyD3wT/yLaFtxA27RYF7uAku82bMCbzolXvFXxX367QaiGSt1qN1djuM 4lwZWhD9Qz6a21HTZU5S72ZkBYigN1sKKnOzMyeNxBQSXpOVIQ0Wjy8zd2FRM8w7 NBbpzHwY/Jl+VHSVdldE36p9krsDuVTmZJ+ZuzU38mhjI/rfSeMiFIjcgCpbcuD/ CGcIVQcL =upfG -----END PGP SIGNATURE----- --=-=-=--