From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 95BFE6DE1003 for ; Mon, 13 Nov 2017 06:41:13 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -1.753 X-Spam-Level: X-Spam-Status: No, score=-1.753 tagged_above=-999 required=5 tests=[AWL=-1.744, HEADER_FROM_DIFFERENT_DOMAINS=0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id nPiLp2At1JzV for ; Mon, 13 Nov 2017 06:41:12 -0800 (PST) X-Greylist: delayed 352 seconds by postgrey-1.36 at arlo; Mon, 13 Nov 2017 06:41:11 PST Received: from upsilon.cc (upsilon.cc [178.32.142.91]) by arlo.cworth.org (Postfix) with ESMTP id D59336DE0FE6 for ; Mon, 13 Nov 2017 06:41:11 -0800 (PST) Received: from scaramouche.takhisis.invalid (unknown [78.194.69.54]) by upsilon.cc (Postfix) with ESMTPSA id B2CD6103A0; Mon, 13 Nov 2017 15:35:15 +0100 (CET) Received: by scaramouche.takhisis.invalid (Postfix, from userid 1000) id 89E5C1CC017D; Mon, 13 Nov 2017 15:35:15 +0100 (CET) Date: Mon, 13 Nov 2017 15:35:15 +0100 From: Stefano Zacchiroli To: David Bremner Cc: Bruno Deremble , notmuch@notmuchmail.org Subject: Re: accented characters Message-ID: <20171113143515.5hbnsma72r24qutf@upsilon.cc> References: <87h8tz8b2v.fsf@ens.fr> <87efp2b9er.fsf@tethera.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <87efp2b9er.fsf@tethera.net> User-Agent: NeoMutt/20170609 (1.8.3) X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Nov 2017 14:41:13 -0000 On Mon, Nov 13, 2017 at 09:22:36AM -0400, David Bremner wrote: > The other thing I don't know is how many people would be happy with just > stripping all accents. That could be done in a gmime filter, as you > suggest. That would be more likely to require changes to the query > language. Off hand I don't know how to transparently de-accent all query > words. My gut feeling is that removing accents by default from both the terms in the index and user queries would go a long way in addressing this problem. Especially so if it's a boolean option in notmuch config (which default to stripping accents). As a random example/data point, chromium does that and when you search unaccented strings in a web page will find any combination of them with accents. Is, by far, my best UX experience w.r.t. accents on GNU/Linux. Unicode has a notion of canonical form that rearrange accented characters in a sequence of non-accented characters + modifiers https://en.wikipedia.org/wiki/Unicode_equivalence . A bunch of libraries use that stuff to normalize-away accents in unicode strings. I'm aware of a few in Python for instance, but not in C++ (which I believe is what you'd be interested in). HTH, -- Stefano Zacchiroli . zack@upsilon.cc . upsilon.cc/zack . . o . . . o . o Computer Science Professor . CTO Software Heritage . . . . . o . . . o o Former Debian Project Leader & OSI Board Director . . . o o o . . . o . « the first rule of tautology club is the first rule of tautology club »