From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Marcin Borkowski Newsgroups: gmane.emacs.devel Subject: Re: Character folding in the pretest Date: Mon, 08 Feb 2016 20:18:48 +0100 Message-ID: <87twljqacn.fsf@mbork.pl> References: <87mvrg2zid.fsf@wanadoo.es> <20160204.180523.769253593641901728.wl@gnu.org> <20160205.070103.162978216111829522.wl@gnu.org> <877fifs3fy.fsf@mbork.pl> <8360xzqejk.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1454959178 10818 80.91.229.3 (8 Feb 2016 19:19:38 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 8 Feb 2016 19:19:38 +0000 (UTC) Cc: ofv@wanadoo.es, lokedhs@gmail.com, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Feb 08 20:19:30 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aSrLO-0003h9-Db for ged-emacs-devel@m.gmane.org; Mon, 08 Feb 2016 20:19:30 +0100 Original-Received: from localhost ([::1]:48054 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aSrLN-0005ZV-H1 for ged-emacs-devel@m.gmane.org; Mon, 08 Feb 2016 14:19:29 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:54762) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aSrKz-0005PY-U7 for emacs-devel@gnu.org; Mon, 08 Feb 2016 14:19:07 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aSrKy-000131-QD for emacs-devel@gnu.org; Mon, 08 Feb 2016 14:19:05 -0500 Original-Received: from mail.mojserwer.eu ([2a01:5e00:2:52::8]:52554) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aSrKu-00012d-Oo; Mon, 08 Feb 2016 14:19:00 -0500 Original-Received: from localhost (localhost [127.0.0.1]) by mail.mojserwer.eu (Postfix) with ESMTP id 9696D9D2003; Mon, 8 Feb 2016 20:18:59 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at mail.mojserwer.eu Original-Received: from mail.mojserwer.eu ([127.0.0.1]) by localhost (mail.mojserwer.eu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RhSgSmdaZeuu; Mon, 8 Feb 2016 20:18:53 +0100 (CET) Original-Received: from localhost (unknown [109.232.24.28]) by mail.mojserwer.eu (Postfix) with ESMTPSA id 467BB9D2001; Mon, 8 Feb 2016 20:18:52 +0100 (CET) User-agent: mu4e 0.9.13; emacs 25.1.50.1 In-reply-to: <8360xzqejk.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2a01:5e00:2:52::8 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:199546 Archived-At: On 2016-02-08, at 18:48, Eli Zaretskii wrote: >> From: Marcin Borkowski >> Date: Mon, 08 Feb 2016 15:05:05 +0100 >> Cc: ofv@wanadoo.es, lokedhs@gmail.com, emacs-devel@gnu.org >>=20 >> Just as another datapoint in discussion: for me, searching for "l" and >> finding "=C5=82" seems a bit weird. (The opposite even more so.) > > Which is why neither one happens under character folding. > >> BTW, strangely enough, here isearching for "l" does /not/ find "=C5=82= ", but >> isearching for "a" (with character folding on) finds "=C4=85". Whatev= er one >> thinks about char folding, this is clearly a bug. > > It's not a bug, it's the feature working as designed: we only fold > characters that have suitable decompositions in the Unicode Character > Database. So: > > (get-char-code-property ?=C4=85 'decomposition) =3D> (97 808) > > but > > (get-char-code-property ?=C5=82 'decomposition) =3D> (322) > > IOW, =C4=85 is canonically equivalent to the 2-character sequence a =CC= =A8 (which > is why searching for a finds that character), while =C5=82 has no canon= ical > decomposition (nor any other decomposition). > > This means that the Unicode guys decided that =C5=82 should not be > equivalent to any other sequence of characters, and therefore Emacs > doesn't find it unless you search for it literally. > > If you want to know why =C5=82 doesn't have any decompositions, I sugge= st > to ask on the Unicode mailing list, I'm sure they had good reasons, > most probably reasons that came from people who are experts in the > Polish language and its intricacies. We just trust the results. Thanks for the explanation, Eli! However, given the number of bugs/quirks in Unicode, I'd personally prefer not to trust them too much. (Though I understand that the Emacs devs /have/ to trust someone, and choosing the Unicode people is probably not a bad idea generally.) Funnily, one of the more annoying bugs in Unicode is connected with quotes, AFAIR. (Why not beat a dead horse? ;-)) And folding "=C4=85" to "a" while not "=C5=82" to "l" is som= ething which most Poles (I guess) would treat as a serious, WTF-level bug. And good luck to all non-Polish people with isearching for the name of Jan =C5=81ukasiewicz (just to choose a Lisp-related name;-)). Yet another datapoint suggesting that the issue is really complicated, and that Drew is right: if this is not configurable by users, it might end up more annoying than helping. (Not to say it won't - I trust Artur here.) Best, --=20 Marcin Borkowski http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski Faculty of Mathematics and Computer Science Adam Mickiewicz University