From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: =?UTF-8?Q?Elias_M=C3=A5rtenson?= Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Sat, 20 Feb 2016 13:22:57 +0800 Message-ID: References: <87io1xwq1e.fsf@wanadoo.es> <87vb5wvzfz.fsf@mail.linkov.net> <87io1wt4cc.fsf@wanadoo.es> <8737syoima.fsf@mail.linkov.net> <871t8iu277.fsf@wanadoo.es> <83d1s28kvh.fsf@gnu.org> <87r3gis7sm.fsf@wanadoo.es> <83twle71xy.fsf@gnu.org> <87io1us0te.fsf@wanadoo.es> <83pow26svf.fsf@gnu.org> <87a8n5srbp.fsf@wanadoo.es> <83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net> <834mdd6llx.fsf@gnu.org> <7fbb8bc7-9a97-4bad-a103-a6690a35241d@default> <834mdc5w6o.fsf@gnu.org> <838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org> <83y4ahru04.fsf@gnu.org> <83fuwproyf.fsf@gnu.org> <837fi0sz29.fsf@gnu.org> <83egc8qzjh.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001a114406eadea0d5052c2cca24 X-Trace: ger.gmane.org 1455945805 12198 80.91.229.3 (20 Feb 2016 05:23:25 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 20 Feb 2016 05:23:25 +0000 (UTC) Cc: Lars Ingebrigtsen , emacs-devel To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Feb 20 06:23:25 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aX00q-0000pK-Gf for ged-emacs-devel@m.gmane.org; Sat, 20 Feb 2016 06:23:24 +0100 Original-Received: from localhost ([::1]:58195 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aX00p-0005h4-RV for ged-emacs-devel@m.gmane.org; Sat, 20 Feb 2016 00:23:23 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:54962) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aX00V-0005gU-42 for emacs-devel@gnu.org; Sat, 20 Feb 2016 00:23:07 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aX00Q-0003sJ-4P for emacs-devel@gnu.org; Sat, 20 Feb 2016 00:23:03 -0500 Original-Received: from mail-vk0-x229.google.com ([2607:f8b0:400c:c05::229]:34205) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aX00P-0003sF-TU; Sat, 20 Feb 2016 00:22:58 -0500 Original-Received: by mail-vk0-x229.google.com with SMTP id e185so91959319vkb.1; Fri, 19 Feb 2016 21:22:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=tEbwAS0AIkvEVCdxaYe6Lgu24qRJMhRTdkdAfnQ4E0Y=; b=qz+a4Nxtd22QnHtOsSjxo05PUUIAUnhZN8RIN9l1ltYqSoiAO7iFQgesKRvewwr1fQ GgtZcyt+2vFi1sz5B8MQjQGivZPxuGH/Yv0z9qAvJd1eRPxZun+Z/lh6S3cTpcaCaHfe SiNCqND4ZvzIkFPslZw+WkidhCRqAbwu38v2rWUAkecY5avYvlBVr9meKmZPJcQGQAlF 6uN+/UgTsRWIMs5UqPPAzcxDt8QOR7m16oiglnPj08WmevHxv17wZfaUnQkG7goEjx+W /ZiWV4cpyEzZWgqCHbdQdYcSeLzBpbPWiTRJxoRmkVmtBcKB5DYyGVeL4RV3HxDMIZEK jc9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=tEbwAS0AIkvEVCdxaYe6Lgu24qRJMhRTdkdAfnQ4E0Y=; b=A0ZJ8onCHYbABIXtcgWmD0Q/S2znTKFFiIdF4Ex5wRGGjmuVi59k/QuhH9jT/pyyjI UygHdn7jjZciU0xGHJoQeVbbWGV4PCbDgVIic3yDHmbuJlDBdmqTpPQ2gjiGB0JAcj8L P5vMSyADKbbBVPZZ6PKbFgefuwJP5zXcb5W8AeGzyDVJ115fHi+BU/AT5u3MLOxERgO9 1vx+8ut/rjm8tK8/Kd5tUt49n663p6E+wNAhhFnVVSbmMTVw25t3umlzhwPCCYF3wXFa 6XjhqE3smkbDIkSaFzTQJ4Z4dAvudm9YyAWRMp9pF9OLQ/zU3ky/ryayVMsDsWfLP6Wy LWgg== X-Gm-Message-State: AG10YOR1HEDsltJa90bgV/AGFuZ0Rz7o5lfJ+kQ87Hg8wyjscC2ceMzb6Tsq53blrfHioqvNbdM4b2UumTveZA== X-Received: by 10.31.162.3 with SMTP id l3mr14348870vke.68.1455945777321; Fri, 19 Feb 2016 21:22:57 -0800 (PST) Original-Received: by 10.176.3.146 with HTTP; Fri, 19 Feb 2016 21:22:57 -0800 (PST) In-Reply-To: <83egc8qzjh.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2607:f8b0:400c:c05::229 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:200269 Archived-At: --001a114406eadea0d5052c2cca24 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 20 February 2016 at 03:18, Eli Zaretskii wrote: > > Date: Fri, 19 Feb 2016 21:37:26 +0800 > > From: Elias M=C3=A5rtenson > > Cc: Lars Ingebrigtsen , emacs-devel > > > > > For example, if the buffer includes n=CC=83 (2 characters), should "C-= s n" > > find the n in it? > > > > That depends on the locale of the user. > > There are use cases that are independent of the locale. For example, > imagine that you need to find all the literal n characters in a buffer > because you are investigating a bug in the program that produced that > buffer. As an Emacs user, I need to do such jobs almost every day. I > don't want the results affected by the locale. > Of course I'm not saying that you should now be able to do this. All I'm advocating here is sensible defaults. > > However, from the point of a user, there should not be a visible > > difference between the precomposed and the composed variants are the > > exact same character. > > What if the user wants to find all those places where what looks like > =C3=B1 is actually n=CC=83? Wouldn't that be a valid use case? > It would, but certainly a very rare one. For all intents and purposes the two forms are (should be) equivalent. > The reference you are looking for is the Unicode Standard itself. It > says to use the normalization forms, see for example section 5.16 > there. > I have read that section before, and I have now read it again. The section certainly talks about searching ignores diacritics, but does not discuss a method to do so. There is also a reference to TR29, but it refers to grapheme clusters which would be a very strange way to do character folding (Koreans would be very confused). > Every character-folding search implementation decomposes characters > before matching them. So does Emacs. We didn't invent this, and we > certainly didn't use the decompositions where they weren't supposed to > be used. It's not a trick, it's what everyone else does to do the > job. See the ICU library, for example. > Every example you have given so far discusses the decomposition equivalence. I.e. the fact that the who variants of =C3=B1 are the same. Se= ction 5.16 discuss the _concept_ of allowing n and =C3=B1 match similarly but the mechanism to do so is locale-dependent. This is what Unicode says, and that is what I say. My position is simply that the default (if absolutely nothing else overrides it) should be chosen to take the locale of the user into account. > > The decompositions are used in the normalisation forms to ensure that > the two variants are treated equally > > (such as the two alternative representations of =C3=B1 that we have bee= n > discussing). > > Yes, and any character-folding search uses normalization forms as > well. > Yes, but that's not what normalisation forms were designed to do. Again (I really apologise for repeating myself, I'm starting to sound like a troll and that is truly not my intention), the purpose of normalisation forms are to ensure that the two variants of =C3=B1 compare the same. It is= not designed to provide a mechanism to allow n to compare equal to =C3=B1. > > Yes. I am fully aware of this. But so be it. Having applications work > differently depending on the locale of the > > environment the application was started in is nothing new. > > It's not new. It's old. We should move on to more general > environments that support multiple languages. Emacs is such an > environment. The old l10n paradigms are fundamentally incompatible > with that. > Sure, but doesn't it make sense to fall back to the user's default if the buffer does not have an overriding locale? > > Being a multi-lingual environment, Emacs has no real notion of the > > locale. > > > > Perhaps it should? > > That'd be a step backward, IMO. > As opposed to having no concept of locale at all? I just have to disagree with you on that. > Strange, I always thought the data was there. Perhaps you should ask > a question on the Unicode mailing list, then. > That's a good idea actually. Thank you for the suggestion. I'm reading that mailing list, and I will post a question there. Regards, Elias --001a114406eadea0d5052c2cca24 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On 2= 0 February 2016 at 03:18, Eli Zaretskii <eliz@gnu.org> wrote:
=
> Date: Fri, 19 Feb 2016 21:37:26 +0800 > From: Elias M=C3=A5rtenson <lokedhs@gmail.com>
> Cc: Lars Ingebrigtsen <larsi@gnus= .org>, emacs-devel <emacs-= devel@gnu.org>
>
>=C2=A0 For example, if the buffer includes n=CC= =83 (2 characters), should "C-s n"
>=C2=A0 find the n in it?
>
> That depends on the locale of the user.

There are use cases that are independent of the locale.=C2=A0 For ex= ample,
imagine that you need to find all the literal n characters in a buffer
because you are investigating a bug in the program that produced that
buffer.=C2=A0 As an Emacs user, I need to do such jobs almost every day.=C2= =A0 I
don't want the results affected by the locale.
Of course I'm not saying that you should now be able to do = this. All I'm advocating here is sensible defaults.
=C2=A0
> However, from the p= oint of a user, there should not be a visible
> difference between the precomposed and the composed variants are the > exact same character.

What if the user wants to find all those places where what looks lik= e
=C3=B1 is actually n=CC=83?=C2=A0 Wouldn't that be a valid use case?

It would, but certainly a very rare one. = For all intents and purposes the two forms are (should be) equivalent.
=C2=A0
The reference you are look= ing for is the Unicode Standard itself.=C2=A0 It
says to use the normalization forms, see for example section 5.16
there.

I have read that section before,= and I have now read it again. The section certainly talks about searching = ignores diacritics, but does not discuss a method to do so. There is also a= reference to TR29, but it refers to grapheme clusters which would be a ver= y strange way to do character folding (Koreans would be very confused).
=C2=A0
Every character-folding s= earch implementation decomposes characters
before matching them.=C2=A0 So does Emacs.=C2=A0 We didn't invent this,= and we
certainly didn't use the decompositions where they weren't supposed= to
be used.=C2=A0 It's not a trick, it's what everyone else does to do= the
job.=C2=A0 See the ICU library, for example.

Every example you have given so far discusses the decomposition equiv= alence. I.e. the fact that the who variants of =C3=B1 are the same. Section= 5.16 discuss the _concept_ of allowing n and =C3=B1 match similarly but th= e mechanism to do so is locale-dependent. This is what Unicode says, and th= at is what I say. My position is simply that the default (if absolutely not= hing else overrides it) should be chosen to take the locale of the user int= o account.
=C2=A0
> The decompositions are used in the normalisation forms to ensure= that the two variants are treated equally
> (such as the two alternative representations of =C3=B1 that we have be= en discussing).

Yes, and any character-folding search uses normalization forms as well.

Yes, but that's not what norm= alisation forms were designed to do.

Again (I real= ly apologise for repeating myself, I'm starting to sound like a troll a= nd that is truly not my intention), the purpose of normalisation forms are = to ensure that the two variants of =C3=B1 compare the same. It is not desig= ned to provide a mechanism to allow n to compare equal to =C3=B1.
=C2=A0
> Yes. I am= fully aware of this. But so be it. Having applications work differently de= pending on the locale of the
> environment the application was started in is nothing new.

It's not new.=C2=A0 It's old.=C2=A0 We should move on to mor= e general
environments that support multiple languages.=C2=A0 Emacs is such an
environment.=C2=A0 The old l10n paradigms are fundamentally incompatible with that.

Sure, but doesn't it mak= e sense to fall back to the user's default if the buffer does not have = an overriding locale?
=C2=A0
= >=C2=A0 Being a multi-lingual environment, Emacs has no= real notion of the
>=C2=A0 locale.
>
> Perhaps it should?

That'd be a step backward, IMO.

<= div>As opposed to having no concept of locale at all? I just have to disagr= ee with you on that.
=C2=A0
S= trange, I always thought the data was there.=C2=A0 Perhaps you should ask a question on the Unicode mailing list, then.

That's a good i= dea actually. Thank you for the suggestion. I'm reading that mailing li= st, and I will post a question there.

<= /div>
Regards,
El= ias
--001a114406eadea0d5052c2cca24--