From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: =?UTF-8?Q?Elias_M=C3=A5rtenson?= Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Fri, 19 Feb 2016 21:37:26 +0800 Message-ID: References: <87io1xwq1e.fsf@wanadoo.es> <87vb5wvzfz.fsf@mail.linkov.net> <87io1wt4cc.fsf@wanadoo.es> <8737syoima.fsf@mail.linkov.net> <871t8iu277.fsf@wanadoo.es> <83d1s28kvh.fsf@gnu.org> <87r3gis7sm.fsf@wanadoo.es> <83twle71xy.fsf@gnu.org> <87io1us0te.fsf@wanadoo.es> <83pow26svf.fsf@gnu.org> <87a8n5srbp.fsf@wanadoo.es> <83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net> <834mdd6llx.fsf@gnu.org> <7fbb8bc7-9a97-4bad-a103-a6690a35241d@default> <834mdc5w6o.fsf@gnu.org> <838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org> <83y4ahru04.fsf@gnu.org> <83fuwproyf.fsf@gnu.org> <837fi0sz29.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001a11416d40788366052c1f9520 X-Trace: ger.gmane.org 1455889082 22444 80.91.229.3 (19 Feb 2016 13:38:02 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 19 Feb 2016 13:38:02 +0000 (UTC) Cc: Lars Ingebrigtsen , emacs-devel To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 19 14:37:54 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aWlFo-0004N0-8j for ged-emacs-devel@m.gmane.org; Fri, 19 Feb 2016 14:37:52 +0100 Original-Received: from localhost ([::1]:52358 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aWlFn-000670-OZ for ged-emacs-devel@m.gmane.org; Fri, 19 Feb 2016 08:37:51 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:58360) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aWlFR-00060H-Sv for emacs-devel@gnu.org; Fri, 19 Feb 2016 08:37:35 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aWlFP-0002xH-LF for emacs-devel@gnu.org; Fri, 19 Feb 2016 08:37:29 -0500 Original-Received: from mail-vk0-x22b.google.com ([2607:f8b0:400c:c05::22b]:36443) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aWlFP-0002wy-EJ; Fri, 19 Feb 2016 08:37:27 -0500 Original-Received: by mail-vk0-x22b.google.com with SMTP id c3so73945821vkb.3; Fri, 19 Feb 2016 05:37:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=tT8ZHipxBEnS1lvpAvaefbM7M7hqzPwYK017WjcKVyM=; b=irexTHf4MKV3v7JlZ26XoVpgG5+XP3n9N/VbCJBHi5aBStDYYDe3I0WUNHW/4E81yE bjNSN2aR4xKekYvfxVZd3NznqHNa5VnRWIGv/GGh9RxH9s890DNQxfGpHRceIGtWqfLc VvTzcXXcyY37rDzLJry6m+vXFZyglxuMAG5kd9bRpXhvczFe5U9Wm8BAbFczvcpm9CNo ddq4UvV1lUnFL/j4lAPkLJ50D9S4hngy34YuUoNlHDZA3C7j4MEssVrY8GUX037fdFI/ eZLryHx5P83ijA/1u8Q98DY1WtdvqnN/Rnz2Hzz3eWaauFu3YJTWoU+/nzs9rd8yg3EV HqSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=tT8ZHipxBEnS1lvpAvaefbM7M7hqzPwYK017WjcKVyM=; b=QLpuAUGHrl6xHN87W6h6Eux4MZCwFwuD1TQjxpHC9klXMvqDOGD1d+/nhUVcZ7zSDW 7OfVDvGBpg74SmNQcLHp9r8wYnwVLet6ozDQ++it/MfWJOY7vMGDBVL+4ts6bn3+5M+V luBwGuBmMPC2vS7Cx9aiB8qY7iaseMnqPJbWhN0Ll1AF46Dezdi4lcrdA+zjDd/aI+Lc v3e7dmj5yqD/MFhH7vpATOlpgzRISffOEZ9+r40LvNKvIF/ckF604fpUE+gLs35Hk3bM ljvkT7bvZNh5r60thhBfhxIshPfQH3QdHlvKIieIa+o9dzQN5kRcmGknWEFSiBbxWWlF Zjcg== X-Gm-Message-State: AG10YOQ6lmfZGqabLVPvUmKJHKgOqYgVBlr/DJsPuB6ddsl0u2IHssocMmR+L56+/In4on8jv58iz3vPxXGMSA== X-Received: by 10.31.42.198 with SMTP id q189mr2143909vkq.60.1455889046861; Fri, 19 Feb 2016 05:37:26 -0800 (PST) Original-Received: by 10.176.3.146 with HTTP; Fri, 19 Feb 2016 05:37:26 -0800 (PST) In-Reply-To: <837fi0sz29.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2607:f8b0:400c:c05::22b X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:200211 Archived-At: --001a11416d40788366052c1f9520 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 19 February 2016 at 19:46, Eli Zaretskii wrote: > > Of course you have to use the decomposition algorithms to ensure that > the precomposed and decomposed > > variations of the same character compares equal. > > Then you agree that _some_ form of character-folding should be turned > on by default? > Yes. > > This is, however, different from using the decomposition to to decompos= e > a character and then using the > > base character as the thing to match against. The latter is what Emacs > is doing today, as far as I understand. > > Please describe in more detail why do you think what Emacs does today > is not what you think it should do. It's possible we have a > miscommunication here. > The main issue to me is that it matches things that should not be matched. A secondary (minor) issue is that some things that should be matched is not (see my example with U+2C65). > For example, if the buffer includes n=CC=83 (2 characters), should "C-s n= " > find the n in it? > That depends on the locale of the user. However, from the point of a user, there should not be a visible difference between the precomposed and the composed variants are the exact same character. This is in line with Unicode recommendations (https://en.wikipedia.org/wiki/Unicode_equivalence) Note: I know that it's possible that I am wrong about this and that Unicode actually _has_ said that the equivalence tables can be used for this purpose (I.e. decompose and only use the primary character). If that is the case, I'd be interested to see a reference to that, but I will still be of the same opinion that doing so will result in broken behaviour for a certain class of user. Thus, if I am Spanish, I will _not_ want any of those to match "n". If I'm Swedish I will likely want both of them to match "n". That equivalence is encoded in the decomposition data that is part of > UnicodeData.txt which Emacs uses for character-folding. > The equivalence tables explains that the precomposed character U+00F1 is equivalent to the specific sequence U+006E U+0303. That is all it says. It does not say that =C3=B1 is a variation of n. It's an instruction how to construct a given character. The decompositions are used in the normalisation forms to ensure that the two variants are treated equally (such as the two alternative representations of =C3=B1 that we have been discussing). > > If you look at the latin collation chart for example > > (http://unicode.org/charts/collation/chart_Latin.html) you will see > that the characters are grouped. These are > > the equivalences I'm referring to. > > Yes. And if you look at the entries of the equivalent characters in > UnicodeData.txt, you will see there they have decompositions, which is > what Emacs uses for searching when character-folding is in effect. > Yes, and this is where the crux of our disagreement lies, I think. I previously referred to using the decompositions as a guide to character equivalence as a "trick". I stand by this, since this is not the purpose of the decompositions. The best thing that Unicode provides for that purpose (to my knowledge) are the collation charts that I mentioned previously ( http://unicode.org/charts/collation/) > > Now, I note that on these charts, U+0061 LATIN SMALL LETTER A and U+2C6= 5 > LATIN SMALL LETTER A > > WITH STROKE compares as different characters, and the latter does not > have a decomposition. Should this > > also be addressed? > > Maybe so, but given the controversy even about what we do now, which > is a subset, I'd doubt extending what we do now is a wise move. > I was just asking to understand your position better. > > As for the locale-specific parts: using that will only DTRT if we > > assume that the majority of searches are done in buffers holding text > > in locale's language. Is that a good assumption? > > > > My opinion is that the default search behaviour should depend primarily > on the locale of the entire Emacs > > session. I.e. the locale of the user starting the application. I'm not > disagreeing that allowing a buffer-local locale > > override this behaviour is a good idea, but as a Swedish speaker I > really see =C3=A5, =C3=A4 and a as completely > > separate things, even if the language of the buffer that I am editing > happens to be English. The equivalence of > > these characters is the odd behaviour here, and the one that should be > enabled explicitly. > > > > Also, if I happen to be editing a Spanish document (I don't speak > Spanish) I would find equivalence of =C3=B1 and n > > to be incredibly useful, even though =C3=93scar would grind his teeth a= t it. > :-) > > So you are in fact making two contradicting statements here. Interesting. I have re-read what I wrote and I really don't see myself holding two contradicting statement. Perhaps you think that I am both against folding and not, at the same time. If that's the case, let me try to rephrase: I like the idea of character folding. But, if it's incorrectly (by my standards, of course) implemented I would rather not have it at all since it will be highly annoying. > Indeed, > the locale in which Emacs started says almost nothing about the > documents being edited, nor even about the user's preferences: it is > easy to imagine a user whose "native" locale is X starting Emacs in > another locale. > Yes. I am fully aware of this. But so be it. Having applications work differently depending on the locale of the environment the application was started in is nothing new. > > We are talking > > about a multilingual Emacs, in an age of global communications, where > > you can have conversations with someone on the other side of the > > world, or read text that combines several languages in the same > > buffer. Do we really want to go back to the l10n days, when there was > > ever only one locale that was interesting -- the current one? I > > wonder. > > > > Actually, I think so. This is because the search equivalence is > inherently a local thing. > > Being a multi-lingual environment, Emacs has no real notion of the > locale. > Perhaps it should? > > It is, Unicode provides it. We just didn't import it yet. > > > > It does? I was looking for such tables, but didn't find it. Do you have > a link? > > Look for DUCET and its tailoring data. These should be a good > starting point: > > http://www.unicode.org/Public/UCA/latest/ > http://cldr.unicode.org/ > Those are the decomposition charts, and don't actually say anything about equivalence outside of providing a canonical form for precomposed characters, as was discussed above. Regards, Elias --001a11416d40788366052c1f9520 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On 1= 9 February 2016 at 19:46, Eli Zaretskii <eliz@gnu.org> wrote:
=
=C2=A0
> Of course you have = to use the decomposition algorithms to ensure that the precomposed and deco= mposed
> variations of the same character compares equal.

Then you agree that _some_ form of character-folding should be turne= d
on by default?

Yes.
=C2=A0
> This is, however, different from = using the decomposition to to decompose a character and then using the
> base character as the thing to match against. The latter is what Emacs= is doing today, as far as I understand.

Please describe in more detail why do you think what Emacs does toda= y
is not what you think it should do.=C2=A0 It's possible we have a
miscommunication here.

The main issue t= o me is that it matches things that should not be matched. A secondary (min= or) issue is that some things that should be matched is not (see my example= with U+2C65).
=C2=A0
For example, if the b= uffer includes n=CC=83 (2 characters), should "C-s n"
find the n in it?

That depends on the l= ocale of the user. However, from the point of a user, there should not be a= visible difference between the precomposed and the composed variants are t= he exact same character. This is in line with Unicode recommendations (https://en.wikipe= dia.org/wiki/Unicode_equivalence)

Note: I know= that it's possible that I am wrong about this and that Unicode actuall= y _has_ said that the equivalence tables can be used for this purpose (I.e.= decompose and only use the primary character). If that is the case, I'= d be interested to see a reference to that, but I will still be of the same= opinion that doing so will result in broken behaviour for a certain class = of user.

Thus, if I am Spanish, I will _not_ want = any of those to match "n". If I'm Swedish I will likely want = both of them to match "n".

= That equivalence is encoded in the decomposition data that is part of
UnicodeData.txt which Emacs uses for character-folding.

The equivalence tables explains that the precomposed chara= cter U+00F1 is equivalent to the specific sequence U+006E U+0303. That is a= ll it says. It does not say that =C3=B1 is a variation of n. It's an in= struction how to construct a given character.

The = decompositions are used in the normalisation forms to ensure that the two v= ariants are treated equally (such as the two alternative representations of= =C3=B1 that we have been discussing).
=C2=A0
> If you look at the latin collation chart for example
> (http://unicode.org/charts/collation/chart= _Latin.html) you will see that the characters are grouped. These are > the equivalences I'm referring to.

Yes.=C2=A0 And if you look at the entries of the equivalent characte= rs in
UnicodeData.txt, you will see there they have decompositions, which is
what Emacs uses for searching when character-folding is in effect.

Yes, and this is where the crux of our disagree= ment lies, I think. I previously referred to using the decompositions as a = guide to character equivalence as a "trick". I stand by this, sin= ce this is not the purpose of the decompositions. The best thing that Unico= de provides for that purpose (to my knowledge) are the collation charts tha= t I mentioned previously (= http://unicode.org/charts/collation/)
=C2=A0
> Now, I note that on these charts, U+0061 LATIN = SMALL LETTER A and U+2C65 LATIN SMALL LETTER A
> WITH STROKE compares as different characters, and the latter does not = have a decomposition. Should this
> also be addressed?

Maybe so, but given the controversy even about what we do now, which=
is a subset, I'd doubt extending what we do now is a wise move.

I was just asking to understand your position = better.
=C2=A0
>=C2=A0 = As for the locale-specific parts: using that will only DTRT if we
>=C2=A0 assume that the majority of searches are done in buffers holding= text
>=C2=A0 in locale's language. Is that a good assumption?
>
> My opinion is that the default search behaviour should depend primaril= y on the locale of the entire Emacs
> session. I.e. the locale of the user starting the application. I'm= not disagreeing that allowing a buffer-local locale
> override this behaviour is a good idea, but as a Swedish speaker I rea= lly see =C3=A5, =C3=A4 and a as completely
> separate things, even if the language of the buffer that I am editing = happens to be English. The equivalence of
> these characters is the odd behaviour here, and the one that should be= enabled explicitly.
>
> Also, if I happen to be editing a Spanish document (I don't speak = Spanish) I would find equivalence of =C3=B1 and n
> to be incredibly useful, even though =C3=93scar would grind his teeth = at it. :-)

So you are in fact making two contradicting statements here.

Interesting. I have re-read what I wrote and I rea= lly don't see myself holding two contradicting statement. Perhaps you t= hink that I am both against folding and not, at the same time. If that'= s the case, let me try to rephrase:

I like the ide= a of character folding. But, if it's incorrectly (by my standards, of c= ourse) implemented I would rather not have it at all since it will be highl= y annoying.
=C2=A0
Indeed,
the locale in which Emacs started says almost nothing about the
documents being edited, nor even about the user's preferences: it is easy to imagine a user whose "native" locale is X starting Emacs = in
another locale.

Yes. I am fully aware o= f this. But so be it. Having applications work differently depending on the= locale of the environment the application was started in is nothing new.
=C2=A0
>=C2=A0 We are ta= lking
>=C2=A0 about a multilingual Emacs, in an age of global communications, = where
>=C2=A0 you can have conversations with someone on the other side of the=
>=C2=A0 world, or read text that combines several languages in the same<= br> >=C2=A0 buffer. Do we really want to go back to the l10n days, when ther= e was
>=C2=A0 ever only one locale that was interesting -- the current one? I<= br> >=C2=A0 wonder.
>
> Actually, I think so. This is because the search equivalence is inhere= ntly a local thing.

Being a multi-lingual environment, Emacs has no real notion of the locale.

Perhaps it should?
= =C2=A0
>=C2=A0 It is, Unicode pro= vides it. We just didn't import it yet.
>
> It does? I was looking for such tables, but didn't find it. Do you= have a link?

Look for DUCET and its tailoring data.=C2=A0 These should be a good<= br> starting point:

=C2=A0 http://www.unicode.org/Public/UCA/latest/
=C2=A0 http://cldr.unicode.org/
=

Those are the decomposition charts, and don't actua= lly say anything about equivalence outside of providing a canonical form fo= r precomposed characters, as was discussed above.

= Regards,
Elias

--001a11416d40788366052c1f9520--