From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: =?UTF-8?Q?Elias_M=C3=A5rtenson?= Newsgroups: gmane.emacs.devel Subject: Re: Character folding in the pretest Date: Fri, 5 Feb 2016 14:36:13 +0800 Message-ID: References: <87mvrg2zid.fsf@wanadoo.es> <20160204.180523.769253593641901728.wl@gnu.org> <20160205.070103.162978216111829522.wl@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001a11439d404c6251052b001137 X-Trace: ger.gmane.org 1454654187 10567 80.91.229.3 (5 Feb 2016 06:36:27 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 5 Feb 2016 06:36:27 +0000 (UTC) Cc: =?UTF-8?Q?=C3=93scar_Fuentes?= , emacs-devel To: Werner LEMBERG Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 05 07:36:26 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aRa0G-00057t-6K for ged-emacs-devel@m.gmane.org; Fri, 05 Feb 2016 07:36:24 +0100 Original-Received: from localhost ([::1]:46276 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aRa0F-0002rR-AE for ged-emacs-devel@m.gmane.org; Fri, 05 Feb 2016 01:36:23 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:40309) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aRa0A-0002rD-CU for emacs-devel@gnu.org; Fri, 05 Feb 2016 01:36:19 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aRa08-0003c7-MX for emacs-devel@gnu.org; Fri, 05 Feb 2016 01:36:18 -0500 Original-Received: from mail-vk0-x235.google.com ([2607:f8b0:400c:c05::235]:34696) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aRa06-0003bl-70; Fri, 05 Feb 2016 01:36:14 -0500 Original-Received: by mail-vk0-x235.google.com with SMTP id e185so51007095vkb.1; Thu, 04 Feb 2016 22:36:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=TVJAv4gFHpXBYHvrm2uIU2lTHB6B0up6Dqr19xw0xdM=; b=McHL4rLEFLJU0wE12KvA3GbP4HS3WGWCir4QH72yASB1h5NvehX6GW/pXXt9aZuIey YXzo6GBv5FWCFknCzjTYW2uT1QjT1f0UU2DKjRUEX7LRED8ajRuf9+cCbnEPzJ53J+9x k3w2BxwFuj5jE696udftKyWXQKFTdPtkVfCPizWLEOYNI2gMcoOHC+JtkQ6ZbHAtL4wf ZcK9A4RwC34X1malcdq90Y4/owhpfGwKi/rBLNDTaFDGYY4DFUOEnxetN1gjblG4+W6V GjG14jtu7EvEa8sTa7vumtNwojNN1JLoGYzxwfvNYuHBT2lJtNLe11VjVStjyUOsPGQQ qj1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=TVJAv4gFHpXBYHvrm2uIU2lTHB6B0up6Dqr19xw0xdM=; b=StfzMuPkBlW6peGnpgzK0MDFaRDfrhOBsORoNNNb4A4snBeDCZT8dt7x/E9/IXt/l4 3KvkrvoBmUYBbR6eXVyGn9rxnuMYEyDTmmsKzOiYxFq85csVx4MuZFDVCle/J6p8wMjm Gng2Kz4eFMK2uJBThc0lJFw8AXd8C0QbMPM/iVFCffR/D7D+4T5nZBJ+LKjKFh0EdyqB LP87zmD1GgmMFmWuWhV2Swo43amde0wTPXDHnEjzqzx5xttm/WqMIKl4m0UlBkU5gzPA 9QJB20K8gDopNi0MhYXKe3gWxPSEZv411PNVSfqjR+t4I+g8YlXBDzF09dBEcbYCdP9B MTwQ== X-Gm-Message-State: AG10YORmbYA9sxZcJH4ZZuR2HUj1Y9tf+iHU05+ajNQXbTbjQNbuKOjKdGlarHRtqQp8r6H+tJZygv5o3uErGA== X-Received: by 10.31.192.147 with SMTP id q141mr8307111vkf.96.1454654173763; Thu, 04 Feb 2016 22:36:13 -0800 (PST) Original-Received: by 10.103.80.2 with HTTP; Thu, 4 Feb 2016 22:36:13 -0800 (PST) In-Reply-To: <20160205.070103.162978216111829522.wl@gnu.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2607:f8b0:400c:c05::235 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:199356 Archived-At: --001a11439d404c6251052b001137 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 5 February 2016 at 14:01, Werner LEMBERG wrote: > > >> This naturally leads to a possible user option: Having `optical' > >> matches or not, where `optical' means `base character plus > >> diacritic and/or slight modifications', e.g., o =E2=86=92 =C3=B8 =E2= =86=92 =C3=B6 etc., etc. > > > > How do you even define "optical similarities"? > > Basically the same as Eli has described: Base character plus > diacritics, probably plus some basic shapes with `diacritics' that > Unicode doesn't represent as composable: o =E2=86=92 =C3=B8, l =E2=86=92 = =C5=82, d =E2=86=92 =C4=91, etc. > Composability is somewhat arbitrary. The character composition has very little to do with "visual similarities". Just have a look at character compositions in Devanagari for example. > > Should l and I compare the same under this definition? They > > certainly looks similar. > > No, since the similarity is a font issue only. For this reason I > *never* use Arial-like fonts. > And that argument works equally well for a and =C3=A5. They really have _nothing_ in common. The fact that there exists a Unicode decomposition for them is completely irrelevant to a Swedish speaker. Also note that to a Swedish speaker (well, at least up until recently), W and V were variations of the same character. Yet I'm not advocating that Emacs should consider them similar unless the locale says they should be. In fact, the links to the Unicode TR on collations that Eli posted mentions that as a specific example. > > What about p and q? They look like mirror images of each other. > > What about z and s? They even sound similar. > > Nonsense. I've clearly mentioned `base character plus diacritic'. > Why do you intentionally skip that? Doing so reminds me of > Schopenhauer's first stratagem in `The Art of Being Right'... > I did not intentionally skip that. I would appreciate it if you didn't assume that I was out to simply prove you wrong, or that I am here to troll= . I was using that as an example in trying to highlight that to some people (like myself) =C3=A4 just simply is not a character with a diacritic. It is= in German, but not in Swedish. I think this is hard to explain because in many European language (such as English, German and French) you have characters which are variations or alternatives. For example, in French you have the letter =C5=92, which is a variation of "OE". Likewise in German, =C3=9F is a variation of SS and =C3= =9C is a variation of UE. As far as I know, I could write "M=C3=BCller" as "Mueller"= . However, this is not true for Swedish. I'll say it again (and I apologise for repeating myself, this kind of repetition makes me sound like the troll that you accused me of being) but in Swedish the difference between =C3=85 = and A are just as great as the difference in English between the letters E and O. Writing my last name as "Martenson" looks just as bizarre as me writing your last name as "Merner". And yes, I picked M because it kinda looks like an upside-down W and I'm doing that not because I'm really suggesting that that equivalence should be implemented, but because I want to illustrate just how silly it looks. > > To a Swedish speaker there are zero similarities between a, =C3=A4 and = =C3=A5. > > I'm a native German speaker, and there is *zero* similarity in the > sound between `a' and `=C3=A4', say. I know. Speak a little German. In fact, =C3=84 is pronounced exactly the sa= me in German and Swedish. That said, as far as I can recall from my German lessons 25 years ago, German grammar does see =C3=84 as a variation of A. A= t least they are sorted together in the dictionary. Swedish distinction is much greater. This discussion would have been much easier if the letter looked completely different. :-) > But it is quite common in English > texts, say, to omit the diaeresis dots, thus having a searching mode > that finds both `H=C3=A4nsel und Gretel' and `Hansel and Gretel' at the > same time would be very valuable. > I never said it's not valuable. I never even suggested that this kind of comparisons should not be possible. In fact, I'm not even suggesting that this kind of comparisons should not be the default, even. Especially given the fact that locale-dependent comparators are not very well supported in Emacs at the moment. What I did want to do was try try to explain that even though there is a visual similarity between A, =C3=84 and =C3=85, to a Swedish speaker those similarities are no greater than those of q and k. And definitely much more different than W and V (which were, up until recently sorted under V in dictionaries and seen as simply a visual variation). > > What you describe naturally leads to another user option: Don't handle > characters as `equal' (with a proper definition of `equal') that > aren't `equal' in the user's locale. This is exactly my point. And you have managed to compress hundreds of my words into a single, district sentence. Thank you. --001a11439d404c6251052b001137 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On 5= February 2016 at 14:01, Werner LEMBERG <wl@gnu.org> wrote:

>> This naturally leads to a possible user option: Having `optical= 9;
>> matches or not, where `optical' means `base character plus
>> diacritic and/or slight modifications', e.g., o =E2=86=92 =C3= =B8 =E2=86=92 =C3=B6 etc., etc.
>
> How do you even define "optical similarit= ies"?

Basically the same as Eli has described: Base character plus
diacritics, probably plus some basic shapes with `diacritics' that
Unicode doesn't represent as composable: o =E2=86=92 =C3=B8, l =E2=86= =92 =C5=82, d =E2=86=92 =C4=91, etc.

Co= mposability is somewhat arbitrary. The character composition has very littl= e to do with "visual similarities". Just have a look at character= compositions in Devanagari for example.
=C2=A0
> Should l and I compare the same under this defin= ition?=C2=A0 They
> certainly looks similar.

No, since the similarity is a font issue only.=C2=A0 For this reason= I
*never* use Arial-like fonts.

And that = argument works equally well for a and =C3=A5. They really have _nothing_ in= common. The fact that there exists a Unicode decomposition for them is com= pletely irrelevant to a Swedish speaker.

Also note= that to a Swedish speaker (well, at least up until recently), W and V were= variations of the same character. Yet I'm not advocating that Emacs sh= ould consider them similar unless the locale says they should be.

In fact, the links to the Unicode TR on collations=C2=A0tha= t Eli posted mentions that as a specific example.
=C2=A0
> What about p and q?=C2=A0 They look lik= e mirror images of each other.
> What about z and s?=C2=A0 They even sound similar.

Nonsense.=C2=A0 I've clearly mentioned `base character plus diac= ritic'.
Why do you intentionally skip that?=C2=A0 Doing so reminds me of
Schopenhauer's first stratagem in `The Art of Being Right'...

I did not intentionally skip that. I would a= ppreciate it if you didn't assume that I was out to simply prove you wr= ong, or that I am here to troll.

I was using that = as an example in trying to highlight that to some people (like myself) =C3= =A4 just simply is not a character with a diacritic. It is in German, but n= ot in Swedish.

I think this is hard to explain bec= ause in many European language (such as English, German and French) you hav= e characters which are variations or alternatives. For example, in French y= ou have the letter =C5=92, which is a variation of "OE". Likewise= in German, =C3=9F is a variation of SS and =C3=9C is a variation of UE. As= far as I know, I could write "M=C3=BCller" as "Mueller"= ;.

However, this is not true for Swedish. I= 9;ll say it again (and I apologise for repeating myself, this kind of repet= ition makes me sound like the troll that you accused me of being) but in Sw= edish the difference between =C3=85 and A are just as great as the differen= ce in English between the letters E and O. Writing my last name as "Ma= rtenson" looks just as bizarre as me writing your last name as "M= erner". And yes, I picked M because it kinda looks like an upside-down= W and I'm doing that not because I'm really suggesting that that e= quivalence should be implemented, but because I want to illustrate just how= silly it looks.

=C2=A0
> To a Swedish speaker there are zero similarities between = a, =C3=A4 and =C3=A5.

I'm a native German speaker, and there is *zero* similarity in t= he
sound between `a' and `=C3=A4', say.

I know. Speak a little German. In fact, =C3=84 is pronounced exactly the= same in German and Swedish. That said, as far as I can recall from my Germ= an lessons 25 years ago, German grammar does see =C3=84 as a variation of A= . At least they are sorted together in the dictionary.

=
Swedish distinction is much greater. This discussion would have been m= uch easier if the letter looked completely different. :-)
=C2=A0<= /div>
But it is quite common in English
texts, say, to omit the diaeresis dots, thus having a searching mode
that finds both `H=C3=A4nsel und Gretel' and `Hansel and Gretel' at= the
same time would be very valuable.

I nev= er said it's not valuable. I never even suggested that this kind of com= parisons should not be possible.

In fact, I'm = not even suggesting that this kind of comparisons should not be the default= , even. Especially given the fact that locale-dependent comparators are not= very well supported in Emacs at the moment.

What = I did want to do was try try to explain that even though there is a visual = similarity between A, =C3=84 and =C3=85, to a Swedish speaker those similar= ities are no greater than those of q and k. And definitely much more differ= ent than W and V (which were, up until recently sorted under V in dictionar= ies and seen as simply a visual variation).

What you describe naturally leads to another user option: Don't = handle
characters as `equal' (with a proper definition of `equal') that aren't `equal' in the user's locale.

This is exactly my point. And you have managed to compress hundreds o= f my words into a single, district sentence. Thank you.=C2=A0
--001a11439d404c6251052b001137--