From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: =?UTF-8?Q?Elias_M=C3=A5rtenson?= <lokedhs@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: Character folding in the pretest
Date: Fri, 5 Feb 2016 14:36:13 +0800
Message-ID: <CADtN0WLnTYHioJ1p16JP-pt=rNqbyBmGfxh3SQFwfswEZnCz0A@mail.gmail.com>
References: <87mvrg2zid.fsf@wanadoo.es>
	<20160204.180523.769253593641901728.wl@gnu.org>
	<CADtN0WKsMH4dNQ3Xjw2Bk2tK29ctXbQfEdJ96TjtXanNz+JRRg@mail.gmail.com>
	<20160205.070103.162978216111829522.wl@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary=001a11439d404c6251052b001137
X-Trace: ger.gmane.org 1454654187 10567 80.91.229.3 (5 Feb 2016 06:36:27 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 5 Feb 2016 06:36:27 +0000 (UTC)
Cc: =?UTF-8?Q?=C3=93scar_Fuentes?= <ofv@wanadoo.es>,
	emacs-devel <emacs-devel@gnu.org>
To: Werner LEMBERG <wl@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 05 07:36:26 2016
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aRa0G-00057t-6K
	for ged-emacs-devel@m.gmane.org; Fri, 05 Feb 2016 07:36:24 +0100
Original-Received: from localhost ([::1]:46276 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aRa0F-0002rR-AE
	for ged-emacs-devel@m.gmane.org; Fri, 05 Feb 2016 01:36:23 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:40309)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lokedhs@gmail.com>) id 1aRa0A-0002rD-CU
	for emacs-devel@gnu.org; Fri, 05 Feb 2016 01:36:19 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <lokedhs@gmail.com>) id 1aRa08-0003c7-MX
	for emacs-devel@gnu.org; Fri, 05 Feb 2016 01:36:18 -0500
Original-Received: from mail-vk0-x235.google.com ([2607:f8b0:400c:c05::235]:34696)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lokedhs@gmail.com>)
	id 1aRa06-0003bl-70; Fri, 05 Feb 2016 01:36:14 -0500
Original-Received: by mail-vk0-x235.google.com with SMTP id e185so51007095vkb.1;
	Thu, 04 Feb 2016 22:36:14 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type;
	bh=TVJAv4gFHpXBYHvrm2uIU2lTHB6B0up6Dqr19xw0xdM=;
	b=McHL4rLEFLJU0wE12KvA3GbP4HS3WGWCir4QH72yASB1h5NvehX6GW/pXXt9aZuIey
	YXzo6GBv5FWCFknCzjTYW2uT1QjT1f0UU2DKjRUEX7LRED8ajRuf9+cCbnEPzJ53J+9x
	k3w2BxwFuj5jE696udftKyWXQKFTdPtkVfCPizWLEOYNI2gMcoOHC+JtkQ6ZbHAtL4wf
	ZcK9A4RwC34X1malcdq90Y4/owhpfGwKi/rBLNDTaFDGYY4DFUOEnxetN1gjblG4+W6V
	GjG14jtu7EvEa8sTa7vumtNwojNN1JLoGYzxwfvNYuHBT2lJtNLe11VjVStjyUOsPGQQ
	qj1w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:mime-version:in-reply-to:references:date
	:message-id:subject:from:to:cc:content-type;
	bh=TVJAv4gFHpXBYHvrm2uIU2lTHB6B0up6Dqr19xw0xdM=;
	b=StfzMuPkBlW6peGnpgzK0MDFaRDfrhOBsORoNNNb4A4snBeDCZT8dt7x/E9/IXt/l4
	3KvkrvoBmUYBbR6eXVyGn9rxnuMYEyDTmmsKzOiYxFq85csVx4MuZFDVCle/J6p8wMjm
	Gng2Kz4eFMK2uJBThc0lJFw8AXd8C0QbMPM/iVFCffR/D7D+4T5nZBJ+LKjKFh0EdyqB
	LP87zmD1GgmMFmWuWhV2Swo43amde0wTPXDHnEjzqzx5xttm/WqMIKl4m0UlBkU5gzPA
	9QJB20K8gDopNi0MhYXKe3gWxPSEZv411PNVSfqjR+t4I+g8YlXBDzF09dBEcbYCdP9B
	MTwQ==
X-Gm-Message-State: AG10YORmbYA9sxZcJH4ZZuR2HUj1Y9tf+iHU05+ajNQXbTbjQNbuKOjKdGlarHRtqQp8r6H+tJZygv5o3uErGA==
X-Received: by 10.31.192.147 with SMTP id q141mr8307111vkf.96.1454654173763;
	Thu, 04 Feb 2016 22:36:13 -0800 (PST)
Original-Received: by 10.103.80.2 with HTTP; Thu, 4 Feb 2016 22:36:13 -0800 (PST)
In-Reply-To: <20160205.070103.162978216111829522.wl@gnu.org>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 2607:f8b0:400c:c05::235
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:199356
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/199356>

--001a11439d404c6251052b001137
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 5 February 2016 at 14:01, Werner LEMBERG <wl@gnu.org> wrote:

>
> >> This naturally leads to a possible user option: Having `optical'
> >> matches or not, where `optical' means `base character plus
> >> diacritic and/or slight modifications', e.g., o =E2=86=92 =C3=B8 =E2=
=86=92 =C3=B6 etc., etc.
> >
> > How do you even define "optical similarities"?
>
> Basically the same as Eli has described: Base character plus
> diacritics, probably plus some basic shapes with `diacritics' that
> Unicode doesn't represent as composable: o =E2=86=92 =C3=B8, l =E2=86=92 =
=C5=82, d =E2=86=92 =C4=91, etc.
>

Composability is somewhat arbitrary. The character composition has very
little to do with "visual similarities". Just have a look at character
compositions in Devanagari for example.


> > Should l and I compare the same under this definition?  They
> > certainly looks similar.
>
> No, since the similarity is a font issue only.  For this reason I
> *never* use Arial-like fonts.
>

And that argument works equally well for a and =C3=A5. They really have
_nothing_ in common. The fact that there exists a Unicode decomposition for
them is completely irrelevant to a Swedish speaker.

Also note that to a Swedish speaker (well, at least up until recently), W
and V were variations of the same character. Yet I'm not advocating that
Emacs should consider them similar unless the locale says they should be.

In fact, the links to the Unicode TR on collations that Eli posted mentions
that as a specific example.


> > What about p and q?  They look like mirror images of each other.
> > What about z and s?  They even sound similar.
>
> Nonsense.  I've clearly mentioned `base character plus diacritic'.
> Why do you intentionally skip that?  Doing so reminds me of
> Schopenhauer's first stratagem in `The Art of Being Right'...
>

I did not intentionally skip that. I would appreciate it if you didn't
assume that I was out to simply prove you wrong, or that I am here to troll=
.

I was using that as an example in trying to highlight that to some people
(like myself) =C3=A4 just simply is not a character with a diacritic. It is=
 in
German, but not in Swedish.

I think this is hard to explain because in many European language (such as
English, German and French) you have characters which are variations or
alternatives. For example, in French you have the letter =C5=92, which is a
variation of "OE". Likewise in German, =C3=9F is a variation of SS and =C3=
=9C is a
variation of UE. As far as I know, I could write "M=C3=BCller" as "Mueller"=
.

However, this is not true for Swedish. I'll say it again (and I apologise
for repeating myself, this kind of repetition makes me sound like the troll
that you accused me of being) but in Swedish the difference between =C3=85 =
and A
are just as great as the difference in English between the letters E and O.
Writing my last name as "Martenson" looks just as bizarre as me writing
your last name as "Merner". And yes, I picked M because it kinda looks like
an upside-down W and I'm doing that not because I'm really suggesting that
that equivalence should be implemented, but because I want to illustrate
just how silly it looks.


> > To a Swedish speaker there are zero similarities between a, =C3=A4 and =
=C3=A5.
>
> I'm a native German speaker, and there is *zero* similarity in the
> sound between `a' and `=C3=A4', say.


I know. Speak a little German. In fact, =C3=84 is pronounced exactly the sa=
me in
German and Swedish. That said, as far as I can recall from my German
lessons 25 years ago, German grammar does see =C3=84 as a variation of A. A=
t
least they are sorted together in the dictionary.

Swedish distinction is much greater. This discussion would have been much
easier if the letter looked completely different. :-)


> But it is quite common in English
> texts, say, to omit the diaeresis dots, thus having a searching mode
> that finds both `H=C3=A4nsel und Gretel' and `Hansel and Gretel' at the
> same time would be very valuable.
>

I never said it's not valuable. I never even suggested that this kind of
comparisons should not be possible.

In fact, I'm not even suggesting that this kind of comparisons should not
be the default, even. Especially given the fact that locale-dependent
comparators are not very well supported in Emacs at the moment.

What I did want to do was try try to explain that even though there is a
visual similarity between A, =C3=84 and =C3=85, to a Swedish speaker those
similarities are no greater than those of q and k. And definitely much more
different than W and V (which were, up until recently sorted under V in
dictionaries and seen as simply a visual variation).

>
> What you describe naturally leads to another user option: Don't handle
> characters as `equal' (with a proper definition of `equal') that
> aren't `equal' in the user's locale.


This is exactly my point. And you have managed to compress hundreds of my
words into a single, district sentence. Thank you.

--001a11439d404c6251052b001137
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On 5=
 February 2016 at 14:01, Werner LEMBERG <span dir=3D"ltr">&lt;<a href=3D"ma=
ilto:wl@gnu.org" target=3D"_blank">wl@gnu.org</a>&gt;</span> wrote:<br><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left=
-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;paddi=
ng-left:1ex"><span class=3D""><br>
&gt;&gt; This naturally leads to a possible user option: Having `optical=
9;<br>
&gt;&gt; matches or not, where `optical&#39; means `base character plus<br>
&gt;&gt; diacritic and/or slight modifications&#39;, e.g., o =E2=86=92 =C3=
=B8 =E2=86=92 =C3=B6 etc., etc.<br>
&gt;<br>
</span><span class=3D"">&gt; How do you even define &quot;optical similarit=
ies&quot;?<br>
<br>
</span>Basically the same as Eli has described: Base character plus<br>
diacritics, probably plus some basic shapes with `diacritics&#39; that<br>
Unicode doesn&#39;t represent as composable: o =E2=86=92 =C3=B8, l =E2=86=
=92 =C5=82, d =E2=86=92 =C4=91, etc.<br></blockquote><div><br></div><div>Co=
mposability is somewhat arbitrary. The character composition has very littl=
e to do with &quot;visual similarities&quot;. Just have a look at character=
 compositions in Devanagari for example.</div><div>=C2=A0</div><blockquote =
class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1=
px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:=
1ex"><span class=3D"">&gt; Should l and I compare the same under this defin=
ition?=C2=A0 They<br>
&gt; certainly looks similar.<br>
<br>
</span>No, since the similarity is a font issue only.=C2=A0 For this reason=
 I<br>
*never* use Arial-like fonts.<br></blockquote><div><br></div><div>And that =
argument works equally well for a and =C3=A5. They really have _nothing_ in=
 common. The fact that there exists a Unicode decomposition for them is com=
pletely irrelevant to a Swedish speaker.</div><div><br></div><div>Also note=
 that to a Swedish speaker (well, at least up until recently), W and V were=
 variations of the same character. Yet I&#39;m not advocating that Emacs sh=
ould consider them similar unless the locale says they should be.</div><div=
><br></div><div>In fact, the links to the Unicode TR on collations=C2=A0tha=
t Eli posted mentions that as a specific example.</div><div>=C2=A0</div><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-lef=
t-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padd=
ing-left:1ex"><span class=3D"">&gt; What about p and q?=C2=A0 They look lik=
e mirror images of each other.<br>
&gt; What about z and s?=C2=A0 They even sound similar.<br>
<br>
</span>Nonsense.=C2=A0 I&#39;ve clearly mentioned `base character plus diac=
ritic&#39;.<br>
Why do you intentionally skip that?=C2=A0 Doing so reminds me of<br>
Schopenhauer&#39;s first stratagem in `The Art of Being Right&#39;...<br></=
blockquote><div><br></div><div>I did not intentionally skip that. I would a=
ppreciate it if you didn&#39;t assume that I was out to simply prove you wr=
ong, or that I am here to troll.</div><div><br></div><div>I was using that =
as an example in trying to highlight that to some people (like myself) =C3=
=A4 just simply is not a character with a diacritic. It is in German, but n=
ot in Swedish.</div><div><br></div><div>I think this is hard to explain bec=
ause in many European language (such as English, German and French) you hav=
e characters which are variations or alternatives. For example, in French y=
ou have the letter =C5=92, which is a variation of &quot;OE&quot;. Likewise=
 in German, =C3=9F is a variation of SS and =C3=9C is a variation of UE. As=
 far as I know, I could write &quot;M=C3=BCller&quot; as &quot;Mueller&quot=
;.<br></div><div><br></div><div>However, this is not true for Swedish. I=
9;ll say it again (and I apologise for repeating myself, this kind of repet=
ition makes me sound like the troll that you accused me of being) but in Sw=
edish the difference between =C3=85 and A are just as great as the differen=
ce in English between the letters E and O. Writing my last name as &quot;Ma=
rtenson&quot; looks just as bizarre as me writing your last name as &quot;M=
erner&quot;. And yes, I picked M because it kinda looks like an upside-down=
 W and I&#39;m doing that not because I&#39;m really suggesting that that e=
quivalence should be implemented, but because I want to illustrate just how=
 silly it looks.</div><div><br></div><div>=C2=A0</div><blockquote class=3D"=
gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border=
-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><spa=
n class=3D"">&gt; To a Swedish speaker there are zero similarities between =
a, =C3=A4 and =C3=A5.<br>
<br>
</span>I&#39;m a native German speaker, and there is *zero* similarity in t=
he<br>
sound between `a&#39; and `=C3=A4&#39;, say. </blockquote><div><br></div><d=
iv>I know. Speak a little German. In fact, =C3=84 is pronounced exactly the=
 same in German and Swedish. That said, as far as I can recall from my Germ=
an lessons 25 years ago, German grammar does see =C3=84 as a variation of A=
. At least they are sorted together in the dictionary.</div><div><br></div>=
<div>Swedish distinction is much greater. This discussion would have been m=
uch easier if the letter looked completely different. :-)</div><div>=C2=A0<=
/div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bo=
rder-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:so=
lid;padding-left:1ex">But it is quite common in English<br>
texts, say, to omit the diaeresis dots, thus having a searching mode<br>
that finds both `H=C3=A4nsel und Gretel&#39; and `Hansel and Gretel&#39; at=
 the<br>
same time would be very valuable.<br></blockquote><div><br></div><div>I nev=
er said it&#39;s not valuable. I never even suggested that this kind of com=
parisons should not be possible.</div><div><br></div><div>In fact, I&#39;m =
not even suggesting that this kind of comparisons should not be the default=
, even. Especially given the fact that locale-dependent comparators are not=
 very well supported in Emacs at the moment.</div><div><br></div><div>What =
I did want to do was try try to explain that even though there is a visual =
similarity between A, =C3=84 and =C3=85, to a Swedish speaker those similar=
ities are no greater than those of q and k. And definitely much more differ=
ent than W and V (which were, up until recently sorted under V in dictionar=
ies and seen as simply a visual variation).</div><blockquote class=3D"gmail=
_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left=
-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span cla=
ss=3D""><br>
</span>What you describe naturally leads to another user option: Don&#39;t =
handle<br>
characters as `equal&#39; (with a proper definition of `equal&#39;) that<br=
>
aren&#39;t `equal&#39; in the user&#39;s locale.</blockquote><div><br></div=
><div>This is exactly my point. And you have managed to compress hundreds o=
f my words into a single, district sentence. Thank you.=C2=A0</div></div></=
div></div>

--001a11439d404c6251052b001137--