From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: =?UTF-8?Q?Elias_M=C3=A5rtenson?= <lokedhs@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: On language-dependent defaults for character-folding
Date: Sat, 20 Feb 2016 17:18:27 +0800
Message-ID: <CADtN0W++yvgrPEGzCM9nfTqpXD2LdDR==S8Q5vRBkS4Rqcx28w@mail.gmail.com>
References: <CAAdUY-KRpbjDJ6h=QOsWBpOJyJ-GP1ia70YyjwYsNe5i1S=mXg@mail.gmail.com>
	<83twle71xy.fsf@gnu.org> <87io1us0te.fsf@wanadoo.es>
	<83pow26svf.fsf@gnu.org> <87a8n5srbp.fsf@wanadoo.es>
	<83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net>
	<834mdd6llx.fsf@gnu.org>
	<7fbb8bc7-9a97-4bad-a103-a6690a35241d@default>
	<834mdc5w6o.fsf@gnu.org> <m2ziuxltit.fsf@newartisans.com>
	<838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org>
	<83y4ahru04.fsf@gnu.org>
	<CADtN0W+B=JZ_LKis9opETfr5q8K=rC+Xt6jGijMC3GwiGbF2RA@mail.gmail.com>
	<83fuwproyf.fsf@gnu.org>
	<CADtN0W+2CjROLMnuC8N3X3TrwvsZOmidviFjM_-AF0DKN-Wvsg@mail.gmail.com>
	<837fi0sz29.fsf@gnu.org>
	<CADtN0W+93LH5d3=joVj2xe40rramMOcURKw7QKdv_OefYCm8Ug@mail.gmail.com>
	<83egc8qzjh.fsf@gnu.org>
	<CADtN0WL-rX5xzw75P=qLEYFYzLWkuCuntE+gf2BAhn981_jWBg@mail.gmail.com>
	<87egc7evu3.fsf@gnus.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary=001a1143f9481a98e3052c3015e9
X-Trace: ger.gmane.org 1455959924 11209 80.91.229.3 (20 Feb 2016 09:18:44 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sat, 20 Feb 2016 09:18:44 +0000 (UTC)
Cc: Eli Zaretskii <eliz@gnu.org>, emacs-devel <emacs-devel@gnu.org>
To: Lars Magne Ingebrigtsen <larsi@gnus.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Feb 20 10:18:43 2016
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aX3gW-0005Ei-Tm
	for ged-emacs-devel@m.gmane.org; Sat, 20 Feb 2016 10:18:41 +0100
Original-Received: from localhost ([::1]:59431 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aX3gW-0001N0-7f
	for ged-emacs-devel@m.gmane.org; Sat, 20 Feb 2016 04:18:40 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:45689)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lokedhs@gmail.com>) id 1aX3gO-0001CN-Do
	for emacs-devel@gnu.org; Sat, 20 Feb 2016 04:18:36 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <lokedhs@gmail.com>) id 1aX3gK-00074j-Bd
	for emacs-devel@gnu.org; Sat, 20 Feb 2016 04:18:32 -0500
Original-Received: from mail-vk0-x22c.google.com ([2607:f8b0:400c:c05::22c]:34111)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lokedhs@gmail.com>)
	id 1aX3gK-00074d-68; Sat, 20 Feb 2016 04:18:28 -0500
Original-Received: by mail-vk0-x22c.google.com with SMTP id e185so94282972vkb.1;
	Sat, 20 Feb 2016 01:18:28 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type;
	bh=ljxM/EVa4HFgIMHOBTIJlUXU+71Y1XZVA7Rjkbw43iY=;
	b=sTThw8EuKBXXMibQCiaIZAo1JGsuU2Dv+Y9QFpVmcvXlcpFxSf6Pk3ErevhrMjKiXU
	3MYimzSFZLyrumK8JuRX0m7AOv6scR6l7SBeBoDuGRavNYiu6iWUyA2DxikwxFd/bOhw
	rbZBQVSGMl3ARX5SfFd0HeekHCzM+L8T4vp97ZhF6hKxR1nhbCxHNjG3iKdtV2WVvqg9
	aLSZG9kqnDDd/VuKr1NUYGgunobImihlL1wsp/WY6up01xWhCsUO41NKLXyi1eoMVCoJ
	pD1f2C1zuOVAxz45G+AGuhq7CnncjQ2n8IpdNk1weQWzwY5H9LC3rK3lt3iwZZVip64V
	VoDw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:mime-version:in-reply-to:references:date
	:message-id:subject:from:to:cc:content-type;
	bh=ljxM/EVa4HFgIMHOBTIJlUXU+71Y1XZVA7Rjkbw43iY=;
	b=H8cRRt6QaS48JBTcdSjyTQElY3ZLMu7cBrpEW47TYjddpppxcBzqpWeDI3NCQwezdk
	7ChIBVdxEPFtVQuXwyOPMI1mH+uFRbX/naAF0KUmkbpttWL3ms/2pckguT9//nFsHMLf
	9QvjkkqYgJFw7OQk/Qs3n5HczR7pBOe2btmDa/Rp5Y147Fa8NHsTZdxoWWnfnmVEOpdO
	ZdHcAczbqxgUuCcMcQmNyfQNsU1udkHh0dxlswEgQ8wolMB6YjcxxFZNTTBbCEOcA3sb
	oubGheSXcUrjUp+J6UxRzEU8YHZqknJ7rBX2I5Ec4vidttV3JGF33Mn1dhta+P6vaihQ
	yoDA==
X-Gm-Message-State: AG10YOSsAQ50MkHK6PaB8LFg7cP9zkLPxjhCsAUFOX5go6U+N3KCCSqsMBMX09+Sj6CV+x4MNmx2PF8ahSkPnw==
X-Received: by 10.31.56.151 with SMTP id f145mr15177350vka.107.1455959907668; 
	Sat, 20 Feb 2016 01:18:27 -0800 (PST)
Original-Received: by 10.176.3.146 with HTTP; Sat, 20 Feb 2016 01:18:27 -0800 (PST)
In-Reply-To: <87egc7evu3.fsf@gnus.org>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 2607:f8b0:400c:c05::22c
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:200289
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/200289>

--001a1143f9481a98e3052c3015e9
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I think your message illustrates an opinion that is not only mine, in that
I am not against the idea of character folding. I mean, if I were, I'd just
ignore this discussion and just turn the feature off. What I want, and by
the looks of things, other people too, is to actually have this feature. I
just don't want it to be broken, and today it is broken because it' been
implemented based on incorrect assumptions.

On 20 Feb 2016 14:32, "Lars Ingebrigtsen" <larsi@gnus.org> wrote:

> It seems to me that we're considering using the Unicode decomposition
> rules for "variant detection" because it's what we have.  But this
> doesn't allow people to say `C-s l' to find =C5=82 or `C-s o' to find =C3=
=B8, and
> this would obviously be something that many people would find helpful.

The Unicode collation charts <http://unicode.org/charts/collation/> do
place =C3=B8 in the "o" category. Eli said in an earlier message that the
collation charts were consulted, but when I test that doesn't seem to be
the case.

The Unicode character collation charts is the best generic solution that
Unicode gives us.

The proposal you put forward below seems very much like what I proposed
earlier; having the locale-dependent rules determine any exceptions and
then fall back to a generic method.

The question is what that generic should be. The current trick of
decomposing and using the first character of the decomposition is not good
and breaks down very quickly. Clearly the collation charts should be
consulted instead, but this is not enough. I could spend quite some time
discussing all the issues that I can think of (to get an idea of it, look
up how Korean and Devanagari works, as well as the concept of "grapheme
clusters").

> So the Unicode decomposition rules only get us halfway there.  On the
> other hand, they go to far for other users, who absolutely do not want
> `C-s o' to find =C3=B8, but would be really glad if `C-s hermes' would fi=
nd
> "Herm=C3=A9s" (or is it "Herm=C3=A8s"?  I can't even type
> So: How many characters are we really talking about?  Unicode is big and
> scary, but this only applies to alphabetical scripts, right?  That is,
> all the Latin-like scripts, and...  possibly Greek/Hebrew/Cyrillic?  I
> don't know?

Cyrillic has the issues. Also, most of the accented characters in Cyrillic
are historical and not used today. Therefore having this feature in
Cyrillic would most definitely be useful.

> But if we only consider the Latin scripts for a moment, there aren't
> more than a few hundred Unicode points that we care about.  Basically
> all the old iso-8859-foos from around Europe.  And what we want is a way
> for people with normal keyboards (they have a-z in Latin alphabet
> countries) to search for variants.

It's more than that, because it's not just single characters we're talking
about but also combinations. Of course, for European languages this can be
handled by comparing only the base character but in other languages this is
a much more complex issue.

That said, I agree with you on your proposed approach.

> That bit is more than an evening, but is something that people would
> enjoy submitting exceptions to, I think.

You can count me in. :-)

> And then we just look up the locale, create the mapping when we type
> `C-s', and there we are.  An awesome, very useful feature that would
> annoy nobody, and that should be on by default.

That would be amazing.

Regards,
Elias

--001a1143f9481a98e3052c3015e9
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><p dir=3D"ltr">I think your message illustrates an opinion=
 that is not only mine, in that I am not against the idea of character fold=
ing. I mean, if I were, I&#39;d just ignore this discussion and just turn t=
he feature off. What I want, and by the looks of things, other people too, =
is to actually have this feature. I just don&#39;t want it to be broken, an=
d today it is broken because it&#39; been implemented based on incorrect as=
sumptions.<br><br>On 20 Feb 2016 14:32, &quot;Lars Ingebrigtsen&quot; &lt;<=
a href=3D"mailto:larsi@gnus.org" target=3D"_blank">larsi@gnus.org</a>&gt; w=
rote:<br><br>&gt; It seems to me that we&#39;re considering using the Unico=
de decomposition<br>
&gt; rules for &quot;variant detection&quot; because it&#39;s what we have.=
=C2=A0 But this<br>
&gt; doesn&#39;t allow people to say `C-s l&#39; to find =C5=82 or `C-s o&#=
39; to find =C3=B8, and<br>
&gt; this would obviously be something that many people would find helpful.=
</p><p>The Unicode <a href=3D"http://unicode.org/charts/collation/">collati=
on charts</a> do place =C3=B8 in the &quot;o&quot; category. Eli said in an=
 earlier message that the collation charts were consulted, but when I test =
that doesn&#39;t seem to be the case.<br><br>The Unicode character collatio=
n charts is the best generic solution that Unicode gives us.<br><br>The pro=
posal you put forward below seems very much like what I proposed earlier; h=
aving the locale-dependent rules determine any exceptions and then fall bac=
k to a generic method.<br><br>The question is what that generic should be. =
The current trick of decomposing and using the first character of the decom=
position is not good and breaks down very quickly. Clearly the collation ch=
arts should be consulted instead, but this is not enough. I could spend qui=
te some time discussing all the issues that I can think of (to get an idea =
of it, look up how Korean and Devanagari works, as well as the concept of &=
quot;grapheme clusters&quot;).<br></p><p dir=3D"ltr">&gt; So the Unicode de=
composition rules only get us halfway there.=C2=A0 On the<br>
&gt; other hand, they go to far for other users, who absolutely do not want=
<br>
&gt; `C-s o&#39; to find =C3=B8, but would be really glad if `C-s hermes=
9; would find<br>
&gt; &quot;Herm=C3=A9s&quot; (or is it &quot;Herm=C3=A8s&quot;?=C2=A0 I can=
&#39;t even type <br>
&gt; So: How many characters are we really talking about?=C2=A0 Unicode is =
big and<br>
&gt; scary, but this only applies to alphabetical scripts, right?=C2=A0 Tha=
t is,<br>
&gt; all the Latin-like scripts, and...=C2=A0 possibly Greek/Hebrew/Cyrilli=
c?=C2=A0 I<br>
&gt; don&#39;t know?<br><br>Cyrillic has the issues. Also, most of the acce=
nted characters in Cyrillic are historical and not used today. Therefore ha=
ving this feature in Cyrillic would most definitely be useful.<br><br>&gt; =
But if we only consider the Latin scripts for a moment, there aren&#39;t<br=
>&gt; more than a few hundred Unicode points that we care about.=C2=A0 Basi=
cally<br>&gt; all the old iso-8859-foos from around Europe.=C2=A0 And what =
we want is a way<br>&gt; for people with normal keyboards (they have a-z in=
 Latin alphabet<br>&gt; countries) to search for variants.<br><br>It&#39;s =
more than that, because it&#39;s not just single characters we&#39;re talki=
ng about but also combinations. Of course, for European languages this can =
be handled by comparing only the base character but in other languages this=
 is a much more complex issue.<br><br>That said, I agree with you on your p=
roposed approach.</p><p dir=3D"ltr">&gt; That bit is more than an evening, =
but is something that people would<br>
&gt; enjoy submitting exceptions to, I think.<br><br>You can count me in. :=
-)</p><p dir=3D"ltr">&gt; And then we just look up the locale, create the m=
apping when we type<br>
&gt; `C-s&#39;, and there we are.=C2=A0 An awesome, very useful feature tha=
t would<br>
&gt; annoy nobody, and that should be on by default.<br><br>That would be a=
mazing.<br><br>Regards,<br>Elias</p>
</div>

--001a1143f9481a98e3052c3015e9--