From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: =?UTF-8?Q?Elias_M=C3=A5rtenson?= Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Sat, 20 Feb 2016 17:18:27 +0800 Message-ID: References: <83twle71xy.fsf@gnu.org> <87io1us0te.fsf@wanadoo.es> <83pow26svf.fsf@gnu.org> <87a8n5srbp.fsf@wanadoo.es> <83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net> <834mdd6llx.fsf@gnu.org> <7fbb8bc7-9a97-4bad-a103-a6690a35241d@default> <834mdc5w6o.fsf@gnu.org> <838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org> <83y4ahru04.fsf@gnu.org> <83fuwproyf.fsf@gnu.org> <837fi0sz29.fsf@gnu.org> <83egc8qzjh.fsf@gnu.org> <87egc7evu3.fsf@gnus.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001a1143f9481a98e3052c3015e9 X-Trace: ger.gmane.org 1455959924 11209 80.91.229.3 (20 Feb 2016 09:18:44 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 20 Feb 2016 09:18:44 +0000 (UTC) Cc: Eli Zaretskii , emacs-devel To: Lars Magne Ingebrigtsen Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Feb 20 10:18:43 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aX3gW-0005Ei-Tm for ged-emacs-devel@m.gmane.org; Sat, 20 Feb 2016 10:18:41 +0100 Original-Received: from localhost ([::1]:59431 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aX3gW-0001N0-7f for ged-emacs-devel@m.gmane.org; Sat, 20 Feb 2016 04:18:40 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:45689) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aX3gO-0001CN-Do for emacs-devel@gnu.org; Sat, 20 Feb 2016 04:18:36 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aX3gK-00074j-Bd for emacs-devel@gnu.org; Sat, 20 Feb 2016 04:18:32 -0500 Original-Received: from mail-vk0-x22c.google.com ([2607:f8b0:400c:c05::22c]:34111) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aX3gK-00074d-68; Sat, 20 Feb 2016 04:18:28 -0500 Original-Received: by mail-vk0-x22c.google.com with SMTP id e185so94282972vkb.1; Sat, 20 Feb 2016 01:18:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=ljxM/EVa4HFgIMHOBTIJlUXU+71Y1XZVA7Rjkbw43iY=; b=sTThw8EuKBXXMibQCiaIZAo1JGsuU2Dv+Y9QFpVmcvXlcpFxSf6Pk3ErevhrMjKiXU 3MYimzSFZLyrumK8JuRX0m7AOv6scR6l7SBeBoDuGRavNYiu6iWUyA2DxikwxFd/bOhw rbZBQVSGMl3ARX5SfFd0HeekHCzM+L8T4vp97ZhF6hKxR1nhbCxHNjG3iKdtV2WVvqg9 aLSZG9kqnDDd/VuKr1NUYGgunobImihlL1wsp/WY6up01xWhCsUO41NKLXyi1eoMVCoJ pD1f2C1zuOVAxz45G+AGuhq7CnncjQ2n8IpdNk1weQWzwY5H9LC3rK3lt3iwZZVip64V VoDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=ljxM/EVa4HFgIMHOBTIJlUXU+71Y1XZVA7Rjkbw43iY=; b=H8cRRt6QaS48JBTcdSjyTQElY3ZLMu7cBrpEW47TYjddpppxcBzqpWeDI3NCQwezdk 7ChIBVdxEPFtVQuXwyOPMI1mH+uFRbX/naAF0KUmkbpttWL3ms/2pckguT9//nFsHMLf 9QvjkkqYgJFw7OQk/Qs3n5HczR7pBOe2btmDa/Rp5Y147Fa8NHsTZdxoWWnfnmVEOpdO ZdHcAczbqxgUuCcMcQmNyfQNsU1udkHh0dxlswEgQ8wolMB6YjcxxFZNTTBbCEOcA3sb oubGheSXcUrjUp+J6UxRzEU8YHZqknJ7rBX2I5Ec4vidttV3JGF33Mn1dhta+P6vaihQ yoDA== X-Gm-Message-State: AG10YOSsAQ50MkHK6PaB8LFg7cP9zkLPxjhCsAUFOX5go6U+N3KCCSqsMBMX09+Sj6CV+x4MNmx2PF8ahSkPnw== X-Received: by 10.31.56.151 with SMTP id f145mr15177350vka.107.1455959907668; Sat, 20 Feb 2016 01:18:27 -0800 (PST) Original-Received: by 10.176.3.146 with HTTP; Sat, 20 Feb 2016 01:18:27 -0800 (PST) In-Reply-To: <87egc7evu3.fsf@gnus.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2607:f8b0:400c:c05::22c X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:200289 Archived-At: --001a1143f9481a98e3052c3015e9 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I think your message illustrates an opinion that is not only mine, in that I am not against the idea of character folding. I mean, if I were, I'd just ignore this discussion and just turn the feature off. What I want, and by the looks of things, other people too, is to actually have this feature. I just don't want it to be broken, and today it is broken because it' been implemented based on incorrect assumptions. On 20 Feb 2016 14:32, "Lars Ingebrigtsen" wrote: > It seems to me that we're considering using the Unicode decomposition > rules for "variant detection" because it's what we have. But this > doesn't allow people to say `C-s l' to find =C5=82 or `C-s o' to find =C3= =B8, and > this would obviously be something that many people would find helpful. The Unicode collation charts do place =C3=B8 in the "o" category. Eli said in an earlier message that the collation charts were consulted, but when I test that doesn't seem to be the case. The Unicode character collation charts is the best generic solution that Unicode gives us. The proposal you put forward below seems very much like what I proposed earlier; having the locale-dependent rules determine any exceptions and then fall back to a generic method. The question is what that generic should be. The current trick of decomposing and using the first character of the decomposition is not good and breaks down very quickly. Clearly the collation charts should be consulted instead, but this is not enough. I could spend quite some time discussing all the issues that I can think of (to get an idea of it, look up how Korean and Devanagari works, as well as the concept of "grapheme clusters"). > So the Unicode decomposition rules only get us halfway there. On the > other hand, they go to far for other users, who absolutely do not want > `C-s o' to find =C3=B8, but would be really glad if `C-s hermes' would fi= nd > "Herm=C3=A9s" (or is it "Herm=C3=A8s"? I can't even type > So: How many characters are we really talking about? Unicode is big and > scary, but this only applies to alphabetical scripts, right? That is, > all the Latin-like scripts, and... possibly Greek/Hebrew/Cyrillic? I > don't know? Cyrillic has the issues. Also, most of the accented characters in Cyrillic are historical and not used today. Therefore having this feature in Cyrillic would most definitely be useful. > But if we only consider the Latin scripts for a moment, there aren't > more than a few hundred Unicode points that we care about. Basically > all the old iso-8859-foos from around Europe. And what we want is a way > for people with normal keyboards (they have a-z in Latin alphabet > countries) to search for variants. It's more than that, because it's not just single characters we're talking about but also combinations. Of course, for European languages this can be handled by comparing only the base character but in other languages this is a much more complex issue. That said, I agree with you on your proposed approach. > That bit is more than an evening, but is something that people would > enjoy submitting exceptions to, I think. You can count me in. :-) > And then we just look up the locale, create the mapping when we type > `C-s', and there we are. An awesome, very useful feature that would > annoy nobody, and that should be on by default. That would be amazing. Regards, Elias --001a1143f9481a98e3052c3015e9 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

I think your message illustrates an opinion= that is not only mine, in that I am not against the idea of character fold= ing. I mean, if I were, I'd just ignore this discussion and just turn t= he feature off. What I want, and by the looks of things, other people too, = is to actually have this feature. I just don't want it to be broken, an= d today it is broken because it' been implemented based on incorrect as= sumptions.

On 20 Feb 2016 14:32, "Lars Ingebrigtsen" <<= a href=3D"mailto:larsi@gnus.org" target=3D"_blank">larsi@gnus.org> w= rote:

> It seems to me that we're considering using the Unico= de decomposition
> rules for "variant detection" because it's what we have.= =C2=A0 But this
> doesn't allow people to say `C-s l' to find =C5=82 or `C-s o&#= 39; to find =C3=B8, and
> this would obviously be something that many people would find helpful.=

The Unicode collati= on charts do place =C3=B8 in the "o" category. Eli said in an= earlier message that the collation charts were consulted, but when I test = that doesn't seem to be the case.

The Unicode character collatio= n charts is the best generic solution that Unicode gives us.

The pro= posal you put forward below seems very much like what I proposed earlier; h= aving the locale-dependent rules determine any exceptions and then fall bac= k to a generic method.

The question is what that generic should be. = The current trick of decomposing and using the first character of the decom= position is not good and breaks down very quickly. Clearly the collation ch= arts should be consulted instead, but this is not enough. I could spend qui= te some time discussing all the issues that I can think of (to get an idea = of it, look up how Korean and Devanagari works, as well as the concept of &= quot;grapheme clusters").

> So the Unicode de= composition rules only get us halfway there.=C2=A0 On the
> other hand, they go to far for other users, who absolutely do not want=
> `C-s o' to find =C3=B8, but would be really glad if `C-s hermes= 9; would find
> "Herm=C3=A9s" (or is it "Herm=C3=A8s"?=C2=A0 I can= 't even type
> So: How many characters are we really talking about?=C2=A0 Unicode is = big and
> scary, but this only applies to alphabetical scripts, right?=C2=A0 Tha= t is,
> all the Latin-like scripts, and...=C2=A0 possibly Greek/Hebrew/Cyrilli= c?=C2=A0 I
> don't know?

Cyrillic has the issues. Also, most of the acce= nted characters in Cyrillic are historical and not used today. Therefore ha= ving this feature in Cyrillic would most definitely be useful.

> = But if we only consider the Latin scripts for a moment, there aren't> more than a few hundred Unicode points that we care about.=C2=A0 Basi= cally
> all the old iso-8859-foos from around Europe.=C2=A0 And what = we want is a way
> for people with normal keyboards (they have a-z in= Latin alphabet
> countries) to search for variants.

It's = more than that, because it's not just single characters we're talki= ng about but also combinations. Of course, for European languages this can = be handled by comparing only the base character but in other languages this= is a much more complex issue.

That said, I agree with you on your p= roposed approach.

> That bit is more than an evening, = but is something that people would
> enjoy submitting exceptions to, I think.

You can count me in. := -)

> And then we just look up the locale, create the m= apping when we type
> `C-s', and there we are.=C2=A0 An awesome, very useful feature tha= t would
> annoy nobody, and that should be on by default.

That would be a= mazing.

Regards,
Elias

--001a1143f9481a98e3052c3015e9--