From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: =?UTF-8?Q?Elias_M=C3=A5rtenson?= Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Fri, 19 Feb 2016 18:51:47 +0800 Message-ID: References: <87io1xwq1e.fsf@wanadoo.es> <87vb5wvzfz.fsf@mail.linkov.net> <87io1wt4cc.fsf@wanadoo.es> <8737syoima.fsf@mail.linkov.net> <871t8iu277.fsf@wanadoo.es> <83d1s28kvh.fsf@gnu.org> <87r3gis7sm.fsf@wanadoo.es> <83twle71xy.fsf@gnu.org> <87io1us0te.fsf@wanadoo.es> <83pow26svf.fsf@gnu.org> <87a8n5srbp.fsf@wanadoo.es> <83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net> <834mdd6llx.fsf@gnu.org> <7fbb8bc7-9a97-4bad-a103-a6690a35241d@default> <834mdc5w6o.fsf@gnu.org> <838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org> <83y4ahru04.fsf@gnu.org> <83fuwproyf.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001a11430242086b0d052c1d453d X-Trace: ger.gmane.org 1455879131 20028 80.91.229.3 (19 Feb 2016 10:52:11 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 19 Feb 2016 10:52:11 +0000 (UTC) Cc: Lars Ingebrigtsen , emacs-devel To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 19 11:52:10 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aWifR-0005u1-I8 for ged-emacs-devel@m.gmane.org; Fri, 19 Feb 2016 11:52:09 +0100 Original-Received: from localhost ([::1]:50942 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aWifQ-0006Y0-IW for ged-emacs-devel@m.gmane.org; Fri, 19 Feb 2016 05:52:08 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:46178) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aWif8-0006Vy-9l for emacs-devel@gnu.org; Fri, 19 Feb 2016 05:51:52 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aWif6-0004Ee-AB for emacs-devel@gnu.org; Fri, 19 Feb 2016 05:51:50 -0500 Original-Received: from mail-vk0-x22b.google.com ([2607:f8b0:400c:c05::22b]:35642) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aWif6-0004EZ-40; Fri, 19 Feb 2016 05:51:48 -0500 Original-Received: by mail-vk0-x22b.google.com with SMTP id e6so70472870vkh.2; Fri, 19 Feb 2016 02:51:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=ETnffvyvkqGjDfh1Jo4L4vU72ZPJ04GZ27fW6B9DGYw=; b=tPMQwqy9vHwtdGWoZg9COckp2/mehUjozZY0gwdmns666DWOm2HwUA0lWlReUOaG4Q /xUuo0twsryFasWnjR9fV4Fu0gV03/wFq6MhRF6Ta2BT1Q7t6ubKWmhWzl/ukAtvNgwW GbyoaNcFdW5jZmo2f++4ArQjuUoBNb05DzTwIJJW1cBuEEtOxvxT2tuTKYIeJPMNl9eJ S5NwCwU53gum9WvwnYcSJs+5aFjf21lXowMVxgbrQSUMF7sCz7Lt4EEG5bfUKc82qs59 48/aaOV/6MF/DFIkzq5jflMPvWvxNjNc3uXvRP/smJLgPGohWMRSh2cHK0g+twXnXC/E s0Lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=ETnffvyvkqGjDfh1Jo4L4vU72ZPJ04GZ27fW6B9DGYw=; b=TJIggpxY7m2lHiqNo1i7CcVfIpaKtnPWedFvI2+v9QaGndhSrl4PSQJfHcSsRlcits Y2BNZhsHlgGBbUDmAc6iuueHPArfgO2EuVJ+SAmhjxCEPeZLwRIhUuNN3jC3xDqFgyIk 5yWlRDgiR17BjFv5jfpXmi8ZegnKMlbVz5hVa4iwZ/4spcik5rfGMACIf51Uwy7V4uGd /Di2oHDW4wYVaPSVqLGzcxJGzZ/jSNcieFzyCJ2GnZq+7UwN4PCzZbkRhRIzAj7DI/gf jJsFsUUJ4RmhhZ27LR3I68UvW50J951pL3P9xcQJxAGtDCaDgNpAhqESmdnepyBMMrMq BBBw== X-Gm-Message-State: AG10YORhgcCC0Eo7zPuGwGvhFQteOq+XtSGwYx6jN/n6vLr8dBIaapyysC4gfkEkD9adTL6g0cK2zxrTgKiMbw== X-Received: by 10.31.16.24 with SMTP id g24mr10171064vki.41.1455879107394; Fri, 19 Feb 2016 02:51:47 -0800 (PST) Original-Received: by 10.176.3.146 with HTTP; Fri, 19 Feb 2016 02:51:47 -0800 (PST) In-Reply-To: <83fuwproyf.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2607:f8b0:400c:c05::22b X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:200200 Archived-At: --001a11430242086b0d052c1d453d Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 19 February 2016 at 18:09, Eli Zaretskii wrote: > > The Unicode character decomposition was never meant to be used to > provide a feature such as character > > folding in Emacs. > > That's not true. Canonical equivalence, which is encoded in canonical > decompositions, is a must for searching. Otherwise, what looks the > same on display will not be found, and will look like a bug. See the > example I gave with =C3=B1 and n=CC=83 (the latter one is 2 characters). > Of course you have to use the decomposition algorithms to ensure that the precomposed and decomposed variations of the same character compares equal. This is, however, different from using the decomposition to to decompose a character and then using the base character as the thing to match against. The latter is what Emacs is doing today, as far as I understand. > 2 and 3 are the same as we do already, AFAICT. (Collation charts > describe ordering, which is irrelevant for searching; other than that, > you will see that Emacs already implements the data shown in > http://unicode.org/charts/collation/.) > The collation charts also describe equivalence. If you look at the latin collation chart for example ( http://unicode.org/charts/collation/chart_Latin.html) you will see that the characters are grouped. These are the equivalences I'm referring to. Now, I note that on these charts, U+0061 LATIN SMALL LETTER A and U+2C65 LATIN SMALL LETTER A WITH STROKE compares as different characters, and the latter does not have a decomposition. Should this also be addressed= ? > As for the locale-specific parts: using that will only DTRT if we > assume that the majority of searches are done in buffers holding text > in locale's language. Is that a good assumption? My opinion is that the default search behaviour should depend primarily on the locale of the entire Emacs session. I.e. the locale of the user starting the application. I'm not disagreeing that allowing a buffer-local locale override this behaviour is a good idea, but as a Swedish speaker I really see =C3=A5, =C3=A4 and a as completely separate things, even if the = language of the buffer that I am editing happens to be English. The equivalence of these characters is the odd behaviour here, and the one that should be enabled explicitly. Also, if I happen to be editing a Spanish document (I don't speak Spanish) I would find equivalence of =C3=B1 and n to be incredibly useful, even thou= gh =C3=93scar would grind his teeth at it. :-) We are talking > about a multilingual Emacs, in an age of global communications, where > you can have conversations with someone on the other side of the > world, or read text that combines several languages in the same > buffer. Do we really want to go back to the l10n days, when there was > ever only one locale that was interesting -- the current one? I > wonder. > Actually, I think so. This is because the search equivalence is inherently a local thing. The behaviour of search is more tried to a user's preference than the locale of the given buffer, in most cases. At least that's my opinion. The bike shed can have many colours. > It is, Unicode provides it. We just didn't import it yet. > It does? I was looking for such tables, but didn't find it. Do you have a link? > It's more complex than that, but patches are welcome, of course. > Having spent the better part of the day trying to solve a C++ design problem that I had originally hand-waved as being trivial, I know what you mean=E2=80=A6 > Note that the prerequisite for anything more complicated and elaborate > than what we have now is to re-implement character-folding on the C > level, inside search.c functions. The current implementation is at > its limits already. I tried to convince the interested people to do > this in C to be gin with, but couldn't, and the feature was important > enough to have even in its current implementation. > I'm not going to offer to do this until I'm sure that I can have the copyright assignment done. But I am interested in it. Regards, Elias --001a11430242086b0d052c1d453d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On 1= 9 February 2016 at 18:09, Eli Zaretskii <eliz@gnu.org> wrote:
=
=C2=A0
> The Unicode character decompos= ition was never meant to be used to provide a feature such as character
> folding in Emacs.

That's not true.=C2=A0 Canonical equivalence, which is encoded i= n canonical
decompositions, is a must for searching.=C2=A0 Otherwise, what looks the same on display will not be found, and will look like a bug.=C2=A0 See the<= br> example I gave with =C3=B1 and n=CC=83 (the latter one is 2 characters).

Of course you have to use the decompositi= on algorithms to ensure that the precomposed and decomposed variations of t= he same character compares equal.

This is, however= , different from using the decomposition to to decompose a character and th= en using the base character as the thing to match against. The latter is wh= at Emacs is doing today, as far as I understand.
=C2=A0
2 and 3 are the same as we do already, AFAICT.=C2=A0 (Collatio= n charts
describe ordering, which is irrelevant for searching; other than that,
you will see that Emacs already implements the data shown in
http://unicode.org/charts/collation/.)

The collation charts also describe equivalence. If you look= at the latin collation chart for example (http://unicode.org/chart= s/collation/chart_Latin.html) you will see that the characters are grou= ped. These are the equivalences I'm referring to.

<= div>Now, I note that on these charts, U+0061 LATIN SMALL LETTER A and U+2C6= 5=C2=A0LATIN SMALL LETTER A WITH STROKE compares as different characters, a= nd the latter does not have a decomposition. Should this also be addressed?=
=C2=A0
As for the locale-specific parts: u= sing that will only DTRT if we
assume that the majority of searches are done in buffers holding text
in locale's language.=C2=A0 Is that a good assumption?

My opinion is that the default search behaviour should dep= end primarily on the locale of the entire Emacs session. I.e. the locale of= the user starting the application. I'm not disagreeing that allowing a= buffer-local locale override this behaviour is a good idea, but as a Swedi= sh speaker I really see =C3=A5, =C3=A4 and a as completely separate things,= even if the language of the buffer that I am editing happens to be English= . The equivalence of these characters is the odd behaviour here, and the on= e that should be enabled explicitly.

Also, if I ha= ppen to be editing a Spanish document (I don't speak Spanish) I would f= ind equivalence of =C3=B1 and n to be incredibly useful, even though =C3=93= scar would grind his teeth at it. :-)

= We are talking
about a multilingual Emacs, in an age of global communications, where
you can have conversations with someone on the other side of the
world, or read text that combines several languages in the same
buffer.=C2=A0 Do we really want to go back to the l10n days, when there was=
ever only one locale that was interesting -- the current one?=C2=A0 I
wonder.

Actually, I think so. This is b= ecause the search equivalence is inherently a local thing. The behaviour of= search is more tried to a user's preference than the locale of the giv= en buffer, in most cases.

At least that's my o= pinion. The bike shed can have many colours.
=C2=A0
It is, Unicode provides it.=C2=A0 We just didn't import it yet= .

It does? I was looking for such table= s, but didn't find it. Do you have a link?
=C2=A0
It's more complex than that, but patches are welcome, of cou= rse.

Having spent the better part of th= e day trying to solve a C++ design problem that I had originally hand-waved= as being trivial, I know what you mean=E2=80=A6
=C2=A0
Note that the prerequisite for anything more complicated and e= laborate
than what we have now is to re-implement character-folding on the C
level, inside search.c functions.=C2=A0 The current implementation is at its limits already.=C2=A0 I tried to convince the interested people to do this in C to be gin with, but couldn't, and the feature was important enough to have even in its current implementation.

I'm not going t= o offer to do this until I'm sure that I can have the copyright assignm= ent done. But I am interested in it.

Regards,
Eli= as
--001a11430242086b0d052c1d453d--