From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: On language-dependent defaults for character-folding
Date: Fri, 19 Feb 2016 21:18:42 +0200
Message-ID: <83egc8qzjh.fsf@gnu.org>
References: <CAAdUY-KRpbjDJ6h=QOsWBpOJyJ-GP1ia70YyjwYsNe5i1S=mXg@mail.gmail.com>
	<CAAdUY-+CNrh=DKYcd89DAXNcd7d3_2VNpWHgLHF9az118u9CCQ@mail.gmail.com>
	<87io1xwq1e.fsf@wanadoo.es>
	<CAAdUY-JLb0kABXfT6XUvZ6MLFUmjtT89mXSa_mY5mv5vOm1BpA@mail.gmail.com>
	<87vb5wvzfz.fsf@mail.linkov.net> <87io1wt4cc.fsf@wanadoo.es>
	<8737syoima.fsf@mail.linkov.net> <871t8iu277.fsf@wanadoo.es>
	<83d1s28kvh.fsf@gnu.org> <87r3gis7sm.fsf@wanadoo.es>
	<83twle71xy.fsf@gnu.org> <87io1us0te.fsf@wanadoo.es>
	<83pow26svf.fsf@gnu.org> <87a8n5srbp.fsf@wanadoo.es>
	<83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net>
	<834mdd6llx.fsf@gnu.org>
	<7fbb8bc7-9a97-4bad-a103-a6690a35241d@default>
	<834mdc5w6o.fsf@gnu.org> <m2ziuxltit.fsf@newartisans.com>
	<838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org>
	<83y4ahru04.fsf@gnu.org>
	<CADtN0W+B=JZ_LKis9opETfr5q8K=rC+Xt6jGijMC3GwiGbF2RA@mail.gmail.com>
	<83fuwproyf.fsf@gnu.org>
	<CADtN0W+2CjROLMnuC8N3X3TrwvsZOmidviFjM_-AF0DKN-Wvsg@mail.gmail.com>
	<837fi0sz29.fsf@gnu.org>
	<CADtN0W+93LH5d3=joVj2xe40rramMOcURKw7QKdv_OefYCm8Ug@mail.gmail.com>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1455909573 31381 80.91.229.3 (19 Feb 2016 19:19:33 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 19 Feb 2016 19:19:33 +0000 (UTC)
Cc: larsi@gnus.org, emacs-devel@gnu.org
To: Elias =?utf-8?Q?M=C3=A5rtenson?= <lokedhs@gmail.com>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 19 20:19:21 2016
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aWqaG-0007T3-Ca
	for ged-emacs-devel@m.gmane.org; Fri, 19 Feb 2016 20:19:20 +0100
Original-Received: from localhost ([::1]:54764 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aWqaF-0005x5-D6
	for ged-emacs-devel@m.gmane.org; Fri, 19 Feb 2016 14:19:19 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44586)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1aWqZy-0005wG-PV
	for emacs-devel@gnu.org; Fri, 19 Feb 2016 14:19:06 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1aWqZu-00028A-L7
	for emacs-devel@gnu.org; Fri, 19 Feb 2016 14:19:02 -0500
Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:38470)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
	id 1aWqZu-000283-Ht; Fri, 19 Feb 2016 14:18:58 -0500
Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:4137
	helo=home-c4e4a596f7)
	by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128)
	(Exim 4.82) (envelope-from <eliz@gnu.org>)
	id 1aWqZt-0001cn-Rg; Fri, 19 Feb 2016 14:18:58 -0500
In-reply-to: <CADtN0W+93LH5d3=joVj2xe40rramMOcURKw7QKdv_OefYCm8Ug@mail.gmail.com>
	(message from Elias =?utf-8?Q?M=C3=A5rtenson?= on Fri,
	19 Feb 2016 21:37:26 +0800)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 2001:4830:134:3::e
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:200239
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/200239>

> Date: Fri, 19 Feb 2016 21:37:26 +0800
> From: Elias Mårtenson <lokedhs@gmail.com>
> Cc: Lars Ingebrigtsen <larsi@gnus.org>, emacs-devel <emacs-devel@gnu.org>
> 
>  For example, if the buffer includes ñ (2 characters), should "C-s n"
>  find the n in it?
> 
> That depends on the locale of the user.

There are use cases that are independent of the locale.  For example,
imagine that you need to find all the literal n characters in a buffer
because you are investigating a bug in the program that produced that
buffer.  As an Emacs user, I need to do such jobs almost every day.  I
don't want the results affected by the locale.

> However, from the point of a user, there should not be a visible
> difference between the precomposed and the composed variants are the
> exact same character.

What if the user wants to find all those places where what looks like
ñ is actually ñ?  Wouldn't that be a valid use case?

> Note: I know that it's possible that I am wrong about this and that Unicode actually _has_ said that the
> equivalence tables can be used for this purpose (I.e. decompose and only use the primary character). If that is
> the case, I'd be interested to see a reference to that, but I will still be of the same opinion that doing so will
> result in broken behaviour for a certain class of user.

The reference you are looking for is the Unicode Standard itself.  It
says to use the normalization forms, see for example section 5.16
there.

> The equivalence tables explains that the precomposed character U+00F1 is equivalent to the specific
> sequence U+006E U+0303. That is all it says. It does not say that ñ is a variation of n. It's an instruction how
> to construct a given character.

Every character-folding search implementation decomposes characters
before matching them.  So does Emacs.  We didn't invent this, and we
certainly didn't use the decompositions where they weren't supposed to
be used.  It's not a trick, it's what everyone else does to do the
job.  See the ICU library, for example.

> The decompositions are used in the normalisation forms to ensure that the two variants are treated equally
> (such as the two alternative representations of ñ that we have been discussing).

Yes, and any character-folding search uses normalization forms as
well.

>  Indeed,
>  the locale in which Emacs started says almost nothing about the
>  documents being edited, nor even about the user's preferences: it is
>  easy to imagine a user whose "native" locale is X starting Emacs in
>  another locale.
> 
> Yes. I am fully aware of this. But so be it. Having applications work differently depending on the locale of the
> environment the application was started in is nothing new.

It's not new.  It's old.  We should move on to more general
environments that support multiple languages.  Emacs is such an
environment.  The old l10n paradigms are fundamentally incompatible
with that.

>  Being a multi-lingual environment, Emacs has no real notion of the
>  locale.
> 
> Perhaps it should?

That'd be a step backward, IMO.

>  > It is, Unicode provides it. We just didn't import it yet.
>  >
>  > It does? I was looking for such tables, but didn't find it. Do you have a link?
> 
>  Look for DUCET and its tailoring data. These should be a good
>  starting point:
> 
>  http://www.unicode.org/Public/UCA/latest/
>  http://cldr.unicode.org/
> 
> Those are the decomposition charts, and don't actually say anything about equivalence outside of providing a
> canonical form for precomposed characters, as was discussed above.

Strange, I always thought the data was there.  Perhaps you should ask
a question on the Unicode mailing list, then.