From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: On language-dependent defaults for character-folding
Date: Fri, 19 Feb 2016 12:09:44 +0200
Message-ID: <83fuwproyf.fsf@gnu.org>
References: <CAAdUY-KRpbjDJ6h=QOsWBpOJyJ-GP1ia70YyjwYsNe5i1S=mXg@mail.gmail.com>
	<CAAdUY-+CNrh=DKYcd89DAXNcd7d3_2VNpWHgLHF9az118u9CCQ@mail.gmail.com>
	<87io1xwq1e.fsf@wanadoo.es>
	<CAAdUY-JLb0kABXfT6XUvZ6MLFUmjtT89mXSa_mY5mv5vOm1BpA@mail.gmail.com>
	<87vb5wvzfz.fsf@mail.linkov.net> <87io1wt4cc.fsf@wanadoo.es>
	<8737syoima.fsf@mail.linkov.net> <871t8iu277.fsf@wanadoo.es>
	<83d1s28kvh.fsf@gnu.org> <87r3gis7sm.fsf@wanadoo.es>
	<83twle71xy.fsf@gnu.org> <87io1us0te.fsf@wanadoo.es>
	<83pow26svf.fsf@gnu.org> <87a8n5srbp.fsf@wanadoo.es>
	<83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net>
	<834mdd6llx.fsf@gnu.org>
	<7fbb8bc7-9a97-4bad-a103-a6690a35241d@default>
	<834mdc5w6o.fsf@gnu.org> <m2ziuxltit.fsf@newartisans.com>
	<838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org>
	<83y4ahru04.fsf@gnu.org>
	<CADtN0W+B=JZ_LKis9opETfr5q8K=rC+Xt6jGijMC3GwiGbF2RA@mail.gmail.com>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1455876628 10639 80.91.229.3 (19 Feb 2016 10:10:28 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 19 Feb 2016 10:10:28 +0000 (UTC)
Cc: larsi@gnus.org, emacs-devel@gnu.org
To: Elias =?utf-8?Q?M=C3=A5rtenson?= <lokedhs@gmail.com>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 19 11:10:22 2016
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aWi0y-0000jH-VM
	for ged-emacs-devel@m.gmane.org; Fri, 19 Feb 2016 11:10:21 +0100
Original-Received: from localhost ([::1]:50659 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aWi0y-0006TF-Dh
	for ged-emacs-devel@m.gmane.org; Fri, 19 Feb 2016 05:10:20 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:60240)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1aWi0a-0006Bo-Oy
	for emacs-devel@gnu.org; Fri, 19 Feb 2016 05:09:58 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1aWi0V-0001EZ-QT
	for emacs-devel@gnu.org; Fri, 19 Feb 2016 05:09:56 -0500
Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:44319)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
	id 1aWi0V-0001EN-KW; Fri, 19 Feb 2016 05:09:51 -0500
Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:2911
	helo=home-c4e4a596f7)
	by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128)
	(Exim 4.82) (envelope-from <eliz@gnu.org>)
	id 1aWi0U-0004Du-Sz; Fri, 19 Feb 2016 05:09:51 -0500
In-reply-to: <CADtN0W+B=JZ_LKis9opETfr5q8K=rC+Xt6jGijMC3GwiGbF2RA@mail.gmail.com>
	(message from Elias =?utf-8?Q?M=C3=A5rtenson?= on Fri,
	19 Feb 2016 17:22:18 +0800)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 2001:4830:134:3::e
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:200197
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/200197>

> Date: Fri, 19 Feb 2016 17:22:18 +0800
> From: Elias Mårtenson <lokedhs@gmail.com>
> Cc: Lars Ingebrigtsen <larsi@gnus.org>, emacs-devel <emacs-devel@gnu.org>
> 
> The Unicode character decomposition was never meant to be used to provide a feature such as character
> folding in Emacs.

That's not true.  Canonical equivalence, which is encoded in canonical
decompositions, is a must for searching.  Otherwise, what looks the
same on display will not be found, and will look like a bug.  See the
example I gave with ñ and ñ (the latter one is 2 characters).

So using decomposition is not a trick, it simply uses the same data
that determines equivalence of character sequences.

> My suggestion would be to apply several levels of comparisons:
> 
> 1. Check if the characters have locale-specific folding rules (for Swedish, this would be no more than 3-5
> characters or so). If not:
> 2. Check the equivalence according to the Unicode collation charts: http://unicode.org/charts/collation/
> 3. (maybe) Use the decomposition trick

2 and 3 are the same as we do already, AFAICT.  (Collation charts
describe ordering, which is irrelevant for searching; other than that,
you will see that Emacs already implements the data shown in
http://unicode.org/charts/collation/.)

As for the locale-specific parts: using that will only DTRT if we
assume that the majority of searches are done in buffers holding text
in locale's language.  Is that a good assumption?  We are talking
about a multilingual Emacs, in an age of global communications, where
you can have conversations with someone on the other side of the
world, or read text that combines several languages in the same
buffer.  Do we really want to go back to the l10n days, when there was
ever only one locale that was interesting -- the current one?  I
wonder.

> As for the per-locale exception tables mentioned in point 1, I don't know if such information is easily available.

It is, Unicode provides it.  We just didn't import it yet.

> It may be possible to extract it from the localedata files from Glibc. But even if it isn't, creating one for a
> language should be trivial since we only need a list of character groups that should _not_ be folded, which for
> most languages should be a very small list (in fact, for most(?) it's probably empty).

It's more complex than that, but patches are welcome, of course.

Note that the prerequisite for anything more complicated and elaborate
than what we have now is to re-implement character-folding on the C
level, inside search.c functions.  The current implementation is at
its limits already.  I tried to convince the interested people to do
this in C to be gin with, but couldn't, and the feature was important
enough to have even in its current implementation.