From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: On language-dependent defaults for character-folding
Date: Sat, 13 Feb 2016 10:49:30 +0200
Message-ID: <834mdd6llx.fsf@gnu.org>
References: <CAAdUY-KRpbjDJ6h=QOsWBpOJyJ-GP1ia70YyjwYsNe5i1S=mXg@mail.gmail.com>
	<87mvr9wxqz.fsf@wanadoo.es>
	<CAAdUY-+CNrh=DKYcd89DAXNcd7d3_2VNpWHgLHF9az118u9CCQ@mail.gmail.com>
	<87io1xwq1e.fsf@wanadoo.es>
	<CAAdUY-JLb0kABXfT6XUvZ6MLFUmjtT89mXSa_mY5mv5vOm1BpA@mail.gmail.com>
	<87vb5wvzfz.fsf@mail.linkov.net> <87io1wt4cc.fsf@wanadoo.es>
	<8737syoima.fsf@mail.linkov.net> <871t8iu277.fsf@wanadoo.es>
	<83d1s28kvh.fsf@gnu.org> <87r3gis7sm.fsf@wanadoo.es>
	<83twle71xy.fsf@gnu.org> <87io1us0te.fsf@wanadoo.es>
	<83pow26svf.fsf@gnu.org> <87a8n5srbp.fsf@wanadoo.es>
	<83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1455353390 25923 80.91.229.3 (13 Feb 2016 08:49:50 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sat, 13 Feb 2016 08:49:50 +0000 (UTC)
Cc: ofv@wanadoo.es, emacs-devel@gnu.org
To: Juri Linkov <juri@linkov.net>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Feb 13 09:49:42 2016
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aUVtc-0005E2-TD
	for ged-emacs-devel@m.gmane.org; Sat, 13 Feb 2016 09:49:41 +0100
Original-Received: from localhost ([::1]:40434 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aUVtb-0005YZ-Uc
	for ged-emacs-devel@m.gmane.org; Sat, 13 Feb 2016 03:49:39 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:59416)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1aUVtX-0005YE-Qt
	for emacs-devel@gnu.org; Sat, 13 Feb 2016 03:49:37 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1aUVtU-0006AC-Kz
	for emacs-devel@gnu.org; Sat, 13 Feb 2016 03:49:35 -0500
Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:52048)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
	id 1aUVtU-0006A8-HQ; Sat, 13 Feb 2016 03:49:32 -0500
Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:3597
	helo=home-c4e4a596f7)
	by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128)
	(Exim 4.82) (envelope-from <eliz@gnu.org>)
	id 1aUVtR-0003n9-Ow; Sat, 13 Feb 2016 03:49:30 -0500
In-reply-to: <87oablfpn3.fsf@mail.linkov.net> (message from Juri Linkov on
	Sat, 13 Feb 2016 01:57:33 +0200)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 2001:4830:134:3::e
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:199859
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/199859>

> From: Juri Linkov <juri@linkov.net>
> Cc: Óscar Fuentes <ofv@wanadoo.es>,  emacs-devel@gnu.org
> Date: Sat, 13 Feb 2016 01:57:33 +0200
> 
> Can't we somehow use the same char-folding as is implemented in
> ICU String Search Service (this is also used for search in Chromium):
> http://userguide.icu-project.org/collation/icu-string-search-service
> that supports matching of accented letters, conjoined letters,
> and ignorable punctuation.
> 
> As is described in http://userguide.icu-project.org/collation/concepts
> there are several levels of character matching:
> 
> 1. Primary Level: differences between base characters
> 
> 2. Secondary Level: Accents in the characters
> 
> 3. Tertiary Level: Upper and lower case differences in characters
> 
> 4. Quaternary Level: Punctuation is ignored (where e.g. snake-cased
>    “black_bird” matches camel-cased “blackBird”)
> 
> 5. Identical Level
> 
> Maybe our customization could provide options to choose
> between all these levels?

That's the final goal, yes.  The current implementation is just the
initial step, and it basically does just item #1.  (The list above is
about collation, not about searching, so the wording does not really
fit the searching use case.  Also, they just reiterate what the
Unicode TR#10, http://unicode.org/reports/tr10/, specifies.)

The implementation should really be on the C level, like the
case-folding support.  The current implementation isn't, and therefore
has several disadvantages some of which were already pointed out
(e.g., the regexp it uses that gets exposed in some situations and
causes users to be surprised).  For these and other reasons, I think
we should replace the current implementation with one that's in
search_buffer, driven by tables generated from the Unicode database.
I also think we will be unable to move to the higher levels mentioned
above without first moving the implementation into search_buffer.

Volunteers are welcome to work on that.  Doing this will eventually
require to use the data in DUCET (Default Unicode Collation Element
Table) and CLDR (Common Locale Data Repository), I think, to support
both the language-independent and language-dependent folding.  But
this is only needed for the next levels, the current level that
basically only looks at the base character doesn't need fancy
databases apart of what we already have.

At the time, no one stepped forward to do this on the C level, and the
current implementation was considered to be good-enough for the first
step.