From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: On language-dependent defaults for character-folding
Date: Tue, 23 Feb 2016 19:11:52 +0200
Message-ID: <83wppvic6f.fsf@gnu.org>
References: <CAAdUY-KRpbjDJ6h=QOsWBpOJyJ-GP1ia70YyjwYsNe5i1S=mXg@mail.gmail.com>
	<834mdc5w6o.fsf@gnu.org> <m2ziuxltit.fsf@newartisans.com>
	<838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org>
	<83y4ahru04.fsf@gnu.org>
	<CADtN0W+B=JZ_LKis9opETfr5q8K=rC+Xt6jGijMC3GwiGbF2RA@mail.gmail.com>
	<83fuwproyf.fsf@gnu.org>
	<CADtN0W+2CjROLMnuC8N3X3TrwvsZOmidviFjM_-AF0DKN-Wvsg@mail.gmail.com>
	<837fi0sz29.fsf@gnu.org>
	<CADtN0W+93LH5d3=joVj2xe40rramMOcURKw7QKdv_OefYCm8Ug@mail.gmail.com>
	<83egc8qzjh.fsf@gnu.org>
	<CADtN0WL-rX5xzw75P=qLEYFYzLWkuCuntE+gf2BAhn981_jWBg@mail.gmail.com>
	<87egc7evu3.fsf@gnus.org> <83io1jpt4u.fsf@gnu.org>
	<87povqhj25.fsf@gnus.org>
	<CADtN0W+qyRZFwDR+MtLxBdayLbzajwbS1_ykufSg1OQLU8yY8w@mail.gmail.com>
	<83povqm3dw.fsf@gnu.org>
	<CADtN0WKC0HTy=WfJs-Frt_S149Ku4jNK-_jLneuo8_0ELgUjVQ@mail.gmail.com>
	<E1aXulA-0008LS-9A@fencepost.gnu.org> <831t84lgsa.fsf@gnu.org>
	<87io1gz3i8.fsf@mail.linkov.net>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1456247558 7514 80.91.229.3 (23 Feb 2016 17:12:38 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Tue, 23 Feb 2016 17:12:38 +0000 (UTC)
Cc: larsi@gnus.org, lokedhs@gmail.com, rms@gnu.org, emacs-devel@gnu.org
To: Juri Linkov <juri@linkov.net>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Feb 23 18:12:33 2016
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aYGVg-0006Kv-PH
	for ged-emacs-devel@m.gmane.org; Tue, 23 Feb 2016 18:12:28 +0100
Original-Received: from localhost ([::1]:58670 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aYGVg-0001pN-7Q
	for ged-emacs-devel@m.gmane.org; Tue, 23 Feb 2016 12:12:28 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:57221)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1aYGVN-0001Sh-Qb
	for emacs-devel@gnu.org; Tue, 23 Feb 2016 12:12:13 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1aYGVK-00077T-6s
	for emacs-devel@gnu.org; Tue, 23 Feb 2016 12:12:09 -0500
Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:44594)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
	id 1aYGVK-00077P-3T; Tue, 23 Feb 2016 12:12:06 -0500
Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:2589
	helo=home-c4e4a596f7)
	by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128)
	(Exim 4.82) (envelope-from <eliz@gnu.org>)
	id 1aYGVC-0004MZ-Cb; Tue, 23 Feb 2016 12:11:58 -0500
In-reply-to: <87io1gz3i8.fsf@mail.linkov.net> (message from Juri Linkov on
	Tue, 23 Feb 2016 02:14:55 +0200)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 2001:4830:134:3::e
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:200548
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/200548>

> From: Juri Linkov <juri@linkov.net>
> Cc: rms@gnu.org,  larsi@gnus.org,  lokedhs@gmail.com,  emacs-devel@gnu.org
> Date: Tue, 23 Feb 2016 02:14:55 +0200
> 
> > But the most basic issue is that any significant development in these
> > directions require to re-implement the feature on the C level, and use
> > char-tables for folding, like we do with case-mapping.  So until
> > someone steps forward for the job, all we can do is small corrections
> > to the existing implementation.
> 
> Do I understand correctly that essentially what is necessary to do on the
> C level is to extend char-tables with character insertions and deletions,
> so in addition to canonical equivalence mappings (like are used for the
> existing case-mappings) char-tables should also support matching of
> multi-character additions (like combining accents in the search
> string) and deletions (like combining accents from the search string
> missing in the search text)?

I'm not sure I understand why you think char-tables need to be
extended in support of folding search.  AFAIU, we need a way to
normalize each character, both in the search string and in the
buffer/string we search.  This normalization involves decomposition
followed by reordering the combining diacritics into a canonical
order.  Then we just match one against the other, almost as usual
("almost" because we need to backtrack in the buffer/string upon
mismatch).  (Of course, decomposition of buffer/string text needs to
be done on the fly, but this is an implementation detail unrelated to
this discussion.)

So we need a char-table that maps each character into its
decomposition sequence, which AFAIR is something the current
char-tables can support already.  Am I missing something?

If you are interested in the details, I suggest reading
http://unicode.org/reports/tr10/ and in particular
http://unicode.org/reports/tr10/#Searching, which deals specifically
with searching.  http://www.unicode.org/notes/tn5/ is also a useful
reading.

> > For example, the default state of character-folding might depend on
> > the locale's language -- we could turn it off by default for languages
> > whose users expressed dissatisfaction with the feature.  We could also
> > augment the regular expressions created for folding the search string
> > by filtering out variants that users of a particular language don't
> > want.  If people think these ideas will make more users happy, we can
> > work on that.
> 
> It seems two user variables are necessary for customization:
> 
> 1. inclusive folding groups that will include by default such pairs
>    as o - ø, l - ł added to the Unicode decomposition-based rules,
>    and allow the users to add more rules;
> 
> 2. exclusive folding groups to exclude locale/language-dependent rules from
>    the default mappings above, e.g. removing n - ñ for the "es" locale.

I think we should add those in item 1 unconditionally (i.e. include
them in the default mappings), and then exclude some of them under the
rules you describe in item 2.  Then the problem becomes easier, as we
only need to filter out some mappings, as determined by a single user
variable (whose default can come from the user locale).

The additional mappings can be picked up from the file decomps.txt in
the UCA database.