From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Juri Linkov <juri@linkov.net>
Newsgroups: gmane.emacs.devel
Subject: Re: On language-dependent defaults for character-folding
Date: Wed, 24 Feb 2016 02:16:23 +0200
Organization: LINKOV.NET
Message-ID: <8737sjufmw.fsf@mail.linkov.net>
References: <CAAdUY-KRpbjDJ6h=QOsWBpOJyJ-GP1ia70YyjwYsNe5i1S=mXg@mail.gmail.com>
	<838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org>
	<83y4ahru04.fsf@gnu.org>
	<CADtN0W+B=JZ_LKis9opETfr5q8K=rC+Xt6jGijMC3GwiGbF2RA@mail.gmail.com>
	<83fuwproyf.fsf@gnu.org>
	<CADtN0W+2CjROLMnuC8N3X3TrwvsZOmidviFjM_-AF0DKN-Wvsg@mail.gmail.com>
	<837fi0sz29.fsf@gnu.org>
	<CADtN0W+93LH5d3=joVj2xe40rramMOcURKw7QKdv_OefYCm8Ug@mail.gmail.com>
	<83egc8qzjh.fsf@gnu.org>
	<CADtN0WL-rX5xzw75P=qLEYFYzLWkuCuntE+gf2BAhn981_jWBg@mail.gmail.com>
	<87egc7evu3.fsf@gnus.org> <83io1jpt4u.fsf@gnu.org>
	<87povqhj25.fsf@gnus.org>
	<CADtN0W+qyRZFwDR+MtLxBdayLbzajwbS1_ykufSg1OQLU8yY8w@mail.gmail.com>
	<83povqm3dw.fsf@gnu.org>
	<CADtN0WKC0HTy=WfJs-Frt_S149Ku4jNK-_jLneuo8_0ELgUjVQ@mail.gmail.com>
	<E1aXulA-0008LS-9A@fencepost.gnu.org> <831t84lgsa.fsf@gnu.org>
	<87io1gz3i8.fsf@mail.linkov.net> <83wppvic6f.fsf@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: ger.gmane.org 1456273726 16705 80.91.229.3 (24 Feb 2016 00:28:46 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Wed, 24 Feb 2016 00:28:46 +0000 (UTC)
Cc: larsi@gnus.org, lokedhs@gmail.com, rms@gnu.org, emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Feb 24 01:28:35 2016
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aYNJi-0006bS-Lw
	for ged-emacs-devel@m.gmane.org; Wed, 24 Feb 2016 01:28:34 +0100
Original-Received: from localhost ([::1]:60713 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aYNJi-0002CM-4x
	for ged-emacs-devel@m.gmane.org; Tue, 23 Feb 2016 19:28:34 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:42948)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <juri@linkov.net>) id 1aYNJd-00029O-Su
	for emacs-devel@gnu.org; Tue, 23 Feb 2016 19:28:31 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <juri@linkov.net>) id 1aYNJc-0007Lk-FP
	for emacs-devel@gnu.org; Tue, 23 Feb 2016 19:28:29 -0500
Original-Received: from sub3.mail.dreamhost.com ([69.163.253.7]:57074
	helo=homiemail-a39.g.dreamhost.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <juri@linkov.net>)
	id 1aYNJW-0007Kf-Oo; Tue, 23 Feb 2016 19:28:22 -0500
Original-Received: from homiemail-a39.g.dreamhost.com (localhost [127.0.0.1])
	by homiemail-a39.g.dreamhost.com (Postfix) with ESMTP id C8C4F150074;
	Tue, 23 Feb 2016 16:28:18 -0800 (PST)
Original-Received: from localhost.linkov.net (85.253.57.158.cable.starman.ee
	[85.253.57.158]) (Authenticated sender: jurta@jurta.org)
	by homiemail-a39.g.dreamhost.com (Postfix) with ESMTPA id 70EDB15006D; 
	Tue, 23 Feb 2016 16:28:17 -0800 (PST)
In-Reply-To: <83wppvic6f.fsf@gnu.org> (Eli Zaretskii's message of "Tue, 23 Feb
	2016 19:11:52 +0200")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.91 (x86_64-pc-linux-gnu)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x (no
	timestamps) [generic]
X-Received-From: 69.163.253.7
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:200574
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/200574>

>> > But the most basic issue is that any significant development in thes=
e
>> > directions require to re-implement the feature on the C level, and u=
se
>> > char-tables for folding, like we do with case-mapping.  So until
>> > someone steps forward for the job, all we can do is small correction=
s
>> > to the existing implementation.
>>
>> Do I understand correctly that essentially what is necessary to do on =
the
>> C level is to extend char-tables with character insertions and deletio=
ns,
>> so in addition to canonical equivalence mappings (like are used for th=
e
>> existing case-mappings) char-tables should also support matching of
>> multi-character additions (like combining accents in the search
>> string) and deletions (like combining accents from the search string
>> missing in the search text)?
>
> I'm not sure I understand why you think char-tables need to be
> extended in support of folding search.  AFAIU, we need a way to
> normalize each character, both in the search string and in the
> buffer/string we search.  This normalization involves decomposition
> followed by reordering the combining diacritics into a canonical
> order.  Then we just match one against the other, almost as usual
> ("almost" because we need to backtrack in the buffer/string upon
> mismatch).  (Of course, decomposition of buffer/string text needs to
> be done on the fly, but this is an implementation detail unrelated to
> this discussion.)
>
> So we need a char-table that maps each character into its
> decomposition sequence, which AFAIR is something the current
> char-tables can support already.  Am I missing something?

Searching for a base character and matching a sequence of characters
(e.g. a base character and combining accents) might be already possible
by the current char-tables indexed by a base character.  But I see
no way to specify such a mapping in a char-table that e.g.
a character should be skipped in the search buffer.  Maybe this need
could be avoided in an asymmetric search with combining characters
in the search buffer, but still is required for ignorable characters.

> If you are interested in the details, I suggest reading
> http://unicode.org/reports/tr10/ and in particular
> http://unicode.org/reports/tr10/#Searching, which deals specifically
> with searching.  http://www.unicode.org/notes/tn5/ is also a useful
> reading.

Thanks, looks like a complete specification with comprehensive answers
to most questions.

>> > For example, the default state of character-folding might depend on
>> > the locale's language -- we could turn it off by default for languag=
es
>> > whose users expressed dissatisfaction with the feature.  We could al=
so
>> > augment the regular expressions created for folding the search strin=
g
>> > by filtering out variants that users of a particular language don't
>> > want.  If people think these ideas will make more users happy, we ca=
n
>> > work on that.
>>
>> It seems two user variables are necessary for customization:
>>
>> 1. inclusive folding groups that will include by default such pairs
>>    as o - =C3=B8, l - =C5=82 added to the Unicode decomposition-based =
rules,
>>    and allow the users to add more rules;
>>
>> 2. exclusive folding groups to exclude locale/language-dependent rules=
 from
>>    the default mappings above, e.g. removing n - =C3=B1 for the "es" l=
ocale.
>
> I think we should add those in item 1 unconditionally (i.e. include
> them in the default mappings), and then exclude some of them under the
> rules you describe in item 2.  Then the problem becomes easier, as we
> only need to filter out some mappings, as determined by a single user
> variable (whose default can come from the user locale).

Better to have 4 variables (2 internal + 2 user customizable variables):

1.1. (internal) default mappings with additional data from decomps.txt

1.2. user mappings to add to the default list

2.1. (internal) locale-dependent mappings to remove from the default list

2.2. user mappings to remove from the default list

> The additional mappings can be picked up from the file decomps.txt in
> the UCA database.

It would be good to find all differences between UnicodeData.txt and
decomps.txt.  Is this the latest version?
http://unicode.org/Public/UCA/6.3.0/decomps.txt