From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: On language-dependent defaults for character-folding
Date: Wed, 24 Feb 2016 20:39:50 +0200
Message-ID: <83fuwigdft.fsf@gnu.org>
References: <CAAdUY-KRpbjDJ6h=QOsWBpOJyJ-GP1ia70YyjwYsNe5i1S=mXg@mail.gmail.com>
	<838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org>
	<83y4ahru04.fsf@gnu.org>
	<CADtN0W+B=JZ_LKis9opETfr5q8K=rC+Xt6jGijMC3GwiGbF2RA@mail.gmail.com>
	<83fuwproyf.fsf@gnu.org>
	<CADtN0W+2CjROLMnuC8N3X3TrwvsZOmidviFjM_-AF0DKN-Wvsg@mail.gmail.com>
	<837fi0sz29.fsf@gnu.org>
	<CADtN0W+93LH5d3=joVj2xe40rramMOcURKw7QKdv_OefYCm8Ug@mail.gmail.com>
	<83egc8qzjh.fsf@gnu.org>
	<CADtN0WL-rX5xzw75P=qLEYFYzLWkuCuntE+gf2BAhn981_jWBg@mail.gmail.com>
	<87egc7evu3.fsf@gnus.org> <83io1jpt4u.fsf@gnu.org>
	<87povqhj25.fsf@gnus.org>
	<CADtN0W+qyRZFwDR+MtLxBdayLbzajwbS1_ykufSg1OQLU8yY8w@mail.gmail.com>
	<83povqm3dw.fsf@gnu.org>
	<CADtN0WKC0HTy=WfJs-Frt_S149Ku4jNK-_jLneuo8_0ELgUjVQ@mail.gmail.com>
	<E1aXulA-0008LS-9A@fencepost.gnu.org> <831t84lgsa.fsf@gnu.org>
	<87io1gz3i8.fsf@mail.linkov.net> <83wppvic6f.fsf@gnu.org>
	<8737sjufmw.fsf@mail.linkov.net>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1456339239 10337 80.91.229.3 (24 Feb 2016 18:40:39 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Wed, 24 Feb 2016 18:40:39 +0000 (UTC)
Cc: larsi@gnus.org, lokedhs@gmail.com, rms@gnu.org, emacs-devel@gnu.org
To: Juri Linkov <juri@linkov.net>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Feb 24 19:40:32 2016
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aYeMQ-0003j8-Mv
	for ged-emacs-devel@m.gmane.org; Wed, 24 Feb 2016 19:40:30 +0100
Original-Received: from localhost ([::1]:37667 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1aYeMM-00028Z-Hs
	for ged-emacs-devel@m.gmane.org; Wed, 24 Feb 2016 13:40:26 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:58138)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1aYeM1-00022b-Av
	for emacs-devel@gnu.org; Wed, 24 Feb 2016 13:40:08 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1aYeLx-00054U-NL
	for emacs-devel@gnu.org; Wed, 24 Feb 2016 13:40:05 -0500
Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:48510)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
	id 1aYeLx-00054Q-Ku; Wed, 24 Feb 2016 13:40:01 -0500
Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:3829
	helo=home-c4e4a596f7)
	by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128)
	(Exim 4.82) (envelope-from <eliz@gnu.org>)
	id 1aYeLp-0007Ig-3L; Wed, 24 Feb 2016 13:39:53 -0500
In-reply-to: <8737sjufmw.fsf@mail.linkov.net> (message from Juri Linkov on
	Wed, 24 Feb 2016 02:16:23 +0200)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 2001:4830:134:3::e
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:200632
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/200632>

> From: Juri Linkov <juri@linkov.net>
> Cc: rms@gnu.org,  larsi@gnus.org,  lokedhs@gmail.com,  emacs-devel@gnu.org
> Date: Wed, 24 Feb 2016 02:16:23 +0200
> 
> > So we need a char-table that maps each character into its
> > decomposition sequence, which AFAIR is something the current
> > char-tables can support already.  Am I missing something?
> 
> Searching for a base character and matching a sequence of characters
> (e.g. a base character and combining accents) might be already possible
> by the current char-tables indexed by a base character.  But I see
> no way to specify such a mapping in a char-table that e.g.
> a character should be skipped in the search buffer.  Maybe this need
> could be avoided in an asymmetric search with combining characters
> in the search buffer, but still is required for ignorable characters.

Whether ignorables can be supported by the current char-tables depends
on the data we store in that table.  It could be a vector of objects
that provide both the codepoint and its weight; then it's easy to
implement skipping characters by throwing away characters whose weight
is above the threshold specified by the caller.

> >> It seems two user variables are necessary for customization:
> >>
> >> 1. inclusive folding groups that will include by default such pairs
> >>    as o - ø, l - ł added to the Unicode decomposition-based rules,
> >>    and allow the users to add more rules;
> >>
> >> 2. exclusive folding groups to exclude locale/language-dependent rules from
> >>    the default mappings above, e.g. removing n - ñ for the "es" locale.
> >
> > I think we should add those in item 1 unconditionally (i.e. include
> > them in the default mappings), and then exclude some of them under the
> > rules you describe in item 2.  Then the problem becomes easier, as we
> > only need to filter out some mappings, as determined by a single user
> > variable (whose default can come from the user locale).
> 
> Better to have 4 variables (2 internal + 2 user customizable variables):

Can you explain why it's better to have 4 variables rather than just
one?

> It would be good to find all differences between UnicodeData.txt and
> decomps.txt.  Is this the latest version?
> http://unicode.org/Public/UCA/6.3.0/decomps.txt

No, the latest is always here:

  http://unicode.org/Public/UCA/latest/decomps.txt

(The last release of Unicode is v8.0.)