From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Juri Linkov Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Wed, 24 Feb 2016 02:16:23 +0200 Organization: LINKOV.NET Message-ID: <8737sjufmw.fsf@mail.linkov.net> References: <838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org> <83y4ahru04.fsf@gnu.org> <83fuwproyf.fsf@gnu.org> <837fi0sz29.fsf@gnu.org> <83egc8qzjh.fsf@gnu.org> <87egc7evu3.fsf@gnus.org> <83io1jpt4u.fsf@gnu.org> <87povqhj25.fsf@gnus.org> <83povqm3dw.fsf@gnu.org> <831t84lgsa.fsf@gnu.org> <87io1gz3i8.fsf@mail.linkov.net> <83wppvic6f.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1456273726 16705 80.91.229.3 (24 Feb 2016 00:28:46 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 24 Feb 2016 00:28:46 +0000 (UTC) Cc: larsi@gnus.org, lokedhs@gmail.com, rms@gnu.org, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Feb 24 01:28:35 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aYNJi-0006bS-Lw for ged-emacs-devel@m.gmane.org; Wed, 24 Feb 2016 01:28:34 +0100 Original-Received: from localhost ([::1]:60713 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aYNJi-0002CM-4x for ged-emacs-devel@m.gmane.org; Tue, 23 Feb 2016 19:28:34 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:42948) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aYNJd-00029O-Su for emacs-devel@gnu.org; Tue, 23 Feb 2016 19:28:31 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aYNJc-0007Lk-FP for emacs-devel@gnu.org; Tue, 23 Feb 2016 19:28:29 -0500 Original-Received: from sub3.mail.dreamhost.com ([69.163.253.7]:57074 helo=homiemail-a39.g.dreamhost.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aYNJW-0007Kf-Oo; Tue, 23 Feb 2016 19:28:22 -0500 Original-Received: from homiemail-a39.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a39.g.dreamhost.com (Postfix) with ESMTP id C8C4F150074; Tue, 23 Feb 2016 16:28:18 -0800 (PST) Original-Received: from localhost.linkov.net (85.253.57.158.cable.starman.ee [85.253.57.158]) (Authenticated sender: jurta@jurta.org) by homiemail-a39.g.dreamhost.com (Postfix) with ESMTPA id 70EDB15006D; Tue, 23 Feb 2016 16:28:17 -0800 (PST) In-Reply-To: <83wppvic6f.fsf@gnu.org> (Eli Zaretskii's message of "Tue, 23 Feb 2016 19:11:52 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.91 (x86_64-pc-linux-gnu) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x (no timestamps) [generic] X-Received-From: 69.163.253.7 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:200574 Archived-At: >> > But the most basic issue is that any significant development in thes= e >> > directions require to re-implement the feature on the C level, and u= se >> > char-tables for folding, like we do with case-mapping. So until >> > someone steps forward for the job, all we can do is small correction= s >> > to the existing implementation. >> >> Do I understand correctly that essentially what is necessary to do on = the >> C level is to extend char-tables with character insertions and deletio= ns, >> so in addition to canonical equivalence mappings (like are used for th= e >> existing case-mappings) char-tables should also support matching of >> multi-character additions (like combining accents in the search >> string) and deletions (like combining accents from the search string >> missing in the search text)? > > I'm not sure I understand why you think char-tables need to be > extended in support of folding search. AFAIU, we need a way to > normalize each character, both in the search string and in the > buffer/string we search. This normalization involves decomposition > followed by reordering the combining diacritics into a canonical > order. Then we just match one against the other, almost as usual > ("almost" because we need to backtrack in the buffer/string upon > mismatch). (Of course, decomposition of buffer/string text needs to > be done on the fly, but this is an implementation detail unrelated to > this discussion.) > > So we need a char-table that maps each character into its > decomposition sequence, which AFAIR is something the current > char-tables can support already. Am I missing something? Searching for a base character and matching a sequence of characters (e.g. a base character and combining accents) might be already possible by the current char-tables indexed by a base character. But I see no way to specify such a mapping in a char-table that e.g. a character should be skipped in the search buffer. Maybe this need could be avoided in an asymmetric search with combining characters in the search buffer, but still is required for ignorable characters. > If you are interested in the details, I suggest reading > http://unicode.org/reports/tr10/ and in particular > http://unicode.org/reports/tr10/#Searching, which deals specifically > with searching. http://www.unicode.org/notes/tn5/ is also a useful > reading. Thanks, looks like a complete specification with comprehensive answers to most questions. >> > For example, the default state of character-folding might depend on >> > the locale's language -- we could turn it off by default for languag= es >> > whose users expressed dissatisfaction with the feature. We could al= so >> > augment the regular expressions created for folding the search strin= g >> > by filtering out variants that users of a particular language don't >> > want. If people think these ideas will make more users happy, we ca= n >> > work on that. >> >> It seems two user variables are necessary for customization: >> >> 1. inclusive folding groups that will include by default such pairs >> as o - =C3=B8, l - =C5=82 added to the Unicode decomposition-based = rules, >> and allow the users to add more rules; >> >> 2. exclusive folding groups to exclude locale/language-dependent rules= from >> the default mappings above, e.g. removing n - =C3=B1 for the "es" l= ocale. > > I think we should add those in item 1 unconditionally (i.e. include > them in the default mappings), and then exclude some of them under the > rules you describe in item 2. Then the problem becomes easier, as we > only need to filter out some mappings, as determined by a single user > variable (whose default can come from the user locale). Better to have 4 variables (2 internal + 2 user customizable variables): 1.1. (internal) default mappings with additional data from decomps.txt 1.2. user mappings to add to the default list 2.1. (internal) locale-dependent mappings to remove from the default list 2.2. user mappings to remove from the default list > The additional mappings can be picked up from the file decomps.txt in > the UCA database. It would be good to find all differences between UnicodeData.txt and decomps.txt. Is this the latest version? http://unicode.org/Public/UCA/6.3.0/decomps.txt