From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Tue, 23 Feb 2016 19:11:52 +0200 Message-ID: <83wppvic6f.fsf@gnu.org> References: <834mdc5w6o.fsf@gnu.org> <838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org> <83y4ahru04.fsf@gnu.org> <83fuwproyf.fsf@gnu.org> <837fi0sz29.fsf@gnu.org> <83egc8qzjh.fsf@gnu.org> <87egc7evu3.fsf@gnus.org> <83io1jpt4u.fsf@gnu.org> <87povqhj25.fsf@gnus.org> <83povqm3dw.fsf@gnu.org> <831t84lgsa.fsf@gnu.org> <87io1gz3i8.fsf@mail.linkov.net> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1456247558 7514 80.91.229.3 (23 Feb 2016 17:12:38 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 23 Feb 2016 17:12:38 +0000 (UTC) Cc: larsi@gnus.org, lokedhs@gmail.com, rms@gnu.org, emacs-devel@gnu.org To: Juri Linkov Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Feb 23 18:12:33 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aYGVg-0006Kv-PH for ged-emacs-devel@m.gmane.org; Tue, 23 Feb 2016 18:12:28 +0100 Original-Received: from localhost ([::1]:58670 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aYGVg-0001pN-7Q for ged-emacs-devel@m.gmane.org; Tue, 23 Feb 2016 12:12:28 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:57221) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aYGVN-0001Sh-Qb for emacs-devel@gnu.org; Tue, 23 Feb 2016 12:12:13 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aYGVK-00077T-6s for emacs-devel@gnu.org; Tue, 23 Feb 2016 12:12:09 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:44594) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aYGVK-00077P-3T; Tue, 23 Feb 2016 12:12:06 -0500 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:2589 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1aYGVC-0004MZ-Cb; Tue, 23 Feb 2016 12:11:58 -0500 In-reply-to: <87io1gz3i8.fsf@mail.linkov.net> (message from Juri Linkov on Tue, 23 Feb 2016 02:14:55 +0200) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:200548 Archived-At: > From: Juri Linkov > Cc: rms@gnu.org, larsi@gnus.org, lokedhs@gmail.com, emacs-devel@gnu.org > Date: Tue, 23 Feb 2016 02:14:55 +0200 > > > But the most basic issue is that any significant development in these > > directions require to re-implement the feature on the C level, and use > > char-tables for folding, like we do with case-mapping. So until > > someone steps forward for the job, all we can do is small corrections > > to the existing implementation. > > Do I understand correctly that essentially what is necessary to do on the > C level is to extend char-tables with character insertions and deletions, > so in addition to canonical equivalence mappings (like are used for the > existing case-mappings) char-tables should also support matching of > multi-character additions (like combining accents in the search > string) and deletions (like combining accents from the search string > missing in the search text)? I'm not sure I understand why you think char-tables need to be extended in support of folding search. AFAIU, we need a way to normalize each character, both in the search string and in the buffer/string we search. This normalization involves decomposition followed by reordering the combining diacritics into a canonical order. Then we just match one against the other, almost as usual ("almost" because we need to backtrack in the buffer/string upon mismatch). (Of course, decomposition of buffer/string text needs to be done on the fly, but this is an implementation detail unrelated to this discussion.) So we need a char-table that maps each character into its decomposition sequence, which AFAIR is something the current char-tables can support already. Am I missing something? If you are interested in the details, I suggest reading http://unicode.org/reports/tr10/ and in particular http://unicode.org/reports/tr10/#Searching, which deals specifically with searching. http://www.unicode.org/notes/tn5/ is also a useful reading. > > For example, the default state of character-folding might depend on > > the locale's language -- we could turn it off by default for languages > > whose users expressed dissatisfaction with the feature. We could also > > augment the regular expressions created for folding the search string > > by filtering out variants that users of a particular language don't > > want. If people think these ideas will make more users happy, we can > > work on that. > > It seems two user variables are necessary for customization: > > 1. inclusive folding groups that will include by default such pairs > as o - ø, l - ł added to the Unicode decomposition-based rules, > and allow the users to add more rules; > > 2. exclusive folding groups to exclude locale/language-dependent rules from > the default mappings above, e.g. removing n - ñ for the "es" locale. I think we should add those in item 1 unconditionally (i.e. include them in the default mappings), and then exclude some of them under the rules you describe in item 2. Then the problem becomes easier, as we only need to filter out some mappings, as determined by a single user variable (whose default can come from the user locale). The additional mappings can be picked up from the file decomps.txt in the UCA database.