From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Thu, 25 Feb 2016 18:24:08 +0200 Message-ID: <83mvqog3mf.fsf@gnu.org> References: <83y4ahru04.fsf@gnu.org> <83fuwproyf.fsf@gnu.org> <837fi0sz29.fsf@gnu.org> <83egc8qzjh.fsf@gnu.org> <87egc7evu3.fsf@gnus.org> <83io1jpt4u.fsf@gnu.org> <87povqhj25.fsf@gnus.org> <83povqm3dw.fsf@gnu.org> <831t84lgsa.fsf@gnu.org> <87io1gz3i8.fsf@mail.linkov.net> <83wppvic6f.fsf@gnu.org> <8737sjufmw.fsf@mail.linkov.net> <83fuwigdft.fsf@gnu.org> <87h9gxfx9k.fsf@mail.linkov.net> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1456417672 28005 80.91.229.3 (25 Feb 2016 16:27:52 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 25 Feb 2016 16:27:52 +0000 (UTC) Cc: larsi@gnus.org, lokedhs@gmail.com, rms@gnu.org, emacs-devel@gnu.org To: Juri Linkov Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Feb 25 17:27:45 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aYylS-0004jC-Kq for ged-emacs-devel@m.gmane.org; Thu, 25 Feb 2016 17:27:42 +0100 Original-Received: from localhost ([::1]:44285 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aYylR-0003g6-VS for ged-emacs-devel@m.gmane.org; Thu, 25 Feb 2016 11:27:41 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:59892) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aYyiC-0006YQ-1n for emacs-devel@gnu.org; Thu, 25 Feb 2016 11:24:23 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aYyi8-0004o2-79 for emacs-devel@gnu.org; Thu, 25 Feb 2016 11:24:19 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:41362) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aYyi8-0004nw-3u; Thu, 25 Feb 2016 11:24:16 -0500 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:4904 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1aYyi0-0004Qy-HH; Thu, 25 Feb 2016 11:24:08 -0500 In-reply-to: <87h9gxfx9k.fsf@mail.linkov.net> (message from Juri Linkov on Thu, 25 Feb 2016 02:29:11 +0200) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:200663 Archived-At: > From: Juri Linkov > Cc: larsi@gnus.org, lokedhs@gmail.com, rms@gnu.org, emacs-devel@gnu.org > Date: Thu, 25 Feb 2016 02:29:11 +0200 > > >> >> It seems two user variables are necessary for customization: > >> >> > >> >> 1. inclusive folding groups that will include by default such pairs > >> >> as o - ø, l - ł added to the Unicode decomposition-based rules, > >> >> and allow the users to add more rules; > >> >> > >> >> 2. exclusive folding groups to exclude locale/language-dependent rules from > >> >> the default mappings above, e.g. removing n - ñ for the "es" locale. > >> > > >> > I think we should add those in item 1 unconditionally (i.e. include > >> > them in the default mappings), and then exclude some of them under the > >> > rules you describe in item 2. Then the problem becomes easier, as we > >> > only need to filter out some mappings, as determined by a single user > >> > variable (whose default can come from the user locale). > >> > >> Better to have 4 variables (2 internal + 2 user customizable variables): > > > > Can you explain why it's better to have 4 variables rather than just > > one? > > If you mean that one customizable variable should contain all mappings from > UnicodeData.txt and decomps.txt presented to the user for customization, > such a list will be too huge to customize: there are 5721 decompositions > in UnicodeData.txt, and 6674 decompositions in decomps.txt. No, of course not. That would be extremely inconvenient. What I envisioned is a single variable that holds a list of folding sub-features. Examples include ignoring diacritics, matching ligatures and their decompositions, "controversial" foldings that users of specific languages might not want, etc. The default value will hold all of the sub-features; users that don't want some of them will be able to remove them from the list, which will affect the mapping at search time. We could also have a setting that means "DTRT for my locale", which will remove the sub-features inappropriate for the locale's language. Stuff like that. > So we could have at least one default internal variable containing all > decompositions from UnicodeData.txt plus decompositions from decomps.txt > minus locale-dependent mappings. Internally, we need a translation table for mapping equivalent characters. This table should be recomputed (or selected among several precomputed ones) according to the list of sub-features that the user requested. > > http://unicode.org/Public/UCA/latest/decomps.txt > > > > (The last release of Unicode is v8.0.) > > Thanks, comparing UnicodeData.txt with the latest decomps.txt shows > 1600 differences (such as ł decomposed to l and ̵ and ø to o and ̸) > we need to add manually (a whole set of differences is attached below): I think we need to create another uni-*.el file which defines a decomposition char-table populated from decomps.txt.