From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Juri Linkov Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Mon, 29 Feb 2016 02:22:02 +0200 Organization: LINKOV.NET Message-ID: <87oab0fk8d.fsf@mail.linkov.net> References: <83fuwproyf.fsf@gnu.org> <837fi0sz29.fsf@gnu.org> <83egc8qzjh.fsf@gnu.org> <87egc7evu3.fsf@gnus.org> <83io1jpt4u.fsf@gnu.org> <87povqhj25.fsf@gnus.org> <83povqm3dw.fsf@gnu.org> <831t84lgsa.fsf@gnu.org> <87io1gz3i8.fsf@mail.linkov.net> <83wppvic6f.fsf@gnu.org> <8737sjufmw.fsf@mail.linkov.net> <83fuwigdft.fsf@gnu.org> <87h9gxfx9k.fsf@mail.linkov.net> <83mvqog3mf.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1456706208 1156 80.91.229.3 (29 Feb 2016 00:36:48 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 29 Feb 2016 00:36:48 +0000 (UTC) Cc: larsi@gnus.org, lokedhs@gmail.com, rms@gnu.org, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Feb 29 01:36:40 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aaBpF-0005Lq-Hw for ged-emacs-devel@m.gmane.org; Mon, 29 Feb 2016 01:36:37 +0100 Original-Received: from localhost ([::1]:33109 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aaBpF-0000J1-14 for ged-emacs-devel@m.gmane.org; Sun, 28 Feb 2016 19:36:37 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:40472) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aaBpB-0000HP-3H for emacs-devel@gnu.org; Sun, 28 Feb 2016 19:36:34 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aaBp9-0005IL-TJ for emacs-devel@gnu.org; Sun, 28 Feb 2016 19:36:33 -0500 Original-Received: from sub3.mail.dreamhost.com ([69.163.253.7]:41580 helo=homiemail-a76.g.dreamhost.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aaBp4-00059g-14; Sun, 28 Feb 2016 19:36:26 -0500 Original-Received: from homiemail-a76.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a76.g.dreamhost.com (Postfix) with ESMTP id DDC0645807B; Sun, 28 Feb 2016 16:36:20 -0800 (PST) Original-Received: from localhost.linkov.net (82.131.112.51.cable.starman.ee [82.131.112.51]) (Authenticated sender: jurta@jurta.org) by homiemail-a76.g.dreamhost.com (Postfix) with ESMTPA id 8B14B458079; Sun, 28 Feb 2016 16:36:19 -0800 (PST) In-Reply-To: <83mvqog3mf.fsf@gnu.org> (Eli Zaretskii's message of "Thu, 25 Feb 2016 18:24:08 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.91 (x86_64-pc-linux-gnu) X-detected-operating-system: by eggs.gnu.org: Windows NT kernel [generic] [fuzzy] X-Received-From: 69.163.253.7 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:200794 Archived-At: > What I envisioned is a single variable that holds a list of folding > sub-features. Examples include ignoring diacritics, matching > ligatures and their decompositions, "controversial" foldings that > users of specific languages might not want, etc. The default value > will hold all of the sub-features; users that don't want some of them > will be able to remove them from the list, which will affect the > mapping at search time. We could also have a setting that means "DTRT > for my locale", which will remove the sub-features inappropriate for > the locale's language. Stuff like that. Like (defcustom char-fold-defaults '(ignore-diacritics match-ligatures ..= .? Not sure if such terms are self-descriptive. At least plain pairs like '((o =C3=B8) (l =C5=82) ...) should be enough to customize at the base ch= aracter level, and later we might consider grouping such pairs into a more high-level features like =E2=80=98spanish-diacritics=E2=80=99, =E2=80=98swedish-diac= ritics=E2=80=99, etc. >> So we could have at least one default internal variable containing all >> decompositions from UnicodeData.txt plus decompositions from decomps.t= xt >> minus locale-dependent mappings. > > Internally, we need a translation table for mapping equivalent > characters. This table should be recomputed (or selected among > several precomputed ones) according to the list of sub-features that > the user requested. Or maybe customizing a variable like (defcustom char-fold-language (with the default depending on the user locale) could reevaluate the table on saving the modified value. >> > http://unicode.org/Public/UCA/latest/decomps.txt >> > >> > (The last release of Unicode is v8.0.) >> >> Thanks, comparing UnicodeData.txt with the latest decomps.txt shows >> 1600 differences (such as =C5=82 decomposed to l and =CC=B5 and =C3=B8= to o and =CC=B8) >> we need to add manually (a whole set of differences is attached below)= : > > I think we need to create another uni-*.el file which defines a > decomposition char-table populated from decomps.txt. The name of the currently used Unicode character property is =E2=80=9Cdec= omposition=E2=80=9D. What would be a good name for the property from decomps.txt? =E2=80=9Cdec= omposition2=E2=80=9D?