From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Fri, 19 Feb 2016 12:09:44 +0200 Message-ID: <83fuwproyf.fsf@gnu.org> References: <87io1xwq1e.fsf@wanadoo.es> <87vb5wvzfz.fsf@mail.linkov.net> <87io1wt4cc.fsf@wanadoo.es> <8737syoima.fsf@mail.linkov.net> <871t8iu277.fsf@wanadoo.es> <83d1s28kvh.fsf@gnu.org> <87r3gis7sm.fsf@wanadoo.es> <83twle71xy.fsf@gnu.org> <87io1us0te.fsf@wanadoo.es> <83pow26svf.fsf@gnu.org> <87a8n5srbp.fsf@wanadoo.es> <83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net> <834mdd6llx.fsf@gnu.org> <7fbb8bc7-9a97-4bad-a103-a6690a35241d@default> <834mdc5w6o.fsf@gnu.org> <838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org> <83y4ahru04.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1455876628 10639 80.91.229.3 (19 Feb 2016 10:10:28 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 19 Feb 2016 10:10:28 +0000 (UTC) Cc: larsi@gnus.org, emacs-devel@gnu.org To: Elias =?utf-8?Q?M=C3=A5rtenson?= Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 19 11:10:22 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aWi0y-0000jH-VM for ged-emacs-devel@m.gmane.org; Fri, 19 Feb 2016 11:10:21 +0100 Original-Received: from localhost ([::1]:50659 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aWi0y-0006TF-Dh for ged-emacs-devel@m.gmane.org; Fri, 19 Feb 2016 05:10:20 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:60240) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aWi0a-0006Bo-Oy for emacs-devel@gnu.org; Fri, 19 Feb 2016 05:09:58 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aWi0V-0001EZ-QT for emacs-devel@gnu.org; Fri, 19 Feb 2016 05:09:56 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:44319) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aWi0V-0001EN-KW; Fri, 19 Feb 2016 05:09:51 -0500 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:2911 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1aWi0U-0004Du-Sz; Fri, 19 Feb 2016 05:09:51 -0500 In-reply-to: (message from Elias =?utf-8?Q?M=C3=A5rtenson?= on Fri, 19 Feb 2016 17:22:18 +0800) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:200197 Archived-At: > Date: Fri, 19 Feb 2016 17:22:18 +0800 > From: Elias Mårtenson > Cc: Lars Ingebrigtsen , emacs-devel > > The Unicode character decomposition was never meant to be used to provide a feature such as character > folding in Emacs. That's not true. Canonical equivalence, which is encoded in canonical decompositions, is a must for searching. Otherwise, what looks the same on display will not be found, and will look like a bug. See the example I gave with ñ and ñ (the latter one is 2 characters). So using decomposition is not a trick, it simply uses the same data that determines equivalence of character sequences. > My suggestion would be to apply several levels of comparisons: > > 1. Check if the characters have locale-specific folding rules (for Swedish, this would be no more than 3-5 > characters or so). If not: > 2. Check the equivalence according to the Unicode collation charts: http://unicode.org/charts/collation/ > 3. (maybe) Use the decomposition trick 2 and 3 are the same as we do already, AFAICT. (Collation charts describe ordering, which is irrelevant for searching; other than that, you will see that Emacs already implements the data shown in http://unicode.org/charts/collation/.) As for the locale-specific parts: using that will only DTRT if we assume that the majority of searches are done in buffers holding text in locale's language. Is that a good assumption? We are talking about a multilingual Emacs, in an age of global communications, where you can have conversations with someone on the other side of the world, or read text that combines several languages in the same buffer. Do we really want to go back to the l10n days, when there was ever only one locale that was interesting -- the current one? I wonder. > As for the per-locale exception tables mentioned in point 1, I don't know if such information is easily available. It is, Unicode provides it. We just didn't import it yet. > It may be possible to extract it from the localedata files from Glibc. But even if it isn't, creating one for a > language should be trivial since we only need a list of character groups that should _not_ be folded, which for > most languages should be a very small list (in fact, for most(?) it's probably empty). It's more complex than that, but patches are welcome, of course. Note that the prerequisite for anything more complicated and elaborate than what we have now is to re-implement character-folding on the C level, inside search.c functions. The current implementation is at its limits already. I tried to convince the interested people to do this in C to be gin with, but couldn't, and the feature was important enough to have even in its current implementation.