From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Sat, 20 Feb 2016 12:34:41 +0200 Message-ID: <83io1jpt4u.fsf@gnu.org> References: <83twle71xy.fsf@gnu.org> <87io1us0te.fsf@wanadoo.es> <83pow26svf.fsf@gnu.org> <87a8n5srbp.fsf@wanadoo.es> <83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net> <834mdd6llx.fsf@gnu.org> <7fbb8bc7-9a97-4bad-a103-a6690a35241d@default> <834mdc5w6o.fsf@gnu.org> <838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org> <83y4ahru04.fsf@gnu.org> <83fuwproyf.fsf@gnu.org> <837fi0sz29.fsf@gnu.org> <83egc8qzjh.fsf@gnu.org> <87egc7evu3.fsf@gnus.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1455964526 12068 80.91.229.3 (20 Feb 2016 10:35:26 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 20 Feb 2016 10:35:26 +0000 (UTC) Cc: lokedhs@gmail.com, emacs-devel@gnu.org To: Lars Ingebrigtsen Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Feb 20 11:35:25 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aX4sn-0001fP-B6 for ged-emacs-devel@m.gmane.org; Sat, 20 Feb 2016 11:35:25 +0100 Original-Received: from localhost ([::1]:59887 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aX4sm-0003AF-OJ for ged-emacs-devel@m.gmane.org; Sat, 20 Feb 2016 05:35:24 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:37547) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aX4sR-00036E-3Q for emacs-devel@gnu.org; Sat, 20 Feb 2016 05:35:07 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aX4sM-000286-2j for emacs-devel@gnu.org; Sat, 20 Feb 2016 05:35:03 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:58658) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aX4sL-000282-VS; Sat, 20 Feb 2016 05:34:57 -0500 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:1231 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1aX4sL-0006FF-7N; Sat, 20 Feb 2016 05:34:57 -0500 In-reply-to: <87egc7evu3.fsf@gnus.org> (message from Lars Ingebrigtsen on Sat, 20 Feb 2016 17:31:48 +1100) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:200295 Archived-At: > From: Lars Ingebrigtsen > Cc: Eli Zaretskii , emacs-devel > Date: Sat, 20 Feb 2016 17:31:48 +1100 > > It seems to me that we're considering using the Unicode decomposition > rules for "variant detection" because it's what we have. No, we use decompositions because that's how equivalent strings are to be compared and mapped/folded. > But this doesn't allow people to say `C-s l' to find ł or `C-s o' to > find ø, and this would obviously be something that many people would > find helpful. > > So the Unicode decomposition rules only get us halfway there. Yes, the current implementation is just a first step. > On the other hand, they go to far for other users, who absolutely do > not want `C-s o' to find ø, but would be really glad if `C-s hermes' > would find "Hermés" (or is it "Hermès"? I can't even type that in > on this keyboard). Which is why this is toggle-able. > (defvar *character-variants* > '((?a ?á ?å ?ä ...) > (?o ?ø ?ö ?ó ...) > ...)) > > Everything that somebody says "that's kinda an a, right?" goes on there. The above won't support finding decomposed sequences as in á (there are 2 characters here, they are just displayed as one). I hope it's agreed that it is imperative for us to support finding such decomposed sequences (and we already do, under the current character-folding default). There are also more complicated cases like ǖ and ǖ (3 characters), where there are several diacritics which can be in either order, and we still have to match them, because they look identical on display. We currently don't support that, but we should do that in the future, and the decomposition data supports that. It is, of course, possible to support this without normalization, by having all those combinations in the database you proposed. But why should we bother creating and maintaining such a database (and updating it whenever a new Unicode version is released), when one is already available in data that we already read into Emacs? So we currently implement this by using the decomposition information in the Unicode database. Also, what would be the algorithm for searching using the data you propose? If you want to use regexps, then the data should already be in the form of regexps, I think. And I expect the regexp to look very similar to what we current construct in character-fold.el. So what are we really arguing here about? Is it about a feature that will allow exempting specific decompositions from the search? If so, I don't think it would be hard to do that with the current implementation, using just the locale-exception data (which should be much smaller). If that will make everyone happier, we can do this now, if we are sure we won't have another round of prolonged dispute about that. > And then we just look up the locale, create the mapping when we type > `C-s', and there we are. An awesome, very useful feature that would > annoy nobody, and that should be on by default. But it doesn't pass the simplest test above, so it really isn't good enough. Btw, this was already discussed in the past, before Artur sat down to implement this stuff. You may wish re-reading those discussions to see the broader picture.