From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Sat, 20 Feb 2016 11:21:17 +0200 Message-ID: <83povrpwj6.fsf@gnu.org> References: <87io1xwq1e.fsf@wanadoo.es> <87vb5wvzfz.fsf@mail.linkov.net> <87io1wt4cc.fsf@wanadoo.es> <8737syoima.fsf@mail.linkov.net> <871t8iu277.fsf@wanadoo.es> <83d1s28kvh.fsf@gnu.org> <87r3gis7sm.fsf@wanadoo.es> <83twle71xy.fsf@gnu.org> <87io1us0te.fsf@wanadoo.es> <83pow26svf.fsf@gnu.org> <87a8n5srbp.fsf@wanadoo.es> <83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net> <834mdd6llx.fsf@gnu.org> <7fbb8bc7-9a97-4bad-a103-a6690a35241d@default> <834mdc5w6o.fsf@gnu.org> <838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org> <83y4ahru04.fsf@gnu.org> <83fuwproyf.fsf@gnu.org> <837fi0sz29.fsf@gnu.org> <83egc8qzjh.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1455960112 13925 80.91.229.3 (20 Feb 2016 09:21:52 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 20 Feb 2016 09:21:52 +0000 (UTC) Cc: larsi@gnus.org, emacs-devel@gnu.org To: Elias =?utf-8?Q?M=C3=A5rtenson?= Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Feb 20 10:21:45 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aX3jT-0007RJ-Ko for ged-emacs-devel@m.gmane.org; Sat, 20 Feb 2016 10:21:43 +0100 Original-Received: from localhost ([::1]:59450 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aX3jT-0003SN-1F for ged-emacs-devel@m.gmane.org; Sat, 20 Feb 2016 04:21:43 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:46565) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aX3jN-0003S8-OC for emacs-devel@gnu.org; Sat, 20 Feb 2016 04:21:40 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aX3jK-0007oN-0k for emacs-devel@gnu.org; Sat, 20 Feb 2016 04:21:37 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:57412) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aX3jJ-0007oJ-Tg; Sat, 20 Feb 2016 04:21:33 -0500 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:1167 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1aX3jJ-0006Wn-4Z; Sat, 20 Feb 2016 04:21:33 -0500 In-reply-to: (message from Elias =?utf-8?Q?M=C3=A5rtenson?= on Sat, 20 Feb 2016 13:22:57 +0800) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:200290 Archived-At: > Date: Sat, 20 Feb 2016 13:22:57 +0800 > From: Elias Mårtenson > Cc: Lars Ingebrigtsen , emacs-devel > > The reference you are looking for is the Unicode Standard itself. It > says to use the normalization forms, see for example section 5.16 > there. > > I have read that section before, and I have now read it again. The section certainly talks about searching > ignores diacritics, but does not discuss a method to do so. There is also a reference to TR29, but it refers to > grapheme clusters which would be a very strange way to do character folding (Koreans would be very > confused). > > Every character-folding search implementation decomposes characters > before matching them. So does Emacs. We didn't invent this, and we > certainly didn't use the decompositions where they weren't supposed to > be used. It's not a trick, it's what everyone else does to do the > job. See the ICU library, for example. > > Every example you have given so far discusses the decomposition equivalence. I.e. the fact that the who > variants of ñ are the same. Section 5.16 discuss the _concept_ of allowing n and ñ match similarly but the > mechanism to do so is locale-dependent. This is what Unicode says, and that is what I say. My position is > simply that the default (if absolutely nothing else overrides it) should be chosen to take the locale of the user > into account. > > > The decompositions are used in the normalisation forms to ensure that the two variants are treated > equally > > (such as the two alternative representations of ñ that we have been discussing). > > Yes, and any character-folding search uses normalization forms as > well. > > Yes, but that's not what normalisation forms were designed to do. Your interpretation is wrong, because every implementation of character-folding in search uses normalization forms. So if you want to maintain that whoever does that is abusing normalization forms, you are not just up against Emacs, you are up against the ICU library and others. You are also up against http://www.unicode.org/notes/tn5/. It is possible that you only see the "equivalence" parts of all these sources. But in that case, you are actually claiming that folding characters should never be done at all! "Folding" means mapping _distinct_ character sequences to the same basic sequence. You start from a normalization form, then compare the results disregarding certain secondary, tertiary, etc. differences. The Emacs implementation simply expresses this algorithm by using suitable regular expressions, and it's currently only capable of either ignoring all the non-base weights or none at all, but the principle is preserved to the letter. > Again (I really apologise for repeating myself, I'm starting to sound like a troll and that is truly not my intention), > the purpose of normalisation forms are to ensure that the two variants of ñ compare the same. It is not > designed to provide a mechanism to allow n to compare equal to ñ. Under character-folding that ignores diacritics, ñ should indeed compare equal to n. > > Yes. I am fully aware of this. But so be it. Having applications work differently depending on the locale > of the > > environment the application was started in is nothing new. > > It's not new. It's old. We should move on to more general > environments that support multiple languages. Emacs is such an > environment. The old l10n paradigms are fundamentally incompatible > with that. > > Sure, but doesn't it make sense to fall back to the user's default if the buffer does not have an overriding > locale? I don't know what you mean by "buffer has an overriding locale". Emacs buffers don't have a locale, and they cannot do that in principle because we support multiple languages. E.g., what could the locale of the HELLO buffer created by "C-h H" be? > > Being a multi-lingual environment, Emacs has no real notion of the > > locale. > > > > Perhaps it should? > > That'd be a step backward, IMO. > > As opposed to having no concept of locale at all? Yes. A multilingual environment cannot have a locale in principle. It will cease being multilingual if it does. > Strange, I always thought the data was there. Perhaps you should ask > a question on the Unicode mailing list, then. > > That's a good idea actually. That's a relief. I was beginning to suspect I don't have any good ideas at all.