On 19 February 2016 at 19:46, Eli Zaretskii wrote: > > Of course you have to use the decomposition algorithms to ensure that > the precomposed and decomposed > > variations of the same character compares equal. > > Then you agree that _some_ form of character-folding should be turned > on by default? > Yes. > > This is, however, different from using the decomposition to to decompose > a character and then using the > > base character as the thing to match against. The latter is what Emacs > is doing today, as far as I understand. > > Please describe in more detail why do you think what Emacs does today > is not what you think it should do. It's possible we have a > miscommunication here. > The main issue to me is that it matches things that should not be matched. A secondary (minor) issue is that some things that should be matched is not (see my example with U+2C65). > For example, if the buffer includes ñ (2 characters), should "C-s n" > find the n in it? > That depends on the locale of the user. However, from the point of a user, there should not be a visible difference between the precomposed and the composed variants are the exact same character. This is in line with Unicode recommendations (https://en.wikipedia.org/wiki/Unicode_equivalence) Note: I know that it's possible that I am wrong about this and that Unicode actually _has_ said that the equivalence tables can be used for this purpose (I.e. decompose and only use the primary character). If that is the case, I'd be interested to see a reference to that, but I will still be of the same opinion that doing so will result in broken behaviour for a certain class of user. Thus, if I am Spanish, I will _not_ want any of those to match "n". If I'm Swedish I will likely want both of them to match "n". That equivalence is encoded in the decomposition data that is part of > UnicodeData.txt which Emacs uses for character-folding. > The equivalence tables explains that the precomposed character U+00F1 is equivalent to the specific sequence U+006E U+0303. That is all it says. It does not say that ñ is a variation of n. It's an instruction how to construct a given character. The decompositions are used in the normalisation forms to ensure that the two variants are treated equally (such as the two alternative representations of ñ that we have been discussing). > > If you look at the latin collation chart for example > > (http://unicode.org/charts/collation/chart_Latin.html) you will see > that the characters are grouped. These are > > the equivalences I'm referring to. > > Yes. And if you look at the entries of the equivalent characters in > UnicodeData.txt, you will see there they have decompositions, which is > what Emacs uses for searching when character-folding is in effect. > Yes, and this is where the crux of our disagreement lies, I think. I previously referred to using the decompositions as a guide to character equivalence as a "trick". I stand by this, since this is not the purpose of the decompositions. The best thing that Unicode provides for that purpose (to my knowledge) are the collation charts that I mentioned previously ( http://unicode.org/charts/collation/) > > Now, I note that on these charts, U+0061 LATIN SMALL LETTER A and U+2C65 > LATIN SMALL LETTER A > > WITH STROKE compares as different characters, and the latter does not > have a decomposition. Should this > > also be addressed? > > Maybe so, but given the controversy even about what we do now, which > is a subset, I'd doubt extending what we do now is a wise move. > I was just asking to understand your position better. > > As for the locale-specific parts: using that will only DTRT if we > > assume that the majority of searches are done in buffers holding text > > in locale's language. Is that a good assumption? > > > > My opinion is that the default search behaviour should depend primarily > on the locale of the entire Emacs > > session. I.e. the locale of the user starting the application. I'm not > disagreeing that allowing a buffer-local locale > > override this behaviour is a good idea, but as a Swedish speaker I > really see å, ä and a as completely > > separate things, even if the language of the buffer that I am editing > happens to be English. The equivalence of > > these characters is the odd behaviour here, and the one that should be > enabled explicitly. > > > > Also, if I happen to be editing a Spanish document (I don't speak > Spanish) I would find equivalence of ñ and n > > to be incredibly useful, even though Óscar would grind his teeth at it. > :-) > > So you are in fact making two contradicting statements here. Interesting. I have re-read what I wrote and I really don't see myself holding two contradicting statement. Perhaps you think that I am both against folding and not, at the same time. If that's the case, let me try to rephrase: I like the idea of character folding. But, if it's incorrectly (by my standards, of course) implemented I would rather not have it at all since it will be highly annoying. > Indeed, > the locale in which Emacs started says almost nothing about the > documents being edited, nor even about the user's preferences: it is > easy to imagine a user whose "native" locale is X starting Emacs in > another locale. > Yes. I am fully aware of this. But so be it. Having applications work differently depending on the locale of the environment the application was started in is nothing new. > > We are talking > > about a multilingual Emacs, in an age of global communications, where > > you can have conversations with someone on the other side of the > > world, or read text that combines several languages in the same > > buffer. Do we really want to go back to the l10n days, when there was > > ever only one locale that was interesting -- the current one? I > > wonder. > > > > Actually, I think so. This is because the search equivalence is > inherently a local thing. > > Being a multi-lingual environment, Emacs has no real notion of the > locale. > Perhaps it should? > > It is, Unicode provides it. We just didn't import it yet. > > > > It does? I was looking for such tables, but didn't find it. Do you have > a link? > > Look for DUCET and its tailoring data. These should be a good > starting point: > > http://www.unicode.org/Public/UCA/latest/ > http://cldr.unicode.org/ > Those are the decomposition charts, and don't actually say anything about equivalence outside of providing a canonical form for precomposed characters, as was discussed above. Regards, Elias