On 19 February 2016 at 19:46, Eli Zaretskii <eliz@gnu.org> wrote:


> > Of course you have to use the decomposition algorithms to ensure that
> the precomposed and decomposed
> > variations of the same character compares equal.
>
> Then you agree that _some_ form of character-folding should be turned
> on by default?
>

Yes.


> > This is, however, different from using the decomposition to to decompose
> a character and then using the
> > base character as the thing to match against. The latter is what Emacs
> is doing today, as far as I understand.
>
> Please describe in more detail why do you think what Emacs does today
> is not what you think it should do.  It's possible we have a
> miscommunication here.
>

The main issue to me is that it matches things that should not be matched.
A secondary (minor) issue is that some things that should be matched is not
(see my example with U+2C65).


> For example, if the buffer includes ñ (2 characters), should "C-s n"
> find the n in it?
>

That depends on the locale of the user. However, from the point of a user,
there should not be a visible difference between the precomposed and the
composed variants are the exact same character. This is in line with
Unicode recommendations (https://en.wikipedia.org/wiki/Unicode_equivalence)

Note: I know that it's possible that I am wrong about this and that Unicode
actually _has_ said that the equivalence tables can be used for this
purpose (I.e. decompose and only use the primary character). If that is the
case, I'd be interested to see a reference to that, but I will still be of
the same opinion that doing so will result in broken behaviour for a
certain class of user.

Thus, if I am Spanish, I will _not_ want any of those to match "n". If I'm
Swedish I will likely want both of them to match "n".

That equivalence is encoded in the decomposition data that is part of
> UnicodeData.txt which Emacs uses for character-folding.
>

The equivalence tables explains that the precomposed character U+00F1 is
equivalent to the specific sequence U+006E U+0303. That is all it says. It
does not say that ñ is a variation of n. It's an instruction how to
construct a given character.

The decompositions are used in the normalisation forms to ensure that the
two variants are treated equally (such as the two alternative
representations of ñ that we have been discussing).


> > If you look at the latin collation chart for example
> > (http://unicode.org/charts/collation/chart_Latin.html) you will see
> that the characters are grouped. These are
> > the equivalences I'm referring to.
>
> Yes.  And if you look at the entries of the equivalent characters in
> UnicodeData.txt, you will see there they have decompositions, which is
> what Emacs uses for searching when character-folding is in effect.
>

Yes, and this is where the crux of our disagreement lies, I think. I
previously referred to using the decompositions as a guide to character
equivalence as a "trick". I stand by this, since this is not the purpose of
the decompositions. The best thing that Unicode provides for that purpose
(to my knowledge) are the collation charts that I mentioned previously (
http://unicode.org/charts/collation/)


> > Now, I note that on these charts, U+0061 LATIN SMALL LETTER A and U+2C65
> LATIN SMALL LETTER A
> > WITH STROKE compares as different characters, and the latter does not
> have a decomposition. Should this
> > also be addressed?
>
> Maybe so, but given the controversy even about what we do now, which
> is a subset, I'd doubt extending what we do now is a wise move.
>

I was just asking to understand your position better.


> >  As for the locale-specific parts: using that will only DTRT if we
> >  assume that the majority of searches are done in buffers holding text
> >  in locale's language. Is that a good assumption?
> >
> > My opinion is that the default search behaviour should depend primarily
> on the locale of the entire Emacs
> > session. I.e. the locale of the user starting the application. I'm not
> disagreeing that allowing a buffer-local locale
> > override this behaviour is a good idea, but as a Swedish speaker I
> really see å, ä and a as completely
> > separate things, even if the language of the buffer that I am editing
> happens to be English. The equivalence of
> > these characters is the odd behaviour here, and the one that should be
> enabled explicitly.
> >
> > Also, if I happen to be editing a Spanish document (I don't speak
> Spanish) I would find equivalence of ñ and n
> > to be incredibly useful, even though Óscar would grind his teeth at it.
> :-)
>
> So you are in fact making two contradicting statements here.


Interesting. I have re-read what I wrote and I really don't see myself
holding two contradicting statement. Perhaps you think that I am both
against folding and not, at the same time. If that's the case, let me try
to rephrase:

I like the idea of character folding. But, if it's incorrectly (by my
standards, of course) implemented I would rather not have it at all since
it will be highly annoying.


> Indeed,
> the locale in which Emacs started says almost nothing about the
> documents being edited, nor even about the user's preferences: it is
> easy to imagine a user whose "native" locale is X starting Emacs in
> another locale.
>

Yes. I am fully aware of this. But so be it. Having applications work
differently depending on the locale of the environment the application was
started in is nothing new.


> >  We are talking
> >  about a multilingual Emacs, in an age of global communications, where
> >  you can have conversations with someone on the other side of the
> >  world, or read text that combines several languages in the same
> >  buffer. Do we really want to go back to the l10n days, when there was
> >  ever only one locale that was interesting -- the current one? I
> >  wonder.
> >
> > Actually, I think so. This is because the search equivalence is
> inherently a local thing.
>
> Being a multi-lingual environment, Emacs has no real notion of the
> locale.
>

Perhaps it should?


> >  It is, Unicode provides it. We just didn't import it yet.
> >
> > It does? I was looking for such tables, but didn't find it. Do you have
> a link?
>
> Look for DUCET and its tailoring data.  These should be a good
> starting point:
>
>   http://www.unicode.org/Public/UCA/latest/
>   http://cldr.unicode.org/
>

Those are the decomposition charts, and don't actually say anything about
equivalence outside of providing a canonical form for precomposed
characters, as was discussed above.

Regards,
Elias