On 19 February 2016 at 16:20, Eli Zaretskii <eliz@gnu.org> wrote:

> > From: Lars Ingebrigtsen <larsi@gnus.org>
> > Date: Fri, 19 Feb 2016 16:11:41 +1100
> >
> > Here's my vote: I think character folding is a good idea, and that it
> > should be turned on by default if it respects the locale.  If not, it
> > should be off by default.
>
> Thanks.  But what does "respect the locale" mean, in practical terms?
> A large portion of the characters that have some decomposition, and
> thus will be folded when searching, belong to scripts that are not
> related to any language or other locale-specific attribute.  What do
> you think should be done with them in the context of this feature?
>

The Unicode character decomposition was never meant to be used to provide a
feature such as character folding in Emacs. But, Unicode really doesn't
provide a good alternative. The standard itself states that this belongs to
the realm of localisation (IIRC, it even goes as far as mentioning Swedish
as a counterexample).

I readily agree that using the decomposition is a clever way to get the
functionality quite a long way, but the cases where it breaks down, it does
so quite spectacularly, and that's what I (and others) have been opposing.

My suggestion would be to apply several levels of comparisons:

  1. Check if the characters have locale-specific folding rules (for
Swedish, this would be no more than 3-5 characters or so). If not:
  2. Check the equivalence according to the Unicode collation charts:
http://unicode.org/charts/collation/
  3. (maybe) Use the decomposition trick

As for the per-locale exception tables mentioned in point 1, I don't know
if such information is easily available. It may be possible to extract it
from the localedata files from Glibc. But even if it isn't, creating one
for a language should be trivial since we only need a list of character
groups that should _not_ be folded, which for most languages should be a
very small list (in fact, for most(?) it's probably empty).

Regards,
Elias