On 19 February 2016 at 16:20, Eli Zaretskii wrote: > > From: Lars Ingebrigtsen > > Date: Fri, 19 Feb 2016 16:11:41 +1100 > > > > Here's my vote: I think character folding is a good idea, and that it > > should be turned on by default if it respects the locale. If not, it > > should be off by default. > > Thanks. But what does "respect the locale" mean, in practical terms? > A large portion of the characters that have some decomposition, and > thus will be folded when searching, belong to scripts that are not > related to any language or other locale-specific attribute. What do > you think should be done with them in the context of this feature? > The Unicode character decomposition was never meant to be used to provide a feature such as character folding in Emacs. But, Unicode really doesn't provide a good alternative. The standard itself states that this belongs to the realm of localisation (IIRC, it even goes as far as mentioning Swedish as a counterexample). I readily agree that using the decomposition is a clever way to get the functionality quite a long way, but the cases where it breaks down, it does so quite spectacularly, and that's what I (and others) have been opposing. My suggestion would be to apply several levels of comparisons: 1. Check if the characters have locale-specific folding rules (for Swedish, this would be no more than 3-5 characters or so). If not: 2. Check the equivalence according to the Unicode collation charts: http://unicode.org/charts/collation/ 3. (maybe) Use the decomposition trick As for the per-locale exception tables mentioned in point 1, I don't know if such information is easily available. It may be possible to extract it from the localedata files from Glibc. But even if it isn't, creating one for a language should be trivial since we only need a list of character groups that should _not_ be folded, which for most languages should be a very small list (in fact, for most(?) it's probably empty). Regards, Elias