From: Eli Zaretskii <eliz@gnu.org>
To: "Roland Winkler" <winkler@gnu.org>
Cc: emacs-devel@gnu.org
Subject: Re: strip accents and sorting [was: BibTeX issues]
Date: Thu, 29 Aug 2019 10:10:37 +0300 [thread overview]
Message-ID: <83lfvcbg5u.fsf@gnu.org> (raw)
In-Reply-To: <17902.3833.825923.23911@gargle.gargle.HOWL> (winkler@gnu.org)
> Date: Wed, 28 Aug 2019 22:26:38 -0500
> From: "Roland Winkler" <winkler@gnu.org>
> Cc: emacs-devel@gnu.org
>
> On Wed Aug 28 2019 Eli Zaretskii wrote:
> > > From: Roland Winkler <winkler@gnu.org>
> > > If there was a generic function strip-accents, then BibTeX mode could
> > > certainly use it within its bibtex-generate-autokey machinery.
> >
> > I don't think we have such a function, but it shouldn't be hard to
> > write one, using the facilities in ucs-normalize.el.
>
> Interesting! What are the intended use cases for ucs-normalize.el
> and the algorithms that it implements?
To implement the functionalities described in UAX#15 Unicode
Normalization Forms (http://www.unicode.org/reports/tr15/). We
already use some of that in implementing the utf8-hfs file-name
encoding (used by macOS).
> I had never much thought about this. But there is obviously a
> problem when one tries to sort a database where the keys may contain
> more fancy utf characters. (This problem must be well-known in the
> utf world). Naivly one might hope that the following lines are
> properly sorted according to string-lessp
As Martin points out, you should use string-collate-lessp instead for
these use cases.
> Of course, this is due to the fact that a German umlaut can be
> represented with its own character or with a combining diaeresis.
> These two ways of presenting an umlaut look the same, but they are
> not the same for string-lessp.
The Unicode Standard mandates that they be handled identically,
including in searching and sorting. We don't yet implement that 100%,
but see char-fold.el for a partial (and not very efficient)
implementation during search.
> Now, one solution would be to simply strip off the combining
> characters by decomposing the characters. Or is there a possibility
> to teach a sorting algorithm that the first letter of ä-combine is
> "the same" as the first letter of ä-umlaut and all this should
> appear near a-plain instead of past o-plain?
Both should be possible. To entirely strip the combining accents, you
can use ucs-normalize, and then filter out all characters whose
canonical combining class is non-zero.
next prev parent reply other threads:[~2019-08-29 7:10 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-27 8:40 BibTeX issues Joost Kremers
2019-08-28 17:45 ` Roland Winkler
2019-08-28 18:45 ` Eli Zaretskii
2019-08-29 3:26 ` strip accents and sorting [was: BibTeX issues] Roland Winkler
2019-08-29 6:15 ` martin rudalics
2019-08-30 16:27 ` Roland Winkler
2019-08-30 17:51 ` Eli Zaretskii
2019-08-30 18:38 ` Eli Zaretskii
2019-08-30 19:09 ` Roland Winkler
2019-08-30 19:19 ` Eli Zaretskii
2019-08-30 19:49 ` Roland Winkler
2019-08-31 6:45 ` Eli Zaretskii
2019-08-29 7:10 ` Eli Zaretskii [this message]
2019-08-30 16:29 ` Roland Winkler
2019-08-29 7:49 ` BibTeX issues Joost Kremers
2019-08-30 19:18 ` Roland Winkler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=83lfvcbg5u.fsf@gnu.org \
--to=eliz@gnu.org \
--cc=emacs-devel@gnu.org \
--cc=winkler@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.