unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: "Roland Winkler" <winkler@gnu.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: emacs-devel@gnu.org
Subject: strip accents and sorting [was: BibTeX issues]
Date: Wed, 28 Aug 2019 22:26:38 -0500	[thread overview]
Message-ID: <17902.3833.825923.23911@gargle.gargle.HOWL> (raw)
In-Reply-To: <83tva1b02r.fsf@gnu.org>

On Wed Aug 28 2019 Eli Zaretskii wrote:
> > From: Roland Winkler <winkler@gnu.org>
> > If there was a generic function strip-accents, then BibTeX mode could
> > certainly use it within its bibtex-generate-autokey machinery.
> 
> I don't think we have such a function, but it shouldn't be hard to
> write one, using the facilities in ucs-normalize.el.

Interesting! What are the intended use cases for ucs-normalize.el
and the algorithms that it implements?

I had never much thought about this.  But there is obviously a
problem when one tries to sort a database where the keys may contain
more fancy utf characters. (This problem must be well-known in the
utf world).  Naivly one might hope that the following lines are
properly sorted according to string-lessp

  ä-combine
  ä-umlaut
  ö-combine
  ö-umlaut

But (string-lessp "ä-umlaut" "ö-combine") gives nil so that sort-lines gives

  ä-combine
  ö-combine
  ä-umlaut
  ö-umlaut

Of course, this is due to the fact that a German umlaut can be
represented with its own character or with a combining diaeresis.
These two ways of presenting an umlaut look the same, but they are
not the same for string-lessp.

This can be particularly annoying when a database (be it BibTeX,
BBDB, or whatever) is often enough populated by copying records from
different sources that may represent such fancy utf characters in
different ways.

Now, one solution would be to simply strip off the combining
characters by decomposing the characters.  Or is there a possibility
to teach a sorting algorithm that the first letter of ä-combine is
"the same" as the first letter of ä-umlaut and all this should
appear near a-plain instead of past o-plain?

Roland



  reply	other threads:[~2019-08-29  3:26 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-27  8:40 BibTeX issues Joost Kremers
2019-08-28 17:45 ` Roland Winkler
2019-08-28 18:45   ` Eli Zaretskii
2019-08-29  3:26     ` Roland Winkler [this message]
2019-08-29  6:15       ` strip accents and sorting [was: BibTeX issues] martin rudalics
2019-08-30 16:27         ` Roland Winkler
2019-08-30 17:51           ` Eli Zaretskii
2019-08-30 18:38             ` Eli Zaretskii
2019-08-30 19:09               ` Roland Winkler
2019-08-30 19:19                 ` Eli Zaretskii
2019-08-30 19:49                   ` Roland Winkler
2019-08-31  6:45                     ` Eli Zaretskii
2019-08-29  7:10       ` Eli Zaretskii
2019-08-30 16:29         ` Roland Winkler
2019-08-29  7:49   ` BibTeX issues Joost Kremers
2019-08-30 19:18     ` Roland Winkler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=17902.3833.825923.23911@gargle.gargle.HOWL \
    --to=winkler@gnu.org \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).