From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: strip accents and sorting [was: BibTeX issues]
Date: Thu, 29 Aug 2019 10:10:37 +0300
Message-ID: <83lfvcbg5u.fsf@gnu.org>
References: <87mufv2e9s.fsf@uni-bielefeld.de> <87ftllji9u.fsf@gnu.org>
 <83tva1b02r.fsf@gnu.org> <17902.3833.825923.23911@gargle.gargle.HOWL>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="51299"; mail-complaints-to="usenet@blaine.gmane.org"
Cc: emacs-devel@gnu.org
To: "Roland Winkler" <winkler@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Aug 29 09:10:50 2019
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.89)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1i3Ea1-000DEN-Gv
	for ged-emacs-devel@m.gmane.org; Thu, 29 Aug 2019 09:10:49 +0200
Original-Received: from localhost ([::1]:46116 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1i3EZz-0001WR-7h
	for ged-emacs-devel@m.gmane.org; Thu, 29 Aug 2019 03:10:47 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:43517)
 by lists.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <eliz@gnu.org>) id 1i3EZl-0001PE-0R
 for emacs-devel@gnu.org; Thu, 29 Aug 2019 03:10:35 -0400
Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:34533)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
 id 1i3EZj-0005Bc-7D
 for emacs-devel@gnu.org; Thu, 29 Aug 2019 03:10:32 -0400
Original-Received: from [176.228.60.248] (port=4611 helo=home-c4e4a596f7)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <eliz@gnu.org>)
 id 1i3EZh-0006ZW-St; Thu, 29 Aug 2019 03:10:30 -0400
In-reply-to: <17902.3833.825923.23911@gargle.gargle.HOWL> (winkler@gnu.org)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:239664
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/239664>

> Date: Wed, 28 Aug 2019 22:26:38 -0500
> From: "Roland Winkler" <winkler@gnu.org>
> Cc: emacs-devel@gnu.org
> 
> On Wed Aug 28 2019 Eli Zaretskii wrote:
> > > From: Roland Winkler <winkler@gnu.org>
> > > If there was a generic function strip-accents, then BibTeX mode could
> > > certainly use it within its bibtex-generate-autokey machinery.
> > 
> > I don't think we have such a function, but it shouldn't be hard to
> > write one, using the facilities in ucs-normalize.el.
> 
> Interesting! What are the intended use cases for ucs-normalize.el
> and the algorithms that it implements?

To implement the functionalities described in UAX#15 Unicode
Normalization Forms (http://www.unicode.org/reports/tr15/).  We
already use some of that in implementing the utf8-hfs file-name
encoding (used by macOS).

> I had never much thought about this.  But there is obviously a
> problem when one tries to sort a database where the keys may contain
> more fancy utf characters. (This problem must be well-known in the
> utf world).  Naivly one might hope that the following lines are
> properly sorted according to string-lessp

As Martin points out, you should use string-collate-lessp instead for
these use cases.

> Of course, this is due to the fact that a German umlaut can be
> represented with its own character or with a combining diaeresis.
> These two ways of presenting an umlaut look the same, but they are
> not the same for string-lessp.

The Unicode Standard mandates that they be handled identically,
including in searching and sorting.  We don't yet implement that 100%,
but see char-fold.el for a partial (and not very efficient)
implementation during search.

> Now, one solution would be to simply strip off the combining
> characters by decomposing the characters.  Or is there a possibility
> to teach a sorting algorithm that the first letter of ä-combine is
> "the same" as the first letter of ä-umlaut and all this should
> appear near a-plain instead of past o-plain?

Both should be possible.  To entirely strip the combining accents, you
can use ucs-normalize, and then filter out all characters whose
canonical combining class is non-zero.