From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: strip accents and sorting [was: BibTeX issues] Date: Thu, 29 Aug 2019 10:10:37 +0300 Message-ID: <83lfvcbg5u.fsf@gnu.org> References: <87mufv2e9s.fsf@uni-bielefeld.de> <87ftllji9u.fsf@gnu.org> <83tva1b02r.fsf@gnu.org> <17902.3833.825923.23911@gargle.gargle.HOWL> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="51299"; mail-complaints-to="usenet@blaine.gmane.org" Cc: emacs-devel@gnu.org To: "Roland Winkler" Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Aug 29 09:10:50 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1i3Ea1-000DEN-Gv for ged-emacs-devel@m.gmane.org; Thu, 29 Aug 2019 09:10:49 +0200 Original-Received: from localhost ([::1]:46116 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1i3EZz-0001WR-7h for ged-emacs-devel@m.gmane.org; Thu, 29 Aug 2019 03:10:47 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:43517) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1i3EZl-0001PE-0R for emacs-devel@gnu.org; Thu, 29 Aug 2019 03:10:35 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:34533) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1i3EZj-0005Bc-7D for emacs-devel@gnu.org; Thu, 29 Aug 2019 03:10:32 -0400 Original-Received: from [176.228.60.248] (port=4611 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1i3EZh-0006ZW-St; Thu, 29 Aug 2019 03:10:30 -0400 In-reply-to: <17902.3833.825923.23911@gargle.gargle.HOWL> (winkler@gnu.org) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:239664 Archived-At: > Date: Wed, 28 Aug 2019 22:26:38 -0500 > From: "Roland Winkler" > Cc: emacs-devel@gnu.org > > On Wed Aug 28 2019 Eli Zaretskii wrote: > > > From: Roland Winkler > > > If there was a generic function strip-accents, then BibTeX mode could > > > certainly use it within its bibtex-generate-autokey machinery. > > > > I don't think we have such a function, but it shouldn't be hard to > > write one, using the facilities in ucs-normalize.el. > > Interesting! What are the intended use cases for ucs-normalize.el > and the algorithms that it implements? To implement the functionalities described in UAX#15 Unicode Normalization Forms (http://www.unicode.org/reports/tr15/). We already use some of that in implementing the utf8-hfs file-name encoding (used by macOS). > I had never much thought about this. But there is obviously a > problem when one tries to sort a database where the keys may contain > more fancy utf characters. (This problem must be well-known in the > utf world). Naivly one might hope that the following lines are > properly sorted according to string-lessp As Martin points out, you should use string-collate-lessp instead for these use cases. > Of course, this is due to the fact that a German umlaut can be > represented with its own character or with a combining diaeresis. > These two ways of presenting an umlaut look the same, but they are > not the same for string-lessp. The Unicode Standard mandates that they be handled identically, including in searching and sorting. We don't yet implement that 100%, but see char-fold.el for a partial (and not very efficient) implementation during search. > Now, one solution would be to simply strip off the combining > characters by decomposing the characters. Or is there a possibility > to teach a sorting algorithm that the first letter of ä-combine is > "the same" as the first letter of ä-umlaut and all this should > appear near a-plain instead of past o-plain? Both should be possible. To entirely strip the combining accents, you can use ucs-normalize, and then filter out all characters whose canonical combining class is non-zero.