From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: "Roland Winkler" Newsgroups: gmane.emacs.devel Subject: strip accents and sorting [was: BibTeX issues] Date: Wed, 28 Aug 2019 22:26:38 -0500 Message-ID: <17902.3833.825923.23911@gargle.gargle.HOWL> References: <87mufv2e9s.fsf@uni-bielefeld.de> <87ftllji9u.fsf@gnu.org> <83tva1b02r.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="236139"; mail-complaints-to="usenet@blaine.gmane.org" Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Aug 29 05:28:04 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1i3B6S-000zIs-9p for ged-emacs-devel@m.gmane.org; Thu, 29 Aug 2019 05:28:04 +0200 Original-Received: from localhost ([::1]:45096 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1i3B6Q-0004c4-3K for ged-emacs-devel@m.gmane.org; Wed, 28 Aug 2019 23:28:02 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:44321) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1i3B5P-0004bm-Ca for emacs-devel@gnu.org; Wed, 28 Aug 2019 23:27:01 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:60917) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1i3B5H-000226-91 for emacs-devel@gnu.org; Wed, 28 Aug 2019 23:26:51 -0400 Original-Received: from [2602:30a:2e52:d720:65b7:1416:12e7:8bfb] (port=33352 helo=regnitz) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1i3B5B-0006CE-QL; Wed, 28 Aug 2019 23:26:46 -0400 In-Reply-To: <83tva1b02r.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:239660 Archived-At: On Wed Aug 28 2019 Eli Zaretskii wrote: > > From: Roland Winkler > > If there was a generic function strip-accents, then BibTeX mode could > > certainly use it within its bibtex-generate-autokey machinery. >=20 > I don't think we have such a function, but it shouldn't be hard to > write one, using the facilities in ucs-normalize.el. Interesting! What are the intended use cases for ucs-normalize.el and the algorithms that it implements? I had never much thought about this. But there is obviously a problem when one tries to sort a database where the keys may contain more fancy utf characters. (This problem must be well-known in the utf world). Naivly one might hope that the following lines are properly sorted according to string-lessp a=CC=88-combine =C3=A4-umlaut o=CC=88-combine =C3=B6-umlaut But (string-lessp "=C3=A4-umlaut" "o=CC=88-combine") gives nil so that sort= -lines gives a=CC=88-combine o=CC=88-combine =C3=A4-umlaut =C3=B6-umlaut Of course, this is due to the fact that a German umlaut can be represented with its own character or with a combining diaeresis. These two ways of presenting an umlaut look the same, but they are not the same for string-lessp. This can be particularly annoying when a database (be it BibTeX, BBDB, or whatever) is often enough populated by copying records from different sources that may represent such fancy utf characters in different ways. Now, one solution would be to simply strip off the combining characters by decomposing the characters. Or is there a possibility to teach a sorting algorithm that the first letter of a=CC=88-combine is "the same" as the first letter of =C3=A4-umlaut and all this should appear near a-plain instead of past o-plain? Roland