unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Michael Albinus <michael.albinus@gmx.de>
To: Eli Zaretskii <eliz@gnu.org>
Cc: michael_heerdegen@web.de, 18051@debbugs.gnu.org
Subject: bug#18051: 24.3.92; ls-lisp: Sorting; make ls-lisp-string-lessp a normal function?
Date: Sat, 16 Aug 2014 23:52:16 +0200	[thread overview]
Message-ID: <877g28w19r.fsf@gmx.de> (raw)
In-Reply-To: <87egxggigj.fsf@gmx.de> (Michael Albinus's message of "Sun, 20 Jul 2014 17:26:04 +0200")

Michael Albinus <michael.albinus@gmx.de> writes:

>>> On systems without glib, we might emulate it partially. Packages
>>> like ls-lisp could use it then for sorting.
>>
>> I think we need our own implementation in any case.  If nothing else,
>> that would solve the issue of encoding strings into UTF-8 before
>> calling external C functions.
>
> Yep. But given the complexity of UCA, we will start slowly with a subset
> of the algorithm only. This and performance considerations will still
> demand for a native C library, if available.

Just being curious, I've taken g_utf8_collate from the glib for a
test. It doesn't work bad.

I have added two functions `gstring-lessp' and `gstring-equalp', which
are meant to be the collation counterparts of `string-lessp' and
`string-equal'. Here are some tests, taken from UTS#10, chapter 1.1
"Multi-Level Comparison":

--8<---------------cut here---------------start------------->8---
(sort '("role" "roles" "rule") 'string-lessp)
=> ("role" "roles" "rule")

(sort '("role" "roles" "rule") 'gstring-lessp)
=> ("role" "roles" "rule")
--8<---------------cut here---------------end--------------->8---

No surprise they return the same result, this is level 1
comparison. Just base characters are compared.

--8<---------------cut here---------------start------------->8---
(sort '("role" "rôle" "roles") 'string-lessp)
=> ("role" "roles" "rôle")

(sort '("role" "rôle" "roles") 'gstring-lessp)
=> ("role" "rôle" "roles")
--8<---------------cut here---------------end--------------->8---

Accent differences are typically ignored in collation, if the base
letters differ. And so on, further tests applied from there ...

The collation rules could even be influenced by setting the locale
environment. The following example is taken from ISO 14651:2011,
appendix D.3. If LC_COLLATE is set to C.utf8, `string-lessp' and
`gstring-lessp' behave the same:

--8<---------------cut here---------------start------------->8---
(sort '("Alzheimer" "czar" "cæsium" "cølibat" "Aachen" "Aalborg" "Århus") 'stri\ng-lessp)
=> ("Aachen" "Aalborg" "Alzheimer" "czar" "cæsium" "cølibat" "Århus")

(sort '("Alzheimer" "czar" "cæsium" "cølibat" "Aachen" "Aalborg" "Århus") 'gstring-lessp)
=> ("Aachen" "Aalborg" "Alzheimer" "czar" "cæsium" "cølibat" "Århus")
--8<---------------cut here---------------end--------------->8---

When I set LC_COLLATE to en_US.utf8, accent differences are ignored,
again:

--8<---------------cut here---------------start------------->8---
(sort '("Alzheimer" "czar" "cæsium" "cølibat" "Aachen" "Aalborg" "Århus") 'gstring-lessp)
=> ("Aachen" "Aalborg" "Alzheimer" "Århus" "cæsium" "cølibat" "czar")
--8<---------------cut here---------------end--------------->8---

But setting LC_COLLATE to da_DK.utf8, the order differs, because "cz" is
less than "cæ", and "aa" is equivalent to "å" but greater than "z".

--8<---------------cut here---------------start------------->8---
(sort '("Alzheimer" "czar" "cæsium" "cølibat" "Aachen" "Aalborg" "Århus") 'gstring-lessp)
("Alzheimer" "czar" "cæsium" "cølibat" "Aachen" "Aalborg" "Århus")
--8<---------------cut here---------------end--------------->8---

Well, for practical use cases it seems to be worth to include
g_utf8_collate into Emacs. Of course, it could be used only in case glib
is linked, so we might still need an own Lisp implementation. I don't
know how well g_utf8_collate works for non Latin characters, 'tho.

And the test files CollationTest_NON_IGNORABLE.txt and
CollationTest_SHIFTED.txt from UTS#10 do not run completely
successful. I have no idea, whether it is due to a limitation of
g_utf8_collate, or whether it is because I have taken the latest Unicode
7.0.0 test files, which might include tests which haven't reached
GNU/Linux distributions yet. (Or whether my implementation is still
erroneous).

Best regards, Michael.





  parent reply	other threads:[~2014-08-16 21:52 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <E1XMiOq-0000si-VD@vcs.savannah.gnu.org>
2014-07-18  6:22 ` bug#18051: 24.3.92; ls-lisp: Sorting; make ls-lisp-string-lessp a normal function? Michael Heerdegen
2014-07-18  6:53   ` Eli Zaretskii
2014-07-18  7:33     ` Michael Heerdegen
2014-07-18  8:53       ` Eli Zaretskii
2014-07-18  9:37         ` Michael Heerdegen
2014-07-18  9:46           ` Eli Zaretskii
2014-07-18 10:18             ` Michael Heerdegen
2014-07-18 13:03               ` Eli Zaretskii
2014-07-19  1:25                 ` Michael Heerdegen
2014-07-19  8:17                   ` Eli Zaretskii
2014-07-19 10:52                     ` Michael Heerdegen
2014-07-19 10:56                     ` Eli Zaretskii
2014-07-18  9:24       ` Michael Albinus
2014-07-18  9:33         ` Eli Zaretskii
2014-07-18 10:12           ` Michael Albinus
2014-07-18 12:57             ` Eli Zaretskii
2014-07-18 13:18               ` Michael Albinus
2014-07-18 13:44                 ` Eli Zaretskii
2014-07-18 16:21                   ` Michael Albinus
2014-07-20  5:49               ` Michael Heerdegen
2014-07-20  6:07                 ` Eli Zaretskii
2014-07-20  6:21                   ` Michael Heerdegen
2014-07-20  6:33                     ` Eli Zaretskii
2014-07-20  7:30                       ` Michael Heerdegen
2014-07-20  8:14                         ` Eli Zaretskii
2014-07-20  8:24                           ` Michael Heerdegen
2014-07-20  8:38                             ` Eli Zaretskii
2014-07-20  9:15                               ` Michael Heerdegen
2014-07-20  9:18                                 ` Eli Zaretskii
2014-07-20 11:44                               ` Michael Albinus
2014-07-20 11:59                                 ` Eli Zaretskii
2014-07-20 15:26                                   ` Michael Albinus
2014-07-20 16:16                                     ` Eli Zaretskii
2014-08-16 21:52                                     ` Michael Albinus [this message]
2014-08-17 16:38                                       ` Eli Zaretskii
2014-08-17 17:55                                         ` Eli Zaretskii
2014-08-17 18:46                                           ` Michael Albinus
2014-08-17 18:52                                             ` Eli Zaretskii
2014-08-21  9:05                                               ` Michael Albinus
2014-08-21 14:41                                                 ` Eli Zaretskii
2014-08-22 14:23                                                   ` Michael Albinus
2014-08-23  9:05                                                     ` Eli Zaretskii
2014-08-23 16:42                                                       ` Michael Albinus
2014-08-23 17:33                                                         ` Eli Zaretskii
2014-08-23 20:32                                                           ` Michael Albinus
2014-08-24 14:54                                                             ` Eli Zaretskii
2014-08-24 16:18                                                               ` Michael Albinus
2014-08-25 15:01                                                               ` Stefan Monnier
2014-08-27  8:49                                                                 ` Michael Albinus
2014-08-27 15:37                                                                   ` Eli Zaretskii
2014-08-27 18:02                                                                     ` Michael Albinus
2014-08-27 15:48                                                                   ` Glenn Morris
2014-08-27 16:53                                                                     ` Eli Zaretskii
2014-08-28  3:23                                                                       ` Stefan Monnier
2014-08-27 18:08                                                                     ` Michael Albinus
2014-08-27 18:30                                                                       ` Glenn Morris
2014-08-25 16:45                                                             ` Glenn Morris
2014-08-25 17:36                                                               ` Eli Zaretskii
2014-07-20  6:18                 ` Michael Heerdegen
2014-07-20 14:22                   ` Stefan Monnier
2014-08-27 23:57   ` bug#18051: trunk r117751: Improve robustness of new string-collation code Katsumi Yamaoka
2014-08-28  0:51     ` Paul Eggert
2014-08-28  3:09   ` Katsumi Yamaoka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877g28w19r.fsf@gmx.de \
    --to=michael.albinus@gmx.de \
    --cc=18051@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    --cc=michael_heerdegen@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).