From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Michael Albinus Newsgroups: gmane.emacs.bugs Subject: bug#18051: 24.3.92; ls-lisp: Sorting; make ls-lisp-string-lessp a normal function? Date: Sat, 16 Aug 2014 23:52:16 +0200 Message-ID: <877g28w19r.fsf@gmx.de> References: <87ha2f5gp8.fsf@web.de> <838unr6ttu.fsf@gnu.org> <871ttj5dfi.fsf@web.de> <87iomvhvdg.fsf@gmx.de> <834myf6mfl.fsf@gnu.org> <87a987ht5r.fsf@gmx.de> <83y4vq6cz3.fsf@gnu.org> <87tx6c7f5v.fsf@web.de> <8338dw5zrf.fsf@gnu.org> <87lhro7dp4.fsf@web.de> <83zjg44jzd.fsf@gnu.org> <87wqb8mqqv.fsf@web.de> <83y4vo4fbr.fsf@gnu.org> <87silwmo8h.fsf@web.de> <83wqb84e7l.fsf@gnu.org> <87iomsgsqg.fsf@gmx.de> <83tx6c44x7.fsf@gnu.org> <87egxggigj.fsf@gmx.de> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1408226005 31882 80.91.229.3 (16 Aug 2014 21:53:25 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 16 Aug 2014 21:53:25 +0000 (UTC) Cc: michael_heerdegen@web.de, 18051@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sat Aug 16 23:53:18 2014 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XIluX-0002bT-Dp for geb-bug-gnu-emacs@m.gmane.org; Sat, 16 Aug 2014 23:53:17 +0200 Original-Received: from localhost ([::1]:36808 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XIluW-0006w4-UR for geb-bug-gnu-emacs@m.gmane.org; Sat, 16 Aug 2014 17:53:16 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:55318) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XIluO-0006vW-TQ for bug-gnu-emacs@gnu.org; Sat, 16 Aug 2014 17:53:14 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XIluI-00048z-Uw for bug-gnu-emacs@gnu.org; Sat, 16 Aug 2014 17:53:08 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:37984) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XIluI-00048v-Rb for bug-gnu-emacs@gnu.org; Sat, 16 Aug 2014 17:53:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80) (envelope-from ) id 1XIluI-0005n6-8r for bug-gnu-emacs@gnu.org; Sat, 16 Aug 2014 17:53:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Michael Albinus Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 16 Aug 2014 21:53:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18051 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 18051-submit@debbugs.gnu.org id=B18051.140822595122217 (code B ref 18051); Sat, 16 Aug 2014 21:53:02 +0000 Original-Received: (at 18051) by debbugs.gnu.org; 16 Aug 2014 21:52:31 +0000 Original-Received: from localhost ([127.0.0.1]:44927 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XIltm-0005mE-20 for submit@debbugs.gnu.org; Sat, 16 Aug 2014 17:52:30 -0400 Original-Received: from mout.gmx.net ([212.227.17.21]:57033) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XIltj-0005lz-S9 for 18051@debbugs.gnu.org; Sat, 16 Aug 2014 17:52:28 -0400 Original-Received: from detlef.gmx.de ([93.202.56.190]) by mail.gmx.com (mrgmx101) with ESMTPSA (Nemesis) id 0MPYqL-1XMscs2l5l-004jke; Sat, 16 Aug 2014 23:52:20 +0200 In-Reply-To: <87egxggigj.fsf@gmx.de> (Michael Albinus's message of "Sun, 20 Jul 2014 17:26:04 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux) X-Provags-ID: V03:K0:Zniz7UQ1b5wdHJUoHMj/bB3ha5J+qG35z7PtTp7c5B1+8IvIGOE VmzvzBdU2rrzhpG2/Bk1t8eV4eTF3tGuT0VpU/HliUFA5gxjDdpwgtRLImPQMaByzNEUifd aEWwYE4HOV1QAng6Ntg0DnaPKYYxec+xovG7GJEtNjW/uUHiXZdX87RE6o9hEvJbuTV55Qe BTSxqIAI9PMpgiduhh1Ug== X-UI-Out-Filterresults: notjunk:1; X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:92492 Archived-At: Michael Albinus writes: >>> On systems without glib, we might emulate it partially. Packages >>> like ls-lisp could use it then for sorting. >> >> I think we need our own implementation in any case. If nothing else, >> that would solve the issue of encoding strings into UTF-8 before >> calling external C functions. > > Yep. But given the complexity of UCA, we will start slowly with a subset > of the algorithm only. This and performance considerations will still > demand for a native C library, if available. Just being curious, I've taken g_utf8_collate from the glib for a test. It doesn't work bad. I have added two functions `gstring-lessp' and `gstring-equalp', which are meant to be the collation counterparts of `string-lessp' and `string-equal'. Here are some tests, taken from UTS#10, chapter 1.1 "Multi-Level Comparison": --8<---------------cut here---------------start------------->8--- (sort '("role" "roles" "rule") 'string-lessp) =3D> ("role" "roles" "rule") (sort '("role" "roles" "rule") 'gstring-lessp) =3D> ("role" "roles" "rule") --8<---------------cut here---------------end--------------->8--- No surprise they return the same result, this is level 1 comparison. Just base characters are compared. --8<---------------cut here---------------start------------->8--- (sort '("role" "r=C3=B4le" "roles") 'string-lessp) =3D> ("role" "roles" "r=C3=B4le") (sort '("role" "r=C3=B4le" "roles") 'gstring-lessp) =3D> ("role" "r=C3=B4le" "roles") --8<---------------cut here---------------end--------------->8--- Accent differences are typically ignored in collation, if the base letters differ. And so on, further tests applied from there ... The collation rules could even be influenced by setting the locale environment. The following example is taken from ISO 14651:2011, appendix D.3. If LC_COLLATE is set to C.utf8, `string-lessp' and `gstring-lessp' behave the same: --8<---------------cut here---------------start------------->8--- (sort '("Alzheimer" "czar" "c=C3=A6sium" "c=C3=B8libat" "Aachen" "Aalborg" = "=C3=85rhus") 'stri\ng-lessp) =3D> ("Aachen" "Aalborg" "Alzheimer" "czar" "c=C3=A6sium" "c=C3=B8libat" "= =C3=85rhus") (sort '("Alzheimer" "czar" "c=C3=A6sium" "c=C3=B8libat" "Aachen" "Aalborg" = "=C3=85rhus") 'gstring-lessp) =3D> ("Aachen" "Aalborg" "Alzheimer" "czar" "c=C3=A6sium" "c=C3=B8libat" "= =C3=85rhus") --8<---------------cut here---------------end--------------->8--- When I set LC_COLLATE to en_US.utf8, accent differences are ignored, again: --8<---------------cut here---------------start------------->8--- (sort '("Alzheimer" "czar" "c=C3=A6sium" "c=C3=B8libat" "Aachen" "Aalborg" = "=C3=85rhus") 'gstring-lessp) =3D> ("Aachen" "Aalborg" "Alzheimer" "=C3=85rhus" "c=C3=A6sium" "c=C3=B8lib= at" "czar") --8<---------------cut here---------------end--------------->8--- But setting LC_COLLATE to da_DK.utf8, the order differs, because "cz" is less than "c=C3=A6", and "aa" is equivalent to "=C3=A5" but greater than "z= ". --8<---------------cut here---------------start------------->8--- (sort '("Alzheimer" "czar" "c=C3=A6sium" "c=C3=B8libat" "Aachen" "Aalborg" = "=C3=85rhus") 'gstring-lessp) ("Alzheimer" "czar" "c=C3=A6sium" "c=C3=B8libat" "Aachen" "Aalborg" "=C3=85= rhus") --8<---------------cut here---------------end--------------->8--- Well, for practical use cases it seems to be worth to include g_utf8_collate into Emacs. Of course, it could be used only in case glib is linked, so we might still need an own Lisp implementation. I don't know how well g_utf8_collate works for non Latin characters, 'tho. And the test files CollationTest_NON_IGNORABLE.txt and CollationTest_SHIFTED.txt from UTS#10 do not run completely successful. I have no idea, whether it is due to a limitation of g_utf8_collate, or whether it is because I have taken the latest Unicode 7.0.0 test files, which might include tests which haven't reached GNU/Linux distributions yet. (Or whether my implementation is still erroneous). Best regards, Michael.