From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Lars Ingebrigtsen Newsgroups: gmane.emacs.bugs Subject: bug#58168: string-lessp glitches and inconsistencies Date: Fri, 30 Sep 2022 15:52:12 +0200 Message-ID: <877d1l55rn.fsf@gnus.org> References: <7824372D-8002-4639-8AEE-E80A6D5FEFC6@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="19205"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux) Cc: 58168@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Fri Sep 30 15:56:57 2022 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1oeGVb-0004oD-Ia for geb-bug-gnu-emacs@m.gmane-mx.org; Fri, 30 Sep 2022 15:56:55 +0200 Original-Received: from localhost ([::1]:60894 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oeGVa-0001CV-Jy for geb-bug-gnu-emacs@m.gmane-mx.org; Fri, 30 Sep 2022 09:56:54 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:46986) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oeGRr-0005Q8-1Y for bug-gnu-emacs@gnu.org; Fri, 30 Sep 2022 09:53:04 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:41830) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1oeGRq-0001AH-Ll for bug-gnu-emacs@gnu.org; Fri, 30 Sep 2022 09:53:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1oeGRq-0003eL-GI for bug-gnu-emacs@gnu.org; Fri, 30 Sep 2022 09:53:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Lars Ingebrigtsen Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Fri, 30 Sep 2022 13:53:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 58168 X-GNU-PR-Package: emacs Original-Received: via spool by 58168-submit@debbugs.gnu.org id=B58168.166454594413982 (code B ref 58168); Fri, 30 Sep 2022 13:53:02 +0000 Original-Received: (at 58168) by debbugs.gnu.org; 30 Sep 2022 13:52:24 +0000 Original-Received: from localhost ([127.0.0.1]:40908 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oeGRE-0003dR-7i for submit@debbugs.gnu.org; Fri, 30 Sep 2022 09:52:24 -0400 Original-Received: from quimby.gnus.org ([95.216.78.240]:55526) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oeGRB-0003dD-O0 for 58168@debbugs.gnu.org; Fri, 30 Sep 2022 09:52:22 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnus.org; s=20200322; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID :Date:References:In-Reply-To:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=1RrUKv+wqa9EIkdIpNfZlipAlYQFaFmRKy2jfyDNukk=; b=mFE1knDeR2z58UJwYzSuKdeNmt mOnmybaDpHS2VhQ941V50H72j8k68Q/xf6z0q7ydJh8qCasSYGAXNuNh2yufc/zYyccCq7AEUEnLx p5pSIXAm2+y58ELc4Hr+gbA/CanwntwH8uWuqyJ0QHU3U2Vm3bN08M29lQ7CZN/T41TI=; Original-Received: from [84.212.220.105] (helo=downe) by quimby.gnus.org with esmtpsa (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1oeGR3-0000e2-5h; Fri, 30 Sep 2022 15:52:15 +0200 In-Reply-To: <7824372D-8002-4639-8AEE-E80A6D5FEFC6@gmail.com> ("Mattias =?UTF-8?Q?Engdeg=C3=A5rd?="'s message of "Thu, 29 Sep 2022 18:24:04 +0200") X-Now-Playing: Nilotika Cultural Ensemble's _L'Esprit de Nyege 2020_: "We Love Nilotika" X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:244019 Archived-At: Mattias Engdeg=C3=A5rd writes: > We really want string< to be consistent with string=3D and itself since t= his is fundamental for string ordering in searching and sorting application= s. > This means that for any pair of strings A and B, we should either have A<= B, B > Unfortunately: > > (let* ((a "=C3=BC") > (b "\xfc")) > (list (string=3D a b) > (string< a b) > (string< b a))) > =3D> (nil nil nil) > > because string< considers the unibyte raw byte 0xFC and the multibyte cha= r U+00FC to be the same, but string=3D thinks they are different. You also have (string 4194176) =3D> "\200" "\x80" =3D> "\200" which are kinda equal in some ways, and not in other ways. > It suggests the following alternative collation orders: > > A. ASCII < ub raw 80..FF < mb U+0080..10FFFF < mb raw 80..FF > > which puts all non-ASCII multibyte chars after unibyte. > > B. ASCII < ub raw 80..FF < mb raw 80..FF < mb U+0080..10FFFF > > which inserts multibyte raw bytes after the unibyte ones, permitting any = ub-ub and mb-mb comparisons to be made using memcmp, and a slow decoding lo= op only required for unibyte against non-ASCII multibyte strings. > > C. ASCII < mb U+0080..10FFFF < mb raw 80..FF < ub raw 80..FF > > which instead moves unibyte raw bytes to after the multibyte raw range. T= his has the same memcmp benefit as alternative B, but may be slightly faste= r for ub-mb comparisons since only unibyte 80..FF need to be remapped. I think A makes the most intuitive sense, somehow. But perhaps my intuition is off.