From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#58168: string-lessp glitches and inconsistencies Date: Thu, 29 Sep 2022 20:11:57 +0300 Message-ID: <83czbef6le.fsf@gnu.org> References: <7824372D-8002-4639-8AEE-E80A6D5FEFC6@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="28338"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 58168@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Thu Sep 29 20:27:15 2022 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1odyFe-0007CZ-3G for geb-bug-gnu-emacs@m.gmane-mx.org; Thu, 29 Sep 2022 20:27:14 +0200 Original-Received: from localhost ([::1]:35284 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1odyFc-0001X5-W1 for geb-bug-gnu-emacs@m.gmane-mx.org; Thu, 29 Sep 2022 14:27:13 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:53074) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1odx5s-0006bb-K4 for bug-gnu-emacs@gnu.org; Thu, 29 Sep 2022 13:13:07 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:40420) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1odx5p-00021O-Ur for bug-gnu-emacs@gnu.org; Thu, 29 Sep 2022 13:13:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1odx5p-000302-P5 for bug-gnu-emacs@gnu.org; Thu, 29 Sep 2022 13:13:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 29 Sep 2022 17:13:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 58168 X-GNU-PR-Package: emacs Original-Received: via spool by 58168-submit@debbugs.gnu.org id=B58168.166447153711475 (code B ref 58168); Thu, 29 Sep 2022 17:13:01 +0000 Original-Received: (at 58168) by debbugs.gnu.org; 29 Sep 2022 17:12:17 +0000 Original-Received: from localhost ([127.0.0.1]:39498 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1odx56-0002z0-HL for submit@debbugs.gnu.org; Thu, 29 Sep 2022 13:12:16 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:38338) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1odx55-0002yo-HL for 58168@debbugs.gnu.org; Thu, 29 Sep 2022 13:12:15 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:54318) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1odx4y-0001u7-Ae; Thu, 29 Sep 2022 13:12:10 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=wotfmTeu58spxbosYjk4MxCJTyw7QrztxzfY8+dxR6I=; b=Eh/NuNlraO6wZO1i2Lmq V7AC+Khe4J033NVL3nHOD1it1qNerZ3eWHAX13/C/68cvggtvSdj3XsjnP7bjkgjxT5e846CTW5tz 5gEfEIphYmnLlhCv5/X2mOnQiPg9k+8c4NCxJppDfyUX9oCV4vlAW8I8gZJCFA5PDv1PFOwd9jbZS yIsSS1ja/u5V8FsUtYvTEI82cykntdk+3R40M+N9kyjE8KAi5BSxtUBaLdhTfeWDbuQ4epLHtjjdx Fzwnf5tF9zSHOKyhdN75h5JD+mNQjfro3qREj92VqxfdSOdA9lvU+ZUgtbb6B92HrzHkOx7JBa+q8 OIzbMfLpjYJGfA==; Original-Received: from [87.69.77.57] (port=2701 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1odx4x-0003u6-NW; Thu, 29 Sep 2022 13:12:08 -0400 In-Reply-To: <7824372D-8002-4639-8AEE-E80A6D5FEFC6@gmail.com> (message from Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= on Thu, 29 Sep 2022 18:24:04 +0200) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:243941 Archived-At: > From: Mattias Engdegård > Date: Thu, 29 Sep 2022 18:24:04 +0200 > > We really want string< to be consistent with string= and itself since this is fundamental for string ordering in searching and sorting applications. > This means that for any pair of strings A and B, we should either have A > Unfortunately: > > (let* ((a "ü") > (b "\xfc")) > (list (string= a b) > (string< a b) > (string< b a))) > => (nil nil nil) > > because string< considers the unibyte raw byte 0xFC and the multibyte char U+00FC to be the same, but string= thinks they are different. Why do we care? Unibyte strings should never be compared with multibyte, unless they are both pure-ASCII. > So, what can be done? The current string< implementation uses the character order > > ASCII < ub raw 80..FF = mb U+0080..U+00FF < U+0100..10FFFF < mb raw 80..FF > > in conflict with string= which unifies unibyte and multibyte ASCII but not raw bytes and Latin-1. It would be unimaginable to unify raw bytes with Latin-1. Raw bytes are not Latin-1 characters, they can stand for any characters, or for no characters at all. > It suggests the following alternative collation orders: > > A. ASCII < ub raw 80..FF < mb U+0080..10FFFF < mb raw 80..FF > > which puts all non-ASCII multibyte chars after unibyte. > > B. ASCII < ub raw 80..FF < mb raw 80..FF < mb U+0080..10FFFF > > which inserts multibyte raw bytes after the unibyte ones, permitting any ub-ub and mb-mb comparisons to be made using memcmp, and a slow decoding loop only required for unibyte against non-ASCII multibyte strings. > > C. ASCII < mb U+0080..10FFFF < mb raw 80..FF < ub raw 80..FF Neither, IMNSHO. Unibyte characters don't belong to this order. They should be converted to multibyte representation to be sensibly comparable. > Otherwise, I'll go with B or C, depending on what the resulting code looks like. Please don't. Let's first decide that we want to change this, and what are the reasons for that. Theoretical "impurity" doesn't count, IMO.