From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#58168: string-lessp glitches and inconsistencies Date: Sat, 01 Oct 2022 08:22:03 +0300 Message-ID: <83h70oce4k.fsf@gnu.org> References: <7824372D-8002-4639-8AEE-E80A6D5FEFC6@gmail.com> <83czbef6le.fsf@gnu.org> <6CB805F6-89EE-4D7C-A398-F29698733A42@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="10393"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 58168@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Sat Oct 01 07:23:11 2022 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1oeUxy-0002Ya-MD for geb-bug-gnu-emacs@m.gmane-mx.org; Sat, 01 Oct 2022 07:23:10 +0200 Original-Received: from localhost ([::1]:46752 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oeUxw-0000wi-Sc for geb-bug-gnu-emacs@m.gmane-mx.org; Sat, 01 Oct 2022 01:23:08 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:42852) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oeUxq-0000wZ-Dm for bug-gnu-emacs@gnu.org; Sat, 01 Oct 2022 01:23:02 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:44339) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1oeUxq-000575-5q for bug-gnu-emacs@gnu.org; Sat, 01 Oct 2022 01:23:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1oeUxp-0000W4-LQ for bug-gnu-emacs@gnu.org; Sat, 01 Oct 2022 01:23:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 01 Oct 2022 05:23:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 58168 X-GNU-PR-Package: emacs Original-Received: via spool by 58168-submit@debbugs.gnu.org id=B58168.16646017481931 (code B ref 58168); Sat, 01 Oct 2022 05:23:01 +0000 Original-Received: (at 58168) by debbugs.gnu.org; 1 Oct 2022 05:22:28 +0000 Original-Received: from localhost ([127.0.0.1]:43417 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oeUxG-0000V1-T4 for submit@debbugs.gnu.org; Sat, 01 Oct 2022 01:22:28 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:60414) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oeUxB-0000Ui-NR for 58168@debbugs.gnu.org; Sat, 01 Oct 2022 01:22:26 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:37346) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oeUx6-00054C-9e; Sat, 01 Oct 2022 01:22:16 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=H420c1Inrvvjmmlbqw+XwoTydAmeC6Zk2x4WVRZZVGg=; b=NvHLoOHVEMsbiYxoYVPe 2b7PEXn2KCbIOqcMbDcS42QziA1mhxOrMVGf2IerENbrnmdjMvM/WitS0Wf0Fq5rLhMHMMq9ZHDaU vtVzXOvKIbGVfylQyNKqRb3J2gxPu8Aeu3iOvjBCb4TGmOCbZ5DLtdOid62Z6EMGwciVYHrRbE8PD 50WmI8r/0rQoTqhMbAP2+z2GBf78OByfBxCVZF2cUdXOzU0/+Gvc1SZzH768LZQqZfNjuaLv+zAdm 0VHy+6ol/Tdtt1kxVOi/FcPEN5xZsYWY99LmGX44ONrbm45qwX0qBOMUdqTM5fvtOPrFNH9/fHs9w XiQpoaMEGeAtyA==; Original-Received: from [87.69.77.57] (port=3937 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oeUx5-0005U5-O7; Sat, 01 Oct 2022 01:22:16 -0400 In-Reply-To: <6CB805F6-89EE-4D7C-A398-F29698733A42@gmail.com> (message from Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= on Fri, 30 Sep 2022 22:04:47 +0200) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:244071 Archived-At: > From: Mattias Engdegård > Date: Fri, 30 Sep 2022 22:04:47 +0200 > Cc: 58168@debbugs.gnu.org > > 29 sep. 2022 kl. 19.11 skrev Eli Zaretskii : > > > Unibyte strings should never be compared with > > multibyte, unless they are both pure-ASCII. > > It's perfectly fine to compare "Madrid" (unibyte) with "Málaga" (non-ASCII multibyte). Not relevant: I meant unibyte non-ASCII strings. The ASCII case is easy and un-problematic, and is really just a straw-man here. > If you mean that all strings (literals in particular) should be multibyte by default then I agree and at some point we should take that step, but it would be quite a breaking change. Perhaps less in practice than we fear, though... That's not what I meant. I think unibyte strings are with us for the observable future. > > Unibyte characters don't belong to this order. They > > should be converted to multibyte representation to be sensibly > > comparable. > > Oh I agree to some extent but we can't really raise an error if someone tries so we might as well return something reasonable and coherent. It depends on the use case, but in general I see no problem with signaling errors when we cannot produce reasonably correct results. For example, string-to-unibyte does signal an error in some cases. > Besides, there are more good reasons for ordering strings (both multibyte and unibyte) than might be apparent at first. Examples, please. > Working from the assumption that we can't change string= to equate raw bytes in unibyte and multibyte strings, we need to invent an order between normally incommensurate values I don't agree with the conclusion. It is not the only possible conclusion. Signaling an error is another one, and I'm sure we could think of more. > It's also a matter of performance -- string< has been improved recently but currently we compare text in Latin and Swahili much faster than French and Arabic; it would be nice to close that gap. UTF-8 is designed so that comparing strings by scalar values can be done byte-wise, but the way we encode raw bytes make them sort right between ASCII and Latin-1. Given that the specific order doesn't matter much, we could just run with that. I see no reason to make comparison of unibyte and multibyte strings perform better.