From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#58168: string-lessp glitches and inconsistencies Date: Sun, 02 Oct 2022 08:36:46 +0300 Message-ID: <83tu4mais1.fsf@gnu.org> References: <7824372D-8002-4639-8AEE-E80A6D5FEFC6@gmail.com> <83czbef6le.fsf@gnu.org> <6CB805F6-89EE-4D7C-A398-F29698733A42@gmail.com> <83h70oce4k.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="40585"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 58168@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Sun Oct 02 07:38:14 2022 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1oerg4-000AQr-Ej for geb-bug-gnu-emacs@m.gmane-mx.org; Sun, 02 Oct 2022 07:38:12 +0200 Original-Received: from localhost ([::1]:49194 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oerg2-0005Ak-TN for geb-bug-gnu-emacs@m.gmane-mx.org; Sun, 02 Oct 2022 01:38:10 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:33498) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oerfu-0005AZ-3M for bug-gnu-emacs@gnu.org; Sun, 02 Oct 2022 01:38:02 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:47009) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1oerft-0007Vw-Ro for bug-gnu-emacs@gnu.org; Sun, 02 Oct 2022 01:38:01 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1oerft-0006cY-M1 for bug-gnu-emacs@gnu.org; Sun, 02 Oct 2022 01:38:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 02 Oct 2022 05:38:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 58168 X-GNU-PR-Package: emacs Original-Received: via spool by 58168-submit@debbugs.gnu.org id=B58168.166468902525385 (code B ref 58168); Sun, 02 Oct 2022 05:38:01 +0000 Original-Received: (at 58168) by debbugs.gnu.org; 2 Oct 2022 05:37:05 +0000 Original-Received: from localhost ([127.0.0.1]:46087 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oerez-0006bM-0P for submit@debbugs.gnu.org; Sun, 02 Oct 2022 01:37:05 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:48266) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oerew-0006as-Ea for 58168@debbugs.gnu.org; Sun, 02 Oct 2022 01:37:03 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:52750) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oerer-0007Kv-5d; Sun, 02 Oct 2022 01:36:57 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=IrJkYNYEzLpDptghp5C90nCDV5y0ZK9grcgeuoSyrCU=; b=Kacmbtmsu+97rOE1raRN qnVnt8wIAqsbztagwOqEP6830kbFZkslAXmfMrQKMM0lOi07O0ajpCISnaMYn8plO0OALXmlxq6Xk h4myK1lRwcit79P5sgIV06YYnv1JyyLhDu7lHqa+xQZa38zR7rgRuK7r2EtayQIrfOvrnPyz4W054 hlx7YYTnhuyTV53GJGXQ6BxUDX/xSXi8fOXYH8KkWaD3HVrheTHdZHyRHwG0ONA5vl3ZWDlr1K2qZ R3FlwWLlLft/0ZREH9X0UKtI59Ik7TXXAZSmiIiuqh2QzRSZ2/72yv5UefknawZhX5HSUso9fJ3+n lepYrIEZWokUQg==; Original-Received: from [87.69.77.57] (port=2226 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oerep-0005JQ-SN; Sun, 02 Oct 2022 01:36:56 -0400 In-Reply-To: (message from Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= on Sat, 1 Oct 2022 21:57:45 +0200) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:244174 Archived-At: > From: Mattias EngdegÄrd > Date: Sat, 1 Oct 2022 21:57:45 +0200 > Cc: 58168@debbugs.gnu.org > > 1 okt. 2022 kl. 07.22 skrev Eli Zaretskii : > > > It depends on the use case, but in general I see no problem with > > signaling errors when we cannot produce reasonably correct results. > > For example, string-to-unibyte does signal an error in some cases. > > That's fine because that function is documented to do so and always has, but making previously possible comparisons raise errors shouldn't be done lightly. I didn't say "lightly", nor do I think so. We need to discuss specific use cases. An alternative is to always convert unibyte non-ASCII strings to their multibyte representation before comparing. > Comparison between objects is not only useful when someone cares about their order, as in presenting a sorted list to the user. Often what is important is an ability to impose an order, preferably total, for use in building and searching data structures. I came across this bug when implementing a string set. Always converting to multibyte handles this case, doesn't it? > >> It's also a matter of performance -- string< has been improved recently but currently we compare text in Latin and Swahili much faster than French and Arabic; it would be nice to close that gap. UTF-8 is designed so that comparing strings by scalar values can be done byte-wise, but the way we encode raw bytes make them sort right between ASCII and Latin-1. Given that the specific order doesn't matter much, we could just run with that. > > > > I see no reason to make comparison of unibyte and multibyte strings > > perform better. > > Actually I was talking about multibyte-multibyte comparisons. Then why did you mention raw bytes? their multibyte representation presents no performance problems, AFAIU. > You were probably thinking about comparisons between unibyte strings that contain raw bytes and multibyte strings, and those are indeed not very performance-sensitive. However there is no way to detect whether a unibyte string contains non-ASCII chars without looking at every byte, and comparing unibyte ASCII with multibyte is definitely of interest. Strings are still unibyte by default. You can compare under the assumption that a unibyte string is pure-ASCII until you bump into the first non-ASCII one. If that happens, abandon the comparison, convert the unibyte string to its multibyte representation, and compare again.