From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#58168: string-lessp glitches and inconsistencies Date: Thu, 06 Oct 2022 14:06:26 +0300 Message-ID: <83wn9dp5xp.fsf@gnu.org> References: <7824372D-8002-4639-8AEE-E80A6D5FEFC6@gmail.com> <83czbef6le.fsf@gnu.org> <6CB805F6-89EE-4D7C-A398-F29698733A42@gmail.com> <83h70oce4k.fsf@gnu.org> <83tu4mais1.fsf@gnu.org> <83wn9gw2sp.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="32973"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 58168@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Thu Oct 06 13:20:55 2022 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1ogOvt-0008Sx-Di for geb-bug-gnu-emacs@m.gmane-mx.org; Thu, 06 Oct 2022 13:20:53 +0200 Original-Received: from localhost ([::1]:34344 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ogOvs-0006G5-Da for geb-bug-gnu-emacs@m.gmane-mx.org; Thu, 06 Oct 2022 07:20:52 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:43466) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ogOiU-0000x8-GX for bug-gnu-emacs@gnu.org; Thu, 06 Oct 2022 07:07:10 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:60018) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1ogOiU-0003zK-1l for bug-gnu-emacs@gnu.org; Thu, 06 Oct 2022 07:07:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1ogOiT-0000ge-T3 for bug-gnu-emacs@gnu.org; Thu, 06 Oct 2022 07:07:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 06 Oct 2022 11:07:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 58168 X-GNU-PR-Package: emacs Original-Received: via spool by 58168-submit@debbugs.gnu.org id=B58168.16650544062617 (code B ref 58168); Thu, 06 Oct 2022 11:07:01 +0000 Original-Received: (at 58168) by debbugs.gnu.org; 6 Oct 2022 11:06:46 +0000 Original-Received: from localhost ([127.0.0.1]:59096 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ogOiE-0000g8-A4 for submit@debbugs.gnu.org; Thu, 06 Oct 2022 07:06:46 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:48068) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ogOiB-0000fs-0p for 58168@debbugs.gnu.org; Thu, 06 Oct 2022 07:06:44 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:47040) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ogOi5-0003rA-MP; Thu, 06 Oct 2022 07:06:37 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=TPzCz7SvVhKSXvzjpf891slfHR9VtQn15G0weTyMTZ8=; b=W4v2PfZZaBYhjdFcXu/J IkumbdGJA9LjxLbHB0CDXtEAURw1+HKJJBA48oD2sqZS7DUXdXGLUhPwF0XiGu7cFep+ha6Z3HGSQ /JmnpibRtnIWPMA7wpF8IdlATrUpdrS+pcYAFceOEZvUBjnmY5pkILSV+5xyVEnpRITM19XTw5ZmV Ow/YHbpgozz6D6YVVD3yl535SK9zIjyKPz+mz7z8YcJLwI9ONfDhoD6FQxKGKJPmBR9wCrZvPVKrD CTSw58QhVzwpwUOgHRlt6uqyoavUHRTcdBTg1pdBskRUPtfaZPgGV0+KR0/TPjq+NRBRjj2ZJ2uBN AOcxQog+3PQJ1g==; Original-Received: from [87.69.77.57] (port=3495 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ogOhx-0003TY-11; Thu, 06 Oct 2022 07:06:37 -0400 In-Reply-To: (message from Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= on Thu, 6 Oct 2022 11:05:04 +0200) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:244638 Archived-At: > From: Mattias EngdegÄrd > Date: Thu, 6 Oct 2022 11:05:04 +0200 > Cc: 58168@debbugs.gnu.org > > 4 okt. 2022 kl. 07.55 skrev Eli Zaretskii : > > > If the fact that string= says strings are not equal, but string-lessp > > says they are equal, is what bothers you, we could document that > > results of comparing unibyte and multibyte strings are unspecified, or > > document explicitly that string= and string-lessp behave differently > > in this case. > > (It's not just string= but `equal` since they use the same comparison.) > But it's just a part of a set of related problems: > > * string< / string= inconsistency > * undesirable string< ordering (unibyte strings are treated as Latin-1) > * bad string< performance That doesn't seem different (and the ordering part is not necessary, IMO). > Ideally we should be able to do something about all three at the same time since they are interrelated. At the very least it's worth a try. It depends on the costs and the risks. All the rest being equal, yes, solving those would be desirable. But it isn't equal, and the costs and the risks of your proposals outweigh the advantages in my book, sorry. > Just documenting the annoying parts won't make them go away -- they still have to be coded around by the user, and it doesn't solve any performance problems either. That's not a catastrophe, because we are already there (sans the documentation), and because these cases are rare in real life. > > I see no reason to worry about 100% consistency here: the order > > is _really_ undefined in these cases, and trying to make it defined > > will not produce any tangible gains, > > Yes it would: better performance and wider applicability. These are not tangible enough IMO. > Even when the order isn't defined the user expects there to be some order between distinct strings. No, if the order is undefined, the caller cannot expect any order. Cf. NaN comparisons with numerical values. > > Once again, slowing down string-lessp when raw-bytes are involved > > shouldn't be a problem. So, if memchr finds a C0 or C1 in a string, > > fall back to a slower comparison. memchr is fast enough to not slow > > down the "usual" case. Would that be a good solution? > > There is no reason a comparison should need to look beyond the first mismatch; anything else is just algorithmically slow. Long strings are likely to differ early on. Any hack that has to special-case raw bytes will add costs. You missed me here. Why are you suddenly talking about mismatches? And if only mismatches matter here, why is it a problem to use memchr in the first place? > > Alternatively, we could introduce a new primitive which could assume > > multibyte or plain-ASCII unibyte strings without checking, and then > > code which is sure raw-bytes cannot happen, and needs to compare long > > strings, could use that for speed. > > That or variants thereof are indeed alternatives but users would be forgiven to wonder why we don't make what we have fast instead? Because the fast versions can break when the assumptions are false. We already have similar stuff in encoding/decoding area: there are fast optimized functions that require the caller to make sure some assumptions hold. > > E.g., are you saying that unibyte strings that are > > pure-ASCII also cause performance problems? > > They do because we have no efficient way of ascertaining that they are pure-ASCII. If we declare that comparing with unibyte non-ASCII produces unspecified results, we don't have to worry about that: it becomes the worry of the caller. > The long-term solution is to make multibyte strings the default in more cases but I'm not proposing such a change right now. I don't think we will ever get there, FWIW. Raw bytes in strings are a fact of life, whether we like it or not. > I'll see to where further performance tweaking of the existing code can take us with a reasonable efforts, but there are hard limits to what can be done. > And thank you for your comments! Thanks.