From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.bugs
Subject: bug#58168: string-lessp glitches and inconsistencies
Date: Thu, 06 Oct 2022 14:06:26 +0300
Message-ID: <83wn9dp5xp.fsf@gnu.org>
References: <7824372D-8002-4639-8AEE-E80A6D5FEFC6@gmail.com>
 <83czbef6le.fsf@gnu.org> <6CB805F6-89EE-4D7C-A398-F29698733A42@gmail.com>
 <83h70oce4k.fsf@gnu.org> <B56DE6FE-732D-432D-B2C2-1B54FC8472B1@gmail.com>
 <83tu4mais1.fsf@gnu.org> <BC625893-642A-4B8B-9309-1DCC5E4594B3@gmail.com>
 <83wn9gw2sp.fsf@gnu.org> <E10B6A7A-5517-46AD-B6D5-88A8845736BA@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="32973"; mail-complaints-to="usenet@ciao.gmane.io"
Cc: 58168@debbugs.gnu.org
To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= <mattias.engdegard@gmail.com>
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Thu Oct 06 13:20:55 2022
Return-path: <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>
Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>)
	id 1ogOvt-0008Sx-Di
	for geb-bug-gnu-emacs@m.gmane-mx.org; Thu, 06 Oct 2022 13:20:53 +0200
Original-Received: from localhost ([::1]:34344 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>)
	id 1ogOvs-0006G5-Da
	for geb-bug-gnu-emacs@m.gmane-mx.org; Thu, 06 Oct 2022 07:20:52 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:43466)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1ogOiU-0000x8-GX
 for bug-gnu-emacs@gnu.org; Thu, 06 Oct 2022 07:07:10 -0400
Original-Received: from debbugs.gnu.org ([209.51.188.43]:60018)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1ogOiU-0003zK-1l
 for bug-gnu-emacs@gnu.org; Thu, 06 Oct 2022 07:07:02 -0400
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1ogOiT-0000ge-T3
 for bug-gnu-emacs@gnu.org; Thu, 06 Oct 2022 07:07:01 -0400
X-Loop: help-debbugs@gnu.org
Resent-From: Eli Zaretskii <eliz@gnu.org>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-CC: bug-gnu-emacs@gnu.org
Resent-Date: Thu, 06 Oct 2022 11:07:01 +0000
Resent-Message-ID: <handler.58168.B58168.16650544062617@debbugs.gnu.org>
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 58168
X-GNU-PR-Package: emacs
Original-Received: via spool by 58168-submit@debbugs.gnu.org id=B58168.16650544062617
 (code B ref 58168); Thu, 06 Oct 2022 11:07:01 +0000
Original-Received: (at 58168) by debbugs.gnu.org; 6 Oct 2022 11:06:46 +0000
Original-Received: from localhost ([127.0.0.1]:59096 helo=debbugs.gnu.org)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
 id 1ogOiE-0000g8-A4
 for submit@debbugs.gnu.org; Thu, 06 Oct 2022 07:06:46 -0400
Original-Received: from eggs.gnu.org ([209.51.188.92]:48068)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eliz@gnu.org>) id 1ogOiB-0000fs-0p
 for 58168@debbugs.gnu.org; Thu, 06 Oct 2022 07:06:44 -0400
Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:47040)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eliz@gnu.org>)
 id 1ogOi5-0003rA-MP; Thu, 06 Oct 2022 07:06:37 -0400
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org;
 s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From:
 Date; bh=TPzCz7SvVhKSXvzjpf891slfHR9VtQn15G0weTyMTZ8=; b=W4v2PfZZaBYhjdFcXu/J
 IkumbdGJA9LjxLbHB0CDXtEAURw1+HKJJBA48oD2sqZS7DUXdXGLUhPwF0XiGu7cFep+ha6Z3HGSQ
 /JmnpibRtnIWPMA7wpF8IdlATrUpdrS+pcYAFceOEZvUBjnmY5pkILSV+5xyVEnpRITM19XTw5ZmV
 Ow/YHbpgozz6D6YVVD3yl535SK9zIjyKPz+mz7z8YcJLwI9ONfDhoD6FQxKGKJPmBR9wCrZvPVKrD
 CTSw58QhVzwpwUOgHRlt6uqyoavUHRTcdBTg1pdBskRUPtfaZPgGV0+KR0/TPjq+NRBRjj2ZJ2uBN
 AOcxQog+3PQJ1g==;
Original-Received: from [87.69.77.57] (port=3495 helo=home-c4e4a596f7)
 by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eliz@gnu.org>)
 id 1ogOhx-0003TY-11; Thu, 06 Oct 2022 07:06:37 -0400
In-Reply-To: <E10B6A7A-5517-46AD-B6D5-88A8845736BA@gmail.com> (message from
 Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= on Thu, 6 Oct 2022 11:05:04 +0200)
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
X-BeenThere: bug-gnu-emacs@gnu.org
List-Id: "Bug reports for GNU Emacs,
 the Swiss army knife of text editors" <bug-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-gnu-emacs>
List-Post: <mailto:bug-gnu-emacs@gnu.org>
List-Help: <mailto:bug-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org
Original-Sender: "bug-gnu-emacs"
 <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>
Xref: news.gmane.io gmane.emacs.bugs:244638
Archived-At: <http://permalink.gmane.org/gmane.emacs.bugs/244638>

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Thu, 6 Oct 2022 11:05:04 +0200
> Cc: 58168@debbugs.gnu.org
> 
> 4 okt. 2022 kl. 07.55 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > If the fact that string= says strings are not equal, but string-lessp
> > says they are equal, is what bothers you, we could document that
> > results of comparing unibyte and multibyte strings are unspecified, or
> > document explicitly that string= and string-lessp behave differently
> > in this case.
> 
> (It's not just string= but `equal` since they use the same comparison.)
> But it's just a part of a set of related problems:
> 
> * string< / string= inconsistency
> * undesirable string< ordering (unibyte strings are treated as Latin-1)
> * bad string< performance 

That doesn't seem different (and the ordering part is not necessary,
IMO).

> Ideally we should be able to do something about all three at the same time since they are interrelated. At the very least it's worth a try.

It depends on the costs and the risks.  All the rest being equal, yes,
solving those would be desirable.  But it isn't equal, and the costs
and the risks of your proposals outweigh the advantages in my book,
sorry.

> Just documenting the annoying parts won't make them go away -- they still have to be coded around by the user, and it doesn't solve any performance problems either.

That's not a catastrophe, because we are already there (sans the
documentation), and because these cases are rare in real life.

> > I see no reason to worry about 100% consistency here: the order
> > is _really_ undefined in these cases, and trying to make it defined
> > will not produce any tangible gains,
> 
> Yes it would: better performance and wider applicability.

These are not tangible enough IMO.

> Even when the order isn't defined the user expects there to be some order between distinct strings.

No, if the order is undefined, the caller cannot expect any order.
Cf. NaN comparisons with numerical values.

> > Once again, slowing down string-lessp when raw-bytes are involved
> > shouldn't be a problem.  So, if memchr finds a C0 or C1 in a string,
> > fall back to a slower comparison.  memchr is fast enough to not slow
> > down the "usual" case.  Would that be a good solution?
> 
> There is no reason a comparison should need to look beyond the first mismatch; anything else is just algorithmically slow. Long strings are likely to differ early on. Any hack that has to special-case raw bytes will add costs.

You missed me here.  Why are you suddenly talking about mismatches?
And if only mismatches matter here, why is it a problem to use memchr
in the first place?

> > Alternatively, we could introduce a new primitive which could assume
> > multibyte or plain-ASCII unibyte strings without checking, and then
> > code which is sure raw-bytes cannot happen, and needs to compare long
> > strings, could use that for speed.
> 
> That or variants thereof are indeed alternatives but users would be forgiven to wonder why we don't make what we have fast instead?

Because the fast versions can break when the assumptions are false.
We already have similar stuff in encoding/decoding area: there are
fast optimized functions that require the caller to make sure some
assumptions hold.

> > E.g., are you saying that unibyte strings that are
> > pure-ASCII also cause performance problems?
> 
> They do because we have no efficient way of ascertaining that they are pure-ASCII.

If we declare that comparing with unibyte non-ASCII produces
unspecified results, we don't have to worry about that: it becomes the
worry of the caller.

> The long-term solution is to make multibyte strings the default in more cases but I'm not proposing such a change right now.

I don't think we will ever get there, FWIW.  Raw bytes in strings are
a fact of life, whether we like it or not.

> I'll see to where further performance tweaking of the existing code can take us with a reasonable efforts, but there are hard limits to what can be done.
> And thank you for your comments!

Thanks.