bug#58168: string-lessp glitches and inconsistencies

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

From: "Mattias Engdegård" <mattias.engdegard@gmail.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 58168@debbugs.gnu.org
Subject: bug#58168: string-lessp glitches and inconsistencies
Date: Thu, 6 Oct 2022 11:05:04 +0200	[thread overview]
Message-ID: <E10B6A7A-5517-46AD-B6D5-88A8845736BA@gmail.com> (raw)
In-Reply-To: <83wn9gw2sp.fsf@gnu.org>

4 okt. 2022 kl. 07.55 skrev Eli Zaretskii <eliz@gnu.org>:

> If the fact that string= says strings are not equal, but string-lessp
> says they are equal, is what bothers you, we could document that
> results of comparing unibyte and multibyte strings are unspecified, or
> document explicitly that string= and string-lessp behave differently
> in this case.

(It's not just string= but `equal` since they use the same comparison.)
But it's just a part of a set of related problems:

* string< / string= inconsistency
* undesirable string< ordering (unibyte strings are treated as Latin-1)
* bad string< performance 

Ideally we should be able to do something about all three at the same time since they are interrelated. At the very least it's worth a try.

Just documenting the annoying parts won't make them go away -- they still have to be coded around by the user, and it doesn't solve any performance problems either.

> I see no reason to worry about 100% consistency here: the order
> is _really_ undefined in these cases, and trying to make it defined
> will not produce any tangible gains,

Yes it would: better performance and wider applicability. Even when the order isn't defined the user expects there to be some order between distinct strings.

> Once again, slowing down string-lessp when raw-bytes are involved
> shouldn't be a problem.  So, if memchr finds a C0 or C1 in a string,
> fall back to a slower comparison.  memchr is fast enough to not slow
> down the "usual" case.  Would that be a good solution?

There is no reason a comparison should need to look beyond the first mismatch; anything else is just algorithmically slow. Long strings are likely to differ early on. Any hack that has to special-case raw bytes will add costs.

The best we can hope for is hand-written vectorised code that does everything in one pass but it's still slower than just a memcmp.
Even then our chosen semantics make that more difficult (and slower) than it needs to be: for example, we cannot assume that any byte with the high bit set indicates a mismatch when comparing unibyte strings with multibyte, since we equate unibyte chars with Latin-1. It's a decision that we will keep paying for.

> Alternatively, we could introduce a new primitive which could assume
> multibyte or plain-ASCII unibyte strings without checking, and then
> code which is sure raw-bytes cannot happen, and needs to compare long
> strings, could use that for speed.

That or variants thereof are indeed alternatives but users would be forgiven to wonder why we don't make what we have fast instead?

> E.g., are you saying that unibyte strings that are
> pure-ASCII also cause performance problems?

They do because we have no efficient way of ascertaining that they are pure-ASCII. The long-term solution is to make multibyte strings the default in more cases but I'm not proposing such a change right now.

I'll see to where further performance tweaking of the existing code can take us with a reasonable efforts, but there are hard limits to what can be done.
And thank you for your comments!

next prev parent reply	other threads:[~2022-10-06  9:05 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-29 16:24 bug#58168: string-lessp glitches and inconsistencies Mattias Engdegård
2022-09-29 17:00 ` Mattias Engdegård
2022-09-29 17:11 ` Eli Zaretskii
2022-09-30 20:04   ` Mattias Engdegård
2022-10-01  5:22     ` Eli Zaretskii
2022-10-01 19:57       ` Mattias Engdegård
2022-10-02  5:36         ` Eli Zaretskii
2022-10-03 19:48           ` Mattias Engdegård
2022-10-04  5:55             ` Eli Zaretskii
2022-10-04 17:40               ` Richard Stallman
2022-10-04 18:07                 ` Eli Zaretskii
2022-10-06  9:05               ` Mattias Engdegård [this message]
2022-10-06 11:06                 ` Eli Zaretskii
2022-10-07 14:23                   ` Mattias Engdegård
2022-10-08  7:35                     ` Eli Zaretskii
2022-10-14 14:39                       ` Mattias Engdegård
2022-10-14 15:31                         ` Eli Zaretskii
2022-10-17 12:44                           ` Mattias Engdegård
2022-09-30 13:52 ` Lars Ingebrigtsen
2022-09-30 20:12   ` Mattias Engdegård
2022-10-01  5:34     ` Eli Zaretskii
2022-10-01 11:51       ` Mattias Engdegård
2022-10-01 10:02     ` Lars Ingebrigtsen
2022-10-01 10:12       ` Eli Zaretskii
2022-10-01 13:37       ` Mattias Engdegård
2022-10-01 13:43         ` Lars Ingebrigtsen
2022-10-03 19:48           ` Mattias Engdegård
2022-10-04 10:44             ` Lars Ingebrigtsen
2022-10-04 11:37             ` Eli Zaretskii
2022-10-04 14:44               ` Mattias Engdegård
2022-10-04 16:24                 ` Eli Zaretskii
2022-10-06  9:05                   ` Mattias Engdegård
2022-10-06 11:13                     ` Eli Zaretskii
2022-10-06 12:43                       ` Mattias Engdegård
2022-10-06 14:34                         ` Eli Zaretskii
2022-10-07 14:45                           ` Mattias Engdegård
2022-10-07 15:33                             ` Eli Zaretskii
2022-10-08 17:13                               ` Mattias Engdegård
2022-10-01 13:51         ` Eli Zaretskii
2022-10-01  5:30   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E10B6A7A-5517-46AD-B6D5-88A8845736BA@gmail.com \
    --to=mattias.engdegard@gmail.com \
    --cc=58168@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).