From: "Mattias Engdegård" <mattias.engdegard@gmail.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 58168@debbugs.gnu.org
Subject: bug#58168: string-lessp glitches and inconsistencies
Date: Thu, 6 Oct 2022 11:05:04 +0200 [thread overview]
Message-ID: <E10B6A7A-5517-46AD-B6D5-88A8845736BA@gmail.com> (raw)
In-Reply-To: <83wn9gw2sp.fsf@gnu.org>
4 okt. 2022 kl. 07.55 skrev Eli Zaretskii <eliz@gnu.org>:
> If the fact that string= says strings are not equal, but string-lessp
> says they are equal, is what bothers you, we could document that
> results of comparing unibyte and multibyte strings are unspecified, or
> document explicitly that string= and string-lessp behave differently
> in this case.
(It's not just string= but `equal` since they use the same comparison.)
But it's just a part of a set of related problems:
* string< / string= inconsistency
* undesirable string< ordering (unibyte strings are treated as Latin-1)
* bad string< performance
Ideally we should be able to do something about all three at the same time since they are interrelated. At the very least it's worth a try.
Just documenting the annoying parts won't make them go away -- they still have to be coded around by the user, and it doesn't solve any performance problems either.
> I see no reason to worry about 100% consistency here: the order
> is _really_ undefined in these cases, and trying to make it defined
> will not produce any tangible gains,
Yes it would: better performance and wider applicability. Even when the order isn't defined the user expects there to be some order between distinct strings.
> Once again, slowing down string-lessp when raw-bytes are involved
> shouldn't be a problem. So, if memchr finds a C0 or C1 in a string,
> fall back to a slower comparison. memchr is fast enough to not slow
> down the "usual" case. Would that be a good solution?
There is no reason a comparison should need to look beyond the first mismatch; anything else is just algorithmically slow. Long strings are likely to differ early on. Any hack that has to special-case raw bytes will add costs.
The best we can hope for is hand-written vectorised code that does everything in one pass but it's still slower than just a memcmp.
Even then our chosen semantics make that more difficult (and slower) than it needs to be: for example, we cannot assume that any byte with the high bit set indicates a mismatch when comparing unibyte strings with multibyte, since we equate unibyte chars with Latin-1. It's a decision that we will keep paying for.
> Alternatively, we could introduce a new primitive which could assume
> multibyte or plain-ASCII unibyte strings without checking, and then
> code which is sure raw-bytes cannot happen, and needs to compare long
> strings, could use that for speed.
That or variants thereof are indeed alternatives but users would be forgiven to wonder why we don't make what we have fast instead?
> E.g., are you saying that unibyte strings that are
> pure-ASCII also cause performance problems?
They do because we have no efficient way of ascertaining that they are pure-ASCII. The long-term solution is to make multibyte strings the default in more cases but I'm not proposing such a change right now.
I'll see to where further performance tweaking of the existing code can take us with a reasonable efforts, but there are hard limits to what can be done.
And thank you for your comments!
next prev parent reply other threads:[~2022-10-06 9:05 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-29 16:24 bug#58168: string-lessp glitches and inconsistencies Mattias Engdegård
2022-09-29 17:00 ` Mattias Engdegård
2022-09-29 17:11 ` Eli Zaretskii
2022-09-30 20:04 ` Mattias Engdegård
2022-10-01 5:22 ` Eli Zaretskii
2022-10-01 19:57 ` Mattias Engdegård
2022-10-02 5:36 ` Eli Zaretskii
2022-10-03 19:48 ` Mattias Engdegård
2022-10-04 5:55 ` Eli Zaretskii
2022-10-04 17:40 ` Richard Stallman
2022-10-04 18:07 ` Eli Zaretskii
2022-10-06 9:05 ` Mattias Engdegård [this message]
2022-10-06 11:06 ` Eli Zaretskii
2022-10-07 14:23 ` Mattias Engdegård
2022-10-08 7:35 ` Eli Zaretskii
2022-10-14 14:39 ` Mattias Engdegård
2022-10-14 15:31 ` Eli Zaretskii
2022-10-17 12:44 ` Mattias Engdegård
2022-09-30 13:52 ` Lars Ingebrigtsen
2022-09-30 20:12 ` Mattias Engdegård
2022-10-01 5:34 ` Eli Zaretskii
2022-10-01 11:51 ` Mattias Engdegård
2022-10-01 10:02 ` Lars Ingebrigtsen
2022-10-01 10:12 ` Eli Zaretskii
2022-10-01 13:37 ` Mattias Engdegård
2022-10-01 13:43 ` Lars Ingebrigtsen
2022-10-03 19:48 ` Mattias Engdegård
2022-10-04 10:44 ` Lars Ingebrigtsen
2022-10-04 11:37 ` Eli Zaretskii
2022-10-04 14:44 ` Mattias Engdegård
2022-10-04 16:24 ` Eli Zaretskii
2022-10-06 9:05 ` Mattias Engdegård
2022-10-06 11:13 ` Eli Zaretskii
2022-10-06 12:43 ` Mattias Engdegård
2022-10-06 14:34 ` Eli Zaretskii
2022-10-07 14:45 ` Mattias Engdegård
2022-10-07 15:33 ` Eli Zaretskii
2022-10-08 17:13 ` Mattias Engdegård
2022-10-01 13:51 ` Eli Zaretskii
2022-10-01 5:30 ` Eli Zaretskii
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=E10B6A7A-5517-46AD-B6D5-88A8845736BA@gmail.com \
--to=mattias.engdegard@gmail.com \
--cc=58168@debbugs.gnu.org \
--cc=eliz@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).