bug#58168: string-lessp glitches and inconsistencies

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

From: Eli Zaretskii <eliz@gnu.org>
To: "Mattias Engdegård" <mattias.engdegard@gmail.com>
Cc: 58168@debbugs.gnu.org
Subject: bug#58168: string-lessp glitches and inconsistencies
Date: Tue, 04 Oct 2022 08:55:34 +0300	[thread overview]
Message-ID: <83wn9gw2sp.fsf@gnu.org> (raw)
In-Reply-To: <BC625893-642A-4B8B-9309-1DCC5E4594B3@gmail.com> (message from Mattias Engdegård on Mon, 3 Oct 2022 21:48:14 +0200)

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Mon, 3 Oct 2022 21:48:14 +0200
> Cc: 58168@debbugs.gnu.org
> 
> 2 okt. 2022 kl. 07.36 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> >> Comparison between objects is not only useful when someone cares about their order, as in presenting a sorted list to the user. Often what is important is an ability to impose an order, preferably total, for use in building and searching data structures. I came across this bug when implementing a string set.
> > 
> > Always converting to multibyte handles this case, doesn't it?
> 
> I don't think it does -- string= treats raw bytes in unibyte and multibyte strings as distinct; converting to multibyte does not preserve (in)equality.

If the fact that string= says strings are not equal, but string-lessp
says they are equal, is what bothers you, we could document that
results of comparing unibyte and multibyte strings are unspecified, or
document explicitly that string= and string-lessp behave differently
in this case.

IOW, I see no reason to worry about 100% consistency here: the order
is _really_ undefined in these cases, and trying to make it defined
will not produce any tangible gains, IMNSHO.  It could very well
produce bugs and regressions, OTOH.  So it sounds like a net loss to
me, in practical terms.

> >> Actually I was talking about multibyte-multibyte comparisons.
> > 
> > Then why did you mention raw bytes? their multibyte representation
> > presents no performance problems
> 
> In a way they do -- the way raw bytes are represented (they start with C0 or C1) causes memcmp to sort them between U+007F and U+0080. If we accept that then comparisons are fast since memcmp will compare many character per data-dependent branch. The current code requires several data-dependent branches for each character.

Once again, slowing down string-lessp when raw-bytes are involved
shouldn't be a problem.  So, if memchr finds a C0 or C1 in a string,
fall back to a slower comparison.  memchr is fast enough to not slow
down the "usual" case.  Would that be a good solution?

Alternatively, we could introduce a new primitive which could assume
multibyte or plain-ASCII unibyte strings without checking, and then
code which is sure raw-bytes cannot happen, and needs to compare long
strings, could use that for speed.

> While we could probably bring down the comparison cost slightly by clever hand-coding, it's unlikely to be even nearly as fast as a memcmp and much messier. Since users are unlikely to care much about the ordering between raw bytes and something else (as long as there is an order), it would be a cheap way to improve performance while at the same time fixing the string< / string= mismatch.

The assumption that "users are unlikely to care" is a pure conjecture,
and we have no way of validating it.  So I don't want us to act on
such an assumption.

What about one of the alternatives above instead?

> > You can compare under the assumption that a unibyte string is
> > pure-ASCII until you bump into the first non-ASCII one.  If that
> > happens, abandon the comparison, convert the unibyte string to its
> > multibyte representation, and compare again.
> 
> I don't quite see how that would improve performance but may be missing something.

Then maybe I didn't understand the performance problems that you had
in mind.  Suppose you describe them in more detail, preferably with
examples?  E.g., are you saying that unibyte strings that are
pure-ASCII also cause performance problems?

next prev parent reply	other threads:[~2022-10-04  5:55 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-29 16:24 bug#58168: string-lessp glitches and inconsistencies Mattias Engdegård
2022-09-29 17:00 ` Mattias Engdegård
2022-09-29 17:11 ` Eli Zaretskii
2022-09-30 20:04   ` Mattias Engdegård
2022-10-01  5:22     ` Eli Zaretskii
2022-10-01 19:57       ` Mattias Engdegård
2022-10-02  5:36         ` Eli Zaretskii
2022-10-03 19:48           ` Mattias Engdegård
2022-10-04  5:55             ` Eli Zaretskii [this message]
2022-10-04 17:40               ` Richard Stallman
2022-10-04 18:07                 ` Eli Zaretskii
2022-10-06  9:05               ` Mattias Engdegård
2022-10-06 11:06                 ` Eli Zaretskii
2022-10-07 14:23                   ` Mattias Engdegård
2022-10-08  7:35                     ` Eli Zaretskii
2022-10-14 14:39                       ` Mattias Engdegård
2022-10-14 15:31                         ` Eli Zaretskii
2022-10-17 12:44                           ` Mattias Engdegård
2022-09-30 13:52 ` Lars Ingebrigtsen
2022-09-30 20:12   ` Mattias Engdegård
2022-10-01  5:34     ` Eli Zaretskii
2022-10-01 11:51       ` Mattias Engdegård
2022-10-01 10:02     ` Lars Ingebrigtsen
2022-10-01 10:12       ` Eli Zaretskii
2022-10-01 13:37       ` Mattias Engdegård
2022-10-01 13:43         ` Lars Ingebrigtsen
2022-10-03 19:48           ` Mattias Engdegård
2022-10-04 10:44             ` Lars Ingebrigtsen
2022-10-04 11:37             ` Eli Zaretskii
2022-10-04 14:44               ` Mattias Engdegård
2022-10-04 16:24                 ` Eli Zaretskii
2022-10-06  9:05                   ` Mattias Engdegård
2022-10-06 11:13                     ` Eli Zaretskii
2022-10-06 12:43                       ` Mattias Engdegård
2022-10-06 14:34                         ` Eli Zaretskii
2022-10-07 14:45                           ` Mattias Engdegård
2022-10-07 15:33                             ` Eli Zaretskii
2022-10-08 17:13                               ` Mattias Engdegård
2022-10-01 13:51         ` Eli Zaretskii
2022-10-01  5:30   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83wn9gw2sp.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=58168@debbugs.gnu.org \
    --cc=mattias.engdegard@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).