all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: "Mattias Engdegård" <mattias.engdegard@gmail.com>
Cc: 58168@debbugs.gnu.org
Subject: bug#58168: string-lessp glitches and inconsistencies
Date: Thu, 06 Oct 2022 14:06:26 +0300	[thread overview]
Message-ID: <83wn9dp5xp.fsf@gnu.org> (raw)
In-Reply-To: <E10B6A7A-5517-46AD-B6D5-88A8845736BA@gmail.com> (message from Mattias Engdegård on Thu, 6 Oct 2022 11:05:04 +0200)

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Thu, 6 Oct 2022 11:05:04 +0200
> Cc: 58168@debbugs.gnu.org
> 
> 4 okt. 2022 kl. 07.55 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > If the fact that string= says strings are not equal, but string-lessp
> > says they are equal, is what bothers you, we could document that
> > results of comparing unibyte and multibyte strings are unspecified, or
> > document explicitly that string= and string-lessp behave differently
> > in this case.
> 
> (It's not just string= but `equal` since they use the same comparison.)
> But it's just a part of a set of related problems:
> 
> * string< / string= inconsistency
> * undesirable string< ordering (unibyte strings are treated as Latin-1)
> * bad string< performance 

That doesn't seem different (and the ordering part is not necessary,
IMO).

> Ideally we should be able to do something about all three at the same time since they are interrelated. At the very least it's worth a try.

It depends on the costs and the risks.  All the rest being equal, yes,
solving those would be desirable.  But it isn't equal, and the costs
and the risks of your proposals outweigh the advantages in my book,
sorry.

> Just documenting the annoying parts won't make them go away -- they still have to be coded around by the user, and it doesn't solve any performance problems either.

That's not a catastrophe, because we are already there (sans the
documentation), and because these cases are rare in real life.

> > I see no reason to worry about 100% consistency here: the order
> > is _really_ undefined in these cases, and trying to make it defined
> > will not produce any tangible gains,
> 
> Yes it would: better performance and wider applicability.

These are not tangible enough IMO.

> Even when the order isn't defined the user expects there to be some order between distinct strings.

No, if the order is undefined, the caller cannot expect any order.
Cf. NaN comparisons with numerical values.

> > Once again, slowing down string-lessp when raw-bytes are involved
> > shouldn't be a problem.  So, if memchr finds a C0 or C1 in a string,
> > fall back to a slower comparison.  memchr is fast enough to not slow
> > down the "usual" case.  Would that be a good solution?
> 
> There is no reason a comparison should need to look beyond the first mismatch; anything else is just algorithmically slow. Long strings are likely to differ early on. Any hack that has to special-case raw bytes will add costs.

You missed me here.  Why are you suddenly talking about mismatches?
And if only mismatches matter here, why is it a problem to use memchr
in the first place?

> > Alternatively, we could introduce a new primitive which could assume
> > multibyte or plain-ASCII unibyte strings without checking, and then
> > code which is sure raw-bytes cannot happen, and needs to compare long
> > strings, could use that for speed.
> 
> That or variants thereof are indeed alternatives but users would be forgiven to wonder why we don't make what we have fast instead?

Because the fast versions can break when the assumptions are false.
We already have similar stuff in encoding/decoding area: there are
fast optimized functions that require the caller to make sure some
assumptions hold.

> > E.g., are you saying that unibyte strings that are
> > pure-ASCII also cause performance problems?
> 
> They do because we have no efficient way of ascertaining that they are pure-ASCII.

If we declare that comparing with unibyte non-ASCII produces
unspecified results, we don't have to worry about that: it becomes the
worry of the caller.

> The long-term solution is to make multibyte strings the default in more cases but I'm not proposing such a change right now.

I don't think we will ever get there, FWIW.  Raw bytes in strings are
a fact of life, whether we like it or not.

> I'll see to where further performance tweaking of the existing code can take us with a reasonable efforts, but there are hard limits to what can be done.
> And thank you for your comments!

Thanks.





  reply	other threads:[~2022-10-06 11:06 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-29 16:24 bug#58168: string-lessp glitches and inconsistencies Mattias Engdegård
2022-09-29 17:00 ` Mattias Engdegård
2022-09-29 17:11 ` Eli Zaretskii
2022-09-30 20:04   ` Mattias Engdegård
2022-10-01  5:22     ` Eli Zaretskii
2022-10-01 19:57       ` Mattias Engdegård
2022-10-02  5:36         ` Eli Zaretskii
2022-10-03 19:48           ` Mattias Engdegård
2022-10-04  5:55             ` Eli Zaretskii
2022-10-04 17:40               ` Richard Stallman
2022-10-04 18:07                 ` Eli Zaretskii
2022-10-06  9:05               ` Mattias Engdegård
2022-10-06 11:06                 ` Eli Zaretskii [this message]
2022-10-07 14:23                   ` Mattias Engdegård
2022-10-08  7:35                     ` Eli Zaretskii
2022-10-14 14:39                       ` Mattias Engdegård
2022-10-14 15:31                         ` Eli Zaretskii
2022-10-17 12:44                           ` Mattias Engdegård
2022-09-30 13:52 ` Lars Ingebrigtsen
2022-09-30 20:12   ` Mattias Engdegård
2022-10-01  5:34     ` Eli Zaretskii
2022-10-01 11:51       ` Mattias Engdegård
2022-10-01 10:02     ` Lars Ingebrigtsen
2022-10-01 10:12       ` Eli Zaretskii
2022-10-01 13:37       ` Mattias Engdegård
2022-10-01 13:43         ` Lars Ingebrigtsen
2022-10-03 19:48           ` Mattias Engdegård
2022-10-04 10:44             ` Lars Ingebrigtsen
2022-10-04 11:37             ` Eli Zaretskii
2022-10-04 14:44               ` Mattias Engdegård
2022-10-04 16:24                 ` Eli Zaretskii
2022-10-06  9:05                   ` Mattias Engdegård
2022-10-06 11:13                     ` Eli Zaretskii
2022-10-06 12:43                       ` Mattias Engdegård
2022-10-06 14:34                         ` Eli Zaretskii
2022-10-07 14:45                           ` Mattias Engdegård
2022-10-07 15:33                             ` Eli Zaretskii
2022-10-08 17:13                               ` Mattias Engdegård
2022-10-01 13:51         ` Eli Zaretskii
2022-10-01  5:30   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83wn9dp5xp.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=58168@debbugs.gnu.org \
    --cc=mattias.engdegard@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.