bug#58168: string-lessp glitches and inconsistencies

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* bug#58168: string-lessp glitches and inconsistencies
@ 2022-09-29 16:24 Mattias Engdegård
  2022-09-29 17:00 ` Mattias Engdegård
                   ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Mattias Engdegård @ 2022-09-29 16:24 UTC (permalink / raw)
  To: 58168

We really want string< to be consistent with string= and itself since this is fundamental for string ordering in searching and sorting applications.
This means that for any pair of strings A and B, we should either have A<B, B<A or A=B.

Unfortunately:

  (let* ((a "ü")
         (b "\xfc"))
    (list (string= a b)
          (string< a b)
          (string< b a)))
=> (nil nil nil)

because string< considers the unibyte raw byte 0xFC and the multibyte char U+00FC to be the same, but string= thinks they are different.
We also distinguish raw bytes by multibyte-ness:

  (let* ((u "\x80")
         (m (string-to-multibyte u)))
    (list (string= u m)
          (string< u m)
          (string< m u)))
=> (nil t nil)

but this is a minor annoyance that we can live with: we strongly want string= to remain consistent with `equal` for strings.
So, what can be done? The current string< implementation uses the character order

 ASCII < ub raw 80..FF = mb U+0080..U+00FF < U+0100..10FFFF < mb raw 80..FF

in conflict with string= which unifies unibyte and multibyte ASCII but not raw bytes and Latin-1.
It suggests the following alternative collation orders:

A. ASCII < ub raw 80..FF < mb U+0080..10FFFF < mb raw 80..FF

which puts all non-ASCII multibyte chars after unibyte.

B. ASCII < ub raw 80..FF < mb raw 80..FF < mb U+0080..10FFFF

which inserts multibyte raw bytes after the unibyte ones, permitting any ub-ub and mb-mb comparisons to be made using memcmp, and a slow decoding loop only required for unibyte against non-ASCII multibyte strings.

C. ASCII < mb U+0080..10FFFF < mb raw 80..FF < ub raw 80..FF

which instead moves unibyte raw bytes to after the multibyte raw range. This has the same memcmp benefit as alternative B, but may be slightly faster for ub-mb comparisons since only unibyte 80..FF need to be remapped.

Any particular preference? Otherwise, I'll go with B or C, depending on what the resulting code looks like.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-09-29 16:24 bug#58168: string-lessp glitches and inconsistencies Mattias Engdegård
@ 2022-09-29 17:00 ` Mattias Engdegård
  2022-09-29 17:11 ` Eli Zaretskii
  2022-09-30 13:52 ` Lars Ingebrigtsen
  2 siblings, 0 replies; 40+ messages in thread
From: Mattias Engdegård @ 2022-09-29 17:00 UTC (permalink / raw)
  To: 58168

29 sep. 2022 kl. 18.24 skrev Mattias Engdegård <mattias.engdegard@gmail.com>:

> C. ASCII < mb U+0080..10FFFF < mb raw 80..FF < ub raw 80..FF
> 
> which instead moves unibyte raw bytes to after the multibyte raw range. This has the same memcmp benefit as alternative B

Actually it doesn't -- editing mistake, sorry. Using memcmp for arbitrary multibyte strings requires collating raw bytes between ASCII and other Unicode.






^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-09-29 16:24 bug#58168: string-lessp glitches and inconsistencies Mattias Engdegård
  2022-09-29 17:00 ` Mattias Engdegård
@ 2022-09-29 17:11 ` Eli Zaretskii
  2022-09-30 20:04   ` Mattias Engdegård
  2022-09-30 13:52 ` Lars Ingebrigtsen
  2 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2022-09-29 17:11 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Thu, 29 Sep 2022 18:24:04 +0200
> 
> We really want string< to be consistent with string= and itself since this is fundamental for string ordering in searching and sorting applications.
> This means that for any pair of strings A and B, we should either have A<B, B<A or A=B.
> 
> Unfortunately:
> 
>   (let* ((a "ü")
>          (b "\xfc"))
>     (list (string= a b)
>           (string< a b)
>           (string< b a)))
> => (nil nil nil)
> 
> because string< considers the unibyte raw byte 0xFC and the multibyte char U+00FC to be the same, but string= thinks they are different.

Why do we care?  Unibyte strings should never be compared with
multibyte, unless they are both pure-ASCII.

> So, what can be done? The current string< implementation uses the character order
> 
>  ASCII < ub raw 80..FF = mb U+0080..U+00FF < U+0100..10FFFF < mb raw 80..FF
> 
> in conflict with string= which unifies unibyte and multibyte ASCII but not raw bytes and Latin-1.

It would be unimaginable to unify raw bytes with Latin-1.  Raw bytes
are not Latin-1 characters, they can stand for any characters, or for
no characters at all.

> It suggests the following alternative collation orders:
> 
> A. ASCII < ub raw 80..FF < mb U+0080..10FFFF < mb raw 80..FF
> 
> which puts all non-ASCII multibyte chars after unibyte.
> 
> B. ASCII < ub raw 80..FF < mb raw 80..FF < mb U+0080..10FFFF
> 
> which inserts multibyte raw bytes after the unibyte ones, permitting any ub-ub and mb-mb comparisons to be made using memcmp, and a slow decoding loop only required for unibyte against non-ASCII multibyte strings.
> 
> C. ASCII < mb U+0080..10FFFF < mb raw 80..FF < ub raw 80..FF

Neither, IMNSHO.  Unibyte characters don't belong to this order.  They
should be converted to multibyte representation to be sensibly
comparable.

> Otherwise, I'll go with B or C, depending on what the resulting code looks like.

Please don't.  Let's first decide that we want to change this, and
what are the reasons for that.  Theoretical "impurity" doesn't count,
IMO.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-09-29 16:24 bug#58168: string-lessp glitches and inconsistencies Mattias Engdegård
  2022-09-29 17:00 ` Mattias Engdegård
  2022-09-29 17:11 ` Eli Zaretskii
@ 2022-09-30 13:52 ` Lars Ingebrigtsen
  2022-09-30 20:12   ` Mattias Engdegård
  2022-10-01  5:30   ` Eli Zaretskii
  2 siblings, 2 replies; 40+ messages in thread
From: Lars Ingebrigtsen @ 2022-09-30 13:52 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> We really want string< to be consistent with string= and itself since this is fundamental for string ordering in searching and sorting applications.
> This means that for any pair of strings A and B, we should either have A<B, B<A or A=B.
>
> Unfortunately:
>
>   (let* ((a "ü")
>          (b "\xfc"))
>     (list (string= a b)
>           (string< a b)
>           (string< b a)))
> => (nil nil nil)
>
> because string< considers the unibyte raw byte 0xFC and the multibyte char U+00FC to be the same, but string= thinks they are different.

You also have

(string 4194176)
=> "\200"
"\x80"
=> "\200"

which are kinda equal in some ways, and not in other ways.

> It suggests the following alternative collation orders:
>
> A. ASCII < ub raw 80..FF < mb U+0080..10FFFF < mb raw 80..FF
>
> which puts all non-ASCII multibyte chars after unibyte.
>
> B. ASCII < ub raw 80..FF < mb raw 80..FF < mb U+0080..10FFFF
>
> which inserts multibyte raw bytes after the unibyte ones, permitting any ub-ub and mb-mb comparisons to be made using memcmp, and a slow decoding loop only required for unibyte against non-ASCII multibyte strings.
>
> C. ASCII < mb U+0080..10FFFF < mb raw 80..FF < ub raw 80..FF
>
> which instead moves unibyte raw bytes to after the multibyte raw range. This has the same memcmp benefit as alternative B, but may be slightly faster for ub-mb comparisons since only unibyte 80..FF need to be remapped.

I think A makes the most intuitive sense, somehow.  But perhaps my
intuition is off.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-09-29 17:11 ` Eli Zaretskii
@ 2022-09-30 20:04   ` Mattias Engdegård
  2022-10-01  5:22     ` Eli Zaretskii
  0 siblings, 1 reply; 40+ messages in thread
From: Mattias Engdegård @ 2022-09-30 20:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168

29 sep. 2022 kl. 19.11 skrev Eli Zaretskii <eliz@gnu.org>:

> Unibyte strings should never be compared with
> multibyte, unless they are both pure-ASCII.

It's perfectly fine to compare "Madrid" (unibyte) with "Málaga" (non-ASCII multibyte).
If you mean that all strings (literals in particular) should be multibyte by default then I agree and at some point we should take that step, but it would be quite a breaking change. Perhaps less in practice than we fear, though...

>> So, what can be done? The current string< implementation uses the character order
>> 
>> ASCII < ub raw 80..FF = mb U+0080..U+00FF < U+0100..10FFFF < mb raw 80..FF
>> 
>> in conflict with string= which unifies unibyte and multibyte ASCII but not raw bytes and Latin-1.
> 
> It would be unimaginable to unify raw bytes with Latin-1.  Raw bytes
> are not Latin-1 characters, they can stand for any characters, or for
> no characters at all.

Completely agreed! Let's try to fix that, then.

> Unibyte characters don't belong to this order.  They
> should be converted to multibyte representation to be sensibly
> comparable.

Oh I agree to some extent but we can't really raise an error if someone tries so we might as well return something reasonable and coherent. Besides, there are more good reasons for ordering strings (both multibyte and unibyte) than might be apparent at first.

Working from the assumption that we can't change string= to equate raw bytes in unibyte and multibyte strings, we need to invent an order between normally incommensurate values which sounds odd but is actually fine; this is occasionally done and can be quite useful.

It's also a matter of performance -- string< has been improved recently but currently we compare text in Latin and Swahili much faster than French and Arabic; it would be nice to close that gap. UTF-8 is designed so that comparing strings by scalar values can be done byte-wise, but the way we encode raw bytes make them sort right between ASCII and Latin-1. Given that the specific order doesn't matter much, we could just run with that.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-09-30 13:52 ` Lars Ingebrigtsen
@ 2022-09-30 20:12   ` Mattias Engdegård
  2022-10-01  5:34     ` Eli Zaretskii
  2022-10-01 10:02     ` Lars Ingebrigtsen
  2022-10-01  5:30   ` Eli Zaretskii
  1 sibling, 2 replies; 40+ messages in thread
From: Mattias Engdegård @ 2022-09-30 20:12 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 58168

30 sep. 2022 kl. 15.52 skrev Lars Ingebrigtsen <larsi@gnus.org>:

> (string 4194176)
> => "\200"
> "\x80"
> => "\200"
> 
> which are kinda equal in some ways, and not in other ways.

And (string 128)
=> "\200"

but painted in a different colour so that the alert reader suspects that something odd is going on...
(What about printing it as "\u0080" instead? Only affects the C1 controls...)

> I think A makes the most intuitive sense, somehow.  But perhaps my
> intuition is off.

Can't say my intuition is any more reliable. On the other hand, intuition says that counter-intuitive solutions are very attractive, but I think it's just trying to confuse us.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-09-30 20:04   ` Mattias Engdegård
@ 2022-10-01  5:22     ` Eli Zaretskii
  2022-10-01 19:57       ` Mattias Engdegård
  0 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-01  5:22 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Fri, 30 Sep 2022 22:04:47 +0200
> Cc: 58168@debbugs.gnu.org
> 
> 29 sep. 2022 kl. 19.11 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > Unibyte strings should never be compared with
> > multibyte, unless they are both pure-ASCII.
> 
> It's perfectly fine to compare "Madrid" (unibyte) with "Málaga" (non-ASCII multibyte).

Not relevant: I meant unibyte non-ASCII strings.  The ASCII case is
easy and un-problematic, and is really just a straw-man here.

> If you mean that all strings (literals in particular) should be multibyte by default then I agree and at some point we should take that step, but it would be quite a breaking change. Perhaps less in practice than we fear, though...

That's not what I meant.  I think unibyte strings are with us for the
observable future.

> > Unibyte characters don't belong to this order.  They
> > should be converted to multibyte representation to be sensibly
> > comparable.
> 
> Oh I agree to some extent but we can't really raise an error if someone tries so we might as well return something reasonable and coherent.

It depends on the use case, but in general I see no problem with
signaling errors when we cannot produce reasonably correct results.
For example, string-to-unibyte does signal an error in some cases.

> Besides, there are more good reasons for ordering strings (both multibyte and unibyte) than might be apparent at first.

Examples, please.

> Working from the assumption that we can't change string= to equate raw bytes in unibyte and multibyte strings, we need to invent an order between normally incommensurate values

I don't agree with the conclusion.  It is not the only possible
conclusion.  Signaling an error is another one, and I'm sure we could
think of more.

> It's also a matter of performance -- string< has been improved recently but currently we compare text in Latin and Swahili much faster than French and Arabic; it would be nice to close that gap. UTF-8 is designed so that comparing strings by scalar values can be done byte-wise, but the way we encode raw bytes make them sort right between ASCII and Latin-1. Given that the specific order doesn't matter much, we could just run with that.

I see no reason to make comparison of unibyte and multibyte strings
perform better.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-09-30 13:52 ` Lars Ingebrigtsen
  2022-09-30 20:12   ` Mattias Engdegård
@ 2022-10-01  5:30   ` Eli Zaretskii
  1 sibling, 0 replies; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-01  5:30 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 58168, mattias.engdegard

> Cc: 58168@debbugs.gnu.org
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Date: Fri, 30 Sep 2022 15:52:12 +0200
> 
> You also have
> 
> (string 4194176)
> => "\200"
> "\x80"
> => "\200"
> 
> which are kinda equal in some ways, and not in other ways.

So are "A" and "𝙰", or "א" and "ℵ": kinda equal in some ways, but not
in others.  That doesn't mean they should compare equal.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-09-30 20:12   ` Mattias Engdegård
@ 2022-10-01  5:34     ` Eli Zaretskii
  2022-10-01 11:51       ` Mattias Engdegård
  2022-10-01 10:02     ` Lars Ingebrigtsen
  1 sibling, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-01  5:34 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168, larsi

> Cc: 58168@debbugs.gnu.org
> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Fri, 30 Sep 2022 22:12:24 +0200
> 
> 30 sep. 2022 kl. 15.52 skrev Lars Ingebrigtsen <larsi@gnus.org>:
> 
> > (string 4194176)
> > => "\200"
> > "\x80"
> > => "\200"
> > 
> > which are kinda equal in some ways, and not in other ways.
> 
> And (string 128)
> => "\200"
> 
> but painted in a different colour

It's the same color as the result of '(string 4194176)'.

>so that the alert reader suspects that something odd is going on...

Which it is.

> (What about printing it as "\u0080" instead?

NO!!  \u0080 is something entirely different.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-09-30 20:12   ` Mattias Engdegård
  2022-10-01  5:34     ` Eli Zaretskii
@ 2022-10-01 10:02     ` Lars Ingebrigtsen
  2022-10-01 10:12       ` Eli Zaretskii
  2022-10-01 13:37       ` Mattias Engdegård
  1 sibling, 2 replies; 40+ messages in thread
From: Lars Ingebrigtsen @ 2022-10-01 10:02 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168

[-- Attachment #1: Type: text/plain, Size: 324 bytes --]

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> (string 4194176)
>> => "\200"
>> "\x80"
>> => "\200"
>> 
>> which are kinda equal in some ways, and not in other ways.
>
> And (string 128)
> => "\200"

Funnily enough, the latter displays in a different way for me, which may
or may not be a bug:


[-- Attachment #2: Type: image/png, Size: 17392 bytes --]

[-- Attachment #3: Type: text/plain, Size: 392 bytes --]


This is with `display-raw-bytes-as-hex' t.

> (What about printing it as "\u0080" instead? Only affects the C1 controls...)

\u0080 is a character, so that's something different than \200.

> Can't say my intuition is any more reliable. On the other hand,
> intuition says that counter-intuitive solutions are very attractive,
> but I think it's just trying to confuse us.

😀

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-01 10:02     ` Lars Ingebrigtsen
@ 2022-10-01 10:12       ` Eli Zaretskii
  2022-10-01 13:37       ` Mattias Engdegård
  1 sibling, 0 replies; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-01 10:12 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 58168, mattias.engdegard

> Cc: 58168@debbugs.gnu.org
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Date: Sat, 01 Oct 2022 12:02:12 +0200
> 
> Mattias Engdegård <mattias.engdegard@gmail.com> writes:
> 
> >> (string 4194176)
> >> => "\200"
> >> "\x80"
> >> => "\200"
> >> 
> >> which are kinda equal in some ways, and not in other ways.
> >
> > And (string 128)
> > => "\200"
> 
> Funnily enough, the latter displays in a different way for me, which may
> or may not be a bug:

The former is a string of 4 characters, the latter is a string of just
one.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-01  5:34     ` Eli Zaretskii
@ 2022-10-01 11:51       ` Mattias Engdegård
  0 siblings, 0 replies; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-01 11:51 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168, larsi

1 okt. 2022 kl. 07.34 skrev Eli Zaretskii <eliz@gnu.org>:

>> (What about printing it as "\u0080" instead?
> 
> NO!!  \u0080 is something entirely different.

Actually not -- (string 128) returns the multibyte string consisting of the single char U+0080, which is exactly what you get by typing "\u0080".
It confuses me, too. That's a C1 control char and Emacs doesn't escape it when printing, but it's displayed as `\200`.

(I don't think changing that display to \u0080 would break any compatibility; we should consider doing that for all C1 controls, U+0080..U+009F.)

Even more confusing is that "\x0080" means the same as "\u0080" but not the same as "\x80", which is a unibyte string of a single raw byte.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-01 10:02     ` Lars Ingebrigtsen
  2022-10-01 10:12       ` Eli Zaretskii
@ 2022-10-01 13:37       ` Mattias Engdegård
  2022-10-01 13:43         ` Lars Ingebrigtsen
  2022-10-01 13:51         ` Eli Zaretskii
  1 sibling, 2 replies; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-01 13:37 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 58168, Eli Zaretskii

[-- Attachment #1: Type: text/plain, Size: 1149 bytes --]

1 okt. 2022 kl. 12.02 skrev Lars Ingebrigtsen <larsi@gnus.org>:

> Funnily enough, the latter displays in a different way for me, which may
> or may not be a bug:
> 
> This is with `display-raw-bytes-as-hex' t.

You are right, that is completely broken -- display-raw-bytes-as-hex shouldn't affect the display of C1 controls.
Whether (string 128) displays "\200" or "\x80", however tarted up in a fancy face, it's still a lie. Only something like "\u0080" would actually be correct.

It seems to be a relic from the pre-Unicode days of Emacs: the code responsible muddles the display of raw bytes and unicode controls.
The attached patch untangles the two somewhat and lets display-raw-bytes-as-hex do what its name and documentation suggest, while using a non-confusing display for C1 controls.

The command

  (insert "C1: " (string 128) " raw: " (unibyte-string 128) ".\n")

currently displays

   C1: \200 raw: \200.
or
   C1: \x80 raw: \x80.

depending on display-raw-bytes-as-hex. With the patch, we get

€ €  €C1: \u0080 raw: \200.
or
   C1: \u0080 raw: \x80.

which should satisfy everyone. What about it?


[-- Attachment #2: unicode-escape-display.diff --]
[-- Type: application/octet-stream, Size: 955 bytes --]

diff --git a/src/xdisp.c b/src/xdisp.c
index 55e74a3603..fa4fc2319e 100644
--- a/src/xdisp.c
+++ b/src/xdisp.c
@@ -8179,12 +8179,20 @@ get_next_display_element (struct it *it)
 		char str[10];
 		int len, i;
 
+		const char *format_string;
 		if (CHAR_BYTE8_P (c))
-		  /* Display \200 or \x80 instead of \17777600.  */
-		  c = CHAR_TO_BYTE8 (c);
-		const char *format_string = display_raw_bytes_as_hex
-					    ? "x%02x"
-					    : "%03o";
+		  {
+		    /* A raw byte: display using an octal or hex escape which
+		       would produce this byte in a Lisp string literal.  */
+		    c = CHAR_TO_BYTE8 (c);
+		    format_string = display_raw_bytes_as_hex ? "x%02x" : "%03o";
+		  }
+		else
+		  {
+		    /* A Unicode character not displayed in any other way:
+		       use a Unicode escape.  */
+		    format_string = c <= 0xffff ? "u%04X" : "U%08X";
+		  }
 		len = sprintf (str, format_string, c + 0u);
 
 		XSETINT (it->ctl_chars[0], escape_glyph);

[-- Attachment #3: Type: text/plain, Size: 203 bytes --]



I see that the redisplay-testsuite.el needs amending too; it actually looks buggy in this respect. If the above approach is deemed acceptable, I'll submit a patch that includes that file as well.


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-01 13:37       ` Mattias Engdegård
@ 2022-10-01 13:43         ` Lars Ingebrigtsen
  2022-10-03 19:48           ` Mattias Engdegård
  2022-10-01 13:51         ` Eli Zaretskii
  1 sibling, 1 reply; 40+ messages in thread
From: Lars Ingebrigtsen @ 2022-10-01 13:43 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168, Eli Zaretskii

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> depending on display-raw-bytes-as-hex. With the patch, we get
>
> € €  €C1: \u0080 raw: \200.
> or
>    C1: \u0080 raw: \x80.
>
> which should satisfy everyone. What about it?

There was a very long thread about changing this output in a bug report
somewhere, and we decided not to, because all the alternatives were
worse than what we have.

*handwaves at bug tracker*





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-01 13:37       ` Mattias Engdegård
  2022-10-01 13:43         ` Lars Ingebrigtsen
@ 2022-10-01 13:51         ` Eli Zaretskii
  1 sibling, 0 replies; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-01 13:51 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168, larsi

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Sat, 1 Oct 2022 15:37:25 +0200
> Cc: 58168@debbugs.gnu.org,
>  Eli Zaretskii <eliz@gnu.org>
> 
> > Funnily enough, the latter displays in a different way for me, which may
> > or may not be a bug:
> > 
> > This is with `display-raw-bytes-as-hex' t.
> 
> You are right, that is completely broken -- display-raw-bytes-as-hex shouldn't affect the display of C1 controls.

I think the variable is a misnomer of sorts: the request back when it
was introduced to display hex where we usually display octal \nnn
escapes.  And the latter happens not only for raw bytes.

> It seems to be a relic from the pre-Unicode days of Emacs: the code responsible muddles the display of raw bytes and unicode controls.

No, I think we decided to keep the display of C1 characters as octal
escapes.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-01  5:22     ` Eli Zaretskii
@ 2022-10-01 19:57       ` Mattias Engdegård
  2022-10-02  5:36         ` Eli Zaretskii
  0 siblings, 1 reply; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-01 19:57 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168

1 okt. 2022 kl. 07.22 skrev Eli Zaretskii <eliz@gnu.org>:

> It depends on the use case, but in general I see no problem with
> signaling errors when we cannot produce reasonably correct results.
> For example, string-to-unibyte does signal an error in some cases.

That's fine because that function is documented to do so and always has, but making previously possible comparisons raise errors shouldn't be done lightly.

Comparison between objects is not only useful when someone cares about their order, as in presenting a sorted list to the user. Often what is important is an ability to impose an order, preferably total, for use in building and searching data structures. I came across this bug when implementing a string set.

>> It's also a matter of performance -- string< has been improved recently but currently we compare text in Latin and Swahili much faster than French and Arabic; it would be nice to close that gap. UTF-8 is designed so that comparing strings by scalar values can be done byte-wise, but the way we encode raw bytes make them sort right between ASCII and Latin-1. Given that the specific order doesn't matter much, we could just run with that.
> 
> I see no reason to make comparison of unibyte and multibyte strings
> perform better.

Actually I was talking about multibyte-multibyte comparisons.

You were probably thinking about comparisons between unibyte strings that contain raw bytes and multibyte strings, and those are indeed not very performance-sensitive. However there is no way to detect whether a unibyte string contains non-ASCII chars without looking at every byte, and comparing unibyte ASCII with multibyte is definitely of interest. Strings are still unibyte by default.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-01 19:57       ` Mattias Engdegård
@ 2022-10-02  5:36         ` Eli Zaretskii
  2022-10-03 19:48           ` Mattias Engdegård
  0 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-02  5:36 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Sat, 1 Oct 2022 21:57:45 +0200
> Cc: 58168@debbugs.gnu.org
> 
> 1 okt. 2022 kl. 07.22 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > It depends on the use case, but in general I see no problem with
> > signaling errors when we cannot produce reasonably correct results.
> > For example, string-to-unibyte does signal an error in some cases.
> 
> That's fine because that function is documented to do so and always has, but making previously possible comparisons raise errors shouldn't be done lightly.

I didn't say "lightly", nor do I think so.  We need to discuss
specific use cases.

An alternative is to always convert unibyte non-ASCII strings to their
multibyte representation before comparing.

> Comparison between objects is not only useful when someone cares about their order, as in presenting a sorted list to the user. Often what is important is an ability to impose an order, preferably total, for use in building and searching data structures. I came across this bug when implementing a string set.

Always converting to multibyte handles this case, doesn't it?

> >> It's also a matter of performance -- string< has been improved recently but currently we compare text in Latin and Swahili much faster than French and Arabic; it would be nice to close that gap. UTF-8 is designed so that comparing strings by scalar values can be done byte-wise, but the way we encode raw bytes make them sort right between ASCII and Latin-1. Given that the specific order doesn't matter much, we could just run with that.
> > 
> > I see no reason to make comparison of unibyte and multibyte strings
> > perform better.
> 
> Actually I was talking about multibyte-multibyte comparisons.

Then why did you mention raw bytes? their multibyte representation
presents no performance problems, AFAIU.

> You were probably thinking about comparisons between unibyte strings that contain raw bytes and multibyte strings, and those are indeed not very performance-sensitive. However there is no way to detect whether a unibyte string contains non-ASCII chars without looking at every byte, and comparing unibyte ASCII with multibyte is definitely of interest. Strings are still unibyte by default.

You can compare under the assumption that a unibyte string is
pure-ASCII until you bump into the first non-ASCII one.  If that
happens, abandon the comparison, convert the unibyte string to its
multibyte representation, and compare again.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-01 13:43         ` Lars Ingebrigtsen
@ 2022-10-03 19:48           ` Mattias Engdegård
  2022-10-04 10:44             ` Lars Ingebrigtsen
  2022-10-04 11:37             ` Eli Zaretskii
  0 siblings, 2 replies; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-03 19:48 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 58168, Eli Zaretskii

[-- Attachment #1: Type: text/plain, Size: 1414 bytes --]

1 okt. 2022 kl. 15.43 skrev Lars Ingebrigtsen <larsi@gnus.org>:

> There was a very long thread about changing this output in a bug report
> somewhere, and we decided not to, because all the alternatives were
> worse than what we have.

I have no wish to reopen old wounds (and it sounds as if the debate isn't one that you are keen to revisit), but I'm a bit surprised -- it all arises because of old code that still treats raw bytes as Latin-1, and I know that neither of you are very fond of that either.

1 okt. 2022 kl. 15.51 skrev Eli Zaretskii <eliz@gnu.org>:

> I think the variable is a misnomer of sorts: the request back when it
> was introduced to display hex where we usually display octal \nnn
> escapes.  And the latter happens not only for raw bytes.

Fair enough. Maybe the documentation should reflect that, but I'm still holding out for a change to the C1 presentation in the long term, so...

I'm not going to pursue this little digression any further except that while looking at it I found a few inaccuracies and a likely bug in redisplay-testsuite.el. I'm attaching a patch which un-muddles the test and adds a display of unprintable Unicode chars such as C1 controls, in addition to raw bytes. I'd like to adorn the commit with the correct bug number so if you remember that of the original discussion that would be useful (I never found it very easy to search debbugs).


[-- Attachment #2: redisplay-testsuite.diff --]
[-- Type: application/octet-stream, Size: 2762 bytes --]

diff --git a/test/manual/redisplay-testsuite.el b/test/manual/redisplay-testsuite.el
index 01b0a895a4..5495146b87 100644
--- a/test/manual/redisplay-testsuite.el
+++ b/test/manual/redisplay-testsuite.el
@@ -305,7 +305,7 @@ test-redisplay-5-toggle
   (let ((label (if display-raw-bytes-as-hex "\\x80" "\\200")))
     (overlay-put test-redisplay-5a-expected-overlay 'display
                  (propertize label 'face 'escape-glyph)))
-  (let ((label (if display-raw-bytes-as-hex "\\x3fffc" "\\777774")))
+  (let ((label (if display-raw-bytes-as-hex "\\xfc" "\\374")))
     (overlay-put test-redisplay-5b-expected-overlay 'display
                  (propertize label 'face 'escape-glyph))))
 
@@ -320,18 +320,36 @@ test-redisplay-5
         (test-insert-overlay " " 'display "\200"))
   (insert "\n\n")
   (insert "  Expected: ")
-  ;; This tests a large codepoint, to make sure the internal buffer we
-  ;; use to produce the representation is large enough.
-  (aset printable-chars #x3fffc nil)
   (setq test-redisplay-5b-expected-overlay
         (test-insert-overlay " " 'display
-                             (propertize "\\777774" 'face 'escape-glyph)))
+                             (propertize "\\374" 'face 'escape-glyph)))
   (insert "\n    Result: ")
   (setq test-redisplay-5b-result-overlay
-        (test-insert-overlay " " 'display (char-to-string #x3fffc)))
+        (test-insert-overlay " " 'display (char-to-string #x3ffffc)))
+  (insert "\n\n")
+  (insert-button "Toggle between octal and hex display for raw bytes"
+                 'action 'test-redisplay-5-toggle)
+  (insert "\n\n"))
+
+(defun test-redisplay-6 ()
+  (insert "Test 6: Display of unprintable Unicode chars:\n\n")
+  (insert "  Expected: ")
+  (test-insert-overlay " " 'display
+                       (propertize "\\200" 'face 'escape-glyph))
+  (insert "  (representing U+0100)")
+  (insert "\n    Result: ")
+  (test-insert-overlay " " 'display "\u0080")
   (insert "\n\n")
-  (insert-button "Toggle between octal and hex display"
-                 'action 'test-redisplay-5-toggle))
+  ;; This tests a large codepoint, to make sure the internal buffer we
+  ;; use to produce the representation is large enough.
+  (insert "  Expected: ")
+  (aset printable-chars #x10abcd nil)
+  (test-insert-overlay " " 'display
+                       (propertize "\\4125715" 'face 'escape-glyph))
+  (insert "  (representing U+0010ABCD)")
+  (insert "\n    Result: ")
+  (test-insert-overlay " " 'display "\U0010ABCD")
+  (insert "\n\n"))
 
 (defun test-redisplay ()
   (interactive)
@@ -349,6 +367,7 @@ test-redisplay
     (test-redisplay-3)
     (test-redisplay-4)
     (test-redisplay-5)
+    (test-redisplay-6)
     (goto-char (point-min))))
 
 ;;; redisplay-testsuite.el ends here

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-02  5:36         ` Eli Zaretskii
@ 2022-10-03 19:48           ` Mattias Engdegård
  2022-10-04  5:55             ` Eli Zaretskii
  0 siblings, 1 reply; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-03 19:48 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168

2 okt. 2022 kl. 07.36 skrev Eli Zaretskii <eliz@gnu.org>:

>> Comparison between objects is not only useful when someone cares about their order, as in presenting a sorted list to the user. Often what is important is an ability to impose an order, preferably total, for use in building and searching data structures. I came across this bug when implementing a string set.
> 
> Always converting to multibyte handles this case, doesn't it?

I don't think it does -- string= treats raw bytes in unibyte and multibyte strings as distinct; converting to multibyte does not preserve (in)equality.

>> Actually I was talking about multibyte-multibyte comparisons.
> 
> Then why did you mention raw bytes? their multibyte representation
> presents no performance problems

In a way they do -- the way raw bytes are represented (they start with C0 or C1) causes memcmp to sort them between U+007F and U+0080. If we accept that then comparisons are fast since memcmp will compare many character per data-dependent branch. The current code requires several data-dependent branches for each character.

While we could probably bring down the comparison cost slightly by clever hand-coding, it's unlikely to be even nearly as fast as a memcmp and much messier. Since users are unlikely to care much about the ordering between raw bytes and something else (as long as there is an order), it would be a cheap way to improve performance while at the same time fixing the string< / string= mismatch.

> You can compare under the assumption that a unibyte string is
> pure-ASCII until you bump into the first non-ASCII one.  If that
> happens, abandon the comparison, convert the unibyte string to its
> multibyte representation, and compare again.

I don't quite see how that would improve performance but may be missing something.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-03 19:48           ` Mattias Engdegård
@ 2022-10-04  5:55             ` Eli Zaretskii
  2022-10-04 17:40               ` Richard Stallman
  2022-10-06  9:05               ` Mattias Engdegård
  0 siblings, 2 replies; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-04  5:55 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Mon, 3 Oct 2022 21:48:14 +0200
> Cc: 58168@debbugs.gnu.org
> 
> 2 okt. 2022 kl. 07.36 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> >> Comparison between objects is not only useful when someone cares about their order, as in presenting a sorted list to the user. Often what is important is an ability to impose an order, preferably total, for use in building and searching data structures. I came across this bug when implementing a string set.
> > 
> > Always converting to multibyte handles this case, doesn't it?
> 
> I don't think it does -- string= treats raw bytes in unibyte and multibyte strings as distinct; converting to multibyte does not preserve (in)equality.

If the fact that string= says strings are not equal, but string-lessp
says they are equal, is what bothers you, we could document that
results of comparing unibyte and multibyte strings are unspecified, or
document explicitly that string= and string-lessp behave differently
in this case.

IOW, I see no reason to worry about 100% consistency here: the order
is _really_ undefined in these cases, and trying to make it defined
will not produce any tangible gains, IMNSHO.  It could very well
produce bugs and regressions, OTOH.  So it sounds like a net loss to
me, in practical terms.

> >> Actually I was talking about multibyte-multibyte comparisons.
> > 
> > Then why did you mention raw bytes? their multibyte representation
> > presents no performance problems
> 
> In a way they do -- the way raw bytes are represented (they start with C0 or C1) causes memcmp to sort them between U+007F and U+0080. If we accept that then comparisons are fast since memcmp will compare many character per data-dependent branch. The current code requires several data-dependent branches for each character.

Once again, slowing down string-lessp when raw-bytes are involved
shouldn't be a problem.  So, if memchr finds a C0 or C1 in a string,
fall back to a slower comparison.  memchr is fast enough to not slow
down the "usual" case.  Would that be a good solution?

Alternatively, we could introduce a new primitive which could assume
multibyte or plain-ASCII unibyte strings without checking, and then
code which is sure raw-bytes cannot happen, and needs to compare long
strings, could use that for speed.

> While we could probably bring down the comparison cost slightly by clever hand-coding, it's unlikely to be even nearly as fast as a memcmp and much messier. Since users are unlikely to care much about the ordering between raw bytes and something else (as long as there is an order), it would be a cheap way to improve performance while at the same time fixing the string< / string= mismatch.

The assumption that "users are unlikely to care" is a pure conjecture,
and we have no way of validating it.  So I don't want us to act on
such an assumption.

What about one of the alternatives above instead?

> > You can compare under the assumption that a unibyte string is
> > pure-ASCII until you bump into the first non-ASCII one.  If that
> > happens, abandon the comparison, convert the unibyte string to its
> > multibyte representation, and compare again.
> 
> I don't quite see how that would improve performance but may be missing something.

Then maybe I didn't understand the performance problems that you had
in mind.  Suppose you describe them in more detail, preferably with
examples?  E.g., are you saying that unibyte strings that are
pure-ASCII also cause performance problems?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-03 19:48           ` Mattias Engdegård
@ 2022-10-04 10:44             ` Lars Ingebrigtsen
  2022-10-04 11:37             ` Eli Zaretskii
  1 sibling, 0 replies; 40+ messages in thread
From: Lars Ingebrigtsen @ 2022-10-04 10:44 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168, Eli Zaretskii

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> I'm attaching a patch which un-muddles the
> test and adds a display of unprintable Unicode chars such as C1
> controls, in addition to raw bytes. I'd like to adorn the commit with
> the correct bug number so if you remember that of the original
> discussion that would be useful (I never found it very easy to search
> debbugs).

Me neither.  I think the context was something about...  uhm...  cutting
and pasting text displayed as \ef from a terminal, and that being
ambiguous?





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-03 19:48           ` Mattias Engdegård
  2022-10-04 10:44             ` Lars Ingebrigtsen
@ 2022-10-04 11:37             ` Eli Zaretskii
  2022-10-04 14:44               ` Mattias Engdegård
  1 sibling, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-04 11:37 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168, larsi

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Mon, 3 Oct 2022 21:48:10 +0200
> Cc: 58168@debbugs.gnu.org,
>  Eli Zaretskii <eliz@gnu.org>
> 
> 1 okt. 2022 kl. 15.51 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > I think the variable is a misnomer of sorts: the request back when it
> > was introduced to display hex where we usually display octal \nnn
> > escapes.  And the latter happens not only for raw bytes.
> 
> Fair enough. Maybe the documentation should reflect that, but I'm still holding out for a change to the C1 presentation in the long term, so...
> 
> I'm not going to pursue this little digression any further except that while looking at it I found a few inaccuracies and a likely bug in redisplay-testsuite.el. I'm attaching a patch which un-muddles the test and adds a display of unprintable Unicode chars such as C1 controls, in addition to raw bytes. I'd like to adorn the commit with the correct bug number so if you remember that of the original discussion that would be useful (I never found it very easy to search debbugs).

First I needed to fix fallout from making STRING_CHAR intolerant of
unibyte text, because redisplay-testsuite caused assertion violations
in string_char_and_length.  (Doesn't it abort for you? or do you not
build Emacs with --enable-checking?)  This was a regression in Emacs
28, sigh.

Looking at your patch, I don't think I understand this part:

> --- a/test/manual/redisplay-testsuite.el
> +++ b/test/manual/redisplay-testsuite.el
> @@ -305,7 +305,7 @@ test-redisplay-5-toggle
>    (let ((label (if display-raw-bytes-as-hex "\\x80" "\\200")))
>      (overlay-put test-redisplay-5a-expected-overlay 'display
>                   (propertize label 'face 'escape-glyph)))
> -  (let ((label (if display-raw-bytes-as-hex "\\x3fffc" "\\777774")))
> +  (let ((label (if display-raw-bytes-as-hex "\\xfc" "\\374")))
>      (overlay-put test-redisplay-5b-expected-overlay 'display
>                   (propertize label 'face 'escape-glyph))))
>  
> @@ -320,18 +320,36 @@ test-redisplay-5
>          (test-insert-overlay " " 'display "\200"))
>    (insert "\n\n")
>    (insert "  Expected: ")
> -  ;; This tests a large codepoint, to make sure the internal buffer we
> -  ;; use to produce the representation is large enough.
> -  (aset printable-chars #x3fffc nil)
>    (setq test-redisplay-5b-expected-overlay
>          (test-insert-overlay " " 'display
> -                             (propertize "\\777774" 'face 'escape-glyph)))
> +                             (propertize "\\374" 'face 'escape-glyph)))

I could understand why you'd want to _add_ the larger values, but why
replace?

As for the bug report which led to display-raw-bytes-as-hex (if that
what you meant) and its discussion, it's bug#27122.  (If you know what
code is in question, it is much easier to find the bug via "git
annotate", assuming the bug number was cited in the commit logs.)





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-04 11:37             ` Eli Zaretskii
@ 2022-10-04 14:44               ` Mattias Engdegård
  2022-10-04 16:24                 ` Eli Zaretskii
  0 siblings, 1 reply; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-04 14:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168, larsi

4 okt. 2022 kl. 13.37 skrev Eli Zaretskii <eliz@gnu.org>:

> First I needed to fix fallout from making STRING_CHAR intolerant of
> unibyte text, because redisplay-testsuite caused assertion violations
> in string_char_and_length.

Good catch! Just to satisfy my curiosity:

>             error ("Invalid format operation %%%c",
> -                  STRING_CHAR ((unsigned char *) format - 1));
> +                  multibyte_format
> +                  ? STRING_CHAR ((unsigned char *) format - 1)
> +                  : *((unsigned char *) format - 1));

This treats unibyte format strings as if they were Latin-1 for the purpose of the error message. Not very important, of course, but maybe there should be a UNIBYTE_TO_CHAR in the alternative branch?

>  (Doesn't it abort for you? or do you not
> build Emacs with --enable-checking?)

Oh I certainly do that occasionally, but it's mostly when I've changed something at the C level or have reason to believe that something is broken there.

> I could understand why you'd want to _add_ the larger values, but why
> replace?

Because it seemed pretty clear that the old code intended to use #x3ffffc for testing display of raw bytes but a typo turned it into #x3fffc instead which isn't a raw byte but a multibyte character. That it's an easy mistake to make (done so several times myself).

Thus the change fixes that: it now correctly tests #x3ffffc (multibyte raw byte FC) as well as a couple of undisplayable multibyte chars (one C1 control and one astral plane unicode value made undisplayable). Now everything should be described correctly, which wasn't the case before.

> As for the bug report which led to display-raw-bytes-as-hex (if that
> what you meant) and its discussion, it's bug#27122.

Thank you, but I actually meant the one where it was agreed that it was a good idea to display raw bytes and Latin-1 U+0080..009F in the same way. There isn't much of a code trail because it probably never was a conscious decision -- it just ended up being that way -- but apparently the status quo was defended/rationalised at some point.

I've now pushed the patch; the code can be improved further if necessary.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-04 14:44               ` Mattias Engdegård
@ 2022-10-04 16:24                 ` Eli Zaretskii
  2022-10-06  9:05                   ` Mattias Engdegård
  0 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-04 16:24 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168, larsi

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Tue, 4 Oct 2022 16:44:17 +0200
> Cc: larsi@gnus.org,
>  58168@debbugs.gnu.org
> 
> 4 okt. 2022 kl. 13.37 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > First I needed to fix fallout from making STRING_CHAR intolerant of
> > unibyte text, because redisplay-testsuite caused assertion violations
> > in string_char_and_length.
> 
> Good catch! Just to satisfy my curiosity:
> 
> >             error ("Invalid format operation %%%c",
> > -                  STRING_CHAR ((unsigned char *) format - 1));
> > +                  multibyte_format
> > +                  ? STRING_CHAR ((unsigned char *) format - 1)
> > +                  : *((unsigned char *) format - 1));
> 
> This treats unibyte format strings as if they were Latin-1 for the purpose of the error message.

No, it doesn't.  It shows the problematic characters as raw bytes, as
in "%\200" (where \200 is a single character).  If you see something
different, please show the recipe.

> Not very important, of course, but maybe there should be a UNIBYTE_TO_CHAR in the alternative branch?

No, that would show the multibyte codepoint, and will confuse users,
because the result would look very different from the problematic
format spec in this case.

> >  (Doesn't it abort for you? or do you not
> > build Emacs with --enable-checking?)
> 
> Oh I certainly do that occasionally, but it's mostly when I've changed something at the C level or have reason to believe that something is broken there.

Please _always_ test changes related to encoding/decoding and
character representation conversions in a --enable-checking build.  We
should have discovered these bugs in time for Emacs 28.2 to be devoid
of them.

> > I could understand why you'd want to _add_ the larger values, but why
> > replace?
> 
> Because it seemed pretty clear that the old code intended to use #x3ffffc for testing display of raw bytes but a typo turned it into #x3fffc instead which isn't a raw byte but a multibyte character. That it's an easy mistake to make (done so several times myself).

Who said anything about #x3fffc?  The original code had #xfc, the
unibyte code for #x3ffffc.  I don't see why we shouldn't test both.
In the other problematic hunk you replaced \777774 with \374 -- why?

> I've now pushed the patch; the code can be improved further if necessary.

I've reverted it.  Please stop this madness of rushing into installing
changes that are still under controversy.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-04  5:55             ` Eli Zaretskii
@ 2022-10-04 17:40               ` Richard Stallman
  2022-10-04 18:07                 ` Eli Zaretskii
  2022-10-06  9:05               ` Mattias Engdegård
  1 sibling, 1 reply; 40+ messages in thread
From: Richard Stallman @ 2022-10-04 17:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168, mattias.engdegard

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > If the fact that string= says strings are not equal, but string-lessp
  > says they are equal, is what bothers you

That result seems paradoxical to me.

I think the way to make sense of it is this: what string-lessp is
really saying is not that the strings are "equal", but rather that
they are lexicographically equivalent.

Perhaps documenting the difference between these two relationships
could make the current behavior comprehensible rather than anomalous.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-04 17:40               ` Richard Stallman
@ 2022-10-04 18:07                 ` Eli Zaretskii
  0 siblings, 0 replies; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-04 18:07 UTC (permalink / raw)
  To: rms; +Cc: 58168, mattias.engdegard

> From: Richard Stallman <rms@gnu.org>
> Cc: mattias.engdegard@gmail.com, 58168@debbugs.gnu.org
> Date: Tue, 04 Oct 2022 13:40:18 -0400
> 
>   > If the fact that string= says strings are not equal, but string-lessp
>   > says they are equal, is what bothers you
> 
> That result seems paradoxical to me.

Not if you accept that comparing unibyte non-ASCII text with multibyte
text yields inherently unspecified results.

> Perhaps documenting the difference between these two relationships
> could make the current behavior comprehensible rather than anomalous.

Yes, that's one of the alternatives that I think should be on the
table.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-04  5:55             ` Eli Zaretskii
  2022-10-04 17:40               ` Richard Stallman
@ 2022-10-06  9:05               ` Mattias Engdegård
  2022-10-06 11:06                 ` Eli Zaretskii
  1 sibling, 1 reply; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-06  9:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168

4 okt. 2022 kl. 07.55 skrev Eli Zaretskii <eliz@gnu.org>:

> If the fact that string= says strings are not equal, but string-lessp
> says they are equal, is what bothers you, we could document that
> results of comparing unibyte and multibyte strings are unspecified, or
> document explicitly that string= and string-lessp behave differently
> in this case.

(It's not just string= but `equal` since they use the same comparison.)
But it's just a part of a set of related problems:

* string< / string= inconsistency
* undesirable string< ordering (unibyte strings are treated as Latin-1)
* bad string< performance 

Ideally we should be able to do something about all three at the same time since they are interrelated. At the very least it's worth a try.

Just documenting the annoying parts won't make them go away -- they still have to be coded around by the user, and it doesn't solve any performance problems either.

> I see no reason to worry about 100% consistency here: the order
> is _really_ undefined in these cases, and trying to make it defined
> will not produce any tangible gains,

Yes it would: better performance and wider applicability. Even when the order isn't defined the user expects there to be some order between distinct strings.

> Once again, slowing down string-lessp when raw-bytes are involved
> shouldn't be a problem.  So, if memchr finds a C0 or C1 in a string,
> fall back to a slower comparison.  memchr is fast enough to not slow
> down the "usual" case.  Would that be a good solution?

There is no reason a comparison should need to look beyond the first mismatch; anything else is just algorithmically slow. Long strings are likely to differ early on. Any hack that has to special-case raw bytes will add costs.

The best we can hope for is hand-written vectorised code that does everything in one pass but it's still slower than just a memcmp.
Even then our chosen semantics make that more difficult (and slower) than it needs to be: for example, we cannot assume that any byte with the high bit set indicates a mismatch when comparing unibyte strings with multibyte, since we equate unibyte chars with Latin-1. It's a decision that we will keep paying for.

> Alternatively, we could introduce a new primitive which could assume
> multibyte or plain-ASCII unibyte strings without checking, and then
> code which is sure raw-bytes cannot happen, and needs to compare long
> strings, could use that for speed.

That or variants thereof are indeed alternatives but users would be forgiven to wonder why we don't make what we have fast instead?

> E.g., are you saying that unibyte strings that are
> pure-ASCII also cause performance problems?

They do because we have no efficient way of ascertaining that they are pure-ASCII. The long-term solution is to make multibyte strings the default in more cases but I'm not proposing such a change right now.

I'll see to where further performance tweaking of the existing code can take us with a reasonable efforts, but there are hard limits to what can be done.
And thank you for your comments!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-04 16:24                 ` Eli Zaretskii
@ 2022-10-06  9:05                   ` Mattias Engdegård
  2022-10-06 11:13                     ` Eli Zaretskii
  0 siblings, 1 reply; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-06  9:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168, larsi

4 okt. 2022 kl. 18.24 skrev Eli Zaretskii <eliz@gnu.org>:

>> This treats unibyte format strings as if they were Latin-1 for the purpose of the error message.
> 
> No, it doesn't.  It shows the problematic characters as raw bytes, as
> in "%\200" (where \200 is a single character).  If you see something
> different, please show the recipe.

   (format-message "%\345" 0)
=> (error "Invalid format operation %å")

where the format string is a unibyte string of two bytes, % and 0xFC, yet the error treats it as the Latin-1 character å.

In fact,

   (format-message "%å" 0)

yields the same error string.

>> Not very important, of course, but maybe there should be a UNIBYTE_TO_CHAR in the alternative branch?
> 
> No, that would show the multibyte codepoint, and will confuse users,
> because the result would look very different from the problematic
> format spec in this case.

Yes, that's probably right. I suppose the right solution is something like:

	      unsigned char *p = (unsigned char *) format - 1;
	      if (multibyte_format)
		error ("Invalid format operation %%%c", STRING_CHAR (p));
	      else
		error (*p <= 127 ? "Invalid format operation %%%c"
			         : "Invalid format operation char 0x%02x",
		       *p);

but perhaps it's a rare error not worth the trouble. (If we don't bother changing it, a little comment saying that we are aware of the glitch may be a good idea.)

> Who said anything about #x3fffc?  The original code had #xfc, the
> unibyte code for #x3ffffc.

There seems to be a misunderstanding. The original (and current) code attempts to display char #x3fffc, which is not a raw byte. It's just a typo for #x3ffffc -- not a big deal.

Of course I could have retained the 3fffc under a different label, but everyone else reading the test would just assume it was a typo of 3ffffc since 3fffc itself is not very interesting. I replaced it with 10abcd, a wide Unicode value deliberately chosen to be arbitrary-looking. We could use another value if you prefer.

>  I don't see why we shouldn't test both.
> In the other problematic hunk you replaced \777774 with \374 -- why?

3fffc in octal is 777774; when changed to 3ffffc it becomes a raw byte, fc, displayed as \374.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-06  9:05               ` Mattias Engdegård
@ 2022-10-06 11:06                 ` Eli Zaretskii
  2022-10-07 14:23                   ` Mattias Engdegård
  0 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-06 11:06 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Thu, 6 Oct 2022 11:05:04 +0200
> Cc: 58168@debbugs.gnu.org
> 
> 4 okt. 2022 kl. 07.55 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > If the fact that string= says strings are not equal, but string-lessp
> > says they are equal, is what bothers you, we could document that
> > results of comparing unibyte and multibyte strings are unspecified, or
> > document explicitly that string= and string-lessp behave differently
> > in this case.
> 
> (It's not just string= but `equal` since they use the same comparison.)
> But it's just a part of a set of related problems:
> 
> * string< / string= inconsistency
> * undesirable string< ordering (unibyte strings are treated as Latin-1)
> * bad string< performance 

That doesn't seem different (and the ordering part is not necessary,
IMO).

> Ideally we should be able to do something about all three at the same time since they are interrelated. At the very least it's worth a try.

It depends on the costs and the risks.  All the rest being equal, yes,
solving those would be desirable.  But it isn't equal, and the costs
and the risks of your proposals outweigh the advantages in my book,
sorry.

> Just documenting the annoying parts won't make them go away -- they still have to be coded around by the user, and it doesn't solve any performance problems either.

That's not a catastrophe, because we are already there (sans the
documentation), and because these cases are rare in real life.

> > I see no reason to worry about 100% consistency here: the order
> > is _really_ undefined in these cases, and trying to make it defined
> > will not produce any tangible gains,
> 
> Yes it would: better performance and wider applicability.

These are not tangible enough IMO.

> Even when the order isn't defined the user expects there to be some order between distinct strings.

No, if the order is undefined, the caller cannot expect any order.
Cf. NaN comparisons with numerical values.

> > Once again, slowing down string-lessp when raw-bytes are involved
> > shouldn't be a problem.  So, if memchr finds a C0 or C1 in a string,
> > fall back to a slower comparison.  memchr is fast enough to not slow
> > down the "usual" case.  Would that be a good solution?
> 
> There is no reason a comparison should need to look beyond the first mismatch; anything else is just algorithmically slow. Long strings are likely to differ early on. Any hack that has to special-case raw bytes will add costs.

You missed me here.  Why are you suddenly talking about mismatches?
And if only mismatches matter here, why is it a problem to use memchr
in the first place?

> > Alternatively, we could introduce a new primitive which could assume
> > multibyte or plain-ASCII unibyte strings without checking, and then
> > code which is sure raw-bytes cannot happen, and needs to compare long
> > strings, could use that for speed.
> 
> That or variants thereof are indeed alternatives but users would be forgiven to wonder why we don't make what we have fast instead?

Because the fast versions can break when the assumptions are false.
We already have similar stuff in encoding/decoding area: there are
fast optimized functions that require the caller to make sure some
assumptions hold.

> > E.g., are you saying that unibyte strings that are
> > pure-ASCII also cause performance problems?
> 
> They do because we have no efficient way of ascertaining that they are pure-ASCII.

If we declare that comparing with unibyte non-ASCII produces
unspecified results, we don't have to worry about that: it becomes the
worry of the caller.

> The long-term solution is to make multibyte strings the default in more cases but I'm not proposing such a change right now.

I don't think we will ever get there, FWIW.  Raw bytes in strings are
a fact of life, whether we like it or not.

> I'll see to where further performance tweaking of the existing code can take us with a reasonable efforts, but there are hard limits to what can be done.
> And thank you for your comments!

Thanks.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-06  9:05                   ` Mattias Engdegård
@ 2022-10-06 11:13                     ` Eli Zaretskii
  2022-10-06 12:43                       ` Mattias Engdegård
  0 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-06 11:13 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168, larsi

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Thu, 6 Oct 2022 11:05:51 +0200
> Cc: larsi@gnus.org,
>  58168@debbugs.gnu.org
> 
> 4 okt. 2022 kl. 18.24 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> >> This treats unibyte format strings as if they were Latin-1 for the purpose of the error message.
> > 
> > No, it doesn't.  It shows the problematic characters as raw bytes, as
> > in "%\200" (where \200 is a single character).  If you see something
> > different, please show the recipe.
> 
>    (format-message "%\345" 0)
> => (error "Invalid format operation %å")

And you want to show %\345 instead?  Are you sure this is not the
consequence of inserting the error message into a multibyte buffer?

> > Who said anything about #x3fffc?  The original code had #xfc, the
> > unibyte code for #x3ffffc.
> 
> There seems to be a misunderstanding. The original (and current) code attempts to display char #x3fffc, which is not a raw byte. It's just a typo for #x3ffffc -- not a big deal.

But your change replaced it with \xfc, which is what I questioned.
Why not test both #x3ffffc and #xfc?  And the same question about
\777777 vs \374.






^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-06 11:13                     ` Eli Zaretskii
@ 2022-10-06 12:43                       ` Mattias Engdegård
  2022-10-06 14:34                         ` Eli Zaretskii
  0 siblings, 1 reply; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-06 12:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168, larsi

6 okt. 2022 kl. 13.13 skrev Eli Zaretskii <eliz@gnu.org>:

>>   (format-message "%\345" 0)
>> => (error "Invalid format operation %å")
> 
> And you want to show %\345 instead?

Maybe, or (as the patch suggested) using a different wording for raw bytes. In any case %å is clearly a lie since that character wasn't in the format string. What would you rather see in such a case?

>  Are you sure this is not the
> consequence of inserting the error message into a multibyte buffer?

Quite sure. The error message is always produced as multibyte and the %c processing done at doprnt.c:471:

	    case 'c':
	      {
		int chr = va_arg (ap, int);
		tem = CHAR_STRING (chr, (unsigned char *) charbuf);

where CHAR_STRING renders chr (the %c argument passed to `error`) as a multibyte char to charbuf here.

>>> Who said anything about #x3fffc?  The original code had #xfc, the
>>> unibyte code for #x3ffffc.
>> 
>> There seems to be a misunderstanding. The original (and current) code attempts to display char #x3fffc, which is not a raw byte. It's just a typo for #x3ffffc -- not a big deal.
> 
> But your change replaced it with \xfc, which is what I questioned.

Oh, I see -- you are looking at the hunk that changed the labels, not the character tested. When 3fffc was changed into 3ffffc, the "expected" string needed to change accordingly; for the latter, it's \xfc or \374 depending on mode.

> Why not test both #x3ffffc and #xfc?  And the same question about
> \777777 vs \374.

Testing #x3ffffc inserts the raw byte #xfc so that takes care of that -- the test already exercised inserting the unibyte raw byte #x80 and the patch didn't change that.

I don't think these two cases actually exercise different paths in redisplay since the buffer is multibyte:

  (insert "\xfc")

and

  (insert (char-to-string #x3ffffc))

should have identical effects on the buffer and hence the display, but it doesn't hurt to have one of each.

\777774 is just octal for #x3fffc which was changed into the (intended) #x3ffffc, and \374 is octal for #xfc which is covered as above.
Thus, the only case actually removed was #x3fffc since it was a typo, and #x10abcd was put in its place.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-06 12:43                       ` Mattias Engdegård
@ 2022-10-06 14:34                         ` Eli Zaretskii
  2022-10-07 14:45                           ` Mattias Engdegård
  0 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-06 14:34 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168, larsi

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Thu, 6 Oct 2022 14:43:09 +0200
> Cc: larsi@gnus.org,
>  58168@debbugs.gnu.org
> 
> 6 okt. 2022 kl. 13.13 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> >>   (format-message "%\345" 0)
> >> => (error "Invalid format operation %å")
> > 
> > And you want to show %\345 instead?
> 
> Maybe, or (as the patch suggested) using a different wording for raw bytes. In any case %å is clearly a lie since that character wasn't in the format string. What would you rather see in such a case?

I don't think it matters much, because whatever we produce we cannot
be sure it will look identical to the original format string.

> Oh, I see -- you are looking at the hunk that changed the labels, not the character tested. When 3fffc was changed into 3ffffc, the "expected" string needed to change accordingly; for the latter, it's \xfc or \374 depending on mode.
> 
> > Why not test both #x3ffffc and #xfc?  And the same question about
> > \777777 vs \374.
> 
> Testing #x3ffffc inserts the raw byte #xfc so that takes care of that -- the test already exercised inserting the unibyte raw byte #x80 and the patch didn't change that.

We are talking about a test suite.  A test suite doesn't have to
assume that the internals work as they should, it should just test
that.  So testing both sounds to me better than testing just one
assuming that this one covers both.

> I don't think these two cases actually exercise different paths in redisplay since the buffer is multibyte:

Again, this kind of considerations don't have a place in a test
suite.  What if tomorrow redisplay code will change to have two
different paths?

> \777774 is just octal for #x3fffc which was changed into the (intended) #x3ffffc, and \374 is octal for #xfc which is covered as above.

Normally, yes.  But this is a test suite...





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-06 11:06                 ` Eli Zaretskii
@ 2022-10-07 14:23                   ` Mattias Engdegård
  2022-10-08  7:35                     ` Eli Zaretskii
  0 siblings, 1 reply; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-07 14:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168

6 okt. 2022 kl. 13.06 skrev Eli Zaretskii <eliz@gnu.org>:

> Cf. NaN comparisons with numerical values.

Emacs strings are completely different from floats and NaNs in just about every respect; no meaningful parallels can be drawn. (And do believe me when I say that we should be thankful for that.)

> You missed me here.  Why are you suddenly talking about mismatches?
> And if only mismatches matter here, why is it a problem to use memchr
> in the first place?

Any lexicographic comparison is a matter of finding the first point of difference, then interpreting the difference at that point. `memchr` does not help with that, nor does `memcmp` unless we are doing a bytewise string comparison.

I've pushed a further optimisation of the comparison between arbitrary multibyte strings; it's now sufficiently fast that I no longer worry much about it (on my machine asymptotically 10× faster than before, yet 30 % slower than memcmp).

Similar improvements could be made to the comparison between unibyte and non-ASCII multibyte strings. These are less common and not quite as slow; I haven't made up my mind about whether it's worth the trouble.

Again it's complicated by the equivalence between unibyte and Latin-1: it means that two equal strings can have different byte patterns, which in turn makes it much more expensive to find the longest common prefix. (Thus this point is of practical interest in several respects.)

In any case, the situation is now better than it was before the bug was opened: string< is faster and the remaining problems have at least been chartered, whether or not an agreement to remedy them can be reached. Let's be happy about this!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-06 14:34                         ` Eli Zaretskii
@ 2022-10-07 14:45                           ` Mattias Engdegård
  2022-10-07 15:33                             ` Eli Zaretskii
  0 siblings, 1 reply; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-07 14:45 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168, larsi

[-- Attachment #1: Type: text/plain, Size: 233 bytes --]

6 okt. 2022 kl. 16.34 skrev Eli Zaretskii <eliz@gnu.org>:

> I don't think it matters much, because whatever we produce we cannot
> be sure it will look identical to the original format string.

I agree. What about this patch then?


[-- Attachment #2: format-error.diff --]
[-- Type: application/octet-stream, Size: 955 bytes --]

diff --git a/src/editfns.c b/src/editfns.c
index c1414071c7..5a99a40656 100644
--- a/src/editfns.c
+++ b/src/editfns.c
@@ -3551,10 +3551,15 @@ styled_format (ptrdiff_t nargs, Lisp_Object *args, bool message)
 		      || float_conversion || conversion == 'i'
 		      || conversion == 'o' || conversion == 'x'
 		      || conversion == 'X'))
-	    error ("Invalid format operation %%%c",
-		   multibyte_format
-		   ? STRING_CHAR ((unsigned char *) format - 1)
-		   : *((unsigned char *) format - 1));
+	    {
+	      unsigned char *p = (unsigned char *) format - 1;
+	      if (multibyte_format)
+		error ("Invalid format operation %%%c", STRING_CHAR (p));
+	      else
+		error ((*p <= 127 ? "Invalid format operation %%%c"
+			: "Invalid format operation char #x%02x"),
+		       *p);
+	    }
 	  else if (! (FIXNUMP (arg) || ((BIGNUMP (arg) || FLOATP (arg))
 					&& conversion != 'c')))
 	    error ("Format specifier doesn't match argument type");

[-- Attachment #3: Type: text/plain, Size: 774 bytes --]



Not very important, but at least it should be a (minor) improvement.

> A test suite doesn't have to
> assume that the internals work as they should, it should just test
> that.  So testing both sounds to me better than testing just one
> assuming that this one covers both.

I agree, we should definitely test raw bytes inserted from both unibyte and multibyte strings. The current code doesn't do that, hence my patch.

With respect to other values there should be a reasonable qualitative difference between test cases: testing both #x80 and #xfc isn't more useful than testing the display of both the letters 'A' and 'B'. But if you think that the current code is deficient in coverage, please do propose extensions and I'll adapt the patch accordingly.


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-07 14:45                           ` Mattias Engdegård
@ 2022-10-07 15:33                             ` Eli Zaretskii
  2022-10-08 17:13                               ` Mattias Engdegård
  0 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-07 15:33 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168, larsi

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Fri, 7 Oct 2022 16:45:57 +0200
> Cc: larsi@gnus.org,
>  58168@debbugs.gnu.org
> 
> > I don't think it matters much, because whatever we produce we cannot
> > be sure it will look identical to the original format string.
> 
> I agree. What about this patch then?
> 
> diff --git a/src/editfns.c b/src/editfns.c
> index c1414071c7..5a99a40656 100644
> --- a/src/editfns.c
> +++ b/src/editfns.c
> @@ -3551,10 +3551,15 @@ styled_format (ptrdiff_t nargs, Lisp_Object *args, bool message)
>  		      || float_conversion || conversion == 'i'
>  		      || conversion == 'o' || conversion == 'x'
>  		      || conversion == 'X'))
> -	    error ("Invalid format operation %%%c",
> -		   multibyte_format
> -		   ? STRING_CHAR ((unsigned char *) format - 1)
> -		   : *((unsigned char *) format - 1));
> +	    {
> +	      unsigned char *p = (unsigned char *) format - 1;
> +	      if (multibyte_format)
> +		error ("Invalid format operation %%%c", STRING_CHAR (p));
> +	      else
> +		error ((*p <= 127 ? "Invalid format operation %%%c"
> +			: "Invalid format operation char #x%02x"),
> +		       *p);
> +	    }
>  	  else if (! (FIXNUMP (arg) || ((BIGNUMP (arg) || FLOATP (arg))
>  					&& conversion != 'c')))
>  	    error ("Format specifier doesn't match argument type");

I'd prefer to show it in octal, not in hex, since that's the default
in Emacs.

> > A test suite doesn't have to
> > assume that the internals work as they should, it should just test
> > that.  So testing both sounds to me better than testing just one
> > assuming that this one covers both.
> 
> I agree, we should definitely test raw bytes inserted from both unibyte and multibyte strings. The current code doesn't do that, hence my patch.
> 
> With respect to other values there should be a reasonable qualitative difference between test cases: testing both #x80 and #xfc isn't more useful than testing the display of both the letters 'A' and 'B'. But if you think that the current code is deficient in coverage, please do propose extensions and I'll adapt the patch accordingly.

I think the more values we test the better.  The codepoints between
#x80 and #xA0 in particular are treated differently than those above
#xA0, so at least one from each range should be beneficial.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-07 14:23                   ` Mattias Engdegård
@ 2022-10-08  7:35                     ` Eli Zaretskii
  2022-10-14 14:39                       ` Mattias Engdegård
  0 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-08  7:35 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Fri, 7 Oct 2022 16:23:26 +0200
> Cc: 58168@debbugs.gnu.org
> 
> 6 okt. 2022 kl. 13.06 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > Cf. NaN comparisons with numerical values.
> 
> Emacs strings are completely different from floats and NaNs in just about every respect; no meaningful parallels can be drawn. (And do believe me when I say that we should be thankful for that.)

I'm totally aware that NaNs and unibyte strings are completely
different beasts, believe me.  I was just pointing out another
widespread case where comparison results are surprising and order is
not defined.  My point is that it isn't an unimaginable situation.

> > You missed me here.  Why are you suddenly talking about mismatches?
> > And if only mismatches matter here, why is it a problem to use memchr
> > in the first place?
> 
> Any lexicographic comparison is a matter of finding the first point of difference, then interpreting the difference at that point. `memchr` does not help with that, nor does `memcmp` unless we are doing a bytewise string comparison.

Wed are miscommunicating, because you remove too much of previous
context.  I suggested to use memchr to find whether a string has any
C0 or C1 bytes, _before_ doing the actual comparison, to find out
whether a multibyte string includes any raw bytes, which would then
require slower comparisons.  If there are no C0/C1 bytes, you could
use memcmp, which is always faster than hand-made word-wise comparison
we have there now.

I also suggested to try memmem as yet another possibility -- not sure
up front whether it can be faster in cases that matter.

> Similar improvements could be made to the comparison between unibyte and non-ASCII multibyte strings. These are less common and not quite as slow; I haven't made up my mind about whether it's worth the trouble.

I don't think it's worth the trouble.

> In any case, the situation is now better than it was before the bug was opened: string< is faster and the remaining problems have at least been chartered, whether or not an agreement to remedy them can be reached. Let's be happy about this!

This is me being happy.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-07 15:33                             ` Eli Zaretskii
@ 2022-10-08 17:13                               ` Mattias Engdegård
  0 siblings, 0 replies; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-08 17:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168, Lars Ingebrigtsen

[-- Attachment #1: Type: text/plain, Size: 549 bytes --]

7 okt. 2022 kl. 17.33 skrev Eli Zaretskii <eliz@gnu.org>:

> I'd prefer to show it in octal, not in hex

Then octal it is, and the patch has now been pushed with that modification.

> I think the more values we test the better.  The codepoints between
> #x80 and #xA0 in particular are treated differently than those above
> #xA0, so at least one from each range should be beneficial.

The patch already does ensure that we test one from each (#x80 and #xFC); I've now added a third to alleviate your concerns.
Updated patch attached.


[-- Attachment #2: 0001-Improve-manual-display-tests-of-undisplayable-chars-.patch --]
[-- Type: application/octet-stream, Size: 4714 bytes --]

From 56a4d76c7f0dfa6a552c9a49f345c09eabc999f5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Fri, 7 Oct 2022 16:41:38 +0200
Subject: [PATCH] Improve manual display tests of undisplayable chars
 (bug#58168)

Test display of multibyte raw bytes, as well as undisplayable
multibyte chars (C1 controls and other values).

The test still assumes that raw bytes should be displayed identically
to undisplayable characters (such as C1 controls) because that is how
the display code currently works.

* test/manual/redisplay-testsuite.el
(test-redisplay-5c-expected-overlay)
(test-redisplay-5c-result-overlay)
(test-redisplay-5-toggle, test-redisplay-5): Fix typo (#x3fffc) of raw
byte value.  Add new test value (#xd7).
(test-redisplay-6): New.
---
 test/manual/redisplay-testsuite.el | 48 +++++++++++++++++++++++++-----
 1 file changed, 40 insertions(+), 8 deletions(-)

diff --git a/test/manual/redisplay-testsuite.el b/test/manual/redisplay-testsuite.el
index 01b0a895a4..7449463b44 100644
--- a/test/manual/redisplay-testsuite.el
+++ b/test/manual/redisplay-testsuite.el
@@ -298,6 +298,8 @@ test-redisplay-5a-expected-overlay
 (defvar test-redisplay-5a-result-overlay nil)
 (defvar test-redisplay-5b-expected-overlay nil)
 (defvar test-redisplay-5b-result-overlay nil)
+(defvar test-redisplay-5c-expected-overlay nil)
+(defvar test-redisplay-5c-result-overlay nil)
 
 (defun test-redisplay-5-toggle (_event)
   (interactive "e")
@@ -305,8 +307,11 @@ test-redisplay-5-toggle
   (let ((label (if display-raw-bytes-as-hex "\\x80" "\\200")))
     (overlay-put test-redisplay-5a-expected-overlay 'display
                  (propertize label 'face 'escape-glyph)))
-  (let ((label (if display-raw-bytes-as-hex "\\x3fffc" "\\777774")))
+  (let ((label (if display-raw-bytes-as-hex "\\xfc" "\\374")))
     (overlay-put test-redisplay-5b-expected-overlay 'display
+                 (propertize label 'face 'escape-glyph)))
+  (let ((label (if display-raw-bytes-as-hex "\\xd7" "\\327")))
+    (overlay-put test-redisplay-5c-expected-overlay 'display
                  (propertize label 'face 'escape-glyph))))
 
 (defun test-redisplay-5 ()
@@ -320,18 +325,44 @@ test-redisplay-5
         (test-insert-overlay " " 'display "\200"))
   (insert "\n\n")
   (insert "  Expected: ")
-  ;; This tests a large codepoint, to make sure the internal buffer we
-  ;; use to produce the representation is large enough.
-  (aset printable-chars #x3fffc nil)
   (setq test-redisplay-5b-expected-overlay
         (test-insert-overlay " " 'display
-                             (propertize "\\777774" 'face 'escape-glyph)))
+                             (propertize "\\374" 'face 'escape-glyph)))
   (insert "\n    Result: ")
   (setq test-redisplay-5b-result-overlay
-        (test-insert-overlay " " 'display (char-to-string #x3fffc)))
+        (test-insert-overlay " " 'display (char-to-string #x3ffffc)))
   (insert "\n\n")
-  (insert-button "Toggle between octal and hex display"
-                 'action 'test-redisplay-5-toggle))
+  (insert "  Expected: ")
+  (setq test-redisplay-5c-expected-overlay
+        (test-insert-overlay " " 'display
+                             (propertize "\\327" 'face 'escape-glyph)))
+  (insert "\n    Result: ")
+  (setq test-redisplay-5c-result-overlay
+        (test-insert-overlay " " 'display "\xd7"))
+  (insert "\n\n")
+  (insert-button "Toggle between octal and hex display for raw bytes"
+                 'action 'test-redisplay-5-toggle)
+  (insert "\n\n"))
+
+(defun test-redisplay-6 ()
+  (insert "Test 6: Display of unprintable Unicode chars:\n\n")
+  (insert "  Expected: ")
+  (test-insert-overlay " " 'display
+                       (propertize "\\200" 'face 'escape-glyph))
+  (insert "  (representing U+0100)")
+  (insert "\n    Result: ")
+  (test-insert-overlay " " 'display "\u0080")
+  (insert "\n\n")
+  ;; This tests a large codepoint, to make sure the internal buffer we
+  ;; use to produce the representation is large enough.
+  (insert "  Expected: ")
+  (aset printable-chars #x10abcd nil)
+  (test-insert-overlay " " 'display
+                       (propertize "\\4125715" 'face 'escape-glyph))
+  (insert "  (representing U+0010ABCD)")
+  (insert "\n    Result: ")
+  (test-insert-overlay " " 'display "\U0010ABCD")
+  (insert "\n\n"))
 
 (defun test-redisplay ()
   (interactive)
@@ -349,6 +380,7 @@ test-redisplay
     (test-redisplay-3)
     (test-redisplay-4)
     (test-redisplay-5)
+    (test-redisplay-6)
     (goto-char (point-min))))
 
 ;;; redisplay-testsuite.el ends here
-- 
2.32.0 (Apple Git-132)


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-08  7:35                     ` Eli Zaretskii
@ 2022-10-14 14:39                       ` Mattias Engdegård
  2022-10-14 15:31                         ` Eli Zaretskii
  0 siblings, 1 reply; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-14 14:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168

As performance is more acceptable now I'm not going to take any further action with respect of string<, but let me just answer your questions:

8 okt. 2022 kl. 09.35 skrev Eli Zaretskii <eliz@gnu.org>:

> I suggested to use memchr to find whether a string has any
> C0 or C1 bytes, _before_ doing the actual comparison, to find out
> whether a multibyte string includes any raw bytes, which would then
> require slower comparisons.

That isn't practical; we would traverse each argument in full, twice, even if there is a difference early on. While memchr is fast for what it does, it will still need to look at every bit of its input.

> I also suggested to try memmem as yet another possibility

There is no reason to believe that this would help in any way.

> This is me being happy.

Thank you, let's both rejoice then! Again, the improvements and analyses have been worthwhile.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-14 14:39                       ` Mattias Engdegård
@ 2022-10-14 15:31                         ` Eli Zaretskii
  2022-10-17 12:44                           ` Mattias Engdegård
  0 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2022-10-14 15:31 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58168

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Fri, 14 Oct 2022 16:39:55 +0200
> Cc: 58168@debbugs.gnu.org
> 
> As performance is more acceptable now I'm not going to take any further action with respect of string<, but let me just answer your questions:
> 
> 8 okt. 2022 kl. 09.35 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > I suggested to use memchr to find whether a string has any
> > C0 or C1 bytes, _before_ doing the actual comparison, to find out
> > whether a multibyte string includes any raw bytes, which would then
> > require slower comparisons.
> 
> That isn't practical; we would traverse each argument in full, twice, even if there is a difference early on. While memchr is fast for what it does, it will still need to look at every bit of its input.

I fail to see how the number of times we'd traverse the strings is of
concern, as long as it's fast enough.  And memchr _is_ very fast.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* bug#58168: string-lessp glitches and inconsistencies
  2022-10-14 15:31                         ` Eli Zaretskii
@ 2022-10-17 12:44                           ` Mattias Engdegård
  0 siblings, 0 replies; 40+ messages in thread
From: Mattias Engdegård @ 2022-10-17 12:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 58168

14 okt. 2022 kl. 17.31 skrev Eli Zaretskii <eliz@gnu.org>:

> I fail to see how the number of times we'd traverse the strings is of
> concern, as long as it's fast enough.

Unfortunately it isn't. Always reading the whole string wouldn't give us decent string comparison performance. (Doing it twice doesn't help.)

If the current code turns out to not to be fast enough then other well-known techniques can be applied, but I'm done for now.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2022-10-17 12:44 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-09-29 16:24 bug#58168: string-lessp glitches and inconsistencies Mattias Engdegård
2022-09-29 17:00 ` Mattias Engdegård
2022-09-29 17:11 ` Eli Zaretskii
2022-09-30 20:04   ` Mattias Engdegård
2022-10-01  5:22     ` Eli Zaretskii
2022-10-01 19:57       ` Mattias Engdegård
2022-10-02  5:36         ` Eli Zaretskii
2022-10-03 19:48           ` Mattias Engdegård
2022-10-04  5:55             ` Eli Zaretskii
2022-10-04 17:40               ` Richard Stallman
2022-10-04 18:07                 ` Eli Zaretskii
2022-10-06  9:05               ` Mattias Engdegård
2022-10-06 11:06                 ` Eli Zaretskii
2022-10-07 14:23                   ` Mattias Engdegård
2022-10-08  7:35                     ` Eli Zaretskii
2022-10-14 14:39                       ` Mattias Engdegård
2022-10-14 15:31                         ` Eli Zaretskii
2022-10-17 12:44                           ` Mattias Engdegård
2022-09-30 13:52 ` Lars Ingebrigtsen
2022-09-30 20:12   ` Mattias Engdegård
2022-10-01  5:34     ` Eli Zaretskii
2022-10-01 11:51       ` Mattias Engdegård
2022-10-01 10:02     ` Lars Ingebrigtsen
2022-10-01 10:12       ` Eli Zaretskii
2022-10-01 13:37       ` Mattias Engdegård
2022-10-01 13:43         ` Lars Ingebrigtsen
2022-10-03 19:48           ` Mattias Engdegård
2022-10-04 10:44             ` Lars Ingebrigtsen
2022-10-04 11:37             ` Eli Zaretskii
2022-10-04 14:44               ` Mattias Engdegård
2022-10-04 16:24                 ` Eli Zaretskii
2022-10-06  9:05                   ` Mattias Engdegård
2022-10-06 11:13                     ` Eli Zaretskii
2022-10-06 12:43                       ` Mattias Engdegård
2022-10-06 14:34                         ` Eli Zaretskii
2022-10-07 14:45                           ` Mattias Engdegård
2022-10-07 15:33                             ` Eli Zaretskii
2022-10-08 17:13                               ` Mattias Engdegård
2022-10-01 13:51         ` Eli Zaretskii
2022-10-01  5:30   ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.