* bug#58168: string-lessp glitches and inconsistencies @ 2022-09-29 16:24 Mattias Engdegård 2022-09-29 17:00 ` Mattias Engdegård ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Mattias Engdegård @ 2022-09-29 16:24 UTC (permalink / raw) To: 58168 We really want string< to be consistent with string= and itself since this is fundamental for string ordering in searching and sorting applications. This means that for any pair of strings A and B, we should either have A<B, B<A or A=B. Unfortunately: (let* ((a "ü") (b "\xfc")) (list (string= a b) (string< a b) (string< b a))) => (nil nil nil) because string< considers the unibyte raw byte 0xFC and the multibyte char U+00FC to be the same, but string= thinks they are different. We also distinguish raw bytes by multibyte-ness: (let* ((u "\x80") (m (string-to-multibyte u))) (list (string= u m) (string< u m) (string< m u))) => (nil t nil) but this is a minor annoyance that we can live with: we strongly want string= to remain consistent with `equal` for strings. So, what can be done? The current string< implementation uses the character order ASCII < ub raw 80..FF = mb U+0080..U+00FF < U+0100..10FFFF < mb raw 80..FF in conflict with string= which unifies unibyte and multibyte ASCII but not raw bytes and Latin-1. It suggests the following alternative collation orders: A. ASCII < ub raw 80..FF < mb U+0080..10FFFF < mb raw 80..FF which puts all non-ASCII multibyte chars after unibyte. B. ASCII < ub raw 80..FF < mb raw 80..FF < mb U+0080..10FFFF which inserts multibyte raw bytes after the unibyte ones, permitting any ub-ub and mb-mb comparisons to be made using memcmp, and a slow decoding loop only required for unibyte against non-ASCII multibyte strings. C. ASCII < mb U+0080..10FFFF < mb raw 80..FF < ub raw 80..FF which instead moves unibyte raw bytes to after the multibyte raw range. This has the same memcmp benefit as alternative B, but may be slightly faster for ub-mb comparisons since only unibyte 80..FF need to be remapped. Any particular preference? Otherwise, I'll go with B or C, depending on what the resulting code looks like. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-09-29 16:24 bug#58168: string-lessp glitches and inconsistencies Mattias Engdegård @ 2022-09-29 17:00 ` Mattias Engdegård 2022-09-29 17:11 ` Eli Zaretskii 2022-09-30 13:52 ` Lars Ingebrigtsen 2 siblings, 0 replies; 40+ messages in thread From: Mattias Engdegård @ 2022-09-29 17:00 UTC (permalink / raw) To: 58168 29 sep. 2022 kl. 18.24 skrev Mattias Engdegård <mattias.engdegard@gmail.com>: > C. ASCII < mb U+0080..10FFFF < mb raw 80..FF < ub raw 80..FF > > which instead moves unibyte raw bytes to after the multibyte raw range. This has the same memcmp benefit as alternative B Actually it doesn't -- editing mistake, sorry. Using memcmp for arbitrary multibyte strings requires collating raw bytes between ASCII and other Unicode. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-09-29 16:24 bug#58168: string-lessp glitches and inconsistencies Mattias Engdegård 2022-09-29 17:00 ` Mattias Engdegård @ 2022-09-29 17:11 ` Eli Zaretskii 2022-09-30 20:04 ` Mattias Engdegård 2022-09-30 13:52 ` Lars Ingebrigtsen 2 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2022-09-29 17:11 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168 > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Thu, 29 Sep 2022 18:24:04 +0200 > > We really want string< to be consistent with string= and itself since this is fundamental for string ordering in searching and sorting applications. > This means that for any pair of strings A and B, we should either have A<B, B<A or A=B. > > Unfortunately: > > (let* ((a "ü") > (b "\xfc")) > (list (string= a b) > (string< a b) > (string< b a))) > => (nil nil nil) > > because string< considers the unibyte raw byte 0xFC and the multibyte char U+00FC to be the same, but string= thinks they are different. Why do we care? Unibyte strings should never be compared with multibyte, unless they are both pure-ASCII. > So, what can be done? The current string< implementation uses the character order > > ASCII < ub raw 80..FF = mb U+0080..U+00FF < U+0100..10FFFF < mb raw 80..FF > > in conflict with string= which unifies unibyte and multibyte ASCII but not raw bytes and Latin-1. It would be unimaginable to unify raw bytes with Latin-1. Raw bytes are not Latin-1 characters, they can stand for any characters, or for no characters at all. > It suggests the following alternative collation orders: > > A. ASCII < ub raw 80..FF < mb U+0080..10FFFF < mb raw 80..FF > > which puts all non-ASCII multibyte chars after unibyte. > > B. ASCII < ub raw 80..FF < mb raw 80..FF < mb U+0080..10FFFF > > which inserts multibyte raw bytes after the unibyte ones, permitting any ub-ub and mb-mb comparisons to be made using memcmp, and a slow decoding loop only required for unibyte against non-ASCII multibyte strings. > > C. ASCII < mb U+0080..10FFFF < mb raw 80..FF < ub raw 80..FF Neither, IMNSHO. Unibyte characters don't belong to this order. They should be converted to multibyte representation to be sensibly comparable. > Otherwise, I'll go with B or C, depending on what the resulting code looks like. Please don't. Let's first decide that we want to change this, and what are the reasons for that. Theoretical "impurity" doesn't count, IMO. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-09-29 17:11 ` Eli Zaretskii @ 2022-09-30 20:04 ` Mattias Engdegård 2022-10-01 5:22 ` Eli Zaretskii 0 siblings, 1 reply; 40+ messages in thread From: Mattias Engdegård @ 2022-09-30 20:04 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168 29 sep. 2022 kl. 19.11 skrev Eli Zaretskii <eliz@gnu.org>: > Unibyte strings should never be compared with > multibyte, unless they are both pure-ASCII. It's perfectly fine to compare "Madrid" (unibyte) with "Málaga" (non-ASCII multibyte). If you mean that all strings (literals in particular) should be multibyte by default then I agree and at some point we should take that step, but it would be quite a breaking change. Perhaps less in practice than we fear, though... >> So, what can be done? The current string< implementation uses the character order >> >> ASCII < ub raw 80..FF = mb U+0080..U+00FF < U+0100..10FFFF < mb raw 80..FF >> >> in conflict with string= which unifies unibyte and multibyte ASCII but not raw bytes and Latin-1. > > It would be unimaginable to unify raw bytes with Latin-1. Raw bytes > are not Latin-1 characters, they can stand for any characters, or for > no characters at all. Completely agreed! Let's try to fix that, then. > Unibyte characters don't belong to this order. They > should be converted to multibyte representation to be sensibly > comparable. Oh I agree to some extent but we can't really raise an error if someone tries so we might as well return something reasonable and coherent. Besides, there are more good reasons for ordering strings (both multibyte and unibyte) than might be apparent at first. Working from the assumption that we can't change string= to equate raw bytes in unibyte and multibyte strings, we need to invent an order between normally incommensurate values which sounds odd but is actually fine; this is occasionally done and can be quite useful. It's also a matter of performance -- string< has been improved recently but currently we compare text in Latin and Swahili much faster than French and Arabic; it would be nice to close that gap. UTF-8 is designed so that comparing strings by scalar values can be done byte-wise, but the way we encode raw bytes make them sort right between ASCII and Latin-1. Given that the specific order doesn't matter much, we could just run with that. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-09-30 20:04 ` Mattias Engdegård @ 2022-10-01 5:22 ` Eli Zaretskii 2022-10-01 19:57 ` Mattias Engdegård 0 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2022-10-01 5:22 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168 > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Fri, 30 Sep 2022 22:04:47 +0200 > Cc: 58168@debbugs.gnu.org > > 29 sep. 2022 kl. 19.11 skrev Eli Zaretskii <eliz@gnu.org>: > > > Unibyte strings should never be compared with > > multibyte, unless they are both pure-ASCII. > > It's perfectly fine to compare "Madrid" (unibyte) with "Málaga" (non-ASCII multibyte). Not relevant: I meant unibyte non-ASCII strings. The ASCII case is easy and un-problematic, and is really just a straw-man here. > If you mean that all strings (literals in particular) should be multibyte by default then I agree and at some point we should take that step, but it would be quite a breaking change. Perhaps less in practice than we fear, though... That's not what I meant. I think unibyte strings are with us for the observable future. > > Unibyte characters don't belong to this order. They > > should be converted to multibyte representation to be sensibly > > comparable. > > Oh I agree to some extent but we can't really raise an error if someone tries so we might as well return something reasonable and coherent. It depends on the use case, but in general I see no problem with signaling errors when we cannot produce reasonably correct results. For example, string-to-unibyte does signal an error in some cases. > Besides, there are more good reasons for ordering strings (both multibyte and unibyte) than might be apparent at first. Examples, please. > Working from the assumption that we can't change string= to equate raw bytes in unibyte and multibyte strings, we need to invent an order between normally incommensurate values I don't agree with the conclusion. It is not the only possible conclusion. Signaling an error is another one, and I'm sure we could think of more. > It's also a matter of performance -- string< has been improved recently but currently we compare text in Latin and Swahili much faster than French and Arabic; it would be nice to close that gap. UTF-8 is designed so that comparing strings by scalar values can be done byte-wise, but the way we encode raw bytes make them sort right between ASCII and Latin-1. Given that the specific order doesn't matter much, we could just run with that. I see no reason to make comparison of unibyte and multibyte strings perform better. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-01 5:22 ` Eli Zaretskii @ 2022-10-01 19:57 ` Mattias Engdegård 2022-10-02 5:36 ` Eli Zaretskii 0 siblings, 1 reply; 40+ messages in thread From: Mattias Engdegård @ 2022-10-01 19:57 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168 1 okt. 2022 kl. 07.22 skrev Eli Zaretskii <eliz@gnu.org>: > It depends on the use case, but in general I see no problem with > signaling errors when we cannot produce reasonably correct results. > For example, string-to-unibyte does signal an error in some cases. That's fine because that function is documented to do so and always has, but making previously possible comparisons raise errors shouldn't be done lightly. Comparison between objects is not only useful when someone cares about their order, as in presenting a sorted list to the user. Often what is important is an ability to impose an order, preferably total, for use in building and searching data structures. I came across this bug when implementing a string set. >> It's also a matter of performance -- string< has been improved recently but currently we compare text in Latin and Swahili much faster than French and Arabic; it would be nice to close that gap. UTF-8 is designed so that comparing strings by scalar values can be done byte-wise, but the way we encode raw bytes make them sort right between ASCII and Latin-1. Given that the specific order doesn't matter much, we could just run with that. > > I see no reason to make comparison of unibyte and multibyte strings > perform better. Actually I was talking about multibyte-multibyte comparisons. You were probably thinking about comparisons between unibyte strings that contain raw bytes and multibyte strings, and those are indeed not very performance-sensitive. However there is no way to detect whether a unibyte string contains non-ASCII chars without looking at every byte, and comparing unibyte ASCII with multibyte is definitely of interest. Strings are still unibyte by default. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-01 19:57 ` Mattias Engdegård @ 2022-10-02 5:36 ` Eli Zaretskii 2022-10-03 19:48 ` Mattias Engdegård 0 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2022-10-02 5:36 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168 > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Sat, 1 Oct 2022 21:57:45 +0200 > Cc: 58168@debbugs.gnu.org > > 1 okt. 2022 kl. 07.22 skrev Eli Zaretskii <eliz@gnu.org>: > > > It depends on the use case, but in general I see no problem with > > signaling errors when we cannot produce reasonably correct results. > > For example, string-to-unibyte does signal an error in some cases. > > That's fine because that function is documented to do so and always has, but making previously possible comparisons raise errors shouldn't be done lightly. I didn't say "lightly", nor do I think so. We need to discuss specific use cases. An alternative is to always convert unibyte non-ASCII strings to their multibyte representation before comparing. > Comparison between objects is not only useful when someone cares about their order, as in presenting a sorted list to the user. Often what is important is an ability to impose an order, preferably total, for use in building and searching data structures. I came across this bug when implementing a string set. Always converting to multibyte handles this case, doesn't it? > >> It's also a matter of performance -- string< has been improved recently but currently we compare text in Latin and Swahili much faster than French and Arabic; it would be nice to close that gap. UTF-8 is designed so that comparing strings by scalar values can be done byte-wise, but the way we encode raw bytes make them sort right between ASCII and Latin-1. Given that the specific order doesn't matter much, we could just run with that. > > > > I see no reason to make comparison of unibyte and multibyte strings > > perform better. > > Actually I was talking about multibyte-multibyte comparisons. Then why did you mention raw bytes? their multibyte representation presents no performance problems, AFAIU. > You were probably thinking about comparisons between unibyte strings that contain raw bytes and multibyte strings, and those are indeed not very performance-sensitive. However there is no way to detect whether a unibyte string contains non-ASCII chars without looking at every byte, and comparing unibyte ASCII with multibyte is definitely of interest. Strings are still unibyte by default. You can compare under the assumption that a unibyte string is pure-ASCII until you bump into the first non-ASCII one. If that happens, abandon the comparison, convert the unibyte string to its multibyte representation, and compare again. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-02 5:36 ` Eli Zaretskii @ 2022-10-03 19:48 ` Mattias Engdegård 2022-10-04 5:55 ` Eli Zaretskii 0 siblings, 1 reply; 40+ messages in thread From: Mattias Engdegård @ 2022-10-03 19:48 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168 2 okt. 2022 kl. 07.36 skrev Eli Zaretskii <eliz@gnu.org>: >> Comparison between objects is not only useful when someone cares about their order, as in presenting a sorted list to the user. Often what is important is an ability to impose an order, preferably total, for use in building and searching data structures. I came across this bug when implementing a string set. > > Always converting to multibyte handles this case, doesn't it? I don't think it does -- string= treats raw bytes in unibyte and multibyte strings as distinct; converting to multibyte does not preserve (in)equality. >> Actually I was talking about multibyte-multibyte comparisons. > > Then why did you mention raw bytes? their multibyte representation > presents no performance problems In a way they do -- the way raw bytes are represented (they start with C0 or C1) causes memcmp to sort them between U+007F and U+0080. If we accept that then comparisons are fast since memcmp will compare many character per data-dependent branch. The current code requires several data-dependent branches for each character. While we could probably bring down the comparison cost slightly by clever hand-coding, it's unlikely to be even nearly as fast as a memcmp and much messier. Since users are unlikely to care much about the ordering between raw bytes and something else (as long as there is an order), it would be a cheap way to improve performance while at the same time fixing the string< / string= mismatch. > You can compare under the assumption that a unibyte string is > pure-ASCII until you bump into the first non-ASCII one. If that > happens, abandon the comparison, convert the unibyte string to its > multibyte representation, and compare again. I don't quite see how that would improve performance but may be missing something. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-03 19:48 ` Mattias Engdegård @ 2022-10-04 5:55 ` Eli Zaretskii 2022-10-04 17:40 ` Richard Stallman 2022-10-06 9:05 ` Mattias Engdegård 0 siblings, 2 replies; 40+ messages in thread From: Eli Zaretskii @ 2022-10-04 5:55 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168 > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Mon, 3 Oct 2022 21:48:14 +0200 > Cc: 58168@debbugs.gnu.org > > 2 okt. 2022 kl. 07.36 skrev Eli Zaretskii <eliz@gnu.org>: > > >> Comparison between objects is not only useful when someone cares about their order, as in presenting a sorted list to the user. Often what is important is an ability to impose an order, preferably total, for use in building and searching data structures. I came across this bug when implementing a string set. > > > > Always converting to multibyte handles this case, doesn't it? > > I don't think it does -- string= treats raw bytes in unibyte and multibyte strings as distinct; converting to multibyte does not preserve (in)equality. If the fact that string= says strings are not equal, but string-lessp says they are equal, is what bothers you, we could document that results of comparing unibyte and multibyte strings are unspecified, or document explicitly that string= and string-lessp behave differently in this case. IOW, I see no reason to worry about 100% consistency here: the order is _really_ undefined in these cases, and trying to make it defined will not produce any tangible gains, IMNSHO. It could very well produce bugs and regressions, OTOH. So it sounds like a net loss to me, in practical terms. > >> Actually I was talking about multibyte-multibyte comparisons. > > > > Then why did you mention raw bytes? their multibyte representation > > presents no performance problems > > In a way they do -- the way raw bytes are represented (they start with C0 or C1) causes memcmp to sort them between U+007F and U+0080. If we accept that then comparisons are fast since memcmp will compare many character per data-dependent branch. The current code requires several data-dependent branches for each character. Once again, slowing down string-lessp when raw-bytes are involved shouldn't be a problem. So, if memchr finds a C0 or C1 in a string, fall back to a slower comparison. memchr is fast enough to not slow down the "usual" case. Would that be a good solution? Alternatively, we could introduce a new primitive which could assume multibyte or plain-ASCII unibyte strings without checking, and then code which is sure raw-bytes cannot happen, and needs to compare long strings, could use that for speed. > While we could probably bring down the comparison cost slightly by clever hand-coding, it's unlikely to be even nearly as fast as a memcmp and much messier. Since users are unlikely to care much about the ordering between raw bytes and something else (as long as there is an order), it would be a cheap way to improve performance while at the same time fixing the string< / string= mismatch. The assumption that "users are unlikely to care" is a pure conjecture, and we have no way of validating it. So I don't want us to act on such an assumption. What about one of the alternatives above instead? > > You can compare under the assumption that a unibyte string is > > pure-ASCII until you bump into the first non-ASCII one. If that > > happens, abandon the comparison, convert the unibyte string to its > > multibyte representation, and compare again. > > I don't quite see how that would improve performance but may be missing something. Then maybe I didn't understand the performance problems that you had in mind. Suppose you describe them in more detail, preferably with examples? E.g., are you saying that unibyte strings that are pure-ASCII also cause performance problems? ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-04 5:55 ` Eli Zaretskii @ 2022-10-04 17:40 ` Richard Stallman 2022-10-04 18:07 ` Eli Zaretskii 2022-10-06 9:05 ` Mattias Engdegård 1 sibling, 1 reply; 40+ messages in thread From: Richard Stallman @ 2022-10-04 17:40 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168, mattias.engdegard [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > If the fact that string= says strings are not equal, but string-lessp > says they are equal, is what bothers you That result seems paradoxical to me. I think the way to make sense of it is this: what string-lessp is really saying is not that the strings are "equal", but rather that they are lexicographically equivalent. Perhaps documenting the difference between these two relationships could make the current behavior comprehensible rather than anomalous. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-04 17:40 ` Richard Stallman @ 2022-10-04 18:07 ` Eli Zaretskii 0 siblings, 0 replies; 40+ messages in thread From: Eli Zaretskii @ 2022-10-04 18:07 UTC (permalink / raw) To: rms; +Cc: 58168, mattias.engdegard > From: Richard Stallman <rms@gnu.org> > Cc: mattias.engdegard@gmail.com, 58168@debbugs.gnu.org > Date: Tue, 04 Oct 2022 13:40:18 -0400 > > > If the fact that string= says strings are not equal, but string-lessp > > says they are equal, is what bothers you > > That result seems paradoxical to me. Not if you accept that comparing unibyte non-ASCII text with multibyte text yields inherently unspecified results. > Perhaps documenting the difference between these two relationships > could make the current behavior comprehensible rather than anomalous. Yes, that's one of the alternatives that I think should be on the table. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-04 5:55 ` Eli Zaretskii 2022-10-04 17:40 ` Richard Stallman @ 2022-10-06 9:05 ` Mattias Engdegård 2022-10-06 11:06 ` Eli Zaretskii 1 sibling, 1 reply; 40+ messages in thread From: Mattias Engdegård @ 2022-10-06 9:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168 4 okt. 2022 kl. 07.55 skrev Eli Zaretskii <eliz@gnu.org>: > If the fact that string= says strings are not equal, but string-lessp > says they are equal, is what bothers you, we could document that > results of comparing unibyte and multibyte strings are unspecified, or > document explicitly that string= and string-lessp behave differently > in this case. (It's not just string= but `equal` since they use the same comparison.) But it's just a part of a set of related problems: * string< / string= inconsistency * undesirable string< ordering (unibyte strings are treated as Latin-1) * bad string< performance Ideally we should be able to do something about all three at the same time since they are interrelated. At the very least it's worth a try. Just documenting the annoying parts won't make them go away -- they still have to be coded around by the user, and it doesn't solve any performance problems either. > I see no reason to worry about 100% consistency here: the order > is _really_ undefined in these cases, and trying to make it defined > will not produce any tangible gains, Yes it would: better performance and wider applicability. Even when the order isn't defined the user expects there to be some order between distinct strings. > Once again, slowing down string-lessp when raw-bytes are involved > shouldn't be a problem. So, if memchr finds a C0 or C1 in a string, > fall back to a slower comparison. memchr is fast enough to not slow > down the "usual" case. Would that be a good solution? There is no reason a comparison should need to look beyond the first mismatch; anything else is just algorithmically slow. Long strings are likely to differ early on. Any hack that has to special-case raw bytes will add costs. The best we can hope for is hand-written vectorised code that does everything in one pass but it's still slower than just a memcmp. Even then our chosen semantics make that more difficult (and slower) than it needs to be: for example, we cannot assume that any byte with the high bit set indicates a mismatch when comparing unibyte strings with multibyte, since we equate unibyte chars with Latin-1. It's a decision that we will keep paying for. > Alternatively, we could introduce a new primitive which could assume > multibyte or plain-ASCII unibyte strings without checking, and then > code which is sure raw-bytes cannot happen, and needs to compare long > strings, could use that for speed. That or variants thereof are indeed alternatives but users would be forgiven to wonder why we don't make what we have fast instead? > E.g., are you saying that unibyte strings that are > pure-ASCII also cause performance problems? They do because we have no efficient way of ascertaining that they are pure-ASCII. The long-term solution is to make multibyte strings the default in more cases but I'm not proposing such a change right now. I'll see to where further performance tweaking of the existing code can take us with a reasonable efforts, but there are hard limits to what can be done. And thank you for your comments! ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-06 9:05 ` Mattias Engdegård @ 2022-10-06 11:06 ` Eli Zaretskii 2022-10-07 14:23 ` Mattias Engdegård 0 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2022-10-06 11:06 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168 > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Thu, 6 Oct 2022 11:05:04 +0200 > Cc: 58168@debbugs.gnu.org > > 4 okt. 2022 kl. 07.55 skrev Eli Zaretskii <eliz@gnu.org>: > > > If the fact that string= says strings are not equal, but string-lessp > > says they are equal, is what bothers you, we could document that > > results of comparing unibyte and multibyte strings are unspecified, or > > document explicitly that string= and string-lessp behave differently > > in this case. > > (It's not just string= but `equal` since they use the same comparison.) > But it's just a part of a set of related problems: > > * string< / string= inconsistency > * undesirable string< ordering (unibyte strings are treated as Latin-1) > * bad string< performance That doesn't seem different (and the ordering part is not necessary, IMO). > Ideally we should be able to do something about all three at the same time since they are interrelated. At the very least it's worth a try. It depends on the costs and the risks. All the rest being equal, yes, solving those would be desirable. But it isn't equal, and the costs and the risks of your proposals outweigh the advantages in my book, sorry. > Just documenting the annoying parts won't make them go away -- they still have to be coded around by the user, and it doesn't solve any performance problems either. That's not a catastrophe, because we are already there (sans the documentation), and because these cases are rare in real life. > > I see no reason to worry about 100% consistency here: the order > > is _really_ undefined in these cases, and trying to make it defined > > will not produce any tangible gains, > > Yes it would: better performance and wider applicability. These are not tangible enough IMO. > Even when the order isn't defined the user expects there to be some order between distinct strings. No, if the order is undefined, the caller cannot expect any order. Cf. NaN comparisons with numerical values. > > Once again, slowing down string-lessp when raw-bytes are involved > > shouldn't be a problem. So, if memchr finds a C0 or C1 in a string, > > fall back to a slower comparison. memchr is fast enough to not slow > > down the "usual" case. Would that be a good solution? > > There is no reason a comparison should need to look beyond the first mismatch; anything else is just algorithmically slow. Long strings are likely to differ early on. Any hack that has to special-case raw bytes will add costs. You missed me here. Why are you suddenly talking about mismatches? And if only mismatches matter here, why is it a problem to use memchr in the first place? > > Alternatively, we could introduce a new primitive which could assume > > multibyte or plain-ASCII unibyte strings without checking, and then > > code which is sure raw-bytes cannot happen, and needs to compare long > > strings, could use that for speed. > > That or variants thereof are indeed alternatives but users would be forgiven to wonder why we don't make what we have fast instead? Because the fast versions can break when the assumptions are false. We already have similar stuff in encoding/decoding area: there are fast optimized functions that require the caller to make sure some assumptions hold. > > E.g., are you saying that unibyte strings that are > > pure-ASCII also cause performance problems? > > They do because we have no efficient way of ascertaining that they are pure-ASCII. If we declare that comparing with unibyte non-ASCII produces unspecified results, we don't have to worry about that: it becomes the worry of the caller. > The long-term solution is to make multibyte strings the default in more cases but I'm not proposing such a change right now. I don't think we will ever get there, FWIW. Raw bytes in strings are a fact of life, whether we like it or not. > I'll see to where further performance tweaking of the existing code can take us with a reasonable efforts, but there are hard limits to what can be done. > And thank you for your comments! Thanks. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-06 11:06 ` Eli Zaretskii @ 2022-10-07 14:23 ` Mattias Engdegård 2022-10-08 7:35 ` Eli Zaretskii 0 siblings, 1 reply; 40+ messages in thread From: Mattias Engdegård @ 2022-10-07 14:23 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168 6 okt. 2022 kl. 13.06 skrev Eli Zaretskii <eliz@gnu.org>: > Cf. NaN comparisons with numerical values. Emacs strings are completely different from floats and NaNs in just about every respect; no meaningful parallels can be drawn. (And do believe me when I say that we should be thankful for that.) > You missed me here. Why are you suddenly talking about mismatches? > And if only mismatches matter here, why is it a problem to use memchr > in the first place? Any lexicographic comparison is a matter of finding the first point of difference, then interpreting the difference at that point. `memchr` does not help with that, nor does `memcmp` unless we are doing a bytewise string comparison. I've pushed a further optimisation of the comparison between arbitrary multibyte strings; it's now sufficiently fast that I no longer worry much about it (on my machine asymptotically 10× faster than before, yet 30 % slower than memcmp). Similar improvements could be made to the comparison between unibyte and non-ASCII multibyte strings. These are less common and not quite as slow; I haven't made up my mind about whether it's worth the trouble. Again it's complicated by the equivalence between unibyte and Latin-1: it means that two equal strings can have different byte patterns, which in turn makes it much more expensive to find the longest common prefix. (Thus this point is of practical interest in several respects.) In any case, the situation is now better than it was before the bug was opened: string< is faster and the remaining problems have at least been chartered, whether or not an agreement to remedy them can be reached. Let's be happy about this! ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-07 14:23 ` Mattias Engdegård @ 2022-10-08 7:35 ` Eli Zaretskii 2022-10-14 14:39 ` Mattias Engdegård 0 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2022-10-08 7:35 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168 > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Fri, 7 Oct 2022 16:23:26 +0200 > Cc: 58168@debbugs.gnu.org > > 6 okt. 2022 kl. 13.06 skrev Eli Zaretskii <eliz@gnu.org>: > > > Cf. NaN comparisons with numerical values. > > Emacs strings are completely different from floats and NaNs in just about every respect; no meaningful parallels can be drawn. (And do believe me when I say that we should be thankful for that.) I'm totally aware that NaNs and unibyte strings are completely different beasts, believe me. I was just pointing out another widespread case where comparison results are surprising and order is not defined. My point is that it isn't an unimaginable situation. > > You missed me here. Why are you suddenly talking about mismatches? > > And if only mismatches matter here, why is it a problem to use memchr > > in the first place? > > Any lexicographic comparison is a matter of finding the first point of difference, then interpreting the difference at that point. `memchr` does not help with that, nor does `memcmp` unless we are doing a bytewise string comparison. Wed are miscommunicating, because you remove too much of previous context. I suggested to use memchr to find whether a string has any C0 or C1 bytes, _before_ doing the actual comparison, to find out whether a multibyte string includes any raw bytes, which would then require slower comparisons. If there are no C0/C1 bytes, you could use memcmp, which is always faster than hand-made word-wise comparison we have there now. I also suggested to try memmem as yet another possibility -- not sure up front whether it can be faster in cases that matter. > Similar improvements could be made to the comparison between unibyte and non-ASCII multibyte strings. These are less common and not quite as slow; I haven't made up my mind about whether it's worth the trouble. I don't think it's worth the trouble. > In any case, the situation is now better than it was before the bug was opened: string< is faster and the remaining problems have at least been chartered, whether or not an agreement to remedy them can be reached. Let's be happy about this! This is me being happy. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-08 7:35 ` Eli Zaretskii @ 2022-10-14 14:39 ` Mattias Engdegård 2022-10-14 15:31 ` Eli Zaretskii 0 siblings, 1 reply; 40+ messages in thread From: Mattias Engdegård @ 2022-10-14 14:39 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168 As performance is more acceptable now I'm not going to take any further action with respect of string<, but let me just answer your questions: 8 okt. 2022 kl. 09.35 skrev Eli Zaretskii <eliz@gnu.org>: > I suggested to use memchr to find whether a string has any > C0 or C1 bytes, _before_ doing the actual comparison, to find out > whether a multibyte string includes any raw bytes, which would then > require slower comparisons. That isn't practical; we would traverse each argument in full, twice, even if there is a difference early on. While memchr is fast for what it does, it will still need to look at every bit of its input. > I also suggested to try memmem as yet another possibility There is no reason to believe that this would help in any way. > This is me being happy. Thank you, let's both rejoice then! Again, the improvements and analyses have been worthwhile. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-14 14:39 ` Mattias Engdegård @ 2022-10-14 15:31 ` Eli Zaretskii 2022-10-17 12:44 ` Mattias Engdegård 0 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2022-10-14 15:31 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168 > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Fri, 14 Oct 2022 16:39:55 +0200 > Cc: 58168@debbugs.gnu.org > > As performance is more acceptable now I'm not going to take any further action with respect of string<, but let me just answer your questions: > > 8 okt. 2022 kl. 09.35 skrev Eli Zaretskii <eliz@gnu.org>: > > > I suggested to use memchr to find whether a string has any > > C0 or C1 bytes, _before_ doing the actual comparison, to find out > > whether a multibyte string includes any raw bytes, which would then > > require slower comparisons. > > That isn't practical; we would traverse each argument in full, twice, even if there is a difference early on. While memchr is fast for what it does, it will still need to look at every bit of its input. I fail to see how the number of times we'd traverse the strings is of concern, as long as it's fast enough. And memchr _is_ very fast. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-14 15:31 ` Eli Zaretskii @ 2022-10-17 12:44 ` Mattias Engdegård 0 siblings, 0 replies; 40+ messages in thread From: Mattias Engdegård @ 2022-10-17 12:44 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168 14 okt. 2022 kl. 17.31 skrev Eli Zaretskii <eliz@gnu.org>: > I fail to see how the number of times we'd traverse the strings is of > concern, as long as it's fast enough. Unfortunately it isn't. Always reading the whole string wouldn't give us decent string comparison performance. (Doing it twice doesn't help.) If the current code turns out to not to be fast enough then other well-known techniques can be applied, but I'm done for now. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-09-29 16:24 bug#58168: string-lessp glitches and inconsistencies Mattias Engdegård 2022-09-29 17:00 ` Mattias Engdegård 2022-09-29 17:11 ` Eli Zaretskii @ 2022-09-30 13:52 ` Lars Ingebrigtsen 2022-09-30 20:12 ` Mattias Engdegård 2022-10-01 5:30 ` Eli Zaretskii 2 siblings, 2 replies; 40+ messages in thread From: Lars Ingebrigtsen @ 2022-09-30 13:52 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168 Mattias Engdegård <mattias.engdegard@gmail.com> writes: > We really want string< to be consistent with string= and itself since this is fundamental for string ordering in searching and sorting applications. > This means that for any pair of strings A and B, we should either have A<B, B<A or A=B. > > Unfortunately: > > (let* ((a "ü") > (b "\xfc")) > (list (string= a b) > (string< a b) > (string< b a))) > => (nil nil nil) > > because string< considers the unibyte raw byte 0xFC and the multibyte char U+00FC to be the same, but string= thinks they are different. You also have (string 4194176) => "\200" "\x80" => "\200" which are kinda equal in some ways, and not in other ways. > It suggests the following alternative collation orders: > > A. ASCII < ub raw 80..FF < mb U+0080..10FFFF < mb raw 80..FF > > which puts all non-ASCII multibyte chars after unibyte. > > B. ASCII < ub raw 80..FF < mb raw 80..FF < mb U+0080..10FFFF > > which inserts multibyte raw bytes after the unibyte ones, permitting any ub-ub and mb-mb comparisons to be made using memcmp, and a slow decoding loop only required for unibyte against non-ASCII multibyte strings. > > C. ASCII < mb U+0080..10FFFF < mb raw 80..FF < ub raw 80..FF > > which instead moves unibyte raw bytes to after the multibyte raw range. This has the same memcmp benefit as alternative B, but may be slightly faster for ub-mb comparisons since only unibyte 80..FF need to be remapped. I think A makes the most intuitive sense, somehow. But perhaps my intuition is off. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-09-30 13:52 ` Lars Ingebrigtsen @ 2022-09-30 20:12 ` Mattias Engdegård 2022-10-01 5:34 ` Eli Zaretskii 2022-10-01 10:02 ` Lars Ingebrigtsen 2022-10-01 5:30 ` Eli Zaretskii 1 sibling, 2 replies; 40+ messages in thread From: Mattias Engdegård @ 2022-09-30 20:12 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 58168 30 sep. 2022 kl. 15.52 skrev Lars Ingebrigtsen <larsi@gnus.org>: > (string 4194176) > => "\200" > "\x80" > => "\200" > > which are kinda equal in some ways, and not in other ways. And (string 128) => "\200" but painted in a different colour so that the alert reader suspects that something odd is going on... (What about printing it as "\u0080" instead? Only affects the C1 controls...) > I think A makes the most intuitive sense, somehow. But perhaps my > intuition is off. Can't say my intuition is any more reliable. On the other hand, intuition says that counter-intuitive solutions are very attractive, but I think it's just trying to confuse us. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-09-30 20:12 ` Mattias Engdegård @ 2022-10-01 5:34 ` Eli Zaretskii 2022-10-01 11:51 ` Mattias Engdegård 2022-10-01 10:02 ` Lars Ingebrigtsen 1 sibling, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2022-10-01 5:34 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168, larsi > Cc: 58168@debbugs.gnu.org > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Fri, 30 Sep 2022 22:12:24 +0200 > > 30 sep. 2022 kl. 15.52 skrev Lars Ingebrigtsen <larsi@gnus.org>: > > > (string 4194176) > > => "\200" > > "\x80" > > => "\200" > > > > which are kinda equal in some ways, and not in other ways. > > And (string 128) > => "\200" > > but painted in a different colour It's the same color as the result of '(string 4194176)'. >so that the alert reader suspects that something odd is going on... Which it is. > (What about printing it as "\u0080" instead? NO!! \u0080 is something entirely different. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-01 5:34 ` Eli Zaretskii @ 2022-10-01 11:51 ` Mattias Engdegård 0 siblings, 0 replies; 40+ messages in thread From: Mattias Engdegård @ 2022-10-01 11:51 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168, larsi 1 okt. 2022 kl. 07.34 skrev Eli Zaretskii <eliz@gnu.org>: >> (What about printing it as "\u0080" instead? > > NO!! \u0080 is something entirely different. Actually not -- (string 128) returns the multibyte string consisting of the single char U+0080, which is exactly what you get by typing "\u0080". It confuses me, too. That's a C1 control char and Emacs doesn't escape it when printing, but it's displayed as `\200`. (I don't think changing that display to \u0080 would break any compatibility; we should consider doing that for all C1 controls, U+0080..U+009F.) Even more confusing is that "\x0080" means the same as "\u0080" but not the same as "\x80", which is a unibyte string of a single raw byte. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-09-30 20:12 ` Mattias Engdegård 2022-10-01 5:34 ` Eli Zaretskii @ 2022-10-01 10:02 ` Lars Ingebrigtsen 2022-10-01 10:12 ` Eli Zaretskii 2022-10-01 13:37 ` Mattias Engdegård 1 sibling, 2 replies; 40+ messages in thread From: Lars Ingebrigtsen @ 2022-10-01 10:02 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168 [-- Attachment #1: Type: text/plain, Size: 324 bytes --] Mattias Engdegård <mattias.engdegard@gmail.com> writes: >> (string 4194176) >> => "\200" >> "\x80" >> => "\200" >> >> which are kinda equal in some ways, and not in other ways. > > And (string 128) > => "\200" Funnily enough, the latter displays in a different way for me, which may or may not be a bug: [-- Attachment #2: Type: image/png, Size: 17392 bytes --] [-- Attachment #3: Type: text/plain, Size: 392 bytes --] This is with `display-raw-bytes-as-hex' t. > (What about printing it as "\u0080" instead? Only affects the C1 controls...) \u0080 is a character, so that's something different than \200. > Can't say my intuition is any more reliable. On the other hand, > intuition says that counter-intuitive solutions are very attractive, > but I think it's just trying to confuse us. 😀 ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-01 10:02 ` Lars Ingebrigtsen @ 2022-10-01 10:12 ` Eli Zaretskii 2022-10-01 13:37 ` Mattias Engdegård 1 sibling, 0 replies; 40+ messages in thread From: Eli Zaretskii @ 2022-10-01 10:12 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 58168, mattias.engdegard > Cc: 58168@debbugs.gnu.org > From: Lars Ingebrigtsen <larsi@gnus.org> > Date: Sat, 01 Oct 2022 12:02:12 +0200 > > Mattias Engdegård <mattias.engdegard@gmail.com> writes: > > >> (string 4194176) > >> => "\200" > >> "\x80" > >> => "\200" > >> > >> which are kinda equal in some ways, and not in other ways. > > > > And (string 128) > > => "\200" > > Funnily enough, the latter displays in a different way for me, which may > or may not be a bug: The former is a string of 4 characters, the latter is a string of just one. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-01 10:02 ` Lars Ingebrigtsen 2022-10-01 10:12 ` Eli Zaretskii @ 2022-10-01 13:37 ` Mattias Engdegård 2022-10-01 13:43 ` Lars Ingebrigtsen 2022-10-01 13:51 ` Eli Zaretskii 1 sibling, 2 replies; 40+ messages in thread From: Mattias Engdegård @ 2022-10-01 13:37 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 58168, Eli Zaretskii [-- Attachment #1: Type: text/plain, Size: 1149 bytes --] 1 okt. 2022 kl. 12.02 skrev Lars Ingebrigtsen <larsi@gnus.org>: > Funnily enough, the latter displays in a different way for me, which may > or may not be a bug: > > This is with `display-raw-bytes-as-hex' t. You are right, that is completely broken -- display-raw-bytes-as-hex shouldn't affect the display of C1 controls. Whether (string 128) displays "\200" or "\x80", however tarted up in a fancy face, it's still a lie. Only something like "\u0080" would actually be correct. It seems to be a relic from the pre-Unicode days of Emacs: the code responsible muddles the display of raw bytes and unicode controls. The attached patch untangles the two somewhat and lets display-raw-bytes-as-hex do what its name and documentation suggest, while using a non-confusing display for C1 controls. The command (insert "C1: " (string 128) " raw: " (unibyte-string 128) ".\n") currently displays C1: \200 raw: \200. or C1: \x80 raw: \x80. depending on display-raw-bytes-as-hex. With the patch, we get C1: \u0080 raw: \200. or C1: \u0080 raw: \x80. which should satisfy everyone. What about it? [-- Attachment #2: unicode-escape-display.diff --] [-- Type: application/octet-stream, Size: 955 bytes --] diff --git a/src/xdisp.c b/src/xdisp.c index 55e74a3603..fa4fc2319e 100644 --- a/src/xdisp.c +++ b/src/xdisp.c @@ -8179,12 +8179,20 @@ get_next_display_element (struct it *it) char str[10]; int len, i; + const char *format_string; if (CHAR_BYTE8_P (c)) - /* Display \200 or \x80 instead of \17777600. */ - c = CHAR_TO_BYTE8 (c); - const char *format_string = display_raw_bytes_as_hex - ? "x%02x" - : "%03o"; + { + /* A raw byte: display using an octal or hex escape which + would produce this byte in a Lisp string literal. */ + c = CHAR_TO_BYTE8 (c); + format_string = display_raw_bytes_as_hex ? "x%02x" : "%03o"; + } + else + { + /* A Unicode character not displayed in any other way: + use a Unicode escape. */ + format_string = c <= 0xffff ? "u%04X" : "U%08X"; + } len = sprintf (str, format_string, c + 0u); XSETINT (it->ctl_chars[0], escape_glyph); [-- Attachment #3: Type: text/plain, Size: 203 bytes --] I see that the redisplay-testsuite.el needs amending too; it actually looks buggy in this respect. If the above approach is deemed acceptable, I'll submit a patch that includes that file as well. ^ permalink raw reply related [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-01 13:37 ` Mattias Engdegård @ 2022-10-01 13:43 ` Lars Ingebrigtsen 2022-10-03 19:48 ` Mattias Engdegård 2022-10-01 13:51 ` Eli Zaretskii 1 sibling, 1 reply; 40+ messages in thread From: Lars Ingebrigtsen @ 2022-10-01 13:43 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168, Eli Zaretskii Mattias Engdegård <mattias.engdegard@gmail.com> writes: > depending on display-raw-bytes-as-hex. With the patch, we get > > C1: \u0080 raw: \200. > or > C1: \u0080 raw: \x80. > > which should satisfy everyone. What about it? There was a very long thread about changing this output in a bug report somewhere, and we decided not to, because all the alternatives were worse than what we have. *handwaves at bug tracker* ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-01 13:43 ` Lars Ingebrigtsen @ 2022-10-03 19:48 ` Mattias Engdegård 2022-10-04 10:44 ` Lars Ingebrigtsen 2022-10-04 11:37 ` Eli Zaretskii 0 siblings, 2 replies; 40+ messages in thread From: Mattias Engdegård @ 2022-10-03 19:48 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 58168, Eli Zaretskii [-- Attachment #1: Type: text/plain, Size: 1414 bytes --] 1 okt. 2022 kl. 15.43 skrev Lars Ingebrigtsen <larsi@gnus.org>: > There was a very long thread about changing this output in a bug report > somewhere, and we decided not to, because all the alternatives were > worse than what we have. I have no wish to reopen old wounds (and it sounds as if the debate isn't one that you are keen to revisit), but I'm a bit surprised -- it all arises because of old code that still treats raw bytes as Latin-1, and I know that neither of you are very fond of that either. 1 okt. 2022 kl. 15.51 skrev Eli Zaretskii <eliz@gnu.org>: > I think the variable is a misnomer of sorts: the request back when it > was introduced to display hex where we usually display octal \nnn > escapes. And the latter happens not only for raw bytes. Fair enough. Maybe the documentation should reflect that, but I'm still holding out for a change to the C1 presentation in the long term, so... I'm not going to pursue this little digression any further except that while looking at it I found a few inaccuracies and a likely bug in redisplay-testsuite.el. I'm attaching a patch which un-muddles the test and adds a display of unprintable Unicode chars such as C1 controls, in addition to raw bytes. I'd like to adorn the commit with the correct bug number so if you remember that of the original discussion that would be useful (I never found it very easy to search debbugs). [-- Attachment #2: redisplay-testsuite.diff --] [-- Type: application/octet-stream, Size: 2762 bytes --] diff --git a/test/manual/redisplay-testsuite.el b/test/manual/redisplay-testsuite.el index 01b0a895a4..5495146b87 100644 --- a/test/manual/redisplay-testsuite.el +++ b/test/manual/redisplay-testsuite.el @@ -305,7 +305,7 @@ test-redisplay-5-toggle (let ((label (if display-raw-bytes-as-hex "\\x80" "\\200"))) (overlay-put test-redisplay-5a-expected-overlay 'display (propertize label 'face 'escape-glyph))) - (let ((label (if display-raw-bytes-as-hex "\\x3fffc" "\\777774"))) + (let ((label (if display-raw-bytes-as-hex "\\xfc" "\\374"))) (overlay-put test-redisplay-5b-expected-overlay 'display (propertize label 'face 'escape-glyph)))) @@ -320,18 +320,36 @@ test-redisplay-5 (test-insert-overlay " " 'display "\200")) (insert "\n\n") (insert " Expected: ") - ;; This tests a large codepoint, to make sure the internal buffer we - ;; use to produce the representation is large enough. - (aset printable-chars #x3fffc nil) (setq test-redisplay-5b-expected-overlay (test-insert-overlay " " 'display - (propertize "\\777774" 'face 'escape-glyph))) + (propertize "\\374" 'face 'escape-glyph))) (insert "\n Result: ") (setq test-redisplay-5b-result-overlay - (test-insert-overlay " " 'display (char-to-string #x3fffc))) + (test-insert-overlay " " 'display (char-to-string #x3ffffc))) + (insert "\n\n") + (insert-button "Toggle between octal and hex display for raw bytes" + 'action 'test-redisplay-5-toggle) + (insert "\n\n")) + +(defun test-redisplay-6 () + (insert "Test 6: Display of unprintable Unicode chars:\n\n") + (insert " Expected: ") + (test-insert-overlay " " 'display + (propertize "\\200" 'face 'escape-glyph)) + (insert " (representing U+0100)") + (insert "\n Result: ") + (test-insert-overlay " " 'display "\u0080") (insert "\n\n") - (insert-button "Toggle between octal and hex display" - 'action 'test-redisplay-5-toggle)) + ;; This tests a large codepoint, to make sure the internal buffer we + ;; use to produce the representation is large enough. + (insert " Expected: ") + (aset printable-chars #x10abcd nil) + (test-insert-overlay " " 'display + (propertize "\\4125715" 'face 'escape-glyph)) + (insert " (representing U+0010ABCD)") + (insert "\n Result: ") + (test-insert-overlay " " 'display "\U0010ABCD") + (insert "\n\n")) (defun test-redisplay () (interactive) @@ -349,6 +367,7 @@ test-redisplay (test-redisplay-3) (test-redisplay-4) (test-redisplay-5) + (test-redisplay-6) (goto-char (point-min)))) ;;; redisplay-testsuite.el ends here ^ permalink raw reply related [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-03 19:48 ` Mattias Engdegård @ 2022-10-04 10:44 ` Lars Ingebrigtsen 2022-10-04 11:37 ` Eli Zaretskii 1 sibling, 0 replies; 40+ messages in thread From: Lars Ingebrigtsen @ 2022-10-04 10:44 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168, Eli Zaretskii Mattias Engdegård <mattias.engdegard@gmail.com> writes: > I'm attaching a patch which un-muddles the > test and adds a display of unprintable Unicode chars such as C1 > controls, in addition to raw bytes. I'd like to adorn the commit with > the correct bug number so if you remember that of the original > discussion that would be useful (I never found it very easy to search > debbugs). Me neither. I think the context was something about... uhm... cutting and pasting text displayed as \ef from a terminal, and that being ambiguous? ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-03 19:48 ` Mattias Engdegård 2022-10-04 10:44 ` Lars Ingebrigtsen @ 2022-10-04 11:37 ` Eli Zaretskii 2022-10-04 14:44 ` Mattias Engdegård 1 sibling, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2022-10-04 11:37 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168, larsi > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Mon, 3 Oct 2022 21:48:10 +0200 > Cc: 58168@debbugs.gnu.org, > Eli Zaretskii <eliz@gnu.org> > > 1 okt. 2022 kl. 15.51 skrev Eli Zaretskii <eliz@gnu.org>: > > > I think the variable is a misnomer of sorts: the request back when it > > was introduced to display hex where we usually display octal \nnn > > escapes. And the latter happens not only for raw bytes. > > Fair enough. Maybe the documentation should reflect that, but I'm still holding out for a change to the C1 presentation in the long term, so... > > I'm not going to pursue this little digression any further except that while looking at it I found a few inaccuracies and a likely bug in redisplay-testsuite.el. I'm attaching a patch which un-muddles the test and adds a display of unprintable Unicode chars such as C1 controls, in addition to raw bytes. I'd like to adorn the commit with the correct bug number so if you remember that of the original discussion that would be useful (I never found it very easy to search debbugs). First I needed to fix fallout from making STRING_CHAR intolerant of unibyte text, because redisplay-testsuite caused assertion violations in string_char_and_length. (Doesn't it abort for you? or do you not build Emacs with --enable-checking?) This was a regression in Emacs 28, sigh. Looking at your patch, I don't think I understand this part: > --- a/test/manual/redisplay-testsuite.el > +++ b/test/manual/redisplay-testsuite.el > @@ -305,7 +305,7 @@ test-redisplay-5-toggle > (let ((label (if display-raw-bytes-as-hex "\\x80" "\\200"))) > (overlay-put test-redisplay-5a-expected-overlay 'display > (propertize label 'face 'escape-glyph))) > - (let ((label (if display-raw-bytes-as-hex "\\x3fffc" "\\777774"))) > + (let ((label (if display-raw-bytes-as-hex "\\xfc" "\\374"))) > (overlay-put test-redisplay-5b-expected-overlay 'display > (propertize label 'face 'escape-glyph)))) > > @@ -320,18 +320,36 @@ test-redisplay-5 > (test-insert-overlay " " 'display "\200")) > (insert "\n\n") > (insert " Expected: ") > - ;; This tests a large codepoint, to make sure the internal buffer we > - ;; use to produce the representation is large enough. > - (aset printable-chars #x3fffc nil) > (setq test-redisplay-5b-expected-overlay > (test-insert-overlay " " 'display > - (propertize "\\777774" 'face 'escape-glyph))) > + (propertize "\\374" 'face 'escape-glyph))) I could understand why you'd want to _add_ the larger values, but why replace? As for the bug report which led to display-raw-bytes-as-hex (if that what you meant) and its discussion, it's bug#27122. (If you know what code is in question, it is much easier to find the bug via "git annotate", assuming the bug number was cited in the commit logs.) ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-04 11:37 ` Eli Zaretskii @ 2022-10-04 14:44 ` Mattias Engdegård 2022-10-04 16:24 ` Eli Zaretskii 0 siblings, 1 reply; 40+ messages in thread From: Mattias Engdegård @ 2022-10-04 14:44 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168, larsi 4 okt. 2022 kl. 13.37 skrev Eli Zaretskii <eliz@gnu.org>: > First I needed to fix fallout from making STRING_CHAR intolerant of > unibyte text, because redisplay-testsuite caused assertion violations > in string_char_and_length. Good catch! Just to satisfy my curiosity: > error ("Invalid format operation %%%c", > - STRING_CHAR ((unsigned char *) format - 1)); > + multibyte_format > + ? STRING_CHAR ((unsigned char *) format - 1) > + : *((unsigned char *) format - 1)); This treats unibyte format strings as if they were Latin-1 for the purpose of the error message. Not very important, of course, but maybe there should be a UNIBYTE_TO_CHAR in the alternative branch? > (Doesn't it abort for you? or do you not > build Emacs with --enable-checking?) Oh I certainly do that occasionally, but it's mostly when I've changed something at the C level or have reason to believe that something is broken there. > I could understand why you'd want to _add_ the larger values, but why > replace? Because it seemed pretty clear that the old code intended to use #x3ffffc for testing display of raw bytes but a typo turned it into #x3fffc instead which isn't a raw byte but a multibyte character. That it's an easy mistake to make (done so several times myself). Thus the change fixes that: it now correctly tests #x3ffffc (multibyte raw byte FC) as well as a couple of undisplayable multibyte chars (one C1 control and one astral plane unicode value made undisplayable). Now everything should be described correctly, which wasn't the case before. > As for the bug report which led to display-raw-bytes-as-hex (if that > what you meant) and its discussion, it's bug#27122. Thank you, but I actually meant the one where it was agreed that it was a good idea to display raw bytes and Latin-1 U+0080..009F in the same way. There isn't much of a code trail because it probably never was a conscious decision -- it just ended up being that way -- but apparently the status quo was defended/rationalised at some point. I've now pushed the patch; the code can be improved further if necessary. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-04 14:44 ` Mattias Engdegård @ 2022-10-04 16:24 ` Eli Zaretskii 2022-10-06 9:05 ` Mattias Engdegård 0 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2022-10-04 16:24 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168, larsi > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Tue, 4 Oct 2022 16:44:17 +0200 > Cc: larsi@gnus.org, > 58168@debbugs.gnu.org > > 4 okt. 2022 kl. 13.37 skrev Eli Zaretskii <eliz@gnu.org>: > > > First I needed to fix fallout from making STRING_CHAR intolerant of > > unibyte text, because redisplay-testsuite caused assertion violations > > in string_char_and_length. > > Good catch! Just to satisfy my curiosity: > > > error ("Invalid format operation %%%c", > > - STRING_CHAR ((unsigned char *) format - 1)); > > + multibyte_format > > + ? STRING_CHAR ((unsigned char *) format - 1) > > + : *((unsigned char *) format - 1)); > > This treats unibyte format strings as if they were Latin-1 for the purpose of the error message. No, it doesn't. It shows the problematic characters as raw bytes, as in "%\200" (where \200 is a single character). If you see something different, please show the recipe. > Not very important, of course, but maybe there should be a UNIBYTE_TO_CHAR in the alternative branch? No, that would show the multibyte codepoint, and will confuse users, because the result would look very different from the problematic format spec in this case. > > (Doesn't it abort for you? or do you not > > build Emacs with --enable-checking?) > > Oh I certainly do that occasionally, but it's mostly when I've changed something at the C level or have reason to believe that something is broken there. Please _always_ test changes related to encoding/decoding and character representation conversions in a --enable-checking build. We should have discovered these bugs in time for Emacs 28.2 to be devoid of them. > > I could understand why you'd want to _add_ the larger values, but why > > replace? > > Because it seemed pretty clear that the old code intended to use #x3ffffc for testing display of raw bytes but a typo turned it into #x3fffc instead which isn't a raw byte but a multibyte character. That it's an easy mistake to make (done so several times myself). Who said anything about #x3fffc? The original code had #xfc, the unibyte code for #x3ffffc. I don't see why we shouldn't test both. In the other problematic hunk you replaced \777774 with \374 -- why? > I've now pushed the patch; the code can be improved further if necessary. I've reverted it. Please stop this madness of rushing into installing changes that are still under controversy. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-04 16:24 ` Eli Zaretskii @ 2022-10-06 9:05 ` Mattias Engdegård 2022-10-06 11:13 ` Eli Zaretskii 0 siblings, 1 reply; 40+ messages in thread From: Mattias Engdegård @ 2022-10-06 9:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168, larsi 4 okt. 2022 kl. 18.24 skrev Eli Zaretskii <eliz@gnu.org>: >> This treats unibyte format strings as if they were Latin-1 for the purpose of the error message. > > No, it doesn't. It shows the problematic characters as raw bytes, as > in "%\200" (where \200 is a single character). If you see something > different, please show the recipe. (format-message "%\345" 0) => (error "Invalid format operation %å") where the format string is a unibyte string of two bytes, % and 0xFC, yet the error treats it as the Latin-1 character å. In fact, (format-message "%å" 0) yields the same error string. >> Not very important, of course, but maybe there should be a UNIBYTE_TO_CHAR in the alternative branch? > > No, that would show the multibyte codepoint, and will confuse users, > because the result would look very different from the problematic > format spec in this case. Yes, that's probably right. I suppose the right solution is something like: unsigned char *p = (unsigned char *) format - 1; if (multibyte_format) error ("Invalid format operation %%%c", STRING_CHAR (p)); else error (*p <= 127 ? "Invalid format operation %%%c" : "Invalid format operation char 0x%02x", *p); but perhaps it's a rare error not worth the trouble. (If we don't bother changing it, a little comment saying that we are aware of the glitch may be a good idea.) > Who said anything about #x3fffc? The original code had #xfc, the > unibyte code for #x3ffffc. There seems to be a misunderstanding. The original (and current) code attempts to display char #x3fffc, which is not a raw byte. It's just a typo for #x3ffffc -- not a big deal. Of course I could have retained the 3fffc under a different label, but everyone else reading the test would just assume it was a typo of 3ffffc since 3fffc itself is not very interesting. I replaced it with 10abcd, a wide Unicode value deliberately chosen to be arbitrary-looking. We could use another value if you prefer. > I don't see why we shouldn't test both. > In the other problematic hunk you replaced \777774 with \374 -- why? 3fffc in octal is 777774; when changed to 3ffffc it becomes a raw byte, fc, displayed as \374. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-06 9:05 ` Mattias Engdegård @ 2022-10-06 11:13 ` Eli Zaretskii 2022-10-06 12:43 ` Mattias Engdegård 0 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2022-10-06 11:13 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168, larsi > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Thu, 6 Oct 2022 11:05:51 +0200 > Cc: larsi@gnus.org, > 58168@debbugs.gnu.org > > 4 okt. 2022 kl. 18.24 skrev Eli Zaretskii <eliz@gnu.org>: > > >> This treats unibyte format strings as if they were Latin-1 for the purpose of the error message. > > > > No, it doesn't. It shows the problematic characters as raw bytes, as > > in "%\200" (where \200 is a single character). If you see something > > different, please show the recipe. > > (format-message "%\345" 0) > => (error "Invalid format operation %å") And you want to show %\345 instead? Are you sure this is not the consequence of inserting the error message into a multibyte buffer? > > Who said anything about #x3fffc? The original code had #xfc, the > > unibyte code for #x3ffffc. > > There seems to be a misunderstanding. The original (and current) code attempts to display char #x3fffc, which is not a raw byte. It's just a typo for #x3ffffc -- not a big deal. But your change replaced it with \xfc, which is what I questioned. Why not test both #x3ffffc and #xfc? And the same question about \777777 vs \374. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-06 11:13 ` Eli Zaretskii @ 2022-10-06 12:43 ` Mattias Engdegård 2022-10-06 14:34 ` Eli Zaretskii 0 siblings, 1 reply; 40+ messages in thread From: Mattias Engdegård @ 2022-10-06 12:43 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168, larsi 6 okt. 2022 kl. 13.13 skrev Eli Zaretskii <eliz@gnu.org>: >> (format-message "%\345" 0) >> => (error "Invalid format operation %å") > > And you want to show %\345 instead? Maybe, or (as the patch suggested) using a different wording for raw bytes. In any case %å is clearly a lie since that character wasn't in the format string. What would you rather see in such a case? > Are you sure this is not the > consequence of inserting the error message into a multibyte buffer? Quite sure. The error message is always produced as multibyte and the %c processing done at doprnt.c:471: case 'c': { int chr = va_arg (ap, int); tem = CHAR_STRING (chr, (unsigned char *) charbuf); where CHAR_STRING renders chr (the %c argument passed to `error`) as a multibyte char to charbuf here. >>> Who said anything about #x3fffc? The original code had #xfc, the >>> unibyte code for #x3ffffc. >> >> There seems to be a misunderstanding. The original (and current) code attempts to display char #x3fffc, which is not a raw byte. It's just a typo for #x3ffffc -- not a big deal. > > But your change replaced it with \xfc, which is what I questioned. Oh, I see -- you are looking at the hunk that changed the labels, not the character tested. When 3fffc was changed into 3ffffc, the "expected" string needed to change accordingly; for the latter, it's \xfc or \374 depending on mode. > Why not test both #x3ffffc and #xfc? And the same question about > \777777 vs \374. Testing #x3ffffc inserts the raw byte #xfc so that takes care of that -- the test already exercised inserting the unibyte raw byte #x80 and the patch didn't change that. I don't think these two cases actually exercise different paths in redisplay since the buffer is multibyte: (insert "\xfc") and (insert (char-to-string #x3ffffc)) should have identical effects on the buffer and hence the display, but it doesn't hurt to have one of each. \777774 is just octal for #x3fffc which was changed into the (intended) #x3ffffc, and \374 is octal for #xfc which is covered as above. Thus, the only case actually removed was #x3fffc since it was a typo, and #x10abcd was put in its place. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-06 12:43 ` Mattias Engdegård @ 2022-10-06 14:34 ` Eli Zaretskii 2022-10-07 14:45 ` Mattias Engdegård 0 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2022-10-06 14:34 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168, larsi > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Thu, 6 Oct 2022 14:43:09 +0200 > Cc: larsi@gnus.org, > 58168@debbugs.gnu.org > > 6 okt. 2022 kl. 13.13 skrev Eli Zaretskii <eliz@gnu.org>: > > >> (format-message "%\345" 0) > >> => (error "Invalid format operation %å") > > > > And you want to show %\345 instead? > > Maybe, or (as the patch suggested) using a different wording for raw bytes. In any case %å is clearly a lie since that character wasn't in the format string. What would you rather see in such a case? I don't think it matters much, because whatever we produce we cannot be sure it will look identical to the original format string. > Oh, I see -- you are looking at the hunk that changed the labels, not the character tested. When 3fffc was changed into 3ffffc, the "expected" string needed to change accordingly; for the latter, it's \xfc or \374 depending on mode. > > > Why not test both #x3ffffc and #xfc? And the same question about > > \777777 vs \374. > > Testing #x3ffffc inserts the raw byte #xfc so that takes care of that -- the test already exercised inserting the unibyte raw byte #x80 and the patch didn't change that. We are talking about a test suite. A test suite doesn't have to assume that the internals work as they should, it should just test that. So testing both sounds to me better than testing just one assuming that this one covers both. > I don't think these two cases actually exercise different paths in redisplay since the buffer is multibyte: Again, this kind of considerations don't have a place in a test suite. What if tomorrow redisplay code will change to have two different paths? > \777774 is just octal for #x3fffc which was changed into the (intended) #x3ffffc, and \374 is octal for #xfc which is covered as above. Normally, yes. But this is a test suite... ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-06 14:34 ` Eli Zaretskii @ 2022-10-07 14:45 ` Mattias Engdegård 2022-10-07 15:33 ` Eli Zaretskii 0 siblings, 1 reply; 40+ messages in thread From: Mattias Engdegård @ 2022-10-07 14:45 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168, larsi [-- Attachment #1: Type: text/plain, Size: 233 bytes --] 6 okt. 2022 kl. 16.34 skrev Eli Zaretskii <eliz@gnu.org>: > I don't think it matters much, because whatever we produce we cannot > be sure it will look identical to the original format string. I agree. What about this patch then? [-- Attachment #2: format-error.diff --] [-- Type: application/octet-stream, Size: 955 bytes --] diff --git a/src/editfns.c b/src/editfns.c index c1414071c7..5a99a40656 100644 --- a/src/editfns.c +++ b/src/editfns.c @@ -3551,10 +3551,15 @@ styled_format (ptrdiff_t nargs, Lisp_Object *args, bool message) || float_conversion || conversion == 'i' || conversion == 'o' || conversion == 'x' || conversion == 'X')) - error ("Invalid format operation %%%c", - multibyte_format - ? STRING_CHAR ((unsigned char *) format - 1) - : *((unsigned char *) format - 1)); + { + unsigned char *p = (unsigned char *) format - 1; + if (multibyte_format) + error ("Invalid format operation %%%c", STRING_CHAR (p)); + else + error ((*p <= 127 ? "Invalid format operation %%%c" + : "Invalid format operation char #x%02x"), + *p); + } else if (! (FIXNUMP (arg) || ((BIGNUMP (arg) || FLOATP (arg)) && conversion != 'c'))) error ("Format specifier doesn't match argument type"); [-- Attachment #3: Type: text/plain, Size: 774 bytes --] Not very important, but at least it should be a (minor) improvement. > A test suite doesn't have to > assume that the internals work as they should, it should just test > that. So testing both sounds to me better than testing just one > assuming that this one covers both. I agree, we should definitely test raw bytes inserted from both unibyte and multibyte strings. The current code doesn't do that, hence my patch. With respect to other values there should be a reasonable qualitative difference between test cases: testing both #x80 and #xfc isn't more useful than testing the display of both the letters 'A' and 'B'. But if you think that the current code is deficient in coverage, please do propose extensions and I'll adapt the patch accordingly. ^ permalink raw reply related [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-07 14:45 ` Mattias Engdegård @ 2022-10-07 15:33 ` Eli Zaretskii 2022-10-08 17:13 ` Mattias Engdegård 0 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2022-10-07 15:33 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168, larsi > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Fri, 7 Oct 2022 16:45:57 +0200 > Cc: larsi@gnus.org, > 58168@debbugs.gnu.org > > > I don't think it matters much, because whatever we produce we cannot > > be sure it will look identical to the original format string. > > I agree. What about this patch then? > > diff --git a/src/editfns.c b/src/editfns.c > index c1414071c7..5a99a40656 100644 > --- a/src/editfns.c > +++ b/src/editfns.c > @@ -3551,10 +3551,15 @@ styled_format (ptrdiff_t nargs, Lisp_Object *args, bool message) > || float_conversion || conversion == 'i' > || conversion == 'o' || conversion == 'x' > || conversion == 'X')) > - error ("Invalid format operation %%%c", > - multibyte_format > - ? STRING_CHAR ((unsigned char *) format - 1) > - : *((unsigned char *) format - 1)); > + { > + unsigned char *p = (unsigned char *) format - 1; > + if (multibyte_format) > + error ("Invalid format operation %%%c", STRING_CHAR (p)); > + else > + error ((*p <= 127 ? "Invalid format operation %%%c" > + : "Invalid format operation char #x%02x"), > + *p); > + } > else if (! (FIXNUMP (arg) || ((BIGNUMP (arg) || FLOATP (arg)) > && conversion != 'c'))) > error ("Format specifier doesn't match argument type"); I'd prefer to show it in octal, not in hex, since that's the default in Emacs. > > A test suite doesn't have to > > assume that the internals work as they should, it should just test > > that. So testing both sounds to me better than testing just one > > assuming that this one covers both. > > I agree, we should definitely test raw bytes inserted from both unibyte and multibyte strings. The current code doesn't do that, hence my patch. > > With respect to other values there should be a reasonable qualitative difference between test cases: testing both #x80 and #xfc isn't more useful than testing the display of both the letters 'A' and 'B'. But if you think that the current code is deficient in coverage, please do propose extensions and I'll adapt the patch accordingly. I think the more values we test the better. The codepoints between #x80 and #xA0 in particular are treated differently than those above #xA0, so at least one from each range should be beneficial. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-07 15:33 ` Eli Zaretskii @ 2022-10-08 17:13 ` Mattias Engdegård 0 siblings, 0 replies; 40+ messages in thread From: Mattias Engdegård @ 2022-10-08 17:13 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 58168, Lars Ingebrigtsen [-- Attachment #1: Type: text/plain, Size: 549 bytes --] 7 okt. 2022 kl. 17.33 skrev Eli Zaretskii <eliz@gnu.org>: > I'd prefer to show it in octal, not in hex Then octal it is, and the patch has now been pushed with that modification. > I think the more values we test the better. The codepoints between > #x80 and #xA0 in particular are treated differently than those above > #xA0, so at least one from each range should be beneficial. The patch already does ensure that we test one from each (#x80 and #xFC); I've now added a third to alleviate your concerns. Updated patch attached. [-- Attachment #2: 0001-Improve-manual-display-tests-of-undisplayable-chars-.patch --] [-- Type: application/octet-stream, Size: 4714 bytes --] From 56a4d76c7f0dfa6a552c9a49f345c09eabc999f5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org> Date: Fri, 7 Oct 2022 16:41:38 +0200 Subject: [PATCH] Improve manual display tests of undisplayable chars (bug#58168) Test display of multibyte raw bytes, as well as undisplayable multibyte chars (C1 controls and other values). The test still assumes that raw bytes should be displayed identically to undisplayable characters (such as C1 controls) because that is how the display code currently works. * test/manual/redisplay-testsuite.el (test-redisplay-5c-expected-overlay) (test-redisplay-5c-result-overlay) (test-redisplay-5-toggle, test-redisplay-5): Fix typo (#x3fffc) of raw byte value. Add new test value (#xd7). (test-redisplay-6): New. --- test/manual/redisplay-testsuite.el | 48 +++++++++++++++++++++++++----- 1 file changed, 40 insertions(+), 8 deletions(-) diff --git a/test/manual/redisplay-testsuite.el b/test/manual/redisplay-testsuite.el index 01b0a895a4..7449463b44 100644 --- a/test/manual/redisplay-testsuite.el +++ b/test/manual/redisplay-testsuite.el @@ -298,6 +298,8 @@ test-redisplay-5a-expected-overlay (defvar test-redisplay-5a-result-overlay nil) (defvar test-redisplay-5b-expected-overlay nil) (defvar test-redisplay-5b-result-overlay nil) +(defvar test-redisplay-5c-expected-overlay nil) +(defvar test-redisplay-5c-result-overlay nil) (defun test-redisplay-5-toggle (_event) (interactive "e") @@ -305,8 +307,11 @@ test-redisplay-5-toggle (let ((label (if display-raw-bytes-as-hex "\\x80" "\\200"))) (overlay-put test-redisplay-5a-expected-overlay 'display (propertize label 'face 'escape-glyph))) - (let ((label (if display-raw-bytes-as-hex "\\x3fffc" "\\777774"))) + (let ((label (if display-raw-bytes-as-hex "\\xfc" "\\374"))) (overlay-put test-redisplay-5b-expected-overlay 'display + (propertize label 'face 'escape-glyph))) + (let ((label (if display-raw-bytes-as-hex "\\xd7" "\\327"))) + (overlay-put test-redisplay-5c-expected-overlay 'display (propertize label 'face 'escape-glyph)))) (defun test-redisplay-5 () @@ -320,18 +325,44 @@ test-redisplay-5 (test-insert-overlay " " 'display "\200")) (insert "\n\n") (insert " Expected: ") - ;; This tests a large codepoint, to make sure the internal buffer we - ;; use to produce the representation is large enough. - (aset printable-chars #x3fffc nil) (setq test-redisplay-5b-expected-overlay (test-insert-overlay " " 'display - (propertize "\\777774" 'face 'escape-glyph))) + (propertize "\\374" 'face 'escape-glyph))) (insert "\n Result: ") (setq test-redisplay-5b-result-overlay - (test-insert-overlay " " 'display (char-to-string #x3fffc))) + (test-insert-overlay " " 'display (char-to-string #x3ffffc))) (insert "\n\n") - (insert-button "Toggle between octal and hex display" - 'action 'test-redisplay-5-toggle)) + (insert " Expected: ") + (setq test-redisplay-5c-expected-overlay + (test-insert-overlay " " 'display + (propertize "\\327" 'face 'escape-glyph))) + (insert "\n Result: ") + (setq test-redisplay-5c-result-overlay + (test-insert-overlay " " 'display "\xd7")) + (insert "\n\n") + (insert-button "Toggle between octal and hex display for raw bytes" + 'action 'test-redisplay-5-toggle) + (insert "\n\n")) + +(defun test-redisplay-6 () + (insert "Test 6: Display of unprintable Unicode chars:\n\n") + (insert " Expected: ") + (test-insert-overlay " " 'display + (propertize "\\200" 'face 'escape-glyph)) + (insert " (representing U+0100)") + (insert "\n Result: ") + (test-insert-overlay " " 'display "\u0080") + (insert "\n\n") + ;; This tests a large codepoint, to make sure the internal buffer we + ;; use to produce the representation is large enough. + (insert " Expected: ") + (aset printable-chars #x10abcd nil) + (test-insert-overlay " " 'display + (propertize "\\4125715" 'face 'escape-glyph)) + (insert " (representing U+0010ABCD)") + (insert "\n Result: ") + (test-insert-overlay " " 'display "\U0010ABCD") + (insert "\n\n")) (defun test-redisplay () (interactive) @@ -349,6 +380,7 @@ test-redisplay (test-redisplay-3) (test-redisplay-4) (test-redisplay-5) + (test-redisplay-6) (goto-char (point-min)))) ;;; redisplay-testsuite.el ends here -- 2.32.0 (Apple Git-132) ^ permalink raw reply related [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-10-01 13:37 ` Mattias Engdegård 2022-10-01 13:43 ` Lars Ingebrigtsen @ 2022-10-01 13:51 ` Eli Zaretskii 1 sibling, 0 replies; 40+ messages in thread From: Eli Zaretskii @ 2022-10-01 13:51 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 58168, larsi > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Sat, 1 Oct 2022 15:37:25 +0200 > Cc: 58168@debbugs.gnu.org, > Eli Zaretskii <eliz@gnu.org> > > > Funnily enough, the latter displays in a different way for me, which may > > or may not be a bug: > > > > This is with `display-raw-bytes-as-hex' t. > > You are right, that is completely broken -- display-raw-bytes-as-hex shouldn't affect the display of C1 controls. I think the variable is a misnomer of sorts: the request back when it was introduced to display hex where we usually display octal \nnn escapes. And the latter happens not only for raw bytes. > It seems to be a relic from the pre-Unicode days of Emacs: the code responsible muddles the display of raw bytes and unicode controls. No, I think we decided to keep the display of C1 characters as octal escapes. ^ permalink raw reply [flat|nested] 40+ messages in thread
* bug#58168: string-lessp glitches and inconsistencies 2022-09-30 13:52 ` Lars Ingebrigtsen 2022-09-30 20:12 ` Mattias Engdegård @ 2022-10-01 5:30 ` Eli Zaretskii 1 sibling, 0 replies; 40+ messages in thread From: Eli Zaretskii @ 2022-10-01 5:30 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 58168, mattias.engdegard > Cc: 58168@debbugs.gnu.org > From: Lars Ingebrigtsen <larsi@gnus.org> > Date: Fri, 30 Sep 2022 15:52:12 +0200 > > You also have > > (string 4194176) > => "\200" > "\x80" > => "\200" > > which are kinda equal in some ways, and not in other ways. So are "A" and "𝙰", or "א" and "ℵ": kinda equal in some ways, but not in others. That doesn't mean they should compare equal. ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2022-10-17 12:44 UTC | newest] Thread overview: 40+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-09-29 16:24 bug#58168: string-lessp glitches and inconsistencies Mattias Engdegård 2022-09-29 17:00 ` Mattias Engdegård 2022-09-29 17:11 ` Eli Zaretskii 2022-09-30 20:04 ` Mattias Engdegård 2022-10-01 5:22 ` Eli Zaretskii 2022-10-01 19:57 ` Mattias Engdegård 2022-10-02 5:36 ` Eli Zaretskii 2022-10-03 19:48 ` Mattias Engdegård 2022-10-04 5:55 ` Eli Zaretskii 2022-10-04 17:40 ` Richard Stallman 2022-10-04 18:07 ` Eli Zaretskii 2022-10-06 9:05 ` Mattias Engdegård 2022-10-06 11:06 ` Eli Zaretskii 2022-10-07 14:23 ` Mattias Engdegård 2022-10-08 7:35 ` Eli Zaretskii 2022-10-14 14:39 ` Mattias Engdegård 2022-10-14 15:31 ` Eli Zaretskii 2022-10-17 12:44 ` Mattias Engdegård 2022-09-30 13:52 ` Lars Ingebrigtsen 2022-09-30 20:12 ` Mattias Engdegård 2022-10-01 5:34 ` Eli Zaretskii 2022-10-01 11:51 ` Mattias Engdegård 2022-10-01 10:02 ` Lars Ingebrigtsen 2022-10-01 10:12 ` Eli Zaretskii 2022-10-01 13:37 ` Mattias Engdegård 2022-10-01 13:43 ` Lars Ingebrigtsen 2022-10-03 19:48 ` Mattias Engdegård 2022-10-04 10:44 ` Lars Ingebrigtsen 2022-10-04 11:37 ` Eli Zaretskii 2022-10-04 14:44 ` Mattias Engdegård 2022-10-04 16:24 ` Eli Zaretskii 2022-10-06 9:05 ` Mattias Engdegård 2022-10-06 11:13 ` Eli Zaretskii 2022-10-06 12:43 ` Mattias Engdegård 2022-10-06 14:34 ` Eli Zaretskii 2022-10-07 14:45 ` Mattias Engdegård 2022-10-07 15:33 ` Eli Zaretskii 2022-10-08 17:13 ` Mattias Engdegård 2022-10-01 13:51 ` Eli Zaretskii 2022-10-01 5:30 ` Eli Zaretskii
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.