unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
@ 2017-06-07  3:57 Paul Eggert
  2017-06-07  5:17 ` Eli Zaretskii
  0 siblings, 1 reply; 39+ messages in thread
From: Paul Eggert @ 2017-06-07  3:57 UTC (permalink / raw)
  To: 27270; +Cc: Vasilij Schneidermann

With the default octal display format one can copy text out of a terminal window 
and into an Emacs string, reliably. With the new hex display this doesn't work 
any more, unfortunately. For example, if I run this shell script:

printf 'x\2205y\n' >foo.txt
LC_ALL=C emacs -nw --color=no --eval '(progn (setq display-raw-bytes-as-hex t) 
(find-file-literally "foo.txt"))'

then on the terminal display I see:

x\x905y

If I cut and paste this (using my windowing system) into an Emacs string, like this:

"x\x905y"

and then evaluate the string, the result is the string "xअy", that is, a 
3-character string with the characters "x", "अ", and "y", where the middle 
character is U+090F DEVANAGARI LETTER A. This is an incorrect representation, as 
the buffer actually contains the four characters "x", "\x90", "5", and "y". The 
problem is that the string has glued together the representation of the 
character "\x90" to the representation of the character "5", resulting in the 
representation of the character "\x905" which is not accurate.

Please change the behavior of display-raw-bytes-as-hex so that it is not 
ambiguous in this way.

A simple solution would be to display this instead:

x\x90\x35y

though that is awkward because it means the ASCII 0-9, a-f, A-F would be 
displayed as hexadecimal escapes when they follow another hexadecimal escape. 
Perhaps we can think of a better approach. One possibility would be to define 
and use a new string escape \Xxx that contains at most two hex digits.

By the way, I expected display-raw-bytes-as-hex to affect how Emacs displays 
Emacs strings, too. Shouldn't it?





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-07  3:57 bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings Paul Eggert
@ 2017-06-07  5:17 ` Eli Zaretskii
  2017-06-08  0:49   ` Paul Eggert
  0 siblings, 1 reply; 39+ messages in thread
From: Eli Zaretskii @ 2017-06-07  5:17 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 27270, v.schneidermann

> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Tue, 6 Jun 2017 20:57:51 -0700
> Cc: Vasilij Schneidermann <v.schneidermann@gmail.com>
> 
> then on the terminal display I see:
> 
> x\x905y
> 
> If I cut and paste this (using my windowing system) into an Emacs string, like this:
> 
> "x\x905y"
> 
> and then evaluate the string, the result is the string "xअy"

display-raw-bytes-as-hex is a display-only feature, as its name tells,
it isn't supposed to affect evaluation or the Lisp reader.  So I'm
unsure why you expected it to affect evaluation.  It's the same if you
define a display table to display one character as another, and then
expect Emacs to perform the opposite transformation when it reads
characters or strings.

> A simple solution would be to display this instead:
> 
> x\x90\x35y

That would mean display-raw-bytes-as-hex is "viral", in that it
affects not just the raw byte, but also the next character.  That
sounds sub-optimal, as it makes reading the result harder.

> though that is awkward because it means the ASCII 0-9, a-f, A-F would be 
> displayed as hexadecimal escapes when they follow another hexadecimal escape. 

Exactly.

> By the way, I expected display-raw-bytes-as-hex to affect how Emacs displays 
> Emacs strings, too. Shouldn't it?

What do you mean by "Emacs strings"?  Buffer text is a string, isn't
it?





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-07  5:17 ` Eli Zaretskii
@ 2017-06-08  0:49   ` Paul Eggert
  2017-06-08  1:07     ` npostavs
  0 siblings, 1 reply; 39+ messages in thread
From: Paul Eggert @ 2017-06-08  0:49 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 27270, v.schneidermann

On 06/06/2017 10:17 PM, Eli Zaretskii wrote:

> What do you mean by "Emacs strings"?

I meant that if I prefer hex to octal for buffer escapes, then when I 
type this into *scratch*:

   (format "J%cK" ?\u0080) C-j

I almost surely would prefer to see the result displayed as hexadecimal 
than as "J\200K" (the current behavior). People who prefer hex in one 
place are quite likely to prefer it in the other.

Here's another suggestion for the buffer problem: separate problematic 
character pairs by "\ " in the buffer display. That way, my test case 
would be displayed this way in a buffer;

   x\x90\ 5y

and this will work as expected when cut and pasted into a string, due to 
the backslash-space syntax already supported for strings. This buffer 
syntax would be less confusing than the "x\x905y" syntax that is 
currently used. Under this approach character pair XY is considered to 
be problematic if X is displayed with a hexadecimal escape and Y is a 
hexadecimal digit.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-08  0:49   ` Paul Eggert
@ 2017-06-08  1:07     ` npostavs
  2017-06-08 15:20       ` Eli Zaretskii
  2017-06-08 15:56       ` Paul Eggert
  0 siblings, 2 replies; 39+ messages in thread
From: npostavs @ 2017-06-08  1:07 UTC (permalink / raw)
  To: Paul Eggert; +Cc: v.schneidermann, 27270

Paul Eggert <eggert@cs.ucla.edu> writes:

> On 06/06/2017 10:17 PM, Eli Zaretskii wrote:
>
>> What do you mean by "Emacs strings"?
>
> I meant that if I prefer hex to octal for buffer escapes, then when I
> type this into *scratch*:
>
>   (format "J%cK" ?\u0080) C-j
>
> I almost surely would prefer to see the result displayed as
> hexadecimal than as "J\200K" (the current behavior).

display-raw-bytes-as-hex does affect the result display for me (of
course, since the result goes into the buffer), doesn't it for you?





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-08  1:07     ` npostavs
@ 2017-06-08 15:20       ` Eli Zaretskii
  2017-06-08 15:56       ` Paul Eggert
  1 sibling, 0 replies; 39+ messages in thread
From: Eli Zaretskii @ 2017-06-08 15:20 UTC (permalink / raw)
  To: npostavs; +Cc: 27270, v.schneidermann, eggert

> From: npostavs@users.sourceforge.net
> Cc: Eli Zaretskii <eliz@gnu.org>,  27270@debbugs.gnu.org,  v.schneidermann@gmail.com
> Date: Wed, 07 Jun 2017 21:07:25 -0400
> 
> > I meant that if I prefer hex to octal for buffer escapes, then when I
> > type this into *scratch*:
> >
> >   (format "J%cK" ?\u0080) C-j
> >
> > I almost surely would prefer to see the result displayed as
> > hexadecimal than as "J\200K" (the current behavior).
> 
> display-raw-bytes-as-hex does affect the result display for me (of
> course, since the result goes into the buffer), doesn't it for you?

Likewise here.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-08  1:07     ` npostavs
  2017-06-08 15:20       ` Eli Zaretskii
@ 2017-06-08 15:56       ` Paul Eggert
  2017-06-08 16:11         ` Eli Zaretskii
  2017-06-10 22:52         ` npostavs
  1 sibling, 2 replies; 39+ messages in thread
From: Paul Eggert @ 2017-06-08 15:56 UTC (permalink / raw)
  To: npostavs; +Cc: v.schneidermann, 27270

On 06/07/2017 06:07 PM, npostavs@users.sourceforge.net wrote:
> display-raw-bytes-as-hex does affect the result display for me (of
> course, since the result goes into the buffer), doesn't it for you?


Sorry, it didn't when I tried it earlier, but apparently I messed up. 
Yes, it does affect the display.

But this means the problem is even worse than I thought. If I evaluate 
this in *scratch* in a terminal session running emacs -nw:

(setq display-raw-bytes-as-hex t) C-j
(format "%c%c" ?\u0090 ?5) C-j

Emacs displays this:

"\x905"

which is the wrong string visually. And if I cut this string out of the 
terminal window and paste it into another terminal window running Emacs, 
I'll get "अ" (a string containing the single character U+0905 DEVANAGARI 
LETTER A), which is indeed the wrong string. The string should be 
displayed unambiguously, either like this:

"\x80\ 5"

or via some other means.

The bottom line is that the visual display of buffers and strings should 
continue to be unambiguous even when display-raw-bytes-as-hex is t.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-08 15:56       ` Paul Eggert
@ 2017-06-08 16:11         ` Eli Zaretskii
  2017-06-08 16:24           ` Paul Eggert
  2017-06-10 22:52         ` npostavs
  1 sibling, 1 reply; 39+ messages in thread
From: Eli Zaretskii @ 2017-06-08 16:11 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 27270, v.schneidermann, npostavs

> Cc: Eli Zaretskii <eliz@gnu.org>, 27270@debbugs.gnu.org,
>  v.schneidermann@gmail.com
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Thu, 8 Jun 2017 08:56:31 -0700
> 
> (setq display-raw-bytes-as-hex t) C-j
> (format "%c%c" ?\u0090 ?5) C-j
> 
> Emacs displays this:
> 
> "\x905"
> 
> which is the wrong string visually.

How is that different from "\2205" you get under the default settings?

> The string should be 
> displayed unambiguously, either like this:
> 
> "\x80\ 5"
> 
> or via some other means.

We do use "some other means": the raw byte has a different face.  But
if you evaluate the above in *scratch*, you won't see that because of
font-lock.  Turn off font-lock-mode, and you will clearly see where
the raw byte ends and "normal" text begins.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-08 16:11         ` Eli Zaretskii
@ 2017-06-08 16:24           ` Paul Eggert
  2017-06-08 18:59             ` Eli Zaretskii
  0 siblings, 1 reply; 39+ messages in thread
From: Paul Eggert @ 2017-06-08 16:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 27270, v.schneidermann, npostavs

On 06/08/2017 09:11 AM, Eli Zaretskii wrote:
>> (setq display-raw-bytes-as-hex t) C-j
>> (format "%c%c" ?\u0090 ?5) C-j
>>
>> Emacs displays this:
>>
>> "\x905"
>>
>> which is the wrong string visually.
> How is that different from "\2205" you get under the default settings?

When I cut and paste "\2205" into another Emacs, it evaluates to the 
same two-character string that I started off with because octal escapes 
are limited to 3 octal digits. When I cut and paste "\x905" I get a 
one-character string because there is no limit to the length of 
hexadecimal escapes. This is a problem, because cut-and-paste should 
continue to copy text accurately even when I'm using terminal windows.

>> The string should be
>> displayed unambiguously, either like this:
>>
>> "\x80\ 5"
>>
>> or via some other means.
> We do use "some other means": the raw byte has a different face.

That doesn't help when --color=no is specified, or in terminal sessions 
that do not support colors. And the colors, even when present, do not 
survive cutting and pasting, which copies the text without colors. So 
this is a real problem.

> But if you evaluate the above in*scratch*, you won't see that because of
> font-lock.  Turn off font-lock-mode, and you will clearly see where
> the raw byte ends and "normal" text begins.

Turning off font-lock-mode doesn't help when colors are disabled. I 
often run with colors disabled, since my terminal color scheme disagrees 
with Emacs's and I prefer monochrome anyway. So this ambiguity will be a 
real pain for me.






^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-08 16:24           ` Paul Eggert
@ 2017-06-08 18:59             ` Eli Zaretskii
  2017-06-08 19:43               ` Paul Eggert
  0 siblings, 1 reply; 39+ messages in thread
From: Eli Zaretskii @ 2017-06-08 18:59 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 27270, v.schneidermann, npostavs

> Cc: npostavs@users.sourceforge.net, 27270@debbugs.gnu.org,
>  v.schneidermann@gmail.com
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Thu, 8 Jun 2017 09:24:56 -0700
> 
> >> "\x905"
> >>
> >> which is the wrong string visually.
> > How is that different from "\2205" you get under the default settings?
> 
> When I cut and paste "\2205" into another Emacs, it evaluates to the 
> same two-character string that I started off with because octal escapes 
> are limited to 3 octal digits.

That's a different issue.  You said "\x905" was wrong visually, so I
asked how is that different, visually, from "\2205".

> When I cut and paste "\x905" I get a 
> one-character string because there is no limit to the length of 
> hexadecimal escapes. This is a problem, because cut-and-paste should 
> continue to copy text accurately even when I'm using terminal windows.

Same thing happens when you copy/paste from an Emacs window which uses
a display table: the pasted string will be different from the original
one.  I believe I already pointed that out in this discussion.

> >> "\x80\ 5"
> >>
> >> or via some other means.
> > We do use "some other means": the raw byte has a different face.
> 
> That doesn't help when --color=no is specified, or in terminal sessions 
> that do not support colors.

In those cases, the octal notation has the same visual problems.

> I prefer monochrome anyway. So this ambiguity will be a real pain
> for me.

I still don't understand how this is different from the octal
notation, but if it is, you can always stay with the default octal
display.  That's what I do.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-08 18:59             ` Eli Zaretskii
@ 2017-06-08 19:43               ` Paul Eggert
  2017-06-08 19:56                 ` Eli Zaretskii
  0 siblings, 1 reply; 39+ messages in thread
From: Paul Eggert @ 2017-06-08 19:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 27270, v.schneidermann, npostavs

On 06/08/2017 11:59 AM, Eli Zaretskii wrote:
> That's a different issue. You said "\x905" was wrong visually, so I
> asked how is that different, visually, from "\2205".

It's wrong visually, because I know the syntax for strings in Emacs 
Lisp, and I know that "\x905" is supposed to be a 1-character string 
whereas "\2205" is a two-character string.

> Same thing happens when you copy/paste from an Emacs window which uses
> a display table

The difference is that I don't use display tables and don't want to use 
them. In contrast, I would like to use hexadecimal display, if it worked 
as well as octal does (which it does not).






^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-08 19:43               ` Paul Eggert
@ 2017-06-08 19:56                 ` Eli Zaretskii
  2017-06-08 20:35                   ` Paul Eggert
  0 siblings, 1 reply; 39+ messages in thread
From: Eli Zaretskii @ 2017-06-08 19:56 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 27270, v.schneidermann, npostavs

> Cc: npostavs@users.sourceforge.net, 27270@debbugs.gnu.org,
>  v.schneidermann@gmail.com
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Thu, 8 Jun 2017 12:43:38 -0700
> 
> On 06/08/2017 11:59 AM, Eli Zaretskii wrote:
> > That's a different issue. You said "\x905" was wrong visually, so I
> > asked how is that different, visually, from "\2205".
> 
> It's wrong visually, because I know the syntax for strings in Emacs 
> Lisp, and I know that "\x905" is supposed to be a 1-character string 
> whereas "\2205" is a two-character string.

How do you know "\2205" is a two character string?

What about this:

  (aset printable-chars #x3fffc nil) C-j
  (format "%c%c" #x3fffc ?5) C-j

Where does the octal codepoint end now?

> > Same thing happens when you copy/paste from an Emacs window which uses
> > a display table
> 
> The difference is that I don't use display tables and don't want to use 
> them. In contrast, I would like to use hexadecimal display, if it worked 
> as well as octal does (which it does not).

Then we need to code a separate feature in the Lisp reader, I think.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-08 19:56                 ` Eli Zaretskii
@ 2017-06-08 20:35                   ` Paul Eggert
  2017-06-09  6:00                     ` Eli Zaretskii
  0 siblings, 1 reply; 39+ messages in thread
From: Paul Eggert @ 2017-06-08 20:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 27270, v.schneidermann, npostavs

On 06/08/2017 12:56 PM, Eli Zaretskii wrote:
> How do you know "\2205" is a two character string

Because I use Emacs out of the box, with the default printable-chars.

>
>> The difference is that I don't use display tables and don't want to use
>> them. In contrast, I would like to use hexadecimal display, if it worked
>> as well as octal does (which it does not).
> Then we need to code a separate feature in the Lisp reader, I think.

What do you think of using capital X for hexadecimal escapes with at 
most two digits? That way, "\X905" would be a two-character string, 
which is what is wanted here. Or we could use small h for hexadecimal, 
and "\h905".

If we were feeling ambitous and concise, we could use no character at 
all and upper-case hex digits for bytes in the range 0x80 through 0xFF; 
this would be unambiguous in strings (the example would be "\905"). This 
may be a bridge too far, though.






^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-08 20:35                   ` Paul Eggert
@ 2017-06-09  6:00                     ` Eli Zaretskii
  2017-06-09 23:44                       ` Paul Eggert
  0 siblings, 1 reply; 39+ messages in thread
From: Eli Zaretskii @ 2017-06-09  6:00 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 27270, v.schneidermann, npostavs

> Cc: npostavs@users.sourceforge.net, 27270@debbugs.gnu.org,
>  v.schneidermann@gmail.com
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Thu, 8 Jun 2017 13:35:45 -0700
> 
> On 06/08/2017 12:56 PM, Eli Zaretskii wrote:
> > How do you know "\2205" is a two character string
> 
> Because I use Emacs out of the box, with the default printable-chars.

That's just sheer luck, then, not a general solution that works for
everybody.  And it's not unimaginable that we will mark more
codepoints printable at some point, given some development in the
Unicode standard or in Emacs.

> >> The difference is that I don't use display tables and don't want to use
> >> them. In contrast, I would like to use hexadecimal display, if it worked
> >> as well as octal does (which it does not).
> > Then we need to code a separate feature in the Lisp reader, I think.
> 
> What do you think of using capital X for hexadecimal escapes with at 
> most two digits? That way, "\X905" would be a two-character string, 
> which is what is wanted here. Or we could use small h for hexadecimal, 
> and "\h905".

I'm okay, but I'm not sure I understand how does this fix your
problem.  Can you explain?

> If we were feeling ambitous and concise, we could use no character at 
> all and upper-case hex digits for bytes in the range 0x80 through 0xFF; 
> this would be unambiguous in strings (the example would be "\905"). This 
> may be a bridge too far, though.

Too far, I agree.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-09  6:00                     ` Eli Zaretskii
@ 2017-06-09 23:44                       ` Paul Eggert
  2017-06-10  7:24                         ` Eli Zaretskii
  2022-04-23 14:00                         ` Lars Ingebrigtsen
  0 siblings, 2 replies; 39+ messages in thread
From: Paul Eggert @ 2017-06-09 23:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 27270, v.schneidermann, npostavs

Eli Zaretskii wrote:
>> What do you think of using capital X for hexadecimal escapes with at
>> most two digits? That way, "\X905" would be a two-character string,
>> which is what is wanted here. Or we could use small h for hexadecimal,
>> and "\h905".
> I'm okay, but I'm not sure I understand how does this fix your
> problem.  Can you explain?
> 

The idea is to add a new \X escape for character constants and strings. This 
escape would allow at most two hexadecimal digits, rather than the unlimited 
number of digits that \x does. For example, the Lisp string "\XABC" would be 
equivalent to the Lisp string "\xAB\ C", that is, it would be a two-character 
string containing the character U+00AB LEFT POINTING GUILLEMET followed by the 
character U+0043 LATIN CAPITAL LETTER C.

Also, display-raw-bytes-as-hex would cause raw bytes to be displayed with this 
new X escape, rather than with with the x escape.

This would fix my problem, since I would continue to be able to copy text 
displayed in a terminal window, and paste it into an Emacs string, and get the 
text unaltered even if display-raw-bytes-as-hex is t.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-09 23:44                       ` Paul Eggert
@ 2017-06-10  7:24                         ` Eli Zaretskii
  2017-06-11  0:04                           ` Paul Eggert
  2022-04-23 14:00                         ` Lars Ingebrigtsen
  1 sibling, 1 reply; 39+ messages in thread
From: Eli Zaretskii @ 2017-06-10  7:24 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 27270, v.schneidermann, npostavs

> Cc: npostavs@users.sourceforge.net, 27270@debbugs.gnu.org,
>  v.schneidermann@gmail.com
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Fri, 9 Jun 2017 16:44:46 -0700
> 
> The idea is to add a new \X escape for character constants and strings. This 
> escape would allow at most two hexadecimal digits, rather than the unlimited 
> number of digits that \x does. For example, the Lisp string "\XABC" would be 
> equivalent to the Lisp string "\xAB\ C", that is, it would be a two-character 
> string containing the character U+00AB LEFT POINTING GUILLEMET followed by the 
> character U+0043 LATIN CAPITAL LETTER C.

So your proposal would mean a change to the Lisp reader to support
such escapes, right?  If so, isn't such a change
backward-incompatible?

> Also, display-raw-bytes-as-hex would cause raw bytes to be displayed with this 
> new X escape, rather than with with the x escape.

It could only do that for codepoints below 256 decimal, so that
limitation should be taken into account when deciding on the proposal.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-08 15:56       ` Paul Eggert
  2017-06-08 16:11         ` Eli Zaretskii
@ 2017-06-10 22:52         ` npostavs
  2017-06-11  0:10           ` Paul Eggert
  1 sibling, 1 reply; 39+ messages in thread
From: npostavs @ 2017-06-10 22:52 UTC (permalink / raw)
  To: Paul Eggert; +Cc: v.schneidermann, 27270

severity 27270 wishlist
quit

Paul Eggert <eggert@cs.ucla.edu> writes:

> But this means the problem is even worse than I thought. If I evaluate
> this in *scratch* in a terminal session running emacs -nw:
>
> (setq display-raw-bytes-as-hex t) C-j
> (format "%c%c" ?\u0090 ?5) C-j

I wonder what you do about low bytes, as in (format "^G%c" ?\a).  Do
those just not come up very much?  It's too bad there's no copying
counterpart to bracketed paste mode...





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-10  7:24                         ` Eli Zaretskii
@ 2017-06-11  0:04                           ` Paul Eggert
  2017-06-11 14:48                             ` Eli Zaretskii
  0 siblings, 1 reply; 39+ messages in thread
From: Paul Eggert @ 2017-06-11  0:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 27270, v.schneidermann, npostavs

On 06/10/2017 12:24 AM, Eli Zaretskii wrote:
> So your proposal would mean a change to the Lisp reader to support
> such escapes, right?  If so, isn't such a change
> backward-incompatible?

Yes, but only in the sense that undocumented escapes evaluate to 
themselves, e.g., "\F" is currently the same as "F" in Emacs Lisp 
because there is no escape sequence \F currently defined for character 
constants. But there's nothing new here, e.g., when we added "\N{...}" 
last year we changed the interpretation of the formerly-undocumented \N 
escape.

>> Also, display-raw-bytes-as-hex would cause raw bytes to be displayed with this
>> new X escape, rather than with with the x escape.
> It could only do that for codepoints below 256 decimal, so that
> limitation should be taken into account when deciding on the proposal.

Ouch, I hadn't thought of that.

Wait -- doesn't that mean that "display-raw-bytes-as-hex" is a 
misleading name, because it affects the display not only of raw bytes, 
but of other undisplayable characters? Shouldn't we change its name to 
something more generic and more accurate, like "display-characters-as-hex"?

Anyway, to address the point you raised: how about a different idea? We 
extend the existing \x syntax in strings so that \x{dddd} has the same 
meaning as "\xdddd", except that the "}" terminates the escape. This 
syntax is used by Perl and so is in the same family as \N{...}. We also 
change display-raw-bytes-as-hex to use this new syntax when a character 
is immediately followed by a hexadecimal digit. That way, most 
characters are displayed as before, but my problematic example is 
displayed as "x\x{90}5y", which is a good visual cue of the unusual 
situation.






^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-10 22:52         ` npostavs
@ 2017-06-11  0:10           ` Paul Eggert
  0 siblings, 0 replies; 39+ messages in thread
From: Paul Eggert @ 2017-06-11  0:10 UTC (permalink / raw)
  To: npostavs; +Cc: v.schneidermann, 27270

On 06/10/2017 03:52 PM, npostavs@users.sourceforge.net wrote:
> I wonder what you do about low bytes, as in (format "^G%c" ?\a).  Do
> those just not come up very much?

They didn't in my examples. :-)  But yes, they do happen, it's just that 
when they mess things up it tends to be more obvious. It might be nice, 
I suppose, if there were an option to make them not happen.






^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-11  0:04                           ` Paul Eggert
@ 2017-06-11 14:48                             ` Eli Zaretskii
  2017-06-11 17:26                               ` Paul Eggert
  0 siblings, 1 reply; 39+ messages in thread
From: Eli Zaretskii @ 2017-06-11 14:48 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 27270, v.schneidermann, npostavs

> Cc: npostavs@users.sourceforge.net, 27270@debbugs.gnu.org,
>  v.schneidermann@gmail.com
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sat, 10 Jun 2017 17:04:40 -0700
> 
> On 06/10/2017 12:24 AM, Eli Zaretskii wrote:
> > So your proposal would mean a change to the Lisp reader to support
> > such escapes, right?  If so, isn't such a change
> > backward-incompatible?
> 
> Yes, but only in the sense that undocumented escapes evaluate to 
> themselves, e.g., "\F" is currently the same as "F" in Emacs Lisp 
> because there is no escape sequence \F currently defined for character 
> constants. But there's nothing new here, e.g., when we added "\N{...}" 
> last year we changed the interpretation of the formerly-undocumented \N 
> escape.

Then maybe the new hex display should use the \N{U+nnn} format?

> >> Also, display-raw-bytes-as-hex would cause raw bytes to be displayed with this
> >> new X escape, rather than with with the x escape.
> > It could only do that for codepoints below 256 decimal, so that
> > limitation should be taken into account when deciding on the proposal.
> 
> Ouch, I hadn't thought of that.
> 
> Wait -- doesn't that mean that "display-raw-bytes-as-hex" is a 
> misleading name, because it affects the display not only of raw bytes, 
> but of other undisplayable characters?

That's true, but since the chances of a _user_ changing the
printable-chars char-table are pretty slim, I didn't think it was
justified to obfuscate the name.

> Shouldn't we change its name to 
> something more generic and more accurate, like "display-characters-as-hex"?

Codepoints whose printable-chars entry is nil cannot in good faith be
called "characters", IMO.  "Codepoints", maybe?  But again, that makes
the discoverability harder, so I'm not sure it's worth the hassle.

> Anyway, to address the point you raised: how about a different idea? We 
> extend the existing \x syntax in strings so that \x{dddd} has the same 
> meaning as "\xdddd", except that the "}" terminates the escape. This 
> syntax is used by Perl and so is in the same family as \N{...}. We also 
> change display-raw-bytes-as-hex to use this new syntax when a character 
> is immediately followed by a hexadecimal digit. That way, most 
> characters are displayed as before, but my problematic example is 
> displayed as "x\x{90}5y", which is a good visual cue of the unusual 
> situation.

See above: why not \N{U+...}?  The only downside is that it's much
longer than \xNN.  Could be another option, perhaps.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-11 14:48                             ` Eli Zaretskii
@ 2017-06-11 17:26                               ` Paul Eggert
  2017-09-02 13:25                                 ` Eli Zaretskii
  0 siblings, 1 reply; 39+ messages in thread
From: Paul Eggert @ 2017-06-11 17:26 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 27270, v.schneidermann, npostavs

Eli Zaretskii wrote:
> Then maybe the new hex display should use the \N{U+nnn} format?

If we're going to do that, we might as well use \unnnn, which is shorter. A 
downside of either syntax, though, is the implication that the raw byte is 
intended to be Unicode, which it typically is not. That is partly why I was 
thinking \x{nn} would be better: it'd be clearer to users.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-11 17:26                               ` Paul Eggert
@ 2017-09-02 13:25                                 ` Eli Zaretskii
  0 siblings, 0 replies; 39+ messages in thread
From: Eli Zaretskii @ 2017-09-02 13:25 UTC (permalink / raw)
  To: Paul Eggert; +Cc: v.schneidermann, 27270, npostavs

unblock 24655 by 27270
thanks

> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sun, 11 Jun 2017 10:26:28 -0700
> Cc: v.schneidermann@gmail.com, 27270@debbugs.gnu.org,
>  npostavs@users.sourceforge.net
> 
> Eli Zaretskii wrote:
> > Then maybe the new hex display should use the \N{U+nnn} format?
> 
> If we're going to do that, we might as well use \unnnn, which is shorter. A 
> downside of either syntax, though, is the implication that the raw byte is 
> intended to be Unicode, which it typically is not. That is partly why I was 
> thinking \x{nn} would be better: it'd be clearer to users.

In any case, since this is a "wishlist" bug report, I don't think it
should block the release of Emacs 26.1 (or any other version).

Thanks.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2017-06-09 23:44                       ` Paul Eggert
  2017-06-10  7:24                         ` Eli Zaretskii
@ 2022-04-23 14:00                         ` Lars Ingebrigtsen
  2022-04-24  7:10                           ` Paul Eggert
  1 sibling, 1 reply; 39+ messages in thread
From: Lars Ingebrigtsen @ 2022-04-23 14:00 UTC (permalink / raw)
  To: Paul Eggert; +Cc: v.schneidermann, 27270, npostavs

[-- Attachment #1: Type: text/plain, Size: 613 bytes --]

Paul Eggert <eggert@cs.ucla.edu> writes:

> The idea is to add a new \X escape for character constants and
> strings. This escape would allow at most two hexadecimal digits,
> rather than the unlimited number of digits that \x does. For example,
> the Lisp string "\XABC" would be equivalent to the Lisp string "\xAB\
> C", that is, it would be a two-character string containing the
> character U+00AB LEFT POINTING GUILLEMET followed by the character
> U+0043 LATIN CAPITAL LETTER C.

This was four years ago, but I don't think any steps were taken in this
direction, beyond marking the raw bytes more clearly:


[-- Attachment #2: Type: image/png, Size: 24406 bytes --]

[-- Attachment #3: Type: text/plain, Size: 752 bytes --]


Even in *scratch*, where font-locking overrode those, I think?

The issue still remains -- if you do this in emacs -nw:

(format "%c5" 128)
"€5"

And cut and paste that do a different Emacs, you get the string

"\x805"
=> "ࠅ"

But...  we've had this format for half a decade now, and this doesn't
really seem to be a problem in practice, so while the format is somewhat
ambiguous, I tend to think that introducing a new syntax just to fix it
isn't worth it.  Especially a syntax like \x{80}, which was one of the
suggestions -- the idea, after all, is to make display prettier and more
readable.

Any further opinions?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-23 14:00                         ` Lars Ingebrigtsen
@ 2022-04-24  7:10                           ` Paul Eggert
  2022-04-24  9:56                             ` Vasilij Schneidermann
  2022-04-24 11:24                             ` Lars Ingebrigtsen
  0 siblings, 2 replies; 39+ messages in thread
From: Paul Eggert @ 2022-04-24  7:10 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: v.schneidermann, 27270, npostavs

On 4/23/22 07:00, Lars Ingebrigtsen wrote:
> we've had this format for half a decade now, and this doesn't
> really seem to be a problem in practice

Not surprising, since most people don't set display-raw-bytes-as-hex. 
But that doesn't mean it's not a problem. Quoting bugs can be issues 
even if they're unlikely to occur at random. (Think SQL injection. :-)


> I tend to think that introducing a new syntax just to fix it
> isn't worth it.

That's fine, so let's fix the problem as originally suggested. That is, 
display the string returned by (format "%c%c" #x9e #x66) as "\x9e\x66" 
(equivalent to (concat "\x9e" "\x66") which is correct) instead of as 
"\x9ef" (equivalent to "\N{BENGALI DIGIT NINE}" which is wrong).

This fixes the problem and doesn't introduce new syntax.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-24  7:10                           ` Paul Eggert
@ 2022-04-24  9:56                             ` Vasilij Schneidermann
  2022-04-24 10:26                               ` Andreas Schwab
  2022-04-24 22:46                               ` Paul Eggert
  2022-04-24 11:24                             ` Lars Ingebrigtsen
  1 sibling, 2 replies; 39+ messages in thread
From: Vasilij Schneidermann @ 2022-04-24  9:56 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Lars Ingebrigtsen, 27270, npostavs

> > I tend to think that introducing a new syntax just to fix it
> > isn't worth it.
>
> That's fine, so let's fix the problem as originally suggested. That is,
> display the string returned by (format "%c%c" #x9e #x66) as "\x9e\x66"
> (equivalent to (concat "\x9e" "\x66") which is correct) instead of as
> "\x9ef" (equivalent to "\N{BENGALI DIGIT NINE}" which is wrong).
>
> This fixes the problem and doesn't introduce new syntax.

Wait, hold up. Under which conditions exactly does the bug happen? If I
use GUI Emacs, thanks to font-lock it's pretty obvious that the output
is three bytes, the first one displayed using the hex escape syntax and
the remaining two bytes using hex letters.  If I copy-paste those into
another GUI Emacs, it's still the same three bytes. I don't know about
terminal Emacs, but trying to work around terminals being bad doesn't
seem worth the extra effort.

Besides, suppose it is worth it, what exactly should the logic be here?
Detect if there's a preceding hex escaped byte and if yes, display
adjacent bytes that are formatted using hex characters using escaping,
too? That seems too involved for something run in redisplay.

The other proposed alternative of tightening up read syntax seems
incompatible, but saner to me overall. Emacs Lisp is the odd one out
here anyway. Only C and C++ consider such sequences as potentially
having a greater length than 2 and they error out with a compilation
error for me.

    len("\x1234") # Python, Go: 3

    "\x1234".length # Ruby, JavaScript: 3

    length("\x1234") # Perl: 3

    (string-length "\x1234") ; Guile, Racket, CHICKEN: 3

    ;; Common Lisp absent because it lacks a lot of string escapes and
    ;; using FORMAT neatly sidesteps these issues

    ;; Clojure only has octal/unicode string escapes
    (count (seq "\u12345678")) ; Clojure: 5

    (length "\x1234") ; Emacs Lisp: 1

    strlen("\x1234") /* C: compilation error */

    std::string("\x1234").length() // C++: compilation error

    "\x1234".len() // Rust: 3

Before deciding on such a change, there should be efforts to figure out
whether anything could actually break due to this. That is, code with
long hex escapes in strings, be it manually authored (unlikely, people
either use raw bytes in strings or unicode escapes) or automatically
generated (cannot comment on that, maybe the byte-code compiler emits
such code?). If not, then it would be an obvious candidate for the next
major release of Emacs.

On Sun, Apr 24, 2022 at 9:10 AM Paul Eggert <eggert@cs.ucla.edu> wrote:
>
> On 4/23/22 07:00, Lars Ingebrigtsen wrote:
> > we've had this format for half a decade now, and this doesn't
> > really seem to be a problem in practice
>
> Not surprising, since most people don't set display-raw-bytes-as-hex.
> But that doesn't mean it's not a problem. Quoting bugs can be issues
> even if they're unlikely to occur at random. (Think SQL injection. :-)
>
>
> > I tend to think that introducing a new syntax just to fix it
> > isn't worth it.
>
> That's fine, so let's fix the problem as originally suggested. That is,
> display the string returned by (format "%c%c" #x9e #x66) as "\x9e\x66"
> (equivalent to (concat "\x9e" "\x66") which is correct) instead of as
> "\x9ef" (equivalent to "\N{BENGALI DIGIT NINE}" which is wrong).
>
> This fixes the problem and doesn't introduce new syntax.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-24  9:56                             ` Vasilij Schneidermann
@ 2022-04-24 10:26                               ` Andreas Schwab
  2022-04-24 10:51                                 ` Vasilij Schneidermann
  2022-04-24 22:46                               ` Paul Eggert
  1 sibling, 1 reply; 39+ messages in thread
From: Andreas Schwab @ 2022-04-24 10:26 UTC (permalink / raw)
  To: Vasilij Schneidermann; +Cc: Lars Ingebrigtsen, Paul Eggert, 27270, npostavs

On Apr 24 2022, Vasilij Schneidermann wrote:

>     strlen("\x1234") /* C: compilation error */

You need to use a wide string:

      wslen(L"\x1234")

>     std::string("\x1234").length() // C++: compilation error

Likewise:

      std::wstring(L"\x1234").length()

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-24 10:26                               ` Andreas Schwab
@ 2022-04-24 10:51                                 ` Vasilij Schneidermann
  2022-04-24 11:01                                   ` Andreas Schwab
  0 siblings, 1 reply; 39+ messages in thread
From: Vasilij Schneidermann @ 2022-04-24 10:51 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Lars Ingebrigtsen, Paul Eggert, 27270, npostavs

> You need to use a wide string:
>
>       wslen(L"\x1234")
>
> >     std::string("\x1234").length() // C++: compilation error
>
> Likewise:
>
>       std::wstring(L"\x1234").length()

Thank you for pointing this out. This gives us three camps:

- Languages where "\x1234" is always one character (Emacs Lisp)
- Languages where "\x1234" is an error, but may become one character
when opting into this with wide literals (C, C++)
- Languages where "\x1234" is always multiple characters (everything
else under the sun)

I propose Emacs Lisp to move into camp 3 (not really a point in moving
to camp two as it requires new syntax for a hardly used feature). As
evident by the bug report, this is a footgun waiting to happen. We
already do have syntax in case one truly wants to specify a value
greater than #xFF using Unicode names/values. This would require an
amendment in `(info "(elisp) General Escape Syntax")`, point 3. Like
with oldstyle backquotes, a warning could be emitted if greater hex
values are used in a string.

I've checked Emacs sources for usage of such hex escapes and only
found org-entities.el to represent non-breaking space (nbsp) this way,
so breakage should be limited.

If there is interest, I could extend the survey to include whether
character syntax is/should be affected the same way and/or include
more languages.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-24 10:51                                 ` Vasilij Schneidermann
@ 2022-04-24 11:01                                   ` Andreas Schwab
  2022-04-24 11:29                                     ` Lars Ingebrigtsen
  0 siblings, 1 reply; 39+ messages in thread
From: Andreas Schwab @ 2022-04-24 11:01 UTC (permalink / raw)
  To: Vasilij Schneidermann; +Cc: Lars Ingebrigtsen, Paul Eggert, 27270, npostavs

On Apr 24 2022, Vasilij Schneidermann wrote:

> I propose Emacs Lisp to move into camp 3

This will break every use of \x in Emacs.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-24  7:10                           ` Paul Eggert
  2022-04-24  9:56                             ` Vasilij Schneidermann
@ 2022-04-24 11:24                             ` Lars Ingebrigtsen
  2022-04-24 22:35                               ` Paul Eggert
  1 sibling, 1 reply; 39+ messages in thread
From: Lars Ingebrigtsen @ 2022-04-24 11:24 UTC (permalink / raw)
  To: Paul Eggert; +Cc: v.schneidermann, 27270, npostavs

Paul Eggert <eggert@cs.ucla.edu> writes:

> Not surprising, since most people don't set
> display-raw-bytes-as-hex. But that doesn't mean it's not a
> problem. Quoting bugs can be issues even if they're unlikely to occur
> at random. (Think SQL injection. :-)

I don't think we're talking quite the same magnitude -- this is a
problem if you're cutting strings from a -nw Emacs and pasting into a
different Emacs and then using the Lisp reader to read it back.  And
then there's a raw byte in the string.

The likelihood of anybody actually encountering this issue is ... small.

>> I tend to think that introducing a new syntax just to fix it
>> isn't worth it.
>
> That's fine, so let's fix the problem as originally suggested. That
> is, display the string returned by (format "%c%c" #x9e #x66) as
> "\x9e\x66" (equivalent to (concat "\x9e" "\x66") which is correct)
> instead of as "\x9ef" (equivalent to "\N{BENGALI DIGIT NINE}" which is
> wrong).
>
> This fixes the problem and doesn't introduce new syntax.

You want to quote all %c as if they were raw bytes?  Or only following a
raw byte?  And what about

(format "%cf" #x9e)

which was the originally reported issue?

In any case, this would definitely be a regression, because it creates
very confusing displayed strings.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-24 11:01                                   ` Andreas Schwab
@ 2022-04-24 11:29                                     ` Lars Ingebrigtsen
  0 siblings, 0 replies; 39+ messages in thread
From: Lars Ingebrigtsen @ 2022-04-24 11:29 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Vasilij Schneidermann, Paul Eggert, 27270, npostavs

Andreas Schwab <schwab@linux-m68k.org> writes:

>> I propose Emacs Lisp to move into camp 3
>
> This will break every use of \x in Emacs.

As Vasilij says, it won't break much of the in-tree code (which usually
looks like "\x3c\x7e\xff\xff\xff\xff\x7e\x3c"), but nevertheless, it'll
break stuff in subtle ways, so I don't think it's an option.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-24 11:24                             ` Lars Ingebrigtsen
@ 2022-04-24 22:35                               ` Paul Eggert
  2022-04-25  7:40                                 ` Lars Ingebrigtsen
  0 siblings, 1 reply; 39+ messages in thread
From: Paul Eggert @ 2022-04-24 22:35 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: v.schneidermann, 27270, npostavs

On 4/24/22 04:24, Lars Ingebrigtsen wrote:

> The likelihood of anybody actually encountering this issue is ... small.

Sure, if strings are random. But strings from opponents aren't random.

I'll readily grant that it's a much smaller exposure than SQL injection. 
Still, like SQL injection it's an exposure and should be fixed.


> You want to quote all %c as if they were raw bytes?  Or only following a
> raw byte?

Closer to the latter, but even less than the latter. I am being 
conservative and am proposing that Emacs do what it does now unless the 
resulting output would be misinterpreted on input. So I wouldn't change 
how all characters are quoted; only how characters are quoted when the 
result would be interpreted incorrectly.


> what about (format "%cf" #x9e)

Since that returns a multibyte string, I suggest "\u009ef" which is 
multibyte. For its unibyte counterpart (encode-coding-string (format 
"%cf" #x9e) 'iso-latin-1) I suggest the syntax "\x9e\ f" which is 
unibyte. (These are not the only possibilities; for example, the former 
could be "\u009e\ f" if you think that's clearer.)

This string syntax is already supported by Emacs, so this wouldn't 
change the Lisp reader.


> it creates
> very confusing displayed strings.

These examples are not *that* confusing. And although they may not be 
beautiful, correct strings are less confusing than incorrect strings.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-24  9:56                             ` Vasilij Schneidermann
  2022-04-24 10:26                               ` Andreas Schwab
@ 2022-04-24 22:46                               ` Paul Eggert
  1 sibling, 0 replies; 39+ messages in thread
From: Paul Eggert @ 2022-04-24 22:46 UTC (permalink / raw)
  To: Vasilij Schneidermann; +Cc: Lars Ingebrigtsen, 27270, npostavs

On 4/24/22 02:56, Vasilij Schneidermann wrote:

> Under which conditions exactly does the bug happen?

I run into it with emacs -nw or equivalent, which I often use when I 
have a high-latency network connection so GUI Emacs is too slow. A few 
people even run Emacs from text consoles, with no graphics or windowing 
system at all, though I'm usually not that hard-core.


> trying to work around terminals being bad doesn't
> seem worth the extra effort.

Please bear with us poor users who don't always use GUIs...


> what exactly should the logic be here?
> Detect if there's a preceding hex escaped byte and if yes, display
> adjacent bytes that are formatted using hex characters using escaping,
> too?

Simpler than that. When hex-escaping a character, Emacs would look at 
the next character and if it's hexadecimal would print "\ " (or some 
similar escaping approach). This is a simple test and won't hurt 
printing performance much in the usual case.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-24 22:35                               ` Paul Eggert
@ 2022-04-25  7:40                                 ` Lars Ingebrigtsen
  2022-04-25 16:49                                   ` Paul Eggert
  0 siblings, 1 reply; 39+ messages in thread
From: Lars Ingebrigtsen @ 2022-04-25  7:40 UTC (permalink / raw)
  To: Paul Eggert; +Cc: v.schneidermann, 27270, npostavs

Paul Eggert <eggert@cs.ucla.edu> writes:

>> The likelihood of anybody actually encountering this issue is ... small.
>
> Sure, if strings are random. But strings from opponents aren't random.
>
> I'll readily grant that it's a much smaller exposure than SQL
> injection. Still, like SQL injection it's an exposure and should be
> fixed.

The opponent would have to get somebody to start an Emacs with -nw, then
cut and paste a string with the mouse, then get the user to use the Lisp
reader to read that string in again, and then end up with a string that
will somehow be a security issue.

Comparing this to SQL injection is far fetched, to put it mildly.

We have a similar issue with the octal printer -- if you print something
out with it, and you end up with something displayed as foo\205bar, you
cut and paste that from -nw, and you save it into a file, you end up
with a file containing 10 characters instead of 8, and then you have
your exploit.

I.e., the Lisp reader and strings isn't the only thing confusable here.

>> what about (format "%cf" #x9e)
>
> Since that returns a multibyte string, I suggest "\u009ef" which is
> multibyte. For its unibyte counterpart (encode-coding-string (format
> "%cf" #x9e) 'iso-latin-1) I suggest the syntax "\x9e\ f" which is
> unibyte. (These are not the only possibilities; for example, the
> former could be "\u009e\ f" if you think that's clearer.)

display-raw-bytes-as-hex is a display setting.  You want to change it so
that the data output will be different, which will break all kinds of
things, even if (when you use the Lisp reader) it'll end up being the
same.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-25  7:40                                 ` Lars Ingebrigtsen
@ 2022-04-25 16:49                                   ` Paul Eggert
  2022-04-26 10:06                                     ` Lars Ingebrigtsen
  0 siblings, 1 reply; 39+ messages in thread
From: Paul Eggert @ 2022-04-25 16:49 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: v.schneidermann, 27270, npostavs

On 4/25/22 00:40, Lars Ingebrigtsen wrote:

> Comparing this to SQL injection is far fetched

Call me paranoid if you like. (Can you tell I used to work for a 
computer security company? :-) And to be honest my main motivation is 
irritation that cut-and-paste doesn't work, not security.


> We have a similar issue with the octal printer -- if you print something
> out with it, and you end up with something displayed as foo\205bar, you
> cut and paste that from -nw, and you save it into a file,

Nobody expects things to work if you output with one quoting scheme and 
input with a different one. But cutting and pasting from Emacs's 
read-eval-print-loop is expected to work.


> display-raw-bytes-as-hex is a display setting.  You want to change it so
> that the data output will be different

No, I would like to change only the display. (I had suggested otherwise 
in comment #5 of this bug report, but was mistaken and took that 
suggestion back in later comments.)





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-25 16:49                                   ` Paul Eggert
@ 2022-04-26 10:06                                     ` Lars Ingebrigtsen
  2022-04-26 16:48                                       ` Paul Eggert
  0 siblings, 1 reply; 39+ messages in thread
From: Lars Ingebrigtsen @ 2022-04-26 10:06 UTC (permalink / raw)
  To: Paul Eggert; +Cc: v.schneidermann, 27270, npostavs

Paul Eggert <eggert@cs.ucla.edu> writes:

>> display-raw-bytes-as-hex is a display setting.  You want to change it so
>> that the data output will be different
>
> No, I would like to change only the display. (I had suggested
> otherwise in comment #5 of this bug report, but was mistaken and took
> that suggestion back in later comments.)

Your last suggestion was to output

(format "%cf" 129)
=> "\x81\x66"

I think?  Which is changing the data output.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-26 10:06                                     ` Lars Ingebrigtsen
@ 2022-04-26 16:48                                       ` Paul Eggert
  2022-04-27 12:13                                         ` Lars Ingebrigtsen
  0 siblings, 1 reply; 39+ messages in thread
From: Paul Eggert @ 2022-04-26 16:48 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: v.schneidermann, 27270, npostavs

On 4/26/22 03:06, Lars Ingebrigtsen wrote:
> Your last suggestion was to output
> 
> (format "%cf" 129)
> => "\x81\x66"
> 
> I think?  Which is changing the data output.

Oh, right. Scratch that. Let's just use "\uXXXX" if multibyte, "\OOO" 
(octal) if unibyte. (This is only when the character precedes a hex 
digit.) That's simpler anyway.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-26 16:48                                       ` Paul Eggert
@ 2022-04-27 12:13                                         ` Lars Ingebrigtsen
  2022-04-27 17:21                                           ` Paul Eggert
  0 siblings, 1 reply; 39+ messages in thread
From: Lars Ingebrigtsen @ 2022-04-27 12:13 UTC (permalink / raw)
  To: Paul Eggert; +Cc: v.schneidermann, 27270, npostavs

Paul Eggert <eggert@cs.ucla.edu> writes:

>> Your last suggestion was to output
>> (format "%cf" 129)
>> => "\x81\x66"
>> I think?  Which is changing the data output.
>
> Oh, right. Scratch that. Let's just use "\uXXXX" if multibyte, "\OOO"
> (octal) if unibyte. (This is only when the character precedes a hex
> digit.) That's simpler anyway.

That will also change the output, which display-raw-bytes-as-hex is not
supposed to do.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-27 12:13                                         ` Lars Ingebrigtsen
@ 2022-04-27 17:21                                           ` Paul Eggert
  2022-04-27 17:22                                             ` Lars Ingebrigtsen
  0 siblings, 1 reply; 39+ messages in thread
From: Paul Eggert @ 2022-04-27 17:21 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: v.schneidermann, 27270, npostavs

On 4/27/22 05:13, Lars Ingebrigtsen wrote:
>> Oh, right. Scratch that. Let's just use "\uXXXX" if multibyte, "\OOO"
>> (octal) if unibyte. (This is only when the character precedes a hex
>> digit.) That's simpler anyway.
> That will also change the output, which display-raw-bytes-as-hex is not
> supposed to do.

Could you explain what you mean by "change the output"? (Sorry, I'm not 
seeing it.)





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-27 17:21                                           ` Paul Eggert
@ 2022-04-27 17:22                                             ` Lars Ingebrigtsen
  2022-04-28 17:58                                               ` Paul Eggert
  0 siblings, 1 reply; 39+ messages in thread
From: Lars Ingebrigtsen @ 2022-04-27 17:22 UTC (permalink / raw)
  To: Paul Eggert; +Cc: v.schneidermann, 27270, npostavs

Paul Eggert <eggert@cs.ucla.edu> writes:

> Could you explain what you mean by "change the output"? (Sorry, I'm
> not seeing it.)

I said earlier:

> display-raw-bytes-as-hex is a display setting.  You want to change it so
> that the data output will be different, which will break all kinds of
> things, even if (when you use the Lisp reader) it'll end up being the
> same.


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 39+ messages in thread

* bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
  2022-04-27 17:22                                             ` Lars Ingebrigtsen
@ 2022-04-28 17:58                                               ` Paul Eggert
  0 siblings, 0 replies; 39+ messages in thread
From: Paul Eggert @ 2022-04-28 17:58 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: v.schneidermann, 27270-done, npostavs

[-- Attachment #1: Type: text/plain, Size: 716 bytes --]

On 4/27/22 10:22, Lars Ingebrigtsen wrote:
> Paul Eggert <eggert@cs.ucla.edu> writes:
> 
>> Could you explain what you mean by "change the output"? (Sorry, I'm
>> not seeing it.)
> 
> I said earlier:
> 
>> display-raw-bytes-as-hex is a display setting.  You want to change it so
>> that the data output will be different

Still not quite following, as I had been thinking more recently of 
changing only how display-raw-bytes-as-hex displays.

That being said, I looked into the code and found that what I was asking 
for would be quite a pain to implement - more trouble than it's worth, 
anyway - so I withdraw the suggestion and am closing the bug report.

I installed the attached, which documents the situation.

[-- Attachment #2: 0001-Document-807-etc.-in-raw-byte-display.patch --]
[-- Type: text/x-patch, Size: 1474 bytes --]

From d501db962eae2b831a2497adc85a94e98064e969 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Thu, 28 Apr 2022 10:51:01 -0700
Subject: [PATCH] Document \807 etc. in raw byte display

* doc/emacs/display.texi (Display Custom): Mention potential
confusion in raw byte display.
---
 doc/emacs/display.texi | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/doc/emacs/display.texi b/doc/emacs/display.texi
index 2ac0dca622..7a6c7f391b 100644
--- a/doc/emacs/display.texi
+++ b/doc/emacs/display.texi
@@ -2097,3 +2097,14 @@ Display Custom
 byte with a decimal value of 128 is displayed as @code{\200}.  To
 change display to the hexadecimal format of @code{\x80}, set the
 variable @code{display-raw-bytes-as-hex} to @code{t}.
+Care may be needed when interpreting a raw byte when copying
+text from a terminal containing an Emacs session, or when a terminal's
+@code{escape-glyph} face looks like the default face.  For example, by
+default Emacs displays the four characters @samp{\}, @samp{2},
+@samp{0}, @samp{0} with the same characters it displays a byte with
+decimal value 128.  The problem can be worse with hex displays, where
+the raw byte 128 followed by the character @samp{7} is displayed as
+@code{\x807}, which Emacs Lisp reads as the single character U+0807
+SAMARITAN LETTER IT; this confusion does not occur with the
+corresponding octal display @code{\2007} because octal escapes contain
+at most three digits.
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2022-04-28 17:58 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-07  3:57 bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings Paul Eggert
2017-06-07  5:17 ` Eli Zaretskii
2017-06-08  0:49   ` Paul Eggert
2017-06-08  1:07     ` npostavs
2017-06-08 15:20       ` Eli Zaretskii
2017-06-08 15:56       ` Paul Eggert
2017-06-08 16:11         ` Eli Zaretskii
2017-06-08 16:24           ` Paul Eggert
2017-06-08 18:59             ` Eli Zaretskii
2017-06-08 19:43               ` Paul Eggert
2017-06-08 19:56                 ` Eli Zaretskii
2017-06-08 20:35                   ` Paul Eggert
2017-06-09  6:00                     ` Eli Zaretskii
2017-06-09 23:44                       ` Paul Eggert
2017-06-10  7:24                         ` Eli Zaretskii
2017-06-11  0:04                           ` Paul Eggert
2017-06-11 14:48                             ` Eli Zaretskii
2017-06-11 17:26                               ` Paul Eggert
2017-09-02 13:25                                 ` Eli Zaretskii
2022-04-23 14:00                         ` Lars Ingebrigtsen
2022-04-24  7:10                           ` Paul Eggert
2022-04-24  9:56                             ` Vasilij Schneidermann
2022-04-24 10:26                               ` Andreas Schwab
2022-04-24 10:51                                 ` Vasilij Schneidermann
2022-04-24 11:01                                   ` Andreas Schwab
2022-04-24 11:29                                     ` Lars Ingebrigtsen
2022-04-24 22:46                               ` Paul Eggert
2022-04-24 11:24                             ` Lars Ingebrigtsen
2022-04-24 22:35                               ` Paul Eggert
2022-04-25  7:40                                 ` Lars Ingebrigtsen
2022-04-25 16:49                                   ` Paul Eggert
2022-04-26 10:06                                     ` Lars Ingebrigtsen
2022-04-26 16:48                                       ` Paul Eggert
2022-04-27 12:13                                         ` Lars Ingebrigtsen
2022-04-27 17:21                                           ` Paul Eggert
2022-04-27 17:22                                             ` Lars Ingebrigtsen
2022-04-28 17:58                                               ` Paul Eggert
2017-06-10 22:52         ` npostavs
2017-06-11  0:10           ` Paul Eggert

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).