* bug#53236: 26.1; encode-coding-string does not encode the string as expected
@ 2022-01-13 19:45 Markus Triska
2022-01-13 20:23 ` Philipp Stephani
2022-01-14 6:55 ` Eli Zaretskii
0 siblings, 2 replies; 4+ messages in thread
From: Markus Triska @ 2022-01-13 19:45 UTC (permalink / raw)
To: 53236
Dear all,
please consider the UTF-8 encoding of the Unicode codepoint 0x80, which
is formed by two bytes. In hexadecimal notation, they are: 0xC2 0x80.
We can use decode-coding-string to verify that this byte sequence is
decoded to 0x80 when specifying utf-8, which works exactly as expected:
(decode-coding-string "\xC2\x80" 'utf-8)
This yields "\200", which is the same as "\x80", as verified via:
(string= "\200" "\x80") --> t
Correspondingly, I expect (encode-coding-string "\200" 'utf-8) to yield
a string equivalent to "\xC2\x80", but that seems not to be the case. I get:
(encode-coding-string "\200" 'utf-8) --> "\200"
And therefore, unexpectedly:
(string= (encode-coding-string "\200" 'utf-8) "\xC2\x80") --> nil
It appears that encode-coding-string does not encode the string in UTF-8
as expected. Is there any way to obtain the desired encoding with
encode-coding-string, i.e., the UTF-8-encoded string "\xC2\x80"?
Thank you and all the best!
Markus
In GNU Emacs 26.1 (build 3, x86_64-pc-linux-gnu, X toolkit, Xaw scroll bars)
of 2019-04-09 built on mt-laptop
Windowing system distributor 'The X.Org Foundation', version 11.0.12004000
System Description: Ubuntu 19.04
Configured features:
XPM JPEG GIF PNG SOUND GSETTINGS NOTIFY GNUTLS LIBXML2 FREETYPE XFT ZLIB
TOOLKIT_SCROLL_BARS LUCID X11 THREADS
Important settings:
value of $LC_MONETARY: en_GB.UTF-8
value of $LC_NUMERIC: en_GB.UTF-8
value of $LC_TIME: en_GB.UTF-8
value of $LANG: en_US.UTF-8
value of $XMODIFIERS: @im=ibus
locale-coding-system: utf-8-unix
^ permalink raw reply [flat|nested] 4+ messages in thread
* bug#53236: 26.1; encode-coding-string does not encode the string as expected
2022-01-13 19:45 bug#53236: 26.1; encode-coding-string does not encode the string as expected Markus Triska
@ 2022-01-13 20:23 ` Philipp Stephani
2022-01-14 6:55 ` Eli Zaretskii
1 sibling, 0 replies; 4+ messages in thread
From: Philipp Stephani @ 2022-01-13 20:23 UTC (permalink / raw)
To: Markus Triska; +Cc: 53236
Am Do., 13. Jan. 2022 um 21:14 Uhr schrieb Markus Triska <triska@metalevel.at>:
>
> Dear all,
>
> please consider the UTF-8 encoding of the Unicode codepoint 0x80, which
> is formed by two bytes. In hexadecimal notation, they are: 0xC2 0x80.
>
> We can use decode-coding-string to verify that this byte sequence is
> decoded to 0x80 when specifying utf-8, which works exactly as expected:
>
> (decode-coding-string "\xC2\x80" 'utf-8)
>
> This yields "\200", which is the same as "\x80", as verified via:
>
> (string= "\200" "\x80") --> t
There are two possible interpretations of "\200":
1. The unibyte string containing the byte #x80
2. The multibyte string containing the Unicode character U+0080
The string literal "\200" gives you the former, while
(decode-coding-string "\xC2\x80" 'utf-8) gives you the latter. In
fact,
(string= (decode-coding-string "\xC2\x80" 'utf-8) "\200") ⇒ nil
but
(string= (decode-coding-string "\xC2\x80" 'utf-8) "\u0080") ⇒ t
>
> Correspondingly, I expect (encode-coding-string "\200" 'utf-8) to yield
> a string equivalent to "\xC2\x80", but that seems not to be the case. I get:
>
> (encode-coding-string "\200" 'utf-8) --> "\200"
Here "\200" gives you the unibyte string that contains the byte #x80.
That can't be encoded as UTF-8 (since UTF-8 encodes Unicode scalar
values, not raw bytes), so it's left alone.
However,
(encode-coding-string "\u0080" 'utf-8) ⇒ "\302\200"
There's some background in the chapter "Text representations" in the
ELisp manual.
HTH
^ permalink raw reply [flat|nested] 4+ messages in thread
* bug#53236: 26.1; encode-coding-string does not encode the string as expected
2022-01-13 19:45 bug#53236: 26.1; encode-coding-string does not encode the string as expected Markus Triska
2022-01-13 20:23 ` Philipp Stephani
@ 2022-01-14 6:55 ` Eli Zaretskii
2022-01-14 10:00 ` Andreas Schwab
1 sibling, 1 reply; 4+ messages in thread
From: Eli Zaretskii @ 2022-01-14 6:55 UTC (permalink / raw)
To: Markus Triska; +Cc: 53236
> From: Markus Triska <triska@metalevel.at>
> Date: Thu, 13 Jan 2022 20:45:57 +0100
>
> Correspondingly, I expect (encode-coding-string "\200" 'utf-8) to yield
> a string equivalent to "\xC2\x80", but that seems not to be the case. I get:
>
> (encode-coding-string "\200" 'utf-8) --> "\200"
>
> And therefore, unexpectedly:
>
> (string= (encode-coding-string "\200" 'utf-8) "\xC2\x80") --> nil
"\200" is a unibyte string, and encoding unibyte strings returns those
strings without changing them.
This is not a bug, just a dark corner of encoding/decoding stuff.
^ permalink raw reply [flat|nested] 4+ messages in thread
* bug#53236: 26.1; encode-coding-string does not encode the string as expected
2022-01-14 6:55 ` Eli Zaretskii
@ 2022-01-14 10:00 ` Andreas Schwab
0 siblings, 0 replies; 4+ messages in thread
From: Andreas Schwab @ 2022-01-14 10:00 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 53236, Markus Triska
On Jan 14 2022, Eli Zaretskii wrote:
>> From: Markus Triska <triska@metalevel.at>
>> Date: Thu, 13 Jan 2022 20:45:57 +0100
>>
>> Correspondingly, I expect (encode-coding-string "\200" 'utf-8) to yield
>> a string equivalent to "\xC2\x80", but that seems not to be the case. I get:
>>
>> (encode-coding-string "\200" 'utf-8) --> "\200"
>>
>> And therefore, unexpectedly:
>>
>> (string= (encode-coding-string "\200" 'utf-8) "\xC2\x80") --> nil
>
> "\200" is a unibyte string, and encoding unibyte strings returns those
> strings without changing them.
>
> This is not a bug, just a dark corner of encoding/decoding stuff.
Or a dark corner of the string syntax.
ELISP> (multibyte-string-p "\200")
nil
ELISP> (multibyte-string-p "\x80")
nil
ELISP> (multibyte-string-p "\x0080")
t
ELISP> (encode-coding-string "\x0080" 'utf-8)
"\302\200"
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-01-14 10:00 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-01-13 19:45 bug#53236: 26.1; encode-coding-string does not encode the string as expected Markus Triska
2022-01-13 20:23 ` Philipp Stephani
2022-01-14 6:55 ` Eli Zaretskii
2022-01-14 10:00 ` Andreas Schwab
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).