bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8
@ 2024-12-17  6:08 Evgeny Kurnevsky
  2024-12-17 13:18 ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Evgeny Kurnevsky @ 2024-12-17  6:08 UTC (permalink / raw)
  To: 74922

[-- Attachment #1: Type: text/plain, Size: 637 bytes --]

According to the docs and comment inside module_copy_string_contents it
should always produce a valid utf-8 string that can be used in dynamic
modules, but it seems it's not always the case. I encountered an emacs
crash when using emacs-module-rs because it always expects a valid utf-8
for strings. To reproduce you can call:

(some-function-from-dynamic-library (encode-coding-string (f-read-text
"wg-private-pc.age") 'utf-8 t))

The file is
https://github.com/kurnevsky/nixfiles/raw/0b3de016dac551398627a55788b80d4809afcbf9/secrets/wg-private-pc.age

See https://github.com/ubolonton/emacs-module-rs/issues/58 for additional
details.

[-- Attachment #2: Type: text/html, Size: 956 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8
  2024-12-17  6:08 bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8 Evgeny Kurnevsky
@ 2024-12-17 13:18 ` Eli Zaretskii
       [not found]   ` <CAOEHfojGKXoUKbf1-5N=973OURs==BQTXejLFd8cLhsR1DWh+g@mail.gmail.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2024-12-17 13:18 UTC (permalink / raw)
  To: Evgeny Kurnevsky; +Cc: 74922

> From: Evgeny Kurnevsky <kurnevsky@gmail.com>
> Date: Tue, 17 Dec 2024 06:08:30 +0000
> 
> According to the docs and comment inside module_copy_string_contents it should always produce a valid
> utf-8 string that can be used in dynamic modules, but it seems it's not always the case. I encountered an
> emacs crash when using emacs-module-rs because it always expects a valid utf-8 for strings. To reproduce
> you can call:
> 
> (some-function-from-dynamic-library (encode-coding-string (f-read-text "wg-private-pc.age") 'utf-8 t))
> 
> The file is
> https://github.com/kurnevsky/nixfiles/raw/0b3de016dac551398627a55788b80d4809afcbf9/secrets/wg-private-pc.age

This string includes raw bytes, it isn't a text string, as far as I
could see.  It definitely isn't UTF-8 encoded text.  What did you
expect to happen with it when you copy such a string from Emacs?

> See https://github.com/ubolonton/emacs-module-rs/issues/58 for additional details.

Can't say there are too many details there...





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8
       [not found]   ` <CAOEHfojGKXoUKbf1-5N=973OURs==BQTXejLFd8cLhsR1DWh+g@mail.gmail.com>
@ 2024-12-17 13:31     ` Evgeny Kurnevsky
  2024-12-17 14:24       ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Evgeny Kurnevsky @ 2024-12-17 13:31 UTC (permalink / raw)
  To: 74922

[-- Attachment #1: Type: text/plain, Size: 1828 bytes --]

Yes, that's a binary file that is not an utf-8 string. From the comment in
module_copy_string_contents implementation I guessed that in such cases
emacs should signal an error, but instead it just passes this invalid
string to the dynamic library which caused this bug in emacs-module-rs (see
https://ubolonton.github.io/emacs-module-rs/latest/type-conversions.html#strings
). So if it's expected then maybe it should be explicitly said in the docs
of copy_string_contents here
https://www.gnu.org/software/emacs/manual/html_node/elisp/Module-Values.html
? It just says that it stores the utf-8 encoded text which makes an
impression that it's an always valid utf-8 string.

On Tue, Dec 17, 2024 at 1:18 PM Eli Zaretskii <eliz@gnu.org> wrote:

> > From: Evgeny Kurnevsky <kurnevsky@gmail.com>
> > Date: Tue, 17 Dec 2024 06:08:30 +0000
> >
> > According to the docs and comment inside module_copy_string_contents it
> should always produce a valid
> > utf-8 string that can be used in dynamic modules, but it seems it's not
> always the case. I encountered an
> > emacs crash when using emacs-module-rs because it always expects a valid
> utf-8 for strings. To reproduce
> > you can call:
> >
> > (some-function-from-dynamic-library (encode-coding-string (f-read-text
> "wg-private-pc.age") 'utf-8 t))
> >
> > The file is
> >
> https://github.com/kurnevsky/nixfiles/raw/0b3de016dac551398627a55788b80d4809afcbf9/secrets/wg-private-pc.age
>
> This string includes raw bytes, it isn't a text string, as far as I
> could see.  It definitely isn't UTF-8 encoded text.  What did you
> expect to happen with it when you copy such a string from Emacs?
>
> > See https://github.com/ubolonton/emacs-module-rs/issues/58 for
> additional details.
>
> Can't say there are too many details there...
>

[-- Attachment #2: Type: text/html, Size: 2855 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8
  2024-12-17 13:31     ` bug#74922: Fwd: " Evgeny Kurnevsky
@ 2024-12-17 14:24       ` Eli Zaretskii
  2024-12-17 14:46         ` Evgeny Kurnevsky
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2024-12-17 14:24 UTC (permalink / raw)
  To: Evgeny Kurnevsky; +Cc: 74922

> From: Evgeny Kurnevsky <kurnevsky@gmail.com>
> Date: Tue, 17 Dec 2024 13:31:57 +0000
> 
> Yes, that's a binary file that is not an utf-8 string. From the comment in module_copy_string_contents
> implementation I guessed that in such cases emacs should signal an error, but instead it just passes this
> invalid string to the dynamic library which caused this bug in emacs-module-rs (see
> https://ubolonton.github.io/emacs-module-rs/latest/type-conversions.html#strings ). So if it's expected then
> maybe it should be explicitly said in the docs of copy_string_contents here
> https://www.gnu.org/software/emacs/manual/html_node/elisp/Module-Values.html ? It just says that it stores
> the utf-8 encoded text which makes an impression that it's an always valid utf-8 string.

I could look into the internals, but I actually wonder why the module
doesn't check the text before relying on such subtle behaviors.  We
didn't document the fact that it signals an error for a reason.

So: why cannot the module code or the application which uses it test
up from that the string it copies is human-readable text, nit some
binary junk?





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8
  2024-12-17 14:24       ` Eli Zaretskii
@ 2024-12-17 14:46         ` Evgeny Kurnevsky
  2024-12-17 15:10           ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Evgeny Kurnevsky @ 2024-12-17 14:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 74922

[-- Attachment #1: Type: text/plain, Size: 1692 bytes --]

It can definitely do it, but I guess in emacs-module-rs it's not done by
default because of performance implications - it might be quite costly to
check every string in some cases, and it wasn't really clear if emacs can
pass an invalid string. So currently this case causes undefined behavior
there which results in emacs crash.

On Tue, Dec 17, 2024 at 2:24 PM Eli Zaretskii <eliz@gnu.org> wrote:

> > From: Evgeny Kurnevsky <kurnevsky@gmail.com>
> > Date: Tue, 17 Dec 2024 13:31:57 +0000
> >
> > Yes, that's a binary file that is not an utf-8 string. From the comment
> in module_copy_string_contents
> > implementation I guessed that in such cases emacs should signal an
> error, but instead it just passes this
> > invalid string to the dynamic library which caused this bug in
> emacs-module-rs (see
> >
> https://ubolonton.github.io/emacs-module-rs/latest/type-conversions.html#strings
> ). So if it's expected then
> > maybe it should be explicitly said in the docs of copy_string_contents
> here
> >
> https://www.gnu.org/software/emacs/manual/html_node/elisp/Module-Values.html
> ? It just says that it stores
> > the utf-8 encoded text which makes an impression that it's an always
> valid utf-8 string.
>
> I could look into the internals, but I actually wonder why the module
> doesn't check the text before relying on such subtle behaviors.  We
> didn't document the fact that it signals an error for a reason.
>
> So: why cannot the module code or the application which uses it test
> up from that the string it copies is human-readable text, nit some
> binary junk?
>


-- 
С уважением, Курневский Евгений.

[-- Attachment #2: Type: text/html, Size: 2485 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8
  2024-12-17 14:46         ` Evgeny Kurnevsky
@ 2024-12-17 15:10           ` Eli Zaretskii
  2024-12-21 12:09             ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2024-12-17 15:10 UTC (permalink / raw)
  To: Evgeny Kurnevsky; +Cc: 74922

> From: Evgeny Kurnevsky <kurnevsky@gmail.com>
> Date: Tue, 17 Dec 2024 14:46:28 +0000
> Cc: 74922@debbugs.gnu.org
> 
> It can definitely do it, but I guess in emacs-module-rs it's not done by default because of performance
> implications - it might be quite costly to check every string in some cases, and it wasn't really clear if emacs
> can pass an invalid string. So currently this case causes undefined behavior there which results in emacs
> crash.

What do Rust programs do when they are told to read random files?
This is the same situation, basically.

And what would the module do if copy_string_contents *did* signal an
error?





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8
  2024-12-17 15:10           ` Eli Zaretskii
@ 2024-12-21 12:09             ` Eli Zaretskii
  0 siblings, 0 replies; 7+ messages in thread
From: Eli Zaretskii @ 2024-12-21 12:09 UTC (permalink / raw)
  To: kurnevsky; +Cc: 74922

> Cc: 74922@debbugs.gnu.org
> Date: Tue, 17 Dec 2024 17:10:36 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> 
> > From: Evgeny Kurnevsky <kurnevsky@gmail.com>
> > Date: Tue, 17 Dec 2024 14:46:28 +0000
> > Cc: 74922@debbugs.gnu.org
> > 
> > It can definitely do it, but I guess in emacs-module-rs it's not done by default because of performance
> > implications - it might be quite costly to check every string in some cases, and it wasn't really clear if emacs
> > can pass an invalid string. So currently this case causes undefined behavior there which results in emacs
> > crash.
> 
> What do Rust programs do when they are told to read random files?
> This is the same situation, basically.
> 
> And what would the module do if copy_string_contents *did* signal an
> error?

I think I know what happened: you called copy_string_contents with a
unibyte string.  In that case, copy_string_contents will return you
the original string without doing anything.  The code in
copy_string_contents that signals an error relies on the fact that
encoding the input string yields nil if the input includes non-Unicode
characters. But that cannot be established with unibyte strings,
because a unibyte string doesn't hold characters, it holds raw bytes.

What you should do is make sure the string passed to
copy_string_contents is a multibyte string.  If I do that, i.e.

  (switch-to-buffer "foo")
  (set-buffer-multibyte t)
  (insert-file-contents "/path/to/wg-private-pc.age")
  (setq str1 (buffer-string))

and then call copy_string_contents with the resulting string str1, I
get the result you expected.

You need to realize that copy_string_contents is a variant of
text-encoding routines: it encodes the input multibyte string in
UTF-8.  The encoding routines in Emacs always return unibyte strings
without doing anything, because a unibyte string is already encoded,
or at least is supposed to be encoded.

And before you ask: no, copy_string_contents cannot by itself signal
an error if passed a unibyte string, because a unibyte string can
legitimately be a valid UTF-8 string. So in this case,
copy_string_contents relies on the caller to make sure the input is
valid UTF-8.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-12-21 12:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-17  6:08 bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8 Evgeny Kurnevsky
2024-12-17 13:18 ` Eli Zaretskii
     [not found]   ` <CAOEHfojGKXoUKbf1-5N=973OURs==BQTXejLFd8cLhsR1DWh+g@mail.gmail.com>
2024-12-17 13:31     ` bug#74922: Fwd: " Evgeny Kurnevsky
2024-12-17 14:24       ` Eli Zaretskii
2024-12-17 14:46         ` Evgeny Kurnevsky
2024-12-17 15:10           ` Eli Zaretskii
2024-12-21 12:09             ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.