* bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8 @ 2024-12-17 6:08 Evgeny Kurnevsky 2024-12-17 13:18 ` Eli Zaretskii 0 siblings, 1 reply; 8+ messages in thread From: Evgeny Kurnevsky @ 2024-12-17 6:08 UTC (permalink / raw) To: 74922 [-- Attachment #1: Type: text/plain, Size: 637 bytes --] According to the docs and comment inside module_copy_string_contents it should always produce a valid utf-8 string that can be used in dynamic modules, but it seems it's not always the case. I encountered an emacs crash when using emacs-module-rs because it always expects a valid utf-8 for strings. To reproduce you can call: (some-function-from-dynamic-library (encode-coding-string (f-read-text "wg-private-pc.age") 'utf-8 t)) The file is https://github.com/kurnevsky/nixfiles/raw/0b3de016dac551398627a55788b80d4809afcbf9/secrets/wg-private-pc.age See https://github.com/ubolonton/emacs-module-rs/issues/58 for additional details. [-- Attachment #2: Type: text/html, Size: 956 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8 2024-12-17 6:08 bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8 Evgeny Kurnevsky @ 2024-12-17 13:18 ` Eli Zaretskii [not found] ` <CAOEHfojGKXoUKbf1-5N=973OURs==BQTXejLFd8cLhsR1DWh+g@mail.gmail.com> 0 siblings, 1 reply; 8+ messages in thread From: Eli Zaretskii @ 2024-12-17 13:18 UTC (permalink / raw) To: Evgeny Kurnevsky; +Cc: 74922 > From: Evgeny Kurnevsky <kurnevsky@gmail.com> > Date: Tue, 17 Dec 2024 06:08:30 +0000 > > According to the docs and comment inside module_copy_string_contents it should always produce a valid > utf-8 string that can be used in dynamic modules, but it seems it's not always the case. I encountered an > emacs crash when using emacs-module-rs because it always expects a valid utf-8 for strings. To reproduce > you can call: > > (some-function-from-dynamic-library (encode-coding-string (f-read-text "wg-private-pc.age") 'utf-8 t)) > > The file is > https://github.com/kurnevsky/nixfiles/raw/0b3de016dac551398627a55788b80d4809afcbf9/secrets/wg-private-pc.age This string includes raw bytes, it isn't a text string, as far as I could see. It definitely isn't UTF-8 encoded text. What did you expect to happen with it when you copy such a string from Emacs? > See https://github.com/ubolonton/emacs-module-rs/issues/58 for additional details. Can't say there are too many details there... ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <CAOEHfojGKXoUKbf1-5N=973OURs==BQTXejLFd8cLhsR1DWh+g@mail.gmail.com>]
* bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8 [not found] ` <CAOEHfojGKXoUKbf1-5N=973OURs==BQTXejLFd8cLhsR1DWh+g@mail.gmail.com> @ 2024-12-17 13:31 ` Evgeny Kurnevsky 2024-12-17 14:24 ` Eli Zaretskii 0 siblings, 1 reply; 8+ messages in thread From: Evgeny Kurnevsky @ 2024-12-17 13:31 UTC (permalink / raw) To: 74922 [-- Attachment #1: Type: text/plain, Size: 1828 bytes --] Yes, that's a binary file that is not an utf-8 string. From the comment in module_copy_string_contents implementation I guessed that in such cases emacs should signal an error, but instead it just passes this invalid string to the dynamic library which caused this bug in emacs-module-rs (see https://ubolonton.github.io/emacs-module-rs/latest/type-conversions.html#strings ). So if it's expected then maybe it should be explicitly said in the docs of copy_string_contents here https://www.gnu.org/software/emacs/manual/html_node/elisp/Module-Values.html ? It just says that it stores the utf-8 encoded text which makes an impression that it's an always valid utf-8 string. On Tue, Dec 17, 2024 at 1:18 PM Eli Zaretskii <eliz@gnu.org> wrote: > > From: Evgeny Kurnevsky <kurnevsky@gmail.com> > > Date: Tue, 17 Dec 2024 06:08:30 +0000 > > > > According to the docs and comment inside module_copy_string_contents it > should always produce a valid > > utf-8 string that can be used in dynamic modules, but it seems it's not > always the case. I encountered an > > emacs crash when using emacs-module-rs because it always expects a valid > utf-8 for strings. To reproduce > > you can call: > > > > (some-function-from-dynamic-library (encode-coding-string (f-read-text > "wg-private-pc.age") 'utf-8 t)) > > > > The file is > > > https://github.com/kurnevsky/nixfiles/raw/0b3de016dac551398627a55788b80d4809afcbf9/secrets/wg-private-pc.age > > This string includes raw bytes, it isn't a text string, as far as I > could see. It definitely isn't UTF-8 encoded text. What did you > expect to happen with it when you copy such a string from Emacs? > > > See https://github.com/ubolonton/emacs-module-rs/issues/58 for > additional details. > > Can't say there are too many details there... > [-- Attachment #2: Type: text/html, Size: 2855 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8 2024-12-17 13:31 ` bug#74922: Fwd: " Evgeny Kurnevsky @ 2024-12-17 14:24 ` Eli Zaretskii 2024-12-17 14:46 ` Evgeny Kurnevsky 0 siblings, 1 reply; 8+ messages in thread From: Eli Zaretskii @ 2024-12-17 14:24 UTC (permalink / raw) To: Evgeny Kurnevsky; +Cc: 74922 > From: Evgeny Kurnevsky <kurnevsky@gmail.com> > Date: Tue, 17 Dec 2024 13:31:57 +0000 > > Yes, that's a binary file that is not an utf-8 string. From the comment in module_copy_string_contents > implementation I guessed that in such cases emacs should signal an error, but instead it just passes this > invalid string to the dynamic library which caused this bug in emacs-module-rs (see > https://ubolonton.github.io/emacs-module-rs/latest/type-conversions.html#strings ). So if it's expected then > maybe it should be explicitly said in the docs of copy_string_contents here > https://www.gnu.org/software/emacs/manual/html_node/elisp/Module-Values.html ? It just says that it stores > the utf-8 encoded text which makes an impression that it's an always valid utf-8 string. I could look into the internals, but I actually wonder why the module doesn't check the text before relying on such subtle behaviors. We didn't document the fact that it signals an error for a reason. So: why cannot the module code or the application which uses it test up from that the string it copies is human-readable text, nit some binary junk? ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8 2024-12-17 14:24 ` Eli Zaretskii @ 2024-12-17 14:46 ` Evgeny Kurnevsky 2024-12-17 15:10 ` Eli Zaretskii 0 siblings, 1 reply; 8+ messages in thread From: Evgeny Kurnevsky @ 2024-12-17 14:46 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 74922 [-- Attachment #1: Type: text/plain, Size: 1692 bytes --] It can definitely do it, but I guess in emacs-module-rs it's not done by default because of performance implications - it might be quite costly to check every string in some cases, and it wasn't really clear if emacs can pass an invalid string. So currently this case causes undefined behavior there which results in emacs crash. On Tue, Dec 17, 2024 at 2:24 PM Eli Zaretskii <eliz@gnu.org> wrote: > > From: Evgeny Kurnevsky <kurnevsky@gmail.com> > > Date: Tue, 17 Dec 2024 13:31:57 +0000 > > > > Yes, that's a binary file that is not an utf-8 string. From the comment > in module_copy_string_contents > > implementation I guessed that in such cases emacs should signal an > error, but instead it just passes this > > invalid string to the dynamic library which caused this bug in > emacs-module-rs (see > > > https://ubolonton.github.io/emacs-module-rs/latest/type-conversions.html#strings > ). So if it's expected then > > maybe it should be explicitly said in the docs of copy_string_contents > here > > > https://www.gnu.org/software/emacs/manual/html_node/elisp/Module-Values.html > ? It just says that it stores > > the utf-8 encoded text which makes an impression that it's an always > valid utf-8 string. > > I could look into the internals, but I actually wonder why the module > doesn't check the text before relying on such subtle behaviors. We > didn't document the fact that it signals an error for a reason. > > So: why cannot the module code or the application which uses it test > up from that the string it copies is human-readable text, nit some > binary junk? > -- С уважением, Курневский Евгений. [-- Attachment #2: Type: text/html, Size: 2485 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8 2024-12-17 14:46 ` Evgeny Kurnevsky @ 2024-12-17 15:10 ` Eli Zaretskii 2024-12-21 12:09 ` Eli Zaretskii 0 siblings, 1 reply; 8+ messages in thread From: Eli Zaretskii @ 2024-12-17 15:10 UTC (permalink / raw) To: Evgeny Kurnevsky; +Cc: 74922 > From: Evgeny Kurnevsky <kurnevsky@gmail.com> > Date: Tue, 17 Dec 2024 14:46:28 +0000 > Cc: 74922@debbugs.gnu.org > > It can definitely do it, but I guess in emacs-module-rs it's not done by default because of performance > implications - it might be quite costly to check every string in some cases, and it wasn't really clear if emacs > can pass an invalid string. So currently this case causes undefined behavior there which results in emacs > crash. What do Rust programs do when they are told to read random files? This is the same situation, basically. And what would the module do if copy_string_contents *did* signal an error? ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8 2024-12-17 15:10 ` Eli Zaretskii @ 2024-12-21 12:09 ` Eli Zaretskii 2025-01-04 11:39 ` Eli Zaretskii 0 siblings, 1 reply; 8+ messages in thread From: Eli Zaretskii @ 2024-12-21 12:09 UTC (permalink / raw) To: kurnevsky; +Cc: 74922 > Cc: 74922@debbugs.gnu.org > Date: Tue, 17 Dec 2024 17:10:36 +0200 > From: Eli Zaretskii <eliz@gnu.org> > > > From: Evgeny Kurnevsky <kurnevsky@gmail.com> > > Date: Tue, 17 Dec 2024 14:46:28 +0000 > > Cc: 74922@debbugs.gnu.org > > > > It can definitely do it, but I guess in emacs-module-rs it's not done by default because of performance > > implications - it might be quite costly to check every string in some cases, and it wasn't really clear if emacs > > can pass an invalid string. So currently this case causes undefined behavior there which results in emacs > > crash. > > What do Rust programs do when they are told to read random files? > This is the same situation, basically. > > And what would the module do if copy_string_contents *did* signal an > error? I think I know what happened: you called copy_string_contents with a unibyte string. In that case, copy_string_contents will return you the original string without doing anything. The code in copy_string_contents that signals an error relies on the fact that encoding the input string yields nil if the input includes non-Unicode characters. But that cannot be established with unibyte strings, because a unibyte string doesn't hold characters, it holds raw bytes. What you should do is make sure the string passed to copy_string_contents is a multibyte string. If I do that, i.e. (switch-to-buffer "foo") (set-buffer-multibyte t) (insert-file-contents "/path/to/wg-private-pc.age") (setq str1 (buffer-string)) and then call copy_string_contents with the resulting string str1, I get the result you expected. You need to realize that copy_string_contents is a variant of text-encoding routines: it encodes the input multibyte string in UTF-8. The encoding routines in Emacs always return unibyte strings without doing anything, because a unibyte string is already encoded, or at least is supposed to be encoded. And before you ask: no, copy_string_contents cannot by itself signal an error if passed a unibyte string, because a unibyte string can legitimately be a valid UTF-8 string. So in this case, copy_string_contents relies on the caller to make sure the input is valid UTF-8. ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8 2024-12-21 12:09 ` Eli Zaretskii @ 2025-01-04 11:39 ` Eli Zaretskii 0 siblings, 0 replies; 8+ messages in thread From: Eli Zaretskii @ 2025-01-04 11:39 UTC (permalink / raw) To: kurnevsky; +Cc: 74922-done > Cc: 74922@debbugs.gnu.org > Date: Sat, 21 Dec 2024 14:09:24 +0200 > From: Eli Zaretskii <eliz@gnu.org> > > > Cc: 74922@debbugs.gnu.org > > Date: Tue, 17 Dec 2024 17:10:36 +0200 > > From: Eli Zaretskii <eliz@gnu.org> > > > > > From: Evgeny Kurnevsky <kurnevsky@gmail.com> > > > Date: Tue, 17 Dec 2024 14:46:28 +0000 > > > Cc: 74922@debbugs.gnu.org > > > > > > It can definitely do it, but I guess in emacs-module-rs it's not done by default because of performance > > > implications - it might be quite costly to check every string in some cases, and it wasn't really clear if emacs > > > can pass an invalid string. So currently this case causes undefined behavior there which results in emacs > > > crash. > > > > What do Rust programs do when they are told to read random files? > > This is the same situation, basically. > > > > And what would the module do if copy_string_contents *did* signal an > > error? > > I think I know what happened: you called copy_string_contents with a > unibyte string. In that case, copy_string_contents will return you > the original string without doing anything. The code in > copy_string_contents that signals an error relies on the fact that > encoding the input string yields nil if the input includes non-Unicode > characters. But that cannot be established with unibyte strings, > because a unibyte string doesn't hold characters, it holds raw bytes. > > What you should do is make sure the string passed to > copy_string_contents is a multibyte string. If I do that, i.e. > > (switch-to-buffer "foo") > (set-buffer-multibyte t) > (insert-file-contents "/path/to/wg-private-pc.age") > (setq str1 (buffer-string)) > > and then call copy_string_contents with the resulting string str1, I > get the result you expected. > > You need to realize that copy_string_contents is a variant of > text-encoding routines: it encodes the input multibyte string in > UTF-8. The encoding routines in Emacs always return unibyte strings > without doing anything, because a unibyte string is already encoded, > or at least is supposed to be encoded. > > And before you ask: no, copy_string_contents cannot by itself signal > an error if passed a unibyte string, because a unibyte string can > legitimately be a valid UTF-8 string. So in this case, > copy_string_contents relies on the caller to make sure the input is > valid UTF-8. I believe the above explains the problem and the solution, so I'm now closing this bug. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-01-04 11:39 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-12-17 6:08 bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8 Evgeny Kurnevsky 2024-12-17 13:18 ` Eli Zaretskii [not found] ` <CAOEHfojGKXoUKbf1-5N=973OURs==BQTXejLFd8cLhsR1DWh+g@mail.gmail.com> 2024-12-17 13:31 ` bug#74922: Fwd: " Evgeny Kurnevsky 2024-12-17 14:24 ` Eli Zaretskii 2024-12-17 14:46 ` Evgeny Kurnevsky 2024-12-17 15:10 ` Eli Zaretskii 2024-12-21 12:09 ` Eli Zaretskii 2025-01-04 11:39 ` Eli Zaretskii
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).