ludo@gnu.org (Ludovic Courtès) writes: > Mark H Weaver skribis: > >> Maxim Cournoyer writes: >> >>> ludo@gnu.org (Ludovic Courtès) writes: >>> >>>> Maxim Cournoyer skribis: >>>> >>>>> I've encountered the following crash when trying to use substitute on a >>>>> file which contains NUL characters: >>>> >>>> Yes, that’s because Guile’s ‘regexp-exec’ simply wraps libc’s ‘regexec’, >>>> which does not handle NULs. >>>> >>>> We should consider switching to the pure-Scheme SRFI-115: >>>> >>>> https://srfi.schemers.org/srfi-115/srfi-115.html >>> >>> This looks good, and I started looking into porting `substitute' to it, >>> but quickly noticed it doesn't seem to be implemented in Guile yet? > > ISTR that the reference implementation works fine on Guile. > >> Indeed. SRFI-115 for Guile is on my TODO list, although it might be >> better to wait until after we switch to using UTF-8 encoding internally >> for strings, since that will drastically affect the implementation of >> any efficient regexp matcher on Scheme strings. > > Indeed, though I suppose it doesn’t matter much for the cases where > ‘substitute*’ is used? > >> Anyway, 'substitute*' is to be used only on text files, and NUL bytes >> are not a valid textual character. So, I think that this case is >> outside of what 'substitute*' is meant to do, and therefore not a bug in >> 'substitute*', although of course a more graceful error would surely be >> preferable. > > Yes, that’s also a good point. > > So yeah, I think it may be good “eventually” to switch to SRFI-115, but > that’s not urgent. Sorry for taking some time to answer; I was puzzled by the fact that my repro didn't work when ran from the REPL. It seems the problem only occurs when run inside Guix's build environment, maybe a side effect which depends on the locale used? In the `patch-el-files' phase of the emacs-build-system, we find the following snippet: (with-directory-excursion el-dir ;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded ;; with the "ISO-8859-1" locale. (unless (false-if-exception (substitute-cmd)) (with-fluids ((%default-port-encoding "ISO-8859-1")) (substitute-cmd)))) In case an exception is returned while processing the file, it is retried being opened with the "ISO-8859-1" encoding. Or, this resolves to a call to `open-file', which documentation says: ‘b’ Use binary mode, ensuring that each byte in the file will be read as one Scheme character. To provide this property, the file will be opened with the 8-bit character encoding "ISO-8859-1", ignoring the default port encoding. *Note Ports::, for more information on port encodings. So, by opening an file whose encoding is unknown as a ISO-8859-1 file, we are doing the same as if we had passed the 'binary option. Could this explain why we end up with NUL characters where we were expecting text? To validate this hypothesis, I've added the following test message to the patch-el-files phase: (unless (false-if-exception (substitute-cmd)) (format (current-error-port) ">>> IS THIS IT? <<<") (with-fluids ((%default-port-encoding "ISO-8859-1")) (substitute-cmd)))) And re-ran the emacs-realgud build (minus the patch working around this issue), and this is what I got: --8<---------------cut here---------------start------------->8--- starting phase `patch-el-files' >>> IS THIS IT? << …) In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/gnu-build-system.scm: 684:27 12 (_ _) In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/emacs-build-system.scm: 117:10 11 (patch-el-files #:outputs _) In srfi/srfi-1.scm: 640:9 10 (for-each # _) In ice-9/boot-9.scm: 849:4 9 (with-throw-handler _ _ _) In ice-9/ports.scm: 444:17 8 (call-with-input-file _ _ #:binary _ #:encoding _ # _) In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/utils.scm: 609:26 7 (_ _) 635:26 6 (_ # #) In srfi/srfi-1.scm: 466:18 5 (fold # …) In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/utils.scm: 638:37 4 (_ _ "\"II*\x00(\x03\x00\x00ÿÿÿÿÿÿÿÿþÿ@@@@ÿÿÿÿ\x04\x04\…") In ice-9/regex.scm: 189:12 3 (list-matches _ _ _) 177:19 2 (fold-matches _ "\"II*\x00(\x03\x00\x00ÿÿÿÿÿÿÿÿþÿ@@@@ÿ…" …) In unknown file: 1 (regexp-exec # "\"II*\x00(\x03\x00\x00ÿ…" …) In ice-9/boot-9.scm: 760:25 0 (dispatch-exception _ _ _) ice-9/boot-9.scm:760:25: In procedure dispatch-exception: ice-9/boot-9.scm:760:25: string contains #\nul character: "\"II*\x00(\x03\x00\x00ÿÿÿÿÿÿÿÿþÿ@@@@ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿBBBBÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿ@@@@ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿBBBBÿÿÿÿÿÿÿÿÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿBBBBÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x10\x10\x10\x10ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x10\x10\x10\x10ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x10\x10\x10\x10ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿþÿ>>>>ÿÿþÿ<<<<ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿþÿ>>>>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿ<<<<ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿþÿ>>>>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿ<<<<ÿÿÿÿ\x0f\x0f\x0f\x0fÿÿÿÿ\x0f\x0f\x0f\x0fÿÿÿÿ\x0f\x0f\x0f\x0fÿÿþÿ>>>>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x14\x00\x00\x01\x03\x00\x01\x00\x00\x00\n" builder for `/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv' failed with exit code 1 @ build-failed /gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv - 1 builder for `/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv' failed with exit code 1 guix build: error: build failed: build of `/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv' failed --8<---------------cut here---------------end--------------->8--- So it is indeed triggered by switching to the "ISO-8859-1" encoding (although I still cannot reproduce this from the REPL?). If I remove the exception guard like this: --8<---------------cut here---------------start------------->8--- (with-directory-excursion el-dir ;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded ;; with the "ISO-8859-1" locale. - (unless (false-if-exception (substitute-cmd)) - (with-fluids ((%default-port-encoding "ISO-8859-1")) - (substitute-cmd)))) + (substitute-cmd)) #t)) --8<---------------cut here---------------end--------------->8--- the exception thrown on the first substitute* call is this: --8<---------------cut here---------------start------------->8--- starting phase `patch-el-files' Backtrace: 12 (primitive-load "/gnu/store/dvyyqxfr08fsr18k2f43gakh23d…") In ice-9/eval.scm: 191:35 11 (_ _) In srfi/srfi-1.scm: 863:16 10 (every1 # …) In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/gnu-build-system.scm: 684:27 9 (_ _) In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/emacs-build-system.scm: 104:27 8 (patch-el-files #:outputs _) In srfi/srfi-1.scm: 640:9 7 (for-each # _) In ice-9/boot-9.scm: 849:4 6 (with-throw-handler _ _ _) In ice-9/ports.scm: 444:17 5 (call-with-input-file _ _ #:binary _ #:encoding _ # _) In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/utils.scm: 609:26 4 (_ _) 645:22 3 (_ # #) In ice-9/rdelim.scm: 195:24 2 (read-line _ _) In unknown file: 1 (%read-line #) In ice-9/boot-9.scm: 760:25 0 (dispatch-exception _ _ _) ice-9/boot-9.scm:760:25: In procedure dispatch-exception: ice-9/boot-9.scm:760:25: Throw to key `decoding-error' with args `("peek-char" "input decoding error" 84 #)'. --8<---------------cut here---------------end--------------->8--- Should we keep my workaround for now? It seems there are valid cases to have the file opened as "ISO-8859-1", but this can mean introducing binary symbols such as NUL in the data (thus regexp crashes). When we finally move to srfi-115, we should remove this workaround. WDYT? Here's an updated patch with Ludovic's suggestion: