From mboxrd@z Thu Jan 1 00:00:00 1970 From: Maxim Cournoyer Subject: bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates) Date: Mon, 22 Jan 2018 23:27:04 -0500 Message-ID: <87k1w9ryhz.fsf@gmail.com> References: <87r2qrc3mq.fsf@gmail.com> <87o9lu6o9m.fsf@gnu.org> <87607vu9dp.fsf@gmail.com> <877esb84ae.fsf@netris.org> <87o9lmp3c2.fsf@gnu.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:48658) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1edqBr-0007Ud-WB for bug-guix@gnu.org; Mon, 22 Jan 2018 23:28:10 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1edqBo-0001q0-P7 for bug-guix@gnu.org; Mon, 22 Jan 2018 23:28:08 -0500 Received: from debbugs.gnu.org ([208.118.235.43]:59223) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1edqBo-0001pe-Ie for bug-guix@gnu.org; Mon, 22 Jan 2018 23:28:04 -0500 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1edqBm-0005zE-DQ for bug-guix@gnu.org; Mon, 22 Jan 2018 23:28:04 -0500 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: <87o9lmp3c2.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Mon, 22 Jan 2018 11:58:37 +0100") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: 30116@debbugs.gnu.org --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable ludo@gnu.org (Ludovic Court=C3=A8s) writes: > Mark H Weaver skribis: > >> Maxim Cournoyer writes: >> >>> ludo@gnu.org (Ludovic Court=C3=A8s) writes: >>> >>>> Maxim Cournoyer skribis: >>>> >>>>> I've encountered the following crash when trying to use substitute on= a >>>>> file which contains NUL characters: >>>> >>>> Yes, that=E2=80=99s because Guile=E2=80=99s =E2=80=98regexp-exec=E2=80= =99 simply wraps libc=E2=80=99s =E2=80=98regexec=E2=80=99, >>>> which does not handle NULs. >>>> >>>> We should consider switching to the pure-Scheme SRFI-115: >>>> >>>> https://srfi.schemers.org/srfi-115/srfi-115.html >>> >>> This looks good, and I started looking into porting `substitute' to it, >>> but quickly noticed it doesn't seem to be implemented in Guile yet? > > ISTR that the reference implementation works fine on Guile. > >> Indeed. SRFI-115 for Guile is on my TODO list, although it might be >> better to wait until after we switch to using UTF-8 encoding internally >> for strings, since that will drastically affect the implementation of >> any efficient regexp matcher on Scheme strings. > > Indeed, though I suppose it doesn=E2=80=99t matter much for the cases whe= re > =E2=80=98substitute*=E2=80=99 is used? > >> Anyway, 'substitute*' is to be used only on text files, and NUL bytes >> are not a valid textual character. So, I think that this case is >> outside of what 'substitute*' is meant to do, and therefore not a bug in >> 'substitute*', although of course a more graceful error would surely be >> preferable. > > Yes, that=E2=80=99s also a good point. > > So yeah, I think it may be good =E2=80=9Ceventually=E2=80=9D to switch to= SRFI-115, but > that=E2=80=99s not urgent. Sorry for taking some time to answer; I was puzzled by the fact that my repro didn't work when ran from the REPL. It seems the problem only occurs when run inside Guix's build environment, maybe a side effect which depends on the locale used? In the `patch-el-files' phase of the emacs-build-system, we find the following snippet: (with-directory-excursion el-dir ;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded ;; with the "ISO-8859-1" locale. (unless (false-if-exception (substitute-cmd)) (with-fluids ((%default-port-encoding "ISO-8859-1")) (substitute-cmd)))) In case an exception is returned while processing the file, it is retried being opened with the "ISO-8859-1" encoding. Or, this resolves to a call to `open-file', which documentation says: =E2=80=98b=E2=80=99 Use binary mode, ensuring that each byte in the file will be read as one Scheme character. To provide this property, the file will be opened with the 8-bit character encoding "ISO-8859-1", ignoring the default port encoding. *Note Ports::, for more information on port encodings. So, by opening an file whose encoding is unknown as a ISO-8859-1 file, we are doing the same as if we had passed the 'binary option. Could this explain why we end up with NUL characters where we were expecting text? To validate this hypothesis, I've added the following test message to the patch-el-files phase: (unless (false-if-exception (substitute-cmd)) (format (current-error-port) ">>> IS THIS IT? <<<") (with-fluids ((%default-port-encoding "ISO-8859-1")) (substitute-cmd)))) And re-ran the emacs-realgud build (minus the patch working around this issue), and this is what I got: --8<---------------cut here---------------start------------->8--- starting phase `patch-el-files' >>> IS THIS IT? << =E2=80=A6) In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/gnu= -build-system.scm: 684:27 12 (_ _) In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/ema= cs-build-system.scm: 117:10 11 (patch-el-files #:outputs _) In srfi/srfi-1.scm: 640:9 10 (for-each # _) In ice-9/boot-9.scm: 849:4 9 (with-throw-handler _ _ _) In ice-9/ports.scm: 444:17 8 (call-with-input-file _ _ #:binary _ #:encoding _ # _) In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/uti= ls.scm: 609:26 7 (_ _) 635:26 6 (_ # #) In srfi/srfi-1.scm: 466:18 5 (fold # =E2=80=A6) In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/uti= ls.scm: 638:37 4 (_ _ "\"II*\x00(\x03\x00\x00=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3= =BF=C3=BF=C3=BF=C3=BE=C3=BF@@@@=C3=BF=C3=BF=C3=BF=C3=BF\x04\x04\=E2=80=A6") In ice-9/regex.scm: 189:12 3 (list-matches _ _ _) 177:19 2 (fold-matches _ "\"II*\x00(\x03\x00\x00=C3=BF=C3=BF=C3=BF=C3= =BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BE=C3=BF@@@@=C3=BF=E2=80=A6" =E2=80=A6) In unknown file: 1 (regexp-exec # "\"II*\x00(\x03\x00\x00=C3=BF=E2= =80=A6" =E2=80=A6) In ice-9/boot-9.scm: 760:25 0 (dispatch-exception _ _ _) ice-9/boot-9.scm:760:25: In procedure dispatch-exception: ice-9/boot-9.scm:760:25: string contains #\nul character: "\"II*\x00(\x03\x= 00\x00=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BE=C3=BF@@@@=C3= =BF=C3=BF=C3=BF=C3=BF\x04\x04\x04\x04=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x= 01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\= x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x04\x04\x04\x04=C3=BF=C3=BF=C3=BF=C3=BFBBBB= =C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3= =BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BE=C3=BF@@@@=C3=BF=C3=BF=C3=BF=C3=BF\x= 04\x04\x04\x04=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3= =BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3= =BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3= =BF=C3=BF=C3=BF\x04\x04\x04\x04=C3=BF=C3=BF=C3=BF=C3=BFBBBB=C3=BF=C3=BF=C3= =BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF\x04\x04\x04\x04= =C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x0= 1\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x= 01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\= x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF= =C3=BF\x04\x04\x04\x04=C3=BF=C3=BF=C3=BF=C3=BFBBBB=C3=BF=C3=BF=C3=BF=C3=BF\= x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF= =C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF= =C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF= =C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01= =C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x10\x10\x1= 0\x10=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x= 01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\= x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF= =C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF= =C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF= =C3=BF=C3=BF=C3=BF\x10\x10\x10\x10=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01= =C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x0= 1\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x= 01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\= x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF= =C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x10\x10\x10\x10=C3=BF=C3=BF= =C3=BF=C3=BF\x04\x04\x04\x04=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF= =C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01= =C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x0= 1\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x= 01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x04\x04\x04\x04=C3=BF=C3=BF=C3=BE=C3=BF>= >>>=C3=BF=C3=BF=C3=BE=C3=BF<<<<=C3=BF=C3=BF=C3=BF=C3=BF\x04\x04\x04\x04=C3= =BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x= 01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\= x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x04= \x04\x04\x04=C3=BF=C3=BF=C3=BE=C3=BF>>>>=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3= =BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF= =C3=BE=C3=BF<<<<=C3=BF=C3=BF=C3=BF=C3=BF\x04\x04\x04\x04=C3=BF=C3=BF=C3=BF= =C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF= =C3=BF=C3=BF\x01\x01\x01\x01=C3=BF=C3=BF=C3=BF=C3=BF\x04\x04\x04\x04=C3=BF= =C3=BF=C3=BE=C3=BF>>>>=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3= =BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF= =C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3= =BF=C3=BE=C3=BF<<<<=C3=BF=C3=BF=C3=BF=C3=BF\x0f\x0f\x0f\x0f=C3=BF=C3=BF=C3= =BF=C3=BF\x0f\x0f\x0f\x0f=C3=BF=C3=BF=C3=BF=C3=BF\x0f\x0f\x0f\x0f=C3=BF=C3= =BF=C3=BE=C3=BF>>>>=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF= =C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3=BF=C3= =BF=C3=BF=C3=BF=C3=BF=C3=BF\x14\x00\x00\x01\x03\x00\x01\x00\x00\x00\n" builder for `/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.= 4.drv' failed with exit code 1 @ build-failed /gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.= 4.4.drv - 1 builder for `/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-= realgud-1.4.4.drv' failed with exit code 1 guix build: error: build failed: build of `/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv' failed --8<---------------cut here---------------end--------------->8--- So it is indeed triggered by switching to the "ISO-8859-1" encoding (although I still cannot reproduce this from the REPL?). If I remove the exception guard like this: --8<---------------cut here---------------start------------->8--- (with-directory-excursion el-dir ;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encod= ed ;; with the "ISO-8859-1" locale. - (unless (false-if-exception (substitute-cmd)) - (with-fluids ((%default-port-encoding "ISO-8859-1")) - (substitute-cmd)))) + (substitute-cmd)) #t)) --8<---------------cut here---------------end--------------->8--- the exception thrown on the first substitute* call is this: --8<---------------cut here---------------start------------->8--- starting phase `patch-el-files' Backtrace: 12 (primitive-load "/gnu/store/dvyyqxfr08fsr18k2f43gakh23d=E2=80= =A6") In ice-9/eval.scm: 191:35 11 (_ _) In srfi/srfi-1.scm: 863:16 10 (every1 # =E2=80=A6) In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/gnu= -build-system.scm: 684:27 9 (_ _) In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/ema= cs-build-system.scm: 104:27 8 (patch-el-files #:outputs _) In srfi/srfi-1.scm: 640:9 7 (for-each # _) In ice-9/boot-9.scm: 849:4 6 (with-throw-handler _ _ _) In ice-9/ports.scm: 444:17 5 (call-with-input-file _ _ #:binary _ #:encoding _ # _) In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/uti= ls.scm: 609:26 4 (_ _) 645:22 3 (_ # #) In ice-9/rdelim.scm: 195:24 2 (read-line _ _) In unknown file: 1 (%read-line #) In ice-9/boot-9.scm: 760:25 0 (dispatch-exception _ _ _) ice-9/boot-9.scm:760:25: In procedure dispatch-exception: ice-9/boot-9.scm:760:25: Throw to key `decoding-error' with args `("peek-char" "input decoding error" 84 #)'. --8<---------------cut here---------------end--------------->8--- Should we keep my workaround for now? It seems there are valid cases to have the file opened as "ISO-8859-1", but this can mean introducing binary symbols such as NUL in the data (thus regexp crashes). When we finally move to srfi-115, we should remove this workaround. WDYT? Here's an updated patch with Ludovic's suggestion: --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: attachment; filename=0001-utils-Prevent-substitute-from-crashing-on-files-cont.patch Content-Transfer-Encoding: quoted-printable >From 573ecb3570355c47aa18c091c0a193e7d90a6949 Mon Sep 17 00:00:00 2001 From: Maxim Cournoyer Date: Sun, 14 Jan 2018 20:31:33 -0500 Subject: [PATCH] utils: Prevent substitute from crashing on files containing NUL chars. Fixes issue #30116. * guix/build/utils.scm (substitute): Add condition to skip lines containing the NUL character. --- guix/build/utils.scm | 46 ++++++++++++++++++++++++++++------------------ 1 file changed, 28 insertions(+), 18 deletions(-) diff --git a/guix/build/utils.scm b/guix/build/utils.scm index 7391307c8..2a37dba06 100644 --- a/guix/build/utils.scm +++ b/guix/build/utils.scm @@ -3,6 +3,7 @@ ;;; Copyright =C2=A9 2013 Andreas Enge ;;; Copyright =C2=A9 2013 Nikita Karetnikov ;;; Copyright =C2=A9 2015 Mark H Weaver +;;; Copyright =C2=A9 2018 Maxim Cournoyer ;;; ;;; This file is part of GNU Guix. ;;; @@ -621,28 +622,37 @@ PROC as (PROC LINE MATCHES); PROC must return the lin= e that will be written as a substitution of the original line. Be careful about using '$' to match = the end of a line; by itself it won't match the terminating newline of a line." (let ((rx+proc (map (match-lambda - (((? regexp? pattern) . proc) - (cons pattern proc)) - ((pattern . proc) - (cons (make-regexp pattern regexp/extended) - proc))) + (((? regexp? pattern) . proc) + (cons pattern proc)) + ((pattern . proc) + (cons (make-regexp pattern regexp/extended) + proc))) pattern+procs))) (with-atomic-file-replacement file (lambda (in out) (let loop ((line (read-line in 'concat))) - (if (eof-object? line) - #t - (let ((line (fold (lambda (r+p line) - (match r+p - ((regexp . proc) - (match (list-matches regexp line) - ((and m+ (_ _ ...)) - (proc line m+)) - (_ line))))) - line - rx+proc))) - (display line out) - (loop (read-line in 'concat))))))))) + (cond + ((eof-object? line) + #t) + ((string-index line #\nul) + ;; The regexp functions of the GNU C library (which Guile uses) + ;; cannot deal with NUL characters, so skip to the next line. + ;; TODO: Port to srfi-115, once we have it implemented in Guil= e. + (format (current-error-port) + "skipping line with NUL characters: ~s\n" line) + (loop (read-line in 'concat))) + (else + (let ((line (fold (lambda (r+p line) + (match r+p + ((regexp . proc) + (match (list-matches regexp line) + ((and m+ (_ _ ...)) + (proc line m+)) + (_ line))))) + line + rx+proc))) + (display line out) + (loop (read-line in 'concat)))))))))) =20 =20 (define-syntax let-matches --=20 2.16.0 --=-=-= Content-Type: text/plain Maxim --=-=-=--