From mboxrd@z Thu Jan 1 00:00:00 1970 From: Maxim Cournoyer Subject: bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates)) Date: Sat, 16 Jun 2018 12:47:01 -0400 Message-ID: <87tvq2smpm.fsf@gmail.com> References: <87r2qrc3mq.fsf@gmail.com> <87k1wjc35d.fsf_-_@gmail.com> <87r2l9q294.fsf@netris.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:50506) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fUEMv-0008VB-VG for bug-guix@gnu.org; Sat, 16 Jun 2018 12:48:07 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fUEMs-0005IN-Nh for bug-guix@gnu.org; Sat, 16 Jun 2018 12:48:06 -0400 Received: from debbugs.gnu.org ([208.118.235.43]:44045) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1fUEMs-0005IJ-J2 for bug-guix@gnu.org; Sat, 16 Jun 2018 12:48:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1fUEMs-0006jU-CZ for bug-guix@gnu.org; Sat, 16 Jun 2018 12:48:02 -0400 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: <87r2l9q294.fsf@netris.org> (Mark H. Weaver's message of "Thu, 14 Jun 2018 03:02:47 -0400") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: Mark H Weaver Cc: 30116@debbugs.gnu.org Hi Mark, Mark H Weaver writes: > Hi Maxim, > > Thanks for working on this. I found a problem with this patch, > and I also have a suggestion. Please see below. > > Maxim Cournoyer writes: > >> From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001 >> From: Maxim Cournoyer >> Date: Sun, 14 Jan 2018 20:31:33 -0500 >> Subject: [PATCH] utils: Prevent substitute from crashing on files contai= ning >> NUL chars. >> >> Fixes issue #30116. >> >> * guix/build/utils.scm (substitute): Add condition to skip lines contain= ing >> the NUL character. >> --- >> guix/build/utils.scm | 44 ++++++++++++++++++++++++++------------------ >> 1 file changed, 26 insertions(+), 18 deletions(-) >> >> diff --git a/guix/build/utils.scm b/guix/build/utils.scm >> index 7391307c8..975f4e70a 100644 >> --- a/guix/build/utils.scm >> +++ b/guix/build/utils.scm >> @@ -3,6 +3,7 @@ >> ;;; Copyright =C2=A9 2013 Andreas Enge >> ;;; Copyright =C2=A9 2013 Nikita Karetnikov >> ;;; Copyright =C2=A9 2015 Mark H Weaver >> +;;; Copyright =C2=A9 2018 Maxim Cournoyer >> ;;; >> ;;; This file is part of GNU Guix. >> ;;; >> @@ -621,28 +622,35 @@ PROC as (PROC LINE MATCHES); PROC must return the = line that will be written as >> a substitution of the original line. Be careful about using '$' to mat= ch the >> end of a line; by itself it won't match the terminating newline of a li= ne." >> (let ((rx+proc (map (match-lambda >> - (((? regexp? pattern) . proc) >> - (cons pattern proc)) >> - ((pattern . proc) >> - (cons (make-regexp pattern regexp/extended) >> - proc))) >> + (((? regexp? pattern) . proc) >> + (cons pattern proc)) >> + ((pattern . proc) >> + (cons (make-regexp pattern regexp/extended) >> + proc))) >> pattern+procs))) >> (with-atomic-file-replacement file >> (lambda (in out) >> (let loop ((line (read-line in 'concat))) >> - (if (eof-object? line) >> - #t >> - (let ((line (fold (lambda (r+p line) >> - (match r+p >> - ((regexp . proc) >> - (match (list-matches regexp line) >> - ((and m+ (_ _ ...)) >> - (proc line m+)) >> - (_ line))))) >> - line >> - rx+proc))) >> - (display line out) >> - (loop (read-line in 'concat))))))))) >> + (cond >> + ((eof-object? line) >> + #t) >> + ((string-contains line (make-string 1 #\nul)) >> + ;; The regexp functions of the GNU C library (which Guile u= ses) >> + ;; cannot deal with NUL characters, so skip to the next lin= e. >> + (format #t "skipping line with NUL characters: ~s\n" line) >> + (loop (read-line in 'concat))) > > This code will unconditionally *delete* all lines that contain NULs. > > This will happen because the lines with NULs are not being written to > the output file, which will replace the original file when this loop > reaches EOF. So, any lines that are not copied to the output will be > lost. > > To preserve the lines with NULs, you should call (display line out) > before calling 'loop'. Good observation! I agree that we should keep limit the effect of ignoring NULs only to the substitution. > Also, please use (string-index line #\nul) to check for NULs instead of > 'string-contains'. It should be more efficient. OK! > >> + (else >> + (let ((line (fold (lambda (r+p line) >> + (match r+p >> + ((regexp . proc) >> + (match (list-matches regexp line) >> + ((and m+ (_ _ ...)) >> + (proc line m+)) >> + (_ line))))) >> + line >> + rx+proc))) >> + (display line out) >> + (loop (read-line in 'concat)))))))))) > > With the changes suggested above, I would have no objection to pushing > this to core-updates. However, it occurs to me that we could handle the > NUL case in a better way: > > Since the C regex functions that we use cannot handle NUL bytes, we > could use a different code point to represent NUL during those > operations. We could choose a code point from one of the Unicode > Private Use Areas that > does not occur in the string. > > Let NUL* be the code point which will represent NUL bytes. First > replace all NULs with NUL*s, then perform the substitutions, and finally > replace all ALT*s with NULs before writing to the output. Do I understand this transformation as NULs -> NUL*s and back from NUL*s -> NULs correctly? I'm not sure how NUL*s became ALT*s in your explanation. > As an important optimization, we should avoid performing these extra > operations unless (string-index line #\nul) finds a NUL. OK. > We could then perform these extra substitutions with simple, inefficient > code. Maybe something like this (untested): > > (with-atomic-file-replacement file > (lambda (in out) > (let loop ((line (read-line in 'concat))) > (if (eof-object? line) > #t > (let* ((nul* (or (and (string-index line #\nul) > (unused-private-use-code-point line)) > #\nul)) > (line* (replace-char #\nul nul* line)) > (line1* (fold (lambda (r+p line) > (match r+p > ((regexp . proc) > (match (list-matches regexp line) > ((and m+ (_ _ ...)) > (proc line m+)) > (_ line))))) > line* > rx+proc)) > (line1 (replace-char nul* #\nul line1*))) > (display line1 out) > (loop (read-line in 'concat))))))))) > > > Where the following additional private procedures would be added to > (guix build utils) above the definition for 'substitute': > > (define (unused-private-use-code-point s) > "Find a code point within a Unicode Private Use Area that is not > present in S, and return the corresponding character object. If one > cannot be found, return false." > (define (scan lo hi) > (and (<=3D lo hi) > (let ((c (integer->char lo))) > (if (string-index s c) > (scan (+ lo 1) hi) > c)))) > (or (scan #xE000 #xF8FF) > (scan #xF0000 #xFFFFD) > (scan #x100000 #x10FFFD))) > > (define (replace-char c1 c2 s) > "Return a string which is equal to S except with all instances of C1 > replaced by C2. If C1 and C2 are equal, return S." > (if (char=3D? c1 c2) > s > (string-map (lambda (c) > (if (char=3D? c c1) > c2 > c)) > s))) > > What do you think? It raises the complexity level a bit for something which doesn't seem to be a very common scenario, but otherwise seems a very elegant workaround. It seems to me that your implementation is already pretty complete. I'll try write a test for validating it and report back. Thank you for sharing your ideas! Maxim