From mboxrd@z Thu Jan  1 00:00:00 1970
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Subject: bug#30116: [PATCH] `substitute' crashes when file contains NUL
	characters (core-updates))
Date: Sat, 16 Jun 2018 12:47:01 -0400
Message-ID: <87tvq2smpm.fsf@gmail.com>
References: <87r2qrc3mq.fsf@gmail.com>
	<handler.30116.B.15159796942311.ack@debbugs.gnu.org>
	<87k1wjc35d.fsf_-_@gmail.com> <87r2l9q294.fsf@netris.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Return-path: <bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([2001:4830:134:3::10]:50506)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1fUEMv-0008VB-VG
	for bug-guix@gnu.org; Sat, 16 Jun 2018 12:48:07 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1fUEMs-0005IN-Nh
	for bug-guix@gnu.org; Sat, 16 Jun 2018 12:48:06 -0400
Received: from debbugs.gnu.org ([208.118.235.43]:44045)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
	id 1fUEMs-0005IJ-J2
	for bug-guix@gnu.org; Sat, 16 Jun 2018 12:48:02 -0400
Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1fUEMs-0006jU-CZ
	for bug-guix@gnu.org; Sat, 16 Jun 2018 12:48:02 -0400
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-Message-ID: <handler.30116.B30116.152916763325818@debbugs.gnu.org>
In-Reply-To: <87r2l9q294.fsf@netris.org> (Mark H. Weaver's message of "Thu, 14
	Jun 2018 03:02:47 -0400")
List-Id: Bug reports for GNU Guix <bug-guix.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-guix>,
	<mailto:bug-guix-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/bug-guix/>
List-Post: <mailto:bug-guix@gnu.org>
List-Help: <mailto:bug-guix-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-guix>,
	<mailto:bug-guix-request@gnu.org?subject=subscribe>
Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org
Sender: "bug-Guix" <bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org>
To: Mark H Weaver <mhw@netris.org>
Cc: 30116@debbugs.gnu.org

Hi Mark,

Mark H Weaver <mhw@netris.org> writes:

> Hi Maxim,
>
> Thanks for working on this.  I found a problem with this patch,
> and I also have a suggestion.  Please see below.
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001
>> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
>> Date: Sun, 14 Jan 2018 20:31:33 -0500
>> Subject: [PATCH] utils: Prevent substitute from crashing on files contai=
ning
>>  NUL chars.
>>
>> Fixes issue #30116.
>>
>> * guix/build/utils.scm (substitute): Add condition to skip lines contain=
ing
>> the NUL character.
>> ---
>>  guix/build/utils.scm | 44 ++++++++++++++++++++++++++------------------
>>  1 file changed, 26 insertions(+), 18 deletions(-)
>>
>> diff --git a/guix/build/utils.scm b/guix/build/utils.scm
>> index 7391307c8..975f4e70a 100644
>> --- a/guix/build/utils.scm
>> +++ b/guix/build/utils.scm
>> @@ -3,6 +3,7 @@
>>  ;;; Copyright =C2=A9 2013 Andreas Enge <andreas@enge.fr>
>>  ;;; Copyright =C2=A9 2013 Nikita Karetnikov <nikita@karetnikov.org>
>>  ;;; Copyright =C2=A9 2015 Mark H Weaver <mhw@netris.org>
>> +;;; Copyright =C2=A9 2018 Maxim Cournoyer <maxim.cournoyer@gmail.com>
>>  ;;;
>>  ;;; This file is part of GNU Guix.
>>  ;;;
>> @@ -621,28 +622,35 @@ PROC as (PROC LINE MATCHES); PROC must return the =
line that will be written as
>>  a substitution of the original line.  Be careful about using '$' to mat=
ch the
>>  end of a line; by itself it won't match the terminating newline of a li=
ne."
>>    (let ((rx+proc  (map (match-lambda
>> -                        (((? regexp? pattern) . proc)
>> -                         (cons pattern proc))
>> -                        ((pattern . proc)
>> -                         (cons (make-regexp pattern regexp/extended)
>> -                               proc)))
>> +                         (((? regexp? pattern) . proc)
>> +                          (cons pattern proc))
>> +                         ((pattern . proc)
>> +                          (cons (make-regexp pattern regexp/extended)
>> +                                proc)))
>>                         pattern+procs)))
>>      (with-atomic-file-replacement file
>>        (lambda (in out)
>>          (let loop ((line (read-line in 'concat)))
>> -          (if (eof-object? line)
>> -              #t
>> -              (let ((line (fold (lambda (r+p line)
>> -                                  (match r+p
>> -                                    ((regexp . proc)
>> -                                     (match (list-matches regexp line)
>> -                                       ((and m+ (_ _ ...))
>> -                                        (proc line m+))
>> -                                       (_ line)))))
>> -                                line
>> -                                rx+proc)))
>> -                (display line out)
>> -                (loop (read-line in 'concat)))))))))
>> +          (cond
>> +           ((eof-object? line)
>> +            #t)
>> +           ((string-contains line (make-string 1 #\nul))
>> +            ;; The regexp functions of the GNU C library (which Guile u=
ses)
>> +            ;; cannot deal with NUL characters, so skip to the next lin=
e.
>> +            (format #t "skipping line with NUL characters: ~s\n" line)
>> +            (loop (read-line in 'concat)))
>
> This code will unconditionally *delete* all lines that contain NULs.
>
> This will happen because the lines with NULs are not being written to
> the output file, which will replace the original file when this loop
> reaches EOF.  So, any lines that are not copied to the output will be
> lost.
>
> To preserve the lines with NULs, you should call (display line out)
> before calling 'loop'.

Good observation! I agree that we should keep limit the effect of
ignoring NULs only to the substitution.

> Also, please use (string-index line #\nul) to check for NULs instead of
> 'string-contains'.  It should be more efficient.

OK!

>
>> +           (else
>> +            (let ((line (fold (lambda (r+p line)
>> +                                (match r+p
>> +                                  ((regexp . proc)
>> +                                   (match (list-matches regexp line)
>> +                                     ((and m+ (_ _ ...))
>> +                                      (proc line m+))
>> +                                     (_ line)))))
>> +                              line
>> +                              rx+proc)))
>> +              (display line out)
>> +              (loop (read-line in 'concat))))))))))
>
> With the changes suggested above, I would have no objection to pushing
> this to core-updates.  However, it occurs to me that we could handle the
> NUL case in a better way:
>
> Since the C regex functions that we use cannot handle NUL bytes, we
> could use a different code point to represent NUL during those
> operations.  We could choose a code point from one of the Unicode
> Private Use Areas <https://en.wikipedia.org/wiki/Private_Use_Areas> that
> does not occur in the string.
>
> Let NUL* be the code point which will represent NUL bytes.  First
> replace all NULs with NUL*s, then perform the substitutions, and finally
> replace all ALT*s with NULs before writing to the output.

Do I understand this transformation as NULs -> NUL*s and back from NUL*s
-> NULs correctly? I'm not sure how NUL*s became ALT*s in your explanation.

> As an important optimization, we should avoid performing these extra
> operations unless (string-index line #\nul) finds a NUL.

OK.

> We could then perform these extra substitutions with simple, inefficient
> code.  Maybe something like this (untested):
>
>     (with-atomic-file-replacement file
>       (lambda (in out)
>         (let loop ((line (read-line in 'concat)))
>           (if (eof-object? line)
>               #t
>               (let* ((nul* (or (and (string-index line #\nul)
>                                     (unused-private-use-code-point line))
>                                #\nul))
>                      (line* (replace-char #\nul nul* line))
>                      (line1* (fold (lambda (r+p line)
>                                      (match r+p
>                                        ((regexp . proc)
>                                         (match (list-matches regexp line)
>                                           ((and m+ (_ _ ...))
>                                            (proc line m+))
>                                           (_ line)))))
>                                    line*
>                                    rx+proc))
>                      (line1 (replace-char nul* #\nul line1*)))
>                 (display line1 out)
>                 (loop (read-line in 'concat)))))))))
>
>
> Where the following additional private procedures would be added to
> (guix build utils) above the definition for 'substitute':
>
> (define (unused-private-use-code-point s)
>   "Find a code point within a Unicode Private Use Area that is not
> present in S, and return the corresponding character object.  If one
> cannot be found, return false."
>   (define (scan lo hi)
>     (and (<=3D lo hi)
>          (let ((c (integer->char lo)))
>            (if (string-index s c)
>                (scan (+ lo 1) hi)
>                c))))
>   (or (scan   #xE000   #xF8FF)
>       (scan  #xF0000  #xFFFFD)
>       (scan #x100000 #x10FFFD)))
>
> (define (replace-char c1 c2 s)
>   "Return a string which is equal to S except with all instances of C1
> replaced by C2.  If C1 and C2 are equal, return S."
>   (if (char=3D? c1 c2)
>       s
>       (string-map (lambda (c)
>                     (if (char=3D? c c1)
>                         c2
>                         c))
>                   s)))
>
> What do you think?

It raises the complexity level a bit for something which doesn't seem to
be a very common scenario, but otherwise seems a very elegant
workaround. It seems to me that your implementation is already pretty
complete. I'll try write a test for validating it and report back.

Thank you for sharing your ideas!

Maxim