unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
To: Mark H Weaver <mhw@netris.org>
Cc: 30116@debbugs.gnu.org
Subject: bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates))
Date: Sat, 16 Jun 2018 12:47:01 -0400	[thread overview]
Message-ID: <87tvq2smpm.fsf@gmail.com> (raw)
In-Reply-To: <87r2l9q294.fsf@netris.org> (Mark H. Weaver's message of "Thu, 14 Jun 2018 03:02:47 -0400")

Hi Mark,

Mark H Weaver <mhw@netris.org> writes:

> Hi Maxim,
>
> Thanks for working on this.  I found a problem with this patch,
> and I also have a suggestion.  Please see below.
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001
>> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
>> Date: Sun, 14 Jan 2018 20:31:33 -0500
>> Subject: [PATCH] utils: Prevent substitute from crashing on files containing
>>  NUL chars.
>>
>> Fixes issue #30116.
>>
>> * guix/build/utils.scm (substitute): Add condition to skip lines containing
>> the NUL character.
>> ---
>>  guix/build/utils.scm | 44 ++++++++++++++++++++++++++------------------
>>  1 file changed, 26 insertions(+), 18 deletions(-)
>>
>> diff --git a/guix/build/utils.scm b/guix/build/utils.scm
>> index 7391307c8..975f4e70a 100644
>> --- a/guix/build/utils.scm
>> +++ b/guix/build/utils.scm
>> @@ -3,6 +3,7 @@
>>  ;;; Copyright © 2013 Andreas Enge <andreas@enge.fr>
>>  ;;; Copyright © 2013 Nikita Karetnikov <nikita@karetnikov.org>
>>  ;;; Copyright © 2015 Mark H Weaver <mhw@netris.org>
>> +;;; Copyright © 2018 Maxim Cournoyer <maxim.cournoyer@gmail.com>
>>  ;;;
>>  ;;; This file is part of GNU Guix.
>>  ;;;
>> @@ -621,28 +622,35 @@ PROC as (PROC LINE MATCHES); PROC must return the line that will be written as
>>  a substitution of the original line.  Be careful about using '$' to match the
>>  end of a line; by itself it won't match the terminating newline of a line."
>>    (let ((rx+proc  (map (match-lambda
>> -                        (((? regexp? pattern) . proc)
>> -                         (cons pattern proc))
>> -                        ((pattern . proc)
>> -                         (cons (make-regexp pattern regexp/extended)
>> -                               proc)))
>> +                         (((? regexp? pattern) . proc)
>> +                          (cons pattern proc))
>> +                         ((pattern . proc)
>> +                          (cons (make-regexp pattern regexp/extended)
>> +                                proc)))
>>                         pattern+procs)))
>>      (with-atomic-file-replacement file
>>        (lambda (in out)
>>          (let loop ((line (read-line in 'concat)))
>> -          (if (eof-object? line)
>> -              #t
>> -              (let ((line (fold (lambda (r+p line)
>> -                                  (match r+p
>> -                                    ((regexp . proc)
>> -                                     (match (list-matches regexp line)
>> -                                       ((and m+ (_ _ ...))
>> -                                        (proc line m+))
>> -                                       (_ line)))))
>> -                                line
>> -                                rx+proc)))
>> -                (display line out)
>> -                (loop (read-line in 'concat)))))))))
>> +          (cond
>> +           ((eof-object? line)
>> +            #t)
>> +           ((string-contains line (make-string 1 #\nul))
>> +            ;; The regexp functions of the GNU C library (which Guile uses)
>> +            ;; cannot deal with NUL characters, so skip to the next line.
>> +            (format #t "skipping line with NUL characters: ~s\n" line)
>> +            (loop (read-line in 'concat)))
>
> This code will unconditionally *delete* all lines that contain NULs.
>
> This will happen because the lines with NULs are not being written to
> the output file, which will replace the original file when this loop
> reaches EOF.  So, any lines that are not copied to the output will be
> lost.
>
> To preserve the lines with NULs, you should call (display line out)
> before calling 'loop'.

Good observation! I agree that we should keep limit the effect of
ignoring NULs only to the substitution.

> Also, please use (string-index line #\nul) to check for NULs instead of
> 'string-contains'.  It should be more efficient.

OK!

>
>> +           (else
>> +            (let ((line (fold (lambda (r+p line)
>> +                                (match r+p
>> +                                  ((regexp . proc)
>> +                                   (match (list-matches regexp line)
>> +                                     ((and m+ (_ _ ...))
>> +                                      (proc line m+))
>> +                                     (_ line)))))
>> +                              line
>> +                              rx+proc)))
>> +              (display line out)
>> +              (loop (read-line in 'concat))))))))))
>
> With the changes suggested above, I would have no objection to pushing
> this to core-updates.  However, it occurs to me that we could handle the
> NUL case in a better way:
>
> Since the C regex functions that we use cannot handle NUL bytes, we
> could use a different code point to represent NUL during those
> operations.  We could choose a code point from one of the Unicode
> Private Use Areas <https://en.wikipedia.org/wiki/Private_Use_Areas> that
> does not occur in the string.
>
> Let NUL* be the code point which will represent NUL bytes.  First
> replace all NULs with NUL*s, then perform the substitutions, and finally
> replace all ALT*s with NULs before writing to the output.

Do I understand this transformation as NULs -> NUL*s and back from NUL*s
-> NULs correctly? I'm not sure how NUL*s became ALT*s in your explanation.

> As an important optimization, we should avoid performing these extra
> operations unless (string-index line #\nul) finds a NUL.

OK.

> We could then perform these extra substitutions with simple, inefficient
> code.  Maybe something like this (untested):
>
>     (with-atomic-file-replacement file
>       (lambda (in out)
>         (let loop ((line (read-line in 'concat)))
>           (if (eof-object? line)
>               #t
>               (let* ((nul* (or (and (string-index line #\nul)
>                                     (unused-private-use-code-point line))
>                                #\nul))
>                      (line* (replace-char #\nul nul* line))
>                      (line1* (fold (lambda (r+p line)
>                                      (match r+p
>                                        ((regexp . proc)
>                                         (match (list-matches regexp line)
>                                           ((and m+ (_ _ ...))
>                                            (proc line m+))
>                                           (_ line)))))
>                                    line*
>                                    rx+proc))
>                      (line1 (replace-char nul* #\nul line1*)))
>                 (display line1 out)
>                 (loop (read-line in 'concat)))))))))
>
>
> Where the following additional private procedures would be added to
> (guix build utils) above the definition for 'substitute':
>
> (define (unused-private-use-code-point s)
>   "Find a code point within a Unicode Private Use Area that is not
> present in S, and return the corresponding character object.  If one
> cannot be found, return false."
>   (define (scan lo hi)
>     (and (<= lo hi)
>          (let ((c (integer->char lo)))
>            (if (string-index s c)
>                (scan (+ lo 1) hi)
>                c))))
>   (or (scan   #xE000   #xF8FF)
>       (scan  #xF0000  #xFFFFD)
>       (scan #x100000 #x10FFFD)))
>
> (define (replace-char c1 c2 s)
>   "Return a string which is equal to S except with all instances of C1
> replaced by C2.  If C1 and C2 are equal, return S."
>   (if (char=? c1 c2)
>       s
>       (string-map (lambda (c)
>                     (if (char=? c c1)
>                         c2
>                         c))
>                   s)))
>
> What do you think?

It raises the complexity level a bit for something which doesn't seem to
be a very common scenario, but otherwise seems a very elegant
workaround. It seems to me that your implementation is already pretty
complete. I'll try write a test for validating it and report back.

Thank you for sharing your ideas!

Maxim

  parent reply	other threads:[~2018-06-16 16:48 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-15  1:27 bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates) Maxim Cournoyer
     [not found] ` <handler.30116.B.15159796942311.ack@debbugs.gnu.org>
2018-01-15  1:38   ` bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates)) Maxim Cournoyer
2018-01-17 14:37     ` Ludovic Courtès
2018-06-14  1:40       ` Maxim Cournoyer
2018-06-14  8:01         ` Ludovic Courtès
2018-06-14  7:02     ` Mark H Weaver
2018-06-14  8:02       ` Ludovic Courtès
2018-06-16 16:47       ` Maxim Cournoyer [this message]
2018-06-17  4:36         ` Mark H Weaver
2021-01-08 19:14           ` bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates) Maxim Cournoyer
2021-01-08 21:42             ` Mark H Weaver
2018-01-16 11:23 ` Ludovic Courtès
2018-01-21  4:24   ` Maxim Cournoyer
2018-01-21 18:17     ` Mark H Weaver
2018-01-22 10:58       ` Ludovic Courtès
2018-01-23  4:27         ` Maxim Cournoyer
2018-01-23 14:11           ` Ludovic Courtès
2018-01-25  5:11             ` Maxim Cournoyer
2018-01-25 11:11               ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87tvq2smpm.fsf@gmail.com \
    --to=maxim.cournoyer@gmail.com \
    --cc=30116@debbugs.gnu.org \
    --cc=mhw@netris.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).