From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
To: Mark H Weaver <mhw@netris.org>
Cc: 30116@debbugs.gnu.org
Subject: bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates))
Date: Sat, 16 Jun 2018 12:47:01 -0400 [thread overview]
Message-ID: <87tvq2smpm.fsf@gmail.com> (raw)
In-Reply-To: <87r2l9q294.fsf@netris.org> (Mark H. Weaver's message of "Thu, 14 Jun 2018 03:02:47 -0400")
Hi Mark,
Mark H Weaver <mhw@netris.org> writes:
> Hi Maxim,
>
> Thanks for working on this. I found a problem with this patch,
> and I also have a suggestion. Please see below.
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001
>> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
>> Date: Sun, 14 Jan 2018 20:31:33 -0500
>> Subject: [PATCH] utils: Prevent substitute from crashing on files containing
>> NUL chars.
>>
>> Fixes issue #30116.
>>
>> * guix/build/utils.scm (substitute): Add condition to skip lines containing
>> the NUL character.
>> ---
>> guix/build/utils.scm | 44 ++++++++++++++++++++++++++------------------
>> 1 file changed, 26 insertions(+), 18 deletions(-)
>>
>> diff --git a/guix/build/utils.scm b/guix/build/utils.scm
>> index 7391307c8..975f4e70a 100644
>> --- a/guix/build/utils.scm
>> +++ b/guix/build/utils.scm
>> @@ -3,6 +3,7 @@
>> ;;; Copyright © 2013 Andreas Enge <andreas@enge.fr>
>> ;;; Copyright © 2013 Nikita Karetnikov <nikita@karetnikov.org>
>> ;;; Copyright © 2015 Mark H Weaver <mhw@netris.org>
>> +;;; Copyright © 2018 Maxim Cournoyer <maxim.cournoyer@gmail.com>
>> ;;;
>> ;;; This file is part of GNU Guix.
>> ;;;
>> @@ -621,28 +622,35 @@ PROC as (PROC LINE MATCHES); PROC must return the line that will be written as
>> a substitution of the original line. Be careful about using '$' to match the
>> end of a line; by itself it won't match the terminating newline of a line."
>> (let ((rx+proc (map (match-lambda
>> - (((? regexp? pattern) . proc)
>> - (cons pattern proc))
>> - ((pattern . proc)
>> - (cons (make-regexp pattern regexp/extended)
>> - proc)))
>> + (((? regexp? pattern) . proc)
>> + (cons pattern proc))
>> + ((pattern . proc)
>> + (cons (make-regexp pattern regexp/extended)
>> + proc)))
>> pattern+procs)))
>> (with-atomic-file-replacement file
>> (lambda (in out)
>> (let loop ((line (read-line in 'concat)))
>> - (if (eof-object? line)
>> - #t
>> - (let ((line (fold (lambda (r+p line)
>> - (match r+p
>> - ((regexp . proc)
>> - (match (list-matches regexp line)
>> - ((and m+ (_ _ ...))
>> - (proc line m+))
>> - (_ line)))))
>> - line
>> - rx+proc)))
>> - (display line out)
>> - (loop (read-line in 'concat)))))))))
>> + (cond
>> + ((eof-object? line)
>> + #t)
>> + ((string-contains line (make-string 1 #\nul))
>> + ;; The regexp functions of the GNU C library (which Guile uses)
>> + ;; cannot deal with NUL characters, so skip to the next line.
>> + (format #t "skipping line with NUL characters: ~s\n" line)
>> + (loop (read-line in 'concat)))
>
> This code will unconditionally *delete* all lines that contain NULs.
>
> This will happen because the lines with NULs are not being written to
> the output file, which will replace the original file when this loop
> reaches EOF. So, any lines that are not copied to the output will be
> lost.
>
> To preserve the lines with NULs, you should call (display line out)
> before calling 'loop'.
Good observation! I agree that we should keep limit the effect of
ignoring NULs only to the substitution.
> Also, please use (string-index line #\nul) to check for NULs instead of
> 'string-contains'. It should be more efficient.
OK!
>
>> + (else
>> + (let ((line (fold (lambda (r+p line)
>> + (match r+p
>> + ((regexp . proc)
>> + (match (list-matches regexp line)
>> + ((and m+ (_ _ ...))
>> + (proc line m+))
>> + (_ line)))))
>> + line
>> + rx+proc)))
>> + (display line out)
>> + (loop (read-line in 'concat))))))))))
>
> With the changes suggested above, I would have no objection to pushing
> this to core-updates. However, it occurs to me that we could handle the
> NUL case in a better way:
>
> Since the C regex functions that we use cannot handle NUL bytes, we
> could use a different code point to represent NUL during those
> operations. We could choose a code point from one of the Unicode
> Private Use Areas <https://en.wikipedia.org/wiki/Private_Use_Areas> that
> does not occur in the string.
>
> Let NUL* be the code point which will represent NUL bytes. First
> replace all NULs with NUL*s, then perform the substitutions, and finally
> replace all ALT*s with NULs before writing to the output.
Do I understand this transformation as NULs -> NUL*s and back from NUL*s
-> NULs correctly? I'm not sure how NUL*s became ALT*s in your explanation.
> As an important optimization, we should avoid performing these extra
> operations unless (string-index line #\nul) finds a NUL.
OK.
> We could then perform these extra substitutions with simple, inefficient
> code. Maybe something like this (untested):
>
> (with-atomic-file-replacement file
> (lambda (in out)
> (let loop ((line (read-line in 'concat)))
> (if (eof-object? line)
> #t
> (let* ((nul* (or (and (string-index line #\nul)
> (unused-private-use-code-point line))
> #\nul))
> (line* (replace-char #\nul nul* line))
> (line1* (fold (lambda (r+p line)
> (match r+p
> ((regexp . proc)
> (match (list-matches regexp line)
> ((and m+ (_ _ ...))
> (proc line m+))
> (_ line)))))
> line*
> rx+proc))
> (line1 (replace-char nul* #\nul line1*)))
> (display line1 out)
> (loop (read-line in 'concat)))))))))
>
>
> Where the following additional private procedures would be added to
> (guix build utils) above the definition for 'substitute':
>
> (define (unused-private-use-code-point s)
> "Find a code point within a Unicode Private Use Area that is not
> present in S, and return the corresponding character object. If one
> cannot be found, return false."
> (define (scan lo hi)
> (and (<= lo hi)
> (let ((c (integer->char lo)))
> (if (string-index s c)
> (scan (+ lo 1) hi)
> c))))
> (or (scan #xE000 #xF8FF)
> (scan #xF0000 #xFFFFD)
> (scan #x100000 #x10FFFD)))
>
> (define (replace-char c1 c2 s)
> "Return a string which is equal to S except with all instances of C1
> replaced by C2. If C1 and C2 are equal, return S."
> (if (char=? c1 c2)
> s
> (string-map (lambda (c)
> (if (char=? c c1)
> c2
> c))
> s)))
>
> What do you think?
It raises the complexity level a bit for something which doesn't seem to
be a very common scenario, but otherwise seems a very elegant
workaround. It seems to me that your implementation is already pretty
complete. I'll try write a test for validating it and report back.
Thank you for sharing your ideas!
Maxim
next prev parent reply other threads:[~2018-06-16 16:48 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-15 1:27 bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates) Maxim Cournoyer
[not found] ` <handler.30116.B.15159796942311.ack@debbugs.gnu.org>
2018-01-15 1:38 ` bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates)) Maxim Cournoyer
2018-01-17 14:37 ` Ludovic Courtès
2018-06-14 1:40 ` Maxim Cournoyer
2018-06-14 8:01 ` Ludovic Courtès
2018-06-14 7:02 ` Mark H Weaver
2018-06-14 8:02 ` Ludovic Courtès
2018-06-16 16:47 ` Maxim Cournoyer [this message]
2018-06-17 4:36 ` Mark H Weaver
2021-01-08 19:14 ` bug#30116: [PATCH] `substitute' crashes when file contains NUL characters (core-updates) Maxim Cournoyer
2021-01-08 21:42 ` Mark H Weaver
2018-01-16 11:23 ` Ludovic Courtès
2018-01-21 4:24 ` Maxim Cournoyer
2018-01-21 18:17 ` Mark H Weaver
2018-01-22 10:58 ` Ludovic Courtès
2018-01-23 4:27 ` Maxim Cournoyer
2018-01-23 14:11 ` Ludovic Courtès
2018-01-25 5:11 ` Maxim Cournoyer
2018-01-25 11:11 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87tvq2smpm.fsf@gmail.com \
--to=maxim.cournoyer@gmail.com \
--cc=30116@debbugs.gnu.org \
--cc=mhw@netris.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.