From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id 2BPILWZEbGDNswAAgWs5BA (envelope-from ) for ; Tue, 06 Apr 2021 13:22:14 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id kG+1J2ZEbGANFwAA1q6Kng (envelope-from ) for ; Tue, 06 Apr 2021 11:22:14 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 851F71A5E2 for ; Tue, 6 Apr 2021 13:22:13 +0200 (CEST) Received: from localhost ([::1]:55192 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lTjmd-0005X3-EW for larch@yhetil.org; Tue, 06 Apr 2021 07:22:11 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:50566) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lTjmU-0005Wl-Ui for bug-guix@gnu.org; Tue, 06 Apr 2021 07:22:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:56072) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1lTjmU-0001f9-B5 for bug-guix@gnu.org; Tue, 06 Apr 2021 07:22:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1lTjmU-0008JX-5l for bug-guix@gnu.org; Tue, 06 Apr 2021 07:22:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#33848: Store references in SBCL-compiled code are "invisible" Resent-From: Mark H Weaver Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Tue, 06 Apr 2021 11:22:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 33848 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Pierre Neidhardt , Guillaume Le Vaillant Received: via spool by 33848-submit@debbugs.gnu.org id=B33848.161770810831936 (code B ref 33848); Tue, 06 Apr 2021 11:22:02 +0000 Received: (at 33848) by debbugs.gnu.org; 6 Apr 2021 11:21:48 +0000 Received: from localhost ([127.0.0.1]:39385 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1lTjmF-0008J2-68 for submit@debbugs.gnu.org; Tue, 06 Apr 2021 07:21:48 -0400 Received: from world.peace.net ([64.112.178.59]:45226) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1lTjmD-0008Ip-Fd for 33848@debbugs.gnu.org; Tue, 06 Apr 2021 07:21:46 -0400 Received: from mhw by world.peace.net with esmtpsa (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lTjm5-0006b5-Ur; Tue, 06 Apr 2021 07:21:38 -0400 From: Mark H Weaver In-Reply-To: <87r1js9udv.fsf@netris.org> References: <87r2e8jpfx.fsf@gnu.org> <87tvizvzgk.fsf@netris.org> <87o9979gfn.fsf@gnu.org> <87tvizgghs.fsf@ambrevar.xyz> <87k1juaomo.fsf@gnu.org> <87muoqhk62.fsf@ambrevar.xyz> <87zhsq8wkj.fsf@gnu.org> <87d0pmhbgn.fsf@ambrevar.xyz> <87r2e28tkv.fsf@gnu.org> <874laygkiy.fsf@ambrevar.xyz> <87lfa5eymf.fsf@ambrevar.xyz> <87tuoscsk9.fsf@gnu.org> <87im57b8u7.fsf@ambrevar.xyz> <87czvebky2.fsf@netris.org> <87eefu30a4.fsf@gnu.org> <87im56l6es.fsf@yamatai> <87wntm8j18.fsf@ambrevar.xyz> <87a6qil4b1.fsf@yamatai> <87a6qiz5b3.fsf@ambrevar.xyz> <871rbtc3j5.fsf@netris.org> <87r1js9udv.fsf@netris.org> Date: Tue, 06 Apr 2021 07:19:51 -0400 Message-ID: <87h7kj7j7x.fsf@netris.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 33848@debbugs.gnu.org Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1617708134; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:resent-cc:resent-from:resent-sender: resent-message-id:in-reply-to:in-reply-to:references:references: list-id:list-help:list-unsubscribe:list-subscribe:list-post; bh=2VlhjTkhB9jN1OVxZVy7+9QIXhUwWw511huIdkkIFno=; b=NNpMtx6xE3ylrQ7SZtPU6bIfqqzT8z3OSKv39cGn6YTuhb8Gnh0Plv3GVLQ48uVaRAnmLy 6+pzWG0o1lnS+5aPpMOzAy7tn+AoIrUuFjvzKpTyB0tB4g/80p9CAEajFbw+EVE8JBA8vm LuMCCkod8u3Qv+1n93MUvAdWWhR5V6V+JTdR3sBl+2skthI5yLyl9/gXzSA8maEcLkrHJX OzO+Pqga2gSyTEMXoO3JXHsdP3xZuUAqcPP5yxi8OgUcnifWNZb1Nxz2GeWIJhx0+xAGsS ORs3aPxDSBrC2gV/ZQpOa6aEJDE2pQ3f0ZJ+z//lAUxfrWSOvULDVtY76xcTFQ== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1617708134; a=rsa-sha256; cv=none; b=ASkGcy7UdLUx0yGq0nL9qiKbpLiNyEh70mKD01yQBvC5mhedZPTPmPMBeUrB+bk6GPO+rz SpAz27kdU3BMaa6DL7dxNuTANvksJff9pmrZmiswo3TBK83o9zUXuhRcs32o2ee0j91Kbo 6grd5S7zXvevK6rgn+4qf1yBiZb23K6zILhLABGysVR22XJiD7RGZ+wk/zMXBQxXdtqoOp XxHZVbq6vRSjaxR0iUy/OrK2xb+jdq5XnnYscJ8wyBAV/PLMYcofIX2UIk+SA1syvi+NqM Dx0iFwHaQGkBUlXCPe28fUNFNlBNeC/fIpA0eIfkL0yvlY5SwU05CRUMcXoKAg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Spam-Score: -0.94 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Queue-Id: 851F71A5E2 X-Spam-Score: -0.94 X-Migadu-Scanner: scn0.migadu.com X-TUID: YEvNk2ub10gL --=-=-= Content-Type: text/plain Here's a revised draft of the patch, which updates the comments and refactors the code a bit to (hopefully) make it a bit more readable. Mark --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: inline; filename=0001-DRAFT-grafts-Support-rewriting-UTF-16-and-UTF-32-sto.patch Content-Transfer-Encoding: quoted-printable Content-Description: [PATCH] DRAFT: grafts: Support rewriting UTF-16 and UTF-32 store references >From 6eec36e66d20d82fe02c6de793422875477b90d8 Mon Sep 17 00:00:00 2001 From: Mark H Weaver Date: Fri, 2 Apr 2021 18:36:23 -0400 Subject: [PATCH] DRAFT: grafts: Support rewriting UTF-16 and UTF-32 store references. * guix/build/graft.scm (replace-store-references): Add support for finding and rewriting UTF-16 and UTF-32 store references. * tests/grafts.scm: Add tests. --- guix/build/graft.scm | 247 +++++++++++++++++++++++++++---------------- tests/grafts.scm | 68 ++++++++++++ 2 files changed, 224 insertions(+), 91 deletions(-) diff --git a/guix/build/graft.scm b/guix/build/graft.scm index c119ee71d1..23fca8f29c 100644 --- a/guix/build/graft.scm +++ b/guix/build/graft.scm @@ -1,6 +1,6 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright =C2=A9 2014, 2015, 2016, 2018 Ludovic Court=C3=A8s -;;; Copyright =C2=A9 2016 Mark H Weaver +;;; Copyright =C2=A9 2016, 2021 Mark H Weaver ;;; ;;; This file is part of GNU Guix. ;;; @@ -55,6 +55,40 @@ (string->char-set "0123456789abcdfghijklmnpqrsvwxyz") <>)) =20 +(define (nix-base32-char-or-nul? byte) + (or (nix-base32-char? byte) + (char=3D? byte #\nul))) + +(define (possible-utf16-hash? buffer i w) + (and (<=3D (* 2 hash-length) (- i w)) + (let loop ((j (+ 1 (- i (* 2 hash-length))))) + (or (>=3D j i) + (and (zero? (bytevector-u8-ref buffer j)) + (loop (+ j 2))))))) + +(define (possible-utf32-hash? buffer i w) + (and (<=3D (* 4 hash-length) (- i w)) + (let loop ((j (+ 1 (- i (* 4 hash-length))))) + (or (>=3D j i) + (and (zero? (bytevector-u8-ref buffer j)) + (zero? (bytevector-u8-ref buffer (+ j 1))) + (zero? (bytevector-u8-ref buffer (+ j 2))) + (loop (+ j 4))))))) + +(define (insert-nuls char-size bv) + (if (or (not bv) (=3D char-size 1)) + bv + (let* ((len (bytevector-length bv)) + (bv* (make-bytevector (+ 1 (* char-size + (- len 1))) + 0))) + (let loop ((i 0)) + (when (< i len) + (bytevector-u8-set! bv* (* i char-size) + (bytevector-u8-ref bv i)) + (loop (+ i 1)))) + bv*))) + (define* (replace-store-references input output replacement-table #:optional (store (%store-directory))) "Read data from INPUT, replacing store references according to @@ -76,9 +110,9 @@ bytevectors to the same value." (list->vector (map pred (iota 256))) <>)) =20 - (define nix-base32-byte? + (define nix-base32-byte-or-nul? (optimize-u8-predicate - (compose nix-base32-char? + (compose nix-base32-char-or-nul? integer->char))) =20 (define (dash? byte) (=3D byte 45)) @@ -86,100 +120,131 @@ bytevectors to the same value." (define request-size (expt 2 20)) ; 1 MiB =20 ;; We scan the file for the following 33-byte pattern: 32 bytes of - ;; nix-base32 characters followed by a dash. To accommodate large files, - ;; we do not read the entire file, but instead work on buffers of up to - ;; 'request-size' bytes. To ensure that every 33-byte sequence appears - ;; entirely within exactly one buffer, adjacent buffers must overlap, - ;; i.e. they must share 32 byte positions. We accomplish this by - ;; "ungetting" the last 32 bytes of each buffer before reading the next - ;; buffer, unless we know that we've reached the end-of-file. + ;; nix-base32 characters followed by a dash. When we find such a pattern + ;; whose hash is in REPLACEMENT-TABLE, we perform the required rewrite a= nd + ;; continue scanning. + ;; + ;; To support UTF-16 and UTF-32 store references, the 33 bytes comprising + ;; this hash+dash pattern may optionally be interspersed by extra NUL by= tes. + ;; This simple approach works because the characters we are looking for = are + ;; restricted to ASCII. UTF-16 hashes are interspersed with single NUL + ;; bytes ("\0"), and UTF-32 hashes are interspersed with triplets of NULs + ;; ("\0\0\0"). Note that we require NULs to be present only *between* t= he + ;; other bytes, and not at either end, in order to be insensitive to byte + ;; order. + ;; + ;; To accommodate large files, we do not read the entire file at once, b= ut + ;; instead work on buffers of up to 'request-size' bytes. To ensure that + ;; every hash+dash pattern appears in its entirety in at least one buffe= r, + ;; adjacent buffers must overlap by one byte less than the maximum size = of a + ;; hash+dash pattern. We accomplish this by "ungetting" a suffix of each + ;; buffer before reading the next buffer, unless we know that we've reac= hed + ;; the end-of-file. (let ((buffer (make-bytevector request-size))) - (let loop () - ;; Note: We avoid 'get-bytevector-n' to work around - ;; . + (define-syntax-rule (byte-at i) + (bytevector-u8-ref buffer i)) + (let outer-loop () (match (get-bytevector-n! input buffer 0 request-size) ((? eof-object?) 'done) (end - ;; We scan the buffer for dashes that might be preceded by a - ;; nix-base32 hash. The key optimization here is that whenever we - ;; find a NON-nix-base32 character at position 'i', we know that = it - ;; cannot be part of a hash, so the earliest position where the n= ext - ;; hash could start is i+1 with the following dash at position i+= 33. - ;; - ;; Since nix-base32 characters comprise only 1/8 of the 256 possi= ble - ;; byte values, and exclude some of the most common letters in - ;; English text (e t o u), in practice we can advance by 33 posit= ions - ;; most of the time. - (let scan-from ((i hash-length) (written 0)) - ;; 'i' is the first position where we look for a dash. 'writte= n' - ;; is the number of bytes in the buffer that have already been - ;; written. + (define (scan-from i w) + ;; Scan the buffer for dashes that might be preceded by nix has= hes, + ;; where 'i' is the minimum position where such a dash might be + ;; found, and 'w' is the number of bytes in the buffer that have + ;; been written so far. + ;; + ;; The key optimization here is that whenever we find a byte at + ;; position 'i' that cannot occur within a nix hash (because it= 's + ;; neither a nix-base32 character nor NUL), we can infer that t= he + ;; earliest position where the next hash could start is at i+1, + ;; and therefore the earliest position for the following dash is + ;; (+ i 1 hash-length), which is i+33. + ;; + ;; Since nix-base32-or-nul characters comprise only about 1/8 of + ;; the 256 possible byte values, and exclude some of the most + ;; common letters in English text (e t o u), we can advance 33 + ;; positions most of the time. (if (< i end) - (let ((byte (bytevector-u8-ref buffer i))) - (cond ((and (dash? byte) - ;; We've found a dash. Note that we do not k= now - ;; whether the preceeding 32 bytes are nix-ba= se32 - ;; characters, but we do not need to know. If - ;; they are not, the following lookup will fa= il. - (lookup-replacement - (string-tabulate (lambda (j) - (integer->char - (bytevector-u8-ref buffer - (+ j (- i hash-length))= ))) - hash-length))) - =3D> (lambda (replacement) - ;; We've found a hash that needs to be replac= ed. - ;; First, write out all bytes preceding the h= ash - ;; that have not yet been written. - (put-bytevector output buffer written - (- i hash-length written)) - ;; Now write the replacement string. - (put-bytevector output replacement) - ;; Since the byte at position 'i' is a dash, - ;; which is not a nix-base32 char, the earlie= st - ;; position where the next hash might start is - ;; i+1, and the earliest position where the - ;; following dash might start is (+ i 1 - ;; hash-length). Also, increase the write - ;; position to account for REPLACEMENT. - (let ((len (bytevector-length replacement))) - (scan-from (+ i 1 len) - (+ i (- len hash-length)))))) - ;; If the byte at position 'i' is a nix-base32 char, - ;; then the dash we're looking for might be as earl= y as - ;; the following byte, so we can only advance by 1. - ((nix-base32-byte? byte) - (scan-from (+ i 1) written)) - ;; If the byte at position 'i' is NOT a nix-base32 - ;; char, then the earliest position where the next = hash - ;; might start is i+1, with the following dash at - ;; position (+ i 1 hash-length). + (let ((byte (byte-at i))) + (cond ((dash? byte) + (found-dash i w)) + ((nix-base32-byte-or-nul? byte) + (scan-from (+ i 1) w)) (else - (scan-from (+ i 1 hash-length) written)))) - - ;; We have finished scanning the buffer. Now we determine = how - ;; many bytes have not yet been written, and how many bytes= to - ;; "unget". If 'end' is less than 'request-size' then we r= ead - ;; less than we asked for, which indicates that we are at E= OF, - ;; so we needn't unget anything. Otherwise, we unget up to - ;; 'hash-length' bytes (32 bytes). However, we must be car= eful - ;; not to unget bytes that have already been written, becau= se - ;; that would cause them to be written again from the next - ;; buffer. In practice, this case occurs when a replacemen= t is - ;; made near or beyond the end of the buffer. When REPLACE= MENT - ;; went beyond END, we consume the extra bytes from INPUT. - (begin - (if (> written end) - (get-bytevector-n! input buffer 0 (- written end)) - (let* ((unwritten (- end written)) - (unget-size (if (=3D end request-size) - (min hash-length unwritten) - 0)) - (write-size (- unwritten unget-size))) - (put-bytevector output buffer written write-size) - (unget-bytevector input buffer (+ written write-siz= e) - unget-size))) - (loop))))))))) + (not-part-of-hash i w)))) + (finish-buffer i w))) + + (define (not-part-of-hash i w) + ;; Position 'i' is known to not be within a nix hash. Therefor= e, + ;; the earliest position where the next hash might start is i+1, + ;; with the following dash at position (+ i 1 hash-length). + (scan-from (+ i 1 hash-length) w)) + + (define (found-dash i w) + (cond ((not (zero? (byte-at (- i 1)))) + (found-possible-hash 1 i w)) + ((not (zero? (byte-at (- i 2)))) + (if (possible-utf16-hash? buffer i w) + (found-possible-hash 2 i w) + (not-part-of-hash i w))) + (else + (if (possible-utf32-hash? buffer i w) + (found-possible-hash 4 i w) + (not-part-of-hash i w))))) + + (define (found-possible-hash char-size i w) + (let* ((hash (string-tabulate + (lambda (j) + (integer->char + (byte-at (- i (* char-size + (- hash-length j)))))) + hash-length)) + (replacement* (lookup-replacement hash)) + (replacement (and replacement* + (insert-nuls char-size replacement*)))) + (cond + ((not replacement) + (not-part-of-hash i w)) + (else + ;; We've found a hash that needs to be replaced. + ;; First, write out all bytes preceding the hash + ;; that have not yet been written. + (put-bytevector output buffer w + (- i (* char-size hash-length) w)) + ;; Now write the replacement string. + (put-bytevector output replacement) + ;; Now compute the new value of 'w' and + ;; the new value of 'i', and iterate. + (let ((w (+ (- i (* char-size hash-length)) + (bytevector-length replacement)))) + (scan-from (+ w hash-length) w)))))) + + (define (finish-buffer i w) + ;; We have finished scanning the buffer. Now we determine how + ;; many bytes have not yet been written, and how many bytes to + ;; "unget". If 'end' is less than 'request-size' then we read + ;; less than we asked for, which indicates that we are at EOF, + ;; so we needn't unget anything. Otherwise, we unget up to + ;; (* 4 hash-length) bytes. However, we must be careful not to + ;; unget bytes that have already been written, because that + ;; would cause them to be written again from the next buffer. + ;; In practice, this case occurs when a replacement is made + ;; near or beyond the end of the buffer. When REPLACEMENT went + ;; beyond END, we consume the extra bytes from INPUT. + (if (> w end) + (get-bytevector-n! input buffer 0 (- w end)) + (let* ((unwritten (- end w)) + (unget-size (if (=3D end request-size) + (min (* 4 hash-length) + unwritten) + 0)) + (write-size (- unwritten unget-size))) + (put-bytevector output buffer w write-size) + (unget-bytevector input buffer (+ w write-size) + unget-size))) + (outer-loop)) + + (scan-from hash-length 0)))))) =20 (define (rename-matching-files directory mapping) "Apply MAPPING to the names of all the files in DIRECTORY, where MAPPING= is diff --git a/tests/grafts.scm b/tests/grafts.scm index a12c6a5911..0e1c7355b1 100644 --- a/tests/grafts.scm +++ b/tests/grafts.scm @@ -1,5 +1,6 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright =C2=A9 2014, 2015, 2016, 2017, 2018, 2019 Ludovic Court=C3= =A8s +;;; Copyright =C2=A9 2021 Mark H Weaver ;;; ;;; This file is part of GNU Guix. ;;; @@ -468,4 +469,71 @@ replacement "/gnu/store"))))) =20 +(define (nul-expand str char-size) + (string-join (map string (string->list str)) + (make-string (- char-size 1) #\nul))) + +(for-each + (lambda (char-size1) + (for-each + (lambda (char-size2) + (for-each + (lambda (gap) + (for-each + (lambda (offset) + (test-equal (format #f "replace-store-references, char-sizes ~a= ~a, gap ~s, offset ~a" + char-size1 char-size2 gap offset) + (string-append (make-string offset #\=3D) + (nul-expand (string-append "/gnu/store/" + (make-string 32 #\6) + "-BlahBlaH") + char-size1) + gap + (nul-expand (string-append "/gnu/store/" + (make-string 32 #\8) + "-SoMeTHiNG") + char-size2) + (list->string (map integer->char (iota 77 33))= )) + + ;; Create input data where the right-hand-size of the dash ("= -something" + ;; here) goes beyond the end of the internal buffer of + ;; 'replace-store-references'. + (let* ((content (string-append (make-string offset #\=3D) + (nul-expand (string-append= "/gnu/store/" + = (make-string 32 #\5) + = "-blahblah") + char-size1) + gap + (nul-expand (string-append= "/gnu/store/" + = (make-string 32 #\7) + = "-something") + char-size2) + (list->string + (map integer->char (iota = 77 33))))) + (replacement (alist->vhash + `((,(make-string 32 #\5) + . ,(string->utf8 (string-append + (make-string 32 #\6) + "-BlahBlaH"))) + (,(make-string 32 #\7) + . ,(string->utf8 (string-append + (make-string 32 #\8) + "-SoMeTHiNG"))))))) + (call-with-output-string + (lambda (output) + ((@@ (guix build graft) replace-store-references) + (open-input-string content) output + replacement + "/gnu/store")))))) + ;; offsets to test + (map (lambda (i) (- buffer-size (* 40 char-size1) i)) + (iota 30)))) + ;; gaps + '("" "-" " " "a"))) + ;; char-size2 values to test + '(1 2))) + ;; char-size1 values to test + '(1 2 4)) + + (test-end) --=20 2.31.1 --=-=-=--