From mboxrd@z Thu Jan  1 00:00:00 1970
From: Mark H Weaver <mhw@netris.org>
Subject: bug#35350: Some compile output still leaks through with --verbosity=1
Date: Fri, 26 Apr 2019 15:09:24 -0400
Message-ID: <87k1fgh9c0.fsf@netris.org>
References: <87mukkfd2j.fsf@netris.org> <87r29v2jz2.fsf@gnu.org>
	<87ftq9silk.fsf@netris.org> <87imv5jai5.fsf@gnu.org>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
Return-path: <bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([209.51.188.92]:39428)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1hK6GT-00080l-Fe
	for bug-guix@gnu.org; Fri, 26 Apr 2019 15:12:07 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1hK6GQ-0000sz-Ri
	for bug-guix@gnu.org; Fri, 26 Apr 2019 15:12:05 -0400
Received: from debbugs.gnu.org ([209.51.188.43]:48275)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
	id 1hK6GQ-0000rf-GZ
	for bug-guix@gnu.org; Fri, 26 Apr 2019 15:12:02 -0400
Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1hK6GQ-0006YX-7d
	for bug-guix@gnu.org; Fri, 26 Apr 2019 15:12:02 -0400
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-Message-ID: <handler.35350.B35350.155630587425146@debbugs.gnu.org>
In-Reply-To: <87imv5jai5.fsf@gnu.org> ("Ludovic
	\=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\=
	\=\?utf-8\?Q\?s\?\= message of "Tue, 23 Apr 2019 12:12:34 +0200")
List-Id: Bug reports for GNU Guix <bug-guix.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-guix>,
	<mailto:bug-guix-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/bug-guix/>
List-Post: <mailto:bug-guix@gnu.org>
List-Help: <mailto:bug-guix-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-guix>,
	<mailto:bug-guix-request@gnu.org?subject=subscribe>
Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org
Sender: "bug-Guix" <bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org>
To: Ludovic =?UTF-8?Q?Court=C3=A8s?= <ludo@gnu.org>
Cc: 35350@debbugs.gnu.org

--=-=-=
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Hi Ludovic,

Thanks for investigating this.

Ludovic Court=C3=A8s <ludo@gnu.org> writes:

> The third read(2) call here ends on a partial UTF-8 sequence for LEFT
> SINGLE QUOTATION MARK (we get the first two bytes of a three byte
> sequence.)
>
> What happens is that =E2=80=98process-stderr=E2=80=99 in (guix store) get=
s that byte
> string from the daemon, passes it through =E2=80=98read-maybe-utf8-string=
=E2=80=99,
> which replaces the last two bytes with REPLACEMENT CHARACTER, which is
> itself a 3-byte sequence.

It seems to me that what's needed here is to save the UTF-8 decoder
state between calls to 'process-stderr'.  Coincidentally, I also needed
something like this a week ago, when I tried implementing R6RS custom
textual input/output ports on top of R6RS custom binary input/output
ports.

To meet these needs, I've implemented a fairly efficient, purely
functional UTF-8 decoder in Scheme that accepts a decoder state and an
arbitrary range from a bytevector, and returns a new decoder state.
There's a macro that allows arbitrary actions to be performed when a
code point (or maximal subpart in the case of errors) is found.

This macro is then used to implement a decoder (utf8->string!) that
writes into an arbitrary range of an existing string.  Of course, it's
not purely functional, but it avoids heap allocation when compiled with
Guile.  On my Thinkpad X200, it can process around 10 megabytes per
second.

The state is represented as an exact integer between 0 and #xF48FBF
inclusive, which are simply the bytes that have been seen so far in the
current code sequence, in big-endian order, or 0 for the start state.
For example, #xF48FBF represents the state where the bytes (F4 8F BF)
have been read.  The state is always either 0 or a proper prefix of a
valid UTF-8 byte sequence.

I also plan to implement an optimized C version of 'utf8->string!' and
add it to Guile, in order to implement fast custom textual ports.  The
precise name and API is not yet finalized.  At present, 'utf8->string!'
always replaces maximal subparts with the substitution character in case
of errors, but I intend to eventually support other error modes as well.

What would you think about using this code to replace the uses of
'read-maybe-utf8-string', and storing the UTF-8 decoder state in the
<store-connection> object?  Would we need to store multiple states in
case of (max-jobs > 1)?

      Regards
        Mark


--=-=-=
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline; filename=utf8-decoder.scm
Content-Transfer-Encoding: quoted-printable
Content-Description: UTF-8 decoder implementation

;;; Copyright =C2=A9 2019 Mark H Weaver <mhw@netris.org>
;;;
;;; This program is free software: you can redistribute it and/or modify
;;; it under the terms of the GNU General Public License as published by
;;; the Free Software Foundation, either version 3 of the License, or
;;; (at your option) any later version.
;;;
;;; This program is distributed in the hope that it will be useful,
;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
;;; GNU General Public License for more details.
;;;
;;; You should have received a copy of the GNU General Public License
;;; along with this program.  If not, see <http://www.gnu.org/licenses/>.

(use-modules (rnrs bytevectors)
             ;; the following modules are only needed for the test.
             ;;(srfi srfi-1)
             ;;(ice-9 iconv)
             )

;; Well-formed UTF-8 sequences
;; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D
;; 00..7F
;; C2..DF   80..BF
;; E0      *A0..BF   80..BF
;; E1..EC   80..BF   80..BF
;; ED       80..9F*  80..BF
;; EE..EF   80..BF   80..BF
;; F0      *90..BF   80..BF   80..BF
;; F1..F3   80..BF   80..BF   80..BF
;; F4       80..8F*  80..BF   80..BF

;; UTF-8 Decoder states
;; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
;;  0                 start state
;;  C2     .. DF      got 1/2 bytes
;;  E0     .. EF      got 1/3 bytes
;;  F0     .. F4      got 1/4 bytes
;;  E0A0   .. ED9F    got 2/3 bytes (range 1)
;;  EE80   .. EFBF    got 2/3 bytes (range 2)
;;  F090   .. F48F    got 2/4 bytes
;;  F09080 .. F48FBF  got 3/4 bytes

(define-syntax-rule (utf8-decode ((j init-expr) ...)
                                 (i continue)
                                 (output (code-point) e1 e1* ...)
                                 (error (maximal-subpart) e2 e2* ...)
                                 state-expr bv-expr start-expr end-expr)
  ;; The start state is 0.

  ;; When 'error' is called with arguments (state i j ...), 'state'
  ;; contains the bytes of the "maximal subpart of an ill-formed
  ;; subsequence" as defined in The Unicode Standard section 3.9,
  ;; i.e. the bytes which are being represented by this error report
  ;; and which are not being converted.  'i' is the bytevector index
  ;; immediately following this maximal subpart, i.e. the index where
  ;; decoding should resume.  'j ...' are the user-provided seeds.

  ;; the decoder returns the values: (new-state bv-pos j ...)
  (let ((bv   bv-expr)
        (end  end-expr))
    (define (output code-point i j ...)
      e1 e1* ...)

    (define (error maximal-subpart i j ...)
      e2 e2* ...)
=20=20=20=20
    (define (continue i j ...)
      (if (< i end)
          (let ((byte (bytevector-u8-ref bv i)))
            (cond ((<=3D byte #x7F)       (output byte (+ i 1) j ...))
                  ((<=3D #xC2 byte #xF4)  (got-1 byte (+ i 1) j ...))
                  (else                 (error byte (+ i 1) j ...))))
          (values 0 i j ...)))

    (define (got-1 state i j ...)
      (if (< i end)
          (let ((byte (bytevector-u8-ref bv i)))
            (cond ((not (<=3D #x80 byte #xBF))
                   (error state i j ...))
                  ((<=3D state #xDF)
                   (output (logior (ash (logand state #x1F) 6)
                                   (logand byte #x3F))
                           (+ i 1) j ...))
                  (else
                   (let ((state^ (logior (ash state 8) byte)))
                     (cond ((or (<=3D #xE0A0 state^ #xED9F)
                                (<=3D #xEE80 state^ #xEFBF))
                            (got-2/3 state^ (+ i 1) j ...))
                           ((<=3D #xF090 state^ #xF48F)
                            (got-2/4 state^ (+ i 1) j ...))
                           (else
                            (error state i j ...)))))))
          (values state i j ...)))

    (define (got-2/3 state i j ...)
      (if (< i end)
          (let ((byte (bytevector-u8-ref bv i)))
            (if (<=3D #x80 byte #xBF)
                (output (logior (ash (logand state #xF00) 4)
                                (ash (logand state #x3F) 6)
                                (logand byte #x3F))
                        (+ i 1) j ...)
                (error state i j ...)))
          (values state i j ...)))

    (define (got-2/4 state i j ...)
      (if (< i end)
          (let ((byte (bytevector-u8-ref bv i)))
            (if (<=3D #x80 byte #xBF)
                (got-3/4 (logior (ash state 8) byte) (+ i 1) j ...)
                (error state i j ...)))
          (values state i j ...)))

    (define (got-3/4 state i j ...)
      (if (< i end)
          (let ((byte (bytevector-u8-ref bv i)))
            (if (<=3D #x80 byte #xBF)
                (output (logior (ash (logand state #x70000) 2)
                                (ash (logand state #x3F00) 4)
                                (ash (logand state #x3F) 6)
                                (logand byte #x3F))
                        (+ i 1) j ...)
                (error state i j ...)))
          (values state i j ...)))

    (define (enter state i j ...)
      (cond ((zero? state)      (continue i j ...))
            ((<=3D state #xF4)    (got-1 state i j ...))
            ((<=3D state #xEFBF)  (got-2/3 state i j ...))
            ((<=3D state #xF48F)  (got-2/4 state i j ...))
            (else               (got-3/4 state i j ...))))

    (enter state-expr start-expr init-expr ...)))

;; Returns three values: (new-state source-pos target-pos)
(define (utf8->string! state source source-start source-end
                       target target-start target-end)
  (utf8-decode ((j target-start))
               (i continue)
               (output (code-point)
                       (string-set! target j (integer->char code-point))
                       (if (< (+ j 1) target-end)
                           (continue i (+ j 1))
                           (values 0 i (+ j 1))))
               (error (maximal-subpart)
                      (output #xFFFD i j)) ;TODO: support other error handl=
ers
               state source source-start source-end))

;; Another experimental primitive, slower than the ones above.
(define* (utf8-fold* out err seed state bv
                     #:optional (start 0) (end (bytevector-length bv)))
  (utf8-decode ((j seed))
               (i continue)
               (output (code-point)
                       (out code-point i j continue))
               (error (maximal-subpart)
                      (err maximal-subpart i j continue))
               state bv start end))

;; Another experimental primitive, slower than the ones above.
(define* (utf8-fold out err seed state bv
                    #:optional (start 0) (end (bytevector-length bv)))
  (utf8-fold* (lambda (code-point i j continue)
                (continue i (out code-point i j)))
              (lambda (maximal-subpart i j continue)
                (call-with-values (lambda () (err maximal-subpart i j))
                  (lambda (continue? j^)
                    (if continue?
                        (continue i j^)
                        (values 0 i j^)))))
              seed state bv start end))

;; A not-so-quick test of all valid characters.
;; TODO: Tests of strictness and error handling.
#;
(let ()
  (define ss (string-tabulate (lambda (i)
                                (if (< i #xD800)
                                    (integer->char i)
                                    (integer->char (+ i #x800))))
                              (- #x110000 #x800)))
  (define bv (string->utf8 ss))
  (define bv-len (bytevector-length bv))
  (define slen (* 2 (string-length ss)))
  (define s (make-string slen))
  (every (lambda (incr)
           (string-fill! s #\a)
           (call-with-values
               (lambda ()
                 (let loop ((state 0) (i 0) (j 0))
                   (if (< i bv-len)
                       (call-with-values
                           (lambda ()
                             (utf8->string! state bv i (min bv-len
                                                            (+ i incr))
                                            s j slen))
                         loop)
                       (values state i j))))
             (lambda (state i j)
               (and (=3D i bv-len)
                    (=3D j (string-length ss))
                    (string=3D? ss (substring s 0 j))))))
         (iota 5 1)))

--=-=-=--