all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Mark H Weaver <mhw@netris.org>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 35350@debbugs.gnu.org
Subject: bug#35350: Some compile output still leaks through with --verbosity=1
Date: Fri, 26 Apr 2019 20:45:24 -0400	[thread overview]
Message-ID: <87bm0sgts0.fsf@netris.org> (raw)
In-Reply-To: <87k1fgh9c0.fsf@netris.org> (Mark H. Weaver's message of "Fri, 26 Apr 2019 15:09:24 -0400")

[-- Attachment #1: Type: text/plain, Size: 160 bytes --]

Here's an improved version of the code with doc strings.  It also
properly handles the case of (target-source >= target-end) in
'utf8->string!'.

       Mark



[-- Attachment #2: UTF-8 decoder, v2 --]
[-- Type: text/plain, Size: 13259 bytes --]

;;; Copyright © 2019 Mark H Weaver <mhw@netris.org>
;;;
;;; This program is free software: you can redistribute it and/or modify
;;; it under the terms of the GNU General Public License as published by
;;; the Free Software Foundation, either version 3 of the License, or
;;; (at your option) any later version.
;;;
;;; This program is distributed in the hope that it will be useful,
;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
;;; GNU General Public License for more details.
;;;
;;; You should have received a copy of the GNU General Public License
;;; along with this program.  If not, see <http://www.gnu.org/licenses/>.

(use-modules (rnrs bytevectors)
             ;; the following modules are only needed for the test.
             ;;(srfi srfi-1)
             ;;(ice-9 iconv)
             )

;; Well-formed UTF-8 sequences
;; ===========================
;; 00..7F
;; C2..DF   80..BF
;; E0      *A0..BF   80..BF
;; E1..EC   80..BF   80..BF
;; ED       80..9F*  80..BF
;; EE..EF   80..BF   80..BF
;; F0      *90..BF   80..BF   80..BF
;; F1..F3   80..BF   80..BF   80..BF
;; F4       80..8F*  80..BF   80..BF

;; UTF-8 Decoder states
;; ====================
;;  0                 start state
;;  C2     .. DF      got 1/2 bytes
;;  E0     .. EF      got 1/3 bytes
;;  F0     .. F4      got 1/4 bytes
;;  E0A0   .. ED9F    got 2/3 bytes (range 1)
;;  EE80   .. EFBF    got 2/3 bytes (range 2)
;;  F090   .. F48F    got 2/4 bytes
;;  F09080 .. F48FBF  got 3/4 bytes

(define-syntax-rule (utf8-decode ((j init-expr) ...)
                                 (i continue)
                                 (output (code-point) e1 e1* ...)
                                 (error (maximal-subpart) e2 e2* ...)
                                 state-expr bv-expr start-expr end-expr)
  "Decode part of a UTF-8 byte stream in the bytevector BV-EXPR with
  indices in the interval from START-EXPR (inclusive) to END-EXPR
  (exclusive).  STATE-EXPR is the initial decoder state, which must
  be an incomplete prefix of a valid UTF-8 byte sequence.  The
  start state is 0.
  
  When a valid UTF-8 byte sequence is found, the output expressions
  (E1 E1* ...) are evaluated, with the following bindings
  available in the lexical environment:
  
    CODE-POINT  the decoded code point, as an exact integer.
    I        bytevector index immediately after the decoded sequence.
    J ...    the user-provided seeds.
    CONTINUE procedure with arguments (i j ...) to continue decoding.
    OUTPUT   procedure with arguments (code-point i j ...), whose
             body consists of (E1 E1* ...), provided by the user.
    ERROR    procedure with arguments (maximal-subpart i j ...), whose
             body consists of (E2 E2* ...), provided by the user.
  
  If you wish for decoding to continue, (E1 E1* ...) should end by
  tail-calling (CONTINUE I J^ ...), where (J^ ...) are the new
  seeds.  Alternatively, if you wish to terminate decoding early,
  simply return one or more values, which will be returned to the
  caller of 'utf8-decode'.  Normally, (values 0 I J^ ...) should be
  returned.
  
  In case of a decoding error, the expressions (E2 E2* ...) will be
  called with the same bindings listed above, except CODE-POINT is
  omitted, and MAXIMAL-SUBPART is bound to the 'maximal subpart of
  an ill-formed subsequence' as defined in section 3.9 of The
  Unicode Standard 12.0, i.e. the longest code unit subsequence
  starting at an inconvertible offset that is either (a) the
  initial subsequence of a well-formed code unit sequence, or (b) a
  subsequence of length one.  MAXIMAL-SUBPART is represented as an
  exact integer containing the bytes in big-endian order,
  e.g. #xF48FBF represents the bytes (F4 8F BF).
  
  The bindings OUTPUT and ERROR are provided for convenience, in
  case the error expressions (E2 E2* ...) wish to call the
  user-provided output procedure (e.g. to output a substitution
  character), or the output expressions (E1 E1* ...) wish to call
  the user-provided error procedure.
  
  If the provided bytes in BV-EXPR end with a non-empty but
  incomplete prefix of a well-formed UTF-8 byte sequence, then the
  following values are returned: (NEW-STATE BV-POS J ...).
  
  When decoding finishes, every byte in the input (including in the
  initial STATE-EXPR) will have been reported in exactly one of the
  following ways:
  
  (1) as part of a well-formed UTF-8 byte sequence, via the output
      expressions (E1 E1* ...), or
  
  (2) as part of a 'maximal subpart of an ill-formed subsequence',
      via the error expressions (E2 E2* ...), or
  
  (3) as part of the new state.
  
  (4) in the unexamined indices of BV-EXPR starting with BV-POS."
  (let ((bv   bv-expr)
        (end  end-expr))
    (define (output code-point i j ...)
      e1 e1* ...)

    (define (error maximal-subpart i j ...)
      e2 e2* ...)
    
    (define (continue i j ...)
      (if (< i end)
          (let ((byte (bytevector-u8-ref bv i)))
            (cond ((<= byte #x7F)       (output byte (+ i 1) j ...))
                  ((<= #xC2 byte #xF4)  (got-1 byte (+ i 1) j ...))
                  (else                 (error byte (+ i 1) j ...))))
          (values 0 i j ...)))

    (define (got-1 state i j ...)
      (if (< i end)
          (let ((byte (bytevector-u8-ref bv i)))
            (cond ((not (<= #x80 byte #xBF))
                   (error state i j ...))
                  ((<= state #xDF)
                   (output (logior (ash (logand state #x1F) 6)
                                   (logand byte #x3F))
                           (+ i 1) j ...))
                  (else
                   (let ((state^ (logior (ash state 8) byte)))
                     (cond ((or (<= #xE0A0 state^ #xED9F)
                                (<= #xEE80 state^ #xEFBF))
                            (got-2/3 state^ (+ i 1) j ...))
                           ((<= #xF090 state^ #xF48F)
                            (got-2/4 state^ (+ i 1) j ...))
                           (else
                            (error state i j ...)))))))
          (values state i j ...)))

    (define (got-2/3 state i j ...)
      (if (< i end)
          (let ((byte (bytevector-u8-ref bv i)))
            (if (<= #x80 byte #xBF)
                (output (logior (ash (logand state #xF00) 4)
                                (ash (logand state #x3F) 6)
                                (logand byte #x3F))
                        (+ i 1) j ...)
                (error state i j ...)))
          (values state i j ...)))

    (define (got-2/4 state i j ...)
      (if (< i end)
          (let ((byte (bytevector-u8-ref bv i)))
            (if (<= #x80 byte #xBF)
                (got-3/4 (logior (ash state 8) byte) (+ i 1) j ...)
                (error state i j ...)))
          (values state i j ...)))

    (define (got-3/4 state i j ...)
      (if (< i end)
          (let ((byte (bytevector-u8-ref bv i)))
            (if (<= #x80 byte #xBF)
                (output (logior (ash (logand state #x70000) 2)
                                (ash (logand state #x3F00) 4)
                                (ash (logand state #x3F) 6)
                                (logand byte #x3F))
                        (+ i 1) j ...)
                (error state i j ...)))
          (values state i j ...)))

    (define (enter state i j ...)
      (cond ((zero? state)      (continue i j ...))
            ((<= state #xF4)    (got-1 state i j ...))
            ((<= state #xEFBF)  (got-2/3 state i j ...))
            ((<= state #xF48F)  (got-2/4 state i j ...))
            (else               (got-3/4 state i j ...))))

    (enter state-expr start-expr init-expr ...)))

(define (utf8->string! state source source-start source-end
                       target target-start target-end)
  "Decode part of a UTF-8 byte stream from the bytevector SOURCE
starting at index SOURCE-START and up to (but not including) index
SOURCE-END, and writing into the string TARGET starting at index
TARGET-START and up to (but not including) index TARGET-END.  Returns
three values: NEW-STATE, SOURCE-POS, and TARGET-POS.  STATE is either
0 (the start state) or the value of NEW-STATE returned by the previous
call.  In case of errors, each 'maximal subpart of an ill-formed
subsequence', as defined in section 3.9 of The Unicode Standard 12.0,
is replaced with a Unicode replacement character (U+FFFD)."
  (if (< target-start target-end)
      (utf8-decode ((j target-start))
                   (i continue)
                   (output (code-point)
                           (string-set! target j (integer->char code-point))
                           (if (< (+ j 1) target-end)
                               (continue i (+ j 1))
                               (values 0 i (+ j 1))))
                   (error (maximal-subpart)
                          (output #xFFFD i j)) ;TODO: support other error handlers
                   state source source-start source-end)
      (values state source-start target-start)))

;; Another experimental primitive, slower than the ones above.
(define* (utf8-fold* out err seed state bv
                     #:optional (start 0) (end (bytevector-length bv)))
  "Decode part of a UTF-8 byte stream from the bytevector SOURCE
starting at index SOURCE-START and up to (but not including) index
SOURCE-END.  Returns three values: NEW-STATE, SOURCE-POS, and
FINAL-SEED.  STATE is either 0 (the start state) or the value of
NEW-STATE returned by the previous call.  For each valid code point,
call (OUT CODE-POINT SOURCE-POS SEED K), which should either call (K
SOURCE-POS NEW-SEED) to continue decoding, or return three values (0
SOURCE-POS NEW-SEED) which will terminate decoding and immediately
exit.  In case of errors, call (ERR MAXIMAL-SUBPART SOURCE-POS SEED K)
where MAXIMAL-SUBPART is a 'maximal subpart of an ill-formed
subsequence', as defined in section 3.9 of The Unicode Standard 12.0.
Similarly, ERR should either call (K SOURCE-POS NEW-SEED) to continue
decoding, or return to exit immediately."
  (utf8-decode ((j seed))
               (i continue)
               (output (code-point)
                       (out code-point i j continue))
               (error (maximal-subpart)
                      (err maximal-subpart i j continue))
               state bv start end))

;; Another experimental primitive, slower than the ones above.
(define* (utf8-fold out err seed state bv
                    #:optional (start 0) (end (bytevector-length bv)))
  "Decode part of a UTF-8 byte stream from the bytevector SOURCE
starting at index SOURCE-START and up to (but not including) index
SOURCE-END.  Returns three values: NEW-STATE, SOURCE-POS, and
FINAL-SEED.  STATE is either 0 (the start state) or the value of
NEW-STATE returned by the previous call.  For each valid code point,
call (OUT CODE-POINT SOURCE-POS SEED), which should return a new SEED.
In case of errors, call (ERR MAXIMAL-SUBPART SOURCE-POS SEED) where
MAXIMAL-SUBPART is a 'maximal subpart of an ill-formed subsequence',
as defined in section 3.9 of The Unicode Standard 12.0.  ERR should
return two values: a boolean specifying whether to continue decoding,
and a new seed."
  (utf8-fold* (lambda (code-point i j continue)
                (continue i (out code-point i j)))
              (lambda (maximal-subpart i j continue)
                (call-with-values (lambda () (err maximal-subpart i j))
                  (lambda (continue? j^)
                    (if continue?
                        (continue i j^)
                        (values 0 i j^)))))
              seed state bv start end))

;; A not-so-quick test of all valid characters.
;; TODO: Tests of strictness and error handling.
#;
(let ()
  (define ss (string-tabulate (lambda (i)
                                (if (< i #xD800)
                                    (integer->char i)
                                    (integer->char (+ i #x800))))
                              (- #x110000 #x800)))
  (define bv (string->utf8 ss))
  (define bv-len (bytevector-length bv))
  (define slen (* 2 (string-length ss)))
  (define s (make-string slen))
  (every (lambda (incr)
           (string-fill! s #\a)
           (call-with-values
               (lambda ()
                 (let loop ((state 0) (i 0) (j 0))
                   (if (< i bv-len)
                       (call-with-values
                           (lambda ()
                             (utf8->string! state bv i (min bv-len
                                                            (+ i incr))
                                            s j slen))
                         loop)
                       (values state i j))))
             (lambda (state i j)
               (and (zero? state)
                    (= i bv-len)
                    (= j (string-length ss))
                    (string=? ss (substring s 0 j))))))
         (iota 5 1)))

  reply	other threads:[~2019-04-27  0:52 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-20 23:53 bug#35350: Some compile output still leaks through with --verbosity=1 Mark H Weaver
2019-04-21 20:15 ` Ludovic Courtès
2019-04-22 23:52   ` Mark H Weaver
2019-04-23  8:45     ` Mark H Weaver
2019-04-23 10:12     ` Ludovic Courtès
2019-04-26 19:09       ` Mark H Weaver
2019-04-27  0:45         ` Mark H Weaver [this message]
2019-04-27  7:56           ` Mark H Weaver
2019-04-27 16:36         ` Ludovic Courtès
2019-04-30 20:26           ` Mark H Weaver
2019-05-04  9:33             ` Ludovic Courtès
2019-05-04 18:53               ` Mark H Weaver
2021-09-20  5:44                 ` Sarah Morgensen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bm0sgts0.fsf@netris.org \
    --to=mhw@netris.org \
    --cc=35350@debbugs.gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.