From: Mark H Weaver <mhw@netris.org>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 35350@debbugs.gnu.org
Subject: bug#35350: Some compile output still leaks through with --verbosity=1
Date: Fri, 26 Apr 2019 20:45:24 -0400 [thread overview]
Message-ID: <87bm0sgts0.fsf@netris.org> (raw)
In-Reply-To: <87k1fgh9c0.fsf@netris.org> (Mark H. Weaver's message of "Fri, 26 Apr 2019 15:09:24 -0400")
[-- Attachment #1: Type: text/plain, Size: 160 bytes --]
Here's an improved version of the code with doc strings. It also
properly handles the case of (target-source >= target-end) in
'utf8->string!'.
Mark
[-- Attachment #2: UTF-8 decoder, v2 --]
[-- Type: text/plain, Size: 13259 bytes --]
;;; Copyright © 2019 Mark H Weaver <mhw@netris.org>
;;;
;;; This program is free software: you can redistribute it and/or modify
;;; it under the terms of the GNU General Public License as published by
;;; the Free Software Foundation, either version 3 of the License, or
;;; (at your option) any later version.
;;;
;;; This program is distributed in the hope that it will be useful,
;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
;;; GNU General Public License for more details.
;;;
;;; You should have received a copy of the GNU General Public License
;;; along with this program. If not, see <http://www.gnu.org/licenses/>.
(use-modules (rnrs bytevectors)
;; the following modules are only needed for the test.
;;(srfi srfi-1)
;;(ice-9 iconv)
)
;; Well-formed UTF-8 sequences
;; ===========================
;; 00..7F
;; C2..DF 80..BF
;; E0 *A0..BF 80..BF
;; E1..EC 80..BF 80..BF
;; ED 80..9F* 80..BF
;; EE..EF 80..BF 80..BF
;; F0 *90..BF 80..BF 80..BF
;; F1..F3 80..BF 80..BF 80..BF
;; F4 80..8F* 80..BF 80..BF
;; UTF-8 Decoder states
;; ====================
;; 0 start state
;; C2 .. DF got 1/2 bytes
;; E0 .. EF got 1/3 bytes
;; F0 .. F4 got 1/4 bytes
;; E0A0 .. ED9F got 2/3 bytes (range 1)
;; EE80 .. EFBF got 2/3 bytes (range 2)
;; F090 .. F48F got 2/4 bytes
;; F09080 .. F48FBF got 3/4 bytes
(define-syntax-rule (utf8-decode ((j init-expr) ...)
(i continue)
(output (code-point) e1 e1* ...)
(error (maximal-subpart) e2 e2* ...)
state-expr bv-expr start-expr end-expr)
"Decode part of a UTF-8 byte stream in the bytevector BV-EXPR with
indices in the interval from START-EXPR (inclusive) to END-EXPR
(exclusive). STATE-EXPR is the initial decoder state, which must
be an incomplete prefix of a valid UTF-8 byte sequence. The
start state is 0.
When a valid UTF-8 byte sequence is found, the output expressions
(E1 E1* ...) are evaluated, with the following bindings
available in the lexical environment:
CODE-POINT the decoded code point, as an exact integer.
I bytevector index immediately after the decoded sequence.
J ... the user-provided seeds.
CONTINUE procedure with arguments (i j ...) to continue decoding.
OUTPUT procedure with arguments (code-point i j ...), whose
body consists of (E1 E1* ...), provided by the user.
ERROR procedure with arguments (maximal-subpart i j ...), whose
body consists of (E2 E2* ...), provided by the user.
If you wish for decoding to continue, (E1 E1* ...) should end by
tail-calling (CONTINUE I J^ ...), where (J^ ...) are the new
seeds. Alternatively, if you wish to terminate decoding early,
simply return one or more values, which will be returned to the
caller of 'utf8-decode'. Normally, (values 0 I J^ ...) should be
returned.
In case of a decoding error, the expressions (E2 E2* ...) will be
called with the same bindings listed above, except CODE-POINT is
omitted, and MAXIMAL-SUBPART is bound to the 'maximal subpart of
an ill-formed subsequence' as defined in section 3.9 of The
Unicode Standard 12.0, i.e. the longest code unit subsequence
starting at an inconvertible offset that is either (a) the
initial subsequence of a well-formed code unit sequence, or (b) a
subsequence of length one. MAXIMAL-SUBPART is represented as an
exact integer containing the bytes in big-endian order,
e.g. #xF48FBF represents the bytes (F4 8F BF).
The bindings OUTPUT and ERROR are provided for convenience, in
case the error expressions (E2 E2* ...) wish to call the
user-provided output procedure (e.g. to output a substitution
character), or the output expressions (E1 E1* ...) wish to call
the user-provided error procedure.
If the provided bytes in BV-EXPR end with a non-empty but
incomplete prefix of a well-formed UTF-8 byte sequence, then the
following values are returned: (NEW-STATE BV-POS J ...).
When decoding finishes, every byte in the input (including in the
initial STATE-EXPR) will have been reported in exactly one of the
following ways:
(1) as part of a well-formed UTF-8 byte sequence, via the output
expressions (E1 E1* ...), or
(2) as part of a 'maximal subpart of an ill-formed subsequence',
via the error expressions (E2 E2* ...), or
(3) as part of the new state.
(4) in the unexamined indices of BV-EXPR starting with BV-POS."
(let ((bv bv-expr)
(end end-expr))
(define (output code-point i j ...)
e1 e1* ...)
(define (error maximal-subpart i j ...)
e2 e2* ...)
(define (continue i j ...)
(if (< i end)
(let ((byte (bytevector-u8-ref bv i)))
(cond ((<= byte #x7F) (output byte (+ i 1) j ...))
((<= #xC2 byte #xF4) (got-1 byte (+ i 1) j ...))
(else (error byte (+ i 1) j ...))))
(values 0 i j ...)))
(define (got-1 state i j ...)
(if (< i end)
(let ((byte (bytevector-u8-ref bv i)))
(cond ((not (<= #x80 byte #xBF))
(error state i j ...))
((<= state #xDF)
(output (logior (ash (logand state #x1F) 6)
(logand byte #x3F))
(+ i 1) j ...))
(else
(let ((state^ (logior (ash state 8) byte)))
(cond ((or (<= #xE0A0 state^ #xED9F)
(<= #xEE80 state^ #xEFBF))
(got-2/3 state^ (+ i 1) j ...))
((<= #xF090 state^ #xF48F)
(got-2/4 state^ (+ i 1) j ...))
(else
(error state i j ...)))))))
(values state i j ...)))
(define (got-2/3 state i j ...)
(if (< i end)
(let ((byte (bytevector-u8-ref bv i)))
(if (<= #x80 byte #xBF)
(output (logior (ash (logand state #xF00) 4)
(ash (logand state #x3F) 6)
(logand byte #x3F))
(+ i 1) j ...)
(error state i j ...)))
(values state i j ...)))
(define (got-2/4 state i j ...)
(if (< i end)
(let ((byte (bytevector-u8-ref bv i)))
(if (<= #x80 byte #xBF)
(got-3/4 (logior (ash state 8) byte) (+ i 1) j ...)
(error state i j ...)))
(values state i j ...)))
(define (got-3/4 state i j ...)
(if (< i end)
(let ((byte (bytevector-u8-ref bv i)))
(if (<= #x80 byte #xBF)
(output (logior (ash (logand state #x70000) 2)
(ash (logand state #x3F00) 4)
(ash (logand state #x3F) 6)
(logand byte #x3F))
(+ i 1) j ...)
(error state i j ...)))
(values state i j ...)))
(define (enter state i j ...)
(cond ((zero? state) (continue i j ...))
((<= state #xF4) (got-1 state i j ...))
((<= state #xEFBF) (got-2/3 state i j ...))
((<= state #xF48F) (got-2/4 state i j ...))
(else (got-3/4 state i j ...))))
(enter state-expr start-expr init-expr ...)))
(define (utf8->string! state source source-start source-end
target target-start target-end)
"Decode part of a UTF-8 byte stream from the bytevector SOURCE
starting at index SOURCE-START and up to (but not including) index
SOURCE-END, and writing into the string TARGET starting at index
TARGET-START and up to (but not including) index TARGET-END. Returns
three values: NEW-STATE, SOURCE-POS, and TARGET-POS. STATE is either
0 (the start state) or the value of NEW-STATE returned by the previous
call. In case of errors, each 'maximal subpart of an ill-formed
subsequence', as defined in section 3.9 of The Unicode Standard 12.0,
is replaced with a Unicode replacement character (U+FFFD)."
(if (< target-start target-end)
(utf8-decode ((j target-start))
(i continue)
(output (code-point)
(string-set! target j (integer->char code-point))
(if (< (+ j 1) target-end)
(continue i (+ j 1))
(values 0 i (+ j 1))))
(error (maximal-subpart)
(output #xFFFD i j)) ;TODO: support other error handlers
state source source-start source-end)
(values state source-start target-start)))
;; Another experimental primitive, slower than the ones above.
(define* (utf8-fold* out err seed state bv
#:optional (start 0) (end (bytevector-length bv)))
"Decode part of a UTF-8 byte stream from the bytevector SOURCE
starting at index SOURCE-START and up to (but not including) index
SOURCE-END. Returns three values: NEW-STATE, SOURCE-POS, and
FINAL-SEED. STATE is either 0 (the start state) or the value of
NEW-STATE returned by the previous call. For each valid code point,
call (OUT CODE-POINT SOURCE-POS SEED K), which should either call (K
SOURCE-POS NEW-SEED) to continue decoding, or return three values (0
SOURCE-POS NEW-SEED) which will terminate decoding and immediately
exit. In case of errors, call (ERR MAXIMAL-SUBPART SOURCE-POS SEED K)
where MAXIMAL-SUBPART is a 'maximal subpart of an ill-formed
subsequence', as defined in section 3.9 of The Unicode Standard 12.0.
Similarly, ERR should either call (K SOURCE-POS NEW-SEED) to continue
decoding, or return to exit immediately."
(utf8-decode ((j seed))
(i continue)
(output (code-point)
(out code-point i j continue))
(error (maximal-subpart)
(err maximal-subpart i j continue))
state bv start end))
;; Another experimental primitive, slower than the ones above.
(define* (utf8-fold out err seed state bv
#:optional (start 0) (end (bytevector-length bv)))
"Decode part of a UTF-8 byte stream from the bytevector SOURCE
starting at index SOURCE-START and up to (but not including) index
SOURCE-END. Returns three values: NEW-STATE, SOURCE-POS, and
FINAL-SEED. STATE is either 0 (the start state) or the value of
NEW-STATE returned by the previous call. For each valid code point,
call (OUT CODE-POINT SOURCE-POS SEED), which should return a new SEED.
In case of errors, call (ERR MAXIMAL-SUBPART SOURCE-POS SEED) where
MAXIMAL-SUBPART is a 'maximal subpart of an ill-formed subsequence',
as defined in section 3.9 of The Unicode Standard 12.0. ERR should
return two values: a boolean specifying whether to continue decoding,
and a new seed."
(utf8-fold* (lambda (code-point i j continue)
(continue i (out code-point i j)))
(lambda (maximal-subpart i j continue)
(call-with-values (lambda () (err maximal-subpart i j))
(lambda (continue? j^)
(if continue?
(continue i j^)
(values 0 i j^)))))
seed state bv start end))
;; A not-so-quick test of all valid characters.
;; TODO: Tests of strictness and error handling.
#;
(let ()
(define ss (string-tabulate (lambda (i)
(if (< i #xD800)
(integer->char i)
(integer->char (+ i #x800))))
(- #x110000 #x800)))
(define bv (string->utf8 ss))
(define bv-len (bytevector-length bv))
(define slen (* 2 (string-length ss)))
(define s (make-string slen))
(every (lambda (incr)
(string-fill! s #\a)
(call-with-values
(lambda ()
(let loop ((state 0) (i 0) (j 0))
(if (< i bv-len)
(call-with-values
(lambda ()
(utf8->string! state bv i (min bv-len
(+ i incr))
s j slen))
loop)
(values state i j))))
(lambda (state i j)
(and (zero? state)
(= i bv-len)
(= j (string-length ss))
(string=? ss (substring s 0 j))))))
(iota 5 1)))
next prev parent reply other threads:[~2019-04-27 0:52 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-20 23:53 bug#35350: Some compile output still leaks through with --verbosity=1 Mark H Weaver
2019-04-21 20:15 ` Ludovic Courtès
2019-04-22 23:52 ` Mark H Weaver
2019-04-23 8:45 ` Mark H Weaver
2019-04-23 10:12 ` Ludovic Courtès
2019-04-26 19:09 ` Mark H Weaver
2019-04-27 0:45 ` Mark H Weaver [this message]
2019-04-27 7:56 ` Mark H Weaver
2019-04-27 16:36 ` Ludovic Courtès
2019-04-30 20:26 ` Mark H Weaver
2019-05-04 9:33 ` Ludovic Courtès
2019-05-04 18:53 ` Mark H Weaver
2021-09-20 5:44 ` Sarah Morgensen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87bm0sgts0.fsf@netris.org \
--to=mhw@netris.org \
--cc=35350@debbugs.gnu.org \
--cc=ludo@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.