From: Mark H Weaver <mhw@netris.org>
To: Christopher Allan Webber <cwebber@dustycloud.org>
Cc: Guile Devel <guile-devel@gnu.org>
Subject: Re: Including sjson (formerly (ice-9 json)) and fash.scm in guile proper?
Date: Tue, 20 Jun 2017 18:47:52 -0400 [thread overview]
Message-ID: <87mv92mg2v.fsf@netris.org> (raw)
In-Reply-To: <87zied7v71.fsf@dustycloud.org> (Christopher Allan Webber's message of "Mon, 15 May 2017 14:53:54 -0500")
[-- Attachment #1: Type: text/plain, Size: 1216 bytes --]
Hi Chris,
I'm terribly sorry for the long delay on this. For better or worse,
I've become extremely concerned about computer security, and so I feel a
heavy responsibility to be extra careful about code that is expected to
parse hostile data.
I was also looking for a cleaner way to express this parser, and to add
better error reporting, while allowing flexibility for users to
customize the Scheme representation. Toward these ends, I ended up
re-implementing the parser from scratch.
I've attached my current draft of the new parser. By default, JSON
objects are represented as (@ . <alist>) where <alist> has dotted pairs,
but it's easy to ask for your preferred two-element lists using
'make-json-parser'.
The json writer is mostly okay, but it also needs to be generalized to
support the customizable representation (and maybe I went too far here
with the parser, dunno). Also, there are a few cases where it will
generate invalid JSON, notably: if ASCII control characters are present
in strings, hex escapes must be printed instead of the raw character,
and real numbers cannot simply be printed using 'display' because of
infinities and NaNs.
Are you okay with this general direction?
Mark
[-- Attachment #2: First draft of proposed JSON parser --]
[-- Type: text/plain, Size: 17724 bytes --]
(define-module (ice-9 json)
#:use-module (ice-9 match)
#:use-module (srfi srfi-11) ; let-values
#:export (read-json make-json-reader))
;; XXX Consider using conditions and exceptions from SRFI-34 + SRFI-35
;; or R6RS (they should be merged!)
(define (json-error port message . args)
(throw 'json-error port (peek-char port) message args))
(define-syntax match-next
(syntax-rules (else)
"Like 'match' from (ice-9 match), but with a few differences. The
first operand is PORT, which must be a variable bound to an open input
port. The value to be matched is the result of (peek-char PORT). If
a pattern from one of the clauses matches the peeked character (or
eof-object), then (read-char PORT) is implicitly performed before
evaluating the associated body. If none of the patterns match, then
leave the port position unchanged, and evaluate the body of the final
'else' clause if present, or else raise a JSON error."
((match-next port
(c body0 body ...)
...
(else ebody0 ebody ...))
(match (peek-char port)
(c (read-char port) body0 body ...)
...
(_ ebody0 ebody ...)))
((match-next port
(c body0 body ...)
...)
(match-next port
(c body0 body ...)
...
(else (json-error port "expected" '(c ...)))))))
(define-syntax match-next*
(syntax-rules (else)
"Like 'match-next', but with an implicit final clause that consumes
JSON whitespace characters and tries again. Assuming that none of the
explicit patterns match JSON whitespace, this is equivalent to
consuming JSON whitespace before 'match-next', but it is more
efficient when JSON whitespace is not present."
((match-next* port
(c body0 body ...)
...
(else ebody0 ebody ...))
(let loop ()
(match-next port
(c body0 body ...) ...
((or #\space #\tab #\newline #\return)
(loop))
(else ebody0 ebody ...))))
((match-next* port
(c body0 body ...)
...)
(match-next* port
(c body0 body ...)
...
(else (json-error port "expected" '(c ...)))))))
(define-syntax read-literal
(syntax-rules ()
"Read the given characters C C* ... from PORT and return VALUE.
Raise a JSON error if the expected characters are not found."
((read-literal port value c c* ...)
(match-next port
(c (read-literal port value c* ...))
(else (json-error port "expected" c
"while reading" 'value))))
((read-literal port value)
value)))
(define* (make-json-reader
#:key
(true #t)
(false #f)
(null 'null)
(make-string (lambda (s) s))
(make-number (lambda (x) x))
(obj-knil '())
(obj-kons (lambda (k v seed) (cons (cons k v) seed)))
(obj-finalize (lambda (seed) (cons '@ (reverse! seed))))
(array-knil '())
(array-kons cons)
(array-finalize reverse!))
;; ws = *( %x20 / ; Space
;; %x09 / ; Horizontal tab
;; %x0A / ; Line feed or New line
;; %x0D ) ; Carriage return
(define (consume-whitespace port)
"Consume zero or more JSON whitespace characters (#\\space #\\tab
#\\newline or #\\return) from PORT."
(match-next* port
(else #t)))
;; DIGIT
(define (try-read-digit port)
"Peeks at the next character to be read from PORT. If it's a
decimal digit, then read it and return its value. Otherwise, leave
the port position unchanged and return false."
(match-next port
(#\0 0)
(#\1 1)
(#\2 2)
(#\3 3)
(#\4 4)
(#\5 5)
(#\6 6)
(#\7 7)
(#\8 8)
(#\9 9)
(else #f)))
;; *DIGIT
(define (read-digits n width port)
"Read the remaining zero or more decimal digits of a non-negative
integer from PORT. Assuming that N and WIDTH are the value and width
of the digits previously read, two values are returned: the value and
width of the non-negative integer."
(match (try-read-digit port)
(#f (values n width))
(digit (read-digits (+ digit (* n 10))
(+ width 1)
port))))
;; 1*DIGIT
(define (read-integer port)
"Read one or more decimal digits from PORT, and return two values:
the non-negative integer represented by those digits, and the
width (number of digits read)."
(match (try-read-digit port)
(#f (json-error port "expected digit"))
(n (read-digits n 1 port))))
;; int = zero / ( digit1-9 *DIGIT )
;; where:
;; digit1-9 = %x31-39 ; 1-9
(define (read-integer-part port)
"Read the non-negative integer part of a JSON number, and return two
values: its value and width (number of digits read). This procedure
differs from 'read-integer' in only one respect: if the initial digit
is 0, it returns immediately without accepting more digits."
(match (try-read-digit port)
(#f (json-error port "expected digit"))
(0 (values 0 1))
(n (read-digits n 1 port))))
;; [ frac ]
;; where:
;; frac = decimal-point 1*DIGIT
(define (read-frac-part port)
"Read the optional fractional part of a JSON number, and return two
values: the value represented by the digits as a non-negative integer,
and the width (number of digits read). For example, if \".25\" is
read, then return 25 and 2. If a decimal point is not found, leave
the port position unchanged and return 0 and 0."
(match-next port
(#\. (read-integer port))
(else (values 0 0))))
;; [ exp ]
;; where:
;; e = %x65 / %x45 ; e E
;; exp = e [ minus / plus ] 1*DIGIT
(define (read-exp-part port)
"Read the optional exponent part of a JSON number, and return the
exponent as an exact integer. If neither e nor E are found, leave the
port position unchanged and return 0."
(match-next port
((or #\e #\E) (match-next port
(#\- (- (read-integer port)))
;; Use unary + here to discard the second value
;; returned by 'read-integer'.
(#\+ (+ (read-integer port)))
(else (+ (read-integer port)))))
(else 0)))
;; number = [ minus ] int [ frac ] [ exp ]
(define (read-non-negative-number port)
"Read the portion of a JSON number following the optional initial
minus sign from PORT, and return its value."
(let*-values (((int int-width) (read-integer-part port))
((frac frac-width) (read-frac-part port))
((exp) (read-exp-part port)))
;; We compute the value in a way that avoids allocating
;; intermediate exact rationals where possible. To achieve this,
;; we effectively pretend that the decimal point were moved to the
;; right FRAC-WIDTH places (i.e. to the right of the final digit),
;; and we adjust the exponent to compensate for this. SIGNIFICAND
;; is the integer value of the digits after moving the decimal
;; point, and EFFECTIVE-EXP is the adjusted exponent.
(let* ((significand (+ frac (* int (expt 10 frac-width))))
(effective-exp (- exp frac-width))
;; XXX if the exponent is very large (regardless of its
;; sign), this could result in a huge amount of memory
;; being allocated for VALUE and for the result of EXPT.
;; We should consider handling these cases more gracefully.
;; However, it should be noted that incorrect rounding will
;; occur if we perform the following computation using
;; inexact arithmetic.
;;
;; We handle the positive and negative EFFECTIVE-EXP cases
;; separately to avoid allocating an intermediate exact
;; rational for the result of EXPT.
(value (if (negative? effective-exp)
(/ significand (expt 10 (- effective-exp)))
(* significand (expt 10 effective-exp)))))
;; If the value is an integer, return its exact value, otherwise
;; convert it to inexact. In the future, we might consider
;; returning inexacts in more cases, e.g. for huge integers, in
;; order to limit the memory usage.
(if (integer? value)
value
(exact->inexact value)))))
(define (read-number port)
"Read a JSON number from PORT and return the result of applying
MAKE-NUMBER to the number."
(make-number (match-next* port
(#\- (- (read-non-negative-number port)))
(else (read-non-negative-number port)))))
(define (high-surrogate? char)
"Return true if CHAR is a UTF-16 high surrogate, otherwise return
false."
(and (char? char)
(<= #xD800 (char->integer char) #xDBFF)))
(define (low-surrogate? char)
"Return true if CHAR is a UTF-16 low surrogate, otherwise return
false."
(and (char? char)
(<= #xDC00 (char->integer char) #xDFFF)))
(define (reduce-surrogate-pair high low)
"Return the character represented by the two UTF-16 surrogate code
points (characters) HIGH and LOW."
(let ((high (- (char->integer high) #xD800))
(low (- (char->integer low) #xDC00)))
(integer->char (+ #x10000 low (* 1024 high)))))
;; HEXDIG
(define (read-hex-digit port)
"Read a hexadecimal digit from PORT and return its value."
(match-next port
(#\0 #x0)
(#\1 #x1)
(#\2 #x2)
(#\3 #x3)
(#\4 #x4)
(#\5 #x5)
(#\6 #x6)
(#\7 #x7)
(#\8 #x8)
(#\9 #x9)
((or #\A #\a) #xA)
((or #\B #\b) #xB)
((or #\C #\c) #xC)
((or #\D #\d) #xD)
((or #\E #\e) #xE)
((or #\F #\f) #xF)))
;; nHEXDIG
(define (read-hex n port)
"Read exactly N hexadecimal digits from PORT and return the
represented value."
(if (zero? n)
0
(let* ((initial-digits (read-hex (- n 1) port))
(final-digit (read-hex-digit port)))
(+ final-digit (* 16 initial-digits)))))
;; unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
(define char-set:unescaped
(char-set-union (ucs-range->char-set #x20 #x22)
(ucs-range->char-set #x23 #x5C)
(ucs-range->char-set #x5D #x110000)))
(define (unescaped? c)
"Return true if C is a character that is allowed to appear unescaped
within a JSON string, otherwise return false."
(and (char? c) (char-set-contains? char-set:unescaped c)))
;; [ char ]
;; where:
;; escape = %x5C ; \
;; char = unescaped /
;; escape (
;; %x22 / ; " quotation mark U+0022
;; %x5C / ; \ reverse solidus U+005C
;; %x2F / ; / solidus U+002F
;; %x62 / ; b backspace U+0008
;; %x66 / ; f form feed U+000C
;; %x6E / ; n line feed U+000A
;; %x72 / ; r carriage return U+000D
;; %x74 / ; t tab U+0009
;; %x75 4HEXDIG ) ; uXXXX U+XXXX
(define (try-read-string-char port)
"Try to read one character of a JSON string from PORT, either an
unescaped character or a backslash-introduced string escape, and
return the represented character. If the next character from PORT
does not introduce a valid JSON string char (e.g. if the closing quote
is found), then leave the port position unchanged and return false.
Note that if a 12-character hex escape is found, only the first 6
characters (the high surrogate) is read by this procedure. Surrogate
pairs are handled by 'read-string-chars'."
(match-next port
((? unescaped? c) c)
(#\\ (match-next port
(#\" #\")
(#\\ #\\)
(#\/ #\/)
(#\b #\backspace)
(#\f #\page)
(#\n #\newline)
(#\r #\return)
(#\t #\tab)
(#\u (integer->char (read-hex 4 port)))))
(else #f)))
(define (read-string-chars port)
"Read the contents of a JSON string from PORT, not including the
quotes, and return the represented string."
;; Use an output string port to accumulate the represented string as
;; we read it. This entails far less heap allocation than
;; accumulating a list of characters.
(let ((out (open-output-string)))
(let loop ((char-or-false (try-read-string-char port)))
(match char-or-false
(#f (get-output-string out))
((? high-surrogate? high)
;; A high-surrogate was found. Try to combine it with the
;; following low-surrogate.
(match (try-read-string-char port)
((? low-surrogate? low)
(write-char (reduce-surrogate-pair high low) out)
(loop (try-read-string-char port)))
(char-or-false
;; The high-surrogate was not followed by a low-surrogate.
;; Replace the unpaired high-surrogate with the Unicode
;; replacement character and continue.
(write-char #\xFFFD out)
(loop char-or-false))))
;; Write the represented character to OUT, but convert
;; unpaired low-surrogates to the Unicode replacement
;; character.
(char (write-char (if (low-surrogate? char) #\xFFFD char)
out)
(loop (try-read-string-char port)))))))
(define (read-string port)
"Read a JSON string from PORT and return the result of applying
MAKE-STRING to the string."
(match-next* port
(#\" (let ((s (make-string (read-string-chars port))))
(match-next port
(#\" s))))))
(define (read-true port)
"Read the characters \"true\" from PORT and return TRUE."
(read-literal port true #\t #\r #\u #\e))
(define (read-false port)
"Read the characters \"false\" from PORT and return FALSE."
(read-literal port false #\f #\a #\l #\s #\e))
(define (read-null port)
"Read the characters \"null\" from PORT and return NULL."
(read-literal port null #\n #\u #\l #\l))
(define (read-array-elements seed port)
"Read the remaining elements of a JSON array from PORT, where SEED
is the result of the last call to ARRAY-KONS, or ARRAY-KNIL if no
elements have previously been read. Returns the final seed."
(let* ((element (read-value port))
(seed (array-kons element seed)))
(match-next* port
(#\, (read-array-elements seed port))
(else seed))))
(define (read-array port)
"Read a JSON array from PORT and return the result of calling
ARRAY-FINALIZE on the final seed produced using ARRAY-KNIL and
ARRAY-KONS."
(match-next* port
(#\[ (match-next* port
(#\] array-knil)
(else (let ((seed (read-array-elements array-knil port)))
(match-next* port
(#\] (array-finalize seed)))))))))
(define (read-member port)
"Read a JSON object member (a colon-separated string-value pair) from PORT,
and return two values: the string and the value."
(let ((key (read-string port)))
(match-next* port
(#\: (let ((value (read-value port)))
(values key value))))))
(define (read-members seed port)
"Read the remaining members of a JSON object from PORT, where SEED
is the result of the last call to OBJ-KONS, or OBJ-KNIL if no members
have previously been read. Return the final seed."
(let-values (((k v) (read-member port)))
(let ((seed (obj-kons k v seed)))
(match-next* port
(#\, (read-members seed port))
(else seed)))))
(define (read-object port)
"Read a JSON object from PORT and return the result of calling
OBJ-FINALIZE on the final seed produced using OBJ-KNIL and OBJ-KONS."
(obj-finalize (match-next* port
(#\{ (match-next* port
(#\} obj-knil)
(else (let ((seed (read-members obj-knil port)))
(match-next* port
(#\} seed)))))))))
(define (read-value port)
"Read a JSON value from PORT and return it."
;; We can't use 'match-next*' here, because we don't want to consume
;; any non-JSON-whitespace characters before dispatching to the
;; appropriate reader.
(consume-whitespace port)
(match (peek-char port)
(#\" (read-string port))
(#\{ (read-object port))
(#\[ (read-array port))
(#\t (read-true port))
(#\f (read-false port))
(#\n (read-null port))
((or #\- #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9)
(read-number port))
(_ (json-error port "expected a JSON value"))))
(define (read-json-text port)
"Read a JSON value, optional JSON whitespace, and EOF from PORT.
Return the represented value. If non-JSON-whitespace characters are
found after the JSON value, raise a JSON error."
(let ((value (read-value port)))
(match-next* port
((? eof-object?) value)
(else (json-error port "expected EOF")))))
read-json-text)
(define read-json (make-json-reader))
;;; Local Variables:
;;; mode: scheme
;;; eval: (put 'match-next 'scheme-indent-function 1)
;;; eval: (put 'match-next* 'scheme-indent-function 1)
;;; End:
next prev parent reply other threads:[~2017-06-20 22:47 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-05-08 23:17 Including sjson (formerly (ice-9 json)) and fash.scm in guile proper? Christopher Allan Webber
2017-05-12 19:15 ` Mark H Weaver
2017-05-14 1:30 ` Christopher Allan Webber
2017-05-15 18:35 ` Mark H Weaver
2017-05-15 19:53 ` Christopher Allan Webber
2017-06-20 22:47 ` Mark H Weaver [this message]
2017-06-20 23:18 ` Mark H Weaver
2017-06-20 23:50 ` Mark H Weaver
2017-06-22 16:26 ` Christopher Allan Webber
2017-06-03 13:35 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87mv92mg2v.fsf@netris.org \
--to=mhw@netris.org \
--cc=cwebber@dustycloud.org \
--cc=guile-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).