* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
@ 2016-10-24 18:06 Helmut Eller
2016-10-24 19:57 ` Philipp Stephani
0 siblings, 1 reply; 8+ messages in thread
From: Helmut Eller @ 2016-10-24 18:06 UTC (permalink / raw)
To: 24784
json-read-from-string doesn't parse strings correctly if the the \u
syntax is used to write UTF-16 surrogates:
(equal (json-read-from-string "\"\\uD834\\uDD1E\"") "\"\U0001D11E\"")
=> nil
The correct result t. To quote RFC 7159[*]:
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
[*] https://tools.ietf.org/html/rfc7159#section-7
In GNU Emacs 26.0.50.2 (x86_64-unknown-linux-gnu, GTK+ Version 3.14.5)
of 2016-10-24 built on caladan
Repository revision: 26ccd19269c040ad5960a7567aa5fc88f142c709
Windowing system distributor 'The X.Org Foundation', version 11.0.11604000
System Description: Debian GNU/Linux 8.5 (jessie)
Configured using:
'configure --with-xpm=no --with-jpeg=no --with-gif=no --with-tiff=no'
Configured features:
PNG SOUND DBUS GSETTINGS NOTIFY GNUTLS LIBXML2 FREETYPE XFT ZLIB
TOOLKIT_SCROLL_BARS GTK3 X11
Important settings:
value of $LANG: C.UTF-8
locale-coding-system: utf-8-unix
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
2016-10-24 18:06 bug#24784: 26.0.50; JSON strings with utf-16 escape codes Helmut Eller
@ 2016-10-24 19:57 ` Philipp Stephani
2016-10-24 23:19 ` Dmitry Gutov
0 siblings, 1 reply; 8+ messages in thread
From: Philipp Stephani @ 2016-10-24 19:57 UTC (permalink / raw)
To: Helmut Eller, 24784
[-- Attachment #1.1: Type: text/plain, Size: 746 bytes --]
Helmut Eller <eller.helmut@gmail.com> schrieb am Mo., 24. Okt. 2016 um
20:58 Uhr:
>
> json-read-from-string doesn't parse strings correctly if the the \u
> syntax is used to write UTF-16 surrogates:
>
> (equal (json-read-from-string "\"\\uD834\\uDD1E\"") "\"\U0001D11E\"")
> => nil
>
> The correct result t. To quote RFC 7159[*]:
>
> To escape an extended character that is not in the Basic Multilingual
> Plane, the character is represented as a 12-character sequence,
> encoding the UTF-16 surrogate pair. So, for example, a string
> containing only the G clef character (U+1D11E) may be represented as
> "\uD834\uDD1E".
>
> [*] https://tools.ietf.org/html/rfc7159#section-7
>
> Thanks for reporting, I've attached a patch.
[-- Attachment #1.2: Type: text/html, Size: 1556 bytes --]
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Fix-encoding-of-JSON-surrogate-pairs.txt --]
[-- Type: text/plain; charset=US-ASCII; name="0001-Fix-encoding-of-JSON-surrogate-pairs.txt", Size: 3040 bytes --]
From 6c630bd5b001243d6b7115380088909a7a180ddb Mon Sep 17 00:00:00 2001
From: Philipp Stephani <phst@google.com>
Date: Mon, 24 Oct 2016 21:54:51 +0200
Subject: [PATCH] Fix encoding of JSON surrogate pairs
JSON requires that such pairs be treated as UTF-16 surrogate pairs, not
individual code points; cf. Bug #24784.
* lisp/json.el (json-read-escaped-char): Fix decoding of surrogate
pairs.
(json--decode-utf-16-surrogates): New defsubst.
* test/lisp/json-tests.el (test-json-read-string): Add test for
surrogate pairs.
(test-json-encode-string): Add test for non-BMP character encoding.
---
lisp/json.el | 13 +++++++++++++
test/lisp/json-tests.el | 7 +++++--
2 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/lisp/json.el b/lisp/json.el
index fdac8d9..5bfdfd4 100644
--- a/lisp/json.el
+++ b/lisp/json.el
@@ -363,6 +363,10 @@ json-special-chars
;; String parsing
+(defsubst json--decode-utf-16-surrogates (high low)
+ "Return the code point represented by the UTF-16 surrogates HIGH and LOW."
+ (+ (lsh (- high #xD800) 10) (- low #xDC00) #x10000))
+
(defun json-read-escaped-char ()
"Read the JSON string escaped character at point."
;; Skip over the '\'
@@ -372,6 +376,15 @@ json-read-escaped-char
(cond
(special (cdr special))
((not (eq char ?u)) char)
+ ;; Special-case UTF-16 surrogate pairs,
+ ;; cf. https://tools.ietf.org/html/rfc7159#section-7
+ ((looking-at
+ (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
+ "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any "0-9A-Fa-f")))))
+ (json-advance 10)
+ (json--decode-utf-16-surrogates
+ (string-to-number (match-string 1) 16)
+ (string-to-number (match-string 2) 16)))
((looking-at "[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]")
(let ((hex (match-string 0)))
(json-advance 4)
diff --git a/test/lisp/json-tests.el b/test/lisp/json-tests.el
index 78cebb4..8958000 100644
--- a/test/lisp/json-tests.el
+++ b/test/lisp/json-tests.el
@@ -167,14 +167,17 @@ json-tests--with-temp-buffer
(should (equal (json-read-string) "abcαβγ")))
(json-tests--with-temp-buffer "\"\\nasd\\u0444\\u044b\\u0432fgh\\t\""
(should (equal (json-read-string) "\nasdфывfgh\t")))
+ ;; Bug#24784
+ (json-tests--with-temp-buffer "\"\\uD834\\uDD1E\""
+ (should (equal (json-read-string) "\U0001D11E")))
(json-tests--with-temp-buffer "foo"
(should-error (json-read-string) :type 'json-string-format)))
(ert-deftest test-json-encode-string ()
(should (equal (json-encode-string "foo") "\"foo\""))
(should (equal (json-encode-string "a\n\fb") "\"a\\n\\fb\""))
- (should (equal (json-encode-string "\nasdфыв\u001f\u007ffgh\t")
- "\"\\nasdфыв\\u001f\u007ffgh\\t\"")))
+ (should (equal (json-encode-string "\nasdфыв𝄞\u001f\u007ffgh\t")
+ "\"\\nasdфыв𝄞\\u001f\u007ffgh\\t\"")))
(ert-deftest test-json-encode-key ()
(should (equal (json-encode-key "foo") "\"foo\""))
--
2.10.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
2016-10-24 19:57 ` Philipp Stephani
@ 2016-10-24 23:19 ` Dmitry Gutov
2016-10-26 16:39 ` Helmut Eller
2016-12-31 16:53 ` Philipp Stephani
0 siblings, 2 replies; 8+ messages in thread
From: Dmitry Gutov @ 2016-10-24 23:19 UTC (permalink / raw)
To: Philipp Stephani, Helmut Eller, 24784
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=windows-1252; format=flowed, Size: 1270 bytes --]
Philipp,
Thanks. Some comments:
On 24.10.2016 22:57, Philipp Stephani wrote:
> +(defsubst json--decode-utf-16-surrogates (high low)
IIRC, there might be no actual benefit from making it a defsubst. If
someone could benchmark it, I'd like to see the result.
> + ;; Special-case UTF-16 surrogate pairs,
> + ;; cf. https://tools.ietf.org/html/rfc7159#section-7
> + ((looking-at
> + (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
> + "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any "0-9A-Fa-f")))))
> + (json-advance 10)
> + (json--decode-utf-16-surrogates
> + (string-to-number (match-string 1) 16)
> + (string-to-number (match-string 2) 16)))
Shouldn't this go below the UTF-8 case, as the less-frequent one?
> (ert-deftest test-json-encode-string ()
> (should (equal (json-encode-string "foo") "\"foo\""))
> (should (equal (json-encode-string "a\n\fb") "\"a\\n\\fb\""))
> - (should (equal (json-encode-string "\nasdфыв\u001f\u007ffgh\t")
> - "\"\\nasdфыв\\u001f\u007ffgh\\t\"")))
> + (should (equal (json-encode-string "\nasdфыв𝄞\u001f\u007ffgh\t")
> + "\"\\nasdфыв𝄞\\u001f\u007ffgh\\t\"")))
Why are we testing string encoding here?
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
2016-10-24 23:19 ` Dmitry Gutov
@ 2016-10-26 16:39 ` Helmut Eller
2016-10-26 17:37 ` Drew Adams
2016-12-31 16:53 ` Philipp Stephani
1 sibling, 1 reply; 8+ messages in thread
From: Helmut Eller @ 2016-10-26 16:39 UTC (permalink / raw)
To: Dmitry Gutov; +Cc: Philipp Stephani, 24784
On Tue, Oct 25 2016, Dmitry Gutov wrote:
> On 24.10.2016 22:57, Philipp Stephani wrote:
>
>> +(defsubst json--decode-utf-16-surrogates (high low)
>
> IIRC, there might be no actual benefit from making it a defsubst. If
> someone could benchmark it, I'd like to see the result.
I guess it doesn't hurt but I also doubt that it makes a measurable
difference as utf-16 surrogates are rarely needed.
>
>> + ;; Special-case UTF-16 surrogate pairs,
>> + ;; cf. https://tools.ietf.org/html/rfc7159#section-7
>> + ((looking-at
>> + (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
>> + "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any "0-9A-Fa-f")))))
>> + (json-advance 10)
>> + (json--decode-utf-16-surrogates
>> + (string-to-number (match-string 1) 16)
>> + (string-to-number (match-string 2) 16)))
>
> Shouldn't this go below the UTF-8 case, as the less-frequent one?
There's also an opportunity to detect unpaired surrogates, e.g.:
(defun json-read-escaped-char ()
"Read the JSON string escaped character at point."
;; Skip over the '\'
(json-advance)
(let* ((char (json-pop))
(special (assq char json-special-chars)))
(cond
(special (cdr special))
((not (eq char ?u)) char)
((looking-at "[0-9A-Fa-f]\\{4\\}")
(let* ((code (string-to-number (match-string 0) 16)))
(json-advance 4)
(cond ((<= #xD800 code #xDBFF) ; UTF-16 high surrogate
(cond ((looking-at "\\\\u\\([Dd][C-Fc-f][0-9A-Fa-f]\\{2\\}\\)")
(let ((low (string-to-number (match-string 1) 16)))
(json-advance 6)
(json--decode-utf-16-surrogates code low)))
(t
;; Expected low surrogate missing
(signal 'json-string-escape (list (point))))))
((<= #xDC00 code #xDFFF)
;; Unexpected low surrogate
(signal 'json-string-escape (list (point))))
(t
code))))
(t
(signal 'json-string-escape (list (point)))))))
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
2016-10-26 16:39 ` Helmut Eller
@ 2016-10-26 17:37 ` Drew Adams
0 siblings, 0 replies; 8+ messages in thread
From: Drew Adams @ 2016-10-26 17:37 UTC (permalink / raw)
To: Helmut Eller, Dmitry Gutov; +Cc: Philipp Stephani, 24784
> >> +(defsubst json--decode-utf-16-surrogates (high low)
> >
> > IIRC, there might be no actual benefit from making it a defsubst.
> > If someone could benchmark it, I'd like to see the result.
>
> I guess it doesn't hurt but I also doubt that it makes a measurable
> difference as utf-16 surrogates are rarely needed.
IMO, we should never, ever use defsubst nowadays. Unless you can
come up with a REALLY good rationale.
Every time defsubst is used it throws a monkey wrench in the ability
of users to extend and modify Emacs behavior. And nothing is really
gained.
Just one opinion.
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
2016-10-24 23:19 ` Dmitry Gutov
2016-10-26 16:39 ` Helmut Eller
@ 2016-12-31 16:53 ` Philipp Stephani
2017-01-01 0:45 ` Dmitry Gutov
1 sibling, 1 reply; 8+ messages in thread
From: Philipp Stephani @ 2016-12-31 16:53 UTC (permalink / raw)
To: Dmitry Gutov, Helmut Eller, 24784
[-- Attachment #1: Type: text/plain, Size: 1806 bytes --]
Dmitry Gutov <dgutov@yandex.ru> schrieb am Di., 25. Okt. 2016 um 01:19 Uhr:
> Philipp,
>
> Thanks. Some comments:
>
> On 24.10.2016 22:57, Philipp Stephani wrote:
>
> > +(defsubst json--decode-utf-16-surrogates (high low)
>
> IIRC, there might be no actual benefit from making it a defsubst. If
> someone could benchmark it, I'd like to see the result.
>
Agreed; converted to defun. I've only used defsubst because some other
helper functions also used defsubst.
>
> > + ;; Special-case UTF-16 surrogate pairs,
> > + ;; cf. https://tools.ietf.org/html/rfc7159#section-7
> > + ((looking-at
> > + (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
> > + "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any
> "0-9A-Fa-f")))))
> > + (json-advance 10)
> > + (json--decode-utf-16-surrogates
> > + (string-to-number (match-string 1) 16)
> > + (string-to-number (match-string 2) 16)))
>
> Shouldn't this go below the UTF-8 case, as the less-frequent one?
>
No, the below case is more general and therefore has to come last.
>
> > (ert-deftest test-json-encode-string ()
> > (should (equal (json-encode-string "foo") "\"foo\""))
> > (should (equal (json-encode-string "a\n\fb") "\"a\\n\\fb\""))
> > - (should (equal (json-encode-string "\nasdфыв\u001f\u007ffgh\t")
> > - "\"\\nasdфыв\\u001f\u007ffgh\\t\"")))
> > + (should (equal (json-encode-string "\nasdфыв𠄞\u001f\u007ffgh\t")
> > + "\"\\nasdфыв𠄞\\u001f\u007ffgh\\t\"")))
>
> Why are we testing string encoding here?
>
It's not 100% related to the patch, but I think it can be included for
symmetry reasons (testing encoding as well as decoding).
[-- Attachment #2: Type: text/html, Size: 3423 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
2016-12-31 16:53 ` Philipp Stephani
@ 2017-01-01 0:45 ` Dmitry Gutov
2017-01-01 12:26 ` Philipp Stephani
0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Gutov @ 2017-01-01 0:45 UTC (permalink / raw)
To: Philipp Stephani, Helmut Eller, 24784
On 31.12.2016 19:53, Philipp Stephani wrote:
> Agreed; converted to defun. I've only used defsubst because some other
> helper functions also used defsubst.
Thanks. Those others can probably be changed as well.
> No, the below case is more general and therefore has to come last.
Makes sense.
> It's not 100% related to the patch, but I think it can be included for
> symmetry reasons (testing encoding as well as decoding).
Of course. These are testing utf-8 encoding, though, right? It would be
better if you split them to a separate commit, I think.
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
2017-01-01 0:45 ` Dmitry Gutov
@ 2017-01-01 12:26 ` Philipp Stephani
0 siblings, 0 replies; 8+ messages in thread
From: Philipp Stephani @ 2017-01-01 12:26 UTC (permalink / raw)
To: Dmitry Gutov, Helmut Eller, 24784-done
[-- Attachment #1: Type: text/plain, Size: 771 bytes --]
Dmitry Gutov <dgutov@yandex.ru> schrieb am So., 1. Jan. 2017 um 01:45 Uhr:
> On 31.12.2016 19:53, Philipp Stephani wrote:
>
> > Agreed; converted to defun. I've only used defsubst because some other
> > helper functions also used defsubst.
>
> Thanks. Those others can probably be changed as well.
>
Yes, but rather not in this commit.
>
> > No, the below case is more general and therefore has to come last.
>
> Makes sense.
>
> > It's not 100% related to the patch, but I think it can be included for
> > symmetry reasons (testing encoding as well as decoding).
>
> Of course. These are testing utf-8 encoding, though, right? It would be
> better if you split them to a separate commit, I think.
>
OK, I've removed it from this patch and pushed it as 93be35e038.
[-- Attachment #2: Type: text/html, Size: 1677 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-01-01 12:26 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-10-24 18:06 bug#24784: 26.0.50; JSON strings with utf-16 escape codes Helmut Eller
2016-10-24 19:57 ` Philipp Stephani
2016-10-24 23:19 ` Dmitry Gutov
2016-10-26 16:39 ` Helmut Eller
2016-10-26 17:37 ` Drew Adams
2016-12-31 16:53 ` Philipp Stephani
2017-01-01 0:45 ` Dmitry Gutov
2017-01-01 12:26 ` Philipp Stephani
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).