bug#24784: 26.0.50; JSON strings with utf-16 escape codes

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
@ 2016-10-24 18:06 Helmut Eller
  2016-10-24 19:57 ` Philipp Stephani
  0 siblings, 1 reply; 8+ messages in thread
From: Helmut Eller @ 2016-10-24 18:06 UTC (permalink / raw)
  To: 24784


json-read-from-string doesn't parse strings correctly if the the \u
syntax is used to write UTF-16 surrogates:

 (equal (json-read-from-string "\"\\uD834\\uDD1E\"") "\"\U0001D11E\"")
 => nil

The correct result t.  To quote RFC 7159[*]:

   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a 12-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "\uD834\uDD1E".

[*] https://tools.ietf.org/html/rfc7159#section-7


In GNU Emacs 26.0.50.2 (x86_64-unknown-linux-gnu, GTK+ Version 3.14.5)
 of 2016-10-24 built on caladan
Repository revision: 26ccd19269c040ad5960a7567aa5fc88f142c709
Windowing system distributor 'The X.Org Foundation', version 11.0.11604000
System Description:	Debian GNU/Linux 8.5 (jessie)

Configured using:
 'configure --with-xpm=no --with-jpeg=no --with-gif=no --with-tiff=no'

Configured features:
PNG SOUND DBUS GSETTINGS NOTIFY GNUTLS LIBXML2 FREETYPE XFT ZLIB
TOOLKIT_SCROLL_BARS GTK3 X11

Important settings:
  value of $LANG: C.UTF-8
  locale-coding-system: utf-8-unix





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
  2016-10-24 18:06 bug#24784: 26.0.50; JSON strings with utf-16 escape codes Helmut Eller
@ 2016-10-24 19:57 ` Philipp Stephani
  2016-10-24 23:19   ` Dmitry Gutov
  0 siblings, 1 reply; 8+ messages in thread
From: Philipp Stephani @ 2016-10-24 19:57 UTC (permalink / raw)
  To: Helmut Eller, 24784


[-- Attachment #1.1: Type: text/plain, Size: 746 bytes --]

Helmut Eller <eller.helmut@gmail.com> schrieb am Mo., 24. Okt. 2016 um
20:58 Uhr:

>
> json-read-from-string doesn't parse strings correctly if the the \u
> syntax is used to write UTF-16 surrogates:
>
>  (equal (json-read-from-string "\"\\uD834\\uDD1E\"") "\"\U0001D11E\"")
>  => nil
>
> The correct result t.  To quote RFC 7159[*]:
>
>    To escape an extended character that is not in the Basic Multilingual
>    Plane, the character is represented as a 12-character sequence,
>    encoding the UTF-16 surrogate pair.  So, for example, a string
>    containing only the G clef character (U+1D11E) may be represented as
>    "\uD834\uDD1E".
>
> [*] https://tools.ietf.org/html/rfc7159#section-7
>
> Thanks for reporting, I've attached a patch.

[-- Attachment #1.2: Type: text/html, Size: 1556 bytes --]

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Fix-encoding-of-JSON-surrogate-pairs.txt --]
[-- Type: text/plain; charset=US-ASCII;  name="0001-Fix-encoding-of-JSON-surrogate-pairs.txt", Size: 3040 bytes --]

From 6c630bd5b001243d6b7115380088909a7a180ddb Mon Sep 17 00:00:00 2001
From: Philipp Stephani <phst@google.com>
Date: Mon, 24 Oct 2016 21:54:51 +0200
Subject: [PATCH] Fix encoding of JSON surrogate pairs

JSON requires that such pairs be treated as UTF-16 surrogate pairs, not
individual code points; cf. Bug #24784.

* lisp/json.el (json-read-escaped-char): Fix decoding of surrogate
pairs.
(json--decode-utf-16-surrogates): New defsubst.

* test/lisp/json-tests.el (test-json-read-string): Add test for
surrogate pairs.
(test-json-encode-string): Add test for non-BMP character encoding.
---
 lisp/json.el            | 13 +++++++++++++
 test/lisp/json-tests.el |  7 +++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/lisp/json.el b/lisp/json.el
index fdac8d9..5bfdfd4 100644
--- a/lisp/json.el
+++ b/lisp/json.el
@@ -363,6 +363,10 @@ json-special-chars
 
 ;; String parsing
 
+(defsubst json--decode-utf-16-surrogates (high low)
+  "Return the code point represented by the UTF-16 surrogates HIGH and LOW."
+  (+ (lsh (- high #xD800) 10) (- low #xDC00) #x10000))
+
 (defun json-read-escaped-char ()
   "Read the JSON string escaped character at point."
   ;; Skip over the '\'
@@ -372,6 +376,15 @@ json-read-escaped-char
     (cond
      (special (cdr special))
      ((not (eq char ?u)) char)
+     ;; Special-case UTF-16 surrogate pairs,
+     ;; cf. https://tools.ietf.org/html/rfc7159#section-7
+     ((looking-at
+       (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
+           "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any "0-9A-Fa-f")))))
+      (json-advance 10)
+      (json--decode-utf-16-surrogates
+       (string-to-number (match-string 1) 16)
+       (string-to-number (match-string 2) 16)))
      ((looking-at "[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]")
       (let ((hex (match-string 0)))
         (json-advance 4)
diff --git a/test/lisp/json-tests.el b/test/lisp/json-tests.el
index 78cebb4..8958000 100644
--- a/test/lisp/json-tests.el
+++ b/test/lisp/json-tests.el
@@ -167,14 +167,17 @@ json-tests--with-temp-buffer
     (should (equal (json-read-string) "abcαβγ")))
   (json-tests--with-temp-buffer "\"\\nasd\\u0444\\u044b\\u0432fgh\\t\""
     (should (equal (json-read-string) "\nasdфывfgh\t")))
+  ;; Bug#24784
+  (json-tests--with-temp-buffer "\"\\uD834\\uDD1E\""
+    (should (equal (json-read-string) "\U0001D11E")))
   (json-tests--with-temp-buffer "foo"
     (should-error (json-read-string) :type 'json-string-format)))
 
 (ert-deftest test-json-encode-string ()
   (should (equal (json-encode-string "foo") "\"foo\""))
   (should (equal (json-encode-string "a\n\fb") "\"a\\n\\fb\""))
-  (should (equal (json-encode-string "\nasdфыв\u001f\u007ffgh\t")
-                 "\"\\nasdфыв\\u001f\u007ffgh\\t\"")))
+  (should (equal (json-encode-string "\nasdфыв𝄞\u001f\u007ffgh\t")
+                 "\"\\nasdфыв𝄞\\u001f\u007ffgh\\t\"")))
 
 (ert-deftest test-json-encode-key ()
   (should (equal (json-encode-key "foo") "\"foo\""))
-- 
2.10.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
  2016-10-24 19:57 ` Philipp Stephani
@ 2016-10-24 23:19   ` Dmitry Gutov
  2016-10-26 16:39     ` Helmut Eller
  2016-12-31 16:53     ` Philipp Stephani
  0 siblings, 2 replies; 8+ messages in thread
From: Dmitry Gutov @ 2016-10-24 23:19 UTC (permalink / raw)
  To: Philipp Stephani, Helmut Eller, 24784

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=windows-1252; format=flowed, Size: 1270 bytes --]

Philipp,

Thanks. Some comments:

On 24.10.2016 22:57, Philipp Stephani wrote:

> +(defsubst json--decode-utf-16-surrogates (high low)

IIRC, there might be no actual benefit from making it a defsubst. If 
someone could benchmark it, I'd like to see the result.

> +     ;; Special-case UTF-16 surrogate pairs,
> +     ;; cf. https://tools.ietf.org/html/rfc7159#section-7
> +     ((looking-at
> +       (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
> +           "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any "0-9A-Fa-f")))))
> +      (json-advance 10)
> +      (json--decode-utf-16-surrogates
> +       (string-to-number (match-string 1) 16)
> +       (string-to-number (match-string 2) 16)))

Shouldn't this go below the UTF-8 case, as the less-frequent one?

>  (ert-deftest test-json-encode-string ()
>    (should (equal (json-encode-string "foo") "\"foo\""))
>    (should (equal (json-encode-string "a\n\fb") "\"a\\n\\fb\""))
> -  (should (equal (json-encode-string "\nasdфыв\u001f\u007ffgh\t")
> -                 "\"\\nasdфыв\\u001f\u007ffgh\\t\"")))
> +  (should (equal (json-encode-string "\nasdфыв𝄞\u001f\u007ffgh\t")
> +                 "\"\\nasdфыв𝄞\\u001f\u007ffgh\\t\"")))

Why are we testing string encoding here?





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
  2016-10-24 23:19   ` Dmitry Gutov
@ 2016-10-26 16:39     ` Helmut Eller
  2016-10-26 17:37       ` Drew Adams
  2016-12-31 16:53     ` Philipp Stephani
  1 sibling, 1 reply; 8+ messages in thread
From: Helmut Eller @ 2016-10-26 16:39 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: Philipp Stephani, 24784

On Tue, Oct 25 2016, Dmitry Gutov wrote:

> On 24.10.2016 22:57, Philipp Stephani wrote:
>
>> +(defsubst json--decode-utf-16-surrogates (high low)
>
> IIRC, there might be no actual benefit from making it a defsubst. If
> someone could benchmark it, I'd like to see the result.

I guess it doesn't hurt but I also doubt that it makes a measurable
difference as utf-16 surrogates are rarely needed.

>
>> +     ;; Special-case UTF-16 surrogate pairs,
>> +     ;; cf. https://tools.ietf.org/html/rfc7159#section-7
>> +     ((looking-at
>> +       (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
>> +           "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any "0-9A-Fa-f")))))
>> +      (json-advance 10)
>> +      (json--decode-utf-16-surrogates
>> +       (string-to-number (match-string 1) 16)
>> +       (string-to-number (match-string 2) 16)))
>
> Shouldn't this go below the UTF-8 case, as the less-frequent one?

There's also an opportunity to detect unpaired surrogates, e.g.:

(defun json-read-escaped-char ()
  "Read the JSON string escaped character at point."
  ;; Skip over the '\'
  (json-advance)
  (let* ((char (json-pop))
         (special (assq char json-special-chars)))
    (cond
     (special (cdr special))
     ((not (eq char ?u)) char)
     ((looking-at "[0-9A-Fa-f]\\{4\\}")
      (let* ((code (string-to-number (match-string 0) 16)))
        (json-advance 4)
        (cond ((<= #xD800 code #xDBFF) ; UTF-16 high surrogate
               (cond ((looking-at "\\\\u\\([Dd][C-Fc-f][0-9A-Fa-f]\\{2\\}\\)")
                      (let ((low (string-to-number (match-string 1) 16)))
                        (json-advance 6)
                        (json--decode-utf-16-surrogates code low)))
                     (t
                      ;; Expected low surrogate missing
                      (signal 'json-string-escape (list (point))))))
              ((<= #xDC00 code #xDFFF)
               ;; Unexpected low surrogate
               (signal 'json-string-escape (list (point))))
              (t
               code))))
     (t
      (signal 'json-string-escape (list (point)))))))





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
  2016-10-26 16:39     ` Helmut Eller
@ 2016-10-26 17:37       ` Drew Adams
  0 siblings, 0 replies; 8+ messages in thread
From: Drew Adams @ 2016-10-26 17:37 UTC (permalink / raw)
  To: Helmut Eller, Dmitry Gutov; +Cc: Philipp Stephani, 24784

> >> +(defsubst json--decode-utf-16-surrogates (high low)
> >
> > IIRC, there might be no actual benefit from making it a defsubst.
> > If someone could benchmark it, I'd like to see the result.
> 
> I guess it doesn't hurt but I also doubt that it makes a measurable
> difference as utf-16 surrogates are rarely needed.

IMO, we should never, ever use defsubst nowadays.  Unless you can
come up with a REALLY good rationale.

Every time defsubst is used it throws a monkey wrench in the ability
of users to extend and modify Emacs behavior.  And nothing is really
gained.

Just one opinion.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
  2016-10-24 23:19   ` Dmitry Gutov
  2016-10-26 16:39     ` Helmut Eller
@ 2016-12-31 16:53     ` Philipp Stephani
  2017-01-01  0:45       ` Dmitry Gutov
  1 sibling, 1 reply; 8+ messages in thread
From: Philipp Stephani @ 2016-12-31 16:53 UTC (permalink / raw)
  To: Dmitry Gutov, Helmut Eller, 24784

[-- Attachment #1: Type: text/plain, Size: 1806 bytes --]

Dmitry Gutov <dgutov@yandex.ru> schrieb am Di., 25. Okt. 2016 um 01:19 Uhr:

> Philipp,
>
> Thanks. Some comments:
>
> On 24.10.2016 22:57, Philipp Stephani wrote:
>
> > +(defsubst json--decode-utf-16-surrogates (high low)
>
> IIRC, there might be no actual benefit from making it a defsubst. If
> someone could benchmark it, I'd like to see the result.
>

Agreed; converted to defun. I've only used defsubst because some other
helper functions also used defsubst.


>
> > +     ;; Special-case UTF-16 surrogate pairs,
> > +     ;; cf. https://tools.ietf.org/html/rfc7159#section-7
> > +     ((looking-at
> > +       (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
> > +           "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any
> "0-9A-Fa-f")))))
> > +      (json-advance 10)
> > +      (json--decode-utf-16-surrogates
> > +       (string-to-number (match-string 1) 16)
> > +       (string-to-number (match-string 2) 16)))
>
> Shouldn't this go below the UTF-8 case, as the less-frequent one?
>

No, the below case is more general and therefore has to come last.


>
> >  (ert-deftest test-json-encode-string ()
> >    (should (equal (json-encode-string "foo") "\"foo\""))
> >    (should (equal (json-encode-string "a\n\fb") "\"a\\n\\fb\""))
> > -  (should (equal (json-encode-string "\nasdÑ„Ñ‹Ð²\u001f\u007ffgh\t")
> > -                 "\"\\nasdÑ„Ñ‹Ð²\\u001f\u007ffgh\\t\"")))
> > +  (should (equal (json-encode-string "\nasdÑ„Ñ‹Ð²ð „ž\u001f\u007ffgh\t")
> > +                 "\"\\nasdÑ„Ñ‹Ð²ð „ž\\u001f\u007ffgh\\t\"")))
>
> Why are we testing string encoding here?
>

It's not 100% related to the patch, but I think it can be included for
symmetry reasons (testing encoding as well as decoding).

[-- Attachment #2: Type: text/html, Size: 3423 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
  2016-12-31 16:53     ` Philipp Stephani
@ 2017-01-01  0:45       ` Dmitry Gutov
  2017-01-01 12:26         ` Philipp Stephani
  0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Gutov @ 2017-01-01  0:45 UTC (permalink / raw)
  To: Philipp Stephani, Helmut Eller, 24784

On 31.12.2016 19:53, Philipp Stephani wrote:

> Agreed; converted to defun. I've only used defsubst because some other
> helper functions also used defsubst.

Thanks. Those others can probably be changed as well.

> No, the below case is more general and therefore has to come last.

Makes sense.

> It's not 100% related to the patch, but I think it can be included for
> symmetry reasons (testing encoding as well as decoding).

Of course. These are testing utf-8 encoding, though, right? It would be 
better if you split them to a separate commit, I think.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#24784: 26.0.50; JSON strings with utf-16 escape codes
  2017-01-01  0:45       ` Dmitry Gutov
@ 2017-01-01 12:26         ` Philipp Stephani
  0 siblings, 0 replies; 8+ messages in thread
From: Philipp Stephani @ 2017-01-01 12:26 UTC (permalink / raw)
  To: Dmitry Gutov, Helmut Eller, 24784-done

[-- Attachment #1: Type: text/plain, Size: 771 bytes --]

Dmitry Gutov <dgutov@yandex.ru> schrieb am So., 1. Jan. 2017 um 01:45 Uhr:

> On 31.12.2016 19:53, Philipp Stephani wrote:
>
> > Agreed; converted to defun. I've only used defsubst because some other
> > helper functions also used defsubst.
>
> Thanks. Those others can probably be changed as well.
>

Yes, but rather not in this commit.


>
> > No, the below case is more general and therefore has to come last.
>
> Makes sense.
>

> > It's not 100% related to the patch, but I think it can be included for
> > symmetry reasons (testing encoding as well as decoding).
>
> Of course. These are testing utf-8 encoding, though, right? It would be
> better if you split them to a separate commit, I think.
>

OK, I've removed it from this patch and pushed it as 93be35e038.

[-- Attachment #2: Type: text/html, Size: 1677 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-01-01 12:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-10-24 18:06 bug#24784: 26.0.50; JSON strings with utf-16 escape codes Helmut Eller
2016-10-24 19:57 ` Philipp Stephani
2016-10-24 23:19   ` Dmitry Gutov
2016-10-26 16:39     ` Helmut Eller
2016-10-26 17:37       ` Drew Adams
2016-12-31 16:53     ` Philipp Stephani
2017-01-01  0:45       ` Dmitry Gutov
2017-01-01 12:26         ` Philipp Stephani

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).