bug#50391: 28.0.50; json-read non-ascii data results in malformed string

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#50391: 28.0.50; json-read non-ascii data results in malformed string
@ 2021-09-05  4:19 Zhiwei Chen
  2021-09-05  7:31 ` Philipp
  2021-09-05  8:08 ` Lars Ingebrigtsen
  0 siblings, 2 replies; 3+ messages in thread
From: Zhiwei Chen @ 2021-09-05  4:19 UTC (permalink / raw)
  To: 50391


When fetch json from youdao (a dict service in China).

#+begin_src elisp
(url-retrieve
  "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
  (lambda (_status)
    (goto-char (1+ url-http-end-of-headers))
    (write-region (point) (point-max) "/tmp/acc1.json")))
#+end_src

Then C-x C-f "/tmp/acc1.json", the file is correctly encoded without 

But If `json-read' then `json-insert', the file is malformed even if
uchardet shows the encoding of the file is utf-8.

#+begin_src elisp
(url-retrieve
  "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
  (lambda (_status)
    (goto-char (1+ url-http-end-of-headers))
    (let ((j (json-read)))
    (with-temp-buffer
      (json-insert j)
      (write-region (point-min) (point-max) "/tmp/acc2.json")))))
#+end_src

#+begin_src shell
diff -u <(hexdump -C /tmp/acc1.json | head -n10) <(hexdump -C /tmp/acc2.json | head -n10) | diff-so-fancy
#+end_src

Screenshot: https://pb.nichi.co/jazz-estate-brave

Where diff shows the first word "累积" is encoded incorrectly in
"/tmp/acc2.json". (It uses `c3 a7 c2 b4 c2 af')

Actually,

#+begin_src shell
echo -n "累积" | hexdump -C
#+end_src

should be `e7 b4 af e7 a7 af' in utf-8 where "累" is represented with
`e7 b4 af' and "积" is represented with `e7 a7 af'

The environment variable LANG is `en_US.UTF-8', all tested in `emacs -Q'

-- 
Zhiwei Chen





^ permalink raw reply	[flat|nested] 3+ messages in thread

* bug#50391: 28.0.50; json-read non-ascii data results in malformed string
  2021-09-05  4:19 bug#50391: 28.0.50; json-read non-ascii data results in malformed string Zhiwei Chen
@ 2021-09-05  7:31 ` Philipp
  2021-09-05  8:08 ` Lars Ingebrigtsen
  1 sibling, 0 replies; 3+ messages in thread
From: Philipp @ 2021-09-05  7:31 UTC (permalink / raw)
  To: Zhiwei Chen; +Cc: 50391



> Am 05.09.2021 um 06:19 schrieb Zhiwei Chen <condy0919@gmail.com>:
> 
> 
> When fetch json from youdao (a dict service in China).
> 
> #+begin_src elisp
> (url-retrieve
>  "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
>  (lambda (_status)
>    (goto-char (1+ url-http-end-of-headers))
>    (write-region (point) (point-max) "/tmp/acc1.json")))
> #+end_src
> 
> Then C-x C-f "/tmp/acc1.json", the file is correctly encoded without 
> 
> But If `json-read' then `json-insert', the file is malformed even if
> uchardet shows the encoding of the file is utf-8.
> 
> #+begin_src elisp
> (url-retrieve
>  "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
>  (lambda (_status)
>    (goto-char (1+ url-http-end-of-headers))
>    (let ((j (json-read)))
>    (with-temp-buffer
>      (json-insert j)
>      (write-region (point-min) (point-max) "/tmp/acc2.json")))))
> #+end_src

Does it work if you use the C JSON function (json-parse-buffer) for parsing?  At least for me the two files are then identical.




^ permalink raw reply	[flat|nested] 3+ messages in thread

* bug#50391: 28.0.50; json-read non-ascii data results in malformed string
  2021-09-05  4:19 bug#50391: 28.0.50; json-read non-ascii data results in malformed string Zhiwei Chen
  2021-09-05  7:31 ` Philipp
@ 2021-09-05  8:08 ` Lars Ingebrigtsen
  1 sibling, 0 replies; 3+ messages in thread
From: Lars Ingebrigtsen @ 2021-09-05  8:08 UTC (permalink / raw)
  To: Zhiwei Chen; +Cc: 50391

[-- Attachment #1: Type: text/plain, Size: 1258 bytes --]

Zhiwei Chen <condy0919@gmail.com> writes:

> When fetch json from youdao (a dict service in China).
>
> #+begin_src elisp
> (url-retrieve
>   "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
>   (lambda (_status)
>     (goto-char (1+ url-http-end-of-headers))
>     (write-region (point) (point-max) "/tmp/acc1.json")))
> #+end_src
>
> Then C-x C-f "/tmp/acc1.json", the file is correctly encoded without 
>
> But If `json-read' then `json-insert', the file is malformed even if
> uchardet shows the encoding of the file is utf-8.

When you do the `write-region', Emacs writes the octets you received
from the web server to a file.  When Emacs loads that file in again, it
guesses that it's utf-8 and decodes it that way, so that's why that
works correctly.

> #+begin_src elisp
> (url-retrieve
>   "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
>   (lambda (_status)
>     (goto-char (1+ url-http-end-of-headers))
>     (let ((j (json-read)))
>     (with-temp-buffer
>       (json-insert j)
>       (write-region (point-min) (point-max) "/tmp/acc2.json")))))
> #+end_src

But here you're asking Emacs to use json-read on a buffer that's not
been decoded.  The http buffer at this points looks like this:


[-- Attachment #2: Type: image/png, Size: 64328 bytes --]

[-- Attachment #3: Type: text/plain, Size: 654 bytes --]


You have to say (decode-coding-region (point) (point-max) 'utf-8) first
for that to work.  I.e.,

  (url-retrieve
   "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
   (lambda (_status)
     (goto-char (1+ url-http-end-of-headers))
     (let ((buf (current-buffer))
	   (end (1+ url-http-end-of-headers)))
       (with-temp-buffer
	 (insert-buffer-substring buf end)
	 (goto-char (point-min))
	 (let ((j (json-read)))
	   (erase-buffer)
	   (json-insert j)
	   (write-region (point-min) (point-max) "/tmp/acc2.json"))))))


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-09-05  8:08 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-05  4:19 bug#50391: 28.0.50; json-read non-ascii data results in malformed string Zhiwei Chen
2021-09-05  7:31 ` Philipp
2021-09-05  8:08 ` Lars Ingebrigtsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).