* bug#50391: 28.0.50; json-read non-ascii data results in malformed string
@ 2021-09-05 4:19 Zhiwei Chen
2021-09-05 7:31 ` Philipp
2021-09-05 8:08 ` Lars Ingebrigtsen
0 siblings, 2 replies; 3+ messages in thread
From: Zhiwei Chen @ 2021-09-05 4:19 UTC (permalink / raw)
To: 50391
When fetch json from youdao (a dict service in China).
#+begin_src elisp
(url-retrieve
"https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
(lambda (_status)
(goto-char (1+ url-http-end-of-headers))
(write-region (point) (point-max) "/tmp/acc1.json")))
#+end_src
Then C-x C-f "/tmp/acc1.json", the file is correctly encoded without
But If `json-read' then `json-insert', the file is malformed even if
uchardet shows the encoding of the file is utf-8.
#+begin_src elisp
(url-retrieve
"https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
(lambda (_status)
(goto-char (1+ url-http-end-of-headers))
(let ((j (json-read)))
(with-temp-buffer
(json-insert j)
(write-region (point-min) (point-max) "/tmp/acc2.json")))))
#+end_src
#+begin_src shell
diff -u <(hexdump -C /tmp/acc1.json | head -n10) <(hexdump -C /tmp/acc2.json | head -n10) | diff-so-fancy
#+end_src
Screenshot: https://pb.nichi.co/jazz-estate-brave
Where diff shows the first word "累积" is encoded incorrectly in
"/tmp/acc2.json". (It uses `c3 a7 c2 b4 c2 af')
Actually,
#+begin_src shell
echo -n "累积" | hexdump -C
#+end_src
should be `e7 b4 af e7 a7 af' in utf-8 where "累" is represented with
`e7 b4 af' and "积" is represented with `e7 a7 af'
The environment variable LANG is `en_US.UTF-8', all tested in `emacs -Q'
--
Zhiwei Chen
^ permalink raw reply [flat|nested] 3+ messages in thread
* bug#50391: 28.0.50; json-read non-ascii data results in malformed string
2021-09-05 4:19 bug#50391: 28.0.50; json-read non-ascii data results in malformed string Zhiwei Chen
@ 2021-09-05 7:31 ` Philipp
2021-09-05 8:08 ` Lars Ingebrigtsen
1 sibling, 0 replies; 3+ messages in thread
From: Philipp @ 2021-09-05 7:31 UTC (permalink / raw)
To: Zhiwei Chen; +Cc: 50391
> Am 05.09.2021 um 06:19 schrieb Zhiwei Chen <condy0919@gmail.com>:
>
>
> When fetch json from youdao (a dict service in China).
>
> #+begin_src elisp
> (url-retrieve
> "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
> (lambda (_status)
> (goto-char (1+ url-http-end-of-headers))
> (write-region (point) (point-max) "/tmp/acc1.json")))
> #+end_src
>
> Then C-x C-f "/tmp/acc1.json", the file is correctly encoded without
>
> But If `json-read' then `json-insert', the file is malformed even if
> uchardet shows the encoding of the file is utf-8.
>
> #+begin_src elisp
> (url-retrieve
> "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
> (lambda (_status)
> (goto-char (1+ url-http-end-of-headers))
> (let ((j (json-read)))
> (with-temp-buffer
> (json-insert j)
> (write-region (point-min) (point-max) "/tmp/acc2.json")))))
> #+end_src
Does it work if you use the C JSON function (json-parse-buffer) for parsing? At least for me the two files are then identical.
^ permalink raw reply [flat|nested] 3+ messages in thread
* bug#50391: 28.0.50; json-read non-ascii data results in malformed string
2021-09-05 4:19 bug#50391: 28.0.50; json-read non-ascii data results in malformed string Zhiwei Chen
2021-09-05 7:31 ` Philipp
@ 2021-09-05 8:08 ` Lars Ingebrigtsen
1 sibling, 0 replies; 3+ messages in thread
From: Lars Ingebrigtsen @ 2021-09-05 8:08 UTC (permalink / raw)
To: Zhiwei Chen; +Cc: 50391
[-- Attachment #1: Type: text/plain, Size: 1258 bytes --]
Zhiwei Chen <condy0919@gmail.com> writes:
> When fetch json from youdao (a dict service in China).
>
> #+begin_src elisp
> (url-retrieve
> "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
> (lambda (_status)
> (goto-char (1+ url-http-end-of-headers))
> (write-region (point) (point-max) "/tmp/acc1.json")))
> #+end_src
>
> Then C-x C-f "/tmp/acc1.json", the file is correctly encoded without
>
> But If `json-read' then `json-insert', the file is malformed even if
> uchardet shows the encoding of the file is utf-8.
When you do the `write-region', Emacs writes the octets you received
from the web server to a file. When Emacs loads that file in again, it
guesses that it's utf-8 and decodes it that way, so that's why that
works correctly.
> #+begin_src elisp
> (url-retrieve
> "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
> (lambda (_status)
> (goto-char (1+ url-http-end-of-headers))
> (let ((j (json-read)))
> (with-temp-buffer
> (json-insert j)
> (write-region (point-min) (point-max) "/tmp/acc2.json")))))
> #+end_src
But here you're asking Emacs to use json-read on a buffer that's not
been decoded. The http buffer at this points looks like this:
[-- Attachment #2: Type: image/png, Size: 64328 bytes --]
[-- Attachment #3: Type: text/plain, Size: 654 bytes --]
You have to say (decode-coding-region (point) (point-max) 'utf-8) first
for that to work. I.e.,
(url-retrieve
"https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
(lambda (_status)
(goto-char (1+ url-http-end-of-headers))
(let ((buf (current-buffer))
(end (1+ url-http-end-of-headers)))
(with-temp-buffer
(insert-buffer-substring buf end)
(goto-char (point-min))
(let ((j (json-read)))
(erase-buffer)
(json-insert j)
(write-region (point-min) (point-max) "/tmp/acc2.json"))))))
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-09-05 8:08 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-05 4:19 bug#50391: 28.0.50; json-read non-ascii data results in malformed string Zhiwei Chen
2021-09-05 7:31 ` Philipp
2021-09-05 8:08 ` Lars Ingebrigtsen
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).