From: Mark H Weaver <mhw@netris.org>
To: Arun Isaac <arunisaac@systemreboot.net>
Cc: 30076@debbugs.gnu.org
Subject: bug#30076: [PATCH] web: Recognize JSON content type as text.
Date: Tue, 30 Jan 2018 22:31:04 -0500 [thread overview]
Message-ID: <87y3kevh53.fsf@netris.org> (raw)
In-Reply-To: <20180111053117.4597-1-arunisaac@systemreboot.net> (Arun Isaac's message of "Thu, 11 Jan 2018 11:01:17 +0530")
Hi Arun,
Arun Isaac <arunisaac@systemreboot.net> writes:
> * module/web/response.scm (text-content-type?): Recognize JSON content
> type as text.
While this would seem reasonable at first glance, it seems to me that
this will result in JSON texts with non-ASCII characters being
mishandled in many cases.
Within Guile, 'text-content-type?' is currently used in two places:
* 'decode-response-body' in (web client), and
* 'response-body-port' in (web response).
In both places, if 'text-content-type?' returns true, the encoding of
the response is assumed to be "ISO-8859-1" if not otherwise specified by
an explicit 'charset' parameter. This is what RFC 2616 specifies for
text/plain, although RFC 6657 would change the default to US-ASCII, as
it was in RFC 2046, and maybe we should look into that.
However, things are quite different for the application/json MIME type,
as specified in RFCs 4627 and 7159. Those RFCs specify that JSON text
"SHALL" (i.e. MUST) be encoded in Unicode (UTF-8, UTF-16 or UTF-32),
that the default encoding is UTF-8, and furthermore that no charset
parameter is defined for application/json.
So, we can expect at least some conforming implementations to omit the
'charset' parameter, and yet in that case we must assume that the
encoding is Unicode, and most definitely not ISO-8859-1.
RFC 4627 makes the additional interesting observation (in section 3,
"encoding") that since the first two characters of JSON text will always
be ASCII, and since UTF-8/UTF-16/UTF-32 are the only valid encodings for
JSON text, we can reliably determine the encoding by looking at the
pattern of nul bytes in the first four octets:
00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8
Given that any of these encodings above are possible, and that there is
no 'charset' parameter defined for "application/json", it seems to me
that we have no choice but to be prepared to auto-detect the encoding,
as described in RFC 4627 section 3 if the 'charset' parameter is
missing.
What do you think?
Mark
next prev parent reply other threads:[~2018-01-31 3:31 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-11 5:31 bug#30076: [PATCH] web: Recognize JSON content type as text Arun Isaac
2018-01-31 3:31 ` Mark H Weaver [this message]
2018-01-31 6:04 ` Mark H Weaver
2018-02-02 7:31 ` Arun Isaac
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87y3kevh53.fsf@netris.org \
--to=mhw@netris.org \
--cc=30076@debbugs.gnu.org \
--cc=arunisaac@systemreboot.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).