unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed
From: Mark H Weaver <mhw@netris.org>
To: Arun Isaac <arunisaac@systemreboot.net>
Cc: 30076@debbugs.gnu.org
Subject: bug#30076: [PATCH] web: Recognize JSON content type as text.
Date: Tue, 30 Jan 2018 22:31:04 -0500	[thread overview]
Message-ID: <87y3kevh53.fsf@netris.org> (raw)
In-Reply-To: <20180111053117.4597-1-arunisaac@systemreboot.net> (Arun Isaac's message of "Thu, 11 Jan 2018 11:01:17 +0530")

Hi Arun,

Arun Isaac <arunisaac@systemreboot.net> writes:
> * module/web/response.scm (text-content-type?): Recognize JSON content
>   type as text.

While this would seem reasonable at first glance, it seems to me that
this will result in JSON texts with non-ASCII characters being
mishandled in many cases.

Within Guile, 'text-content-type?' is currently used in two places:

* 'decode-response-body' in (web client), and
* 'response-body-port' in (web response).

In both places, if 'text-content-type?' returns true, the encoding of
the response is assumed to be "ISO-8859-1" if not otherwise specified by
an explicit 'charset' parameter.  This is what RFC 2616 specifies for
text/plain, although RFC 6657 would change the default to US-ASCII, as
it was in RFC 2046, and maybe we should look into that.

However, things are quite different for the application/json MIME type,
as specified in RFCs 4627 and 7159.  Those RFCs specify that JSON text
"SHALL" (i.e. MUST) be encoded in Unicode (UTF-8, UTF-16 or UTF-32),
that the default encoding is UTF-8, and furthermore that no charset
parameter is defined for application/json.

So, we can expect at least some conforming implementations to omit the
'charset' parameter, and yet in that case we must assume that the
encoding is Unicode, and most definitely not ISO-8859-1.

RFC 4627 makes the additional interesting observation (in section 3,
"encoding") that since the first two characters of JSON text will always
be ASCII, and since UTF-8/UTF-16/UTF-32 are the only valid encodings for
JSON text, we can reliably determine the encoding by looking at the
pattern of nul bytes in the first four octets:

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

Given that any of these encodings above are possible, and that there is
no 'charset' parameter defined for "application/json", it seems to me
that we have no choice but to be prepared to auto-detect the encoding,
as described in RFC 4627 section 3 if the 'charset' parameter is
missing.

What do you think?

      Mark





  reply	other threads:[~2018-01-31  3:31 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-11  5:31 bug#30076: [PATCH] web: Recognize JSON content type as text Arun Isaac
2018-01-31  3:31 ` Mark H Weaver [this message]
2018-01-31  6:04   ` Mark H Weaver
2018-02-02  7:31     ` Arun Isaac

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87y3kevh53.fsf@netris.org \
    --to=mhw@netris.org \
    --cc=30076@debbugs.gnu.org \
    --cc=arunisaac@systemreboot.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).