From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Philipp Stephani <p.stephani2@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: String encoding in json.c
Date: Tue, 26 Dec 2017 21:42:54 +0000
Message-ID: <CAArVCkRAPfumUxmKDCKp_jQPHfJE-C=Ps=1oo9km1TgJDjv1ag@mail.gmail.com>
References: <CAArVCkSoF9SQx_tUyx_wHc5bMKavY+qOeBYo=xiCLO4oXdz7jQ@mail.gmail.com>
	<83tvwhjyi5.fsf@gnu.org>
	<CAArVCkQCbuE4o_oYyXRc-vjS0ppHLW5cL_wRLMzhT+iFqYUZRA@mail.gmail.com>
	<83mv29jv99.fsf@gnu.org>
	<CAArVCkTgrMe0LqFXcsmTccvUWKcTK9L4mLAXtrxOQj7rwmMv1A@mail.gmail.com>
	<83incxjojg.fsf@gnu.org>
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="001a114844d8f98dd50561452a4a"
X-Trace: blaine.gmane.org 1514324491 28150 195.159.176.226 (26 Dec 2017 21:41:31 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Tue, 26 Dec 2017 21:41:31 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Dec 26 22:41:27 2017
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1eTwyT-0006sL-RI
	for ged-emacs-devel@m.gmane.org; Tue, 26 Dec 2017 22:41:26 +0100
Original-Received: from localhost ([::1]:58431 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1eTx0S-0000vD-Jt
	for ged-emacs-devel@m.gmane.org; Tue, 26 Dec 2017 16:43:28 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:37630)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1eTx0C-0000u6-Sa
	for emacs-devel@gnu.org; Tue, 26 Dec 2017 16:43:16 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1eTx09-0002FT-T8
	for emacs-devel@gnu.org; Tue, 26 Dec 2017 16:43:12 -0500
Original-Received: from mail-qt0-x22b.google.com ([2607:f8b0:400d:c0d::22b]:42806)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <p.stephani2@gmail.com>)
	id 1eTx06-00029R-An; Tue, 26 Dec 2017 16:43:06 -0500
Original-Received: by mail-qt0-x22b.google.com with SMTP id g9so46143680qth.9;
	Tue, 26 Dec 2017 13:43:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=mime-version:references:in-reply-to:from:date:message-id:subject:to
	:cc; bh=nwQFCqs+z6cEy4Z/Muvl14Z0Hmlt/KfBdAad/SNeEt0=;
	b=NdGfUIGAxv9ynCxI/FkGu8icMT4M/q7NJ+LMeVxNds7ehKBp/GxZTtoP0AUzQg6ygC
	AVUkJuht+ENyFf5Z3ijMA7CHDDCa9fqvBOH9rE4z/ur99asaUwXqhV6LGTh8GOYwbP+g
	j9pz07zt/bnfjt1mLy8Zf3MkMhCSkhxqQeNTHckjj1ioox+2XKzTcgRMckN1rz4jivpO
	63e1xPf+LXtQsFiQcYHzfBmnOSMuhn1eOzQxkMLXuoPRbjEP0Kw1nJZPfSnH8D+yCw/U
	wDrXiiUNXgHNZm+hOI4lb6WVlY4tkEuZpq9xY674wjUZ90BLyVi4amsF60MbG8yByYpC
	ZIwg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:references:in-reply-to:from:date
	:message-id:subject:to:cc;
	bh=nwQFCqs+z6cEy4Z/Muvl14Z0Hmlt/KfBdAad/SNeEt0=;
	b=jeN6ap6P63wTUpy94QatcMFU+bQDcYTgFlm6jmBWV5zZg0gjilgB2Xsdw2NTRusj5H
	hzOj2RpLB1kcQU7A1MvpCN7qBY6k8/r82r5HFtR3s8ENkD7AjeNzSR5U4YRYLmiUOOcj
	O13bXzsO6XCYm6qoTbLsGz7MqWUleE8A8cPzoUQIRnJQOiifn06Mi9DoyZiVqGfCT+yj
	WgTz7X7nFEwNdtGroJT9zugBhw7bMi5H2LPODCAEucxpRWAen7/EgP7T46b6QtflbfYz
	uqRffPcc3ZXAUhjlyhFFOamJgYf50KErFwdBKfhCr5JXGg7SaAYM0AVyWqELun7n5YjA
	G0oQ==
X-Gm-Message-State: AKGB3mIWN43uXv0bZ27AuKrA1fsO51lAy1Q223lZ7gS+A991Fvelg2HP
	hX+ndo9Nhd/o0XWJM9Vy5effZNpEYdKG0dP19be3yA==
X-Google-Smtp-Source: ACJfBotOoqJnMvbgMf8koscEM0gOXDEIvXWD/8gzxjfHo6yBo548KXMrNOVYD8cDJ5H9n1deS7T1W5QMPAE7ulFTcWk=
X-Received: by 10.200.28.44 with SMTP id a41mr37953850qtk.340.1514324585187;
	Tue, 26 Dec 2017 13:43:05 -0800 (PST)
In-Reply-To: <83incxjojg.fsf@gnu.org>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 2607:f8b0:400d:c0d::22b
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel/>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:221429
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/221429>

--001a114844d8f98dd50561452a4a
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Eli Zaretskii <eliz@gnu.org> schrieb am Sa., 23. Dez. 2017 um 19:19 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sat, 23 Dec 2017 17:27:22 +0000
> > Cc: emacs-devel@gnu.org
> >
> > - We encode Lisp strings when passing them to Jansson. Jansson only
> accepts UTF-8 strings and fails (with
> > proper error reporting, not crashing) when encountering non-UTF-8
> strings. I think encoding can only make a
> > difference here for strings that contain sequences of bytes that are
> themselves valid UTF-8 code unit
> > sequences, such as "=C3=84\xC3\x84". This string is encoded as
> "\xC3\x84\xC3\x84" using utf-8-unix. (Note how
> > this is a case where encoding and decoding are not inverses of each
> other.) Without encoding, the string
> > contents will be \xC3\x84 plus two invalid 5-byte sequences. I think
> it's not obvious at all which interpretation is
> > correct; after all, "=C3=84\xC3\x84" is not equal to "=C3=84=C3=84", bu=
t the two
> strings now result in the same JSON
> > representation. This could be at least surprising, and I'd argue that
> the other behavior (raising an error) would
> > be more correct and more obvious.
>
> I think we need to take a step back and decide what would we want to
> do with strings which include raw bytes.  If we pass such strings to
> Jansson, it will just error out, right?


Yes


>   If so, then we could do one
> of the two:
>
>   . Check up front whether a Lisp string includes raw bytes, and if
>     it does, signal an error before even trying to encode it.  I think
>     find_charsets_in_text could be instrumental here; alternatively,
>     we could scan the string using BYTES_BY_CHAR_HEAD, looking for
>     either sequences longer than 4 bytes or 2-byte sequences whose
>     leading bytes are C0 or C1 (these are the raw bytes).
>
>   . Or we could encode the string, pass it to Jansson, and let it
>     error out; then we could produce our own diagnostics.
>

That's what we are currently doing.


>
> Which one of these do you prefer?


The third option: don't encode (pass SDATA directly) because we know that
valid Unicode sequences are represented as valid UTF-8 strings, and invalid
Unicode sequences as invalid UTF-8 strings, and that Jansson behaves
correctly in all cases.
Given otherwise equal behavior, I generally prefer the least complex
option, and "doing nothing" is simpler than "doing something".


> Currently, you opted for the 2nd
> one.  It is not clear to me that the option you've chosen is better,
> since (a) it relies on Jansson,


That's fine, because we only rely on documented and tested behavior. Doing
so is generally OK; if we couldn't rely on documented behavior, we couldn't
use external libraries (including glibc) at all.


> and (b) it encodes strings which don't
> need to be encoded.


True, that's why I argue we should remove the encoding step.


> OTOH, the check I propose in (a) means penalty
> for every caller.  But then such penalties never averted you elsewhere
> in your code, so I wonder why this case is suddenly so different?
>

I generally prefer interface clarity and defensive programming, i.e. I
don't want to introduce undefined behavior on unexpected user input, and I
prefer signaling errors over silently doing something subtly wrong. But
here the Jansson library already performs all the checks we need, so we
don't need to add equivalent duplicate checks.


>
> It is true that if we believe Jansson's detection of invalid UTF-8,
> and we assume that raw bytes in their current representation will
> forever the only extensions of UTF-8 in Emacs, we could pass the
> internal representation to Jansson.  Personally, I'm not sure we
> should make such assumptions, but that's me.
>

I think it's fine to make such assumptions.
- Jansson documents how it handles invalid UTF-8.
- Jansson includes multiple test cases that check for the behavior on
encountering invalid UTF-8.
- Emacs itself now also includes multiple test cases for such inputs.
- Jansson gets high scores in the nativejson-benchmark conformance tests
(the remaining failures are corner cases involving real numbers, which are
arguably not true errors and don't affect string handling).
- We don't need to assume that Emacs's internal encoding stays
UTF-8-compatible forever, but we can still rely on it. Given the importance
and widespread use of UTF-8, it's unlikely that our internal encoding will
have to change to something else within the next couple of years. Even if
the need to change the encoding should arise, the existing regression tests
should alert us immediately about what needs to change.
Emacs is a relatively monolithic codebase, where it's common for some
compilation units to rely on implementation details of other compilation
units. That's not super great, but also not a strong reason to artificially
restrict ourselves from using global knowledge about fundamental data types
such as strings. We expose SDATA and SBYTES in lisp.h, so why can't we say
what the bytes at SDATA actually contain?


>
> > - We decode UTF-8 strings after receiving them from Jansson. Jansson
> guarantees to only ever emit
> > well-formed UTF-8. Given that for well-formed UTF-8 strings, the UTF-8
> representation and the Emacs
> > representation are one and the same, we don't need decoding.
>
> Once again: do we really want to rely on external libraries to always
> DTRT and be bug-free?


Yes, we need to do that, otherwise we couldn't use external libraries at
all.


>   We don't normally rely on external sources like
> that.


We do so all the time. For example, we rely on malloc(123) actually
returning either NULL or a memory block of at least 123 bytes.


> The cost of decoding is not too high;


It's not extremely high, but significant. Users of JSON serialization such
as the Language Server Protocol or YCM regularly encode and decode large
JSON objects on every keystroke, so we need JSON functions to be fast. If
we can speed them up by *removing* code (and thus complexity), then we
should do it.


> the price users will pay
> for Jansson's bugs will be much higher.
>

We shouldn't add workarounds for bugs just because they could potentially
happen in the future. True, bugs are possible in any library, but we might
as well hit a bug in malloc, which would be far more disastrous, and we
don't proactively attempt to work around theoretical malloc bugs. If and
when we encounter a serialization bug in Jansson that would produce invalid
UTF-8, I'm more than happy to add workarounds, but not for non-existing
bugs.


>
> >    And second, encoding keeps the
> >  encoding intact precisely because it is not a no-op: raw bytes are
> >  held in buffer and string text as special multibyte sequences, not as
> >  single bytes, so just copying them to output instead of encoding will
> >  produce non-UTF-8 multibyte sequences.
> >
> > That's the correct behavior, I think. JSON values must be valid Unicode
> strings, and raw bytes are not.
>
> Neither are the internal representations of raw bytes, so what's your
> point here?
>

The point is that encoding a multibyte string containing a sequence of two
raw bytes can produce a valid UTF-8 string, while using the bytes directly
cannot.


>
> >  >   /* We need to send a valid UTF-8 string.  We could encode `object'
> >  >      but by not encoding it, we guarantee it's valid utf-8, even if
> >  >      it contains eight-bit-bytes.  Of course, you can still send
> >  >      manually-crafted junk by passing a unibyte string.  */
> >
> >  If gnutls.c and dbusbind.c don't encode and decode text that comes
> >  from and goes to outside, then they are buggy.
> >
> > Not necessarily. As mentioned, the internal encoding of multibyte
> strings is even mentioned in the Lisp
> > reference; and the above comment indicates that it's OK to use that
> information at least within the Emacs
> > codebase.
>
> I think that comment is based on a mistake, or maybe I don't really
> understand it.  Internal representation is not in general valid UTF-8,
> that's for sure.
>

Agreed, the comment should at least be reworded, e.g. "If OBJECT is a
well-formed Unicode scalar value sequence, the unencoded bytes range is a
valid UTF-8 string, so we don't need to encode it. If OBJECT is not
well-formed or unibyte, the function will return EINVAL instead of
exhibiting undefined behavior."


>
> And the fact that the internal representation is documented doesn't
> mean we can draw the conclusions like that.


Why? Clearly we can make use of documented information?


> For starters, the
> documentation doesn't tell all the story: the 2-byte representation of
> raw bytes is not described there.
>

What's the 2-byte representation?


>
> > Some parts are definitely encoded, but for example, there is c_hostname
> in Fgnutls_boot, which doesn't
> > encode the user-supplied string.
>
> That's a bug.
>

Maybe, maybe not. gnutls_server_name_set explicitly documents that the
hostname is interpreted as UTF-8 (presumably even on Windows), so if we can
rely on the UTF-8-ness of strings not encoding it is OK.


>
> >  Well, I disagree with that conclusion.  Just look at all the calls to
> >  decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc.,
> >  and you will see where we do that.
> >
> > We obviously do *some* encoding/decoding. But when interacting with
> third-party libraries, we seem to leave
> > it out pretty frequently, if those libraries use UTF-8 as well.
>
> Most if not all of those places are just bugs.  People who work mostly
> on GNU/Linux tend to forget that not everything is UTF-8.
>

Definitely true for files and processes, but if an API (such as GnuTLS or
Jansson) explicitly documents that it expects UTF-8, then we should be able
to rely on that.

--001a114844d8f98dd50561452a4a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">Eli Za=
retskii &lt;<a href=3D"mailto:eliz@gnu.org">eliz@gnu.org</a>&gt; schrieb am=
 Sa., 23. Dez. 2017 um 19:19=C2=A0Uhr:<br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex">&gt; From: Philipp Stephani &lt;<a href=3D"mailto:p.stephani2@gmail.com=
" target=3D"_blank">p.stephani2@gmail.com</a>&gt;<br>
&gt; Date: Sat, 23 Dec 2017 17:27:22 +0000<br>
&gt; Cc: <a href=3D"mailto:emacs-devel@gnu.org" target=3D"_blank">emacs-dev=
el@gnu.org</a><br>
&gt;<br>
&gt; - We encode Lisp strings when passing them to Jansson. Jansson only ac=
cepts UTF-8 strings and fails (with<br>
&gt; proper error reporting, not crashing) when encountering non-UTF-8 stri=
ngs. I think encoding can only make a<br>
&gt; difference here for strings that contain sequences of bytes that are t=
hemselves valid UTF-8 code unit<br>
&gt; sequences, such as &quot;=C3=84\xC3\x84&quot;. This string is encoded =
as &quot;\xC3\x84\xC3\x84&quot; using utf-8-unix. (Note how<br>
&gt; this is a case where encoding and decoding are not inverses of each ot=
her.) Without encoding, the string<br>
&gt; contents will be \xC3\x84 plus two invalid 5-byte sequences. I think i=
t&#39;s not obvious at all which interpretation is<br>
&gt; correct; after all, &quot;=C3=84\xC3\x84&quot; is not equal to &quot;=
=C3=84=C3=84&quot;, but the two strings now result in the same JSON<br>
&gt; representation. This could be at least surprising, and I&#39;d argue t=
hat the other behavior (raising an error) would<br>
&gt; be more correct and more obvious.<br>
<br>
I think we need to take a step back and decide what would we want to<br>
do with strings which include raw bytes.=C2=A0 If we pass such strings to<b=
r>
Jansson, it will just error out, right?</blockquote><div><br></div><div>Yes=
</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 =
0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">=C2=A0 If so, then we=
 could do one<br>
of the two:<br>
<br>
=C2=A0 . Check up front whether a Lisp string includes raw bytes, and if<br=
>
=C2=A0 =C2=A0 it does, signal an error before even trying to encode it.=C2=
=A0 I think<br>
=C2=A0 =C2=A0 find_charsets_in_text could be instrumental here; alternative=
ly,<br>
=C2=A0 =C2=A0 we could scan the string using BYTES_BY_CHAR_HEAD, looking fo=
r<br>
=C2=A0 =C2=A0 either sequences longer than 4 bytes or 2-byte sequences whos=
e<br>
=C2=A0 =C2=A0 leading bytes are C0 or C1 (these are the raw bytes).<br>
<br>
=C2=A0 . Or we could encode the string, pass it to Jansson, and let it<br>
=C2=A0 =C2=A0 error out; then we could produce our own diagnostics.<br></bl=
ockquote><div><br></div><div>That&#39;s what we are currently doing.</div><=
div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8e=
x;border-left:1px #ccc solid;padding-left:1ex">
<br>
Which one of these do you prefer?=C2=A0</blockquote><div><br></div><div>The=
 third option: don&#39;t encode (pass SDATA directly) because we know that =
valid Unicode sequences are represented as valid UTF-8 strings, and invalid=
 Unicode sequences as invalid UTF-8 strings, and that Jansson behaves corre=
ctly in all cases.</div><div>Given otherwise equal behavior, I generally pr=
efer the least complex option, and &quot;doing nothing&quot; is simpler tha=
n &quot;doing something&quot;.</div><div>=C2=A0</div><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"> Currently, you opted for the 2nd<br>
one.=C2=A0 It is not clear to me that the option you&#39;ve chosen is bette=
r,<br>
since (a) it relies on Jansson,</blockquote><div><br></div><div>That&#39;s =
fine, because we only rely on documented and tested behavior. Doing so is g=
enerally OK; if we couldn&#39;t rely on documented behavior, we couldn&#39;=
t use external libraries (including glibc) at all.</div><div>=C2=A0</div><b=
lockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px =
#ccc solid;padding-left:1ex"> and (b) it encodes strings which don&#39;t<br=
>
need to be encoded.=C2=A0</blockquote><div><br></div><div>True, that&#39;s =
why I argue we should remove the encoding step.</div><div>=C2=A0</div><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex"> OTOH, the check I propose in (a) means penalty<b=
r>
for every caller.=C2=A0 But then such penalties never averted you elsewhere=
<br>
in your code, so I wonder why this case is suddenly so different?<br></bloc=
kquote><div><br></div><div>I generally prefer interface clarity and defensi=
ve programming, i.e. I don&#39;t want to introduce undefined behavior on un=
expected user input, and I prefer signaling errors over silently doing some=
thing subtly wrong. But here the Jansson library already performs all the c=
hecks we need, so we don&#39;t need to add equivalent duplicate checks.</di=
v><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
It is true that if we believe Jansson&#39;s detection of invalid UTF-8,<br>
and we assume that raw bytes in their current representation will<br>
forever the only extensions of UTF-8 in Emacs, we could pass the<br>
internal representation to Jansson.=C2=A0 Personally, I&#39;m not sure we<b=
r>
should make such assumptions, but that&#39;s me.<br></blockquote><div><br><=
/div><div>I think it&#39;s fine to make such assumptions.</div><div>- Janss=
on documents how it handles invalid UTF-8.</div><div>- Jansson includes mul=
tiple test cases that check for the behavior on encountering invalid UTF-8.=
</div><div>- Emacs itself now also includes multiple test cases for such in=
puts.</div><div>- Jansson gets high scores in the nativejson-benchmark conf=
ormance tests (the remaining failures are corner cases involving real numbe=
rs, which are arguably not true errors and don&#39;t affect string handling=
).</div><div>- We don&#39;t need to assume that Emacs&#39;s internal encodi=
ng stays UTF-8-compatible forever, but we can still rely on it. Given the i=
mportance and widespread use of UTF-8, it&#39;s unlikely that our internal =
encoding will have to change to something else within the next couple of ye=
ars. Even if the need to change the encoding should arise, the existing reg=
ression tests should alert us immediately about what needs to change.</div>=
<div>Emacs is a relatively monolithic codebase, where it&#39;s common for s=
ome compilation units to rely on implementation details of other compilatio=
n units. That&#39;s not super great, but also not a strong reason to artifi=
cially restrict ourselves from using global knowledge about fundamental dat=
a types such as strings. We expose SDATA and SBYTES in lisp.h, so why can&#=
39;t we say what the bytes at SDATA actually contain?</div><div>=C2=A0</div=
><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1=
px #ccc solid;padding-left:1ex">
<br>
&gt; - We decode UTF-8 strings after receiving them from Jansson. Jansson g=
uarantees to only ever emit<br>
&gt; well-formed UTF-8. Given that for well-formed UTF-8 strings, the UTF-8=
 representation and the Emacs<br>
&gt; representation are one and the same, we don&#39;t need decoding.<br>
<br>
Once again: do we really want to rely on external libraries to always<br>
DTRT and be bug-free?</blockquote><div><br></div><div>Yes, we need to do th=
at, otherwise we couldn&#39;t use external libraries at all.</div><div>=C2=
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex">=C2=A0 We don&#39;t normally rely o=
n external sources like<br>
that.=C2=A0</blockquote><div><br></div><div>We do so all the time. For exam=
ple, we rely on malloc(123) actually returning either NULL or a memory bloc=
k of at least 123 bytes.</div><div>=C2=A0</div><blockquote class=3D"gmail_q=
uote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1e=
x"> The cost of decoding is not too high;</blockquote><div><br></div><div>I=
t&#39;s not extremely high, but significant. Users of JSON serialization su=
ch as the Language Server Protocol or YCM regularly encode and decode large=
 JSON objects on every keystroke, so we need JSON functions to be fast. If =
we can speed them up by *removing* code (and thus complexity), then we shou=
ld do it.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"=
margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> the price u=
sers will pay<br>
for Jansson&#39;s bugs will be much higher.<br></blockquote><div><br></div>=
<div>We shouldn&#39;t add workarounds for bugs just because they could pote=
ntially happen in the future. True, bugs are possible in any library, but w=
e might as well hit a bug in malloc, which would be far more disastrous, an=
d we don&#39;t proactively attempt to work around theoretical malloc bugs. =
If and when we encounter a serialization bug in Jansson that would produce =
invalid UTF-8, I&#39;m more than happy to add workarounds, but not for non-=
existing bugs.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" styl=
e=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
&gt;=C2=A0 =C2=A0 And second, encoding keeps the<br>
&gt;=C2=A0 encoding intact precisely because it is not a no-op: raw bytes a=
re<br>
&gt;=C2=A0 held in buffer and string text as special multibyte sequences, n=
ot as<br>
&gt;=C2=A0 single bytes, so just copying them to output instead of encoding=
 will<br>
&gt;=C2=A0 produce non-UTF-8 multibyte sequences.<br>
&gt;<br>
&gt; That&#39;s the correct behavior, I think. JSON values must be valid Un=
icode strings, and raw bytes are not.<br>
<br>
Neither are the internal representations of raw bytes, so what&#39;s your<b=
r>
point here?<br></blockquote><div><br></div><div>The point is that encoding =
a multibyte string containing a sequence of two raw bytes can produce a val=
id UTF-8 string, while using the bytes directly cannot.</div><div>=C2=A0</d=
iv><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left=
:1px #ccc solid;padding-left:1ex">
<br>
&gt;=C2=A0 &gt;=C2=A0 =C2=A0/* We need to send a valid UTF-8 string.=C2=A0 =
We could encode `object&#39;<br>
&gt;=C2=A0 &gt;=C2=A0 =C2=A0 =C2=A0 but by not encoding it, we guarantee it=
&#39;s valid utf-8, even if<br>
&gt;=C2=A0 &gt;=C2=A0 =C2=A0 =C2=A0 it contains eight-bit-bytes.=C2=A0 Of c=
ourse, you can still send<br>
&gt;=C2=A0 &gt;=C2=A0 =C2=A0 =C2=A0 manually-crafted junk by passing a unib=
yte string.=C2=A0 */<br>
&gt;<br>
&gt;=C2=A0 If gnutls.c and dbusbind.c don&#39;t encode and decode text that=
 comes<br>
&gt;=C2=A0 from and goes to outside, then they are buggy.<br>
&gt;<br>
&gt; Not necessarily. As mentioned, the internal encoding of multibyte stri=
ngs is even mentioned in the Lisp<br>
&gt; reference; and the above comment indicates that it&#39;s OK to use tha=
t information at least within the Emacs<br>
&gt; codebase.<br>
<br>
I think that comment is based on a mistake, or maybe I don&#39;t really<br>
understand it.=C2=A0 Internal representation is not in general valid UTF-8,=
<br>
that&#39;s for sure.<br></blockquote><div><br></div><div>Agreed, the commen=
t should at least be reworded, e.g. &quot;If OBJECT is a well-formed Unicod=
e scalar value sequence, the unencoded bytes range is a valid UTF-8 string,=
 so we don&#39;t need to encode it. If OBJECT is not well-formed or unibyte=
, the function will return EINVAL instead of exhibiting undefined behavior.=
&quot;</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
And the fact that the internal representation is documented doesn&#39;t<br>
mean we can draw the conclusions like that.=C2=A0</blockquote><div><br></di=
v><div>Why? Clearly we can make use of documented information?</div><div>=
=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex"> For starters, the<br>
documentation doesn&#39;t tell all the story: the 2-byte representation of<=
br>
raw bytes is not described there.<br></blockquote><div><br></div><div>What&=
#39;s the 2-byte representation?</div><div>=C2=A0</div><blockquote class=3D=
"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding=
-left:1ex">
<br>
&gt; Some parts are definitely encoded, but for example, there is c_hostnam=
e in Fgnutls_boot, which doesn&#39;t<br>
&gt; encode the user-supplied string.<br>
<br>
That&#39;s a bug.<br></blockquote><div><br></div><div>Maybe, maybe not. gnu=
tls_server_name_set explicitly documents that the hostname is interpreted a=
s UTF-8 (presumably even on Windows), so if we can rely on the UTF-8-ness o=
f strings not encoding it is OK.</div><div>=C2=A0</div><blockquote class=3D=
"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding=
-left:1ex">
<br>
&gt;=C2=A0 Well, I disagree with that conclusion.=C2=A0 Just look at all th=
e calls to<br>
&gt;=C2=A0 decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, =
etc.,<br>
&gt;=C2=A0 and you will see where we do that.<br>
&gt;<br>
&gt; We obviously do *some* encoding/decoding. But when interacting with th=
ird-party libraries, we seem to leave<br>
&gt; it out pretty frequently, if those libraries use UTF-8 as well.<br>
<br>
Most if not all of those places are just bugs.=C2=A0 People who work mostly=
<br>
on GNU/Linux tend to forget that not everything is UTF-8.<br></blockquote><=
div><br></div><div>Definitely true for files and processes, but if an API (=
such as GnuTLS or Jansson) explicitly documents that it expects UTF-8, then=
 we should be able to rely on that.=C2=A0</div></div></div>

--001a114844d8f98dd50561452a4a--