From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Philipp Stephani
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sat, 23 Dec 2017 17:27:22 +0000
> Cc: emacs-dev= el@gnu.org
>
> - We encode Lisp strings when passing them to Jansson. Jansson only ac= cepts UTF-8 strings and fails (with
> proper error reporting, not crashing) when encountering non-UTF-8 stri= ngs. I think encoding can only make a
> difference here for strings that contain sequences of bytes that are t= hemselves valid UTF-8 code unit
> sequences, such as "=C3=84\xC3\x84". This string is encoded = as "\xC3\x84\xC3\x84" using utf-8-unix. (Note how
> this is a case where encoding and decoding are not inverses of each ot= her.) Without encoding, the string
> contents will be \xC3\x84 plus two invalid 5-byte sequences. I think i= t's not obvious at all which interpretation is
> correct; after all, "=C3=84\xC3\x84" is not equal to "= =C3=84=C3=84", but the two strings now result in the same JSON
> representation. This could be at least surprising, and I'd argue t= hat the other behavior (raising an error) would
> be more correct and more obvious.
I think we need to take a step back and decide what would we want to
do with strings which include raw bytes.=C2=A0 If we pass such strings to Jansson, it will just error out, right?
=C2=A0 If so, then we= could do one
of the two:
=C2=A0 . Check up front whether a Lisp string includes raw bytes, and if
=C2=A0 =C2=A0 it does, signal an error before even trying to encode it.=C2= =A0 I think
=C2=A0 =C2=A0 find_charsets_in_text could be instrumental here; alternative= ly,
=C2=A0 =C2=A0 we could scan the string using BYTES_BY_CHAR_HEAD, looking fo= r
=C2=A0 =C2=A0 either sequences longer than 4 bytes or 2-byte sequences whos= e
=C2=A0 =C2=A0 leading bytes are C0 or C1 (these are the raw bytes).
=C2=A0 . Or we could encode the string, pass it to Jansson, and let it
=C2=A0 =C2=A0 error out; then we could produce our own diagnostics.That's what we are currently doing.<= div>=C2=A0
Which one of these do you prefer?=C2=A0
Currently, you opted for the 2nd
one.=C2=A0 It is not clear to me that the option you've chosen is bette= r,
since (a) it relies on Jansson,
It is true that if we believe Jansson's detection of invalid UTF-8,
and we assume that raw bytes in their current representation will
forever the only extensions of UTF-8 in Emacs, we could pass the
internal representation to Jansson.=C2=A0 Personally, I'm not sure we should make such assumptions, but that's me.
> - We decode UTF-8 strings after receiving them from Jansson. Jansson g= uarantees to only ever emit
> well-formed UTF-8. Given that for well-formed UTF-8 strings, the UTF-8= representation and the Emacs
> representation are one and the same, we don't need decoding.
Once again: do we really want to rely on external libraries to always
DTRT and be bug-free?
=C2=A0 We don't normally rely o= n external sources like
that.=C2=A0
The cost of decoding is not too high;
the price u= sers will pay
for Jansson's bugs will be much higher.
>=C2=A0 =C2=A0 And second, encoding keeps the
>=C2=A0 encoding intact precisely because it is not a no-op: raw bytes a= re
>=C2=A0 held in buffer and string text as special multibyte sequences, n= ot as
>=C2=A0 single bytes, so just copying them to output instead of encoding= will
>=C2=A0 produce non-UTF-8 multibyte sequences.
>
> That's the correct behavior, I think. JSON values must be valid Un= icode strings, and raw bytes are not.
Neither are the internal representations of raw bytes, so what's your point here?
>=C2=A0 >=C2=A0 =C2=A0/* We need to send a valid UTF-8 string.=C2=A0 = We could encode `object'
>=C2=A0 >=C2=A0 =C2=A0 =C2=A0 but by not encoding it, we guarantee it= 's valid utf-8, even if
>=C2=A0 >=C2=A0 =C2=A0 =C2=A0 it contains eight-bit-bytes.=C2=A0 Of c= ourse, you can still send
>=C2=A0 >=C2=A0 =C2=A0 =C2=A0 manually-crafted junk by passing a unib= yte string.=C2=A0 */
>
>=C2=A0 If gnutls.c and dbusbind.c don't encode and decode text that= comes
>=C2=A0 from and goes to outside, then they are buggy.
>
> Not necessarily. As mentioned, the internal encoding of multibyte stri= ngs is even mentioned in the Lisp
> reference; and the above comment indicates that it's OK to use tha= t information at least within the Emacs
> codebase.
I think that comment is based on a mistake, or maybe I don't really
understand it.=C2=A0 Internal representation is not in general valid UTF-8,=
that's for sure.
And the fact that the internal representation is documented doesn't
mean we can draw the conclusions like that.=C2=A0
For starters, the
documentation doesn't tell all the story: the 2-byte representation of<= br> raw bytes is not described there.
> Some parts are definitely encoded, but for example, there is c_hostnam= e in Fgnutls_boot, which doesn't
> encode the user-supplied string.
That's a bug.
<= div>
>=C2=A0 Well, I disagree with that conclusion.=C2=A0 Just look at all th= e calls to
>=C2=A0 decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, = etc.,
>=C2=A0 and you will see where we do that.
>
> We obviously do *some* encoding/decoding. But when interacting with th= ird-party libraries, we seem to leave
> it out pretty frequently, if those libraries use UTF-8 as well.
Most if not all of those places are just bugs.=C2=A0 People who work mostly=
on GNU/Linux tend to forget that not everything is UTF-8.