String encoding in json.c

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* String encoding in json.c
@ 2017-12-23 14:26 Philipp Stephani
  2017-12-23 14:43 ` Eli Zaretskii
  0 siblings, 1 reply; 14+ messages in thread
From: Philipp Stephani @ 2017-12-23 14:26 UTC (permalink / raw)
  To: Emacs developers

[-- Attachment #1: Type: text/plain, Size: 867 bytes --]

Hi,

I've benchmarked serialization and parsing of JSON with and without
explicit encoding. I've found that leaving out the coding makes both
operations significantly faster – from a speedup of a factor of 1.11 ± 0.06
for parsing canada.json to 1.57 ± 0.08 for serializing twitter.json. Other
speedups are in between, but the speedup is always significant (to at least
one standard deviation). All unit tests pass when leaving out the coding
steps – which isn't surprising given that currently the coding operations
are expensive no-ops. Therefore I'd suggest to document the internal string
encoding in lisp.h or character.h and remove the explicit coding in json.c
and emacs-module.c. It's very unlikely that the internal string encoding
will change frequently, and if so, the unit tests should catch potential
issues caused by that.

Philipp

[-- Attachment #2: Type: text/html, Size: 941 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: String encoding in json.c
  2017-12-23 14:26 String encoding in json.c Philipp Stephani
@ 2017-12-23 14:43 ` Eli Zaretskii
  2017-12-23 15:31   ` Philipp Stephani
  2017-12-24 20:48   ` Dmitry Gutov
  0 siblings, 2 replies; 14+ messages in thread
From: Eli Zaretskii @ 2017-12-23 14:43 UTC (permalink / raw)
  To: Philipp Stephani; +Cc: emacs-devel

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sat, 23 Dec 2017 14:26:09 +0000
> 
> I've benchmarked serialization and parsing of JSON with and without explicit encoding. I've found that leaving
> out the coding makes both operations significantly faster – from a speedup of a factor of 1.11 ± 0.06 for
> parsing canada.json to 1.57 ± 0.08 for serializing twitter.json. Other speedups are in between, but the
> speedup is always significant (to at least one standard deviation). All unit tests pass when leaving out the
> coding steps – which isn't surprising given that currently the coding operations are expensive no-ops.

The coding operations are "expensive no-ops" except when they aren't,
and that is exactly when we need their 'expensive" parts.

> Therefore I'd suggest to document the internal string encoding in lisp.h or character.h and remove the explicit
> coding in json.c and emacs-module.c. It's very unlikely that the internal string encoding will change frequently,
> and if so, the unit tests should catch potential issues caused by that.

As I've already said, I don't think this particular case should be an
exception wrt to how Emacs behaves with external strings everywhere
else.  We suffer similar slow-downs in those other places as well, and
IMO this is a small penalty to pay for making sure our objects are
valid and won't crash Emacs.

The main purpose of Emacs is to be an efficient editor, so if we care
about the slow-down of code conversions, we should first and foremost
speed up reading and writing files.  JSON conversion, with all due
respect to them, are not the main business for us, and I'm not even
sure JSON objects will frequently be as large as files our users visit
all the time.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: String encoding in json.c
  2017-12-23 14:43 ` Eli Zaretskii
@ 2017-12-23 15:31   ` Philipp Stephani
  2017-12-23 15:53     ` Eli Zaretskii
  2017-12-24 20:48   ` Dmitry Gutov
  1 sibling, 1 reply; 14+ messages in thread
From: Philipp Stephani @ 2017-12-23 15:31 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 2784 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Sa., 23. Dez. 2017 um 15:43 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sat, 23 Dec 2017 14:26:09 +0000
> >
> > I've benchmarked serialization and parsing of JSON with and without
> explicit encoding. I've found that leaving
> > out the coding makes both operations significantly faster – from a
> speedup of a factor of 1.11 ± 0.06 for
> > parsing canada.json to 1.57 ± 0.08 for serializing twitter.json. Other
> speedups are in between, but the
> > speedup is always significant (to at least one standard deviation). All
> unit tests pass when leaving out the
> > coding steps – which isn't surprising given that currently the coding
> operations are expensive no-ops.
>
> The coding operations are "expensive no-ops" except when they aren't,
> and that is exactly when we need their 'expensive" parts.
>

In which case are they not no-ops? I've spot-checked some of the
implementation details of coding.c, and I haven't found obvious cases where
they are not no-ops. Emacs appears to use the obvious extension of UTF-8
for integers that are not Unicode scalar values, and that's even documented
in character.h and the Elisp reference manual. Using utf-8-unix as encoding
seems to keep the encoding intact.


>
> > Therefore I'd suggest to document the internal string encoding in lisp.h
> or character.h and remove the explicit
> > coding in json.c and emacs-module.c. It's very unlikely that the
> internal string encoding will change frequently,
> > and if so, the unit tests should catch potential issues caused by that.
>
> As I've already said, I don't think this particular case should be an
> exception wrt to how Emacs behaves with external strings everywhere
> else.  We suffer similar slow-downs in those other places as well, and
> IMO this is a small penalty to pay for making sure our objects are
> valid and won't crash Emacs.
>

I've spot-checked some other code where we interface with external
libraries, namely dbusbind.c and gnutls.c. In no cases I've found explicit
coding operations (except for filenames, where the situation is different);
these files always use SDATA directly. dbusbind.c even has the comment

  /* We need to send a valid UTF-8 string.  We could encode `object'
     but by not encoding it, we guarantee it's valid utf-8, even if
     it contains eight-bit-bytes.  Of course, you can still send
     manually-crafted junk by passing a unibyte string.  */

So not only do we not encode strings explicitly, we even *prefer* not
encoding them, and we do rely on the internal string encoding being an
extension of UTF-8. It's the *current* json.c (and emacs-module.c) that's
inconsistent with the rest of the codebase.

[-- Attachment #2: Type: text/html, Size: 3622 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: String encoding in json.c
  2017-12-23 15:31   ` Philipp Stephani
@ 2017-12-23 15:53     ` Eli Zaretskii
  2017-12-23 17:27       ` Philipp Stephani
  0 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2017-12-23 15:53 UTC (permalink / raw)
  To: Philipp Stephani; +Cc: emacs-devel

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sat, 23 Dec 2017 15:31:06 +0000
> Cc: emacs-devel@gnu.org
> 
>  The coding operations are "expensive no-ops" except when they aren't,
>  and that is exactly when we need their 'expensive" parts.
> 
> In which case are they not no-ops?

When the input is not a valid UTF-8 sequence.  When that happens, we
produce a special representation of such raw bytes instead of
signaling EILSEQ and refusing to decode the input.  Encoding (if and
when it is done) then performs the opposite conversion, producing the
same single raw byte in the output stream.  This allows Emacs to
manipulate text that included invalid sequences without crashing,
because all the low-level primitives that walk buffer text and strings
by characters assume the internal representation of each character is
valid.

> Using utf-8-unix as encoding seems to keep the encoding intact.

First, you forget about decoding.  And second, encoding keeps the
encoding intact precisely because it is not a no-op: raw bytes are
held in buffer and string text as special multibyte sequences, not as
single bytes, so just copying them to output instead of encoding will
produce non-UTF-8 multibyte sequences.

> I've spot-checked some other code where we interface with external libraries, namely dbusbind.c and
> gnutls.c. In no cases I've found explicit coding operations (except for filenames, where the situation is
> different); these files always use SDATA directly. dbusbind.c even has the comment
> 
>   /* We need to send a valid UTF-8 string.  We could encode `object'
>      but by not encoding it, we guarantee it's valid utf-8, even if
>      it contains eight-bit-bytes.  Of course, you can still send
>      manually-crafted junk by passing a unibyte string.  */

If gnutls.c and dbusbind.c don't encode and decode text that comes
from and goes to outside, then they are buggy.  (At least for
gnutls.c, I think you are mistaken, because the encoding/decoding is
in process.c, see, e.g., read_process_output.)

> It's the *current* json.c (and emacs-module.c) that's inconsistent
> with the rest of the codebase.

Well, I disagree with that conclusion.  Just look at all the calls to
decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc.,
and you will see where we do that.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: String encoding in json.c
  2017-12-23 15:53     ` Eli Zaretskii
@ 2017-12-23 17:27       ` Philipp Stephani
  2017-12-23 18:18         ` Eli Zaretskii
  0 siblings, 1 reply; 14+ messages in thread
From: Philipp Stephani @ 2017-12-23 17:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 4727 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Sa., 23. Dez. 2017 um 16:53 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sat, 23 Dec 2017 15:31:06 +0000
> > Cc: emacs-devel@gnu.org
> >
> >  The coding operations are "expensive no-ops" except when they aren't,
> >  and that is exactly when we need their 'expensive" parts.
> >
> > In which case are they not no-ops?
>
> When the input is not a valid UTF-8 sequence.  When that happens, we
> produce a special representation of such raw bytes instead of
> signaling EILSEQ and refusing to decode the input.  Encoding (if and
> when it is done) then performs the opposite conversion, producing the
> same single raw byte in the output stream.  This allows Emacs to
> manipulate text that included invalid sequences without crashing,
> because all the low-level primitives that walk buffer text and strings
> by characters assume the internal representation of each character is
> valid.
>

OK, thanks for the refresher. I was aware of the single byte
representation, but forgot how exactly it's handled during coding.


>
> > Using utf-8-unix as encoding seems to keep the encoding intact.
>
> First, you forget about decoding.


OK, let's treat encoding and decoding separately.

- We encode Lisp strings when passing them to Jansson. Jansson only accepts
UTF-8 strings and fails (with proper error reporting, not crashing) when
encountering non-UTF-8 strings. I think encoding can only make a difference
here for strings that contain sequences of bytes that are themselves valid
UTF-8 code unit sequences, such as "Ä\xC3\x84". This string is encoded as
"\xC3\x84\xC3\x84" using utf-8-unix. (Note how this is a case where
encoding and decoding are not inverses of each other.) Without encoding,
the string contents will be \xC3\x84 plus two invalid 5-byte sequences. I
think it's not obvious at all which interpretation is correct; after all,
"Ä\xC3\x84" is not equal to "ÄÄ", but the two strings now result in the
same JSON representation. This could be at least surprising, and I'd argue
that the other behavior (raising an error) would be more correct and more
obvious.

- We decode UTF-8 strings after receiving them from Jansson. Jansson
guarantees to only ever emit well-formed UTF-8. Given that for well-formed
UTF-8 strings, the UTF-8 representation and the Emacs representation are
one and the same, we don't need decoding.



>   And second, encoding keeps the
> encoding intact precisely because it is not a no-op: raw bytes are
> held in buffer and string text as special multibyte sequences, not as
> single bytes, so just copying them to output instead of encoding will
> produce non-UTF-8 multibyte sequences.
>

That's the correct behavior, I think. JSON values must be valid Unicode
strings, and raw bytes are not.


>
> > I've spot-checked some other code where we interface with external
> libraries, namely dbusbind.c and
> > gnutls.c. In no cases I've found explicit coding operations (except for
> filenames, where the situation is
> > different); these files always use SDATA directly. dbusbind.c even has
> the comment
> >
> >   /* We need to send a valid UTF-8 string.  We could encode `object'
> >      but by not encoding it, we guarantee it's valid utf-8, even if
> >      it contains eight-bit-bytes.  Of course, you can still send
> >      manually-crafted junk by passing a unibyte string.  */
>
> If gnutls.c and dbusbind.c don't encode and decode text that comes
> from and goes to outside, then they are buggy.


Not necessarily. As mentioned, the internal encoding of multibyte strings
is even mentioned in the Lisp reference; and the above comment indicates
that it's OK to use that information at least within the Emacs codebase.
BTW, that comment was added by Stefan in
commit e454a4a330cc6524cf0d2604b4fafc32d5bda795, where he removed an
explicit encoding step.


>   (At least for
> gnutls.c, I think you are mistaken, because the encoding/decoding is
> in process.c, see, e.g., read_process_output.)
>

Some parts are definitely encoded, but for example, there is c_hostname in
Fgnutls_boot, which doesn't encode the user-supplied string.


>
> > It's the *current* json.c (and emacs-module.c) that's inconsistent
> > with the rest of the codebase.
>
> Well, I disagree with that conclusion.  Just look at all the calls to
> decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc.,
> and you will see where we do that.
>

We obviously do *some* encoding/decoding. But when interacting with
third-party libraries, we seem to leave it out pretty frequently, if those
libraries use UTF-8 as well.

[-- Attachment #2: Type: text/html, Size: 6152 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: String encoding in json.c
  2017-12-23 17:27       ` Philipp Stephani
@ 2017-12-23 18:18         ` Eli Zaretskii
  2017-12-26 21:42           ` Philipp Stephani
  0 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2017-12-23 18:18 UTC (permalink / raw)
  To: Philipp Stephani; +Cc: emacs-devel

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sat, 23 Dec 2017 17:27:22 +0000
> Cc: emacs-devel@gnu.org
> 
> - We encode Lisp strings when passing them to Jansson. Jansson only accepts UTF-8 strings and fails (with
> proper error reporting, not crashing) when encountering non-UTF-8 strings. I think encoding can only make a
> difference here for strings that contain sequences of bytes that are themselves valid UTF-8 code unit
> sequences, such as "Ä\xC3\x84". This string is encoded as "\xC3\x84\xC3\x84" using utf-8-unix. (Note how
> this is a case where encoding and decoding are not inverses of each other.) Without encoding, the string
> contents will be \xC3\x84 plus two invalid 5-byte sequences. I think it's not obvious at all which interpretation is
> correct; after all, "Ä\xC3\x84" is not equal to "ÄÄ", but the two strings now result in the same JSON
> representation. This could be at least surprising, and I'd argue that the other behavior (raising an error) would
> be more correct and more obvious.

I think we need to take a step back and decide what would we want to
do with strings which include raw bytes.  If we pass such strings to
Jansson, it will just error out, right?  If so, then we could do one
of the two:

  . Check up front whether a Lisp string includes raw bytes, and if
    it does, signal an error before even trying to encode it.  I think
    find_charsets_in_text could be instrumental here; alternatively,
    we could scan the string using BYTES_BY_CHAR_HEAD, looking for
    either sequences longer than 4 bytes or 2-byte sequences whose
    leading bytes are C0 or C1 (these are the raw bytes).

  . Or we could encode the string, pass it to Jansson, and let it
    error out; then we could produce our own diagnostics.

Which one of these do you prefer?  Currently, you opted for the 2nd
one.  It is not clear to me that the option you've chosen is better,
since (a) it relies on Jansson, and (b) it encodes strings which don't
need to be encoded.  OTOH, the check I propose in (a) means penalty
for every caller.  But then such penalties never averted you elsewhere
in your code, so I wonder why this case is suddenly so different?

It is true that if we believe Jansson's detection of invalid UTF-8,
and we assume that raw bytes in their current representation will
forever the only extensions of UTF-8 in Emacs, we could pass the
internal representation to Jansson.  Personally, I'm not sure we
should make such assumptions, but that's me.

> - We decode UTF-8 strings after receiving them from Jansson. Jansson guarantees to only ever emit
> well-formed UTF-8. Given that for well-formed UTF-8 strings, the UTF-8 representation and the Emacs
> representation are one and the same, we don't need decoding.

Once again: do we really want to rely on external libraries to always
DTRT and be bug-free?  We don't normally rely on external sources like
that.  The cost of decoding is not too high; the price users will pay
for Jansson's bugs will be much higher.

>    And second, encoding keeps the
>  encoding intact precisely because it is not a no-op: raw bytes are
>  held in buffer and string text as special multibyte sequences, not as
>  single bytes, so just copying them to output instead of encoding will
>  produce non-UTF-8 multibyte sequences.
> 
> That's the correct behavior, I think. JSON values must be valid Unicode strings, and raw bytes are not.

Neither are the internal representations of raw bytes, so what's your
point here?

>  >   /* We need to send a valid UTF-8 string.  We could encode `object'
>  >      but by not encoding it, we guarantee it's valid utf-8, even if
>  >      it contains eight-bit-bytes.  Of course, you can still send
>  >      manually-crafted junk by passing a unibyte string.  */
> 
>  If gnutls.c and dbusbind.c don't encode and decode text that comes
>  from and goes to outside, then they are buggy.
> 
> Not necessarily. As mentioned, the internal encoding of multibyte strings is even mentioned in the Lisp
> reference; and the above comment indicates that it's OK to use that information at least within the Emacs
> codebase.

I think that comment is based on a mistake, or maybe I don't really
understand it.  Internal representation is not in general valid UTF-8,
that's for sure.

And the fact that the internal representation is documented doesn't
mean we can draw the conclusions like that.  For starters, the
documentation doesn't tell all the story: the 2-byte representation of
raw bytes is not described there.

> Some parts are definitely encoded, but for example, there is c_hostname in Fgnutls_boot, which doesn't
> encode the user-supplied string.

That's a bug.

>  Well, I disagree with that conclusion.  Just look at all the calls to
>  decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc.,
>  and you will see where we do that.
> 
> We obviously do *some* encoding/decoding. But when interacting with third-party libraries, we seem to leave
> it out pretty frequently, if those libraries use UTF-8 as well.

Most if not all of those places are just bugs.  People who work mostly
on GNU/Linux tend to forget that not everything is UTF-8.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: String encoding in json.c
  2017-12-23 14:43 ` Eli Zaretskii
  2017-12-23 15:31   ` Philipp Stephani
@ 2017-12-24 20:48   ` Dmitry Gutov
  2017-12-25 16:21     ` Eli Zaretskii
  1 sibling, 1 reply; 14+ messages in thread
From: Dmitry Gutov @ 2017-12-24 20:48 UTC (permalink / raw)
  To: Eli Zaretskii, Philipp Stephani; +Cc: emacs-devel

Hi Eli,

On 12/23/17 4:43 PM, Eli Zaretskii wrote:

> The main purpose of Emacs is to be an efficient editor, so if we care
> about the slow-down of code conversions, we should first and foremost
> speed up reading and writing files.  JSON conversion, with all due
> respect to them, are not the main business for us, and I'm not even
> sure JSON objects will frequently be as large as files our users visit
> all the time.

FWIW, a significant use case for json.el is encoding a structure 
containing the whole contents of the current buffer as one of the values 
(several protocols for external code completion tools use this).



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: String encoding in json.c
  2017-12-24 20:48   ` Dmitry Gutov
@ 2017-12-25 16:21     ` Eli Zaretskii
  2017-12-25 20:51       ` Dmitry Gutov
  0 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2017-12-25 16:21 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: p.stephani2, emacs-devel

> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Sun, 24 Dec 2017 22:48:17 +0200
> Cc: emacs-devel@gnu.org
> 
> > The main purpose of Emacs is to be an efficient editor, so if we care
> > about the slow-down of code conversions, we should first and foremost
> > speed up reading and writing files.  JSON conversion, with all due
> > respect to them, are not the main business for us, and I'm not even
> > sure JSON objects will frequently be as large as files our users visit
> > all the time.
> 
> FWIW, a significant use case for json.el is encoding a structure 
> containing the whole contents of the current buffer as one of the values 
> (several protocols for external code completion tools use this).

OK, but I presume the size of such buffers is rarely more than 1
MByte, say?  Files we visit are quite frequently much larger.  Which
is why I say that if the speed of en/decoding is crucial, we should
first and foremost work on speeding up file I/O.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: String encoding in json.c
  2017-12-25 16:21     ` Eli Zaretskii
@ 2017-12-25 20:51       ` Dmitry Gutov
  2017-12-26  4:35         ` Eli Zaretskii
  0 siblings, 1 reply; 14+ messages in thread
From: Dmitry Gutov @ 2017-12-25 20:51 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: p.stephani2, emacs-devel

On 12/25/17 6:21 PM, Eli Zaretskii wrote:

> OK, but I presume the size of such buffers is rarely more than 1
> MByte, say?

Maybe not. This bug report: 
https://github.com/abingham/emacs-ycmd/issues/163 mentioned large 
translation units, but I'm not sure which exact sizes are in play. The 
reproduction scenario used a 300 KB example, encoding which was quite 
slow at the time.

> Files we visit are quite frequently much larger.  Which
> is why I say that if the speed of en/decoding is crucial, we should
> first and foremost work on speeding up file I/O.

If you mean I/O in general, maybe. I'm not sure which encoding speeds 
json.c allows us now.

But the JSON structure I was talking about is not written to a file. 
It's sent over the network (usually to a local TCP socket).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: String encoding in json.c
  2017-12-25 20:51       ` Dmitry Gutov
@ 2017-12-26  4:35         ` Eli Zaretskii
  2017-12-26 21:50           ` Philipp Stephani
  0 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2017-12-26  4:35 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: p.stephani2, emacs-devel

> Cc: p.stephani2@gmail.com, emacs-devel@gnu.org
> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Mon, 25 Dec 2017 22:51:22 +0200
> 
> On 12/25/17 6:21 PM, Eli Zaretskii wrote:
> 
> > OK, but I presume the size of such buffers is rarely more than 1
> > MByte, say?
> 
> Maybe not. This bug report: 
> https://github.com/abingham/emacs-ycmd/issues/163 mentioned large 
> translation units, but I'm not sure which exact sizes are in play. The 
> reproduction scenario used a 300 KB example, encoding which was quite 
> slow at the time.

I'd expect the Jansson configuration to do that at least 10 to 20
times faster.

> > Files we visit are quite frequently much larger.  Which
> > is why I say that if the speed of en/decoding is crucial, we should
> > first and foremost work on speeding up file I/O.
> 
> If you mean I/O in general, maybe.

No, I meant file I/O specifically.  Our implementation of encoding and
decoding is separate for each one of: (a) file I/O, (b) process and
network I/O, (c) string encoding/decoding, (d) file-name encoding and
decoding (maybe I forget some).

> I'm not sure which encoding speeds json.c allows us now.

Well, Philipp measured it, so maybe he could share the benchmarks and
the results.

> But the JSON structure I was talking about is not written to a file. 
> It's sent over the network (usually to a local TCP socket).

In that case, encoding it is not a wasted effort, since it would have
been otherwise encoded when we send it.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: String encoding in json.c
  2017-12-23 18:18         ` Eli Zaretskii
@ 2017-12-26 21:42           ` Philipp Stephani
  2017-12-27 16:08             ` Eli Zaretskii
  0 siblings, 1 reply; 14+ messages in thread
From: Philipp Stephani @ 2017-12-26 21:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 10019 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Sa., 23. Dez. 2017 um 19:19 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sat, 23 Dec 2017 17:27:22 +0000
> > Cc: emacs-devel@gnu.org
> >
> > - We encode Lisp strings when passing them to Jansson. Jansson only
> accepts UTF-8 strings and fails (with
> > proper error reporting, not crashing) when encountering non-UTF-8
> strings. I think encoding can only make a
> > difference here for strings that contain sequences of bytes that are
> themselves valid UTF-8 code unit
> > sequences, such as "Ä\xC3\x84". This string is encoded as
> "\xC3\x84\xC3\x84" using utf-8-unix. (Note how
> > this is a case where encoding and decoding are not inverses of each
> other.) Without encoding, the string
> > contents will be \xC3\x84 plus two invalid 5-byte sequences. I think
> it's not obvious at all which interpretation is
> > correct; after all, "Ä\xC3\x84" is not equal to "ÄÄ", but the two
> strings now result in the same JSON
> > representation. This could be at least surprising, and I'd argue that
> the other behavior (raising an error) would
> > be more correct and more obvious.
>
> I think we need to take a step back and decide what would we want to
> do with strings which include raw bytes.  If we pass such strings to
> Jansson, it will just error out, right?


Yes


>   If so, then we could do one
> of the two:
>
>   . Check up front whether a Lisp string includes raw bytes, and if
>     it does, signal an error before even trying to encode it.  I think
>     find_charsets_in_text could be instrumental here; alternatively,
>     we could scan the string using BYTES_BY_CHAR_HEAD, looking for
>     either sequences longer than 4 bytes or 2-byte sequences whose
>     leading bytes are C0 or C1 (these are the raw bytes).
>
>   . Or we could encode the string, pass it to Jansson, and let it
>     error out; then we could produce our own diagnostics.
>

That's what we are currently doing.


>
> Which one of these do you prefer?


The third option: don't encode (pass SDATA directly) because we know that
valid Unicode sequences are represented as valid UTF-8 strings, and invalid
Unicode sequences as invalid UTF-8 strings, and that Jansson behaves
correctly in all cases.
Given otherwise equal behavior, I generally prefer the least complex
option, and "doing nothing" is simpler than "doing something".


> Currently, you opted for the 2nd
> one.  It is not clear to me that the option you've chosen is better,
> since (a) it relies on Jansson,


That's fine, because we only rely on documented and tested behavior. Doing
so is generally OK; if we couldn't rely on documented behavior, we couldn't
use external libraries (including glibc) at all.


> and (b) it encodes strings which don't
> need to be encoded.


True, that's why I argue we should remove the encoding step.


> OTOH, the check I propose in (a) means penalty
> for every caller.  But then such penalties never averted you elsewhere
> in your code, so I wonder why this case is suddenly so different?
>

I generally prefer interface clarity and defensive programming, i.e. I
don't want to introduce undefined behavior on unexpected user input, and I
prefer signaling errors over silently doing something subtly wrong. But
here the Jansson library already performs all the checks we need, so we
don't need to add equivalent duplicate checks.


>
> It is true that if we believe Jansson's detection of invalid UTF-8,
> and we assume that raw bytes in their current representation will
> forever the only extensions of UTF-8 in Emacs, we could pass the
> internal representation to Jansson.  Personally, I'm not sure we
> should make such assumptions, but that's me.
>

I think it's fine to make such assumptions.
- Jansson documents how it handles invalid UTF-8.
- Jansson includes multiple test cases that check for the behavior on
encountering invalid UTF-8.
- Emacs itself now also includes multiple test cases for such inputs.
- Jansson gets high scores in the nativejson-benchmark conformance tests
(the remaining failures are corner cases involving real numbers, which are
arguably not true errors and don't affect string handling).
- We don't need to assume that Emacs's internal encoding stays
UTF-8-compatible forever, but we can still rely on it. Given the importance
and widespread use of UTF-8, it's unlikely that our internal encoding will
have to change to something else within the next couple of years. Even if
the need to change the encoding should arise, the existing regression tests
should alert us immediately about what needs to change.
Emacs is a relatively monolithic codebase, where it's common for some
compilation units to rely on implementation details of other compilation
units. That's not super great, but also not a strong reason to artificially
restrict ourselves from using global knowledge about fundamental data types
such as strings. We expose SDATA and SBYTES in lisp.h, so why can't we say
what the bytes at SDATA actually contain?


>
> > - We decode UTF-8 strings after receiving them from Jansson. Jansson
> guarantees to only ever emit
> > well-formed UTF-8. Given that for well-formed UTF-8 strings, the UTF-8
> representation and the Emacs
> > representation are one and the same, we don't need decoding.
>
> Once again: do we really want to rely on external libraries to always
> DTRT and be bug-free?


Yes, we need to do that, otherwise we couldn't use external libraries at
all.


>   We don't normally rely on external sources like
> that.


We do so all the time. For example, we rely on malloc(123) actually
returning either NULL or a memory block of at least 123 bytes.


> The cost of decoding is not too high;


It's not extremely high, but significant. Users of JSON serialization such
as the Language Server Protocol or YCM regularly encode and decode large
JSON objects on every keystroke, so we need JSON functions to be fast. If
we can speed them up by *removing* code (and thus complexity), then we
should do it.


> the price users will pay
> for Jansson's bugs will be much higher.
>

We shouldn't add workarounds for bugs just because they could potentially
happen in the future. True, bugs are possible in any library, but we might
as well hit a bug in malloc, which would be far more disastrous, and we
don't proactively attempt to work around theoretical malloc bugs. If and
when we encounter a serialization bug in Jansson that would produce invalid
UTF-8, I'm more than happy to add workarounds, but not for non-existing
bugs.


>
> >    And second, encoding keeps the
> >  encoding intact precisely because it is not a no-op: raw bytes are
> >  held in buffer and string text as special multibyte sequences, not as
> >  single bytes, so just copying them to output instead of encoding will
> >  produce non-UTF-8 multibyte sequences.
> >
> > That's the correct behavior, I think. JSON values must be valid Unicode
> strings, and raw bytes are not.
>
> Neither are the internal representations of raw bytes, so what's your
> point here?
>

The point is that encoding a multibyte string containing a sequence of two
raw bytes can produce a valid UTF-8 string, while using the bytes directly
cannot.


>
> >  >   /* We need to send a valid UTF-8 string.  We could encode `object'
> >  >      but by not encoding it, we guarantee it's valid utf-8, even if
> >  >      it contains eight-bit-bytes.  Of course, you can still send
> >  >      manually-crafted junk by passing a unibyte string.  */
> >
> >  If gnutls.c and dbusbind.c don't encode and decode text that comes
> >  from and goes to outside, then they are buggy.
> >
> > Not necessarily. As mentioned, the internal encoding of multibyte
> strings is even mentioned in the Lisp
> > reference; and the above comment indicates that it's OK to use that
> information at least within the Emacs
> > codebase.
>
> I think that comment is based on a mistake, or maybe I don't really
> understand it.  Internal representation is not in general valid UTF-8,
> that's for sure.
>

Agreed, the comment should at least be reworded, e.g. "If OBJECT is a
well-formed Unicode scalar value sequence, the unencoded bytes range is a
valid UTF-8 string, so we don't need to encode it. If OBJECT is not
well-formed or unibyte, the function will return EINVAL instead of
exhibiting undefined behavior."


>
> And the fact that the internal representation is documented doesn't
> mean we can draw the conclusions like that.


Why? Clearly we can make use of documented information?


> For starters, the
> documentation doesn't tell all the story: the 2-byte representation of
> raw bytes is not described there.
>

What's the 2-byte representation?


>
> > Some parts are definitely encoded, but for example, there is c_hostname
> in Fgnutls_boot, which doesn't
> > encode the user-supplied string.
>
> That's a bug.
>

Maybe, maybe not. gnutls_server_name_set explicitly documents that the
hostname is interpreted as UTF-8 (presumably even on Windows), so if we can
rely on the UTF-8-ness of strings not encoding it is OK.


>
> >  Well, I disagree with that conclusion.  Just look at all the calls to
> >  decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc.,
> >  and you will see where we do that.
> >
> > We obviously do *some* encoding/decoding. But when interacting with
> third-party libraries, we seem to leave
> > it out pretty frequently, if those libraries use UTF-8 as well.
>
> Most if not all of those places are just bugs.  People who work mostly
> on GNU/Linux tend to forget that not everything is UTF-8.
>

Definitely true for files and processes, but if an API (such as GnuTLS or
Jansson) explicitly documents that it expects UTF-8, then we should be able
to rely on that.

[-- Attachment #2: Type: text/html, Size: 13301 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: String encoding in json.c
  2017-12-26  4:35         ` Eli Zaretskii
@ 2017-12-26 21:50           ` Philipp Stephani
  2017-12-27  2:00             ` Dmitry Gutov
  0 siblings, 1 reply; 14+ messages in thread
From: Philipp Stephani @ 2017-12-26 21:50 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, Dmitry Gutov

[-- Attachment #1: Type: text/plain, Size: 1002 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Di., 26. Dez. 2017 um 05:35 Uhr:

>
> > I'm not sure which encoding speeds json.c allows us now.
>
> Well, Philipp measured it, so maybe he could share the benchmarks and
> the results.
>

json.el itself got some speedups from the time of that bug report, as
Dmitry mentioned in the bug.
My benchmark is a relatively dumb microbenchmark that calls the JSON
functions 20 times on the data files in nativejson-benchmark. Then I ran
that code through the `bench' program, which does some basic statistical
analysis. Serialization is in all cases more than 5 times faster, with a
high significance.

>
> > But the JSON structure I was talking about is not written to a file.
> > It's sent over the network (usually to a local TCP socket).
>
> In that case, encoding it is not a wasted effort, since it would have
> been otherwise encoded when we send it.
>

If that turns out to be a problem, we might consider returning a unibyte
string to avoid further encoding.

[-- Attachment #2: Type: text/html, Size: 1481 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: String encoding in json.c
  2017-12-26 21:50           ` Philipp Stephani
@ 2017-12-27  2:00             ` Dmitry Gutov
  0 siblings, 0 replies; 14+ messages in thread
From: Dmitry Gutov @ 2017-12-27  2:00 UTC (permalink / raw)
  To: Philipp Stephani, Eli Zaretskii; +Cc: emacs-devel

On 12/26/17 11:50 PM, Philipp Stephani wrote:
> 
> Serialization is in all cases more than 5 times faster, with a 
> high significance.

That's a pretty good result. Does the extra encoding step under 
discussion change this figure much?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: String encoding in json.c
  2017-12-26 21:42           ` Philipp Stephani
@ 2017-12-27 16:08             ` Eli Zaretskii
  0 siblings, 0 replies; 14+ messages in thread
From: Eli Zaretskii @ 2017-12-27 16:08 UTC (permalink / raw)
  To: Philipp Stephani; +Cc: emacs-devel

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Tue, 26 Dec 2017 21:42:54 +0000
> Cc: emacs-devel@gnu.org
> 
>  And the fact that the internal representation is documented doesn't
>  mean we can draw the conclusions like that. 
> 
> Why? Clearly we can make use of documented information?

Because we are not Microsoft: we do document some internal details
when we think this will help Lisp programmers and people who work on
the core.  We even have a chapter named "Internals" in the manual, and
that definitely doesn't mean those descriptions are a contract that
cannot be changed at will.

IOW, the ELisp manual is no more "official" and "reliable" than
comments in the code.

>  For starters, the
>  documentation doesn't tell all the story: the 2-byte representation of
>  raw bytes is not described there.
> 
> What's the 2-byte representation?

See CHAR_BYTE8_HEAD_P and its users.

(I didn't respond to your other points because I don't see this
discussion going anywhere where I could be of use.)

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-12-27 16:08 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-23 14:26 String encoding in json.c Philipp Stephani
2017-12-23 14:43 ` Eli Zaretskii
2017-12-23 15:31   ` Philipp Stephani
2017-12-23 15:53     ` Eli Zaretskii
2017-12-23 17:27       ` Philipp Stephani
2017-12-23 18:18         ` Eli Zaretskii
2017-12-26 21:42           ` Philipp Stephani
2017-12-27 16:08             ` Eli Zaretskii
2017-12-24 20:48   ` Dmitry Gutov
2017-12-25 16:21     ` Eli Zaretskii
2017-12-25 20:51       ` Dmitry Gutov
2017-12-26  4:35         ` Eli Zaretskii
2017-12-26 21:50           ` Philipp Stephani
2017-12-27  2:00             ` Dmitry Gutov

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).