unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Philipp Stephani <p.stephani2@gmail.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: emacs-devel@gnu.org
Subject: Re: String encoding in json.c
Date: Sat, 23 Dec 2017 17:27:22 +0000	[thread overview]
Message-ID: <CAArVCkTgrMe0LqFXcsmTccvUWKcTK9L4mLAXtrxOQj7rwmMv1A@mail.gmail.com> (raw)
In-Reply-To: <83mv29jv99.fsf@gnu.org>

[-- Attachment #1: Type: text/plain, Size: 4727 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Sa., 23. Dez. 2017 um 16:53 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sat, 23 Dec 2017 15:31:06 +0000
> > Cc: emacs-devel@gnu.org
> >
> >  The coding operations are "expensive no-ops" except when they aren't,
> >  and that is exactly when we need their 'expensive" parts.
> >
> > In which case are they not no-ops?
>
> When the input is not a valid UTF-8 sequence.  When that happens, we
> produce a special representation of such raw bytes instead of
> signaling EILSEQ and refusing to decode the input.  Encoding (if and
> when it is done) then performs the opposite conversion, producing the
> same single raw byte in the output stream.  This allows Emacs to
> manipulate text that included invalid sequences without crashing,
> because all the low-level primitives that walk buffer text and strings
> by characters assume the internal representation of each character is
> valid.
>

OK, thanks for the refresher. I was aware of the single byte
representation, but forgot how exactly it's handled during coding.


>
> > Using utf-8-unix as encoding seems to keep the encoding intact.
>
> First, you forget about decoding.


OK, let's treat encoding and decoding separately.

- We encode Lisp strings when passing them to Jansson. Jansson only accepts
UTF-8 strings and fails (with proper error reporting, not crashing) when
encountering non-UTF-8 strings. I think encoding can only make a difference
here for strings that contain sequences of bytes that are themselves valid
UTF-8 code unit sequences, such as "Ä\xC3\x84". This string is encoded as
"\xC3\x84\xC3\x84" using utf-8-unix. (Note how this is a case where
encoding and decoding are not inverses of each other.) Without encoding,
the string contents will be \xC3\x84 plus two invalid 5-byte sequences. I
think it's not obvious at all which interpretation is correct; after all,
"Ä\xC3\x84" is not equal to "ÄÄ", but the two strings now result in the
same JSON representation. This could be at least surprising, and I'd argue
that the other behavior (raising an error) would be more correct and more
obvious.

- We decode UTF-8 strings after receiving them from Jansson. Jansson
guarantees to only ever emit well-formed UTF-8. Given that for well-formed
UTF-8 strings, the UTF-8 representation and the Emacs representation are
one and the same, we don't need decoding.



>   And second, encoding keeps the
> encoding intact precisely because it is not a no-op: raw bytes are
> held in buffer and string text as special multibyte sequences, not as
> single bytes, so just copying them to output instead of encoding will
> produce non-UTF-8 multibyte sequences.
>

That's the correct behavior, I think. JSON values must be valid Unicode
strings, and raw bytes are not.


>
> > I've spot-checked some other code where we interface with external
> libraries, namely dbusbind.c and
> > gnutls.c. In no cases I've found explicit coding operations (except for
> filenames, where the situation is
> > different); these files always use SDATA directly. dbusbind.c even has
> the comment
> >
> >   /* We need to send a valid UTF-8 string.  We could encode `object'
> >      but by not encoding it, we guarantee it's valid utf-8, even if
> >      it contains eight-bit-bytes.  Of course, you can still send
> >      manually-crafted junk by passing a unibyte string.  */
>
> If gnutls.c and dbusbind.c don't encode and decode text that comes
> from and goes to outside, then they are buggy.


Not necessarily. As mentioned, the internal encoding of multibyte strings
is even mentioned in the Lisp reference; and the above comment indicates
that it's OK to use that information at least within the Emacs codebase.
BTW, that comment was added by Stefan in
commit e454a4a330cc6524cf0d2604b4fafc32d5bda795, where he removed an
explicit encoding step.


>   (At least for
> gnutls.c, I think you are mistaken, because the encoding/decoding is
> in process.c, see, e.g., read_process_output.)
>

Some parts are definitely encoded, but for example, there is c_hostname in
Fgnutls_boot, which doesn't encode the user-supplied string.


>
> > It's the *current* json.c (and emacs-module.c) that's inconsistent
> > with the rest of the codebase.
>
> Well, I disagree with that conclusion.  Just look at all the calls to
> decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc.,
> and you will see where we do that.
>

We obviously do *some* encoding/decoding. But when interacting with
third-party libraries, we seem to leave it out pretty frequently, if those
libraries use UTF-8 as well.

[-- Attachment #2: Type: text/html, Size: 6152 bytes --]

  reply	other threads:[~2017-12-23 17:27 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-23 14:26 String encoding in json.c Philipp Stephani
2017-12-23 14:43 ` Eli Zaretskii
2017-12-23 15:31   ` Philipp Stephani
2017-12-23 15:53     ` Eli Zaretskii
2017-12-23 17:27       ` Philipp Stephani [this message]
2017-12-23 18:18         ` Eli Zaretskii
2017-12-26 21:42           ` Philipp Stephani
2017-12-27 16:08             ` Eli Zaretskii
2017-12-24 20:48   ` Dmitry Gutov
2017-12-25 16:21     ` Eli Zaretskii
2017-12-25 20:51       ` Dmitry Gutov
2017-12-26  4:35         ` Eli Zaretskii
2017-12-26 21:50           ` Philipp Stephani
2017-12-27  2:00             ` Dmitry Gutov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAArVCkTgrMe0LqFXcsmTccvUWKcTK9L4mLAXtrxOQj7rwmMv1A@mail.gmail.com \
    --to=p.stephani2@gmail.com \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).