unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: Philipp Stephani <p.stephani2@gmail.com>
Cc: emacs-devel@gnu.org
Subject: Re: String encoding in json.c
Date: Sat, 23 Dec 2017 17:53:38 +0200	[thread overview]
Message-ID: <83mv29jv99.fsf@gnu.org> (raw)
In-Reply-To: <CAArVCkQCbuE4o_oYyXRc-vjS0ppHLW5cL_wRLMzhT+iFqYUZRA@mail.gmail.com> (message from Philipp Stephani on Sat, 23 Dec 2017 15:31:06 +0000)

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sat, 23 Dec 2017 15:31:06 +0000
> Cc: emacs-devel@gnu.org
> 
>  The coding operations are "expensive no-ops" except when they aren't,
>  and that is exactly when we need their 'expensive" parts.
> 
> In which case are they not no-ops?

When the input is not a valid UTF-8 sequence.  When that happens, we
produce a special representation of such raw bytes instead of
signaling EILSEQ and refusing to decode the input.  Encoding (if and
when it is done) then performs the opposite conversion, producing the
same single raw byte in the output stream.  This allows Emacs to
manipulate text that included invalid sequences without crashing,
because all the low-level primitives that walk buffer text and strings
by characters assume the internal representation of each character is
valid.

> Using utf-8-unix as encoding seems to keep the encoding intact.

First, you forget about decoding.  And second, encoding keeps the
encoding intact precisely because it is not a no-op: raw bytes are
held in buffer and string text as special multibyte sequences, not as
single bytes, so just copying them to output instead of encoding will
produce non-UTF-8 multibyte sequences.

> I've spot-checked some other code where we interface with external libraries, namely dbusbind.c and
> gnutls.c. In no cases I've found explicit coding operations (except for filenames, where the situation is
> different); these files always use SDATA directly. dbusbind.c even has the comment
> 
>   /* We need to send a valid UTF-8 string.  We could encode `object'
>      but by not encoding it, we guarantee it's valid utf-8, even if
>      it contains eight-bit-bytes.  Of course, you can still send
>      manually-crafted junk by passing a unibyte string.  */

If gnutls.c and dbusbind.c don't encode and decode text that comes
from and goes to outside, then they are buggy.  (At least for
gnutls.c, I think you are mistaken, because the encoding/decoding is
in process.c, see, e.g., read_process_output.)

> It's the *current* json.c (and emacs-module.c) that's inconsistent
> with the rest of the codebase.

Well, I disagree with that conclusion.  Just look at all the calls to
decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc.,
and you will see where we do that.



  reply	other threads:[~2017-12-23 15:53 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-23 14:26 String encoding in json.c Philipp Stephani
2017-12-23 14:43 ` Eli Zaretskii
2017-12-23 15:31   ` Philipp Stephani
2017-12-23 15:53     ` Eli Zaretskii [this message]
2017-12-23 17:27       ` Philipp Stephani
2017-12-23 18:18         ` Eli Zaretskii
2017-12-26 21:42           ` Philipp Stephani
2017-12-27 16:08             ` Eli Zaretskii
2017-12-24 20:48   ` Dmitry Gutov
2017-12-25 16:21     ` Eli Zaretskii
2017-12-25 20:51       ` Dmitry Gutov
2017-12-26  4:35         ` Eli Zaretskii
2017-12-26 21:50           ` Philipp Stephani
2017-12-27  2:00             ` Dmitry Gutov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83mv29jv99.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=p.stephani2@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).