From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Philipp Stephani
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sat, 23 Dec 2017 15:31:06 +0000
> Cc: emacs-dev= el@gnu.org
>
>=C2=A0 The coding operations are "expensive no-ops" except wh= en they aren't,
>=C2=A0 and that is exactly when we need their 'expensive" part= s.
>
> In which case are they not no-ops?
When the input is not a valid UTF-8 sequence.=C2=A0 When that happens, we produce a special representation of such raw bytes instead of
signaling EILSEQ and refusing to decode the input.=C2=A0 Encoding (if and when it is done) then performs the opposite conversion, producing the
same single raw byte in the output stream.=C2=A0 This allows Emacs to
manipulate text that included invalid sequences without crashing,
because all the low-level primitives that walk buffer text and strings
by characters assume the internal representation of each character is
valid.
> Using utf-8-unix as encoding seems to keep the encoding intact.
First, you forget about decoding.
=C2=A0 And second, encoding keep= s the
encoding intact precisely because it is not a no-op: raw bytes are
held in buffer and string text as special multibyte sequences, not as
single bytes, so just copying them to output instead of encoding will
produce non-UTF-8 multibyte sequences.
> I've spot-checked some other code where we interface with external= libraries, namely dbusbind.c and
> gnutls.c. In no cases I've found explicit coding operations (excep= t for filenames, where the situation is
> different); these files always use SDATA directly. dbusbind.c even has= the comment
>
>=C2=A0 =C2=A0/* We need to send a valid UTF-8 string.=C2=A0 We could en= code `object'
>=C2=A0 =C2=A0 =C2=A0 but by not encoding it, we guarantee it's vali= d utf-8, even if
>=C2=A0 =C2=A0 =C2=A0 it contains eight-bit-bytes.=C2=A0 Of course, you = can still send
>=C2=A0 =C2=A0 =C2=A0 manually-crafted junk by passing a unibyte string.= =C2=A0 */
If gnutls.c and dbusbind.c don't encode and decode text that comes
from and goes to outside, then they are buggy.
=C2=A0 (At least for=
gnutls.c, I think you are mistaken, because the encoding/decoding is
in process.c, see, e.g., read_process_output.)
> It's the *current* json.c (and emacs-module.c) that's inconsis= tent
> with the rest of the codebase.
Well, I disagree with that conclusion.=C2=A0 Just look at all the calls to<= br> decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc.,
and you will see where we do that.