From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: String encoding in json.c Date: Sat, 23 Dec 2017 17:53:38 +0200 Message-ID: <83mv29jv99.fsf@gnu.org> References: <83tvwhjyi5.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org X-Trace: blaine.gmane.org 1514044312 5712 195.159.176.226 (23 Dec 2017 15:51:52 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 23 Dec 2017 15:51:52 +0000 (UTC) Cc: emacs-devel@gnu.org To: Philipp Stephani Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Dec 23 16:51:48 2017 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eSm5T-00019m-SH for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 16:51:47 +0100 Original-Received: from localhost ([::1]:47417 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eSm7S-0000Mz-D7 for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 10:53:50 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:46064) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eSm7L-0000ME-R4 for emacs-devel@gnu.org; Sat, 23 Dec 2017 10:53:44 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eSm7G-0000vp-Pj for emacs-devel@gnu.org; Sat, 23 Dec 2017 10:53:43 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:44504) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eSm7G-0000vi-Mn; Sat, 23 Dec 2017 10:53:38 -0500 Original-Received: from [176.228.60.248] (port=4081 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1eSm7F-0005sy-VS; Sat, 23 Dec 2017 10:53:38 -0500 In-reply-to: (message from Philipp Stephani on Sat, 23 Dec 2017 15:31:06 +0000) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:221386 Archived-At: > From: Philipp Stephani > Date: Sat, 23 Dec 2017 15:31:06 +0000 > Cc: emacs-devel@gnu.org > > The coding operations are "expensive no-ops" except when they aren't, > and that is exactly when we need their 'expensive" parts. > > In which case are they not no-ops? When the input is not a valid UTF-8 sequence. When that happens, we produce a special representation of such raw bytes instead of signaling EILSEQ and refusing to decode the input. Encoding (if and when it is done) then performs the opposite conversion, producing the same single raw byte in the output stream. This allows Emacs to manipulate text that included invalid sequences without crashing, because all the low-level primitives that walk buffer text and strings by characters assume the internal representation of each character is valid. > Using utf-8-unix as encoding seems to keep the encoding intact. First, you forget about decoding. And second, encoding keeps the encoding intact precisely because it is not a no-op: raw bytes are held in buffer and string text as special multibyte sequences, not as single bytes, so just copying them to output instead of encoding will produce non-UTF-8 multibyte sequences. > I've spot-checked some other code where we interface with external libraries, namely dbusbind.c and > gnutls.c. In no cases I've found explicit coding operations (except for filenames, where the situation is > different); these files always use SDATA directly. dbusbind.c even has the comment > > /* We need to send a valid UTF-8 string. We could encode `object' > but by not encoding it, we guarantee it's valid utf-8, even if > it contains eight-bit-bytes. Of course, you can still send > manually-crafted junk by passing a unibyte string. */ If gnutls.c and dbusbind.c don't encode and decode text that comes from and goes to outside, then they are buggy. (At least for gnutls.c, I think you are mistaken, because the encoding/decoding is in process.c, see, e.g., read_process_output.) > It's the *current* json.c (and emacs-module.c) that's inconsistent > with the rest of the codebase. Well, I disagree with that conclusion. Just look at all the calls to decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc., and you will see where we do that.