From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: String encoding in json.c Date: Sat, 23 Dec 2017 20:18:43 +0200 Message-ID: <83incxjojg.fsf@gnu.org> References: <83tvwhjyi5.fsf@gnu.org> <83mv29jv99.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1514053075 8417 195.159.176.226 (23 Dec 2017 18:17:55 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 23 Dec 2017 18:17:55 +0000 (UTC) Cc: emacs-devel@gnu.org To: Philipp Stephani Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Dec 23 19:17:51 2017 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eSoMj-0001LO-LM for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 19:17:45 +0100 Original-Received: from localhost ([::1]:33553 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eSoOf-0007gf-0c for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 13:19:45 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:38412) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eSoNq-0007eQ-AZ for emacs-devel@gnu.org; Sat, 23 Dec 2017 13:19:00 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eSoNh-0006xR-V9 for emacs-devel@gnu.org; Sat, 23 Dec 2017 13:18:54 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:46055) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eSoNh-0006xH-Qn; Sat, 23 Dec 2017 13:18:45 -0500 Original-Received: from [176.228.60.248] (port=4383 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1eSoNe-00048B-QY; Sat, 23 Dec 2017 13:18:43 -0500 In-reply-to: (message from Philipp Stephani on Sat, 23 Dec 2017 17:27:22 +0000) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:221394 Archived-At: > From: Philipp Stephani > Date: Sat, 23 Dec 2017 17:27:22 +0000 > Cc: emacs-devel@gnu.org > > - We encode Lisp strings when passing them to Jansson. Jansson only accepts UTF-8 strings and fails (with > proper error reporting, not crashing) when encountering non-UTF-8 strings. I think encoding can only make a > difference here for strings that contain sequences of bytes that are themselves valid UTF-8 code unit > sequences, such as "Ä\xC3\x84". This string is encoded as "\xC3\x84\xC3\x84" using utf-8-unix. (Note how > this is a case where encoding and decoding are not inverses of each other.) Without encoding, the string > contents will be \xC3\x84 plus two invalid 5-byte sequences. I think it's not obvious at all which interpretation is > correct; after all, "Ä\xC3\x84" is not equal to "ÄÄ", but the two strings now result in the same JSON > representation. This could be at least surprising, and I'd argue that the other behavior (raising an error) would > be more correct and more obvious. I think we need to take a step back and decide what would we want to do with strings which include raw bytes. If we pass such strings to Jansson, it will just error out, right? If so, then we could do one of the two: . Check up front whether a Lisp string includes raw bytes, and if it does, signal an error before even trying to encode it. I think find_charsets_in_text could be instrumental here; alternatively, we could scan the string using BYTES_BY_CHAR_HEAD, looking for either sequences longer than 4 bytes or 2-byte sequences whose leading bytes are C0 or C1 (these are the raw bytes). . Or we could encode the string, pass it to Jansson, and let it error out; then we could produce our own diagnostics. Which one of these do you prefer? Currently, you opted for the 2nd one. It is not clear to me that the option you've chosen is better, since (a) it relies on Jansson, and (b) it encodes strings which don't need to be encoded. OTOH, the check I propose in (a) means penalty for every caller. But then such penalties never averted you elsewhere in your code, so I wonder why this case is suddenly so different? It is true that if we believe Jansson's detection of invalid UTF-8, and we assume that raw bytes in their current representation will forever the only extensions of UTF-8 in Emacs, we could pass the internal representation to Jansson. Personally, I'm not sure we should make such assumptions, but that's me. > - We decode UTF-8 strings after receiving them from Jansson. Jansson guarantees to only ever emit > well-formed UTF-8. Given that for well-formed UTF-8 strings, the UTF-8 representation and the Emacs > representation are one and the same, we don't need decoding. Once again: do we really want to rely on external libraries to always DTRT and be bug-free? We don't normally rely on external sources like that. The cost of decoding is not too high; the price users will pay for Jansson's bugs will be much higher. > And second, encoding keeps the > encoding intact precisely because it is not a no-op: raw bytes are > held in buffer and string text as special multibyte sequences, not as > single bytes, so just copying them to output instead of encoding will > produce non-UTF-8 multibyte sequences. > > That's the correct behavior, I think. JSON values must be valid Unicode strings, and raw bytes are not. Neither are the internal representations of raw bytes, so what's your point here? > > /* We need to send a valid UTF-8 string. We could encode `object' > > but by not encoding it, we guarantee it's valid utf-8, even if > > it contains eight-bit-bytes. Of course, you can still send > > manually-crafted junk by passing a unibyte string. */ > > If gnutls.c and dbusbind.c don't encode and decode text that comes > from and goes to outside, then they are buggy. > > Not necessarily. As mentioned, the internal encoding of multibyte strings is even mentioned in the Lisp > reference; and the above comment indicates that it's OK to use that information at least within the Emacs > codebase. I think that comment is based on a mistake, or maybe I don't really understand it. Internal representation is not in general valid UTF-8, that's for sure. And the fact that the internal representation is documented doesn't mean we can draw the conclusions like that. For starters, the documentation doesn't tell all the story: the 2-byte representation of raw bytes is not described there. > Some parts are definitely encoded, but for example, there is c_hostname in Fgnutls_boot, which doesn't > encode the user-supplied string. That's a bug. > Well, I disagree with that conclusion. Just look at all the calls to > decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc., > and you will see where we do that. > > We obviously do *some* encoding/decoding. But when interacting with third-party libraries, we seem to leave > it out pretty frequently, if those libraries use UTF-8 as well. Most if not all of those places are just bugs. People who work mostly on GNU/Linux tend to forget that not everything is UTF-8.