From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Philipp Stephani Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] Improve error reporting when serializing non-Unicode strings to JSON Date: Sat, 23 Dec 2017 15:19:17 +0000 Message-ID: References: <20171222210031.30811-1-phst@google.com> <83efnllufm.fsf@gnu.org> <83wp1dk18g.fsf@gnu.org> <83shc1jy3j.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="001a11c012768206e205610375fc" X-Trace: blaine.gmane.org 1514042299 9349 195.159.176.226 (23 Dec 2017 15:18:19 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 23 Dec 2017 15:18:19 +0000 (UTC) Cc: phst@google.com, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Dec 23 16:18:14 2017 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eSlZ0-00023y-3c for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 16:18:14 +0100 Original-Received: from localhost ([::1]:43986 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eSlay-0001ZE-GB for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 10:20:16 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:51598) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eSlaG-0001YN-Gf for emacs-devel@gnu.org; Sat, 23 Dec 2017 10:19:33 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eSlaF-0007dI-8G for emacs-devel@gnu.org; Sat, 23 Dec 2017 10:19:32 -0500 Original-Received: from mail-qt0-x230.google.com ([2607:f8b0:400d:c0d::230]:46537) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1eSlaC-0007Zp-Td; Sat, 23 Dec 2017 10:19:29 -0500 Original-Received: by mail-qt0-x230.google.com with SMTP id r39so39326489qtr.13; Sat, 23 Dec 2017 07:19:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Eak/tV1XJNqU9CAN1Q9KjQdAbIRUxnHKFkwfOKpe2NU=; b=BfW9c6K/dWSfZmAO4r6/n7LGHGXh2EKNYZ5GOzqqqjIkmgDlxLKnZ2A4A1+uLvG29h CrI31G54/+LkpuM3A5eCMu+Fz5Q2Ey3R6PLDYAlNYbHdF2B+LdmqanUPeM/FPzdWE2Df WGOh/RhfPG2OnlQkeEEsWdWSZUOusFe8JextgE8P/CACog/eBjKkK5L3WZGVcMrjwEBv vZv/gNUnJxQ2IvEQ7eb9zAV2d9KrWvyw3n8x2mibyiy3QHfPEz4/n53OgRh2uVXNKMwR 3FO6NVK+HTY8QKfypQKfFDfT8QCYcIPYIoQMJw25BPF4alvmOaswHlKOfU2norjhM399 cZXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Eak/tV1XJNqU9CAN1Q9KjQdAbIRUxnHKFkwfOKpe2NU=; b=UPIX6Z3dn94m2GWhtHgs6Rs8bxRI3kRLFqAUOyLAAqLUAvffdtEfEDnFlk74qpr0O+ uiZk6LuFDKxtiTYggp4xep2m6IqLHMmXQW6JNx1JxSJOg2/VaLAFGCUauFlZesIb4M4m F6L/QGzGWd7oR9BrhgCcB5cPDP1FQLXjg7tjmUJORKR9apKD/3pkjcPOxe+52FrzeiQY xygOn9NzbRg58WUS274P7+5FMpdgq5vd1hpLXnuE7TEWhaVVrbUNbMtLJKBursbkn4Ra s5PkdalNXGkNPoSnuLpJng2szqzXOrY9U7YOSX/5abxN8nkAHjT9m2JeDshhQt+2siVv CDJw== X-Gm-Message-State: AKGB3mIIAl8JmJ78kmIe7CnZFfgzCxzJdACze+jZPjaqyP0tNuqBeSSk oAtONpy2VWGvJXVg5p+IsgdRUAWNfxu2IC2pYEYXhw== X-Google-Smtp-Source: ACJfBovDPqnqtHuRsteRVYW5hOOuNA69uhL2qu1gvem4oJYUOc4f/o8FejKxnc1q0TgYJWLdWNC7nmhJMI9jcSjScfc= X-Received: by 10.200.46.149 with SMTP id h21mr23661860qta.73.1514042367803; Sat, 23 Dec 2017 07:19:27 -0800 (PST) In-Reply-To: <83shc1jy3j.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:400d:c0d::230 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:221382 Archived-At: --001a11c012768206e205610375fc Content-Type: text/plain; charset="UTF-8" Eli Zaretskii schrieb am Sa., 23. Dez. 2017 um 15:52 Uhr: > > From: Philipp Stephani > > Date: Sat, 23 Dec 2017 14:29:56 +0000 > > Cc: emacs-devel@gnu.org, phst@google.com > > > > OK, but why do we need external functions for doing that? What is > > missing in our own code to detect such a situation? > > > > Not much I think, it's just easiest to use Gnulib functions because they > are well-documented, have a clean > > interface, and are probably bug-free. > > coding.c has check_utf_8, which is quite similar, but has an > incompatible interface (it takes struct > > coding_system objects) and also checks for embedded newlines, which > isn't necessary here. > > So let's use check_utf_8, as its downsides don't sound serious to me, > Well it needs to be rewritten significantly to take a char*, length argument instead of the coding_system struct. > and OTOH using unistring functions will bloat Emacs u8-check.c is just 77 LoC (including all boilerplate, comments, and empty lines), so I don't think it blows up Emacs in any significant way. > for the benefit of > a single use case, not to mention create two different methods for > doing the same job, which IMO is even more confusing to any newcomer > to the Emacs internals. > Agreed it's somewhat confusing, but I think not too much. The two functions have quite different use cases: check_utf_8 is a specialized function that requires a coding system with significant set-up and is only used once (in decode_coding_gap), while u8_check is a general-purpose function. Having not much experience with coding.c, I find the functions in that file much more confusing and harder to understand than the ones from libunistring. The libunistring functions tend to have a single, clear purpose, while the coding.c functions often do many different things at once. > > Btw, doesn't find_charsets_in_text do the same job cleaner and > quicker? AFAIU, all you need is make sure there are no characters > from the 2 eight-bit-* charsets in the text, or did I miss something? > What I need to check is one of the following: - Is the initial string either a well-formed UTF-8 unibyte string, or a multibyte string that represents a Unicode scalar value sequence? - Is the encoded string a well-formed UTF-8 unibyte string? Given my understanding of the implementation of coding.c, these two criteria should be equivalent. (Unfortunately that doesn't seem to be documented.) So I choose to implement the second check, which is easier and allows delaying the check until we know we have to signal an error. --001a11c012768206e205610375fc Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


Eli Za= retskii <eliz@gnu.org> schrieb am= Sa., 23. Dez. 2017 um 15:52=C2=A0Uhr:
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sat, 23 Dec 2017 14:29:56 +0000
> Cc: emacs-dev= el@gnu.org, phst@g= oogle.com
>
>=C2=A0 OK, but why do we need external functions for doing that?=C2=A0 = What is
>=C2=A0 missing in our own code to detect such a situation?
>
> Not much I think, it's just easiest to use Gnulib functions becaus= e they are well-documented, have a clean
> interface, and are probably bug-free.
> coding.c has check_utf_8, which is quite similar, but has an incompati= ble interface (it takes struct
> coding_system objects) and also checks for embedded newlines, which is= n't necessary here.

So let's use check_utf_8, as its downsides don't sound serious to m= e,

Well it needs to be rewritten signif= icantly to take a char*, length argument instead of the coding_system struc= t.
=C2=A0
and OTOH using unistring functions will bloat Emacs

u8-check.c is just 77 LoC (including all boilerplate, comments, an= d empty lines), so I don't think it blows up Emacs in any significant w= ay.
=C2=A0
for the benefit o= f
a single use case, not to mention create two different methods for
doing the same job, which IMO is even more confusing to any newcomer
to the Emacs internals.

Agreed it's= somewhat confusing, but I think not too much. The two functions have quite= different use cases: check_utf_8 is a specialized function that requires a= coding system with significant set-up and is only used once (in decode_cod= ing_gap), while u8_check is a general-purpose function.
Having no= t much experience with coding.c, I find the functions in that file much mor= e confusing and harder to understand than the ones from libunistring. The l= ibunistring functions tend to have a single, clear purpose, while the codin= g.c functions often do many different things at once.
=C2=A0

Btw, doesn't find_charsets_in_text do the same job cleaner and
quicker?=C2=A0 AFAIU, all you need is make sure there are no characters
from the 2 eight-bit-* charsets in the text, or did I miss something?

What I need to check is one of the following= :
- Is the initial string either a well-formed UTF-8 unibyte stri= ng, or a multibyte string that represents a Unicode scalar value sequence?<= /div>
- Is the encoded string a well-formed UTF-8 unibyte string?
=
Given my understanding of the implementation of coding.c, these two cr= iteria should be equivalent. (Unfortunately that doesn't seem to be doc= umented.) So I choose to implement the second check, which is easier and al= lows delaying the check until we know we have to signal an error.
--001a11c012768206e205610375fc--