From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Philipp Stephani <p.stephani2@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: [PATCH] Improve error reporting when serializing non-Unicode
	strings to JSON
Date: Sat, 23 Dec 2017 15:19:17 +0000
Message-ID: <CAArVCkT1tNki8D2V-gdwm8ixjRxYN6QNOPo36bv5-aFOAm+OKg@mail.gmail.com>
References: <20171222210031.30811-1-phst@google.com> <83efnllufm.fsf@gnu.org>
	<CAArVCkTCmhSmT_Yip1r99NvZwiB3hB0rMzkpvu3ame_tq4UMtA@mail.gmail.com>
	<83wp1dk18g.fsf@gnu.org>
	<CAArVCkTdrVaECPGnRWFQkoEEZQPN8w3yq16csCaeoj+289GD9g@mail.gmail.com>
	<83shc1jy3j.fsf@gnu.org>
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="001a11c012768206e205610375fc"
X-Trace: blaine.gmane.org 1514042299 9349 195.159.176.226 (23 Dec 2017 15:18:19 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Sat, 23 Dec 2017 15:18:19 +0000 (UTC)
Cc: phst@google.com, emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Dec 23 16:18:14 2017
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1eSlZ0-00023y-3c
	for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 16:18:14 +0100
Original-Received: from localhost ([::1]:43986 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1eSlay-0001ZE-GB
	for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 10:20:16 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:51598)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1eSlaG-0001YN-Gf
	for emacs-devel@gnu.org; Sat, 23 Dec 2017 10:19:33 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1eSlaF-0007dI-8G
	for emacs-devel@gnu.org; Sat, 23 Dec 2017 10:19:32 -0500
Original-Received: from mail-qt0-x230.google.com ([2607:f8b0:400d:c0d::230]:46537)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <p.stephani2@gmail.com>)
	id 1eSlaC-0007Zp-Td; Sat, 23 Dec 2017 10:19:29 -0500
Original-Received: by mail-qt0-x230.google.com with SMTP id r39so39326489qtr.13;
	Sat, 23 Dec 2017 07:19:28 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=mime-version:references:in-reply-to:from:date:message-id:subject:to
	:cc; bh=Eak/tV1XJNqU9CAN1Q9KjQdAbIRUxnHKFkwfOKpe2NU=;
	b=BfW9c6K/dWSfZmAO4r6/n7LGHGXh2EKNYZ5GOzqqqjIkmgDlxLKnZ2A4A1+uLvG29h
	CrI31G54/+LkpuM3A5eCMu+Fz5Q2Ey3R6PLDYAlNYbHdF2B+LdmqanUPeM/FPzdWE2Df
	WGOh/RhfPG2OnlQkeEEsWdWSZUOusFe8JextgE8P/CACog/eBjKkK5L3WZGVcMrjwEBv
	vZv/gNUnJxQ2IvEQ7eb9zAV2d9KrWvyw3n8x2mibyiy3QHfPEz4/n53OgRh2uVXNKMwR
	3FO6NVK+HTY8QKfypQKfFDfT8QCYcIPYIoQMJw25BPF4alvmOaswHlKOfU2norjhM399
	cZXQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:references:in-reply-to:from:date
	:message-id:subject:to:cc;
	bh=Eak/tV1XJNqU9CAN1Q9KjQdAbIRUxnHKFkwfOKpe2NU=;
	b=UPIX6Z3dn94m2GWhtHgs6Rs8bxRI3kRLFqAUOyLAAqLUAvffdtEfEDnFlk74qpr0O+
	uiZk6LuFDKxtiTYggp4xep2m6IqLHMmXQW6JNx1JxSJOg2/VaLAFGCUauFlZesIb4M4m
	F6L/QGzGWd7oR9BrhgCcB5cPDP1FQLXjg7tjmUJORKR9apKD/3pkjcPOxe+52FrzeiQY
	xygOn9NzbRg58WUS274P7+5FMpdgq5vd1hpLXnuE7TEWhaVVrbUNbMtLJKBursbkn4Ra
	s5PkdalNXGkNPoSnuLpJng2szqzXOrY9U7YOSX/5abxN8nkAHjT9m2JeDshhQt+2siVv
	CDJw==
X-Gm-Message-State: AKGB3mIIAl8JmJ78kmIe7CnZFfgzCxzJdACze+jZPjaqyP0tNuqBeSSk
	oAtONpy2VWGvJXVg5p+IsgdRUAWNfxu2IC2pYEYXhw==
X-Google-Smtp-Source: ACJfBovDPqnqtHuRsteRVYW5hOOuNA69uhL2qu1gvem4oJYUOc4f/o8FejKxnc1q0TgYJWLdWNC7nmhJMI9jcSjScfc=
X-Received: by 10.200.46.149 with SMTP id h21mr23661860qta.73.1514042367803;
	Sat, 23 Dec 2017 07:19:27 -0800 (PST)
In-Reply-To: <83shc1jy3j.fsf@gnu.org>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 2607:f8b0:400d:c0d::230
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel/>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:221382
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/221382>

--001a11c012768206e205610375fc
Content-Type: text/plain; charset="UTF-8"

Eli Zaretskii <eliz@gnu.org> schrieb am Sa., 23. Dez. 2017 um 15:52 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sat, 23 Dec 2017 14:29:56 +0000
> > Cc: emacs-devel@gnu.org, phst@google.com
> >
> >  OK, but why do we need external functions for doing that?  What is
> >  missing in our own code to detect such a situation?
> >
> > Not much I think, it's just easiest to use Gnulib functions because they
> are well-documented, have a clean
> > interface, and are probably bug-free.
> > coding.c has check_utf_8, which is quite similar, but has an
> incompatible interface (it takes struct
> > coding_system objects) and also checks for embedded newlines, which
> isn't necessary here.
>
> So let's use check_utf_8, as its downsides don't sound serious to me,
>

Well it needs to be rewritten significantly to take a char*, length
argument instead of the coding_system struct.


> and OTOH using unistring functions will bloat Emacs


u8-check.c is just 77 LoC (including all boilerplate, comments, and empty
lines), so I don't think it blows up Emacs in any significant way.


> for the benefit of
> a single use case, not to mention create two different methods for
> doing the same job, which IMO is even more confusing to any newcomer
> to the Emacs internals.
>

Agreed it's somewhat confusing, but I think not too much. The two functions
have quite different use cases: check_utf_8 is a specialized function that
requires a coding system with significant set-up and is only used once (in
decode_coding_gap), while u8_check is a general-purpose function.
Having not much experience with coding.c, I find the functions in that file
much more confusing and harder to understand than the ones from
libunistring. The libunistring functions tend to have a single, clear
purpose, while the coding.c functions often do many different things at
once.


>
> Btw, doesn't find_charsets_in_text do the same job cleaner and
> quicker?  AFAIU, all you need is make sure there are no characters
> from the 2 eight-bit-* charsets in the text, or did I miss something?
>

What I need to check is one of the following:
- Is the initial string either a well-formed UTF-8 unibyte string, or a
multibyte string that represents a Unicode scalar value sequence?
- Is the encoded string a well-formed UTF-8 unibyte string?
Given my understanding of the implementation of coding.c, these two
criteria should be equivalent. (Unfortunately that doesn't seem to be
documented.) So I choose to implement the second check, which is easier and
allows delaying the check until we know we have to signal an error.

--001a11c012768206e205610375fc
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">Eli Za=
retskii &lt;<a href=3D"mailto:eliz@gnu.org">eliz@gnu.org</a>&gt; schrieb am=
 Sa., 23. Dez. 2017 um 15:52=C2=A0Uhr:<br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex">&gt; From: Philipp Stephani &lt;<a href=3D"mailto:p.stephani2@gmail.com=
" target=3D"_blank">p.stephani2@gmail.com</a>&gt;<br>
&gt; Date: Sat, 23 Dec 2017 14:29:56 +0000<br>
&gt; Cc: <a href=3D"mailto:emacs-devel@gnu.org" target=3D"_blank">emacs-dev=
el@gnu.org</a>, <a href=3D"mailto:phst@google.com" target=3D"_blank">phst@g=
oogle.com</a><br>
&gt;<br>
&gt;=C2=A0 OK, but why do we need external functions for doing that?=C2=A0 =
What is<br>
&gt;=C2=A0 missing in our own code to detect such a situation?<br>
&gt;<br>
&gt; Not much I think, it&#39;s just easiest to use Gnulib functions becaus=
e they are well-documented, have a clean<br>
&gt; interface, and are probably bug-free.<br>
&gt; coding.c has check_utf_8, which is quite similar, but has an incompati=
ble interface (it takes struct<br>
&gt; coding_system objects) and also checks for embedded newlines, which is=
n&#39;t necessary here.<br>
<br>
So let&#39;s use check_utf_8, as its downsides don&#39;t sound serious to m=
e,<br></blockquote><div><br></div><div>Well it needs to be rewritten signif=
icantly to take a char*, length argument instead of the coding_system struc=
t.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:=
0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
and OTOH using unistring functions will bloat Emacs</blockquote><div><br></=
div><div>u8-check.c is just 77 LoC (including all boilerplate, comments, an=
d empty lines), so I don&#39;t think it blows up Emacs in any significant w=
ay.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin=
:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> for the benefit o=
f<br>
a single use case, not to mention create two different methods for<br>
doing the same job, which IMO is even more confusing to any newcomer<br>
to the Emacs internals.<br></blockquote><div><br></div><div>Agreed it&#39;s=
 somewhat confusing, but I think not too much. The two functions have quite=
 different use cases: check_utf_8 is a specialized function that requires a=
 coding system with significant set-up and is only used once (in decode_cod=
ing_gap), while u8_check is a general-purpose function.</div><div>Having no=
t much experience with coding.c, I find the functions in that file much mor=
e confusing and harder to understand than the ones from libunistring. The l=
ibunistring functions tend to have a single, clear purpose, while the codin=
g.c functions often do many different things at once.</div><div>=C2=A0</div=
><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1=
px #ccc solid;padding-left:1ex">
<br>
Btw, doesn&#39;t find_charsets_in_text do the same job cleaner and<br>
quicker?=C2=A0 AFAIU, all you need is make sure there are no characters<br>
from the 2 eight-bit-* charsets in the text, or did I miss something?<br></=
blockquote><div><br></div><div>What I need to check is one of the following=
:</div><div>- Is the initial string either a well-formed UTF-8 unibyte stri=
ng, or a multibyte string that represents a Unicode scalar value sequence?<=
/div><div>- Is the encoded string a well-formed UTF-8 unibyte string?</div>=
<div>Given my understanding of the implementation of coding.c, these two cr=
iteria should be equivalent. (Unfortunately that doesn&#39;t seem to be doc=
umented.) So I choose to implement the second check, which is easier and al=
lows delaying the check until we know we have to signal an error.</div></di=
v></div>

--001a11c012768206e205610375fc--