From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: ludo@gnu.org (Ludovic =?utf-8?Q?Court=C3=A8s?=) Newsgroups: gmane.lisp.guile.devel Subject: Re: [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings Date: Thu, 31 Jan 2013 22:42:39 +0100 Message-ID: <87wqutjagg.fsf@gnu.org> References: <87boc956j2.fsf@pobox.com> <87y5fcjt52.fsf@tines.lan> <87vcag4azz.fsf@pobox.com> <87a9rsm85s.fsf@gnu.org> <87k3qw2ii1.fsf@pobox.com> <87txpzkjaf.fsf@tines.lan> <87halz10ys.fsf@pobox.com> <87d2wmhsn4.fsf_-_@tines.lan> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1359668567 12903 80.91.229.3 (31 Jan 2013 21:42:47 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 31 Jan 2013 21:42:47 +0000 (UTC) Cc: Andy Wingo , guile-devel@gnu.org To: Mark H Weaver Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Thu Jan 31 22:43:06 2013 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1U11uT-0004KZ-Ej for guile-devel@m.gmane.org; Thu, 31 Jan 2013 22:43:05 +0100 Original-Received: from localhost ([::1]:49646 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U11uB-0002O1-Ai for guile-devel@m.gmane.org; Thu, 31 Jan 2013 16:42:47 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:42030) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U11u7-0002Ns-UY for guile-devel@gnu.org; Thu, 31 Jan 2013 16:42:45 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1U11u5-0005ay-Ra for guile-devel@gnu.org; Thu, 31 Jan 2013 16:42:43 -0500 Original-Received: from mail2-relais-roc.national.inria.fr ([192.134.164.83]:46347) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U11u5-0005ab-LF for guile-devel@gnu.org; Thu, 31 Jan 2013 16:42:41 -0500 X-IronPort-AV: E=Sophos;i="4.84,579,1355094000"; d="scan'208";a="897538" Original-Received: from reverse-83.fdn.fr (HELO pluto) ([80.67.176.83]) by mail2-relais-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES128-SHA; 31 Jan 2013 22:42:39 +0100 X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 12 =?utf-8?Q?Pluvi=C3=B4se?= an 221 de la =?utf-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0xEA52ECF4 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 83C4 F8E5 10A3 3B4C 5BEA D15D 77DD 95E2 EA52 ECF4 X-OS: x86_64-unknown-linux-gnu In-Reply-To: <87d2wmhsn4.fsf_-_@tines.lan> (Mark H. Weaver's message of "Wed, 30 Jan 2013 23:40:31 -0500") User-Agent: Gnus/5.130005 (Ma Gnus v0.5) Emacs/24.2 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 192.134.164.83 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:15659 Archived-At: Hi! Mark H Weaver skribis: > I researched this some more, and discovered that removal of byte-order > marks (BOMs) is the responsibility of iconv, which discards BOMs from > the beginning of streams when using the UTF-16 or UTF-32 encodings, but > *not* for UTF-16LE, UTF-16GE, UTF-32LE, UTF-32GE or any other encoding. > It uses the BOM to determine the endianness of the stream, but other > than that does *not* use it to guess the encoding, so there's no > guesswork involved. (Side note: iconv also inserts a BOM automatically > when writing a stream using UTF-16 or UTF-32). Are you talking about GNU iconv or iconv as specified by POSIX? I can=E2=80=99t see any occurrence of =E2=80=9CBOM=E2=80=9D at . > So thanks to iconv, we get UTF-{16,32} BOM removal for free. > Unfortunately we have a nasty bug in 'get_iconv_codepoint' that leads to > a buffer overrun and assertion failure when 'iconv' discards a BOM. Good catch! > The first patch below fixes this problem. I ended up almost completely > rewriting that function, partly because it was largely structured around > a mistaken assumption that iconv will never consume input without > producing output, and partly because it was quite inefficient (several > unnecessary conditional branches in the loop) and IMO was rather > difficult to read. Great. (I think =E2=80=98iconv=E2=80=99 semantics lead to tricky code, no = matter what.) > get_iconv_codepoint (SCM port, scm_t_wchar *codepoint, > char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) > { [...] > - for (output_size =3D 0, output =3D (char *) utf8_buf, > - bytes_consumed =3D 0, err =3D 0; > - err =3D=3D 0 && output_size =3D=3D 0 > - && (bytes_consumed =3D=3D 0 || byte_read !=3D EOF); > - bytes_consumed++) > + for (;;) Clarity is in the eye of the beholder, but to me this is a step backwards. [...] > + /* NOTE: The following test assumes that the only special values > + (other than SCM_ICONV_UNINITIALIZED) are for UTF-8. */ > + if (SCM_ICONV_SPECIAL_P (pt->input_cd)) Probably an indication that a more descriptive name is needed, as Andy noted. > @@ -2247,16 +2279,15 @@ scm_i_set_port_encoding_x (SCM port, const char *= encoding) > new_output_cd =3D iconv_open (encoding, "UTF-8"); > if (new_output_cd =3D=3D (iconv_t) -1) Should be SCM_ICONV_UNINITIALIZED? Thanks again for the research and fixes! Ludo=E2=80=99.