From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Mark H Weaver <mhw@netris.org>
Newsgroups: gmane.lisp.guile.devel
Subject: Re: Reducing iconv-induced memory usage
Date: Tue, 26 Apr 2011 23:47:41 -0400
Message-ID: <877hagxsma.fsf@netris.org>
References: <87r58oyb07.fsf@gnu.org> <87ei4oy6ta.fsf@gnu.org>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: dough.gmane.org 1303876080 4935 80.91.229.12 (27 Apr 2011 03:48:00 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Wed, 27 Apr 2011 03:48:00 +0000 (UTC)
Cc: guile-devel@gnu.org
To: ludo@gnu.org (Ludovic =?utf-8?Q?Court=C3=A8s?=)
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Wed Apr 27 05:47:55 2011
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([140.186.70.17])
	by lo.gmane.org with esmtp (Exim 4.69)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1QEvjH-00015z-Ht
	for guile-devel@m.gmane.org; Wed, 27 Apr 2011 05:47:55 +0200
Original-Received: from localhost ([::1]:56304 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1QEvjH-0004Sh-2g
	for guile-devel@m.gmane.org; Tue, 26 Apr 2011 23:47:55 -0400
Original-Received: from eggs.gnu.org ([140.186.70.92]:41631)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mhw@netris.org>) id 1QEvjE-0004Sa-3I
	for guile-devel@gnu.org; Tue, 26 Apr 2011 23:47:53 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mhw@netris.org>) id 1QEvjD-00056L-4k
	for guile-devel@gnu.org; Tue, 26 Apr 2011 23:47:52 -0400
Original-Received: from world.peace.net ([96.39.62.75]:42817)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mhw@netris.org>)
	id 1QEvjB-000567-JW; Tue, 26 Apr 2011 23:47:49 -0400
Original-Received: from ip68-9-118-38.ri.ri.cox.net ([68.9.118.38] helo=freedomincluded)
	by world.peace.net with esmtpa (Exim 4.69)
	(envelope-from <mhw@netris.org>)
	id 1QEvj5-0003Dj-82; Tue, 26 Apr 2011 23:47:43 -0400
Original-Received: from mhw by freedomincluded with local (Exim 4.69)
	(envelope-from <mhw@netris.org>)
	id 1QEvj3-0002BF-SZ; Tue, 26 Apr 2011 23:47:41 -0400
In-Reply-To: <87ei4oy6ta.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?=
	=?utf-8?Q?s?= message of "Wed, 27 Apr 2011 00:41:05 +0200")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3)
X-Received-From: 96.39.62.75
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Developers list for Guile,
	the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: </archive/html/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=subscribe>
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.devel:12358
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/12358>

Hi Ludovic!

ludo@gnu.org (Ludovic Court=C3=A8s) writes:
> So, here=E2=80=99s the patch.
>
> It also makes UTF-8 input ~30% faster according to ports.bm (which
> doesn=E2=80=99t benchmark output):

Thanks for working on this.  I haven't yet had time to fully review this
patch, but here I will document the problems I see so far.

First of all, while looking at this patch, I've discovered another
problem in ports.c: scm_char_ready_p does not consider the possibility
of multibyte characters, and returns #t whenever there is at least one
byte ready.

> -/* Read a codepoint from PORT and return it in *CODEPOINT.  Fill BUF
> -   with the byte representation of the codepoint in PORT's encoding, and
> -   set *LEN to the length in bytes of that representation.  Return 0 on
> -   success and an errno value on error.  */
> +/* Read a UTF-8 sequence from PORT.  On success, return 0 and set
> +   *CODEPOINT to the codepoint that was read, fill BUF with its UTF-8
> +   representation, and set *LEN to the length in bytes.  Return
> +   `EILSEQ' on error.  */
>  static int
> -get_codepoint (SCM port, scm_t_wchar *codepoint,
> -	       char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +get_utf8_codepoint (SCM port, scm_t_wchar *codepoint,
> +		    scm_t_uint8 buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +{
> +  int byte;
> +
> +  *len =3D 0;
> +
> +  byte =3D scm_get_byte_or_eof (port);
> +  if (byte =3D=3D EOF)
> +    {
> +      *codepoint =3D EOF;
> +      return 0;
> +    }
> +
> +  buf[0] =3D (scm_t_uint8) byte;
> +  *len =3D 1;
> +
> +  if (buf[0] <=3D 0x7f)
> +    *codepoint =3D buf[0];
> +  else if ((buf[0] & 0xe0) =3D=3D 0xc0)
> +    {
> +      byte =3D scm_get_byte_or_eof (port);
> +      if (byte =3D=3D EOF || ((byte & 0xc0) !=3D 0x80))
> +	goto invalid_seq;
> +
> +      buf[1] =3D (scm_t_uint8) byte;
> +      *len =3D 2;
> +
> +      *codepoint =3D ((scm_t_wchar) buf[0] & 0x1f) << 6UL
> +	| (buf[1] & 0x3f);
> +    }

The code here would be sufficient for UTF-8 that is known valid, but
when reading from a port we must check for ill-formed UTF-8.

Unicode requires that we reject as ill-formed any UTF-8 byte sequence in
non-shortest form.  For example, we must reject the byte sequence
0xC1 0x80 which a permissive reader would read as 0x40, since obviously
that code point can be encoded as a single byte in UTF-8.

We must also reject any UTF-8 byte sequence that corresponds to a
surrogate code point (U+D800..U+DFFF), or to a code point greater than
U+10FFFF.

Table 3.7 of the Unicode 6.0.0 standard, reproduced below, concisely
shows all well-formed UTF-8 byte sequences.  The asterisks highlight
continuation bytes that are constrained to a smaller range than the
usual 80..BF.

   code points       byte[0]  byte[1]   byte[2]  byte[3]
---------------------------------------------------------
U+000000..U+00007F | 00..7F |         |        |        |
U+000080..U+0007FF | C2..DF | 80..BF  |        |        |
U+000800..U+000FFF |   E0   | A0..BF* | 80..BF |        |
U+001000..U+00CFFF | E1..EC | 80..BF  | 80..BF |        |
U+00D000..U+00D7FF |   ED   | 80..9F* | 80..BF |        |
U+00E000..U+00FFFF | EE..EF | 80..BF  | 80..BF |        |
U+010000..U+03FFFF |   F0   | 90..BF* | 80..BF | 80..BF |
U+040000..U+0FFFFF | F1..F3 | 80..BF  | 80..BF | 80..BF |
U+100000..U+10FFFF |   F4   | 80..8F* | 80..BF | 80..BF |
---------------------------------------------------------

So, for the code above corresponding to 2-byte sequences, it would
suffice to verify that buf[0] >=3D 0xC2.  The 3- and 4-byte cases are
somewhat more constrained.

> +  else if ((buf[0] & 0xf0) =3D=3D 0xe0)
> +    {
> +      byte =3D scm_get_byte_or_eof (port);
> +      if (byte =3D=3D EOF || ((byte & 0xc0) !=3D 0x80))
> +	goto invalid_seq;
> +
> +      buf[1] =3D (scm_t_uint8) byte;
> +      *len =3D 2;
> +
> +      byte =3D scm_get_byte_or_eof (port);
> +      if (byte =3D=3D EOF || ((byte & 0xc0) !=3D 0x80))
> +	goto invalid_seq;
> +
> +      buf[2] =3D (scm_t_uint8) byte;
> +      *len =3D 3;
> +
> +      *codepoint =3D ((scm_t_wchar) buf[0] & 0x0f) << 12UL
> +	| ((scm_t_wchar) buf[1] & 0x3f) << 6UL
> +	| (buf[2] & 0x3f);
> +    }
> +  else
> +    {

That ^^^ should not simply be an "else".  It must check that the first
byte is valid.

> +      byte =3D scm_get_byte_or_eof (port);
> +      if (byte =3D=3D EOF || ((byte & 0xc0) !=3D 0x80))
> +	goto invalid_seq;
> +
> +      buf[1] =3D (scm_t_uint8) byte;
> +      *len =3D 2;
> +
> +      byte =3D scm_get_byte_or_eof (port);
> +      if (byte =3D=3D EOF || ((byte & 0xc0) !=3D 0x80))
> +	goto invalid_seq;
> +
> +      buf[2] =3D (scm_t_uint8) byte;
> +      *len =3D 3;
> +
> +      byte =3D scm_get_byte_or_eof (port);
> +      if (byte =3D=3D EOF || ((byte & 0xc0) !=3D 0x80))
> +	goto invalid_seq;
> +
> +      buf[3] =3D (scm_t_uint8) byte;
> +      *len =3D 4;
> +
> +      *codepoint =3D ((scm_t_wchar) buf[0] & 0x07) << 18UL
> +	| ((scm_t_wchar) buf[1] & 0x3f) << 12UL
> +	| ((scm_t_wchar) buf[2] & 0x3f) << 6UL
> +	| (buf[3] & 0x3f);
> +    }
> +
> +  return 0;
> +
> + invalid_seq:
> +  /* Return the faulty byte.  */
> +  scm_unget_byte (byte, port);

This ungets only the last byte, but there may be up to 4 bytes to unget.

> +
> +  return EILSEQ;
> +}
> +
> +/* Likewise, read a byte sequence from PORT, passing it through its
> +   input conversion descriptor.  */
> +static int
> +get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
> +		     char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
>  {
> +  scm_t_port *pt;
>    int err, byte_read;
>    size_t bytes_consumed, output_size;
>    char *output;
>    scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
> -  scm_t_port *pt =3D SCM_PTAB_ENTRY (port);
>=20=20
> -  if (SCM_UNLIKELY (pt->input_cd =3D=3D (iconv_t) -1))
> -    /* Initialize the conversion descriptors.  */
> -    scm_i_set_port_encoding_x (port, pt->encoding);
> +  pt =3D SCM_PTAB_ENTRY (port);
>=20=20
>    for (output_size =3D 0, output =3D (char *) utf8_buf,
>  	 bytes_consumed =3D 0, err =3D 0;
> @@ -1174,10 +1265,44 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
>  	output_size =3D sizeof (utf8_buf) - output_left;
>      }
>=20=20
> -  if (SCM_UNLIKELY (err !=3D 0))
> +
> +  if (SCM_LIKELY (err =3D=3D 0))
> +    {
> +      /* Convert the UTF8_BUF sequence to a Unicode code point.  */
> +      *codepoint =3D utf8_to_codepoint (utf8_buf, output_size);
> +      *len =3D bytes_consumed;
> +    }
> +
> +  return err;
> +}
> +
> +/* Read a codepoint from PORT and return it in *CODEPOINT.  Fill BUF
> +   with the byte representation of the codepoint in PORT's encoding, and
> +   set *LEN to the length in bytes of that representation.  Return 0 on
> +   success and an errno value on error.  */
> +static int
> +get_codepoint (SCM port, scm_t_wchar *codepoint,
> +	       char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +{
> +  int err;
> +  scm_t_port *pt =3D SCM_PTAB_ENTRY (port);
> +
> +  if (pt->input_cd =3D=3D (iconv_t) -1)
> +    /* Initialize the conversion descriptors, if needed.  */
> +    scm_i_set_port_encoding_x (port, pt->encoding);
> +
> +  if (pt->input_cd =3D=3D (iconv_t) -1)
> +    err =3D get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, le=
n);
> +  else
> +    err =3D get_iconv_codepoint (port, codepoint, buf, len);

>From the code above, it appears that for UTF-8 ports,
scm_i_set_port_encoding_x will necessarily be called once per character
read.  This seems rather inefficient.  Also, if we wish to support
Latin-1 without iconv as well, the simple method above will not work.

I would recommend adding an enum field to the port which for now only
has two encoding schemes: ICONV or UTF8.  Later, we could add LATIN1 and
maybe ASCII as well.  Given that this check must be done once per
character, it seems better to do a switch on an enum than to strcmp with
pt->encoding (as is done in scm_i_set_port_encoding_x).

Thanks again for working on this :)

     Mark