unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
From: Mark H Weaver <mhw@netris.org>
To: ludo@gnu.org (Ludovic Courtès)
Cc: guile-devel@gnu.org
Subject: Re: Reducing iconv-induced memory usage
Date: Tue, 26 Apr 2011 23:47:41 -0400	[thread overview]
Message-ID: <877hagxsma.fsf@netris.org> (raw)
In-Reply-To: <87ei4oy6ta.fsf@gnu.org> ("Ludovic Courtès"'s message of "Wed, 27 Apr 2011 00:41:05 +0200")

Hi Ludovic!

ludo@gnu.org (Ludovic Courtès) writes:
> So, here’s the patch.
>
> It also makes UTF-8 input ~30% faster according to ports.bm (which
> doesn’t benchmark output):

Thanks for working on this.  I haven't yet had time to fully review this
patch, but here I will document the problems I see so far.

First of all, while looking at this patch, I've discovered another
problem in ports.c: scm_char_ready_p does not consider the possibility
of multibyte characters, and returns #t whenever there is at least one
byte ready.

> -/* Read a codepoint from PORT and return it in *CODEPOINT.  Fill BUF
> -   with the byte representation of the codepoint in PORT's encoding, and
> -   set *LEN to the length in bytes of that representation.  Return 0 on
> -   success and an errno value on error.  */
> +/* Read a UTF-8 sequence from PORT.  On success, return 0 and set
> +   *CODEPOINT to the codepoint that was read, fill BUF with its UTF-8
> +   representation, and set *LEN to the length in bytes.  Return
> +   `EILSEQ' on error.  */
>  static int
> -get_codepoint (SCM port, scm_t_wchar *codepoint,
> -	       char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +get_utf8_codepoint (SCM port, scm_t_wchar *codepoint,
> +		    scm_t_uint8 buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +{
> +  int byte;
> +
> +  *len = 0;
> +
> +  byte = scm_get_byte_or_eof (port);
> +  if (byte == EOF)
> +    {
> +      *codepoint = EOF;
> +      return 0;
> +    }
> +
> +  buf[0] = (scm_t_uint8) byte;
> +  *len = 1;
> +
> +  if (buf[0] <= 0x7f)
> +    *codepoint = buf[0];
> +  else if ((buf[0] & 0xe0) == 0xc0)
> +    {
> +      byte = scm_get_byte_or_eof (port);
> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
> +	goto invalid_seq;
> +
> +      buf[1] = (scm_t_uint8) byte;
> +      *len = 2;
> +
> +      *codepoint = ((scm_t_wchar) buf[0] & 0x1f) << 6UL
> +	| (buf[1] & 0x3f);
> +    }

The code here would be sufficient for UTF-8 that is known valid, but
when reading from a port we must check for ill-formed UTF-8.

Unicode requires that we reject as ill-formed any UTF-8 byte sequence in
non-shortest form.  For example, we must reject the byte sequence
0xC1 0x80 which a permissive reader would read as 0x40, since obviously
that code point can be encoded as a single byte in UTF-8.

We must also reject any UTF-8 byte sequence that corresponds to a
surrogate code point (U+D800..U+DFFF), or to a code point greater than
U+10FFFF.

Table 3.7 of the Unicode 6.0.0 standard, reproduced below, concisely
shows all well-formed UTF-8 byte sequences.  The asterisks highlight
continuation bytes that are constrained to a smaller range than the
usual 80..BF.

   code points       byte[0]  byte[1]   byte[2]  byte[3]
---------------------------------------------------------
U+000000..U+00007F | 00..7F |         |        |        |
U+000080..U+0007FF | C2..DF | 80..BF  |        |        |
U+000800..U+000FFF |   E0   | A0..BF* | 80..BF |        |
U+001000..U+00CFFF | E1..EC | 80..BF  | 80..BF |        |
U+00D000..U+00D7FF |   ED   | 80..9F* | 80..BF |        |
U+00E000..U+00FFFF | EE..EF | 80..BF  | 80..BF |        |
U+010000..U+03FFFF |   F0   | 90..BF* | 80..BF | 80..BF |
U+040000..U+0FFFFF | F1..F3 | 80..BF  | 80..BF | 80..BF |
U+100000..U+10FFFF |   F4   | 80..8F* | 80..BF | 80..BF |
---------------------------------------------------------

So, for the code above corresponding to 2-byte sequences, it would
suffice to verify that buf[0] >= 0xC2.  The 3- and 4-byte cases are
somewhat more constrained.

> +  else if ((buf[0] & 0xf0) == 0xe0)
> +    {
> +      byte = scm_get_byte_or_eof (port);
> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
> +	goto invalid_seq;
> +
> +      buf[1] = (scm_t_uint8) byte;
> +      *len = 2;
> +
> +      byte = scm_get_byte_or_eof (port);
> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
> +	goto invalid_seq;
> +
> +      buf[2] = (scm_t_uint8) byte;
> +      *len = 3;
> +
> +      *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL
> +	| ((scm_t_wchar) buf[1] & 0x3f) << 6UL
> +	| (buf[2] & 0x3f);
> +    }
> +  else
> +    {

That ^^^ should not simply be an "else".  It must check that the first
byte is valid.

> +      byte = scm_get_byte_or_eof (port);
> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
> +	goto invalid_seq;
> +
> +      buf[1] = (scm_t_uint8) byte;
> +      *len = 2;
> +
> +      byte = scm_get_byte_or_eof (port);
> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
> +	goto invalid_seq;
> +
> +      buf[2] = (scm_t_uint8) byte;
> +      *len = 3;
> +
> +      byte = scm_get_byte_or_eof (port);
> +      if (byte == EOF || ((byte & 0xc0) != 0x80))
> +	goto invalid_seq;
> +
> +      buf[3] = (scm_t_uint8) byte;
> +      *len = 4;
> +
> +      *codepoint = ((scm_t_wchar) buf[0] & 0x07) << 18UL
> +	| ((scm_t_wchar) buf[1] & 0x3f) << 12UL
> +	| ((scm_t_wchar) buf[2] & 0x3f) << 6UL
> +	| (buf[3] & 0x3f);
> +    }
> +
> +  return 0;
> +
> + invalid_seq:
> +  /* Return the faulty byte.  */
> +  scm_unget_byte (byte, port);

This ungets only the last byte, but there may be up to 4 bytes to unget.

> +
> +  return EILSEQ;
> +}
> +
> +/* Likewise, read a byte sequence from PORT, passing it through its
> +   input conversion descriptor.  */
> +static int
> +get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
> +		     char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
>  {
> +  scm_t_port *pt;
>    int err, byte_read;
>    size_t bytes_consumed, output_size;
>    char *output;
>    scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
> -  scm_t_port *pt = SCM_PTAB_ENTRY (port);
>  
> -  if (SCM_UNLIKELY (pt->input_cd == (iconv_t) -1))
> -    /* Initialize the conversion descriptors.  */
> -    scm_i_set_port_encoding_x (port, pt->encoding);
> +  pt = SCM_PTAB_ENTRY (port);
>  
>    for (output_size = 0, output = (char *) utf8_buf,
>  	 bytes_consumed = 0, err = 0;
> @@ -1174,10 +1265,44 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
>  	output_size = sizeof (utf8_buf) - output_left;
>      }
>  
> -  if (SCM_UNLIKELY (err != 0))
> +
> +  if (SCM_LIKELY (err == 0))
> +    {
> +      /* Convert the UTF8_BUF sequence to a Unicode code point.  */
> +      *codepoint = utf8_to_codepoint (utf8_buf, output_size);
> +      *len = bytes_consumed;
> +    }
> +
> +  return err;
> +}
> +
> +/* Read a codepoint from PORT and return it in *CODEPOINT.  Fill BUF
> +   with the byte representation of the codepoint in PORT's encoding, and
> +   set *LEN to the length in bytes of that representation.  Return 0 on
> +   success and an errno value on error.  */
> +static int
> +get_codepoint (SCM port, scm_t_wchar *codepoint,
> +	       char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +{
> +  int err;
> +  scm_t_port *pt = SCM_PTAB_ENTRY (port);
> +
> +  if (pt->input_cd == (iconv_t) -1)
> +    /* Initialize the conversion descriptors, if needed.  */
> +    scm_i_set_port_encoding_x (port, pt->encoding);
> +
> +  if (pt->input_cd == (iconv_t) -1)
> +    err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len);
> +  else
> +    err = get_iconv_codepoint (port, codepoint, buf, len);

From the code above, it appears that for UTF-8 ports,
scm_i_set_port_encoding_x will necessarily be called once per character
read.  This seems rather inefficient.  Also, if we wish to support
Latin-1 without iconv as well, the simple method above will not work.

I would recommend adding an enum field to the port which for now only
has two encoding schemes: ICONV or UTF8.  Later, we could add LATIN1 and
maybe ASCII as well.  Given that this check must be done once per
character, it seems better to do a switch on an enum than to strcmp with
pt->encoding (as is done in scm_i_set_port_encoding_x).

Thanks again for working on this :)

     Mark



  reply	other threads:[~2011-04-27  3:47 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-26 21:10 Reducing iconv-induced memory usage Ludovic Courtès
2011-04-26 22:41 ` Ludovic Courtès
2011-04-27  3:47   ` Mark H Weaver [this message]
2011-04-27 14:36     ` Ludovic Courtès
2011-05-05 16:19     ` Ludovic Courtès
2011-05-06 16:19       ` Ludovic Courtès
2011-05-07 20:51         ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877hagxsma.fsf@netris.org \
    --to=mhw@netris.org \
    --cc=guile-devel@gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).