From: Mark H Weaver <mhw@netris.org>
To: ludo@gnu.org (Ludovic Courtès)
Cc: guile-devel@gnu.org
Subject: Re: Reducing iconv-induced memory usage
Date: Tue, 26 Apr 2011 23:47:41 -0400 [thread overview]
Message-ID: <877hagxsma.fsf@netris.org> (raw)
In-Reply-To: <87ei4oy6ta.fsf@gnu.org> ("Ludovic Courtès"'s message of "Wed, 27 Apr 2011 00:41:05 +0200")
Hi Ludovic!
ludo@gnu.org (Ludovic Courtès) writes:
> So, here’s the patch.
>
> It also makes UTF-8 input ~30% faster according to ports.bm (which
> doesn’t benchmark output):
Thanks for working on this. I haven't yet had time to fully review this
patch, but here I will document the problems I see so far.
First of all, while looking at this patch, I've discovered another
problem in ports.c: scm_char_ready_p does not consider the possibility
of multibyte characters, and returns #t whenever there is at least one
byte ready.
> -/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF
> - with the byte representation of the codepoint in PORT's encoding, and
> - set *LEN to the length in bytes of that representation. Return 0 on
> - success and an errno value on error. */
> +/* Read a UTF-8 sequence from PORT. On success, return 0 and set
> + *CODEPOINT to the codepoint that was read, fill BUF with its UTF-8
> + representation, and set *LEN to the length in bytes. Return
> + `EILSEQ' on error. */
> static int
> -get_codepoint (SCM port, scm_t_wchar *codepoint,
> - char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +get_utf8_codepoint (SCM port, scm_t_wchar *codepoint,
> + scm_t_uint8 buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +{
> + int byte;
> +
> + *len = 0;
> +
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF)
> + {
> + *codepoint = EOF;
> + return 0;
> + }
> +
> + buf[0] = (scm_t_uint8) byte;
> + *len = 1;
> +
> + if (buf[0] <= 0x7f)
> + *codepoint = buf[0];
> + else if ((buf[0] & 0xe0) == 0xc0)
> + {
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF || ((byte & 0xc0) != 0x80))
> + goto invalid_seq;
> +
> + buf[1] = (scm_t_uint8) byte;
> + *len = 2;
> +
> + *codepoint = ((scm_t_wchar) buf[0] & 0x1f) << 6UL
> + | (buf[1] & 0x3f);
> + }
The code here would be sufficient for UTF-8 that is known valid, but
when reading from a port we must check for ill-formed UTF-8.
Unicode requires that we reject as ill-formed any UTF-8 byte sequence in
non-shortest form. For example, we must reject the byte sequence
0xC1 0x80 which a permissive reader would read as 0x40, since obviously
that code point can be encoded as a single byte in UTF-8.
We must also reject any UTF-8 byte sequence that corresponds to a
surrogate code point (U+D800..U+DFFF), or to a code point greater than
U+10FFFF.
Table 3.7 of the Unicode 6.0.0 standard, reproduced below, concisely
shows all well-formed UTF-8 byte sequences. The asterisks highlight
continuation bytes that are constrained to a smaller range than the
usual 80..BF.
code points byte[0] byte[1] byte[2] byte[3]
---------------------------------------------------------
U+000000..U+00007F | 00..7F | | | |
U+000080..U+0007FF | C2..DF | 80..BF | | |
U+000800..U+000FFF | E0 | A0..BF* | 80..BF | |
U+001000..U+00CFFF | E1..EC | 80..BF | 80..BF | |
U+00D000..U+00D7FF | ED | 80..9F* | 80..BF | |
U+00E000..U+00FFFF | EE..EF | 80..BF | 80..BF | |
U+010000..U+03FFFF | F0 | 90..BF* | 80..BF | 80..BF |
U+040000..U+0FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
U+100000..U+10FFFF | F4 | 80..8F* | 80..BF | 80..BF |
---------------------------------------------------------
So, for the code above corresponding to 2-byte sequences, it would
suffice to verify that buf[0] >= 0xC2. The 3- and 4-byte cases are
somewhat more constrained.
> + else if ((buf[0] & 0xf0) == 0xe0)
> + {
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF || ((byte & 0xc0) != 0x80))
> + goto invalid_seq;
> +
> + buf[1] = (scm_t_uint8) byte;
> + *len = 2;
> +
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF || ((byte & 0xc0) != 0x80))
> + goto invalid_seq;
> +
> + buf[2] = (scm_t_uint8) byte;
> + *len = 3;
> +
> + *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL
> + | ((scm_t_wchar) buf[1] & 0x3f) << 6UL
> + | (buf[2] & 0x3f);
> + }
> + else
> + {
That ^^^ should not simply be an "else". It must check that the first
byte is valid.
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF || ((byte & 0xc0) != 0x80))
> + goto invalid_seq;
> +
> + buf[1] = (scm_t_uint8) byte;
> + *len = 2;
> +
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF || ((byte & 0xc0) != 0x80))
> + goto invalid_seq;
> +
> + buf[2] = (scm_t_uint8) byte;
> + *len = 3;
> +
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF || ((byte & 0xc0) != 0x80))
> + goto invalid_seq;
> +
> + buf[3] = (scm_t_uint8) byte;
> + *len = 4;
> +
> + *codepoint = ((scm_t_wchar) buf[0] & 0x07) << 18UL
> + | ((scm_t_wchar) buf[1] & 0x3f) << 12UL
> + | ((scm_t_wchar) buf[2] & 0x3f) << 6UL
> + | (buf[3] & 0x3f);
> + }
> +
> + return 0;
> +
> + invalid_seq:
> + /* Return the faulty byte. */
> + scm_unget_byte (byte, port);
This ungets only the last byte, but there may be up to 4 bytes to unget.
> +
> + return EILSEQ;
> +}
> +
> +/* Likewise, read a byte sequence from PORT, passing it through its
> + input conversion descriptor. */
> +static int
> +get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
> + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> {
> + scm_t_port *pt;
> int err, byte_read;
> size_t bytes_consumed, output_size;
> char *output;
> scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
> - scm_t_port *pt = SCM_PTAB_ENTRY (port);
>
> - if (SCM_UNLIKELY (pt->input_cd == (iconv_t) -1))
> - /* Initialize the conversion descriptors. */
> - scm_i_set_port_encoding_x (port, pt->encoding);
> + pt = SCM_PTAB_ENTRY (port);
>
> for (output_size = 0, output = (char *) utf8_buf,
> bytes_consumed = 0, err = 0;
> @@ -1174,10 +1265,44 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
> output_size = sizeof (utf8_buf) - output_left;
> }
>
> - if (SCM_UNLIKELY (err != 0))
> +
> + if (SCM_LIKELY (err == 0))
> + {
> + /* Convert the UTF8_BUF sequence to a Unicode code point. */
> + *codepoint = utf8_to_codepoint (utf8_buf, output_size);
> + *len = bytes_consumed;
> + }
> +
> + return err;
> +}
> +
> +/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF
> + with the byte representation of the codepoint in PORT's encoding, and
> + set *LEN to the length in bytes of that representation. Return 0 on
> + success and an errno value on error. */
> +static int
> +get_codepoint (SCM port, scm_t_wchar *codepoint,
> + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +{
> + int err;
> + scm_t_port *pt = SCM_PTAB_ENTRY (port);
> +
> + if (pt->input_cd == (iconv_t) -1)
> + /* Initialize the conversion descriptors, if needed. */
> + scm_i_set_port_encoding_x (port, pt->encoding);
> +
> + if (pt->input_cd == (iconv_t) -1)
> + err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len);
> + else
> + err = get_iconv_codepoint (port, codepoint, buf, len);
From the code above, it appears that for UTF-8 ports,
scm_i_set_port_encoding_x will necessarily be called once per character
read. This seems rather inefficient. Also, if we wish to support
Latin-1 without iconv as well, the simple method above will not work.
I would recommend adding an enum field to the port which for now only
has two encoding schemes: ICONV or UTF8. Later, we could add LATIN1 and
maybe ASCII as well. Given that this check must be done once per
character, it seems better to do a switch on an enum than to strcmp with
pt->encoding (as is done in scm_i_set_port_encoding_x).
Thanks again for working on this :)
Mark
next prev parent reply other threads:[~2011-04-27 3:47 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-04-26 21:10 Reducing iconv-induced memory usage Ludovic Courtès
2011-04-26 22:41 ` Ludovic Courtès
2011-04-27 3:47 ` Mark H Weaver [this message]
2011-04-27 14:36 ` Ludovic Courtès
2011-05-05 16:19 ` Ludovic Courtès
2011-05-06 16:19 ` Ludovic Courtès
2011-05-07 20:51 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=877hagxsma.fsf@netris.org \
--to=mhw@netris.org \
--cc=guile-devel@gnu.org \
--cc=ludo@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).