From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.devel Subject: Re: Reducing iconv-induced memory usage Date: Tue, 26 Apr 2011 23:47:41 -0400 Message-ID: <877hagxsma.fsf@netris.org> References: <87r58oyb07.fsf@gnu.org> <87ei4oy6ta.fsf@gnu.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1303876080 4935 80.91.229.12 (27 Apr 2011 03:48:00 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 27 Apr 2011 03:48:00 +0000 (UTC) Cc: guile-devel@gnu.org To: ludo@gnu.org (Ludovic =?utf-8?Q?Court=C3=A8s?=) Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Wed Apr 27 05:47:55 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1QEvjH-00015z-Ht for guile-devel@m.gmane.org; Wed, 27 Apr 2011 05:47:55 +0200 Original-Received: from localhost ([::1]:56304 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QEvjH-0004Sh-2g for guile-devel@m.gmane.org; Tue, 26 Apr 2011 23:47:55 -0400 Original-Received: from eggs.gnu.org ([140.186.70.92]:41631) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QEvjE-0004Sa-3I for guile-devel@gnu.org; Tue, 26 Apr 2011 23:47:53 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QEvjD-00056L-4k for guile-devel@gnu.org; Tue, 26 Apr 2011 23:47:52 -0400 Original-Received: from world.peace.net ([96.39.62.75]:42817) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QEvjB-000567-JW; Tue, 26 Apr 2011 23:47:49 -0400 Original-Received: from ip68-9-118-38.ri.ri.cox.net ([68.9.118.38] helo=freedomincluded) by world.peace.net with esmtpa (Exim 4.69) (envelope-from ) id 1QEvj5-0003Dj-82; Tue, 26 Apr 2011 23:47:43 -0400 Original-Received: from mhw by freedomincluded with local (Exim 4.69) (envelope-from ) id 1QEvj3-0002BF-SZ; Tue, 26 Apr 2011 23:47:41 -0400 In-Reply-To: <87ei4oy6ta.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Wed, 27 Apr 2011 00:41:05 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 96.39.62.75 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:12358 Archived-At: Hi Ludovic! ludo@gnu.org (Ludovic Court=C3=A8s) writes: > So, here=E2=80=99s the patch. > > It also makes UTF-8 input ~30% faster according to ports.bm (which > doesn=E2=80=99t benchmark output): Thanks for working on this. I haven't yet had time to fully review this patch, but here I will document the problems I see so far. First of all, while looking at this patch, I've discovered another problem in ports.c: scm_char_ready_p does not consider the possibility of multibyte characters, and returns #t whenever there is at least one byte ready. > -/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF > - with the byte representation of the codepoint in PORT's encoding, and > - set *LEN to the length in bytes of that representation. Return 0 on > - success and an errno value on error. */ > +/* Read a UTF-8 sequence from PORT. On success, return 0 and set > + *CODEPOINT to the codepoint that was read, fill BUF with its UTF-8 > + representation, and set *LEN to the length in bytes. Return > + `EILSEQ' on error. */ > static int > -get_codepoint (SCM port, scm_t_wchar *codepoint, > - char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) > +get_utf8_codepoint (SCM port, scm_t_wchar *codepoint, > + scm_t_uint8 buf[SCM_MBCHAR_BUF_SIZE], size_t *len) > +{ > + int byte; > + > + *len =3D 0; > + > + byte =3D scm_get_byte_or_eof (port); > + if (byte =3D=3D EOF) > + { > + *codepoint =3D EOF; > + return 0; > + } > + > + buf[0] =3D (scm_t_uint8) byte; > + *len =3D 1; > + > + if (buf[0] <=3D 0x7f) > + *codepoint =3D buf[0]; > + else if ((buf[0] & 0xe0) =3D=3D 0xc0) > + { > + byte =3D scm_get_byte_or_eof (port); > + if (byte =3D=3D EOF || ((byte & 0xc0) !=3D 0x80)) > + goto invalid_seq; > + > + buf[1] =3D (scm_t_uint8) byte; > + *len =3D 2; > + > + *codepoint =3D ((scm_t_wchar) buf[0] & 0x1f) << 6UL > + | (buf[1] & 0x3f); > + } The code here would be sufficient for UTF-8 that is known valid, but when reading from a port we must check for ill-formed UTF-8. Unicode requires that we reject as ill-formed any UTF-8 byte sequence in non-shortest form. For example, we must reject the byte sequence 0xC1 0x80 which a permissive reader would read as 0x40, since obviously that code point can be encoded as a single byte in UTF-8. We must also reject any UTF-8 byte sequence that corresponds to a surrogate code point (U+D800..U+DFFF), or to a code point greater than U+10FFFF. Table 3.7 of the Unicode 6.0.0 standard, reproduced below, concisely shows all well-formed UTF-8 byte sequences. The asterisks highlight continuation bytes that are constrained to a smaller range than the usual 80..BF. code points byte[0] byte[1] byte[2] byte[3] --------------------------------------------------------- U+000000..U+00007F | 00..7F | | | | U+000080..U+0007FF | C2..DF | 80..BF | | | U+000800..U+000FFF | E0 | A0..BF* | 80..BF | | U+001000..U+00CFFF | E1..EC | 80..BF | 80..BF | | U+00D000..U+00D7FF | ED | 80..9F* | 80..BF | | U+00E000..U+00FFFF | EE..EF | 80..BF | 80..BF | | U+010000..U+03FFFF | F0 | 90..BF* | 80..BF | 80..BF | U+040000..U+0FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF | U+100000..U+10FFFF | F4 | 80..8F* | 80..BF | 80..BF | --------------------------------------------------------- So, for the code above corresponding to 2-byte sequences, it would suffice to verify that buf[0] >=3D 0xC2. The 3- and 4-byte cases are somewhat more constrained. > + else if ((buf[0] & 0xf0) =3D=3D 0xe0) > + { > + byte =3D scm_get_byte_or_eof (port); > + if (byte =3D=3D EOF || ((byte & 0xc0) !=3D 0x80)) > + goto invalid_seq; > + > + buf[1] =3D (scm_t_uint8) byte; > + *len =3D 2; > + > + byte =3D scm_get_byte_or_eof (port); > + if (byte =3D=3D EOF || ((byte & 0xc0) !=3D 0x80)) > + goto invalid_seq; > + > + buf[2] =3D (scm_t_uint8) byte; > + *len =3D 3; > + > + *codepoint =3D ((scm_t_wchar) buf[0] & 0x0f) << 12UL > + | ((scm_t_wchar) buf[1] & 0x3f) << 6UL > + | (buf[2] & 0x3f); > + } > + else > + { That ^^^ should not simply be an "else". It must check that the first byte is valid. > + byte =3D scm_get_byte_or_eof (port); > + if (byte =3D=3D EOF || ((byte & 0xc0) !=3D 0x80)) > + goto invalid_seq; > + > + buf[1] =3D (scm_t_uint8) byte; > + *len =3D 2; > + > + byte =3D scm_get_byte_or_eof (port); > + if (byte =3D=3D EOF || ((byte & 0xc0) !=3D 0x80)) > + goto invalid_seq; > + > + buf[2] =3D (scm_t_uint8) byte; > + *len =3D 3; > + > + byte =3D scm_get_byte_or_eof (port); > + if (byte =3D=3D EOF || ((byte & 0xc0) !=3D 0x80)) > + goto invalid_seq; > + > + buf[3] =3D (scm_t_uint8) byte; > + *len =3D 4; > + > + *codepoint =3D ((scm_t_wchar) buf[0] & 0x07) << 18UL > + | ((scm_t_wchar) buf[1] & 0x3f) << 12UL > + | ((scm_t_wchar) buf[2] & 0x3f) << 6UL > + | (buf[3] & 0x3f); > + } > + > + return 0; > + > + invalid_seq: > + /* Return the faulty byte. */ > + scm_unget_byte (byte, port); This ungets only the last byte, but there may be up to 4 bytes to unget. > + > + return EILSEQ; > +} > + > +/* Likewise, read a byte sequence from PORT, passing it through its > + input conversion descriptor. */ > +static int > +get_iconv_codepoint (SCM port, scm_t_wchar *codepoint, > + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) > { > + scm_t_port *pt; > int err, byte_read; > size_t bytes_consumed, output_size; > char *output; > scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE]; > - scm_t_port *pt =3D SCM_PTAB_ENTRY (port); >=20=20 > - if (SCM_UNLIKELY (pt->input_cd =3D=3D (iconv_t) -1)) > - /* Initialize the conversion descriptors. */ > - scm_i_set_port_encoding_x (port, pt->encoding); > + pt =3D SCM_PTAB_ENTRY (port); >=20=20 > for (output_size =3D 0, output =3D (char *) utf8_buf, > bytes_consumed =3D 0, err =3D 0; > @@ -1174,10 +1265,44 @@ get_codepoint (SCM port, scm_t_wchar *codepoint, > output_size =3D sizeof (utf8_buf) - output_left; > } >=20=20 > - if (SCM_UNLIKELY (err !=3D 0)) > + > + if (SCM_LIKELY (err =3D=3D 0)) > + { > + /* Convert the UTF8_BUF sequence to a Unicode code point. */ > + *codepoint =3D utf8_to_codepoint (utf8_buf, output_size); > + *len =3D bytes_consumed; > + } > + > + return err; > +} > + > +/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF > + with the byte representation of the codepoint in PORT's encoding, and > + set *LEN to the length in bytes of that representation. Return 0 on > + success and an errno value on error. */ > +static int > +get_codepoint (SCM port, scm_t_wchar *codepoint, > + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) > +{ > + int err; > + scm_t_port *pt =3D SCM_PTAB_ENTRY (port); > + > + if (pt->input_cd =3D=3D (iconv_t) -1) > + /* Initialize the conversion descriptors, if needed. */ > + scm_i_set_port_encoding_x (port, pt->encoding); > + > + if (pt->input_cd =3D=3D (iconv_t) -1) > + err =3D get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, le= n); > + else > + err =3D get_iconv_codepoint (port, codepoint, buf, len); >From the code above, it appears that for UTF-8 ports, scm_i_set_port_encoding_x will necessarily be called once per character read. This seems rather inefficient. Also, if we wish to support Latin-1 without iconv as well, the simple method above will not work. I would recommend adding an enum field to the port which for now only has two encoding schemes: ICONV or UTF8. Later, we could add LATIN1 and maybe ASCII as well. Given that this check must be done once per character, it seems better to do a switch on an enum than to strcmp with pt->encoding (as is done in scm_i_set_port_encoding_x). Thanks again for working on this :) Mark