* Reducing iconv-induced memory usage @ 2011-04-26 21:10 Ludovic Courtès 2011-04-26 22:41 ` Ludovic Courtès 0 siblings, 1 reply; 7+ messages in thread From: Ludovic Courtès @ 2011-04-26 21:10 UTC (permalink / raw) To: guile-devel [-- Attachment #1: Type: text/plain, Size: 725 bytes --] Hello! As Andy noted in the past, iconv conversion descriptors associated with ports take up a lot of malloc’d memory, that only gets freed when finalizers are run. On GNU/Linux, a UTF-8 → UTF-8 C.D., which does nothing, mallocs 180 KiB (!), according to the program attached. So the problem is acute. So I think we should special-case UTF-8 I/O to not use iconv at all. For output, it’s easy since we already do the conversion to UTF-8 in ‘display_string’. For input, it’s a bit more work because input byte streams have to be checked for invalid sequences. I’m working on a patch but I’d like to get initial feedback and also about whether it should wait until after 2.0.1 or not. Thanks, Ludo’. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: the program --] [-- Type: text/x-csrc, Size: 572 bytes --] #include <iconv.h> #include <malloc.h> static size_t total; static void * (*prev_hook) (size_t, const void *); static void * m (size_t s, const void *c) { __malloc_hook = prev_hook; printf ("alloc %zi\n", s); void *r = malloc (s); total += s; __malloc_hook = &m; return r; } static void my_init_hook (void) { prev_hook = __malloc_hook; __malloc_hook = &m; } void (*__malloc_initialize_hook) (void) = my_init_hook; int main (int argc, char *argv[]) { total = 0; iconv_open ("UTF-8", "UTF-8"); printf ("allocated %zi B\n", total); return 0; } ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Reducing iconv-induced memory usage 2011-04-26 21:10 Reducing iconv-induced memory usage Ludovic Courtès @ 2011-04-26 22:41 ` Ludovic Courtès 2011-04-27 3:47 ` Mark H Weaver 0 siblings, 1 reply; 7+ messages in thread From: Ludovic Courtès @ 2011-04-26 22:41 UTC (permalink / raw) To: guile-devel [-- Attachment #1: Type: text/plain, Size: 1581 bytes --] Hi! So, here’s the patch. It also makes UTF-8 input ~30% faster according to ports.bm (which doesn’t benchmark output): * before: ("ports.bm: peek-char: latin-1 port" 700000 user 0.36) ("ports.bm: peek-char: utf-8 port, ascii character" 700000 user 0.35) ("ports.bm: peek-char: utf-8 port, Korean character" 700000 user 0.61) ("ports.bm: read-char: latin-1 port" 10000000 user 3.32) ("ports.bm: read-char: utf-8 port, ascii character" 10000000 user 3.33) ("ports.bm: read-char: utf-8 port, Korean character" 10000000 user 3.39) ("ports.bm: char-ready?: latin-1 port" 10000000 user 2.95) ("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 user 2.96) ("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 user 3.01) ("ports.bm: rdelim: read-line" 1000 user 3.1) * after: ("ports.bm: peek-char: latin-1 port" 700000 user 0.31) ("ports.bm: peek-char: utf-8 port, ascii character" 700000 user 0.24) ("ports.bm: peek-char: utf-8 port, Korean character" 700000 user 0.3) ("ports.bm: read-char: latin-1 port" 10000000 user 2.73) ("ports.bm: read-char: utf-8 port, ascii character" 10000000 user 3.38) ("ports.bm: read-char: utf-8 port, Korean character" 10000000 user 3.37) ("ports.bm: char-ready?: latin-1 port" 10000000 user 2.42) ("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 user 2.41) ("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 user 2.43) ("ports.bm: rdelim: read-line" 1000 user 1.91) Comments? OK to apply? Thanks, Ludo’. [-- Attachment #2: the patch --] [-- Type: text/x-patch, Size: 12384 bytes --] diff --git a/libguile/ports.c b/libguile/ports.c index 6e0ae6c..d728356 100644 --- a/libguile/ports.c +++ b/libguile/ports.c @@ -1057,6 +1057,7 @@ update_port_lf (scm_t_wchar c, SCM port) switch (c) { case '\a': + case EOF: break; case '\b': SCM_DECCOL (port); @@ -1115,23 +1116,113 @@ utf8_to_codepoint (const scm_t_uint8 *utf8_buf, size_t size) return codepoint; } -/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF - with the byte representation of the codepoint in PORT's encoding, and - set *LEN to the length in bytes of that representation. Return 0 on - success and an errno value on error. */ +/* Read a UTF-8 sequence from PORT. On success, return 0 and set + *CODEPOINT to the codepoint that was read, fill BUF with its UTF-8 + representation, and set *LEN to the length in bytes. Return + `EILSEQ' on error. */ static int -get_codepoint (SCM port, scm_t_wchar *codepoint, - char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) +get_utf8_codepoint (SCM port, scm_t_wchar *codepoint, + scm_t_uint8 buf[SCM_MBCHAR_BUF_SIZE], size_t *len) +{ + int byte; + + *len = 0; + + byte = scm_get_byte_or_eof (port); + if (byte == EOF) + { + *codepoint = EOF; + return 0; + } + + buf[0] = (scm_t_uint8) byte; + *len = 1; + + if (buf[0] <= 0x7f) + *codepoint = buf[0]; + else if ((buf[0] & 0xe0) == 0xc0) + { + byte = scm_get_byte_or_eof (port); + if (byte == EOF || ((byte & 0xc0) != 0x80)) + goto invalid_seq; + + buf[1] = (scm_t_uint8) byte; + *len = 2; + + *codepoint = ((scm_t_wchar) buf[0] & 0x1f) << 6UL + | (buf[1] & 0x3f); + } + else if ((buf[0] & 0xf0) == 0xe0) + { + byte = scm_get_byte_or_eof (port); + if (byte == EOF || ((byte & 0xc0) != 0x80)) + goto invalid_seq; + + buf[1] = (scm_t_uint8) byte; + *len = 2; + + byte = scm_get_byte_or_eof (port); + if (byte == EOF || ((byte & 0xc0) != 0x80)) + goto invalid_seq; + + buf[2] = (scm_t_uint8) byte; + *len = 3; + + *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL + | ((scm_t_wchar) buf[1] & 0x3f) << 6UL + | (buf[2] & 0x3f); + } + else + { + byte = scm_get_byte_or_eof (port); + if (byte == EOF || ((byte & 0xc0) != 0x80)) + goto invalid_seq; + + buf[1] = (scm_t_uint8) byte; + *len = 2; + + byte = scm_get_byte_or_eof (port); + if (byte == EOF || ((byte & 0xc0) != 0x80)) + goto invalid_seq; + + buf[2] = (scm_t_uint8) byte; + *len = 3; + + byte = scm_get_byte_or_eof (port); + if (byte == EOF || ((byte & 0xc0) != 0x80)) + goto invalid_seq; + + buf[3] = (scm_t_uint8) byte; + *len = 4; + + *codepoint = ((scm_t_wchar) buf[0] & 0x07) << 18UL + | ((scm_t_wchar) buf[1] & 0x3f) << 12UL + | ((scm_t_wchar) buf[2] & 0x3f) << 6UL + | (buf[3] & 0x3f); + } + + return 0; + + invalid_seq: + /* Return the faulty byte. */ + scm_unget_byte (byte, port); + + return EILSEQ; +} + +/* Likewise, read a byte sequence from PORT, passing it through its + input conversion descriptor. */ +static int +get_iconv_codepoint (SCM port, scm_t_wchar *codepoint, + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) { + scm_t_port *pt; int err, byte_read; size_t bytes_consumed, output_size; char *output; scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE]; - scm_t_port *pt = SCM_PTAB_ENTRY (port); - if (SCM_UNLIKELY (pt->input_cd == (iconv_t) -1)) - /* Initialize the conversion descriptors. */ - scm_i_set_port_encoding_x (port, pt->encoding); + pt = SCM_PTAB_ENTRY (port); for (output_size = 0, output = (char *) utf8_buf, bytes_consumed = 0, err = 0; @@ -1174,10 +1265,44 @@ get_codepoint (SCM port, scm_t_wchar *codepoint, output_size = sizeof (utf8_buf) - output_left; } - if (SCM_UNLIKELY (err != 0)) + + if (SCM_LIKELY (err == 0)) + { + /* Convert the UTF8_BUF sequence to a Unicode code point. */ + *codepoint = utf8_to_codepoint (utf8_buf, output_size); + *len = bytes_consumed; + } + + return err; +} + +/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF + with the byte representation of the codepoint in PORT's encoding, and + set *LEN to the length in bytes of that representation. Return 0 on + success and an errno value on error. */ +static int +get_codepoint (SCM port, scm_t_wchar *codepoint, + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) +{ + int err; + scm_t_port *pt = SCM_PTAB_ENTRY (port); + + if (pt->input_cd == (iconv_t) -1) + /* Initialize the conversion descriptors, if needed. */ + scm_i_set_port_encoding_x (port, pt->encoding); + + if (pt->input_cd == (iconv_t) -1) + err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len); + else + err = get_iconv_codepoint (port, codepoint, buf, len); + + if (SCM_LIKELY (err == 0)) + update_port_lf (*codepoint, port); + else { - /* Reset the `iconv' state. */ - iconv (pt->input_cd, NULL, NULL, NULL, NULL); + if (pt->input_cd != (iconv_t) -1) + /* Reset the `iconv' state. */ + iconv (pt->input_cd, NULL, NULL, NULL, NULL); if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK) { @@ -1189,14 +1314,6 @@ get_codepoint (SCM port, scm_t_wchar *codepoint, SCM_ICONVEH_ESCAPE_SEQUENCE (the latter doesn't make sense for input encoding errors.) */ } - else - /* Convert the UTF8_BUF sequence to a Unicode code point. */ - *codepoint = utf8_to_codepoint (utf8_buf, output_size); - - if (SCM_LIKELY (err == 0)) - update_port_lf (*codepoint, port); - - *len = bytes_consumed; return err; } @@ -2027,28 +2144,35 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding) if (encoding == NULL) encoding = "ISO-8859-1"; - pt->encoding = scm_gc_strdup (encoding, "port"); + if (pt->encoding != encoding) + pt->encoding = scm_gc_strdup (encoding, "port"); - if (SCM_CELL_WORD_0 (port) & SCM_RDNG) + /* If ENCODING is UTF-8, then no conversion descriptor is opened + because we do I/O ourselves. This saves 100+ KiB for each + descriptor. */ + if (strcmp (encoding, "UTF-8")) { - /* Open an input iconv conversion descriptor, from ENCODING - to UTF-8. We choose UTF-8, not UTF-32, because iconv - implementations can typically convert from anything to - UTF-8, but not to UTF-32 (see - <http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00007.html>). */ - new_input_cd = iconv_open ("UTF-8", encoding); - if (new_input_cd == (iconv_t) -1) - goto invalid_encoding; - } + if (SCM_CELL_WORD_0 (port) & SCM_RDNG) + { + /* Open an input iconv conversion descriptor, from ENCODING + to UTF-8. We choose UTF-8, not UTF-32, because iconv + implementations can typically convert from anything to + UTF-8, but not to UTF-32 (see + <http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00007.html>). */ + new_input_cd = iconv_open ("UTF-8", encoding); + if (new_input_cd == (iconv_t) -1) + goto invalid_encoding; + } - if (SCM_CELL_WORD_0 (port) & SCM_WRTNG) - { - new_output_cd = iconv_open (encoding, "UTF-8"); - if (new_output_cd == (iconv_t) -1) + if (SCM_CELL_WORD_0 (port) & SCM_WRTNG) { - if (new_input_cd != (iconv_t) -1) - iconv_close (new_input_cd); - goto invalid_encoding; + new_output_cd = iconv_open (encoding, "UTF-8"); + if (new_output_cd == (iconv_t) -1) + { + if (new_input_cd != (iconv_t) -1) + iconv_close (new_input_cd); + goto invalid_encoding; + } } } diff --git a/libguile/print.c b/libguile/print.c index 1399566..d18c054 100644 --- a/libguile/print.c +++ b/libguile/print.c @@ -821,33 +821,58 @@ codepoint_to_utf8 (scm_t_wchar ch, scm_t_uint8 utf8[4]) return len; } -/* Display the LEN codepoints in STR to PORT according to STRATEGY; - return the number of codepoints successfully displayed. If NARROW_P, - then STR is interpreted as a sequence of `char', denoting a Latin-1 - string; otherwise it's interpreted as a sequence of - `scm_t_wchar'. */ -static size_t -display_string (const void *str, int narrow_p, - size_t len, SCM port, - scm_t_string_failed_conversion_handler strategy) - -{ #define STR_REF(s, x) \ (narrow_p \ ? (scm_t_wchar) ((unsigned char *) (s))[x] \ : ((scm_t_wchar *) (s))[x]) +/* Write STR to PORT as UTF-8. STR is a LEN-codepoint string; it is + narrow if NARROW_P is true, wide otherwise. Return LEN. */ +static size_t +display_string_as_utf8 (const void *str, int narrow_p, size_t len, + SCM port) +{ + size_t printed = 0; + + while (len > printed) + { + size_t utf8_len, i; + char *input, utf8_buf[256]; + + /* Convert STR to UTF-8. */ + for (i = printed, utf8_len = 0, input = utf8_buf; + i < len && utf8_len + 4 < sizeof (utf8_buf); + i++) + { + utf8_len += codepoint_to_utf8 (STR_REF (str, i), + (scm_t_uint8 *) input); + input = utf8_buf + utf8_len; + } + + /* INPUT was successfully converted, entirely; print the + result. */ + scm_lfwrite (utf8_buf, utf8_len, port); + printed += i - printed; + } + + assert (printed == len); + + return len; +} + +/* Convert STR through PORT's output conversion descriptor and write the + output to PORT. Return the number of codepoints written. */ +static size_t +display_string_using_iconv (const void *str, int narrow_p, size_t len, + SCM port, + scm_t_string_failed_conversion_handler strategy) +{ size_t printed; scm_t_port *pt; pt = SCM_PTAB_ENTRY (port); - if (SCM_UNLIKELY (pt->output_cd == (iconv_t) -1)) - /* Initialize the conversion descriptors. */ - scm_i_set_port_encoding_x (port, pt->encoding); - printed = 0; - while (len > printed) { size_t done, utf8_len, input_left, output_left, i; @@ -880,7 +905,7 @@ display_string (const void *str, int narrow_p, if (SCM_UNLIKELY (done == (size_t) -1)) { - int errno_save = errno; + int errno_save = errno; /* Reset the `iconv' state. */ iconv (pt->output_cd, NULL, NULL, NULL, NULL); @@ -928,7 +953,34 @@ display_string (const void *str, int narrow_p, } return printed; +} + #undef STR_REF + +/* Display the LEN codepoints in STR to PORT according to STRATEGY; + return the number of codepoints successfully displayed. If NARROW_P, + then STR is interpreted as a sequence of `char', denoting a Latin-1 + string; otherwise it's interpreted as a sequence of + `scm_t_wchar'. */ +static size_t +display_string (const void *str, int narrow_p, + size_t len, SCM port, + scm_t_string_failed_conversion_handler strategy) + +{ + scm_t_port *pt; + + pt = SCM_PTAB_ENTRY (port); + + if (pt->output_cd == (iconv_t) -1) + /* Initialize the conversion descriptors, if needed. */ + scm_i_set_port_encoding_x (port, pt->encoding); + + if (pt->output_cd == (iconv_t) -1) + return display_string_as_utf8 (str, narrow_p, len, port); + else + return display_string_using_cd (str, narrow_p, len, + port, strategy); } /* Attempt to display CH to PORT according to STRATEGY. Return non-zero diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test index 9d3000c..d5b1b60 100644 --- a/test-suite/tests/ports.test +++ b/test-suite/tests/ports.test @@ -391,7 +391,8 @@ (with-fluids ((%default-port-encoding e)) (call-with-output-string (lambda (p) - (display (port-encoding p) p))))) + (and (string=? e (port-encoding p)) + (display (port-encoding p) p)))))) encodings) encodings))) @@ -462,6 +463,15 @@ (= (port-line p) 0) (= (port-column p) 0)))) + (pass-if "peek-char [utf-16]" + (let ((p (with-fluids ((%default-port-encoding "UTF-16BE")) + (open-input-string "안녕하세요")))) + (and (char=? (peek-char p) #\안) + (char=? (peek-char p) #\안) + (char=? (peek-char p) #\안) + (= (port-line p) 0) + (= (port-column p) 0)))) + (pass-if "read-char, wrong encoding, error" (let ((p (open-bytevector-input-port #vu8(255 65 66 67)))) (catch 'decoding-error ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: Reducing iconv-induced memory usage 2011-04-26 22:41 ` Ludovic Courtès @ 2011-04-27 3:47 ` Mark H Weaver 2011-04-27 14:36 ` Ludovic Courtès 2011-05-05 16:19 ` Ludovic Courtès 0 siblings, 2 replies; 7+ messages in thread From: Mark H Weaver @ 2011-04-27 3:47 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guile-devel Hi Ludovic! ludo@gnu.org (Ludovic Courtès) writes: > So, here’s the patch. > > It also makes UTF-8 input ~30% faster according to ports.bm (which > doesn’t benchmark output): Thanks for working on this. I haven't yet had time to fully review this patch, but here I will document the problems I see so far. First of all, while looking at this patch, I've discovered another problem in ports.c: scm_char_ready_p does not consider the possibility of multibyte characters, and returns #t whenever there is at least one byte ready. > -/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF > - with the byte representation of the codepoint in PORT's encoding, and > - set *LEN to the length in bytes of that representation. Return 0 on > - success and an errno value on error. */ > +/* Read a UTF-8 sequence from PORT. On success, return 0 and set > + *CODEPOINT to the codepoint that was read, fill BUF with its UTF-8 > + representation, and set *LEN to the length in bytes. Return > + `EILSEQ' on error. */ > static int > -get_codepoint (SCM port, scm_t_wchar *codepoint, > - char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) > +get_utf8_codepoint (SCM port, scm_t_wchar *codepoint, > + scm_t_uint8 buf[SCM_MBCHAR_BUF_SIZE], size_t *len) > +{ > + int byte; > + > + *len = 0; > + > + byte = scm_get_byte_or_eof (port); > + if (byte == EOF) > + { > + *codepoint = EOF; > + return 0; > + } > + > + buf[0] = (scm_t_uint8) byte; > + *len = 1; > + > + if (buf[0] <= 0x7f) > + *codepoint = buf[0]; > + else if ((buf[0] & 0xe0) == 0xc0) > + { > + byte = scm_get_byte_or_eof (port); > + if (byte == EOF || ((byte & 0xc0) != 0x80)) > + goto invalid_seq; > + > + buf[1] = (scm_t_uint8) byte; > + *len = 2; > + > + *codepoint = ((scm_t_wchar) buf[0] & 0x1f) << 6UL > + | (buf[1] & 0x3f); > + } The code here would be sufficient for UTF-8 that is known valid, but when reading from a port we must check for ill-formed UTF-8. Unicode requires that we reject as ill-formed any UTF-8 byte sequence in non-shortest form. For example, we must reject the byte sequence 0xC1 0x80 which a permissive reader would read as 0x40, since obviously that code point can be encoded as a single byte in UTF-8. We must also reject any UTF-8 byte sequence that corresponds to a surrogate code point (U+D800..U+DFFF), or to a code point greater than U+10FFFF. Table 3.7 of the Unicode 6.0.0 standard, reproduced below, concisely shows all well-formed UTF-8 byte sequences. The asterisks highlight continuation bytes that are constrained to a smaller range than the usual 80..BF. code points byte[0] byte[1] byte[2] byte[3] --------------------------------------------------------- U+000000..U+00007F | 00..7F | | | | U+000080..U+0007FF | C2..DF | 80..BF | | | U+000800..U+000FFF | E0 | A0..BF* | 80..BF | | U+001000..U+00CFFF | E1..EC | 80..BF | 80..BF | | U+00D000..U+00D7FF | ED | 80..9F* | 80..BF | | U+00E000..U+00FFFF | EE..EF | 80..BF | 80..BF | | U+010000..U+03FFFF | F0 | 90..BF* | 80..BF | 80..BF | U+040000..U+0FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF | U+100000..U+10FFFF | F4 | 80..8F* | 80..BF | 80..BF | --------------------------------------------------------- So, for the code above corresponding to 2-byte sequences, it would suffice to verify that buf[0] >= 0xC2. The 3- and 4-byte cases are somewhat more constrained. > + else if ((buf[0] & 0xf0) == 0xe0) > + { > + byte = scm_get_byte_or_eof (port); > + if (byte == EOF || ((byte & 0xc0) != 0x80)) > + goto invalid_seq; > + > + buf[1] = (scm_t_uint8) byte; > + *len = 2; > + > + byte = scm_get_byte_or_eof (port); > + if (byte == EOF || ((byte & 0xc0) != 0x80)) > + goto invalid_seq; > + > + buf[2] = (scm_t_uint8) byte; > + *len = 3; > + > + *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL > + | ((scm_t_wchar) buf[1] & 0x3f) << 6UL > + | (buf[2] & 0x3f); > + } > + else > + { That ^^^ should not simply be an "else". It must check that the first byte is valid. > + byte = scm_get_byte_or_eof (port); > + if (byte == EOF || ((byte & 0xc0) != 0x80)) > + goto invalid_seq; > + > + buf[1] = (scm_t_uint8) byte; > + *len = 2; > + > + byte = scm_get_byte_or_eof (port); > + if (byte == EOF || ((byte & 0xc0) != 0x80)) > + goto invalid_seq; > + > + buf[2] = (scm_t_uint8) byte; > + *len = 3; > + > + byte = scm_get_byte_or_eof (port); > + if (byte == EOF || ((byte & 0xc0) != 0x80)) > + goto invalid_seq; > + > + buf[3] = (scm_t_uint8) byte; > + *len = 4; > + > + *codepoint = ((scm_t_wchar) buf[0] & 0x07) << 18UL > + | ((scm_t_wchar) buf[1] & 0x3f) << 12UL > + | ((scm_t_wchar) buf[2] & 0x3f) << 6UL > + | (buf[3] & 0x3f); > + } > + > + return 0; > + > + invalid_seq: > + /* Return the faulty byte. */ > + scm_unget_byte (byte, port); This ungets only the last byte, but there may be up to 4 bytes to unget. > + > + return EILSEQ; > +} > + > +/* Likewise, read a byte sequence from PORT, passing it through its > + input conversion descriptor. */ > +static int > +get_iconv_codepoint (SCM port, scm_t_wchar *codepoint, > + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) > { > + scm_t_port *pt; > int err, byte_read; > size_t bytes_consumed, output_size; > char *output; > scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE]; > - scm_t_port *pt = SCM_PTAB_ENTRY (port); > > - if (SCM_UNLIKELY (pt->input_cd == (iconv_t) -1)) > - /* Initialize the conversion descriptors. */ > - scm_i_set_port_encoding_x (port, pt->encoding); > + pt = SCM_PTAB_ENTRY (port); > > for (output_size = 0, output = (char *) utf8_buf, > bytes_consumed = 0, err = 0; > @@ -1174,10 +1265,44 @@ get_codepoint (SCM port, scm_t_wchar *codepoint, > output_size = sizeof (utf8_buf) - output_left; > } > > - if (SCM_UNLIKELY (err != 0)) > + > + if (SCM_LIKELY (err == 0)) > + { > + /* Convert the UTF8_BUF sequence to a Unicode code point. */ > + *codepoint = utf8_to_codepoint (utf8_buf, output_size); > + *len = bytes_consumed; > + } > + > + return err; > +} > + > +/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF > + with the byte representation of the codepoint in PORT's encoding, and > + set *LEN to the length in bytes of that representation. Return 0 on > + success and an errno value on error. */ > +static int > +get_codepoint (SCM port, scm_t_wchar *codepoint, > + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) > +{ > + int err; > + scm_t_port *pt = SCM_PTAB_ENTRY (port); > + > + if (pt->input_cd == (iconv_t) -1) > + /* Initialize the conversion descriptors, if needed. */ > + scm_i_set_port_encoding_x (port, pt->encoding); > + > + if (pt->input_cd == (iconv_t) -1) > + err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len); > + else > + err = get_iconv_codepoint (port, codepoint, buf, len); From the code above, it appears that for UTF-8 ports, scm_i_set_port_encoding_x will necessarily be called once per character read. This seems rather inefficient. Also, if we wish to support Latin-1 without iconv as well, the simple method above will not work. I would recommend adding an enum field to the port which for now only has two encoding schemes: ICONV or UTF8. Later, we could add LATIN1 and maybe ASCII as well. Given that this check must be done once per character, it seems better to do a switch on an enum than to strcmp with pt->encoding (as is done in scm_i_set_port_encoding_x). Thanks again for working on this :) Mark ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Reducing iconv-induced memory usage 2011-04-27 3:47 ` Mark H Weaver @ 2011-04-27 14:36 ` Ludovic Courtès 2011-05-05 16:19 ` Ludovic Courtès 1 sibling, 0 replies; 7+ messages in thread From: Ludovic Courtès @ 2011-04-27 14:36 UTC (permalink / raw) To: Mark H Weaver; +Cc: guile-devel Hi Mark, Mark H Weaver <mhw@netris.org> writes: > ludo@gnu.org (Ludovic Courtès) writes: >> So, here’s the patch. >> >> It also makes UTF-8 input ~30% faster according to ports.bm (which >> doesn’t benchmark output): > > Thanks for working on this. I haven't yet had time to fully review this > patch, but here I will document the problems I see so far. Thanks for the review! > First of all, while looking at this patch, I've discovered another > problem in ports.c: scm_char_ready_p does not consider the possibility > of multibyte characters, and returns #t whenever there is at least one > byte ready. Indeed; let’s discuss it separately. > Unicode requires that we reject as ill-formed any UTF-8 byte sequence in > non-shortest form. For example, we must reject the byte sequence > 0xC1 0x80 which a permissive reader would read as 0x40, since obviously > that code point can be encoded as a single byte in UTF-8. > > We must also reject any UTF-8 byte sequence that corresponds to a > surrogate code point (U+D800..U+DFFF), or to a code point greater than > U+10FFFF. > > Table 3.7 of the Unicode 6.0.0 standard, reproduced below, concisely > shows all well-formed UTF-8 byte sequences. The asterisks highlight > continuation bytes that are constrained to a smaller range than the > usual 80..BF. > > code points byte[0] byte[1] byte[2] byte[3] > --------------------------------------------------------- > U+000000..U+00007F | 00..7F | | | | > U+000080..U+0007FF | C2..DF | 80..BF | | | > U+000800..U+000FFF | E0 | A0..BF* | 80..BF | | > U+001000..U+00CFFF | E1..EC | 80..BF | 80..BF | | > U+00D000..U+00D7FF | ED | 80..9F* | 80..BF | | > U+00E000..U+00FFFF | EE..EF | 80..BF | 80..BF | | > U+010000..U+03FFFF | F0 | 90..BF* | 80..BF | 80..BF | > U+040000..U+0FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF | > U+100000..U+10FFFF | F4 | 80..8F* | 80..BF | 80..BF | > --------------------------------------------------------- > > So, for the code above corresponding to 2-byte sequences, it would > suffice to verify that buf[0] >= 0xC2. The 3- and 4-byte cases are > somewhat more constrained. Indeed, thanks for educating me. ;-) Could you add UTF-8 tests for such cases using the just-committed ‘test-decoding-error’ in ports.test? >> + else if ((buf[0] & 0xf0) == 0xe0) >> + { >> + byte = scm_get_byte_or_eof (port); >> + if (byte == EOF || ((byte & 0xc0) != 0x80)) >> + goto invalid_seq; >> + >> + buf[1] = (scm_t_uint8) byte; >> + *len = 2; >> + >> + byte = scm_get_byte_or_eof (port); >> + if (byte == EOF || ((byte & 0xc0) != 0x80)) >> + goto invalid_seq; >> + >> + buf[2] = (scm_t_uint8) byte; >> + *len = 3; >> + >> + *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL >> + | ((scm_t_wchar) buf[1] & 0x3f) << 6UL >> + | (buf[2] & 0x3f); >> + } >> + else >> + { > > That ^^^ should not simply be an "else". It must check that the first > byte is valid. Right. >> + invalid_seq: >> + /* Return the faulty byte. */ >> + scm_unget_byte (byte, port); > > This ungets only the last byte, but there may be up to 4 bytes to unget. No, that’s done in ‘peek-char’. >> +/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF >> + with the byte representation of the codepoint in PORT's encoding, and >> + set *LEN to the length in bytes of that representation. Return 0 on >> + success and an errno value on error. */ >> +static int >> +get_codepoint (SCM port, scm_t_wchar *codepoint, >> + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) >> +{ >> + int err; >> + scm_t_port *pt = SCM_PTAB_ENTRY (port); >> + >> + if (pt->input_cd == (iconv_t) -1) >> + /* Initialize the conversion descriptors, if needed. */ >> + scm_i_set_port_encoding_x (port, pt->encoding); >> + >> + if (pt->input_cd == (iconv_t) -1) >> + err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len); >> + else >> + err = get_iconv_codepoint (port, codepoint, buf, len); > > From the code above, it appears that for UTF-8 ports, > scm_i_set_port_encoding_x will necessarily be called once per character > read. This seems rather inefficient. Correct. Alas, I don’t know how to avoid this inefficiency in 2.0 since we can’t just add a flag in ‘scm_t_port’ since it would break the ABI. Ideas? Besides, however inefficient it may seem, it’s still more efficient than what we currently have, as I explained. > Also, if we wish to support Latin-1 without iconv as well, the simple > method above will not work. Why would we want such a thing? :-) The starting point for this patch was the observation that our Unicode I/O converts to/from UTF-8, and then from UTF-8 to our internal representation, and that it’s wasteful to use iconv to convert from UTF-8 to UTF-8 when reading from/writing to a UTF-8 port. > I would recommend adding an enum field to the port which for now only > has two encoding schemes: ICONV or UTF8. Later, we could add LATIN1 and > maybe ASCII as well. Given that this check must be done once per > character, it seems better to do a switch on an enum than to strcmp with > pt->encoding (as is done in scm_i_set_port_encoding_x). Agreed; maybe something for ‘master’ once this version is in 2.0? Thanks, Ludo’. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Reducing iconv-induced memory usage 2011-04-27 3:47 ` Mark H Weaver 2011-04-27 14:36 ` Ludovic Courtès @ 2011-05-05 16:19 ` Ludovic Courtès 2011-05-06 16:19 ` Ludovic Courtès 1 sibling, 1 reply; 7+ messages in thread From: Ludovic Courtès @ 2011-05-05 16:19 UTC (permalink / raw) To: guile-devel [-- Attachment #1: Type: text/plain, Size: 240 bytes --] Hello! Here’s an updated patch that strictly checks for ill-formed UTF-8 sequences, as Mark pointed out. It passes all the tests I recently added to ports.test. I’d like to commit it soon, when Mark approves. :-) Thanks, Ludo’. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: the patch --] [-- Type: text/x-patch, Size: 12009 bytes --] diff --git a/libguile/ports.c b/libguile/ports.c index b5ad95e..2482a24 100644 --- a/libguile/ports.c +++ b/libguile/ports.c @@ -1057,6 +1057,7 @@ update_port_lf (scm_t_wchar c, SCM port) switch (c) { case '\a': + case EOF: break; case '\b': SCM_DECCOL (port); @@ -1115,23 +1116,162 @@ utf8_to_codepoint (const scm_t_uint8 *utf8_buf, size_t size) return codepoint; } -/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF - with the byte representation of the codepoint in PORT's encoding, and - set *LEN to the length in bytes of that representation. Return 0 on - success and an errno value on error. */ +/* Read a UTF-8 sequence from PORT. On success, return 0 and set + *CODEPOINT to the codepoint that was read, fill BUF with its UTF-8 + representation, and set *LEN to the length in bytes. Return + `EILSEQ' on error. */ static int -get_codepoint (SCM port, scm_t_wchar *codepoint, - char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) +get_utf8_codepoint (SCM port, scm_t_wchar *codepoint, + scm_t_uint8 buf[SCM_MBCHAR_BUF_SIZE], size_t *len) { +#define ASSERT_NOT_EOF(b) \ + if (SCM_UNLIKELY ((b) == EOF)) \ + goto invalid_seq + + int byte; + + *len = 0; + + byte = scm_get_byte_or_eof (port); + if (byte == EOF) + { + *codepoint = EOF; + return 0; + } + + buf[0] = (scm_t_uint8) byte; + *len = 1; + + if (buf[0] <= 0x7f) + /* 1-byte form. */ + *codepoint = buf[0]; + else if (buf[0] >= 0xc2 && buf[0] <= 0xdf) + { + /* 2-byte form. */ + byte = scm_get_byte_or_eof (port); + ASSERT_NOT_EOF (byte); + + buf[1] = (scm_t_uint8) byte; + *len = 2; + + if (SCM_UNLIKELY ((byte & 0xc0) != 0x80)) + goto invalid_seq; + + *codepoint = ((scm_t_wchar) buf[0] & 0x1f) << 6UL + | (buf[1] & 0x3f); + } + else if ((buf[0] & 0xf0) == 0xe0) + { + /* 3-byte form. */ + byte = scm_get_byte_or_eof (port); + if (SCM_UNLIKELY (byte == EOF)) + goto invalid_seq; + + buf[1] = (scm_t_uint8) byte; + *len = 2; + + if (SCM_UNLIKELY ((byte & 0xc0) != 0x80 + || (buf[0] == 0xe0 && byte < 0xa0) + || (buf[0] == 0xed && byte > 0x9f))) + { + /* Swallow the 3rd byte. */ + byte = scm_get_byte_or_eof (port); + ASSERT_NOT_EOF (byte); + *len = 3, buf[2] = byte; + goto invalid_seq; + } + + + byte = scm_get_byte_or_eof (port); + ASSERT_NOT_EOF (byte); + + buf[2] = (scm_t_uint8) byte; + *len = 3; + + if (SCM_UNLIKELY ((byte & 0xc0) != 0x80)) + goto invalid_seq; + + *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL + | ((scm_t_wchar) buf[1] & 0x3f) << 6UL + | (buf[2] & 0x3f); + } + else if (buf[0] >= 0xf0 && buf[0] <= 0xf4) + { + /* 4-byte form. */ + byte = scm_get_byte_or_eof (port); + ASSERT_NOT_EOF (byte); + + buf[1] = (scm_t_uint8) byte; + *len = 2; + + if (SCM_UNLIKELY (((byte & 0xc0) != 0x80) + || (buf[0] == 0xf0 && byte < 0x90) + || (buf[0] == 0xf4 && byte > 0x8f))) + { + /* Swallow the 3rd and 4th bytes. */ + byte = scm_get_byte_or_eof (port); + ASSERT_NOT_EOF (byte); + *len = 3, buf[2] = byte; + + byte = scm_get_byte_or_eof (port); + ASSERT_NOT_EOF (byte); + *len = 4, buf[3] = byte; + goto invalid_seq; + } + + byte = scm_get_byte_or_eof (port); + ASSERT_NOT_EOF (byte); + + buf[2] = (scm_t_uint8) byte; + *len = 3; + + if (SCM_UNLIKELY ((byte & 0xc0) != 0x80)) + { + /* Swallow the 4th byte. */ + byte = scm_get_byte_or_eof (port); + ASSERT_NOT_EOF (byte); + *len = 4, buf[3] = byte; + goto invalid_seq; + } + + byte = scm_get_byte_or_eof (port); + ASSERT_NOT_EOF (byte); + + buf[3] = (scm_t_uint8) byte; + *len = 4; + + if (SCM_UNLIKELY ((byte & 0xc0) != 0x80)) + goto invalid_seq; + + *codepoint = ((scm_t_wchar) buf[0] & 0x07) << 18UL + | ((scm_t_wchar) buf[1] & 0x3f) << 12UL + | ((scm_t_wchar) buf[2] & 0x3f) << 6UL + | (buf[3] & 0x3f); + } + else + goto invalid_seq; + + return 0; + + invalid_seq: + return EILSEQ; + +#undef ASSERT_NOT_EOF +} + +/* Likewise, read a byte sequence from PORT, passing it through its + input conversion descriptor. */ +static int +get_iconv_codepoint (SCM port, scm_t_wchar *codepoint, + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) +{ + scm_t_port *pt; int err, byte_read; size_t bytes_consumed, output_size; char *output; scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE]; - scm_t_port *pt = SCM_PTAB_ENTRY (port); - if (SCM_UNLIKELY (pt->input_cd == (iconv_t) -1)) - /* Initialize the conversion descriptors. */ - scm_i_set_port_encoding_x (port, pt->encoding); + pt = SCM_PTAB_ENTRY (port); for (output_size = 0, output = (char *) utf8_buf, bytes_consumed = 0, err = 0; @@ -1177,31 +1317,45 @@ get_codepoint (SCM port, scm_t_wchar *codepoint, if (SCM_UNLIKELY (output_size == 0)) /* An unterminated sequence. */ err = EILSEQ; - - if (SCM_UNLIKELY (err != 0)) + else if (SCM_LIKELY (err == 0)) { - /* Reset the `iconv' state. */ - iconv (pt->input_cd, NULL, NULL, NULL, NULL); + /* Convert the UTF8_BUF sequence to a Unicode code point. */ + *codepoint = utf8_to_codepoint (utf8_buf, output_size); + *len = bytes_consumed; + } - if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK) - { - *codepoint = '?'; - err = 0; - } + return err; +} - /* Fail when the strategy is SCM_ICONVEH_ERROR or - SCM_ICONVEH_ESCAPE_SEQUENCE (the latter doesn't make sense for - input encoding errors.) */ - } +/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF + with the byte representation of the codepoint in PORT's encoding, and + set *LEN to the length in bytes of that representation. Return 0 on + success and an errno value on error. */ +static int +get_codepoint (SCM port, scm_t_wchar *codepoint, + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) +{ + int err; + scm_t_port *pt = SCM_PTAB_ENTRY (port); + + if (pt->input_cd == (iconv_t) -1) + /* Initialize the conversion descriptors, if needed. */ + scm_i_set_port_encoding_x (port, pt->encoding); + + if (pt->input_cd == (iconv_t) -1) + err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len); else + err = get_iconv_codepoint (port, codepoint, buf, len); + + if (SCM_LIKELY (err == 0)) + update_port_lf (*codepoint, port); + else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK) { - /* Convert the UTF8_BUF sequence to a Unicode code point. */ - *codepoint = utf8_to_codepoint (utf8_buf, output_size); + *codepoint = '?'; + err = 0; update_port_lf (*codepoint, port); } - *len = bytes_consumed; - return err; } @@ -2031,28 +2185,35 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding) if (encoding == NULL) encoding = "ISO-8859-1"; - pt->encoding = scm_gc_strdup (encoding, "port"); + if (pt->encoding != encoding) + pt->encoding = scm_gc_strdup (encoding, "port"); - if (SCM_CELL_WORD_0 (port) & SCM_RDNG) + /* If ENCODING is UTF-8, then no conversion descriptor is opened + because we do I/O ourselves. This saves 100+ KiB for each + descriptor. */ + if (strcmp (encoding, "UTF-8")) { - /* Open an input iconv conversion descriptor, from ENCODING - to UTF-8. We choose UTF-8, not UTF-32, because iconv - implementations can typically convert from anything to - UTF-8, but not to UTF-32 (see - <http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00007.html>). */ - new_input_cd = iconv_open ("UTF-8", encoding); - if (new_input_cd == (iconv_t) -1) - goto invalid_encoding; - } + if (SCM_CELL_WORD_0 (port) & SCM_RDNG) + { + /* Open an input iconv conversion descriptor, from ENCODING + to UTF-8. We choose UTF-8, not UTF-32, because iconv + implementations can typically convert from anything to + UTF-8, but not to UTF-32 (see + <http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00007.html>). */ + new_input_cd = iconv_open ("UTF-8", encoding); + if (new_input_cd == (iconv_t) -1) + goto invalid_encoding; + } - if (SCM_CELL_WORD_0 (port) & SCM_WRTNG) - { - new_output_cd = iconv_open (encoding, "UTF-8"); - if (new_output_cd == (iconv_t) -1) + if (SCM_CELL_WORD_0 (port) & SCM_WRTNG) { - if (new_input_cd != (iconv_t) -1) - iconv_close (new_input_cd); - goto invalid_encoding; + new_output_cd = iconv_open (encoding, "UTF-8"); + if (new_output_cd == (iconv_t) -1) + { + if (new_input_cd != (iconv_t) -1) + iconv_close (new_input_cd); + goto invalid_encoding; + } } } diff --git a/libguile/print.c b/libguile/print.c index 1399566..d5c015b 100644 --- a/libguile/print.c +++ b/libguile/print.c @@ -821,31 +821,57 @@ codepoint_to_utf8 (scm_t_wchar ch, scm_t_uint8 utf8[4]) return len; } -/* Display the LEN codepoints in STR to PORT according to STRATEGY; - return the number of codepoints successfully displayed. If NARROW_P, - then STR is interpreted as a sequence of `char', denoting a Latin-1 - string; otherwise it's interpreted as a sequence of - `scm_t_wchar'. */ -static size_t -display_string (const void *str, int narrow_p, - size_t len, SCM port, - scm_t_string_failed_conversion_handler strategy) - -{ #define STR_REF(s, x) \ (narrow_p \ ? (scm_t_wchar) ((unsigned char *) (s))[x] \ : ((scm_t_wchar *) (s))[x]) +/* Write STR to PORT as UTF-8. STR is a LEN-codepoint string; it is + narrow if NARROW_P is true, wide otherwise. Return LEN. */ +static size_t +display_string_as_utf8 (const void *str, int narrow_p, size_t len, + SCM port) +{ + size_t printed = 0; + + while (len > printed) + { + size_t utf8_len, i; + char *input, utf8_buf[256]; + + /* Convert STR to UTF-8. */ + for (i = printed, utf8_len = 0, input = utf8_buf; + i < len && utf8_len + 4 < sizeof (utf8_buf); + i++) + { + utf8_len += codepoint_to_utf8 (STR_REF (str, i), + (scm_t_uint8 *) input); + input = utf8_buf + utf8_len; + } + + /* INPUT was successfully converted, entirely; print the + result. */ + scm_lfwrite (utf8_buf, utf8_len, port); + printed += i - printed; + } + + assert (printed == len); + + return len; +} + +/* Convert STR through PORT's output conversion descriptor and write the + output to PORT. Return the number of codepoints written. */ +static size_t +display_string_using_iconv (const void *str, int narrow_p, size_t len, + SCM port, + scm_t_string_failed_conversion_handler strategy) +{ size_t printed; scm_t_port *pt; pt = SCM_PTAB_ENTRY (port); - if (SCM_UNLIKELY (pt->output_cd == (iconv_t) -1)) - /* Initialize the conversion descriptors. */ - scm_i_set_port_encoding_x (port, pt->encoding); - printed = 0; while (len > printed) @@ -928,7 +954,34 @@ display_string (const void *str, int narrow_p, } return printed; +} + #undef STR_REF + +/* Display the LEN codepoints in STR to PORT according to STRATEGY; + return the number of codepoints successfully displayed. If NARROW_P, + then STR is interpreted as a sequence of `char', denoting a Latin-1 + string; otherwise it's interpreted as a sequence of + `scm_t_wchar'. */ +static size_t +display_string (const void *str, int narrow_p, + size_t len, SCM port, + scm_t_string_failed_conversion_handler strategy) + +{ + scm_t_port *pt; + + pt = SCM_PTAB_ENTRY (port); + + if (pt->output_cd == (iconv_t) -1) + /* Initialize the conversion descriptors, if needed. */ + scm_i_set_port_encoding_x (port, pt->encoding); + + if (pt->output_cd == (iconv_t) -1) + return display_string_as_utf8 (str, narrow_p, len, port); + else + return display_string_using_iconv (str, narrow_p, len, + port, strategy); } /* Attempt to display CH to PORT according to STRATEGY. Return non-zero ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: Reducing iconv-induced memory usage 2011-05-05 16:19 ` Ludovic Courtès @ 2011-05-06 16:19 ` Ludovic Courtès 2011-05-07 20:51 ` Ludovic Courtès 0 siblings, 1 reply; 7+ messages in thread From: Ludovic Courtès @ 2011-05-06 16:19 UTC (permalink / raw) To: guile-devel Hello, ludo@gnu.org (Ludovic Courtès) writes: > Here’s an updated patch that strictly checks for ill-formed UTF-8 > sequences, as Mark pointed out. It passes all the tests I recently > added to ports.test. I committed it, though Mark rightfully noted on IRC a non-conformance issue. I’ve added a FIXME and started looking into it. Thanks, Mark, for the detailed review! Thanks, Ludo’. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Reducing iconv-induced memory usage 2011-05-06 16:19 ` Ludovic Courtès @ 2011-05-07 20:51 ` Ludovic Courtès 0 siblings, 0 replies; 7+ messages in thread From: Ludovic Courtès @ 2011-05-07 20:51 UTC (permalink / raw) To: guile-devel Hello! ludo@gnu.org (Ludovic Courtès) writes: > Hello, > > ludo@gnu.org (Ludovic Courtès) writes: > >> Here’s an updated patch that strictly checks for ill-formed UTF-8 >> sequences, as Mark pointed out. It passes all the tests I recently >> added to ports.test. > > I committed it, though Mark rightfully noted on IRC a non-conformance > issue. I’ve added a FIXME and started looking into it. Commit 7be1705dbda377780335ecbcbfce04de523f2671 fixes it, AFAICS. Mark, please let me know if you spot other errors! Thanks, Ludo’. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2011-05-07 20:51 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-04-26 21:10 Reducing iconv-induced memory usage Ludovic Courtès 2011-04-26 22:41 ` Ludovic Courtès 2011-04-27 3:47 ` Mark H Weaver 2011-04-27 14:36 ` Ludovic Courtès 2011-05-05 16:19 ` Ludovic Courtès 2011-05-06 16:19 ` Ludovic Courtès 2011-05-07 20:51 ` Ludovic Courtès
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).