* Reducing iconv-induced memory usage
@ 2011-04-26 21:10 Ludovic Courtès
2011-04-26 22:41 ` Ludovic Courtès
0 siblings, 1 reply; 7+ messages in thread
From: Ludovic Courtès @ 2011-04-26 21:10 UTC (permalink / raw)
To: guile-devel
[-- Attachment #1: Type: text/plain, Size: 725 bytes --]
Hello!
As Andy noted in the past, iconv conversion descriptors associated with
ports take up a lot of malloc’d memory, that only gets freed when
finalizers are run. On GNU/Linux, a UTF-8 → UTF-8 C.D., which does
nothing, mallocs 180 KiB (!), according to the program attached. So the
problem is acute.
So I think we should special-case UTF-8 I/O to not use iconv at all.
For output, it’s easy since we already do the conversion to UTF-8 in
‘display_string’. For input, it’s a bit more work because input byte
streams have to be checked for invalid sequences.
I’m working on a patch but I’d like to get initial feedback and also
about whether it should wait until after 2.0.1 or not.
Thanks,
Ludo’.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: the program --]
[-- Type: text/x-csrc, Size: 572 bytes --]
#include <iconv.h>
#include <malloc.h>
static size_t total;
static void * (*prev_hook) (size_t, const void *);
static void *
m (size_t s, const void *c)
{
__malloc_hook = prev_hook;
printf ("alloc %zi\n", s);
void *r = malloc (s);
total += s;
__malloc_hook = &m;
return r;
}
static void
my_init_hook (void)
{
prev_hook = __malloc_hook;
__malloc_hook = &m;
}
void (*__malloc_initialize_hook) (void) = my_init_hook;
int
main (int argc, char *argv[])
{
total = 0;
iconv_open ("UTF-8", "UTF-8");
printf ("allocated %zi B\n", total);
return 0;
}
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Reducing iconv-induced memory usage
2011-04-26 21:10 Reducing iconv-induced memory usage Ludovic Courtès
@ 2011-04-26 22:41 ` Ludovic Courtès
2011-04-27 3:47 ` Mark H Weaver
0 siblings, 1 reply; 7+ messages in thread
From: Ludovic Courtès @ 2011-04-26 22:41 UTC (permalink / raw)
To: guile-devel
[-- Attachment #1: Type: text/plain, Size: 1581 bytes --]
Hi!
So, here’s the patch.
It also makes UTF-8 input ~30% faster according to ports.bm (which
doesn’t benchmark output):
* before:
("ports.bm: peek-char: latin-1 port" 700000 user 0.36)
("ports.bm: peek-char: utf-8 port, ascii character" 700000 user 0.35)
("ports.bm: peek-char: utf-8 port, Korean character" 700000 user 0.61)
("ports.bm: read-char: latin-1 port" 10000000 user 3.32)
("ports.bm: read-char: utf-8 port, ascii character" 10000000 user 3.33)
("ports.bm: read-char: utf-8 port, Korean character" 10000000 user 3.39)
("ports.bm: char-ready?: latin-1 port" 10000000 user 2.95)
("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 user 2.96)
("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 user 3.01)
("ports.bm: rdelim: read-line" 1000 user 3.1)
* after:
("ports.bm: peek-char: latin-1 port" 700000 user 0.31)
("ports.bm: peek-char: utf-8 port, ascii character" 700000 user 0.24)
("ports.bm: peek-char: utf-8 port, Korean character" 700000 user 0.3)
("ports.bm: read-char: latin-1 port" 10000000 user 2.73)
("ports.bm: read-char: utf-8 port, ascii character" 10000000 user 3.38)
("ports.bm: read-char: utf-8 port, Korean character" 10000000 user 3.37)
("ports.bm: char-ready?: latin-1 port" 10000000 user 2.42)
("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 user 2.41)
("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 user 2.43)
("ports.bm: rdelim: read-line" 1000 user 1.91)
Comments? OK to apply?
Thanks,
Ludo’.
[-- Attachment #2: the patch --]
[-- Type: text/x-patch, Size: 12384 bytes --]
diff --git a/libguile/ports.c b/libguile/ports.c
index 6e0ae6c..d728356 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -1057,6 +1057,7 @@ update_port_lf (scm_t_wchar c, SCM port)
switch (c)
{
case '\a':
+ case EOF:
break;
case '\b':
SCM_DECCOL (port);
@@ -1115,23 +1116,113 @@ utf8_to_codepoint (const scm_t_uint8 *utf8_buf, size_t size)
return codepoint;
}
-/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF
- with the byte representation of the codepoint in PORT's encoding, and
- set *LEN to the length in bytes of that representation. Return 0 on
- success and an errno value on error. */
+/* Read a UTF-8 sequence from PORT. On success, return 0 and set
+ *CODEPOINT to the codepoint that was read, fill BUF with its UTF-8
+ representation, and set *LEN to the length in bytes. Return
+ `EILSEQ' on error. */
static int
-get_codepoint (SCM port, scm_t_wchar *codepoint,
- char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
+get_utf8_codepoint (SCM port, scm_t_wchar *codepoint,
+ scm_t_uint8 buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
+{
+ int byte;
+
+ *len = 0;
+
+ byte = scm_get_byte_or_eof (port);
+ if (byte == EOF)
+ {
+ *codepoint = EOF;
+ return 0;
+ }
+
+ buf[0] = (scm_t_uint8) byte;
+ *len = 1;
+
+ if (buf[0] <= 0x7f)
+ *codepoint = buf[0];
+ else if ((buf[0] & 0xe0) == 0xc0)
+ {
+ byte = scm_get_byte_or_eof (port);
+ if (byte == EOF || ((byte & 0xc0) != 0x80))
+ goto invalid_seq;
+
+ buf[1] = (scm_t_uint8) byte;
+ *len = 2;
+
+ *codepoint = ((scm_t_wchar) buf[0] & 0x1f) << 6UL
+ | (buf[1] & 0x3f);
+ }
+ else if ((buf[0] & 0xf0) == 0xe0)
+ {
+ byte = scm_get_byte_or_eof (port);
+ if (byte == EOF || ((byte & 0xc0) != 0x80))
+ goto invalid_seq;
+
+ buf[1] = (scm_t_uint8) byte;
+ *len = 2;
+
+ byte = scm_get_byte_or_eof (port);
+ if (byte == EOF || ((byte & 0xc0) != 0x80))
+ goto invalid_seq;
+
+ buf[2] = (scm_t_uint8) byte;
+ *len = 3;
+
+ *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL
+ | ((scm_t_wchar) buf[1] & 0x3f) << 6UL
+ | (buf[2] & 0x3f);
+ }
+ else
+ {
+ byte = scm_get_byte_or_eof (port);
+ if (byte == EOF || ((byte & 0xc0) != 0x80))
+ goto invalid_seq;
+
+ buf[1] = (scm_t_uint8) byte;
+ *len = 2;
+
+ byte = scm_get_byte_or_eof (port);
+ if (byte == EOF || ((byte & 0xc0) != 0x80))
+ goto invalid_seq;
+
+ buf[2] = (scm_t_uint8) byte;
+ *len = 3;
+
+ byte = scm_get_byte_or_eof (port);
+ if (byte == EOF || ((byte & 0xc0) != 0x80))
+ goto invalid_seq;
+
+ buf[3] = (scm_t_uint8) byte;
+ *len = 4;
+
+ *codepoint = ((scm_t_wchar) buf[0] & 0x07) << 18UL
+ | ((scm_t_wchar) buf[1] & 0x3f) << 12UL
+ | ((scm_t_wchar) buf[2] & 0x3f) << 6UL
+ | (buf[3] & 0x3f);
+ }
+
+ return 0;
+
+ invalid_seq:
+ /* Return the faulty byte. */
+ scm_unget_byte (byte, port);
+
+ return EILSEQ;
+}
+
+/* Likewise, read a byte sequence from PORT, passing it through its
+ input conversion descriptor. */
+static int
+get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
+ char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
{
+ scm_t_port *pt;
int err, byte_read;
size_t bytes_consumed, output_size;
char *output;
scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
- scm_t_port *pt = SCM_PTAB_ENTRY (port);
- if (SCM_UNLIKELY (pt->input_cd == (iconv_t) -1))
- /* Initialize the conversion descriptors. */
- scm_i_set_port_encoding_x (port, pt->encoding);
+ pt = SCM_PTAB_ENTRY (port);
for (output_size = 0, output = (char *) utf8_buf,
bytes_consumed = 0, err = 0;
@@ -1174,10 +1265,44 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
output_size = sizeof (utf8_buf) - output_left;
}
- if (SCM_UNLIKELY (err != 0))
+
+ if (SCM_LIKELY (err == 0))
+ {
+ /* Convert the UTF8_BUF sequence to a Unicode code point. */
+ *codepoint = utf8_to_codepoint (utf8_buf, output_size);
+ *len = bytes_consumed;
+ }
+
+ return err;
+}
+
+/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF
+ with the byte representation of the codepoint in PORT's encoding, and
+ set *LEN to the length in bytes of that representation. Return 0 on
+ success and an errno value on error. */
+static int
+get_codepoint (SCM port, scm_t_wchar *codepoint,
+ char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
+{
+ int err;
+ scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
+ if (pt->input_cd == (iconv_t) -1)
+ /* Initialize the conversion descriptors, if needed. */
+ scm_i_set_port_encoding_x (port, pt->encoding);
+
+ if (pt->input_cd == (iconv_t) -1)
+ err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len);
+ else
+ err = get_iconv_codepoint (port, codepoint, buf, len);
+
+ if (SCM_LIKELY (err == 0))
+ update_port_lf (*codepoint, port);
+ else
{
- /* Reset the `iconv' state. */
- iconv (pt->input_cd, NULL, NULL, NULL, NULL);
+ if (pt->input_cd != (iconv_t) -1)
+ /* Reset the `iconv' state. */
+ iconv (pt->input_cd, NULL, NULL, NULL, NULL);
if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
{
@@ -1189,14 +1314,6 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
SCM_ICONVEH_ESCAPE_SEQUENCE (the latter doesn't make sense for
input encoding errors.) */
}
- else
- /* Convert the UTF8_BUF sequence to a Unicode code point. */
- *codepoint = utf8_to_codepoint (utf8_buf, output_size);
-
- if (SCM_LIKELY (err == 0))
- update_port_lf (*codepoint, port);
-
- *len = bytes_consumed;
return err;
}
@@ -2027,28 +2144,35 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
if (encoding == NULL)
encoding = "ISO-8859-1";
- pt->encoding = scm_gc_strdup (encoding, "port");
+ if (pt->encoding != encoding)
+ pt->encoding = scm_gc_strdup (encoding, "port");
- if (SCM_CELL_WORD_0 (port) & SCM_RDNG)
+ /* If ENCODING is UTF-8, then no conversion descriptor is opened
+ because we do I/O ourselves. This saves 100+ KiB for each
+ descriptor. */
+ if (strcmp (encoding, "UTF-8"))
{
- /* Open an input iconv conversion descriptor, from ENCODING
- to UTF-8. We choose UTF-8, not UTF-32, because iconv
- implementations can typically convert from anything to
- UTF-8, but not to UTF-32 (see
- <http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00007.html>). */
- new_input_cd = iconv_open ("UTF-8", encoding);
- if (new_input_cd == (iconv_t) -1)
- goto invalid_encoding;
- }
+ if (SCM_CELL_WORD_0 (port) & SCM_RDNG)
+ {
+ /* Open an input iconv conversion descriptor, from ENCODING
+ to UTF-8. We choose UTF-8, not UTF-32, because iconv
+ implementations can typically convert from anything to
+ UTF-8, but not to UTF-32 (see
+ <http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00007.html>). */
+ new_input_cd = iconv_open ("UTF-8", encoding);
+ if (new_input_cd == (iconv_t) -1)
+ goto invalid_encoding;
+ }
- if (SCM_CELL_WORD_0 (port) & SCM_WRTNG)
- {
- new_output_cd = iconv_open (encoding, "UTF-8");
- if (new_output_cd == (iconv_t) -1)
+ if (SCM_CELL_WORD_0 (port) & SCM_WRTNG)
{
- if (new_input_cd != (iconv_t) -1)
- iconv_close (new_input_cd);
- goto invalid_encoding;
+ new_output_cd = iconv_open (encoding, "UTF-8");
+ if (new_output_cd == (iconv_t) -1)
+ {
+ if (new_input_cd != (iconv_t) -1)
+ iconv_close (new_input_cd);
+ goto invalid_encoding;
+ }
}
}
diff --git a/libguile/print.c b/libguile/print.c
index 1399566..d18c054 100644
--- a/libguile/print.c
+++ b/libguile/print.c
@@ -821,33 +821,58 @@ codepoint_to_utf8 (scm_t_wchar ch, scm_t_uint8 utf8[4])
return len;
}
-/* Display the LEN codepoints in STR to PORT according to STRATEGY;
- return the number of codepoints successfully displayed. If NARROW_P,
- then STR is interpreted as a sequence of `char', denoting a Latin-1
- string; otherwise it's interpreted as a sequence of
- `scm_t_wchar'. */
-static size_t
-display_string (const void *str, int narrow_p,
- size_t len, SCM port,
- scm_t_string_failed_conversion_handler strategy)
-
-{
#define STR_REF(s, x) \
(narrow_p \
? (scm_t_wchar) ((unsigned char *) (s))[x] \
: ((scm_t_wchar *) (s))[x])
+/* Write STR to PORT as UTF-8. STR is a LEN-codepoint string; it is
+ narrow if NARROW_P is true, wide otherwise. Return LEN. */
+static size_t
+display_string_as_utf8 (const void *str, int narrow_p, size_t len,
+ SCM port)
+{
+ size_t printed = 0;
+
+ while (len > printed)
+ {
+ size_t utf8_len, i;
+ char *input, utf8_buf[256];
+
+ /* Convert STR to UTF-8. */
+ for (i = printed, utf8_len = 0, input = utf8_buf;
+ i < len && utf8_len + 4 < sizeof (utf8_buf);
+ i++)
+ {
+ utf8_len += codepoint_to_utf8 (STR_REF (str, i),
+ (scm_t_uint8 *) input);
+ input = utf8_buf + utf8_len;
+ }
+
+ /* INPUT was successfully converted, entirely; print the
+ result. */
+ scm_lfwrite (utf8_buf, utf8_len, port);
+ printed += i - printed;
+ }
+
+ assert (printed == len);
+
+ return len;
+}
+
+/* Convert STR through PORT's output conversion descriptor and write the
+ output to PORT. Return the number of codepoints written. */
+static size_t
+display_string_using_iconv (const void *str, int narrow_p, size_t len,
+ SCM port,
+ scm_t_string_failed_conversion_handler strategy)
+{
size_t printed;
scm_t_port *pt;
pt = SCM_PTAB_ENTRY (port);
- if (SCM_UNLIKELY (pt->output_cd == (iconv_t) -1))
- /* Initialize the conversion descriptors. */
- scm_i_set_port_encoding_x (port, pt->encoding);
-
printed = 0;
-
while (len > printed)
{
size_t done, utf8_len, input_left, output_left, i;
@@ -880,7 +905,7 @@ display_string (const void *str, int narrow_p,
if (SCM_UNLIKELY (done == (size_t) -1))
{
- int errno_save = errno;
+ int errno_save = errno;
/* Reset the `iconv' state. */
iconv (pt->output_cd, NULL, NULL, NULL, NULL);
@@ -928,7 +953,34 @@ display_string (const void *str, int narrow_p,
}
return printed;
+}
+
#undef STR_REF
+
+/* Display the LEN codepoints in STR to PORT according to STRATEGY;
+ return the number of codepoints successfully displayed. If NARROW_P,
+ then STR is interpreted as a sequence of `char', denoting a Latin-1
+ string; otherwise it's interpreted as a sequence of
+ `scm_t_wchar'. */
+static size_t
+display_string (const void *str, int narrow_p,
+ size_t len, SCM port,
+ scm_t_string_failed_conversion_handler strategy)
+
+{
+ scm_t_port *pt;
+
+ pt = SCM_PTAB_ENTRY (port);
+
+ if (pt->output_cd == (iconv_t) -1)
+ /* Initialize the conversion descriptors, if needed. */
+ scm_i_set_port_encoding_x (port, pt->encoding);
+
+ if (pt->output_cd == (iconv_t) -1)
+ return display_string_as_utf8 (str, narrow_p, len, port);
+ else
+ return display_string_using_cd (str, narrow_p, len,
+ port, strategy);
}
/* Attempt to display CH to PORT according to STRATEGY. Return non-zero
diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test
index 9d3000c..d5b1b60 100644
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@@ -391,7 +391,8 @@
(with-fluids ((%default-port-encoding e))
(call-with-output-string
(lambda (p)
- (display (port-encoding p) p)))))
+ (and (string=? e (port-encoding p))
+ (display (port-encoding p) p))))))
encodings)
encodings)))
@@ -462,6 +463,15 @@
(= (port-line p) 0)
(= (port-column p) 0))))
+ (pass-if "peek-char [utf-16]"
+ (let ((p (with-fluids ((%default-port-encoding "UTF-16BE"))
+ (open-input-string "안녕하세요"))))
+ (and (char=? (peek-char p) #\안)
+ (char=? (peek-char p) #\안)
+ (char=? (peek-char p) #\안)
+ (= (port-line p) 0)
+ (= (port-column p) 0))))
+
(pass-if "read-char, wrong encoding, error"
(let ((p (open-bytevector-input-port #vu8(255 65 66 67))))
(catch 'decoding-error
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: Reducing iconv-induced memory usage
2011-04-26 22:41 ` Ludovic Courtès
@ 2011-04-27 3:47 ` Mark H Weaver
2011-04-27 14:36 ` Ludovic Courtès
2011-05-05 16:19 ` Ludovic Courtès
0 siblings, 2 replies; 7+ messages in thread
From: Mark H Weaver @ 2011-04-27 3:47 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: guile-devel
Hi Ludovic!
ludo@gnu.org (Ludovic Courtès) writes:
> So, here’s the patch.
>
> It also makes UTF-8 input ~30% faster according to ports.bm (which
> doesn’t benchmark output):
Thanks for working on this. I haven't yet had time to fully review this
patch, but here I will document the problems I see so far.
First of all, while looking at this patch, I've discovered another
problem in ports.c: scm_char_ready_p does not consider the possibility
of multibyte characters, and returns #t whenever there is at least one
byte ready.
> -/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF
> - with the byte representation of the codepoint in PORT's encoding, and
> - set *LEN to the length in bytes of that representation. Return 0 on
> - success and an errno value on error. */
> +/* Read a UTF-8 sequence from PORT. On success, return 0 and set
> + *CODEPOINT to the codepoint that was read, fill BUF with its UTF-8
> + representation, and set *LEN to the length in bytes. Return
> + `EILSEQ' on error. */
> static int
> -get_codepoint (SCM port, scm_t_wchar *codepoint,
> - char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +get_utf8_codepoint (SCM port, scm_t_wchar *codepoint,
> + scm_t_uint8 buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +{
> + int byte;
> +
> + *len = 0;
> +
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF)
> + {
> + *codepoint = EOF;
> + return 0;
> + }
> +
> + buf[0] = (scm_t_uint8) byte;
> + *len = 1;
> +
> + if (buf[0] <= 0x7f)
> + *codepoint = buf[0];
> + else if ((buf[0] & 0xe0) == 0xc0)
> + {
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF || ((byte & 0xc0) != 0x80))
> + goto invalid_seq;
> +
> + buf[1] = (scm_t_uint8) byte;
> + *len = 2;
> +
> + *codepoint = ((scm_t_wchar) buf[0] & 0x1f) << 6UL
> + | (buf[1] & 0x3f);
> + }
The code here would be sufficient for UTF-8 that is known valid, but
when reading from a port we must check for ill-formed UTF-8.
Unicode requires that we reject as ill-formed any UTF-8 byte sequence in
non-shortest form. For example, we must reject the byte sequence
0xC1 0x80 which a permissive reader would read as 0x40, since obviously
that code point can be encoded as a single byte in UTF-8.
We must also reject any UTF-8 byte sequence that corresponds to a
surrogate code point (U+D800..U+DFFF), or to a code point greater than
U+10FFFF.
Table 3.7 of the Unicode 6.0.0 standard, reproduced below, concisely
shows all well-formed UTF-8 byte sequences. The asterisks highlight
continuation bytes that are constrained to a smaller range than the
usual 80..BF.
code points byte[0] byte[1] byte[2] byte[3]
---------------------------------------------------------
U+000000..U+00007F | 00..7F | | | |
U+000080..U+0007FF | C2..DF | 80..BF | | |
U+000800..U+000FFF | E0 | A0..BF* | 80..BF | |
U+001000..U+00CFFF | E1..EC | 80..BF | 80..BF | |
U+00D000..U+00D7FF | ED | 80..9F* | 80..BF | |
U+00E000..U+00FFFF | EE..EF | 80..BF | 80..BF | |
U+010000..U+03FFFF | F0 | 90..BF* | 80..BF | 80..BF |
U+040000..U+0FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
U+100000..U+10FFFF | F4 | 80..8F* | 80..BF | 80..BF |
---------------------------------------------------------
So, for the code above corresponding to 2-byte sequences, it would
suffice to verify that buf[0] >= 0xC2. The 3- and 4-byte cases are
somewhat more constrained.
> + else if ((buf[0] & 0xf0) == 0xe0)
> + {
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF || ((byte & 0xc0) != 0x80))
> + goto invalid_seq;
> +
> + buf[1] = (scm_t_uint8) byte;
> + *len = 2;
> +
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF || ((byte & 0xc0) != 0x80))
> + goto invalid_seq;
> +
> + buf[2] = (scm_t_uint8) byte;
> + *len = 3;
> +
> + *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL
> + | ((scm_t_wchar) buf[1] & 0x3f) << 6UL
> + | (buf[2] & 0x3f);
> + }
> + else
> + {
That ^^^ should not simply be an "else". It must check that the first
byte is valid.
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF || ((byte & 0xc0) != 0x80))
> + goto invalid_seq;
> +
> + buf[1] = (scm_t_uint8) byte;
> + *len = 2;
> +
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF || ((byte & 0xc0) != 0x80))
> + goto invalid_seq;
> +
> + buf[2] = (scm_t_uint8) byte;
> + *len = 3;
> +
> + byte = scm_get_byte_or_eof (port);
> + if (byte == EOF || ((byte & 0xc0) != 0x80))
> + goto invalid_seq;
> +
> + buf[3] = (scm_t_uint8) byte;
> + *len = 4;
> +
> + *codepoint = ((scm_t_wchar) buf[0] & 0x07) << 18UL
> + | ((scm_t_wchar) buf[1] & 0x3f) << 12UL
> + | ((scm_t_wchar) buf[2] & 0x3f) << 6UL
> + | (buf[3] & 0x3f);
> + }
> +
> + return 0;
> +
> + invalid_seq:
> + /* Return the faulty byte. */
> + scm_unget_byte (byte, port);
This ungets only the last byte, but there may be up to 4 bytes to unget.
> +
> + return EILSEQ;
> +}
> +
> +/* Likewise, read a byte sequence from PORT, passing it through its
> + input conversion descriptor. */
> +static int
> +get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
> + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> {
> + scm_t_port *pt;
> int err, byte_read;
> size_t bytes_consumed, output_size;
> char *output;
> scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
> - scm_t_port *pt = SCM_PTAB_ENTRY (port);
>
> - if (SCM_UNLIKELY (pt->input_cd == (iconv_t) -1))
> - /* Initialize the conversion descriptors. */
> - scm_i_set_port_encoding_x (port, pt->encoding);
> + pt = SCM_PTAB_ENTRY (port);
>
> for (output_size = 0, output = (char *) utf8_buf,
> bytes_consumed = 0, err = 0;
> @@ -1174,10 +1265,44 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
> output_size = sizeof (utf8_buf) - output_left;
> }
>
> - if (SCM_UNLIKELY (err != 0))
> +
> + if (SCM_LIKELY (err == 0))
> + {
> + /* Convert the UTF8_BUF sequence to a Unicode code point. */
> + *codepoint = utf8_to_codepoint (utf8_buf, output_size);
> + *len = bytes_consumed;
> + }
> +
> + return err;
> +}
> +
> +/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF
> + with the byte representation of the codepoint in PORT's encoding, and
> + set *LEN to the length in bytes of that representation. Return 0 on
> + success and an errno value on error. */
> +static int
> +get_codepoint (SCM port, scm_t_wchar *codepoint,
> + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> +{
> + int err;
> + scm_t_port *pt = SCM_PTAB_ENTRY (port);
> +
> + if (pt->input_cd == (iconv_t) -1)
> + /* Initialize the conversion descriptors, if needed. */
> + scm_i_set_port_encoding_x (port, pt->encoding);
> +
> + if (pt->input_cd == (iconv_t) -1)
> + err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len);
> + else
> + err = get_iconv_codepoint (port, codepoint, buf, len);
From the code above, it appears that for UTF-8 ports,
scm_i_set_port_encoding_x will necessarily be called once per character
read. This seems rather inefficient. Also, if we wish to support
Latin-1 without iconv as well, the simple method above will not work.
I would recommend adding an enum field to the port which for now only
has two encoding schemes: ICONV or UTF8. Later, we could add LATIN1 and
maybe ASCII as well. Given that this check must be done once per
character, it seems better to do a switch on an enum than to strcmp with
pt->encoding (as is done in scm_i_set_port_encoding_x).
Thanks again for working on this :)
Mark
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Reducing iconv-induced memory usage
2011-04-27 3:47 ` Mark H Weaver
@ 2011-04-27 14:36 ` Ludovic Courtès
2011-05-05 16:19 ` Ludovic Courtès
1 sibling, 0 replies; 7+ messages in thread
From: Ludovic Courtès @ 2011-04-27 14:36 UTC (permalink / raw)
To: Mark H Weaver; +Cc: guile-devel
Hi Mark,
Mark H Weaver <mhw@netris.org> writes:
> ludo@gnu.org (Ludovic Courtès) writes:
>> So, here’s the patch.
>>
>> It also makes UTF-8 input ~30% faster according to ports.bm (which
>> doesn’t benchmark output):
>
> Thanks for working on this. I haven't yet had time to fully review this
> patch, but here I will document the problems I see so far.
Thanks for the review!
> First of all, while looking at this patch, I've discovered another
> problem in ports.c: scm_char_ready_p does not consider the possibility
> of multibyte characters, and returns #t whenever there is at least one
> byte ready.
Indeed; let’s discuss it separately.
> Unicode requires that we reject as ill-formed any UTF-8 byte sequence in
> non-shortest form. For example, we must reject the byte sequence
> 0xC1 0x80 which a permissive reader would read as 0x40, since obviously
> that code point can be encoded as a single byte in UTF-8.
>
> We must also reject any UTF-8 byte sequence that corresponds to a
> surrogate code point (U+D800..U+DFFF), or to a code point greater than
> U+10FFFF.
>
> Table 3.7 of the Unicode 6.0.0 standard, reproduced below, concisely
> shows all well-formed UTF-8 byte sequences. The asterisks highlight
> continuation bytes that are constrained to a smaller range than the
> usual 80..BF.
>
> code points byte[0] byte[1] byte[2] byte[3]
> ---------------------------------------------------------
> U+000000..U+00007F | 00..7F | | | |
> U+000080..U+0007FF | C2..DF | 80..BF | | |
> U+000800..U+000FFF | E0 | A0..BF* | 80..BF | |
> U+001000..U+00CFFF | E1..EC | 80..BF | 80..BF | |
> U+00D000..U+00D7FF | ED | 80..9F* | 80..BF | |
> U+00E000..U+00FFFF | EE..EF | 80..BF | 80..BF | |
> U+010000..U+03FFFF | F0 | 90..BF* | 80..BF | 80..BF |
> U+040000..U+0FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
> U+100000..U+10FFFF | F4 | 80..8F* | 80..BF | 80..BF |
> ---------------------------------------------------------
>
> So, for the code above corresponding to 2-byte sequences, it would
> suffice to verify that buf[0] >= 0xC2. The 3- and 4-byte cases are
> somewhat more constrained.
Indeed, thanks for educating me. ;-)
Could you add UTF-8 tests for such cases using the just-committed
‘test-decoding-error’ in ports.test?
>> + else if ((buf[0] & 0xf0) == 0xe0)
>> + {
>> + byte = scm_get_byte_or_eof (port);
>> + if (byte == EOF || ((byte & 0xc0) != 0x80))
>> + goto invalid_seq;
>> +
>> + buf[1] = (scm_t_uint8) byte;
>> + *len = 2;
>> +
>> + byte = scm_get_byte_or_eof (port);
>> + if (byte == EOF || ((byte & 0xc0) != 0x80))
>> + goto invalid_seq;
>> +
>> + buf[2] = (scm_t_uint8) byte;
>> + *len = 3;
>> +
>> + *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL
>> + | ((scm_t_wchar) buf[1] & 0x3f) << 6UL
>> + | (buf[2] & 0x3f);
>> + }
>> + else
>> + {
>
> That ^^^ should not simply be an "else". It must check that the first
> byte is valid.
Right.
>> + invalid_seq:
>> + /* Return the faulty byte. */
>> + scm_unget_byte (byte, port);
>
> This ungets only the last byte, but there may be up to 4 bytes to unget.
No, that’s done in ‘peek-char’.
>> +/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF
>> + with the byte representation of the codepoint in PORT's encoding, and
>> + set *LEN to the length in bytes of that representation. Return 0 on
>> + success and an errno value on error. */
>> +static int
>> +get_codepoint (SCM port, scm_t_wchar *codepoint,
>> + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
>> +{
>> + int err;
>> + scm_t_port *pt = SCM_PTAB_ENTRY (port);
>> +
>> + if (pt->input_cd == (iconv_t) -1)
>> + /* Initialize the conversion descriptors, if needed. */
>> + scm_i_set_port_encoding_x (port, pt->encoding);
>> +
>> + if (pt->input_cd == (iconv_t) -1)
>> + err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len);
>> + else
>> + err = get_iconv_codepoint (port, codepoint, buf, len);
>
> From the code above, it appears that for UTF-8 ports,
> scm_i_set_port_encoding_x will necessarily be called once per character
> read. This seems rather inefficient.
Correct. Alas, I don’t know how to avoid this inefficiency in 2.0 since
we can’t just add a flag in ‘scm_t_port’ since it would break the ABI.
Ideas?
Besides, however inefficient it may seem, it’s still more efficient than
what we currently have, as I explained.
> Also, if we wish to support Latin-1 without iconv as well, the simple
> method above will not work.
Why would we want such a thing? :-)
The starting point for this patch was the observation that our Unicode
I/O converts to/from UTF-8, and then from UTF-8 to our internal
representation, and that it’s wasteful to use iconv to convert from
UTF-8 to UTF-8 when reading from/writing to a UTF-8 port.
> I would recommend adding an enum field to the port which for now only
> has two encoding schemes: ICONV or UTF8. Later, we could add LATIN1 and
> maybe ASCII as well. Given that this check must be done once per
> character, it seems better to do a switch on an enum than to strcmp with
> pt->encoding (as is done in scm_i_set_port_encoding_x).
Agreed; maybe something for ‘master’ once this version is in 2.0?
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Reducing iconv-induced memory usage
2011-04-27 3:47 ` Mark H Weaver
2011-04-27 14:36 ` Ludovic Courtès
@ 2011-05-05 16:19 ` Ludovic Courtès
2011-05-06 16:19 ` Ludovic Courtès
1 sibling, 1 reply; 7+ messages in thread
From: Ludovic Courtès @ 2011-05-05 16:19 UTC (permalink / raw)
To: guile-devel
[-- Attachment #1: Type: text/plain, Size: 240 bytes --]
Hello!
Here’s an updated patch that strictly checks for ill-formed UTF-8
sequences, as Mark pointed out. It passes all the tests I recently
added to ports.test.
I’d like to commit it soon, when Mark approves. :-)
Thanks,
Ludo’.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: the patch --]
[-- Type: text/x-patch, Size: 12009 bytes --]
diff --git a/libguile/ports.c b/libguile/ports.c
index b5ad95e..2482a24 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -1057,6 +1057,7 @@ update_port_lf (scm_t_wchar c, SCM port)
switch (c)
{
case '\a':
+ case EOF:
break;
case '\b':
SCM_DECCOL (port);
@@ -1115,23 +1116,162 @@ utf8_to_codepoint (const scm_t_uint8 *utf8_buf, size_t size)
return codepoint;
}
-/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF
- with the byte representation of the codepoint in PORT's encoding, and
- set *LEN to the length in bytes of that representation. Return 0 on
- success and an errno value on error. */
+/* Read a UTF-8 sequence from PORT. On success, return 0 and set
+ *CODEPOINT to the codepoint that was read, fill BUF with its UTF-8
+ representation, and set *LEN to the length in bytes. Return
+ `EILSEQ' on error. */
static int
-get_codepoint (SCM port, scm_t_wchar *codepoint,
- char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
+get_utf8_codepoint (SCM port, scm_t_wchar *codepoint,
+ scm_t_uint8 buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
{
+#define ASSERT_NOT_EOF(b) \
+ if (SCM_UNLIKELY ((b) == EOF)) \
+ goto invalid_seq
+
+ int byte;
+
+ *len = 0;
+
+ byte = scm_get_byte_or_eof (port);
+ if (byte == EOF)
+ {
+ *codepoint = EOF;
+ return 0;
+ }
+
+ buf[0] = (scm_t_uint8) byte;
+ *len = 1;
+
+ if (buf[0] <= 0x7f)
+ /* 1-byte form. */
+ *codepoint = buf[0];
+ else if (buf[0] >= 0xc2 && buf[0] <= 0xdf)
+ {
+ /* 2-byte form. */
+ byte = scm_get_byte_or_eof (port);
+ ASSERT_NOT_EOF (byte);
+
+ buf[1] = (scm_t_uint8) byte;
+ *len = 2;
+
+ if (SCM_UNLIKELY ((byte & 0xc0) != 0x80))
+ goto invalid_seq;
+
+ *codepoint = ((scm_t_wchar) buf[0] & 0x1f) << 6UL
+ | (buf[1] & 0x3f);
+ }
+ else if ((buf[0] & 0xf0) == 0xe0)
+ {
+ /* 3-byte form. */
+ byte = scm_get_byte_or_eof (port);
+ if (SCM_UNLIKELY (byte == EOF))
+ goto invalid_seq;
+
+ buf[1] = (scm_t_uint8) byte;
+ *len = 2;
+
+ if (SCM_UNLIKELY ((byte & 0xc0) != 0x80
+ || (buf[0] == 0xe0 && byte < 0xa0)
+ || (buf[0] == 0xed && byte > 0x9f)))
+ {
+ /* Swallow the 3rd byte. */
+ byte = scm_get_byte_or_eof (port);
+ ASSERT_NOT_EOF (byte);
+ *len = 3, buf[2] = byte;
+ goto invalid_seq;
+ }
+
+
+ byte = scm_get_byte_or_eof (port);
+ ASSERT_NOT_EOF (byte);
+
+ buf[2] = (scm_t_uint8) byte;
+ *len = 3;
+
+ if (SCM_UNLIKELY ((byte & 0xc0) != 0x80))
+ goto invalid_seq;
+
+ *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL
+ | ((scm_t_wchar) buf[1] & 0x3f) << 6UL
+ | (buf[2] & 0x3f);
+ }
+ else if (buf[0] >= 0xf0 && buf[0] <= 0xf4)
+ {
+ /* 4-byte form. */
+ byte = scm_get_byte_or_eof (port);
+ ASSERT_NOT_EOF (byte);
+
+ buf[1] = (scm_t_uint8) byte;
+ *len = 2;
+
+ if (SCM_UNLIKELY (((byte & 0xc0) != 0x80)
+ || (buf[0] == 0xf0 && byte < 0x90)
+ || (buf[0] == 0xf4 && byte > 0x8f)))
+ {
+ /* Swallow the 3rd and 4th bytes. */
+ byte = scm_get_byte_or_eof (port);
+ ASSERT_NOT_EOF (byte);
+ *len = 3, buf[2] = byte;
+
+ byte = scm_get_byte_or_eof (port);
+ ASSERT_NOT_EOF (byte);
+ *len = 4, buf[3] = byte;
+ goto invalid_seq;
+ }
+
+ byte = scm_get_byte_or_eof (port);
+ ASSERT_NOT_EOF (byte);
+
+ buf[2] = (scm_t_uint8) byte;
+ *len = 3;
+
+ if (SCM_UNLIKELY ((byte & 0xc0) != 0x80))
+ {
+ /* Swallow the 4th byte. */
+ byte = scm_get_byte_or_eof (port);
+ ASSERT_NOT_EOF (byte);
+ *len = 4, buf[3] = byte;
+ goto invalid_seq;
+ }
+
+ byte = scm_get_byte_or_eof (port);
+ ASSERT_NOT_EOF (byte);
+
+ buf[3] = (scm_t_uint8) byte;
+ *len = 4;
+
+ if (SCM_UNLIKELY ((byte & 0xc0) != 0x80))
+ goto invalid_seq;
+
+ *codepoint = ((scm_t_wchar) buf[0] & 0x07) << 18UL
+ | ((scm_t_wchar) buf[1] & 0x3f) << 12UL
+ | ((scm_t_wchar) buf[2] & 0x3f) << 6UL
+ | (buf[3] & 0x3f);
+ }
+ else
+ goto invalid_seq;
+
+ return 0;
+
+ invalid_seq:
+ return EILSEQ;
+
+#undef ASSERT_NOT_EOF
+}
+
+/* Likewise, read a byte sequence from PORT, passing it through its
+ input conversion descriptor. */
+static int
+get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
+ char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
+{
+ scm_t_port *pt;
int err, byte_read;
size_t bytes_consumed, output_size;
char *output;
scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
- scm_t_port *pt = SCM_PTAB_ENTRY (port);
- if (SCM_UNLIKELY (pt->input_cd == (iconv_t) -1))
- /* Initialize the conversion descriptors. */
- scm_i_set_port_encoding_x (port, pt->encoding);
+ pt = SCM_PTAB_ENTRY (port);
for (output_size = 0, output = (char *) utf8_buf,
bytes_consumed = 0, err = 0;
@@ -1177,31 +1317,45 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
if (SCM_UNLIKELY (output_size == 0))
/* An unterminated sequence. */
err = EILSEQ;
-
- if (SCM_UNLIKELY (err != 0))
+ else if (SCM_LIKELY (err == 0))
{
- /* Reset the `iconv' state. */
- iconv (pt->input_cd, NULL, NULL, NULL, NULL);
+ /* Convert the UTF8_BUF sequence to a Unicode code point. */
+ *codepoint = utf8_to_codepoint (utf8_buf, output_size);
+ *len = bytes_consumed;
+ }
- if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
- {
- *codepoint = '?';
- err = 0;
- }
+ return err;
+}
- /* Fail when the strategy is SCM_ICONVEH_ERROR or
- SCM_ICONVEH_ESCAPE_SEQUENCE (the latter doesn't make sense for
- input encoding errors.) */
- }
+/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF
+ with the byte representation of the codepoint in PORT's encoding, and
+ set *LEN to the length in bytes of that representation. Return 0 on
+ success and an errno value on error. */
+static int
+get_codepoint (SCM port, scm_t_wchar *codepoint,
+ char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
+{
+ int err;
+ scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
+ if (pt->input_cd == (iconv_t) -1)
+ /* Initialize the conversion descriptors, if needed. */
+ scm_i_set_port_encoding_x (port, pt->encoding);
+
+ if (pt->input_cd == (iconv_t) -1)
+ err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len);
else
+ err = get_iconv_codepoint (port, codepoint, buf, len);
+
+ if (SCM_LIKELY (err == 0))
+ update_port_lf (*codepoint, port);
+ else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
{
- /* Convert the UTF8_BUF sequence to a Unicode code point. */
- *codepoint = utf8_to_codepoint (utf8_buf, output_size);
+ *codepoint = '?';
+ err = 0;
update_port_lf (*codepoint, port);
}
- *len = bytes_consumed;
-
return err;
}
@@ -2031,28 +2185,35 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
if (encoding == NULL)
encoding = "ISO-8859-1";
- pt->encoding = scm_gc_strdup (encoding, "port");
+ if (pt->encoding != encoding)
+ pt->encoding = scm_gc_strdup (encoding, "port");
- if (SCM_CELL_WORD_0 (port) & SCM_RDNG)
+ /* If ENCODING is UTF-8, then no conversion descriptor is opened
+ because we do I/O ourselves. This saves 100+ KiB for each
+ descriptor. */
+ if (strcmp (encoding, "UTF-8"))
{
- /* Open an input iconv conversion descriptor, from ENCODING
- to UTF-8. We choose UTF-8, not UTF-32, because iconv
- implementations can typically convert from anything to
- UTF-8, but not to UTF-32 (see
- <http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00007.html>). */
- new_input_cd = iconv_open ("UTF-8", encoding);
- if (new_input_cd == (iconv_t) -1)
- goto invalid_encoding;
- }
+ if (SCM_CELL_WORD_0 (port) & SCM_RDNG)
+ {
+ /* Open an input iconv conversion descriptor, from ENCODING
+ to UTF-8. We choose UTF-8, not UTF-32, because iconv
+ implementations can typically convert from anything to
+ UTF-8, but not to UTF-32 (see
+ <http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00007.html>). */
+ new_input_cd = iconv_open ("UTF-8", encoding);
+ if (new_input_cd == (iconv_t) -1)
+ goto invalid_encoding;
+ }
- if (SCM_CELL_WORD_0 (port) & SCM_WRTNG)
- {
- new_output_cd = iconv_open (encoding, "UTF-8");
- if (new_output_cd == (iconv_t) -1)
+ if (SCM_CELL_WORD_0 (port) & SCM_WRTNG)
{
- if (new_input_cd != (iconv_t) -1)
- iconv_close (new_input_cd);
- goto invalid_encoding;
+ new_output_cd = iconv_open (encoding, "UTF-8");
+ if (new_output_cd == (iconv_t) -1)
+ {
+ if (new_input_cd != (iconv_t) -1)
+ iconv_close (new_input_cd);
+ goto invalid_encoding;
+ }
}
}
diff --git a/libguile/print.c b/libguile/print.c
index 1399566..d5c015b 100644
--- a/libguile/print.c
+++ b/libguile/print.c
@@ -821,31 +821,57 @@ codepoint_to_utf8 (scm_t_wchar ch, scm_t_uint8 utf8[4])
return len;
}
-/* Display the LEN codepoints in STR to PORT according to STRATEGY;
- return the number of codepoints successfully displayed. If NARROW_P,
- then STR is interpreted as a sequence of `char', denoting a Latin-1
- string; otherwise it's interpreted as a sequence of
- `scm_t_wchar'. */
-static size_t
-display_string (const void *str, int narrow_p,
- size_t len, SCM port,
- scm_t_string_failed_conversion_handler strategy)
-
-{
#define STR_REF(s, x) \
(narrow_p \
? (scm_t_wchar) ((unsigned char *) (s))[x] \
: ((scm_t_wchar *) (s))[x])
+/* Write STR to PORT as UTF-8. STR is a LEN-codepoint string; it is
+ narrow if NARROW_P is true, wide otherwise. Return LEN. */
+static size_t
+display_string_as_utf8 (const void *str, int narrow_p, size_t len,
+ SCM port)
+{
+ size_t printed = 0;
+
+ while (len > printed)
+ {
+ size_t utf8_len, i;
+ char *input, utf8_buf[256];
+
+ /* Convert STR to UTF-8. */
+ for (i = printed, utf8_len = 0, input = utf8_buf;
+ i < len && utf8_len + 4 < sizeof (utf8_buf);
+ i++)
+ {
+ utf8_len += codepoint_to_utf8 (STR_REF (str, i),
+ (scm_t_uint8 *) input);
+ input = utf8_buf + utf8_len;
+ }
+
+ /* INPUT was successfully converted, entirely; print the
+ result. */
+ scm_lfwrite (utf8_buf, utf8_len, port);
+ printed += i - printed;
+ }
+
+ assert (printed == len);
+
+ return len;
+}
+
+/* Convert STR through PORT's output conversion descriptor and write the
+ output to PORT. Return the number of codepoints written. */
+static size_t
+display_string_using_iconv (const void *str, int narrow_p, size_t len,
+ SCM port,
+ scm_t_string_failed_conversion_handler strategy)
+{
size_t printed;
scm_t_port *pt;
pt = SCM_PTAB_ENTRY (port);
- if (SCM_UNLIKELY (pt->output_cd == (iconv_t) -1))
- /* Initialize the conversion descriptors. */
- scm_i_set_port_encoding_x (port, pt->encoding);
-
printed = 0;
while (len > printed)
@@ -928,7 +954,34 @@ display_string (const void *str, int narrow_p,
}
return printed;
+}
+
#undef STR_REF
+
+/* Display the LEN codepoints in STR to PORT according to STRATEGY;
+ return the number of codepoints successfully displayed. If NARROW_P,
+ then STR is interpreted as a sequence of `char', denoting a Latin-1
+ string; otherwise it's interpreted as a sequence of
+ `scm_t_wchar'. */
+static size_t
+display_string (const void *str, int narrow_p,
+ size_t len, SCM port,
+ scm_t_string_failed_conversion_handler strategy)
+
+{
+ scm_t_port *pt;
+
+ pt = SCM_PTAB_ENTRY (port);
+
+ if (pt->output_cd == (iconv_t) -1)
+ /* Initialize the conversion descriptors, if needed. */
+ scm_i_set_port_encoding_x (port, pt->encoding);
+
+ if (pt->output_cd == (iconv_t) -1)
+ return display_string_as_utf8 (str, narrow_p, len, port);
+ else
+ return display_string_using_iconv (str, narrow_p, len,
+ port, strategy);
}
/* Attempt to display CH to PORT according to STRATEGY. Return non-zero
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: Reducing iconv-induced memory usage
2011-05-05 16:19 ` Ludovic Courtès
@ 2011-05-06 16:19 ` Ludovic Courtès
2011-05-07 20:51 ` Ludovic Courtès
0 siblings, 1 reply; 7+ messages in thread
From: Ludovic Courtès @ 2011-05-06 16:19 UTC (permalink / raw)
To: guile-devel
Hello,
ludo@gnu.org (Ludovic Courtès) writes:
> Here’s an updated patch that strictly checks for ill-formed UTF-8
> sequences, as Mark pointed out. It passes all the tests I recently
> added to ports.test.
I committed it, though Mark rightfully noted on IRC a non-conformance
issue. I’ve added a FIXME and started looking into it.
Thanks, Mark, for the detailed review!
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Reducing iconv-induced memory usage
2011-05-06 16:19 ` Ludovic Courtès
@ 2011-05-07 20:51 ` Ludovic Courtès
0 siblings, 0 replies; 7+ messages in thread
From: Ludovic Courtès @ 2011-05-07 20:51 UTC (permalink / raw)
To: guile-devel
Hello!
ludo@gnu.org (Ludovic Courtès) writes:
> Hello,
>
> ludo@gnu.org (Ludovic Courtès) writes:
>
>> Here’s an updated patch that strictly checks for ill-formed UTF-8
>> sequences, as Mark pointed out. It passes all the tests I recently
>> added to ports.test.
>
> I committed it, though Mark rightfully noted on IRC a non-conformance
> issue. I’ve added a FIXME and started looking into it.
Commit 7be1705dbda377780335ecbcbfce04de523f2671 fixes it, AFAICS.
Mark, please let me know if you spot other errors!
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2011-05-07 20:51 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-26 21:10 Reducing iconv-induced memory usage Ludovic Courtès
2011-04-26 22:41 ` Ludovic Courtès
2011-04-27 3:47 ` Mark H Weaver
2011-04-27 14:36 ` Ludovic Courtès
2011-05-05 16:19 ` Ludovic Courtès
2011-05-06 16:19 ` Ludovic Courtès
2011-05-07 20:51 ` Ludovic Courtès
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).