* Unicode I/O
@ 2010-09-18 21:50 Ludovic Courtès
2010-09-19 10:27 ` Andy Wingo
2011-01-03 22:58 ` Ludovic Courtès
0 siblings, 2 replies; 5+ messages in thread
From: Ludovic Courtès @ 2010-09-18 21:50 UTC (permalink / raw)
To: guile-devel
Hello,
Guile currently uses libunistring’s ‘u32_conv_from_encoding’ when
reading text from an input port whose encoding isn’t Latin-1 (similarly
when writing to output ports.)
An issue with that is that escaping non-representable characters is
handled by libunistring, with a syntax different from the one we’d like
(Guile or R6RS string escapes.) So
‘scm_i_unistring_escapes_to_{guile,r6rs}_escapes’ kludgely attempt to
substitute the right escapes.
The problems with this approach are discussed in the thread at:
http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00004.html
The conclusion is that we’d better use raw ‘iconv’ calls in such
cases...
Ludo’.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Unicode I/O
2010-09-18 21:50 Unicode I/O Ludovic Courtès
@ 2010-09-19 10:27 ` Andy Wingo
2010-09-19 20:41 ` Ludovic Courtès
2011-01-03 22:58 ` Ludovic Courtès
1 sibling, 1 reply; 5+ messages in thread
From: Andy Wingo @ 2010-09-19 10:27 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: guile-devel
On Sat 18 Sep 2010 23:50, ludo@gnu.org (Ludovic Courtès) writes:
> http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00004.html
>
> The conclusion is that we’d better use raw ‘iconv’ calls in such
> cases...
Boo. I guess this is a 2.0 blocker. Bruno's strategy appears (as usual)
to be the right one...
Andy
--
http://wingolog.org/
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Unicode I/O
2010-09-19 10:27 ` Andy Wingo
@ 2010-09-19 20:41 ` Ludovic Courtès
0 siblings, 0 replies; 5+ messages in thread
From: Ludovic Courtès @ 2010-09-19 20:41 UTC (permalink / raw)
To: guile-devel
Hi,
Andy Wingo <wingo@pobox.com> writes:
> On Sat 18 Sep 2010 23:50, ludo@gnu.org (Ludovic Courtès) writes:
>
>> http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00004.html
>>
>> The conclusion is that we’d better use raw ‘iconv’ calls in such
>> cases...
>
> Boo. I guess this is a 2.0 blocker. Bruno's strategy appears (as usual)
> to be the right one...
Yes to both...
Ludo’.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Unicode I/O
2010-09-18 21:50 Unicode I/O Ludovic Courtès
2010-09-19 10:27 ` Andy Wingo
@ 2011-01-03 22:58 ` Ludovic Courtès
2011-01-22 23:42 ` Ludovic Courtès
1 sibling, 1 reply; 5+ messages in thread
From: Ludovic Courtès @ 2011-01-03 22:58 UTC (permalink / raw)
To: guile-devel
Hello Guilers, and Happy New Year! :-)
My resolution for the beginning of this year is to address this:
> Guile currently uses libunistring’s ‘u32_conv_from_encoding’ when
> reading text from an input port whose encoding isn’t Latin-1 (similarly
> when writing to output ports.)
>
> An issue with that is that escaping non-representable characters is
> handled by libunistring, with a syntax different from the one we’d like
> (Guile or R6RS string escapes.) So
> ‘scm_i_unistring_escapes_to_{guile,r6rs}_escapes’ kludgely attempt to
> substitute the right escapes.
>
> The problems with this approach are discussed in the thread at:
>
> http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00004.html
>
> The conclusion is that we’d better use raw ‘iconv’ calls in such
> cases...
I’ve just pushed a ‘wip-iconv’ branch, which currently changes ports to
use ‘iconv’ for input. Remaining tasks include doing it for output, and
finding a solution for ‘scm_{to,from}_stringn’ so that it behaves in the
same way wrt. to escapes and error handling.
Comments, feedback, suggestions, and patches are all welcome! :-)
Ludo’.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Unicode I/O
2011-01-03 22:58 ` Ludovic Courtès
@ 2011-01-22 23:42 ` Ludovic Courtès
0 siblings, 0 replies; 5+ messages in thread
From: Ludovic Courtès @ 2011-01-22 23:42 UTC (permalink / raw)
To: guile-devel
Hello!
ludo@gnu.org (Ludovic Courtès) writes:
> I’ve just pushed a ‘wip-iconv’ branch, which currently changes ports to
> use ‘iconv’ for input. Remaining tasks include doing it for output, and
> finding a solution for ‘scm_{to,from}_stringn’ so that it behaves in the
> same way wrt. to escapes and error handling.
I just merged ‘wip-iconv’ into ‘master’. It uses ‘iconv’ for
display/write and peek-char/read-char, but not yet for
‘scm_{to,from}_string’ and ‘read-line’. Caveat: only tested on
GNU/Linux.
Also, we should take advantage of this to improve error reporting, e.g.,
to include the location of a conversion failure.
Overall, it improves performance, except on Latin-1 ports since I chose
not to special-case them (i.e., I/O on Latin-1 ports goes through
iconv.) The trick is that iconv conversion descriptors are opened once
for all, and no heap allocation happens (‘u32_conv_from_encoding’ and
friends typically malloc.)
Benchmark results:
--8<---------------cut here---------------start------------->8---
;; with iconv:
("ports.bm: peek-char: latin-1 port" 700000 total 0.38)
("ports.bm: peek-char: utf-8 port, ascii character" 700000 total 0.38)
("ports.bm: peek-char: utf-8 port, Korean character" 700000 total 0.68)
("ports.bm: read-char: latin-1 port" 10000000 total 3.34)
("ports.bm: read-char: utf-8 port, ascii character" 10000000 total 3.33)
("ports.bm: read-char: utf-8 port, Korean character" 10000000 total 3.31)
("ports.bm: char-ready?: latin-1 port" 10000000 total 3.02 user 3.01)
("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 total 3.0)
("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 total 3.01)
;; with libunistring:
("ports.bm: peek-char: latin-1 port" 700000 total 0.25)
("ports.bm: peek-char: utf-8 port, ascii character" 700000 total 2.65)
("ports.bm: peek-char: utf-8 port, Korean character" 700000 total 7.58)
("ports.bm: read-char: latin-1 port" 10000000 total 3.38)
("ports.bm: read-char: utf-8 port, ascii character" 10000000 total 3.31)
("ports.bm: read-char: utf-8 port, Korean character" 10000000 total 3.29)
("ports.bm: char-ready?: latin-1 port" 10000000 total 3.08 user 3.08)
("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 total 3.08)
("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 total 3.05)
--8<---------------cut here---------------end--------------->8---
So ‘peek-char’ is faster, whereas ‘read-char’ gives the same results (to
my surprise, I must say.)
The ‘peek-char’ improvement is beneficial to SSAX. When loading a 4 MiB
XML file in UTF-8, it’s ~4 times faster than the old method:
--8<---------------cut here---------------start------------->8---
$ time guile -c '(use-modules (sxml simple)) (setlocale LC_ALL "") (xml->sxml (open-input-file "chbouib.xml"))'
real 0m20.509s
user 0m20.437s
sys 0m0.064s
$ time ./meta/guile -c '(use-modules (sxml simple)) (setlocale LC_ALL "") (xml->sxml (open-input-file "chbouib.xml"))'
real 0m5.676s
user 0m5.599s
sys 0m0.076s
--8<---------------cut here---------------end--------------->8---
For ‘write.bm’:
--8<---------------cut here---------------start------------->8---
;; with iconv:
("write.bm: write: string with escapes" 50 total 0.71)
("write.bm: write: string without escapes" 50 total 0.65)
("write.bm: display: string with escapes" 1000 total 3.39)
("write.bm: display: string without escapes" 1000 total 0.97)
;; with libunistring:
("write.bm: write: string with escapes" 50 total 7.06)
("write.bm: write: string without escapes" 50 total 7.51)
("write.bm: display: string with escapes" 1000 total 1.96)
("write.bm: display: string without escapes" 1000 total 1.46)
--8<---------------cut here---------------end--------------->8---
In the nominal case, ‘display’ is ~30% faster here, and ‘sxml->xml’ is
60% faster on this 4 MiB XML file:
--8<---------------cut here---------------start------------->8---
$ ./meta/guile -c '(use-modules (sxml simple) (ice-9 time)) (setlocale LC_ALL "") (define s (xml->sxml (open-input-file "chbouib.xml"))) (time (with-output-to-file "/tmp/foo.xml" (lambda () (sxml->xml s))))'
clock utime stime cutime cstime gctime
2.48 2.44 0.02 0.00 0.00 0.00
$ guile -c '(use-modules (sxml simple) (ice-9 time)) (setlocale LC_ALL "") (define s (xml->sxml (open-input-file "chbouib.xml"))) (time (with-output-to-file "/tmp/foo.xml" (lambda () (sxml->xml s))))'
clock utime stime cutime cstime gctime
6.43 6.39 0.04 0.00 0.00 0.00
--8<---------------cut here---------------end--------------->8---
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-01-22 23:42 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-18 21:50 Unicode I/O Ludovic Courtès
2010-09-19 10:27 ` Andy Wingo
2010-09-19 20:41 ` Ludovic Courtès
2011-01-03 22:58 ` Ludovic Courtès
2011-01-22 23:42 ` Ludovic Courtès
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).