* Re: Unicode I/O
2011-01-03 22:58 ` Ludovic Courtès
@ 2011-01-22 23:42 ` Ludovic Courtès
0 siblings, 0 replies; 5+ messages in thread
From: Ludovic Courtès @ 2011-01-22 23:42 UTC (permalink / raw)
To: guile-devel
Hello!
ludo@gnu.org (Ludovic Courtès) writes:
> I’ve just pushed a ‘wip-iconv’ branch, which currently changes ports to
> use ‘iconv’ for input. Remaining tasks include doing it for output, and
> finding a solution for ‘scm_{to,from}_stringn’ so that it behaves in the
> same way wrt. to escapes and error handling.
I just merged ‘wip-iconv’ into ‘master’. It uses ‘iconv’ for
display/write and peek-char/read-char, but not yet for
‘scm_{to,from}_string’ and ‘read-line’. Caveat: only tested on
GNU/Linux.
Also, we should take advantage of this to improve error reporting, e.g.,
to include the location of a conversion failure.
Overall, it improves performance, except on Latin-1 ports since I chose
not to special-case them (i.e., I/O on Latin-1 ports goes through
iconv.) The trick is that iconv conversion descriptors are opened once
for all, and no heap allocation happens (‘u32_conv_from_encoding’ and
friends typically malloc.)
Benchmark results:
--8<---------------cut here---------------start------------->8---
;; with iconv:
("ports.bm: peek-char: latin-1 port" 700000 total 0.38)
("ports.bm: peek-char: utf-8 port, ascii character" 700000 total 0.38)
("ports.bm: peek-char: utf-8 port, Korean character" 700000 total 0.68)
("ports.bm: read-char: latin-1 port" 10000000 total 3.34)
("ports.bm: read-char: utf-8 port, ascii character" 10000000 total 3.33)
("ports.bm: read-char: utf-8 port, Korean character" 10000000 total 3.31)
("ports.bm: char-ready?: latin-1 port" 10000000 total 3.02 user 3.01)
("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 total 3.0)
("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 total 3.01)
;; with libunistring:
("ports.bm: peek-char: latin-1 port" 700000 total 0.25)
("ports.bm: peek-char: utf-8 port, ascii character" 700000 total 2.65)
("ports.bm: peek-char: utf-8 port, Korean character" 700000 total 7.58)
("ports.bm: read-char: latin-1 port" 10000000 total 3.38)
("ports.bm: read-char: utf-8 port, ascii character" 10000000 total 3.31)
("ports.bm: read-char: utf-8 port, Korean character" 10000000 total 3.29)
("ports.bm: char-ready?: latin-1 port" 10000000 total 3.08 user 3.08)
("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 total 3.08)
("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 total 3.05)
--8<---------------cut here---------------end--------------->8---
So ‘peek-char’ is faster, whereas ‘read-char’ gives the same results (to
my surprise, I must say.)
The ‘peek-char’ improvement is beneficial to SSAX. When loading a 4 MiB
XML file in UTF-8, it’s ~4 times faster than the old method:
--8<---------------cut here---------------start------------->8---
$ time guile -c '(use-modules (sxml simple)) (setlocale LC_ALL "") (xml->sxml (open-input-file "chbouib.xml"))'
real 0m20.509s
user 0m20.437s
sys 0m0.064s
$ time ./meta/guile -c '(use-modules (sxml simple)) (setlocale LC_ALL "") (xml->sxml (open-input-file "chbouib.xml"))'
real 0m5.676s
user 0m5.599s
sys 0m0.076s
--8<---------------cut here---------------end--------------->8---
For ‘write.bm’:
--8<---------------cut here---------------start------------->8---
;; with iconv:
("write.bm: write: string with escapes" 50 total 0.71)
("write.bm: write: string without escapes" 50 total 0.65)
("write.bm: display: string with escapes" 1000 total 3.39)
("write.bm: display: string without escapes" 1000 total 0.97)
;; with libunistring:
("write.bm: write: string with escapes" 50 total 7.06)
("write.bm: write: string without escapes" 50 total 7.51)
("write.bm: display: string with escapes" 1000 total 1.96)
("write.bm: display: string without escapes" 1000 total 1.46)
--8<---------------cut here---------------end--------------->8---
In the nominal case, ‘display’ is ~30% faster here, and ‘sxml->xml’ is
60% faster on this 4 MiB XML file:
--8<---------------cut here---------------start------------->8---
$ ./meta/guile -c '(use-modules (sxml simple) (ice-9 time)) (setlocale LC_ALL "") (define s (xml->sxml (open-input-file "chbouib.xml"))) (time (with-output-to-file "/tmp/foo.xml" (lambda () (sxml->xml s))))'
clock utime stime cutime cstime gctime
2.48 2.44 0.02 0.00 0.00 0.00
$ guile -c '(use-modules (sxml simple) (ice-9 time)) (setlocale LC_ALL "") (define s (xml->sxml (open-input-file "chbouib.xml"))) (time (with-output-to-file "/tmp/foo.xml" (lambda () (sxml->xml s))))'
clock utime stime cutime cstime gctime
6.43 6.39 0.04 0.00 0.00 0.00
--8<---------------cut here---------------end--------------->8---
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 5+ messages in thread