Unicode I/O

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* Unicode I/O
@ 2010-09-18 21:50 Ludovic Courtès
  2010-09-19 10:27 ` Andy Wingo
  2011-01-03 22:58 ` Ludovic Courtès
  0 siblings, 2 replies; 5+ messages in thread
From: Ludovic Courtès @ 2010-09-18 21:50 UTC (permalink / raw)
  To: guile-devel

Hello,

Guile currently uses libunistring’s ‘u32_conv_from_encoding’ when
reading text from an input port whose encoding isn’t Latin-1 (similarly
when writing to output ports.)

An issue with that is that escaping non-representable characters is
handled by libunistring, with a syntax different from the one we’d like
(Guile or R6RS string escapes.)  So
‘scm_i_unistring_escapes_to_{guile,r6rs}_escapes’ kludgely attempt to
substitute the right escapes.

The problems with this approach are discussed in the thread at:

  http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00004.html

The conclusion is that we’d better use raw ‘iconv’ calls in such
cases...

Ludo’.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unicode I/O
  2010-09-18 21:50 Unicode I/O Ludovic Courtès
@ 2010-09-19 10:27 ` Andy Wingo
  2010-09-19 20:41   ` Ludovic Courtès
  2011-01-03 22:58 ` Ludovic Courtès
  1 sibling, 1 reply; 5+ messages in thread
From: Andy Wingo @ 2010-09-19 10:27 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

On Sat 18 Sep 2010 23:50, ludo@gnu.org (Ludovic Courtès) writes:

>   http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00004.html
>
> The conclusion is that we’d better use raw ‘iconv’ calls in such
> cases...

Boo. I guess this is a 2.0 blocker. Bruno's strategy appears (as usual)
to be the right one...

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unicode I/O
  2010-09-19 10:27 ` Andy Wingo
@ 2010-09-19 20:41   ` Ludovic Courtès
  0 siblings, 0 replies; 5+ messages in thread
From: Ludovic Courtès @ 2010-09-19 20:41 UTC (permalink / raw)
  To: guile-devel

Hi,

Andy Wingo <wingo@pobox.com> writes:

> On Sat 18 Sep 2010 23:50, ludo@gnu.org (Ludovic Courtès) writes:
>
>>   http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00004.html
>>
>> The conclusion is that we’d better use raw ‘iconv’ calls in such
>> cases...
>
> Boo. I guess this is a 2.0 blocker. Bruno's strategy appears (as usual)
> to be the right one...

Yes to both...

Ludo’.




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unicode I/O
  2010-09-18 21:50 Unicode I/O Ludovic Courtès
  2010-09-19 10:27 ` Andy Wingo
@ 2011-01-03 22:58 ` Ludovic Courtès
  2011-01-22 23:42   ` Ludovic Courtès
  1 sibling, 1 reply; 5+ messages in thread
From: Ludovic Courtès @ 2011-01-03 22:58 UTC (permalink / raw)
  To: guile-devel

Hello Guilers, and Happy New Year!  :-)

My resolution for the beginning of this year is to address this:

> Guile currently uses libunistring’s ‘u32_conv_from_encoding’ when
> reading text from an input port whose encoding isn’t Latin-1 (similarly
> when writing to output ports.)
>
> An issue with that is that escaping non-representable characters is
> handled by libunistring, with a syntax different from the one we’d like
> (Guile or R6RS string escapes.)  So
> ‘scm_i_unistring_escapes_to_{guile,r6rs}_escapes’ kludgely attempt to
> substitute the right escapes.
>
> The problems with this approach are discussed in the thread at:
>
>   http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00004.html
>
> The conclusion is that we’d better use raw ‘iconv’ calls in such
> cases...

I’ve just pushed a ‘wip-iconv’ branch, which currently changes ports to
use ‘iconv’ for input.  Remaining tasks include doing it for output, and
finding a solution for ‘scm_{to,from}_stringn’ so that it behaves in the
same way wrt. to escapes and error handling.

Comments, feedback, suggestions, and patches are all welcome!  :-)

Ludo’.




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unicode I/O
  2011-01-03 22:58 ` Ludovic Courtès
@ 2011-01-22 23:42   ` Ludovic Courtès
  0 siblings, 0 replies; 5+ messages in thread
From: Ludovic Courtès @ 2011-01-22 23:42 UTC (permalink / raw)
  To: guile-devel

Hello!

ludo@gnu.org (Ludovic Courtès) writes:

> I’ve just pushed a ‘wip-iconv’ branch, which currently changes ports to
> use ‘iconv’ for input.  Remaining tasks include doing it for output, and
> finding a solution for ‘scm_{to,from}_stringn’ so that it behaves in the
> same way wrt. to escapes and error handling.

I just merged ‘wip-iconv’ into ‘master’.  It uses ‘iconv’ for
display/write and peek-char/read-char, but not yet for
‘scm_{to,from}_string’ and ‘read-line’.  Caveat: only tested on
GNU/Linux.

Also, we should take advantage of this to improve error reporting, e.g.,
to include the location of a conversion failure.

Overall, it improves performance, except on Latin-1 ports since I chose
not to special-case them (i.e., I/O on Latin-1 ports goes through
iconv.)  The trick is that iconv conversion descriptors are opened once
for all, and no heap allocation happens (‘u32_conv_from_encoding’ and
friends typically malloc.)

Benchmark results:

--8<---------------cut here---------------start------------->8---
;; with iconv:

("ports.bm: peek-char: latin-1 port" 700000 total 0.38)
("ports.bm: peek-char: utf-8 port, ascii character" 700000 total 0.38)
("ports.bm: peek-char: utf-8 port, Korean character" 700000 total 0.68)
("ports.bm: read-char: latin-1 port" 10000000 total 3.34)
("ports.bm: read-char: utf-8 port, ascii character" 10000000 total 3.33)
("ports.bm: read-char: utf-8 port, Korean character" 10000000 total 3.31)
("ports.bm: char-ready?: latin-1 port" 10000000 total 3.02 user 3.01)
("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 total 3.0)
("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 total 3.01)

;; with libunistring:

("ports.bm: peek-char: latin-1 port" 700000 total 0.25)
("ports.bm: peek-char: utf-8 port, ascii character" 700000 total 2.65)
("ports.bm: peek-char: utf-8 port, Korean character" 700000 total 7.58)
("ports.bm: read-char: latin-1 port" 10000000 total 3.38)
("ports.bm: read-char: utf-8 port, ascii character" 10000000 total 3.31)
("ports.bm: read-char: utf-8 port, Korean character" 10000000 total 3.29)
("ports.bm: char-ready?: latin-1 port" 10000000 total 3.08 user 3.08)
("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 total 3.08)
("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 total 3.05)
--8<---------------cut here---------------end--------------->8---

So ‘peek-char’ is faster, whereas ‘read-char’ gives the same results (to
my surprise, I must say.)

The ‘peek-char’ improvement is beneficial to SSAX.  When loading a 4 MiB
XML file in UTF-8, it’s ~4 times faster than the old method:

--8<---------------cut here---------------start------------->8---
$ time guile -c '(use-modules (sxml simple)) (setlocale LC_ALL "") (xml->sxml (open-input-file "chbouib.xml"))'

real    0m20.509s
user    0m20.437s
sys     0m0.064s

$ time ./meta/guile -c '(use-modules (sxml simple)) (setlocale LC_ALL "") (xml->sxml (open-input-file "chbouib.xml"))'

real    0m5.676s
user    0m5.599s
sys     0m0.076s
--8<---------------cut here---------------end--------------->8---

For ‘write.bm’:

--8<---------------cut here---------------start------------->8---
;; with iconv:

("write.bm: write: string with escapes" 50 total 0.71)
("write.bm: write: string without escapes" 50 total 0.65)
("write.bm: display: string with escapes" 1000 total 3.39)
("write.bm: display: string without escapes" 1000 total 0.97)

;; with libunistring:

("write.bm: write: string with escapes" 50 total 7.06)
("write.bm: write: string without escapes" 50 total 7.51)
("write.bm: display: string with escapes" 1000 total 1.96)
("write.bm: display: string without escapes" 1000 total 1.46)
--8<---------------cut here---------------end--------------->8---

In the nominal case, ‘display’ is ~30% faster here, and ‘sxml->xml’ is
60% faster on this 4 MiB XML file:

--8<---------------cut here---------------start------------->8---
$ ./meta/guile -c '(use-modules (sxml simple) (ice-9 time)) (setlocale LC_ALL "") (define s (xml->sxml (open-input-file "chbouib.xml"))) (time (with-output-to-file "/tmp/foo.xml" (lambda () (sxml->xml s))))'
clock utime stime cutime cstime gctime
 2.48  2.44  0.02   0.00   0.00   0.00

$ guile -c '(use-modules (sxml simple) (ice-9 time)) (setlocale LC_ALL "") (define s (xml->sxml (open-input-file "chbouib.xml"))) (time (with-output-to-file "/tmp/foo.xml" (lambda () (sxml->xml s))))'
clock utime stime cutime cstime gctime
 6.43  6.39  0.04   0.00   0.00   0.00
--8<---------------cut here---------------end--------------->8---

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-01-22 23:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-18 21:50 Unicode I/O Ludovic Courtès
2010-09-19 10:27 ` Andy Wingo
2010-09-19 20:41   ` Ludovic Courtès
2011-01-03 22:58 ` Ludovic Courtès
2011-01-22 23:42   ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).