unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* UTF-16 and (ice-9 rdelim)
@ 2010-01-17 22:49 Neil Jerram
  2010-01-18  0:11 ` Mike Gran
  0 siblings, 1 reply; 4+ messages in thread
From: Neil Jerram @ 2010-01-17 22:49 UTC (permalink / raw)
  To: Guile Development

[-- Attachment #1: Type: text/plain, Size: 2764 bytes --]

I have a program that processes a UTF-16 input file, using
`with-input-from-file', `set-port-encoding' and `read-line' in a pattern
like this:

(use-modules (ice-9 rdelim))

(with-input-from-file "rdelim-utf16.txt"
  (lambda ()
    (set-port-encoding! (current-input-port) "UTF16LE")

    (let ((first-line (read-line))
          (second-line (read-line)))

      ...)

    ))

A sample UTF-16 input file is attached.

This hits a couple of problems.

1. It seems that most (all?) UTF-16 files begin with a byte order marker
(BOM), \ufeff, which readers are conventionally supposed to discard -
but Guile doesn't.  So first-line becomes "\ufeffhello"

2. The internals of (read-line) just search for a '\n' char to determine
the end of the first line, which means they're assuming that

- '\n' never occurs as part of some other multibyte sequence

- when '\n' occurs as part of the newline sequence, it occupies a single
  byte.

This causes the second line to be read wrong, because newline in UTF-16
is actually 2 bytes - \n \0 - and the first (read-line) leaves the \0
byte unconsumed.

I think the fixes for these are roughly as follows.

For 1:

- Add a flag to the representation of a file port to say whether we're
  still at the start of the file.  This flag starts off true, and
  becomes false once we've read enough bytes to get past a possible BOM.

- Define a static map from encodings to possible BOMs.

- When reading bytes, and the flag is true, and the port has an
  encoding, and that encoding has a possible BOM, check for and consume
  the BOM.

Or is it too magic for the port to do this automatically?
Alternatively, we could provide something like
`read-line-discarding-bom', and it would be up to the application to
know when to use this instead of `read-line'.

For 2:

- In scm_do_read_line(), keep the current (fast) code for the case where
  the port has no encoding.

- When the port has an encoding, use a modified implementation that
  copies raw bytes into an intermediate buffer, calls
  u32_conv_from_encoding to convert those to u32*, and uses u32_strchr
  to look for a newline.

Does that sound about right?  Are there any possible optimizations?

For the static map, is there a canonical set of possible encoding
strings, or a way to get a single canonical ID for all the strings that
are allowed to mean the same encoding?  For UTF-16, for example, it
seems to me that many of the following encoding strings will work

utf-16
utf-16-le
utf16le
utf16-le
utf-16le
utf16
+ the same with different case

and we don't want a map entry for each one.

I suppose one pseudo-canonical method would be to upcase and remove all
punctuation.  Then we're only left with "UTF16" and "UTF16LE", which
makes sense.

Regards,
        Neil



[-- Attachment #2: rdelim-utf16.txt --]
[-- Type: text/plain, Size: 18 bytes --]

hello
hello again

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-01-18 21:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-17 22:49 UTF-16 and (ice-9 rdelim) Neil Jerram
2010-01-18  0:11 ` Mike Gran
2010-01-18 20:13   ` Neil Jerram
2010-01-18 21:29     ` Mike Gran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).