UTF-16 and (ice-9 rdelim)

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* UTF-16 and (ice-9 rdelim)
@ 2010-01-17 22:49 Neil Jerram
  2010-01-18  0:11 ` Mike Gran
  0 siblings, 1 reply; 4+ messages in thread
From: Neil Jerram @ 2010-01-17 22:49 UTC (permalink / raw)
  To: Guile Development

[-- Attachment #1: Type: text/plain, Size: 2764 bytes --]

I have a program that processes a UTF-16 input file, using
`with-input-from-file', `set-port-encoding' and `read-line' in a pattern
like this:

(use-modules (ice-9 rdelim))

(with-input-from-file "rdelim-utf16.txt"
  (lambda ()
    (set-port-encoding! (current-input-port) "UTF16LE")

    (let ((first-line (read-line))
          (second-line (read-line)))

      ...)

    ))

A sample UTF-16 input file is attached.

This hits a couple of problems.

1. It seems that most (all?) UTF-16 files begin with a byte order marker
(BOM), \ufeff, which readers are conventionally supposed to discard -
but Guile doesn't.  So first-line becomes "\ufeffhello"

2. The internals of (read-line) just search for a '\n' char to determine
the end of the first line, which means they're assuming that

- '\n' never occurs as part of some other multibyte sequence

- when '\n' occurs as part of the newline sequence, it occupies a single
  byte.

This causes the second line to be read wrong, because newline in UTF-16
is actually 2 bytes - \n \0 - and the first (read-line) leaves the \0
byte unconsumed.

I think the fixes for these are roughly as follows.

For 1:

- Add a flag to the representation of a file port to say whether we're
  still at the start of the file.  This flag starts off true, and
  becomes false once we've read enough bytes to get past a possible BOM.

- Define a static map from encodings to possible BOMs.

- When reading bytes, and the flag is true, and the port has an
  encoding, and that encoding has a possible BOM, check for and consume
  the BOM.

Or is it too magic for the port to do this automatically?
Alternatively, we could provide something like
`read-line-discarding-bom', and it would be up to the application to
know when to use this instead of `read-line'.

For 2:

- In scm_do_read_line(), keep the current (fast) code for the case where
  the port has no encoding.

- When the port has an encoding, use a modified implementation that
  copies raw bytes into an intermediate buffer, calls
  u32_conv_from_encoding to convert those to u32*, and uses u32_strchr
  to look for a newline.

Does that sound about right?  Are there any possible optimizations?

For the static map, is there a canonical set of possible encoding
strings, or a way to get a single canonical ID for all the strings that
are allowed to mean the same encoding?  For UTF-16, for example, it
seems to me that many of the following encoding strings will work

utf-16
utf-16-le
utf16le
utf16-le
utf-16le
utf16
+ the same with different case

and we don't want a map entry for each one.

I suppose one pseudo-canonical method would be to upcase and remove all
punctuation.  Then we're only left with "UTF16" and "UTF16LE", which
makes sense.

Regards,
        Neil

[-- Attachment #2: rdelim-utf16.txt --]
[-- Type: text/plain, Size: 18 bytes --]

hello
hello again

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: UTF-16 and (ice-9 rdelim)
  2010-01-17 22:49 UTF-16 and (ice-9 rdelim) Neil Jerram
@ 2010-01-18  0:11 ` Mike Gran
  2010-01-18 20:13   ` Neil Jerram
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Gran @ 2010-01-18  0:11 UTC (permalink / raw)
  To: Neil Jerram, Guile Development

> From: Neil Jerram <neil@ossau.uklinux.net>

> 
> 1. It seems that most (all?) UTF-16 files begin with a byte order marker
> (BOM), \ufeff, which readers are conventionally supposed to discard -
> but Guile doesn't.  So first-line becomes "\ufeffhello"

> 2. The internals of (read-line) just search for a '\n' char to determine
> the end of the first line, which means they're assuming that
> 
> - '\n' never occurs as part of some other multibyte sequence
> 
> - when '\n' occurs as part of the newline sequence, it occupies a single
>   byte.
> 
> This causes the second line to be read wrong, because newline in UTF-16
> is actually 2 bytes - \n \0 - and the first (read-line) leaves the \0
> byte unconsumed.
> 
> I think the fixes for these are roughly as follows.
> 
> For 1:
> 
> - Add a flag to the representation of a file port to say whether we're
>   still at the start of the file.  This flag starts off true, and
>   becomes false once we've read enough bytes to get past a possible BOM.
> 
> - Define a static map from encodings to possible BOMs.
> 
> - When reading bytes, and the flag is true, and the port has an
>   encoding, and that encoding has a possible BOM, check for and consume
>   the BOM.

This should work.  BOMs are either 2, 3, or 4 bytes for UTF-16, UTF-8, and UTF-32 respectively.  And if the port encoding is expected to be set
correctly in the first place, a BOM should always be the first code point
returned by read-char.

> For 2:
> 
> - In scm_do_read_line(), keep the current (fast) code for the case where
>   the port has no encoding.
> 
> - When the port has an encoding, use a modified implementation that
>   copies raw bytes into an intermediate buffer, calls
>   u32_conv_from_encoding to convert those to u32*, and uses u32_strchr
>   to look for a newline.
> 
> Does that sound about right?  Are there any possible optimizations?

If you already have to go to the trouble of converting to u32, it might
be simplest to reimplement the non-Latin-1 case in Scheme,
since read-char and unread-char should work even for UTF-16.
That might do bad things to speed, though.

> 
> For the static map, is there a canonical set of possible encoding
> strings, or a way to get a single canonical ID for all the strings that
> are allowed to mean the same encoding?  For UTF-16, for example, it
> seems to me that many of the following encoding strings will work
> 
> utf-16
> utf-16-le
> utf16le
> utf16-le
> utf-16le
> utf16
> + the same with different case
> 
> and we don't want a map entry for each one.
> 
> I suppose one pseudo-canonical method would be to upcase and remove all
> punctuation.  Then we're only left with "UTF16" and "UTF16LE", which
> makes sense.

There are a couple of issues here.  If you want a port to automatically
identify a Unicode encoding by checking its first four bytes for a BOM, 
then you would need some sort of association table.  It wouldn't be that
hard to do.

But, if you just want to get rid of a BOM, you can cut it down to 
a rule.  If the first code point that a port reads is U+FEFF and if the
encoding has the string "utf" in it, ignore it.  If the first code point
is U+FFFE and the encoding has "utf" in it, flag an error.

> 
> Regards,
>         Neil

-Mike 





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: UTF-16 and (ice-9 rdelim)
  2010-01-18  0:11 ` Mike Gran
@ 2010-01-18 20:13   ` Neil Jerram
  2010-01-18 21:29     ` Mike Gran
  0 siblings, 1 reply; 4+ messages in thread
From: Neil Jerram @ 2010-01-18 20:13 UTC (permalink / raw)
  To: Mike Gran; +Cc: Guile Development

Hi Mike,

Many thanks for your quick response.  I'll hopefully work on these fixes
shortly.

A few comments...

Mike Gran <spk121@yahoo.com> writes:

> This should work.  BOMs are either 2, 3, or 4 bytes for UTF-16, UTF-8,
> and UTF-32 respectively.  And if the port encoding is expected to be
> set correctly in the first place, a BOM should always be the first
> code point returned by read-char.

Thanks.  For the moment, I am assuming that the encoding will have
previously been declared correctly, by `set-port-encoding' or by a
`coding:' comment.

> If you already have to go to the trouble of converting to u32, it might
> be simplest to reimplement the non-Latin-1 case in Scheme,
> since read-char and unread-char should work even for UTF-16.
> That might do bad things to speed, though.

I'll have a look; it's nice to prototype that way, at least.

> There are a couple of issues here.  If you want a port to automatically
> identify a Unicode encoding by checking its first four bytes for a BOM, 
> then you would need some sort of association table.  It wouldn't be that
> hard to do.

I'm not thinking of that yet.  (For the future, clearly it must be
possible, as Emacs is doing it all the time.)

> But, if you just want to get rid of a BOM, you can cut it down to 
> a rule.  If the first code point that a port reads is U+FEFF and if the
> encoding has the string "utf" in it, ignore it.  If the first code point
> is U+FFFE and the encoding has "utf" in it, flag an error.

Agreed.

Out of interest, does that mean that iconv will auto-detect the
endianness if the encoding does not explicitly say "le" or "be"?

Regards,
        Neil

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: UTF-16 and (ice-9 rdelim)
  2010-01-18 20:13   ` Neil Jerram
@ 2010-01-18 21:29     ` Mike Gran
  0 siblings, 0 replies; 4+ messages in thread
From: Mike Gran @ 2010-01-18 21:29 UTC (permalink / raw)
  To: Neil Jerram; +Cc: Guile Development

> From: Neil Jerram
> Hi Mike,

> > But, if you just want to get rid of a BOM, you can cut it down to 
> > a rule.  If the first code point that a port reads is U+FEFF and if the
> > encoding has the string "utf" in it, ignore it.  If the first code point
> > is U+FFFE and the encoding has "utf" in it, flag an error.
> 
> Agreed.
> 
> Out of interest, does that mean that iconv will auto-detect the
> endianness if the encoding does not explicitly say "le" or "be"?

The Unicode FAQ from unicode.org says that "the unmarked form (UTF-16, UTF-32)
uses big-endian byte serialization by default, but may include a byte order
mark at the beginning to indicate the actual byte serialization used."  So,
I guess the strictly correct thing to do for UTF-16 would be to

* check for a BOM.  
* if it exists
  *  if it is U+FFFE, modify the port encoding to UTF-16-LE
  *  if it is U+FEFF, leave the port encoding as UTF-16
  *  discard the BOM
* else, leave the port-encoding to UTF-16

and similarly for UTF-32.

Thanks,
- Mike




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-01-18 21:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-17 22:49 UTF-16 and (ice-9 rdelim) Neil Jerram
2010-01-18  0:11 ` Mike Gran
2010-01-18 20:13   ` Neil Jerram
2010-01-18 21:29     ` Mike Gran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).