From: Neil Jerram <neil@ossau.uklinux.net>
To: Guile Development <guile-devel@gnu.org>
Subject: UTF-16 and (ice-9 rdelim)
Date: Sun, 17 Jan 2010 22:49:17 +0000 [thread overview]
Message-ID: <871vho9wk2.fsf@ossau.uklinux.net> (raw)
[-- Attachment #1: Type: text/plain, Size: 2764 bytes --]
I have a program that processes a UTF-16 input file, using
`with-input-from-file', `set-port-encoding' and `read-line' in a pattern
like this:
(use-modules (ice-9 rdelim))
(with-input-from-file "rdelim-utf16.txt"
(lambda ()
(set-port-encoding! (current-input-port) "UTF16LE")
(let ((first-line (read-line))
(second-line (read-line)))
...)
))
A sample UTF-16 input file is attached.
This hits a couple of problems.
1. It seems that most (all?) UTF-16 files begin with a byte order marker
(BOM), \ufeff, which readers are conventionally supposed to discard -
but Guile doesn't. So first-line becomes "\ufeffhello"
2. The internals of (read-line) just search for a '\n' char to determine
the end of the first line, which means they're assuming that
- '\n' never occurs as part of some other multibyte sequence
- when '\n' occurs as part of the newline sequence, it occupies a single
byte.
This causes the second line to be read wrong, because newline in UTF-16
is actually 2 bytes - \n \0 - and the first (read-line) leaves the \0
byte unconsumed.
I think the fixes for these are roughly as follows.
For 1:
- Add a flag to the representation of a file port to say whether we're
still at the start of the file. This flag starts off true, and
becomes false once we've read enough bytes to get past a possible BOM.
- Define a static map from encodings to possible BOMs.
- When reading bytes, and the flag is true, and the port has an
encoding, and that encoding has a possible BOM, check for and consume
the BOM.
Or is it too magic for the port to do this automatically?
Alternatively, we could provide something like
`read-line-discarding-bom', and it would be up to the application to
know when to use this instead of `read-line'.
For 2:
- In scm_do_read_line(), keep the current (fast) code for the case where
the port has no encoding.
- When the port has an encoding, use a modified implementation that
copies raw bytes into an intermediate buffer, calls
u32_conv_from_encoding to convert those to u32*, and uses u32_strchr
to look for a newline.
Does that sound about right? Are there any possible optimizations?
For the static map, is there a canonical set of possible encoding
strings, or a way to get a single canonical ID for all the strings that
are allowed to mean the same encoding? For UTF-16, for example, it
seems to me that many of the following encoding strings will work
utf-16
utf-16-le
utf16le
utf16-le
utf-16le
utf16
+ the same with different case
and we don't want a map entry for each one.
I suppose one pseudo-canonical method would be to upcase and remove all
punctuation. Then we're only left with "UTF16" and "UTF16LE", which
makes sense.
Regards,
Neil
[-- Attachment #2: rdelim-utf16.txt --]
[-- Type: text/plain, Size: 18 bytes --]
hello
hello again
next reply other threads:[~2010-01-17 22:49 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-01-17 22:49 Neil Jerram [this message]
2010-01-18 0:11 ` UTF-16 and (ice-9 rdelim) Mike Gran
2010-01-18 20:13 ` Neil Jerram
2010-01-18 21:29 ` Mike Gran
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=871vho9wk2.fsf@ossau.uklinux.net \
--to=neil@ossau.uklinux.net \
--cc=guile-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).