From: Mark H Weaver <mhw@netris.org>
To: Andy Wingo <wingo@pobox.com>
Cc: "Ludovic Courtès" <ludo@gnu.org>, guile-devel@gnu.org
Subject: Re: byte-order marks
Date: Tue, 29 Jan 2013 14:09:16 -0500 [thread overview]
Message-ID: <87mwvrkdr7.fsf@tines.lan> (raw)
In-Reply-To: <87txpzkjaf.fsf@tines.lan> (Mark H. Weaver's message of "Tue, 29 Jan 2013 12:09:44 -0500")
I wrote:
> Having slept on this, I think I agree that 'open-input-file' should
> auto-consume BOMs.
On the other hand, there's a nasty complication. Of course
(open-input-file FILENAME) is just (open-file FILENAME "r"), so the
auto-consuming logic should be in 'open-file'.
So what should (open-file FILENAME "r+") do? The problem is that we
don't know if the user will read or write first. If they write first,
then they may reasonably assume that what they write will be put at the
very beginning of the file, no?
Also, Unicode 6.2 section 2.6 table 2-4 says that BOMs are only allowed
for the encoding schemes UTF-8, UTF-16, and UTF-32. They are *not*
allowed for UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.
Unicode 6.2 section 16.8 goes into more detail:
For compatibility with versions of the Unicode Standard prior to
Version 3.2, the code point U+FEFF has the word-joining semantics of
zero width no-break space when it is not used as a BOM. [...]
Where the byte order is explicitly specified, such as in UTF-16BE or
UTF-16LE, then all U+FEFF characters -- even at the very beginning of
the text -- are to be interpreted as zero width no-break spaces.
Similarly, where Unicode text has known byte order, initial U+FEFF
characters are not required, but for backward compatibility are to be
interpreted as zero width no-break spaces. [...]
Systems that use the byte order mark must recognize when an initial
U+FEFF signals the byte order. In those cases, it is not part of the
textual content and should be removed before processing, because
otherwise it may be mistaken for a legitimate zero width no-break
space. To represent an initial U+FEFF zero width no-break space in a
UTF-16 file, use U+FEFF twice in a row. The first one is a byte order
mark; the second one is the initial zero width no-break space. [...]
This will require some more research and thought.
Mark
next prev parent reply other threads:[~2013-01-29 19:09 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-28 21:42 byte-order marks Andy Wingo
2013-01-28 22:20 ` Mike Gran
2013-01-29 9:03 ` Andy Wingo
2013-01-29 8:22 ` Mark H Weaver
2013-01-29 9:03 ` Andy Wingo
2013-01-29 13:27 ` Ludovic Courtès
2013-01-29 14:04 ` Andy Wingo
2013-01-29 17:09 ` Mark H Weaver
2013-01-29 19:09 ` Mark H Weaver [this message]
2013-01-29 20:52 ` Ludovic Courtès
2013-01-29 20:53 ` Ludovic Courtès
2013-01-30 9:20 ` Andy Wingo
2013-01-30 21:18 ` Ludovic Courtès
2013-01-31 8:52 ` Andy Wingo
2013-01-31 4:40 ` [PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings Mark H Weaver
2013-01-31 9:39 ` Andy Wingo
2013-01-31 10:33 ` Andy Wingo
2013-01-31 18:01 ` [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings Mark H Weaver
2013-01-31 21:42 ` Ludovic Courtès
2013-01-29 19:22 ` byte-order marks Neil Jerram
2013-01-29 21:09 ` Andy Wingo
2013-01-29 21:12 ` Neil Jerram
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87mwvrkdr7.fsf@tines.lan \
--to=mhw@netris.org \
--cc=guile-devel@gnu.org \
--cc=ludo@gnu.org \
--cc=wingo@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).