unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
From: Mark H Weaver <mhw@netris.org>
To: Andy Wingo <wingo@pobox.com>
Cc: "Ludovic Courtès" <ludo@gnu.org>, guile-devel@gnu.org
Subject: Re: byte-order marks
Date: Tue, 29 Jan 2013 14:09:16 -0500	[thread overview]
Message-ID: <87mwvrkdr7.fsf@tines.lan> (raw)
In-Reply-To: <87txpzkjaf.fsf@tines.lan> (Mark H. Weaver's message of "Tue, 29 Jan 2013 12:09:44 -0500")

I wrote:
> Having slept on this, I think I agree that 'open-input-file' should
> auto-consume BOMs.

On the other hand, there's a nasty complication.  Of course
(open-input-file FILENAME) is just (open-file FILENAME "r"), so the
auto-consuming logic should be in 'open-file'.

So what should (open-file FILENAME "r+") do?  The problem is that we
don't know if the user will read or write first.  If they write first,
then they may reasonably assume that what they write will be put at the
very beginning of the file, no?

Also, Unicode 6.2 section 2.6 table 2-4 says that BOMs are only allowed
for the encoding schemes UTF-8, UTF-16, and UTF-32.  They are *not*
allowed for UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.

Unicode 6.2 section 16.8 goes into more detail:

   For compatibility with versions of the Unicode Standard prior to
   Version 3.2, the code point U+FEFF has the word-joining semantics of
   zero width no-break space when it is not used as a BOM.  [...]

   Where the byte order is explicitly specified, such as in UTF-16BE or
   UTF-16LE, then all U+FEFF characters -- even at the very beginning of
   the text -- are to be interpreted as zero width no-break spaces.
   Similarly, where Unicode text has known byte order, initial U+FEFF
   characters are not required, but for backward compatibility are to be
   interpreted as zero width no-break spaces.  [...]

   Systems that use the byte order mark must recognize when an initial
   U+FEFF signals the byte order. In those cases, it is not part of the
   textual content and should be removed before processing, because
   otherwise it may be mistaken for a legitimate zero width no-break
   space.  To represent an initial U+FEFF zero width no-break space in a
   UTF-16 file, use U+FEFF twice in a row. The first one is a byte order
   mark; the second one is the initial zero width no-break space.  [...]

This will require some more research and thought.

    Mark



  reply	other threads:[~2013-01-29 19:09 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-28 21:42 byte-order marks Andy Wingo
2013-01-28 22:20 ` Mike Gran
2013-01-29  9:03   ` Andy Wingo
2013-01-29  8:22 ` Mark H Weaver
2013-01-29  9:03   ` Andy Wingo
2013-01-29 13:27     ` Ludovic Courtès
2013-01-29 14:04       ` Andy Wingo
2013-01-29 17:09         ` Mark H Weaver
2013-01-29 19:09           ` Mark H Weaver [this message]
2013-01-29 20:52             ` Ludovic Courtès
2013-01-29 20:53           ` Ludovic Courtès
2013-01-30  9:20           ` Andy Wingo
2013-01-30 21:18             ` Ludovic Courtès
2013-01-31  8:52               ` Andy Wingo
2013-01-31  4:40             ` [PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings Mark H Weaver
2013-01-31  9:39               ` Andy Wingo
2013-01-31 10:33                 ` Andy Wingo
2013-01-31 18:01                   ` [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings Mark H Weaver
2013-01-31 21:42               ` Ludovic Courtès
2013-01-29 19:22 ` byte-order marks Neil Jerram
2013-01-29 21:09   ` Andy Wingo
2013-01-29 21:12     ` Neil Jerram

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mwvrkdr7.fsf@tines.lan \
    --to=mhw@netris.org \
    --cc=guile-devel@gnu.org \
    --cc=ludo@gnu.org \
    --cc=wingo@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).