all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "Stephen J. Turnbull" <turnbull@sk.tsukuba.ac.jp>
To: tomas@tuxteam.de
Cc: emacs-pretest-bug@gnu.org,
	Patrick Drechsler <patrick@pdrechsler.de>,
	emacs-devel@gnu.org
Subject: Re: 23.0.60; [nxml] BOM and utf-8
Date: Fri, 23 May 2008 02:34:46 +0900	[thread overview]
Message-ID: <87mymixmx5.fsf@uwakimon.sk.tsukuba.ac.jp> (raw)
In-Reply-To: <20080522041745.GA29437@tomas>

tomas@tuxteam.de writes:

 > > > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding ]
 > > > | Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY
 > > > | begin with the Byte Order Mark [...]
 > > > |        [...]  XML processors MUST be able to use this character to
 > > > | differentiate between UTF-8 and UTF-16 encoded documents.
 > > > `----
 > 
 > ...and how are the XML processors supposed to achieve that? Is there a
 > second variant of BOM, indicating UTF-8?

Well, note that the BOM is three octets in UTF-8 but only two in
UTF-16.  Dead giveaway, there.

 > > > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing-with-ext-info ]
 > > > | If an XML entity is in a file, the Byte-Order Mark and encoding
 > > > | declaration are used (if present) to determine the character encoding.
 > > > `----
 > 
 > ...or is rather the absence of a BOM the indicator for UTF-8?

Absence of a BOM is *an* indicator for UTF-8, as is presence of the
BOM encoded in UTF-8 as the first 3 octets of a file.

 > Am I completely whacko, or are they?

Neither.  You live in a relatively sane world, they live in a world
which contains the sovereign nations of Japan and Microsoft.

 > But that would be "in the middle of a file", not at the beginning, as
 > our case is.
 > 
 > I'd appreciate any insights.

If there is a higher level protocol which tells you what to do about
the BOM, obey it.  This is the case for XML files.

Otherwise, if U+FEFF occurs at the beginning of a file which otherwise
seems to be valid Unicode (in two-octet and four-octet versions, that
means containing no instances of 0xFFFF and only one endianness of
0xFEFF, in UTF-8, doesn't violate the constraints of UTF-8, and
doesn't contain any sequences that decode to U+FFFF or U+FFFE), ignore
it and start processing with the next character.

The next question is, where is this "XML processing" done?  As David
Kastrup points out, Emacs must be able to pass the BOM through to the
buffer, and he may be correct that that is the best default behavior.
I don't see any way for Emacs to determine when pass through is
appropriate and when not in the coding system; the coding system is
normally invoked at a level where Emacs cannot know that it will be
processed by nxml.

Therefore I think that for the purposes of XML conformance, nxml, not
Emacs, must be considered the XML processor, and nxml's failure to
recognize the BOM and ignore it (for the purpose of checking validity)
is a bug.

However, I'd be careful.  Maybe somebody should ask James Clark why he
did things this way.  He may have had an excellent reason for
insisting that Emacs strip BOMs before passing the file on to nxml.

Or maybe he just never saw UTF-8 signatures, or maybe he never edits
binary files using text encodings, and didn't consider use cases other
than editing text files important enough to provide for the BOM in
nxml.




  parent reply	other threads:[~2008-05-22 17:34 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <494CBBAC.2020706@f2s.com>
2008-05-17 12:31 ` 23.0.60; [nxml] BOM and utf-8 Patrick Drechsler
2008-05-17 14:13   ` Lennart Borgman (gmail)
2008-05-17 16:57     ` Patrick Drechsler
2008-05-17 20:38   ` Mark A. Hershberger
2008-05-21 22:20     ` Patrick Drechsler
2008-05-21 22:37       ` Patrick Drechsler
2008-05-22  1:33         ` Mark A. Hershberger
2008-05-22 14:43           ` Tom Tromey
2008-05-22 21:24             ` Miles Bader
2008-05-22  4:17         ` tomas
2008-05-22  4:33           ` Miles Bader
2008-05-22  8:28             ` Jason Rumney
2008-05-27  8:22             ` tomas
2008-05-22 17:34           ` Stephen J. Turnbull [this message]
2008-05-23  9:05             ` tomas
2008-05-23 21:23               ` Stephen J. Turnbull
2008-05-27  8:20                 ` tomas
2008-05-18  2:29   ` Stephen J. Turnbull
2008-05-18  2:30     ` Miles Bader
2008-05-18  3:19       ` Eli Zaretskii
2008-05-18  4:19         ` Stephen J. Turnbull
2008-05-18  8:56         ` Jason Rumney
2008-05-18 11:00           ` Patrick Drechsler
2008-05-19  3:11             ` Stephen J. Turnbull
2008-05-19 14:32               ` Patrick Drechsler
2008-05-19 18:56                 ` Eli Zaretskii
2008-05-20 15:16                   ` Patrick Drechsler
2008-05-18 15:19           ` joakim
2008-05-18  4:13       ` Stephen J. Turnbull
2008-05-18  5:40         ` Miles Bader
2008-05-18  9:14         ` David Kastrup
2008-05-19  3:05           ` Stephen J. Turnbull
2008-05-18 23:40             ` David Kastrup
2008-05-19 20:34               ` Stephen J. Turnbull
2008-05-19 20:57                 ` David Kastrup
2008-05-19 23:36                   ` Stephen J. Turnbull
2008-05-20  7:13                     ` David Kastrup
2008-05-30  2:47                       ` Kenichi Handa
2008-05-30  3:44                         ` Miles Bader
2008-05-30  3:59                           ` Kenichi Handa
2008-05-19  6:32             ` Lennart Borgman (gmail)
2008-12-20  9:40   ` bug#269: marked as done (23.0.60; [nxml] BOM and utf-8) Emacs bug Tracking System

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mymixmx5.fsf@uwakimon.sk.tsukuba.ac.jp \
    --to=turnbull@sk.tsukuba.ac.jp \
    --cc=emacs-devel@gnu.org \
    --cc=emacs-pretest-bug@gnu.org \
    --cc=patrick@pdrechsler.de \
    --cc=tomas@tuxteam.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.