From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: "Stephen J. Turnbull" <turnbull@sk.tsukuba.ac.jp>
Newsgroups: gmane.emacs.devel,gmane.emacs.pretest.bugs
Subject: Re: 23.0.60; [nxml] BOM and utf-8
Date: Fri, 23 May 2008 02:34:46 +0900
Message-ID: <87mymixmx5.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <87od75kt78.fsf@pdrechsler.de> <87d4nk8y3q.fsf@everybody.org>
	<87r6bvs3jj.fsf@pdrechsler.de> <87mymjs2qw.fsf@pdrechsler.de>
	<20080522041745.GA29437@tomas>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1211477053 363 80.91.229.12 (22 May 2008 17:24:13 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Thu, 22 May 2008 17:24:13 +0000 (UTC)
Cc: emacs-pretest-bug@gnu.org, Patrick Drechsler <patrick@pdrechsler.de>,
	emacs-devel@gnu.org
To: tomas@tuxteam.de
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu May 22 19:24:50 2008
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1JzEWV-0000YU-S0
	for ged-emacs-devel@m.gmane.org; Thu, 22 May 2008 19:24:16 +0200
Original-Received: from localhost ([127.0.0.1]:49794 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1JzEVl-0003vm-BH
	for ged-emacs-devel@m.gmane.org; Thu, 22 May 2008 13:23:29 -0400
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1JzEVh-0003vR-2C
	for emacs-devel@gnu.org; Thu, 22 May 2008 13:23:25 -0400
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1JzEVe-0003uv-ES
	for emacs-devel@gnu.org; Thu, 22 May 2008 13:23:24 -0400
Original-Received: from [199.232.76.173] (port=45182 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1JzEVe-0003us-9A
	for emacs-devel@gnu.org; Thu, 22 May 2008 13:23:22 -0400
Original-Received: from mtps02.sk.tsukuba.ac.jp ([130.158.97.224]:41603)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <stephen@xemacs.org>)
	id 1JzEVW-0006dV-N0; Thu, 22 May 2008 13:23:15 -0400
Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp
	[130.158.99.156])
	by mtps02.sk.tsukuba.ac.jp (Postfix) with ESMTP id 2FBB27FFC;
	Fri, 23 May 2008 02:23:00 +0900 (JST)
Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000)
	id 44AD31A25C3; Fri, 23 May 2008 02:34:46 +0900 (JST)
In-Reply-To: <20080522041745.GA29437@tomas>
X-Mailer: VM ?bug? under XEmacs 21.5.21 (x86_64-unknown-linux)
X-detected-kernel: by monty-python.gnu.org: Linux 2.6, seldom 2.4 (older, 4)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:97535 gmane.emacs.pretest.bugs:22420
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/97535>

tomas@tuxteam.de writes:

 > > > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding ]
 > > > | Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY
 > > > | begin with the Byte Order Mark [...]
 > > > |        [...]  XML processors MUST be able to use this character to
 > > > | differentiate between UTF-8 and UTF-16 encoded documents.
 > > > `----
 > 
 > ...and how are the XML processors supposed to achieve that? Is there a
 > second variant of BOM, indicating UTF-8?

Well, note that the BOM is three octets in UTF-8 but only two in
UTF-16.  Dead giveaway, there.

 > > > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing-with-ext-info ]
 > > > | If an XML entity is in a file, the Byte-Order Mark and encoding
 > > > | declaration are used (if present) to determine the character encoding.
 > > > `----
 > 
 > ...or is rather the absence of a BOM the indicator for UTF-8?

Absence of a BOM is *an* indicator for UTF-8, as is presence of the
BOM encoded in UTF-8 as the first 3 octets of a file.

 > Am I completely whacko, or are they?

Neither.  You live in a relatively sane world, they live in a world
which contains the sovereign nations of Japan and Microsoft.

 > But that would be "in the middle of a file", not at the beginning, as
 > our case is.
 > 
 > I'd appreciate any insights.

If there is a higher level protocol which tells you what to do about
the BOM, obey it.  This is the case for XML files.

Otherwise, if U+FEFF occurs at the beginning of a file which otherwise
seems to be valid Unicode (in two-octet and four-octet versions, that
means containing no instances of 0xFFFF and only one endianness of
0xFEFF, in UTF-8, doesn't violate the constraints of UTF-8, and
doesn't contain any sequences that decode to U+FFFF or U+FFFE), ignore
it and start processing with the next character.

The next question is, where is this "XML processing" done?  As David
Kastrup points out, Emacs must be able to pass the BOM through to the
buffer, and he may be correct that that is the best default behavior.
I don't see any way for Emacs to determine when pass through is
appropriate and when not in the coding system; the coding system is
normally invoked at a level where Emacs cannot know that it will be
processed by nxml.

Therefore I think that for the purposes of XML conformance, nxml, not
Emacs, must be considered the XML processor, and nxml's failure to
recognize the BOM and ignore it (for the purpose of checking validity)
is a bug.

However, I'd be careful.  Maybe somebody should ask James Clark why he
did things this way.  He may have had an excellent reason for
insisting that Emacs strip BOMs before passing the file on to nxml.

Or maybe he just never saw UTF-8 signatures, or maybe he never edits
binary files using text encodings, and didn't consider use cases other
than editing text files important enough to provide for the BOM in
nxml.