From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel,gmane.emacs.pretest.bugs Subject: Re: 23.0.60; [nxml] BOM and utf-8 Date: Fri, 23 May 2008 02:34:46 +0900 Message-ID: <87mymixmx5.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87od75kt78.fsf@pdrechsler.de> <87d4nk8y3q.fsf@everybody.org> <87r6bvs3jj.fsf@pdrechsler.de> <87mymjs2qw.fsf@pdrechsler.de> <20080522041745.GA29437@tomas> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1211477053 363 80.91.229.12 (22 May 2008 17:24:13 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 22 May 2008 17:24:13 +0000 (UTC) Cc: emacs-pretest-bug@gnu.org, Patrick Drechsler , emacs-devel@gnu.org To: tomas@tuxteam.de Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu May 22 19:24:50 2008 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1JzEWV-0000YU-S0 for ged-emacs-devel@m.gmane.org; Thu, 22 May 2008 19:24:16 +0200 Original-Received: from localhost ([127.0.0.1]:49794 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JzEVl-0003vm-BH for ged-emacs-devel@m.gmane.org; Thu, 22 May 2008 13:23:29 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1JzEVh-0003vR-2C for emacs-devel@gnu.org; Thu, 22 May 2008 13:23:25 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1JzEVe-0003uv-ES for emacs-devel@gnu.org; Thu, 22 May 2008 13:23:24 -0400 Original-Received: from [199.232.76.173] (port=45182 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JzEVe-0003us-9A for emacs-devel@gnu.org; Thu, 22 May 2008 13:23:22 -0400 Original-Received: from mtps02.sk.tsukuba.ac.jp ([130.158.97.224]:41603) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1JzEVW-0006dV-N0; Thu, 22 May 2008 13:23:15 -0400 Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp [130.158.99.156]) by mtps02.sk.tsukuba.ac.jp (Postfix) with ESMTP id 2FBB27FFC; Fri, 23 May 2008 02:23:00 +0900 (JST) Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000) id 44AD31A25C3; Fri, 23 May 2008 02:34:46 +0900 (JST) In-Reply-To: <20080522041745.GA29437@tomas> X-Mailer: VM ?bug? under XEmacs 21.5.21 (x86_64-unknown-linux) X-detected-kernel: by monty-python.gnu.org: Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:97535 gmane.emacs.pretest.bugs:22420 Archived-At: tomas@tuxteam.de writes: > > > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding ] > > > | Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY > > > | begin with the Byte Order Mark [...] > > > | [...] XML processors MUST be able to use this character to > > > | differentiate between UTF-8 and UTF-16 encoded documents. > > > `---- > > ...and how are the XML processors supposed to achieve that? Is there a > second variant of BOM, indicating UTF-8? Well, note that the BOM is three octets in UTF-8 but only two in UTF-16. Dead giveaway, there. > > > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing-with-ext-info ] > > > | If an XML entity is in a file, the Byte-Order Mark and encoding > > > | declaration are used (if present) to determine the character encoding. > > > `---- > > ...or is rather the absence of a BOM the indicator for UTF-8? Absence of a BOM is *an* indicator for UTF-8, as is presence of the BOM encoded in UTF-8 as the first 3 octets of a file. > Am I completely whacko, or are they? Neither. You live in a relatively sane world, they live in a world which contains the sovereign nations of Japan and Microsoft. > But that would be "in the middle of a file", not at the beginning, as > our case is. > > I'd appreciate any insights. If there is a higher level protocol which tells you what to do about the BOM, obey it. This is the case for XML files. Otherwise, if U+FEFF occurs at the beginning of a file which otherwise seems to be valid Unicode (in two-octet and four-octet versions, that means containing no instances of 0xFFFF and only one endianness of 0xFEFF, in UTF-8, doesn't violate the constraints of UTF-8, and doesn't contain any sequences that decode to U+FFFF or U+FFFE), ignore it and start processing with the next character. The next question is, where is this "XML processing" done? As David Kastrup points out, Emacs must be able to pass the BOM through to the buffer, and he may be correct that that is the best default behavior. I don't see any way for Emacs to determine when pass through is appropriate and when not in the coding system; the coding system is normally invoked at a level where Emacs cannot know that it will be processed by nxml. Therefore I think that for the purposes of XML conformance, nxml, not Emacs, must be considered the XML processor, and nxml's failure to recognize the BOM and ignore it (for the purpose of checking validity) is a bug. However, I'd be careful. Maybe somebody should ask James Clark why he did things this way. He may have had an excellent reason for insisting that Emacs strip BOMs before passing the file on to nxml. Or maybe he just never saw UTF-8 signatures, or maybe he never edits binary files using text encodings, and didn't consider use cases other than editing text files important enough to provide for the BOM in nxml.