From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: tomas@tuxteam.de Newsgroups: gmane.emacs.devel,gmane.emacs.pretest.bugs Subject: Re: 23.0.60; [nxml] BOM and utf-8 Date: Thu, 22 May 2008 06:17:45 +0200 Message-ID: <20080522041745.GA29437@tomas> References: <87od75kt78.fsf@pdrechsler.de> <87d4nk8y3q.fsf@everybody.org> <87r6bvs3jj.fsf@pdrechsler.de> <87mymjs2qw.fsf@pdrechsler.de> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; x-action=pgp-signed Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1211429882 9892 80.91.229.12 (22 May 2008 04:18:02 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 22 May 2008 04:18:02 +0000 (UTC) Cc: emacs-pretest-bug@gnu.org, emacs-devel@gnu.org To: Patrick Drechsler Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu May 22 06:18:39 2008 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1Jz2GC-00023T-6Y for ged-emacs-devel@m.gmane.org; Thu, 22 May 2008 06:18:36 +0200 Original-Received: from localhost ([127.0.0.1]:57412 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Jz2FR-0000BA-Ba for ged-emacs-devel@m.gmane.org; Thu, 22 May 2008 00:17:49 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Jz2FM-0000Aq-0L for emacs-devel@gnu.org; Thu, 22 May 2008 00:17:44 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Jz2FL-0000AW-2J for emacs-devel@gnu.org; Thu, 22 May 2008 00:17:43 -0400 Original-Received: from [199.232.76.173] (port=45502 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Jz2FK-0000AT-T1 for emacs-devel@gnu.org; Thu, 22 May 2008 00:17:42 -0400 Original-Received: from alextrapp1.equinoxe.de ([217.22.192.104]:60370 helo=www.elogos.de) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1Jz2FF-0007ln-2a; Thu, 22 May 2008 00:17:37 -0400 Original-Received: by www.elogos.de (Postfix, from userid 1000) id 7C06B90171; Thu, 22 May 2008 06:17:45 +0200 (CEST) Content-Disposition: inline In-Reply-To: <87mymjs2qw.fsf@pdrechsler.de> User-Agent: Mutt/1.5.15+20070412 (2007-04-11) X-detected-kernel: by monty-python.gnu.org: Linux 2.6 (newer, 2) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:97498 gmane.emacs.pretest.bugs:22407 Archived-At: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thu, May 22, 2008 at 12:37:11AM +0200, Patrick Drechsler wrote: > Patrick Drechsler writes: This would be rather a question to w3.org, but... > > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding ] > > | Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY > > | begin with the Byte Order Mark [...] > > | [...] XML processors MUST be able to use this character to > > | differentiate between UTF-8 and UTF-16 encoded documents. > > `---- ...and how are the XML processors supposed to achieve that? Is there a second variant of BOM, indicating UTF-8? > > and > > > > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing-with-= ext-info ] > > | If an XML entity is in a file, the Byte-Order Mark and encoding > > | declaration are used (if present) to determine the character encodi= ng. > > `---- ...or is rather the absence of a BOM the indicator for UTF-8? Am I completely whacko, or are they? Sorry. I am confused. Ah, and BTW: interpreting the BOM as whitespace is not that far off -- as stated in : | Q: What should I do with U+FEFF in the middle of a file? |=20 | A: In the absence of a protocol supporting its use as a BOM and when n= ot | at the beginning of a text stream, U+FEFF should normally not occur. F= or | backwards compatibility it should be treated as ZERO WIDTH NON-BREAKIN= G | SPACE (ZWNBSP), and is then part of the content of the file or string. [...] But that would be "in the middle of a file", not at the beginning, as our case is. I'd appreciate any insights. Thanks - -- tom=C3=A1s -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFINPPpBcgs9XrR2kYRAutgAJ9BXb32mnDV53T3RTOBu4LGmOfHIgCfUxNG EJYtPO908ac75bw1vERvRyQ=3D =3DIQaH -----END PGP SIGNATURE-----