From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel,gmane.emacs.pretest.bugs Subject: Re: 23.0.60; [nxml] BOM and utf-8 Date: Sun, 18 May 2008 13:13:58 +0900 Message-ID: <87k5hsfdvd.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87od75kt78.fsf@pdrechsler.de> <87mymofip6.fsf@uwakimon.sk.tsukuba.ac.jp> <878wy8ny36.fsf@catnip.gol.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1211083374 13183 80.91.229.12 (18 May 2008 04:02:54 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 18 May 2008 04:02:54 +0000 (UTC) Cc: emacs-pretest-bug@gnu.org, Patrick Drechsler To: Miles Bader Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun May 18 06:03:30 2008 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1Jxa7M-0001qp-Hs for ged-emacs-devel@m.gmane.org; Sun, 18 May 2008 06:03:28 +0200 Original-Received: from localhost ([127.0.0.1]:47162 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Jxa6c-0000nI-FI for ged-emacs-devel@m.gmane.org; Sun, 18 May 2008 00:02:42 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Jxa6W-0000kI-G0 for emacs-devel@gnu.org; Sun, 18 May 2008 00:02:36 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Jxa6U-0000iP-QB for emacs-devel@gnu.org; Sun, 18 May 2008 00:02:36 -0400 Original-Received: from [199.232.76.173] (port=39963 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Jxa6U-0000iE-Mk for emacs-devel@gnu.org; Sun, 18 May 2008 00:02:34 -0400 Original-Received: from fencepost.gnu.org ([140.186.70.10]:54696) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1Jxa6U-0007tC-Bg for emacs-devel@gnu.org; Sun, 18 May 2008 00:02:34 -0400 Original-Received: from mx10.gnu.org ([199.232.76.166]:36236) by fencepost.gnu.org with esmtp (Exim 4.67) (envelope-from ) id 1Jxa5L-0000mh-Q4 for emacs-pretest-bug@gnu.org; Sun, 18 May 2008 00:01:23 -0400 Original-Received: from Debian-exim by monty-python.gnu.org with spam-scanned (Exim 4.60) (envelope-from ) id 1Jxa6O-0007rk-Cc for emacs-pretest-bug@gnu.org; Sun, 18 May 2008 00:02:33 -0400 Original-Received: from mtps01.sk.tsukuba.ac.jp ([130.158.97.223]:56576) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1Jxa6N-0007r6-Lw; Sun, 18 May 2008 00:02:28 -0400 Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp [130.158.99.156]) by mtps01.sk.tsukuba.ac.jp (Postfix) with ESMTP id 5BC791535AC; Sun, 18 May 2008 13:02:25 +0900 (JST) Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000) id AB9581A25C3; Sun, 18 May 2008 13:13:58 +0900 (JST) In-Reply-To: <878wy8ny36.fsf@catnip.gol.com> X-Mailer: VM ?bug? under XEmacs 21.5.21 (x86_64-unknown-linux) X-detected-kernel: by monty-python.gnu.org: Linux 2.6, seldom 2.4 (older, 4) X-detected-kernel: by monty-python.gnu.org: Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:97348 gmane.emacs.pretest.bugs:22368 Archived-At: Miles Bader writes: > "Stephen J. Turnbull" writes: > > > is the attached xml file (simple.xml) really invalid (as indicated by > > > nxhtml) or is this a bug in nxhtml? > > > > Neither. Emacs is (arguably) reading it incorrectly. > > By "arguably" I presume you're referring to the "Microsoft does stupid thing>, therefore everybody who doesn't do > is incorrect" tactic. No, by "arguably" I'm referring to the fact that although the optional UTF-8 signature has been part of ISO/IEC 10646-1 and Unicode for a decade or so, not to mention Internet STD 63 (aka RFC 3269), I fully expected somebody like you to pop up and argue about it. It is a bad standard (see STD 63) and possibly Microsoft-induced, but it *is* the standard and is showing no signs of going away; see Section 16.8 of *The Unicode Standard*, v5.0. In fact, the trend is the other way around: the ancient RFCs 2044 and 2279 don't mention it either way, but STD 63 found it necessary to *add* it. > In general, other apps that read such files are not expecting the > BOM, and won't be able to deal with it. So Emacs wouldn't be doing > the user any favors by hiding the BOM from him. So pop up a warning to the effect that the BOM was stripped per the Unicode standard, and that if it needs to be preserved, set UNICODE_ME_SOFTLY in the environment or bind `unicode-me-softly' around the codec. Alternatively, sabotage the Microsoft users by silently eating the BOM on the way in, and writing the file in GNU substandard[1] format on the way out. Footnotes: [1] A substandard is a standard with stupid optional features subtracted. :-)