From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save Date: Sun, 10 Dec 2017 21:17:00 +0200 Message-ID: <838teatmtv.fsf@gnu.org> References: <555E2912.7060509@gmx.net> <83iobl67ao.fsf@gnu.org> <555E44EB.6070604@gmx.net> <83egm95boc.fsf@gnu.org> <555F2D3C.6090608@gmx.net> <8660oxdyxy.fsf@realize.ch> <457eu2h1sk.fsf@fencepost.gnu.org> <837eu2xmor.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org X-Trace: blaine.gmane.org 1512933494 28059 195.159.176.226 (10 Dec 2017 19:18:14 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 10 Dec 2017 19:18:14 +0000 (UTC) Cc: a.s@realize.ch, 20623@debbugs.gnu.org, sledergerber@gmx.net To: Stefan Monnier Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sun Dec 10 20:18:07 2017 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eO771-0006uH-LO for geb-bug-gnu-emacs@m.gmane.org; Sun, 10 Dec 2017 20:18:07 +0100 Original-Received: from localhost ([::1]:49453 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eO774-0001m7-8f for geb-bug-gnu-emacs@m.gmane.org; Sun, 10 Dec 2017 14:18:10 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:59526) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eO76z-0001m2-5G for bug-gnu-emacs@gnu.org; Sun, 10 Dec 2017 14:18:06 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eO76v-0007mo-TE for bug-gnu-emacs@gnu.org; Sun, 10 Dec 2017 14:18:05 -0500 Original-Received: from debbugs.gnu.org ([208.118.235.43]:46924) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1eO76v-0007mg-PE for bug-gnu-emacs@gnu.org; Sun, 10 Dec 2017 14:18:01 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1eO76v-0005jc-JM for bug-gnu-emacs@gnu.org; Sun, 10 Dec 2017 14:18:01 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 10 Dec 2017 19:18:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 20623 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 20623-submit@debbugs.gnu.org id=B20623.151293345721993 (code B ref 20623); Sun, 10 Dec 2017 19:18:01 +0000 Original-Received: (at 20623) by debbugs.gnu.org; 10 Dec 2017 19:17:37 +0000 Original-Received: from localhost ([127.0.0.1]:55600 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eO76X-0005if-D5 for submit@debbugs.gnu.org; Sun, 10 Dec 2017 14:17:37 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:40159) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eO76V-0005iQ-4A for 20623@debbugs.gnu.org; Sun, 10 Dec 2017 14:17:35 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eO76M-0007L2-Kq for 20623@debbugs.gnu.org; Sun, 10 Dec 2017 14:17:29 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:52121) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eO76M-0007Kv-H6; Sun, 10 Dec 2017 14:17:26 -0500 Original-Received: from [176.228.60.248] (port=2582 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1eO76K-0001Dh-FT; Sun, 10 Dec 2017 14:17:26 -0500 In-reply-to: (message from Stefan Monnier on Mon, 04 Dec 2017 16:08:14 -0500) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:140907 Archived-At: > From: Stefan Monnier > Cc: rgm@gnu.org, a.s@realize.ch, sledergerber@gmx.net, 20623@debbugs.gnu.org > Date: Mon, 04 Dec 2017 16:08:14 -0500 > > > Isn't it better to fix this in sgml-xml-auto-coding-function? That's > > where the root cause is, AFAIU. > > I'd expect the same problem would affect all other uses. Not sure what you meant by "all other uses". Could you please elaborate? > > And I don't understand the comment about latin-1-mac: I don't think we > > have such problems in Emacs. The -with-signature variety is > > different, because it is not about EOL format. > > You might be right, but I don't know where/how this is handled. I would like to propose the following alternative patch, which accepts utf-8-with-signature and utf-8-hfs as variants of utf-8 for the purposes of encoding of XML files. Comments? Do we want a similar treatment for UTF-16? (That doesn't seem to be required by the bug report, and UTF-16 in XML files is non-standard anyway. But what about HTML?) diff --git a/lisp/international/mule.el b/lisp/international/mule.el index 857fa80..5ff1acf 100644 --- a/lisp/international/mule.el +++ b/lisp/international/mule.el @@ -2493,7 +2493,17 @@ sgml-xml-auto-coding-function (let* ((match (match-string 1)) (sym (intern (downcase match)))) (if (coding-system-p sym) - sym + ;; If the encoding tag is UTF-8 and the buffer's + ;; encoding is one of the variants of UTF-8, use the + ;; buffer's encoding. This allows, e.g., saving an + ;; XML file as UTF-8 with BOM when the tag says UTF-8. + (if (and (coding-system-equal 'utf-8 + (coding-system-type sym)) + (coding-system-equal sym + (coding-system-type + buffer-file-coding-system))) + buffer-file-coding-system + sym) (message "Warning: unknown coding system \"%s\"" match) nil)) ;; Files without an encoding tag should be UTF-8. But users @@ -2506,7 +2516,8 @@ sgml-xml-auto-coding-function (coding-system-base (detect-coding-region (point-min) size t))))) ;; Pure ASCII always comes back as undecided. - (if (memq detected '(utf-8 undecided)) + (if (memq detected + '(utf-8 'utf-8-with-signature 'utf-8-hfs undecided)) 'utf-8 (warn "File contents detected as %s. Consider adding an encoding attribute to the xml declaration,