From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Simon Ledergerber Newsgroups: gmane.emacs.bugs Subject: bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save Date: Fri, 22 May 2015 15:21:00 +0200 Message-ID: <555F2D3C.6090608@gmx.net> References: <555E2912.7060509@gmx.net> <83iobl67ao.fsf@gnu.org> <555E44EB.6070604@gmx.net> <83egm95boc.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1432301261 25538 80.91.229.3 (22 May 2015 13:27:41 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 22 May 2015 13:27:41 +0000 (UTC) Cc: 20623@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Fri May 22 15:27:27 2015 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Yvmz1-00054n-7e for geb-bug-gnu-emacs@m.gmane.org; Fri, 22 May 2015 15:27:27 +0200 Original-Received: from localhost ([::1]:34045 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yvmz0-0001nN-IQ for geb-bug-gnu-emacs@m.gmane.org; Fri, 22 May 2015 09:27:26 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:42503) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yvmys-0001ma-8T for bug-gnu-emacs@gnu.org; Fri, 22 May 2015 09:27:24 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Yvmtn-0000SE-EI for bug-gnu-emacs@gnu.org; Fri, 22 May 2015 09:22:08 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:42685) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yvmtn-0000Rg-AV for bug-gnu-emacs@gnu.org; Fri, 22 May 2015 09:22:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80) (envelope-from ) id 1Yvmtm-0007eV-Th for bug-gnu-emacs@gnu.org; Fri, 22 May 2015 09:22:03 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Simon Ledergerber Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Fri, 22 May 2015 13:22:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 20623 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 20623-submit@debbugs.gnu.org id=B20623.143230088329357 (code B ref 20623); Fri, 22 May 2015 13:22:02 +0000 Original-Received: (at 20623) by debbugs.gnu.org; 22 May 2015 13:21:23 +0000 Original-Received: from localhost ([127.0.0.1]:52660 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yvmt6-0007dN-2M for submit@debbugs.gnu.org; Fri, 22 May 2015 09:21:22 -0400 Original-Received: from mout.gmx.net ([212.227.17.21]:63529) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yvmt2-0007d9-Op for 20623@debbugs.gnu.org; Fri, 22 May 2015 09:21:18 -0400 Original-Received: from [192.168.1.102] ([77.56.185.142]) by mail.gmx.com (mrgmx101) with ESMTPSA (Nemesis) id 0MZgdm-1YaOeO23Hq-00LU0e; Fri, 22 May 2015 15:21:10 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 In-Reply-To: <83egm95boc.fsf@gnu.org> X-Provags-ID: V03:K0:W/wk9xhHItSC6ZbE358dFPGwRmhsaBTnRQC84cYaA3qxCXZ7iiy ai7mL/ZNA0Fj1RsGF+Vpmn5gGTrZZwQxSEsq5PJ783NjpP6HP5/4/xYniwP7Lqun5GLogaM 80tMGKaudMhy/f2/PY8naexo3TB8BW1O06AhKcqfU5YG4GVkMtpabMHjh5AEjPymMxGuKRd EtsrWdoz7GfNps/nPQewQ== X-UI-Out-Filterresults: notjunk:1; X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:103023 Archived-At: Hello Eli I have done some more research to answer your questions. You will find the details of my statement at the end of this mail. On 22.05.2015 09:11, Eli Zaretskii wrote: > [Please don't remove the bug address from the CC list, so that this > discussion is recorded in the bug data base.] > >> Date: Thu, 21 May 2015 22:49:47 +0200 >> From: Simon Ledergerber >> >> From the documentation I understand that utf-8 is without BOM and >> utf-8-with-signature is with BOM. Maybe I am wrong and should rather >> understand that utf-8 is auto-detect. But then there is something like >> utf-8-without-signature missing to specify explicitly that no BOM is >> desired. >> >> In my opinion, it is correct when Emacs prefers utf-8 over >> utf-8-with-signature when it opens a file without BOM that can still be >> recognized as UTF-8. >> >> However when a file is opened with a BOM already present, it should >> stick to the utf-8-with-signature coding system, because the BOM "EF BB >> BF" unambiguously marks the file as UTF-8. (For UTF-16 for example, >> there is a different BOM byte pattern. There are other coding systems >> which do not have a BOM at all.) > What do you mean by "stick to"? When I try visiting an XML file that > is encoded with BOM, Emacs decodes the file correctly, and the value > of buffer-file-coding-system is utf-8-with-signature. Isn't that what > you want? If that's what you want, but it doesn't happen for you, > please try in "emacs -Q". It's possible that the default you set: > > (setq-default buffer-file-coding-system 'utf-8-dos) > > is the reason for what you see. (I don't understand why you need such > a default, and it sounds like a bad idea to me.) You're right. When I open a file that was really saved with BOM, Emacs detects its encoding correctly, i. e. utf-8-with-signature-dos. But when I change the content and save with C-x C-s, the encoding changes to utf-8-dos and the BOM gets lost. Even when I start Emacs with -Q. This is the actual bug. > >> By doing C-x f and then saving it with C-x C-s, I expect to be >> able to change the coding system. For example, if I specify utf-8-dos, >> the BOM should be removed, if one was present, and CR LF should be >> inserted for EOL. On the other side, if I choose >> utf-8-with-signature-unix, a BOM should be written and LF be taken for >> EOL. (The conversion between DOS and Unix works, just the BOM is the >> problem.) >> >> I have found a link, where this topic was already discussed, but it >> didn't help me further: >> http://superuser.com/questions/41254/make-emacs-not-remove-the-bom-from-xml-files >> >> In that post Vebjorn Ljosa asked exactly the question I have. Richard >> Hoskins replies with the answer to change the coding system with C-x >> r utf-8-with-signature. Unfortunately, it didn't work for me - >> after doing a change in the file and saving, it got back to utf-8 >> automatically - that's why I have filed the bug. > That's not how you force a file to be saved in a specific encoding. > You should do this instead: > > C-x RET c utf-8-with-signature RET C-x C-s > > The "C-x RET c" prefix forces the next Emacs operation to use the > specified encoding. In this case, Emacs will ask for confirmation, > because the encoding you specified is different from what the XML > comment says. > This is true and it worked for me. Please see below for further explanations. Summary: - C-x RET c utf-8-with-signature RET C-x C-s is a good workaround, because it really forces the file being written with BOM. In order to have an effect however, the file must be dirty, i. e. there must be a pending change. But before the command completes in this case, the prompt "Selected encoding utf-8-with-signature-dos disagrees with utf-8-dos specified by file contents. Really save (else edit coding cookies and try again)? (yes or no)" appears. I think this is what you mean with your sentence: "In this case, Emacs will ask for confirmation, because the encoding you specified is different from what the XML comment says." - But consider the following: The encoding in the XML declaration or in the HTML just specifies UTF-8 (or another encoding). It doesn't say anything about the presence or absence of the BOM. Therefore an editor detecting and deciding about the file's encoding should not rely on this information only. - When such a file, which was saved successfully with BOM, is closed and reopened again, Emacs detects its encoding correctly, say utf-8-with-signature-dos. - However, when I change the file content and save it again just with C-x C-s (without C-x RET c ... first!), then it changes back to utf-8-dos. Yes, even if I start emacs with -Q! (That's the point.) - I do not fully understand the criterion for and the magic behind how Emacs chooses the file encoding when I do C-x C-s. But I was able to reproduce it several times by applying the procedures given in the bug report, even when -Q is on. As we already have stated above, this could be because Emacs favors (and forces) utf-8 whenever it sees something like XML or HTML that might be UTF-8-encoded. -> Conclusion: C-x RET c utf-8-with-signature RET C-x C-s is a good way to force the file being written as I want. But what I still do not understand: When I open a file with BOM and Emacs recognizes that, why does it change the encoding silently to drop the BOM when I regularly save with C-x C-s - and this even without giving me a notice or warning?