* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save @ 2015-05-21 18:50 Simon Ledergerber 2015-05-21 19:48 ` Eli Zaretskii 0 siblings, 1 reply; 34+ messages in thread From: Simon Ledergerber @ 2015-05-21 18:50 UTC (permalink / raw) To: 20623 Hi When I was editing XHTML and HTML files, I wanted to make sure the BOM was written out to the file in order to make it easier for the browser to detect the UTF-8 encoding. Therefore I changed the coding system for the file buffer to utf-8-with-signature-dos (since I am working on a Windows System) before saving the file. After some time I got surprised because the browser (IE11), didn't report UTF-8 as the file's encoding. Having checked the hexdump of my (X)HTML file, I saw the BOM was definitely missing. Obviously, when a "UTF-8" string appears in the <meta charset="utf-8"> (even if commented out, see later below) or <?xml version="1.0" encoding="utf-8"?> declaration, Emacs switches the file coding system to utf-8, when it saves the file, even if utf-8-with-signature was specified explicitly before. This appears to me as a bug, because there is no way anymore to restore the BOM using Emacs. I was not sure, if my bug is related to bug #8282, so I decided to report it (again). My Emacs version is: 24.5.1 (x86_64-unkown-cygwin) of 2015-04-10 on Windows 8.1 x64. I am running Emacs in text-mode only inside a Cygwin console. This is my .emacs.d/init.el: (line-number-mode) (column-number-mode) (setq-default fill-column 80) (setq-default buffer-file-coding-system 'utf-8-dos) (setq-default indent-tabs-mode nil) With XML the problem can be reproduced in the most basic way as detailed out by the following steps: - Create a new file with C-x C-f in the current directory. Name it test.txt for example. - Switch to fundamental mode with M-x fundamental-mode. - Type the text '<?xml version="1.0"' (without the surrounding single quotes). - Switch the encoding system to include the BOM: C-x RET f utf-8-with-signature-dos. - Verify the current encoding system with C-h Shift-c RET: Yes, the encoding system for the file buffer is as specified before. - Type C-x k to kill the help buffer if necessary and save the file with C-x C-s. - Check the file with a hex editor. Under the Cygwin Bash shell, 'od -Ax -t xCaz test.txt' will also do it: The UTF-8 BOM 'EF BB BF' was written at the beginning of the file. - Complete the rest of the XML declaration as follows: ' encoding="utf-8"?>' - Now save the file and check again: The encoding system for the buffer has changed to utf-8-dos and the BOM has disappeared from the file! Now the steps for HTML: - Create a new file test1.txt in the current directory. - Fill it with the following simple and yet incomplete HTML5 document: <!doctype html> <html> <head> <title>Test</title> </head> <body> </body> </html> - Change the coding system to utf-8-with-signature-dos and save the file. - Verify that the coding system for the buffer is correct and the BOM is really written: Yes, it is. - Insert the following *comment* between <head> and <title>: <!-- <meta charset="utf-8"> --> - Save the file and verify: The coding system has changed to utf-8-dos and the BOM has vanished, even if it is just a comment and has no effect! Regards Simon P. S. Information as reported by M-x report-emacs-bug: In GNU Emacs 24.5.1 (x86_64-unknown-cygwin) of 2015-04-10 on desktop-new Configured using: `configure --srcdir=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5 --prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc --docdir=/usr/share/doc/emacs --htmldir=/usr/share/doc/emacs/html -C --with-x=no 'CFLAGS=-ggdb -O2 -pipe -Wimplicit-function-declaration -fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/build=/usr/src/debug/emacs-24.5-1 -fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5=/usr/src/debug/emacs-24.5-1' CPPFLAGS= LDFLAGS=' Important settings: value of $LANG: en_US.UTF-8 locale-coding-system: utf-8-unix Major mode: Help Minor modes in effect: tooltip-mode: t electric-indent-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t buffer-read-only: t column-number-mode: t line-number-mode: t transient-mark-mode: t Recent messages: Beginning of buffer [3 times] Saving file /cygdrive/c/users/.../html_basics/basic.xhtml... Wrote /cygdrive/c/users/.../html_basics/basic.xhtml Mark set [2 times] Auto-saving...done Mark set [2 times] Saving file /cygdrive/c/users/.../html_basics/basic.xhtml... Wrote /cygdrive/c/users/.../html_basics/basic.xhtml No docstring slot for help-mode-setup No docstring slot for help-mode-finish Load-path shadows: None found. Features: (shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util help-fns mail-prsvr mail-utils misearch multi-isearch mule-diag help-mode easymenu regexp-opt sgml-mode xterm time-date tooltip electric uniquify ediff-hook vc-hooks lisp-float-type tabulated-list newcomment lisp-mode prog-mode register page menu-bar rfn-eshadow timer select mouse jit-lock font-lock syntax facemenu font-core frame cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean japanese hebrew greek romanian slovak czech european ethiopic indian cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev minibuffer nadvice loaddefs button faces cus-face macroexp files text-properties overlay sha1 md5 base64 format env code-pages mule custom widget hashtable-print-readable backquote make-network-process dbusbind gfilenotify multi-tty emacs) Memory information: ((conses 16 81797 4691) (symbols 48 17091 0) (miscs 40 73 387) (strings 32 11233 4887) (string-bytes 1 291872) (vectors 16 7587) (vector-slots 8 342125 27930) (floats 8 57 393) (intervals 56 834 26) (buffers 960 21)) ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2015-05-21 18:50 bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save Simon Ledergerber @ 2015-05-21 19:48 ` Eli Zaretskii [not found] ` <555E44EB.6070604@gmx.net> 2015-05-22 15:22 ` Stefan Monnier 0 siblings, 2 replies; 34+ messages in thread From: Eli Zaretskii @ 2015-05-21 19:48 UTC (permalink / raw) To: Simon Ledergerber; +Cc: 20623 > Date: Thu, 21 May 2015 20:50:58 +0200 > From: Simon Ledergerber <sledergerber@gmx.net> > > When I was editing XHTML and HTML files, I wanted to make sure the BOM > was written out to the file in order to make it easier for the browser > to detect the UTF-8 encoding. Therefore I changed the coding system for > the file buffer to utf-8-with-signature-dos (since I am working on a > Windows System) before saving the file. > > After some time I got surprised because the browser (IE11), didn't > report UTF-8 as the file's encoding. Having checked the hexdump of my > (X)HTML file, I saw the BOM was definitely missing. > > Obviously, when a "UTF-8" string appears in the <meta charset="utf-8"> > (even if commented out, see later below) or <?xml version="1.0" > encoding="utf-8"?> declaration, Emacs switches the file coding system to > utf-8, when it saves the file, even if utf-8-with-signature was > specified explicitly before. This appears to me as a bug, because there > is no way anymore to restore the BOM using Emacs. What would you expect Emacs to do instead? It just obeys the stated encoding, which says nothing about the BOM. How can Emacs know when to use utf-8 and when utf-8-with-signature? ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <555E44EB.6070604@gmx.net>]
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save [not found] ` <555E44EB.6070604@gmx.net> @ 2015-05-22 7:11 ` Eli Zaretskii 2015-05-22 13:21 ` Simon Ledergerber 0 siblings, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2015-05-22 7:11 UTC (permalink / raw) To: Simon Ledergerber; +Cc: 20623 [Please don't remove the bug address from the CC list, so that this discussion is recorded in the bug data base.] > Date: Thu, 21 May 2015 22:49:47 +0200 > From: Simon Ledergerber <sledergerber@gmx.net> > > From the documentation I understand that utf-8 is without BOM and > utf-8-with-signature is with BOM. Maybe I am wrong and should rather > understand that utf-8 is auto-detect. But then there is something like > utf-8-without-signature missing to specify explicitly that no BOM is > desired. > > In my opinion, it is correct when Emacs prefers utf-8 over > utf-8-with-signature when it opens a file without BOM that can still be > recognized as UTF-8. > > However when a file is opened with a BOM already present, it should > stick to the utf-8-with-signature coding system, because the BOM "EF BB > BF" unambiguously marks the file as UTF-8. (For UTF-16 for example, > there is a different BOM byte pattern. There are other coding systems > which do not have a BOM at all.) What do you mean by "stick to"? When I try visiting an XML file that is encoded with BOM, Emacs decodes the file correctly, and the value of buffer-file-coding-system is utf-8-with-signature. Isn't that what you want? If that's what you want, but it doesn't happen for you, please try in "emacs -Q". It's possible that the default you set: (setq-default buffer-file-coding-system 'utf-8-dos) is the reason for what you see. (I don't understand why you need such a default, and it sounds like a bad idea to me.) > By doing C-x <RET> f and then saving it with C-x C-s, I expect to be > able to change the coding system. For example, if I specify utf-8-dos, > the BOM should be removed, if one was present, and CR LF should be > inserted for EOL. On the other side, if I choose > utf-8-with-signature-unix, a BOM should be written and LF be taken for > EOL. (The conversion between DOS and Unix works, just the BOM is the > problem.) > > I have found a link, where this topic was already discussed, but it > didn't help me further: > http://superuser.com/questions/41254/make-emacs-not-remove-the-bom-from-xml-files > > In that post Vebjorn Ljosa asked exactly the question I have. Richard > Hoskins replies with the answer to change the coding system with C-x > <RET> r utf-8-with-signature. Unfortunately, it didn't work for me - > after doing a change in the file and saving, it got back to utf-8 > automatically - that's why I have filed the bug. That's not how you force a file to be saved in a specific encoding. You should do this instead: C-x RET c utf-8-with-signature RET C-x C-s The "C-x RET c" prefix forces the next Emacs operation to use the specified encoding. In this case, Emacs will ask for confirmation, because the encoding you specified is different from what the XML comment says. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2015-05-22 7:11 ` Eli Zaretskii @ 2015-05-22 13:21 ` Simon Ledergerber 2016-10-12 21:44 ` Alain Schneble 0 siblings, 1 reply; 34+ messages in thread From: Simon Ledergerber @ 2015-05-22 13:21 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 20623 Hello Eli I have done some more research to answer your questions. You will find the details of my statement at the end of this mail. On 22.05.2015 09:11, Eli Zaretskii wrote: > [Please don't remove the bug address from the CC list, so that this > discussion is recorded in the bug data base.] > >> Date: Thu, 21 May 2015 22:49:47 +0200 >> From: Simon Ledergerber <sledergerber@gmx.net> >> >> From the documentation I understand that utf-8 is without BOM and >> utf-8-with-signature is with BOM. Maybe I am wrong and should rather >> understand that utf-8 is auto-detect. But then there is something like >> utf-8-without-signature missing to specify explicitly that no BOM is >> desired. >> >> In my opinion, it is correct when Emacs prefers utf-8 over >> utf-8-with-signature when it opens a file without BOM that can still be >> recognized as UTF-8. >> >> However when a file is opened with a BOM already present, it should >> stick to the utf-8-with-signature coding system, because the BOM "EF BB >> BF" unambiguously marks the file as UTF-8. (For UTF-16 for example, >> there is a different BOM byte pattern. There are other coding systems >> which do not have a BOM at all.) > What do you mean by "stick to"? When I try visiting an XML file that > is encoded with BOM, Emacs decodes the file correctly, and the value > of buffer-file-coding-system is utf-8-with-signature. Isn't that what > you want? If that's what you want, but it doesn't happen for you, > please try in "emacs -Q". It's possible that the default you set: > > (setq-default buffer-file-coding-system 'utf-8-dos) > > is the reason for what you see. (I don't understand why you need such > a default, and it sounds like a bad idea to me.) You're right. When I open a file that was really saved with BOM, Emacs detects its encoding correctly, i. e. utf-8-with-signature-dos. But when I change the content and save with C-x C-s, the encoding changes to utf-8-dos and the BOM gets lost. Even when I start Emacs with -Q. This is the actual bug. > >> By doing C-x <RET> f and then saving it with C-x C-s, I expect to be >> able to change the coding system. For example, if I specify utf-8-dos, >> the BOM should be removed, if one was present, and CR LF should be >> inserted for EOL. On the other side, if I choose >> utf-8-with-signature-unix, a BOM should be written and LF be taken for >> EOL. (The conversion between DOS and Unix works, just the BOM is the >> problem.) >> >> I have found a link, where this topic was already discussed, but it >> didn't help me further: >> http://superuser.com/questions/41254/make-emacs-not-remove-the-bom-from-xml-files >> >> In that post Vebjorn Ljosa asked exactly the question I have. Richard >> Hoskins replies with the answer to change the coding system with C-x >> <RET> r utf-8-with-signature. Unfortunately, it didn't work for me - >> after doing a change in the file and saving, it got back to utf-8 >> automatically - that's why I have filed the bug. > That's not how you force a file to be saved in a specific encoding. > You should do this instead: > > C-x RET c utf-8-with-signature RET C-x C-s > > The "C-x RET c" prefix forces the next Emacs operation to use the > specified encoding. In this case, Emacs will ask for confirmation, > because the encoding you specified is different from what the XML > comment says. > This is true and it worked for me. Please see below for further explanations. Summary: - C-x RET c utf-8-with-signature RET C-x C-s is a good workaround, because it really forces the file being written with BOM. In order to have an effect however, the file must be dirty, i. e. there must be a pending change. But before the command completes in this case, the prompt "Selected encoding utf-8-with-signature-dos disagrees with utf-8-dos specified by file contents. Really save (else edit coding cookies and try again)? (yes or no)" appears. I think this is what you mean with your sentence: "In this case, Emacs will ask for confirmation, because the encoding you specified is different from what the XML comment says." - But consider the following: The encoding in the XML declaration or in the HTML <meta charset="utf-8"> just specifies UTF-8 (or another encoding). It doesn't say anything about the presence or absence of the BOM. Therefore an editor detecting and deciding about the file's encoding should not rely on this information only. - When such a file, which was saved successfully with BOM, is closed and reopened again, Emacs detects its encoding correctly, say utf-8-with-signature-dos. - However, when I change the file content and save it again just with C-x C-s (without C-x RET c ... first!), then it changes back to utf-8-dos. Yes, even if I start emacs with -Q! (That's the point.) - I do not fully understand the criterion for and the magic behind how Emacs chooses the file encoding when I do C-x C-s. But I was able to reproduce it several times by applying the procedures given in the bug report, even when -Q is on. As we already have stated above, this could be because Emacs favors (and forces) utf-8 whenever it sees something like XML or HTML that might be UTF-8-encoded. -> Conclusion: C-x RET c utf-8-with-signature RET C-x C-s is a good way to force the file being written as I want. But what I still do not understand: When I open a file with BOM and Emacs recognizes that, why does it change the encoding silently to drop the BOM when I regularly save with C-x C-s - and this even without giving me a notice or warning? ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2015-05-22 13:21 ` Simon Ledergerber @ 2016-10-12 21:44 ` Alain Schneble 2017-12-04 16:54 ` Glenn Morris 0 siblings, 1 reply; 34+ messages in thread From: Alain Schneble @ 2016-10-12 21:44 UTC (permalink / raw) To: Simon Ledergerber; +Cc: Stefan Monnier, 20623 I'm joining this discussion and would like to report a recipe to reproduce this issue on Windows: - emacs -Q - C-x C-f utf-8-bom-test.xml - Enter the following text in the new buffer: <?xml version="1.0" encoding="utf-8"?> <root></root> - C-x RET c utf-8-with-signature-dos C-x C-s yes RET - C-x k RET - C-x C-f utf-8-bom-test.xml - M-: buffer-file-coding-system => utf-8-with-signature-dos - Change buffer content, e.g. add some text to the root element: <?xml version="1.0" encoding="utf-8"?> <root>test</root> - C-x C-s - M-: buffer-file-coding-system => utf-8-dos (expected coding system: utf-8-with-signature-dos) As it was already mentioned in this thread, just by visiting the file, then changing and saving the buffer, the BOM gets lost. This is due to select-safe-coding-system (called by choose_write_coding_system) fully trusting the coding system identified by find-auto-coding. So far so good. The latter eventually calls auto-coding-functions which in turn calls the built-in sgml-xml-auto-coding-function which I think should take into account some context to enrich the derived coding system with a signature if needed. Similar to what select-safe-coding-system does to enrich the coding with the proper eol-type. Does that make sense to you? If so, I'll try to come up with a patch that enhances sgml-xml-auto-coding-function to take into account buffer-file-coding-system (buffer + default value) in case it carries the same text-conversion but different signature. The proposed "auto coding" shall inherit the signature in this case. Thanks for any help. Alain ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2016-10-12 21:44 ` Alain Schneble @ 2017-12-04 16:54 ` Glenn Morris 2017-12-04 17:38 ` Stefan Monnier 0 siblings, 1 reply; 34+ messages in thread From: Glenn Morris @ 2017-12-04 16:54 UTC (permalink / raw) To: Alain Schneble; +Cc: Simon Ledergerber, Stefan Monnier, 20623 Now reported with "fix this or get removed from the distribution" severity at <https://bugs.debian.org/883434>. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2017-12-04 16:54 ` Glenn Morris @ 2017-12-04 17:38 ` Stefan Monnier 2017-12-04 20:28 ` Eli Zaretskii 2018-08-08 9:47 ` Vincent Lefevre 0 siblings, 2 replies; 34+ messages in thread From: Stefan Monnier @ 2017-12-04 17:38 UTC (permalink / raw) To: Glenn Morris; +Cc: Simon Ledergerber, Alain Schneble, 20623 > Now reported with "fix this or get removed from the distribution" > severity at <https://bugs.debian.org/883434>. I'm curious to see if the OP's "grave" severity settings will stick. "Grave" is defined in https://www.debian.org/Bugs/Developer#severities as: makes the package in question unusable or mostly so, or causes data loss, or introduces a security hole allowing access to the accounts of users who use the package. The only part that could arguably apply is "causes data loss", but even that is stretching the meaning of those words, I think. This said, we should indeed fix this bug. Not sure how to Do It Right but least this specific problem should be fixable with a patch along the lines of the one below (guaranteed 100% untested). Stefan diff --git a/lisp/international/mule.el b/lisp/international/mule.el index 019e65b2c6..5c0675aa2f 100644 --- a/lisp/international/mule.el +++ b/lisp/international/mule.el @@ -1885,6 +1885,12 @@ auto-coding-alist-lookup (setq alist (cdr alist)))) coding-system)) +(defun mule--coding-system-compatible-p (cs new-cs) + "Return non-nil if CS is one of the coding-systems described by NEW-CS." + (let ((base (coding-system-base cs))) + (or (eq base new-cs) + (eq base (intern (concat new-cs "-with-signature")))))) + (put 'enable-character-translation 'permanent-local t) (put 'enable-character-translation 'safe-local-variable 'booleanp) @@ -2038,8 +2044,12 @@ find-auto-coding (save-excursion (goto-char (point-min)) (funcall (pop funcs) size))))) - (if coding-system - (cons coding-system 'auto-coding-functions))))) + (and coding-system + ;; Don't override utf-8-with-signature with utf-8 + ;; or latin-1-mac with latin-1 (bug#20623). + (not (mule--coding-system-compatible-p + buffer-file-coding-system coding-system)) + (cons coding-system 'auto-coding-functions))))) (defun set-auto-coding (filename size) "Return coding system for a file FILENAME of which SIZE bytes follow point. ^ permalink raw reply related [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2017-12-04 17:38 ` Stefan Monnier @ 2017-12-04 20:28 ` Eli Zaretskii 2017-12-04 21:08 ` Stefan Monnier 2018-08-08 9:47 ` Vincent Lefevre 1 sibling, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2017-12-04 20:28 UTC (permalink / raw) To: Stefan Monnier; +Cc: a.s, 20623, sledergerber > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: Alain Schneble <a.s@realize.ch>, Simon Ledergerber <sledergerber@gmx.net>, 20623@debbugs.gnu.org, Eli Zaretskii <eliz@gnu.org> > Date: Mon, 04 Dec 2017 12:38:57 -0500 > > This said, we should indeed fix this bug. Agreed. > Not sure how to Do It Right but least this specific problem should be > fixable with a patch along the lines of the one below (guaranteed 100% > untested). Isn't it better to fix this in sgml-xml-auto-coding-function? That's where the root cause is, AFAIU. And I don't understand the comment about latin-1-mac: I don't think we have such problems in Emacs. The -with-signature variety is different, because it is not about EOL format. Thanks. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2017-12-04 20:28 ` Eli Zaretskii @ 2017-12-04 21:08 ` Stefan Monnier 2017-12-10 19:17 ` Eli Zaretskii 0 siblings, 1 reply; 34+ messages in thread From: Stefan Monnier @ 2017-12-04 21:08 UTC (permalink / raw) To: Eli Zaretskii; +Cc: a.s, 20623, sledergerber > Isn't it better to fix this in sgml-xml-auto-coding-function? That's > where the root cause is, AFAIU. I'd expect the same problem would affect all other uses. > And I don't understand the comment about latin-1-mac: I don't think we > have such problems in Emacs. The -with-signature variety is > different, because it is not about EOL format. You might be right, but I don't know where/how this is handled. Stefan ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2017-12-04 21:08 ` Stefan Monnier @ 2017-12-10 19:17 ` Eli Zaretskii 2017-12-15 9:08 ` Eli Zaretskii ` (2 more replies) 0 siblings, 3 replies; 34+ messages in thread From: Eli Zaretskii @ 2017-12-10 19:17 UTC (permalink / raw) To: Stefan Monnier; +Cc: a.s, 20623, sledergerber > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: rgm@gnu.org, a.s@realize.ch, sledergerber@gmx.net, 20623@debbugs.gnu.org > Date: Mon, 04 Dec 2017 16:08:14 -0500 > > > Isn't it better to fix this in sgml-xml-auto-coding-function? That's > > where the root cause is, AFAIU. > > I'd expect the same problem would affect all other uses. Not sure what you meant by "all other uses". Could you please elaborate? > > And I don't understand the comment about latin-1-mac: I don't think we > > have such problems in Emacs. The -with-signature variety is > > different, because it is not about EOL format. > > You might be right, but I don't know where/how this is handled. I would like to propose the following alternative patch, which accepts utf-8-with-signature and utf-8-hfs as variants of utf-8 for the purposes of encoding of XML files. Comments? Do we want a similar treatment for UTF-16? (That doesn't seem to be required by the bug report, and UTF-16 in XML files is non-standard anyway. But what about HTML?) diff --git a/lisp/international/mule.el b/lisp/international/mule.el index 857fa80..5ff1acf 100644 --- a/lisp/international/mule.el +++ b/lisp/international/mule.el @@ -2493,7 +2493,17 @@ sgml-xml-auto-coding-function (let* ((match (match-string 1)) (sym (intern (downcase match)))) (if (coding-system-p sym) - sym + ;; If the encoding tag is UTF-8 and the buffer's + ;; encoding is one of the variants of UTF-8, use the + ;; buffer's encoding. This allows, e.g., saving an + ;; XML file as UTF-8 with BOM when the tag says UTF-8. + (if (and (coding-system-equal 'utf-8 + (coding-system-type sym)) + (coding-system-equal sym + (coding-system-type + buffer-file-coding-system))) + buffer-file-coding-system + sym) (message "Warning: unknown coding system \"%s\"" match) nil)) ;; Files without an encoding tag should be UTF-8. But users @@ -2506,7 +2516,8 @@ sgml-xml-auto-coding-function (coding-system-base (detect-coding-region (point-min) size t))))) ;; Pure ASCII always comes back as undecided. - (if (memq detected '(utf-8 undecided)) + (if (memq detected + '(utf-8 'utf-8-with-signature 'utf-8-hfs undecided)) 'utf-8 (warn "File contents detected as %s. Consider adding an encoding attribute to the xml declaration, ^ permalink raw reply related [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2017-12-10 19:17 ` Eli Zaretskii @ 2017-12-15 9:08 ` Eli Zaretskii 2018-08-01 18:07 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose " Glenn Morris 2018-08-11 12:45 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose " Stefan Monnier 2 siblings, 0 replies; 34+ messages in thread From: Eli Zaretskii @ 2017-12-15 9:08 UTC (permalink / raw) To: monnier; +Cc: sledergerber, a.s, 20623-done > Date: Sun, 10 Dec 2017 21:17:00 +0200 > From: Eli Zaretskii <eliz@gnu.org> > Cc: a.s@realize.ch, 20623@debbugs.gnu.org, sledergerber@gmx.net > > I would like to propose the following alternative patch, which accepts > utf-8-with-signature and utf-8-hfs as variants of utf-8 for the > purposes of encoding of XML files. Comments? Do we want a similar > treatment for UTF-16? (That doesn't seem to be required by the bug > report, and UTF-16 in XML files is non-standard anyway. But what > about HTML?) No further comments, so I've pushed the change and I'm marking this bug done. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2017-12-10 19:17 ` Eli Zaretskii 2017-12-15 9:08 ` Eli Zaretskii @ 2018-08-01 18:07 ` Glenn Morris 2018-08-01 18:41 ` Eli Zaretskii 2018-08-11 12:45 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose " Stefan Monnier 2 siblings, 1 reply; 34+ messages in thread From: Glenn Morris @ 2018-08-01 18:07 UTC (permalink / raw) To: Eli Zaretskii; +Cc: sledergerber, a.s, Stefan Monnier, 20623 The HTML (not XML) case specified in the original report ("Now the steps for HTML" in https://debbugs.gnu.org/20623#5) and in https://bugs.debian.org/883434 seems unfixed. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-01 18:07 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose " Glenn Morris @ 2018-08-01 18:41 ` Eli Zaretskii 2018-08-07 19:14 ` Glenn Morris 0 siblings, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2018-08-01 18:41 UTC (permalink / raw) To: Glenn Morris; +Cc: sledergerber, a.s, monnier, 20623 > From: Glenn Morris <rgm@gnu.org> > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, 20623@debbugs.gnu.org, a.s@realize.ch, sledergerber@gmx.net > Date: Wed, 01 Aug 2018 14:07:28 -0400 > > The HTML (not XML) case specified in the original report > ("Now the steps for HTML" in https://debbugs.gnu.org/20623#5) and in > https://bugs.debian.org/883434 seems unfixed. Should it be? ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-01 18:41 ` Eli Zaretskii @ 2018-08-07 19:14 ` Glenn Morris 0 siblings, 0 replies; 34+ messages in thread From: Glenn Morris @ 2018-08-07 19:14 UTC (permalink / raw) To: Eli Zaretskii; +Cc: sledergerber, a.s, monnier, 20623 Eli Zaretskii wrote: >> The HTML (not XML) case specified in the original report >> ("Now the steps for HTML" in https://debbugs.gnu.org/20623#5) and in >> https://bugs.debian.org/883434 seems unfixed. > > Should it be? I think this a bug that should be fixed, yes (if that is the question). ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2017-12-10 19:17 ` Eli Zaretskii 2017-12-15 9:08 ` Eli Zaretskii 2018-08-01 18:07 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose " Glenn Morris @ 2018-08-11 12:45 ` Stefan Monnier 2018-08-11 13:54 ` Eli Zaretskii 2 siblings, 1 reply; 34+ messages in thread From: Stefan Monnier @ 2018-08-11 12:45 UTC (permalink / raw) To: Eli Zaretskii; +Cc: a.s, 20623, sledergerber >> > Isn't it better to fix this in sgml-xml-auto-coding-function? That's >> > where the root cause is, AFAIU. >> I'd expect the same problem would affect all other uses. > Not sure what you meant by "all other uses". Could you please > elaborate? Your commit ec6f588940e51013435408a456c10d33ddf98fb2 answers that question: at least sgml-html-meta-auto-coding-function is one of those "other uses". > > And I don't understand the comment about latin-1-mac: I don't think we > > have such problems in Emacs. The -with-signature variety is > > different, because it is not about EOL format. > You might be right, but I don't know where/how this is handled. I still don't know where the EOL part is handled. Stefan ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-11 12:45 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose " Stefan Monnier @ 2018-08-11 13:54 ` Eli Zaretskii 2018-08-12 0:04 ` Stefan Monnier 0 siblings, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2018-08-11 13:54 UTC (permalink / raw) To: Stefan Monnier; +Cc: a.s, 20623, sledergerber > From: Stefan Monnier <monnier@IRO.UMontreal.CA> > Cc: rgm@gnu.org, a.s@realize.ch, 20623@debbugs.gnu.org, sledergerber@gmx.net > Date: Sat, 11 Aug 2018 08:45:15 -0400 > > > > And I don't understand the comment about latin-1-mac: I don't think we > > > have such problems in Emacs. The -with-signature variety is > > > different, because it is not about EOL format. > > You might be right, but I don't know where/how this is handled. > > I still don't know where the EOL part is handled. If you tell me what do you mean by "handled" in this context, I might be able to help you understand where that happens. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-11 13:54 ` Eli Zaretskii @ 2018-08-12 0:04 ` Stefan Monnier 2018-08-12 19:07 ` Eli Zaretskii 0 siblings, 1 reply; 34+ messages in thread From: Stefan Monnier @ 2018-08-12 0:04 UTC (permalink / raw) To: Eli Zaretskii; +Cc: a.s, 20623, sledergerber >> > > And I don't understand the comment about latin-1-mac: I don't think we >> > > have such problems in Emacs. The -with-signature variety is >> > > different, because it is not about EOL format. >> > You might be right, but I don't know where/how this is handled. >> I still don't know where the EOL part is handled. > If you tell me what do you mean by "handled" in this context, I might > be able to help you understand where that happens. You say that the code I wrote is not needed to make sure an existing latin-1-mac setting isn't overwritten by a latin-1 guess. I expect this is indeed true (otherwise I think we'd have had bug-reports about it), but I don't know where that is handled. Stefan ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-12 0:04 ` Stefan Monnier @ 2018-08-12 19:07 ` Eli Zaretskii 0 siblings, 0 replies; 34+ messages in thread From: Eli Zaretskii @ 2018-08-12 19:07 UTC (permalink / raw) To: Stefan Monnier; +Cc: a.s, 20623, sledergerber > From: Stefan Monnier <monnier@IRO.UMontreal.CA> > Cc: rgm@gnu.org, a.s@realize.ch, 20623@debbugs.gnu.org, sledergerber@gmx.net > Date: Sat, 11 Aug 2018 20:04:05 -0400 > > You say that the code I wrote is not needed to make sure an existing > latin-1-mac setting isn't overwritten by a latin-1 guess. I expect this > is indeed true (otherwise I think we'd have had bug-reports about it), > but I don't know where that is handled. It is handled inside select-safe-coding-system, which first invokes find-auto-coding to decide which encoding is appropriate (and as part of that, looks at XML or HTML charset information declared by the text), and then, if the encoding it got doesn't specify the EOL conversion, it uses the EOL conversion from the buffer's encoding or from the appropriate defaults. Since XML/HTML charset tags never specify the EOL conversion, it follows that Emacs will never override the EOL conversion of the buffer, it will only use the charset for "text conversion". I hope this answers your question. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2017-12-04 17:38 ` Stefan Monnier 2017-12-04 20:28 ` Eli Zaretskii @ 2018-08-08 9:47 ` Vincent Lefevre 2018-08-08 14:45 ` Stefan Monnier 2018-08-11 9:15 ` Eli Zaretskii 1 sibling, 2 replies; 34+ messages in thread From: Vincent Lefevre @ 2018-08-08 9:47 UTC (permalink / raw) To: Stefan Monnier; +Cc: Alain Schneble, 20623, Simon Ledergerber On 2017-12-04 12:38:57 -0500, Stefan Monnier wrote: > > Now reported with "fix this or get removed from the distribution" > > severity at <https://bugs.debian.org/883434>. > > I'm curious to see if the OP's "grave" severity settings will stick. > "Grave" is defined in https://www.debian.org/Bugs/Developer#severities as: > > makes the package in question unusable or mostly so, or causes data > loss, or introduces a security hole allowing access to the accounts > of users who use the package. > > The only part that could arguably apply is "causes data loss", but even > that is stretching the meaning of those words, I think. Actually there's the issue that the coding system (in Emacs sense) is changed, but also the fact that this change is invisible to the user (mainly because the BOM is usually not visible), which makes the issue even worse. Basically, this is invisible data corruption. Even though only two bytes are removed, this introduces breakage in other applications, and it can take much time to the user to find the cause. Emacs should not change the coding system when not needed, and when it needs to, it must make sure to have a confirmation from the user. -- Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-08 9:47 ` Vincent Lefevre @ 2018-08-08 14:45 ` Stefan Monnier 2018-08-11 9:15 ` Eli Zaretskii 1 sibling, 0 replies; 34+ messages in thread From: Stefan Monnier @ 2018-08-08 14:45 UTC (permalink / raw) To: Vincent Lefevre; +Cc: Alain Schneble, 20623, Simon Ledergerber > Actually there's the issue that the coding system (in Emacs sense) > is changed, but also the fact that this change is invisible to the > user (mainly because the BOM is usually not visible), which makes > the issue even worse. Basically, this is invisible data corruption. > Even though only two bytes are removed, this introduces breakage in > other applications, and it can take much time to the user to find > the cause. > > Emacs should not change the coding system when not needed, and when > it needs to, it must make sure to have a confirmation from the user. FWIW, I agree: I don't think it qualifies as Debian's definition of "grave", but there is no doubt that it's a bug and that we should fix it. Stefan ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-08 9:47 ` Vincent Lefevre 2018-08-08 14:45 ` Stefan Monnier @ 2018-08-11 9:15 ` Eli Zaretskii 2018-08-11 10:13 ` Vincent Lefevre 1 sibling, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2018-08-11 9:15 UTC (permalink / raw) To: Vincent Lefevre; +Cc: a.s, monnier, 20623-done, sledergerber > Date: Wed, 8 Aug 2018 11:47:48 +0200 > From: Vincent Lefevre <vincent@vinc17.net> > Cc: Glenn Morris <rgm@gnu.org>, Simon Ledergerber <sledergerber@gmx.net>, > Eli Zaretskii <eliz@gnu.org>, Alain Schneble <a.s@realize.ch>, > 20623@debbugs.gnu.org > > On 2017-12-04 12:38:57 -0500, Stefan Monnier wrote: > > > Now reported with "fix this or get removed from the distribution" > > > severity at <https://bugs.debian.org/883434>. > > > > I'm curious to see if the OP's "grave" severity settings will stick. > > "Grave" is defined in https://www.debian.org/Bugs/Developer#severities as: > > > > makes the package in question unusable or mostly so, or causes data > > loss, or introduces a security hole allowing access to the accounts > > of users who use the package. > > > > The only part that could arguably apply is "causes data loss", but even > > that is stretching the meaning of those words, I think. > > Actually there's the issue that the coding system (in Emacs sense) > is changed, but also the fact that this change is invisible to the > user (mainly because the BOM is usually not visible), which makes > the issue even worse. Basically, this is invisible data corruption. > Even though only two bytes are removed, this introduces breakage in > other applications, and it can take much time to the user to find > the cause. > > Emacs should not change the coding system when not needed, and when > it needs to, it must make sure to have a confirmation from the user. I agree with the last paragraph, so I've now fixed the remaining issue of this bug (with HTML files) on the emacs-26 branch. However, I would respectfully request that in the future bug reports be accurate and fair in the assigned severity, and in particular make sure that the severity matches the actual behavior as judged objectively. In this case, I cannot but express my extreme surprise to see such a minor issue described as "grave". The alleged data loss is minor, if it exists at all (the BOM is not data important for the user, nor data whose loss cannot be easily repaired). The unspecified "breakage in other applications" cannot be considered without the missing details, but in general I'd be surprised to hear about modern applications (browsers?) that really need a BOM in UTF-8 encoded HTML files to the degree that the lack of BOM causes them to "break" in some way; if they do, it could arguably be a bug in those applications. Bottom line: artificially and unreasonably increasing the severity level doesn't help the motivation to fix the bug, and if anything, has the opposite effect of ignoring the source of the bug report as not serious. I'm sure we don't want that, certainly not for bugs reported by Debian. Thanks. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-11 9:15 ` Eli Zaretskii @ 2018-08-11 10:13 ` Vincent Lefevre 2018-08-11 10:45 ` Eli Zaretskii 0 siblings, 1 reply; 34+ messages in thread From: Vincent Lefevre @ 2018-08-11 10:13 UTC (permalink / raw) To: Eli Zaretskii; +Cc: a.s, monnier, 20623, sledergerber On 2018-08-11 12:15:31 +0300, Eli Zaretskii wrote: > In this case, I cannot but express my extreme surprise to see such a > minor issue described as "grave". The alleged data loss is minor, if > it exists at all (the BOM is not data important for the user, You're completely wrong. The presence of BOM or not is very important for some applications, such as Firefox (not to determine the charset, but the MIME type of local files). > nor data whose loss cannot be easily repaired). It can be repaired, but the problems are the user doesn't know what's going on and this breaks things. If some package removed the execute permission of some utility in /bin, this would also be a grave bug, though it can easily been repaired. > The unspecified "breakage in > other applications" cannot be considered without the missing details, > but in general I'd be surprised to hear about modern applications > (browsers?) that really need a BOM in UTF-8 encoded HTML files to the > degree that the lack of BOM causes them to "break" in some way; if > they do, it could arguably be a bug in those applications. Firefox. And that's actually the way I detected the bug, after hours of trying to find why it was behaving in an inconsistent way. -- Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-11 10:13 ` Vincent Lefevre @ 2018-08-11 10:45 ` Eli Zaretskii 2018-08-11 15:41 ` Vincent Lefevre 0 siblings, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2018-08-11 10:45 UTC (permalink / raw) To: Vincent Lefevre; +Cc: a.s, monnier, 20623, sledergerber > Date: Sat, 11 Aug 2018 12:13:41 +0200 > From: Vincent Lefevre <vincent@vinc17.net> > Cc: monnier@iro.umontreal.ca, rgm@gnu.org, sledergerber@gmx.net, > a.s@realize.ch, 20623@debbugs.gnu.org > > On 2018-08-11 12:15:31 +0300, Eli Zaretskii wrote: > > In this case, I cannot but express my extreme surprise to see such a > > minor issue described as "grave". The alleged data loss is minor, if > > it exists at all (the BOM is not data important for the user, > > You're completely wrong. The presence of BOM or not is very important > for some applications, such as Firefox (not to determine the charset, > but the MIME type of local files). Please provide the details, including the use case, if possible. I'm still in the dark regarding the importance of the BOM in UTF-8 encoded HTML stuff. > It can be repaired, but the problems are the user doesn't know > what's going on and this breaks things. I agree about the user not knowing, but that doesn't yet qualify as "data loss", which has an widely accepted meaning. > If some package removed the execute permission of some utility in > /bin, this would also be a grave bug, though it can easily been > repaired. Well, I disagree about the "grave" part, because that means the package is unusable, causes data loss, or introduces a security hole allowing access to the user account. None of that is true in the case in point. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-11 10:45 ` Eli Zaretskii @ 2018-08-11 15:41 ` Vincent Lefevre 2018-08-11 16:27 ` Eli Zaretskii 2018-08-12 0:11 ` Stefan Monnier 0 siblings, 2 replies; 34+ messages in thread From: Vincent Lefevre @ 2018-08-11 15:41 UTC (permalink / raw) To: Eli Zaretskii; +Cc: a.s, monnier, 20623, sledergerber On 2018-08-11 13:45:17 +0300, Eli Zaretskii wrote: > > Date: Sat, 11 Aug 2018 12:13:41 +0200 > > From: Vincent Lefevre <vincent@vinc17.net> > > Cc: monnier@iro.umontreal.ca, rgm@gnu.org, sledergerber@gmx.net, > > a.s@realize.ch, 20623@debbugs.gnu.org > > > > On 2018-08-11 12:15:31 +0300, Eli Zaretskii wrote: > > > In this case, I cannot but express my extreme surprise to see such a > > > minor issue described as "grave". The alleged data loss is minor, if > > > it exists at all (the BOM is not data important for the user, > > > > You're completely wrong. The presence of BOM or not is very important > > for some applications, such as Firefox (not to determine the charset, > > but the MIME type of local files). > > Please provide the details, including the use case, if possible. I'm > still in the dark regarding the importance of the BOM in UTF-8 encoded > HTML stuff. https://bugzilla.mozilla.org/show_bug.cgi?id=1422889 for HTML. Wontfix because of: https://mimesniff.spec.whatwg.org/#mime-type-sniffing-algorithm For text/plain only (but this is another example that BOM can matter in practice), there's https://bugzilla.mozilla.org/show_bug.cgi?id=1071816 (which is a bug that should be fixed). > > It can be repaired, but the problems are the user doesn't know > > what's going on and this breaks things. > > I agree about the user not knowing, but that doesn't yet qualify as > "data loss", which has an widely accepted meaning. This is data corruption, which is a form of data loss, because some information is lost in the process (I recall that Emacs does not provide any information to the user about this transformation). -- Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-11 15:41 ` Vincent Lefevre @ 2018-08-11 16:27 ` Eli Zaretskii 2018-08-12 1:34 ` Vincent Lefevre 2018-08-12 0:11 ` Stefan Monnier 1 sibling, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2018-08-11 16:27 UTC (permalink / raw) To: Vincent Lefevre; +Cc: a.s, monnier, 20623, sledergerber > Date: Sat, 11 Aug 2018 17:41:01 +0200 > From: Vincent Lefevre <vincent@vinc17.net> > Cc: monnier@iro.umontreal.ca, rgm@gnu.org, sledergerber@gmx.net, > a.s@realize.ch, 20623@debbugs.gnu.org > > > > You're completely wrong. The presence of BOM or not is very important > > > for some applications, such as Firefox (not to determine the charset, > > > but the MIME type of local files). > > > > Please provide the details, including the use case, if possible. I'm > > still in the dark regarding the importance of the BOM in UTF-8 encoded > > HTML stuff. > > https://bugzilla.mozilla.org/show_bug.cgi?id=1422889 > > for HTML. Wontfix because of: > > https://mimesniff.spec.whatwg.org/#mime-type-sniffing-algorithm > > For text/plain only (but this is another example that BOM can matter > in practice), there's > > https://bugzilla.mozilla.org/show_bug.cgi?id=1071816 > > (which is a bug that should be fixed). Maybe I'm missing something, but none of these issues describes the situation in this bug report, namely: an HTML file with an explicit charset= tag, with or without a BOM. In fact, the first of these issues happens only in files that _do_ have a BOM, so you could say that Emacs did you a favor by removing it ;-) > > I agree about the user not knowing, but that doesn't yet qualify as > > "data loss", which has an widely accepted meaning. > > This is data corruption, which is a form of data loss, because some > information is lost in the process (I recall that Emacs does not > provide any information to the user about this transformation). That is the most inclusive interpretation of "data loss" I've ever seen. "Some information is lost" is nowhere near what "grave bug" means by "data loss", so I don't think "grave" applies here. Anyway, the Emacs issue is now fixed. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-11 16:27 ` Eli Zaretskii @ 2018-08-12 1:34 ` Vincent Lefevre 0 siblings, 0 replies; 34+ messages in thread From: Vincent Lefevre @ 2018-08-12 1:34 UTC (permalink / raw) To: Eli Zaretskii; +Cc: a.s, monnier, 20623, sledergerber On 2018-08-11 19:27:33 +0300, Eli Zaretskii wrote: > Maybe I'm missing something, but none of these issues describes the > situation in this bug report, namely: an HTML file with an explicit > charset= tag, with or without a BOM. In fact, the first of these > issues happens only in files that _do_ have a BOM, so you could say > that Emacs did you a favor by removing it ;-) In theory yes, but in practice, one does not want that when doing file-loading tests. Otherwise the tests become meaningless. This is just list a spellchecker that automatically corrects spelling mistakes without the user knowledge (even when it is right), as if the goal is to write something about a spelling mistake, the text becomes meaningless. Or when some characters are changed automatically to improve typography (as this can be seen by some blog software when posting, with no previewing), as this can make the text meaningless, e.g. when it is code. > Anyway, the Emacs issue is now fixed. OK, thanks. -- Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-11 15:41 ` Vincent Lefevre 2018-08-11 16:27 ` Eli Zaretskii @ 2018-08-12 0:11 ` Stefan Monnier 2018-08-12 0:58 ` Vincent Lefevre 1 sibling, 1 reply; 34+ messages in thread From: Stefan Monnier @ 2018-08-12 0:11 UTC (permalink / raw) To: Vincent Lefevre; +Cc: a.s, 20623, sledergerber >> > > In this case, I cannot but express my extreme surprise to see such a >> > > minor issue described as "grave". The alleged data loss is minor, if >> > > it exists at all (the BOM is not data important for the user, >> > You're completely wrong. The presence of BOM or not is very important >> > for some applications, such as Firefox (not to determine the charset, >> > but the MIME type of local files). >> Please provide the details, including the use case, if possible. I'm >> still in the dark regarding the importance of the BOM in UTF-8 encoded >> HTML stuff. > https://bugzilla.mozilla.org/show_bug.cgi?id=1422889 I don't see any data loss there. Stefan PS: We can all cook up contrived scenarios where this bug leads to a serious loss of data. But in that case a problem in C-n which makes it move to the wrong column would also qualify as "grave" because I can just as well construct a contrived scenarios where such a bug leads to a serious loss of data. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2018-08-12 0:11 ` Stefan Monnier @ 2018-08-12 0:58 ` Vincent Lefevre 0 siblings, 0 replies; 34+ messages in thread From: Vincent Lefevre @ 2018-08-12 0:58 UTC (permalink / raw) To: Stefan Monnier; +Cc: a.s, 20623, sledergerber On 2018-08-11 20:11:49 -0400, Stefan Monnier wrote: > >> Please provide the details, including the use case, if possible. I'm > >> still in the dark regarding the importance of the BOM in UTF-8 encoded > >> HTML stuff. > > https://bugzilla.mozilla.org/show_bug.cgi?id=1422889 > > I don't see any data loss there. Because it is not there, it is in Emacs. What the Mozilla bug shows is that the presence of BOM or not is important and yields very different behavior. -- Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2015-05-21 19:48 ` Eli Zaretskii [not found] ` <555E44EB.6070604@gmx.net> @ 2015-05-22 15:22 ` Stefan Monnier 2015-05-22 15:26 ` Eli Zaretskii 1 sibling, 1 reply; 34+ messages in thread From: Stefan Monnier @ 2015-05-22 15:22 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Simon Ledergerber, 20623 > What would you expect Emacs to do instead? It just obeys the stated > encoding, which says nothing about the BOM. How can Emacs know when > to use utf-8 and when utf-8-with-signature? To the extent that Emacs has seen the BOM when opening the file, it would make sense for Emacs to try and preserve this detail. IOW the utf-8 annotation in the XML metadata shouldn't mean "use the utf-8 coding system" but "use a coding system compatible with utf-8". So if the coding system is already compatible with utf-8 (e.g. utf-8-with-signature), we should simply keep using that rather than switch to the utf-8 coding-system. Stefan ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2015-05-22 15:22 ` Stefan Monnier @ 2015-05-22 15:26 ` Eli Zaretskii 2015-05-22 21:51 ` Stefan Monnier 0 siblings, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2015-05-22 15:26 UTC (permalink / raw) To: Stefan Monnier; +Cc: sledergerber, 20623 > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: Simon Ledergerber <sledergerber@gmx.net>, 20623@debbugs.gnu.org > Date: Fri, 22 May 2015 11:22:27 -0400 > > > What would you expect Emacs to do instead? It just obeys the stated > > encoding, which says nothing about the BOM. How can Emacs know when > > to use utf-8 and when utf-8-with-signature? > > To the extent that Emacs has seen the BOM when opening the file, it > would make sense for Emacs to try and preserve this detail. It does. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2015-05-22 15:26 ` Eli Zaretskii @ 2015-05-22 21:51 ` Stefan Monnier 2015-05-23 6:44 ` Eli Zaretskii 0 siblings, 1 reply; 34+ messages in thread From: Stefan Monnier @ 2015-05-22 21:51 UTC (permalink / raw) To: Eli Zaretskii; +Cc: sledergerber, 20623 >> > What would you expect Emacs to do instead? It just obeys the stated >> > encoding, which says nothing about the BOM. How can Emacs know when >> > to use utf-8 and when utf-8-with-signature? >> To the extent that Emacs has seen the BOM when opening the file, it >> would make sense for Emacs to try and preserve this detail. > It does. While there are cases where it does, this bug report is about a case where it doesn't, IIUC. Stefan ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2015-05-22 21:51 ` Stefan Monnier @ 2015-05-23 6:44 ` Eli Zaretskii 2015-05-23 17:11 ` Simon Ledergerber 0 siblings, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2015-05-23 6:44 UTC (permalink / raw) To: Stefan Monnier; +Cc: sledergerber, 20623 > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: sledergerber@gmx.net, 20623@debbugs.gnu.org > Date: Fri, 22 May 2015 17:51:07 -0400 > > >> > What would you expect Emacs to do instead? It just obeys the stated > >> > encoding, which says nothing about the BOM. How can Emacs know when > >> > to use utf-8 and when utf-8-with-signature? > >> To the extent that Emacs has seen the BOM when opening the file, it > >> would make sense for Emacs to try and preserve this detail. > > It does. > > While there are cases where it does, this bug report is about a case > where it doesn't, IIUC. AFAIU, that happened because the user has this in ~/.emacs: (setq-default buffer-file-coding-system 'utf-8-dos) IMO, this bad customization should be removed, and then the problem will go away. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2015-05-23 6:44 ` Eli Zaretskii @ 2015-05-23 17:11 ` Simon Ledergerber 2015-05-23 17:20 ` Eli Zaretskii 0 siblings, 1 reply; 34+ messages in thread From: Simon Ledergerber @ 2015-05-23 17:11 UTC (permalink / raw) To: Eli Zaretskii, Stefan Monnier; +Cc: 20623 [-- Attachment #1: Type: text/plain, Size: 1444 bytes --] As already mentioned in my last post, even when I started Emacs with the option -Q, which should opt out my customizations, it made no difference. So naturally, the source of the problem will be somewhere else. -----Original Message----- From: "Eli Zaretskii" <eliz@gnu.org> Sent: 23.05.2015 08:44 To: "Stefan Monnier" <monnier@iro.umontreal.ca> Cc: "sledergerber@gmx.net" <sledergerber@gmx.net>; "20623@debbugs.gnu.org" <20623@debbugs.gnu.org> Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: sledergerber@gmx.net, 20623@debbugs.gnu.org > Date: Fri, 22 May 2015 17:51:07 -0400 > > >> > What would you expect Emacs to do instead? It just obeys the stated > >> > encoding, which says nothing about the BOM. How can Emacs know when > >> > to use utf-8 and when utf-8-with-signature? > >> To the extent that Emacs has seen the BOM when opening the file, it > >> would make sense for Emacs to try and preserve this detail. > > It does. > > While there are cases where it does, this bug report is about a case > where it doesn't, IIUC. AFAIU, that happened because the user has this in ~/.emacs: (setq-default buffer-file-coding-system 'utf-8-dos) IMO, this bad customization should be removed, and then the problem will go away. [-- Attachment #2: Type: text/html, Size: 2655 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save 2015-05-23 17:11 ` Simon Ledergerber @ 2015-05-23 17:20 ` Eli Zaretskii 0 siblings, 0 replies; 34+ messages in thread From: Eli Zaretskii @ 2015-05-23 17:20 UTC (permalink / raw) To: Simon Ledergerber; +Cc: 20623 > Cc: <20623@debbugs.gnu.org> > From: Simon Ledergerber <sledergerber@gmx.net> > Date: Sat, 23 May 2015 19:11:15 +0200 > > As already mentioned in my last post, even when I started Emacs with the option > -Q, which should opt out my customizations, it made no difference. So > naturally, the source of the problem will be somewhere else. Doesn't happen to me. So please post the file you used and the exact sequence of steps, starting from 'emacs -Q", to reproduce the problem. Thanks. ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2018-08-12 19:07 UTC | newest] Thread overview: 34+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-05-21 18:50 bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save Simon Ledergerber 2015-05-21 19:48 ` Eli Zaretskii [not found] ` <555E44EB.6070604@gmx.net> 2015-05-22 7:11 ` Eli Zaretskii 2015-05-22 13:21 ` Simon Ledergerber 2016-10-12 21:44 ` Alain Schneble 2017-12-04 16:54 ` Glenn Morris 2017-12-04 17:38 ` Stefan Monnier 2017-12-04 20:28 ` Eli Zaretskii 2017-12-04 21:08 ` Stefan Monnier 2017-12-10 19:17 ` Eli Zaretskii 2017-12-15 9:08 ` Eli Zaretskii 2018-08-01 18:07 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose " Glenn Morris 2018-08-01 18:41 ` Eli Zaretskii 2018-08-07 19:14 ` Glenn Morris 2018-08-11 12:45 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose " Stefan Monnier 2018-08-11 13:54 ` Eli Zaretskii 2018-08-12 0:04 ` Stefan Monnier 2018-08-12 19:07 ` Eli Zaretskii 2018-08-08 9:47 ` Vincent Lefevre 2018-08-08 14:45 ` Stefan Monnier 2018-08-11 9:15 ` Eli Zaretskii 2018-08-11 10:13 ` Vincent Lefevre 2018-08-11 10:45 ` Eli Zaretskii 2018-08-11 15:41 ` Vincent Lefevre 2018-08-11 16:27 ` Eli Zaretskii 2018-08-12 1:34 ` Vincent Lefevre 2018-08-12 0:11 ` Stefan Monnier 2018-08-12 0:58 ` Vincent Lefevre 2015-05-22 15:22 ` Stefan Monnier 2015-05-22 15:26 ` Eli Zaretskii 2015-05-22 21:51 ` Stefan Monnier 2015-05-23 6:44 ` Eli Zaretskii 2015-05-23 17:11 ` Simon Ledergerber 2015-05-23 17:20 ` Eli Zaretskii
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.