* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
@ 2015-05-21 18:50 Simon Ledergerber
2015-05-21 19:48 ` Eli Zaretskii
0 siblings, 1 reply; 34+ messages in thread
From: Simon Ledergerber @ 2015-05-21 18:50 UTC (permalink / raw)
To: 20623
Hi
When I was editing XHTML and HTML files, I wanted to make sure the BOM
was written out to the file in order to make it easier for the browser
to detect the UTF-8 encoding. Therefore I changed the coding system for
the file buffer to utf-8-with-signature-dos (since I am working on a
Windows System) before saving the file.
After some time I got surprised because the browser (IE11), didn't
report UTF-8 as the file's encoding. Having checked the hexdump of my
(X)HTML file, I saw the BOM was definitely missing.
Obviously, when a "UTF-8" string appears in the <meta charset="utf-8">
(even if commented out, see later below) or <?xml version="1.0"
encoding="utf-8"?> declaration, Emacs switches the file coding system to
utf-8, when it saves the file, even if utf-8-with-signature was
specified explicitly before. This appears to me as a bug, because there
is no way anymore to restore the BOM using Emacs.
I was not sure, if my bug is related to bug #8282, so I decided to
report it (again).
My Emacs version is: 24.5.1 (x86_64-unkown-cygwin) of 2015-04-10 on
Windows 8.1 x64.
I am running Emacs in text-mode only inside a Cygwin console.
This is my .emacs.d/init.el:
(line-number-mode)
(column-number-mode)
(setq-default fill-column 80)
(setq-default buffer-file-coding-system 'utf-8-dos)
(setq-default indent-tabs-mode nil)
With XML the problem can be reproduced in the most basic way as detailed
out by the following steps:
- Create a new file with C-x C-f in the current directory. Name it
test.txt for example.
- Switch to fundamental mode with M-x fundamental-mode.
- Type the text '<?xml version="1.0"' (without the surrounding single
quotes).
- Switch the encoding system to include the BOM: C-x RET f
utf-8-with-signature-dos.
- Verify the current encoding system with C-h Shift-c RET: Yes, the
encoding system for the file buffer is as specified before.
- Type C-x k to kill the help buffer if necessary and save the file with
C-x C-s.
- Check the file with a hex editor. Under the Cygwin Bash shell, 'od -Ax
-t xCaz test.txt' will also do it: The UTF-8 BOM 'EF BB BF' was written
at the beginning of the file.
- Complete the rest of the XML declaration as follows: ' encoding="utf-8"?>'
- Now save the file and check again: The encoding system for the buffer
has changed to utf-8-dos and the BOM has disappeared from the file!
Now the steps for HTML:
- Create a new file test1.txt in the current directory.
- Fill it with the following simple and yet incomplete HTML5 document:
<!doctype html>
<html>
<head>
<title>Test</title>
</head>
<body>
</body>
</html>
- Change the coding system to utf-8-with-signature-dos and save the file.
- Verify that the coding system for the buffer is correct and the BOM is
really written: Yes, it is.
- Insert the following *comment* between <head> and <title>: <!-- <meta
charset="utf-8"> -->
- Save the file and verify: The coding system has changed to utf-8-dos
and the BOM has vanished, even if it is just a comment and has no effect!
Regards
Simon
P. S. Information as reported by M-x report-emacs-bug:
In GNU Emacs 24.5.1 (x86_64-unknown-cygwin)
of 2015-04-10 on desktop-new
Configured using:
`configure
--srcdir=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5
--prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc
--docdir=/usr/share/doc/emacs --htmldir=/usr/share/doc/emacs/html -C
--with-x=no 'CFLAGS=-ggdb -O2 -pipe -Wimplicit-function-declaration
-fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/build=/usr/src/debug/emacs-24.5-1
-fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5=/usr/src/debug/emacs-24.5-1'
CPPFLAGS= LDFLAGS='
Important settings:
value of $LANG: en_US.UTF-8
locale-coding-system: utf-8-unix
Major mode: Help
Minor modes in effect:
tooltip-mode: t
electric-indent-mode: t
menu-bar-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
buffer-read-only: t
column-number-mode: t
line-number-mode: t
transient-mark-mode: t
Recent messages:
Beginning of buffer [3 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
Mark set [2 times]
Auto-saving...done
Mark set [2 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
No docstring slot for help-mode-setup
No docstring slot for help-mode-finish
Load-path shadows:
None found.
Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
help-fns mail-prsvr mail-utils misearch multi-isearch mule-diag
help-mode easymenu regexp-opt sgml-mode xterm time-date tooltip electric
uniquify ediff-hook vc-hooks lisp-float-type tabulated-list newcomment
lisp-mode prog-mode register page menu-bar rfn-eshadow timer select
mouse jit-lock font-lock syntax facemenu font-core frame cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev
minibuffer nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote make-network-process
dbusbind gfilenotify multi-tty emacs)
Memory information:
((conses 16 81797 4691)
(symbols 48 17091 0)
(miscs 40 73 387)
(strings 32 11233 4887)
(string-bytes 1 291872)
(vectors 16 7587)
(vector-slots 8 342125 27930)
(floats 8 57 393)
(intervals 56 834 26)
(buffers 960 21))
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2015-05-21 18:50 bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save Simon Ledergerber
@ 2015-05-21 19:48 ` Eli Zaretskii
[not found] ` <555E44EB.6070604@gmx.net>
2015-05-22 15:22 ` Stefan Monnier
0 siblings, 2 replies; 34+ messages in thread
From: Eli Zaretskii @ 2015-05-21 19:48 UTC (permalink / raw)
To: Simon Ledergerber; +Cc: 20623
> Date: Thu, 21 May 2015 20:50:58 +0200
> From: Simon Ledergerber <sledergerber@gmx.net>
>
> When I was editing XHTML and HTML files, I wanted to make sure the BOM
> was written out to the file in order to make it easier for the browser
> to detect the UTF-8 encoding. Therefore I changed the coding system for
> the file buffer to utf-8-with-signature-dos (since I am working on a
> Windows System) before saving the file.
>
> After some time I got surprised because the browser (IE11), didn't
> report UTF-8 as the file's encoding. Having checked the hexdump of my
> (X)HTML file, I saw the BOM was definitely missing.
>
> Obviously, when a "UTF-8" string appears in the <meta charset="utf-8">
> (even if commented out, see later below) or <?xml version="1.0"
> encoding="utf-8"?> declaration, Emacs switches the file coding system to
> utf-8, when it saves the file, even if utf-8-with-signature was
> specified explicitly before. This appears to me as a bug, because there
> is no way anymore to restore the BOM using Emacs.
What would you expect Emacs to do instead? It just obeys the stated
encoding, which says nothing about the BOM. How can Emacs know when
to use utf-8 and when utf-8-with-signature?
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
[not found] ` <555E44EB.6070604@gmx.net>
@ 2015-05-22 7:11 ` Eli Zaretskii
2015-05-22 13:21 ` Simon Ledergerber
0 siblings, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2015-05-22 7:11 UTC (permalink / raw)
To: Simon Ledergerber; +Cc: 20623
[Please don't remove the bug address from the CC list, so that this
discussion is recorded in the bug data base.]
> Date: Thu, 21 May 2015 22:49:47 +0200
> From: Simon Ledergerber <sledergerber@gmx.net>
>
> From the documentation I understand that utf-8 is without BOM and
> utf-8-with-signature is with BOM. Maybe I am wrong and should rather
> understand that utf-8 is auto-detect. But then there is something like
> utf-8-without-signature missing to specify explicitly that no BOM is
> desired.
>
> In my opinion, it is correct when Emacs prefers utf-8 over
> utf-8-with-signature when it opens a file without BOM that can still be
> recognized as UTF-8.
>
> However when a file is opened with a BOM already present, it should
> stick to the utf-8-with-signature coding system, because the BOM "EF BB
> BF" unambiguously marks the file as UTF-8. (For UTF-16 for example,
> there is a different BOM byte pattern. There are other coding systems
> which do not have a BOM at all.)
What do you mean by "stick to"? When I try visiting an XML file that
is encoded with BOM, Emacs decodes the file correctly, and the value
of buffer-file-coding-system is utf-8-with-signature. Isn't that what
you want? If that's what you want, but it doesn't happen for you,
please try in "emacs -Q". It's possible that the default you set:
(setq-default buffer-file-coding-system 'utf-8-dos)
is the reason for what you see. (I don't understand why you need such
a default, and it sounds like a bad idea to me.)
> By doing C-x <RET> f and then saving it with C-x C-s, I expect to be
> able to change the coding system. For example, if I specify utf-8-dos,
> the BOM should be removed, if one was present, and CR LF should be
> inserted for EOL. On the other side, if I choose
> utf-8-with-signature-unix, a BOM should be written and LF be taken for
> EOL. (The conversion between DOS and Unix works, just the BOM is the
> problem.)
>
> I have found a link, where this topic was already discussed, but it
> didn't help me further:
> http://superuser.com/questions/41254/make-emacs-not-remove-the-bom-from-xml-files
>
> In that post Vebjorn Ljosa asked exactly the question I have. Richard
> Hoskins replies with the answer to change the coding system with C-x
> <RET> r utf-8-with-signature. Unfortunately, it didn't work for me -
> after doing a change in the file and saving, it got back to utf-8
> automatically - that's why I have filed the bug.
That's not how you force a file to be saved in a specific encoding.
You should do this instead:
C-x RET c utf-8-with-signature RET C-x C-s
The "C-x RET c" prefix forces the next Emacs operation to use the
specified encoding. In this case, Emacs will ask for confirmation,
because the encoding you specified is different from what the XML
comment says.
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2015-05-22 7:11 ` Eli Zaretskii
@ 2015-05-22 13:21 ` Simon Ledergerber
2016-10-12 21:44 ` Alain Schneble
0 siblings, 1 reply; 34+ messages in thread
From: Simon Ledergerber @ 2015-05-22 13:21 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 20623
Hello Eli
I have done some more research to answer your questions. You will find
the details of my statement at the end of this mail.
On 22.05.2015 09:11, Eli Zaretskii wrote:
> [Please don't remove the bug address from the CC list, so that this
> discussion is recorded in the bug data base.]
>
>> Date: Thu, 21 May 2015 22:49:47 +0200
>> From: Simon Ledergerber <sledergerber@gmx.net>
>>
>> From the documentation I understand that utf-8 is without BOM and
>> utf-8-with-signature is with BOM. Maybe I am wrong and should rather
>> understand that utf-8 is auto-detect. But then there is something like
>> utf-8-without-signature missing to specify explicitly that no BOM is
>> desired.
>>
>> In my opinion, it is correct when Emacs prefers utf-8 over
>> utf-8-with-signature when it opens a file without BOM that can still be
>> recognized as UTF-8.
>>
>> However when a file is opened with a BOM already present, it should
>> stick to the utf-8-with-signature coding system, because the BOM "EF BB
>> BF" unambiguously marks the file as UTF-8. (For UTF-16 for example,
>> there is a different BOM byte pattern. There are other coding systems
>> which do not have a BOM at all.)
> What do you mean by "stick to"? When I try visiting an XML file that
> is encoded with BOM, Emacs decodes the file correctly, and the value
> of buffer-file-coding-system is utf-8-with-signature. Isn't that what
> you want? If that's what you want, but it doesn't happen for you,
> please try in "emacs -Q". It's possible that the default you set:
>
> (setq-default buffer-file-coding-system 'utf-8-dos)
>
> is the reason for what you see. (I don't understand why you need such
> a default, and it sounds like a bad idea to me.)
You're right. When I open a file that was really saved with BOM, Emacs
detects its encoding correctly, i. e. utf-8-with-signature-dos. But when
I change the content and save with C-x C-s, the encoding changes to
utf-8-dos and the BOM gets lost. Even when I start Emacs with -Q. This
is the actual bug.
>
>> By doing C-x <RET> f and then saving it with C-x C-s, I expect to be
>> able to change the coding system. For example, if I specify utf-8-dos,
>> the BOM should be removed, if one was present, and CR LF should be
>> inserted for EOL. On the other side, if I choose
>> utf-8-with-signature-unix, a BOM should be written and LF be taken for
>> EOL. (The conversion between DOS and Unix works, just the BOM is the
>> problem.)
>>
>> I have found a link, where this topic was already discussed, but it
>> didn't help me further:
>> http://superuser.com/questions/41254/make-emacs-not-remove-the-bom-from-xml-files
>>
>> In that post Vebjorn Ljosa asked exactly the question I have. Richard
>> Hoskins replies with the answer to change the coding system with C-x
>> <RET> r utf-8-with-signature. Unfortunately, it didn't work for me -
>> after doing a change in the file and saving, it got back to utf-8
>> automatically - that's why I have filed the bug.
> That's not how you force a file to be saved in a specific encoding.
> You should do this instead:
>
> C-x RET c utf-8-with-signature RET C-x C-s
>
> The "C-x RET c" prefix forces the next Emacs operation to use the
> specified encoding. In this case, Emacs will ask for confirmation,
> because the encoding you specified is different from what the XML
> comment says.
>
This is true and it worked for me. Please see below for further
explanations.
Summary:
- C-x RET c utf-8-with-signature RET C-x C-s is a good workaround,
because it really forces the file being written with BOM. In order to
have an effect however, the file must be dirty, i. e. there must be a
pending change. But before the command completes in this case, the
prompt "Selected encoding utf-8-with-signature-dos disagrees with
utf-8-dos specified by file contents. Really save (else edit coding
cookies and try again)? (yes or no)" appears. I think this is what you
mean with your sentence: "In this case, Emacs will ask for confirmation,
because the encoding you specified is different from what the XML
comment says."
- But consider the following: The encoding in the XML declaration or in
the HTML <meta charset="utf-8"> just specifies UTF-8 (or another
encoding). It doesn't say anything about the presence or absence of the
BOM. Therefore an editor detecting and deciding about the file's
encoding should not rely on this information only.
- When such a file, which was saved successfully with BOM, is closed and
reopened again, Emacs detects its encoding correctly, say
utf-8-with-signature-dos.
- However, when I change the file content and save it again just with
C-x C-s (without C-x RET c ... first!), then it changes back to
utf-8-dos. Yes, even if I start emacs with -Q! (That's the point.)
- I do not fully understand the criterion for and the magic behind how
Emacs chooses the file encoding when I do C-x C-s. But I was able to
reproduce it several times by applying the procedures given in the bug
report, even when -Q is on. As we already have stated above, this could
be because Emacs favors (and forces) utf-8 whenever it sees something
like XML or HTML that might be UTF-8-encoded.
-> Conclusion: C-x RET c utf-8-with-signature RET C-x C-s is a good way
to force the file being written as I want. But what I still do not
understand: When I open a file with BOM and Emacs recognizes that, why
does it change the encoding silently to drop the BOM when I regularly
save with C-x C-s - and this even without giving me a notice or warning?
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2015-05-21 19:48 ` Eli Zaretskii
[not found] ` <555E44EB.6070604@gmx.net>
@ 2015-05-22 15:22 ` Stefan Monnier
2015-05-22 15:26 ` Eli Zaretskii
1 sibling, 1 reply; 34+ messages in thread
From: Stefan Monnier @ 2015-05-22 15:22 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: Simon Ledergerber, 20623
> What would you expect Emacs to do instead? It just obeys the stated
> encoding, which says nothing about the BOM. How can Emacs know when
> to use utf-8 and when utf-8-with-signature?
To the extent that Emacs has seen the BOM when opening the file, it
would make sense for Emacs to try and preserve this detail. IOW the
utf-8 annotation in the XML metadata shouldn't mean "use the utf-8
coding system" but "use a coding system compatible with utf-8". So if
the coding system is already compatible with utf-8
(e.g. utf-8-with-signature), we should simply keep using that rather
than switch to the utf-8 coding-system.
Stefan
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2015-05-22 15:22 ` Stefan Monnier
@ 2015-05-22 15:26 ` Eli Zaretskii
2015-05-22 21:51 ` Stefan Monnier
0 siblings, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2015-05-22 15:26 UTC (permalink / raw)
To: Stefan Monnier; +Cc: sledergerber, 20623
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Simon Ledergerber <sledergerber@gmx.net>, 20623@debbugs.gnu.org
> Date: Fri, 22 May 2015 11:22:27 -0400
>
> > What would you expect Emacs to do instead? It just obeys the stated
> > encoding, which says nothing about the BOM. How can Emacs know when
> > to use utf-8 and when utf-8-with-signature?
>
> To the extent that Emacs has seen the BOM when opening the file, it
> would make sense for Emacs to try and preserve this detail.
It does.
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2015-05-22 15:26 ` Eli Zaretskii
@ 2015-05-22 21:51 ` Stefan Monnier
2015-05-23 6:44 ` Eli Zaretskii
0 siblings, 1 reply; 34+ messages in thread
From: Stefan Monnier @ 2015-05-22 21:51 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: sledergerber, 20623
>> > What would you expect Emacs to do instead? It just obeys the stated
>> > encoding, which says nothing about the BOM. How can Emacs know when
>> > to use utf-8 and when utf-8-with-signature?
>> To the extent that Emacs has seen the BOM when opening the file, it
>> would make sense for Emacs to try and preserve this detail.
> It does.
While there are cases where it does, this bug report is about a case
where it doesn't, IIUC.
Stefan
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2015-05-22 21:51 ` Stefan Monnier
@ 2015-05-23 6:44 ` Eli Zaretskii
2015-05-23 17:11 ` Simon Ledergerber
0 siblings, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2015-05-23 6:44 UTC (permalink / raw)
To: Stefan Monnier; +Cc: sledergerber, 20623
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: sledergerber@gmx.net, 20623@debbugs.gnu.org
> Date: Fri, 22 May 2015 17:51:07 -0400
>
> >> > What would you expect Emacs to do instead? It just obeys the stated
> >> > encoding, which says nothing about the BOM. How can Emacs know when
> >> > to use utf-8 and when utf-8-with-signature?
> >> To the extent that Emacs has seen the BOM when opening the file, it
> >> would make sense for Emacs to try and preserve this detail.
> > It does.
>
> While there are cases where it does, this bug report is about a case
> where it doesn't, IIUC.
AFAIU, that happened because the user has this in ~/.emacs:
(setq-default buffer-file-coding-system 'utf-8-dos)
IMO, this bad customization should be removed, and then the problem
will go away.
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2015-05-23 6:44 ` Eli Zaretskii
@ 2015-05-23 17:11 ` Simon Ledergerber
2015-05-23 17:20 ` Eli Zaretskii
0 siblings, 1 reply; 34+ messages in thread
From: Simon Ledergerber @ 2015-05-23 17:11 UTC (permalink / raw)
To: Eli Zaretskii, Stefan Monnier; +Cc: 20623
[-- Attachment #1: Type: text/plain, Size: 1444 bytes --]
As already mentioned in my last post, even when I started Emacs with the option -Q, which should opt out my customizations, it made no difference. So naturally, the source of the problem will be somewhere else.
-----Original Message-----
From: "Eli Zaretskii" <eliz@gnu.org>
Sent: 23.05.2015 08:44
To: "Stefan Monnier" <monnier@iro.umontreal.ca>
Cc: "sledergerber@gmx.net" <sledergerber@gmx.net>; "20623@debbugs.gnu.org" <20623@debbugs.gnu.org>
Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: sledergerber@gmx.net, 20623@debbugs.gnu.org
> Date: Fri, 22 May 2015 17:51:07 -0400
>
> >> > What would you expect Emacs to do instead? It just obeys the stated
> >> > encoding, which says nothing about the BOM. How can Emacs know when
> >> > to use utf-8 and when utf-8-with-signature?
> >> To the extent that Emacs has seen the BOM when opening the file, it
> >> would make sense for Emacs to try and preserve this detail.
> > It does.
>
> While there are cases where it does, this bug report is about a case
> where it doesn't, IIUC.
AFAIU, that happened because the user has this in ~/.emacs:
(setq-default buffer-file-coding-system 'utf-8-dos)
IMO, this bad customization should be removed, and then the problem
will go away.
[-- Attachment #2: Type: text/html, Size: 2655 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2015-05-23 17:11 ` Simon Ledergerber
@ 2015-05-23 17:20 ` Eli Zaretskii
0 siblings, 0 replies; 34+ messages in thread
From: Eli Zaretskii @ 2015-05-23 17:20 UTC (permalink / raw)
To: Simon Ledergerber; +Cc: 20623
> Cc: <20623@debbugs.gnu.org>
> From: Simon Ledergerber <sledergerber@gmx.net>
> Date: Sat, 23 May 2015 19:11:15 +0200
>
> As already mentioned in my last post, even when I started Emacs with the option
> -Q, which should opt out my customizations, it made no difference. So
> naturally, the source of the problem will be somewhere else.
Doesn't happen to me. So please post the file you used and the exact
sequence of steps, starting from 'emacs -Q", to reproduce the problem.
Thanks.
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2015-05-22 13:21 ` Simon Ledergerber
@ 2016-10-12 21:44 ` Alain Schneble
2017-12-04 16:54 ` Glenn Morris
0 siblings, 1 reply; 34+ messages in thread
From: Alain Schneble @ 2016-10-12 21:44 UTC (permalink / raw)
To: Simon Ledergerber; +Cc: Stefan Monnier, 20623
I'm joining this discussion and would like to report a recipe to
reproduce this issue on Windows:
- emacs -Q
- C-x C-f utf-8-bom-test.xml
- Enter the following text in the new buffer:
<?xml version="1.0" encoding="utf-8"?>
<root></root>
- C-x RET c utf-8-with-signature-dos C-x C-s yes RET
- C-x k RET
- C-x C-f utf-8-bom-test.xml
- M-: buffer-file-coding-system
=> utf-8-with-signature-dos
- Change buffer content, e.g. add some text to the root element:
<?xml version="1.0" encoding="utf-8"?>
<root>test</root>
- C-x C-s
- M-: buffer-file-coding-system
=> utf-8-dos
(expected coding system: utf-8-with-signature-dos)
As it was already mentioned in this thread, just by visiting the file,
then changing and saving the buffer, the BOM gets lost. This is due to
select-safe-coding-system (called by choose_write_coding_system) fully
trusting the coding system identified by find-auto-coding. So far so
good. The latter eventually calls auto-coding-functions which in turn
calls the built-in sgml-xml-auto-coding-function which I think should
take into account some context to enrich the derived coding system with
a signature if needed. Similar to what select-safe-coding-system does
to enrich the coding with the proper eol-type.
Does that make sense to you? If so, I'll try to come up with a patch
that enhances sgml-xml-auto-coding-function to take into account
buffer-file-coding-system (buffer + default value) in case it carries
the same text-conversion but different signature. The proposed "auto
coding" shall inherit the signature in this case.
Thanks for any help.
Alain
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2016-10-12 21:44 ` Alain Schneble
@ 2017-12-04 16:54 ` Glenn Morris
2017-12-04 17:38 ` Stefan Monnier
0 siblings, 1 reply; 34+ messages in thread
From: Glenn Morris @ 2017-12-04 16:54 UTC (permalink / raw)
To: Alain Schneble; +Cc: Simon Ledergerber, Stefan Monnier, 20623
Now reported with "fix this or get removed from the distribution"
severity at <https://bugs.debian.org/883434>.
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2017-12-04 16:54 ` Glenn Morris
@ 2017-12-04 17:38 ` Stefan Monnier
2017-12-04 20:28 ` Eli Zaretskii
2018-08-08 9:47 ` Vincent Lefevre
0 siblings, 2 replies; 34+ messages in thread
From: Stefan Monnier @ 2017-12-04 17:38 UTC (permalink / raw)
To: Glenn Morris; +Cc: Simon Ledergerber, Alain Schneble, 20623
> Now reported with "fix this or get removed from the distribution"
> severity at <https://bugs.debian.org/883434>.
I'm curious to see if the OP's "grave" severity settings will stick.
"Grave" is defined in https://www.debian.org/Bugs/Developer#severities as:
makes the package in question unusable or mostly so, or causes data
loss, or introduces a security hole allowing access to the accounts
of users who use the package.
The only part that could arguably apply is "causes data loss", but even
that is stretching the meaning of those words, I think.
This said, we should indeed fix this bug.
Not sure how to Do It Right but least this specific problem should be
fixable with a patch along the lines of the one below (guaranteed 100%
untested).
Stefan
diff --git a/lisp/international/mule.el b/lisp/international/mule.el
index 019e65b2c6..5c0675aa2f 100644
--- a/lisp/international/mule.el
+++ b/lisp/international/mule.el
@@ -1885,6 +1885,12 @@ auto-coding-alist-lookup
(setq alist (cdr alist))))
coding-system))
+(defun mule--coding-system-compatible-p (cs new-cs)
+ "Return non-nil if CS is one of the coding-systems described by NEW-CS."
+ (let ((base (coding-system-base cs)))
+ (or (eq base new-cs)
+ (eq base (intern (concat new-cs "-with-signature"))))))
+
(put 'enable-character-translation 'permanent-local t)
(put 'enable-character-translation 'safe-local-variable 'booleanp)
@@ -2038,8 +2044,12 @@ find-auto-coding
(save-excursion
(goto-char (point-min))
(funcall (pop funcs) size)))))
- (if coding-system
- (cons coding-system 'auto-coding-functions)))))
+ (and coding-system
+ ;; Don't override utf-8-with-signature with utf-8
+ ;; or latin-1-mac with latin-1 (bug#20623).
+ (not (mule--coding-system-compatible-p
+ buffer-file-coding-system coding-system))
+ (cons coding-system 'auto-coding-functions)))))
(defun set-auto-coding (filename size)
"Return coding system for a file FILENAME of which SIZE bytes follow point.
^ permalink raw reply related [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2017-12-04 17:38 ` Stefan Monnier
@ 2017-12-04 20:28 ` Eli Zaretskii
2017-12-04 21:08 ` Stefan Monnier
2018-08-08 9:47 ` Vincent Lefevre
1 sibling, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2017-12-04 20:28 UTC (permalink / raw)
To: Stefan Monnier; +Cc: a.s, 20623, sledergerber
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Alain Schneble <a.s@realize.ch>, Simon Ledergerber <sledergerber@gmx.net>, 20623@debbugs.gnu.org, Eli Zaretskii <eliz@gnu.org>
> Date: Mon, 04 Dec 2017 12:38:57 -0500
>
> This said, we should indeed fix this bug.
Agreed.
> Not sure how to Do It Right but least this specific problem should be
> fixable with a patch along the lines of the one below (guaranteed 100%
> untested).
Isn't it better to fix this in sgml-xml-auto-coding-function? That's
where the root cause is, AFAIU.
And I don't understand the comment about latin-1-mac: I don't think we
have such problems in Emacs. The -with-signature variety is
different, because it is not about EOL format.
Thanks.
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2017-12-04 20:28 ` Eli Zaretskii
@ 2017-12-04 21:08 ` Stefan Monnier
2017-12-10 19:17 ` Eli Zaretskii
0 siblings, 1 reply; 34+ messages in thread
From: Stefan Monnier @ 2017-12-04 21:08 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: a.s, 20623, sledergerber
> Isn't it better to fix this in sgml-xml-auto-coding-function? That's
> where the root cause is, AFAIU.
I'd expect the same problem would affect all other uses.
> And I don't understand the comment about latin-1-mac: I don't think we
> have such problems in Emacs. The -with-signature variety is
> different, because it is not about EOL format.
You might be right, but I don't know where/how this is handled.
Stefan
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2017-12-04 21:08 ` Stefan Monnier
@ 2017-12-10 19:17 ` Eli Zaretskii
2017-12-15 9:08 ` Eli Zaretskii
` (2 more replies)
0 siblings, 3 replies; 34+ messages in thread
From: Eli Zaretskii @ 2017-12-10 19:17 UTC (permalink / raw)
To: Stefan Monnier; +Cc: a.s, 20623, sledergerber
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: rgm@gnu.org, a.s@realize.ch, sledergerber@gmx.net, 20623@debbugs.gnu.org
> Date: Mon, 04 Dec 2017 16:08:14 -0500
>
> > Isn't it better to fix this in sgml-xml-auto-coding-function? That's
> > where the root cause is, AFAIU.
>
> I'd expect the same problem would affect all other uses.
Not sure what you meant by "all other uses". Could you please
elaborate?
> > And I don't understand the comment about latin-1-mac: I don't think we
> > have such problems in Emacs. The -with-signature variety is
> > different, because it is not about EOL format.
>
> You might be right, but I don't know where/how this is handled.
I would like to propose the following alternative patch, which accepts
utf-8-with-signature and utf-8-hfs as variants of utf-8 for the
purposes of encoding of XML files. Comments? Do we want a similar
treatment for UTF-16? (That doesn't seem to be required by the bug
report, and UTF-16 in XML files is non-standard anyway. But what
about HTML?)
diff --git a/lisp/international/mule.el b/lisp/international/mule.el
index 857fa80..5ff1acf 100644
--- a/lisp/international/mule.el
+++ b/lisp/international/mule.el
@@ -2493,7 +2493,17 @@ sgml-xml-auto-coding-function
(let* ((match (match-string 1))
(sym (intern (downcase match))))
(if (coding-system-p sym)
- sym
+ ;; If the encoding tag is UTF-8 and the buffer's
+ ;; encoding is one of the variants of UTF-8, use the
+ ;; buffer's encoding. This allows, e.g., saving an
+ ;; XML file as UTF-8 with BOM when the tag says UTF-8.
+ (if (and (coding-system-equal 'utf-8
+ (coding-system-type sym))
+ (coding-system-equal sym
+ (coding-system-type
+ buffer-file-coding-system)))
+ buffer-file-coding-system
+ sym)
(message "Warning: unknown coding system \"%s\"" match)
nil))
;; Files without an encoding tag should be UTF-8. But users
@@ -2506,7 +2516,8 @@ sgml-xml-auto-coding-function
(coding-system-base
(detect-coding-region (point-min) size t)))))
;; Pure ASCII always comes back as undecided.
- (if (memq detected '(utf-8 undecided))
+ (if (memq detected
+ '(utf-8 'utf-8-with-signature 'utf-8-hfs undecided))
'utf-8
(warn "File contents detected as %s.
Consider adding an encoding attribute to the xml declaration,
^ permalink raw reply related [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2017-12-10 19:17 ` Eli Zaretskii
@ 2017-12-15 9:08 ` Eli Zaretskii
2018-08-01 18:07 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose " Glenn Morris
2018-08-11 12:45 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose " Stefan Monnier
2 siblings, 0 replies; 34+ messages in thread
From: Eli Zaretskii @ 2017-12-15 9:08 UTC (permalink / raw)
To: monnier; +Cc: sledergerber, a.s, 20623-done
> Date: Sun, 10 Dec 2017 21:17:00 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: a.s@realize.ch, 20623@debbugs.gnu.org, sledergerber@gmx.net
>
> I would like to propose the following alternative patch, which accepts
> utf-8-with-signature and utf-8-hfs as variants of utf-8 for the
> purposes of encoding of XML files. Comments? Do we want a similar
> treatment for UTF-16? (That doesn't seem to be required by the bug
> report, and UTF-16 in XML files is non-standard anyway. But what
> about HTML?)
No further comments, so I've pushed the change and I'm marking this
bug done.
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2017-12-10 19:17 ` Eli Zaretskii
2017-12-15 9:08 ` Eli Zaretskii
@ 2018-08-01 18:07 ` Glenn Morris
2018-08-01 18:41 ` Eli Zaretskii
2018-08-11 12:45 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose " Stefan Monnier
2 siblings, 1 reply; 34+ messages in thread
From: Glenn Morris @ 2018-08-01 18:07 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: sledergerber, a.s, Stefan Monnier, 20623
The HTML (not XML) case specified in the original report
("Now the steps for HTML" in https://debbugs.gnu.org/20623#5) and in
https://bugs.debian.org/883434 seems unfixed.
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-01 18:07 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose " Glenn Morris
@ 2018-08-01 18:41 ` Eli Zaretskii
2018-08-07 19:14 ` Glenn Morris
0 siblings, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2018-08-01 18:41 UTC (permalink / raw)
To: Glenn Morris; +Cc: sledergerber, a.s, monnier, 20623
> From: Glenn Morris <rgm@gnu.org>
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, 20623@debbugs.gnu.org, a.s@realize.ch, sledergerber@gmx.net
> Date: Wed, 01 Aug 2018 14:07:28 -0400
>
> The HTML (not XML) case specified in the original report
> ("Now the steps for HTML" in https://debbugs.gnu.org/20623#5) and in
> https://bugs.debian.org/883434 seems unfixed.
Should it be?
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-01 18:41 ` Eli Zaretskii
@ 2018-08-07 19:14 ` Glenn Morris
0 siblings, 0 replies; 34+ messages in thread
From: Glenn Morris @ 2018-08-07 19:14 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: sledergerber, a.s, monnier, 20623
Eli Zaretskii wrote:
>> The HTML (not XML) case specified in the original report
>> ("Now the steps for HTML" in https://debbugs.gnu.org/20623#5) and in
>> https://bugs.debian.org/883434 seems unfixed.
>
> Should it be?
I think this a bug that should be fixed, yes (if that is the question).
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2017-12-04 17:38 ` Stefan Monnier
2017-12-04 20:28 ` Eli Zaretskii
@ 2018-08-08 9:47 ` Vincent Lefevre
2018-08-08 14:45 ` Stefan Monnier
2018-08-11 9:15 ` Eli Zaretskii
1 sibling, 2 replies; 34+ messages in thread
From: Vincent Lefevre @ 2018-08-08 9:47 UTC (permalink / raw)
To: Stefan Monnier; +Cc: Alain Schneble, 20623, Simon Ledergerber
On 2017-12-04 12:38:57 -0500, Stefan Monnier wrote:
> > Now reported with "fix this or get removed from the distribution"
> > severity at <https://bugs.debian.org/883434>.
>
> I'm curious to see if the OP's "grave" severity settings will stick.
> "Grave" is defined in https://www.debian.org/Bugs/Developer#severities as:
>
> makes the package in question unusable or mostly so, or causes data
> loss, or introduces a security hole allowing access to the accounts
> of users who use the package.
>
> The only part that could arguably apply is "causes data loss", but even
> that is stretching the meaning of those words, I think.
Actually there's the issue that the coding system (in Emacs sense)
is changed, but also the fact that this change is invisible to the
user (mainly because the BOM is usually not visible), which makes
the issue even worse. Basically, this is invisible data corruption.
Even though only two bytes are removed, this introduces breakage in
other applications, and it can take much time to the user to find
the cause.
Emacs should not change the coding system when not needed, and when
it needs to, it must make sure to have a confirmation from the user.
--
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-08 9:47 ` Vincent Lefevre
@ 2018-08-08 14:45 ` Stefan Monnier
2018-08-11 9:15 ` Eli Zaretskii
1 sibling, 0 replies; 34+ messages in thread
From: Stefan Monnier @ 2018-08-08 14:45 UTC (permalink / raw)
To: Vincent Lefevre; +Cc: Alain Schneble, 20623, Simon Ledergerber
> Actually there's the issue that the coding system (in Emacs sense)
> is changed, but also the fact that this change is invisible to the
> user (mainly because the BOM is usually not visible), which makes
> the issue even worse. Basically, this is invisible data corruption.
> Even though only two bytes are removed, this introduces breakage in
> other applications, and it can take much time to the user to find
> the cause.
>
> Emacs should not change the coding system when not needed, and when
> it needs to, it must make sure to have a confirmation from the user.
FWIW, I agree: I don't think it qualifies as Debian's definition of
"grave", but there is no doubt that it's a bug and that we should
fix it.
Stefan
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-08 9:47 ` Vincent Lefevre
2018-08-08 14:45 ` Stefan Monnier
@ 2018-08-11 9:15 ` Eli Zaretskii
2018-08-11 10:13 ` Vincent Lefevre
1 sibling, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2018-08-11 9:15 UTC (permalink / raw)
To: Vincent Lefevre; +Cc: a.s, monnier, 20623-done, sledergerber
> Date: Wed, 8 Aug 2018 11:47:48 +0200
> From: Vincent Lefevre <vincent@vinc17.net>
> Cc: Glenn Morris <rgm@gnu.org>, Simon Ledergerber <sledergerber@gmx.net>,
> Eli Zaretskii <eliz@gnu.org>, Alain Schneble <a.s@realize.ch>,
> 20623@debbugs.gnu.org
>
> On 2017-12-04 12:38:57 -0500, Stefan Monnier wrote:
> > > Now reported with "fix this or get removed from the distribution"
> > > severity at <https://bugs.debian.org/883434>.
> >
> > I'm curious to see if the OP's "grave" severity settings will stick.
> > "Grave" is defined in https://www.debian.org/Bugs/Developer#severities as:
> >
> > makes the package in question unusable or mostly so, or causes data
> > loss, or introduces a security hole allowing access to the accounts
> > of users who use the package.
> >
> > The only part that could arguably apply is "causes data loss", but even
> > that is stretching the meaning of those words, I think.
>
> Actually there's the issue that the coding system (in Emacs sense)
> is changed, but also the fact that this change is invisible to the
> user (mainly because the BOM is usually not visible), which makes
> the issue even worse. Basically, this is invisible data corruption.
> Even though only two bytes are removed, this introduces breakage in
> other applications, and it can take much time to the user to find
> the cause.
>
> Emacs should not change the coding system when not needed, and when
> it needs to, it must make sure to have a confirmation from the user.
I agree with the last paragraph, so I've now fixed the remaining issue
of this bug (with HTML files) on the emacs-26 branch.
However, I would respectfully request that in the future bug reports
be accurate and fair in the assigned severity, and in particular make
sure that the severity matches the actual behavior as judged
objectively.
In this case, I cannot but express my extreme surprise to see such a
minor issue described as "grave". The alleged data loss is minor, if
it exists at all (the BOM is not data important for the user, nor data
whose loss cannot be easily repaired). The unspecified "breakage in
other applications" cannot be considered without the missing details,
but in general I'd be surprised to hear about modern applications
(browsers?) that really need a BOM in UTF-8 encoded HTML files to the
degree that the lack of BOM causes them to "break" in some way; if
they do, it could arguably be a bug in those applications.
Bottom line: artificially and unreasonably increasing the severity
level doesn't help the motivation to fix the bug, and if anything, has
the opposite effect of ignoring the source of the bug report as not
serious. I'm sure we don't want that, certainly not for bugs reported
by Debian.
Thanks.
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-11 9:15 ` Eli Zaretskii
@ 2018-08-11 10:13 ` Vincent Lefevre
2018-08-11 10:45 ` Eli Zaretskii
0 siblings, 1 reply; 34+ messages in thread
From: Vincent Lefevre @ 2018-08-11 10:13 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: a.s, monnier, 20623, sledergerber
On 2018-08-11 12:15:31 +0300, Eli Zaretskii wrote:
> In this case, I cannot but express my extreme surprise to see such a
> minor issue described as "grave". The alleged data loss is minor, if
> it exists at all (the BOM is not data important for the user,
You're completely wrong. The presence of BOM or not is very important
for some applications, such as Firefox (not to determine the charset,
but the MIME type of local files).
> nor data whose loss cannot be easily repaired).
It can be repaired, but the problems are the user doesn't know
what's going on and this breaks things. If some package removed
the execute permission of some utility in /bin, this would also
be a grave bug, though it can easily been repaired.
> The unspecified "breakage in
> other applications" cannot be considered without the missing details,
> but in general I'd be surprised to hear about modern applications
> (browsers?) that really need a BOM in UTF-8 encoded HTML files to the
> degree that the lack of BOM causes them to "break" in some way; if
> they do, it could arguably be a bug in those applications.
Firefox. And that's actually the way I detected the bug, after
hours of trying to find why it was behaving in an inconsistent way.
--
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-11 10:13 ` Vincent Lefevre
@ 2018-08-11 10:45 ` Eli Zaretskii
2018-08-11 15:41 ` Vincent Lefevre
0 siblings, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2018-08-11 10:45 UTC (permalink / raw)
To: Vincent Lefevre; +Cc: a.s, monnier, 20623, sledergerber
> Date: Sat, 11 Aug 2018 12:13:41 +0200
> From: Vincent Lefevre <vincent@vinc17.net>
> Cc: monnier@iro.umontreal.ca, rgm@gnu.org, sledergerber@gmx.net,
> a.s@realize.ch, 20623@debbugs.gnu.org
>
> On 2018-08-11 12:15:31 +0300, Eli Zaretskii wrote:
> > In this case, I cannot but express my extreme surprise to see such a
> > minor issue described as "grave". The alleged data loss is minor, if
> > it exists at all (the BOM is not data important for the user,
>
> You're completely wrong. The presence of BOM or not is very important
> for some applications, such as Firefox (not to determine the charset,
> but the MIME type of local files).
Please provide the details, including the use case, if possible. I'm
still in the dark regarding the importance of the BOM in UTF-8 encoded
HTML stuff.
> It can be repaired, but the problems are the user doesn't know
> what's going on and this breaks things.
I agree about the user not knowing, but that doesn't yet qualify as
"data loss", which has an widely accepted meaning.
> If some package removed the execute permission of some utility in
> /bin, this would also be a grave bug, though it can easily been
> repaired.
Well, I disagree about the "grave" part, because that means the
package is unusable, causes data loss, or introduces a security hole
allowing access to the user account. None of that is true in the case
in point.
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2017-12-10 19:17 ` Eli Zaretskii
2017-12-15 9:08 ` Eli Zaretskii
2018-08-01 18:07 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose " Glenn Morris
@ 2018-08-11 12:45 ` Stefan Monnier
2018-08-11 13:54 ` Eli Zaretskii
2 siblings, 1 reply; 34+ messages in thread
From: Stefan Monnier @ 2018-08-11 12:45 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: a.s, 20623, sledergerber
>> > Isn't it better to fix this in sgml-xml-auto-coding-function? That's
>> > where the root cause is, AFAIU.
>> I'd expect the same problem would affect all other uses.
> Not sure what you meant by "all other uses". Could you please
> elaborate?
Your commit ec6f588940e51013435408a456c10d33ddf98fb2 answers that
question: at least sgml-html-meta-auto-coding-function is one of those
"other uses".
> > And I don't understand the comment about latin-1-mac: I don't think we
> > have such problems in Emacs. The -with-signature variety is
> > different, because it is not about EOL format.
> You might be right, but I don't know where/how this is handled.
I still don't know where the EOL part is handled.
Stefan
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-11 12:45 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose " Stefan Monnier
@ 2018-08-11 13:54 ` Eli Zaretskii
2018-08-12 0:04 ` Stefan Monnier
0 siblings, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2018-08-11 13:54 UTC (permalink / raw)
To: Stefan Monnier; +Cc: a.s, 20623, sledergerber
> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: rgm@gnu.org, a.s@realize.ch, 20623@debbugs.gnu.org, sledergerber@gmx.net
> Date: Sat, 11 Aug 2018 08:45:15 -0400
>
> > > And I don't understand the comment about latin-1-mac: I don't think we
> > > have such problems in Emacs. The -with-signature variety is
> > > different, because it is not about EOL format.
> > You might be right, but I don't know where/how this is handled.
>
> I still don't know where the EOL part is handled.
If you tell me what do you mean by "handled" in this context, I might
be able to help you understand where that happens.
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-11 10:45 ` Eli Zaretskii
@ 2018-08-11 15:41 ` Vincent Lefevre
2018-08-11 16:27 ` Eli Zaretskii
2018-08-12 0:11 ` Stefan Monnier
0 siblings, 2 replies; 34+ messages in thread
From: Vincent Lefevre @ 2018-08-11 15:41 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: a.s, monnier, 20623, sledergerber
On 2018-08-11 13:45:17 +0300, Eli Zaretskii wrote:
> > Date: Sat, 11 Aug 2018 12:13:41 +0200
> > From: Vincent Lefevre <vincent@vinc17.net>
> > Cc: monnier@iro.umontreal.ca, rgm@gnu.org, sledergerber@gmx.net,
> > a.s@realize.ch, 20623@debbugs.gnu.org
> >
> > On 2018-08-11 12:15:31 +0300, Eli Zaretskii wrote:
> > > In this case, I cannot but express my extreme surprise to see such a
> > > minor issue described as "grave". The alleged data loss is minor, if
> > > it exists at all (the BOM is not data important for the user,
> >
> > You're completely wrong. The presence of BOM or not is very important
> > for some applications, such as Firefox (not to determine the charset,
> > but the MIME type of local files).
>
> Please provide the details, including the use case, if possible. I'm
> still in the dark regarding the importance of the BOM in UTF-8 encoded
> HTML stuff.
https://bugzilla.mozilla.org/show_bug.cgi?id=1422889
for HTML. Wontfix because of:
https://mimesniff.spec.whatwg.org/#mime-type-sniffing-algorithm
For text/plain only (but this is another example that BOM can matter
in practice), there's
https://bugzilla.mozilla.org/show_bug.cgi?id=1071816
(which is a bug that should be fixed).
> > It can be repaired, but the problems are the user doesn't know
> > what's going on and this breaks things.
>
> I agree about the user not knowing, but that doesn't yet qualify as
> "data loss", which has an widely accepted meaning.
This is data corruption, which is a form of data loss, because some
information is lost in the process (I recall that Emacs does not
provide any information to the user about this transformation).
--
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-11 15:41 ` Vincent Lefevre
@ 2018-08-11 16:27 ` Eli Zaretskii
2018-08-12 1:34 ` Vincent Lefevre
2018-08-12 0:11 ` Stefan Monnier
1 sibling, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2018-08-11 16:27 UTC (permalink / raw)
To: Vincent Lefevre; +Cc: a.s, monnier, 20623, sledergerber
> Date: Sat, 11 Aug 2018 17:41:01 +0200
> From: Vincent Lefevre <vincent@vinc17.net>
> Cc: monnier@iro.umontreal.ca, rgm@gnu.org, sledergerber@gmx.net,
> a.s@realize.ch, 20623@debbugs.gnu.org
>
> > > You're completely wrong. The presence of BOM or not is very important
> > > for some applications, such as Firefox (not to determine the charset,
> > > but the MIME type of local files).
> >
> > Please provide the details, including the use case, if possible. I'm
> > still in the dark regarding the importance of the BOM in UTF-8 encoded
> > HTML stuff.
>
> https://bugzilla.mozilla.org/show_bug.cgi?id=1422889
>
> for HTML. Wontfix because of:
>
> https://mimesniff.spec.whatwg.org/#mime-type-sniffing-algorithm
>
> For text/plain only (but this is another example that BOM can matter
> in practice), there's
>
> https://bugzilla.mozilla.org/show_bug.cgi?id=1071816
>
> (which is a bug that should be fixed).
Maybe I'm missing something, but none of these issues describes the
situation in this bug report, namely: an HTML file with an explicit
charset= tag, with or without a BOM. In fact, the first of these
issues happens only in files that _do_ have a BOM, so you could say
that Emacs did you a favor by removing it ;-)
> > I agree about the user not knowing, but that doesn't yet qualify as
> > "data loss", which has an widely accepted meaning.
>
> This is data corruption, which is a form of data loss, because some
> information is lost in the process (I recall that Emacs does not
> provide any information to the user about this transformation).
That is the most inclusive interpretation of "data loss" I've ever
seen. "Some information is lost" is nowhere near what "grave bug"
means by "data loss", so I don't think "grave" applies here.
Anyway, the Emacs issue is now fixed.
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-11 13:54 ` Eli Zaretskii
@ 2018-08-12 0:04 ` Stefan Monnier
2018-08-12 19:07 ` Eli Zaretskii
0 siblings, 1 reply; 34+ messages in thread
From: Stefan Monnier @ 2018-08-12 0:04 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: a.s, 20623, sledergerber
>> > > And I don't understand the comment about latin-1-mac: I don't think we
>> > > have such problems in Emacs. The -with-signature variety is
>> > > different, because it is not about EOL format.
>> > You might be right, but I don't know where/how this is handled.
>> I still don't know where the EOL part is handled.
> If you tell me what do you mean by "handled" in this context, I might
> be able to help you understand where that happens.
You say that the code I wrote is not needed to make sure an existing
latin-1-mac setting isn't overwritten by a latin-1 guess. I expect this
is indeed true (otherwise I think we'd have had bug-reports about it),
but I don't know where that is handled.
Stefan
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-11 15:41 ` Vincent Lefevre
2018-08-11 16:27 ` Eli Zaretskii
@ 2018-08-12 0:11 ` Stefan Monnier
2018-08-12 0:58 ` Vincent Lefevre
1 sibling, 1 reply; 34+ messages in thread
From: Stefan Monnier @ 2018-08-12 0:11 UTC (permalink / raw)
To: Vincent Lefevre; +Cc: a.s, 20623, sledergerber
>> > > In this case, I cannot but express my extreme surprise to see such a
>> > > minor issue described as "grave". The alleged data loss is minor, if
>> > > it exists at all (the BOM is not data important for the user,
>> > You're completely wrong. The presence of BOM or not is very important
>> > for some applications, such as Firefox (not to determine the charset,
>> > but the MIME type of local files).
>> Please provide the details, including the use case, if possible. I'm
>> still in the dark regarding the importance of the BOM in UTF-8 encoded
>> HTML stuff.
> https://bugzilla.mozilla.org/show_bug.cgi?id=1422889
I don't see any data loss there.
Stefan
PS: We can all cook up contrived scenarios where this bug leads to a serious
loss of data. But in that case a problem in C-n which makes it move to
the wrong column would also qualify as "grave" because I can just as
well construct a contrived scenarios where such a bug leads to a serious
loss of data.
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-12 0:11 ` Stefan Monnier
@ 2018-08-12 0:58 ` Vincent Lefevre
0 siblings, 0 replies; 34+ messages in thread
From: Vincent Lefevre @ 2018-08-12 0:58 UTC (permalink / raw)
To: Stefan Monnier; +Cc: a.s, 20623, sledergerber
On 2018-08-11 20:11:49 -0400, Stefan Monnier wrote:
> >> Please provide the details, including the use case, if possible. I'm
> >> still in the dark regarding the importance of the BOM in UTF-8 encoded
> >> HTML stuff.
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1422889
>
> I don't see any data loss there.
Because it is not there, it is in Emacs. What the Mozilla bug shows
is that the presence of BOM or not is important and yields very
different behavior.
--
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-11 16:27 ` Eli Zaretskii
@ 2018-08-12 1:34 ` Vincent Lefevre
0 siblings, 0 replies; 34+ messages in thread
From: Vincent Lefevre @ 2018-08-12 1:34 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: a.s, monnier, 20623, sledergerber
On 2018-08-11 19:27:33 +0300, Eli Zaretskii wrote:
> Maybe I'm missing something, but none of these issues describes the
> situation in this bug report, namely: an HTML file with an explicit
> charset= tag, with or without a BOM. In fact, the first of these
> issues happens only in files that _do_ have a BOM, so you could say
> that Emacs did you a favor by removing it ;-)
In theory yes, but in practice, one does not want that when doing
file-loading tests. Otherwise the tests become meaningless. This
is just list a spellchecker that automatically corrects spelling
mistakes without the user knowledge (even when it is right), as
if the goal is to write something about a spelling mistake, the
text becomes meaningless. Or when some characters are changed
automatically to improve typography (as this can be seen by some
blog software when posting, with no previewing), as this can make
the text meaningless, e.g. when it is code.
> Anyway, the Emacs issue is now fixed.
OK, thanks.
--
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
2018-08-12 0:04 ` Stefan Monnier
@ 2018-08-12 19:07 ` Eli Zaretskii
0 siblings, 0 replies; 34+ messages in thread
From: Eli Zaretskii @ 2018-08-12 19:07 UTC (permalink / raw)
To: Stefan Monnier; +Cc: a.s, 20623, sledergerber
> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: rgm@gnu.org, a.s@realize.ch, 20623@debbugs.gnu.org, sledergerber@gmx.net
> Date: Sat, 11 Aug 2018 20:04:05 -0400
>
> You say that the code I wrote is not needed to make sure an existing
> latin-1-mac setting isn't overwritten by a latin-1 guess. I expect this
> is indeed true (otherwise I think we'd have had bug-reports about it),
> but I don't know where that is handled.
It is handled inside select-safe-coding-system, which first invokes
find-auto-coding to decide which encoding is appropriate (and as part
of that, looks at XML or HTML charset information declared by the
text), and then, if the encoding it got doesn't specify the EOL
conversion, it uses the EOL conversion from the buffer's encoding or
from the appropriate defaults.
Since XML/HTML charset tags never specify the EOL conversion, it
follows that Emacs will never override the EOL conversion of the
buffer, it will only use the charset for "text conversion".
I hope this answers your question.
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2018-08-12 19:07 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-21 18:50 bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save Simon Ledergerber
2015-05-21 19:48 ` Eli Zaretskii
[not found] ` <555E44EB.6070604@gmx.net>
2015-05-22 7:11 ` Eli Zaretskii
2015-05-22 13:21 ` Simon Ledergerber
2016-10-12 21:44 ` Alain Schneble
2017-12-04 16:54 ` Glenn Morris
2017-12-04 17:38 ` Stefan Monnier
2017-12-04 20:28 ` Eli Zaretskii
2017-12-04 21:08 ` Stefan Monnier
2017-12-10 19:17 ` Eli Zaretskii
2017-12-15 9:08 ` Eli Zaretskii
2018-08-01 18:07 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose " Glenn Morris
2018-08-01 18:41 ` Eli Zaretskii
2018-08-07 19:14 ` Glenn Morris
2018-08-11 12:45 ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose " Stefan Monnier
2018-08-11 13:54 ` Eli Zaretskii
2018-08-12 0:04 ` Stefan Monnier
2018-08-12 19:07 ` Eli Zaretskii
2018-08-08 9:47 ` Vincent Lefevre
2018-08-08 14:45 ` Stefan Monnier
2018-08-11 9:15 ` Eli Zaretskii
2018-08-11 10:13 ` Vincent Lefevre
2018-08-11 10:45 ` Eli Zaretskii
2018-08-11 15:41 ` Vincent Lefevre
2018-08-11 16:27 ` Eli Zaretskii
2018-08-12 1:34 ` Vincent Lefevre
2018-08-12 0:11 ` Stefan Monnier
2018-08-12 0:58 ` Vincent Lefevre
2015-05-22 15:22 ` Stefan Monnier
2015-05-22 15:26 ` Eli Zaretskii
2015-05-22 21:51 ` Stefan Monnier
2015-05-23 6:44 ` Eli Zaretskii
2015-05-23 17:11 ` Simon Ledergerber
2015-05-23 17:20 ` Eli Zaretskii
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).