unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
@ 2015-05-21 18:50 Simon Ledergerber
  2015-05-21 19:48 ` Eli Zaretskii
  0 siblings, 1 reply; 34+ messages in thread
From: Simon Ledergerber @ 2015-05-21 18:50 UTC (permalink / raw)
  To: 20623

Hi

When I was editing XHTML and HTML files, I wanted to make sure the BOM 
was written out to the file in order to make it easier for the browser 
to detect the UTF-8 encoding. Therefore I changed the coding system for 
the file buffer to utf-8-with-signature-dos (since I am working on a 
Windows System) before saving the file.

After some time I got surprised because the browser (IE11), didn't 
report UTF-8 as the file's encoding. Having checked the hexdump of my 
(X)HTML file, I saw the BOM was definitely missing.

Obviously, when a "UTF-8" string appears in the <meta charset="utf-8"> 
(even if commented out, see later below) or <?xml version="1.0" 
encoding="utf-8"?> declaration, Emacs switches the file coding system to 
utf-8, when it saves the file, even if utf-8-with-signature was 
specified explicitly before. This appears to me as a bug, because there 
is no way anymore to restore the BOM using Emacs.

I was not sure, if my bug is related to bug #8282, so I decided to 
report it (again).

My Emacs version is: 24.5.1 (x86_64-unkown-cygwin) of 2015-04-10 on 
Windows 8.1 x64.

I am running Emacs in text-mode only inside a Cygwin console.

This is my .emacs.d/init.el:
(line-number-mode)
(column-number-mode)
(setq-default fill-column 80)
(setq-default buffer-file-coding-system 'utf-8-dos)
(setq-default indent-tabs-mode nil)

With XML the problem can be reproduced in the most basic way as detailed 
out by the following steps:

- Create a new file with C-x C-f in the current directory. Name it 
test.txt for example.

- Switch to fundamental mode with M-x fundamental-mode.

- Type the text '<?xml version="1.0"' (without the surrounding single 
quotes).

- Switch the encoding system to include the BOM: C-x RET f 
utf-8-with-signature-dos.

- Verify the current encoding system with C-h Shift-c RET: Yes, the 
encoding system for the file buffer is as specified before.

- Type C-x k to kill the help buffer if necessary and save the file with 
C-x C-s.

- Check the file with a hex editor. Under the Cygwin Bash shell, 'od -Ax 
-t xCaz test.txt' will also do it: The UTF-8 BOM 'EF BB BF' was written 
at the beginning of the file.

- Complete the rest of the XML declaration as follows: ' encoding="utf-8"?>'

- Now save the file and check again: The encoding system for the buffer 
has changed to utf-8-dos and the BOM has disappeared from the file!

Now the steps for HTML:

- Create a new file test1.txt in the current directory.

- Fill it with the following simple and yet incomplete HTML5 document:
<!doctype html>
<html>
     <head>
         <title>Test</title>
     </head>
     <body>
     </body>
</html>

- Change the coding system to utf-8-with-signature-dos and save the file.

- Verify that the coding system for the buffer is correct and the BOM is 
really written: Yes, it is.

- Insert the following *comment* between <head> and <title>: <!-- <meta 
charset="utf-8"> -->

- Save the file and verify: The coding system has changed to utf-8-dos 
and the BOM has vanished, even if it is just a comment and has no effect!

Regards

Simon

P. S. Information as reported by M-x report-emacs-bug:
In GNU Emacs 24.5.1 (x86_64-unknown-cygwin)
  of 2015-04-10 on desktop-new
Configured using:
  `configure
  --srcdir=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5
  --prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc
  --docdir=/usr/share/doc/emacs --htmldir=/usr/share/doc/emacs/html -C
  --with-x=no 'CFLAGS=-ggdb -O2 -pipe -Wimplicit-function-declaration
  -fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/build=/usr/src/debug/emacs-24.5-1
  -fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5=/usr/src/debug/emacs-24.5-1'
  CPPFLAGS= LDFLAGS='

Important settings:
   value of $LANG: en_US.UTF-8
   locale-coding-system: utf-8-unix

Major mode: Help

Minor modes in effect:
   tooltip-mode: t
   electric-indent-mode: t
   menu-bar-mode: t
   file-name-shadow-mode: t
   global-font-lock-mode: t
   font-lock-mode: t
   auto-composition-mode: t
   auto-encryption-mode: t
   auto-compression-mode: t
   buffer-read-only: t
   column-number-mode: t
   line-number-mode: t
   transient-mark-mode: t

Recent messages:
Beginning of buffer [3 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
Mark set [2 times]
Auto-saving...done
Mark set [2 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
No docstring slot for help-mode-setup
No docstring slot for help-mode-finish

Load-path shadows:
None found.

Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
help-fns mail-prsvr mail-utils misearch multi-isearch mule-diag
help-mode easymenu regexp-opt sgml-mode xterm time-date tooltip electric
uniquify ediff-hook vc-hooks lisp-float-type tabulated-list newcomment
lisp-mode prog-mode register page menu-bar rfn-eshadow timer select
mouse jit-lock font-lock syntax facemenu font-core frame cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev
minibuffer nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote make-network-process
dbusbind gfilenotify multi-tty emacs)

Memory information:
((conses 16 81797 4691)
  (symbols 48 17091 0)
  (miscs 40 73 387)
  (strings 32 11233 4887)
  (string-bytes 1 291872)
  (vectors 16 7587)
  (vector-slots 8 342125 27930)
  (floats 8 57 393)
  (intervals 56 834 26)
  (buffers 960 21))






^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-08-12 19:07 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-21 18:50 bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save Simon Ledergerber
2015-05-21 19:48 ` Eli Zaretskii
     [not found]   ` <555E44EB.6070604@gmx.net>
2015-05-22  7:11     ` Eli Zaretskii
2015-05-22 13:21       ` Simon Ledergerber
2016-10-12 21:44         ` Alain Schneble
2017-12-04 16:54           ` Glenn Morris
2017-12-04 17:38             ` Stefan Monnier
2017-12-04 20:28               ` Eli Zaretskii
2017-12-04 21:08                 ` Stefan Monnier
2017-12-10 19:17                   ` Eli Zaretskii
2017-12-15  9:08                     ` Eli Zaretskii
2018-08-01 18:07                     ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose " Glenn Morris
2018-08-01 18:41                       ` Eli Zaretskii
2018-08-07 19:14                         ` Glenn Morris
2018-08-11 12:45                     ` bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose " Stefan Monnier
2018-08-11 13:54                       ` Eli Zaretskii
2018-08-12  0:04                         ` Stefan Monnier
2018-08-12 19:07                           ` Eli Zaretskii
2018-08-08  9:47               ` Vincent Lefevre
2018-08-08 14:45                 ` Stefan Monnier
2018-08-11  9:15                 ` Eli Zaretskii
2018-08-11 10:13                   ` Vincent Lefevre
2018-08-11 10:45                     ` Eli Zaretskii
2018-08-11 15:41                       ` Vincent Lefevre
2018-08-11 16:27                         ` Eli Zaretskii
2018-08-12  1:34                           ` Vincent Lefevre
2018-08-12  0:11                         ` Stefan Monnier
2018-08-12  0:58                           ` Vincent Lefevre
2015-05-22 15:22   ` Stefan Monnier
2015-05-22 15:26     ` Eli Zaretskii
2015-05-22 21:51       ` Stefan Monnier
2015-05-23  6:44         ` Eli Zaretskii
2015-05-23 17:11           ` Simon Ledergerber
2015-05-23 17:20             ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).