* bug#61005: 28.1.91; Encoding not detected in HTML files inside archives
@ 2023-01-22 13:13 Benjamin Riefenstahl
2023-01-22 13:24 ` Benjamin Riefenstahl
0 siblings, 1 reply; 3+ messages in thread
From: Benjamin Riefenstahl @ 2023-01-22 13:13 UTC (permalink / raw)
To: 61005
[-- Attachment #1: Type: text/plain, Size: 483 bytes --]
Problem
----
* Given an HTML file with charset "windows-1255".
* Opening the file from disk detects the encoding correctly.
* Opening a ZIP archive with the same file inside and than opening the
HTML archive member does not detect the encoding, instead the coding
system for saving is the default according to M-x
describe-coding-system.
Attached are two files test.html and test.zip. Call "emacs -Q test.html
test.zip" and press RET on the archive member to reproduce.
[-- Attachment #2: test.html --]
[-- Type: text/html, Size: 172 bytes --]
[-- Attachment #3: test.zip --]
[-- Type: application/zip, Size: 279 bytes --]
[-- Attachment #4: Type: text/plain, Size: 5626 bytes --]
Solution
----
The problem seems to be the function
sgml-html-meta-auto-coding-function. It is missing a condition similar
to the one added to code in sgml-xml-auto-coding-function with commit
#df7ed10e in 2018.
modified lisp/international/mule.el
@@ -2539,6 +2539,10 @@ sgml-html-meta-auto-coding-function
(bfcs-type
(coding-system-type buffer-file-coding-system)))
(if (and enable-multibyte-characters
+ ;; 'charset' will signal an error in
+ ;; coding-system-equal, since it isn't a
+ ;; coding-system. So test that up front.
+ (not (equal sym-type 'charset))
(coding-system-equal 'utf-8 sym-type)
(coding-system-equal 'utf-8 bfcs-type))
buffer-file-coding-system
I will send this as a patch as soon as I have a bug number to mention in
the commit message.
----
In GNU Emacs 28.1.91 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.24, cairo version 1.16.0)
of 2022-08-29 built on arrian
Repository revision: f4168b8143008b787a11366462c928d761e90dd0
Repository branch: emacs-28
Windowing system distributor 'The X.Org Foundation', version 11.0.12011000
System Description: Debian GNU/Linux 11 (bullseye)
Configured features:
ACL CAIRO DBUS FREETYPE GIF GLIB GMP GNUTLS GPM GSETTINGS HARFBUZZ JPEG
JSON LCMS2 LIBOTF LIBSELINUX LIBXML2 M17N_FLT MODULES NOTIFY INOTIFY
PDUMPER PNG RSVG SECCOMP SOUND THREADS TIFF TOOLKIT_SCROLL_BARS X11 XDBE
XIM XPM GTK3 ZLIB
Important settings:
value of $LANG: en_US.UTF-8
locale-coding-system: utf-8-unix
Major mode: Dired by date
Minor modes in effect:
shell-dirtrack-mode: t
desktop-save-mode: t
display-time-mode: t
xclip-mode: t
xterm-mouse-mode: t
delete-selection-mode: t
cua-mode: t
display-battery-mode: t
tooltip-mode: t
global-eldoc-mode: t
show-paren-mode: t
electric-indent-mode: t
mouse-wheel-mode: t
menu-bar-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
blink-cursor-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
buffer-read-only: t
column-number-mode: t
line-number-mode: t
transient-mark-mode: t
Load-path shadows:
~/Projects/ttf-mode/arc-mode-compat hides ~/emacs/arc-mode-compat
/home/benny/.emacs.d/elpa/transient-20210723.1601/transient hides /usr/local/share/emacs/28.1.91/lisp/transient
/home/benny/.emacs.d/elpa/dictionary-20201001.1727/dictionary hides /usr/local/share/emacs/28.1.91/lisp/net/dictionary
Features:
(shadow sort mail-extr emacsbug message rmc puny rfc822 mml mml-sec epa
epg rfc6068 epg-config gnus-util rmail rmail-loaddefs mm-decode
mm-bodies mm-encode mailabbrev gmm-utils mailheader arc-mode
archive-mode benny-images dirtrack shell pcomplete misearch
multi-isearch thai-util thai-word lao-util enriched view tabify
benny-auto-insert ttf-glyphs rng-xsd xsd-regexp rng-cmpct rng-nxml
rng-valid rng-loc rng-uri rng-parse nxml-parse rng-match rng-dt rng-util
rng-pttrn nxml-ns nxml-mode nxml-outln nxml-rap sgml-mode facemenu dom
nxml-util nxml-enc xmltok mule-util jka-compr dired-aux time-date
bug-reference imenu desktop frameset highline benny-calendar-cfg
ange-ftp generic-x autoinsert cc-mode cc-fonts cc-guess cc-menus
cc-styles cc-align cc-cmds cc-engine cc-vars cc-defs ps-print
ps-print-loaddefs ps-def lpr advice cl-extra help-mode dired
dired-loaddefs derived benny-x-clipboard disp-table time server protbuf
xclip term/xterm xterm xt-mouse cal-china lunar solar cal-dst cal-bahai
cal-islam cal-hebrew holidays hol-loaddefs vc-git diff-mode easy-mmode
vc-dispatcher vc-fossil diary-lib diary-loaddefs cal-menu calendar
cal-loaddefs delsel grep compile text-property-search comint ansi-color
ring cua-base cus-load format-spec battery dbus xml sendmail mail-utils
.loaddefs benny-tools autoload radix-tree lisp-mnt mail-parse rfc2231
rfc2047 rfc2045 mm-util ietf-drums mail-prsvr edmacro kmacro info
package browse-url url url-proxy url-privacy url-expand url-methods
url-history url-cookie url-domsuf url-util mailcap url-handlers
url-parse auth-source cl-seq eieio eieio-core cl-macs eieio-loaddefs
password-cache json subr-x map url-vars seq byte-opt gv bytecomp
byte-compile cconv cl-loaddefs cl-lib iso-transl tooltip eldoc paren
electric uniquify ediff-hook vc-hooks lisp-float-type elisp-mode mwheel
term/x-win x-win term/common-win x-dnd tool-bar dnd fontset image
regexp-opt fringe tabulated-list replace newcomment text-mode lisp-mode
prog-mode register page tab-bar menu-bar rfn-eshadow isearch easymenu
timer select scroll-bar mouse jit-lock font-lock syntax font-core
term/tty-colors frame minibuffer cl-generic cham georgian utf-8-lang
misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms
cp51932 hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese composite emoji-zwj charscript charprop case-table
epa-hook jka-cmpr-hook help simple abbrev obarray cl-preloaded nadvice
button loaddefs faces cus-face macroexp files window text-properties
overlay sha1 md5 base64 format env code-pages mule custom widget
hashtable-print-readable backquote threads dbusbind inotify lcms2
dynamic-setting system-font-setting font-render-setting cairo
move-toolbar gtk x-toolkit x multi-tty make-network-process emacs)
Memory information:
((conses 16 273770 13520)
(symbols 48 18619 1)
(strings 32 66582 2920)
(string-bytes 1 2318045)
(vectors 16 39996)
(vector-slots 8 1131973 174560)
(floats 8 762 66)
(intervals 56 1039 60)
(buffers 992 50))
^ permalink raw reply [flat|nested] 3+ messages in thread
* bug#61005: 28.1.91; Encoding not detected in HTML files inside archives
2023-01-22 13:13 bug#61005: 28.1.91; Encoding not detected in HTML files inside archives Benjamin Riefenstahl
@ 2023-01-22 13:24 ` Benjamin Riefenstahl
2023-01-22 14:09 ` Eli Zaretskii
0 siblings, 1 reply; 3+ messages in thread
From: Benjamin Riefenstahl @ 2023-01-22 13:24 UTC (permalink / raw)
To: 61005
[-- Attachment #1: Type: text/plain, Size: 200 bytes --]
The promised patch. This is against master.
Also a small test-suite for sgml-html-meta-auto-coding-function, if you
want that. If you care, I could also add one for
sgml-xml-auto-coding-function.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Fix-decoding-HTML-files-from-archives.patch --]
[-- Type: text/x-diff, Size: 1391 bytes --]
From 95b63baf1bf411422c61b76470abb1aa681f2db2 Mon Sep 17 00:00:00 2001
From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
Date: Tue, 17 Jan 2023 20:08:15 +0200
Subject: [PATCH 1/2] Fix decoding HTML files from archives
* lisp/international/mule.el (sgml-xml-auto-coding-function): Avoid
signaling an error from coding-system-equal when the XML encoding tag
specifies an encoding whose type is 'charset'. (Bug#61005)
This is the same fix as in #df7ed10e for
sgml-xml-auto-coding-function.
---
lisp/international/mule.el | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/lisp/international/mule.el b/lisp/international/mule.el
index 4f6addea387..9480213be9a 100644
--- a/lisp/international/mule.el
+++ b/lisp/international/mule.el
@@ -2539,6 +2539,10 @@ sgml-html-meta-auto-coding-function
(bfcs-type
(coding-system-type buffer-file-coding-system)))
(if (and enable-multibyte-characters
+ ;; 'charset' will signal an error in
+ ;; coding-system-equal, since it isn't a
+ ;; coding-system. So test that up front.
+ (not (equal sym-type 'charset))
(coding-system-equal 'utf-8 sym-type)
(coding-system-equal 'utf-8 bfcs-type))
buffer-file-coding-system
--
2.30.2
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-Add-test-suite-for-sgml-html-meta-auto-coding-functi.patch --]
[-- Type: text/x-diff, Size: 3803 bytes --]
From 29996e07c23c9716f731dde224c8ca47e321e697 Mon Sep 17 00:00:00 2001
From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
Date: Tue, 17 Jan 2023 20:13:39 +0200
Subject: [PATCH 2/2] Add test suite for sgml-html-meta-auto-coding-function
* test/lisp/international/mule-tests.el (sgml-html-meta-pre)
(sgml-html-meta-post, sgml-html-meta-run, sgml-html-meta-utf-8)
(sgml-html-meta-windows-hebrew, sgml-html-meta-none)
(sgml-html-meta-unknown-coding, sgml-html-meta-no-pre)
(sgml-html-meta-no-post-less-than-10lines)
(sgml-html-meta-no-post-10lines, sgml-html-meta-utf-8-with-bom): Add.
---
test/lisp/international/mule-tests.el | 66 +++++++++++++++++++++++++++
1 file changed, 66 insertions(+)
diff --git a/test/lisp/international/mule-tests.el b/test/lisp/international/mule-tests.el
index 4f70b275848..6e23d8c5421 100644
--- a/test/lisp/international/mule-tests.el
+++ b/test/lisp/international/mule-tests.el
@@ -70,6 +70,72 @@ mule-hz
;; The chinese-hz encoding is not ASCII compatible.
(should-not (coding-system-get 'chinese-hz :ascii-compatible-p)))
+;;; Testing `sgml-html-meta-auto-coding-function'.
+
+(defconst sgml-html-meta-pre "<!doctype html><html><head>"
+ "The beginning of a minimal HTML document.")
+
+(defconst sgml-html-meta-post "</head></html>"
+ "The end of a minimal HTML document.")
+
+(defun sgml-html-meta-run (coding-system)
+ "Run `sgml-html-meta-auto-coding-function' on a minimal HTML.
+When CODING-SYSTEM is not nil, insert it, wrapped in a '<meta>'
+element. When CODING-SYSTEM contains HTML meta characters or
+white space, insert it as-is, without additional formatting. Use
+the variables `sgml-html-meta-pre' and `sgml-html-meta-post' to
+provide HTML fragments. Some tests override those variables."
+ (with-temp-buffer
+ (insert sgml-html-meta-pre
+ (cond ((not coding-system)
+ "")
+ ((string-match "[<>'\"\n ]" coding-system)
+ coding-system)
+ (t
+ (format "<meta charset='%s'>" coding-system)))
+ sgml-html-meta-post)
+ (goto-char (point-min))
+ (sgml-html-meta-auto-coding-function (- (point-max) (point-min)))))
+
+(ert-deftest sgml-html-meta-utf-8 ()
+ "Baseline: UTF-8."
+ (should (eq 'utf-8 (sgml-html-meta-run "utf-8"))))
+
+(ert-deftest sgml-html-meta-windows-hebrew ()
+ "A non-Unicode charset."
+ (should (eq 'windows-1255 (sgml-html-meta-run "windows-1255"))))
+
+(ert-deftest sgml-html-meta-none ()
+ (should (eq nil (sgml-html-meta-run nil))))
+
+(ert-deftest sgml-html-meta-unknown-coding ()
+ (should (eq nil (sgml-html-meta-run "XXX"))))
+
+(ert-deftest sgml-html-meta-no-pre ()
+ "Without the prefix, so not HTML."
+ (let ((sgml-html-meta-pre ""))
+ (should (eq nil (sgml-html-meta-run "utf-8")))))
+
+(ert-deftest sgml-html-meta-no-post-less-than-10lines ()
+ "No '</head>', detect charset in the first 10 lines."
+ (let ((sgml-html-meta-post ""))
+ (should (eq 'utf-8 (sgml-html-meta-run
+ (concat "\n\n\n\n\n\n\n\n\n"
+ "<meta charset='utf-8'>"))))))
+
+(ert-deftest sgml-html-meta-no-post-10lines ()
+ "No '</head>', do not detect charset after the first 10 lines."
+ (let ((sgml-html-meta-post ""))
+ (should (eq nil (sgml-html-meta-run
+ (concat "\n\n\n\n\n\n\n\n\n\n"
+ "<meta charset='utf-8'>"))))))
+
+(ert-deftest sgml-html-meta-utf-8-with-bom ()
+ "Requesting 'UTF-8' does not override `utf-8-with-signature'.
+Check fix for Bug#20623."
+ (let ((buffer-file-coding-system 'utf-8-with-signature))
+ (should (eq 'utf-8-with-signature (sgml-html-meta-run "utf-8")))))
+
;; Stop "Local Variables" above causing confusion when visiting this file.
\f
--
2.30.2
^ permalink raw reply related [flat|nested] 3+ messages in thread
* bug#61005: 28.1.91; Encoding not detected in HTML files inside archives
2023-01-22 13:24 ` Benjamin Riefenstahl
@ 2023-01-22 14:09 ` Eli Zaretskii
0 siblings, 0 replies; 3+ messages in thread
From: Eli Zaretskii @ 2023-01-22 14:09 UTC (permalink / raw)
To: Benjamin Riefenstahl; +Cc: 61005-done
> From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
> Date: Sun, 22 Jan 2023 14:24:07 +0100
>
> The promised patch. This is against master.
>
> Also a small test-suite for sgml-html-meta-auto-coding-function, if you
> want that. If you care, I could also add one for
> sgml-xml-auto-coding-function.
Thanks, I installed this on the emacs-29 branch, and I'm closing the
bug.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-01-22 14:09 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-22 13:13 bug#61005: 28.1.91; Encoding not detected in HTML files inside archives Benjamin Riefenstahl
2023-01-22 13:24 ` Benjamin Riefenstahl
2023-01-22 14:09 ` Eli Zaretskii
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.