all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* bug#61005: 28.1.91; Encoding not detected in HTML files inside archives
@ 2023-01-22 13:13 Benjamin Riefenstahl
  2023-01-22 13:24 ` Benjamin Riefenstahl
  0 siblings, 1 reply; 3+ messages in thread
From: Benjamin Riefenstahl @ 2023-01-22 13:13 UTC (permalink / raw)
  To: 61005

[-- Attachment #1: Type: text/plain, Size: 483 bytes --]

Problem
----

* Given an HTML file with charset "windows-1255". 

* Opening the file from disk detects the encoding correctly.

* Opening a ZIP archive with the same file inside and than opening the
  HTML archive member does not detect the encoding, instead the coding
  system for saving is the default according to M-x
  describe-coding-system.

Attached are two files test.html and test.zip.  Call "emacs -Q test.html
test.zip" and press RET on the archive member to reproduce.


[-- Attachment #2: test.html --]
[-- Type: text/html, Size: 172 bytes --]

[-- Attachment #3: test.zip --]
[-- Type: application/zip, Size: 279 bytes --]

[-- Attachment #4: Type: text/plain, Size: 5626 bytes --]


Solution
----

The problem seems to be the function
sgml-html-meta-auto-coding-function.  It is missing a condition similar
to the one added to code in sgml-xml-auto-coding-function with commit
#df7ed10e in 2018.

modified   lisp/international/mule.el
@@ -2539,6 +2539,10 @@ sgml-html-meta-auto-coding-function
                   (bfcs-type
                    (coding-system-type buffer-file-coding-system)))
               (if (and enable-multibyte-characters
+                       ;; 'charset' will signal an error in
+                       ;; coding-system-equal, since it isn't a
+                       ;; coding-system.  So test that up front.
+                       (not (equal sym-type 'charset))
                        (coding-system-equal 'utf-8 sym-type)
                        (coding-system-equal 'utf-8 bfcs-type))
                   buffer-file-coding-system

I will send this as a patch as soon as I have a bug number to mention in
the commit message.

----

In GNU Emacs 28.1.91 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.24, cairo version 1.16.0)
 of 2022-08-29 built on arrian
Repository revision: f4168b8143008b787a11366462c928d761e90dd0
Repository branch: emacs-28
Windowing system distributor 'The X.Org Foundation', version 11.0.12011000
System Description: Debian GNU/Linux 11 (bullseye)

Configured features:
ACL CAIRO DBUS FREETYPE GIF GLIB GMP GNUTLS GPM GSETTINGS HARFBUZZ JPEG
JSON LCMS2 LIBOTF LIBSELINUX LIBXML2 M17N_FLT MODULES NOTIFY INOTIFY
PDUMPER PNG RSVG SECCOMP SOUND THREADS TIFF TOOLKIT_SCROLL_BARS X11 XDBE
XIM XPM GTK3 ZLIB

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Dired by date

Minor modes in effect:
  shell-dirtrack-mode: t
  desktop-save-mode: t
  display-time-mode: t
  xclip-mode: t
  xterm-mouse-mode: t
  delete-selection-mode: t
  cua-mode: t
  display-battery-mode: t
  tooltip-mode: t
  global-eldoc-mode: t
  show-paren-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  buffer-read-only: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t

Load-path shadows:
~/Projects/ttf-mode/arc-mode-compat hides ~/emacs/arc-mode-compat
/home/benny/.emacs.d/elpa/transient-20210723.1601/transient hides /usr/local/share/emacs/28.1.91/lisp/transient
/home/benny/.emacs.d/elpa/dictionary-20201001.1727/dictionary hides /usr/local/share/emacs/28.1.91/lisp/net/dictionary

Features:
(shadow sort mail-extr emacsbug message rmc puny rfc822 mml mml-sec epa
epg rfc6068 epg-config gnus-util rmail rmail-loaddefs mm-decode
mm-bodies mm-encode mailabbrev gmm-utils mailheader arc-mode
archive-mode benny-images dirtrack shell pcomplete misearch
multi-isearch thai-util thai-word lao-util enriched view tabify
benny-auto-insert ttf-glyphs rng-xsd xsd-regexp rng-cmpct rng-nxml
rng-valid rng-loc rng-uri rng-parse nxml-parse rng-match rng-dt rng-util
rng-pttrn nxml-ns nxml-mode nxml-outln nxml-rap sgml-mode facemenu dom
nxml-util nxml-enc xmltok mule-util jka-compr dired-aux time-date
bug-reference imenu desktop frameset highline benny-calendar-cfg
ange-ftp generic-x autoinsert cc-mode cc-fonts cc-guess cc-menus
cc-styles cc-align cc-cmds cc-engine cc-vars cc-defs ps-print
ps-print-loaddefs ps-def lpr advice cl-extra help-mode dired
dired-loaddefs derived benny-x-clipboard disp-table time server protbuf
xclip term/xterm xterm xt-mouse cal-china lunar solar cal-dst cal-bahai
cal-islam cal-hebrew holidays hol-loaddefs vc-git diff-mode easy-mmode
vc-dispatcher vc-fossil diary-lib diary-loaddefs cal-menu calendar
cal-loaddefs delsel grep compile text-property-search comint ansi-color
ring cua-base cus-load format-spec battery dbus xml sendmail mail-utils
.loaddefs benny-tools autoload radix-tree lisp-mnt mail-parse rfc2231
rfc2047 rfc2045 mm-util ietf-drums mail-prsvr edmacro kmacro info
package browse-url url url-proxy url-privacy url-expand url-methods
url-history url-cookie url-domsuf url-util mailcap url-handlers
url-parse auth-source cl-seq eieio eieio-core cl-macs eieio-loaddefs
password-cache json subr-x map url-vars seq byte-opt gv bytecomp
byte-compile cconv cl-loaddefs cl-lib iso-transl tooltip eldoc paren
electric uniquify ediff-hook vc-hooks lisp-float-type elisp-mode mwheel
term/x-win x-win term/common-win x-dnd tool-bar dnd fontset image
regexp-opt fringe tabulated-list replace newcomment text-mode lisp-mode
prog-mode register page tab-bar menu-bar rfn-eshadow isearch easymenu
timer select scroll-bar mouse jit-lock font-lock syntax font-core
term/tty-colors frame minibuffer cl-generic cham georgian utf-8-lang
misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms
cp51932 hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese composite emoji-zwj charscript charprop case-table
epa-hook jka-cmpr-hook help simple abbrev obarray cl-preloaded nadvice
button loaddefs faces cus-face macroexp files window text-properties
overlay sha1 md5 base64 format env code-pages mule custom widget
hashtable-print-readable backquote threads dbusbind inotify lcms2
dynamic-setting system-font-setting font-render-setting cairo
move-toolbar gtk x-toolkit x multi-tty make-network-process emacs)

Memory information:
((conses 16 273770 13520)
 (symbols 48 18619 1)
 (strings 32 66582 2920)
 (string-bytes 1 2318045)
 (vectors 16 39996)
 (vector-slots 8 1131973 174560)
 (floats 8 762 66)
 (intervals 56 1039 60)
 (buffers 992 50))

^ permalink raw reply	[flat|nested] 3+ messages in thread

* bug#61005: 28.1.91; Encoding not detected in HTML files inside archives
  2023-01-22 13:13 bug#61005: 28.1.91; Encoding not detected in HTML files inside archives Benjamin Riefenstahl
@ 2023-01-22 13:24 ` Benjamin Riefenstahl
  2023-01-22 14:09   ` Eli Zaretskii
  0 siblings, 1 reply; 3+ messages in thread
From: Benjamin Riefenstahl @ 2023-01-22 13:24 UTC (permalink / raw)
  To: 61005

[-- Attachment #1: Type: text/plain, Size: 200 bytes --]

The promised patch.  This is against master.

Also a small test-suite for sgml-html-meta-auto-coding-function, if you
want that.  If you care, I could also add one for
sgml-xml-auto-coding-function.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Fix-decoding-HTML-files-from-archives.patch --]
[-- Type: text/x-diff, Size: 1391 bytes --]

From 95b63baf1bf411422c61b76470abb1aa681f2db2 Mon Sep 17 00:00:00 2001
From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
Date: Tue, 17 Jan 2023 20:08:15 +0200
Subject: [PATCH 1/2] Fix decoding HTML files from archives

* lisp/international/mule.el (sgml-xml-auto-coding-function): Avoid
signaling an error from coding-system-equal when the XML encoding tag
specifies an encoding whose type is 'charset'.  (Bug#61005)

This is the same fix as in #df7ed10e for
sgml-xml-auto-coding-function.
---
 lisp/international/mule.el | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lisp/international/mule.el b/lisp/international/mule.el
index 4f6addea387..9480213be9a 100644
--- a/lisp/international/mule.el
+++ b/lisp/international/mule.el
@@ -2539,6 +2539,10 @@ sgml-html-meta-auto-coding-function
                   (bfcs-type
                    (coding-system-type buffer-file-coding-system)))
               (if (and enable-multibyte-characters
+                       ;; 'charset' will signal an error in
+                       ;; coding-system-equal, since it isn't a
+                       ;; coding-system.  So test that up front.
+                       (not (equal sym-type 'charset))
                        (coding-system-equal 'utf-8 sym-type)
                        (coding-system-equal 'utf-8 bfcs-type))
                   buffer-file-coding-system
-- 
2.30.2


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-Add-test-suite-for-sgml-html-meta-auto-coding-functi.patch --]
[-- Type: text/x-diff, Size: 3803 bytes --]

From 29996e07c23c9716f731dde224c8ca47e321e697 Mon Sep 17 00:00:00 2001
From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
Date: Tue, 17 Jan 2023 20:13:39 +0200
Subject: [PATCH 2/2] Add test suite for sgml-html-meta-auto-coding-function

* test/lisp/international/mule-tests.el (sgml-html-meta-pre)
(sgml-html-meta-post, sgml-html-meta-run, sgml-html-meta-utf-8)
(sgml-html-meta-windows-hebrew, sgml-html-meta-none)
(sgml-html-meta-unknown-coding, sgml-html-meta-no-pre)
(sgml-html-meta-no-post-less-than-10lines)
(sgml-html-meta-no-post-10lines, sgml-html-meta-utf-8-with-bom): Add.
---
 test/lisp/international/mule-tests.el | 66 +++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/test/lisp/international/mule-tests.el b/test/lisp/international/mule-tests.el
index 4f70b275848..6e23d8c5421 100644
--- a/test/lisp/international/mule-tests.el
+++ b/test/lisp/international/mule-tests.el
@@ -70,6 +70,72 @@ mule-hz
   ;; The chinese-hz encoding is not ASCII compatible.
   (should-not (coding-system-get 'chinese-hz :ascii-compatible-p)))
 
+;;; Testing `sgml-html-meta-auto-coding-function'.
+
+(defconst sgml-html-meta-pre "<!doctype html><html><head>"
+  "The beginning of a minimal HTML document.")
+
+(defconst sgml-html-meta-post "</head></html>"
+  "The end of a minimal HTML document.")
+
+(defun sgml-html-meta-run (coding-system)
+  "Run `sgml-html-meta-auto-coding-function' on a minimal HTML.
+When CODING-SYSTEM is not nil, insert it, wrapped in a '<meta>'
+element.  When CODING-SYSTEM contains HTML meta characters or
+white space, insert it as-is, without additional formatting.  Use
+the variables `sgml-html-meta-pre' and `sgml-html-meta-post' to
+provide HTML fragments.  Some tests override those variables."
+  (with-temp-buffer
+    (insert sgml-html-meta-pre
+            (cond ((not coding-system)
+                   "")
+                  ((string-match "[<>'\"\n ]" coding-system)
+                   coding-system)
+                  (t
+                   (format "<meta charset='%s'>" coding-system)))
+            sgml-html-meta-post)
+    (goto-char (point-min))
+    (sgml-html-meta-auto-coding-function (- (point-max) (point-min)))))
+
+(ert-deftest sgml-html-meta-utf-8 ()
+  "Baseline: UTF-8."
+  (should (eq 'utf-8 (sgml-html-meta-run "utf-8"))))
+
+(ert-deftest sgml-html-meta-windows-hebrew ()
+  "A non-Unicode charset."
+  (should (eq 'windows-1255 (sgml-html-meta-run "windows-1255"))))
+
+(ert-deftest sgml-html-meta-none ()
+  (should (eq nil (sgml-html-meta-run nil))))
+
+(ert-deftest sgml-html-meta-unknown-coding ()
+  (should (eq nil (sgml-html-meta-run "XXX"))))
+
+(ert-deftest sgml-html-meta-no-pre ()
+  "Without the prefix, so not HTML."
+  (let ((sgml-html-meta-pre ""))
+    (should (eq nil (sgml-html-meta-run "utf-8")))))
+
+(ert-deftest sgml-html-meta-no-post-less-than-10lines ()
+  "No '</head>', detect charset in the first 10 lines."
+  (let ((sgml-html-meta-post ""))
+    (should (eq 'utf-8 (sgml-html-meta-run
+                        (concat "\n\n\n\n\n\n\n\n\n"
+                                "<meta charset='utf-8'>"))))))
+
+(ert-deftest sgml-html-meta-no-post-10lines ()
+  "No '</head>', do not detect charset after the first 10 lines."
+  (let ((sgml-html-meta-post ""))
+    (should (eq nil (sgml-html-meta-run
+                     (concat "\n\n\n\n\n\n\n\n\n\n"
+                             "<meta charset='utf-8'>"))))))
+
+(ert-deftest sgml-html-meta-utf-8-with-bom ()
+  "Requesting 'UTF-8' does not override `utf-8-with-signature'.
+Check fix for Bug#20623."
+  (let ((buffer-file-coding-system 'utf-8-with-signature))
+    (should (eq 'utf-8-with-signature (sgml-html-meta-run "utf-8")))))
+
 ;; Stop "Local Variables" above causing confusion when visiting this file.
 \f
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* bug#61005: 28.1.91; Encoding not detected in HTML files inside archives
  2023-01-22 13:24 ` Benjamin Riefenstahl
@ 2023-01-22 14:09   ` Eli Zaretskii
  0 siblings, 0 replies; 3+ messages in thread
From: Eli Zaretskii @ 2023-01-22 14:09 UTC (permalink / raw)
  To: Benjamin Riefenstahl; +Cc: 61005-done

> From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
> Date: Sun, 22 Jan 2023 14:24:07 +0100
> 
> The promised patch.  This is against master.
> 
> Also a small test-suite for sgml-html-meta-auto-coding-function, if you
> want that.  If you care, I could also add one for
> sgml-xml-auto-coding-function.

Thanks, I installed this on the emacs-29 branch, and I'm closing the
bug.





^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-01-22 14:09 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-22 13:13 bug#61005: 28.1.91; Encoding not detected in HTML files inside archives Benjamin Riefenstahl
2023-01-22 13:24 ` Benjamin Riefenstahl
2023-01-22 14:09   ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.