unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
To: 61005@debbugs.gnu.org
Subject: bug#61005: 28.1.91; Encoding not detected in HTML files inside archives
Date: Sun, 22 Jan 2023 14:24:07 +0100	[thread overview]
Message-ID: <877cxeem88.fsf@turtle-trading.net> (raw)
In-Reply-To: <87bkmqempd.fsf@turtle-trading.net> (Benjamin Riefenstahl's message of "Sun, 22 Jan 2023 14:13:50 +0100")

[-- Attachment #1: Type: text/plain, Size: 200 bytes --]

The promised patch.  This is against master.

Also a small test-suite for sgml-html-meta-auto-coding-function, if you
want that.  If you care, I could also add one for
sgml-xml-auto-coding-function.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Fix-decoding-HTML-files-from-archives.patch --]
[-- Type: text/x-diff, Size: 1391 bytes --]

From 95b63baf1bf411422c61b76470abb1aa681f2db2 Mon Sep 17 00:00:00 2001
From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
Date: Tue, 17 Jan 2023 20:08:15 +0200
Subject: [PATCH 1/2] Fix decoding HTML files from archives

* lisp/international/mule.el (sgml-xml-auto-coding-function): Avoid
signaling an error from coding-system-equal when the XML encoding tag
specifies an encoding whose type is 'charset'.  (Bug#61005)

This is the same fix as in #df7ed10e for
sgml-xml-auto-coding-function.
---
 lisp/international/mule.el | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lisp/international/mule.el b/lisp/international/mule.el
index 4f6addea387..9480213be9a 100644
--- a/lisp/international/mule.el
+++ b/lisp/international/mule.el
@@ -2539,6 +2539,10 @@ sgml-html-meta-auto-coding-function
                   (bfcs-type
                    (coding-system-type buffer-file-coding-system)))
               (if (and enable-multibyte-characters
+                       ;; 'charset' will signal an error in
+                       ;; coding-system-equal, since it isn't a
+                       ;; coding-system.  So test that up front.
+                       (not (equal sym-type 'charset))
                        (coding-system-equal 'utf-8 sym-type)
                        (coding-system-equal 'utf-8 bfcs-type))
                   buffer-file-coding-system
-- 
2.30.2


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-Add-test-suite-for-sgml-html-meta-auto-coding-functi.patch --]
[-- Type: text/x-diff, Size: 3803 bytes --]

From 29996e07c23c9716f731dde224c8ca47e321e697 Mon Sep 17 00:00:00 2001
From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
Date: Tue, 17 Jan 2023 20:13:39 +0200
Subject: [PATCH 2/2] Add test suite for sgml-html-meta-auto-coding-function

* test/lisp/international/mule-tests.el (sgml-html-meta-pre)
(sgml-html-meta-post, sgml-html-meta-run, sgml-html-meta-utf-8)
(sgml-html-meta-windows-hebrew, sgml-html-meta-none)
(sgml-html-meta-unknown-coding, sgml-html-meta-no-pre)
(sgml-html-meta-no-post-less-than-10lines)
(sgml-html-meta-no-post-10lines, sgml-html-meta-utf-8-with-bom): Add.
---
 test/lisp/international/mule-tests.el | 66 +++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/test/lisp/international/mule-tests.el b/test/lisp/international/mule-tests.el
index 4f70b275848..6e23d8c5421 100644
--- a/test/lisp/international/mule-tests.el
+++ b/test/lisp/international/mule-tests.el
@@ -70,6 +70,72 @@ mule-hz
   ;; The chinese-hz encoding is not ASCII compatible.
   (should-not (coding-system-get 'chinese-hz :ascii-compatible-p)))
 
+;;; Testing `sgml-html-meta-auto-coding-function'.
+
+(defconst sgml-html-meta-pre "<!doctype html><html><head>"
+  "The beginning of a minimal HTML document.")
+
+(defconst sgml-html-meta-post "</head></html>"
+  "The end of a minimal HTML document.")
+
+(defun sgml-html-meta-run (coding-system)
+  "Run `sgml-html-meta-auto-coding-function' on a minimal HTML.
+When CODING-SYSTEM is not nil, insert it, wrapped in a '<meta>'
+element.  When CODING-SYSTEM contains HTML meta characters or
+white space, insert it as-is, without additional formatting.  Use
+the variables `sgml-html-meta-pre' and `sgml-html-meta-post' to
+provide HTML fragments.  Some tests override those variables."
+  (with-temp-buffer
+    (insert sgml-html-meta-pre
+            (cond ((not coding-system)
+                   "")
+                  ((string-match "[<>'\"\n ]" coding-system)
+                   coding-system)
+                  (t
+                   (format "<meta charset='%s'>" coding-system)))
+            sgml-html-meta-post)
+    (goto-char (point-min))
+    (sgml-html-meta-auto-coding-function (- (point-max) (point-min)))))
+
+(ert-deftest sgml-html-meta-utf-8 ()
+  "Baseline: UTF-8."
+  (should (eq 'utf-8 (sgml-html-meta-run "utf-8"))))
+
+(ert-deftest sgml-html-meta-windows-hebrew ()
+  "A non-Unicode charset."
+  (should (eq 'windows-1255 (sgml-html-meta-run "windows-1255"))))
+
+(ert-deftest sgml-html-meta-none ()
+  (should (eq nil (sgml-html-meta-run nil))))
+
+(ert-deftest sgml-html-meta-unknown-coding ()
+  (should (eq nil (sgml-html-meta-run "XXX"))))
+
+(ert-deftest sgml-html-meta-no-pre ()
+  "Without the prefix, so not HTML."
+  (let ((sgml-html-meta-pre ""))
+    (should (eq nil (sgml-html-meta-run "utf-8")))))
+
+(ert-deftest sgml-html-meta-no-post-less-than-10lines ()
+  "No '</head>', detect charset in the first 10 lines."
+  (let ((sgml-html-meta-post ""))
+    (should (eq 'utf-8 (sgml-html-meta-run
+                        (concat "\n\n\n\n\n\n\n\n\n"
+                                "<meta charset='utf-8'>"))))))
+
+(ert-deftest sgml-html-meta-no-post-10lines ()
+  "No '</head>', do not detect charset after the first 10 lines."
+  (let ((sgml-html-meta-post ""))
+    (should (eq nil (sgml-html-meta-run
+                     (concat "\n\n\n\n\n\n\n\n\n\n"
+                             "<meta charset='utf-8'>"))))))
+
+(ert-deftest sgml-html-meta-utf-8-with-bom ()
+  "Requesting 'UTF-8' does not override `utf-8-with-signature'.
+Check fix for Bug#20623."
+  (let ((buffer-file-coding-system 'utf-8-with-signature))
+    (should (eq 'utf-8-with-signature (sgml-html-meta-run "utf-8")))))
+
 ;; Stop "Local Variables" above causing confusion when visiting this file.
 \f
 
-- 
2.30.2


  reply	other threads:[~2023-01-22 13:24 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-22 13:13 bug#61005: 28.1.91; Encoding not detected in HTML files inside archives Benjamin Riefenstahl
2023-01-22 13:24 ` Benjamin Riefenstahl [this message]
2023-01-22 14:09   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877cxeem88.fsf@turtle-trading.net \
    --to=b.riefenstahl@turtle-trading.net \
    --cc=61005@debbugs.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).