unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#73133: 29.2; EWW fails to render some webpages
@ 2024-09-08 20:52 Ganimard via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2024-09-10  6:06 ` Jim Porter
  2024-10-23 10:43 ` Mattias Engdegård
  0 siblings, 2 replies; 35+ messages in thread
From: Ganimard via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2024-09-08 20:52 UTC (permalink / raw)
  To: 73133

[-- Attachment #1: Type: text/plain, Size: 7951 bytes --]

To Whom it may concern,

I have recently discovered the website gastonle.ru, however it does not
render with Emacs Web Wowser.  It appears to be a relatively simple
website and I cannot see what would prohibit it from rendering.

I have also tried it on an Ubuntu 22.04.4 LTS distro running Emacs 28.1
but it also fails to render.  This therefore appears to be a bug in EWW.
---

In GNU Emacs 29.2 (build 1, aarch64-apple-darwin21.6.0, NS
appkit-2113.60 Version 12.6.6 (Build 21G646)) of 2024-01-19 built on
armbob.lan
Windowing system distributor 'Apple', version 10.3.2487
System Description:  macOS 14.2.1

Configured using:
'configure --with-ns '--enable-locallisppath=/Library/Application
Support/Emacs/${version}/site-lisp:/Library/Application
Support/Emacs/site-lisp' --with-modules 'CFLAGS=-DFD_SETSIZE=10000
-DDARWIN_UNLIMITED_SELECT' --with-x-toolkit=no'

Configured features:
ACL GLIB GMP GNUTLS JPEG JSON LIBXML2 MODULES NOTIFY KQUEUE NS PDUMPER
PNG RSVG SQLITE3 THREADS TIFF TOOLKIT_SCROLL_BARS TREE_SITTER ZLIB

Important settings:
  value of $LANG: en_NZ.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Markdown

Minor modes in effect:
  yas-global-mode: t
  yas-minor-mode: t
  global-git-commit-mode: t
  magit-auto-revert-mode: t
  shell-dirtrack-mode: t
  server-mode: t
  TeX-PDF-mode: t
  TeX-source-correlate-mode: t
  global-display-line-numbers-mode: t
  display-line-numbers-mode: t
  whitespace-mode: t
  global-page-break-lines-mode: t
  override-global-mode: t
  tooltip-mode: t
  global-eldoc-mode: t
  eldoc-mode: t
  show-paren-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  line-number-mode: t
  transient-mark-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t

Load-path shadows:
/Users/ganimard/.emacs.d/elpa/transient-20230919.2146/transient hides /Applications/Emacs.app/Contents/Resources/lisp/transient <http://Emacs.app/Contents/Resources/lisp/transient>

Features:
(shadow sort mail-extr emacsbug files-x vc-hg vc-bzr vc-src vc-sccs
vc-svn vc-cvs vc-rcs log-view vc bug-reference help-fns radix-tree
magit-patch magit-subtree magit-gitignore magit-ediff ediff ediff-merg
ediff-mult ediff-wind ediff-diff ediff-help ediff-init ediff-util
magit-extras face-remap misearch multi-isearch vc-git vc-dispatcher
markdown-mode color dired-aux disp-table hl-todo flycheck forth-mode
forth-spec forth-smie smie forth-syntax llvm-mode splunk-mode ess
lisp-mnt ess-utils ess-custom go-mode find-file ffap etags fileloop xref
rust-utils rust-mode rust-rustfmt rust-playpen rust-compile rust-cargo
yasnippet magit-submodule magit-blame magit-stash magit-reflog
magit-bisect magit-push magit-pull magit-fetch magit-clone magit-remote
magit-commit magit-sequence magit-notes magit-worktree magit-tag
magit-merge magit-branch magit-reset magit-files magit-refs magit-status
magit magit-repos magit-apply magit-wip magit-log which-func imenu
magit-diff smerge-mode diff diff-mode git-commit log-edit pcvs-util
add-log magit-core magit-autorevert autorevert magit-margin
magit-transient magit-process with-editor shell server magit-mode
transient magit-git magit-base magit-section cursor-sensor dash
auctex-latexmk latex latex-flymake flymake-proc flymake project compile
warnings tex-ispell tex-style tex texmathp latex-preview-pane doc-view
filenotify jka-compr image-mode exif auctex ebib ebib-reading-list
ebib-notes org-element org-persist xdg org-id org-refile org ob
ob-tangle ob-ref ob-lob ob-table ob-exp org-macro org-src ob-comint
org-pcomplete pcomplete comint ansi-osc ansi-color org-list org-footnote
org-faces org-entities noutline outline icons ob-emacs-lisp ob-core
ob-eval org-cycle org-table org-keys oc org-loaddefs find-func cal-menu
calendar cal-loaddefs ol org-fold org-fold-core org-compat ring avl-tree
generator org-version org-macs ebib-filters ebib-keywords ebib-utils
ebib-db message sendmail yank-media puny dired dired-loaddefs rfc822 mml
mml-sec epa derived epg rfc6068 epg-config gnus-util
text-property-search mm-decode mm-bodies mm-encode mail-parse rfc2231
rfc2047 rfc2045 mm-util ietf-drums mail-prsvr mailabbrev mail-utils
gmm-utils mailheader format-spec parsebib rx hl-line pp crm bibtex
iso8601 time-date writeroom-mode visual-fill-column olivetti
multiple-cursors mc-separate-operations rectangular-region-mode
mc-mark-pop mc-edit-lines mc-hide-unmatched-lines-mode mc-mark-more
thingatpt mc-cycle-cursors multiple-cursors-core advice rect move-text
no-littering compat paredit edmacro kmacro display-line-numbers
whitespace page-break-lines smart-mode-line-atom-one-dark-theme cl-extra
help-mode atom-one-dark-theme use-package use-package-ensure
use-package-delight use-package-diminish use-package-bind-key bind-key
easy-mmode use-package-core finder-inf atom-one-dark-theme-autoloads
auctex-latexmk-autoloads auctex-autoloads tex-site company-autoloads
dracula-theme-autoloads ebib-autoloads ess-autoloads flycheck-autoloads
forth-mode-autoloads gdscript-mode-autoloads go-mode-autoloads
hl-todo-autoloads impatient-mode-autoloads htmlize-autoloads
julia-formatter-autoloads just-mode-autoloads
latex-preview-pane-autoloads llvm-ts-mode-autoloads lsp-docker-autoloads
lsp-julia-autoloads julia-mode-autoloads lsp-ui-autoloads
lsp-mode-autoloads ht-autoloads lv-autoloads magit-autoloads pcase
git-commit-autoloads magit-section-autoloads move-text-autoloads
multiple-cursors-autoloads no-littering-autoloads olivetti-autoloads
package-lint-autoloads page-break-lines-autoloads paredit-autoloads
parsebib-autoloads pkg-info-autoloads epl-autoloads
quelpa-use-package-autoloads quelpa-autoloads rustic-autoloads
markdown-mode-autoloads f-autoloads dash-autoloads rust-mode-autoloads
s-autoloads session-async-autoloads simple-httpd-autoloads
smart-mode-line-atom-one-dark-theme-autoloads smart-mode-line-autoloads
rich-minority-autoloads spinner-autoloads splunk-mode-autoloads
transient-autoloads with-editor-autoloads compat-autoloads info
writeroom-mode-autoloads visual-fill-column-autoloads
xterm-color-autoloads yaml-autoloads yaml-mode-autoloads
yasnippet-autoloads package browse-url url url-proxy url-privacy
url-expand url-methods url-history url-cookie generate-lisp-file
url-domsuf url-util mailcap url-handlers url-parse auth-source cl-seq
eieio eieio-core cl-macs password-cache json subr-x map byte-opt gv
bytecomp byte-compile url-vars cl-loaddefs cl-lib rmc iso-transl tooltip
cconv eldoc paren electric uniquify ediff-hook vc-hooks lisp-float-type
elisp-mode mwheel term/ns-win ns-win ucs-normalize mule-util
term/common-win tool-bar dnd fontset image regexp-opt fringe
tabulated-list replace newcomment text-mode lisp-mode prog-mode register
page tab-bar menu-bar rfn-eshadow isearch easymenu timer select
scroll-bar mouse jit-lock font-lock syntax font-core term/tty-colors
frame minibuffer nadvice seq simple cl-generic indonesian philippine
cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao
korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech
european ethiopic indian cyrillic chinese composite emoji-zwj charscript
charprop case-table epa-hook jka-cmpr-hook help abbrev obarray oclosure
cl-preloaded button loaddefs theme-loaddefs faces cus-face macroexp
files window text-properties overlay sha1 md5 base64 format env
code-pages mule custom widget keymap hashtable-print-readable backquote
threads kqueue cocoa ns multi-tty make-network-process emacs)

Memory information:
((conses 16 412027 70117)
(symbols 48 34112 0)
(strings 32 128155 6447)
(string-bytes 1 4038566)
(vectors 16 67754)
(vector-slots 8 739746 70880)
(floats 8 294 368)
(intervals 56 6200 53)
(buffers 984 43))

[-- Attachment #2: Type: text/html, Size: 12370 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-09-08 20:52 bug#73133: 29.2; EWW fails to render some webpages Ganimard via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2024-09-10  6:06 ` Jim Porter
  2024-09-21  9:13   ` Eli Zaretskii
  2024-10-23 10:43 ` Mattias Engdegård
  1 sibling, 1 reply; 35+ messages in thread
From: Jim Porter @ 2024-09-10  6:06 UTC (permalink / raw)
  To: Ganimard, 73133

On 9/8/2024 1:52 PM, Ganimard via Bug reports for GNU Emacs, the Swiss 
army knife of text editors wrote:
> I have recently discovered the website gastonle.ru, however it does not
> render with Emacs Web Wowser.  It appears to be a relatively simple
> website and I cannot see what would prohibit it from rendering.

Checking that page via curl, it appears that it doesn't return a 
Content-Type header. In the absence of that header, EWW assumes that the 
page is plain text.

> I have also tried it on an Ubuntu 22.04.4 LTS distro running Emacs 28.1
> but it also fails to render.  This therefore appears to be a bug in EWW.

 From my reading of RFC9110[1], this is *technically* a bug (we should 
assume application/octet-stream, not text/plain), but that wouldn't fix 
the rendering here; it would probably make things worse. However, per 
the RFC, EWW would be within its rights to guess that the page is HTML, 
e.g. by checking for "<!doctype html>". It also recommends having that 
be an option that can be disabled, which is reasonable (and in keeping 
with Emacs's design principles anyway).

[1] https://www.rfc-editor.org/rfc/rfc9110#section-8.3-5





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-09-10  6:06 ` Jim Porter
@ 2024-09-21  9:13   ` Eli Zaretskii
  2024-09-21 17:12     ` Jim Porter
  0 siblings, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2024-09-21  9:13 UTC (permalink / raw)
  To: Jim Porter; +Cc: 73133, ganimard

> Date: Mon, 9 Sep 2024 23:06:56 -0700
> From: Jim Porter <jporterbugs@gmail.com>
> 
> On 9/8/2024 1:52 PM, Ganimard via Bug reports for GNU Emacs, the Swiss 
> army knife of text editors wrote:
> > I have recently discovered the website gastonle.ru, however it does not
> > render with Emacs Web Wowser.  It appears to be a relatively simple
> > website and I cannot see what would prohibit it from rendering.
> 
> Checking that page via curl, it appears that it doesn't return a 
> Content-Type header. In the absence of that header, EWW assumes that the 
> page is plain text.
> 
> > I have also tried it on an Ubuntu 22.04.4 LTS distro running Emacs 28.1
> > but it also fails to render.  This therefore appears to be a bug in EWW.
> 
>  From my reading of RFC9110[1], this is *technically* a bug (we should 
> assume application/octet-stream, not text/plain), but that wouldn't fix 
> the rendering here; it would probably make things worse. However, per 
> the RFC, EWW would be within its rights to guess that the page is HTML, 
> e.g. by checking for "<!doctype html>". It also recommends having that 
> be an option that can be disabled, which is reasonable (and in keeping 
> with Emacs's design principles anyway).
> 
> [1] https://www.rfc-editor.org/rfc/rfc9110#section-8.3-5

Thanks.  Would someone like to submit a patch along these lines?





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-09-21  9:13   ` Eli Zaretskii
@ 2024-09-21 17:12     ` Jim Porter
  2024-09-23 15:43       ` Sebastián Monía
  2024-09-23 15:56       ` Sebastián Monía
  0 siblings, 2 replies; 35+ messages in thread
From: Jim Porter @ 2024-09-21 17:12 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 73133, ganimard

On 9/21/2024 2:13 AM, Eli Zaretskii wrote:
>> Date: Mon, 9 Sep 2024 23:06:56 -0700
>> From: Jim Porter <jporterbugs@gmail.com>
>>
>>   From my reading of RFC9110[1], this is *technically* a bug (we should
>> assume application/octet-stream, not text/plain), but that wouldn't fix
>> the rendering here; it would probably make things worse. However, per
>> the RFC, EWW would be within its rights to guess that the page is HTML,
>> e.g. by checking for "<!doctype html>". It also recommends having that
>> be an option that can be disabled, which is reasonable (and in keeping
>> with Emacs's design principles anyway).
>>
>> [1] https://www.rfc-editor.org/rfc/rfc9110#section-8.3-5
> 
> Thanks.  Would someone like to submit a patch along these lines?

It'll probably be a couple weeks until I have time to write a patch, but 
if no one has done so by then, I'll look into it.





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-09-21 17:12     ` Jim Porter
@ 2024-09-23 15:43       ` Sebastián Monía
  2024-09-28 10:58         ` Eli Zaretskii
  2024-09-23 15:56       ` Sebastián Monía
  1 sibling, 1 reply; 35+ messages in thread
From: Sebastián Monía @ 2024-09-23 15:43 UTC (permalink / raw)
  To: Jim Porter; +Cc: Eli Zaretskii, 73133, ganimard

[-- Attachment #1: Type: text/plain, Size: 970 bytes --]

Jim Porter <jporterbugs@gmail.com> writes:

> On 9/21/2024 2:13 AM, Eli Zaretskii wrote:
>>> Date: Mon, 9 Sep 2024 23:06:56 -0700
>>> From: Jim Porter <jporterbugs@gmail.com>
>>>
>>>   From my reading of RFC9110[1], this is *technically* a bug (we should
>>> assume application/octet-stream, not text/plain), but that wouldn't fix
>>> the rendering here; it would probably make things worse. However, per
>>> the RFC, EWW would be within its rights to guess that the page is HTML,
>>> e.g. by checking for "<!doctype html>". It also recommends having that
>>> be an option that can be disabled, which is reasonable (and in keeping
>>> with Emacs's design principles anyway).
>>>
>>> [1] https://www.rfc-editor.org/rfc/rfc9110#section-8.3-5
>> Thanks.  Would someone like to submit a patch along these lines?
>
> It'll probably be a couple weeks until I have time to write a patch,
> but if no one has done so by then, I'll look into it.

Would the patch attached work?


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: eww-use-doctype-fallback --]
[-- Type: text/x-patch, Size: 2863 bytes --]

From 499abe197e6d245228be853731314e19148bb658 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Sebasti=C3=A1n=20Mon=C3=ADa?=
 <sebastian.monia@sebasmonia.com>
Date: Mon, 23 Sep 2024 11:40:18 -0400
Subject: [PATCH] Add option eww-use-doctype-fallback, code to detect if a page
 has a valid doctype tag, and use it as alternative to a content-type header

---
 lisp/net/eww.el | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/lisp/net/eww.el b/lisp/net/eww.el
index a651d9d5020..59a146c8392 100644
--- a/lisp/net/eww.el
+++ b/lisp/net/eww.el
@@ -170,6 +170,14 @@ the first item is the program, and the rest are the arguments."
   :type '(choice (const :tag "Never" nil)
                  regexp))
 
+(defcustom eww-use-doctype-fallback t
+  "Accept a DOCTYPE tag as evidence that page content is HTML.
+This is used only when the page does not have a valid Content-Type
+header."
+  :version "30.1"
+  :group 'eww
+  :type 'boolean)
+
 (defcustom eww-browse-url-new-window-is-tab 'tab-bar
   "Whether to open up new windows in a tab or a new buffer.
 If t, then open the URL in a new tab rather than a new buffer if
@@ -630,6 +638,18 @@ Currently this means either text/html or application/xhtml+xml."
   (member content-type '("text/html"
 			 "application/xhtml+xml")))
 
+(defun eww--doctype-html-p (data-buffer)
+  "Return non-nil if DATA-BUFFER contains a doctype declaration."
+  ;; https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
+  (let ((case-fold-search t)
+        (target
+         "<!doctype +html *\\(>\\|system +\\(\\\"\\|'\\)+about:legacy-compat\\)"))
+    (with-current-buffer data-buffer
+      (goto-char (point-min))
+      ;; match basic <!doctype html> and also legacy variants as
+      ;; specified in link above
+      (re-search-forward target nil t))))
+
 (defun eww--rename-buffer ()
   "Rename the current EWW buffer.
 The renaming scheme is performed in accordance with
@@ -695,7 +715,9 @@ The renaming scheme is performed in accordance with
                               url))
               (goto-char (point-min))
               (eww-display-html (or encode charset) url nil point buffer))
-	     ((eww-html-p (car content-type))
+	     ((or (eww-html-p (car content-type))
+                  (and eww-use-doctype-fallback
+                       (eww--doctype-html-p data-buffer)))
               (eww-display-html (or encode charset) url nil point buffer))
 	     ((equal (car content-type) "application/pdf")
 	      (eww-display-pdf))
@@ -717,7 +739,7 @@ The renaming scheme is performed in accordance with
               (setq buffer-undo-list nil)))
         (kill-buffer data-buffer)))
     (unless (buffer-live-p buffer)
-      (kill-buffer data-buffer))))
+      (kill-buffer data-buffer)))
 
 (defun eww-parse-headers ()
   (let ((headers nil))
-- 
2.45.2.windows.1


[-- Attachment #3: Type: text/plain, Size: 54 bytes --]


-- 
Sebastián Monía
https://site.sebasmonia.com/

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-09-21 17:12     ` Jim Porter
  2024-09-23 15:43       ` Sebastián Monía
@ 2024-09-23 15:56       ` Sebastián Monía
  2024-09-24 18:31         ` Jim Porter
  1 sibling, 1 reply; 35+ messages in thread
From: Sebastián Monía @ 2024-09-23 15:56 UTC (permalink / raw)
  To: Jim Porter; +Cc: Eli Zaretskii, 73133, ganimard

[-- Attachment #1: Type: text/plain, Size: 158 bytes --]


Hi all,

Would something like the attached patch work?

Thanks,
Seb

PS: I think I sent this to just one person by mistake instead of a wide
reply, my bad.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: eww-use-doctype-fallback --]
[-- Type: text/x-patch, Size: 2863 bytes --]

From 499abe197e6d245228be853731314e19148bb658 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Sebasti=C3=A1n=20Mon=C3=ADa?=
 <sebastian.monia@sebasmonia.com>
Date: Mon, 23 Sep 2024 11:40:18 -0400
Subject: [PATCH] Add option eww-use-doctype-fallback, code to detect if a page
 has a valid doctype tag, and use it as alternative to a content-type header

---
 lisp/net/eww.el | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/lisp/net/eww.el b/lisp/net/eww.el
index a651d9d5020..59a146c8392 100644
--- a/lisp/net/eww.el
+++ b/lisp/net/eww.el
@@ -170,6 +170,14 @@ the first item is the program, and the rest are the arguments."
   :type '(choice (const :tag "Never" nil)
                  regexp))
 
+(defcustom eww-use-doctype-fallback t
+  "Accept a DOCTYPE tag as evidence that page content is HTML.
+This is used only when the page does not have a valid Content-Type
+header."
+  :version "30.1"
+  :group 'eww
+  :type 'boolean)
+
 (defcustom eww-browse-url-new-window-is-tab 'tab-bar
   "Whether to open up new windows in a tab or a new buffer.
 If t, then open the URL in a new tab rather than a new buffer if
@@ -630,6 +638,18 @@ Currently this means either text/html or application/xhtml+xml."
   (member content-type '("text/html"
 			 "application/xhtml+xml")))
 
+(defun eww--doctype-html-p (data-buffer)
+  "Return non-nil if DATA-BUFFER contains a doctype declaration."
+  ;; https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
+  (let ((case-fold-search t)
+        (target
+         "<!doctype +html *\\(>\\|system +\\(\\\"\\|'\\)+about:legacy-compat\\)"))
+    (with-current-buffer data-buffer
+      (goto-char (point-min))
+      ;; match basic <!doctype html> and also legacy variants as
+      ;; specified in link above
+      (re-search-forward target nil t))))
+
 (defun eww--rename-buffer ()
   "Rename the current EWW buffer.
 The renaming scheme is performed in accordance with
@@ -695,7 +715,9 @@ The renaming scheme is performed in accordance with
                               url))
               (goto-char (point-min))
               (eww-display-html (or encode charset) url nil point buffer))
-	     ((eww-html-p (car content-type))
+	     ((or (eww-html-p (car content-type))
+                  (and eww-use-doctype-fallback
+                       (eww--doctype-html-p data-buffer)))
               (eww-display-html (or encode charset) url nil point buffer))
 	     ((equal (car content-type) "application/pdf")
 	      (eww-display-pdf))
@@ -717,7 +739,7 @@ The renaming scheme is performed in accordance with
               (setq buffer-undo-list nil)))
         (kill-buffer data-buffer)))
     (unless (buffer-live-p buffer)
-      (kill-buffer data-buffer))))
+      (kill-buffer data-buffer)))
 
 (defun eww-parse-headers ()
   (let ((headers nil))
-- 
2.45.2.windows.1


[-- Attachment #3: Type: text/plain, Size: 54 bytes --]


-- 
Sebastián Monía
https://site.sebasmonia.com/

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-09-23 15:56       ` Sebastián Monía
@ 2024-09-24 18:31         ` Jim Porter
  2024-09-25 20:46           ` Sebastián Monía
  0 siblings, 1 reply; 35+ messages in thread
From: Jim Porter @ 2024-09-24 18:31 UTC (permalink / raw)
  To: Sebastián Monía; +Cc: Eli Zaretskii, 73133, ganimard

On 9/23/2024 8:56 AM, Sebastián Monía wrote:
> Would something like the attached patch work?

I was actually thinking something more general, like a defcustom named 
'eww-guess-content-type-functions', which would be a list of functions 
where the first non-nil result is the guessed Content-Type. That way, we 
could extend this to other content types (for example, maybe we'd want 
to look for the magic headers for various image formats too; we don't 
have to do that in this bug).

I think your 'eww--doctype-html-p' function would work nicely with a 
couple small tweaks as one of the functions in 
'eww-guess-content-type-functions' though.





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-09-24 18:31         ` Jim Porter
@ 2024-09-25 20:46           ` Sebastián Monía
  2024-09-26  1:59             ` Jim Porter
  0 siblings, 1 reply; 35+ messages in thread
From: Sebastián Monía @ 2024-09-25 20:46 UTC (permalink / raw)
  To: Jim Porter; +Cc: Eli Zaretskii, 73133, ganimard

Hi Jim,

Jim Porter <jporterbugs@gmail.com> writes:
> I was actually thinking something more general, like a defcustom named
> 'eww-guess-content-type-functions', which would be a list of functions
> where the first non-nil result is the guessed Content-Type. That way,
> we could extend this to other content types (for example, maybe we'd
> want to look for the magic headers for various image formats too; we
> don't have to do that in this bug).

I think the functions for the new defcustom should accept the
content-type, headers (since both are already parsed by that time), and
the entire buffer. If you agree, I can give your suggestion a shot, if
not let me know what do you think would work.

> I think your 'eww--doctype-html-p' function would work nicely with a
> couple small tweaks as one of the functions in
> 'eww-guess-content-type-functions' though.

Thanks!
I would also have the current '(eww-html-p (car content-type))' wrapped
in a function `eww--content-type-html-p` and put both functions in the
defcustom, first content type then doctype.



-- 
Sebastián Monía
https://site.sebasmonia.com/





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-09-25 20:46           ` Sebastián Monía
@ 2024-09-26  1:59             ` Jim Porter
  2024-09-30 17:10               ` Sebastián Monía
  0 siblings, 1 reply; 35+ messages in thread
From: Jim Porter @ 2024-09-26  1:59 UTC (permalink / raw)
  To: Sebastián Monía; +Cc: Eli Zaretskii, 73133, ganimard

On 9/25/2024 1:46 PM, Sebastián Monía wrote:
> Jim Porter <jporterbugs@gmail.com> writes:
>> I was actually thinking something more general, like a defcustom named
>> 'eww-guess-content-type-functions', which would be a list of functions
>> where the first non-nil result is the guessed Content-Type. That way,
>> we could extend this to other content types (for example, maybe we'd
>> want to look for the magic headers for various image formats too; we
>> don't have to do that in this bug).
> 
> I think the functions for the new defcustom should accept the
> content-type, headers (since both are already parsed by that time), and
> the entire buffer. If you agree, I can give your suggestion a shot, if
> not let me know what do you think would work.

I think we'd only want to run this hook if the Content-Type is absent 
from the headers (its job is to *guess* a content type, after all), so 
I'd expect the signature to be the list of headers + the buffer.





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-09-23 15:43       ` Sebastián Monía
@ 2024-09-28 10:58         ` Eli Zaretskii
  2024-09-30 15:52           ` Sebastián Monía
  0 siblings, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2024-09-28 10:58 UTC (permalink / raw)
  To: Sebastián Monía; +Cc: jporterbugs, 73133, ganimard

> From: Sebastián Monía <sebastian@sebasmonia.com>
> Cc: Eli Zaretskii <eliz@gnu.org>,  73133@debbugs.gnu.org,  ganimard@tuta.io
> Date: Mon, 23 Sep 2024 11:43:36 -0400
> 
> +(defcustom eww-use-doctype-fallback t
> +  "Accept a DOCTYPE tag as evidence that page content is HTML.

This should say

  "Whether to accept the DOCTYPE tag as evidence that page content is HTML."

> +This is used only when the page does not have a valid Content-Type
> +header."
> +  :version "30.1"
               ^^^^
This should be "31.1"

> +(defun eww--doctype-html-p (data-buffer)
> +  "Return non-nil if DATA-BUFFER contains a doctype declaration."

Not just "doctype declaration", but "HTML doctype declaration", right?





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-09-28 10:58         ` Eli Zaretskii
@ 2024-09-30 15:52           ` Sebastián Monía
  0 siblings, 0 replies; 35+ messages in thread
From: Sebastián Monía @ 2024-09-30 15:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: jporterbugs, 73133, ganimard


Eli Zaretskii <eliz@gnu.org> writes:
>> +(defcustom eww-use-doctype-fallback t
>> +  "Accept a DOCTYPE tag as evidence that page content is HTML.
>
> This should say
>
>   "Whether to accept the DOCTYPE tag as evidence that page content is HTML."

>> +  :version "30.1"
>                ^^^^
> This should be "31.1"

Will correct these (although the defcustom might change completely)

>> +(defun eww--doctype-html-p (data-buffer)
>> +  "Return non-nil if DATA-BUFFER contains a doctype declaration."
>
> Not just "doctype declaration", but "HTML doctype declaration", right?

Same here.

Thanks for the feedback!


-- 
Sebastián Monía
https://site.sebasmonia.com/





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-09-26  1:59             ` Jim Porter
@ 2024-09-30 17:10               ` Sebastián Monía
  2024-10-03 23:39                 ` Jim Porter
  0 siblings, 1 reply; 35+ messages in thread
From: Sebastián Monía @ 2024-09-30 17:10 UTC (permalink / raw)
  To: Jim Porter; +Cc: Eli Zaretskii, 73133, ganimard

Hello!

I was looking into this today and considering our options.

Jim Porter <jporterbugs@gmail.com> writes:
> On 9/25/2024 1:46 PM, Sebastián Monía wrote:
>> Jim Porter <jporterbugs@gmail.com> writes:
>>> I was actually thinking something more general, like a defcustom named
>>> 'eww-guess-content-type-functions', which would be a list of functions
>>> where the first non-nil result is the guessed Content-Type. That way,
>>> we could extend this to other content types (for example, maybe we'd
>>> want to look for the magic headers for various image formats too; we
>>> don't have to do that in this bug).

We aren't really guessing the content-type, at least in the scope of my
original patch, and probably this bug. We just want to know if the page
is HTML to render it, in these snippets (part of eww-render):

;; original cond
((eww-html-p (car content-type))
   (eww-display-html (or encode charset) url nil point buffer))

;; one possible alternative 
((or (eww-html-p (car content-type))
     ;; alternative mechanism to detect if the page is HTML
     ;; via <doctype...>, or other tests.
     )
   (eww-display-html (or encode charset) url nil point buffer))

We could instead change 'eww-html-p' to accept the content-type, other
headers and buffer. And in that function, as a fallback, call the
functions in 'eww-guess-content-type-functions' and return non-nil for
HTML.

The reason I am suggesting this is that there is no benefit to have a
generic mechanism to detect the Content Type, without heavily modifying
'eww-render'. It only matters in the context of deciding whether to
render the HTML or displaying it as-is, other cases are handled in
eww-render already.

Hope that made sense!

I can always address Eli's comments in the context of my original patch,
too, for a much simpler (and of course, limited) solution.

-- 
Sebastián Monía
https://site.sebasmonia.com/





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-09-30 17:10               ` Sebastián Monía
@ 2024-10-03 23:39                 ` Jim Porter
  2024-10-09  3:30                   ` Sebastián Monía
  0 siblings, 1 reply; 35+ messages in thread
From: Jim Porter @ 2024-10-03 23:39 UTC (permalink / raw)
  To: Sebastián Monía; +Cc: Eli Zaretskii, 73133, ganimard

On 9/30/2024 10:10 AM, Sebastián Monía wrote:
> We aren't really guessing the content-type, at least in the scope of my
> original patch, and probably this bug. We just want to know if the page
> is HTML to render it, in these snippets (part of eww-render):

What I was thinking about was something like this (with some appropriate 
implementation for 'eww--guess-content-type', possibly accepting args as 
needed):

diff --git a/lisp/net/eww.el b/lisp/net/eww.el
index b5d2f20781a..1c134717cc9 100644
--- a/lisp/net/eww.el
+++ b/lisp/net/eww.el
@@ -659,7 +659,7 @@ eww-render
  	 (content-type
  	  (mail-header-parse-content-type
             (if (zerop (length (cdr (assoc "content-type" headers))))
-	       "text/plain"
+	       (eww--guess-content-type)
               (cdr (assoc "content-type" headers)))))
  	 (charset (intern
  		   (downcase






^ permalink raw reply related	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-03 23:39                 ` Jim Porter
@ 2024-10-09  3:30                   ` Sebastián Monía
  2024-10-09  3:42                     ` Jim Porter
  0 siblings, 1 reply; 35+ messages in thread
From: Sebastián Monía @ 2024-10-09  3:30 UTC (permalink / raw)
  To: Jim Porter; +Cc: Eli Zaretskii, 73133, ganimard

[-- Attachment #1: Type: text/plain, Size: 1076 bytes --]


Jim Porter <jporterbugs@gmail.com> writes:
> On 9/30/2024 10:10 AM, Sebastián Monía wrote:
>> We aren't really guessing the content-type, at least in the scope of my
>> original patch, and probably this bug. We just want to know if the page
>> is HTML to render it, in these snippets (part of eww-render):
>
> What I was thinking about was something like this (with some
> appropriate implementation for 'eww--guess-content-type', possibly
> accepting args as needed):
>
> diff --git a/lisp/net/eww.el b/lisp/net/eww.el
> index b5d2f20781a..1c134717cc9 100644
> --- a/lisp/net/eww.el
> +++ b/lisp/net/eww.el
> @@ -659,7 +659,7 @@ eww-render
>  	 (content-type
>  	  (mail-header-parse-content-type
>             (if (zerop (length (cdr (assoc "content-type" headers))))
> -	       "text/plain"
> +	       (eww--guess-content-type)
>               (cdr (assoc "content-type" headers)))))
>  	 (charset (intern
>  		   (downcase
Hello!

Attached a new patch that goes in the direction outlined above, let me
know what you think.

Cheers,
Seb


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: patch --]
[-- Type: text/x-patch, Size: 3056 bytes --]

From 309a7d729665f14964a550f57f589a79705e23d6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Sebasti=C3=A1n=20Mon=C3=ADa?= <sebastian@sebasmonia.com>
Date: Tue, 8 Oct 2024 23:26:42 -0400
Subject: [PATCH] Add customization to let EWW guess content-type if needed
 (bug#73133)

---
 lisp/net/eww.el | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/lisp/net/eww.el b/lisp/net/eww.el
index b5d2f20781a..0a9a621f3e5 100644
--- a/lisp/net/eww.el
+++ b/lisp/net/eww.el
@@ -108,6 +108,19 @@ eww-suggest-uris
              eww-current-url
              eww-bookmark-urls))
 
+(defcustom eww-guess-content-type-functions
+  '(eww--html-if-doctype)
+  "List of functions used to guess a page's content-type.
+These are only used when the page does not have a valid Content-Type
+header.  Functions are called in order, until one of them returns the
+value to be used as Content-Type.  They receive two parameters: an alist
+of headers, and the buffer that holds the complete response.  If the
+list is exhausted, eww assumes \"text/plain\" so the user can see the
+markup."
+  :version "31.1"
+  :group 'eww
+  :type '(repeat function))
+
 (defcustom eww-bookmarks-directory user-emacs-directory
   "Directory where bookmark files will be stored."
   :version "25.1"
@@ -630,6 +643,31 @@ eww-html-p
   (member content-type '("text/html"
 			 "application/xhtml+xml")))
 
+(defun eww--guess-content-type (headers response-buffer)
+  "Use HEADERS and RESPONSE to guess the Content-Type.
+Will call each function in `eww-guess-content-type-functions', until one
+of them returns a value.  This mechanism is used only if there isn't a
+valid Content-Type header.  If none of the functions can guess, return
+\"text/plain\", so at least the mark up is displayed."
+  (let ((first-guess (seq-some
+                      (lambda (f) (funcall f headers response-buffer))
+                      eww-guess-content-type-functions)))
+    (or first-guess "text/plain")))
+
+(defun eww--html-if-doctype (headers response-buffer)
+  "Return \"text/html\" if RESPONSE-BUFFER has an HTML doctype declaration.
+HEADERS is unused."
+  ;; https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
+  (let ((case-fold-search t)
+        (target
+         "<!doctype +html *\\(>\\|system +\\(\\\"\\|'\\)+about:legacy-compat\\)"))
+    (with-current-buffer response-buffer
+      (goto-char (point-min))
+      ;; match basic <!doctype html> and also legacy variants as
+      ;; specified in link above
+      (when (re-search-forward target nil t)
+        "text/html"))))
+
 (defun eww--rename-buffer ()
   "Rename the current EWW buffer.
 The renaming scheme is performed in accordance with
@@ -659,7 +697,7 @@ eww-render
 	 (content-type
 	  (mail-header-parse-content-type
            (if (zerop (length (cdr (assoc "content-type" headers))))
-	       "text/plain"
+               (eww--guess-content-type headers buffer)
              (cdr (assoc "content-type" headers)))))
 	 (charset (intern
 		   (downcase
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-09  3:30                   ` Sebastián Monía
@ 2024-10-09  3:42                     ` Jim Porter
  2024-10-10  2:08                       ` Sebastián Monía
  0 siblings, 1 reply; 35+ messages in thread
From: Jim Porter @ 2024-10-09  3:42 UTC (permalink / raw)
  To: Sebastián Monía; +Cc: Eli Zaretskii, 73133, ganimard

On 10/8/2024 8:30 PM, Sebastián Monía wrote:
> Attached a new patch that goes in the direction outlined above, let me
> know what you think.

Thanks, I think this looks good overall (though I haven't run with your 
patch locally). Just one comment below.

> +  (let ((first-guess (seq-some
> +                      (lambda (f) (funcall f headers response-buffer))
> +                      eww-guess-content-type-functions)))
> +    (or first-guess "text/plain")))

I believe this could be:

   (or (run-hook-with-args-until-success
        'eww-guess-content-type-functions headers response-buffer)
       "text/plain")





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-09  3:42                     ` Jim Porter
@ 2024-10-10  2:08                       ` Sebastián Monía
  2024-10-14  4:35                         ` Jim Porter
  0 siblings, 1 reply; 35+ messages in thread
From: Sebastián Monía @ 2024-10-10  2:08 UTC (permalink / raw)
  To: Jim Porter; +Cc: Eli Zaretskii, 73133, ganimard

[-- Attachment #1: Type: text/plain, Size: 838 bytes --]


Jim Porter <jporterbugs@gmail.com> writes:
> I believe this could be:
>
>   (or (run-hook-with-args-until-success
>        'eww-guess-content-type-functions headers response-buffer)
>       "text/plain")

TIL. I landed in seq-some looking for something like
run-hook-with-args-until-sucess. So I actually learned two days in a
row! :)

Attached a modified patch. I also noticed and corrected another error,
that broke things when using the "g" (reload) command.

As for testing, I used this:

(defun do-ask (headers response)
  (when (y-or-n-p "decide?")
    (if (y-or-n-p "render?")
        "text/html"
      "text/plain")))

(setq eww-guess-content-type-functions
      '(do-ask eww--html-if-doctype))

And then reverse the order of the functions. Using "regular"
pages and the one reported in the bug.
Also tested with no functions.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: patch bug 73133 --]
[-- Type: text/x-patch, Size: 2995 bytes --]

From 5239cf0add09f69276ae21c13efb2fe665297234 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Sebasti=C3=A1n=20Mon=C3=ADa?= <sebastian@sebasmonia.com>
Date: Tue, 8 Oct 2024 23:26:42 -0400
Subject: [PATCH] Add customization to let EWW guess content-type if needed
 (bug#73133)

---
 lisp/net/eww.el | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/lisp/net/eww.el b/lisp/net/eww.el
index b5d2f20781a..30e780a44d9 100644
--- a/lisp/net/eww.el
+++ b/lisp/net/eww.el
@@ -108,6 +108,19 @@ eww-suggest-uris
              eww-current-url
              eww-bookmark-urls))
 
+(defcustom eww-guess-content-type-functions
+  '(eww--html-if-doctype)
+  "List of functions used to guess a page's content-type.
+These are only used when the page does not have a valid Content-Type
+header.  Functions are called in order, until one of them returns the
+value to be used as Content-Type.  They receive two parameters: an alist
+of headers, and the buffer that holds the complete response.  If the
+list is exhausted, eww assumes \"text/plain\" so the user can see the
+markup."
+  :version "31.1"
+  :group 'eww
+  :type '(repeat function))
+
 (defcustom eww-bookmarks-directory user-emacs-directory
   "Directory where bookmark files will be stored."
   :version "25.1"
@@ -630,6 +643,30 @@ eww-html-p
   (member content-type '("text/html"
 			 "application/xhtml+xml")))
 
+(defun eww--guess-content-type (headers response-buffer)
+  "Use HEADERS and RESPONSE to guess the Content-Type.
+Will call each function in `eww-guess-content-type-functions', until one
+of them returns a value.  This mechanism is used only if there isn't a
+valid Content-Type header.  If none of the functions can guess, return
+\"text/plain\", so at least the mark up is displayed."
+  (or (run-hook-with-args-until-success
+       'eww-guess-content-type-functions headers response-buffer)
+      "text/plain"))
+
+(defun eww--html-if-doctype (headers response-buffer)
+  "Return \"text/html\" if RESPONSE-BUFFER has an HTML doctype declaration.
+HEADERS is unused."
+  ;; https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
+  (let ((case-fold-search t)
+        (target
+         "<!doctype +html *\\(>\\|system +\\(\\\"\\|'\\)+about:legacy-compat\\)"))
+    (with-current-buffer response-buffer
+      (goto-char (point-min))
+      ;; match basic <!doctype html> and also legacy variants as
+      ;; specified in link above
+      (when (re-search-forward target nil t)
+        "text/html"))))
+
 (defun eww--rename-buffer ()
   "Rename the current EWW buffer.
 The renaming scheme is performed in accordance with
@@ -659,7 +696,7 @@ eww-render
 	 (content-type
 	  (mail-header-parse-content-type
            (if (zerop (length (cdr (assoc "content-type" headers))))
-	       "text/plain"
+               (eww--guess-content-type headers (current-buffer))
              (cdr (assoc "content-type" headers)))))
 	 (charset (intern
 		   (downcase
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-10  2:08                       ` Sebastián Monía
@ 2024-10-14  4:35                         ` Jim Porter
  2024-10-14 14:03                           ` Eli Zaretskii
  0 siblings, 1 reply; 35+ messages in thread
From: Jim Porter @ 2024-10-14  4:35 UTC (permalink / raw)
  To: Sebastián Monía; +Cc: Eli Zaretskii, 73133, ganimard

On 10/9/2024 7:08 PM, Sebastián Monía wrote:
> Attached a modified patch. I also noticed and corrected another error,
> that broke things when using the "g" (reload) command.

Thanks, I think this looks good overall. I just noticed one small nit 
(which I can fix when merging):

> +(defun eww--html-if-doctype (headers response-buffer)
> +  "Return \"text/html\" if RESPONSE-BUFFER has an HTML doctype declaration.
> +HEADERS is unused."

If an argument is unused, the convention is to prefix it with an 
underscore like "_headers". Then Flymake won't complain about an unused 
variable. :)

One last question: do you have FSF copyright assignment paperwork filled 
out? If you haven't already, you'll need to fill that out before we can 
merge this. (I don't think I have access to the full list of people 
who've filled out paperwork, so I'm not sure if you've already done this.)





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-14  4:35                         ` Jim Porter
@ 2024-10-14 14:03                           ` Eli Zaretskii
  2024-10-15 11:43                             ` Sebastián Monía
  0 siblings, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2024-10-14 14:03 UTC (permalink / raw)
  To: Jim Porter; +Cc: sebastian, 73133, ganimard

> Date: Sun, 13 Oct 2024 21:35:33 -0700
> Cc: Eli Zaretskii <eliz@gnu.org>, 73133@debbugs.gnu.org, ganimard@tuta.io
> From: Jim Porter <jporterbugs@gmail.com>
> 
> One last question: do you have FSF copyright assignment paperwork filled 
> out?

AFAIK, Sebastián is in the middle of the assignment process, but it
was not yet completed.





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-14 14:03                           ` Eli Zaretskii
@ 2024-10-15 11:43                             ` Sebastián Monía
  2024-10-19  7:46                               ` Eli Zaretskii
  0 siblings, 1 reply; 35+ messages in thread
From: Sebastián Monía @ 2024-10-15 11:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Jim Porter, 73133, ganimard


Eli Zaretskii <eliz@gnu.org> writes:
>> Date: Sun, 13 Oct 2024 21:35:33 -0700
>> Cc: Eli Zaretskii <eliz@gnu.org>, 73133@debbugs.gnu.org, ganimard@tuta.io
>> From: Jim Porter <jporterbugs@gmail.com>
>> 
>> One last question: do you have FSF copyright assignment paperwork filled 
>> out?
>
> AFAIK, Sebastián is in the middle of the assignment process, but it
> was not yet completed.

This is correct. I sent the form signed, didn't hear back yet.

-- 
Sebastián Monía
https://site.sebasmonia.com/





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-15 11:43                             ` Sebastián Monía
@ 2024-10-19  7:46                               ` Eli Zaretskii
  2024-10-19 17:56                                 ` Sebastián Monía
  0 siblings, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2024-10-19  7:46 UTC (permalink / raw)
  To: Sebastián Monía; +Cc: jporterbugs, 73133, ganimard

> From: Sebastián Monía <sebastian@sebasmonia.com>
> Cc: Jim Porter <jporterbugs@gmail.com>,  73133@debbugs.gnu.org,
>   ganimard@tuta.io
> Date: Tue, 15 Oct 2024 07:43:40 -0400
> 
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> >> Date: Sun, 13 Oct 2024 21:35:33 -0700
> >> Cc: Eli Zaretskii <eliz@gnu.org>, 73133@debbugs.gnu.org, ganimard@tuta.io
> >> From: Jim Porter <jporterbugs@gmail.com>
> >> 
> >> One last question: do you have FSF copyright assignment paperwork filled 
> >> out?
> >
> > AFAIK, Sebastián is in the middle of the assignment process, but it
> > was not yet completed.
> 
> This is correct. I sent the form signed, didn't hear back yet.

The legal paperwork is now done, so Sebastián, please update the patch
to fix the nit with unused argument HEADERS in eww--html-if-doctype,
and resubmit, so we could install the changes.

Thanks.





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-19  7:46                               ` Eli Zaretskii
@ 2024-10-19 17:56                                 ` Sebastián Monía
  2024-10-20 19:17                                   ` Jim Porter
  0 siblings, 1 reply; 35+ messages in thread
From: Sebastián Monía @ 2024-10-19 17:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: jporterbugs, 73133, ganimard

[-- Attachment #1: Type: text/plain, Size: 449 bytes --]


Eli Zaretskii <eliz@gnu.org> writes:
> The legal paperwork is now done, so Sebastián, please update the patch
> to fix the nit with unused argument HEADERS in eww--html-if-doctype,
> and resubmit, so we could install the changes.
>
> Thanks.

What a momentous ocassion :)
Attached the patch with that correction (and a small dostring fix that
'checkdoc' caught)


Thank you everyone for your help in this process.

Regards,
Seb


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: bug73133-doctype --]
[-- Type: text/x-patch, Size: 3003 bytes --]

From e35f4502383e368747d5f2bd8bcb9ed872315029 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Sebasti=C3=A1n=20Mon=C3=ADa?= <sebastian@sebasmonia.com>
Date: Tue, 8 Oct 2024 23:26:42 -0400
Subject: [PATCH] Add customization to let EWW guess content-type if needed
 (bug#73133)

---
 lisp/net/eww.el | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/lisp/net/eww.el b/lisp/net/eww.el
index b5d2f20781a..147982057c5 100644
--- a/lisp/net/eww.el
+++ b/lisp/net/eww.el
@@ -108,6 +108,19 @@ eww-suggest-uris
              eww-current-url
              eww-bookmark-urls))
 
+(defcustom eww-guess-content-type-functions
+  '(eww--html-if-doctype)
+  "List of functions used to guess a page's content-type.
+These are only used when the page does not have a valid Content-Type
+header.  Functions are called in order, until one of them returns the
+value to be used as Content-Type.  They receive two parameters: an alist
+of headers, and the buffer that holds the complete response.  If the
+list is exhausted, eww assumes \"text/plain\" so the user can see the
+markup."
+  :version "31.1"
+  :group 'eww
+  :type '(repeat function))
+
 (defcustom eww-bookmarks-directory user-emacs-directory
   "Directory where bookmark files will be stored."
   :version "25.1"
@@ -630,6 +643,30 @@ eww-html-p
   (member content-type '("text/html"
 			 "application/xhtml+xml")))
 
+(defun eww--guess-content-type (headers response-buffer)
+  "Use HEADERS and RESPONSE-BUFFER to guess the Content-Type.
+Will call each function in `eww-guess-content-type-functions', until one
+of them returns a value.  This mechanism is used only if there isn't a
+valid Content-Type header.  If none of the functions can guess, return
+\"text/plain\", so at least the mark up is displayed."
+  (or (run-hook-with-args-until-success
+       'eww-guess-content-type-functions headers response-buffer)
+      "text/plain"))
+
+(defun eww--html-if-doctype (_headers response-buffer)
+  "Return \"text/html\" if RESPONSE-BUFFER has an HTML doctype declaration.
+HEADERS is unused."
+  ;; https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
+  (let ((case-fold-search t)
+        (target
+         "<!doctype +html *\\(>\\|system +\\(\\\"\\|'\\)+about:legacy-compat\\)"))
+    (with-current-buffer response-buffer
+      (goto-char (point-min))
+      ;; match basic <!doctype html> and also legacy variants as
+      ;; specified in link above
+      (when (re-search-forward target nil t)
+        "text/html"))))
+
 (defun eww--rename-buffer ()
   "Rename the current EWW buffer.
 The renaming scheme is performed in accordance with
@@ -659,7 +696,7 @@ eww-render
 	 (content-type
 	  (mail-header-parse-content-type
            (if (zerop (length (cdr (assoc "content-type" headers))))
-	       "text/plain"
+               (eww--guess-content-type headers (current-buffer))
              (cdr (assoc "content-type" headers)))))
 	 (charset (intern
 		   (downcase
-- 
2.43.0


[-- Attachment #3: Type: text/plain, Size: 56 bytes --]


-- 
Sebastián Monía
https://site.sebasmonia.com/

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-19 17:56                                 ` Sebastián Monía
@ 2024-10-20 19:17                                   ` Jim Porter
  2024-10-21  1:48                                     ` Sebastián Monía
  0 siblings, 1 reply; 35+ messages in thread
From: Jim Porter @ 2024-10-20 19:17 UTC (permalink / raw)
  To: Sebastián Monía, Eli Zaretskii; +Cc: 73133, ganimard

On 10/19/2024 10:56 AM, Sebastián Monía wrote:
> Thank you everyone for your help in this process.

One last thought before I merge this: I notice that when we can't guess 
a Content-Type, we use "text/plain" as a fallback. Per RFC-9110[1], the 
fallback should be "application/octet-stream".

I tested this out in EWW, and we still display 
"application/octet-stream" pages as text in EWW, so there's no 
difference in behavior by default vs "text/plain". However, users who 
customize 'eww-use-external-browser-for-content-type' could make pages 
like that open externally, which I think makes sense. For non-HTML pages 
with no actual Content-Type header, they're at least reasonably likely 
to be binary files, so you'd probably want to download them rather than 
display them.

Does anyone else have any thoughts on the relative merits of falling 
back to "application/octet-stream" vs "text/plain"? If we go with the 
former, I can update the patch when I merge.

[1] https://www.rfc-editor.org/rfc/rfc9110#section-8.3-5






^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-20 19:17                                   ` Jim Porter
@ 2024-10-21  1:48                                     ` Sebastián Monía
  2024-10-22  4:59                                       ` Jim Porter
  0 siblings, 1 reply; 35+ messages in thread
From: Sebastián Monía @ 2024-10-21  1:48 UTC (permalink / raw)
  To: Jim Porter; +Cc: Eli Zaretskii, 73133, ganimard

Jim Porter <jporterbugs@gmail.com> writes:
> On 10/19/2024 10:56 AM, Sebastián Monía wrote:
>> Thank you everyone for your help in this process.
>
> One last thought before I merge this: I notice that when we can't
> guess a Content-Type, we use "text/plain" as a fallback. Per
> RFC-9110[1], the fallback should be "application/octet-stream".

I used text/plain only because it was the original behaviour, not a
particularly interesting reason!

> However, users who customize
> 'eww-use-external-browser-for-content-type' could make pages like that
> open externally, which I think makes sense.
> [...]
> Does anyone else have any thoughts on the relative merits of falling
> back to "application/octet-stream" vs "text/plain"? If we go with the
> former, I can update the patch when I merge.

I think it is a reasonable change. TIL about that option, too.

Regards,
Seb

-- 
Sebastián Monía
https://site.sebasmonia.com/





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-21  1:48                                     ` Sebastián Monía
@ 2024-10-22  4:59                                       ` Jim Porter
  2024-10-22 12:35                                         ` Sebastián Monía
  0 siblings, 1 reply; 35+ messages in thread
From: Jim Porter @ 2024-10-22  4:59 UTC (permalink / raw)
  To: Sebastián Monía; +Cc: Eli Zaretskii, 73133-done, ganimard

On 10/20/2024 6:48 PM, Sebastián Monía wrote:
> I used text/plain only because it was the original behaviour, not a
> particularly interesting reason!

Thanks, I've now pushed this change to the master branch as 9074a9f496b, 
so I'm closing this bug. (Of course, if there's anything remaining to do 
here, just let me know.)





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-22  4:59                                       ` Jim Porter
@ 2024-10-22 12:35                                         ` Sebastián Monía
  2024-10-22 12:36                                           ` Ganimard via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 35+ messages in thread
From: Sebastián Monía @ 2024-10-22 12:35 UTC (permalink / raw)
  To: Jim Porter; +Cc: Eli Zaretskii, 73133-done, ganimard

Jim Porter <jporterbugs@gmail.com> writes:
> On 10/20/2024 6:48 PM, Sebastián Monía wrote:
>> I used text/plain only because it was the original behaviour, not a
>> particularly interesting reason!
>
> Thanks, I've now pushed this change to the master branch as
> 9074a9f496b, so I'm closing this bug. (Of course, if there's anything
> remaining to do here, just let me know.)

Not that I can think of. Thank you for fixing this changelog too, will
keep that in mind for future patches :)

-- 
Sebastián Monía
https://site.sebasmonia.com/





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-22 12:35                                         ` Sebastián Monía
@ 2024-10-22 12:36                                           ` Ganimard via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 0 replies; 35+ messages in thread
From: Ganimard via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2024-10-22 12:36 UTC (permalink / raw)
  To: Sebastián Monía; +Cc: Jim Porter, Eli Zaretskii, 73133 Done

[-- Attachment #1: Type: text/plain, Size: 712 bytes --]

Thanks for all your work, Jim and Sebastián and others!
G

23 Oct 2024, 1:35 am by sebastian@sebasmonia.com:

> Jim Porter <jporterbugs@gmail.com> writes:
>
>> On 10/20/2024 6:48 PM, Sebastián Monía wrote:
>>
>>> I used text/plain only because it was the original behaviour, not a
>>> particularly interesting reason!
>>>
>>
>> Thanks, I've now pushed this change to the master branch as
>> 9074a9f496b, so I'm closing this bug. (Of course, if there's anything
>> remaining to do here, just let me know.)
>>
>
> Not that I can think of. Thank you for fixing this changelog too, will
> keep that in mind for future patches :)
>
> -- 
> Sebastián Monía
> https://site.sebasmonia.com/
>

[-- Attachment #2: Type: text/html, Size: 1471 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-09-08 20:52 bug#73133: 29.2; EWW fails to render some webpages Ganimard via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2024-09-10  6:06 ` Jim Porter
@ 2024-10-23 10:43 ` Mattias Engdegård
  2024-10-23 16:19   ` Mattias Engdegård
                     ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: Mattias Engdegård @ 2024-10-23 10:43 UTC (permalink / raw)
  To: Sebastián Monía; +Cc: Jim Porter, Eli Zaretskii, 73133, ganimard

Sebastián, thanks for your contribution! A few minor points about this part:

 663   (let ((case-fold-search t)
 664         (target
 665          "<!doctype +html *\\(>\\|system +\\(\\\"\\|'\\)+about:legacy-compat\\)"))
 666     (with-current-buffer response-buffer

First of all, `case-fold-search` becomes buffer-local if set, so binding it before changing buffer won't help. You need to do it the other way around.

The regexp is a bit muddled. (Carets here apply to the quoted line below.)

 665          "<!doctype +html *\\(>\\|system +\\(\\\"\\|'\\)+about:legacy-compat\\)"))
...................................^
Why match the terminating `>` in one branch (without DOCTYPE legacy string) but not the other?

..................................................^^
Useless backslash(es) here. Did you mean to include something else?
(Relint found this one, which is what brought me here.)

.............................................................^
Why the `+`? According to the reference, there should be one single or double quote here.
(https://html.spec.whatwg.org/multipage/syntax.html#doctype-legacy-string)

................................^^^............^^^
These two capture groups don't seem to be used; you probably meant to use non-capturing \(?:...\) brackets.

..................................................^^^^^^^^
A character alternative would be better here: ["'].

An exact translation of your regexp to the rx notation might be:

  (rx "<!doctype" (+ " ") "html" (* " ")
      (group
       (| ">"
          (: "system" (+ " ") (+ (group (| "\"" "'")))
             "about:legacy-compat"))))

but perhaps you meant something like

  (rx "<!doctype" (+ " ") "html"
      (? (* " ") "system" (+ " ")
         (| "\"" "'") "about:legacy-compat" (| "\"" "'"))
      (* " ") ">")







^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-23 10:43 ` Mattias Engdegård
@ 2024-10-23 16:19   ` Mattias Engdegård
  2024-10-23 18:51   ` Jim Porter
  2024-10-24  3:32   ` Sebastián Monía
  2 siblings, 0 replies; 35+ messages in thread
From: Mattias Engdegård @ 2024-10-23 16:19 UTC (permalink / raw)
  To: Sebastián Monía
  Cc: Jim Porter, Eli Zaretskii, control, 73133, ganimard

reopen 73133
stop

Re-opening the bug so that we don't forget to remedy the above points.






^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-23 10:43 ` Mattias Engdegård
  2024-10-23 16:19   ` Mattias Engdegård
@ 2024-10-23 18:51   ` Jim Porter
  2024-10-24  3:35     ` Sebastián Monía
  2024-10-24  3:32   ` Sebastián Monía
  2 siblings, 1 reply; 35+ messages in thread
From: Jim Porter @ 2024-10-23 18:51 UTC (permalink / raw)
  To: Mattias Engdegård, Sebastián Monía
  Cc: Eli Zaretskii, 73133, ganimard

On 10/23/2024 3:43 AM, Mattias Engdegård wrote:
> An exact translation of your regexp to the rx notation might be:
> 
>    (rx "<!doctype" (+ " ") "html" (* " ")
>        (group
>         (| ">"
>            (: "system" (+ " ") (+ (group (| "\"" "'")))
>               "about:legacy-compat"))))
> 
> but perhaps you meant something like
> 
>    (rx "<!doctype" (+ " ") "html"
>        (? (* " ") "system" (+ " ")
>           (| "\"" "'") "about:legacy-compat" (| "\"" "'"))
>        (* " ") ">")

Thoughts on just simplifying to checking for "<!doctype html"? That way, 
we'd also guess "text/html" for all the (mostly obsolete) HTML doctypes 
here: <https://www.w3.org/QA/2002/04/valid-dtd-list.html>.

(Technically the XHTML ones should be "application/xhtml+xml" but I 
don't think that makes any difference for EWW.)





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-23 10:43 ` Mattias Engdegård
  2024-10-23 16:19   ` Mattias Engdegård
  2024-10-23 18:51   ` Jim Porter
@ 2024-10-24  3:32   ` Sebastián Monía
  2 siblings, 0 replies; 35+ messages in thread
From: Sebastián Monía @ 2024-10-24  3:32 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Jim Porter, Eli Zaretskii, 73133, ganimard

Mattias Engdegård <mattias.engdegard@gmail.com> writes:
> Sebastián, thanks for your contribution! A few minor points about this part:
>
>  663   (let ((case-fold-search t)
>  664         (target
>  665          "<!doctype +html *\\(>\\|system +\\(\\\"\\|'\\)+about:legacy-compat\\)"))
>  666     (with-current-buffer response-buffer
>
> First of all, `case-fold-search` becomes buffer-local if set, so binding it before changing buffer won't help. You need to do it the other way around.

Thank you for picking this up! Makes wonder what I did wrong when
testing, that it worked OK. Will correct it in the next patch.

> The regexp is a bit muddled. (Carets here apply to the quoted line below.)
>
>  665          "<!doctype +html *\\(>\\|system +\\(\\\"\\|'\\)+about:legacy-compat\\)"))
> ...................................^
> Why match the terminating `>` in one branch (without DOCTYPE legacy string) but not the other?

The idea was to match exactly a "modern" doctype declaration, or softly
a legacy one since they are more likely to have...wonky? markup.

> ..................................................^^
> Useless backslash(es) here. Did you mean to include something else?
> (Relint found this one, which is what brought me here.)

I don't think so, it is an honest mistake. I rarely write regexps in
elisp code (or any code, for that matter :) haha), only interactive use

> .............................................................^
> Why the `+`? According to the reference, there should be one single or double quote here.
> (https://html.spec.whatwg.org/multipage/syntax.html#doctype-legacy-string)
>
> ................................^^^............^^^
> These two capture groups don't seem to be used; you probably meant to use non-capturing \(?:...\) brackets.

This is correct (just read on non-capturing groups).

> ..................................................^^^^^^^^
> A character alternative would be better here: ["'].
>
> An exact translation of your regexp to the rx notation might be:

Despite all the mistakes in the regex above, and a few tries to
understand it, the rx notation doesn't really click for me.
I am more than happy to use either of the versions you provided.

Thank you for your review!

-- 
Sebastián Monía
https://site.sebasmonia.com/





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-23 18:51   ` Jim Porter
@ 2024-10-24  3:35     ` Sebastián Monía
  2024-10-24 17:13       ` Sebastián Monía
  0 siblings, 1 reply; 35+ messages in thread
From: Sebastián Monía @ 2024-10-24  3:35 UTC (permalink / raw)
  To: Jim Porter; +Cc: 73133, Mattias Engdegård, Eli Zaretskii, ganimard

Jim Porter <jporterbugs@gmail.com> writes:
> Thoughts on just simplifying to checking for "<!doctype html"? That
> way, we'd also guess "text/html" for all the (mostly obsolete) HTML
> doctypes here: <https://www.w3.org/QA/2002/04/valid-dtd-list.html>.

It sounds like a good idea, can provide a patch in a couple days (maybe
tomorrow). That leaves some time for dissenting voices to express any
concerns with this approach.

-- 
Sebastián Monía
https://site.sebasmonia.com/





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-24  3:35     ` Sebastián Monía
@ 2024-10-24 17:13       ` Sebastián Monía
  2024-10-28 15:45         ` Mattias Engdegård
  0 siblings, 1 reply; 35+ messages in thread
From: Sebastián Monía @ 2024-10-24 17:13 UTC (permalink / raw)
  To: Jim Porter; +Cc: 73133, Mattias Engdegård, Eli Zaretskii, ganimard

[-- Attachment #1: Type: text/plain, Size: 558 bytes --]

Sebastián Monía <sebastian@sebasmonia.com> writes:
> Jim Porter <jporterbugs@gmail.com> writes:
>> Thoughts on just simplifying to checking for "<!doctype html"? That
>> way, we'd also guess "text/html" for all the (mostly obsolete) HTML
>> doctypes here: <https://www.w3.org/QA/2002/04/valid-dtd-list.html>.
>
> It sounds like a good idea, can provide a patch in a couple days (maybe
> tomorrow). That leaves some time for dissenting voices to express any
> concerns with this approach.

Attached a patch with the corrections mentioned so far.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: bug#73133 --]
[-- Type: text/x-patch, Size: 1683 bytes --]

From 952930c78dcfe7e4bb3a32504805239ae32073e9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Sebasti=C3=A1n=20Mon=C3=ADa?=
 <sebastian.monia@sebasmonia.com>
Date: Thu, 24 Oct 2024 13:09:11 -0400
Subject: [PATCH] More lax doctype check in EWW (bug#73133)

The regexp to match doctype tags was simplified and will match
more legacy entries; also correct binding of case-fold-search.

* lisp/net/eww.el (eww--html buffer-list): Update function.
---
 lisp/net/eww.el | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/lisp/net/eww.el b/lisp/net/eww.el
index 7bbbeadaedd..71e4d720b74 100644
--- a/lisp/net/eww.el
+++ b/lisp/net/eww.el
@@ -660,15 +660,14 @@ eww--html-if-doctype
   "Return \"text/html\" if RESPONSE-BUFFER has an HTML doctype declaration.
 HEADERS is unused."
   ;; https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
-  (let ((case-fold-search t)
-        (target
-         "<!doctype +html *\\(>\\|system +\\(\\\"\\|'\\)+about:legacy-compat\\)"))
-    (with-current-buffer response-buffer
-      (goto-char (point-min))
-      ;; match basic <!doctype html> and also legacy variants as
-      ;; specified in link above
-      (when (re-search-forward target nil t)
-        "text/html"))))
+  (with-current-buffer response-buffer
+    (let ((case-fold-search t))
+      (save-excursion
+        (goto-char (point-min))
+        ;; match basic <!doctype html> and also legacy variants as
+        ;; specified in link above - being purposely lax about it
+        (when (re-search-forward "<!doctype html" nil t)
+          "text/html")))))
 
 (defun eww--rename-buffer ()
   "Rename the current EWW buffer.
-- 
2.45.2.windows.1


[-- Attachment #3: Type: text/plain, Size: 54 bytes --]


-- 
Sebastián Monía
https://site.sebasmonia.com/

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-24 17:13       ` Sebastián Monía
@ 2024-10-28 15:45         ` Mattias Engdegård
  2024-10-30 15:21           ` Sebastián Monía
  0 siblings, 1 reply; 35+ messages in thread
From: Mattias Engdegård @ 2024-10-28 15:45 UTC (permalink / raw)
  To: Sebastián Monía; +Cc: Jim Porter, Eli Zaretskii, 73133, ganimard

24 okt. 2024 kl. 19.13 skrev Sebastián Monía <sebastian@sebasmonia.com>:

> Attached a patch with the corrections mentioned so far.

Fine as far as I'm concerned. You could use `search-forward` instead of `re-search-forward` since you aren't actually using a regexp any more.






^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-28 15:45         ` Mattias Engdegård
@ 2024-10-30 15:21           ` Sebastián Monía
  2024-11-02 11:35             ` Eli Zaretskii
  0 siblings, 1 reply; 35+ messages in thread
From: Sebastián Monía @ 2024-10-30 15:21 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Jim Porter, Eli Zaretskii, 73133, ganimard

[-- Attachment #1: Type: text/plain, Size: 358 bytes --]

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> 24 okt. 2024 kl. 19.13 skrev Sebastián Monía <sebastian@sebasmonia.com>:
>
>> Attached a patch with the corrections mentioned so far.
>
> Fine as far as I'm concerned. You could use `search-forward` instead
> of `re-search-forward` since you aren't actually using a regexp any
> more.
>


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: search-forward --]
[-- Type: text/x-patch, Size: 1680 bytes --]

From ab4a00e3ae5c8b2f6a9d3355df0ee406dbccaee8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Sebasti=C3=A1n=20Mon=C3=ADa?=
 <sebastian.monia@sebasmonia.com>
Date: Thu, 24 Oct 2024 13:09:11 -0400
Subject: [PATCH] More lax doctype check in EWW (bug#73133)

The regexp to match doctype tags was simplified and will match
more legacy entries; also correct binding of case-fold-search.

* lisp/net/eww.el (eww--html buffer-list): Update function.
---
 lisp/net/eww.el | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/lisp/net/eww.el b/lisp/net/eww.el
index 7bbbeadaedd..ec2f4e494e4 100644
--- a/lisp/net/eww.el
+++ b/lisp/net/eww.el
@@ -660,15 +660,14 @@ eww--html-if-doctype
   "Return \"text/html\" if RESPONSE-BUFFER has an HTML doctype declaration.
 HEADERS is unused."
   ;; https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
-  (let ((case-fold-search t)
-        (target
-         "<!doctype +html *\\(>\\|system +\\(\\\"\\|'\\)+about:legacy-compat\\)"))
-    (with-current-buffer response-buffer
-      (goto-char (point-min))
-      ;; match basic <!doctype html> and also legacy variants as
-      ;; specified in link above
-      (when (re-search-forward target nil t)
-        "text/html"))))
+  (with-current-buffer response-buffer
+    (let ((case-fold-search t))
+      (save-excursion
+        (goto-char (point-min))
+        ;; match basic <!doctype html> and also legacy variants as
+        ;; specified in link above - being purposely lax about it
+        (when (search-forward "<!doctype html" nil t)
+          "text/html")))))
 
 (defun eww--rename-buffer ()
   "Rename the current EWW buffer.
-- 
2.45.2.windows.1


[-- Attachment #3: Type: text/plain, Size: 95 bytes --]


New patch that uses search-forward :)

-- 
Sebastián Monía
https://site.sebasmonia.com/

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* bug#73133: 29.2; EWW fails to render some webpages
  2024-10-30 15:21           ` Sebastián Monía
@ 2024-11-02 11:35             ` Eli Zaretskii
  0 siblings, 0 replies; 35+ messages in thread
From: Eli Zaretskii @ 2024-11-02 11:35 UTC (permalink / raw)
  To: Sebastián Monía
  Cc: jporterbugs, mattias.engdegard, 73133-done, ganimard

> From: Sebastián Monía <sebastian@sebasmonia.com>
> Cc: Jim Porter <jporterbugs@gmail.com>,  Eli Zaretskii <eliz@gnu.org>,
>   73133@debbugs.gnu.org,  ganimard@tuta.io
> Date: Wed, 30 Oct 2024 11:21:32 -0400
> 
> Mattias Engdegård <mattias.engdegard@gmail.com> writes:
> 
> > 24 okt. 2024 kl. 19.13 skrev Sebastián Monía <sebastian@sebasmonia.com>:
> >
> >> Attached a patch with the corrections mentioned so far.
> >
> > Fine as far as I'm concerned. You could use `search-forward` instead
> > of `re-search-forward` since you aren't actually using a regexp any
> > more.
> >
> 
> >From ab4a00e3ae5c8b2f6a9d3355df0ee406dbccaee8 Mon Sep 17 00:00:00 2001
> From: =?UTF-8?q?Sebasti=C3=A1n=20Mon=C3=ADa?=
>  <sebastian.monia@sebasmonia.com>
> Date: Thu, 24 Oct 2024 13:09:11 -0400
> Subject: [PATCH] More lax doctype check in EWW (bug#73133)

Thanks, installed on master, and closing the bug.





^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2024-11-02 11:35 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-08 20:52 bug#73133: 29.2; EWW fails to render some webpages Ganimard via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-09-10  6:06 ` Jim Porter
2024-09-21  9:13   ` Eli Zaretskii
2024-09-21 17:12     ` Jim Porter
2024-09-23 15:43       ` Sebastián Monía
2024-09-28 10:58         ` Eli Zaretskii
2024-09-30 15:52           ` Sebastián Monía
2024-09-23 15:56       ` Sebastián Monía
2024-09-24 18:31         ` Jim Porter
2024-09-25 20:46           ` Sebastián Monía
2024-09-26  1:59             ` Jim Porter
2024-09-30 17:10               ` Sebastián Monía
2024-10-03 23:39                 ` Jim Porter
2024-10-09  3:30                   ` Sebastián Monía
2024-10-09  3:42                     ` Jim Porter
2024-10-10  2:08                       ` Sebastián Monía
2024-10-14  4:35                         ` Jim Porter
2024-10-14 14:03                           ` Eli Zaretskii
2024-10-15 11:43                             ` Sebastián Monía
2024-10-19  7:46                               ` Eli Zaretskii
2024-10-19 17:56                                 ` Sebastián Monía
2024-10-20 19:17                                   ` Jim Porter
2024-10-21  1:48                                     ` Sebastián Monía
2024-10-22  4:59                                       ` Jim Porter
2024-10-22 12:35                                         ` Sebastián Monía
2024-10-22 12:36                                           ` Ganimard via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-10-23 10:43 ` Mattias Engdegård
2024-10-23 16:19   ` Mattias Engdegård
2024-10-23 18:51   ` Jim Porter
2024-10-24  3:35     ` Sebastián Monía
2024-10-24 17:13       ` Sebastián Monía
2024-10-28 15:45         ` Mattias Engdegård
2024-10-30 15:21           ` Sebastián Monía
2024-11-02 11:35             ` Eli Zaretskii
2024-10-24  3:32   ` Sebastián Monía

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).