unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#48211: 28.0.50; eww strips whitespace between <mark> elements
@ 2021-05-03 23:16 Stefan Kangas
  2021-05-03 23:55 ` Basil L. Contovounesios
  0 siblings, 1 reply; 5+ messages in thread
From: Stefan Kangas @ 2021-05-03 23:16 UTC (permalink / raw)
  To: 48211; +Cc: Lars Ingebrigtsen

Opening a HTML file in eww with <mark> elements strips whitespace
between elements.

Steps to reproduce:

0. echo "<p><mark>foo</mark> <mark>bar</mark></p>" > /tmp/foo.html
1. emacs -Q
2. M-x eww RET file:///tmp/foo.html RET

Result is that I see, in the eww buffer:

    "foobar"

Expected result is:

    "foo bar"

For a real world example where this matters, see:

    https://dle.rae.es/palabra

In eww, I get:

  1. f. Unidadlingüística, dotadageneralmentedesignificado,
  queseseparadelasdemásmediantepausaspotencialesenlapronunciaciónyblancosenlaescritura.

In Firefox, I get:

  1. f. Unidad lingüística, dotada generalmente de significado, que se
  separa de las demás mediante pausas potenciales en la pronunciación y
  blancos en la escritura.





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#48211: 28.0.50; eww strips whitespace between <mark> elements
  2021-05-03 23:16 bug#48211: 28.0.50; eww strips whitespace between <mark> elements Stefan Kangas
@ 2021-05-03 23:55 ` Basil L. Contovounesios
  2021-05-04  0:35   ` Stefan Kangas
  0 siblings, 1 reply; 5+ messages in thread
From: Basil L. Contovounesios @ 2021-05-03 23:55 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Lars Ingebrigtsen, 48211

found 48211 24.1
quit

Stefan Kangas <stefan@marxist.se> writes:

> Opening a HTML file in eww with <mark> elements strips whitespace
> between elements.

I think this is because libxml-parse-html-region specifies
HTML_PARSE_NOBLANKS:

Return CDATA sections (like <style>foo</style>) as text nodes.
3c2317e891 2010-12-06 17:59:52 +0100
https://git.sv.gnu.org/cgit/emacs.git/commit/?id=3c2317e89100833812a7194c0d9d39ae0f52cb33

-- 
Basil





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#48211: 28.0.50; eww strips whitespace between <mark> elements
  2021-05-03 23:55 ` Basil L. Contovounesios
@ 2021-05-04  0:35   ` Stefan Kangas
  2021-05-04  0:51     ` Stefan Kangas
  2022-07-01 11:46     ` Lars Ingebrigtsen
  0 siblings, 2 replies; 5+ messages in thread
From: Stefan Kangas @ 2021-05-04  0:35 UTC (permalink / raw)
  To: Basil L. Contovounesios; +Cc: Lars Ingebrigtsen, 48211

"Basil L. Contovounesios" <contovob@tcd.ie> writes:

> I think this is because libxml-parse-html-region specifies
> HTML_PARSE_NOBLANKS:
>
> Return CDATA sections (like <style>foo</style>) as text nodes.
> 3c2317e891 2010-12-06 17:59:52 +0100
> https://git.sv.gnu.org/cgit/emacs.git/commit/?id=3c2317e89100833812a7194c0d9d39ae0f52cb33

Hmm, okay.  For now, I'm seeing this issue with basically any tag that
libxml2 does not already know about, e.g. "<summary>" or "<bdi>".

This is what I came up with before reading Basil's reply:

(with-temp-buffer
  (insert "<p><tt>foo</tt> <tt>bar</tt></p>")
  (libxml-parse-html-region (point-min) (point-max)))

=> (html nil (body nil (p nil (tt nil "foo") " " (tt nil "bar"))))

(with-temp-buffer
  (insert "<p><mark>foo</mark> <mark>bar</mark></p>")
  (libxml-parse-html-region (point-min) (point-max)))

=> (html nil (body nil (p nil (mark nil "foo") (mark nil "bar"))))

I guess this is a bug in libxml2, so I reported it here:

    https://gitlab.gnome.org/GNOME/libxml2/-/issues/247

FWIW, the below diff works around this bug for me.

diff --git a/lisp/net/shr.el b/lisp/net/shr.el
index cbdeb65ba8..3eb3a5bc49 100644
--- a/lisp/net/shr.el
+++ b/lisp/net/shr.el
@@ -1485,6 +1485,12 @@ shr-tag-tt
   ;; The `tt' tag is deprecated in favor of `code'.
   (shr-tag-code dom))

+(defun shr-tag-mark (dom)
+  (shr-generic dom)
+  ;; Hack to work around bug in libxml2 (Bug#48211):
+  ;; https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
+  (insert " "))
+
 (defun shr-tag-ins (cont)
   (let* ((start (point))
          (color "green")





^ permalink raw reply related	[flat|nested] 5+ messages in thread

* bug#48211: 28.0.50; eww strips whitespace between <mark> elements
  2021-05-04  0:35   ` Stefan Kangas
@ 2021-05-04  0:51     ` Stefan Kangas
  2022-07-01 11:46     ` Lars Ingebrigtsen
  1 sibling, 0 replies; 5+ messages in thread
From: Stefan Kangas @ 2021-05-04  0:51 UTC (permalink / raw)
  To: Basil L. Contovounesios; +Cc: Lars Ingebrigtsen, 48211

Stefan Kangas <stefan@marxist.se> writes:

> FWIW, the below diff works around this bug for me.
>
> diff --git a/lisp/net/shr.el b/lisp/net/shr.el
> index cbdeb65ba8..3eb3a5bc49 100644
> --- a/lisp/net/shr.el
> +++ b/lisp/net/shr.el
> @@ -1485,6 +1485,12 @@ shr-tag-tt
>    ;; The `tt' tag is deprecated in favor of `code'.
>    (shr-tag-code dom))
>
> +(defun shr-tag-mark (dom)
> +  (shr-generic dom)
> +  ;; Hack to work around bug in libxml2 (Bug#48211):
> +  ;; https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
> +  (insert " "))
> +
>  (defun shr-tag-ins (cont)
>    (let* ((start (point))
>           (color "green")

Well, I should moderate that statement.

It doesn't exactly fix the bug as I'm now getting this instead:

    1. f. Unidad lingüística , dotada generalmente de significado , que
    se separa de las demás mediante pausas potenciales en la
    pronunciación y blancos en la escritura .

    2. f. Representación gráfica de la palabra hablada .

    3. f. Facultad de hablar .

IOW, whitespace is added even if the following character is
punctuation...





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#48211: 28.0.50; eww strips whitespace between <mark> elements
  2021-05-04  0:35   ` Stefan Kangas
  2021-05-04  0:51     ` Stefan Kangas
@ 2022-07-01 11:46     ` Lars Ingebrigtsen
  1 sibling, 0 replies; 5+ messages in thread
From: Lars Ingebrigtsen @ 2022-07-01 11:46 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Basil L. Contovounesios, 48211

Stefan Kangas <stefan@marxist.se> writes:

> I guess this is a bug in libxml2, so I reported it here:
>
>     https://gitlab.gnome.org/GNOME/libxml2/-/issues/247

[...]

> +(defun shr-tag-mark (dom)
> +  (shr-generic dom)
> +  ;; Hack to work around bug in libxml2 (Bug#48211):
> +  ;; https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
> +  (insert " "))

I've now pushed a variation of this to Emacs 29, and included a face and
stuff, as

https://www.w3schools.com/tags/tag_mark.asp

recommends.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-07-01 11:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-05-03 23:16 bug#48211: 28.0.50; eww strips whitespace between <mark> elements Stefan Kangas
2021-05-03 23:55 ` Basil L. Contovounesios
2021-05-04  0:35   ` Stefan Kangas
2021-05-04  0:51     ` Stefan Kangas
2022-07-01 11:46     ` Lars Ingebrigtsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).