* bug#48211: 28.0.50; eww strips whitespace between <mark> elements
@ 2021-05-03 23:16 Stefan Kangas
2021-05-03 23:55 ` Basil L. Contovounesios
0 siblings, 1 reply; 5+ messages in thread
From: Stefan Kangas @ 2021-05-03 23:16 UTC (permalink / raw)
To: 48211; +Cc: Lars Ingebrigtsen
Opening a HTML file in eww with <mark> elements strips whitespace
between elements.
Steps to reproduce:
0. echo "<p><mark>foo</mark> <mark>bar</mark></p>" > /tmp/foo.html
1. emacs -Q
2. M-x eww RET file:///tmp/foo.html RET
Result is that I see, in the eww buffer:
"foobar"
Expected result is:
"foo bar"
For a real world example where this matters, see:
https://dle.rae.es/palabra
In eww, I get:
1. f. Unidadlingüística, dotadageneralmentedesignificado,
queseseparadelasdemásmediantepausaspotencialesenlapronunciaciónyblancosenlaescritura.
In Firefox, I get:
1. f. Unidad lingüística, dotada generalmente de significado, que se
separa de las demás mediante pausas potenciales en la pronunciación y
blancos en la escritura.
^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#48211: 28.0.50; eww strips whitespace between <mark> elements
2021-05-03 23:16 bug#48211: 28.0.50; eww strips whitespace between <mark> elements Stefan Kangas
@ 2021-05-03 23:55 ` Basil L. Contovounesios
2021-05-04 0:35 ` Stefan Kangas
0 siblings, 1 reply; 5+ messages in thread
From: Basil L. Contovounesios @ 2021-05-03 23:55 UTC (permalink / raw)
To: Stefan Kangas; +Cc: Lars Ingebrigtsen, 48211
found 48211 24.1
quit
Stefan Kangas <stefan@marxist.se> writes:
> Opening a HTML file in eww with <mark> elements strips whitespace
> between elements.
I think this is because libxml-parse-html-region specifies
HTML_PARSE_NOBLANKS:
Return CDATA sections (like <style>foo</style>) as text nodes.
3c2317e891 2010-12-06 17:59:52 +0100
https://git.sv.gnu.org/cgit/emacs.git/commit/?id=3c2317e89100833812a7194c0d9d39ae0f52cb33
--
Basil
^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#48211: 28.0.50; eww strips whitespace between <mark> elements
2021-05-03 23:55 ` Basil L. Contovounesios
@ 2021-05-04 0:35 ` Stefan Kangas
2021-05-04 0:51 ` Stefan Kangas
2022-07-01 11:46 ` Lars Ingebrigtsen
0 siblings, 2 replies; 5+ messages in thread
From: Stefan Kangas @ 2021-05-04 0:35 UTC (permalink / raw)
To: Basil L. Contovounesios; +Cc: Lars Ingebrigtsen, 48211
"Basil L. Contovounesios" <contovob@tcd.ie> writes:
> I think this is because libxml-parse-html-region specifies
> HTML_PARSE_NOBLANKS:
>
> Return CDATA sections (like <style>foo</style>) as text nodes.
> 3c2317e891 2010-12-06 17:59:52 +0100
> https://git.sv.gnu.org/cgit/emacs.git/commit/?id=3c2317e89100833812a7194c0d9d39ae0f52cb33
Hmm, okay. For now, I'm seeing this issue with basically any tag that
libxml2 does not already know about, e.g. "<summary>" or "<bdi>".
This is what I came up with before reading Basil's reply:
(with-temp-buffer
(insert "<p><tt>foo</tt> <tt>bar</tt></p>")
(libxml-parse-html-region (point-min) (point-max)))
=> (html nil (body nil (p nil (tt nil "foo") " " (tt nil "bar"))))
(with-temp-buffer
(insert "<p><mark>foo</mark> <mark>bar</mark></p>")
(libxml-parse-html-region (point-min) (point-max)))
=> (html nil (body nil (p nil (mark nil "foo") (mark nil "bar"))))
I guess this is a bug in libxml2, so I reported it here:
https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
FWIW, the below diff works around this bug for me.
diff --git a/lisp/net/shr.el b/lisp/net/shr.el
index cbdeb65ba8..3eb3a5bc49 100644
--- a/lisp/net/shr.el
+++ b/lisp/net/shr.el
@@ -1485,6 +1485,12 @@ shr-tag-tt
;; The `tt' tag is deprecated in favor of `code'.
(shr-tag-code dom))
+(defun shr-tag-mark (dom)
+ (shr-generic dom)
+ ;; Hack to work around bug in libxml2 (Bug#48211):
+ ;; https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
+ (insert " "))
+
(defun shr-tag-ins (cont)
(let* ((start (point))
(color "green")
^ permalink raw reply related [flat|nested] 5+ messages in thread
* bug#48211: 28.0.50; eww strips whitespace between <mark> elements
2021-05-04 0:35 ` Stefan Kangas
@ 2021-05-04 0:51 ` Stefan Kangas
2022-07-01 11:46 ` Lars Ingebrigtsen
1 sibling, 0 replies; 5+ messages in thread
From: Stefan Kangas @ 2021-05-04 0:51 UTC (permalink / raw)
To: Basil L. Contovounesios; +Cc: Lars Ingebrigtsen, 48211
Stefan Kangas <stefan@marxist.se> writes:
> FWIW, the below diff works around this bug for me.
>
> diff --git a/lisp/net/shr.el b/lisp/net/shr.el
> index cbdeb65ba8..3eb3a5bc49 100644
> --- a/lisp/net/shr.el
> +++ b/lisp/net/shr.el
> @@ -1485,6 +1485,12 @@ shr-tag-tt
> ;; The `tt' tag is deprecated in favor of `code'.
> (shr-tag-code dom))
>
> +(defun shr-tag-mark (dom)
> + (shr-generic dom)
> + ;; Hack to work around bug in libxml2 (Bug#48211):
> + ;; https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
> + (insert " "))
> +
> (defun shr-tag-ins (cont)
> (let* ((start (point))
> (color "green")
Well, I should moderate that statement.
It doesn't exactly fix the bug as I'm now getting this instead:
1. f. Unidad lingüística , dotada generalmente de significado , que
se separa de las demás mediante pausas potenciales en la
pronunciación y blancos en la escritura .
2. f. Representación gráfica de la palabra hablada .
3. f. Facultad de hablar .
IOW, whitespace is added even if the following character is
punctuation...
^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#48211: 28.0.50; eww strips whitespace between <mark> elements
2021-05-04 0:35 ` Stefan Kangas
2021-05-04 0:51 ` Stefan Kangas
@ 2022-07-01 11:46 ` Lars Ingebrigtsen
1 sibling, 0 replies; 5+ messages in thread
From: Lars Ingebrigtsen @ 2022-07-01 11:46 UTC (permalink / raw)
To: Stefan Kangas; +Cc: Basil L. Contovounesios, 48211
Stefan Kangas <stefan@marxist.se> writes:
> I guess this is a bug in libxml2, so I reported it here:
>
> https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
[...]
> +(defun shr-tag-mark (dom)
> + (shr-generic dom)
> + ;; Hack to work around bug in libxml2 (Bug#48211):
> + ;; https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
> + (insert " "))
I've now pushed a variation of this to Emacs 29, and included a face and
stuff, as
https://www.w3schools.com/tags/tag_mark.asp
recommends.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-07-01 11:46 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-05-03 23:16 bug#48211: 28.0.50; eww strips whitespace between <mark> elements Stefan Kangas
2021-05-03 23:55 ` Basil L. Contovounesios
2021-05-04 0:35 ` Stefan Kangas
2021-05-04 0:51 ` Stefan Kangas
2022-07-01 11:46 ` Lars Ingebrigtsen
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).