* bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region
@ 2020-04-23 13:24 Tim Landscheidt
2020-07-29 5:26 ` Lars Ingebrigtsen
0 siblings, 1 reply; 5+ messages in thread
From: Tim Landscheidt @ 2020-04-23 13:24 UTC (permalink / raw)
To: 40794
(Prologue: This bug showed up in the "ALT" attribute of an
"IMG" element of an HTML mail in Gnus. I am reasonably cer-
tain that this stems from libxml-parse-html-region and
should be fixed there, but there may be more prudent solu-
tions.)
With GNU Emacs 26.3 on Fedora:
| ELISP> (with-temp-buffer
| (insert "<!DOCTYPE html>
| <html lang=\"en\">
| <head><title>Title</title></head>
| <body>
| <p>Hello world</p>
| <p>ä</p>
| <p>☆</p>
| <p>★</p>
| </body>
| </html>")
| (libxml-parse-html-region (point-min) (point-max)))
| (html
| ((lang . "en"))
| (head nil
| (title nil "Title"))
| (body nil "\n "
| (p nil "Hello world")
| "\n "
| (p nil "ä")
| "\n "
| (p nil "☆")
| "\n "
| (p nil "★")
| "\n"))
| ELISP>
These should instead yield "ä" (228), "☆" (9734) and
"★" (9733).
lisp/leim/quail/sgml-input.el seems to contain the necessary
data for ☆ and ★ that could probably be fed to
libxml.
^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region
2020-04-23 13:24 bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region Tim Landscheidt
@ 2020-07-29 5:26 ` Lars Ingebrigtsen
2020-07-29 5:35 ` Lars Ingebrigtsen
0 siblings, 1 reply; 5+ messages in thread
From: Lars Ingebrigtsen @ 2020-07-29 5:26 UTC (permalink / raw)
To: Tim Landscheidt; +Cc: 40794
Tim Landscheidt <tim@tim-landscheidt.de> writes:
> (Prologue: This bug showed up in the "ALT" attribute of an
> "IMG" element of an HTML mail in Gnus. I am reasonably cer-
> tain that this stems from libxml-parse-html-region and
> should be fixed there, but there may be more prudent solu-
> tions.)
[...]
> These should instead yield "ä" (228), "☆" (9734) and
> "★" (9733).
>
> lisp/leim/quail/sgml-input.el seems to contain the necessary
> data for ☆ and ★ that could probably be fed to
> libxml.
As far as I can tell, libxml2 doesn't take a list of entities as an
input when parsing HTML? I may have missed something...
Hm, a bit of googling shows http://xmlsoft.org/html/libxml-entities.html
and there is apparently a way to tell libxml2 about further entities?
But I think this all sounds more like a libxml2 than an Emacs bug,
really?
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region
2020-07-29 5:26 ` Lars Ingebrigtsen
@ 2020-07-29 5:35 ` Lars Ingebrigtsen
2020-09-09 13:22 ` Stefan Kangas
0 siblings, 1 reply; 5+ messages in thread
From: Lars Ingebrigtsen @ 2020-07-29 5:35 UTC (permalink / raw)
To: Tim Landscheidt; +Cc: 40794
I had a look at the libxml2 sources. The logic isn't really explained,
but apparently they include all the <255-value entities, and then a
selected number of the other entities (about 160 of them).
I have no idea what the logic behind this is... perhaps they've just
forgotten to add the new ones? Which makes me think that this is really
a libxml2 bug, and you should report it there instead.
Excerpt:
/************************************************************************
* *
* The list of HTML predefined entities *
* *
************************************************************************/
static const htmlEntityDesc html40EntitiesTable[] = {
/*
* the 4 absolute ones, plus apostrophe.
*/
{ 34, "quot", "quotation mark = APL quote, U+0022 ISOnum" },
{ 38, "amp", "ampersand, U+0026 ISOnum" },
{ 39, "apos", "single quote" },
{ 60, "lt", "less-than sign, U+003C ISOnum" },
{ 62, "gt", "greater-than sign, U+003E ISOnum" },
/*
* A bunch still in the 128-255 range
* Replacing them depend really on the charset used.
*/
{ 160, "nbsp", "no-break space = non-breaking space, U+00A0 ISOnum" },
{ 161, "iexcl","inverted exclamation mark, U+00A1 ISOnum" },
{ 162, "cent", "cent sign, U+00A2 ISOnum" },
[...]
{ 376, "Yuml", "latin capital letter Y with diaeresis, U+0178 ISOlat2" },
/*
* Anything below should really be kept as entities references
*/
{ 402, "fnof", "latin small f with hook = function = florin, U+0192 ISOtech" },
{ 710, "circ", "modifier letter circumflex accent, U+02C6 ISOpub" },
{ 732, "tilde","small tilde, U+02DC ISOdia" },
{ 913, "Alpha","greek capital letter alpha, U+0391" },
{ 914, "Beta", "greek capital letter beta, U+0392" },
{ 915, "Gamma","greek capital letter gamma, U+0393 ISOgrk3" },
{ 916, "Delta","greek capital letter delta, U+0394 ISOgrk3" },
[...]
{ 9824, "spades","black spade suit, U+2660 ISOpub" },
{ 9827, "clubs","black club suit = shamrock, U+2663 ISOpub" },
{ 9829, "hearts","black heart suit = valentine, U+2665 ISOpub" },
{ 9830, "diams","black diamond suit, U+2666 ISOpub" },
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region
2020-07-29 5:35 ` Lars Ingebrigtsen
@ 2020-09-09 13:22 ` Stefan Kangas
2020-11-25 10:03 ` Stefan Kangas
0 siblings, 1 reply; 5+ messages in thread
From: Stefan Kangas @ 2020-09-09 13:22 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 40794, Tim Landscheidt
Lars Ingebrigtsen <larsi@gnus.org> writes:
> I had a look at the libxml2 sources. The logic isn't really explained,
> but apparently they include all the <255-value entities, and then a
> selected number of the other entities (about 160 of them).
>
> I have no idea what the logic behind this is... perhaps they've just
> forgotten to add the new ones? Which makes me think that this is really
> a libxml2 bug, and you should report it there instead.
Agreed. Tim, could you please report this to the libxml2 developers?
^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region
2020-09-09 13:22 ` Stefan Kangas
@ 2020-11-25 10:03 ` Stefan Kangas
0 siblings, 0 replies; 5+ messages in thread
From: Stefan Kangas @ 2020-11-25 10:03 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 40794, Tim Landscheidt
tags 40794 notabug
close 40794
thanks
Stefan Kangas <stefan@marxist.se> writes:
> Lars Ingebrigtsen <larsi@gnus.org> writes:
>
>> I had a look at the libxml2 sources. The logic isn't really explained,
>> but apparently they include all the <255-value entities, and then a
>> selected number of the other entities (about 160 of them).
>>
>> I have no idea what the logic behind this is... perhaps they've just
>> forgotten to add the new ones? Which makes me think that this is really
>> a libxml2 bug, and you should report it there instead.
>
> Agreed. Tim, could you please report this to the libxml2 developers?
That was 10 weeks ago, and we seem to agree that this is not a bug in
Emacs. I'm therefore closing this bug report.
Please report this issue to the libxml2 developers if it is still an
issue.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-11-25 10:03 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-23 13:24 bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region Tim Landscheidt
2020-07-29 5:26 ` Lars Ingebrigtsen
2020-07-29 5:35 ` Lars Ingebrigtsen
2020-09-09 13:22 ` Stefan Kangas
2020-11-25 10:03 ` Stefan Kangas
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).