From: Lars Ingebrigtsen <larsi@gnus.org>
To: Tim Landscheidt <tim@tim-landscheidt.de>
Cc: 40794@debbugs.gnu.org
Subject: bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region
Date: Wed, 29 Jul 2020 07:35:51 +0200 [thread overview]
Message-ID: <874kpq3mtk.fsf@gnus.org> (raw)
In-Reply-To: <878sf23n9k.fsf@gnus.org> (Lars Ingebrigtsen's message of "Wed, 29 Jul 2020 07:26:15 +0200")
I had a look at the libxml2 sources. The logic isn't really explained,
but apparently they include all the <255-value entities, and then a
selected number of the other entities (about 160 of them).
I have no idea what the logic behind this is... perhaps they've just
forgotten to add the new ones? Which makes me think that this is really
a libxml2 bug, and you should report it there instead.
Excerpt:
/************************************************************************
* *
* The list of HTML predefined entities *
* *
************************************************************************/
static const htmlEntityDesc html40EntitiesTable[] = {
/*
* the 4 absolute ones, plus apostrophe.
*/
{ 34, "quot", "quotation mark = APL quote, U+0022 ISOnum" },
{ 38, "amp", "ampersand, U+0026 ISOnum" },
{ 39, "apos", "single quote" },
{ 60, "lt", "less-than sign, U+003C ISOnum" },
{ 62, "gt", "greater-than sign, U+003E ISOnum" },
/*
* A bunch still in the 128-255 range
* Replacing them depend really on the charset used.
*/
{ 160, "nbsp", "no-break space = non-breaking space, U+00A0 ISOnum" },
{ 161, "iexcl","inverted exclamation mark, U+00A1 ISOnum" },
{ 162, "cent", "cent sign, U+00A2 ISOnum" },
[...]
{ 376, "Yuml", "latin capital letter Y with diaeresis, U+0178 ISOlat2" },
/*
* Anything below should really be kept as entities references
*/
{ 402, "fnof", "latin small f with hook = function = florin, U+0192 ISOtech" },
{ 710, "circ", "modifier letter circumflex accent, U+02C6 ISOpub" },
{ 732, "tilde","small tilde, U+02DC ISOdia" },
{ 913, "Alpha","greek capital letter alpha, U+0391" },
{ 914, "Beta", "greek capital letter beta, U+0392" },
{ 915, "Gamma","greek capital letter gamma, U+0393 ISOgrk3" },
{ 916, "Delta","greek capital letter delta, U+0394 ISOgrk3" },
[...]
{ 9824, "spades","black spade suit, U+2660 ISOpub" },
{ 9827, "clubs","black club suit = shamrock, U+2663 ISOpub" },
{ 9829, "hearts","black heart suit = valentine, U+2665 ISOpub" },
{ 9830, "diams","black diamond suit, U+2666 ISOpub" },
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
next prev parent reply other threads:[~2020-07-29 5:35 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-23 13:24 bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region Tim Landscheidt
2020-07-29 5:26 ` Lars Ingebrigtsen
2020-07-29 5:35 ` Lars Ingebrigtsen [this message]
2020-09-09 13:22 ` Stefan Kangas
2020-11-25 10:03 ` Stefan Kangas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=874kpq3mtk.fsf@gnus.org \
--to=larsi@gnus.org \
--cc=40794@debbugs.gnu.org \
--cc=tim@tim-landscheidt.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.