* bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region @ 2020-04-23 13:24 Tim Landscheidt 2020-07-29 5:26 ` Lars Ingebrigtsen 0 siblings, 1 reply; 5+ messages in thread From: Tim Landscheidt @ 2020-04-23 13:24 UTC (permalink / raw) To: 40794 (Prologue: This bug showed up in the "ALT" attribute of an "IMG" element of an HTML mail in Gnus. I am reasonably cer- tain that this stems from libxml-parse-html-region and should be fixed there, but there may be more prudent solu- tions.) With GNU Emacs 26.3 on Fedora: | ELISP> (with-temp-buffer | (insert "<!DOCTYPE html> | <html lang=\"en\"> | <head><title>Title</title></head> | <body> | <p>Hello world</p> | <p>ä</p> | <p>☆</p> | <p>★</p> | </body> | </html>") | (libxml-parse-html-region (point-min) (point-max))) | (html | ((lang . "en")) | (head nil | (title nil "Title")) | (body nil "\n " | (p nil "Hello world") | "\n " | (p nil "ä") | "\n " | (p nil "☆") | "\n " | (p nil "★") | "\n")) | ELISP> These should instead yield "ä" (228), "☆" (9734) and "★" (9733). lisp/leim/quail/sgml-input.el seems to contain the necessary data for ☆ and ★ that could probably be fed to libxml. ^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region 2020-04-23 13:24 bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region Tim Landscheidt @ 2020-07-29 5:26 ` Lars Ingebrigtsen 2020-07-29 5:35 ` Lars Ingebrigtsen 0 siblings, 1 reply; 5+ messages in thread From: Lars Ingebrigtsen @ 2020-07-29 5:26 UTC (permalink / raw) To: Tim Landscheidt; +Cc: 40794 Tim Landscheidt <tim@tim-landscheidt.de> writes: > (Prologue: This bug showed up in the "ALT" attribute of an > "IMG" element of an HTML mail in Gnus. I am reasonably cer- > tain that this stems from libxml-parse-html-region and > should be fixed there, but there may be more prudent solu- > tions.) [...] > These should instead yield "ä" (228), "☆" (9734) and > "★" (9733). > > lisp/leim/quail/sgml-input.el seems to contain the necessary > data for ☆ and ★ that could probably be fed to > libxml. As far as I can tell, libxml2 doesn't take a list of entities as an input when parsing HTML? I may have missed something... Hm, a bit of googling shows http://xmlsoft.org/html/libxml-entities.html and there is apparently a way to tell libxml2 about further entities? But I think this all sounds more like a libxml2 than an Emacs bug, really? -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region 2020-07-29 5:26 ` Lars Ingebrigtsen @ 2020-07-29 5:35 ` Lars Ingebrigtsen 2020-09-09 13:22 ` Stefan Kangas 0 siblings, 1 reply; 5+ messages in thread From: Lars Ingebrigtsen @ 2020-07-29 5:35 UTC (permalink / raw) To: Tim Landscheidt; +Cc: 40794 I had a look at the libxml2 sources. The logic isn't really explained, but apparently they include all the <255-value entities, and then a selected number of the other entities (about 160 of them). I have no idea what the logic behind this is... perhaps they've just forgotten to add the new ones? Which makes me think that this is really a libxml2 bug, and you should report it there instead. Excerpt: /************************************************************************ * * * The list of HTML predefined entities * * * ************************************************************************/ static const htmlEntityDesc html40EntitiesTable[] = { /* * the 4 absolute ones, plus apostrophe. */ { 34, "quot", "quotation mark = APL quote, U+0022 ISOnum" }, { 38, "amp", "ampersand, U+0026 ISOnum" }, { 39, "apos", "single quote" }, { 60, "lt", "less-than sign, U+003C ISOnum" }, { 62, "gt", "greater-than sign, U+003E ISOnum" }, /* * A bunch still in the 128-255 range * Replacing them depend really on the charset used. */ { 160, "nbsp", "no-break space = non-breaking space, U+00A0 ISOnum" }, { 161, "iexcl","inverted exclamation mark, U+00A1 ISOnum" }, { 162, "cent", "cent sign, U+00A2 ISOnum" }, [...] { 376, "Yuml", "latin capital letter Y with diaeresis, U+0178 ISOlat2" }, /* * Anything below should really be kept as entities references */ { 402, "fnof", "latin small f with hook = function = florin, U+0192 ISOtech" }, { 710, "circ", "modifier letter circumflex accent, U+02C6 ISOpub" }, { 732, "tilde","small tilde, U+02DC ISOdia" }, { 913, "Alpha","greek capital letter alpha, U+0391" }, { 914, "Beta", "greek capital letter beta, U+0392" }, { 915, "Gamma","greek capital letter gamma, U+0393 ISOgrk3" }, { 916, "Delta","greek capital letter delta, U+0394 ISOgrk3" }, [...] { 9824, "spades","black spade suit, U+2660 ISOpub" }, { 9827, "clubs","black club suit = shamrock, U+2663 ISOpub" }, { 9829, "hearts","black heart suit = valentine, U+2665 ISOpub" }, { 9830, "diams","black diamond suit, U+2666 ISOpub" }, -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region 2020-07-29 5:35 ` Lars Ingebrigtsen @ 2020-09-09 13:22 ` Stefan Kangas 2020-11-25 10:03 ` Stefan Kangas 0 siblings, 1 reply; 5+ messages in thread From: Stefan Kangas @ 2020-09-09 13:22 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 40794, Tim Landscheidt Lars Ingebrigtsen <larsi@gnus.org> writes: > I had a look at the libxml2 sources. The logic isn't really explained, > but apparently they include all the <255-value entities, and then a > selected number of the other entities (about 160 of them). > > I have no idea what the logic behind this is... perhaps they've just > forgotten to add the new ones? Which makes me think that this is really > a libxml2 bug, and you should report it there instead. Agreed. Tim, could you please report this to the libxml2 developers? ^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region 2020-09-09 13:22 ` Stefan Kangas @ 2020-11-25 10:03 ` Stefan Kangas 0 siblings, 0 replies; 5+ messages in thread From: Stefan Kangas @ 2020-11-25 10:03 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 40794, Tim Landscheidt tags 40794 notabug close 40794 thanks Stefan Kangas <stefan@marxist.se> writes: > Lars Ingebrigtsen <larsi@gnus.org> writes: > >> I had a look at the libxml2 sources. The logic isn't really explained, >> but apparently they include all the <255-value entities, and then a >> selected number of the other entities (about 160 of them). >> >> I have no idea what the logic behind this is... perhaps they've just >> forgotten to add the new ones? Which makes me think that this is really >> a libxml2 bug, and you should report it there instead. > > Agreed. Tim, could you please report this to the libxml2 developers? That was 10 weeks ago, and we seem to agree that this is not a bug in Emacs. I'm therefore closing this bug report. Please report this issue to the libxml2 developers if it is still an issue. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-11-25 10:03 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-04-23 13:24 bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region Tim Landscheidt 2020-07-29 5:26 ` Lars Ingebrigtsen 2020-07-29 5:35 ` Lars Ingebrigtsen 2020-09-09 13:22 ` Stefan Kangas 2020-11-25 10:03 ` Stefan Kangas
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.