unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region
@ 2020-04-23 13:24 Tim Landscheidt
  2020-07-29  5:26 ` Lars Ingebrigtsen
  0 siblings, 1 reply; 5+ messages in thread
From: Tim Landscheidt @ 2020-04-23 13:24 UTC (permalink / raw)
  To: 40794

(Prologue: This bug showed up in the "ALT" attribute of an
"IMG" element of an HTML mail in Gnus.  I am reasonably cer-
tain that this stems from libxml-parse-html-region and
should be fixed there, but there may be more prudent solu-
tions.)

With GNU Emacs 26.3 on Fedora:

| ELISP> (with-temp-buffer
|          (insert "<!DOCTYPE html>
| <html lang=\"en\">
| <head><title>Title</title></head>
| <body>
|   <p>Hello world</p>
|   <p>&auml;</p>
|   <p>&star;</p>
|   <p>&starf;</p>
| </body>
| </html>")
|          (libxml-parse-html-region (point-min) (point-max)))
| (html
|  ((lang . "en"))
|  (head nil
|        (title nil "Title"))
|  (body nil "\n  "
|        (p nil "Hello world")
|        "\n  "
|        (p nil "ä")
|        "\n  "
|        (p nil "&star;")
|        "\n  "
|        (p nil "&starf;")
|        "\n"))

| ELISP>

These should instead yield "ä" (228), "☆" (9734) and
"★" (9733).

lisp/leim/quail/sgml-input.el seems to contain the necessary
data for &star; and &starf; that could probably be fed to
libxml.





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#40794: 26.3; HTML entities &star; and &starf; (inter alia) are not parsed by libxml-parse-html-region
  2020-04-23 13:24 bug#40794: 26.3; HTML entities &star; and &starf; (inter alia) are not parsed by libxml-parse-html-region Tim Landscheidt
@ 2020-07-29  5:26 ` Lars Ingebrigtsen
  2020-07-29  5:35   ` Lars Ingebrigtsen
  0 siblings, 1 reply; 5+ messages in thread
From: Lars Ingebrigtsen @ 2020-07-29  5:26 UTC (permalink / raw)
  To: Tim Landscheidt; +Cc: 40794

Tim Landscheidt <tim@tim-landscheidt.de> writes:

> (Prologue: This bug showed up in the "ALT" attribute of an
> "IMG" element of an HTML mail in Gnus.  I am reasonably cer-
> tain that this stems from libxml-parse-html-region and
> should be fixed there, but there may be more prudent solu-
> tions.)

[...]

> These should instead yield "ä" (228), "☆" (9734) and
> "★" (9733).
>
> lisp/leim/quail/sgml-input.el seems to contain the necessary
> data for &star; and &starf; that could probably be fed to
> libxml.

As far as I can tell, libxml2 doesn't take a list of entities as an
input when parsing HTML?  I may have missed something...

Hm, a bit of googling shows http://xmlsoft.org/html/libxml-entities.html
and there is apparently a way to tell libxml2 about further entities?

But I think this all sounds more like a libxml2 than an Emacs bug,
really?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#40794: 26.3; HTML entities &star; and &starf; (inter alia) are not parsed by libxml-parse-html-region
  2020-07-29  5:26 ` Lars Ingebrigtsen
@ 2020-07-29  5:35   ` Lars Ingebrigtsen
  2020-09-09 13:22     ` Stefan Kangas
  0 siblings, 1 reply; 5+ messages in thread
From: Lars Ingebrigtsen @ 2020-07-29  5:35 UTC (permalink / raw)
  To: Tim Landscheidt; +Cc: 40794


I had a look at the libxml2 sources.  The logic isn't really explained,
but apparently they include all the <255-value entities, and then a
selected number of the other entities (about 160 of them).

I have no idea what the logic behind this is...  perhaps they've just
forgotten to add the new ones?  Which makes me think that this is really
a libxml2 bug, and you should report it there instead.

Excerpt:

/************************************************************************
 *									*
 *	The list of HTML predefined entities			*
 *									*
 ************************************************************************/

static const htmlEntityDesc  html40EntitiesTable[] = {
/*
 * the 4 absolute ones, plus apostrophe.
 */
{ 34,	"quot",	"quotation mark = APL quote, U+0022 ISOnum" },
{ 38,	"amp",	"ampersand, U+0026 ISOnum" },
{ 39,	"apos",	"single quote" },
{ 60,	"lt",	"less-than sign, U+003C ISOnum" },
{ 62,	"gt",	"greater-than sign, U+003E ISOnum" },

/*
 * A bunch still in the 128-255 range
 * Replacing them depend really on the charset used.
 */
{ 160,	"nbsp",	"no-break space = non-breaking space, U+00A0 ISOnum" },
{ 161,	"iexcl","inverted exclamation mark, U+00A1 ISOnum" },
{ 162,	"cent",	"cent sign, U+00A2 ISOnum" },

[...]

{ 376,	"Yuml",	"latin capital letter Y with diaeresis, U+0178 ISOlat2" },

/*
 * Anything below should really be kept as entities references
 */
{ 402,	"fnof",	"latin small f with hook = function = florin, U+0192 ISOtech" },

{ 710,	"circ",	"modifier letter circumflex accent, U+02C6 ISOpub" },
{ 732,	"tilde","small tilde, U+02DC ISOdia" },

{ 913,	"Alpha","greek capital letter alpha, U+0391" },
{ 914,	"Beta",	"greek capital letter beta, U+0392" },
{ 915,	"Gamma","greek capital letter gamma, U+0393 ISOgrk3" },
{ 916,	"Delta","greek capital letter delta, U+0394 ISOgrk3" },

[...]

{ 9824,	"spades","black spade suit, U+2660 ISOpub" },
{ 9827,	"clubs","black club suit = shamrock, U+2663 ISOpub" },
{ 9829,	"hearts","black heart suit = valentine, U+2665 ISOpub" },
{ 9830,	"diams","black diamond suit, U+2666 ISOpub" },


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no






^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#40794: 26.3; HTML entities &star; and &starf; (inter alia) are not parsed by libxml-parse-html-region
  2020-07-29  5:35   ` Lars Ingebrigtsen
@ 2020-09-09 13:22     ` Stefan Kangas
  2020-11-25 10:03       ` Stefan Kangas
  0 siblings, 1 reply; 5+ messages in thread
From: Stefan Kangas @ 2020-09-09 13:22 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 40794, Tim Landscheidt

Lars Ingebrigtsen <larsi@gnus.org> writes:

> I had a look at the libxml2 sources.  The logic isn't really explained,
> but apparently they include all the <255-value entities, and then a
> selected number of the other entities (about 160 of them).
>
> I have no idea what the logic behind this is...  perhaps they've just
> forgotten to add the new ones?  Which makes me think that this is really
> a libxml2 bug, and you should report it there instead.

Agreed.  Tim, could you please report this to the libxml2 developers?





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#40794: 26.3; HTML entities &star; and &starf; (inter alia) are not parsed by libxml-parse-html-region
  2020-09-09 13:22     ` Stefan Kangas
@ 2020-11-25 10:03       ` Stefan Kangas
  0 siblings, 0 replies; 5+ messages in thread
From: Stefan Kangas @ 2020-11-25 10:03 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 40794, Tim Landscheidt

tags 40794 notabug
close 40794
thanks

Stefan Kangas <stefan@marxist.se> writes:

> Lars Ingebrigtsen <larsi@gnus.org> writes:
>
>> I had a look at the libxml2 sources.  The logic isn't really explained,
>> but apparently they include all the <255-value entities, and then a
>> selected number of the other entities (about 160 of them).
>>
>> I have no idea what the logic behind this is...  perhaps they've just
>> forgotten to add the new ones?  Which makes me think that this is really
>> a libxml2 bug, and you should report it there instead.
>
> Agreed.  Tim, could you please report this to the libxml2 developers?

That was 10 weeks ago, and we seem to agree that this is not a bug in
Emacs.  I'm therefore closing this bug report.

Please report this issue to the libxml2 developers if it is still an
issue.





^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-11-25 10:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-23 13:24 bug#40794: 26.3; HTML entities &star; and &starf; (inter alia) are not parsed by libxml-parse-html-region Tim Landscheidt
2020-07-29  5:26 ` Lars Ingebrigtsen
2020-07-29  5:35   ` Lars Ingebrigtsen
2020-09-09 13:22     ` Stefan Kangas
2020-11-25 10:03       ` Stefan Kangas

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).