From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Lars Ingebrigtsen Newsgroups: gmane.emacs.bugs Subject: bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region Date: Wed, 29 Jul 2020 07:35:51 +0200 Message-ID: <874kpq3mtk.fsf@gnus.org> References: <87368uwd1f.fsf@passepartout.tim-landscheidt.de> <878sf23n9k.fsf@gnus.org> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="18919"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux) Cc: 40794@debbugs.gnu.org To: Tim Landscheidt Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Wed Jul 29 07:37:11 2020 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1k0em7-0004ob-Cd for geb-bug-gnu-emacs@m.gmane-mx.org; Wed, 29 Jul 2020 07:37:11 +0200 Original-Received: from localhost ([::1]:60550 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1k0em6-00081d-9v for geb-bug-gnu-emacs@m.gmane-mx.org; Wed, 29 Jul 2020 01:37:10 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:45072) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k0elz-00081N-PF for bug-gnu-emacs@gnu.org; Wed, 29 Jul 2020 01:37:03 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:48237) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1k0elz-0002Ij-GF for bug-gnu-emacs@gnu.org; Wed, 29 Jul 2020 01:37:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1k0ely-0005kF-EI for bug-gnu-emacs@gnu.org; Wed, 29 Jul 2020 01:37:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Lars Ingebrigtsen Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Wed, 29 Jul 2020 05:37:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 40794 X-GNU-PR-Package: emacs Original-Received: via spool by 40794-submit@debbugs.gnu.org id=B40794.159600097122022 (code B ref 40794); Wed, 29 Jul 2020 05:37:02 +0000 Original-Received: (at 40794) by debbugs.gnu.org; 29 Jul 2020 05:36:11 +0000 Original-Received: from localhost ([127.0.0.1]:59783 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1k0el9-0005j8-HJ for submit@debbugs.gnu.org; Wed, 29 Jul 2020 01:36:11 -0400 Original-Received: from quimby.gnus.org ([95.216.78.240]:42106) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1k0el7-0005it-7Q for 40794@debbugs.gnu.org; Wed, 29 Jul 2020 01:36:10 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnus.org; s=20200322; h=Content-Type:MIME-Version:Message-ID:In-Reply-To:Date: References:Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=TzEbr2UfztG4hQgvGYVUht+u1I+VlmRESLAYE0wM2nY=; b=gA2I+Rcl4hvlHOzxaPhBSN4pHd ToF3Qd6MsfOztbz1eAdV5mD7IIjBo/G5nk2wv2mf1zWwTwG0C1ZZOgKfI9jsdUOPCtP6MVhz6c+Fv 2c0FfZiyIhimiC2FqYfp8s13cDiGMg8bwyI+OyRqI53CzmogeDOTrdU+7njozQbFvNCg=; Original-Received: from cm-84.212.202.86.getinternet.no ([84.212.202.86] helo=xo) by quimby with esmtpsa (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1k0ekq-0002bY-W1; Wed, 29 Jul 2020 07:36:02 +0200 In-Reply-To: <878sf23n9k.fsf@gnus.org> (Lars Ingebrigtsen's message of "Wed, 29 Jul 2020 07:26:15 +0200") X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:183644 Archived-At: I had a look at the libxml2 sources. The logic isn't really explained, but apparently they include all the <255-value entities, and then a selected number of the other entities (about 160 of them). I have no idea what the logic behind this is... perhaps they've just forgotten to add the new ones? Which makes me think that this is really a libxml2 bug, and you should report it there instead. Excerpt: /************************************************************************ * * * The list of HTML predefined entities * * * ************************************************************************/ static const htmlEntityDesc html40EntitiesTable[] = { /* * the 4 absolute ones, plus apostrophe. */ { 34, "quot", "quotation mark = APL quote, U+0022 ISOnum" }, { 38, "amp", "ampersand, U+0026 ISOnum" }, { 39, "apos", "single quote" }, { 60, "lt", "less-than sign, U+003C ISOnum" }, { 62, "gt", "greater-than sign, U+003E ISOnum" }, /* * A bunch still in the 128-255 range * Replacing them depend really on the charset used. */ { 160, "nbsp", "no-break space = non-breaking space, U+00A0 ISOnum" }, { 161, "iexcl","inverted exclamation mark, U+00A1 ISOnum" }, { 162, "cent", "cent sign, U+00A2 ISOnum" }, [...] { 376, "Yuml", "latin capital letter Y with diaeresis, U+0178 ISOlat2" }, /* * Anything below should really be kept as entities references */ { 402, "fnof", "latin small f with hook = function = florin, U+0192 ISOtech" }, { 710, "circ", "modifier letter circumflex accent, U+02C6 ISOpub" }, { 732, "tilde","small tilde, U+02DC ISOdia" }, { 913, "Alpha","greek capital letter alpha, U+0391" }, { 914, "Beta", "greek capital letter beta, U+0392" }, { 915, "Gamma","greek capital letter gamma, U+0393 ISOgrk3" }, { 916, "Delta","greek capital letter delta, U+0394 ISOgrk3" }, [...] { 9824, "spades","black spade suit, U+2660 ISOpub" }, { 9827, "clubs","black club suit = shamrock, U+2663 ISOpub" }, { 9829, "hearts","black heart suit = valentine, U+2665 ISOpub" }, { 9830, "diams","black diamond suit, U+2666 ISOpub" }, -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no