From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.ciao.gmane.io!not-for-mail From: Tim Landscheidt Newsgroups: gmane.emacs.bugs Subject: bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region Date: Thu, 23 Apr 2020 13:24:12 +0000 Organization: http://www.tim-landscheidt.de/ Message-ID: <87368uwd1f.fsf@passepartout.tim-landscheidt.de> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="ciao.gmane.io:159.69.161.202"; logging-data="94273"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) To: 40794@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Thu Apr 23 15:25:13 2020 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1jRbqq-000OPB-HT for geb-bug-gnu-emacs@m.gmane-mx.org; Thu, 23 Apr 2020 15:25:12 +0200 Original-Received: from localhost ([::1]:43198 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jRbqp-0001xc-La for geb-bug-gnu-emacs@m.gmane-mx.org; Thu, 23 Apr 2020 09:25:11 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:42730) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jRbqg-0001xM-VA for bug-gnu-emacs@gnu.org; Thu, 23 Apr 2020 09:25:03 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.90_1) (envelope-from ) id 1jRbqg-00028w-I2 for bug-gnu-emacs@gnu.org; Thu, 23 Apr 2020 09:25:02 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:42700) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1jRbqg-00028S-68 for bug-gnu-emacs@gnu.org; Thu, 23 Apr 2020 09:25:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1jRbqg-0005hU-0X for bug-gnu-emacs@gnu.org; Thu, 23 Apr 2020 09:25:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Tim Landscheidt Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 23 Apr 2020 13:25:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 40794 X-GNU-PR-Package: emacs X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Original-Received: via spool by submit@debbugs.gnu.org id=B.158764826121846 (code B ref -1); Thu, 23 Apr 2020 13:25:01 +0000 Original-Received: (at submit) by debbugs.gnu.org; 23 Apr 2020 13:24:21 +0000 Original-Received: from localhost ([127.0.0.1]:54246 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jRbq1-0005gI-Ak for submit@debbugs.gnu.org; Thu, 23 Apr 2020 09:24:21 -0400 Original-Received: from lists.gnu.org ([209.51.188.17]:44262) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jRbpz-0005gA-UJ for submit@debbugs.gnu.org; Thu, 23 Apr 2020 09:24:20 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:42588) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jRbpz-0001TL-IS for bug-gnu-emacs@gnu.org; Thu, 23 Apr 2020 09:24:19 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.90_1) (envelope-from ) id 1jRbpz-00019P-01 for bug-gnu-emacs@gnu.org; Thu, 23 Apr 2020 09:24:19 -0400 Original-Received: from andalucia.tim-landscheidt.de ([116.203.78.250]:48564 helo=andalucia) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jRbpy-00013q-Dd for bug-gnu-emacs@gnu.org; Thu, 23 Apr 2020 09:24:18 -0400 Original-Received: from dslb-090-186-010-106.090.186.pools.vodafone-ip.de ([90.186.10.106]:52708 helo=passepartout.tim-landscheidt.de) by andalucia with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1jRbps-0003BM-Sg for bug-gnu-emacs@gnu.org; Thu, 23 Apr 2020 15:24:12 +0200 Received-SPF: pass client-ip=116.203.78.250; envelope-from=tim@tim-landscheidt.de; helo=andalucia X-detected-operating-system: by eggs.gnu.org: First seen = 2020/04/23 09:24:15 X-ACL-Warn: Detected OS = Linux 3.11 and newer X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Received-From: 209.51.188.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:178852 Archived-At: (Prologue: This bug showed up in the "ALT" attribute of an "IMG" element of an HTML mail in Gnus. I am reasonably cer- tain that this stems from libxml-parse-html-region and should be fixed there, but there may be more prudent solu- tions.) With GNU Emacs 26.3 on Fedora: | ELISP> (with-temp-buffer | (insert " | | Title | |

Hello world

|

ä

|

|

| | ") | (libxml-parse-html-region (point-min) (point-max))) | (html | ((lang . "en")) | (head nil | (title nil "Title")) | (body nil "\n " | (p nil "Hello world") | "\n " | (p nil "=C3=A4") | "\n " | (p nil "☆") | "\n " | (p nil "★") | "\n")) | ELISP> These should instead yield "=C3=A4" (228), "=E2=98=86" (9734) and "=E2=98=85" (9733). lisp/leim/quail/sgml-input.el seems to contain the necessary data for ☆ and ★ that could probably be fed to libxml.