From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Katsumi Yamaoka Newsgroups: gmane.emacs.bugs Subject: bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't Date: Tue, 13 Mar 2018 11:28:45 +0900 Organization: Emacsen advocacy group Message-ID: References: NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: blaine.gmane.org 1520908031 24381 195.159.176.226 (13 Mar 2018 02:27:11 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 13 Mar 2018 02:27:11 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (x86_64-unknown-cygwin) Cc: 30789@debbugs.gnu.org To: Lars Ingebrigtsen , =?UTF-8?Q?=E7=A9=8D=E4=B8=B9=E5=B0=BC?= Dan Jacobson Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Mar 13 03:27:07 2018 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1evZec-0006G8-Uq for geb-bug-gnu-emacs@m.gmane.org; Tue, 13 Mar 2018 03:27:07 +0100 Original-Received: from localhost ([::1]:36685 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1evZgf-00015W-Uw for geb-bug-gnu-emacs@m.gmane.org; Mon, 12 Mar 2018 22:29:13 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:50115) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1evZgZ-00015J-OO for bug-gnu-emacs@gnu.org; Mon, 12 Mar 2018 22:29:08 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1evZgU-0004XH-Pu for bug-gnu-emacs@gnu.org; Mon, 12 Mar 2018 22:29:07 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:50002) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1evZgU-0004Wx-L9 for bug-gnu-emacs@gnu.org; Mon, 12 Mar 2018 22:29:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1evZgU-0003VN-Bo for bug-gnu-emacs@gnu.org; Mon, 12 Mar 2018 22:29:02 -0400 X-Loop: help-debbugs@gnu.org In-Reply-To: Resent-From: Katsumi Yamaoka Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 13 Mar 2018 02:29:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 30789 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 30789-submit@debbugs.gnu.org id=B30789.152090813213455 (code B ref 30789); Tue, 13 Mar 2018 02:29:02 +0000 Original-Received: (at 30789) by debbugs.gnu.org; 13 Mar 2018 02:28:52 +0000 Original-Received: from localhost ([127.0.0.1]:57899 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1evZgK-0003Ux-3X for submit@debbugs.gnu.org; Mon, 12 Mar 2018 22:28:52 -0400 Original-Received: from hampton.hostforweb.net ([181.214.31.159]:47956) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1evZgI-0003Up-OE for 30789@debbugs.gnu.org; Mon, 12 Mar 2018 22:28:50 -0400 Original-Received: from s70.gtokyofl21.vectant.ne.jp ([202.215.75.70]:60000 helo=localhost) by hampton.hostforweb.net with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from ) id 1evZgE-004DeE-Cm; Mon, 12 Mar 2018 21:28:47 -0500 X-Face: #kKnN,xUnmKia.'[pp`; Omh}odZK)?7wQSl"4o04=EixTF+V[""w~iNbM9ZL+.b*_CxUmFk B#Fu[*?MZZH@IkN:!"\w%I_zt>[$nm7nQosZ<3eu; B:$Q_:p!',P.c0-_Cy[dz4oIpw0ESA^D*1Lw= L&i*6&( Cancel-Lock: sha1:G74Ta9R5OQki4dCCSi2DrHGYgAs= X-OutGoing-Spam-Status: No, score=-0.2 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - hampton.hostforweb.net X-AntiAbuse: Original Domain - debbugs.gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - jpl.org X-Get-Message-Sender-Via: hampton.hostforweb.net: authenticated_id: yamaoka/from_h X-Authenticated-Sender: hampton.hostforweb.net: yamaoka@jpl.org X-Source: X-Source-Args: X-Source-Dir: X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:144174 Archived-At: --=-=-= On Tue, 13 Mar 2018 01:44:22 +0100, Lars Ingebrigtsen wrote: > libxml is more strict about correctness of the input than most other > HTML parsers. I don't think there's anything we can do about this > problematic input other than ponder whether Emacs should use a different > HTML parser, which I think sounds of unlikely. :-) I see. I agree not to modify libxml. Jidanni, how about trying the following patch personally if you often get such broken mails? Though I'm not quite sure if it does not cause another problem, it fixes at least the mail in question. --=-=-= Content-Type: text/x-patch Content-Disposition: inline --- mm-decode.el~ 2018-02-28 02:01:37.897607000 +0000 +++ mm-decode.el 2018-03-13 02:23:04.321753900 +0000 @@ -1810,6 +1810,11 @@ (when (and (or coding (setq coding (mm-charset-to-coding-system charset nil t))) (not (eq coding 'ascii))) + ;; Remove extra bytes in utf-8 encoded data. + (when (eq coding 'utf-8) + (goto-char (point-min)) + (while (re-search-forward "[\x00-\x7f]+\\([\x80-\xbf]\\)" nil t) + (replace-match "\\1"))) (insert (prog1 (decode-coding-string (buffer-string) coding) (erase-buffer) --=-=-=--