unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't
@ 2018-03-12 23:38 Katsumi Yamaoka
  2018-03-13  0:44 ` Lars Ingebrigtsen
  0 siblings, 1 reply; 9+ messages in thread
From: Katsumi Yamaoka @ 2018-03-12 23:38 UTC (permalink / raw)
  To: 30789; +Cc: 積丹尼 Dan Jacobson

[-- Attachment #1: Type: text/plain, Size: 1284 bytes --]

Hi,

Jidanni mailed me an example html mail that contains a broken
encoded text as follows:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    .......公告辦理現金救助及低利貸款\343\200
 \202因2月
 低溫危害農作物為延遲性損害,.......
  </body>
</html>

This is a part of the contents.  The original one is encoded by
utf-8 and 8-bit (attached in this mail).  Where "\343\200\n \202"
is the encoded version of "。", i.e., "\343\200\202", but broken
in the middle of the bytes.  It seems that a stupid mail software
perpetrates it because of a long encoded line.

When I read the mail using Gnus + shr, the text after the broken
point is all cut off.  That is what libxml-parse-html-region does,
whereas xml-parse-region doesn't cut it.  Moreover a web browser,
to which I send the html data using the `K H' command, shows all
the text (the broken character is shown as is, though).

This is not necessarily a libxml bug anyway, but I hope it works
like xml-parse.

Thanks.

In GNU Emacs 26.0.91 (build 1, x86_64-unknown-cygwin, GTK+ Version 3.22.28)
 of 2018-03-12 built on localhost
Windowing system distributor 'The Cygwin/X Project', version 11.0.11906000

[-- Attachment #2: example-html-mail.gz --]
[-- Type: application/x-gunzip, Size: 238 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't
  2018-03-12 23:38 bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't Katsumi Yamaoka
@ 2018-03-13  0:44 ` Lars Ingebrigtsen
  2018-03-13  2:28   ` Katsumi Yamaoka
  2018-03-13  2:55   ` 積丹尼 Dan Jacobson
  0 siblings, 2 replies; 9+ messages in thread
From: Lars Ingebrigtsen @ 2018-03-13  0:44 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: 30789, 積丹尼 Dan Jacobson

Katsumi Yamaoka <yamaoka@jpl.org> writes:

> When I read the mail using Gnus + shr, the text after the broken
> point is all cut off.  That is what libxml-parse-html-region does,
> whereas xml-parse-region doesn't cut it.  Moreover a web browser,
> to which I send the html data using the `K H' command, shows all
> the text (the broken character is shown as is, though).
>
> This is not necessarily a libxml bug anyway, but I hope it works
> like xml-parse.

libxml is more strict about correctness of the input than most other
HTML parsers.  I don't think there's anything we can do about this
problematic input other than ponder whether Emacs should use a different
HTML parser, which I think sounds of unlikely.  :-)

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't
  2018-03-13  0:44 ` Lars Ingebrigtsen
@ 2018-03-13  2:28   ` Katsumi Yamaoka
  2018-03-13  3:29     ` 積丹尼 Dan Jacobson
  2018-03-13  3:31     ` Katsumi Yamaoka
  2018-03-13  2:55   ` 積丹尼 Dan Jacobson
  1 sibling, 2 replies; 9+ messages in thread
From: Katsumi Yamaoka @ 2018-03-13  2:28 UTC (permalink / raw)
  To: Lars Ingebrigtsen, 積丹尼 Dan Jacobson; +Cc: 30789

[-- Attachment #1: Type: text/plain, Size: 568 bytes --]

On Tue, 13 Mar 2018 01:44:22 +0100, Lars Ingebrigtsen wrote:
> libxml is more strict about correctness of the input than most other
> HTML parsers.  I don't think there's anything we can do about this
> problematic input other than ponder whether Emacs should use a different
> HTML parser, which I think sounds of unlikely.  :-)

I see.  I agree not to modify libxml.  Jidanni, how about trying
the following patch personally if you often get such broken mails?
Though I'm not quite sure if it does not cause another problem,
it fixes at least the mail in question.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 547 bytes --]

--- mm-decode.el~	2018-02-28 02:01:37.897607000 +0000
+++ mm-decode.el	2018-03-13 02:23:04.321753900 +0000
@@ -1810,6 +1810,11 @@
       (when (and (or coding
 		     (setq coding (mm-charset-to-coding-system charset nil t)))
 		 (not (eq coding 'ascii)))
+	;; Remove extra bytes in utf-8 encoded data.
+	(when (eq coding 'utf-8)
+	  (goto-char (point-min))
+	  (while (re-search-forward "[\x00-\x7f]+\\([\x80-\xbf]\\)" nil t)
+	    (replace-match "\\1")))
 	(insert (prog1
 		    (decode-coding-string (buffer-string) coding)
 		  (erase-buffer)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't
  2018-03-13  0:44 ` Lars Ingebrigtsen
  2018-03-13  2:28   ` Katsumi Yamaoka
@ 2018-03-13  2:55   ` 積丹尼 Dan Jacobson
  1 sibling, 0 replies; 9+ messages in thread
From: 積丹尼 Dan Jacobson @ 2018-03-13  2:55 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: Katsumi Yamaoka, 30789

Expecting perfect input is OK for compilers, but not for browsers
https://blog.codinghorror.com/its-a-malformed-world/





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't
  2018-03-13  2:28   ` Katsumi Yamaoka
@ 2018-03-13  3:29     ` 積丹尼 Dan Jacobson
  2018-03-13 11:28       ` Lars Ingebrigtsen
  2018-03-13  3:31     ` Katsumi Yamaoka
  1 sibling, 1 reply; 9+ messages in thread
From: 積丹尼 Dan Jacobson @ 2018-03-13  3:29 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: Lars Ingebrigtsen, 30789

Thank you for the patch but the real answer is to do what all other
browsers do... show as much as possible.

There is no browser out there that would dream of dying on the slightest
mistake.

Anyway if you guys are really going to use XML::LibXML::Parser (?) then
maybe loosen up some of

       recover
           /parser, html, reader/

           recover from errors; possible values are 0, 1, and 2

           A true value turns on recovery mode which allows one to parse broken XML or HTML data. The recovery mode allows the parser to return the successfully parsed
           portion of the input document. This is useful for almost well-formed documents, where for example a closing tag is missing somewhere. Still, XML::LibXML will
           only parse until the first fatal (non-recoverable) error occurs, reporting recoverable parsing errors as warnings. To suppress even these warnings, use
           recover=>2.

           Note that validation is switched off automatically in recovery mode.

       validation
           /parser, reader/

           validate with the DTD; possible values are 0 and 1


      ERROR REPORTING
       XML::LibXML throws exceptions during parsing, validation or XPath processing (and some other occasions). These errors can be caught by using eval blocks. The error
       is stored in $@. There are two implementations: the old one throws $@ which is just a message string, in the new one $@ is an object from the class
       XML::LibXML::Error; this class overrides the operator "" so that when printed, the object flattens to the usual error message.

       XML::LibXML throws errors as they occur. This is a very common misunderstanding in the use of XML::LibXML. If the eval is omitted, XML::LibXML will always halt your
       script by "croaking" (see Carp man page for details).

       Also note that an increasing number of functions throw errors if bad data is passed as arguments. If you cannot assure valid data passed to XML::LibXML you should
       eval these functions.

       Note: since version 1.59, get_last_error() is no longer available in XML::LibXML for thread-safety reasons.





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't
  2018-03-13  2:28   ` Katsumi Yamaoka
  2018-03-13  3:29     ` 積丹尼 Dan Jacobson
@ 2018-03-13  3:31     ` Katsumi Yamaoka
  1 sibling, 0 replies; 9+ messages in thread
From: Katsumi Yamaoka @ 2018-03-13  3:31 UTC (permalink / raw)
  To: Lars Ingebrigtsen, 積丹尼 Dan Jacobson; +Cc: 30789

[-- Attachment #1: Type: text/plain, Size: 282 bytes --]

On Tue, 13 Mar 2018 11:28:45 +0900, Katsumi Yamaoka wrote:
> +	;; Remove extra bytes in utf-8 encoded data.
> +	(when (eq coding 'utf-8)
> +	  (goto-char (point-min))
> +	  (while (re-search-forward "[\x00-\x7f]+\\([\x80-\xbf]\\)" nil t)
> +	    (replace-match "\\1")))

Corrected:

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 589 bytes --]

--- mm-decode.el~	2018-02-28 02:01:37.897607000 +0000
+++ mm-decode.el	2018-03-13 03:27:56.885844100 +0000
@@ -1810,6 +1810,13 @@
       (when (and (or coding
 		     (setq coding (mm-charset-to-coding-system charset nil t)))
 		 (not (eq coding 'ascii)))
+	;; Remove extra bytes in utf-8 encoded data.
+	(when (eq coding 'utf-8)
+	  (goto-char (point-min))
+	  (while (re-search-forward
+		  "\\([\xc2-\xf7][\x80-\xbf]?\\)[\x00-\x7f]+\\([\x80-\xbf]\\)"
+		  nil t)
+	    (replace-match "\\1\\2")))
 	(insert (prog1
 		    (decode-coding-string (buffer-string) coding)
 		  (erase-buffer)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't
  2018-03-13  3:29     ` 積丹尼 Dan Jacobson
@ 2018-03-13 11:28       ` Lars Ingebrigtsen
  2018-03-13 20:27         ` 積丹尼 Dan Jacobson
  0 siblings, 1 reply; 9+ messages in thread
From: Lars Ingebrigtsen @ 2018-03-13 11:28 UTC (permalink / raw)
  To: 積丹尼 Dan Jacobson; +Cc: Katsumi Yamaoka, 30789

積丹尼 Dan Jacobson <jidanni@jidanni.org> writes:

> There is no browser out there that would dream of dying on the slightest
> mistake.

I agree, and you should report these problems to the libxml2
maintainers.

> Anyway if you guys are really going to use XML::LibXML::Parser (?) then
> maybe loosen up some of

Our calls are as loose as they get, if I recall correctly.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't
  2018-03-13 11:28       ` Lars Ingebrigtsen
@ 2018-03-13 20:27         ` 積丹尼 Dan Jacobson
  2018-03-13 22:30           ` Lars Ingebrigtsen
  0 siblings, 1 reply; 9+ messages in thread
From: 積丹尼 Dan Jacobson @ 2018-03-13 20:27 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: Katsumi Yamaoka, 30789

>>>>> "LI" == Lars Ingebrigtsen <larsi@gnus.org> writes:

LI> I agree, and you should report these problems to the libxml2
LI> maintainers.

I would not want to ruin my reputation by letting them know I was
inputting unvalidated XML and expecting whatever results.





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't
  2018-03-13 20:27         ` 積丹尼 Dan Jacobson
@ 2018-03-13 22:30           ` Lars Ingebrigtsen
  0 siblings, 0 replies; 9+ messages in thread
From: Lars Ingebrigtsen @ 2018-03-13 22:30 UTC (permalink / raw)
  To: 積丹尼 Dan Jacobson; +Cc: Katsumi Yamaoka, 30789

積丹尼 Dan Jacobson <jidanni@jidanni.org> writes:

> LI> I agree, and you should report these problems to the libxml2
> LI> maintainers.
>
> I would not want to ruin my reputation by letting them know I was
> inputting unvalidated XML and expecting whatever results.

:-)

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-03-13 22:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-12 23:38 bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't Katsumi Yamaoka
2018-03-13  0:44 ` Lars Ingebrigtsen
2018-03-13  2:28   ` Katsumi Yamaoka
2018-03-13  3:29     ` 積丹尼 Dan Jacobson
2018-03-13 11:28       ` Lars Ingebrigtsen
2018-03-13 20:27         ` 積丹尼 Dan Jacobson
2018-03-13 22:30           ` Lars Ingebrigtsen
2018-03-13  3:31     ` Katsumi Yamaoka
2018-03-13  2:55   ` 積丹尼 Dan Jacobson

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).