* bug#24831: shr mangling messages [not found] <87shrgvt8y.fsf@jidanni.org> @ 2016-10-31 2:45 ` 積丹尼 Dan Jacobson 2016-11-01 1:39 ` Katsumi Yamaoka ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: 積丹尼 Dan Jacobson @ 2016-10-31 2:45 UTC (permalink / raw) To: 24831; +Cc: Katsumi Yamaoka [-- Attachment #1: Type: text/plain, Size: 2387 bytes --] Gentelmen, the "shr" program is mangling messages. It could remove vital words, causing arguments: "I did include the address!" "No you didn't." "Yes I did. Your mail reader probably cut it out." We're talking data loss here. It may still be on the disk, but not shown to the user. True, the HTML might not be perfect, but at least Chromium, Firefox, etc. show it fine. >>>>> "KY" == Katsumi Yamaoka <yamaoka@jpl.org> writes: KY> Emacs-w3m renders it as: KY> http://w KY> Hi, you have a new email from Catherineme KY> [25] KY> View your inbox at http://www.travel-buddies.com/Inbox.aspx KY> © Travel Buddies 2015 | All rights reserved Hmmm, w3m -dump on the attachment shows the first URL in full. KY> However shr renders it as: KY> Travel Buddies KY> © Travel Buddies 2015 | All rights reserved KY> http://www.travel-buddies.com/ KY> * KY> There lacks the "Hi, you have a new mail" message. The return KY> value of `libxml-parse-html-region' contains the message as KY> (h1 nil (span nil "Hi, you have a new email from") "Catherineme") KY> (p nil "View your inbox at " KY> (a ((href . "http://www.travel-buddies.com/Inbox.aspx")) KY> "http://www.travel-buddies.com/Inbox.aspx")) KY> regardless of whether all style specs are removed[1] or not KY> (three nil portions above are replaced with style specs if they KY> are not removed). So, style specs are not cause of not KY> displaying some meaningful message in an html mail, I believe. KY> In that case, making shr display images does not help. KY> I think there's something wrong in shr.el, and what you should KY> do would be to send a bug report to the Emacs bug team, i.e., KY> M-x report-emacs-bug, with the sample html part (I'm not so KY> familiar with recent shr, sorry). Note that a mail containing KY> html part might be rejected by the server, so putting it in your KY> web site separately would be better. KY> [1] I tested it by modifying mm-shr so as to remove style specs. OK I'll send the message, [-- Attachment #2: message --] [-- Type: application/gzip, Size: 1786 bytes --] [-- Attachment #3: Type: text/plain, Size: 143 bytes --] here in this bug report about In GNU Emacs 24.5.1 (i686-pc-linux-gnu, GTK+ Version 3.21.5) of 2016-09-06 on x86-csail-01, modified by Debian. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-10-31 2:45 ` bug#24831: shr mangling messages 積丹尼 Dan Jacobson @ 2016-11-01 1:39 ` Katsumi Yamaoka 2016-11-01 9:59 ` Katsumi Yamaoka 2016-11-01 11:22 ` 積丹尼 Dan Jacobson 2016-11-01 11:24 ` 積丹尼 Dan Jacobson 2 siblings, 1 reply; 16+ messages in thread From: Katsumi Yamaoka @ 2016-11-01 1:39 UTC (permalink / raw) To: 24831; +Cc: jidanni On Mon, 31 Oct 2016 10:45:58 +0800, Dan Jacobson wrote: > Gentelmen, the "shr" program is mangling messages. I found the cause of the problem that shr does not display the "Hi, you have a new email..." statement contained in the example message. That is, the message has a table in which the td element is omitted or lost. Here is a simplified html form (try `M-x shr-render-region RET' on it): --8<---------------cut here---------------start------------->8--- <html> <body> <table> <tr> <!--td--> <table> <tr> <td> Hi, you have a new email </td> </tr> </table> <!--/td--> </tr> </table> </body> </html> --8<---------------cut here---------------end--------------->8--- > True, the HTML might not be perfect, but at least Chromium, Firefox, > etc. show it fine. Yes, what is bad is the html message, but shr should show it. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-11-01 1:39 ` Katsumi Yamaoka @ 2016-11-01 9:59 ` Katsumi Yamaoka 2016-11-01 10:06 ` Lars Ingebrigtsen 0 siblings, 1 reply; 16+ messages in thread From: Katsumi Yamaoka @ 2016-11-01 9:59 UTC (permalink / raw) To: 24831; +Cc: jidanni [-- Attachment #1: Type: text/plain, Size: 548 bytes --] On Tue, 01 Nov 2016 10:39:12 +0900, Katsumi Yamaoka wrote: > I found the cause of the problem that shr does not display the > "Hi, you have a new email..." > statement contained in the example message. That is, the message > has a table in which the td element is omitted or lost. I tried fixing it. A patch is below. But I feel it somewhat awkward, so I hope Lars or someone will review it. My patch simply adds the missing td tag as follows: (table nil (tr nil contents)) ↓ (table nil (tr nil (td nil contents))) Thanks. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: text/x-patch, Size: 1098 bytes --] --- shr.el~ 2016-11-01 02:35:57.788777000 +0000 +++ shr.el 2016-11-01 09:51:32.251984400 +0000 @@ -1759,6 +1759,7 @@ ;; we then render everything again with the new widths, and finally ;; insert all these boxes into the main buffer. (defun shr-tag-table-1 (dom) + (shr-add-missing-td dom) (setq dom (or (dom-child-by-tag dom 'tbody) dom)) (let* ((shr-inhibit-images t) (shr-table-depth (1+ shr-table-depth)) @@ -1787,6 +1788,19 @@ ;; Then render the table again with these new "hard" widths. (shr-insert-table (shr-make-table dom sketch-widths t) sketch-widths))) +(defun shr-add-missing-td (dom) + "Add missing td tag to table." + (let (tr td) + (dolist (elem (dom-children dom)) + (when (eq (car-safe elem) 'tr) + (setq tr elem + td nil + elem (cddr elem)) + (while (and (not td) elem) + (setq td (eq (car-safe (pop elem)) 'td))) + (unless td + (setcdr (cdr tr) (list (cons 'td (cons nil (cddr tr)))))))))) + (defun shr-table-body (dom) (let ((tbodies (seq-filter (lambda (child) (eq (dom-tag child) 'tbody)) ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-11-01 9:59 ` Katsumi Yamaoka @ 2016-11-01 10:06 ` Lars Ingebrigtsen 2016-11-01 10:12 ` Lars Ingebrigtsen 2016-11-01 18:43 ` Lars Ingebrigtsen 0 siblings, 2 replies; 16+ messages in thread From: Lars Ingebrigtsen @ 2016-11-01 10:06 UTC (permalink / raw) To: Katsumi Yamaoka; +Cc: 24831, jidanni Katsumi Yamaoka <yamaoka@jpl.org> writes: > I tried fixing it. A patch is below. But I feel it somewhat > awkward, so I hope Lars or someone will review it. My patch > simply adds the missing td tag as follows: > > (table nil (tr nil contents)) > ↓ > (table nil (tr nil (td nil contents))) I'm not sure I think it's worth trying to work around invalid HTML to this extent. In addition, other browsers do not correct "missing" TDs in this way: Instead they typically render non-TD/TH nodes before the table, which I think might be a better idea. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-11-01 10:06 ` Lars Ingebrigtsen @ 2016-11-01 10:12 ` Lars Ingebrigtsen 2016-11-01 18:43 ` Lars Ingebrigtsen 1 sibling, 0 replies; 16+ messages in thread From: Lars Ingebrigtsen @ 2016-11-01 10:12 UTC (permalink / raw) To: Katsumi Yamaoka; +Cc: 24831, jidanni Lars Ingebrigtsen <larsi@gnus.org> writes: > I'm not sure I think it's worth trying to work around invalid HTML to > this extent. Besides, there's often lots of empty space text nodes interspersed, aren't there? <table> <tr> <td> ... will have a node with "\n " before the TD node, I think? Those text nodes are supposed to be ignored. I'd prefer just to close this bug with a WONTFIX. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-11-01 10:06 ` Lars Ingebrigtsen 2016-11-01 10:12 ` Lars Ingebrigtsen @ 2016-11-01 18:43 ` Lars Ingebrigtsen 2016-11-02 9:49 ` Katsumi Yamaoka 1 sibling, 1 reply; 16+ messages in thread From: Lars Ingebrigtsen @ 2016-11-01 18:43 UTC (permalink / raw) To: Katsumi Yamaoka; +Cc: 24831, jidanni Lars Ingebrigtsen <larsi@gnus.org> writes: > In addition, other browsers do not correct "missing" TDs in this way: > Instead they typically render non-TD/TH nodes before the table, which I > think might be a better idea. And thinking about it a bit more, I think that would perhaps be the most likely solution for shr, too. That is, `shr-tag-table' could, at the end there, go through and find all non-blank non-td/th elements and insert them at the end. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-11-01 18:43 ` Lars Ingebrigtsen @ 2016-11-02 9:49 ` Katsumi Yamaoka 2016-11-04 7:19 ` Katsumi Yamaoka 0 siblings, 1 reply; 16+ messages in thread From: Katsumi Yamaoka @ 2016-11-02 9:49 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 24831, jidanni On Tue, 01 Nov 2016 19:43:23 +0100, Lars Ingebrigtsen wrote: > Lars Ingebrigtsen <larsi@gnus.org> writes: >> In addition, other browsers do not correct "missing" TDs in this way: >> Instead they typically render non-TD/TH nodes before the table, which I >> think might be a better idea. > And thinking about it a bit more, I think that would perhaps be the most > likely solution for shr, too. That is, `shr-tag-table' could, at the > end there, go through and find all non-blank non-td/th elements and > insert them at the end. Thanks. I'm trying it but not succeeded yet though, I think I understand what I should do. The function for it should gather only those extra elements, that are parts of a table tag of which the parent (of the parent ...) table has no TD/TH tag. It's a good brain teaser. :) ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-11-02 9:49 ` Katsumi Yamaoka @ 2016-11-04 7:19 ` Katsumi Yamaoka 2016-11-04 8:51 ` Lars Ingebrigtsen 0 siblings, 1 reply; 16+ messages in thread From: Katsumi Yamaoka @ 2016-11-04 7:19 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 24831, jidanni [-- Attachment #1: Type: text/plain, Size: 703 bytes --] On Wed, 02 Nov 2016 18:49:58 +0900, Katsumi Yamaoka wrote: > On Tue, 01 Nov 2016 19:43:23 +0100, Lars Ingebrigtsen wrote: >> And thinking about it a bit more, I think that would perhaps be the most >> likely solution for shr, too. That is, `shr-tag-table' could, at the >> end there, go through and find all non-blank non-td/th elements and >> insert them at the end. > Thanks. I'm trying it but not succeeded yet though,... I did it. A patch is below. Bad things in this version I know at least are: ・It does not support styles -- font, color, etc. ・No way to exclude text existing outside of <html>...</html>. Thers is no such problems in the first version I posted. ;-) [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: text/x-patch, Size: 2060 bytes --] --- shr.el~ 2016-11-01 02:35:57.788777000 +0000 +++ shr.el 2016-11-04 07:17:19.789855000 +0000 @@ -1897,11 +1897,48 @@ (when (zerop shr-table-depth) (save-excursion (shr-expand-alignments start (point))) + ;; Insert also non-td/th strings excluding comments and styles. + (save-restriction + (narrow-to-region (point) (point)) + (insert (mapconcat #'identity + (shr-collect-extra-strings-in-table dom) + "\n")) + (shr-fill-lines (point-min) (point-max))) (dolist (elem (dom-by-tag dom 'object)) (shr-tag-object elem)) (dolist (elem (dom-by-tag dom 'img)) (shr-tag-img elem))))) +(defun shr-collect-extra-strings-in-table (dom &optional flags) + "Return extra strings in DOM of which the root is a table clause. +FLAGS is a cons of two flags that control whether to collect strings." + ;; If and only if the cdr is not set, the car will be set to t when + ;; a <td> or a <th> clause is found in the children of DOM, and reset + ;; to nil when a <table> clause is found in the children of DOM. + ;; The cdr will be set to t when a <table> clause is found if the car + ;; is not set then, and will never be reset. + ;; This function collects strings if the car of FLAGS is not set. + (unless flags (setq flags (cons nil nil))) + (cl-loop for child in (dom-children dom) + if (stringp child) + when (and (not (car flags)) + (string-match "\\(?:[^\t\n\r ]+[\t\n\r ]+\\)*[^\t\n\r ]+" + child)) + collect (match-string 0 child) + end + else + unless (let ((tag (dom-tag child))) + (or (memq tag '(comment style)) + (progn + (cond ((memq tag '(td th)) + (unless (cdr flags) (setcar flags t))) + ((eq tag 'table) + (if (car flags) + (unless (cdr flags) (setcar flags nil)) + (setcdr flags t)))) + nil))) + append (shr-collect-extra-strings-in-table child flags))) + (defun shr-insert-table (table widths) (let* ((collapse (equal (cdr (assq 'border-collapse shr-stylesheet)) "collapse")) ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-11-04 7:19 ` Katsumi Yamaoka @ 2016-11-04 8:51 ` Lars Ingebrigtsen 2016-11-04 10:28 ` Katsumi Yamaoka 0 siblings, 1 reply; 16+ messages in thread From: Lars Ingebrigtsen @ 2016-11-04 8:51 UTC (permalink / raw) To: Katsumi Yamaoka; +Cc: 24831, jidanni Katsumi Yamaoka <yamaoka@jpl.org> writes: > I did it. A patch is below. Great! Looks good to me. > Bad things in this version I know at least are: > > ・It does not support styles -- font, color, etc. I don't think that matters very much. The HTML is invalid. > ・No way to exclude text existing outside of <html>...</html>. Hm... I don't quite follow... -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-11-04 8:51 ` Lars Ingebrigtsen @ 2016-11-04 10:28 ` Katsumi Yamaoka 2016-11-04 11:17 ` Lars Ingebrigtsen 0 siblings, 1 reply; 16+ messages in thread From: Katsumi Yamaoka @ 2016-11-04 10:28 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 24831, jidanni On Fri, 04 Nov 2016 09:51:52 +0100, Lars Ingebrigtsen wrote: > Katsumi Yamaoka <yamaoka@jpl.org> writes: >> I did it. A patch is below. > Great! Looks good to me. Thanks! I'll commit it to master. [...] >> ・No way to exclude text existing outside of <html>...</html>. > Hm... I don't quite follow... I found it in some mails from amazon.co.jp, but not so many and not so annoying. Here it is: <html> ... </html> --MuLtIpArT_BoUnDaRy-- Well, is this a reasonable operation? (with-temp-buffer (insert "<html><body>Foo</body></html>Bar") (libxml-parse-html-region (point-min) (point-max))) => (html nil (body nil "Foo") (html nil (p nil "Bar"))) ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-11-04 10:28 ` Katsumi Yamaoka @ 2016-11-04 11:17 ` Lars Ingebrigtsen 2016-11-06 23:32 ` Katsumi Yamaoka 0 siblings, 1 reply; 16+ messages in thread From: Lars Ingebrigtsen @ 2016-11-04 11:17 UTC (permalink / raw) To: Katsumi Yamaoka; +Cc: 24831, jidanni Katsumi Yamaoka <yamaoka@jpl.org> writes: > I found it in some mails from amazon.co.jp, but not so many and > not so annoying. Here it is: > > <html> > ... > </html> --MuLtIpArT_BoUnDaRy-- Oh, right... > Well, is this a reasonable operation? > > (with-temp-buffer > (insert "<html><body>Foo</body></html>Bar") > (libxml-parse-html-region (point-min) (point-max))) > => (html nil (body nil "Foo") (html nil (p nil "Bar"))) Yes, it's two <html> elements after each other. In HTML, the <html> start (and end) tags are optional. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-11-04 11:17 ` Lars Ingebrigtsen @ 2016-11-06 23:32 ` Katsumi Yamaoka 0 siblings, 0 replies; 16+ messages in thread From: Katsumi Yamaoka @ 2016-11-06 23:32 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 24831-done, jidanni On Fri, 04 Nov 2016 12:17:18 +0100, Lars Ingebrigtsen wrote: > Katsumi Yamaoka <yamaoka@jpl.org> writes: >> Well, is this a reasonable operation? >> >> (with-temp-buffer >> (insert "<html><body>Foo</body></html>Bar") >> (libxml-parse-html-region (point-min) (point-max))) >> => (html nil (body nil "Foo") (html nil (p nil "Bar"))) > Yes, it's two <html> elements after each other. In HTML, the <html> > start (and end) tags are optional. I see. But I'm sorry for my confusion; that extra text appearing is not due to my change. So, I'm closing this bug. Thanks. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-10-31 2:45 ` bug#24831: shr mangling messages 積丹尼 Dan Jacobson 2016-11-01 1:39 ` Katsumi Yamaoka @ 2016-11-01 11:22 ` 積丹尼 Dan Jacobson 2016-11-01 17:16 ` Richard Stallman 2016-11-01 11:24 ` 積丹尼 Dan Jacobson 2 siblings, 1 reply; 16+ messages in thread From: 積丹尼 Dan Jacobson @ 2016-11-01 11:22 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: Katsumi Yamaoka, 24831 Another idea would be first run it through a validator. If valid, proceed as before. If invalid, just spew out all the text nodes of the whole document, separated by spaces. Anything is better than vital sentences going missing. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-11-01 11:22 ` 積丹尼 Dan Jacobson @ 2016-11-01 17:16 ` Richard Stallman 2016-11-04 18:18 ` Ted Zlatanov 0 siblings, 1 reply; 16+ messages in thread From: Richard Stallman @ 2016-11-01 17:16 UTC (permalink / raw) To: 積丹尼 Dan Jacobson; +Cc: larsi, yamaoka, 24831 [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > Another idea would be first run it through a validator. > If valid, proceed as before. > If invalid, just spew out all the text nodes of the whole document, > separated by spaces. Do we have a validator in Emacs Lisp? Or would we run one as a child? What program is available? -- Dr Richard Stallman President, Free Software Foundation (gnu.org, fsf.org) Internet Hall-of-Famer (internethalloffame.org) Skype: No way! See stallman.org/skype.html. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-11-01 17:16 ` Richard Stallman @ 2016-11-04 18:18 ` Ted Zlatanov 0 siblings, 0 replies; 16+ messages in thread From: Ted Zlatanov @ 2016-11-04 18:18 UTC (permalink / raw) To: Richard Stallman Cc: larsi, 積丹尼 Dan Jacobson, 24831, yamaoka On Tue, 01 Nov 2016 13:16:52 -0400 Richard Stallman <rms@gnu.org> wrote: >> Another idea would be first run it through a validator. >> If valid, proceed as before. >> If invalid, just spew out all the text nodes of the whole document, >> separated by spaces. RS> Do we have a validator in Emacs Lisp? Or would we run one as a child? RS> What program is available? IMHO validation is not a workable solution, both because of complexity and because real-world HTML authors are incredibly skilled at writing broken HTML that somehow renders in the browsers they support. Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#24831: shr mangling messages 2016-10-31 2:45 ` bug#24831: shr mangling messages 積丹尼 Dan Jacobson 2016-11-01 1:39 ` Katsumi Yamaoka 2016-11-01 11:22 ` 積丹尼 Dan Jacobson @ 2016-11-01 11:24 ` 積丹尼 Dan Jacobson 2 siblings, 0 replies; 16+ messages in thread From: 積丹尼 Dan Jacobson @ 2016-11-01 11:24 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: Katsumi Yamaoka, 24831 At least print a warning, *** Invalid HTML detected, some text might be missing *** in red, which stays at the top of the message. (Not in the fleeting minibuffer.) ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-11-06 23:32 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <87shrgvt8y.fsf@jidanni.org> 2016-10-31 2:45 ` bug#24831: shr mangling messages 積丹尼 Dan Jacobson 2016-11-01 1:39 ` Katsumi Yamaoka 2016-11-01 9:59 ` Katsumi Yamaoka 2016-11-01 10:06 ` Lars Ingebrigtsen 2016-11-01 10:12 ` Lars Ingebrigtsen 2016-11-01 18:43 ` Lars Ingebrigtsen 2016-11-02 9:49 ` Katsumi Yamaoka 2016-11-04 7:19 ` Katsumi Yamaoka 2016-11-04 8:51 ` Lars Ingebrigtsen 2016-11-04 10:28 ` Katsumi Yamaoka 2016-11-04 11:17 ` Lars Ingebrigtsen 2016-11-06 23:32 ` Katsumi Yamaoka 2016-11-01 11:22 ` 積丹尼 Dan Jacobson 2016-11-01 17:16 ` Richard Stallman 2016-11-04 18:18 ` Ted Zlatanov 2016-11-01 11:24 ` 積丹尼 Dan Jacobson
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).