unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#31665: libxml-parse-html-region' doesn't extract text in tables
       [not found] ` <b4m1sf2izty.fsf@jpl.org>
@ 2018-05-31  9:55   ` 積丹尼 Dan Jacobson
  2018-05-31 10:58     ` Lars Ingebrigtsen
  0 siblings, 1 reply; 8+ messages in thread
From: 積丹尼 Dan Jacobson @ 2018-05-31  9:55 UTC (permalink / raw)
  To: 31665; +Cc: Katsumi Yamaoka

Dear bug-gnu-emacs, libxml-parse-html-region' doesn't extract text in <table>s,

KY> I found that Emacs' built-in function `libxml-parse-html-region'
KY> doesn't extract text existing in the table clause.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#31665: libxml-parse-html-region' doesn't extract text in tables
  2018-05-31  9:55   ` bug#31665: libxml-parse-html-region' doesn't extract text in tables 積丹尼 Dan Jacobson
@ 2018-05-31 10:58     ` Lars Ingebrigtsen
  2018-06-06 20:50       ` 積丹尼 Dan Jacobson
  0 siblings, 1 reply; 8+ messages in thread
From: Lars Ingebrigtsen @ 2018-05-31 10:58 UTC (permalink / raw)
  To: 積丹尼 Dan Jacobson; +Cc: Katsumi Yamaoka, 31665

積丹尼 Dan Jacobson <jidanni@jidanni.org> writes:

> Dear bug-gnu-emacs, libxml-parse-html-region' doesn't extract text in
> <table>s,

Do you have an example table that `libxml-parse-html-region' doesn't
"extract" text from?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#31665: libxml-parse-html-region' doesn't extract text in tables
  2018-05-31 10:58     ` Lars Ingebrigtsen
@ 2018-06-06 20:50       ` 積丹尼 Dan Jacobson
  2019-09-29  8:34         ` Lars Ingebrigtsen
  0 siblings, 1 reply; 8+ messages in thread
From: 積丹尼 Dan Jacobson @ 2018-06-06 20:50 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: Katsumi Yamaoka, 31665

[-- Attachment #1: Type: text/plain, Size: 223 bytes --]

>>>>> "LI" == Lars Ingebrigtsen <larsi@gnus.org> writes:

LI> Do you have an example table that `libxml-parse-html-region' doesn't
LI> "extract" text from?

OK here is a mail that I cleaned off my personal phone bill from:

[-- Attachment #2: gg.gz --]
[-- Type: application/gzip, Size: 2411 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#31665: libxml-parse-html-region' doesn't extract text in tables
  2018-06-06 20:50       ` 積丹尼 Dan Jacobson
@ 2019-09-29  8:34         ` Lars Ingebrigtsen
  2019-09-29 16:52           ` 積丹尼 Dan Jacobson
  0 siblings, 1 reply; 8+ messages in thread
From: Lars Ingebrigtsen @ 2019-09-29  8:34 UTC (permalink / raw)
  To: 積丹尼 Dan Jacobson; +Cc: Katsumi Yamaoka, 31665

積丹尼 Dan Jacobson <jidanni@jidanni.org> writes:

>>>>>> "LI" == Lars Ingebrigtsen <larsi@gnus.org> writes:
>
> LI> Do you have an example table that `libxml-parse-html-region' doesn't
> LI> "extract" text from?
>
> OK here is a mail that I cleaned off my personal phone bill from:

What was it you think is missing from that table?  I don't read Chinese,
but there didn't seem to be any text in that table, just a bunch of
images.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#31665: libxml-parse-html-region' doesn't extract text in tables
  2019-09-29  8:34         ` Lars Ingebrigtsen
@ 2019-09-29 16:52           ` 積丹尼 Dan Jacobson
  2019-09-30  5:05             ` Lars Ingebrigtsen
  0 siblings, 1 reply; 8+ messages in thread
From: 積丹尼 Dan Jacobson @ 2019-09-29 16:52 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: Katsumi Yamaoka, 31665

>>>>> "LI" == Lars Ingebrigtsen <larsi@gnus.org> writes:
LI> 積丹尼 Dan Jacobson <jidanni@jidanni.org> writes:

>>>>>>> "LI" == Lars Ingebrigtsen <larsi@gnus.org> writes:
>> 
LI> Do you have an example table that `libxml-parse-html-region' doesn't
LI> "extract" text from?
>> 
>> OK here is a mail that I cleaned off my personal phone bill from:

LI> What was it you think is missing from that table?  I don't read Chinese,
LI> but there didn't seem to be any text in that table, just a bunch of
LI> images.

It should look like:

+----------------------------------------------------------------------------------------------------------------------------------------------------+
|+---------------------------------------------------------------------------------------------------------------------+                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
|||[banner2]                                                                                                         | |                             |
|||------------------------------------------------------------------------------------------------------------------| |                             |
|||+---------------------------------------------------------------------------------------------------------------+ | |                             |
||||                                    |親愛的客戶,您好:                   |                                    | | |                             |
||||                                    |-------------------------------------|                                    | | |                             |
||||                                    |為保障您資料的安全,請輸入密碼開啟附 |                                    | | |                             |
||||                                    |加檔案瀏覽您本期的帳單,密碼為『身分 |                                    | | |                             |
||||               [IS1]                |證號碼』(英文字母須大寫),營業人客戶 |               [IS2]                | | |                             |
||||                                    |不需輸入密碼即可瀏覽。               |                                    | | |                             |
||||                                    |若無法開啟附加檔案,請先確認是否已下 |                                    | | |                             |
||||                                    |載Acrobat Reader軟體。               |                                    | | |                             |
||||                                    |-------------------------------------|                                    | | |                             |
|||+---------------------------------------------------------------------------------------------------------------+ | |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
||++                                                                                                                   |                             |
||||                                                                                                                   |                             |
||++                                                                                                                   |                             |
||+-------------------------------------------------------------------------------------------------------------------+|                             |
|||[new1]                                                                                                             ||                             |
|||+-----------------------------------------------------------------------------------------------------------------+||                             |
||||                                                        |                                                [enf201]|||                             |
||||                                                        |--------------------------------------------------------|||                             |
||||[end101]                                                |                                                [enl301]|||                             |
||||                                                        |--------------------------------------------------------|||                             |
||||                                                        |                                                [enl401]|||                             |
|||+-----------------------------------------------------------------------------------------------------------------+||                             |
||+-------------------------------------------------------------------------------------------------------------------+|                             |
||++                                                                                                                   |                             |
||||                                                                                                                   |                             |
||++                                                                                                                   |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
|||[hot1]                                                                                                            | |                             |
|||------------------------------------------------------------------------------------------------------------------| |                             |
|||+----------------------------------+                                                                              | |                             |
||||[hot1]|[hot2]|[hot3]|[hot4]|[hot5]|                                                                              | |                             |
|||+----------------------------------+                                                                              | |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
||++                                                                                                                   |                             |
||||                                                                                                                   |                             |
||++                                                                                                                   |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
|||[link1]                                                                                                           | |                             |
|||+-----------------------------------------------------------------+                                               | |                             |
||||||            |                |                |                |                                               | |                             |
||||++------------+----------------+----------------+----------------|                                               | |                             |
||||||電子帳單Q&A |    費率說明    |  客戶消費資訊  |    線上繳費    |                                               | |                             |
||||++------------+----------------+----------------+----------------|                                               | |                             |
||||||  服務專線  |    貼心提醒    |不可不知行動優惠| HiNet好康優惠  |                                               | |                             |
|||+-----------------------------------------------------------------+                                               | |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
||++                                                                                                                   |                             |
||||                                                                                                                   |                             |
||++                                                                                                                   |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
|||                                                      [cht]                                                       | |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
|+---------------------------------------------------------------------------------------------------------------------+                             |
+----------------------------------------------------------------------------------------------------------------------------------------------------+

But instead all we get is:

From: Phone Co. <p@cht.com.tw>
Subject: Phone Bill
To: "jidanni@jidanni.org" <jidanni@jidanni.org>
Date: Thu, 17 May 2018 12:12:06 +0800
Reply-To: x@cht.com.tw

[1. text/html]
中華電信電子帳單

*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*






^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#31665: libxml-parse-html-region' doesn't extract text in tables
  2019-09-29 16:52           ` 積丹尼 Dan Jacobson
@ 2019-09-30  5:05             ` Lars Ingebrigtsen
  2019-09-30  5:28               ` Lars Ingebrigtsen
  0 siblings, 1 reply; 8+ messages in thread
From: Lars Ingebrigtsen @ 2019-09-30  5:05 UTC (permalink / raw)
  To: 積丹尼 Dan Jacobson; +Cc: Katsumi Yamaoka, 31665

The HTML in that email is invalid.  It's basically on the form

<table>
  <tbody>
    foo
  </tbody>
</table>

"foo" won't be rendered by shr.

shr does try to deal with invalid tables, though.  If the <tbody>
elements hadn't been there, then the "foo" would have been, so I guess
some more work is required in that area.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no






^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#31665: libxml-parse-html-region' doesn't extract text in tables
  2019-09-30  5:05             ` Lars Ingebrigtsen
@ 2019-09-30  5:28               ` Lars Ingebrigtsen
  2019-10-01  2:43                 ` Katsumi Yamaoka
  0 siblings, 1 reply; 8+ messages in thread
From: Lars Ingebrigtsen @ 2019-09-30  5:28 UTC (permalink / raw)
  To: 積丹尼 Dan Jacobson; +Cc: Katsumi Yamaoka, 31665

Lars Ingebrigtsen <larsi@gnus.org> writes:

> shr does try to deal with invalid tables, though.  If the <tbody>
> elements hadn't been there, then the "foo" would have been, so I guess
> some more work is required in that area.

I've now fixed this on the trunk.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#31665: libxml-parse-html-region' doesn't extract text in tables
  2019-09-30  5:28               ` Lars Ingebrigtsen
@ 2019-10-01  2:43                 ` Katsumi Yamaoka
  0 siblings, 0 replies; 8+ messages in thread
From: Katsumi Yamaoka @ 2019-10-01  2:43 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 31665, 積丹尼 Dan Jacobson

On Mon, 30 Sep 2019 07:28:19 +0200, Lars Ingebrigtsen wrote:
> I've now fixed this on the trunk.

Verified.  Thank you for improving it!





^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-10-01  2:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <8736zjmtsa.fsf@jidanni.org>
     [not found] ` <b4m1sf2izty.fsf@jpl.org>
2018-05-31  9:55   ` bug#31665: libxml-parse-html-region' doesn't extract text in tables 積丹尼 Dan Jacobson
2018-05-31 10:58     ` Lars Ingebrigtsen
2018-06-06 20:50       ` 積丹尼 Dan Jacobson
2019-09-29  8:34         ` Lars Ingebrigtsen
2019-09-29 16:52           ` 積丹尼 Dan Jacobson
2019-09-30  5:05             ` Lars Ingebrigtsen
2019-09-30  5:28               ` Lars Ingebrigtsen
2019-10-01  2:43                 ` Katsumi Yamaoka

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).