* bug#31665: libxml-parse-html-region' doesn't extract text in tables
[not found] ` <b4m1sf2izty.fsf@jpl.org>
@ 2018-05-31 9:55 ` 積丹尼 Dan Jacobson
2018-05-31 10:58 ` Lars Ingebrigtsen
0 siblings, 1 reply; 8+ messages in thread
From: 積丹尼 Dan Jacobson @ 2018-05-31 9:55 UTC (permalink / raw)
To: 31665; +Cc: Katsumi Yamaoka
Dear bug-gnu-emacs, libxml-parse-html-region' doesn't extract text in <table>s,
KY> I found that Emacs' built-in function `libxml-parse-html-region'
KY> doesn't extract text existing in the table clause.
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#31665: libxml-parse-html-region' doesn't extract text in tables
2018-05-31 9:55 ` bug#31665: libxml-parse-html-region' doesn't extract text in tables 積丹尼 Dan Jacobson
@ 2018-05-31 10:58 ` Lars Ingebrigtsen
2018-06-06 20:50 ` 積丹尼 Dan Jacobson
0 siblings, 1 reply; 8+ messages in thread
From: Lars Ingebrigtsen @ 2018-05-31 10:58 UTC (permalink / raw)
To: 積丹尼 Dan Jacobson; +Cc: Katsumi Yamaoka, 31665
積丹尼 Dan Jacobson <jidanni@jidanni.org> writes:
> Dear bug-gnu-emacs, libxml-parse-html-region' doesn't extract text in
> <table>s,
Do you have an example table that `libxml-parse-html-region' doesn't
"extract" text from?
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#31665: libxml-parse-html-region' doesn't extract text in tables
2018-05-31 10:58 ` Lars Ingebrigtsen
@ 2018-06-06 20:50 ` 積丹尼 Dan Jacobson
2019-09-29 8:34 ` Lars Ingebrigtsen
0 siblings, 1 reply; 8+ messages in thread
From: 積丹尼 Dan Jacobson @ 2018-06-06 20:50 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: Katsumi Yamaoka, 31665
[-- Attachment #1: Type: text/plain, Size: 223 bytes --]
>>>>> "LI" == Lars Ingebrigtsen <larsi@gnus.org> writes:
LI> Do you have an example table that `libxml-parse-html-region' doesn't
LI> "extract" text from?
OK here is a mail that I cleaned off my personal phone bill from:
[-- Attachment #2: gg.gz --]
[-- Type: application/gzip, Size: 2411 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#31665: libxml-parse-html-region' doesn't extract text in tables
2018-06-06 20:50 ` 積丹尼 Dan Jacobson
@ 2019-09-29 8:34 ` Lars Ingebrigtsen
2019-09-29 16:52 ` 積丹尼 Dan Jacobson
0 siblings, 1 reply; 8+ messages in thread
From: Lars Ingebrigtsen @ 2019-09-29 8:34 UTC (permalink / raw)
To: 積丹尼 Dan Jacobson; +Cc: Katsumi Yamaoka, 31665
積丹尼 Dan Jacobson <jidanni@jidanni.org> writes:
>>>>>> "LI" == Lars Ingebrigtsen <larsi@gnus.org> writes:
>
> LI> Do you have an example table that `libxml-parse-html-region' doesn't
> LI> "extract" text from?
>
> OK here is a mail that I cleaned off my personal phone bill from:
What was it you think is missing from that table? I don't read Chinese,
but there didn't seem to be any text in that table, just a bunch of
images.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#31665: libxml-parse-html-region' doesn't extract text in tables
2019-09-29 8:34 ` Lars Ingebrigtsen
@ 2019-09-29 16:52 ` 積丹尼 Dan Jacobson
2019-09-30 5:05 ` Lars Ingebrigtsen
0 siblings, 1 reply; 8+ messages in thread
From: 積丹尼 Dan Jacobson @ 2019-09-29 16:52 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: Katsumi Yamaoka, 31665
>>>>> "LI" == Lars Ingebrigtsen <larsi@gnus.org> writes:
LI> 積丹尼 Dan Jacobson <jidanni@jidanni.org> writes:
>>>>>>> "LI" == Lars Ingebrigtsen <larsi@gnus.org> writes:
>>
LI> Do you have an example table that `libxml-parse-html-region' doesn't
LI> "extract" text from?
>>
>> OK here is a mail that I cleaned off my personal phone bill from:
LI> What was it you think is missing from that table? I don't read Chinese,
LI> but there didn't seem to be any text in that table, just a bunch of
LI> images.
It should look like:
+----------------------------------------------------------------------------------------------------------------------------------------------------+
|+---------------------------------------------------------------------------------------------------------------------+ |
||+------------------------------------------------------------------------------------------------------------------+ | |
|||[banner2] | | |
|||------------------------------------------------------------------------------------------------------------------| | |
|||+---------------------------------------------------------------------------------------------------------------+ | | |
|||| |親愛的客戶,您好: | | | | |
|||| |-------------------------------------| | | | |
|||| |為保障您資料的安全,請輸入密碼開啟附 | | | | |
|||| |加檔案瀏覽您本期的帳單,密碼為『身分 | | | | |
|||| [IS1] |證號碼』(英文字母須大寫),營業人客戶 | [IS2] | | | |
|||| |不需輸入密碼即可瀏覽。 | | | | |
|||| |若無法開啟附加檔案,請先確認是否已下 | | | | |
|||| |載Acrobat Reader軟體。 | | | | |
|||| |-------------------------------------| | | | |
|||+---------------------------------------------------------------------------------------------------------------+ | | |
||+------------------------------------------------------------------------------------------------------------------+ | |
||++ | |
|||| | |
||++ | |
||+-------------------------------------------------------------------------------------------------------------------+| |
|||[new1] || |
|||+-----------------------------------------------------------------------------------------------------------------+|| |
|||| | [enf201]||| |
|||| |--------------------------------------------------------||| |
||||[end101] | [enl301]||| |
|||| |--------------------------------------------------------||| |
|||| | [enl401]||| |
|||+-----------------------------------------------------------------------------------------------------------------+|| |
||+-------------------------------------------------------------------------------------------------------------------+| |
||++ | |
|||| | |
||++ | |
||+------------------------------------------------------------------------------------------------------------------+ | |
|||[hot1] | | |
|||------------------------------------------------------------------------------------------------------------------| | |
|||+----------------------------------+ | | |
||||[hot1]|[hot2]|[hot3]|[hot4]|[hot5]| | | |
|||+----------------------------------+ | | |
||+------------------------------------------------------------------------------------------------------------------+ | |
||++ | |
|||| | |
||++ | |
||+------------------------------------------------------------------------------------------------------------------+ | |
|||[link1] | | |
|||+-----------------------------------------------------------------+ | | |
|||||| | | | | | | |
||||++------------+----------------+----------------+----------------| | | |
||||||電子帳單Q&A | 費率說明 | 客戶消費資訊 | 線上繳費 | | | |
||||++------------+----------------+----------------+----------------| | | |
|||||| 服務專線 | 貼心提醒 |不可不知行動優惠| HiNet好康優惠 | | | |
|||+-----------------------------------------------------------------+ | | |
||+------------------------------------------------------------------------------------------------------------------+ | |
||++ | |
|||| | |
||++ | |
||+------------------------------------------------------------------------------------------------------------------+ | |
||| [cht] | | |
||+------------------------------------------------------------------------------------------------------------------+ | |
|+---------------------------------------------------------------------------------------------------------------------+ |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
But instead all we get is:
From: Phone Co. <p@cht.com.tw>
Subject: Phone Bill
To: "jidanni@jidanni.org" <jidanni@jidanni.org>
Date: Thu, 17 May 2018 12:12:06 +0800
Reply-To: x@cht.com.tw
[1. text/html]
中華電信電子帳單
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#31665: libxml-parse-html-region' doesn't extract text in tables
2019-09-29 16:52 ` 積丹尼 Dan Jacobson
@ 2019-09-30 5:05 ` Lars Ingebrigtsen
2019-09-30 5:28 ` Lars Ingebrigtsen
0 siblings, 1 reply; 8+ messages in thread
From: Lars Ingebrigtsen @ 2019-09-30 5:05 UTC (permalink / raw)
To: 積丹尼 Dan Jacobson; +Cc: Katsumi Yamaoka, 31665
The HTML in that email is invalid. It's basically on the form
<table>
<tbody>
foo
</tbody>
</table>
"foo" won't be rendered by shr.
shr does try to deal with invalid tables, though. If the <tbody>
elements hadn't been there, then the "foo" would have been, so I guess
some more work is required in that area.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#31665: libxml-parse-html-region' doesn't extract text in tables
2019-09-30 5:05 ` Lars Ingebrigtsen
@ 2019-09-30 5:28 ` Lars Ingebrigtsen
2019-10-01 2:43 ` Katsumi Yamaoka
0 siblings, 1 reply; 8+ messages in thread
From: Lars Ingebrigtsen @ 2019-09-30 5:28 UTC (permalink / raw)
To: 積丹尼 Dan Jacobson; +Cc: Katsumi Yamaoka, 31665
Lars Ingebrigtsen <larsi@gnus.org> writes:
> shr does try to deal with invalid tables, though. If the <tbody>
> elements hadn't been there, then the "foo" would have been, so I guess
> some more work is required in that area.
I've now fixed this on the trunk.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#31665: libxml-parse-html-region' doesn't extract text in tables
2019-09-30 5:28 ` Lars Ingebrigtsen
@ 2019-10-01 2:43 ` Katsumi Yamaoka
0 siblings, 0 replies; 8+ messages in thread
From: Katsumi Yamaoka @ 2019-10-01 2:43 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 31665, 積丹尼 Dan Jacobson
On Mon, 30 Sep 2019 07:28:19 +0200, Lars Ingebrigtsen wrote:
> I've now fixed this on the trunk.
Verified. Thank you for improving it!
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-10-01 2:43 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <8736zjmtsa.fsf@jidanni.org>
[not found] ` <b4m1sf2izty.fsf@jpl.org>
2018-05-31 9:55 ` bug#31665: libxml-parse-html-region' doesn't extract text in tables 積丹尼 Dan Jacobson
2018-05-31 10:58 ` Lars Ingebrigtsen
2018-06-06 20:50 ` 積丹尼 Dan Jacobson
2019-09-29 8:34 ` Lars Ingebrigtsen
2019-09-29 16:52 ` 積丹尼 Dan Jacobson
2019-09-30 5:05 ` Lars Ingebrigtsen
2019-09-30 5:28 ` Lars Ingebrigtsen
2019-10-01 2:43 ` Katsumi Yamaoka
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).