unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* [PATCH] test: add known broken test for indexing html
@ 2017-03-18 13:25 David Bremner
  2017-03-18 13:37 ` Jeffrey Stedfast
  0 siblings, 1 reply; 7+ messages in thread
From: David Bremner @ 2017-03-18 13:25 UTC (permalink / raw)
  To: notmuch

'quite' on IRC reported that notmuch new was grinding to a halt during
initial indexing, and we eventually narrowed the problem down to some
html parts with large embedded images. These cause the number of terms
added to the Xapian database to explode (the first 400 messages
generated 4.6M unique terms), and of course the resulting terms are
not much use for searching.
---

I'm not sure the best approach to fix this. Workarounds include
limiting the size of the part indexed, and skipping html parts. The
latter is easy, but probably too drastic.  A nice solution might be a
filter similar to the existing one that strips out uuencoded text but
for base64. Alas base64 crud seems to come with all kinds of syntactic
wrappers, so it's probably harder to filter.


 test/T680-html-indexing.sh       | 12 +++++++
 test/corpora/README              |  3 ++
 test/corpora/html/embedded-image | 69 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 84 insertions(+)
 create mode 100755 test/T680-html-indexing.sh
 create mode 100644 test/corpora/html/embedded-image

diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
new file mode 100755
index 00000000..78768c4f
--- /dev/null
+++ b/test/T680-html-indexing.sh
@@ -0,0 +1,12 @@
+#!/usr/bin/env bash
+test_description="indexing of html parts"
+. ./test-lib.sh || exit 1
+
+add_email_corpus html
+
+test_begin_subtest 'embedded images should not be indexed'
+test_subtest_known_broken
+notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
+test_done
diff --git a/test/corpora/README b/test/corpora/README
index 77c48e6e..c9a35fed 100644
--- a/test/corpora/README
+++ b/test/corpora/README
@@ -9,3 +9,6 @@ default
 broken
   The broken corpus contains messages that are broken and/or RFC
   non-compliant, ensuring we deal with them in a sane way.
+
+html
+  The html corpus contains html parts
diff --git a/test/corpora/html/embedded-image b/test/corpora/html/embedded-image
new file mode 100644
index 00000000..40851530
--- /dev/null
+++ b/test/corpora/html/embedded-image
@@ -0,0 +1,69 @@
+From: =?utf-8?b?bWFsbW9ib3Jn?= <daemon@lublin.se>
+To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= <daemon@lublin.se>
+Date: Tue, 19 Jul 2016 11:54:24 +0200
+X-Feed2Imap-Version: 1.2.5
+Message-Id: <boendemalmoborg-1834@eltanin.uberspace.de>
+Subject: =?utf-8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?=
+Content-Type: multipart/alternative; boundary="=-1468922508-176605-12427-9500-21-="
+MIME-Version: 1.0
+
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/plain; charset=utf-8; format=flowed
+Content-Transfer-Encoding: 8bit
+
+<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
+
+Malmö 2016-07-09
+
+I skrivande stund är vi i färd med att avetablera vår entreprenad på 
+Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större 
+dräneringsarbete som i sin tur har inneburit vissa 
+trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några 
+veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi 
+kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu 
+kommer den vackra fastigheten att klara sig torrskodd under många år 
+framöver [A]
+
+ 
+
+[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif
+-- 
+Feed: Förvaltnings AB Malmöborg
+<http://malmoborg.se>
+Item: Tack alla trafikanter och fotgängare!
+<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
+Date: 2016-07-19 11:54:24 +0200
+Author: malmoborg
+Filed under: Nyheter
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/html; charset=utf-8
+Content-Transfer-Encoding: 8bit
+
+<table border="1" width="100%" cellpadding="0" cellspacing="0" borderspacing="0"><tr><td>
+<table width="100%" bgcolor="#EDEDED" cellpadding="4" cellspacing="2">
+<tr><td align="right"><b>Feed:</b></td>
+<td width="100%"><a href="http://malmoborg.se">
+<b>Förvaltnings AB Malmöborg</b>
+</a>
+</td></tr><tr><td align="right"><b>Item:</b></td>
+<td width="100%"><a href="http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/"><b>Tack alla trafikanter och fotgängare!</b>
+</a>
+</td></tr></table></td></tr></table>
+
+<p>Malmö 2016-07-09</p>
+<p>I skrivande stund är vi i färd med att avetablera vår entreprenad på Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större dräneringsarbete som i sin tur har inneburit vissa trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu kommer den vackra fastigheten att klara sig torrskodd under många år framöver <img src="
+xzMzM///6//lAAAAAAAAACH5BAEAAA4ALAAAAAAPAA8AAARb0EkZap3YVabO
+GRcWcAgCnIMRTEEnCCfwpqt2mHEOagoOnz+CKnADxoKFyiHHBBCSAdOiCVg8
+KwPZa7sVrgJZQWI8FhB2msGgwTXTWGqCXP4WBQr4wjDDstQmEQA7
+" alt=":-)" class="wp-smiley" /> </p>
+<p>&nbsp;</p>
+<hr width="100%"/>
+<table width="100%" cellpadding="0" cellspacing="0">
+<tr><td align="right"><font color="#ababab">Date:</font>&nbsp;&nbsp;</td><td><font color="#ababab">2016-07-19 11:54:24 +0200</font></td></tr>
+<tr><td align="right"><font color="#ababab">Author:</font>&nbsp;&nbsp;</td><td><font color="#ababab">malmoborg</font></td></tr>
+<tr><td align="right"><font color="#ababab">Filed under:</font>&nbsp;&nbsp;</td><td><font color="#ababab">Nyheter</font></td></tr>
+</table>
+
+--=-1468922508-176605-12427-9500-21-=--
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* RE: [PATCH] test: add known broken test for indexing html
  2017-03-18 13:25 [PATCH] test: add known broken test for indexing html David Bremner
@ 2017-03-18 13:37 ` Jeffrey Stedfast
  2017-03-18 15:04   ` David Bremner
  2017-03-18 15:08   ` David Bremner
  0 siblings, 2 replies; 7+ messages in thread
From: Jeffrey Stedfast @ 2017-03-18 13:37 UTC (permalink / raw)
  To: David Bremner, notmuch@notmuchmail.org

Hi David,

Base64 encoded inline image data is always within the src attribute value of an <img> tag and will always begin with "data:" followed by the mime-type and then followed by ";base64," so it's pretty easy to spot.

While on this topic, why index HTML attribute values at all? Other than perhaps some known ones like perhaps the 'alt' value of <img> tags?

I would argue that the only portion of any HTML that you should be indexing at all for searching is the character data between tags.

Hope my $0.02 helps,

Jeff

> -----Original Message-----
> From: notmuch [mailto:notmuch-bounces@notmuchmail.org] On Behalf Of
> David Bremner
> Sent: Saturday, March 18, 2017 9:25 AM
> To: notmuch@notmuchmail.org
> Subject: [PATCH] test: add known broken test for indexing html
> 
> 'quite' on IRC reported that notmuch new was grinding to a halt during initial
> indexing, and we eventually narrowed the problem down to some html parts
> with large embedded images. These cause the number of terms added to
> the Xapian database to explode (the first 400 messages generated 4.6M
> unique terms), and of course the resulting terms are not much use for
> searching.
> ---
> 
> I'm not sure the best approach to fix this. Workarounds include limiting the
> size of the part indexed, and skipping html parts. The latter is easy, but
> probably too drastic.  A nice solution might be a filter similar to the existing
> one that strips out uuencoded text but for base64. Alas base64 crud seems
> to come with all kinds of syntactic wrappers, so it's probably harder to filter.
> 
> 
>  test/T680-html-indexing.sh       | 12 +++++++
>  test/corpora/README              |  3 ++
>  test/corpora/html/embedded-image | 69
> ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 84 insertions(+)
>  create mode 100755 test/T680-html-indexing.sh  create mode 100644
> test/corpora/html/embedded-image
> 
> diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh new file
> mode 100755 index 00000000..78768c4f
> --- /dev/null
> +++ b/test/T680-html-indexing.sh
> @@ -0,0 +1,12 @@
> +#!/usr/bin/env bash
> +test_description="indexing of html parts"
> +. ./test-lib.sh || exit 1
> +
> +add_email_corpus html
> +
> +test_begin_subtest 'embedded images should not be indexed'
> +test_subtest_known_broken
> +notmuch search
> kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 >
> +OUTPUT test_expect_equal_file /dev/null OUTPUT
> +
> +test_done
> diff --git a/test/corpora/README b/test/corpora/README index
> 77c48e6e..c9a35fed 100644
> --- a/test/corpora/README
> +++ b/test/corpora/README
> @@ -9,3 +9,6 @@ default
>  broken
>    The broken corpus contains messages that are broken and/or RFC
>    non-compliant, ensuring we deal with them in a sane way.
> +
> +html
> +  The html corpus contains html parts
> diff --git a/test/corpora/html/embedded-image
> b/test/corpora/html/embedded-image
> new file mode 100644
> index 00000000..40851530
> --- /dev/null
> +++ b/test/corpora/html/embedded-image
> @@ -0,0 +1,69 @@
> +From: =?utf-8?b?bWFsbW9ib3Jn?= <daemon@lublin.se>
> +To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= <daemon@lublin.se>
> +Date: Tue, 19 Jul 2016 11:54:24 +0200
> +X-Feed2Imap-Version: 1.2.5
> +Message-Id: <boendemalmoborg-1834@eltanin.uberspace.de>
> +Subject:
> +=?utf-
> 8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?=
> +Content-Type: multipart/alternative; boundary="=-1468922508-176605-
> 12427-9500-21-="
> +MIME-Version: 1.0
> +
> +
> +--=-1468922508-176605-12427-9500-21-=
> +Content-Type: text/plain; charset=utf-8; format=flowed
> +Content-Transfer-Encoding: 8bit
> +
> +<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
> +
> +Malmö 2016-07-09
> +
> +I skrivande stund är vi i färd med att avetablera vår entreprenad på
> +Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett
> +större dräneringsarbete som i sin tur har inneburit vissa
> +trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några
> +veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och
> +vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort.
> +Nu kommer den vackra fastigheten att klara sig torrskodd under många år
> +framöver [A]
> +
> +
> +
> +[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif
> +--
> +Feed: Förvaltnings AB Malmöborg
> +<http://malmoborg.se>
> +Item: Tack alla trafikanter och fotgängare!
> +<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
> +Date: 2016-07-19 11:54:24 +0200
> +Author: malmoborg
> +Filed under: Nyheter
> +
> +--=-1468922508-176605-12427-9500-21-=
> +Content-Type: text/html; charset=utf-8
> +Content-Transfer-Encoding: 8bit
> +
> +<table border="1" width="100%" cellpadding="0" cellspacing="0"
> +borderspacing="0"><tr><td> <table width="100%" bgcolor="#EDEDED"
> +cellpadding="4" cellspacing="2"> <tr><td
> +align="right"><b>Feed:</b></td> <td width="100%"><a
> +href="http://malmoborg.se"> <b>Förvaltnings AB Malmöborg</b> </a>
> +</td></tr><tr><td align="right"><b>Item:</b></td> <td width="100%"><a
> +href="http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/
> +"><b>Tack alla trafikanter och fotgängare!</b> </a>
> +</td></tr></table></td></tr></table>
> +
> +<p>Malmö 2016-07-09</p>
> +<p>I skrivande stund är vi i färd med att avetablera vår entreprenad på
> +Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett
> +större dräneringsarbete som i sin tur har inneburit vissa
> +trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några
> +veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och
> +vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort.
> +Nu kommer den vackra fastigheten att klara sig torrskodd under många år
> +framöver <img
> +src="
> JAP+0AP6d
> +AP/+k//9E///////
> +xzMzM///6//lAAAAAAAAACH5BAEAAA4ALAAAAAAPAA8AAARb0EkZap3YV
> abO
> +GRcWcAgCnIMRTEEnCCfwpqt2mHEOagoOnz+CKnADxoKFyiHHBBCSAdOiCV
> g8
> +KwPZa7sVrgJZQWI8FhB2msGgwTXTWGqCXP4WBQr4wjDDstQmEQA7
> +" alt=":-)" class="wp-smiley" /> </p>
> +<p>&nbsp;</p>
> +<hr width="100%"/>
> +<table width="100%" cellpadding="0" cellspacing="0"> <tr><td
> +align="right"><font
> +color="#ababab">Date:</font>&nbsp;&nbsp;</td><td><font
> +color="#ababab">2016-07-19 11:54:24 +0200</font></td></tr> <tr><td
> +align="right"><font
> +color="#ababab">Author:</font>&nbsp;&nbsp;</td><td><font
> +color="#ababab">malmoborg</font></td></tr>
> +<tr><td align="right"><font color="#ababab">Filed
> +under:</font>&nbsp;&nbsp;</td><td><font
> +color="#ababab">Nyheter</font></td></tr>
> +</table>
> +
> +--=-1468922508-176605-12427-9500-21-=--
> --
> 2.11.0
> 
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> https://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] test: add known broken test for indexing html
  2017-03-18 13:37 ` Jeffrey Stedfast
@ 2017-03-18 15:04   ` David Bremner
  2017-03-18 16:21     ` Jeffrey Stedfast
  2017-03-18 15:08   ` David Bremner
  1 sibling, 1 reply; 7+ messages in thread
From: David Bremner @ 2017-03-18 15:04 UTC (permalink / raw)
  To: Jeffrey Stedfast, notmuch@notmuchmail.org

Jeffrey Stedfast <jestedfa@microsoft.com> writes:

> Hi David,
>
> Base64 encoded inline image data is always within the src attribute value of an <img> tag and will always begin with "data:" followed by the mime-type and then followed by ";base64," so it's pretty easy to spot.
>
> While on this topic, why index HTML attribute values at all? Other than perhaps some known ones like perhaps the 'alt' value of <img> tags?
>
> I would argue that the only portion of any HTML that you should be indexing at all for searching is the character data between tags.

We're not currently parsing the HTML, so none of these distinctions are
really available to us. Maybe adding an HTML parser is the right
solution, but it's a bit non-trivial.

d

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] test: add known broken test for indexing html
  2017-03-18 13:37 ` Jeffrey Stedfast
  2017-03-18 15:04   ` David Bremner
@ 2017-03-18 15:08   ` David Bremner
  1 sibling, 0 replies; 7+ messages in thread
From: David Bremner @ 2017-03-18 15:08 UTC (permalink / raw)
  To: Jeffrey Stedfast, notmuch@notmuchmail.org

Jeffrey Stedfast <jestedfa@microsoft.com> writes:

> Base64 encoded inline image data is always within the src attribute
> value of an <img> tag and will always begin with "data:" followed by
> the mime-type and then followed by ";base64," so it's pretty easy to
> spot.
>
> While on this topic, why index HTML attribute values at all? Other
>than perhaps some known ones like perhaps the 'alt' value of <img>
>tags?
>
> I would argue that the only portion of any HTML that you should be
> indexing at all for searching is the character data between tags.
>

I should mention that we also have a fair amount of base64 gunk from
inline PGP signatures. I'm not sure if it's just ugly to look at when
dumping the database term, or if it actually makes a measurable
difference in time/space usage.

d

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] test: add known broken test for indexing html
  2017-03-18 15:04   ` David Bremner
@ 2017-03-18 16:21     ` Jeffrey Stedfast
  2017-03-18 18:14       ` David Bremner
  0 siblings, 1 reply; 7+ messages in thread
From: Jeffrey Stedfast @ 2017-03-18 16:21 UTC (permalink / raw)
  To: David Bremner, notmuch@notmuchmail.org

Hey David,

I actually have an HTML tokenizer for MimeKit for (among other things) this type of purpose. Perhaps I need to port that to C and include that with GMime 😊

https://github.com/jstedfast/MimeKit/tree/master/MimeKit/Text

Jeff

> -----Original Message-----
> From: David Bremner [mailto:david@tethera.net]
> Sent: Saturday, March 18, 2017 11:04 AM
> To: Jeffrey Stedfast <jestedfa@microsoft.com>; notmuch@notmuchmail.org
> Subject: RE: [PATCH] test: add known broken test for indexing html
> 
> Jeffrey Stedfast <jestedfa@microsoft.com> writes:
> 
> > Hi David,
> >
> > Base64 encoded inline image data is always within the src attribute value of
> an <img> tag and will always begin with "data:" followed by the mime-type
> and then followed by ";base64," so it's pretty easy to spot.
> >
> > While on this topic, why index HTML attribute values at all? Other than
> perhaps some known ones like perhaps the 'alt' value of <img> tags?
> >
> > I would argue that the only portion of any HTML that you should be
> indexing at all for searching is the character data between tags.
> 
> We're not currently parsing the HTML, so none of these distinctions are really
> available to us. Maybe adding an HTML parser is the right solution, but it's a
> bit non-trivial.
> 
> d

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] test: add known broken test for indexing html
  2017-03-18 16:21     ` Jeffrey Stedfast
@ 2017-03-18 18:14       ` David Bremner
  2017-03-19 17:24         ` Jeffrey Stedfast
  0 siblings, 1 reply; 7+ messages in thread
From: David Bremner @ 2017-03-18 18:14 UTC (permalink / raw)
  To: Jeffrey Stedfast, notmuch@notmuchmail.org

Jeffrey Stedfast <jestedfa@microsoft.com> writes:

> Hey David,
>
> I actually have an HTML tokenizer for MimeKit for (among other things) this type of purpose. Perhaps I need to port that to C and include that with GMime 😊
>
> https://github.com/jstedfast/MimeKit/tree/master/MimeKit/Text
>
> Jeff

That's probably a good idea in your abundant spare time ;).  More
generally though we've thought about letting users provide filters to
convert attachements (e.g. .odt / .docx / pdf) to text. I'm not sure
about the performance hit, but I guess that would work for html as well.
I guess in principle it should be possible to write GMime filter that
manages the child process.

d

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] test: add known broken test for indexing html
  2017-03-18 18:14       ` David Bremner
@ 2017-03-19 17:24         ` Jeffrey Stedfast
  0 siblings, 0 replies; 7+ messages in thread
From: Jeffrey Stedfast @ 2017-03-19 17:24 UTC (permalink / raw)
  To: David Bremner, notmuch@notmuchmail.org


> -----Original Message-----
> From: David Bremner [mailto:david@tethera.net]
> Sent: Saturday, March 18, 2017 2:15 PM
> To: Jeffrey Stedfast <jestedfa@microsoft.com>; notmuch@notmuchmail.org
> Subject: RE: [PATCH] test: add known broken test for indexing html
> 
> Jeffrey Stedfast <jestedfa@microsoft.com> writes:
> 
> > Hey David,
> >
> > I actually have an HTML tokenizer for MimeKit for (among other things)
> > this type of purpose. Perhaps I need to port that to C and include
> > that with GMime 😊
> >
> > https://github.com/jstedfast/MimeKit/tree/master/MimeKit/Text
> >
> > Jeff
> 
> That's probably a good idea in your abundant spare time ;).  More generally
> though we've thought about letting users provide filters to convert
> attachements (e.g. .odt / .docx / pdf) to text. I'm not sure about the
> performance hit, but I guess that would work for html as well.
> I guess in principle it should be possible to write GMime filter that manages
> the child process.
> 
> d


Hah, yea... it'll probably be awhile. I need to focus on GMime 3.0 first. Once I get that squared away, I can look at porting other handy features back from MimeKit 😊

Jeff


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-03-19 17:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-18 13:25 [PATCH] test: add known broken test for indexing html David Bremner
2017-03-18 13:37 ` Jeffrey Stedfast
2017-03-18 15:04   ` David Bremner
2017-03-18 16:21     ` Jeffrey Stedfast
2017-03-18 18:14       ` David Bremner
2017-03-19 17:24         ` Jeffrey Stedfast
2017-03-18 15:08   ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).