From: David Bremner <david@tethera.net>
To: notmuch@notmuchmail.org
Subject: [rfc patch 1/6] test: add known broken test for indexing html
Date: Tue, 21 Mar 2017 10:15:44 -0300 [thread overview]
Message-ID: <20170321131549.19557-2-david@tethera.net> (raw)
In-Reply-To: <20170321131549.19557-1-david@tethera.net>
'quite' on IRC reported that notmuch new was grinding to a halt during
initial indexing, and we eventually narrowed the problem down to some
html parts with large embedded images. These cause the number of terms
added to the Xapian database to explode (the first 400 messages
generated 4.6M unique terms), and of course the resulting terms are
not much use for searching.
---
test/T680-html-indexing.sh | 12 +++++++
test/corpora/README | 3 ++
test/corpora/html/embedded-image | 69 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 84 insertions(+)
create mode 100755 test/T680-html-indexing.sh
create mode 100644 test/corpora/html/embedded-image
diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
new file mode 100755
index 00000000..78768c4f
--- /dev/null
+++ b/test/T680-html-indexing.sh
@@ -0,0 +1,12 @@
+#!/usr/bin/env bash
+test_description="indexing of html parts"
+. ./test-lib.sh || exit 1
+
+add_email_corpus html
+
+test_begin_subtest 'embedded images should not be indexed'
+test_subtest_known_broken
+notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
+test_done
diff --git a/test/corpora/README b/test/corpora/README
index 77c48e6e..c9a35fed 100644
--- a/test/corpora/README
+++ b/test/corpora/README
@@ -9,3 +9,6 @@ default
broken
The broken corpus contains messages that are broken and/or RFC
non-compliant, ensuring we deal with them in a sane way.
+
+html
+ The html corpus contains html parts
diff --git a/test/corpora/html/embedded-image b/test/corpora/html/embedded-image
new file mode 100644
index 00000000..40851530
--- /dev/null
+++ b/test/corpora/html/embedded-image
@@ -0,0 +1,69 @@
+From: =?utf-8?b?bWFsbW9ib3Jn?= <daemon@lublin.se>
+To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= <daemon@lublin.se>
+Date: Tue, 19 Jul 2016 11:54:24 +0200
+X-Feed2Imap-Version: 1.2.5
+Message-Id: <boendemalmoborg-1834@eltanin.uberspace.de>
+Subject: =?utf-8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?=
+Content-Type: multipart/alternative; boundary="=-1468922508-176605-12427-9500-21-="
+MIME-Version: 1.0
+
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/plain; charset=utf-8; format=flowed
+Content-Transfer-Encoding: 8bit
+
+<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
+
+Malmö 2016-07-09
+
+I skrivande stund är vi i färd med att avetablera vår entreprenad på
+Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större
+dräneringsarbete som i sin tur har inneburit vissa
+trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några
+veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi
+kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu
+kommer den vackra fastigheten att klara sig torrskodd under många år
+framöver [A]
+
+
+
+[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif
+--
+Feed: Förvaltnings AB Malmöborg
+<http://malmoborg.se>
+Item: Tack alla trafikanter och fotgängare!
+<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
+Date: 2016-07-19 11:54:24 +0200
+Author: malmoborg
+Filed under: Nyheter
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/html; charset=utf-8
+Content-Transfer-Encoding: 8bit
+
+<table border="1" width="100%" cellpadding="0" cellspacing="0" borderspacing="0"><tr><td>
+<table width="100%" bgcolor="#EDEDED" cellpadding="4" cellspacing="2">
+<tr><td align="right"><b>Feed:</b></td>
+<td width="100%"><a href="http://malmoborg.se">
+<b>Förvaltnings AB Malmöborg</b>
+</a>
+</td></tr><tr><td align="right"><b>Item:</b></td>
+<td width="100%"><a href="http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/"><b>Tack alla trafikanter och fotgängare!</b>
+</a>
+</td></tr></table></td></tr></table>
+
+<p>Malmö 2016-07-09</p>
+<p>I skrivande stund är vi i färd med att avetablera vår entreprenad på Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större dräneringsarbete som i sin tur har inneburit vissa trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu kommer den vackra fastigheten att klara sig torrskodd under många år framöver <img src="data:image/gif;base64,R0lGODlhDwAPALMOAP/qAEVFRQAAAP/OAP/JAP+0AP6dAP/+k//9E///////
+xzMzM///6//lAAAAAAAAACH5BAEAAA4ALAAAAAAPAA8AAARb0EkZap3YVabO
+GRcWcAgCnIMRTEEnCCfwpqt2mHEOagoOnz+CKnADxoKFyiHHBBCSAdOiCVg8
+KwPZa7sVrgJZQWI8FhB2msGgwTXTWGqCXP4WBQr4wjDDstQmEQA7
+" alt=":-)" class="wp-smiley" /> </p>
+<p> </p>
+<hr width="100%"/>
+<table width="100%" cellpadding="0" cellspacing="0">
+<tr><td align="right"><font color="#ababab">Date:</font> </td><td><font color="#ababab">2016-07-19 11:54:24 +0200</font></td></tr>
+<tr><td align="right"><font color="#ababab">Author:</font> </td><td><font color="#ababab">malmoborg</font></td></tr>
+<tr><td align="right"><font color="#ababab">Filed under:</font> </td><td><font color="#ababab">Nyheter</font></td></tr>
+</table>
+
+--=-1468922508-176605-12427-9500-21-=--
--
2.11.0
next prev parent reply other threads:[~2017-03-21 13:16 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-21 13:15 RFC: drop html tags David Bremner
2017-03-21 13:15 ` David Bremner [this message]
2017-03-21 13:15 ` [rfc patch 2/6] lib: add content type argument to uuencode filter David Bremner
2017-03-21 13:15 ` [rfc patch 3/6] lib/index: Add another layer of indirection in filtering David Bremner
2017-03-21 13:15 ` [rfc patch 4/6] lib/index: separate state table definition from scanner David Bremner
2017-03-21 13:15 ` [rfc patch 5/6] lib/index: generalize filter name David Bremner
2017-03-21 13:15 ` [rfc patch 6/6] lib/index: add simple html filter David Bremner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://notmuchmail.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170321131549.19557-2-david@tethera.net \
--to=david@tethera.net \
--cc=notmuch@notmuchmail.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://yhetil.org/notmuch.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).