Drop HTML tags when indexing

unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed

* Drop HTML tags when indexing
@ 2017-03-22 11:22 David Bremner
  2017-03-22 11:23 ` [PATCH 1/7] test: add known broken test for indexing html David Bremner
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: David Bremner @ 2017-03-22 11:22 UTC (permalink / raw)
  To: notmuch

Steven Allen pointed out [2] that the previous scanner [1] was a
little too simplistic. This version handles (or claims to) quoted
strings in attributes, which can apparently contain '>'and '<'
characters. This required generalizing the state machine runner a bit
[3] to handle states with out-degree more than two.


[1]: id:20170321131549.19557-1-david@tethera.net
[2]: id:87wpbipl9z.fsf@tesseract.cs.unb.ca
[3]:
diff --git a/lib/index.cc b/lib/index.cc
index 03223f7d..324e6e79 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -122,23 +122,25 @@ do_filter (const scanner_state_t states[],
     register const char *inptr = inbuf;
     const char *inend = inbuf + inlen;
     char *outptr;
-    int next;
+    int next, current;
     (void) prespace;
 
 
     g_mime_filter_set_size (gmime_filter, inlen, FALSE);
     outptr = gmime_filter->outbuf;
 
+    current = filter->state;
     while (inptr < inend) {
-	if (*inptr >= states[filter->state].a &&
-	    *inptr <= states[filter->state].b)
-	{
-	    next = states[filter->state].next_if_match;
-	}
-	else
-	{
-	    next = states[filter->state].next_if_not_match;
-	}
+	/* do "fake transitions" until we fire a rule, or run out of rules */
+	do {
+	    if (*inptr >= states[current].a && *inptr <= states[current].b)  {
+		next = states[current].next_if_match;
+	    } else  {
+		next = states[current].next_if_not_match;
+	    }
+
+	    current = next;
+	} while (next != states[next].state);
 
 	if (filter->state < first_skipping_state)
 	    *outptr++ = *inptr;
@@ -209,7 +211,11 @@ filter_filter_html (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t
 {
     static const scanner_state_t states[] = {
 	{0,  '<',  '<',  1,  0},
+	{1,  '\'', '\'', 4,  2},  /* scanning for quote or > */
+	{1,  '"',  '"',  5,  3},
 	{1,  '>',  '>',  0,  1},
+	{4,  '\'', '\'', 1,  4},  /* inside single quotes */
+	{5,  '"', '"',   1,  5},  /* inside double quotes */
     };
     do_filter(states, 1,
 	      gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
index ee69209c..74f33708 100755
--- a/test/T680-html-indexing.sh
+++ b/test/T680-html-indexing.sh
@@ -8,4 +8,15 @@ test_begin_subtest 'embedded images should not be indexed'
 notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
 test_expect_equal_file /dev/null OUTPUT
 
+test_begin_subtest 'ignore > in attribute text'
+notmuch search swordfish | notmuch_search_sanitize > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
+test_begin_subtest 'non tag text should be indexed'
+notmuch search hunter2 | notmuch_search_sanitize > OUTPUT
+cat <<EOF > EXPECTED
+thread:XXX   2009-11-17 [1/1] David Bremner; test html attachment (inbox unread)
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
 test_done
diff --git a/test/corpora/html/attribute-text b/test/corpora/html/attribute-text
new file mode 100644
index 00000000..6dae8194
--- /dev/null
+++ b/test/corpora/html/attribute-text
@@ -0,0 +1,15 @@
+From: David Bremner <david@example.net>
+To: David Bremner <david@example.net>
+Subject: test html attachment
+Date: Tue, 17 Nov 2009 21:28:38 +0600
+Message-ID: <87d1dajhgf.fsf@example.net>
+MIME-Version: 1.0
+Content-Type: text/html
+Content-Disposition: inline; filename=test.html
+
+<html>
+  <body>
+    <input value="a>swordfish">
+  </body>
+  hunter2
+</html>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 1/7] test: add known broken test for indexing html
  2017-03-22 11:22 Drop HTML tags when indexing David Bremner
@ 2017-03-22 11:23 ` David Bremner
  2017-04-20 10:05   ` David Bremner
  2017-07-16 12:44   ` David Bremner
  2017-03-22 11:23 ` [PATCH 2/7] lib: add content type argument to uuencode filter David Bremner
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 12+ messages in thread
From: David Bremner @ 2017-03-22 11:23 UTC (permalink / raw)
  To: notmuch

'quite' on IRC reported that notmuch new was grinding to a halt during
initial indexing, and we eventually narrowed the problem down to some
html parts with large embedded images. These cause the number of terms
added to the Xapian database to explode (the first 400 messages
generated 4.6M unique terms), and of course the resulting terms are
not much use for searching.

The second test is sanity check for any "improved" indexing of HTML.
---
 test/T680-html-indexing.sh       | 19 +++++++++++
 test/corpora/README              |  3 ++
 test/corpora/html/attribute-text | 15 +++++++++
 test/corpora/html/embedded-image | 69 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 106 insertions(+)
 create mode 100755 test/T680-html-indexing.sh
 create mode 100644 test/corpora/html/attribute-text
 create mode 100644 test/corpora/html/embedded-image

diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
new file mode 100755
index 00000000..5e9cc4cb
--- /dev/null
+++ b/test/T680-html-indexing.sh
@@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+test_description="indexing of html parts"
+. ./test-lib.sh || exit 1
+
+add_email_corpus html
+
+test_begin_subtest 'embedded images should not be indexed'
+test_subtest_known_broken
+notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
+test_begin_subtest 'non tag text should be indexed'
+notmuch search hunter2 | notmuch_search_sanitize > OUTPUT
+cat <<EOF > EXPECTED
+thread:XXX   2009-11-17 [1/1] David Bremner; test html attachment (inbox unread)
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
+test_done
diff --git a/test/corpora/README b/test/corpora/README
index 77c48e6e..c9a35fed 100644
--- a/test/corpora/README
+++ b/test/corpora/README
@@ -9,3 +9,6 @@ default
 broken
   The broken corpus contains messages that are broken and/or RFC
   non-compliant, ensuring we deal with them in a sane way.
+
+html
+  The html corpus contains html parts
diff --git a/test/corpora/html/attribute-text b/test/corpora/html/attribute-text
new file mode 100644
index 00000000..6dae8194
--- /dev/null
+++ b/test/corpora/html/attribute-text
@@ -0,0 +1,15 @@
+From: David Bremner <david@example.net>
+To: David Bremner <david@example.net>
+Subject: test html attachment
+Date: Tue, 17 Nov 2009 21:28:38 +0600
+Message-ID: <87d1dajhgf.fsf@example.net>
+MIME-Version: 1.0
+Content-Type: text/html
+Content-Disposition: inline; filename=test.html
+
+<html>
+  <body>
+    <input value="a>swordfish">
+  </body>
+  hunter2
+</html>
diff --git a/test/corpora/html/embedded-image b/test/corpora/html/embedded-image
new file mode 100644
index 00000000..40851530
--- /dev/null
+++ b/test/corpora/html/embedded-image
@@ -0,0 +1,69 @@
+From: =?utf-8?b?bWFsbW9ib3Jn?= <daemon@lublin.se>
+To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= <daemon@lublin.se>
+Date: Tue, 19 Jul 2016 11:54:24 +0200
+X-Feed2Imap-Version: 1.2.5
+Message-Id: <boendemalmoborg-1834@eltanin.uberspace.de>
+Subject: =?utf-8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?=
+Content-Type: multipart/alternative; boundary="=-1468922508-176605-12427-9500-21-="
+MIME-Version: 1.0
+
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/plain; charset=utf-8; format=flowed
+Content-Transfer-Encoding: 8bit
+
+<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
+
+Malmö 2016-07-09
+
+I skrivande stund är vi i färd med att avetablera vår entreprenad på 
+Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större 
+dräneringsarbete som i sin tur har inneburit vissa 
+trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några 
+veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi 
+kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu 
+kommer den vackra fastigheten att klara sig torrskodd under många år 
+framöver [A]
+
+ 
+
+[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif
+-- 
+Feed: Förvaltnings AB Malmöborg
+<http://malmoborg.se>
+Item: Tack alla trafikanter och fotgängare!
+<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
+Date: 2016-07-19 11:54:24 +0200
+Author: malmoborg
+Filed under: Nyheter
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/html; charset=utf-8
+Content-Transfer-Encoding: 8bit
+
+<table border="1" width="100%" cellpadding="0" cellspacing="0" borderspacing="0"><tr><td>
+<table width="100%" bgcolor="#EDEDED" cellpadding="4" cellspacing="2">
+<tr><td align="right"><b>Feed:</b></td>
+<td width="100%"><a href="http://malmoborg.se">
+<b>Förvaltnings AB Malmöborg</b>
+</a>
+</td></tr><tr><td align="right"><b>Item:</b></td>
+<td width="100%"><a href="http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/"><b>Tack alla trafikanter och fotgängare!</b>
+</a>
+</td></tr></table></td></tr></table>
+
+<p>Malmö 2016-07-09</p>
+<p>I skrivande stund är vi i färd med att avetablera vår entreprenad på Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större dräneringsarbete som i sin tur har inneburit vissa trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu kommer den vackra fastigheten att klara sig torrskodd under många år framöver <img src="data:image/gif;base64,R0lGODlhDwAPALMOAP/qAEVFRQAAAP/OAP/JAP+0AP6dAP/+k//9E///////
+xzMzM///6//lAAAAAAAAACH5BAEAAA4ALAAAAAAPAA8AAARb0EkZap3YVabO
+GRcWcAgCnIMRTEEnCCfwpqt2mHEOagoOnz+CKnADxoKFyiHHBBCSAdOiCVg8
+KwPZa7sVrgJZQWI8FhB2msGgwTXTWGqCXP4WBQr4wjDDstQmEQA7
+" alt=":-)" class="wp-smiley" /> </p>
+<p>&nbsp;</p>
+<hr width="100%"/>
+<table width="100%" cellpadding="0" cellspacing="0">
+<tr><td align="right"><font color="#ababab">Date:</font>&nbsp;&nbsp;</td><td><font color="#ababab">2016-07-19 11:54:24 +0200</font></td></tr>
+<tr><td align="right"><font color="#ababab">Author:</font>&nbsp;&nbsp;</td><td><font color="#ababab">malmoborg</font></td></tr>
+<tr><td align="right"><font color="#ababab">Filed under:</font>&nbsp;&nbsp;</td><td><font color="#ababab">Nyheter</font></td></tr>
+</table>
+
+--=-1468922508-176605-12427-9500-21-=--
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/7] lib: add content type argument to uuencode filter.
  2017-03-22 11:22 Drop HTML tags when indexing David Bremner
  2017-03-22 11:23 ` [PATCH 1/7] test: add known broken test for indexing html David Bremner
@ 2017-03-22 11:23 ` David Bremner
  2017-03-22 11:23 ` [PATCH 3/7] lib/index: Add another layer of indirection in filtering David Bremner
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: David Bremner @ 2017-03-22 11:23 UTC (permalink / raw)
  To: notmuch

The idea is to support more general types of filtering, based on
content type.
---
 lib/index.cc | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 8c145540..1c04cc3d 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -56,6 +56,7 @@ typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeCl
  **/
 struct _NotmuchFilterDiscardUuencode {
     GMimeFilter parent_object;
+    GMimeContentType *content_type;
     int state;
 };
 
@@ -63,7 +64,7 @@ struct _NotmuchFilterDiscardUuencodeClass {
     GMimeFilterClass parent_class;
 };
 
-static GMimeFilter *notmuch_filter_discard_uuencode_new (void);
+static GMimeFilter *notmuch_filter_discard_uuencode_new (GMimeContentType *content);
 
 static void notmuch_filter_discard_uuencode_finalize (GObject *object);
 
@@ -102,8 +103,9 @@ notmuch_filter_discard_uuencode_finalize (GObject *object)
 static GMimeFilter *
 filter_copy (GMimeFilter *gmime_filter)
 {
-    (void) gmime_filter;
-    return notmuch_filter_discard_uuencode_new ();
+    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+
+    return notmuch_filter_discard_uuencode_new (filter->content_type);
 }
 
 static void
@@ -196,7 +198,7 @@ filter_reset (GMimeFilter *gmime_filter)
  * Returns: a new #NotmuchFilterDiscardUuencode filter.
  **/
 static GMimeFilter *
-notmuch_filter_discard_uuencode_new (void)
+notmuch_filter_discard_uuencode_new (GMimeContentType *content_type)
 {
     static GType type = 0;
     NotmuchFilterDiscardUuencode *filter;
@@ -220,6 +222,7 @@ notmuch_filter_discard_uuencode_new (void)
 
     filter = (NotmuchFilterDiscardUuencode *) g_object_newv (type, 0, NULL);
     filter->state = 0;
+    filter->content_type = content_type;
 
     return (GMimeFilter *) filter;
 }
@@ -396,7 +399,7 @@ _index_mime_part (notmuch_message_t *message,
     g_mime_stream_mem_set_owner (GMIME_STREAM_MEM (stream), FALSE);
 
     filter = g_mime_stream_filter_new (stream);
-    discard_uuencode_filter = notmuch_filter_discard_uuencode_new ();
+    discard_uuencode_filter = notmuch_filter_discard_uuencode_new (content_type);
 
     g_mime_stream_filter_add (GMIME_STREAM_FILTER (filter),
 			      discard_uuencode_filter);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/7] lib/index: Add another layer of indirection in filtering
  2017-03-22 11:22 Drop HTML tags when indexing David Bremner
  2017-03-22 11:23 ` [PATCH 1/7] test: add known broken test for indexing html David Bremner
  2017-03-22 11:23 ` [PATCH 2/7] lib: add content type argument to uuencode filter David Bremner
@ 2017-03-22 11:23 ` David Bremner
  2017-03-22 11:23 ` [PATCH 4/7] lib/index: separate state table definition from scanner David Bremner
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: David Bremner @ 2017-03-22 11:23 UTC (permalink / raw)
  To: notmuch

We could add a second gmime filter subclass, but prefer to avoid
duplicating the boilerplate.
---
 lib/index.cc | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 1c04cc3d..74a750b9 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -29,6 +29,8 @@
 typedef struct _NotmuchFilterDiscardUuencode NotmuchFilterDiscardUuencode;
 typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeClass;
 
+typedef void (*filter_fun) (GMimeFilter *filter, char *in, size_t len, size_t prespace,
+			    char **out, size_t *outlen, size_t *outprespace);
 /**
  * NotmuchFilterDiscardUuencode:
  *
@@ -57,6 +59,7 @@ typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeCl
 struct _NotmuchFilterDiscardUuencode {
     GMimeFilter parent_object;
     GMimeContentType *content_type;
+    filter_fun real_filter;
     int state;
 };
 
@@ -110,7 +113,14 @@ filter_copy (GMimeFilter *gmime_filter)
 
 static void
 filter_filter (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
-	       char **outbuf, size_t *outlen, size_t *outprespace)
+	       char **outbuf, size_t *outlen, size_t *outprespace) {
+    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    (*filter->real_filter)(gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
+}
+
+static void
+filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
+			char **outbuf, size_t *outlen, size_t *outprespace)
 {
     NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
     register const char *inptr = inbuf;
@@ -223,7 +233,7 @@ notmuch_filter_discard_uuencode_new (GMimeContentType *content_type)
     filter = (NotmuchFilterDiscardUuencode *) g_object_newv (type, 0, NULL);
     filter->state = 0;
     filter->content_type = content_type;
-
+    filter->real_filter = filter_filter_uuencode;
     return (GMimeFilter *) filter;
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/7] lib/index: separate state table definition from scanner.
  2017-03-22 11:22 Drop HTML tags when indexing David Bremner
                   ` (2 preceding siblings ...)
  2017-03-22 11:23 ` [PATCH 3/7] lib/index: Add another layer of indirection in filtering David Bremner
@ 2017-03-22 11:23 ` David Bremner
  2017-03-22 11:23 ` [PATCH 5/7] lib/index: generalize filter name David Bremner
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: David Bremner @ 2017-03-22 11:23 UTC (permalink / raw)
  To: notmuch

We want to reuse the scanner definition with a different table
---
 lib/index.cc | 81 +++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 47 insertions(+), 34 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 74a750b9..02b35b81 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -31,6 +31,15 @@ typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeCl
 
 typedef void (*filter_fun) (GMimeFilter *filter, char *in, size_t len, size_t prespace,
 			    char **out, size_t *outlen, size_t *outprespace);
+
+typedef struct {
+    int state;
+    int a;
+    int b;
+    int next_if_match;
+    int next_if_not_match;
+} scanner_state_t;
+
 /**
  * NotmuchFilterDiscardUuencode:
  *
@@ -119,46 +128,18 @@ filter_filter (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t pres
 }
 
 static void
-filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
-			char **outbuf, size_t *outlen, size_t *outprespace)
+do_filter (const scanner_state_t states[],
+	   int first_skipping_state,
+	   GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
+	   char **outbuf, size_t *outlen, size_t *outprespace)
 {
     NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
     register const char *inptr = inbuf;
     const char *inend = inbuf + inlen;
     char *outptr;
-
+    int next;
     (void) prespace;
 
-    /* Simple, linear state-transition diagram for our filter.
-     *
-     * If the character being processed is within the range of [a, b]
-     * for the current state then we transition next_if_match
-     * state. If not, we transition to the next_if_not_match state.
-     *
-     * The final two states are special in that they are the states in
-     * which we discard data. */
-    static const struct {
-	int state;
-	int a;
-	int b;
-	int next_if_match;
-	int next_if_not_match;
-    } states[] = {
-	{0,  'b',  'b',  1,  0},
-	{1,  'e',  'e',  2,  0},
-	{2,  'g',  'g',  3,  0},
-	{3,  'i',  'i',  4,  0},
-	{4,  'n',  'n',  5,  0},
-	{5,  ' ',  ' ',  6,  0},
-	{6,  '0',  '7',  7,  0},
-	{7,  '0',  '7',  8,  0},
-	{8,  '0',  '7',  9,  0},
-	{9,  ' ',  ' ',  10, 0},
-	{10, '\n', '\n', 11, 10},
-	{11, 'M',  'M',  12, 0},
-	{12, ' ',  '`',  12, 11}
-    };
-    int next;
 
     g_mime_filter_set_size (gmime_filter, inlen, FALSE);
     outptr = gmime_filter->outbuf;
@@ -174,7 +155,7 @@ filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, si
 	    next = states[filter->state].next_if_not_match;
 	}
 
-	if (filter->state < 11)
+	if (filter->state < first_skipping_state)
 	    *outptr++ = *inptr;
 
 	filter->state = next;
@@ -187,6 +168,38 @@ filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, si
 }
 
 static void
+filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
+			char **outbuf, size_t *outlen, size_t *outprespace)
+{
+    /* Simple, linear state-transition diagram for our filter.
+     *
+     * If the character being processed is within the range of [a, b]
+     * for the current state then we transition next_if_match
+     * state. If not, we transition to the next_if_not_match state.
+     *
+     * The final two states are special in that they are the states in
+     * which we discard data. */
+    static const scanner_state_t states[] = {
+	{0,  'b',  'b',  1,  0},
+	{1,  'e',  'e',  2,  0},
+	{2,  'g',  'g',  3,  0},
+	{3,  'i',  'i',  4,  0},
+	{4,  'n',  'n',  5,  0},
+	{5,  ' ',  ' ',  6,  0},
+	{6,  '0',  '7',  7,  0},
+	{7,  '0',  '7',  8,  0},
+	{8,  '0',  '7',  9,  0},
+	{9,  ' ',  ' ',  10, 0},
+	{10, '\n', '\n', 11, 10},
+	{11, 'M',  'M',  12, 0},
+	{12, ' ',  '`',  12, 11}
+    };
+
+    do_filter(states, 11,
+	      gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
+}
+
+static void
 filter_complete (GMimeFilter *filter, char *inbuf, size_t inlen, size_t prespace,
 		 char **outbuf, size_t *outlen, size_t *outprespace)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/7] lib/index: generalize filter name
  2017-03-22 11:22 Drop HTML tags when indexing David Bremner
                   ` (3 preceding siblings ...)
  2017-03-22 11:23 ` [PATCH 4/7] lib/index: separate state table definition from scanner David Bremner
@ 2017-03-22 11:23 ` David Bremner
  2017-03-22 11:23 ` [PATCH 6/7] lib/index.cc: generalize filter state machine David Bremner
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: David Bremner @ 2017-03-22 11:23 UTC (permalink / raw)
  To: notmuch

We can't very well call it uuencode if it is going to filter other
things as well.
---
 lib/index.cc | 92 +++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 48 insertions(+), 44 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 02b35b81..3bb1ac1c 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -26,8 +26,8 @@
 
 /* Oh, how I wish that gobject didn't require so much noisy boilerplate!
  * (Though I have at least eliminated some of the stock set...) */
-typedef struct _NotmuchFilterDiscardUuencode NotmuchFilterDiscardUuencode;
-typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeClass;
+typedef struct _NotmuchFilterDiscardNonTerms NotmuchFilterDiscardNonTerms;
+typedef struct _NotmuchFilterDiscardNonTermsClass NotmuchFilterDiscardNonTermsClass;
 
 typedef void (*filter_fun) (GMimeFilter *filter, char *in, size_t len, size_t prespace,
 			    char **out, size_t *outlen, size_t *outprespace);
@@ -41,44 +41,29 @@ typedef struct {
 } scanner_state_t;
 
 /**
- * NotmuchFilterDiscardUuencode:
+ * NotmuchFilterDiscardNonTerms:
  *
  * @parent_object: parent #GMimeFilter
  * @encode: encoding vs decoding
  * @state: State of the parser
  *
- * A filter to discard uuencoded portions of an email.
- *
- * A uuencoded portion is identified as beginning with a line
- * matching:
- *
- *	begin [0-7][0-7][0-7] .*
- *
- * After that detection, and beginning with the following line,
- * characters will be discarded as long as the first character of each
- * line begins with M and subsequent characters on the line are within
- * the range of ASCII characters from ' ' to '`'.
- *
- * This is not a perfect UUencode filter. It's possible to have a
- * message that will legitimately match that pattern, (so that some
- * legitimate content is discarded). And for most UUencoded files, the
- * final line of encoded data (the line not starting with M) will be
- * indexed.
+ * A filter to discard non terms portions of an email, i.e. stuff not
+ * worth indexing.
  **/
-struct _NotmuchFilterDiscardUuencode {
+struct _NotmuchFilterDiscardNonTerms {
     GMimeFilter parent_object;
     GMimeContentType *content_type;
     filter_fun real_filter;
     int state;
 };
 
-struct _NotmuchFilterDiscardUuencodeClass {
+struct _NotmuchFilterDiscardNonTermsClass {
     GMimeFilterClass parent_class;
 };
 
-static GMimeFilter *notmuch_filter_discard_uuencode_new (GMimeContentType *content);
+static GMimeFilter *notmuch_filter_discard_non_terms_new (GMimeContentType *content);
 
-static void notmuch_filter_discard_uuencode_finalize (GObject *object);
+static void notmuch_filter_discard_non_terms_finalize (GObject *object);
 
 static GMimeFilter *filter_copy (GMimeFilter *filter);
 static void filter_filter (GMimeFilter *filter, char *in, size_t len, size_t prespace,
@@ -91,14 +76,14 @@ static void filter_reset (GMimeFilter *filter);
 static GMimeFilterClass *parent_class = NULL;
 
 static void
-notmuch_filter_discard_uuencode_class_init (NotmuchFilterDiscardUuencodeClass *klass)
+notmuch_filter_discard_non_terms_class_init (NotmuchFilterDiscardNonTermsClass *klass)
 {
     GObjectClass *object_class = G_OBJECT_CLASS (klass);
     GMimeFilterClass *filter_class = GMIME_FILTER_CLASS (klass);
 
     parent_class = (GMimeFilterClass *) g_type_class_ref (GMIME_TYPE_FILTER);
 
-    object_class->finalize = notmuch_filter_discard_uuencode_finalize;
+    object_class->finalize = notmuch_filter_discard_non_terms_finalize;
 
     filter_class->copy = filter_copy;
     filter_class->filter = filter_filter;
@@ -107,7 +92,7 @@ notmuch_filter_discard_uuencode_class_init (NotmuchFilterDiscardUuencodeClass *k
 }
 
 static void
-notmuch_filter_discard_uuencode_finalize (GObject *object)
+notmuch_filter_discard_non_terms_finalize (GObject *object)
 {
     G_OBJECT_CLASS (parent_class)->finalize (object);
 }
@@ -115,15 +100,15 @@ notmuch_filter_discard_uuencode_finalize (GObject *object)
 static GMimeFilter *
 filter_copy (GMimeFilter *gmime_filter)
 {
-    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) gmime_filter;
 
-    return notmuch_filter_discard_uuencode_new (filter->content_type);
+    return notmuch_filter_discard_non_terms_new (filter->content_type);
 }
 
 static void
 filter_filter (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
 	       char **outbuf, size_t *outlen, size_t *outprespace) {
-    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) gmime_filter;
     (*filter->real_filter)(gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
 }
 
@@ -133,7 +118,7 @@ do_filter (const scanner_state_t states[],
 	   GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
 	   char **outbuf, size_t *outlen, size_t *outprespace)
 {
-    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) gmime_filter;
     register const char *inptr = inbuf;
     const char *inend = inbuf + inlen;
     char *outptr;
@@ -167,6 +152,25 @@ do_filter (const scanner_state_t states[],
     *outbuf = gmime_filter->outbuf;
 }
 
+/*
+ *
+ * A uuencoded portion is identified as beginning with a line
+ * matching:
+ *
+ *	begin [0-7][0-7][0-7] .*
+ *
+ * After that detection, and beginning with the following line,
+ * characters will be discarded as long as the first character of each
+ * line begins with M and subsequent characters on the line are within
+ * the range of ASCII characters from ' ' to '`'.
+ *
+ * This is not a perfect UUencode filter. It's possible to have a
+ * message that will legitimately match that pattern, (so that some
+ * legitimate content is discarded). And for most UUencoded files, the
+ * final line of encoded data (the line not starting with M) will be
+ * indexed.
+ */
+
 static void
 filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
 			char **outbuf, size_t *outlen, size_t *outprespace)
@@ -210,7 +214,7 @@ filter_complete (GMimeFilter *filter, char *inbuf, size_t inlen, size_t prespace
 static void
 filter_reset (GMimeFilter *gmime_filter)
 {
-    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) gmime_filter;
 
     filter->state = 0;
 }
@@ -218,32 +222,32 @@ filter_reset (GMimeFilter *gmime_filter)
 /**
  * notmuch_filter_discard_uuencode_new:
  *
- * Returns: a new #NotmuchFilterDiscardUuencode filter.
+ * Returns: a new #NotmuchFilterDiscardNonTerms filter.
  **/
 static GMimeFilter *
-notmuch_filter_discard_uuencode_new (GMimeContentType *content_type)
+notmuch_filter_discard_non_terms_new (GMimeContentType *content_type)
 {
     static GType type = 0;
-    NotmuchFilterDiscardUuencode *filter;
+    NotmuchFilterDiscardNonTerms *filter;
 
     if (!type) {
 	static const GTypeInfo info = {
-	    sizeof (NotmuchFilterDiscardUuencodeClass),
+	    sizeof (NotmuchFilterDiscardNonTermsClass),
 	    NULL, /* base_class_init */
 	    NULL, /* base_class_finalize */
-	    (GClassInitFunc) notmuch_filter_discard_uuencode_class_init,
+	    (GClassInitFunc) notmuch_filter_discard_non_terms_class_init,
 	    NULL, /* class_finalize */
 	    NULL, /* class_data */
-	    sizeof (NotmuchFilterDiscardUuencode),
+	    sizeof (NotmuchFilterDiscardNonTerms),
 	    0,    /* n_preallocs */
 	    NULL, /* instance_init */
 	    NULL  /* value_table */
 	};
 
-	type = g_type_register_static (GMIME_TYPE_FILTER, "NotmuchFilterDiscardUuencode", &info, (GTypeFlags) 0);
+	type = g_type_register_static (GMIME_TYPE_FILTER, "NotmuchFilterDiscardNonTerms", &info, (GTypeFlags) 0);
     }
 
-    filter = (NotmuchFilterDiscardUuencode *) g_object_newv (type, 0, NULL);
+    filter = (NotmuchFilterDiscardNonTerms *) g_object_newv (type, 0, NULL);
     filter->state = 0;
     filter->content_type = content_type;
     filter->real_filter = filter_filter_uuencode;
@@ -332,7 +336,7 @@ _index_mime_part (notmuch_message_t *message,
 		  GMimeObject *part)
 {
     GMimeStream *stream, *filter;
-    GMimeFilter *discard_uuencode_filter;
+    GMimeFilter *discard_non_terms_filter;
     GMimeDataWrapper *wrapper;
     GByteArray *byte_array;
     GMimeContentDisposition *disposition;
@@ -422,10 +426,10 @@ _index_mime_part (notmuch_message_t *message,
     g_mime_stream_mem_set_owner (GMIME_STREAM_MEM (stream), FALSE);
 
     filter = g_mime_stream_filter_new (stream);
-    discard_uuencode_filter = notmuch_filter_discard_uuencode_new (content_type);
+    discard_non_terms_filter = notmuch_filter_discard_non_terms_new (content_type);
 
     g_mime_stream_filter_add (GMIME_STREAM_FILTER (filter),
-			      discard_uuencode_filter);
+			      discard_non_terms_filter);
 
     charset = g_mime_object_get_content_type_parameter (part, "charset");
     if (charset) {
@@ -447,7 +451,7 @@ _index_mime_part (notmuch_message_t *message,
 
     g_object_unref (stream);
     g_object_unref (filter);
-    g_object_unref (discard_uuencode_filter);
+    g_object_unref (discard_non_terms_filter);
 
     g_byte_array_append (byte_array, (guint8 *) "\0", 1);
     body = (char *) g_byte_array_free (byte_array, FALSE);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 6/7] lib/index.cc: generalize filter state machine
  2017-03-22 11:22 Drop HTML tags when indexing David Bremner
                   ` (4 preceding siblings ...)
  2017-03-22 11:23 ` [PATCH 5/7] lib/index: generalize filter name David Bremner
@ 2017-03-22 11:23 ` David Bremner
  2017-03-22 11:23 ` [PATCH 7/7] lib/index: add simple html filter David Bremner
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: David Bremner @ 2017-03-22 11:23 UTC (permalink / raw)
  To: notmuch

To match things more complicated than fixed strings, we need states
with multiple out arrows.
---
 lib/index.cc | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 3bb1ac1c..fd66762c 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -122,23 +122,25 @@ do_filter (const scanner_state_t states[],
     register const char *inptr = inbuf;
     const char *inend = inbuf + inlen;
     char *outptr;
-    int next;
+    int next, current;
     (void) prespace;
 
 
     g_mime_filter_set_size (gmime_filter, inlen, FALSE);
     outptr = gmime_filter->outbuf;
 
+    current = filter->state;
     while (inptr < inend) {
-	if (*inptr >= states[filter->state].a &&
-	    *inptr <= states[filter->state].b)
-	{
-	    next = states[filter->state].next_if_match;
-	}
-	else
-	{
-	    next = states[filter->state].next_if_not_match;
-	}
+	/* do "fake transitions" until we fire a rule, or run out of rules */
+	do {
+	    if (*inptr >= states[current].a && *inptr <= states[current].b)  {
+		next = states[current].next_if_match;
+	    } else  {
+		next = states[current].next_if_not_match;
+	    }
+
+	    current = next;
+	} while (next != states[next].state);
 
 	if (filter->state < first_skipping_state)
 	    *outptr++ = *inptr;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 7/7] lib/index: add simple html filter
  2017-03-22 11:22 Drop HTML tags when indexing David Bremner
                   ` (5 preceding siblings ...)
  2017-03-22 11:23 ` [PATCH 6/7] lib/index.cc: generalize filter state machine David Bremner
@ 2017-03-22 11:23 ` David Bremner
  2017-03-22 13:20 ` Drop HTML tags when indexing Daniel Lublin (quite)
  2017-03-25 12:59 ` David Bremner
  8 siblings, 0 replies; 12+ messages in thread
From: David Bremner @ 2017-03-22 11:23 UTC (permalink / raw)
  To: notmuch

Just drop all tags
---
 lib/index.cc               | 21 ++++++++++++++++++++-
 test/T680-html-indexing.sh |  5 ++++-
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index fd66762c..324e6e79 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -206,6 +206,22 @@ filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, si
 }
 
 static void
+filter_filter_html (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
+		    char **outbuf, size_t *outlen, size_t *outprespace)
+{
+    static const scanner_state_t states[] = {
+	{0,  '<',  '<',  1,  0},
+	{1,  '\'', '\'', 4,  2},  /* scanning for quote or > */
+	{1,  '"',  '"',  5,  3},
+	{1,  '>',  '>',  0,  1},
+	{4,  '\'', '\'', 1,  4},  /* inside single quotes */
+	{5,  '"', '"',   1,  5},  /* inside double quotes */
+    };
+    do_filter(states, 1,
+	      gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
+}
+
+static void
 filter_complete (GMimeFilter *filter, char *inbuf, size_t inlen, size_t prespace,
 		 char **outbuf, size_t *outlen, size_t *outprespace)
 {
@@ -252,7 +268,10 @@ notmuch_filter_discard_non_terms_new (GMimeContentType *content_type)
     filter = (NotmuchFilterDiscardNonTerms *) g_object_newv (type, 0, NULL);
     filter->state = 0;
     filter->content_type = content_type;
-    filter->real_filter = filter_filter_uuencode;
+    if (g_mime_content_type_is_type (content_type, "text", "html"))
+	filter->real_filter = filter_filter_html;
+    else
+	filter->real_filter = filter_filter_uuencode;
     return (GMimeFilter *) filter;
 }
 
diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
index 5e9cc4cb..74f33708 100755
--- a/test/T680-html-indexing.sh
+++ b/test/T680-html-indexing.sh
@@ -5,10 +5,13 @@ test_description="indexing of html parts"
 add_email_corpus html
 
 test_begin_subtest 'embedded images should not be indexed'
-test_subtest_known_broken
 notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
 test_expect_equal_file /dev/null OUTPUT
 
+test_begin_subtest 'ignore > in attribute text'
+notmuch search swordfish | notmuch_search_sanitize > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
 test_begin_subtest 'non tag text should be indexed'
 notmuch search hunter2 | notmuch_search_sanitize > OUTPUT
 cat <<EOF > EXPECTED
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Drop HTML tags when indexing
  2017-03-22 11:22 Drop HTML tags when indexing David Bremner
                   ` (6 preceding siblings ...)
  2017-03-22 11:23 ` [PATCH 7/7] lib/index: add simple html filter David Bremner
@ 2017-03-22 13:20 ` Daniel Lublin (quite)
  2017-03-25 12:59 ` David Bremner
  8 siblings, 0 replies; 12+ messages in thread
From: Daniel Lublin (quite) @ 2017-03-22 13:20 UTC (permalink / raw)
  To: notmuch

This patch is good. notmuch now gets through my whole archive of 175k mails,
memory usage peaking at 430M.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Drop HTML tags when indexing
  2017-03-22 11:22 Drop HTML tags when indexing David Bremner
                   ` (7 preceding siblings ...)
  2017-03-22 13:20 ` Drop HTML tags when indexing Daniel Lublin (quite)
@ 2017-03-25 12:59 ` David Bremner
  8 siblings, 0 replies; 12+ messages in thread
From: David Bremner @ 2017-03-25 12:59 UTC (permalink / raw)
  To: notmuch

David Bremner <david@tethera.net> writes:

> Steven Allen pointed out [2] that the previous scanner [1] was a
> little too simplistic. This version handles (or claims to) quoted
> strings in attributes, which can apparently contain '>'and '<'
> characters. This required generalizing the state machine runner a bit
> [3] to handle states with out-degree more than two.

For what it is worth, this series shrunk my index by about the same
amount as skipping html messages entirely: I have about 15% messages
with html parts, and this series made the index about 15% smaller.

d

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/7] test: add known broken test for indexing html
  2017-03-22 11:23 ` [PATCH 1/7] test: add known broken test for indexing html David Bremner
@ 2017-04-20 10:05   ` David Bremner
  2017-07-16 12:44   ` David Bremner
  1 sibling, 0 replies; 12+ messages in thread
From: David Bremner @ 2017-04-20 10:05 UTC (permalink / raw)
  To: notmuch

David Bremner <david@tethera.net> writes:

> 'quite' on IRC reported that notmuch new was grinding to a halt during
> initial indexing, and we eventually narrowed the problem down to some
> html parts with large embedded images. These cause the number of terms
> added to the Xapian database to explode (the first 400 messages
> generated 4.6M unique terms), and of course the resulting terms are
> not much use for searching.
>
> The second test is sanity check for any "improved" indexing of HTML.

pushed the first patch in the series to master.

d

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/7] test: add known broken test for indexing html
  2017-03-22 11:23 ` [PATCH 1/7] test: add known broken test for indexing html David Bremner
  2017-04-20 10:05   ` David Bremner
@ 2017-07-16 12:44   ` David Bremner
  1 sibling, 0 replies; 12+ messages in thread
From: David Bremner @ 2017-07-16 12:44 UTC (permalink / raw)
  To: notmuch

David Bremner <david@tethera.net> writes:

> 'quite' on IRC reported that notmuch new was grinding to a halt during
> initial indexing, and we eventually narrowed the problem down to some
> html parts with large embedded images. These cause the number of terms
> added to the Xapian database to explode (the first 400 messages
> generated 4.6M unique terms), and of course the resulting terms are
> not much use for searching.

this bug is fixed in master / release

d

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-07-16 12:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-22 11:22 Drop HTML tags when indexing David Bremner
2017-03-22 11:23 ` [PATCH 1/7] test: add known broken test for indexing html David Bremner
2017-04-20 10:05   ` David Bremner
2017-07-16 12:44   ` David Bremner
2017-03-22 11:23 ` [PATCH 2/7] lib: add content type argument to uuencode filter David Bremner
2017-03-22 11:23 ` [PATCH 3/7] lib/index: Add another layer of indirection in filtering David Bremner
2017-03-22 11:23 ` [PATCH 4/7] lib/index: separate state table definition from scanner David Bremner
2017-03-22 11:23 ` [PATCH 5/7] lib/index: generalize filter name David Bremner
2017-03-22 11:23 ` [PATCH 6/7] lib/index.cc: generalize filter state machine David Bremner
2017-03-22 11:23 ` [PATCH 7/7] lib/index: add simple html filter David Bremner
2017-03-22 13:20 ` Drop HTML tags when indexing Daniel Lublin (quite)
2017-03-25 12:59 ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).