RFC: drop html tags

unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed

* RFC: drop html tags
@ 2017-03-21 13:15 David Bremner
  2017-03-21 13:15 ` [rfc patch 1/6] test: add known broken test for indexing html David Bremner
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: David Bremner @ 2017-03-21 13:15 UTC (permalink / raw)
  To: notmuch

Although HTML itself is not regular (probably not anything sane in the
latest incarnations), well formed tags should be as far as I know.
Here is a simple fix to the problem of giant embedded images in HTML:
drop all tags.  Unbalanced < > could force an HTML part not to be
indexed.

If the general approach seems sensible, then it can probably be tidied
up a bit, e.g.  by storing a state table in the filter struct, rather
than creating a function to define the appropriate state table and
jumping through a function pointer. On the other hand, in principle
this approach is more flexible as it does not insist that all scanners
are automata based. I originally wanted to try a real HTML parser, but
I couldn't see how to get the one I looked at (gumbo) working easily
in "stream" mode.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [rfc patch 1/6] test: add known broken test for indexing html
  2017-03-21 13:15 RFC: drop html tags David Bremner
@ 2017-03-21 13:15 ` David Bremner
  2017-03-21 13:15 ` [rfc patch 2/6] lib: add content type argument to uuencode filter David Bremner
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: David Bremner @ 2017-03-21 13:15 UTC (permalink / raw)
  To: notmuch

'quite' on IRC reported that notmuch new was grinding to a halt during
initial indexing, and we eventually narrowed the problem down to some
html parts with large embedded images. These cause the number of terms
added to the Xapian database to explode (the first 400 messages
generated 4.6M unique terms), and of course the resulting terms are
not much use for searching.
---
 test/T680-html-indexing.sh       | 12 +++++++
 test/corpora/README              |  3 ++
 test/corpora/html/embedded-image | 69 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 84 insertions(+)
 create mode 100755 test/T680-html-indexing.sh
 create mode 100644 test/corpora/html/embedded-image

diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
new file mode 100755
index 00000000..78768c4f
--- /dev/null
+++ b/test/T680-html-indexing.sh
@@ -0,0 +1,12 @@
+#!/usr/bin/env bash
+test_description="indexing of html parts"
+. ./test-lib.sh || exit 1
+
+add_email_corpus html
+
+test_begin_subtest 'embedded images should not be indexed'
+test_subtest_known_broken
+notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
+test_done
diff --git a/test/corpora/README b/test/corpora/README
index 77c48e6e..c9a35fed 100644
--- a/test/corpora/README
+++ b/test/corpora/README
@@ -9,3 +9,6 @@ default
 broken
   The broken corpus contains messages that are broken and/or RFC
   non-compliant, ensuring we deal with them in a sane way.
+
+html
+  The html corpus contains html parts
diff --git a/test/corpora/html/embedded-image b/test/corpora/html/embedded-image
new file mode 100644
index 00000000..40851530
--- /dev/null
+++ b/test/corpora/html/embedded-image
@@ -0,0 +1,69 @@
+From: =?utf-8?b?bWFsbW9ib3Jn?= <daemon@lublin.se>
+To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= <daemon@lublin.se>
+Date: Tue, 19 Jul 2016 11:54:24 +0200
+X-Feed2Imap-Version: 1.2.5
+Message-Id: <boendemalmoborg-1834@eltanin.uberspace.de>
+Subject: =?utf-8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?=
+Content-Type: multipart/alternative; boundary="=-1468922508-176605-12427-9500-21-="
+MIME-Version: 1.0
+
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/plain; charset=utf-8; format=flowed
+Content-Transfer-Encoding: 8bit
+
+<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
+
+Malmö 2016-07-09
+
+I skrivande stund är vi i färd med att avetablera vår entreprenad på 
+Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större 
+dräneringsarbete som i sin tur har inneburit vissa 
+trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några 
+veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi 
+kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu 
+kommer den vackra fastigheten att klara sig torrskodd under många år 
+framöver [A]
+
+ 
+
+[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif
+-- 
+Feed: Förvaltnings AB Malmöborg
+<http://malmoborg.se>
+Item: Tack alla trafikanter och fotgängare!
+<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
+Date: 2016-07-19 11:54:24 +0200
+Author: malmoborg
+Filed under: Nyheter
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/html; charset=utf-8
+Content-Transfer-Encoding: 8bit
+
+<table border="1" width="100%" cellpadding="0" cellspacing="0" borderspacing="0"><tr><td>
+<table width="100%" bgcolor="#EDEDED" cellpadding="4" cellspacing="2">
+<tr><td align="right"><b>Feed:</b></td>
+<td width="100%"><a href="http://malmoborg.se">
+<b>Förvaltnings AB Malmöborg</b>
+</a>
+</td></tr><tr><td align="right"><b>Item:</b></td>
+<td width="100%"><a href="http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/"><b>Tack alla trafikanter och fotgängare!</b>
+</a>
+</td></tr></table></td></tr></table>
+
+<p>Malmö 2016-07-09</p>
+<p>I skrivande stund är vi i färd med att avetablera vår entreprenad på Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större dräneringsarbete som i sin tur har inneburit vissa trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu kommer den vackra fastigheten att klara sig torrskodd under många år framöver <img src="data:image/gif;base64,R0lGODlhDwAPALMOAP/qAEVFRQAAAP/OAP/JAP+0AP6dAP/+k//9E///////
+xzMzM///6//lAAAAAAAAACH5BAEAAA4ALAAAAAAPAA8AAARb0EkZap3YVabO
+GRcWcAgCnIMRTEEnCCfwpqt2mHEOagoOnz+CKnADxoKFyiHHBBCSAdOiCVg8
+KwPZa7sVrgJZQWI8FhB2msGgwTXTWGqCXP4WBQr4wjDDstQmEQA7
+" alt=":-)" class="wp-smiley" /> </p>
+<p>&nbsp;</p>
+<hr width="100%"/>
+<table width="100%" cellpadding="0" cellspacing="0">
+<tr><td align="right"><font color="#ababab">Date:</font>&nbsp;&nbsp;</td><td><font color="#ababab">2016-07-19 11:54:24 +0200</font></td></tr>
+<tr><td align="right"><font color="#ababab">Author:</font>&nbsp;&nbsp;</td><td><font color="#ababab">malmoborg</font></td></tr>
+<tr><td align="right"><font color="#ababab">Filed under:</font>&nbsp;&nbsp;</td><td><font color="#ababab">Nyheter</font></td></tr>
+</table>
+
+--=-1468922508-176605-12427-9500-21-=--
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [rfc patch 2/6] lib: add content type argument to uuencode filter.
  2017-03-21 13:15 RFC: drop html tags David Bremner
  2017-03-21 13:15 ` [rfc patch 1/6] test: add known broken test for indexing html David Bremner
@ 2017-03-21 13:15 ` David Bremner
  2017-03-21 13:15 ` [rfc patch 3/6] lib/index: Add another layer of indirection in filtering David Bremner
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: David Bremner @ 2017-03-21 13:15 UTC (permalink / raw)
  To: notmuch

The idea is to support more general types of filtering, based on
content type.
---
 lib/index.cc | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 8c145540..1c04cc3d 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -56,6 +56,7 @@ typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeCl
  **/
 struct _NotmuchFilterDiscardUuencode {
     GMimeFilter parent_object;
+    GMimeContentType *content_type;
     int state;
 };
 
@@ -63,7 +64,7 @@ struct _NotmuchFilterDiscardUuencodeClass {
     GMimeFilterClass parent_class;
 };
 
-static GMimeFilter *notmuch_filter_discard_uuencode_new (void);
+static GMimeFilter *notmuch_filter_discard_uuencode_new (GMimeContentType *content);
 
 static void notmuch_filter_discard_uuencode_finalize (GObject *object);
 
@@ -102,8 +103,9 @@ notmuch_filter_discard_uuencode_finalize (GObject *object)
 static GMimeFilter *
 filter_copy (GMimeFilter *gmime_filter)
 {
-    (void) gmime_filter;
-    return notmuch_filter_discard_uuencode_new ();
+    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+
+    return notmuch_filter_discard_uuencode_new (filter->content_type);
 }
 
 static void
@@ -196,7 +198,7 @@ filter_reset (GMimeFilter *gmime_filter)
  * Returns: a new #NotmuchFilterDiscardUuencode filter.
  **/
 static GMimeFilter *
-notmuch_filter_discard_uuencode_new (void)
+notmuch_filter_discard_uuencode_new (GMimeContentType *content_type)
 {
     static GType type = 0;
     NotmuchFilterDiscardUuencode *filter;
@@ -220,6 +222,7 @@ notmuch_filter_discard_uuencode_new (void)
 
     filter = (NotmuchFilterDiscardUuencode *) g_object_newv (type, 0, NULL);
     filter->state = 0;
+    filter->content_type = content_type;
 
     return (GMimeFilter *) filter;
 }
@@ -396,7 +399,7 @@ _index_mime_part (notmuch_message_t *message,
     g_mime_stream_mem_set_owner (GMIME_STREAM_MEM (stream), FALSE);
 
     filter = g_mime_stream_filter_new (stream);
-    discard_uuencode_filter = notmuch_filter_discard_uuencode_new ();
+    discard_uuencode_filter = notmuch_filter_discard_uuencode_new (content_type);
 
     g_mime_stream_filter_add (GMIME_STREAM_FILTER (filter),
 			      discard_uuencode_filter);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [rfc patch 3/6] lib/index: Add another layer of indirection in filtering
  2017-03-21 13:15 RFC: drop html tags David Bremner
  2017-03-21 13:15 ` [rfc patch 1/6] test: add known broken test for indexing html David Bremner
  2017-03-21 13:15 ` [rfc patch 2/6] lib: add content type argument to uuencode filter David Bremner
@ 2017-03-21 13:15 ` David Bremner
  2017-03-21 13:15 ` [rfc patch 4/6] lib/index: separate state table definition from scanner David Bremner
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: David Bremner @ 2017-03-21 13:15 UTC (permalink / raw)
  To: notmuch

We could add a second gmime filter subclass, but prefer to avoid
duplicating the boilerplate.
---
 lib/index.cc | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 1c04cc3d..74a750b9 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -29,6 +29,8 @@
 typedef struct _NotmuchFilterDiscardUuencode NotmuchFilterDiscardUuencode;
 typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeClass;
 
+typedef void (*filter_fun) (GMimeFilter *filter, char *in, size_t len, size_t prespace,
+			    char **out, size_t *outlen, size_t *outprespace);
 /**
  * NotmuchFilterDiscardUuencode:
  *
@@ -57,6 +59,7 @@ typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeCl
 struct _NotmuchFilterDiscardUuencode {
     GMimeFilter parent_object;
     GMimeContentType *content_type;
+    filter_fun real_filter;
     int state;
 };
 
@@ -110,7 +113,14 @@ filter_copy (GMimeFilter *gmime_filter)
 
 static void
 filter_filter (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
-	       char **outbuf, size_t *outlen, size_t *outprespace)
+	       char **outbuf, size_t *outlen, size_t *outprespace) {
+    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    (*filter->real_filter)(gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
+}
+
+static void
+filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
+			char **outbuf, size_t *outlen, size_t *outprespace)
 {
     NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
     register const char *inptr = inbuf;
@@ -223,7 +233,7 @@ notmuch_filter_discard_uuencode_new (GMimeContentType *content_type)
     filter = (NotmuchFilterDiscardUuencode *) g_object_newv (type, 0, NULL);
     filter->state = 0;
     filter->content_type = content_type;
-
+    filter->real_filter = filter_filter_uuencode;
     return (GMimeFilter *) filter;
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [rfc patch 4/6] lib/index: separate state table definition from scanner.
  2017-03-21 13:15 RFC: drop html tags David Bremner
                   ` (2 preceding siblings ...)
  2017-03-21 13:15 ` [rfc patch 3/6] lib/index: Add another layer of indirection in filtering David Bremner
@ 2017-03-21 13:15 ` David Bremner
  2017-03-21 13:15 ` [rfc patch 5/6] lib/index: generalize filter name David Bremner
  2017-03-21 13:15 ` [rfc patch 6/6] lib/index: add simple html filter David Bremner
  5 siblings, 0 replies; 7+ messages in thread
From: David Bremner @ 2017-03-21 13:15 UTC (permalink / raw)
  To: notmuch

We want to reuse the scanner definition with a different table
---
 lib/index.cc | 81 +++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 47 insertions(+), 34 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 74a750b9..02b35b81 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -31,6 +31,15 @@ typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeCl
 
 typedef void (*filter_fun) (GMimeFilter *filter, char *in, size_t len, size_t prespace,
 			    char **out, size_t *outlen, size_t *outprespace);
+
+typedef struct {
+    int state;
+    int a;
+    int b;
+    int next_if_match;
+    int next_if_not_match;
+} scanner_state_t;
+
 /**
  * NotmuchFilterDiscardUuencode:
  *
@@ -119,46 +128,18 @@ filter_filter (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t pres
 }
 
 static void
-filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
-			char **outbuf, size_t *outlen, size_t *outprespace)
+do_filter (const scanner_state_t states[],
+	   int first_skipping_state,
+	   GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
+	   char **outbuf, size_t *outlen, size_t *outprespace)
 {
     NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
     register const char *inptr = inbuf;
     const char *inend = inbuf + inlen;
     char *outptr;
-
+    int next;
     (void) prespace;
 
-    /* Simple, linear state-transition diagram for our filter.
-     *
-     * If the character being processed is within the range of [a, b]
-     * for the current state then we transition next_if_match
-     * state. If not, we transition to the next_if_not_match state.
-     *
-     * The final two states are special in that they are the states in
-     * which we discard data. */
-    static const struct {
-	int state;
-	int a;
-	int b;
-	int next_if_match;
-	int next_if_not_match;
-    } states[] = {
-	{0,  'b',  'b',  1,  0},
-	{1,  'e',  'e',  2,  0},
-	{2,  'g',  'g',  3,  0},
-	{3,  'i',  'i',  4,  0},
-	{4,  'n',  'n',  5,  0},
-	{5,  ' ',  ' ',  6,  0},
-	{6,  '0',  '7',  7,  0},
-	{7,  '0',  '7',  8,  0},
-	{8,  '0',  '7',  9,  0},
-	{9,  ' ',  ' ',  10, 0},
-	{10, '\n', '\n', 11, 10},
-	{11, 'M',  'M',  12, 0},
-	{12, ' ',  '`',  12, 11}
-    };
-    int next;
 
     g_mime_filter_set_size (gmime_filter, inlen, FALSE);
     outptr = gmime_filter->outbuf;
@@ -174,7 +155,7 @@ filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, si
 	    next = states[filter->state].next_if_not_match;
 	}
 
-	if (filter->state < 11)
+	if (filter->state < first_skipping_state)
 	    *outptr++ = *inptr;
 
 	filter->state = next;
@@ -187,6 +168,38 @@ filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, si
 }
 
 static void
+filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
+			char **outbuf, size_t *outlen, size_t *outprespace)
+{
+    /* Simple, linear state-transition diagram for our filter.
+     *
+     * If the character being processed is within the range of [a, b]
+     * for the current state then we transition next_if_match
+     * state. If not, we transition to the next_if_not_match state.
+     *
+     * The final two states are special in that they are the states in
+     * which we discard data. */
+    static const scanner_state_t states[] = {
+	{0,  'b',  'b',  1,  0},
+	{1,  'e',  'e',  2,  0},
+	{2,  'g',  'g',  3,  0},
+	{3,  'i',  'i',  4,  0},
+	{4,  'n',  'n',  5,  0},
+	{5,  ' ',  ' ',  6,  0},
+	{6,  '0',  '7',  7,  0},
+	{7,  '0',  '7',  8,  0},
+	{8,  '0',  '7',  9,  0},
+	{9,  ' ',  ' ',  10, 0},
+	{10, '\n', '\n', 11, 10},
+	{11, 'M',  'M',  12, 0},
+	{12, ' ',  '`',  12, 11}
+    };
+
+    do_filter(states, 11,
+	      gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
+}
+
+static void
 filter_complete (GMimeFilter *filter, char *inbuf, size_t inlen, size_t prespace,
 		 char **outbuf, size_t *outlen, size_t *outprespace)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [rfc patch 5/6] lib/index: generalize filter name
  2017-03-21 13:15 RFC: drop html tags David Bremner
                   ` (3 preceding siblings ...)
  2017-03-21 13:15 ` [rfc patch 4/6] lib/index: separate state table definition from scanner David Bremner
@ 2017-03-21 13:15 ` David Bremner
  2017-03-21 13:15 ` [rfc patch 6/6] lib/index: add simple html filter David Bremner
  5 siblings, 0 replies; 7+ messages in thread
From: David Bremner @ 2017-03-21 13:15 UTC (permalink / raw)
  To: notmuch

We can't very well call it uuencode if it is going to filter other
things as well.
---
 lib/index.cc | 92 +++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 48 insertions(+), 44 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 02b35b81..3bb1ac1c 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -26,8 +26,8 @@
 
 /* Oh, how I wish that gobject didn't require so much noisy boilerplate!
  * (Though I have at least eliminated some of the stock set...) */
-typedef struct _NotmuchFilterDiscardUuencode NotmuchFilterDiscardUuencode;
-typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeClass;
+typedef struct _NotmuchFilterDiscardNonTerms NotmuchFilterDiscardNonTerms;
+typedef struct _NotmuchFilterDiscardNonTermsClass NotmuchFilterDiscardNonTermsClass;
 
 typedef void (*filter_fun) (GMimeFilter *filter, char *in, size_t len, size_t prespace,
 			    char **out, size_t *outlen, size_t *outprespace);
@@ -41,44 +41,29 @@ typedef struct {
 } scanner_state_t;
 
 /**
- * NotmuchFilterDiscardUuencode:
+ * NotmuchFilterDiscardNonTerms:
  *
  * @parent_object: parent #GMimeFilter
  * @encode: encoding vs decoding
  * @state: State of the parser
  *
- * A filter to discard uuencoded portions of an email.
- *
- * A uuencoded portion is identified as beginning with a line
- * matching:
- *
- *	begin [0-7][0-7][0-7] .*
- *
- * After that detection, and beginning with the following line,
- * characters will be discarded as long as the first character of each
- * line begins with M and subsequent characters on the line are within
- * the range of ASCII characters from ' ' to '`'.
- *
- * This is not a perfect UUencode filter. It's possible to have a
- * message that will legitimately match that pattern, (so that some
- * legitimate content is discarded). And for most UUencoded files, the
- * final line of encoded data (the line not starting with M) will be
- * indexed.
+ * A filter to discard non terms portions of an email, i.e. stuff not
+ * worth indexing.
  **/
-struct _NotmuchFilterDiscardUuencode {
+struct _NotmuchFilterDiscardNonTerms {
     GMimeFilter parent_object;
     GMimeContentType *content_type;
     filter_fun real_filter;
     int state;
 };
 
-struct _NotmuchFilterDiscardUuencodeClass {
+struct _NotmuchFilterDiscardNonTermsClass {
     GMimeFilterClass parent_class;
 };
 
-static GMimeFilter *notmuch_filter_discard_uuencode_new (GMimeContentType *content);
+static GMimeFilter *notmuch_filter_discard_non_terms_new (GMimeContentType *content);
 
-static void notmuch_filter_discard_uuencode_finalize (GObject *object);
+static void notmuch_filter_discard_non_terms_finalize (GObject *object);
 
 static GMimeFilter *filter_copy (GMimeFilter *filter);
 static void filter_filter (GMimeFilter *filter, char *in, size_t len, size_t prespace,
@@ -91,14 +76,14 @@ static void filter_reset (GMimeFilter *filter);
 static GMimeFilterClass *parent_class = NULL;
 
 static void
-notmuch_filter_discard_uuencode_class_init (NotmuchFilterDiscardUuencodeClass *klass)
+notmuch_filter_discard_non_terms_class_init (NotmuchFilterDiscardNonTermsClass *klass)
 {
     GObjectClass *object_class = G_OBJECT_CLASS (klass);
     GMimeFilterClass *filter_class = GMIME_FILTER_CLASS (klass);
 
     parent_class = (GMimeFilterClass *) g_type_class_ref (GMIME_TYPE_FILTER);
 
-    object_class->finalize = notmuch_filter_discard_uuencode_finalize;
+    object_class->finalize = notmuch_filter_discard_non_terms_finalize;
 
     filter_class->copy = filter_copy;
     filter_class->filter = filter_filter;
@@ -107,7 +92,7 @@ notmuch_filter_discard_uuencode_class_init (NotmuchFilterDiscardUuencodeClass *k
 }
 
 static void
-notmuch_filter_discard_uuencode_finalize (GObject *object)
+notmuch_filter_discard_non_terms_finalize (GObject *object)
 {
     G_OBJECT_CLASS (parent_class)->finalize (object);
 }
@@ -115,15 +100,15 @@ notmuch_filter_discard_uuencode_finalize (GObject *object)
 static GMimeFilter *
 filter_copy (GMimeFilter *gmime_filter)
 {
-    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) gmime_filter;
 
-    return notmuch_filter_discard_uuencode_new (filter->content_type);
+    return notmuch_filter_discard_non_terms_new (filter->content_type);
 }
 
 static void
 filter_filter (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
 	       char **outbuf, size_t *outlen, size_t *outprespace) {
-    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) gmime_filter;
     (*filter->real_filter)(gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
 }
 
@@ -133,7 +118,7 @@ do_filter (const scanner_state_t states[],
 	   GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
 	   char **outbuf, size_t *outlen, size_t *outprespace)
 {
-    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) gmime_filter;
     register const char *inptr = inbuf;
     const char *inend = inbuf + inlen;
     char *outptr;
@@ -167,6 +152,25 @@ do_filter (const scanner_state_t states[],
     *outbuf = gmime_filter->outbuf;
 }
 
+/*
+ *
+ * A uuencoded portion is identified as beginning with a line
+ * matching:
+ *
+ *	begin [0-7][0-7][0-7] .*
+ *
+ * After that detection, and beginning with the following line,
+ * characters will be discarded as long as the first character of each
+ * line begins with M and subsequent characters on the line are within
+ * the range of ASCII characters from ' ' to '`'.
+ *
+ * This is not a perfect UUencode filter. It's possible to have a
+ * message that will legitimately match that pattern, (so that some
+ * legitimate content is discarded). And for most UUencoded files, the
+ * final line of encoded data (the line not starting with M) will be
+ * indexed.
+ */
+
 static void
 filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
 			char **outbuf, size_t *outlen, size_t *outprespace)
@@ -210,7 +214,7 @@ filter_complete (GMimeFilter *filter, char *inbuf, size_t inlen, size_t prespace
 static void
 filter_reset (GMimeFilter *gmime_filter)
 {
-    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) gmime_filter;
 
     filter->state = 0;
 }
@@ -218,32 +222,32 @@ filter_reset (GMimeFilter *gmime_filter)
 /**
  * notmuch_filter_discard_uuencode_new:
  *
- * Returns: a new #NotmuchFilterDiscardUuencode filter.
+ * Returns: a new #NotmuchFilterDiscardNonTerms filter.
  **/
 static GMimeFilter *
-notmuch_filter_discard_uuencode_new (GMimeContentType *content_type)
+notmuch_filter_discard_non_terms_new (GMimeContentType *content_type)
 {
     static GType type = 0;
-    NotmuchFilterDiscardUuencode *filter;
+    NotmuchFilterDiscardNonTerms *filter;
 
     if (!type) {
 	static const GTypeInfo info = {
-	    sizeof (NotmuchFilterDiscardUuencodeClass),
+	    sizeof (NotmuchFilterDiscardNonTermsClass),
 	    NULL, /* base_class_init */
 	    NULL, /* base_class_finalize */
-	    (GClassInitFunc) notmuch_filter_discard_uuencode_class_init,
+	    (GClassInitFunc) notmuch_filter_discard_non_terms_class_init,
 	    NULL, /* class_finalize */
 	    NULL, /* class_data */
-	    sizeof (NotmuchFilterDiscardUuencode),
+	    sizeof (NotmuchFilterDiscardNonTerms),
 	    0,    /* n_preallocs */
 	    NULL, /* instance_init */
 	    NULL  /* value_table */
 	};
 
-	type = g_type_register_static (GMIME_TYPE_FILTER, "NotmuchFilterDiscardUuencode", &info, (GTypeFlags) 0);
+	type = g_type_register_static (GMIME_TYPE_FILTER, "NotmuchFilterDiscardNonTerms", &info, (GTypeFlags) 0);
     }
 
-    filter = (NotmuchFilterDiscardUuencode *) g_object_newv (type, 0, NULL);
+    filter = (NotmuchFilterDiscardNonTerms *) g_object_newv (type, 0, NULL);
     filter->state = 0;
     filter->content_type = content_type;
     filter->real_filter = filter_filter_uuencode;
@@ -332,7 +336,7 @@ _index_mime_part (notmuch_message_t *message,
 		  GMimeObject *part)
 {
     GMimeStream *stream, *filter;
-    GMimeFilter *discard_uuencode_filter;
+    GMimeFilter *discard_non_terms_filter;
     GMimeDataWrapper *wrapper;
     GByteArray *byte_array;
     GMimeContentDisposition *disposition;
@@ -422,10 +426,10 @@ _index_mime_part (notmuch_message_t *message,
     g_mime_stream_mem_set_owner (GMIME_STREAM_MEM (stream), FALSE);
 
     filter = g_mime_stream_filter_new (stream);
-    discard_uuencode_filter = notmuch_filter_discard_uuencode_new (content_type);
+    discard_non_terms_filter = notmuch_filter_discard_non_terms_new (content_type);
 
     g_mime_stream_filter_add (GMIME_STREAM_FILTER (filter),
-			      discard_uuencode_filter);
+			      discard_non_terms_filter);
 
     charset = g_mime_object_get_content_type_parameter (part, "charset");
     if (charset) {
@@ -447,7 +451,7 @@ _index_mime_part (notmuch_message_t *message,
 
     g_object_unref (stream);
     g_object_unref (filter);
-    g_object_unref (discard_uuencode_filter);
+    g_object_unref (discard_non_terms_filter);
 
     g_byte_array_append (byte_array, (guint8 *) "\0", 1);
     body = (char *) g_byte_array_free (byte_array, FALSE);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [rfc patch 6/6] lib/index: add simple html filter
  2017-03-21 13:15 RFC: drop html tags David Bremner
                   ` (4 preceding siblings ...)
  2017-03-21 13:15 ` [rfc patch 5/6] lib/index: generalize filter name David Bremner
@ 2017-03-21 13:15 ` David Bremner
  5 siblings, 0 replies; 7+ messages in thread
From: David Bremner @ 2017-03-21 13:15 UTC (permalink / raw)
  To: notmuch

Just drop all tags
---
 lib/index.cc               | 17 ++++++++++++++++-
 test/T680-html-indexing.sh |  1 -
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 3bb1ac1c..03223f7d 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -204,6 +204,18 @@ filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, si
 }
 
 static void
+filter_filter_html (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
+		    char **outbuf, size_t *outlen, size_t *outprespace)
+{
+    static const scanner_state_t states[] = {
+	{0,  '<',  '<',  1,  0},
+	{1,  '>',  '>',  0,  1},
+    };
+    do_filter(states, 1,
+	      gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
+}
+
+static void
 filter_complete (GMimeFilter *filter, char *inbuf, size_t inlen, size_t prespace,
 		 char **outbuf, size_t *outlen, size_t *outprespace)
 {
@@ -250,7 +262,10 @@ notmuch_filter_discard_non_terms_new (GMimeContentType *content_type)
     filter = (NotmuchFilterDiscardNonTerms *) g_object_newv (type, 0, NULL);
     filter->state = 0;
     filter->content_type = content_type;
-    filter->real_filter = filter_filter_uuencode;
+    if (g_mime_content_type_is_type (content_type, "text", "html"))
+	filter->real_filter = filter_filter_html;
+    else
+	filter->real_filter = filter_filter_uuencode;
     return (GMimeFilter *) filter;
 }
 
diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
index 78768c4f..ee69209c 100755
--- a/test/T680-html-indexing.sh
+++ b/test/T680-html-indexing.sh
@@ -5,7 +5,6 @@ test_description="indexing of html parts"
 add_email_corpus html
 
 test_begin_subtest 'embedded images should not be indexed'
-test_subtest_known_broken
 notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
 test_expect_equal_file /dev/null OUTPUT
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-03-21 13:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-21 13:15 RFC: drop html tags David Bremner
2017-03-21 13:15 ` [rfc patch 1/6] test: add known broken test for indexing html David Bremner
2017-03-21 13:15 ` [rfc patch 2/6] lib: add content type argument to uuencode filter David Bremner
2017-03-21 13:15 ` [rfc patch 3/6] lib/index: Add another layer of indirection in filtering David Bremner
2017-03-21 13:15 ` [rfc patch 4/6] lib/index: separate state table definition from scanner David Bremner
2017-03-21 13:15 ` [rfc patch 5/6] lib/index: generalize filter name David Bremner
2017-03-21 13:15 ` [rfc patch 6/6] lib/index: add simple html filter David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).