From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id C22BC6DE160D for ; Wed, 22 Mar 2017 04:23:14 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.005 X-Spam-Level: X-Spam-Status: No, score=-0.005 tagged_above=-999 required=5 tests=[AWL=0.006, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id KADpiExB1c80 for ; Wed, 22 Mar 2017 04:23:13 -0700 (PDT) Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197]) by arlo.cworth.org (Postfix) with ESMTPS id 678A36DE15C7 for ; Wed, 22 Mar 2017 04:23:13 -0700 (PDT) Received: from remotemail by fethera.tethera.net with local (Exim 4.84_2) (envelope-from ) id 1cqeLV-0000g2-SU for notmuch@notmuchmail.org; Wed, 22 Mar 2017 07:22:29 -0400 Received: (nullmailer pid 12121 invoked by uid 1000); Wed, 22 Mar 2017 11:23:11 -0000 From: David Bremner To: notmuch@notmuchmail.org Subject: Drop HTML tags when indexing Date: Wed, 22 Mar 2017 08:22:59 -0300 Message-Id: <20170322112306.12060-1-david@tethera.net> X-Mailer: git-send-email 2.11.0 X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Mar 2017 11:23:14 -0000 Steven Allen pointed out [2] that the previous scanner [1] was a little too simplistic. This version handles (or claims to) quoted strings in attributes, which can apparently contain '>'and '<' characters. This required generalizing the state machine runner a bit [3] to handle states with out-degree more than two. [1]: id:20170321131549.19557-1-david@tethera.net [2]: id:87wpbipl9z.fsf@tesseract.cs.unb.ca [3]: diff --git a/lib/index.cc b/lib/index.cc index 03223f7d..324e6e79 100644 --- a/lib/index.cc +++ b/lib/index.cc @@ -122,23 +122,25 @@ do_filter (const scanner_state_t states[], register const char *inptr = inbuf; const char *inend = inbuf + inlen; char *outptr; - int next; + int next, current; (void) prespace; g_mime_filter_set_size (gmime_filter, inlen, FALSE); outptr = gmime_filter->outbuf; + current = filter->state; while (inptr < inend) { - if (*inptr >= states[filter->state].a && - *inptr <= states[filter->state].b) - { - next = states[filter->state].next_if_match; - } - else - { - next = states[filter->state].next_if_not_match; - } + /* do "fake transitions" until we fire a rule, or run out of rules */ + do { + if (*inptr >= states[current].a && *inptr <= states[current].b) { + next = states[current].next_if_match; + } else { + next = states[current].next_if_not_match; + } + + current = next; + } while (next != states[next].state); if (filter->state < first_skipping_state) *outptr++ = *inptr; @@ -209,7 +211,11 @@ filter_filter_html (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t { static const scanner_state_t states[] = { {0, '<', '<', 1, 0}, + {1, '\'', '\'', 4, 2}, /* scanning for quote or > */ + {1, '"', '"', 5, 3}, {1, '>', '>', 0, 1}, + {4, '\'', '\'', 1, 4}, /* inside single quotes */ + {5, '"', '"', 1, 5}, /* inside double quotes */ }; do_filter(states, 1, gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace); diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh index ee69209c..74f33708 100755 --- a/test/T680-html-indexing.sh +++ b/test/T680-html-indexing.sh @@ -8,4 +8,15 @@ test_begin_subtest 'embedded images should not be indexed' notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT test_expect_equal_file /dev/null OUTPUT +test_begin_subtest 'ignore > in attribute text' +notmuch search swordfish | notmuch_search_sanitize > OUTPUT +test_expect_equal_file /dev/null OUTPUT + +test_begin_subtest 'non tag text should be indexed' +notmuch search hunter2 | notmuch_search_sanitize > OUTPUT +cat < EXPECTED +thread:XXX 2009-11-17 [1/1] David Bremner; test html attachment (inbox unread) +EOF +test_expect_equal_file EXPECTED OUTPUT + test_done diff --git a/test/corpora/html/attribute-text b/test/corpora/html/attribute-text new file mode 100644 index 00000000..6dae8194 --- /dev/null +++ b/test/corpora/html/attribute-text @@ -0,0 +1,15 @@ +From: David Bremner +To: David Bremner +Subject: test html attachment +Date: Tue, 17 Nov 2009 21:28:38 +0600 +Message-ID: <87d1dajhgf.fsf@example.net> +MIME-Version: 1.0 +Content-Type: text/html +Content-Disposition: inline; filename=test.html + + + + + + hunter2 +