From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id B77386DE174F for ; Tue, 21 Mar 2017 06:16:10 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.005 X-Spam-Level: X-Spam-Status: No, score=-0.005 tagged_above=-999 required=5 tests=[AWL=0.006, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CMhYtcSgPYsp for ; Tue, 21 Mar 2017 06:16:08 -0700 (PDT) Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197]) by arlo.cworth.org (Postfix) with ESMTPS id 5EB6B6DE167B for ; Tue, 21 Mar 2017 06:16:04 -0700 (PDT) Received: from remotemail by fethera.tethera.net with local (Exim 4.84_2) (envelope-from ) id 1cqJdA-0007Xx-Os; Tue, 21 Mar 2017 09:15:20 -0400 Received: (nullmailer pid 19664 invoked by uid 1000); Tue, 21 Mar 2017 13:15:55 -0000 From: David Bremner To: notmuch@notmuchmail.org Subject: [rfc patch 1/6] test: add known broken test for indexing html Date: Tue, 21 Mar 2017 10:15:44 -0300 Message-Id: <20170321131549.19557-2-david@tethera.net> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20170321131549.19557-1-david@tethera.net> References: <20170321131549.19557-1-david@tethera.net> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Mar 2017 13:16:10 -0000 'quite' on IRC reported that notmuch new was grinding to a halt during initial indexing, and we eventually narrowed the problem down to some html parts with large embedded images. These cause the number of terms added to the Xapian database to explode (the first 400 messages generated 4.6M unique terms), and of course the resulting terms are not much use for searching. --- test/T680-html-indexing.sh | 12 +++++++ test/corpora/README | 3 ++ test/corpora/html/embedded-image | 69 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 84 insertions(+) create mode 100755 test/T680-html-indexing.sh create mode 100644 test/corpora/html/embedded-image diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh new file mode 100755 index 00000000..78768c4f --- /dev/null +++ b/test/T680-html-indexing.sh @@ -0,0 +1,12 @@ +#!/usr/bin/env bash +test_description="indexing of html parts" +. ./test-lib.sh || exit 1 + +add_email_corpus html + +test_begin_subtest 'embedded images should not be indexed' +test_subtest_known_broken +notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT +test_expect_equal_file /dev/null OUTPUT + +test_done diff --git a/test/corpora/README b/test/corpora/README index 77c48e6e..c9a35fed 100644 --- a/test/corpora/README +++ b/test/corpora/README @@ -9,3 +9,6 @@ default broken The broken corpus contains messages that are broken and/or RFC non-compliant, ensuring we deal with them in a sane way. + +html + The html corpus contains html parts diff --git a/test/corpora/html/embedded-image b/test/corpora/html/embedded-image new file mode 100644 index 00000000..40851530 --- /dev/null +++ b/test/corpora/html/embedded-image @@ -0,0 +1,69 @@ +From: =?utf-8?b?bWFsbW9ib3Jn?= +To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= +Date: Tue, 19 Jul 2016 11:54:24 +0200 +X-Feed2Imap-Version: 1.2.5 +Message-Id: +Subject: =?utf-8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?= +Content-Type: multipart/alternative; boundary="=-1468922508-176605-12427-9500-21-=" +MIME-Version: 1.0 + + +--=-1468922508-176605-12427-9500-21-= +Content-Type: text/plain; charset=utf-8; format=flowed +Content-Transfer-Encoding: 8bit + + + +Malmö 2016-07-09 + +I skrivande stund är vi i färd med att avetablera vår entreprenad på +Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större +dräneringsarbete som i sin tur har inneburit vissa +trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några +veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi +kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu +kommer den vackra fastigheten att klara sig torrskodd under många år +framöver [A] + +  + +[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif +-- +Feed: Förvaltnings AB Malmöborg + +Item: Tack alla trafikanter och fotgängare! + +Date: 2016-07-19 11:54:24 +0200 +Author: malmoborg +Filed under: Nyheter + +--=-1468922508-176605-12427-9500-21-= +Content-Type: text/html; charset=utf-8 +Content-Transfer-Encoding: 8bit + +
+ + + +
Feed: +Förvaltnings AB Malmöborg + +
Item:Tack alla trafikanter och fotgängare! + +
+ +

Malmö 2016-07-09

+

I skrivande stund är vi i färd med att avetablera vår entreprenad på Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större dräneringsarbete som i sin tur har inneburit vissa trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu kommer den vackra fastigheten att klara sig torrskodd under många år framöver :-)

+

 

+
+ + + + +
Date:  2016-07-19 11:54:24 +0200
Author:  malmoborg
Filed under:  Nyheter
+ +--=-1468922508-176605-12427-9500-21-=-- -- 2.11.0