From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bremner@tethera.net>
Received: from localhost (localhost [127.0.0.1])
 by arlo.cworth.org (Postfix) with ESMTP id 9FD436DE1655
 for <notmuch@notmuchmail.org>; Tue, 21 Mar 2017 06:16:02 -0700 (PDT)
X-Virus-Scanned: Debian amavisd-new at cworth.org
X-Spam-Flag: NO
X-Spam-Score: -0.005
X-Spam-Level: 
X-Spam-Status: No, score=-0.005 tagged_above=-999 required=5 tests=[AWL=0.006, 
 SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled
Received: from arlo.cworth.org ([127.0.0.1])
 by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id ofG99pZ3LIRl for <notmuch@notmuchmail.org>;
 Tue, 21 Mar 2017 06:15:59 -0700 (PDT)
Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197])
 by arlo.cworth.org (Postfix) with ESMTPS id C57666DE1416
 for <notmuch@notmuchmail.org>; Tue, 21 Mar 2017 06:15:59 -0700 (PDT)
Received: from remotemail by fethera.tethera.net with local (Exim 4.84_2)
 (envelope-from <bremner@tethera.net>) id 1cqJd4-0007Wk-AG
 for notmuch@notmuchmail.org; Tue, 21 Mar 2017 09:15:14 -0400
Received: (nullmailer pid 19661 invoked by uid 1000);
 Tue, 21 Mar 2017 13:15:55 -0000
From: David Bremner <david@tethera.net>
To: notmuch@notmuchmail.org
Subject: RFC: drop html tags
Date: Tue, 21 Mar 2017 10:15:43 -0300
Message-Id: <20170321131549.19557-1-david@tethera.net>
X-Mailer: git-send-email 2.11.0
X-BeenThere: notmuch@notmuchmail.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Use and development of the notmuch mail system."
 <notmuch.notmuchmail.org>
List-Unsubscribe: <https://notmuchmail.org/mailman/options/notmuch>,
 <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
List-Archive: <http://notmuchmail.org/pipermail/notmuch/>
List-Post: <mailto:notmuch@notmuchmail.org>
List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
List-Subscribe: <https://notmuchmail.org/mailman/listinfo/notmuch>,
 <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
X-List-Received-Date: Tue, 21 Mar 2017 13:16:02 -0000

Although HTML itself is not regular (probably not anything sane in the
latest incarnations), well formed tags should be as far as I know.
Here is a simple fix to the problem of giant embedded images in HTML:
drop all tags.  Unbalanced < > could force an HTML part not to be
indexed.

If the general approach seems sensible, then it can probably be tidied
up a bit, e.g.  by storing a state table in the filter struct, rather
than creating a function to define the appropriate state table and
jumping through a function pointer. On the other hand, in principle
this approach is more flexible as it does not insist that all scanners
are automata based. I originally wanted to try a real HTML parser, but
I couldn't see how to get the one I looked at (gumbo) working easily
in "stream" mode.