From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 9FD436DE1655 for ; Tue, 21 Mar 2017 06:16:02 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.005 X-Spam-Level: X-Spam-Status: No, score=-0.005 tagged_above=-999 required=5 tests=[AWL=0.006, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ofG99pZ3LIRl for ; Tue, 21 Mar 2017 06:15:59 -0700 (PDT) Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197]) by arlo.cworth.org (Postfix) with ESMTPS id C57666DE1416 for ; Tue, 21 Mar 2017 06:15:59 -0700 (PDT) Received: from remotemail by fethera.tethera.net with local (Exim 4.84_2) (envelope-from ) id 1cqJd4-0007Wk-AG for notmuch@notmuchmail.org; Tue, 21 Mar 2017 09:15:14 -0400 Received: (nullmailer pid 19661 invoked by uid 1000); Tue, 21 Mar 2017 13:15:55 -0000 From: David Bremner To: notmuch@notmuchmail.org Subject: RFC: drop html tags Date: Tue, 21 Mar 2017 10:15:43 -0300 Message-Id: <20170321131549.19557-1-david@tethera.net> X-Mailer: git-send-email 2.11.0 X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Mar 2017 13:16:02 -0000 Although HTML itself is not regular (probably not anything sane in the latest incarnations), well formed tags should be as far as I know. Here is a simple fix to the problem of giant embedded images in HTML: drop all tags. Unbalanced < > could force an HTML part not to be indexed. If the general approach seems sensible, then it can probably be tidied up a bit, e.g. by storing a state table in the filter struct, rather than creating a function to define the appropriate state table and jumping through a function pointer. On the other hand, in principle this approach is more flexible as it does not insist that all scanners are automata based. I originally wanted to try a real HTML parser, but I couldn't see how to get the one I looked at (gumbo) working easily in "stream" mode.