unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* RFC: drop html tags
@ 2017-03-21 13:15 David Bremner
  2017-03-21 13:15 ` [rfc patch 1/6] test: add known broken test for indexing html David Bremner
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: David Bremner @ 2017-03-21 13:15 UTC (permalink / raw)
  To: notmuch

Although HTML itself is not regular (probably not anything sane in the
latest incarnations), well formed tags should be as far as I know.
Here is a simple fix to the problem of giant embedded images in HTML:
drop all tags.  Unbalanced < > could force an HTML part not to be
indexed.

If the general approach seems sensible, then it can probably be tidied
up a bit, e.g.  by storing a state table in the filter struct, rather
than creating a function to define the appropriate state table and
jumping through a function pointer. On the other hand, in principle
this approach is more flexible as it does not insist that all scanners
are automata based. I originally wanted to try a real HTML parser, but
I couldn't see how to get the one I looked at (gumbo) working easily
in "stream" mode.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-03-21 13:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-21 13:15 RFC: drop html tags David Bremner
2017-03-21 13:15 ` [rfc patch 1/6] test: add known broken test for indexing html David Bremner
2017-03-21 13:15 ` [rfc patch 2/6] lib: add content type argument to uuencode filter David Bremner
2017-03-21 13:15 ` [rfc patch 3/6] lib/index: Add another layer of indirection in filtering David Bremner
2017-03-21 13:15 ` [rfc patch 4/6] lib/index: separate state table definition from scanner David Bremner
2017-03-21 13:15 ` [rfc patch 5/6] lib/index: generalize filter name David Bremner
2017-03-21 13:15 ` [rfc patch 6/6] lib/index: add simple html filter David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).