* [David Bremner] Re: RFC: drop html tags
@ 2017-03-21 17:55 David Bremner
2017-03-22 11:12 ` David Bremner
0 siblings, 1 reply; 2+ messages in thread
From: David Bremner @ 2017-03-21 17:55 UTC (permalink / raw)
To: notmuch
[-- Attachment #0: Type: message/rfc822, Size: 1434 bytes --]
From: David Bremner <david@tethera.net>
To: Steven Allen <steven@stebalien.com>
Subject: Re: RFC: drop html tags
Date: Tue, 21 Mar 2017 14:03:10 -0300
Message-ID: <87zigepnox.fsf@tesseract.cs.unb.ca>
Steven Allen <steven@stebalien.com> writes:
> David Bremner <david@tethera.net> writes:
>> Although HTML itself is not regular (probably not anything sane in the
>> latest incarnations), well formed tags should be as far as I know.
>> Here is a simple fix to the problem of giant embedded images in HTML:
>> drop all tags. Unbalanced < > could force an HTML part not to be
>> indexed.
>
> What about attribute values?
>
> <input value="a<b">
>
> Contrary to a lot of misinformation on the web, I'm pretty sure this is
> perfectly legal in HTML (not XML).
>
> Docs: https://www.w3.org/TR/html5/syntax.html#attributes-0
>
> In the JavaScript regex format, I believe the correct way to parse this is:
>
> /<("[^"]*"|'[^']*'|[^"'>]*)*>/g
>
> Basically, while inside a tag, ignore everything between double and single quotes.
Thanks for the reality check. It should be possible to handle quotes. In
my limited understanding of that regex, we can do a bit better by
forcing pairs of quotes to match, since I <chaos attribute="'"> is
probably legal.
Cheers,
d
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [David Bremner] Re: RFC: drop html tags
2017-03-21 17:55 [David Bremner] Re: RFC: drop html tags David Bremner
@ 2017-03-22 11:12 ` David Bremner
0 siblings, 0 replies; 2+ messages in thread
From: David Bremner @ 2017-03-22 11:12 UTC (permalink / raw)
To: notmuch
David Bremner <david@tethera.net> writes:
> From: David Bremner <david@tethera.net>
> Subject: Re: RFC: drop html tags
> To: Steven Allen <steven@stebalien.com>
> Date: Tue, 21 Mar 2017 14:03:10 -0300
>
> Steven Allen <steven@stebalien.com> writes:
>
>> In the JavaScript regex format, I believe the correct way to parse this is:
>>
>> /<("[^"]*"|'[^']*'|[^"'>]*)*>/g
>>
>> Basically, while inside a tag, ignore everything between double and single quotes.
>
> Thanks for the reality check. It should be possible to handle quotes. In
> my limited understanding of that regex, we can do a bit better by
> forcing pairs of quotes to match, since I <chaos attribute="'"> is
> probably legal.
Actually, I'm wrong. My eyes just glaze over when faced with any
non-trivial regex, I guess.
d
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2017-03-22 11:12 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-21 17:55 [David Bremner] Re: RFC: drop html tags David Bremner
2017-03-22 11:12 ` David Bremner
Code repositories for project(s) associated with this public inbox
https://yhetil.org/notmuch.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).