unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* [David Bremner] Re: RFC: drop html tags
@ 2017-03-21 17:55 David Bremner
  2017-03-22 11:12 ` David Bremner
  0 siblings, 1 reply; 2+ messages in thread
From: David Bremner @ 2017-03-21 17:55 UTC (permalink / raw)
  To: notmuch


[-- Attachment #0: Type: message/rfc822, Size: 1434 bytes --]

From: David Bremner <david@tethera.net>
To: Steven Allen <steven@stebalien.com>
Subject: Re: RFC: drop html tags
Date: Tue, 21 Mar 2017 14:03:10 -0300
Message-ID: <87zigepnox.fsf@tesseract.cs.unb.ca>

Steven Allen <steven@stebalien.com> writes:

> David Bremner <david@tethera.net> writes:
>> Although HTML itself is not regular (probably not anything sane in the
>> latest incarnations), well formed tags should be as far as I know.
>> Here is a simple fix to the problem of giant embedded images in HTML:
>> drop all tags.  Unbalanced < > could force an HTML part not to be
>> indexed.
>
> What about attribute values?
>
>     <input value="a<b">
>
> Contrary to a lot of misinformation on the web, I'm pretty sure this is
> perfectly legal in HTML (not XML).
>
> Docs: https://www.w3.org/TR/html5/syntax.html#attributes-0
>
> In the JavaScript regex format, I believe the correct way to parse this is:
>
>     /<("[^"]*"|'[^']*'|[^"'>]*)*>/g
>
> Basically, while inside a tag, ignore everything between double and single quotes.

Thanks for the reality check. It should be possible to handle quotes. In
my limited understanding of that regex, we can do a bit better by
forcing pairs of quotes to match, since I <chaos attribute="'"> is
probably legal.

Cheers,

d

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [David Bremner] Re: RFC: drop html tags
  2017-03-21 17:55 [David Bremner] Re: RFC: drop html tags David Bremner
@ 2017-03-22 11:12 ` David Bremner
  0 siblings, 0 replies; 2+ messages in thread
From: David Bremner @ 2017-03-22 11:12 UTC (permalink / raw)
  To: notmuch

David Bremner <david@tethera.net> writes:

> From: David Bremner <david@tethera.net>
> Subject: Re: RFC: drop html tags
> To: Steven Allen <steven@stebalien.com>
> Date: Tue, 21 Mar 2017 14:03:10 -0300
>
> Steven Allen <steven@stebalien.com> writes:
>
>> In the JavaScript regex format, I believe the correct way to parse this is:
>>
>>     /<("[^"]*"|'[^']*'|[^"'>]*)*>/g
>>
>> Basically, while inside a tag, ignore everything between double and single quotes.
>
> Thanks for the reality check. It should be possible to handle quotes. In
> my limited understanding of that regex, we can do a bit better by
> forcing pairs of quotes to match, since I <chaos attribute="'"> is
> probably legal.

Actually, I'm wrong. My eyes just glaze over when faced with any
non-trivial regex, I guess.

d

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-03-22 11:12 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-21 17:55 [David Bremner] Re: RFC: drop html tags David Bremner
2017-03-22 11:12 ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).