[PATCH] quiet "Complex regular subexpression recursion limit" warnings

unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed

From: Eric Wong <e@yhbt.net>
To: meta@public-inbox.org
Subject: [PATCH] quiet "Complex regular subexpression recursion limit" warnings
Date: Fri,  3 Apr 2020 21:51:28 +0000	[thread overview]
Message-ID: <20200403215128.34277-1-e@yhbt.net> (raw)

These seem mostly harmless since Perl will just truncate the
match and start a new one on a newline boundary in our case.
The only downside is we'd end up with redundant <span> tags in
HTML.

Limiting the number of line matched ourselves with `{1,$NUM}'
doesn't seem prudent since lines vary in length, so we continue
to defer the job of limiting matches to the Perl regexp engine.

I've noticed this warning in practice on 100K+ line patches to
locale data.
---
 lib/PublicInbox/MsgIter.pm   | 10 ++++++++++
 lib/PublicInbox/SearchIdx.pm |  2 +-
 lib/PublicInbox/View.pm      |  2 +-
 lib/PublicInbox/ViewDiff.pm  | 11 +++++++++++
 t/msg_iter.t                 | 30 ++++++++++++++++++++++++++++++
 5 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/MsgIter.pm b/lib/PublicInbox/MsgIter.pm
index 6c18d2bf..fa25564a 100644
--- a/lib/PublicInbox/MsgIter.pm
+++ b/lib/PublicInbox/MsgIter.pm
@@ -71,4 +71,14 @@ sub msg_part_text ($$) {
 	($s, $err);
 }
 
+# returns an array of quoted or unquoted sections
+sub split_quotes {
+	# Quiet "Complex regular subexpression recursion limit" warning
+	# in case an inconsiderate sender quotes 32K of text at once.
+	# The warning from Perl is harmless for us since our callers can
+	# tolerate less-than-ideal matches which work within Perl limits.
+	no warnings 'regexp';
+	split(/((?:^>[^\n]*\n)+)/sm, shift);
+}
+
 1;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index fe00df53..89d8bc2b 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -302,7 +302,7 @@ sub index_xapian { # msg_iter callback
 	defined $s or return;
 
 	# split off quoted and unquoted blocks:
-	my @sections = split(/((?:^>[^\n]*\n)+)/sm, $s);
+	my @sections = PublicInbox::MsgIter::split_quotes($s);
 	$part = $s = undef;
 	index_body($self, $_, /\A>/ ? 0 : $doc) for @sections;
 }
diff --git a/lib/PublicInbox/View.pm b/lib/PublicInbox/View.pm
index c42654b6..70c10604 100644
--- a/lib/PublicInbox/View.pm
+++ b/lib/PublicInbox/View.pm
@@ -576,7 +576,7 @@ sub add_text_body { # callback for msg_iter
 	$s .= "\n" unless $s =~ /\n\z/s;
 
 	# split off quoted and unquoted blocks:
-	my @sections = split(/((?:^>[^\n]*\n)+)/sm, $s);
+	my @sections = PublicInbox::MsgIter::split_quotes($s);
 	$s = '';
 	my $rv = $ctx->{obuf};
 	if (defined($fn) || $depth > 0 || $err) {
diff --git a/lib/PublicInbox/ViewDiff.pm b/lib/PublicInbox/ViewDiff.pm
index d22c80b9..5d391a13 100644
--- a/lib/PublicInbox/ViewDiff.pm
+++ b/lib/PublicInbox/ViewDiff.pm
@@ -202,6 +202,17 @@ sub flush_diff ($$$) {
 			$dctx = diff_header($dst, \$x, $ctx, \@top);
 		} elsif ($dctx) {
 			my $after = '';
+
+			# Quiet "Complex regular subexpression recursion limit"
+			# warning.  Perl will truncate matches upon hitting
+			# that limit, giving us more (and shorter) scalars than
+			# would be ideal, but otherwise it's harmless.
+			#
+			# We could replace the `+' metacharacter with `{1,100}'
+			# to limit the matches ourselves to 100, but we can
+			# let Perl do it for us, quietly.
+			no warnings 'regexp';
+
 			for my $s (split(/((?:(?:^\+[^\n]*\n)+)|
 					(?:(?:^-[^\n]*\n)+)|
 					(?:^@@ [^\n]+\n))/xsm, $x)) {
diff --git a/t/msg_iter.t b/t/msg_iter.t
index e33bfc69..d303564f 100644
--- a/t/msg_iter.t
+++ b/t/msg_iter.t
@@ -78,5 +78,35 @@ use_ok('PublicInbox::MsgIter');
 		'got bullet point when X-UNKNOWN assumes UTF-8');
 }
 
+{ # API not finalized
+	my @warn;
+	local $SIG{__WARN__} = sub { push @warn, [ @_ ] };
+	my $attr = "So and so wrote:\n";
+	my $q = "> hello world\n" x 10;
+	my $nq = "hello world\n" x 10;
+	my @sections = PublicInbox::MsgIter::split_quotes($attr . $q . $nq);
+	is($sections[0], $attr, 'attribution matches');
+	is($sections[1], $q, 'quoted section matches');
+	is($sections[2], $nq, 'non-quoted section matches');
+	is(scalar(@sections), 3, 'only three sections for short message');
+	is_deeply(\@warn, [], 'no warnings');
+
+	$q x= 3300;
+	$nq x= 3300;
+	@sections = PublicInbox::MsgIter::split_quotes($attr . $q . $nq);
+	is_deeply(\@warn, [], 'no warnings on giant message');
+	is(join('', @sections), $attr . $q . $nq, 'result matches expected');
+	is(shift(@sections), $attr, 'attribution is first section');
+	my @check = ('', '');
+	while (defined(my $l = shift @sections)) {
+		next if $l eq '';
+		like($l, qr/\n\z/s, 'section ends with newline');
+		my $idx = ($l =~ /\A>/) ? 0 : 1;
+		$check[$idx] .= $l;
+	}
+	is($check[0], $q, 'long quoted section matches');
+	is($check[1], $nq, 'long quoted section matches');
+}
+
 done_testing();
 1;

                 reply	other threads:[~2020-04-03 21:51 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

find likely ancestor, descendant, or conflicting patches for this message:
dfblob:6c18d2b dfblob:fe00df5 dfblob:c42654b dfblob:d22c80b
dfblob:e33bfc6 dfblob:fa25564 dfblob:89d8bc2 dfblob:70c1060
dfblob:5d391a1 dfblob:d303564
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200403215128.34277-1-e@yhbt.net \
    --to=e@yhbt.net \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).