From: Eric Wong <e@80x24.org>
To: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Cc: meta@public-inbox.org
Subject: [PATCH] lei_to_mail+mbox_reader: fix handling of empty/bogus emails
Date: Sat, 4 Sep 2021 21:36:58 +0000 [thread overview]
Message-ID: <20210904213658.GA27941@dcvr> (raw)
In-Reply-To: <20210903151500.h72mzcpqixgtytjs@meerkat.local>
Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Yep, that seems to work fine. Question -- I noticed that lei just issues a
> regular query, retrieves results with curl and then parses the output. Is
> there a danger of potentially running into issues with parsing the regular
> HTML output if it changes in the future?
It's actually parsing gzipped mboxrd (&x=m). But you're right
we could use stronger safeguards in case we see gzipped HTML or
something else...
----------8<---------
Subject: [PATCH] lei_to_mail+mbox_reader: fix handling of empty/bogus emails
We may be handling invalid mboxes, so just return no objects in
that case. While "lei q" on HTTP(S) externals expects a gzipped
mboxrd, there's always a chance something else gzipped can be
sent to us.
There's also changes to lei_to_mail to better handle emails
which lack a body and/or headers (e.g. t/solve/bare.patch)
Link: https://public-inbox.org/meta/20210903151500.h72mzcpqixgtytjs@meerkat.local/
---
lib/PublicInbox/Eml.pm | 8 ++++++++
lib/PublicInbox/LeiToMail.pm | 21 +++++++--------------
lib/PublicInbox/MboxReader.pm | 3 ++-
t/mbox_reader.t | 23 +++++++++++++++++++++++
4 files changed, 40 insertions(+), 15 deletions(-)
diff --git a/lib/PublicInbox/Eml.pm b/lib/PublicInbox/Eml.pm
index 955d6a96..0867a016 100644
--- a/lib/PublicInbox/Eml.pm
+++ b/lib/PublicInbox/Eml.pm
@@ -480,6 +480,14 @@ sub charset_set {
sub crlf { $_[0]->{crlf} // "\n" }
+sub raw_size {
+ my ($self) = @_;
+ my $len = length(${$self->{hdr}});
+ defined($self->{bdy}) and
+ $len += length(${$self->{bdy}}) + length($self->{crlf});
+ $len;
+}
+
# warnings to ignore when handling spam mailboxes and maybe other places
sub warn_ignore {
my $s = "@_";
diff --git a/lib/PublicInbox/LeiToMail.pm b/lib/PublicInbox/LeiToMail.pm
index 6e102a1d..1221d3c7 100644
--- a/lib/PublicInbox/LeiToMail.pm
+++ b/lib/PublicInbox/LeiToMail.pm
@@ -109,32 +109,25 @@ sub _mboxcl_common ($$$) {
$$buf .= 'Content-Length: '.length($$bdy).$crlf.
'Lines: '.$lines.$crlf.$crlf;
substr($$bdy, 0, 0, $$buf); # prepend header
- $_[0] = $bdy;
+ $$bdy .= $crlf;
+ $bdy;
}
# mboxcl still escapes "From " lines
sub eml2mboxcl {
my ($eml, $smsg) = @_;
my $buf = _mbox_hdr_buf($eml, 'mboxcl', $smsg);
- my $crlf = $eml->{crlf};
- if (my $bdy = delete $eml->{bdy}) {
- $$bdy =~ s/^From />From /gm;
- _mboxcl_common($buf, $bdy, $crlf);
- }
- $$buf .= $crlf;
- $buf;
+ my $bdy = delete($eml->{bdy}) // \(my $empty = '');
+ $$bdy =~ s/^From />From /gm;
+ _mboxcl_common($buf, $bdy, $eml->{crlf});
}
# mboxcl2 has no "From " escaping
sub eml2mboxcl2 {
my ($eml, $smsg) = @_;
my $buf = _mbox_hdr_buf($eml, 'mboxcl2', $smsg);
- my $crlf = $eml->{crlf};
- if (my $bdy = delete $eml->{bdy}) {
- _mboxcl_common($buf, $bdy, $crlf);
- }
- $$buf .= $crlf;
- $buf;
+ my $bdy = delete($eml->{bdy}) // \(my $empty = '');
+ _mboxcl_common($buf, $bdy, $eml->{crlf});
}
sub git_to_mail { # git->cat_async callback
diff --git a/lib/PublicInbox/MboxReader.pm b/lib/PublicInbox/MboxReader.pm
index 9291f00b..5a754cb8 100644
--- a/lib/PublicInbox/MboxReader.pm
+++ b/lib/PublicInbox/MboxReader.pm
@@ -41,7 +41,7 @@ sub _mbox_from {
$raw =~ s/^\r?\n\z//ms;
$raw =~ s/$from_re/$1/gms;
my $eml = PublicInbox::Eml->new(\$raw);
- $eml_cb->($eml, @arg);
+ $eml_cb->($eml, @arg) if $eml->raw_size;
}
return if $r == 0; # EOF
}
@@ -96,6 +96,7 @@ sub _mbox_cl ($$$;@) {
$$hdr =~ s/\A[\r\n]*From [^\n]*\n//s or
die "E: no 'From ' line in:\n", Dumper($hdr);
my $eml = PublicInbox::Eml->new($hdr);
+ next unless $eml->raw_size;
my @cl = $eml->header_raw('Content-Length');
my $n = scalar(@cl);
$n == 0 and die "E: Content-Length missing in:\n",
diff --git a/t/mbox_reader.t b/t/mbox_reader.t
index da0ce7f1..e5f57d7b 100644
--- a/t/mbox_reader.t
+++ b/t/mbox_reader.t
@@ -71,6 +71,12 @@ my $check_fmt = sub {
"Content-Length is correct $fmt $cur");
# clobber for ->as_string comparison below
$eml->header_set('Content-Length');
+
+ # special case for t/solve/bare.patch, not sure if we
+ # should even handle it...
+ if ($cl[0] eq '0' && ${$eml->{hdr}} eq '') {
+ delete $eml->{bdy};
+ }
} else {
is(scalar(@cl), 0, "Content-Length unset $fmt $cur");
}
@@ -121,4 +127,21 @@ exit 1
is(scalar(grep(/Final/, @x)), 0, 'no incomplete bit');
}
+{
+ my $html = <<EOM;
+<html><head><title>hi,</title></head><body>how are you</body></html>
+EOM
+ for my $m (qw(mboxrd mboxcl mboxcl2 mboxo)) {
+ my (@w, @x);
+ local $SIG{__WARN__} = sub { push @w, @_ };
+ open my $fh, '<', \$html or xbail 'PerlIO::scalar';
+ PublicInbox::MboxReader->$m($fh, sub {
+ push @x, $_[0]->as_string
+ });
+ is_deeply(\@x, [], "messages in invalid $m");
+ is_deeply([grep(!/^W: leftover/, @w)], [],
+ "no extra warnings besides leftover ($m)");
+ }
+}
+
done_testing;
next prev parent reply other threads:[~2021-09-04 21:36 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-02 21:12 Showcasing lei at Linux Plumbers Konstantin Ryabitsev
2021-09-02 21:58 ` Eric Wong
2021-09-03 15:15 ` Konstantin Ryabitsev
2021-09-04 21:36 ` Eric Wong [this message]
2021-09-07 18:17 ` [PATCH] lei_to_mail+mbox_reader: fix handling of empty/bogus emails Konstantin Ryabitsev
2021-09-07 20:56 ` Eric Wong
2021-09-07 21:20 ` Konstantin Ryabitsev
2021-09-07 22:22 ` Eric Wong
2021-09-07 21:33 ` Showcasing lei at Linux Plumbers Konstantin Ryabitsev
2021-09-07 22:14 ` Eric Wong
2021-09-08 13:36 ` Konstantin Ryabitsev
2021-09-08 14:49 ` Eric Wong
2021-09-08 17:17 ` Konstantin Ryabitsev
2021-09-08 17:32 ` Eric Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210904213658.GA27941@dcvr \
--to=e@80x24.org \
--cc=konstantin@linuxfoundation.org \
--cc=meta@public-inbox.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).