From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id F1C6B1F487; Sat, 4 Apr 2020 06:20:03 +0000 (UTC) Date: Sat, 4 Apr 2020 06:20:03 +0000 From: Eric Wong To: Kyle Meyer Cc: meta@public-inbox.org Subject: [PATCH] inboxwritable: fix From_ line unescaping Message-ID: <20200404062003.GA23899@dcvr> References: <87lfnb3kz8.fsf@kyleam.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <87lfnb3kz8.fsf@kyleam.com> List-Id: Kyle Meyer wrote: > I'm feeding mbox files created with Konstantin Ryabitsev's > list-archive-maker.py script [^1] to import_vger_from_mbox. Looking > through the result, I noticed some ">From" lines. Here's an example: > > https://yhetil.org/orgmode/871rpt9zc4.fsf@kyleam.com/ > > If I'm following the code correctly, that leads to an import_mbox call, > which in turn calls mb_add: > > sub mb_add ($$$$) { > my ($im, $variant, $filter, $msg) = @_; > $$msg =~ s/(\r?\n)+\z/$1/s; > my $mime = PublicInbox::MIME->new($msg); > if ($variant eq 'mboxrd') { > $$msg =~ s/^>(>*From )/$1/sm; > } elsif ($variant eq 'mboxo') { > $$msg =~ s/^>From /From /sm; > } > [...] Yup, and that's buggy on first sight. My fault :x > So, it appears the ">From" _should_ be getting reversed. To eliminate > any stupid things I may have done when creating the archive, I looked > for a message on meta that has an in-body line starting with "From" and > found > > https://public-inbox.org/meta/20200121222924.ioz5ve2sg65zcuoy@chatter.i7.local/ > > So I downloaded the public-inbox generated mbox and fed it to > import_vger_from_mbox: > > curl -s https://public-inbox.org/meta/20200121222924.ioz5ve2sg65zcuoy@chatter.i7.local/t.mbox.gz \ > | zcat | scripts/import_vger_from_mbox testing emacs-orgmode@gnu.org ~/inboxes/testing > > That too leaves a ">From" in the body: > > https://yhetil.org/testing/20200121222924.ioz5ve2sg65zcuoy@chatter.i7.local/ Thanks for the reproducible test case. A fix is below (only tested with your case, nothing in t/*.t yet) > Any idea what's going wrong here? Two bugs, actually, but one affected your case. > [^1]: https://git.kernel.org/pub/scm/linux/kernel/git/mricon/korg-helpers.git/plain/list-archive-maker.py Can you confirm the following fixes things for you? Thanks again for the excellent bug report and apologies for my careless bug :x ----8<---- From: Eric Wong Date: Sat, 04 Apr 2020 06:17:29 +0000 Subject: [PATCH] inboxwritable: fix From_ line unescaping We can't rely on Email::MIME noticing the change to our scalar ref after calling `PublicInbox::MIME->new'. This is because Email::MIME::body_set (unlike Email::Simple::body_set) will copy the contents of the body into `->{body_raw}' as a new scalar. Furthermore, we need to escape multiple From lines in the body, not just the first one, using the `g' modifier to `s//'. Reported-by: Kyle Meyer --- lib/PublicInbox/InboxWritable.pm | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/PublicInbox/InboxWritable.pm b/lib/PublicInbox/InboxWritable.pm index ce979ea2..f2ba21fc 100644 --- a/lib/PublicInbox/InboxWritable.pm +++ b/lib/PublicInbox/InboxWritable.pm @@ -157,12 +157,12 @@ my $from_strict = qr/^From \S+ +\S+ \S+ +\S+ [^:]+:[^:]+:[^:]+ [^:]+/; sub mb_add ($$$$) { my ($im, $variant, $filter, $msg) = @_; $$msg =~ s/(\r?\n)+\z/$1/s; - my $mime = PublicInbox::MIME->new($msg); if ($variant eq 'mboxrd') { - $$msg =~ s/^>(>*From )/$1/sm; + $$msg =~ s/^>(>*From )/$1/gms; } elsif ($variant eq 'mboxo') { - $$msg =~ s/^>From /From /sm; + $$msg =~ s/^>From /From /gms; } + my $mime = PublicInbox::MIME->new($msg); if ($filter) { my $ret = $filter->scrub($mime) or return; return if $ret == REJECT();