From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 8DE4C1F4C0; Thu, 17 Oct 2019 11:22:15 +0000 (UTC) Date: Thu, 17 Oct 2019 11:22:15 +0000 From: Eric Wong To: meta@public-inbox.org Subject: [RFC] v2writable: reindex handles 3-headed monsters [was: BUG: --reindex broken on multiple Message-IDs reuse] Message-ID: <20191017112215.GA13175@dcvr> References: <20191016211415.GA6084@dcvr> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20191016211415.GA6084@dcvr> List-Id: Eric Wong wrote: > Initial indexing went fine, Actually, no. Initial indexing was broken when dealing with this horror show. But I think I can fix that... > but --reindex fails because it's > trying to use $sync->{regen} when it should not... Yes, --reindex can be fixed if initial indexing isn't totally broken to begin with. Or, if over.sqlite3 is deleted.. But --reindex won't fix things if initial indexing was broken.... > So that's a bug in the reindexing logic somewhere that > needs to be fixed w/o disrupting normal indexing... > > One of the messages has 3 Message-IDs: > https://lore.kernel.org/linux-renesas-soc/1923946.Jvi0TDUXFC@wasted.cogentembedded.com So we had broken code trying to workaround broken data :< Anyways, the following patch is probably OK, but I'm way too tired and need to do some more testing on existing repos. Unfortunately, --reindex alone isn't yet strong enough to fix cases that were already broken... ----------8<------------ Subject: [RFC] v2writable: reindex handles 3-headed monsters And maybe 8-headed ones, too; but not very gracefully... I noticed --reindex failing on the linux-renesas-soc mirror due one 3-headed monster of a message having 3 sets of headers; while another normal message had a Message-ID that matched one of the 3 IDs of the 3-headed monster. Since we do reindexing backwards, we can try to do our best to preserve NNTP article numbers by matching blob OIDs if the overview DB exists by re-reindexing. Some of these may benefit from using List::Util::shuffle in case we hit infinite loops with reindex_q... Link: https://public-inbox.org/meta/20191016211415.GA6084@dcvr/ --- TODO | 3 + lib/PublicInbox/OverIdx.pm | 14 +++++ lib/PublicInbox/V2Writable.pm | 107 ++++++++++++++++++++++++++++------ t/v2reindex.t | 65 +++++++++++++++++++++ 4 files changed, 171 insertions(+), 18 deletions(-) diff --git a/TODO b/TODO index a327ca06..61c44a84 100644 --- a/TODO +++ b/TODO @@ -133,3 +133,6 @@ all need to be considered for everything we introduce) for coderepos * configurable diff output for solver-generated blobs + +* fix search for messages with multiple Subject:/To:/From:/Date: + headers (some wacky examples out there...) diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm index 7fd1905d..1ff435d7 100644 --- a/lib/PublicInbox/OverIdx.pm +++ b/lib/PublicInbox/OverIdx.pm @@ -343,6 +343,20 @@ sub remove_oid { $nr; } +sub num_for_oid { + my ($self, $oid, $mid) = @_; + my $num; + $self->begin_lazy; + each_by_mid($self, $mid, ['ddd'], sub { + my ($smsg) = @_; + my $blob = $smsg->{blob}; + return 1 if (!defined($blob) || $blob ne $oid); # continue; + $num = $smsg->{num}; + 0; # done + }); + $num; +} + sub create_tables { my ($dbh) = @_; diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 6a88f62a..6e1ac5e3 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -845,6 +845,51 @@ sub mark_deleted ($$$$) { } } +sub regen_art_num ($$$$$) { + my ($self, $sync, $git, $oid, $mids) = @_; + my $num = $sync->{regen}--; + my $mm = $self->{mm}; + if ($num > 0) { + foreach my $mid (reverse @$mids) { + if ($mm->mid_set($num, $mid) == 1) { + return ($num, $mid); + } + } + + # nope, not using the regen number because either we + # have or overlap Message-IDs with another message. + $sync->{regen}++; + $num = undef; + } + + # --reindex was broken when Message-IDs were reused with a + # message that already had multiple Message-IDs. This causes + # failures when using --reindex again. The least bad option is + # to show messages twice to NNTP clients, rather than lose/drop + # messages entirely. + # + # create a new article number? + for my $mid (@$mids) { + $num = $self->{mm}->mid_insert($mid); + return ($num, $mid) if defined $num; + } + + # swap article numbers with a previously regenerated (newer message) + # which has an overlapping Message-ID. The current $oid is older + # than $smsg->{blob} in git history since we regen walks backwards. + my $reindex_q = $sync->{reindex_q} //= []; + for my $mid (@$mids) { + my $n = $self->{mm}->num_for($mid); + next unless defined $n; + if (my $smsg = $self->{over}->get_art($n)) { + push @$reindex_q, $smsg->{blob}; + return ($n, $mid); + } + } + warn "W: ran out of article numbers on $oid\n"; + (undef, undef); +} + sub reindex_oid ($$$$) { my ($self, $sync, $git, $oid) = @_; my $len; @@ -853,31 +898,43 @@ sub reindex_oid ($$$$) { my $mids = mids($mime->header_obj); my $cid = content_id($mime); - # get the NNTP article number we used before, highest number wins - # and gets deleted from sync->{mm_tmp}; + # get the NNTP article number we used before, which gets deleted + # from sync->{mm_tmp}; my $mid0; my $num = -1; my $del = 0; - foreach my $mid (@$mids) { - $del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0; - my $n = $sync->{mm_tmp}->num_for($mid); - if (defined $n && $n > $num) { - $mid0 = $mid; - $num = $n; - $self->{mm}->mid_set($num, $mid0); + my $reindex = $sync->{reindex}; + if ($reindex) { + # see if it's already in the overview DB, but keep in mind + # --reindex may be used blindly w/o overview DB. + foreach my $mid (@$mids) { + $del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0; + my $n = $self->{over}->num_for_oid($oid, $mid); + if (defined $n) { + ($num, $mid0) = ($n, $mid); + $self->{mm}->mid_set($num, $mid0); + last; # yay! reused + } } } - if (!defined($mid0) && !$del) { - $num = $sync->{regen}--; - die "BUG: ran out of article numbers\n" if $num <= 0; - my $mm = $self->{mm}; - foreach my $mid (reverse @$mids) { - if ($mm->mid_set($num, $mid) == 1) { - $mid0 = $mid; - last; + + # not reindexing, or reindexing with broken/incomplete overview DB: + if (!defined($mid0)) { + # highest number wins for unseen messages + foreach my $mid (@$mids) { + $del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0; + my $n = $sync->{mm_tmp}->num_for($mid); + if (defined $n && $n > $num) { + ($num, $mid0) = ($n, $mid); + $self->{mm}->mid_set($num, $mid0); } } + } + + if (!defined($mid0) && !$del) { + ($num, $mid0) = regen_art_num($self, $sync, $git, $oid, $mids); if (!defined($mid0)) { + my $mm = $self->{mm}; my $id = '<' . join('> <', @$mids) . '>'; warn "Message-ID $id unusable for $num\n"; foreach my $mid (@$mids) { @@ -890,7 +947,11 @@ sub reindex_oid ($$$$) { if (!defined($mid0) || $del) { if (!defined($mid0) && $del) { # expected for deletes $num = $sync->{regen}--; - $self->{mm}->num_highwater($num) if !$sync->{reindex}; + if (!$reindex) { + ($num <= 0) and + die "BUG: ran out of article numbers\n"; + $self->{mm}->num_highwater($num); + } return } @@ -1147,6 +1208,16 @@ sub index_epoch ($$$) { } elsif (/\A:\d{6} 100644 $x40 ($x40) [AM]\td$/o) { mark_deleted($self, $sync, $git, $1); } + + while (my $reindex_q = delete $sync->{reindex_q}) { + my $all = $self->{-inbox}->git; + for my $oid (@$reindex_q) { + $self->{current_info} = "$i.git $oid"; + warn "reindexing to shuffle article numbers\n"; + reindex_oid($self, $sync, $all, $oid); + } + $all->cleanup; + } } $fh = undef; delete $self->{reindex_pipe}; diff --git a/t/v2reindex.t b/t/v2reindex.t index 7c5a6b07..5de067c6 100644 --- a/t/v2reindex.t +++ b/t/v2reindex.t @@ -431,4 +431,69 @@ ok(!-d $xap, 'Xapian directories removed again'); ], 'msgmap as expected' ); } +# A real example from linux-renesas-soc on lore where a 3-headed monster +# of a message has 3 sets of common headers. Another normal message +# previously existed with a single Message-ID that conflicts with one +# of the Message-IDs in the 3-headed monster. +{ + my @warn; + local $SIG{__WARN__} = sub { push @warn, @_ }; + my %config = %$ibx_config; + $config{indexlevel} = 'basic'; + my $ibx = PublicInbox::Inbox->new(\%config); + my $im = PublicInbox::V2Writable->new($ibx); + my $m3 = PublicInbox::MIME->new(<<'EOF'); +Date: Tue, 24 May 2016 14:34:22 -0700 (PDT) +Message-Id: <20160524.143422.552507610109476444.d@example.com> +To: t@example.com +Cc: c@example.com +Subject: Re: [PATCH v2 2/2] +From: +In-Reply-To: <1463825855-7363-2-git-send-email-y@example.com> +References: <1463825855-7363-1-git-send-email-y@example.com> + <1463825855-7363-2-git-send-email-y@example.com> +Date: Wed, 25 May 2016 10:01:51 +0900 +From: h@example.com +To: g@example.com +Cc: m@example.com +Subject: Re: [PATCH] +Message-ID: <20160525010150.GD7292@example.com> +References: <1463498133-23918-1-git-send-email-g+r@example.com> +In-Reply-To: <1463498133-23918-1-git-send-email-g+r@example.com> +From: s@example.com +To: h@example.com +Cc: m@example.com +Subject: [PATCH 12/13] +Date: Wed, 01 Jun 2016 01:32:35 +0300 +Message-ID: <1923946.Jvi0TDUXFC@wasted.example.com> +In-Reply-To: <13205049.n7pM8utpHF@wasted.example.com> +References: <13205049.n7pM8utpHF@wasted.example.com> + +Somehow we got a message with 3 sets of headers into one +message, could've been something broken on the archiver side. +EOF + + my $m1 = PublicInbox::MIME->new(<<'EOF'); +From: a@example.com +To: t@example.com +Subject: [PATCH 12/13] +Date: Wed, 01 Jun 2016 01:32:35 +0300 +Message-ID: <1923946.Jvi0TDUXFC@wasted.example.com> +In-Reply-To: <13205049.n7pM8utpHF@wasted.example.com> +References: <13205049.n7pM8utpHF@wasted.example.com> + +This is probably one of the original messages + +EOF + $im->add($m1); + $im->add($m3); + $im->done; + remove_tree($xap); + eval { $im->index_sync() }; + is($@, '', 'no error from initial indexing'); + eval { $im->index_sync({reindex=>1}) }; + is($@, '', 'no error from reindexing after reused Message-ID (x3)'); + is_deeply(\@warn, [], 'no warnings'); +} + done_testing();