* BUG: --reindex broken on multiple Message-IDs reuse @ 2019-10-16 21:14 Eric Wong 2019-10-17 11:22 ` [RFC] v2writable: reindex handles 3-headed monsters [was: BUG: --reindex broken on multiple Message-IDs reuse] Eric Wong 2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong 0 siblings, 2 replies; 11+ messages in thread From: Eric Wong @ 2019-10-16 21:14 UTC (permalink / raw) To: meta Initial indexing went fine, but --reindex fails because it's trying to use $sync->{regen} when it should not... So that's a bug in the reindexing logic somewhere that needs to be fixed w/o disrupting normal indexing... One of the messages has 3 Message-IDs: https://lore.kernel.org/linux-renesas-soc/1923946.Jvi0TDUXFC@wasted.cogentembedded.com :< Will look into it more later... ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC] v2writable: reindex handles 3-headed monsters [was: BUG: --reindex broken on multiple Message-IDs reuse] 2019-10-16 21:14 BUG: --reindex broken on multiple Message-IDs reuse Eric Wong @ 2019-10-17 11:22 ` Eric Wong 2019-10-22 8:09 ` [RFC/HELP] search: multiple From/To/Cc/Subject (what about Date?) Eric Wong 2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong 1 sibling, 1 reply; 11+ messages in thread From: Eric Wong @ 2019-10-17 11:22 UTC (permalink / raw) To: meta Eric Wong <e@80x24.org> wrote: > Initial indexing went fine, Actually, no. Initial indexing was broken when dealing with this horror show. But I think I can fix that... > but --reindex fails because it's > trying to use $sync->{regen} when it should not... Yes, --reindex can be fixed if initial indexing isn't totally broken to begin with. Or, if over.sqlite3 is deleted.. But --reindex won't fix things if initial indexing was broken.... > So that's a bug in the reindexing logic somewhere that > needs to be fixed w/o disrupting normal indexing... > > One of the messages has 3 Message-IDs: > https://lore.kernel.org/linux-renesas-soc/1923946.Jvi0TDUXFC@wasted.cogentembedded.com So we had broken code trying to workaround broken data :< Anyways, the following patch is probably OK, but I'm way too tired and need to do some more testing on existing repos. Unfortunately, --reindex alone isn't yet strong enough to fix cases that were already broken... ----------8<------------ Subject: [RFC] v2writable: reindex handles 3-headed monsters And maybe 8-headed ones, too; but not very gracefully... I noticed --reindex failing on the linux-renesas-soc mirror due one 3-headed monster of a message having 3 sets of headers; while another normal message had a Message-ID that matched one of the 3 IDs of the 3-headed monster. Since we do reindexing backwards, we can try to do our best to preserve NNTP article numbers by matching blob OIDs if the overview DB exists by re-reindexing. Some of these may benefit from using List::Util::shuffle in case we hit infinite loops with reindex_q... Link: https://public-inbox.org/meta/20191016211415.GA6084@dcvr/ --- TODO | 3 + lib/PublicInbox/OverIdx.pm | 14 +++++ lib/PublicInbox/V2Writable.pm | 107 ++++++++++++++++++++++++++++------ t/v2reindex.t | 65 +++++++++++++++++++++ 4 files changed, 171 insertions(+), 18 deletions(-) diff --git a/TODO b/TODO index a327ca06..61c44a84 100644 --- a/TODO +++ b/TODO @@ -133,3 +133,6 @@ all need to be considered for everything we introduce) for coderepos * configurable diff output for solver-generated blobs + +* fix search for messages with multiple Subject:/To:/From:/Date: + headers (some wacky examples out there...) diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm index 7fd1905d..1ff435d7 100644 --- a/lib/PublicInbox/OverIdx.pm +++ b/lib/PublicInbox/OverIdx.pm @@ -343,6 +343,20 @@ sub remove_oid { $nr; } +sub num_for_oid { + my ($self, $oid, $mid) = @_; + my $num; + $self->begin_lazy; + each_by_mid($self, $mid, ['ddd'], sub { + my ($smsg) = @_; + my $blob = $smsg->{blob}; + return 1 if (!defined($blob) || $blob ne $oid); # continue; + $num = $smsg->{num}; + 0; # done + }); + $num; +} + sub create_tables { my ($dbh) = @_; diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 6a88f62a..6e1ac5e3 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -845,6 +845,51 @@ sub mark_deleted ($$$$) { } } +sub regen_art_num ($$$$$) { + my ($self, $sync, $git, $oid, $mids) = @_; + my $num = $sync->{regen}--; + my $mm = $self->{mm}; + if ($num > 0) { + foreach my $mid (reverse @$mids) { + if ($mm->mid_set($num, $mid) == 1) { + return ($num, $mid); + } + } + + # nope, not using the regen number because either we + # have or overlap Message-IDs with another message. + $sync->{regen}++; + $num = undef; + } + + # --reindex was broken when Message-IDs were reused with a + # message that already had multiple Message-IDs. This causes + # failures when using --reindex again. The least bad option is + # to show messages twice to NNTP clients, rather than lose/drop + # messages entirely. + # + # create a new article number? + for my $mid (@$mids) { + $num = $self->{mm}->mid_insert($mid); + return ($num, $mid) if defined $num; + } + + # swap article numbers with a previously regenerated (newer message) + # which has an overlapping Message-ID. The current $oid is older + # than $smsg->{blob} in git history since we regen walks backwards. + my $reindex_q = $sync->{reindex_q} //= []; + for my $mid (@$mids) { + my $n = $self->{mm}->num_for($mid); + next unless defined $n; + if (my $smsg = $self->{over}->get_art($n)) { + push @$reindex_q, $smsg->{blob}; + return ($n, $mid); + } + } + warn "W: ran out of article numbers on $oid\n"; + (undef, undef); +} + sub reindex_oid ($$$$) { my ($self, $sync, $git, $oid) = @_; my $len; @@ -853,31 +898,43 @@ sub reindex_oid ($$$$) { my $mids = mids($mime->header_obj); my $cid = content_id($mime); - # get the NNTP article number we used before, highest number wins - # and gets deleted from sync->{mm_tmp}; + # get the NNTP article number we used before, which gets deleted + # from sync->{mm_tmp}; my $mid0; my $num = -1; my $del = 0; - foreach my $mid (@$mids) { - $del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0; - my $n = $sync->{mm_tmp}->num_for($mid); - if (defined $n && $n > $num) { - $mid0 = $mid; - $num = $n; - $self->{mm}->mid_set($num, $mid0); + my $reindex = $sync->{reindex}; + if ($reindex) { + # see if it's already in the overview DB, but keep in mind + # --reindex may be used blindly w/o overview DB. + foreach my $mid (@$mids) { + $del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0; + my $n = $self->{over}->num_for_oid($oid, $mid); + if (defined $n) { + ($num, $mid0) = ($n, $mid); + $self->{mm}->mid_set($num, $mid0); + last; # yay! reused + } } } - if (!defined($mid0) && !$del) { - $num = $sync->{regen}--; - die "BUG: ran out of article numbers\n" if $num <= 0; - my $mm = $self->{mm}; - foreach my $mid (reverse @$mids) { - if ($mm->mid_set($num, $mid) == 1) { - $mid0 = $mid; - last; + + # not reindexing, or reindexing with broken/incomplete overview DB: + if (!defined($mid0)) { + # highest number wins for unseen messages + foreach my $mid (@$mids) { + $del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0; + my $n = $sync->{mm_tmp}->num_for($mid); + if (defined $n && $n > $num) { + ($num, $mid0) = ($n, $mid); + $self->{mm}->mid_set($num, $mid0); } } + } + + if (!defined($mid0) && !$del) { + ($num, $mid0) = regen_art_num($self, $sync, $git, $oid, $mids); if (!defined($mid0)) { + my $mm = $self->{mm}; my $id = '<' . join('> <', @$mids) . '>'; warn "Message-ID $id unusable for $num\n"; foreach my $mid (@$mids) { @@ -890,7 +947,11 @@ sub reindex_oid ($$$$) { if (!defined($mid0) || $del) { if (!defined($mid0) && $del) { # expected for deletes $num = $sync->{regen}--; - $self->{mm}->num_highwater($num) if !$sync->{reindex}; + if (!$reindex) { + ($num <= 0) and + die "BUG: ran out of article numbers\n"; + $self->{mm}->num_highwater($num); + } return } @@ -1147,6 +1208,16 @@ sub index_epoch ($$$) { } elsif (/\A:\d{6} 100644 $x40 ($x40) [AM]\td$/o) { mark_deleted($self, $sync, $git, $1); } + + while (my $reindex_q = delete $sync->{reindex_q}) { + my $all = $self->{-inbox}->git; + for my $oid (@$reindex_q) { + $self->{current_info} = "$i.git $oid"; + warn "reindexing to shuffle article numbers\n"; + reindex_oid($self, $sync, $all, $oid); + } + $all->cleanup; + } } $fh = undef; delete $self->{reindex_pipe}; diff --git a/t/v2reindex.t b/t/v2reindex.t index 7c5a6b07..5de067c6 100644 --- a/t/v2reindex.t +++ b/t/v2reindex.t @@ -431,4 +431,69 @@ ok(!-d $xap, 'Xapian directories removed again'); ], 'msgmap as expected' ); } +# A real example from linux-renesas-soc on lore where a 3-headed monster +# of a message has 3 sets of common headers. Another normal message +# previously existed with a single Message-ID that conflicts with one +# of the Message-IDs in the 3-headed monster. +{ + my @warn; + local $SIG{__WARN__} = sub { push @warn, @_ }; + my %config = %$ibx_config; + $config{indexlevel} = 'basic'; + my $ibx = PublicInbox::Inbox->new(\%config); + my $im = PublicInbox::V2Writable->new($ibx); + my $m3 = PublicInbox::MIME->new(<<'EOF'); +Date: Tue, 24 May 2016 14:34:22 -0700 (PDT) +Message-Id: <20160524.143422.552507610109476444.d@example.com> +To: t@example.com +Cc: c@example.com +Subject: Re: [PATCH v2 2/2] +From: <f@example.com> +In-Reply-To: <1463825855-7363-2-git-send-email-y@example.com> +References: <1463825855-7363-1-git-send-email-y@example.com> + <1463825855-7363-2-git-send-email-y@example.com> +Date: Wed, 25 May 2016 10:01:51 +0900 +From: h@example.com +To: g@example.com +Cc: m@example.com +Subject: Re: [PATCH] +Message-ID: <20160525010150.GD7292@example.com> +References: <1463498133-23918-1-git-send-email-g+r@example.com> +In-Reply-To: <1463498133-23918-1-git-send-email-g+r@example.com> +From: s@example.com +To: h@example.com +Cc: m@example.com +Subject: [PATCH 12/13] +Date: Wed, 01 Jun 2016 01:32:35 +0300 +Message-ID: <1923946.Jvi0TDUXFC@wasted.example.com> +In-Reply-To: <13205049.n7pM8utpHF@wasted.example.com> +References: <13205049.n7pM8utpHF@wasted.example.com> + +Somehow we got a message with 3 sets of headers into one +message, could've been something broken on the archiver side. +EOF + + my $m1 = PublicInbox::MIME->new(<<'EOF'); +From: a@example.com +To: t@example.com +Subject: [PATCH 12/13] +Date: Wed, 01 Jun 2016 01:32:35 +0300 +Message-ID: <1923946.Jvi0TDUXFC@wasted.example.com> +In-Reply-To: <13205049.n7pM8utpHF@wasted.example.com> +References: <13205049.n7pM8utpHF@wasted.example.com> + +This is probably one of the original messages + +EOF + $im->add($m1); + $im->add($m3); + $im->done; + remove_tree($xap); + eval { $im->index_sync() }; + is($@, '', 'no error from initial indexing'); + eval { $im->index_sync({reindex=>1}) }; + is($@, '', 'no error from reindexing after reused Message-ID (x3)'); + is_deeply(\@warn, [], 'no warnings'); +} + done_testing(); ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [RFC/HELP] search: multiple From/To/Cc/Subject (what about Date?) 2019-10-17 11:22 ` [RFC] v2writable: reindex handles 3-headed monsters [was: BUG: --reindex broken on multiple Message-IDs reuse] Eric Wong @ 2019-10-22 8:09 ` Eric Wong 0 siblings, 0 replies; 11+ messages in thread From: Eric Wong @ 2019-10-22 8:09 UTC (permalink / raw) To: meta We can easily support searching on messages with multiple From/To/Cc/Subject headers. Now, the display part in the thread skeleton could be trickier, but not impossible... Now, how are we supposed to support searching date ranges when messages have multiple Date: headers? e.g: https://lore.kernel.org/linux-renesas-soc/20160524.143422.552507610109476444.davem@davemloft.net/raw OTOH, I'm pretty sure that's just a mangled message somebody passed on because all 3 Message-IDs appear in standalone forms elsewhere (e.g. lkml) But, I also fully expect future messages will contain horribleness in headers unless anti-spam/abuse rules start blocking that before it hits public-inbox... --- lib/PublicInbox/SearchMsg.pm | 4 ++-- t/v2reindex.t | 16 ++++++++++++---- 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm index adadf92e..7561e7f2 100644 --- a/lib/PublicInbox/SearchMsg.pm +++ b/lib/PublicInbox/SearchMsg.pm @@ -107,8 +107,8 @@ sub __hdr ($$) { return $val if defined $val; my $mime = $self->{mime} or return; - $val = $mime->header($field); - $val = '' unless defined $val; + my @raw = $mime->header($field); + $val = join(', ', @raw); $val =~ tr/\t\n/ /; $val =~ tr/\r//d; $self->{$field} = $val; diff --git a/t/v2reindex.t b/t/v2reindex.t index 52711f8f..3e56ddfa 100644 --- a/t/v2reindex.t +++ b/t/v2reindex.t @@ -439,7 +439,7 @@ ok(!-d $xap, 'Xapian directories removed again'); my @warn; local $SIG{__WARN__} = sub { push @warn, @_ }; my %config = %$ibx_config; - $config{indexlevel} = 'basic'; + $config{indexlevel} = 'medium'; my $ibx = PublicInbox::Inbox->new(\%config); my $im = PublicInbox::V2Writable->new($ibx); my $m3 = PublicInbox::MIME->new(<<'EOF'); @@ -447,7 +447,7 @@ Date: Tue, 24 May 2016 14:34:22 -0700 (PDT) Message-Id: <20160524.143422.552507610109476444.d@example.com> To: t@example.com Cc: c@example.com -Subject: Re: [PATCH v2 2/2] +Subject: Re: [PATCH v2 2/2] uno From: <f@example.com> In-Reply-To: <1463825855-7363-2-git-send-email-y@example.com> References: <1463825855-7363-1-git-send-email-y@example.com> @@ -456,14 +456,14 @@ Date: Wed, 25 May 2016 10:01:51 +0900 From: h@example.com To: g@example.com Cc: m@example.com -Subject: Re: [PATCH] +Subject: Re: [PATCH] dos Message-ID: <20160525010150.GD7292@example.com> References: <1463498133-23918-1-git-send-email-g+r@example.com> In-Reply-To: <1463498133-23918-1-git-send-email-g+r@example.com> From: s@example.com To: h@example.com Cc: m@example.com -Subject: [PATCH 12/13] +Subject: [PATCH 12/13] tres Date: Wed, 01 Jun 2016 01:32:35 +0300 Message-ID: <1923946.Jvi0TDUXFC@wasted.example.com> In-Reply-To: <13205049.n7pM8utpHF@wasted.example.com> @@ -495,6 +495,14 @@ EOF eval { $im->index_sync({reindex=>1}) }; is($@, '', 'no error from reindexing after reused Message-ID (x3)'); is_deeply(\@warn, [], 'no warnings on reindex'); + + my %uniq; + for my $s (qw(uno dos tres)) { + my $msgs = $ibx->search->query("s:$s"); + is(scalar(@$msgs), 1, "only one result for `$s'"); + $uniq{$msgs->[0]->{num}}++; + } + is_deeply([values %uniq], [3], 'search on different subjects'); } done_testing(); ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs 2019-10-16 21:14 BUG: --reindex broken on multiple Message-IDs reuse Eric Wong 2019-10-17 11:22 ` [RFC] v2writable: reindex handles 3-headed monsters [was: BUG: --reindex broken on multiple Message-IDs reuse] Eric Wong @ 2019-10-21 11:02 ` Eric Wong 2019-10-21 11:02 ` [PATCH 1/3] v2writable: set unindexed article number Eric Wong ` (3 more replies) 1 sibling, 4 replies; 11+ messages in thread From: Eric Wong @ 2019-10-21 11:02 UTC (permalink / raw) To: meta This fixes my initial problems around --reindex not picking up previously-skipped messages, and without having to add extra command-line options. It took several iterations and I considered relying on "git log --reverse" since v2 epochs are limited in size compared to v1. But I'm glad we won't have to rely on the expensive --reverse option and only internally reverse a few rare messages which have multiple Message-IDs. I'm actually satisfied with this because it doesn't involve adding new switches to workaround buggy behavior, since I considered (and rejected) a "--renumber" option. original RFC: https://public-inbox.org/meta/20191017112215.GA13175@dcvr/ I'm still testing this with various repos, but the linux-renesas-soc mirror from lore seems good when doing initial indexing with an old version of this and using --reindex with this patch, as all previously-missed messages are now indexed. Eric Wong (3): v2writable: set unindexed article number v2writable: improve "num_for" API and disambiguate v2writable: reindex handles 3-headered monsters TODO | 3 + lib/PublicInbox/OverIdx.pm | 14 ++ lib/PublicInbox/V2Writable.pm | 257 +++++++++++++++++++++++----------- t/v2reindex.t | 66 +++++++++ 4 files changed, 256 insertions(+), 84 deletions(-) Interdiff: diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm index 1ff435d7..64277342 100644 --- a/lib/PublicInbox/OverIdx.pm +++ b/lib/PublicInbox/OverIdx.pm @@ -343,18 +343,18 @@ sub remove_oid { $nr; } -sub num_for_oid { +sub num_mid0_for_oid { my ($self, $oid, $mid) = @_; - my $num; + my ($num, $mid0); $self->begin_lazy; each_by_mid($self, $mid, ['ddd'], sub { my ($smsg) = @_; my $blob = $smsg->{blob}; return 1 if (!defined($blob) || $blob ne $oid); # continue; - $num = $smsg->{num}; + ($num, $mid0) = ($smsg->{num}, $smsg->{mid}); 0; # done }); - $num; + ($num, $mid0); } sub create_tables { diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 6e1ac5e3..7ece6b01 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -155,8 +155,7 @@ sub _add { # leaking FDs to it... $self->idx_init; - my $mid0; - my $num = num_for($self, $mime, \$mid0); + my ($num, $mid0) = v2_num_for($self, $mime); defined $num or return; # duplicate defined $mid0 or die "BUG: $mid0 undefined\n"; my $im = $self->importer; @@ -172,16 +171,15 @@ sub _add { $cmt; } -sub num_for { - my ($self, $mime, $mid0) = @_; +sub v2_num_for { + my ($self, $mime) = @_; my $mids = mids($mime->header_obj); if (@$mids) { my $mid = $mids->[0]; my $num = $self->{mm}->mid_insert($mid); if (defined $num) { # common case - $$mid0 = $mid; - return $num; - }; + return ($num, $mid); + } # crap, Message-ID is already known, hope somebody just resent: foreach my $m (@$mids) { @@ -190,7 +188,7 @@ sub num_for { # easy, don't store duplicates # note: do not add more diagnostic info here since # it gets noisy on public-inbox-watch restarts - return if $existing; + return () if $existing; } # AltId may pre-populate article numbers (e.g. X-Mail-Count @@ -201,8 +199,7 @@ sub num_for { my $num = $self->{mm}->num_for($mid); if (defined $num && !$self->{over}->get_art($num)) { - $$mid0 = $mid; - return $num; + return ($num, $mid); } } @@ -215,39 +212,38 @@ sub num_for { $num = $self->{mm}->mid_insert($m); if (defined $num) { warn "alternative <$m> for <$mid> found\n"; - $$mid0 = $m; - return $num; + return ($num, $m); } } } # none of the existing Message-IDs are good, generate a new one: - num_for_harder($self, $mime, $mid0); + v2_num_for_harder($self, $mime); } -sub num_for_harder { - my ($self, $mime, $mid0) = @_; +sub v2_num_for_harder { + my ($self, $mime) = @_; my $hdr = $mime->header_obj; my $dig = content_digest($mime); - $$mid0 = PublicInbox::Import::digest2mid($dig, $hdr); - my $num = $self->{mm}->mid_insert($$mid0); + my $mid0 = PublicInbox::Import::digest2mid($dig, $hdr); + my $num = $self->{mm}->mid_insert($mid0); unless (defined $num) { # it's hard to spoof the last Received: header my @recvd = $hdr->header_raw('Received'); $dig->add("Received: $_") foreach (@recvd); - $$mid0 = PublicInbox::Import::digest2mid($dig, $hdr); - $num = $self->{mm}->mid_insert($$mid0); + $mid0 = PublicInbox::Import::digest2mid($dig, $hdr); + $num = $self->{mm}->mid_insert($mid0); # fall back to a random Message-ID and give up determinism: until (defined($num)) { $dig->add(rand); - $$mid0 = PublicInbox::Import::digest2mid($dig, $hdr); - warn "using random Message-ID <$$mid0> as fallback\n"; - $num = $self->{mm}->mid_insert($$mid0); + $mid0 = PublicInbox::Import::digest2mid($dig, $hdr); + warn "using random Message-ID <$mid0> as fallback\n"; + $num = $self->{mm}->mid_insert($mid0); } } - PublicInbox::Import::append_mid($hdr, $$mid0); - $num; + PublicInbox::Import::append_mid($hdr, $mid0); + ($num, $mid0); } sub idx_shard { @@ -845,142 +841,157 @@ sub mark_deleted ($$$$) { } } -sub regen_art_num ($$$$$) { - my ($self, $sync, $git, $oid, $mids) = @_; - my $num = $sync->{regen}--; - my $mm = $self->{mm}; - if ($num > 0) { - foreach my $mid (reverse @$mids) { - if ($mm->mid_set($num, $mid) == 1) { - return ($num, $mid); - } - } +sub reindex_checkpoint ($$$) { + my ($self, $sync, $git) = @_; - # nope, not using the regen number because either we - # have or overlap Message-IDs with another message. - $sync->{regen}++; - $num = undef; - } - - # --reindex was broken when Message-IDs were reused with a - # message that already had multiple Message-IDs. This causes - # failures when using --reindex again. The least bad option is - # to show messages twice to NNTP clients, rather than lose/drop - # messages entirely. - # - # create a new article number? - for my $mid (@$mids) { - $num = $self->{mm}->mid_insert($mid); - return ($num, $mid) if defined $num; - } - - # swap article numbers with a previously regenerated (newer message) - # which has an overlapping Message-ID. The current $oid is older - # than $smsg->{blob} in git history since we regen walks backwards. - my $reindex_q = $sync->{reindex_q} //= []; - for my $mid (@$mids) { - my $n = $self->{mm}->num_for($mid); - next unless defined $n; - if (my $smsg = $self->{over}->get_art($n)) { - push @$reindex_q, $smsg->{blob}; - return ($n, $mid); - } + $git->cleanup; + $sync->{mm_tmp}->atfork_prepare; + $self->done; # release lock + + if (my $pr = $sync->{-opt}->{-progress}) { + my ($bn) = (split('/', $git->{git_dir}))[-1]; + $pr->("$bn ".sprintf($sync->{-regen_fmt}, $sync->{nr})); } - warn "W: ran out of article numbers on $oid\n"; - (undef, undef); + + # allow -watch or -mda to write... + $self->idx_init; # reacquire lock + $sync->{mm_tmp}->atfork_parent; } -sub reindex_oid ($$$$) { +# only for a few odd messages with multiple Message-IDs +sub reindex_oid_m ($$$$) { my ($self, $sync, $git, $oid) = @_; - my $len; + my ($num, $mid0, $len); my $msgref = $git->cat_file($oid, \$len); my $mime = PublicInbox::MIME->new($$msgref); my $mids = mids($mime->header_obj); my $cid = content_id($mime); + die "BUG: reindex_oid_m called for <=1 mids" if scalar(@$mids) <= 1; - # get the NNTP article number we used before, which gets deleted - # from sync->{mm_tmp}; - my $mid0; - my $num = -1; - my $del = 0; - my $reindex = $sync->{reindex}; - if ($reindex) { - # see if it's already in the overview DB, but keep in mind - # --reindex may be used blindly w/o overview DB. - foreach my $mid (@$mids) { - $del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0; - my $n = $self->{over}->num_for_oid($oid, $mid); - if (defined $n) { - ($num, $mid0) = ($n, $mid); - $self->{mm}->mid_set($num, $mid0); - last; # yay! reused - } + for my $mid (reverse @$mids) { + delete($sync->{D}->{"$mid\0$cid"}) and + die "BUG: reindex_oid should handle <$mid> delete"; + } + for my $mid (reverse @$mids) { + ($num, $mid0) = $self->{over}->num_mid0_for_oid($oid, $mid); + last if defined $num; + } + unless (defined($num)) { + for my $mid (reverse @$mids) { + # is this a number we got before? + $num = $sync->{mm_tmp}->num_for($mid); + next unless defined $num; + $mid0 = $mid; + last; } } - - # not reindexing, or reindexing with broken/incomplete overview DB: - if (!defined($mid0)) { - # highest number wins for unseen messages - foreach my $mid (@$mids) { - $del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0; - my $n = $sync->{mm_tmp}->num_for($mid); - if (defined $n && $n > $num) { - ($num, $mid0) = ($n, $mid); - $self->{mm}->mid_set($num, $mid0); + if (defined($num)) { + $sync->{mm_tmp}->num_delete($num); + } else { + $num = $sync->{regen}--; + if ($num <= 0) { + # fixup bugs in old mirrors on reindex + for my $mid (reverse @$mids) { + $num = $self->{mm}->mid_insert($mid); + next unless defined $num; + $mid0 = $mid; + last; + } + if (defined $mid0) { + if ($sync->{reindex}) { + warn "reindex added #$num <$mid0>\n"; + } + } else { + warn "E: cannot find article #\n"; + return; + } + } else { # $num > 0, use the new article number + for my $mid (reverse @$mids) { + $self->{mm}->mid_set($num, $mid) == 1 or next; + $mid0 = $mid; + last; + } + unless (defined $mid0) { + warn "E: cannot regen #$num\n"; + return; } } } + $sync->{nr}++; + if (do_idx($self, $msgref, $mime, $len, $num, $oid, $mid0)) { + reindex_checkpoint($self, $sync, $git); + } +} - if (!defined($mid0) && !$del) { - ($num, $mid0) = regen_art_num($self, $sync, $git, $oid, $mids); - if (!defined($mid0)) { - my $mm = $self->{mm}; - my $id = '<' . join('> <', @$mids) . '>'; - warn "Message-ID $id unusable for $num\n"; - foreach my $mid (@$mids) { - defined(my $n = $mm->num_for($mid)) or next; - warn "#$n previously mapped for <$mid>\n"; - } - } +sub check_unindexed ($$$) { + my ($self, $num, $mid0) = @_; + my $unindexed = $self->{unindexed} // {}; + my $n = delete($unindexed->{$mid0}); + defined $n or return; + if ($n != $num) { + die "BUG: unindexed $n != $num <$mid0>\n"; + } else { + $self->{mm}->mid_set($num, $mid0); } +} - if (!defined($mid0) || $del) { - if (!defined($mid0) && $del) { # expected for deletes - $num = $sync->{regen}--; - if (!$reindex) { - ($num <= 0) and - die "BUG: ran out of article numbers\n"; +sub reindex_oid ($$$$) { + my ($self, $sync, $git, $oid) = @_; + my ($num, $mid0, $len); + my $msgref = $git->cat_file($oid, \$len); + return if $len == 0; # purged + my $mime = PublicInbox::MIME->new($$msgref); + my $mids = mids($mime->header_obj); + my $cid = content_id($mime); + + if (scalar(@$mids) == 0) { + warn "E: $oid has no Message-ID, skipping\n"; + return; + } elsif (scalar(@$mids) == 1) { + my $mid = $mids->[0]; + + # was the file previously marked as deleted?, skip if so + if (delete($sync->{D}->{"$mid\0$cid"})) { + if (!$sync->{reindex}) { + $num = $sync->{regen}--; $self->{mm}->num_highwater($num); } - return + return; } - my $id = '<' . join('> <', @$mids) . '>'; - defined($mid0) or - warn "Skipping $id, no article number found\n"; - if ($del && defined($mid0)) { - warn "$id was deleted $del " . - "time(s) but mapped to article #$num\n"; + # is this a number we got before? + $num = $sync->{mm_tmp}->num_for($mid); + if (defined $num) { + $mid0 = $mid; + check_unindexed($self, $num, $mid0); + } else { + $num = $sync->{regen}--; + die "BUG: ran out of article numbers\n" if $num <= 0; + if ($self->{mm}->mid_set($num, $mid) != 1) { + warn "E: unable to assign $num => <$mid>\n"; + return; + } + $mid0 = $mid; + } + } else { # multiple MIDs are a weird case: + my $del = 0; + for (@$mids) { + $del += delete($sync->{D}->{"$_\0$cid"}) // 0; + } + if ($del) { + unindex_oid_remote($self, $oid, $_) for @$mids; + # do not delete from {mm_tmp}, since another + # single-MID message may use it. + } else { # handle them at the end: + push @{$sync->{multi_mid} //= []}, $oid; } return; - } $sync->{mm_tmp}->mid_delete($mid0) or die "failed to delete <$mid0> for article #$num\n"; $sync->{nr}++; if (do_idx($self, $msgref, $mime, $len, $num, $oid, $mid0)) { - $git->cleanup; - $sync->{mm_tmp}->atfork_prepare; - $self->done; # release lock - - if (my $pr = $sync->{-opt}->{-progress}) { - my ($bn) = (split('/', $git->{git_dir}))[-1]; - $pr->("$bn ".sprintf($sync->{-regen_fmt}, $sync->{nr})); - } - - # allow -watch or -mda to write... - $self->idx_init; # reacquire lock - $sync->{mm_tmp}->atfork_parent; + reindex_checkpoint($self, $sync, $git); } } @@ -1112,8 +1123,9 @@ sub unindex_oid_remote ($$$) { $self->{over}->remove_oid($oid, $mid); } -sub unindex_oid ($$$) { - my ($self, $git, $oid) = @_; +sub unindex_oid ($$$;$) { + my ($self, $git, $oid, $unindexed) = @_; + my $mm = $self->{mm}; my $msgref = $git->cat_file($oid); my $mime = PublicInbox::MIME->new($msgref); my $mids = mids($mime->header_obj); @@ -1133,8 +1145,11 @@ sub unindex_oid ($$$) { join(',',sort keys %gone), "\n"; } foreach my $num (keys %gone) { - $self->{unindexed}->{$_}++; - $self->{mm}->num_delete($num); + if ($unindexed) { + my $mid0 = $mm->mid_for($num); + $unindexed->{$mid0} = $num; + } + $mm->num_delete($num); } unindex_oid_remote($self, $oid, $mid); } @@ -1143,20 +1158,21 @@ sub unindex_oid ($$$) { my $x40 = qr/[a-f0-9]{40}/; sub unindex ($$$$) { my ($self, $sync, $git, $unindex_range) = @_; - my $un = $self->{unindexed} ||= {}; # num => removal count - my $before = scalar keys %$un; + my $unindexed = $self->{unindexed} ||= {}; # $mid0 => $num + my $before = scalar keys %$unindexed; + # order does not matter, here: my @cmd = qw(log --raw -r --no-notes --no-color --no-abbrev --no-renames); my $fh = $self->{reindex_pipe} = $git->popen(@cmd, $unindex_range); while (<$fh>) { /\A:\d{6} 100644 $x40 ($x40) [AM]\tm$/o or next; - unindex_oid($self, $git, $1); + unindex_oid($self, $git, $1, $unindexed); } delete $self->{reindex_pipe}; $fh = undef; return unless $sync->{-opt}->{prune}; - my $after = scalar keys %$un; + my $after = scalar keys %$unindexed; return if $before == $after; # ensure any blob can not longer be accessed via dumb HTTP @@ -1208,16 +1224,6 @@ sub index_epoch ($$$) { } elsif (/\A:\d{6} 100644 $x40 ($x40) [AM]\td$/o) { mark_deleted($self, $sync, $git, $1); } - - while (my $reindex_q = delete $sync->{reindex_q}) { - my $all = $self->{-inbox}->git; - for my $oid (@$reindex_q) { - $self->{current_info} = "$i.git $oid"; - warn "reindexing to shuffle article numbers\n"; - reindex_oid($self, $sync, $all, $oid); - } - $all->cleanup; - } } $fh = undef; delete $self->{reindex_pipe}; @@ -1259,10 +1265,22 @@ sub index_sync { # unindex is required for leftovers if "deletes" affect messages # in a previous fetch+index window: + my $git; if (my @leftovers = values %{delete $sync->{D}}) { - my $git = $self->{-inbox}->git; - unindex_oid($self, $git, $_) for @leftovers; - $git->cleanup; + $git = $self->{-inbox}->git; + for my $oid (@leftovers) { + $self->{current_info} = "leftover $oid"; + unindex_oid($self, $git, $oid); + } + } + if (my $multi_mid = delete $sync->{multi_mid}) { + $git //= $self->{-inbox}->git; + + while (defined(my $oid = pop(@$multi_mid))) { + $self->{current_info} = "multi_mid $oid"; + reindex_oid_m($self, $sync, $git, $oid); + } + $git->cleanup if $git; } $self->done; diff --git a/t/v2reindex.t b/t/v2reindex.t index 5de067c6..52711f8f 100644 --- a/t/v2reindex.t +++ b/t/v2reindex.t @@ -491,9 +491,10 @@ EOF remove_tree($xap); eval { $im->index_sync() }; is($@, '', 'no error from initial indexing'); + is_deeply(\@warn, [], 'no warnings from initial index'); eval { $im->index_sync({reindex=>1}) }; is($@, '', 'no error from reindexing after reused Message-ID (x3)'); - is_deeply(\@warn, [], 'no warnings'); + is_deeply(\@warn, [], 'no warnings on reindex'); } done_testing(); ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 1/3] v2writable: set unindexed article number 2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong @ 2019-10-21 11:02 ` Eric Wong 2019-10-21 11:02 ` [PATCH 2/3] v2writable: improve "num_for" API and disambiguate Eric Wong ` (2 subsequent siblings) 3 siblings, 0 replies; 11+ messages in thread From: Eric Wong @ 2019-10-21 11:02 UTC (permalink / raw) To: meta We'll actually use the keys of this hash in future commits. --- lib/PublicInbox/V2Writable.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 6a88f62a..dcbbbc77 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -1072,7 +1072,7 @@ sub unindex_oid ($$$) { join(',',sort keys %gone), "\n"; } foreach my $num (keys %gone) { - $self->{unindexed}->{$_}++; + $self->{unindexed}->{$num}++; $self->{mm}->num_delete($num); } unindex_oid_remote($self, $oid, $mid); ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/3] v2writable: improve "num_for" API and disambiguate 2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong 2019-10-21 11:02 ` [PATCH 1/3] v2writable: set unindexed article number Eric Wong @ 2019-10-21 11:02 ` Eric Wong 2019-10-21 11:02 ` [PATCH 3/3] v2writable: reindex handles 3-headered monsters Eric Wong 2019-10-23 18:19 ` [PUSHED] fix reindex on multiple + overlapping Message-IDs Eric Wong 3 siblings, 0 replies; 11+ messages in thread From: Eric Wong @ 2019-10-21 11:02 UTC (permalink / raw) To: meta Make it obvious that we're not the Msgmap sub and return an array because it's less awkward than providing a modifiable ref to a function to write to. --- lib/PublicInbox/V2Writable.pm | 44 ++++++++++++++++------------------- 1 file changed, 20 insertions(+), 24 deletions(-) diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index dcbbbc77..9d507828 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -155,8 +155,7 @@ sub _add { # leaking FDs to it... $self->idx_init; - my $mid0; - my $num = num_for($self, $mime, \$mid0); + my ($num, $mid0) = v2_num_for($self, $mime); defined $num or return; # duplicate defined $mid0 or die "BUG: $mid0 undefined\n"; my $im = $self->importer; @@ -172,16 +171,15 @@ sub _add { $cmt; } -sub num_for { - my ($self, $mime, $mid0) = @_; +sub v2_num_for { + my ($self, $mime) = @_; my $mids = mids($mime->header_obj); if (@$mids) { my $mid = $mids->[0]; my $num = $self->{mm}->mid_insert($mid); if (defined $num) { # common case - $$mid0 = $mid; - return $num; - }; + return ($num, $mid); + } # crap, Message-ID is already known, hope somebody just resent: foreach my $m (@$mids) { @@ -190,7 +188,7 @@ sub num_for { # easy, don't store duplicates # note: do not add more diagnostic info here since # it gets noisy on public-inbox-watch restarts - return if $existing; + return () if $existing; } # AltId may pre-populate article numbers (e.g. X-Mail-Count @@ -201,8 +199,7 @@ sub num_for { my $num = $self->{mm}->num_for($mid); if (defined $num && !$self->{over}->get_art($num)) { - $$mid0 = $mid; - return $num; + return ($num, $mid); } } @@ -215,39 +212,38 @@ sub num_for { $num = $self->{mm}->mid_insert($m); if (defined $num) { warn "alternative <$m> for <$mid> found\n"; - $$mid0 = $m; - return $num; + return ($num, $m); } } } # none of the existing Message-IDs are good, generate a new one: - num_for_harder($self, $mime, $mid0); + v2_num_for_harder($self, $mime); } -sub num_for_harder { - my ($self, $mime, $mid0) = @_; +sub v2_num_for_harder { + my ($self, $mime) = @_; my $hdr = $mime->header_obj; my $dig = content_digest($mime); - $$mid0 = PublicInbox::Import::digest2mid($dig, $hdr); - my $num = $self->{mm}->mid_insert($$mid0); + my $mid0 = PublicInbox::Import::digest2mid($dig, $hdr); + my $num = $self->{mm}->mid_insert($mid0); unless (defined $num) { # it's hard to spoof the last Received: header my @recvd = $hdr->header_raw('Received'); $dig->add("Received: $_") foreach (@recvd); - $$mid0 = PublicInbox::Import::digest2mid($dig, $hdr); - $num = $self->{mm}->mid_insert($$mid0); + $mid0 = PublicInbox::Import::digest2mid($dig, $hdr); + $num = $self->{mm}->mid_insert($mid0); # fall back to a random Message-ID and give up determinism: until (defined($num)) { $dig->add(rand); - $$mid0 = PublicInbox::Import::digest2mid($dig, $hdr); - warn "using random Message-ID <$$mid0> as fallback\n"; - $num = $self->{mm}->mid_insert($$mid0); + $mid0 = PublicInbox::Import::digest2mid($dig, $hdr); + warn "using random Message-ID <$mid0> as fallback\n"; + $num = $self->{mm}->mid_insert($mid0); } } - PublicInbox::Import::append_mid($hdr, $$mid0); - $num; + PublicInbox::Import::append_mid($hdr, $mid0); + ($num, $mid0); } sub idx_shard { ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/3] v2writable: reindex handles 3-headered monsters 2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong 2019-10-21 11:02 ` [PATCH 1/3] v2writable: set unindexed article number Eric Wong 2019-10-21 11:02 ` [PATCH 2/3] v2writable: improve "num_for" API and disambiguate Eric Wong @ 2019-10-21 11:02 ` Eric Wong 2019-10-21 11:34 ` Eric Wong 2019-10-22 1:28 ` [PATCH 4/3] v2writable: move git->cleanup to the correct place Eric Wong 2019-10-23 18:19 ` [PUSHED] fix reindex on multiple + overlapping Message-IDs Eric Wong 3 siblings, 2 replies; 11+ messages in thread From: Eric Wong @ 2019-10-21 11:02 UTC (permalink / raw) To: meta And maybe 8-headered ones, too... I noticed --reindex failing on the linux-renesas-soc mirror due one 3-headed monster of a message having 3 sets of headers; while another normal message had a Message-ID that matched one of the 3 IDs of the 3-headed monster. We still try to do the majority of indexing backwards, but we defer indexing multi-Message-ID'd messages until the end to ensure we get all the "good" messages in before we process the multi-headered ones. Link: https://public-inbox.org/meta/20191016211415.GA6084@dcvr/ --- TODO | 3 + lib/PublicInbox/OverIdx.pm | 14 +++ lib/PublicInbox/V2Writable.pm | 213 ++++++++++++++++++++++++---------- t/v2reindex.t | 66 +++++++++++ 4 files changed, 236 insertions(+), 60 deletions(-) diff --git a/TODO b/TODO index a327ca06..61c44a84 100644 --- a/TODO +++ b/TODO @@ -133,3 +133,6 @@ all need to be considered for everything we introduce) for coderepos * configurable diff output for solver-generated blobs + +* fix search for messages with multiple Subject:/To:/From:/Date: + headers (some wacky examples out there...) diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm index 7fd1905d..64277342 100644 --- a/lib/PublicInbox/OverIdx.pm +++ b/lib/PublicInbox/OverIdx.pm @@ -343,6 +343,20 @@ sub remove_oid { $nr; } +sub num_mid0_for_oid { + my ($self, $oid, $mid) = @_; + my ($num, $mid0); + $self->begin_lazy; + each_by_mid($self, $mid, ['ddd'], sub { + my ($smsg) = @_; + my $blob = $smsg->{blob}; + return 1 if (!defined($blob) || $blob ne $oid); # continue; + ($num, $mid0) = ($smsg->{num}, $smsg->{mid}); + 0; # done + }); + ($num, $mid0); +} + sub create_tables { my ($dbh) = @_; diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 9d507828..7ece6b01 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -841,81 +841,157 @@ sub mark_deleted ($$$$) { } } -sub reindex_oid ($$$$) { +sub reindex_checkpoint ($$$) { + my ($self, $sync, $git) = @_; + + $git->cleanup; + $sync->{mm_tmp}->atfork_prepare; + $self->done; # release lock + + if (my $pr = $sync->{-opt}->{-progress}) { + my ($bn) = (split('/', $git->{git_dir}))[-1]; + $pr->("$bn ".sprintf($sync->{-regen_fmt}, $sync->{nr})); + } + + # allow -watch or -mda to write... + $self->idx_init; # reacquire lock + $sync->{mm_tmp}->atfork_parent; +} + +# only for a few odd messages with multiple Message-IDs +sub reindex_oid_m ($$$$) { my ($self, $sync, $git, $oid) = @_; - my $len; + my ($num, $mid0, $len); my $msgref = $git->cat_file($oid, \$len); my $mime = PublicInbox::MIME->new($$msgref); my $mids = mids($mime->header_obj); my $cid = content_id($mime); + die "BUG: reindex_oid_m called for <=1 mids" if scalar(@$mids) <= 1; - # get the NNTP article number we used before, highest number wins - # and gets deleted from sync->{mm_tmp}; - my $mid0; - my $num = -1; - my $del = 0; - foreach my $mid (@$mids) { - $del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0; - my $n = $sync->{mm_tmp}->num_for($mid); - if (defined $n && $n > $num) { + for my $mid (reverse @$mids) { + delete($sync->{D}->{"$mid\0$cid"}) and + die "BUG: reindex_oid should handle <$mid> delete"; + } + for my $mid (reverse @$mids) { + ($num, $mid0) = $self->{over}->num_mid0_for_oid($oid, $mid); + last if defined $num; + } + unless (defined($num)) { + for my $mid (reverse @$mids) { + # is this a number we got before? + $num = $sync->{mm_tmp}->num_for($mid); + next unless defined $num; $mid0 = $mid; - $num = $n; - $self->{mm}->mid_set($num, $mid0); + last; } } - if (!defined($mid0) && !$del) { + if (defined($num)) { + $sync->{mm_tmp}->num_delete($num); + } else { $num = $sync->{regen}--; - die "BUG: ran out of article numbers\n" if $num <= 0; - my $mm = $self->{mm}; - foreach my $mid (reverse @$mids) { - if ($mm->mid_set($num, $mid) == 1) { + if ($num <= 0) { + # fixup bugs in old mirrors on reindex + for my $mid (reverse @$mids) { + $num = $self->{mm}->mid_insert($mid); + next unless defined $num; $mid0 = $mid; last; } - } - if (!defined($mid0)) { - my $id = '<' . join('> <', @$mids) . '>'; - warn "Message-ID $id unusable for $num\n"; - foreach my $mid (@$mids) { - defined(my $n = $mm->num_for($mid)) or next; - warn "#$n previously mapped for <$mid>\n"; + if (defined $mid0) { + if ($sync->{reindex}) { + warn "reindex added #$num <$mid0>\n"; + } + } else { + warn "E: cannot find article #\n"; + return; + } + } else { # $num > 0, use the new article number + for my $mid (reverse @$mids) { + $self->{mm}->mid_set($num, $mid) == 1 or next; + $mid0 = $mid; + last; + } + unless (defined $mid0) { + warn "E: cannot regen #$num\n"; + return; } } } + $sync->{nr}++; + if (do_idx($self, $msgref, $mime, $len, $num, $oid, $mid0)) { + reindex_checkpoint($self, $sync, $git); + } +} - if (!defined($mid0) || $del) { - if (!defined($mid0) && $del) { # expected for deletes - $num = $sync->{regen}--; - $self->{mm}->num_highwater($num) if !$sync->{reindex}; - return +sub check_unindexed ($$$) { + my ($self, $num, $mid0) = @_; + my $unindexed = $self->{unindexed} // {}; + my $n = delete($unindexed->{$mid0}); + defined $n or return; + if ($n != $num) { + die "BUG: unindexed $n != $num <$mid0>\n"; + } else { + $self->{mm}->mid_set($num, $mid0); + } +} + +sub reindex_oid ($$$$) { + my ($self, $sync, $git, $oid) = @_; + my ($num, $mid0, $len); + my $msgref = $git->cat_file($oid, \$len); + return if $len == 0; # purged + my $mime = PublicInbox::MIME->new($$msgref); + my $mids = mids($mime->header_obj); + my $cid = content_id($mime); + + if (scalar(@$mids) == 0) { + warn "E: $oid has no Message-ID, skipping\n"; + return; + } elsif (scalar(@$mids) == 1) { + my $mid = $mids->[0]; + + # was the file previously marked as deleted?, skip if so + if (delete($sync->{D}->{"$mid\0$cid"})) { + if (!$sync->{reindex}) { + $num = $sync->{regen}--; + $self->{mm}->num_highwater($num); + } + return; } - my $id = '<' . join('> <', @$mids) . '>'; - defined($mid0) or - warn "Skipping $id, no article number found\n"; - if ($del && defined($mid0)) { - warn "$id was deleted $del " . - "time(s) but mapped to article #$num\n"; + # is this a number we got before? + $num = $sync->{mm_tmp}->num_for($mid); + if (defined $num) { + $mid0 = $mid; + check_unindexed($self, $num, $mid0); + } else { + $num = $sync->{regen}--; + die "BUG: ran out of article numbers\n" if $num <= 0; + if ($self->{mm}->mid_set($num, $mid) != 1) { + warn "E: unable to assign $num => <$mid>\n"; + return; + } + $mid0 = $mid; + } + } else { # multiple MIDs are a weird case: + my $del = 0; + for (@$mids) { + $del += delete($sync->{D}->{"$_\0$cid"}) // 0; + } + if ($del) { + unindex_oid_remote($self, $oid, $_) for @$mids; + # do not delete from {mm_tmp}, since another + # single-MID message may use it. + } else { # handle them at the end: + push @{$sync->{multi_mid} //= []}, $oid; } return; - } $sync->{mm_tmp}->mid_delete($mid0) or die "failed to delete <$mid0> for article #$num\n"; $sync->{nr}++; if (do_idx($self, $msgref, $mime, $len, $num, $oid, $mid0)) { - $git->cleanup; - $sync->{mm_tmp}->atfork_prepare; - $self->done; # release lock - - if (my $pr = $sync->{-opt}->{-progress}) { - my ($bn) = (split('/', $git->{git_dir}))[-1]; - $pr->("$bn ".sprintf($sync->{-regen_fmt}, $sync->{nr})); - } - - # allow -watch or -mda to write... - $self->idx_init; # reacquire lock - $sync->{mm_tmp}->atfork_parent; + reindex_checkpoint($self, $sync, $git); } } @@ -1047,8 +1123,9 @@ sub unindex_oid_remote ($$$) { $self->{over}->remove_oid($oid, $mid); } -sub unindex_oid ($$$) { - my ($self, $git, $oid) = @_; +sub unindex_oid ($$$;$) { + my ($self, $git, $oid, $unindexed) = @_; + my $mm = $self->{mm}; my $msgref = $git->cat_file($oid); my $mime = PublicInbox::MIME->new($msgref); my $mids = mids($mime->header_obj); @@ -1068,8 +1145,11 @@ sub unindex_oid ($$$) { join(',',sort keys %gone), "\n"; } foreach my $num (keys %gone) { - $self->{unindexed}->{$num}++; - $self->{mm}->num_delete($num); + if ($unindexed) { + my $mid0 = $mm->mid_for($num); + $unindexed->{$mid0} = $num; + } + $mm->num_delete($num); } unindex_oid_remote($self, $oid, $mid); } @@ -1078,20 +1158,21 @@ sub unindex_oid ($$$) { my $x40 = qr/[a-f0-9]{40}/; sub unindex ($$$$) { my ($self, $sync, $git, $unindex_range) = @_; - my $un = $self->{unindexed} ||= {}; # num => removal count - my $before = scalar keys %$un; + my $unindexed = $self->{unindexed} ||= {}; # $mid0 => $num + my $before = scalar keys %$unindexed; + # order does not matter, here: my @cmd = qw(log --raw -r --no-notes --no-color --no-abbrev --no-renames); my $fh = $self->{reindex_pipe} = $git->popen(@cmd, $unindex_range); while (<$fh>) { /\A:\d{6} 100644 $x40 ($x40) [AM]\tm$/o or next; - unindex_oid($self, $git, $1); + unindex_oid($self, $git, $1, $unindexed); } delete $self->{reindex_pipe}; $fh = undef; return unless $sync->{-opt}->{prune}; - my $after = scalar keys %$un; + my $after = scalar keys %$unindexed; return if $before == $after; # ensure any blob can not longer be accessed via dumb HTTP @@ -1184,10 +1265,22 @@ sub index_sync { # unindex is required for leftovers if "deletes" affect messages # in a previous fetch+index window: + my $git; if (my @leftovers = values %{delete $sync->{D}}) { - my $git = $self->{-inbox}->git; - unindex_oid($self, $git, $_) for @leftovers; - $git->cleanup; + $git = $self->{-inbox}->git; + for my $oid (@leftovers) { + $self->{current_info} = "leftover $oid"; + unindex_oid($self, $git, $oid); + } + } + if (my $multi_mid = delete $sync->{multi_mid}) { + $git //= $self->{-inbox}->git; + + while (defined(my $oid = pop(@$multi_mid))) { + $self->{current_info} = "multi_mid $oid"; + reindex_oid_m($self, $sync, $git, $oid); + } + $git->cleanup if $git; } $self->done; diff --git a/t/v2reindex.t b/t/v2reindex.t index 7c5a6b07..52711f8f 100644 --- a/t/v2reindex.t +++ b/t/v2reindex.t @@ -431,4 +431,70 @@ ok(!-d $xap, 'Xapian directories removed again'); ], 'msgmap as expected' ); } +# A real example from linux-renesas-soc on lore where a 3-headed monster +# of a message has 3 sets of common headers. Another normal message +# previously existed with a single Message-ID that conflicts with one +# of the Message-IDs in the 3-headed monster. +{ + my @warn; + local $SIG{__WARN__} = sub { push @warn, @_ }; + my %config = %$ibx_config; + $config{indexlevel} = 'basic'; + my $ibx = PublicInbox::Inbox->new(\%config); + my $im = PublicInbox::V2Writable->new($ibx); + my $m3 = PublicInbox::MIME->new(<<'EOF'); +Date: Tue, 24 May 2016 14:34:22 -0700 (PDT) +Message-Id: <20160524.143422.552507610109476444.d@example.com> +To: t@example.com +Cc: c@example.com +Subject: Re: [PATCH v2 2/2] +From: <f@example.com> +In-Reply-To: <1463825855-7363-2-git-send-email-y@example.com> +References: <1463825855-7363-1-git-send-email-y@example.com> + <1463825855-7363-2-git-send-email-y@example.com> +Date: Wed, 25 May 2016 10:01:51 +0900 +From: h@example.com +To: g@example.com +Cc: m@example.com +Subject: Re: [PATCH] +Message-ID: <20160525010150.GD7292@example.com> +References: <1463498133-23918-1-git-send-email-g+r@example.com> +In-Reply-To: <1463498133-23918-1-git-send-email-g+r@example.com> +From: s@example.com +To: h@example.com +Cc: m@example.com +Subject: [PATCH 12/13] +Date: Wed, 01 Jun 2016 01:32:35 +0300 +Message-ID: <1923946.Jvi0TDUXFC@wasted.example.com> +In-Reply-To: <13205049.n7pM8utpHF@wasted.example.com> +References: <13205049.n7pM8utpHF@wasted.example.com> + +Somehow we got a message with 3 sets of headers into one +message, could've been something broken on the archiver side. +EOF + + my $m1 = PublicInbox::MIME->new(<<'EOF'); +From: a@example.com +To: t@example.com +Subject: [PATCH 12/13] +Date: Wed, 01 Jun 2016 01:32:35 +0300 +Message-ID: <1923946.Jvi0TDUXFC@wasted.example.com> +In-Reply-To: <13205049.n7pM8utpHF@wasted.example.com> +References: <13205049.n7pM8utpHF@wasted.example.com> + +This is probably one of the original messages + +EOF + $im->add($m1); + $im->add($m3); + $im->done; + remove_tree($xap); + eval { $im->index_sync() }; + is($@, '', 'no error from initial indexing'); + is_deeply(\@warn, [], 'no warnings from initial index'); + eval { $im->index_sync({reindex=>1}) }; + is($@, '', 'no error from reindexing after reused Message-ID (x3)'); + is_deeply(\@warn, [], 'no warnings on reindex'); +} + done_testing(); ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] v2writable: reindex handles 3-headered monsters 2019-10-21 11:02 ` [PATCH 3/3] v2writable: reindex handles 3-headered monsters Eric Wong @ 2019-10-21 11:34 ` Eric Wong 2019-10-22 1:28 ` [PATCH 4/3] v2writable: move git->cleanup to the correct place Eric Wong 1 sibling, 0 replies; 11+ messages in thread From: Eric Wong @ 2019-10-21 11:34 UTC (permalink / raw) To: meta Eric Wong <e@80x24.org> wrote: > +++ b/lib/PublicInbox/V2Writable.pm > +sub reindex_oid ($$$$) { <snip> > + } else { # multiple MIDs are a weird case: > + my $del = 0; > + for (@$mids) { > + $del += delete($sync->{D}->{"$_\0$cid"}) // 0; > + } > + if ($del) { > + unindex_oid_remote($self, $oid, $_) for @$mids; > + # do not delete from {mm_tmp}, since another > + # single-MID message may use it. > + } else { # handle them at the end: > + push @{$sync->{multi_mid} //= []}, $oid; Part of me worres @$multi_mid can be abused here by people trying to OOM the indexing process... <snip> > @@ -1184,10 +1265,22 @@ sub index_sync { > > # unindex is required for leftovers if "deletes" affect messages > # in a previous fetch+index window: > + my $git; > if (my @leftovers = values %{delete $sync->{D}}) { > - my $git = $self->{-inbox}->git; > - unindex_oid($self, $git, $_) for @leftovers; > - $git->cleanup; > + $git = $self->{-inbox}->git; > + for my $oid (@leftovers) { > + $self->{current_info} = "leftover $oid"; > + unindex_oid($self, $git, $oid); > + } > + } > + if (my $multi_mid = delete $sync->{multi_mid}) { > + $git //= $self->{-inbox}->git; > + > + while (defined(my $oid = pop(@$multi_mid))) { > + $self->{current_info} = "multi_mid $oid"; > + reindex_oid_m($self, $sync, $git, $oid); > + } > + $git->cleanup if $git; > } > $self->done; I suppose we could easily write to a temporary file and use fixed offsets, too; but fixed offsets would make git + SHA-256 more difficult. So maybe Tie::File (stdlib) or a SQLite-based stack could work, too... ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 4/3] v2writable: move git->cleanup to the correct place 2019-10-21 11:02 ` [PATCH 3/3] v2writable: reindex handles 3-headered monsters Eric Wong 2019-10-21 11:34 ` Eric Wong @ 2019-10-22 1:28 ` Eric Wong 2019-10-22 1:29 ` [PATCH 5/3] v2writable: use msgmap as multi_mid queue Eric Wong 1 sibling, 1 reply; 11+ messages in thread From: Eric Wong @ 2019-10-22 1:28 UTC (permalink / raw) To: meta We need to stop the git process to avoid leaking FDs to Xapian if we recurse ->index_sync on reindex. --- lib/PublicInbox/V2Writable.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 7ece6b01..33c0038d 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -1280,8 +1280,8 @@ sub index_sync { $self->{current_info} = "multi_mid $oid"; reindex_oid_m($self, $sync, $git, $oid); } - $git->cleanup if $git; } + $git->cleanup if $git; $self->done; if (my $nr = $sync->{nr}) { ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 5/3] v2writable: use msgmap as multi_mid queue 2019-10-22 1:28 ` [PATCH 4/3] v2writable: move git->cleanup to the correct place Eric Wong @ 2019-10-22 1:29 ` Eric Wong 0 siblings, 0 replies; 11+ messages in thread From: Eric Wong @ 2019-10-22 1:29 UTC (permalink / raw) To: meta Instead of storing Message-IDs in the Msgmap object, we can store the blob OID. For initial indexing of mirrors, this lets us preserve $sync->{regen} by storing the intended article number in the queue. On --reindex, the article number we store in Msgmap is ignored but only used for ordering purposes. This also allows us to avoid ENOMEM errors if somebody abuses our system by reusing Message-IDs; but we now risk ENOSPC instead (but systems tend to have more FS storage than RAM). --- lib/PublicInbox/V2Writable.pm | 120 ++++++++++++++++++++++------------ 1 file changed, 79 insertions(+), 41 deletions(-) diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 33c0038d..ad2e8e62 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -19,6 +19,7 @@ use PublicInbox::Msgmap; use PublicInbox::Spawn qw(spawn); use PublicInbox::SearchIdx; use IO::Handle; +use File::Temp qw(tempfile); # an estimate of the post-packed size to the raw uncompressed size my $PACKING_FACTOR = 0.4; @@ -763,7 +764,6 @@ sub import_init { # XXX experimental sub diff ($$$) { my ($mid, $cur, $new) = @_; - use File::Temp qw(tempfile); my ($ah, $an) = tempfile('email-cur-XXXXXXXX', TMPDIR => 1); print $ah $cur->as_string or die "print: $!"; @@ -859,8 +859,9 @@ sub reindex_checkpoint ($$$) { } # only for a few odd messages with multiple Message-IDs -sub reindex_oid_m ($$$$) { - my ($self, $sync, $git, $oid) = @_; +sub reindex_oid_m ($$$$;$) { + my ($self, $sync, $git, $oid, $regen_num) = @_; + $self->{current_info} = "multi_mid $oid"; my ($num, $mid0, $len); my $msgref = $git->cat_file($oid, \$len); my $mime = PublicInbox::MIME->new($$msgref); @@ -872,49 +873,51 @@ sub reindex_oid_m ($$$$) { delete($sync->{D}->{"$mid\0$cid"}) and die "BUG: reindex_oid should handle <$mid> delete"; } + my $over = $self->{over}; for my $mid (reverse @$mids) { - ($num, $mid0) = $self->{over}->num_mid0_for_oid($oid, $mid); - last if defined $num; + ($num, $mid0) = $over->num_mid0_for_oid($oid, $mid); + next unless defined $num; + if (defined($regen_num) && $regen_num != $num) { + die "BUG: regen(#$regen_num) != over(#$num)"; + } } unless (defined($num)) { for my $mid (reverse @$mids) { # is this a number we got before? - $num = $sync->{mm_tmp}->num_for($mid); - next unless defined $num; - $mid0 = $mid; + my $n = $sync->{mm_tmp}->num_for($mid); + next unless defined $n; + next if defined($regen_num) && $regen_num != $n; + ($num, $mid0) = ($n, $mid); last; } } if (defined($num)) { $sync->{mm_tmp}->num_delete($num); - } else { - $num = $sync->{regen}--; - if ($num <= 0) { - # fixup bugs in old mirrors on reindex - for my $mid (reverse @$mids) { - $num = $self->{mm}->mid_insert($mid); - next unless defined $num; - $mid0 = $mid; - last; - } - if (defined $mid0) { - if ($sync->{reindex}) { - warn "reindex added #$num <$mid0>\n"; - } - } else { - warn "E: cannot find article #\n"; - return; - } - } else { # $num > 0, use the new article number - for my $mid (reverse @$mids) { - $self->{mm}->mid_set($num, $mid) == 1 or next; - $mid0 = $mid; - last; - } - unless (defined $mid0) { - warn "E: cannot regen #$num\n"; - return; + } elsif (defined $regen_num) { + $num = $regen_num; + for my $mid (reverse @$mids) { + $self->{mm}->mid_set($num, $mid) == 1 or next; + $mid0 = $mid; + last; + } + unless (defined $mid0) { + warn "E: cannot regen #$num\n"; + return; + } + } else { # fixup bugs in old mirrors on reindex + for my $mid (reverse @$mids) { + $num = $self->{mm}->mid_insert($mid); + next unless defined $num; + $mid0 = $mid; + last; + } + if (defined $mid0) { + if ($sync->{reindex}) { + warn "reindex added #$num <$mid0>\n"; } + } else { + warn "E: cannot find article #\n"; + return; } } $sync->{nr}++; @@ -935,6 +938,30 @@ sub check_unindexed ($$$) { } } +# reuse Msgmap to store num => oid mapping (rather than num => mid) +sub multi_mid_q_new () { + my ($fh, $fn) = tempfile('multi_mid-XXXXXXX', EXLOCK => 0, TMPDIR => 1); + my $multi_mid = PublicInbox::Msgmap->new_file($fn, 1); + $multi_mid->{dbh}->do('PRAGMA synchronous = OFF'); + # for Msgmap->DESTROY: + $multi_mid->{tmp_name} = $fn; + $multi_mid->{pid} = $$; + close $fh or die "failed to close $fn: $!"; + $multi_mid +} + +sub multi_mid_q_push ($$) { + my ($sync, $oid) = @_; + my $multi_mid = $sync->{multi_mid} //= multi_mid_q_new(); + if ($sync->{reindex}) { # no regen on reindex + $multi_mid->mid_insert($oid); + } else { + my $num = $sync->{regen}--; + die "BUG: ran out of article numbers" if $num <= 0; + $multi_mid->mid_set($num, $oid); + } +} + sub reindex_oid ($$$$) { my ($self, $sync, $git, $oid) = @_; my ($num, $mid0, $len); @@ -966,7 +993,7 @@ sub reindex_oid ($$$$) { check_unindexed($self, $num, $mid0); } else { $num = $sync->{regen}--; - die "BUG: ran out of article numbers\n" if $num <= 0; + die "BUG: ran out of article numbers" if $num <= 0; if ($self->{mm}->mid_set($num, $mid) != 1) { warn "E: unable to assign $num => <$mid>\n"; return; @@ -983,7 +1010,7 @@ sub reindex_oid ($$$$) { # do not delete from {mm_tmp}, since another # single-MID message may use it. } else { # handle them at the end: - push @{$sync->{multi_mid} //= []}, $oid; + multi_mid_q_push($sync, $oid); } return; } @@ -1275,10 +1302,21 @@ sub index_sync { } if (my $multi_mid = delete $sync->{multi_mid}) { $git //= $self->{-inbox}->git; - - while (defined(my $oid = pop(@$multi_mid))) { - $self->{current_info} = "multi_mid $oid"; - reindex_oid_m($self, $sync, $git, $oid); + my ($min, $max) = $multi_mid->minmax; + if ($sync->{reindex}) { + # we may need to create new Message-IDs if mirrors + # were initially indexed with old versions + for (my $i = $max; $i >= $min; $i--) { + my $oid = $multi_mid->mid_for($i); + next unless defined $oid; + reindex_oid_m($self, $sync, $git, $oid); + } + } else { # regen on initial index + for my $num ($min..$max) { + my $oid = $multi_mid->mid_for($num); + next unless defined $oid; + reindex_oid_m($self, $sync, $git, $oid, $num); + } } } $git->cleanup if $git; ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PUSHED] fix reindex on multiple + overlapping Message-IDs 2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong ` (2 preceding siblings ...) 2019-10-21 11:02 ` [PATCH 3/3] v2writable: reindex handles 3-headered monsters Eric Wong @ 2019-10-23 18:19 ` Eric Wong 3 siblings, 0 replies; 11+ messages in thread From: Eric Wong @ 2019-10-23 18:19 UTC (permalink / raw) To: meta Eric Wong <e@80x24.org> wrote: > I'm still testing this with various repos, but the > linux-renesas-soc mirror from lore seems good when doing initial > indexing with an old version of this and using --reindex with > this patch, as all previously-missed messages are now indexed. Seems fine and pushed all 5 patches. However, display issues still need to be fixed in the HTML... > Eric Wong (3): > v2writable: set unindexed article number > v2writable: improve "num_for" API and disambiguate > v2writable: reindex handles 3-headered monsters v2writable: move git->cleanup to the correct place v2writable: use msgmap as multi_mid queue ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2019-10-23 18:19 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-10-16 21:14 BUG: --reindex broken on multiple Message-IDs reuse Eric Wong 2019-10-17 11:22 ` [RFC] v2writable: reindex handles 3-headed monsters [was: BUG: --reindex broken on multiple Message-IDs reuse] Eric Wong 2019-10-22 8:09 ` [RFC/HELP] search: multiple From/To/Cc/Subject (what about Date?) Eric Wong 2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong 2019-10-21 11:02 ` [PATCH 1/3] v2writable: set unindexed article number Eric Wong 2019-10-21 11:02 ` [PATCH 2/3] v2writable: improve "num_for" API and disambiguate Eric Wong 2019-10-21 11:02 ` [PATCH 3/3] v2writable: reindex handles 3-headered monsters Eric Wong 2019-10-21 11:34 ` Eric Wong 2019-10-22 1:28 ` [PATCH 4/3] v2writable: move git->cleanup to the correct place Eric Wong 2019-10-22 1:29 ` [PATCH 5/3] v2writable: use msgmap as multi_mid queue Eric Wong 2019-10-23 18:19 ` [PUSHED] fix reindex on multiple + overlapping Message-IDs Eric Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).