From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 0BFF81F4B4; Tue, 5 Jan 2021 01:29:11 +0000 (UTC) Date: Tue, 5 Jan 2021 01:29:10 +0000 From: Eric Wong To: meta@public-inbox.org Subject: [PATCH] v2writable: exact discontiguous history handling Message-ID: <20210105012910.GA16722@dcvr> References: <20201221212032.syunaxzrvcqcrose@chatter.i7.local> <20201221213914.GA9374@dcvr> <20201222062808.GA4522@dcvr> <20201228162218.zcnqxkgwa2i3nt66@chatter.i7.local> <20201228213139.GA17600@dcvr> <20210104201245.cbtqno6cyxw5iycu@chatter.i7.local> <20210105010643.GA20926@dcvr> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20210105010643.GA20926@dcvr> List-Id: Eric Wong wrote: > That would allow the new version of the edited message to be > piped and seen by NNTP/IMAP readers. > > You *do* want to pipe the new version of the message you've > edited, right? ---------8<-------- Subject: [PATCH] v2writable: exact discontiguous history handling We've always temporarily unindexeded messages before reindexing them again if there's discontiguous history. This change improves the mechanism we use to prevent NNTP and IMAP clients from seeing duplicate messages. Previously, we relied on mapping Message-IDs to NNTP article numbers to ensure clients would not see the same message twice. This worked for most messages, but not for for messages with reused or duplicate Message-IDs. Instead of relying on Message-IDs as a key, we now rely on the git blob object ID for exact content matching. This allows truly different messages to show up for NNTP|IMAP clients, while still those clients from seeing the message again. --- lib/PublicInbox/V2Writable.pm | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 459c7e86..54004fd7 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -888,12 +888,16 @@ sub index_oid { # cat_async callback } # {unindexed} is unlikely - if ((my $unindexed = $arg->{unindexed}) && scalar(@$mids) == 1) { - $num = delete($unindexed->{$mids->[0]}); + if (my $unindexed = $arg->{unindexed}) { + my $oidbin = pack('H*', $oid); + my $u = $unindexed->{$oidbin}; + ($num, $mid0) = splice(@$u, 0, 2) if $u; if (defined $num) { - $mid0 = $mids->[0]; $self->{mm}->mid_set($num, $mid0); - delete($arg->{unindexed}) if !keys(%$unindexed); + if (scalar(@$u) == 0) { # done with current OID + delete $unindexed->{$oidbin}; + delete($arg->{unindexed}) if !keys(%$unindexed); + } } } if (!defined($num)) { # reuse if reindexing (or duplicates) @@ -1160,10 +1164,13 @@ sub unindex_oid ($$;$) { # git->cat_async callback warn "BUG: multiple articles linked to $oid\n", join(',',sort keys %gone), "\n"; } - foreach my $num (keys %gone) { + # reuse (num => mid) mapping in ascending numeric order + for my $num (sort { $a <=> $b } keys %gone) { + $num += 0; if ($unindexed) { my $mid0 = $mm->mid_for($num); - $unindexed->{$mid0} = $num; + my $oidbin = pack('H*', $oid); + push @{$unindexed->{$oidbin}}, $num, $mid0; } $mm->num_delete($num); } @@ -1179,7 +1186,7 @@ sub git { $_[0]->{ibx}->git } sub unindex_todo ($$$) { my ($self, $sync, $unit) = @_; my $unindex_range = delete($unit->{unindex_range}) // return; - my $unindexed = $sync->{unindexed} //= {}; # $mid0 => $num + my $unindexed = $sync->{unindexed} //= {}; # $oidbin => [$num, $mid0] my $before = scalar keys %$unindexed; # order does not matter, here: my $fh = $unit->{git}->popen(qw(log --raw -r --no-notes --no-color