unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
From: Eric Wong <e@80x24.org>
To: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Cc: meta@public-inbox.org
Subject: [PATCH] extindex: --gc removes messages from over, too
Date: Wed, 1 Sep 2021 00:17:32 +0000	[thread overview]
Message-ID: <20210901001732.GA18457@dcvr> (raw)
In-Reply-To: <20210830201723.dehoul4y6gpqf2cp@nitro.local>

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> This works, though there are some interesting side-effects to it. For example,
> I removed the /gitolite-transparency-log/* feed (it was too noisy and wasn't
> useful for anything, really). After running the --gc, I see lots of messages
> about things being removed, but /all/ still contains leftover entries from the
> inbox:
> 
> https://x-lore.kernel.org/all/?t=20210830182453
> 
> (look for "post-receive:" in the page)
> 
> Clicking on the link returns a thread with blank messages (but not all of
> them).
> 
> This is not a huge problem, as the messages are for sure gone from search
> results, but the behaviour is a bit odd.

Thanks, that was a bug, but not a data-loss one (nor the biting
ones that get worse every summer...):
----------8<---------
Subject: [PATCH] extindex: --gc removes messages from over, too

While messages from removed inboxes were removed from Xapian
search, --gc failed to remove messages from over.sqlite3
entirely.  They no longer show up in the topic summary view.

Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/20210830201723.dehoul4y6gpqf2cp@nitro.local/
---
 lib/PublicInbox/ExtSearchIdx.pm | 19 +++++-----
 t/extsearch.t                   | 61 +++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index cf61237c..8cdad23d 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -343,20 +343,18 @@ sub _sync_inbox ($$$) {
 
 sub gc_unref_doc ($$$$) {
 	my ($self, $ibx_id, $eidx_key, $docid) = @_;
-	my $dbh = $self->{oidx}->dbh;
-
+	my $remain = 0;
 	# for debug/info purposes, oids may no longer be accessible
+	my $dbh = $self->{oidx}->dbh;
 	my $sth = $dbh->prepare_cached(<<'', undef, 1);
 SELECT oidbin FROM xref3 WHERE docid = ? AND ibx_id = ?
 
 	$sth->execute($docid, $ibx_id);
 	my @oid = map { unpack('H*', $_->[0]) } @{$sth->fetchall_arrayref};
-
-	$dbh->prepare_cached(<<'')->execute($docid, $ibx_id);
-DELETE FROM xref3 WHERE docid = ? AND ibx_id = ?
-
-	my $remain = $self->{oidx}->get_xref3($docid);
-	if (scalar(@$remain)) {
+	for my $oid (@oid) {
+		$remain += $self->{oidx}->remove_xref3($docid, $oid, $eidx_key);
+	}
+	if ($remain) {
 		$self->{oidx}->eidxq_add($docid); # enqueue for reindex
 		for my $oid (@oid) {
 			warn "I: unref #$docid $eidx_key $oid\n";
@@ -421,6 +419,11 @@ DELETE FROM xref3 WHERE docid NOT IN (SELECT num FROM over)
 
 	warn "I: eliminated $nr stale xref3 entries\n" if $nr != 0;
 
+	# fixup from old bugs:
+	$nr = $dbh->do(<<'');
+DELETE FROM over WHERE num NOT IN (SELECT docid FROM xref3)
+
+	warn "I: eliminated $nr stale over entries\n" if $nr != 0;
 	done($self);
 }
 
diff --git a/t/extsearch.t b/t/extsearch.t
index b03adc17..ad4f2c6d 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -466,4 +466,65 @@ SKIP: {
 		'--gc works after compact');
 }
 
+{ # ensure --gc removes non-xposted messages
+	my $old_size = -s $cfg_path // xbail "stat $cfg_path $!";
+	my $tmp_addr = 'v2tmp@example.com';
+	run_script([qw(-init v2tmp --indexlevel basic
+		--newsgroup v2tmp.example),
+		"$home/v2tmp", 'http://example.com/v2tmp', $tmp_addr ])
+		or xbail '-init';
+	$env = { ORIGINAL_RECIPIENT => $tmp_addr };
+	open $fh, '+>', undef or xbail "open $!";
+	$fh->autoflush(1);
+	my $mid = 'tmpmsg@example.com';
+	print $fh <<EOM or xbail "print $!";
+From: b\@z
+To: b\@r
+Message-Id: <$mid>
+Subject: tmpmsg
+Date: Tue, 19 Jan 2038 03:14:07 +0000
+
+EOM
+	seek $fh, 0, SEEK_SET or xbail "seek $!";
+	run_script([qw(-mda --no-precheck)], $env, {0 => $fh}) or xbail '-mda';
+	ok(run_script([qw(-extindex --all), "$home/extindex"]), 'update');
+	my $nr;
+	{
+		my $es = PublicInbox::ExtSearch->new("$home/extindex");
+		my ($id, $prv);
+		my $smsg = $es->over->next_by_mid($mid, \$id, \$prv);
+		ok($smsg, 'tmpmsg indexed');
+		my $mset = $es->search->mset("mid:$mid");
+		is($mset->size, 1, 'new message found');
+		$mset = $es->search->mset('z:0..');
+		$nr = $mset->size;
+	}
+	truncate($cfg_path, $old_size) or xbail "truncate $!";
+	my $rdr = { 2 => \(my $err) };
+	ok(run_script([qw(-extindex --gc), "$home/extindex"], undef, $rdr),
+		'gc to get rid of removed inbox');
+	is_deeply([ grep(!/^(?:I:|#)/, split(/^/m, $err)) ], [],
+		'no non-informational errors in stderr');
+
+	my $es = PublicInbox::ExtSearch->new("$home/extindex");
+	my $mset = $es->search->mset("mid:$mid");
+	is($mset->size, 0, 'tmpmsg gone from search');
+	my ($id, $prv);
+	is($es->over->next_by_mid($mid, \$id, \$prv), undef,
+		'tmpmsg gone from over');
+	$id = $prv = undef;
+	is($es->over->next_by_mid('testmessage@example.com', \$id, \$prv),
+		undef, 'remaining message not indavderover');
+	$mset = $es->search->mset('z:0..');
+	is($mset->size, $nr - 1, 'existing messages not clobbered from search');
+	my $o = $es->over->{dbh}->selectall_arrayref(<<EOM);
+SELECT num FROM over ORDER BY num
+EOM
+	is(scalar(@$o), $mset->size, 'over row count matches Xapian');
+	my $x = $es->over->{dbh}->selectall_arrayref(<<EOM);
+SELECT DISTINCT(docid) FROM xref3 ORDER BY docid
+EOM
+	is_deeply($x, $o, 'xref3 and over docids match');
+}
+
 done_testing;

      parent reply	other threads:[~2021-09-01  0:17 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-30 19:17 Removing an inbox from extindex Konstantin Ryabitsev
2021-08-30 19:27 ` Eric Wong
2021-08-30 20:17   ` Konstantin Ryabitsev
2021-08-30 20:22     ` Eric Wong
2021-09-01  0:17     ` Eric Wong [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210901001732.GA18457@dcvr \
    --to=e@80x24.org \
    --cc=konstantin@linuxfoundation.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).