unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* Removing an inbox from extindex
@ 2021-08-30 19:17 Konstantin Ryabitsev
  2021-08-30 19:27 ` Eric Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Konstantin Ryabitsev @ 2021-08-30 19:17 UTC (permalink / raw)
  To: meta

Hello:

What's the proper procedure to remove an inbox from extindex? For example, if
I wanted to drop a source from being replicated to lore.kernel.org, how do I
properly remove it from /all/ search results?

-K

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Removing an inbox from extindex
  2021-08-30 19:17 Removing an inbox from extindex Konstantin Ryabitsev
@ 2021-08-30 19:27 ` Eric Wong
  2021-08-30 20:17   ` Konstantin Ryabitsev
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Wong @ 2021-08-30 19:27 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hello:
> 
> What's the proper procedure to remove an inbox from extindex? For example, if
> I wanted to drop a source from being replicated to lore.kernel.org, how do I
> properly remove it from /all/ search results?

public-inbox-extindex --gc $EXTINDEX_DIR

I just realized I forgot to put it into the manpage,
but there's tests for it.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Removing an inbox from extindex
  2021-08-30 19:27 ` Eric Wong
@ 2021-08-30 20:17   ` Konstantin Ryabitsev
  2021-08-30 20:22     ` Eric Wong
  2021-09-01  0:17     ` [PATCH] extindex: --gc removes messages from over, too Eric Wong
  0 siblings, 2 replies; 5+ messages in thread
From: Konstantin Ryabitsev @ 2021-08-30 20:17 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Mon, Aug 30, 2021 at 07:27:32PM +0000, Eric Wong wrote:
> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> > Hello:
> > 
> > What's the proper procedure to remove an inbox from extindex? For example, if
> > I wanted to drop a source from being replicated to lore.kernel.org, how do I
> > properly remove it from /all/ search results?
> 
> public-inbox-extindex --gc $EXTINDEX_DIR

This works, though there are some interesting side-effects to it. For example,
I removed the /gitolite-transparency-log/* feed (it was too noisy and wasn't
useful for anything, really). After running the --gc, I see lots of messages
about things being removed, but /all/ still contains leftover entries from the
inbox:

https://x-lore.kernel.org/all/?t=20210830182453

(look for "post-receive:" in the page)

Clicking on the link returns a thread with blank messages (but not all of
them).

This is not a huge problem, as the messages are for sure gone from search
results, but the behaviour is a bit odd.

Thanks for your help,
-K

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Removing an inbox from extindex
  2021-08-30 20:17   ` Konstantin Ryabitsev
@ 2021-08-30 20:22     ` Eric Wong
  2021-09-01  0:17     ` [PATCH] extindex: --gc removes messages from over, too Eric Wong
  1 sibling, 0 replies; 5+ messages in thread
From: Eric Wong @ 2021-08-30 20:22 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Mon, Aug 30, 2021 at 07:27:32PM +0000, Eric Wong wrote:
> > Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> > > Hello:
> > > 
> > > What's the proper procedure to remove an inbox from extindex? For example, if
> > > I wanted to drop a source from being replicated to lore.kernel.org, how do I
> > > properly remove it from /all/ search results?
> > 
> > public-inbox-extindex --gc $EXTINDEX_DIR
> 
> This works, though there are some interesting side-effects to it. For example,
> I removed the /gitolite-transparency-log/* feed (it was too noisy and wasn't
> useful for anything, really). After running the --gc, I see lots of messages
> about things being removed, but /all/ still contains leftover entries from the
> inbox:

Yeah, it needs some work...  I mostly forgot about it :x
(And somehow, I completely forgot about the existence of
 syslog at some point :x)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] extindex: --gc removes messages from over, too
  2021-08-30 20:17   ` Konstantin Ryabitsev
  2021-08-30 20:22     ` Eric Wong
@ 2021-09-01  0:17     ` Eric Wong
  1 sibling, 0 replies; 5+ messages in thread
From: Eric Wong @ 2021-09-01  0:17 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> This works, though there are some interesting side-effects to it. For example,
> I removed the /gitolite-transparency-log/* feed (it was too noisy and wasn't
> useful for anything, really). After running the --gc, I see lots of messages
> about things being removed, but /all/ still contains leftover entries from the
> inbox:
> 
> https://x-lore.kernel.org/all/?t=20210830182453
> 
> (look for "post-receive:" in the page)
> 
> Clicking on the link returns a thread with blank messages (but not all of
> them).
> 
> This is not a huge problem, as the messages are for sure gone from search
> results, but the behaviour is a bit odd.

Thanks, that was a bug, but not a data-loss one (nor the biting
ones that get worse every summer...):
----------8<---------
Subject: [PATCH] extindex: --gc removes messages from over, too

While messages from removed inboxes were removed from Xapian
search, --gc failed to remove messages from over.sqlite3
entirely.  They no longer show up in the topic summary view.

Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/20210830201723.dehoul4y6gpqf2cp@nitro.local/
---
 lib/PublicInbox/ExtSearchIdx.pm | 19 +++++-----
 t/extsearch.t                   | 61 +++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index cf61237c..8cdad23d 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -343,20 +343,18 @@ sub _sync_inbox ($$$) {
 
 sub gc_unref_doc ($$$$) {
 	my ($self, $ibx_id, $eidx_key, $docid) = @_;
-	my $dbh = $self->{oidx}->dbh;
-
+	my $remain = 0;
 	# for debug/info purposes, oids may no longer be accessible
+	my $dbh = $self->{oidx}->dbh;
 	my $sth = $dbh->prepare_cached(<<'', undef, 1);
 SELECT oidbin FROM xref3 WHERE docid = ? AND ibx_id = ?
 
 	$sth->execute($docid, $ibx_id);
 	my @oid = map { unpack('H*', $_->[0]) } @{$sth->fetchall_arrayref};
-
-	$dbh->prepare_cached(<<'')->execute($docid, $ibx_id);
-DELETE FROM xref3 WHERE docid = ? AND ibx_id = ?
-
-	my $remain = $self->{oidx}->get_xref3($docid);
-	if (scalar(@$remain)) {
+	for my $oid (@oid) {
+		$remain += $self->{oidx}->remove_xref3($docid, $oid, $eidx_key);
+	}
+	if ($remain) {
 		$self->{oidx}->eidxq_add($docid); # enqueue for reindex
 		for my $oid (@oid) {
 			warn "I: unref #$docid $eidx_key $oid\n";
@@ -421,6 +419,11 @@ DELETE FROM xref3 WHERE docid NOT IN (SELECT num FROM over)
 
 	warn "I: eliminated $nr stale xref3 entries\n" if $nr != 0;
 
+	# fixup from old bugs:
+	$nr = $dbh->do(<<'');
+DELETE FROM over WHERE num NOT IN (SELECT docid FROM xref3)
+
+	warn "I: eliminated $nr stale over entries\n" if $nr != 0;
 	done($self);
 }
 
diff --git a/t/extsearch.t b/t/extsearch.t
index b03adc17..ad4f2c6d 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -466,4 +466,65 @@ SKIP: {
 		'--gc works after compact');
 }
 
+{ # ensure --gc removes non-xposted messages
+	my $old_size = -s $cfg_path // xbail "stat $cfg_path $!";
+	my $tmp_addr = 'v2tmp@example.com';
+	run_script([qw(-init v2tmp --indexlevel basic
+		--newsgroup v2tmp.example),
+		"$home/v2tmp", 'http://example.com/v2tmp', $tmp_addr ])
+		or xbail '-init';
+	$env = { ORIGINAL_RECIPIENT => $tmp_addr };
+	open $fh, '+>', undef or xbail "open $!";
+	$fh->autoflush(1);
+	my $mid = 'tmpmsg@example.com';
+	print $fh <<EOM or xbail "print $!";
+From: b\@z
+To: b\@r
+Message-Id: <$mid>
+Subject: tmpmsg
+Date: Tue, 19 Jan 2038 03:14:07 +0000
+
+EOM
+	seek $fh, 0, SEEK_SET or xbail "seek $!";
+	run_script([qw(-mda --no-precheck)], $env, {0 => $fh}) or xbail '-mda';
+	ok(run_script([qw(-extindex --all), "$home/extindex"]), 'update');
+	my $nr;
+	{
+		my $es = PublicInbox::ExtSearch->new("$home/extindex");
+		my ($id, $prv);
+		my $smsg = $es->over->next_by_mid($mid, \$id, \$prv);
+		ok($smsg, 'tmpmsg indexed');
+		my $mset = $es->search->mset("mid:$mid");
+		is($mset->size, 1, 'new message found');
+		$mset = $es->search->mset('z:0..');
+		$nr = $mset->size;
+	}
+	truncate($cfg_path, $old_size) or xbail "truncate $!";
+	my $rdr = { 2 => \(my $err) };
+	ok(run_script([qw(-extindex --gc), "$home/extindex"], undef, $rdr),
+		'gc to get rid of removed inbox');
+	is_deeply([ grep(!/^(?:I:|#)/, split(/^/m, $err)) ], [],
+		'no non-informational errors in stderr');
+
+	my $es = PublicInbox::ExtSearch->new("$home/extindex");
+	my $mset = $es->search->mset("mid:$mid");
+	is($mset->size, 0, 'tmpmsg gone from search');
+	my ($id, $prv);
+	is($es->over->next_by_mid($mid, \$id, \$prv), undef,
+		'tmpmsg gone from over');
+	$id = $prv = undef;
+	is($es->over->next_by_mid('testmessage@example.com', \$id, \$prv),
+		undef, 'remaining message not indavderover');
+	$mset = $es->search->mset('z:0..');
+	is($mset->size, $nr - 1, 'existing messages not clobbered from search');
+	my $o = $es->over->{dbh}->selectall_arrayref(<<EOM);
+SELECT num FROM over ORDER BY num
+EOM
+	is(scalar(@$o), $mset->size, 'over row count matches Xapian');
+	my $x = $es->over->{dbh}->selectall_arrayref(<<EOM);
+SELECT DISTINCT(docid) FROM xref3 ORDER BY docid
+EOM
+	is_deeply($x, $o, 'xref3 and over docids match');
+}
+
 done_testing;

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-09-01  0:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-08-30 19:17 Removing an inbox from extindex Konstantin Ryabitsev
2021-08-30 19:27 ` Eric Wong
2021-08-30 20:17   ` Konstantin Ryabitsev
2021-08-30 20:22     ` Eric Wong
2021-09-01  0:17     ` [PATCH] extindex: --gc removes messages from over, too Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).