unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: [PATCH 3/3] extsearchidx: enforce -index before -extindex
Date: Wed,  9 Dec 2020 09:25:12 +0000	[thread overview]
Message-ID: <20201209092512.25282-4-e@80x24.org> (raw)
In-Reply-To: <20201209092512.25282-1-e@80x24.org>

We cannot set xref3 data without the `xnum' column to
tie it to the per-inbox over.sqlite3 DB.  So ensure we don't
read brand-new history that only exists in git, but instead
rely on last_commit and last_xap15-$EPOCH metadata in msgmap
to decide how far we can index.

Before this change, it was possible to miss messages in
the extindex if -index did not run (which will be fixable by
upcoming --reindex support in -extindex).
---
 lib/PublicInbox/ExtSearchIdx.pm |  7 ++++-
 lib/PublicInbox/V2Writable.pm   | 20 +++++++++---
 t/extsearch.t                   | 54 +++++++++++++++++++++++++++++++++
 3 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index b0a12bca..84449cb4 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -291,7 +291,12 @@ sub _sync_inbox ($$$) {
 	} elsif ($v == 1) {
 		my $uv = $ibx->uidvalidity;
 		my $lc = $self->{oidx}->eidx_meta("lc-v1:$ekey//$uv");
-		my $stk = prepare_stack($sync, $lc ? "$lc..HEAD" : 'HEAD');
+		my $head = $ibx->mm->last_commit;
+		unless (defined $head) {
+			warn "E: $ibx->{inboxdir} is not indexed\n";
+			return;
+		}
+		my $stk = prepare_stack($sync, $lc ? "$lc..$head" : $head);
 		my $unit = { stack => $stk, git => $ibx->git };
 		push @{$sync->{todo}}, $unit;
 	} else {
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 07a7fa42..bef3a67a 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1073,10 +1073,22 @@ sub sync_prepare ($$) {
 		$pfx //= $sync->{ibx}->{inboxdir};
 	}
 
-	# reindex stops at the current heads and we later rerun index_sync
-	# without {reindex}
-	my $reindex_heads = $self->last_commits($sync) if $sync->{reindex};
-
+	my $reindex_heads;
+	if ($self->{ibx_map}) {
+		# ExtSearchIdx won't index messages unless they're in
+		# over.sqlite3 for a given inbox, so don't read beyond
+		# what's in the per-inbox index.
+		$reindex_heads = [];
+		my $v = PublicInbox::Search::SCHEMA_VERSION;
+		my $mm = $sync->{ibx}->mm;
+		for my $i (0..$sync->{epoch_max}) {
+			$reindex_heads->[$i] = $mm->last_commit_xap($v, $i);
+		}
+	} elsif ($sync->{reindex}) { # V2 inbox
+		# reindex stops at the current heads and we later
+		# rerun index_sync without {reindex}
+		$reindex_heads = $self->last_commits($sync);
+	}
 	if ($sync->{max_size} = $sync->{-opt}->{max_size}) {
 		$sync->{index_oid} = $self->can('index_oid');
 	}
diff --git a/t/extsearch.t b/t/extsearch.t
index 96512227..70a60b5a 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -176,6 +176,60 @@ is(scalar(@it), 2, 'two inboxes');
 like($it[0]->get_document->get_data, qr/v2test/, 'docdata matched v2');
 like($it[1]->get_document->get_data, qr/v1test/, 'docdata matched v1');
 
+if ('inject w/o indexing') {
+	use PublicInbox::Import;
+	use PublicInbox::Search;
+	my $schema_version = PublicInbox::Search::SCHEMA_VERSION();
+	my $v1ibx = PublicInbox::Config->new->lookup_name('v1test');
+	my $last_v1_commit = $v1ibx->mm->last_commit;
+	my $v2ibx = PublicInbox::Config->new->lookup_name('v2test');
+	my $last_v2_commit = $v2ibx->mm->last_commit_xap($schema_version, 0);
+	my $git0 = PublicInbox::Git->new("$v2ibx->{inboxdir}/git/0.git");
+	chomp(my $cmt = $git0->qx(qw(rev-parse HEAD^0)));
+	is($last_v2_commit, $cmt, 'v2 index up-to-date');
+
+	my $v2im = PublicInbox::Import->new($git0, undef, undef, $v2ibx);
+	$v2im->{lock_path} = undef;
+	$v2im->{path_type} = 'v2';
+	$v2im->add(eml_load('t/mda-mime.eml'));
+	$v2im->done;
+	chomp(my $tip = $git0->qx(qw(rev-parse HEAD^0)));
+	isnt($tip, $cmt, '0.git v2 updated');
+
+	# inject a message w/o updating index
+	rename("$home/v1test/public-inbox", "$home/v1test/skip-index") or
+		BAIL_OUT $!;
+	open(my $eh, '<', 't/iso-2202-jp.eml') or BAIL_OUT $!;
+	run_script(['-mda', '--no-precheck'], $env, { 0 => $eh}) or
+		BAIL_OUT '-mda';
+	rename("$home/v1test/skip-index", "$home/v1test/public-inbox") or
+		BAIL_OUT $!;
+
+	my ($in, $out, $err);
+	$in = $out = $err = '';
+	my $opt = { 0 => \$in, 1 => \$out, 2 => \$err };
+	ok(run_script([qw(-extindex -v -v --all), "$home/extindex"],
+		undef, undef), 'extindex noop');
+	$es->{xdb}->reopen;
+	my $mset = $es->mset('mid:199707281508.AAA24167@hoyogw.example');
+	is($mset->size, 0, 'did not attempt to index unindexed v1 message');
+	$mset = $es->mset('mid:multipart-html-sucks@11');
+	is($mset->size, 0, 'did not attempt to index unindexed v2 message');
+	ok(run_script([qw(-index --all)]), 'indexed v1 and v2 inboxes');
+
+	isnt($v1ibx->mm->last_commit, $last_v1_commit, '-index v1 worked');
+	isnt($v2ibx->mm->last_commit_xap($schema_version, 0),
+		$last_v2_commit, '-index v2 worked');
+	ok(run_script([qw(-extindex --all), "$home/extindex"]),
+		'extindex updates');
+
+	$es->{xdb}->reopen;
+	$mset = $es->mset('mid:199707281508.AAA24167@hoyogw.example');
+	is($mset->size, 1, 'got v1 message');
+	$mset = $es->mset('mid:multipart-html-sucks@11');
+	is($mset->size, 1, 'got v2 message');
+}
+
 if ('remove v1test and test gc') {
 	xsys([qw(git config --unset publicinbox.v1test.inboxdir)],
 		{ GIT_CONFIG => $cfg_path });

      parent reply	other threads:[~2020-12-09  9:25 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-09  9:25 [PATCH 0/3] extindex: another fix and some cleanups Eric Wong
2020-12-09  9:25 ` [PATCH 1/3] searchidx: all indexers check for bad blobs Eric Wong
2020-12-09  9:25 ` [PATCH 2/3] t/extsearch: use indexlevel=basic in inboxes Eric Wong
2020-12-09  9:25 ` Eric Wong [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201209092512.25282-4-e@80x24.org \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).