unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 00/52] detached external index: mostly
@ 2020-10-27  7:54 Eric Wong
  2020-10-27  7:54 ` [PATCH 01/52] doc/standards: add RFCs for URL schemes Eric Wong
                   ` (53 more replies)
  0 siblings, 54 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

...and mostly wired up for WWW, but requires manual config
editing atm.  Needs docs and tests, and IMAP support.

This will also form the basis of a mairix workalike client.

Not sure about the usability aspects, but I think this can
replace the need for per-inbox Xapian DBs and save a truckload
of disk space (and more importantly: cache space).  Per-inbox
over.sqlite3 remains required for compatibility with NNTP/IMAP
and existing WWW code.

I don't know if the command-line tool is going to be called
public-inbox-eindex or public-inbox-extindex, but probably the
latter...

"xindex" could be confusing, and "eindex" rhymes with "reindex"
which could also be confusing.  But I'm even more easily
confused than usual these days :x

Performance isn't great, it took 30+ hours to index my mirror of
lore on a SATA SSD, but the entire index is <200GB due to
deduplication between cross posts.  -compact isn't working with
these indices, yet, but will sometime...

More changes on the way, still trying fix my brain and get
through this year...

Eric Wong (52):
  doc/standards: add RFCs for URL schemes
  search: hoist out _xdb_sharded for v2 inboxes
  extsearch: start mocking out
  searchidx: expose INDEXLEVELS as `our'
  v2writable: add git method
  v2writable: make OO calls to last_commit-related methods
  search: xdb_sharded: make this a public method for ExtSearch
  searchidx: introduce "xref3" concept
  v2writable: prepare initialization for external indices
  v2writable: hoist out write_alternates
  searchidxshard: allow msgref to be undef
  v2writable: idx_shard: simplify callers
  v2writable: count_shards: allow working without {ibx}
  overidx: introduce changes for external index
  v2: some changes for ExtSearchIdx compatibility
  inboxwritable: eidx_key for external index
  v2writable: rename remaining "remote" terminology
  v2writable: checkpoint: account for lack of {mm}
  extsearchidx: initial implementation
  searchidx: index eidx_key as a boolean term
  searchidx: xref3 delete support
  searchidxshard: special init for eidx
  searchidx: put {ibx} into $sync state
  searchidx: log2stack: simplify callers
  v2writable: more generic sync setup code
  v2writable: allow OO method references
  v2writable: rename {v2w} field to {self}
  v2writable: make *last_commits and sync_prepare OO methods
  v2writable: move size check init to sync_prepare
  extsearchidx: more compatibility with V2Writable callers
  v2writable: reduce scope of epoch-aware code
  extsearchidx: remove {unindex_range} field
  v2writable: pass oid to uindex_oid
  extsearchidx: sync unit updates
  searchidx: export prepare_stack
  extsearchidx: sync updates
  searchidx: reduce inbox-dependency, wrap ->with_umask
  searchidx: favor $sync->{ibx} (over $self->{ibx})
  Makefile.PL: do not build manpage if POD is missing
  script: add preliminary eindex implementation
  index: eindex wiring
  over: store xref3 data in over.sqlite3
  searchidx: remove xref3 support for Xapian
  t/extsearch.t: verify results and xref3 ordering
  t/v2writable: remove pointless ->barrier call
  extsearch: wire up smsg_eml
  extsearchidx: handle edits
  extsearch: wire up remaining Inbox-like methods for WWW
  searchidx: ignore exceptions from ->remove_term
  extsearchidx: set current_info in warning callbacks
  extsearchidx: support --batch-size checkpoints
  searchidxshard: make warnings with eidx_key less confusing

 Documentation/standards.perl      |   3 +
 MANIFEST                          |   4 +
 Makefile.PL                       |  16 +-
 lib/PublicInbox/Config.pm         |  12 +
 lib/PublicInbox/ExtSearch.pm      |  69 +++++
 lib/PublicInbox/ExtSearchIdx.pm   | 404 ++++++++++++++++++++++++++++++
 lib/PublicInbox/Inbox.pm          |  53 ++--
 lib/PublicInbox/InboxWritable.pm  |  23 ++
 lib/PublicInbox/Over.pm           |  19 ++
 lib/PublicInbox/OverIdx.pm        | 122 ++++++++-
 lib/PublicInbox/Search.pm         |  62 ++---
 lib/PublicInbox/SearchIdx.pm      | 135 +++++++---
 lib/PublicInbox/SearchIdxShard.pm |  77 +++++-
 lib/PublicInbox/V2Writable.pm     | 310 ++++++++++++-----------
 lib/PublicInbox/WWW.pm            |   3 +-
 lib/PublicInbox/Xapcmd.pm         |   2 +-
 script/public-inbox-eindex        |  43 ++++
 script/public-inbox-index         |   3 +-
 t/extsearch.t                     |  75 ++++++
 t/over.t                          |  24 ++
 t/search.t                        |   2 -
 t/v2writable.t                    |   3 +-
 22 files changed, 1204 insertions(+), 260 deletions(-)
 create mode 100644 lib/PublicInbox/ExtSearch.pm
 create mode 100644 lib/PublicInbox/ExtSearchIdx.pm
 create mode 100644 script/public-inbox-eindex
 create mode 100644 t/extsearch.t

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 01/52] doc/standards: add RFCs for URL schemes
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-11-05  7:54   ` [PATCH v2] " Eric Wong
  2020-10-27  7:54 ` [PATCH 02/52] search: hoist out _xdb_sharded for v2 inboxes Eric Wong
                   ` (52 subsequent siblings)
  53 siblings, 1 reply; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We linkify these in the WWW UI, and will support them in other
places.  These URL schemes may end up being stored in
external/detached indices for indexing non-git-based mail
stores.
---
 Documentation/standards.perl | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/standards.perl b/Documentation/standards.perl
index 0ac6cc52..175c717f 100755
--- a/Documentation/standards.perl
+++ b/Documentation/standards.perl
@@ -25,6 +25,9 @@ EOF
 my $rfcs = [
 	3977 => 'NNTP',
 	977 => 'NNTP (old)',
+	1738 => 'Uniform resource locators',
+	5092 => 'IMAP URL scheme',
+	5538 => 'NNTP URI schemes',
 	6048 => 'NNTP additions to LIST command (TODO)',
 	8054 => 'NNTP compression',
 	4642 => 'NNTP TLS',

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 02/52] search: hoist out _xdb_sharded for v2 inboxes
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
  2020-10-27  7:54 ` [PATCH 01/52] doc/standards: add RFCs for URL schemes Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 03/52] extsearch: start mocking out Eric Wong
                   ` (51 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We'll be using this in detached (ext) Xapian indexes
in cross inbox search.
---
 lib/PublicInbox/Search.pm | 58 +++++++++++++++++++++------------------
 1 file changed, 31 insertions(+), 27 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 0321ca93..6346d788 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -190,38 +190,42 @@ sub xdir ($;$) {
 	}
 }
 
-sub _xdb ($) {
+sub _xdb_sharded {
+	my ($self, $xpfx) = @_;
+	opendir(my $dh, $xpfx) or return; # not initialized yet
+
+	# We need numeric sorting so shard[0] is first for reading
+	# Xapian metadata, if needed
+	my $last = max(grep(/\A[0-9]+\z/, readdir($dh)));
+	return if !defined($last);
+	my (@xdb, $slow_phrase);
+	for (0..$last) {
+		my $shard_dir = "$xpfx/$_";
+		if (-d $shard_dir && -r _) {
+			push @xdb, $X{Database}->new($shard_dir);
+			$slow_phrase ||= -f "$shard_dir/iamchert";
+		} else { # gaps from missing epochs throw off mdocid()
+			warn "E: $shard_dir missing or unreadable\n";
+			return;
+		}
+	}
+	$self->{qp_flags} |= FLAG_PHRASE() if !$slow_phrase;
+	$self->{nshard} = scalar(@xdb);
+	my $xdb = shift @xdb;
+	$xdb->add_database($_) for @xdb;
+	$xdb;
+}
+
+sub _xdb {
 	my ($self) = @_;
 	my $dir = xdir($self, 1);
-	my ($xdb, $slow_phrase);
-	my $qpf = \($self->{qp_flags} ||= $QP_FLAGS);
+	$self->{qp_flags} //= $QP_FLAGS;
 	if ($self->{ibx_ver} >= 2) {
-		my @xdb;
-		opendir(my $dh, $dir) or return; # not initialized yet
-
-		# We need numeric sorting so shard[0] is first for reading
-		# Xapian metadata, if needed
-		my $last = max(grep(/\A[0-9]+\z/, readdir($dh)));
-		return if !defined($last);
-		for (0..$last) {
-			my $shard_dir = "$dir/$_";
-			if (-d $shard_dir && -r _) {
-				push @xdb, $X{Database}->new($shard_dir);
-				$slow_phrase ||= -f "$shard_dir/iamchert";
-			} else { # gaps from missing epochs throw off mdocid()
-				warn "E: $shard_dir missing or unreadable\n";
-				return;
-			}
-		}
-		$self->{nshard} = scalar(@xdb);
-		$xdb = shift @xdb;
-		$xdb->add_database($_) for @xdb;
+		_xdb_sharded($self, $dir);
 	} else {
-		$slow_phrase = -f "$dir/iamchert";
-		$xdb = $X{Database}->new($dir);
+		$self->{qp_flags} |= FLAG_PHRASE() if !-f "$dir/iamchert";
+		$X{Database}->new($dir);
 	}
-	$$qpf |= FLAG_PHRASE() unless $slow_phrase;
-	$xdb;
 }
 
 # v2 Xapian docids don't conflict, so they're identical to

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 03/52] extsearch: start mocking out
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
  2020-10-27  7:54 ` [PATCH 01/52] doc/standards: add RFCs for URL schemes Eric Wong
  2020-10-27  7:54 ` [PATCH 02/52] search: hoist out _xdb_sharded for v2 inboxes Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 04/52] searchidx: expose INDEXLEVELS as `our' Eric Wong
                   ` (50 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This will provide a similar API to PublicInbox::Inbox for
read-only WWW, -imapd, and -nntpd interfaces.
---
 MANIFEST                     |  2 ++
 lib/PublicInbox/ExtSearch.pm | 40 ++++++++++++++++++++++++++++++++++++
 lib/PublicInbox/Search.pm    |  4 ++--
 t/extsearch.t                | 11 ++++++++++
 4 files changed, 55 insertions(+), 2 deletions(-)
 create mode 100644 lib/PublicInbox/ExtSearch.pm
 create mode 100644 t/extsearch.t

diff --git a/MANIFEST b/MANIFEST
index b6a681e9..60055d2b 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -121,6 +121,7 @@ lib/PublicInbox/Emergency.pm
 lib/PublicInbox/Eml.pm
 lib/PublicInbox/EmlContentFoo.pm
 lib/PublicInbox/ExtMsg.pm
+lib/PublicInbox/ExtSearch.pm
 lib/PublicInbox/FakeInotify.pm
 lib/PublicInbox/Feed.pm
 lib/PublicInbox/Filter/Base.pm
@@ -269,6 +270,7 @@ t/eml.t
 t/eml_content_disposition.t
 t/eml_content_type.t
 t/epoll.t
+t/extsearch.t
 t/fail-bin/spamc
 t/fake_inotify.t
 t/feed.t
diff --git a/lib/PublicInbox/ExtSearch.pm b/lib/PublicInbox/ExtSearch.pm
new file mode 100644
index 00000000..9bbe7857
--- /dev/null
+++ b/lib/PublicInbox/ExtSearch.pm
@@ -0,0 +1,40 @@
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# Read-only external (detached) index for cross inbox search.
+# This is a read-only counterpart to PublicInbox::ExtSearchIdx
+package PublicInbox::ExtSearch;
+use strict;
+use v5.10.1;
+use PublicInbox::Over;
+
+# for ->reopen, ->mset, ->mset_to_artnums
+use parent qw(PublicInbox::Search);
+
+sub new {
+	my (undef, $topdir) = @_;
+	bless {
+		topdir => $topdir,
+		# xpfx => 'ei15'
+		xpfx => "$topdir/ei".PublicInbox::Search::SCHEMA_VERSION
+	}, __PACKAGE__;
+}
+
+# overrides PublicInbox::Search::_xdb
+sub _xdb {
+	my ($self) = @_;
+	$self->_xdb_sharded($self->{xpfx});
+}
+
+# same as per-inbox ->over, for now...
+sub over {
+	my ($self) = @_;
+	$self->{over} //= PublicInbox::Over->new("$self->{xpfx}/over.sqlite3");
+}
+
+sub git {
+	my ($self) = @_;
+	$self->{git} //= PublicInbox::Git->new("$self->{topdir}/ALL.git");
+}
+
+1;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 6346d788..5a57657f 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -245,9 +245,9 @@ sub mset_to_artnums {
 
 sub xdb ($) {
 	my ($self) = @_;
-	$self->{xdb} ||= do {
+	$self->{xdb} //= do {
 		load_xapian();
-		_xdb($self);
+		$self->_xdb;
 	};
 }
 
diff --git a/t/extsearch.t b/t/extsearch.t
new file mode 100644
index 00000000..7687f5f0
--- /dev/null
+++ b/t/extsearch.t
@@ -0,0 +1,11 @@
+#!perl -w
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use Test::More;
+use PublicInbox::TestCommon;
+require_git(2.6);
+require_mods(qw(DBD::SQLite Search::Xapian));
+use_ok 'PublicInbox::ExtSearch';
+
+done_testing;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 04/52] searchidx: expose INDEXLEVELS as `our'
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (2 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 03/52] extsearch: start mocking out Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 05/52] v2writable: add git method Eric Wong
                   ` (49 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This will be used by external/detached indices, too.
---
 lib/PublicInbox/SearchIdx.pm | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 2aec2b73..af707ced 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -32,11 +32,11 @@ use constant DEBUG => !!$ENV{DEBUG};
 my $xapianlevels = qr/\A(?:full|medium)\z/;
 my $hex = '[a-f0-9]';
 my $OID = $hex .'{40,}';
+our $INDEXLEVELS = qr/\A(?:full|medium|basic)\z/;
 
 sub new {
 	my ($class, $ibx, $creat, $shard) = @_;
 	ref $ibx or die "BUG: expected PublicInbox::Inbox object: $ibx";
-	my $levels = qr/\A(?:full|medium|basic)\z/;
 	my $inboxdir = $ibx->{inboxdir};
 	my $version = $ibx->version;
 	my $indexlevel = 'full';
@@ -46,7 +46,7 @@ sub new {
 		$altid = [ map { PublicInbox::AltId->new($ibx, $_); } @$altid ];
 	}
 	if ($ibx->{indexlevel}) {
-		if ($ibx->{indexlevel} =~ $levels) {
+		if ($ibx->{indexlevel} =~ $INDEXLEVELS) {
 			$indexlevel = $ibx->{indexlevel};
 		} else {
 			die("Invalid indexlevel $ibx->{indexlevel}\n");

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 05/52] v2writable: add git method
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (3 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 04/52] searchidx: expose INDEXLEVELS as `our' Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 06/52] v2writable: make OO calls to last_commit-related methods Eric Wong
                   ` (48 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This will make it easier to share code with ExtSearchIdx.
---
 lib/PublicInbox/V2Writable.pm | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index c04f0c59..9d08549f 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -467,7 +467,7 @@ sub git_hash_raw ($$) {
 	my ($self, $raw) = @_;
 	# grab the expected OID we have to reindex:
 	pipe(my($in, $w)) or die "pipe: $!";
-	my $git_dir = $self->{ibx}->git->{git_dir};
+	my $git_dir = $self->git->{git_dir};
 	my $cmd = ['git', "--git-dir=$git_dir", qw(hash-object --stdin)];
 	my $r = popen_rd($cmd, undef, { 0 => $in });
 	print $w $$raw or die "print \$w: $!";
@@ -531,11 +531,11 @@ W: $list
 	}
 
 	# make sure we really got the OID:
-	my ($blob, $type, $bytes) = $self->{ibx}->git->check($expect_oid);
+	my ($blob, $type, $bytes) = $self->git->check($expect_oid);
 	$blob eq $expect_oid or die "BUG: $expect_oid not found after replace";
 
 	# don't leak FDs to Xapian:
-	$self->{ibx}->git->cleanup;
+	$self->git->cleanup;
 
 	# reindex modified messages:
 	for my $smsg (@$need_reindex) {
@@ -671,7 +671,7 @@ sub done {
 	my $nbytes = $self->{total_bytes};
 	$self->{total_bytes} = 0;
 	$self->lock_release(!!$nbytes) if $shards;
-	$self->{ibx}->git->cleanup;
+	$self->git->cleanup;
 	die $err if $err;
 }
 
@@ -861,7 +861,7 @@ sub atfork_child {
 sub reindex_checkpoint ($$) {
 	my ($self, $sync) = @_;
 
-	$self->{ibx}->git->cleanup; # *async_wait
+	$self->git->cleanup; # *async_wait
 	${$sync->{need_checkpoint}} = 0;
 	my $mm_tmp = $sync->{mm_tmp};
 	$mm_tmp->atfork_prepare if $mm_tmp;
@@ -1066,13 +1066,12 @@ sub sync_prepare ($$$) {
 	if (my @leftovers = keys %{delete($sync->{D}) // {}}) {
 		warn('W: unindexing '.scalar(@leftovers)." leftovers\n");
 		my $arg = { v2w => $self };
-		my $all = $self->{ibx}->git;
 		for my $oid (@leftovers) {
 			$oid = unpack('H*', $oid);
 			$self->{current_info} = "leftover $oid";
-			$all->cat_async($oid, \&unindex_oid, $arg);
+			$self->git->cat_async($oid, \&unindex_oid, $arg);
 		}
-		$all->cat_async_wait;
+		$self->git->cat_async_wait;
 	}
 	if (!$regen_max) {
 		$sync->{-regen_fmt} = "%u/?\n";
@@ -1127,6 +1126,8 @@ sub unindex_oid ($$;$) { # git->cat_async callback
 	}
 }
 
+sub git { $_[0]->{ibx}->git }
+
 # this is rare, it only happens when we get discontiguous history in
 # a mirror because the source used -purge or -edit
 sub unindex ($$$$) {
@@ -1137,14 +1138,13 @@ sub unindex ($$$$) {
 	my @cmd = qw(log --raw -r
 			--no-notes --no-color --no-abbrev --no-renames);
 	my $fh = $git->popen(@cmd, $unindex_range);
-	my $all = $self->{ibx}->git;
 	local $sync->{in_unindex} = 1;
 	while (<$fh>) {
 		/\A:\d{6} 100644 $OID ($OID) [AM]\tm$/o or next;
-		$all->cat_async($1, \&unindex_oid, $sync);
+		$self->git->cat_async($1, \&unindex_oid, $sync);
 	}
 	close $fh or die "git log failed: \$?=$?";
-	$all->cat_async_wait;
+	$self->git->cat_async_wait;
 
 	return unless $sync->{-opt}->{prune};
 	my $after = scalar keys %$unindexed;
@@ -1211,7 +1211,7 @@ sub index_epoch ($$$) {
 	}
 	defined(my $stk = $sync->{stacks}->[$i]) or return;
 	$sync->{stacks}->[$i] = undef;
-	my $all = $self->{ibx}->git;
+	my $all = $self->git;
 	while (my ($f, $at, $ct, $oid) = $stk->pop_rec) {
 		$self->{current_info} = "$i.git $oid";
 		if ($f eq 'm') {
@@ -1259,7 +1259,7 @@ sub xapian_only {
 			index_xap_step($self, $sync, $art_beg, 1);
 		}
 	}
-	$self->{ibx}->git->cat_async_wait;
+	$self->git->cat_async_wait;
 	$self->done;
 }
 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 06/52] v2writable: make OO calls to last_commit-related methods
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (4 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 05/52] v2writable: add git method Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 07/52] search: xdb_sharded: make this a public method for ExtSearch Eric Wong
                   ` (47 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We'll try to reuse as much V2Writable code as possible for
external indices, but the way "last_commit" info is stored
must be different as external indices will deal with last_commit
info for multiple inboxes.
---
 lib/PublicInbox/V2Writable.pm | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 9d08549f..de89b729 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -953,7 +953,7 @@ sub index_oid { # cat_async callback
 }
 
 # only update last_commit for $i on reindex iff newer than current
-sub update_last_commit ($$$$) {
+sub update_last_commit {
 	my ($self, $git, $i, $cmt) = @_;
 	my $last = last_epoch_commit($self, $i);
 	if (defined $last && is_ancestor($git, $last, $cmt)) {
@@ -1034,7 +1034,7 @@ sub sync_prepare ($$$) {
 
 	# reindex stops at the current heads and we later rerun index_sync
 	# without {reindex}
-	my $reindex_heads = last_commits($self, $epoch_max) if $sync->{reindex};
+	my $reindex_heads = $self->last_commits($epoch_max) if $sync->{reindex};
 
 	for (my $i = $epoch_max; $i >= 0; $i--) {
 		my $git_dir = git_dir_n($self, $i);
@@ -1229,7 +1229,7 @@ sub index_epoch ($$$) {
 		}
 	}
 	$all->async_wait_all;
-	update_last_commit($self, $git, $i, $stk->{latest_cmt});
+	$self->update_last_commit($git, $i, $stk->{latest_cmt});
 }
 
 sub xapian_only {

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 07/52] search: xdb_sharded: make this a public method for ExtSearch
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (5 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 06/52] v2writable: make OO calls to last_commit-related methods Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 08/52] searchidx: introduce "xref3" concept Eric Wong
                   ` (46 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We can simplify callers by using $self->{xpfx} instead of
passing another arg on the stack.
---
 lib/PublicInbox/ExtSearch.pm |  2 +-
 lib/PublicInbox/Search.pm    | 10 +++++-----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/ExtSearch.pm b/lib/PublicInbox/ExtSearch.pm
index 9bbe7857..8997cd54 100644
--- a/lib/PublicInbox/ExtSearch.pm
+++ b/lib/PublicInbox/ExtSearch.pm
@@ -23,7 +23,7 @@ sub new {
 # overrides PublicInbox::Search::_xdb
 sub _xdb {
 	my ($self) = @_;
-	$self->_xdb_sharded($self->{xpfx});
+	$self->xdb_sharded;
 }
 
 # same as per-inbox ->over, for now...
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 5a57657f..71417d5e 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -190,9 +190,9 @@ sub xdir ($;$) {
 	}
 }
 
-sub _xdb_sharded {
-	my ($self, $xpfx) = @_;
-	opendir(my $dh, $xpfx) or return; # not initialized yet
+sub xdb_sharded {
+	my ($self) = @_;
+	opendir(my $dh, $self->{xpfx}) or return; # not initialized yet
 
 	# We need numeric sorting so shard[0] is first for reading
 	# Xapian metadata, if needed
@@ -200,7 +200,7 @@ sub _xdb_sharded {
 	return if !defined($last);
 	my (@xdb, $slow_phrase);
 	for (0..$last) {
-		my $shard_dir = "$xpfx/$_";
+		my $shard_dir = "$self->{xpfx}/$_";
 		if (-d $shard_dir && -r _) {
 			push @xdb, $X{Database}->new($shard_dir);
 			$slow_phrase ||= -f "$shard_dir/iamchert";
@@ -221,7 +221,7 @@ sub _xdb {
 	my $dir = xdir($self, 1);
 	$self->{qp_flags} //= $QP_FLAGS;
 	if ($self->{ibx_ver} >= 2) {
-		_xdb_sharded($self, $dir);
+		xdb_sharded($self);
 	} else {
 		$self->{qp_flags} |= FLAG_PHRASE() if !-f "$dir/iamchert";
 		$X{Database}->new($dir);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 08/52] searchidx: introduce "xref3" concept
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (6 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 07/52] search: xdb_sharded: make this a public method for ExtSearch Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 09/52] v2writable: prepare initialization for external indices Eric Wong
                   ` (45 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This will be used to track cross-posted messages in the
external/detached index.
---
 lib/PublicInbox/SearchIdx.pm      | 78 ++++++++++++++++++++++++++-----
 lib/PublicInbox/SearchIdxShard.pm | 53 ++++++++++++++++++---
 lib/PublicInbox/Smsg.pm           | 13 ++++++
 t/search.t                        | 28 +++++++++--
 4 files changed, 150 insertions(+), 22 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index af707ced..283bdd6c 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -325,6 +325,16 @@ sub index_xapian { # msg_iter callback
 	}
 }
 
+sub index_list_id ($$$) {
+	my ($self, $doc, $hdr) = @_;
+	for my $l ($hdr->header_raw('List-Id')) {
+		$l =~ /<([^>]+)>/ or next;
+		my $lid = lc $1;
+		$doc->add_boolean_term('G' . $lid);
+		index_text($self, $lid, 1, 'XL'); # probabilistic
+	}
+}
+
 sub index_ids ($$$$) {
 	my ($self, $doc, $hdr, $mids) = @_;
 	for my $mid (@$mids) {
@@ -338,12 +348,7 @@ sub index_ids ($$$$) {
 		}
 	}
 	$doc->add_boolean_term('Q' . $_) for @$mids;
-	for my $l ($hdr->header_raw('List-Id')) {
-		$l =~ /<([^>]+)>/ or next;
-		my $lid = lc $1;
-		$doc->add_boolean_term('G' . $lid);
-		index_text($self, $lid, 1, 'XL'); # probabilistic
-	}
+	index_list_id($self, $doc, $hdr);
 }
 
 sub add_xapian ($$$$) {
@@ -363,6 +368,10 @@ sub add_xapian ($$$$) {
 	$tg->set_document($doc);
 	index_headers($self, $smsg);
 
+	if (my $ng_or_dir = $self->{ng_or_dir}) { # external index
+		$doc->add_boolean_term('P'.
+				"$ng_or_dir:$smsg->{num}:$smsg->{blob}");
+	}
 	msg_iter($eml, \&index_xapian, [ $self, $doc ]);
 	index_ids($self, $doc, $eml, $mids);
 
@@ -436,6 +445,56 @@ sub add_message {
 	$smsg->{num};
 }
 
+sub _get_doc ($$$) {
+	my ($self, $docid, $oid) = @_;
+	my $doc = eval { $self->{xdb}->get_document($docid) };
+	$doc // do {
+		warn "E: $@\n" if $@;
+		warn "E: #$docid $oid missing in Xapian\n";
+		undef;
+	}
+}
+
+sub add_xref3 {
+	my ($self, $docid, $xnum, $oid, $ng_or_dir, $eml) = @_;
+	begin_txn_lazy($self);
+	my $doc = _get_doc($self, $docid, $oid) or return;
+	term_generator($self)->set_document($doc);
+	$doc->add_boolean_term('P'."$ng_or_dir:$xnum:$oid");
+	index_list_id($self, $doc, $eml);
+	$self->{xdb}->replace_document($docid, $doc);
+}
+
+sub remove_xref3 {
+	my ($self, $docid, $oid, $ng_or_dir, $eml) = @_;
+	begin_txn_lazy($self);
+	my $doc = _get_doc($self, $docid, $oid) or return;
+	my $xref3 = PublicInbox::Smsg::xref3(undef, $doc);
+	for (grep(/\A\Q$ng_or_dir\E:[0-9]+:\Q$oid\E\z/, @$xref3)) {
+		$doc->remove_term('P' . $_);
+	}
+	for my $l ($eml->header_raw('List-Id')) {
+		$l =~ /<([^>]+)>/ or next;
+		my $lid = lc $1;
+		$doc->remove_term('G' . $lid);
+
+		# nb: we don't remove the XL probabilistic terms
+		# since terms may overlap if cross-posted.
+		#
+		# IOW, a message which has both <foo.example.com>
+		# and <bar.example.com> would have overlapping
+		# "XLexample" and "XLcom" as terms and which we
+		# wouldn't know if they're safe to remove if we just
+		# unindex <foo.example.com> while preserving
+		# <bar.example.com>.
+		#
+		# In any case, this entire sub is will likely never
+		# be needed and users using the "l:" prefix are probably
+		# rarer.
+	}
+	$self->{xdb}->replace_document($docid, $doc);
+}
+
 sub get_val ($$) {
 	my ($doc, $col) = @_;
 	sortable_unserialise($doc->get_value($col));
@@ -457,12 +516,7 @@ sub xdb_remove {
 	my ($self, $oid, @removed) = @_;
 	my $xdb = $self->{xdb} or return;
 	for my $num (@removed) {
-		my $doc = eval { $xdb->get_document($num) };
-		unless ($doc) {
-			warn "E: $@\n" if $@;
-			warn "E: #$num $oid missing in Xapian\n";
-			next;
-		}
+		my $doc = _get_doc($self, $num, $oid) or next;
 		my $smsg = smsg_from_doc($doc);
 		my $blob = $smsg->{blob}; # may be undef if --skip-docdata
 		if (!defined($blob) || $blob eq $oid) {
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index f23d23d0..8e24aa1b 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -7,6 +7,7 @@ package PublicInbox::SearchIdxShard;
 use strict;
 use v5.10.1;
 use parent qw(PublicInbox::SearchIdx);
+use bytes qw(length);
 use IO::Handle (); # autoflush
 use PublicInbox::Eml;
 
@@ -47,6 +48,13 @@ sub spawn_worker {
 	close $r or die "failed to close: $!";
 }
 
+sub eml ($$) {
+	my ($r, $len) = @_;
+	my $n = read($r, my $bref, $len) or die "read: $!\n";
+	$n == $len or die "short read: $n != $len\n";
+	PublicInbox::Eml->new(\$bref);
+}
+
 # this reads all the writes to $self->{w} from the parent process
 sub shard_worker_loop ($$$$$) {
 	my ($self, $v2w, $r, $shard, $bnote) = @_;
@@ -65,25 +73,32 @@ sub shard_worker_loop ($$$$$) {
 					die "write failed for barrier $!\n";
 		} elsif ($line =~ /\AD ([a-f0-9]{40,}) ([0-9]+)\n\z/s) {
 			$self->remove_by_oid($1, $2 + 0);
+		} elsif ($line =~ s/\A\+X //) {
+			my ($len, $docid, $xnum, $oid, $ng_or_dir) =
+							split(/ /, $line, 5);
+			$self->add_xref3($docid, $xnum, $oid, $ng_or_dir,
+						eml($r, $len));
+		} elsif ($line =~ s/\A-X //) {
+			my ($len, $docid, $xnum, $oid, $ng_or_dir) =
+							split(/ /, $line, 5);
+			$self->remove_xref3($docid, $xnum, $oid,
+						$ng_or_dir, eml($r, $len));
 		} else {
 			chomp $line;
 			# n.b. $mid may contain spaces(!)
-			my ($to_read, $bytes, $num, $blob, $ds, $ts, $tid, $mid)
+			my ($len, $bytes, $num, $oid, $ds, $ts, $tid, $mid)
 				= split(/ /, $line, 8);
 			$self->begin_txn_lazy;
-			my $n = read($r, my $msg, $to_read) or die "read: $!\n";
-			$n == $to_read or die "short read: $n != $to_read\n";
-			my $mime = PublicInbox::Eml->new(\$msg);
 			my $smsg = bless {
 				bytes => $bytes,
 				num => $num + 0,
-				blob => $blob,
+				blob => $oid,
 				mid => $mid,
 				tid => $tid,
 				ds => $ds,
 				ts => $ts,
 			}, 'PublicInbox::Smsg';
-			$self->add_message($mime, $smsg);
+			$self->add_message(eml($r, $len), $smsg);
 		}
 	}
 	$self->worker_done;
@@ -107,6 +122,32 @@ sub index_raw {
 	}
 }
 
+sub shard_add_xref3 {
+	my ($self, $docid, $xnum, $oid, $xibx, $eml) = @_;
+	my $ng_or_dir = $xibx->{newsgroup} // $xibx->{inboxdir};
+	if (my $w = $self->{w}) {
+		my $hdr = $eml->header_obj->as_string;
+		my $len = length($hdr);
+		print $w "+X $len $docid $xnum $oid $ng_or_dir\n", $hdr or
+			die "failed to write shard: $!";
+	} else {
+		$self->add_xref3($docid, $xnum, $oid, $ng_or_dir, $eml);
+	}
+}
+
+sub shard_remove_xref3 {
+	my ($self, $docid, $oid, $xibx, $eml) = @_;
+	my $ng_or_dir = $xibx->{newsgroup} // $xibx->{inboxdir};
+	if (my $w = $self->{w}) {
+		my $hdr = $eml->header_obj->as_string;
+		my $len = length($hdr);
+		print $w "-X $len $docid $oid $ng_or_dir\n", $hdr or
+			die "failed to write shard: $!";
+	} else {
+		$self->remove_xref3($docid, $oid, $ng_or_dir, $eml);
+	}
+}
+
 sub atfork_child {
 	close $_[0]->{w} or die "failed to close write pipe: $!\n";
 }
diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm
index 14086538..c0fd85fd 100644
--- a/lib/PublicInbox/Smsg.pm
+++ b/lib/PublicInbox/Smsg.pm
@@ -137,4 +137,17 @@ sub subject_normalized ($) {
 	$subj;
 }
 
+sub xref3 {
+	my ($self, $doc) = @_;
+	my $end = $doc->termlist_end;
+	my $it = $doc->termlist_begin;
+	$it->skip_to('P');
+	my @ret;
+	for (; $it != $end; $it++) {
+		my $val = $it->get_termname;
+		$val =~ s/\AP// and push @ret, $val;
+	}
+	\@ret;
+}
+
 1;
diff --git a/t/search.t b/t/search.t
index a66aec36..e789b81e 100644
--- a/t/search.t
+++ b/t/search.t
@@ -341,6 +341,14 @@ $ibx->with_umask(sub {
 		my $uid = PublicInbox::SearchIdx::get_val($doc, $col);
 		is($uid, $smsg->{num}, 'UID column matches {num}');
 		is($uid, $m->get_docid, 'UID column matches docid');
+
+		# check ->xref3 for external index:
+		is_deeply($smsg->xref3($doc), [], 'xref3 empty by default');
+		my $exp = "inbox.com.example:$uid:deadbeef";
+		$doc->add_boolean_term('P'.$exp);
+		is_deeply($smsg->xref3($doc), [ $exp ], 'xref3 can be set');
+		$doc->remove_term('P'.$exp);
+		is_deeply($smsg->xref3($doc), [], 'xref3 can be unset');
 	}
 
 	$mset = $ibx->search->mset('tc:list@example.com');
@@ -513,8 +521,13 @@ $ibx->with_umask(sub {
 	$rw_commit->();
 	my $doc_id = $rw->add_message(eml_load('t/data/message_embed.eml'));
 	ok($doc_id > 0, 'messages within messages');
-	$rw->commit_txn_lazy;
-	$ibx->search->reopen;
+
+	my $eml = PublicInbox::Eml->new(<<EOF);
+List-Id: <blahblah.example.com>
+
+EOF
+	$rw->add_xref3($doc_id, 1, 'deadbeef', 'newsgroup1.example', $eml);
+	$rw_commit->();
 	my $n_test_eml = $query->('n:test.eml');
 	is(scalar(@$n_test_eml), 1, 'got a result');
 	my $n_embed2x_eml = $query->('n:embed2x.eml');
@@ -532,8 +545,15 @@ $ibx->with_umask(sub {
 	is($query->('s:"mail header experiments"')->[0]->{mid},
 		'20200418222508.GA13918@dcvr',
 		'Subject search reaches inside message/rfc822');
+	is($query->('l:blahblah.example.com')->[0]->{num}, $doc_id,
+		'xref3 List-Id probabilistic works');
+	is($query->('lid:blahblah.example.com')->[0]->{num}, $doc_id,
+		'xref3 List-Id boolean term works');
+	$rw->remove_xref3($doc_id, 'deadbeef', 'newsgroup1.example', $eml);
+	$rw->commit_txn_lazy;
+	$ibx->search->reopen;
+	my $res = $query->('lid:blahblah.example.com');
+	is_deeply($res, [], '->remove_xref3 dropped boolean term');
 });
 
 done_testing();
-
-1;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 09/52] v2writable: prepare initialization for external indices
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (7 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 08/52] searchidx: introduce "xref3" concept Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 10/52] v2writable: hoist out write_alternates Eric Wong
                   ` (44 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

External indices won't have $self->{ibx} since it needs to
deal with multiple inboxes.  We can also hoist out
->parallel_init to make it easier to distinguish the
non-parallel control flow.
---
 lib/PublicInbox/V2Writable.pm | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index de89b729..eecc702b 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -271,15 +271,30 @@ sub _idx_init { # with_umask callback
 	my $max = $self->{shards} - 1;
 	my $idx = $self->{idx_shards} = [];
 	push @$idx, PublicInbox::SearchIdxShard->new($self, $_) for (0..$max);
+	my $ibx = $self->{ibx} or return; # ExtIdxSearch
 
 	# Now that all subprocesses are up, we can open the FDs
 	# for SQLite:
 	my $mm = $self->{mm} = PublicInbox::Msgmap->new_file(
-				"$self->{ibx}->{inboxdir}/msgmap.sqlite3",
-				$self->{ibx}->{-no_fsync} ? 2 : 1);
+				"$ibx->{inboxdir}/msgmap.sqlite3",
+				$ibx->{-no_fsync} ? 2 : 1);
 	$mm->{dbh}->begin_work;
 }
 
+sub parallel_init ($$) {
+	my ($self, $indexlevel) = @_;
+	if (($indexlevel // 'full') eq 'basic') {
+		$self->{parallel} = 0;
+	} else {
+		pipe(my ($r, $w)) or die "pipe failed: $!";
+		# pipe for barrier notifications doesn't need to be big,
+		# 1031: F_SETPIPE_SZ
+		fcntl($w, 1031, 4096) if $^O eq 'linux';
+		$self->{bnote} = [ $r, $w ];
+		$w->autoflush(1);
+	}
+}
+
 # idempotent
 sub idx_init {
 	my ($self, $opt) = @_;
@@ -292,16 +307,7 @@ sub idx_init {
 	delete @$ibx{qw(mm search)};
 	$ibx->git->cleanup;
 
-	$self->{parallel} = 0 if ($ibx->{indexlevel}//'') eq 'basic';
-	if ($self->{parallel}) {
-		pipe(my ($r, $w)) or die "pipe failed: $!";
-		# pipe for barrier notifications doesn't need to be big,
-		# 1031: F_SETPIPE_SZ
-		fcntl($w, 1031, 4096) if $^O eq 'linux';
-		$self->{bnote} = [ $r, $w ];
-		$w->autoflush(1);
-	}
-
+	parallel_init($self, $ibx->{indexlevel});
 	$ibx->umask_prepare;
 	$ibx->with_umask(\&_idx_init, $self, $opt);
 }

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 10/52] v2writable: hoist out write_alternates
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (8 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 09/52] v2writable: prepare initialization for external indices Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 11/52] searchidxshard: allow msgref to be undef Eric Wong
                   ` (43 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We'll be reusing this for external indices and possibly
other places.
---
 lib/PublicInbox/V2Writable.pm | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index eecc702b..aa812a6b 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -681,6 +681,18 @@ sub done {
 	die $err if $err;
 }
 
+sub write_alternates ($$$) {
+	my ($info_dir, $mode, $out) = @_;
+	my $fh = File::Temp->new(TEMPLATE => 'alt-XXXXXXXX', DIR => $info_dir);
+	my $tmp = $fh->filename;
+	print $fh @$out or die "print $tmp: $!\n";
+	chmod($mode, $fh) or die "fchmod $tmp: $!\n";
+	close $fh or die "close $tmp $!\n";
+	my $alt = "$info_dir/alternates";
+	rename($tmp, $alt) or die "rename $tmp => $alt: $!\n";
+	$fh->unlink_on_destroy(0);
+}
+
 sub fill_alternates ($$) {
 	my ($self, $epoch) = @_;
 
@@ -719,15 +731,8 @@ sub fill_alternates ($$) {
 		}
 	}
 	return unless $new;
-
-	my $fh = File::Temp->new(TEMPLATE => 'alt-XXXXXXXX', DIR => $info_dir);
-	my $tmp = $fh->filename;
-	print $fh join("\n", sort { $alt{$b} <=> $alt{$a} } keys %alt), "\n"
-		or die "print $tmp: $!\n";
-	chmod($mode, $fh) or die "fchmod $tmp: $!\n";
-	close $fh or die "close $tmp $!\n";
-	rename($tmp, $alt) or die "rename $tmp => $alt: $!\n";
-	$fh->unlink_on_destroy(0);
+	write_alternates($info_dir, $mode,
+		[join("\n", sort { $alt{$b} <=> $alt{$a} } keys %alt), "\n"]);
 }
 
 sub git_init {

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 11/52] searchidxshard: allow msgref to be undef
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (9 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 10/52] v2writable: hoist out write_alternates Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 12/52] v2writable: idx_shard: simplify callers Eric Wong
                   ` (42 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We don't need to keep it in code paths which are guaranteed to
only see PublicInbox::Eml (and not Email::MIME or PublicInbox::MIME
which did not round-trip properly).  However, we must set
{raw_bytes} since PublicInbox::Eml may add an extra "\n" for
rare messages with no bodies.
---
 lib/PublicInbox/SearchIdxShard.pm | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index 8e24aa1b..8ff9ab8b 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -107,13 +107,15 @@ sub shard_worker_loop ($$$$$) {
 sub index_raw {
 	my ($self, $msgref, $eml, $smsg) = @_;
 	if (my $w = $self->{w}) {
+		$msgref //= \($eml->as_string);
+		$smsg->{raw_bytes} //= length($$msgref);
 		# mid must be last, it can contain spaces (but not LF)
 		print $w join(' ', @$smsg{qw(raw_bytes bytes
 						num blob ds ts tid mid)}),
 			"\n", $$msgref or die "failed to write shard $!\n";
 	} else {
 		if ($eml) {
-			undef $$msgref;
+			undef($$msgref) if $msgref;
 		} else { # --xapian-only + --sequential-shard:
 			$eml = PublicInbox::Eml->new($msgref);
 		}

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 12/52] v2writable: idx_shard: simplify callers
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (10 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 11/52] searchidxshard: allow msgref to be undef Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 13/52] v2writable: count_shards: allow working without {ibx} Eric Wong
                   ` (41 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This will make it easier-to-use in ExtSearchIdx.
---
 lib/PublicInbox/V2Writable.pm | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index aa812a6b..f575ba11 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -133,12 +133,17 @@ sub add {
 	$self->{ibx}->with_umask(\&_add, $self, $eml, $check_cb);
 }
 
+sub idx_shard ($$) {
+	my ($self, $num) = @_;
+	$self->{idx_shards}->[$num % scalar(@{$self->{idx_shards}})];
+}
+
 # indexes a message, returns true if checkpointing is needed
 sub do_idx ($$$$) {
 	my ($self, $msgref, $mime, $smsg) = @_;
 	$smsg->{bytes} = $smsg->{raw_bytes} + crlf_adjust($$msgref);
 	$self->{oidx}->add_overview($mime, $smsg);
-	my $idx = idx_shard($self, $smsg->{num} % $self->{shards});
+	my $idx = idx_shard($self, $smsg->{num});
 	$idx->index_raw($msgref, $mime, $smsg);
 	my $n = $self->{transact_bytes} += $smsg->{raw_bytes};
 	$n >= $self->{batch_bytes};
@@ -249,11 +254,6 @@ sub v2_num_for_harder {
 	($num, $mid0);
 }
 
-sub idx_shard {
-	my ($self, $shard_i) = @_;
-	$self->{idx_shards}->[$shard_i];
-}
-
 sub _idx_init { # with_umask callback
 	my ($self, $opt) = @_;
 	$self->lock_acquire unless $opt && $opt->{-skip_lock};
@@ -1102,7 +1102,7 @@ sub unindex_oid_remote ($$$) {
 	my ($self, $oid, $mid) = @_;
 	my @removed = $self->{oidx}->remove_oid($oid, $mid);
 	for my $num (@removed) {
-		my $idx = idx_shard($self, $num % $self->{shards});
+		my $idx = idx_shard($self, $num);
 		$idx->shard_remove($oid, $num);
 	}
 }
@@ -1183,7 +1183,7 @@ sub sync_ranges ($$$) {
 sub index_xap_only { # git->cat_async callback
 	my ($bref, $oid, $type, $size, $smsg) = @_;
 	my $self = $smsg->{v2w};
-	my $idx = idx_shard($self, $smsg->{num} % $self->{shards});
+	my $idx = idx_shard($self, $smsg->{num});
 	$smsg->{raw_bytes} = $size;
 	$idx->index_raw($bref, undef, $smsg);
 	$self->{transact_bytes} += $size;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 13/52] v2writable: count_shards: allow working without {ibx}
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (11 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 12/52] v2writable: idx_shard: simplify callers Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 14/52] overidx: introduce changes for external index Eric Wong
                   ` (40 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This will be needed for ExtSearchIdx which doesn't have a
persistent PublicInbox::Inbox object.
---
 lib/PublicInbox/V2Writable.pm | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index f575ba11..667a11f8 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -65,11 +65,13 @@ sub nproc_shards ($) {
 
 sub count_shards ($) {
 	my ($self) = @_;
-	# always load existing shards in case core count changes:
-	# Also, shard count may change while -watch is running
-	my $srch = $self->{ibx}->search or return 0;
-	delete $self->{ibx}->{search};
-	$srch->{nshard} // 0
+	$self->{ibx} ? do {
+		# always load existing shards in case core count changes:
+		# Also, shard count may change while -watch is running
+		my $srch = $self->{ibx}->search or return 0;
+		delete $self->{ibx}->{search};
+		$srch->{nshard} // 0
+	} : $self->{nshard}; # self->{nshard} is for ExtSearchIdx
 }
 
 sub new {

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 14/52] overidx: introduce changes for external index
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (12 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 13/52] v2writable: count_shards: allow working without {ibx} Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 15/52] v2: some changes for ExtSearchIdx compatibility Eric Wong
                   ` (39 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

Since external indices won't have msgmap.sqlite3, we'll need to
store last_commit-* metadata in over.sqlite3 instead.  This
has a longer limits to account for path names or newsgroup names
stored in keys.

We'll also rely on built-in counters for Xapian document IDs,
since msgmap.sqlite3 no longer provides an AUTOINCREMENT column.
---
 lib/PublicInbox/OverIdx.pm | 76 ++++++++++++++++++++++++++++++++++++++
 t/over.t                   | 11 ++++++
 2 files changed, 87 insertions(+)

diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index db4b7738..09bca790 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -512,4 +512,80 @@ EOM
 	$pr->("I: rethread culled $total ghosts\n") if $pr && $total;
 }
 
+# used for cross-inbox search
+sub eidx_prep ($) {
+	my ($self) = @_;
+	$self->{-eidx_prep} //= do {
+		my $dbh = $self->dbh;
+		$dbh->do(<<'');
+INSERT OR IGNORE INTO counter (key) VALUES ('oidmap_num')
+
+		$dbh->do(<<'');
+INSERT OR IGNORE INTO counter (key) VALUES ('eidx_docid')
+
+		$dbh->do(<<'');
+CREATE TABLE IF NOT EXISTS oidmap (
+	num INTEGER NOT NULL, /* NNTP article number == IMAP UID */
+	oidbin VARBINARY, /* 20-byte SHA-1 or 32-byte SHA-256 */
+	UNIQUE (num),
+	UNIQUE (oidbin)
+)
+
+		$dbh->do(<<'');
+CREATE TABLE IF NOT EXISTS eidx_meta (
+	key VARCHAR(255) PRIMARY KEY,
+	val VARCHAR(255) NOT NULL
+)
+
+		$dbh;
+	};
+}
+
+sub eidx_meta { # requires transaction
+	my ($self, $key, $val) = @_;
+
+	my $sql = 'SELECT val FROM eidx_meta WHERE key = ? LIMIT 1';
+	my $dbh = $self->{dbh};
+	defined($val) or return $dbh->selectrow_array($sql, undef, $key);
+
+	my $prev = $dbh->selectrow_array($sql, undef, $key);
+	if (defined $prev) {
+		$sql = 'UPDATE eidx_meta SET val = ? WHERE key = ?';
+		$dbh->do($sql, undef, $val, $key);
+	} else {
+		$sql = 'INSERT INTO eidx_meta (key,val) VALUES (?,?)';
+		$dbh->do($sql, undef, $key, $val);
+	}
+	$prev;
+}
+
+sub eidx_max {
+	my ($self) = @_;
+	get_counter($self->{dbh}, 'eidx_docid');
+}
+
+sub oid2num {
+	my ($self, $oidhex) = @_;
+	my $dbh = eidx_prep($self);
+	my $sth = $dbh->prepare_cached(<<'', undef, 1);
+SELECT num FROM oidmap WHERE oidbin = ?
+
+	$sth->bind_param(1, pack('H*', $oidhex), SQL_BLOB);
+	$sth->execute;
+	$sth->fetchrow_array;
+}
+
+sub oid_add {
+	my ($self, $oidhex) = @_;
+	my $dbh = eidx_prep($self);
+	my $num = adj_counter($self, 'oidmap_num', '+');
+	my $sth = $dbh->prepare_cached(<<'');
+INSERT INTO oidmap (num, oidbin) VALUES (?,?)
+
+	$sth->bind_param(1, $num);
+	$sth->bind_param(2, pack('H*', $oidhex), SQL_BLOB);
+	$sth->execute;
+	$num;
+}
+
 1;
diff --git a/t/over.t b/t/over.t
index 4c8f8098..3e2860f8 100644
--- a/t/over.t
+++ b/t/over.t
@@ -74,4 +74,15 @@ SKIP: {
 		'WAL journal_mode not clobbered if manually set');
 }
 
+# ext index additions
+{
+	my $hex = 'deadbeefcafe';
+	my $n = $over->oid_add($hex);
+	ok($n > 0, 'oid_add returned number');
+	is($over->oid2num($hex), $n, 'oid2num works');
+	my $n2 = $over->oid_add($hex.$hex);
+	ok($n2 > $n, 'oid_add increments');
+	is($over->oid2num($hex.$hex), $n2, 'oid2num works again');
+}
+
 done_testing();

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 15/52] v2: some changes for ExtSearchIdx compatibility
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (13 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 14/52] overidx: introduce changes for external index Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 16/52] inboxwritable: eidx_key for external index Eric Wong
                   ` (38 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We'll be using per-sync-state {ibx} refs instead, so make parts
of the v2 indexing code less-dependent on $self->{ibx} where
$self is a V2Writable object.
---
 lib/PublicInbox/InboxWritable.pm | 21 ++++++++++++++
 lib/PublicInbox/V2Writable.pm    | 49 +++++++++++---------------------
 lib/PublicInbox/Xapcmd.pm        |  2 +-
 3 files changed, 38 insertions(+), 34 deletions(-)

diff --git a/lib/PublicInbox/InboxWritable.pm b/lib/PublicInbox/InboxWritable.pm
index 752f1997..e97c7e2d 100644
--- a/lib/PublicInbox/InboxWritable.pm
+++ b/lib/PublicInbox/InboxWritable.pm
@@ -298,4 +298,25 @@ sub warn_ignore_cb {
 	}
 }
 
+# v2+ only
+sub git_dir_n { "$_[0]->{inboxdir}/git/$_[1].git" }
+
+# v2+ only
+sub git_dir_latest {
+	my ($self, $max) = @_;
+	$$max = -1;
+	my $pfx = "$self->{inboxdir}/git";
+	return unless -d $pfx;
+	my $latest;
+	opendir my $dh, $pfx or die "opendir $pfx: $!\n";
+	while (defined(my $git_dir = readdir($dh))) {
+		$git_dir =~ m!\A([0-9]+)\.git\z! or next;
+		if ($1 > $$max) {
+			$$max = $1;
+			$latest = "$pfx/$git_dir";
+		}
+	}
+	$latest;
+}
+
 1;
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 667a11f8..6af50f5d 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -120,7 +120,7 @@ sub init_inbox {
 	$self->idx_init;
 	$self->{mm}->skip_artnum($skip_artnum) if defined $skip_artnum;
 	my $epoch_max = -1;
-	git_dir_latest($self, \$epoch_max);
+	$self->{ibx}->git_dir_latest(\$epoch_max);
 	if (defined $skip_epoch && $epoch_max == -1) {
 		$epoch_max = $skip_epoch;
 	}
@@ -320,12 +320,13 @@ sub idx_init {
 sub _replace_oids ($$$) {
 	my ($self, $mime, $replace_map) = @_;
 	$self->done;
-	my $pfx = "$self->{ibx}->{inboxdir}/git";
+	my $ibx = $self->{ibx};
+	my $pfx = "$ibx->{inboxdir}/git";
 	my $rewrites = []; # epoch => commit
 	my $max = $self->{epoch_max};
 
 	unless (defined($max)) {
-		defined(my $latest = git_dir_latest($self, \$max)) or return;
+		defined(my $latest = $ibx->git_dir_latest(\$max)) or return;
 		$self->{epoch_max} = $max;
 	}
 
@@ -748,23 +749,6 @@ sub git_init {
 	$git_dir
 }
 
-sub git_dir_latest {
-	my ($self, $max) = @_;
-	$$max = -1;
-	my $pfx = "$self->{ibx}->{inboxdir}/git";
-	return unless -d $pfx;
-	my $latest;
-	opendir my $dh, $pfx or die "opendir $pfx: $!\n";
-	while (defined(my $git_dir = readdir($dh))) {
-		$git_dir =~ m!\A([0-9]+)\.git\z! or next;
-		if ($1 > $$max) {
-			$$max = $1;
-			$latest = "$pfx/$git_dir";
-		}
-	}
-	$latest;
-}
-
 sub importer {
 	my ($self) = @_;
 	my $im = $self->{im};
@@ -783,7 +767,7 @@ sub importer {
 	}
 	my $epoch = 0;
 	my $max;
-	my $latest = git_dir_latest($self, \$max);
+	my $latest = $self->{ibx}->git_dir_latest(\$max);
 	if (defined $latest) {
 		my $git = PublicInbox::Git->new($latest);
 		my $packed_bytes = $git->packed_bytes;
@@ -977,8 +961,6 @@ sub update_last_commit {
 	last_epoch_commit($self, $i, $cmt);
 }
 
-sub git_dir_n ($$) { "$_[0]->{ibx}->{inboxdir}/git/$_[1].git" }
-
 sub last_commits ($$) {
 	my ($self, $epoch_max) = @_;
 	my $heads = [];
@@ -989,8 +971,8 @@ sub last_commits ($$) {
 }
 
 # returns a revision range for git-log(1)
-sub log_range ($$$$$) {
-	my ($self, $sync, $git, $i, $tip) = @_;
+sub log_range ($$$$) {
+	my ($sync, $git, $i, $tip) = @_;
 	my $opt = $sync->{-opt};
 	my $pr = $opt->{-progress} if (($opt->{verbose} || 0) > 1);
 	my $cur = $sync->{ranges}->[$i] or do {
@@ -1043,14 +1025,14 @@ sub sync_prepare ($$$) {
 	my ($self, $sync, $epoch_max) = @_;
 	my $pr = $sync->{-opt}->{-progress};
 	my $regen_max = 0;
-	my $head = $self->{ibx}->{ref_head} || 'HEAD';
+	my $head = $sync->{ibx}->{ref_head} || 'HEAD';
 
 	# reindex stops at the current heads and we later rerun index_sync
 	# without {reindex}
 	my $reindex_heads = $self->last_commits($epoch_max) if $sync->{reindex};
 
 	for (my $i = $epoch_max; $i >= 0; $i--) {
-		my $git_dir = git_dir_n($self, $i);
+		my $git_dir = $sync->{ibx}->git_dir_n($i);
 		-d $git_dir or next; # missing epochs are fine
 		my $git = PublicInbox::Git->new($git_dir);
 		if ($reindex_heads) {
@@ -1059,7 +1041,7 @@ sub sync_prepare ($$$) {
 		chomp(my $tip = $git->qx(qw(rev-parse -q --verify), $head));
 
 		next if $?; # new repo
-		my $range = log_range($self, $sync, $git, $i, $tip) or next;
+		my $range = log_range($sync, $git, $i, $tip) or next;
 		# can't use 'rev-list --count' if we use --diff-filter
 		$pr->("$i.git counting $range ... ") if $pr;
 		# Don't bump num_highwater on --reindex by using {D}.
@@ -1067,7 +1049,7 @@ sub sync_prepare ($$$) {
 		# because we want NNTP article number gaps from unindexed
 		# messages to show up in mirrors, too.
 		$sync->{D} //= $sync->{reindex} ? {} : undef; # OID_BIN => NR
-		my $stk = log2stack($sync, $git, $range, $self->{ibx});
+		my $stk = log2stack($sync, $git, $range, $sync->{ibx});
 		my $nr = $stk ? $stk->num_records : 0;
 		$pr->("$nr\n") if $pr;
 		$sync->{stacks}->[$i] = $stk if $stk;
@@ -1143,7 +1125,7 @@ sub git { $_[0]->{ibx}->git }
 
 # this is rare, it only happens when we get discontiguous history in
 # a mirror because the source used -purge or -edit
-sub unindex ($$$$) {
+sub unindex_epoch ($$$$) {
 	my ($self, $sync, $git, $unindex_range) = @_;
 	my $unindexed = $sync->{unindexed} //= {}; # $mid0 => $num
 	my $before = scalar keys %$unindexed;
@@ -1216,11 +1198,11 @@ sub index_xap_step ($$$;$) {
 sub index_epoch ($$$) {
 	my ($self, $sync, $i) = @_;
 
-	my $git_dir = git_dir_n($self, $i);
+	my $git_dir = $sync->{ibx}->git_dir_n($i);
 	-d $git_dir or return; # missing epochs are fine
 	my $git = PublicInbox::Git->new($git_dir);
 	if (my $unindex_range = delete $sync->{unindex_range}->{$i}) { # rare
-		unindex($self, $sync, $git, $unindex_range);
+		unindex_epoch($self, $sync, $git, $unindex_range);
 	}
 	defined(my $stk = $sync->{stacks}->[$i]) or return;
 	$sync->{stacks}->[$i] = undef;
@@ -1284,7 +1266,7 @@ sub index_sync {
 
 	my $pr = $opt->{-progress};
 	my $epoch_max;
-	my $latest = git_dir_latest($self, \$epoch_max);
+	my $latest = $self->{ibx}->git_dir_latest(\$epoch_max);
 	return unless defined $latest;
 
 	my $seq = $opt->{sequential_shard};
@@ -1301,6 +1283,7 @@ sub index_sync {
 		reindex => $opt->{reindex},
 		-opt => $opt,
 		v2w => $self,
+		ibx => $self->{ibx},
 	};
 	$sync->{ranges} = sync_ranges($self, $sync, $epoch_max);
 	if (sync_prepare($self, $sync, $epoch_max)) {
diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm
index 6a74daf9..4332943c 100644
--- a/lib/PublicInbox/Xapcmd.pm
+++ b/lib/PublicInbox/Xapcmd.pm
@@ -110,7 +110,7 @@ sub prepare_reindex ($$$) {
 		}
 	} else { # v2
 		my $max;
-		$im->git_dir_latest(\$max) or return;
+		$ibx->git_dir_latest(\$max) or return;
 		my $from = $opt->{reindex}->{from};
 		my $mm = $ibx->mm;
 		my $v = PublicInbox::Search::SCHEMA_VERSION();

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 16/52] inboxwritable: eidx_key for external index
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (14 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 15/52] v2: some changes for ExtSearchIdx compatibility Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 17/52] v2writable: rename remaining "remote" terminology Eric Wong
                   ` (37 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This is preferable to open-coding "newsgroup // inboxdir" everywhere.
---
 lib/PublicInbox/InboxWritable.pm  |  2 ++
 lib/PublicInbox/SearchIdx.pm      | 12 ++++++------
 lib/PublicInbox/SearchIdxShard.pm | 32 ++++++++++++++++++++-----------
 3 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/lib/PublicInbox/InboxWritable.pm b/lib/PublicInbox/InboxWritable.pm
index e97c7e2d..d3c255c7 100644
--- a/lib/PublicInbox/InboxWritable.pm
+++ b/lib/PublicInbox/InboxWritable.pm
@@ -319,4 +319,6 @@ sub git_dir_latest {
 	$latest;
 }
 
+sub eidx_key { $_[0]->{newsgroup} // $_[0]->{inboxdir} }
+
 1;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 283bdd6c..061a8153 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -368,9 +368,9 @@ sub add_xapian ($$$$) {
 	$tg->set_document($doc);
 	index_headers($self, $smsg);
 
-	if (my $ng_or_dir = $self->{ng_or_dir}) { # external index
+	if (defined(my $eidx_key = $smsg->{eidx_key})) {
 		$doc->add_boolean_term('P'.
-				"$ng_or_dir:$smsg->{num}:$smsg->{blob}");
+				"$eidx_key:$smsg->{num}:$smsg->{blob}");
 	}
 	msg_iter($eml, \&index_xapian, [ $self, $doc ]);
 	index_ids($self, $doc, $eml, $mids);
@@ -456,21 +456,21 @@ sub _get_doc ($$$) {
 }
 
 sub add_xref3 {
-	my ($self, $docid, $xnum, $oid, $ng_or_dir, $eml) = @_;
+	my ($self, $docid, $xnum, $oid, $eidx_key, $eml) = @_;
 	begin_txn_lazy($self);
 	my $doc = _get_doc($self, $docid, $oid) or return;
 	term_generator($self)->set_document($doc);
-	$doc->add_boolean_term('P'."$ng_or_dir:$xnum:$oid");
+	$doc->add_boolean_term('P'."$eidx_key:$xnum:$oid");
 	index_list_id($self, $doc, $eml);
 	$self->{xdb}->replace_document($docid, $doc);
 }
 
 sub remove_xref3 {
-	my ($self, $docid, $oid, $ng_or_dir, $eml) = @_;
+	my ($self, $docid, $oid, $eidx_key, $eml) = @_;
 	begin_txn_lazy($self);
 	my $doc = _get_doc($self, $docid, $oid) or return;
 	my $xref3 = PublicInbox::Smsg::xref3(undef, $doc);
-	for (grep(/\A\Q$ng_or_dir\E:[0-9]+:\Q$oid\E\z/, @$xref3)) {
+	for (grep(/\A\Q$eidx_key\E:[0-9]+:\Q$oid\E\z/, @$xref3)) {
 		$doc->remove_term('P' . $_);
 	}
 	for my $l ($eml->header_raw('List-Id')) {
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index 8ff9ab8b..fa77a9f9 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -74,17 +74,21 @@ sub shard_worker_loop ($$$$$) {
 		} elsif ($line =~ /\AD ([a-f0-9]{40,}) ([0-9]+)\n\z/s) {
 			$self->remove_by_oid($1, $2 + 0);
 		} elsif ($line =~ s/\A\+X //) {
-			my ($len, $docid, $xnum, $oid, $ng_or_dir) =
+			my ($len, $docid, $xnum, $oid, $eidx_key) =
 							split(/ /, $line, 5);
-			$self->add_xref3($docid, $xnum, $oid, $ng_or_dir,
+			$self->add_xref3($docid, $xnum, $oid, $eidx_key,
 						eml($r, $len));
 		} elsif ($line =~ s/\A-X //) {
-			my ($len, $docid, $xnum, $oid, $ng_or_dir) =
+			my ($len, $docid, $xnum, $oid, $eidx_key) =
 							split(/ /, $line, 5);
 			$self->remove_xref3($docid, $xnum, $oid,
-						$ng_or_dir, eml($r, $len));
+						$eidx_key, eml($r, $len));
 		} else {
 			chomp $line;
+			my $eidx_key;
+			if ($line =~ s/\AX(.+)\0//) {
+				$eidx_key = $1;
+			}
 			# n.b. $mid may contain spaces(!)
 			my ($len, $bytes, $num, $oid, $ds, $ts, $tid, $mid)
 				= split(/ /, $line, 8);
@@ -98,6 +102,7 @@ sub shard_worker_loop ($$$$$) {
 				ds => $ds,
 				ts => $ts,
 			}, 'PublicInbox::Smsg';
+			$smsg->{eidx_key} = $eidx_key if defined($eidx_key);
 			$self->add_message(eml($r, $len), $smsg);
 		}
 	}
@@ -105,8 +110,12 @@ sub shard_worker_loop ($$$$$) {
 }
 
 sub index_raw {
-	my ($self, $msgref, $eml, $smsg) = @_;
+	my ($self, $msgref, $eml, $smsg, $ibx) = @_;
 	if (my $w = $self->{w}) {
+		if ($ibx) {
+			print $w 'X', $ibx->eidx_key, "\0" or die
+				"failed to write shard: $!\n";
+		}
 		$msgref //= \($eml->as_string);
 		$smsg->{raw_bytes} //= length($$msgref);
 		# mid must be last, it can contain spaces (but not LF)
@@ -120,33 +129,34 @@ sub index_raw {
 			$eml = PublicInbox::Eml->new($msgref);
 		}
 		$self->begin_txn_lazy;
+		$smsg->{eidx_key} = $ibx->eidx_key if $ibx;
 		$self->add_message($eml, $smsg);
 	}
 }
 
 sub shard_add_xref3 {
 	my ($self, $docid, $xnum, $oid, $xibx, $eml) = @_;
-	my $ng_or_dir = $xibx->{newsgroup} // $xibx->{inboxdir};
+	my $eidx_key = $xibx->eidx_key;
 	if (my $w = $self->{w}) {
 		my $hdr = $eml->header_obj->as_string;
 		my $len = length($hdr);
-		print $w "+X $len $docid $xnum $oid $ng_or_dir\n", $hdr or
+		print $w "+X $len $docid $xnum $oid $eidx_key\n", $hdr or
 			die "failed to write shard: $!";
 	} else {
-		$self->add_xref3($docid, $xnum, $oid, $ng_or_dir, $eml);
+		$self->add_xref3($docid, $xnum, $oid, $eidx_key, $eml);
 	}
 }
 
 sub shard_remove_xref3 {
 	my ($self, $docid, $oid, $xibx, $eml) = @_;
-	my $ng_or_dir = $xibx->{newsgroup} // $xibx->{inboxdir};
+	my $eidx_key = $xibx->eidx_key;
 	if (my $w = $self->{w}) {
 		my $hdr = $eml->header_obj->as_string;
 		my $len = length($hdr);
-		print $w "-X $len $docid $oid $ng_or_dir\n", $hdr or
+		print $w "-X $len $docid $oid $eidx_key\n", $hdr or
 			die "failed to write shard: $!";
 	} else {
-		$self->remove_xref3($docid, $oid, $ng_or_dir, $eml);
+		$self->remove_xref3($docid, $oid, $eidx_key, $eml);
 	}
 }
 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 17/52] v2writable: rename remaining "remote" terminology
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (15 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 16/52] inboxwritable: eidx_key for external index Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 18/52] v2writable: checkpoint: account for lack of {mm} Eric Wong
                   ` (36 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

"remote" used to imply "child process on the same machine" which
was somewhat non-sensical, anyways.  And OverIdx has been in the
same process since v2 was finalized.  So use the suffix "aux"
for "auxiliary" since it can be safely jettisoned without
breaking URLs.
---
 lib/PublicInbox/V2Writable.pm | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 6af50f5d..f8b7abe1 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -423,7 +423,7 @@ sub rewrite_internal ($$;$$$) {
 			} else { # ->purge or ->remove
 				$self->{mm}->num_delete($num);
 			}
-			unindex_oid_remote($self, $oid, $mid);
+			unindex_oid_aux($self, $oid, $mid);
 		}
 	}
 
@@ -631,7 +631,7 @@ sub checkpoint ($;$) {
 		}
 
 		# last_commit is special, don't commit these until
-		# remote shards are done:
+		# Xapian shards are done:
 		$dbh->begin_work;
 		set_last_commits($self);
 		$dbh->commit;
@@ -1082,7 +1082,7 @@ sub sync_prepare ($$$) {
 	$regen_max + $self->{mm}->num_highwater() || 0;
 }
 
-sub unindex_oid_remote ($$$) {
+sub unindex_oid_aux ($$$) {
 	my ($self, $oid, $mid) = @_;
 	my @removed = $self->{oidx}->remove_oid($oid, $mid);
 	for my $num (@removed) {
@@ -1117,7 +1117,7 @@ sub unindex_oid ($$;$) { # git->cat_async callback
 			}
 			$mm->num_delete($num);
 		}
-		unindex_oid_remote($self, $oid, $mid);
+		unindex_oid_aux($self, $oid, $mid);
 	}
 }
 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 18/52] v2writable: checkpoint: account for lack of {mm}
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (16 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 17/52] v2writable: rename remaining "remote" terminology Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 19/52] extsearchidx: initial implementation Eric Wong
                   ` (35 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

ExtSearchIdx will not have Msgmap, since it may index
non email blobs in the future (it'll still be usable
with IMAP, but not NNTP).
---
 lib/PublicInbox/V2Writable.pm | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index f8b7abe1..867e4d2b 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -609,10 +609,11 @@ sub checkpoint ($;$) {
 	}
 	my $shards = $self->{idx_shards};
 	if ($shards) {
-		my $dbh = $self->{mm}->{dbh};
+		my $mm = $self->{mm};
+		my $dbh = $mm->{dbh} if $mm;
 
 		# SQLite msgmap data is second in importance
-		$dbh->commit;
+		$dbh->commit if $dbh;
 
 		# SQLite overview is third
 		$self->{oidx}->commit_lazy;
@@ -632,11 +633,12 @@ sub checkpoint ($;$) {
 
 		# last_commit is special, don't commit these until
 		# Xapian shards are done:
-		$dbh->begin_work;
+		$dbh->begin_work if $dbh;
 		set_last_commits($self);
-		$dbh->commit;
-
-		$dbh->begin_work;
+		if ($dbh) {
+			$dbh->commit;
+			$dbh->begin_work;
+		}
 	}
 	$self->{total_bytes} += $self->{transact_bytes};
 	$self->{transact_bytes} = 0;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 19/52] extsearchidx: initial implementation
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (17 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 18/52] v2writable: checkpoint: account for lack of {mm} Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 20/52] searchidx: index eidx_key as a boolean term Eric Wong
                   ` (34 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

It compiles...
---
 MANIFEST                        |   1 +
 lib/PublicInbox/ExtSearchIdx.pm | 311 ++++++++++++++++++++++++++++++++
 t/extsearch.t                   |   1 +
 3 files changed, 313 insertions(+)
 create mode 100644 lib/PublicInbox/ExtSearchIdx.pm

diff --git a/MANIFEST b/MANIFEST
index 60055d2b..418a2f17 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -122,6 +122,7 @@ lib/PublicInbox/Eml.pm
 lib/PublicInbox/EmlContentFoo.pm
 lib/PublicInbox/ExtMsg.pm
 lib/PublicInbox/ExtSearch.pm
+lib/PublicInbox/ExtSearchIdx.pm
 lib/PublicInbox/FakeInotify.pm
 lib/PublicInbox/Feed.pm
 lib/PublicInbox/Filter/Base.pm
diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
new file mode 100644
index 00000000..edf17974
--- /dev/null
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -0,0 +1,311 @@
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# Detached/external index cross inbox search indexing support
+# read-write counterpart to PublicInbox::ExtSearch
+#
+# It's based on the same ideas as public-inbox-v2-format(5) using
+# over.sqlite3 for dedupe and sharded Xapian.  msgmap.sqlite3 is
+# missing, so there is no Message-ID conflict resolution, meaning
+# no NNTP support for now.
+#
+# v2 has a 1:1 mapping of index:inbox or msgmap for NNTP support.
+# This is intended to be an M:N index:inbox mapping, but it'll likely
+# be 1:N in common practice (M==1)
+
+package PublicInbox::ExtSearchIdx;
+use strict;
+use v5.10.1;
+use parent qw(PublicInbox::ExtSearch PublicInbox::Lock);
+use Carp qw(croak carp);
+use PublicInbox::Search;
+use PublicInbox::SearchIdx qw(crlf_adjust);
+use PublicInbox::OverIdx;
+use PublicInbox::V2Writable;
+use PublicInbox::InboxWritable;
+use PublicInbox::Eml;
+use File::Spec;
+
+sub new {
+	my (undef, $dir, $opt, $shard) = @_;
+	my $l = $opt->{indexlevel} // 'full';
+	$l !~ $PublicInbox::SearchIdx::INDEXLEVELS and
+		die "invalid indexlevel=$l\n";
+	$l eq 'basic' and die "E: indexlevel=basic not yet supported\n";
+	my $self = bless {
+		xpfx => "$dir/ei".PublicInbox::Search::SCHEMA_VERSION,
+		topdir => $dir,
+		creat => $opt->{creat},
+		ibx_map => {}, # (newsgroup//inboxdir) => $ibx
+		ibx_list => [],
+		indexlevel => $l,
+		transact_bytes => 0,
+		total_bytes => 0,
+		current_info => '',
+		parallel => 1,
+		lock_path => "$dir/ei.lock",
+	}, __PACKAGE__;
+	$self->{shards} = $self->count_shards || nproc_shards($opt->{creat});
+	my $oidx = PublicInbox::OverIdx->new("$self->{xpfx}/over.sqlite3");
+	$oidx->{-no_fsync} = 1 if $opt->{-no_fsync};
+	$self->{oidx} = $oidx;
+	$self
+}
+
+sub attach_inbox {
+	my ($self, $ibx) = @_;
+	my $key = $ibx->eidx_key;
+	if (!$ibx->over || !$ibx->mm) {
+		warn "W: skipping $key (unindexed)\n";
+		return;
+	}
+	if (!defined($ibx->uidvalidity)) {
+		warn "W: skipping $key (no UIDVALIDITY)\n";
+		return;
+	}
+	my $ibxdir = File::Spec->canonpath($ibx->{inboxdir});
+	if ($ibxdir ne $ibx->{inboxdir}) {
+		warn "W: `$ibx->{inboxdir}' canonicalized to `$ibxdir'\n";
+		$ibx->{inboxdir} = $ibxdir;
+	}
+	$ibx = PublicInbox::InboxWritable->new($ibx);
+	$self->{ibx_map}->{$key} //= do {
+		push @{$self->{ibx_list}}, $ibx;
+		$ibx;
+	}
+}
+
+sub _ibx_attach { # each_inbox callback
+	my ($ibx, $self) = @_;
+	attach_inbox($self, $ibx);
+}
+
+sub attach_config {
+	my ($self, $cfg) = @_;
+	$cfg->each_inbox(\&_ibx_attach, $self);
+}
+
+sub git_blob_digest ($) {
+	my ($bref) = @_;
+	my $dig = Digest::SHA->new(1); # XXX SHA256 later
+	$dig->add('blob '.length($$bref)."\0");
+	$dig->add($$bref);
+	$dig;
+}
+
+sub is_bad_blob ($$$$) {
+	my ($oid, $type, $size, $expect_oid) = @_;
+	if ($type ne 'blob') {
+		carp "W: $expect_oid is not a blob (type=$type)";
+		return 1;
+	}
+	croak "BUG: $oid != $expect_oid" if $oid ne $expect_oid;
+	$size == 0 ? 1 : 0; # size == 0 means purged
+}
+
+sub do_xpost ($$) {
+	my ($req, $smsg) = @_;
+	my $self = $req->{eidx};
+	my $docid = $smsg->{num};
+	my $idx = $self->idx_shard($docid);
+	my $oid = $req->{oid};
+	my $xibx = $self->{sync}->{ibx};
+	my $eml = $req->{eml};
+	if (my $new_smsg = $req->{new_smsg}) { # 'm' on cross-posted message
+		my $xnum = $req->{xnum};
+		$idx->shard_add_xref3($docid, $xnum, $oid, $xibx, $eml);
+	} else { # 'd'
+		$idx->shard_remove_xref3($docid, $oid, $xibx, $eml);
+	}
+}
+
+sub index_unseen ($) {
+	my ($req) = @_;
+	my $new_smsg = $req->{new_smsg} or die 'BUG: {new_smsg} unset';
+	$new_smsg->populate($req->{eml}, $req);
+	my $self = $req->{eidx};
+	my $docid = $self->{oidx}->adj_counter('eidx_docid', '+');
+	$new_smsg->{num} = $docid;
+	my $idx = $self->idx_shard($docid);
+	my $eml = delete $req->{eml};
+	$self->{oidx}->add_overview($eml, $new_smsg);
+	$idx->index_raw(undef, $eml, $new_smsg, delete $new_smsg->{ibx});
+}
+
+sub do_finalize ($) {
+	my ($req) = @_;
+	if (my $indexed = $req->{indexed}) {
+		do_xpost($req, $_) for @$indexed;
+	} elsif (exists $req->{new_smsg}) { # totally unseen messsage
+		index_unseen($req);
+	} else {
+		warn "W: ignoring delete $req->{oid} (not found)\n";
+	}
+}
+
+sub do_step ($) { # main iterator for adding messages to the index
+	my ($req) = @_;
+	my $self = $req->{eidx};
+	while (1) {
+		if (my $next_arg = $req->{next_arg}) {
+			if (my $smsg = $self->{oidx}->next_by_mid(@$next_arg)) {
+				$req->{cur_smsg} = $smsg;
+				$self->git->cat_async($smsg->{blob},
+							\&ck_existing, $req);
+				return; # ck_existing calls do_step
+			}
+			delete $req->{cur_smsg};
+			delete $req->{next_arg};
+		}
+		my $mid = shift(@{$req->{mids}});
+		last unless defined $mid;
+		my ($id, $prev);
+		$req->{next_arg} = [ $mid, \$id, \$prev ];
+		# loop again
+	}
+	do_finalize($req);
+}
+
+sub ck_existing { # git->cat_async callback
+	my ($bref, $oid, $type, $size, $req) = @_;
+	my $smsg = $req->{cur_smsg} or die 'BUG: {cur_smsg} missing';
+	return if is_bad_blob($oid, $type, $size, $smsg->{blob});
+	my $cur = PublicInbox::Eml->new($bref);
+	if (content_digest($cur) eq $req->{chash}) {
+		push @{$req->{indexed}}, $smsg; # for do_xpost
+	} # else { index_unseen later }
+	do_step($req);
+}
+
+# is the messages visible in the inbox currently being indexed?
+# return the number if so
+sub cur_ibx_xnum ($$) {
+	my ($req, $bref) = @_;
+	my $ibx = $req->{sync}->{ibx} or die 'BUG: current {ibx} missing';
+
+	# XXX overkill?
+	git_blob_digest($bref)->hexdigest eq $req->{oid} or die
+		"BUG: blob mismatch $req->{oid}";
+
+	$req->{eml} = PublicInbox::Eml->new($bref);
+	$req->{chash} = content_hash($req->{eml});
+	$req->{mids} = mids($req->{eml});
+	my @q = @{$req->{mids}}; # copy
+	while (defined(my $mid = shift @q)) {
+		my ($id, $prev);
+		while (my $x = $ibx->over->next_by_mid($mid, \$id, \$prev)) {
+			return $x->{num} if $x->{blob} eq $req->{oid};
+		}
+	}
+	undef;
+}
+
+sub m_start { # git->cat_async callback for 'm'
+	my ($bref, $oid, $type, $size, $req) = @_;
+	return if is_bad_blob($oid, $type, $size, $req->{oid});
+	my $new_smsg = $req->{new_smsg} = bless {
+		blob => $oid,
+		raw_bytes => $size,
+	}, 'PublicInbox::Smsg';
+	$new_smsg->{bytes} = $new_smsg->{raw_bytes} + crlf_adjust($$bref);
+	defined($req->{xnum} = cur_ibx_xnum($req, $bref)) or return;
+	$new_smsg->{ibx} = $req->{sync}->{ibx};
+	do_step($req);
+}
+
+sub d_start { # git->cat_async callback for 'd'
+	my ($bref, $oid, $type, $size, $req) = @_;
+	return if is_bad_blob($oid, $type, $size, $req->{oid});
+	return if defined(cur_ibx_xnum($req, $bref)); # was re-added
+	do_step($req);
+}
+
+sub eidx_last_epoch_commit ($$$) {
+	my ($self, $sync, $epoch) = @_;
+	my $key = $sync->{ibx}->eidx_key;
+	my $uidvalidity = $sync->{ibx}->uidvalidity;
+	$self->{oidx}->eidx_meta("lc-v2:$key//$uidvalidity;$epoch");
+}
+
+sub _sync_inbox ($$$) {
+	my ($self, $opt, $ibx) = @_;
+	my $sync = {
+		need_checkpoint => \(my $bool = 0),
+		unindex_range => {}, # EPOCH => oid_old..oid_new
+		reindex => $opt->{reindex},
+		-opt => $opt,
+		eidx => $self,
+		ibx => $ibx,
+	};
+	my $key = $ibx->eidx_key;
+	my $u = $ibx->uidvalidity;
+	my $oidx = $self->{oidx};
+	my $v = $ibx->version;
+	if ($v == 2) {
+		my $epoch_max;
+		defined($ibx->git_dir_latest(\$epoch_max)) or return;
+		my $heads = $sync->{ranges} = [];
+		for my $i (0..$epoch_max) {
+			$heads->[$i] = $oidx->eidx_meta("lc-v2:$key//$u;$i");
+		}
+
+
+	} elsif ($v == 1) {
+		my $lc = $oidx->eidx_meta("lc-v1:$key//$u");
+		prepare_stack($sync, $lc ? "$lc..HEAD" : 'HEAD');
+	} else {
+		warn "E: $key unsupported inbox version (v$v)\n";
+		return;
+	}
+}
+
+sub eidx_sync { # main entry point
+	my ($self, $opt) = @_;
+	$self->idx_init($opt); # acquire lock via V2Writable::_idx_init
+	$self->{oidx}->rethread_prepare($opt);
+
+	_sync_inbox($self, $opt, $_) for (@{$self->{ibx_list}});
+}
+
+sub idx_init { # similar to V2Writable
+	my ($self, $opt) = @_;
+	return if $self->{idx_shards};
+
+	$self->git->cleanup;
+
+	my $ALL = $self->git->{git_dir}; # ALL.git
+	PublicInbox::Import::init_bare($ALL) unless -d $ALL;
+	my $info_dir = "$ALL/objects/info";
+	my $alt = "$info_dir/alternates";
+	my $mode = 0644;
+	my (%old, @old, %new, @new);
+	if (-e $alt) {
+		open(my $fh, '<', $alt) or die "open $alt: $!";
+		$mode = (stat($fh))[2] & 07777;
+		while (<$fh>) {
+			push @old, $_ if !$old{$_}++;
+		}
+	}
+	for my $ibx (@{$self->{ibx_list}}) {
+		my $line = $ibx->git->{git_dir} . "/objects\n";
+		next if $old{$line};
+		$new{$line} = 1;
+		push @new, $line;
+	}
+	push @old, @new;
+	PublicInbox::V2Writable::write_alternates($info_dir, $mode, \@old);
+	$self->parallel_init($self->{indexlevel});
+	$self->umask_prepare;
+	$self->with_umask(\&PublicInbox::V2Writable::_idx_init, $self, $opt);
+	$self->{oidx}->begin_lazy;
+	$self->{oidx}->eidx_prep;
+}
+
+no warnings 'once';
+*done = \&PublicInbox::V2Writable::done;
+*umask_prepare = \&PublicInbox::InboxWritable::umask_prepare;
+*with_umask = \&PublicInbox::InboxWritable::with_umask;
+*parallel_init = \&PublicInbox::V2Writable::parallel_init;
+*nproc_shards = \&PublicInbox::V2Writable::nproc_shards;
+
+1;
diff --git a/t/extsearch.t b/t/extsearch.t
index 7687f5f0..54927c50 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -7,5 +7,6 @@ use PublicInbox::TestCommon;
 require_git(2.6);
 require_mods(qw(DBD::SQLite Search::Xapian));
 use_ok 'PublicInbox::ExtSearch';
+use_ok 'PublicInbox::ExtSearchIdx';
 
 done_testing;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 20/52] searchidx: index eidx_key as a boolean term
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (18 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 19/52] extsearchidx: initial implementation Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 21/52] searchidx: xref3 delete support Eric Wong
                   ` (33 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

Using `O' (owner) here (according Xapian omega's
termprefixes.rst) since we could say the newsgroup or inbox is
the owner of the given message.
---
 lib/PublicInbox/SearchIdx.pm | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 061a8153..5171c610 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -369,6 +369,7 @@ sub add_xapian ($$$$) {
 	index_headers($self, $smsg);
 
 	if (defined(my $eidx_key = $smsg->{eidx_key})) {
+		$doc->add_boolean_term('O'.$eidx_key);
 		$doc->add_boolean_term('P'.
 				"$eidx_key:$smsg->{num}:$smsg->{blob}");
 	}
@@ -460,6 +461,7 @@ sub add_xref3 {
 	begin_txn_lazy($self);
 	my $doc = _get_doc($self, $docid, $oid) or return;
 	term_generator($self)->set_document($doc);
+	$doc->add_boolean_term('O'.$eidx_key);
 	$doc->add_boolean_term('P'."$eidx_key:$xnum:$oid");
 	index_list_id($self, $doc, $eml);
 	$self->{xdb}->replace_document($docid, $doc);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 21/52] searchidx: xref3 delete support
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (19 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 20/52] searchidx: index eidx_key as a boolean term Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 22/52] searchidxshard: special init for eidx Eric Wong
                   ` (32 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

Not yet tested, but Perl compiles it!
---
 lib/PublicInbox/SearchIdx.pm | 50 ++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 19 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 5171c610..0458d9c3 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -472,29 +472,41 @@ sub remove_xref3 {
 	begin_txn_lazy($self);
 	my $doc = _get_doc($self, $docid, $oid) or return;
 	my $xref3 = PublicInbox::Smsg::xref3(undef, $doc);
+	my %x3 = map { $_ => undef } @$xref3;
 	for (grep(/\A\Q$eidx_key\E:[0-9]+:\Q$oid\E\z/, @$xref3)) {
+		delete $x3{$_};
 		$doc->remove_term('P' . $_);
 	}
-	for my $l ($eml->header_raw('List-Id')) {
-		$l =~ /<([^>]+)>/ or next;
-		my $lid = lc $1;
-		$doc->remove_term('G' . $lid);
-
-		# nb: we don't remove the XL probabilistic terms
-		# since terms may overlap if cross-posted.
-		#
-		# IOW, a message which has both <foo.example.com>
-		# and <bar.example.com> would have overlapping
-		# "XLexample" and "XLcom" as terms and which we
-		# wouldn't know if they're safe to remove if we just
-		# unindex <foo.example.com> while preserving
-		# <bar.example.com>.
-		#
-		# In any case, this entire sub is will likely never
-		# be needed and users using the "l:" prefix are probably
-		# rarer.
+	if (scalar(keys(%x3)) == 0) {
+		$self->{xdb}->delete_document($docid);
+		if (my $del_fh = $self->{del_fh}) { # TODO
+			print $del_fh $docid, "\n" or die "E: print $!";
+		}
+	} else {
+		if (!grep(/\A\Q$eidx_key\E:/, keys %x3)) {
+			$doc->remove_term('O'.$eidx_key);
+		}
+		for my $l ($eml->header_raw('List-Id')) {
+			$l =~ /<([^>]+)>/ or next;
+			my $lid = lc $1;
+			$doc->remove_term('G' . $lid);
+
+			# nb: we don't remove the XL probabilistic terms
+			# since terms may overlap if cross-posted.
+			#
+			# IOW, a message which has both <foo.example.com>
+			# and <bar.example.com> would have overlapping
+			# "XLexample" and "XLcom" as terms and which we
+			# wouldn't know if they're safe to remove if we just
+			# unindex <foo.example.com> while preserving
+			# <bar.example.com>.
+			#
+			# In any case, this entire sub is will likely never
+			# be needed and users using the "l:" prefix are probably
+			# rarer.
+		}
+		$self->{xdb}->replace_document($docid, $doc);
 	}
-	$self->{xdb}->replace_document($docid, $doc);
 }
 
 sub get_val ($$) {

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 22/52] searchidxshard: special init for eidx
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (20 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 21/52] searchidx: xref3 delete support Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 23/52] searchidx: put {ibx} into $sync state Eric Wong
                   ` (31 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

Having a special init path for external indices is probably
easier than further overloading SearchIdx->new initialization
to work without an Inbox object.
---
 lib/PublicInbox/SearchIdx.pm      | 13 +++++++++++++
 lib/PublicInbox/SearchIdxShard.pm |  7 ++++---
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 0458d9c3..029b2726 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -942,4 +942,17 @@ sub worker_done {
 	die "$$ $0 still in transaction\n" if $self->{txn};
 }
 
+sub eidx_shard_new {
+	my ($class, $eidx, $shard) = @_;
+	my $self = bless {
+		xpfx => $eidx->{xpfx},
+		indexlevel => $eidx->{indexlevel},
+		-skip_docdata => 1,
+		shard => $shard,
+		creat => 1,
+	}, $class;
+	$self->{-set_indexlevel_once} = 1 if $self->{indexlevel} eq 'medium';
+	$self;
+}
+
 1;
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index fa77a9f9..ac01340c 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -12,9 +12,10 @@ use IO::Handle (); # autoflush
 use PublicInbox::Eml;
 
 sub new {
-	my ($class, $v2w, $shard) = @_;
+	my ($class, $v2w, $shard) = @_; # v2w may be ExtSearchIdx
 	my $ibx = $v2w->{ibx};
-	my $self = $class->SUPER::new($ibx, 1, $shard);
+	my $self = $ibx ? $class->SUPER::new($ibx, 1, $shard)
+			: $class->eidx_shard_new($v2w, $shard);
 	# create the DB before forking:
 	$self->idx_acquire;
 	$self->set_metadata_once;
@@ -58,7 +59,7 @@ sub eml ($$) {
 # this reads all the writes to $self->{w} from the parent process
 sub shard_worker_loop ($$$$$) {
 	my ($self, $v2w, $r, $shard, $bnote) = @_;
-	$0 = "pi-v2-shard[$shard]";
+	$0 = "shard[$shard]";
 	$self->begin_txn_lazy;
 	while (my $line = readline($r)) {
 		$v2w->{current_info} = "[$shard] $line";

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 23/52] searchidx: put {ibx} into $sync state
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (21 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 22/52] searchidxshard: special init for eidx Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 24/52] searchidx: log2stack: simplify callers Eric Wong
                   ` (30 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This will allow reusability with ExtSearchIdx
---
 lib/PublicInbox/SearchIdx.pm | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 029b2726..d3c904c7 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -785,9 +785,9 @@ sub log2stack ($$$$) {
 	$stk->read_prepare;
 }
 
-sub prepare_stack ($$$) {
-	my ($self, $sync, $range) = @_;
-	my $git = $self->{ibx}->git;
+sub prepare_stack ($$) {
+	my ($sync, $range) = @_;
+	my $git = $sync->{ibx}->git;
 
 	if (index($range, '..') < 0) {
 		# don't show annoying git errors to users who run -index
@@ -796,7 +796,7 @@ sub prepare_stack ($$$) {
 		return PublicInbox::IdxStack->new->read_prepare if $?;
 	}
 	$sync->{D} = $sync->{reindex} ? {} : undef; # OID_BIN => NR
-	log2stack($sync, $git, $range, $self->{ibx});
+	log2stack($sync, $git, $range, $sync->{ibx});
 }
 
 # --is-ancestor requires git 1.8.0+
@@ -848,11 +848,11 @@ sub reindex_from ($$) {
 sub _index_sync {
 	my ($self, $opt) = @_;
 	my $tip = $opt->{ref} || 'HEAD';
-	my $git = $self->{ibx}->git;
+	my $ibx = $self->{ibx};
 	$self->{batch_bytes} = $opt->{batch_size} // $BATCH_BYTES;
-	$git->batch_prepare;
+	$ibx->git->batch_prepare;
 	my $pr = $opt->{-progress};
-	my $sync = { reindex => $opt->{reindex}, -opt => $opt };
+	my $sync = { reindex => $opt->{reindex}, -opt => $opt, ibx => $ibx };
 	my $xdb = $self->begin_txn_lazy;
 	$self->{oidx}->rethread_prepare($opt);
 	my $mm = _msgmap_init($self);
@@ -870,7 +870,7 @@ sub _index_sync {
 	my $lx = reindex_from($sync->{reindex}, $last_commit);
 	my $range = $lx eq '' ? $tip : "$lx..$tip";
 	$pr->("counting changes\n\t$range ... ") if $pr;
-	my $stk = prepare_stack($self, $sync, $range);
+	my $stk = prepare_stack($sync, $range);
 	$sync->{ntodo} = $stk ? $stk->num_records : 0;
 	$pr->("$sync->{ntodo}\n") if $pr; # continue previous line
 	process_stack($self, $sync, $stk);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 24/52] searchidx: log2stack: simplify callers
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (22 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 23/52] searchidx: put {ibx} into $sync state Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 25/52] v2writable: more generic sync setup code Eric Wong
                   ` (29 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

Since we store {ibx} in $sync state, we no longer have to
pass it as an argument to log2stack.
---
 lib/PublicInbox/SearchIdx.pm  | 8 ++++----
 lib/PublicInbox/V2Writable.pm | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index d3c904c7..33c81ea8 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -736,11 +736,11 @@ sub process_stack {
 	v1_checkpoint($self, $sync, $stk);
 }
 
-sub log2stack ($$$$) {
-	my ($sync, $git, $range, $ibx) = @_;
+sub log2stack ($$$) {
+	my ($sync, $git, $range) = @_;
 	my $D = $sync->{D}; # OID_BIN => NR (if reindexing, undef otherwise)
 	my ($add, $del);
-	if ($ibx->version == 1) {
+	if ($sync->{ibx}->version == 1) {
 		my $path = $hex.'{2}/'.$hex.'{38}';
 		$add = qr!\A:000000 100644 \S+ ($OID) A\t$path$!;
 		$del = qr!\A:100644 000000 ($OID) \S+ D\t$path$!;
@@ -796,7 +796,7 @@ sub prepare_stack ($$) {
 		return PublicInbox::IdxStack->new->read_prepare if $?;
 	}
 	$sync->{D} = $sync->{reindex} ? {} : undef; # OID_BIN => NR
-	log2stack($sync, $git, $range, $sync->{ibx});
+	log2stack($sync, $git, $range);
 }
 
 # --is-ancestor requires git 1.8.0+
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 867e4d2b..a403f22f 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1051,7 +1051,7 @@ sub sync_prepare ($$$) {
 		# because we want NNTP article number gaps from unindexed
 		# messages to show up in mirrors, too.
 		$sync->{D} //= $sync->{reindex} ? {} : undef; # OID_BIN => NR
-		my $stk = log2stack($sync, $git, $range, $sync->{ibx});
+		my $stk = log2stack($sync, $git, $range);
 		my $nr = $stk ? $stk->num_records : 0;
 		$pr->("$nr\n") if $pr;
 		$sync->{stacks}->[$i] = $stk if $stk;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 25/52] v2writable: more generic sync setup code
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (23 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 24/52] searchidx: log2stack: simplify callers Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 26/52] v2writable: allow OO method references Eric Wong
                   ` (28 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We want to reuse this code for ExtSearchIdx, eventually.
---
 lib/PublicInbox/V2Writable.pm | 29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index a403f22f..7c7be1bd 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -964,9 +964,9 @@ sub update_last_commit {
 }
 
 sub last_commits ($$) {
-	my ($self, $epoch_max) = @_;
+	my ($self, $sync) = @_;
 	my $heads = [];
-	for (my $i = $epoch_max; $i >= 0; $i--) {
+	for (my $i = $sync->{epoch_max}; $i >= 0; $i--) {
 		$heads->[$i] = last_epoch_commit($self, $i);
 	}
 	$heads;
@@ -1023,17 +1023,20 @@ $range
 	$range;
 }
 
-sub sync_prepare ($$$) {
-	my ($self, $sync, $epoch_max) = @_;
+# overridden by ExtSearchIdx
+sub artnum_max { $_[0]->{mm}->num_highwater }
+
+sub sync_prepare ($$) {
+	my ($self, $sync) = @_;
 	my $pr = $sync->{-opt}->{-progress};
 	my $regen_max = 0;
 	my $head = $sync->{ibx}->{ref_head} || 'HEAD';
 
 	# reindex stops at the current heads and we later rerun index_sync
 	# without {reindex}
-	my $reindex_heads = $self->last_commits($epoch_max) if $sync->{reindex};
+	my $reindex_heads = $self->last_commits($sync) if $sync->{reindex};
 
-	for (my $i = $epoch_max; $i >= 0; $i--) {
+	for (my $i = $sync->{epoch_max}; $i >= 0; $i--) {
 		my $git_dir = $sync->{ibx}->git_dir_n($i);
 		-d $git_dir or next; # missing epochs are fine
 		my $git = PublicInbox::Git->new($git_dir);
@@ -1081,7 +1084,7 @@ sub sync_prepare ($$$) {
 	$sync->{-regen_fmt} = "% ${pad}u/$regen_max\n";
 	$sync->{nr} = \(my $nr = 0);
 	return -1 if $sync->{reindex};
-	$regen_max + $self->{mm}->num_highwater() || 0;
+	$regen_max + $self->artnum_max || 0;
 }
 
 sub unindex_oid_aux ($$$) {
@@ -1152,11 +1155,10 @@ sub unindex_epoch ($$$$) {
 		qw(-c gc.reflogExpire=now gc --prune=all --quiet)]);
 }
 
-sub sync_ranges ($$$) {
-	my ($self, $sync, $epoch_max) = @_;
+sub sync_ranges ($$) {
+	my ($self, $sync) = @_;
 	my $reindex = $sync->{reindex};
-
-	return last_commits($self, $epoch_max) unless $reindex;
+	return $self->last_commits($sync) unless $reindex;
 	return [] if ref($reindex) ne 'HASH';
 
 	my $ranges = $reindex->{from}; # arrayref;
@@ -1286,9 +1288,10 @@ sub index_sync {
 		-opt => $opt,
 		v2w => $self,
 		ibx => $self->{ibx},
+		epoch_max => $epoch_max,
 	};
-	$sync->{ranges} = sync_ranges($self, $sync, $epoch_max);
-	if (sync_prepare($self, $sync, $epoch_max)) {
+	$sync->{ranges} = sync_ranges($self, $sync);
+	if (sync_prepare($self, $sync)) {
 		# tmp_clone seems to fail if inside a transaction, so
 		# we rollback here (because we opened {mm} for reading)
 		# Note: we do NOT rely on DBI transactions for atomicity;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 26/52] v2writable: allow OO method references
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (24 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 25/52] v2writable: more generic sync setup code Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 27/52] v2writable: rename {v2w} field to {self} Eric Wong
                   ` (27 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

Using `->can(method)' allows subclasses to override `index_oid'
and `unindex_oid' methods.
---
 lib/PublicInbox/V2Writable.pm | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 7c7be1bd..c0306e82 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1066,10 +1066,11 @@ sub sync_prepare ($$) {
 	if (my @leftovers = keys %{delete($sync->{D}) // {}}) {
 		warn('W: unindexing '.scalar(@leftovers)." leftovers\n");
 		my $arg = { v2w => $self };
+		my $unindex_oid = $self->can('unindex_oid');
 		for my $oid (@leftovers) {
 			$oid = unpack('H*', $oid);
 			$self->{current_info} = "leftover $oid";
-			$self->git->cat_async($oid, \&unindex_oid, $arg);
+			$self->git->cat_async($oid, $unindex_oid, $arg);
 		}
 		$self->git->cat_async_wait;
 	}
@@ -1139,9 +1140,10 @@ sub unindex_epoch ($$$$) {
 			--no-notes --no-color --no-abbrev --no-renames);
 	my $fh = $git->popen(@cmd, $unindex_range);
 	local $sync->{in_unindex} = 1;
+	my $unindex_oid = $self->can('unindex_oid');
 	while (<$fh>) {
 		/\A:\d{6} 100644 $OID ($OID) [AM]\tm$/o or next;
-		$self->git->cat_async($1, \&unindex_oid, $sync);
+		$self->git->cat_async($1, $unindex_oid, $sync);
 	}
 	close $fh or die "git log failed: \$?=$?";
 	$self->git->cat_async_wait;
@@ -1211,17 +1213,19 @@ sub index_epoch ($$$) {
 	defined(my $stk = $sync->{stacks}->[$i]) or return;
 	$sync->{stacks}->[$i] = undef;
 	my $all = $self->git;
+	my $index_oid = $self->can('index_oid');
+	my $unindex_oid = $self->can('unindex_oid');
 	while (my ($f, $at, $ct, $oid) = $stk->pop_rec) {
 		$self->{current_info} = "$i.git $oid";
+		my $req = { %$sync, autime => $at, cotime => $ct, oid => $oid };
 		if ($f eq 'm') {
-			my $arg = { %$sync, autime => $at, cotime => $ct };
 			if ($sync->{max_size}) {
-				$all->check_async($oid, \&check_size, $arg);
+				$all->check_async($oid, \&check_size, $req);
 			} else {
-				$all->cat_async($oid, \&index_oid, $arg);
+				$all->cat_async($oid, $index_oid, $req);
 			}
 		} elsif ($f eq 'd') {
-			$all->cat_async($oid, \&unindex_oid, $sync);
+			$all->cat_async($oid, $unindex_oid, $req);
 		}
 		if (${$sync->{need_checkpoint}}) {
 			reindex_checkpoint($self, $sync);
@@ -1308,7 +1312,7 @@ sub index_sync {
 		}
 	}
 	if ($sync->{max_size} = $opt->{max_size}) {
-		$sync->{index_oid} = \&index_oid;
+		$sync->{index_oid} = $self->can('index_oid');
 	}
 	# work forwards through history
 	index_epoch($self, $sync, $_) for (0..$epoch_max);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 27/52] v2writable: rename {v2w} field to {self}
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (25 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 26/52] v2writable: allow OO method references Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 28/52] v2writable: make *last_commits and sync_prepare OO methods Eric Wong
                   ` (26 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This will make it easier to reuse some indexing code for ExtSearchIdx.
---
 lib/PublicInbox/V2Writable.pm | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index c0306e82..3d3c25ec 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -882,7 +882,7 @@ sub index_oid { # cat_async callback
 	my $eml = PublicInbox::Eml->new($$bref);
 	my $mids = mids($eml);
 	my $chash = content_hash($eml);
-	my $self = $arg->{v2w};
+	my $self = $arg->{self};
 
 	if (scalar(@$mids) == 0) {
 		warn "E: $oid has no Message-ID, skipping\n";
@@ -1065,7 +1065,7 @@ sub sync_prepare ($$) {
 	# our code and blindly injects "d" file history into git repos
 	if (my @leftovers = keys %{delete($sync->{D}) // {}}) {
 		warn('W: unindexing '.scalar(@leftovers)." leftovers\n");
-		my $arg = { v2w => $self };
+		my $arg = { self => $self };
 		my $unindex_oid = $self->can('unindex_oid');
 		for my $oid (@leftovers) {
 			$oid = unpack('H*', $oid);
@@ -1099,7 +1099,7 @@ sub unindex_oid_aux ($$$) {
 
 sub unindex_oid ($$;$) { # git->cat_async callback
 	my ($bref, $oid, $type, $size, $sync) = @_;
-	my $self = $sync->{v2w};
+	my $self = $sync->{self};
 	my $unindexed = $sync->{in_unindex} ? $sync->{unindexed} : undef;
 	my $mm = $self->{mm};
 	my $mids = mids(PublicInbox::Eml->new($bref));
@@ -1172,7 +1172,7 @@ sub sync_ranges ($$) {
 
 sub index_xap_only { # git->cat_async callback
 	my ($bref, $oid, $type, $size, $smsg) = @_;
-	my $self = $smsg->{v2w};
+	my $self = $smsg->{self};
 	my $idx = idx_shard($self, $smsg->{num});
 	$smsg->{raw_bytes} = $size;
 	$idx->index_raw($bref, undef, $smsg);
@@ -1192,7 +1192,7 @@ sub index_xap_step ($$$;$) {
 	}
 	for (my $num = $beg; $num <= $end; $num += $step) {
 		my $smsg = $ibx->over->get_art($num) or next;
-		$smsg->{v2w} = $self;
+		$smsg->{self} = $self;
 		$ibx->git->cat_async($smsg->{blob}, \&index_xap_only, $smsg);
 		if ($self->{transact_bytes} >= $self->{batch_bytes}) {
 			${$sync->{nr}} = $num;
@@ -1245,7 +1245,7 @@ sub xapian_only {
 		$sync //= {
 			need_checkpoint => \(my $bool = 0),
 			-opt => $opt,
-			v2w => $self,
+			self => $self,
 			nr => \(my $nr = 0),
 			-regen_fmt => "%u/?\n",
 		};
@@ -1290,7 +1290,7 @@ sub index_sync {
 		unindex_range => {}, # EPOCH => oid_old..oid_new
 		reindex => $opt->{reindex},
 		-opt => $opt,
-		v2w => $self,
+		self => $self,
 		ibx => $self->{ibx},
 		epoch_max => $epoch_max,
 	};

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 28/52] v2writable: make *last_commits and sync_prepare OO methods
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (26 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 27/52] v2writable: rename {v2w} field to {self} Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 29/52] v2writable: move size check init to sync_prepare Eric Wong
                   ` (25 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This will allow ExtSearchIdx to override or reuse them more
easily.  Unfortunately we lose prototype validation, but that
seems to be discouraged anyways given the 'signatures' feature
in Perl 5.20+.
---
 lib/PublicInbox/V2Writable.pm | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 3d3c25ec..ca60f2a1 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -952,8 +952,9 @@ sub index_oid { # cat_async callback
 }
 
 # only update last_commit for $i on reindex iff newer than current
+# $sync will be used by subclasses
 sub update_last_commit {
-	my ($self, $git, $i, $cmt) = @_;
+	my ($self, $sync, $git, $i, $cmt) = @_;
 	my $last = last_epoch_commit($self, $i);
 	if (defined $last && is_ancestor($git, $last, $cmt)) {
 		my @cmd = (qw(rev-list --count), "$last..$cmt");
@@ -963,7 +964,7 @@ sub update_last_commit {
 	last_epoch_commit($self, $i, $cmt);
 }
 
-sub last_commits ($$) {
+sub last_commits {
 	my ($self, $sync) = @_;
 	my $heads = [];
 	for (my $i = $sync->{epoch_max}; $i >= 0; $i--) {
@@ -1028,6 +1029,7 @@ sub artnum_max { $_[0]->{mm}->num_highwater }
 
 sub sync_prepare ($$) {
 	my ($self, $sync) = @_;
+	$sync->{ranges} = sync_ranges($self, $sync);
 	my $pr = $sync->{-opt}->{-progress};
 	my $regen_max = 0;
 	my $head = $sync->{ibx}->{ref_head} || 'HEAD';
@@ -1232,7 +1234,7 @@ sub index_epoch ($$$) {
 		}
 	}
 	$all->async_wait_all;
-	$self->update_last_commit($git, $i, $stk->{latest_cmt});
+	$self->update_last_commit($sync, $git, $i, $stk->{latest_cmt});
 }
 
 sub xapian_only {
@@ -1294,7 +1296,6 @@ sub index_sync {
 		ibx => $self->{ibx},
 		epoch_max => $epoch_max,
 	};
-	$sync->{ranges} = sync_ranges($self, $sync);
 	if (sync_prepare($self, $sync)) {
 		# tmp_clone seems to fail if inside a transaction, so
 		# we rollback here (because we opened {mm} for reading)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 29/52] v2writable: move size check init to sync_prepare
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (27 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 28/52] v2writable: make *last_commits and sync_prepare OO methods Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 30/52] extsearchidx: more compatibility with V2Writable callers Eric Wong
                   ` (24 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This will let us use it from ExtSearchIdx.
---
 lib/PublicInbox/V2Writable.pm | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index ca60f2a1..d417b125 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1038,6 +1038,9 @@ sub sync_prepare ($$) {
 	# without {reindex}
 	my $reindex_heads = $self->last_commits($sync) if $sync->{reindex};
 
+	if ($sync->{max_size} = $sync->{-opt}->{max_size}) {
+		$sync->{index_oid} = $self->can('index_oid');
+	}
 	for (my $i = $sync->{epoch_max}; $i >= 0; $i--) {
 		my $git_dir = $sync->{ibx}->git_dir_n($i);
 		-d $git_dir or next; # missing epochs are fine
@@ -1312,9 +1315,6 @@ sub index_sync {
 			$art_beg++ if defined($art_beg);
 		}
 	}
-	if ($sync->{max_size} = $opt->{max_size}) {
-		$sync->{index_oid} = $self->can('index_oid');
-	}
 	# work forwards through history
 	index_epoch($self, $sync, $_) for (0..$epoch_max);
 	$self->{oidx}->rethread_done($opt);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 30/52] extsearchidx: more compatibility with V2Writable callers
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (28 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 29/52] v2writable: move size check init to sync_prepare Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 31/52] v2writable: reduce scope of epoch-aware code Eric Wong
                   ` (23 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We'll use `index_oid' and `unindex_oid' as our method names
so V2Writable methods may use `$self->can' to access them.
---
 lib/PublicInbox/ExtSearchIdx.pm | 64 +++++++++++++++++----------------
 1 file changed, 34 insertions(+), 30 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index edf17974..609151e4 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -105,11 +105,11 @@ sub is_bad_blob ($$$$) {
 
 sub do_xpost ($$) {
 	my ($req, $smsg) = @_;
-	my $self = $req->{eidx};
+	my $self = $req->{self};
 	my $docid = $smsg->{num};
 	my $idx = $self->idx_shard($docid);
 	my $oid = $req->{oid};
-	my $xibx = $self->{sync}->{ibx};
+	my $xibx = $req->{ibx};
 	my $eml = $req->{eml};
 	if (my $new_smsg = $req->{new_smsg}) { # 'm' on cross-posted message
 		my $xnum = $req->{xnum};
@@ -119,17 +119,20 @@ sub do_xpost ($$) {
 	}
 }
 
+# called by V2Writable::sync_prepare
+sub artnum_max { $_[0]->{oidx}->get_counter('eidx_docid') }
+
 sub index_unseen ($) {
 	my ($req) = @_;
 	my $new_smsg = $req->{new_smsg} or die 'BUG: {new_smsg} unset';
-	$new_smsg->populate($req->{eml}, $req);
-	my $self = $req->{eidx};
+	my $eml = delete $req->{eml};
+	$new_smsg->populate($eml, $req);
+	my $self = $req->{self};
 	my $docid = $self->{oidx}->adj_counter('eidx_docid', '+');
 	$new_smsg->{num} = $docid;
 	my $idx = $self->idx_shard($docid);
-	my $eml = delete $req->{eml};
 	$self->{oidx}->add_overview($eml, $new_smsg);
-	$idx->index_raw(undef, $eml, $new_smsg, delete $new_smsg->{ibx});
+	$idx->index_raw(undef, $eml, $new_smsg, $req->{ibx});
 }
 
 sub do_finalize ($) {
@@ -145,7 +148,7 @@ sub do_finalize ($) {
 
 sub do_step ($) { # main iterator for adding messages to the index
 	my ($req) = @_;
-	my $self = $req->{eidx};
+	my $self = $req->{self};
 	while (1) {
 		if (my $next_arg = $req->{next_arg}) {
 			if (my $smsg = $self->{oidx}->next_by_mid(@$next_arg)) {
@@ -181,7 +184,7 @@ sub ck_existing { # git->cat_async callback
 # return the number if so
 sub cur_ibx_xnum ($$) {
 	my ($req, $bref) = @_;
-	my $ibx = $req->{sync}->{ibx} or die 'BUG: current {ibx} missing';
+	my $ibx = $req->{ibx} or die 'BUG: current {ibx} missing';
 
 	# XXX overkill?
 	git_blob_digest($bref)->hexdigest eq $req->{oid} or die
@@ -200,31 +203,34 @@ sub cur_ibx_xnum ($$) {
 	undef;
 }
 
-sub m_start { # git->cat_async callback for 'm'
+sub index_oid { # git->cat_async callback for 'm'
 	my ($bref, $oid, $type, $size, $req) = @_;
 	return if is_bad_blob($oid, $type, $size, $req->{oid});
 	my $new_smsg = $req->{new_smsg} = bless {
 		blob => $oid,
-		raw_bytes => $size,
 	}, 'PublicInbox::Smsg';
-	$new_smsg->{bytes} = $new_smsg->{raw_bytes} + crlf_adjust($$bref);
+	$new_smsg->{bytes} = $size + crlf_adjust($$bref);
 	defined($req->{xnum} = cur_ibx_xnum($req, $bref)) or return;
-	$new_smsg->{ibx} = $req->{sync}->{ibx};
 	do_step($req);
 }
 
-sub d_start { # git->cat_async callback for 'd'
+sub unindex_oid { # git->cat_async callback for 'd'
 	my ($bref, $oid, $type, $size, $req) = @_;
 	return if is_bad_blob($oid, $type, $size, $req->{oid});
 	return if defined(cur_ibx_xnum($req, $bref)); # was re-added
 	do_step($req);
 }
 
-sub eidx_last_epoch_commit ($$$) {
-	my ($self, $sync, $epoch) = @_;
-	my $key = $sync->{ibx}->eidx_key;
-	my $uidvalidity = $sync->{ibx}->uidvalidity;
-	$self->{oidx}->eidx_meta("lc-v2:$key//$uidvalidity;$epoch");
+# overrides V2Writable::last_commits, called by sync_ranges via sync_prepare
+sub last_commits {
+	my ($self, $sync) = @_;
+	my $heads = [];
+	my $ekey = $sync->{ibx}->eidx_key;
+	my $uv = $sync->{ibx}->uidvalidity;
+	for my $i (0..$sync->{epoch_max}) {
+		$heads->[$i] = $self->{oidx}->eidx_meta("lc-v2:$ekey//$uv;$i");
+	}
+	$heads;
 }
 
 sub _sync_inbox ($$$) {
@@ -234,27 +240,23 @@ sub _sync_inbox ($$$) {
 		unindex_range => {}, # EPOCH => oid_old..oid_new
 		reindex => $opt->{reindex},
 		-opt => $opt,
-		eidx => $self,
+		self => $self,
 		ibx => $ibx,
 	};
-	my $key = $ibx->eidx_key;
-	my $u = $ibx->uidvalidity;
-	my $oidx = $self->{oidx};
 	my $v = $ibx->version;
+	my $ekey = $ibx->eidx_key;
 	if ($v == 2) {
 		my $epoch_max;
 		defined($ibx->git_dir_latest(\$epoch_max)) or return;
-		my $heads = $sync->{ranges} = [];
-		for my $i (0..$epoch_max) {
-			$heads->[$i] = $oidx->eidx_meta("lc-v2:$key//$u;$i");
-		}
-
-
+		$sync->{epoch_max} = $epoch_max;
+		sync_prepare($self, $sync) or return;
+		index_epoch($self, $sync, $_) for (0..$epoch_max);
 	} elsif ($v == 1) {
-		my $lc = $oidx->eidx_meta("lc-v1:$key//$u");
+		my $uv = $ibx->uidvalidity;
+		my $lc = $self->{oidx}->eidx_meta("lc-v1:$ekey//$uv");
 		prepare_stack($sync, $lc ? "$lc..HEAD" : 'HEAD');
 	} else {
-		warn "E: $key unsupported inbox version (v$v)\n";
+		warn "E: $ekey unsupported inbox version (v$v)\n";
 		return;
 	}
 }
@@ -307,5 +309,7 @@ no warnings 'once';
 *with_umask = \&PublicInbox::InboxWritable::with_umask;
 *parallel_init = \&PublicInbox::V2Writable::parallel_init;
 *nproc_shards = \&PublicInbox::V2Writable::nproc_shards;
+*sync_prepare = \&PublicInbox::V2Writable::sync_prepare;
+*index_epoch = \&PublicInbox::V2Writable::index_epoch;
 
 1;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 31/52] v2writable: reduce scope of epoch-aware code
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (29 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 30/52] extsearchidx: more compatibility with V2Writable callers Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 32/52] extsearchidx: remove {unindex_range} field Eric Wong
                   ` (22 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

And clearly label it.  We may try to reuse some of this for v1
indexing code paths.
---
 lib/PublicInbox/V2Writable.pm | 72 +++++++++++++++++------------------
 1 file changed, 35 insertions(+), 37 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index d417b125..c8b01a3d 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -954,14 +954,14 @@ sub index_oid { # cat_async callback
 # only update last_commit for $i on reindex iff newer than current
 # $sync will be used by subclasses
 sub update_last_commit {
-	my ($self, $sync, $git, $i, $cmt) = @_;
-	my $last = last_epoch_commit($self, $i);
-	if (defined $last && is_ancestor($git, $last, $cmt)) {
+	my ($self, $sync, $unit, $cmt) = @_;
+	my $last = last_epoch_commit($self, $unit->{epoch});
+	if (defined $last && is_ancestor($unit->{git}, $last, $cmt)) {
 		my @cmd = (qw(rev-list --count), "$last..$cmt");
-		chomp(my $n = $git->qx(@cmd));
+		chomp(my $n = $unit->{git}->qx(@cmd));
 		return if $n ne '' && $n == 0;
 	}
-	last_epoch_commit($self, $i, $cmt);
+	last_epoch_commit($self, $unit->{epoch}, $cmt);
 }
 
 sub last_commits {
@@ -974,10 +974,11 @@ sub last_commits {
 }
 
 # returns a revision range for git-log(1)
-sub log_range ($$$$) {
-	my ($sync, $git, $i, $tip) = @_;
+sub log_range ($$$) {
+	my ($sync, $unit, $tip) = @_;
 	my $opt = $sync->{-opt};
 	my $pr = $opt->{-progress} if (($opt->{verbose} || 0) > 1);
+	my $i = $unit->{epoch};
 	my $cur = $sync->{ranges}->[$i] or do {
 		$pr->("$i.git indexing all of $tip\n") if $pr;
 		return $tip; # all of it
@@ -991,9 +992,9 @@ sub log_range ($$$$) {
 
 	my $range = "$cur..$tip";
 	$pr->("$i.git checking contiguity... ") if $pr;
-	if (is_ancestor($git, $cur, $tip)) { # common case
+	if (is_ancestor($unit->{git}, $cur, $tip)) { # common case
 		$pr->("OK\n") if $pr;
-		my $n = $git->qx(qw(rev-list --count), $range);
+		my $n = $unit->{git}->qx(qw(rev-list --count), $range);
 		chomp($n);
 		if ($n == 0) {
 			$sync->{ranges}->[$i] = undef;
@@ -1005,9 +1006,9 @@ sub log_range ($$$$) {
 		$pr->("FAIL\n") if $pr;
 		warn <<"";
 discontiguous range: $range
-Rewritten history? (in $git->{git_dir})
+Rewritten history? (in $unit->{git}->{git_dir})
 
-		chomp(my $base = $git->qx('merge-base', $tip, $cur));
+		chomp(my $base = $unit->{git}->qx('merge-base', $tip, $cur));
 		if ($base) {
 			$range = "$base..$tip";
 			warn "found merge-base: $base\n"
@@ -1016,10 +1017,10 @@ Rewritten history? (in $git->{git_dir})
 			warn "discarding history at $cur\n";
 		}
 		warn <<"";
-reindexing $git->{git_dir} starting at
+reindexing $unit->{git}->{git_dir} starting at
 $range
 
-		$sync->{unindex_range}->{$i} = "$base..$cur";
+		$unit->{unindex_range} = "$base..$cur";
 	}
 	$range;
 }
@@ -1045,13 +1046,14 @@ sub sync_prepare ($$) {
 		my $git_dir = $sync->{ibx}->git_dir_n($i);
 		-d $git_dir or next; # missing epochs are fine
 		my $git = PublicInbox::Git->new($git_dir);
+		my $unit = { git => $git, epoch => $i };
 		if ($reindex_heads) {
 			$head = $reindex_heads->[$i] or next;
 		}
 		chomp(my $tip = $git->qx(qw(rev-parse -q --verify), $head));
-
 		next if $?; # new repo
-		my $range = log_range($sync, $git, $i, $tip) or next;
+
+		my $range = log_range($sync, $unit, $tip) or next;
 		# can't use 'rev-list --count' if we use --diff-filter
 		$pr->("$i.git counting $range ... ") if $pr;
 		# Don't bump num_highwater on --reindex by using {D}.
@@ -1062,7 +1064,8 @@ sub sync_prepare ($$) {
 		my $stk = log2stack($sync, $git, $range);
 		my $nr = $stk ? $stk->num_records : 0;
 		$pr->("$nr\n") if $pr;
-		$sync->{stacks}->[$i] = $stk if $stk;
+		$unit->{stack} = $stk; # may be undef
+		unshift @{$sync->{todo}}, $unit;
 		$regen_max += $nr;
 	}
 
@@ -1136,14 +1139,14 @@ sub git { $_[0]->{ibx}->git }
 
 # this is rare, it only happens when we get discontiguous history in
 # a mirror because the source used -purge or -edit
-sub unindex_epoch ($$$$) {
-	my ($self, $sync, $git, $unindex_range) = @_;
+sub unindex_todo ($$$) {
+	my ($self, $sync, $unit) = @_;
+	my $unindex_range = delete($unit->{unindex_range}) // return;
 	my $unindexed = $sync->{unindexed} //= {}; # $mid0 => $num
 	my $before = scalar keys %$unindexed;
 	# order does not matter, here:
-	my @cmd = qw(log --raw -r
-			--no-notes --no-color --no-abbrev --no-renames);
-	my $fh = $git->popen(@cmd, $unindex_range);
+	my $fh = $unit->{git}->popen(qw(log --raw -r --no-notes --no-color
+				--no-abbrev --no-renames), $unindex_range);
 	local $sync->{in_unindex} = 1;
 	my $unindex_oid = $self->can('unindex_oid');
 	while (<$fh>) {
@@ -1158,7 +1161,8 @@ sub unindex_epoch ($$$$) {
 	return if $before == $after;
 
 	# ensure any blob can not longer be accessed via dumb HTTP
-	PublicInbox::Import::run_die(['git', "--git-dir=$git->{git_dir}",
+	PublicInbox::Import::run_die(['git',
+		"--git-dir=$unit->{git}->{git_dir}",
 		qw(-c gc.reflogExpire=now gc --prune=all --quiet)]);
 }
 
@@ -1206,22 +1210,17 @@ sub index_xap_step ($$$;$) {
 	}
 }
 
-sub index_epoch ($$$) {
-	my ($self, $sync, $i) = @_;
-
-	my $git_dir = $sync->{ibx}->git_dir_n($i);
-	-d $git_dir or return; # missing epochs are fine
-	my $git = PublicInbox::Git->new($git_dir);
-	if (my $unindex_range = delete $sync->{unindex_range}->{$i}) { # rare
-		unindex_epoch($self, $sync, $git, $unindex_range);
-	}
-	defined(my $stk = $sync->{stacks}->[$i]) or return;
-	$sync->{stacks}->[$i] = undef;
+sub index_todo ($$$) {
+	my ($self, $sync, $unit) = @_;
+	unindex_todo($self, $sync, $unit);
+	my $stk = delete($unit->{stack}) or return;
 	my $all = $self->git;
 	my $index_oid = $self->can('index_oid');
 	my $unindex_oid = $self->can('unindex_oid');
+	my ($pfx) = ($unit->{git}->{git_dir} =~ m!/([^/]+)\z!g);
+	$pfx //= $unit->{git}->{git_dir};
 	while (my ($f, $at, $ct, $oid) = $stk->pop_rec) {
-		$self->{current_info} = "$i.git $oid";
+		$self->{current_info} = "$pfx $oid";
 		my $req = { %$sync, autime => $at, cotime => $ct, oid => $oid };
 		if ($f eq 'm') {
 			if ($sync->{max_size}) {
@@ -1237,7 +1236,7 @@ sub index_epoch ($$$) {
 		}
 	}
 	$all->async_wait_all;
-	$self->update_last_commit($sync, $git, $i, $stk->{latest_cmt});
+	$self->update_last_commit($sync, $unit, $stk->{latest_cmt});
 }
 
 sub xapian_only {
@@ -1292,7 +1291,6 @@ sub index_sync {
 	$self->{oidx}->rethread_prepare($opt);
 	my $sync = {
 		need_checkpoint => \(my $bool = 0),
-		unindex_range => {}, # EPOCH => oid_old..oid_new
 		reindex => $opt->{reindex},
 		-opt => $opt,
 		self => $self,
@@ -1316,7 +1314,7 @@ sub index_sync {
 		}
 	}
 	# work forwards through history
-	index_epoch($self, $sync, $_) for (0..$epoch_max);
+	index_todo($self, $sync, $_) for @{$sync->{todo}};
 	$self->{oidx}->rethread_done($opt);
 	$self->done;
 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 32/52] extsearchidx: remove {unindex_range} field
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (30 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 31/52] v2writable: reduce scope of epoch-aware code Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 33/52] v2writable: pass oid to uindex_oid Eric Wong
                   ` (21 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

Moved to per-epoch "units".
---
 lib/PublicInbox/ExtSearchIdx.pm | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 609151e4..5e72d65d 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -237,7 +237,6 @@ sub _sync_inbox ($$$) {
 	my ($self, $opt, $ibx) = @_;
 	my $sync = {
 		need_checkpoint => \(my $bool = 0),
-		unindex_range => {}, # EPOCH => oid_old..oid_new
 		reindex => $opt->{reindex},
 		-opt => $opt,
 		self => $self,

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 33/52] v2writable: pass oid to uindex_oid
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (31 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 32/52] extsearchidx: remove {unindex_range} field Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 34/52] extsearchidx: sync unit updates Eric Wong
                   ` (20 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We'll be validating against this in the future to stop
bugs from creeping in.
---
 lib/PublicInbox/V2Writable.pm | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index c8b01a3d..efda7907 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1073,12 +1073,12 @@ sub sync_prepare ($$) {
 	# our code and blindly injects "d" file history into git repos
 	if (my @leftovers = keys %{delete($sync->{D}) // {}}) {
 		warn('W: unindexing '.scalar(@leftovers)." leftovers\n");
-		my $arg = { self => $self };
 		my $unindex_oid = $self->can('unindex_oid');
 		for my $oid (@leftovers) {
 			$oid = unpack('H*', $oid);
 			$self->{current_info} = "leftover $oid";
-			$self->git->cat_async($oid, $unindex_oid, $arg);
+			my $req = { %$sync, oid => $oid };
+			$self->git->cat_async($oid, $unindex_oid, $req);
 		}
 		$self->git->cat_async_wait;
 	}
@@ -1151,7 +1151,7 @@ sub unindex_todo ($$$) {
 	my $unindex_oid = $self->can('unindex_oid');
 	while (<$fh>) {
 		/\A:\d{6} 100644 $OID ($OID) [AM]\tm$/o or next;
-		$self->git->cat_async($1, $unindex_oid, $sync);
+		$self->git->cat_async($1, $unindex_oid, { %$sync, oid => $1 });
 	}
 	close $fh or die "git log failed: \$?=$?";
 	$self->git->cat_async_wait;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 34/52] extsearchidx: sync unit updates
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (32 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 33/52] v2writable: pass oid to uindex_oid Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 35/52] searchidx: export prepare_stack Eric Wong
                   ` (19 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

Now that the V2Writable code is more generic, we can
sync with it to use `units' which represent either
a v2 epoch or an entire v1 inbox.
---
 lib/PublicInbox/ExtSearchIdx.pm | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 5e72d65d..32c57188 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -27,7 +27,7 @@ use PublicInbox::Eml;
 use File::Spec;
 
 sub new {
-	my (undef, $dir, $opt, $shard) = @_;
+	my (undef, $dir, $opt) = @_;
 	my $l = $opt->{indexlevel} // 'full';
 	$l !~ $PublicInbox::SearchIdx::INDEXLEVELS and
 		die "invalid indexlevel=$l\n";
@@ -249,15 +249,16 @@ sub _sync_inbox ($$$) {
 		defined($ibx->git_dir_latest(\$epoch_max)) or return;
 		$sync->{epoch_max} = $epoch_max;
 		sync_prepare($self, $sync) or return;
-		index_epoch($self, $sync, $_) for (0..$epoch_max);
 	} elsif ($v == 1) {
 		my $uv = $ibx->uidvalidity;
 		my $lc = $self->{oidx}->eidx_meta("lc-v1:$ekey//$uv");
-		prepare_stack($sync, $lc ? "$lc..HEAD" : 'HEAD');
+		my $stk = prepare_stack($sync, $lc ? "$lc..HEAD" : 'HEAD');
+		my $unit = { stack => $stk, git => $ibx->git };
 	} else {
 		warn "E: $ekey unsupported inbox version (v$v)\n";
 		return;
 	}
+	index_todo($self, $sync, $_) for @{$sync->{todo}};
 }
 
 sub eidx_sync { # main entry point
@@ -266,6 +267,8 @@ sub eidx_sync { # main entry point
 	$self->{oidx}->rethread_prepare($opt);
 
 	_sync_inbox($self, $opt, $_) for (@{$self->{ibx_list}});
+	$self->{oidx}->rethread_done($opt);
+	PublicInbox::V2Writable::done($self);
 }
 
 sub idx_init { # similar to V2Writable
@@ -309,6 +312,8 @@ no warnings 'once';
 *parallel_init = \&PublicInbox::V2Writable::parallel_init;
 *nproc_shards = \&PublicInbox::V2Writable::nproc_shards;
 *sync_prepare = \&PublicInbox::V2Writable::sync_prepare;
-*index_epoch = \&PublicInbox::V2Writable::index_epoch;
+*index_todo = \&PublicInbox::V2Writable::index_todo;
+*count_shards = \&PublicInbox::V2Writable::count_shards;
+*atfork_child = \&PublicInbox::V2Writable::atfork_child;
 
 1;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 35/52] searchidx: export prepare_stack
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (33 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 34/52] extsearchidx: sync unit updates Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 36/52] extsearchidx: sync updates Eric Wong
                   ` (18 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We'll be needing it in ExtSearchIdx for the next commit.
---
 lib/PublicInbox/SearchIdx.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 33c81ea8..0c0e844a 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -22,7 +22,7 @@ use PublicInbox::OverIdx;
 use PublicInbox::Spawn qw(spawn nodatacow_dir);
 use PublicInbox::Git qw(git_unquote);
 use PublicInbox::MsgTime qw(msg_timestamp msg_datestamp);
-our @EXPORT_OK = qw(crlf_adjust log2stack is_ancestor check_size);
+our @EXPORT_OK = qw(crlf_adjust log2stack is_ancestor check_size prepare_stack);
 my $X = \%PublicInbox::Search::X;
 my ($DB_CREATE_OR_OPEN, $DB_OPEN);
 our $DB_NO_SYNC = 0;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 36/52] extsearchidx: sync updates
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (34 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 35/52] searchidx: export prepare_stack Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 37/52] searchidx: reduce inbox-dependency, wrap ->with_umask Eric Wong
                   ` (17 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

A couple of more things to prepare us to run syncs on
both v1 and v2 inboxes.
---
 lib/PublicInbox/ExtSearchIdx.pm | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 32c57188..790ee921 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -19,10 +19,12 @@ use v5.10.1;
 use parent qw(PublicInbox::ExtSearch PublicInbox::Lock);
 use Carp qw(croak carp);
 use PublicInbox::Search;
-use PublicInbox::SearchIdx qw(crlf_adjust);
+use PublicInbox::SearchIdx qw(crlf_adjust prepare_stack);
 use PublicInbox::OverIdx;
+use PublicInbox::MID qw(mids);
 use PublicInbox::V2Writable;
 use PublicInbox::InboxWritable;
+use PublicInbox::ContentHash qw(content_hash);
 use PublicInbox::Eml;
 use File::Spec;
 
@@ -54,6 +56,7 @@ sub new {
 
 sub attach_inbox {
 	my ($self, $ibx) = @_;
+	$ibx = PublicInbox::InboxWritable->new($ibx);
 	my $key = $ibx->eidx_key;
 	if (!$ibx->over || !$ibx->mm) {
 		warn "W: skipping $key (unindexed)\n";
@@ -120,7 +123,7 @@ sub do_xpost ($$) {
 }
 
 # called by V2Writable::sync_prepare
-sub artnum_max { $_[0]->{oidx}->get_counter('eidx_docid') }
+sub artnum_max { $_[0]->{oidx}->eidx_max }
 
 sub index_unseen ($) {
 	my ($req) = @_;
@@ -174,7 +177,7 @@ sub ck_existing { # git->cat_async callback
 	my $smsg = $req->{cur_smsg} or die 'BUG: {cur_smsg} missing';
 	return if is_bad_blob($oid, $type, $size, $smsg->{blob});
 	my $cur = PublicInbox::Eml->new($bref);
-	if (content_digest($cur) eq $req->{chash}) {
+	if (content_hash($cur) eq $req->{chash}) {
 		push @{$req->{indexed}}, $smsg; # for do_xpost
 	} # else { index_unseen later }
 	do_step($req);
@@ -248,12 +251,13 @@ sub _sync_inbox ($$$) {
 		my $epoch_max;
 		defined($ibx->git_dir_latest(\$epoch_max)) or return;
 		$sync->{epoch_max} = $epoch_max;
-		sync_prepare($self, $sync) or return;
+		sync_prepare($self, $sync) or return; # fills $sync->{todo}
 	} elsif ($v == 1) {
 		my $uv = $ibx->uidvalidity;
 		my $lc = $self->{oidx}->eidx_meta("lc-v1:$ekey//$uv");
 		my $stk = prepare_stack($sync, $lc ? "$lc..HEAD" : 'HEAD');
 		my $unit = { stack => $stk, git => $ibx->git };
+		push @{$sync->{todo}}, $unit;
 	} else {
 		warn "E: $ekey unsupported inbox version (v$v)\n";
 		return;
@@ -267,10 +271,23 @@ sub eidx_sync { # main entry point
 	$self->{oidx}->rethread_prepare($opt);
 
 	_sync_inbox($self, $opt, $_) for (@{$self->{ibx_list}});
+
 	$self->{oidx}->rethread_done($opt);
+
 	PublicInbox::V2Writable::done($self);
 }
 
+sub update_last_commit {
+	my ($self, $sync, $unit, $latest_cmt) = @_;
+
+	my $ALL = $self->git;
+	# while (scalar(@{$ALL->{inflight_c}}) || scalar(@{$ALL->{inflight}})) {
+		# $ALL->check_async_wait;
+		# $ALL->cat_async_wait;
+	# }
+	# TODO
+}
+
 sub idx_init { # similar to V2Writable
 	my ($self, $opt) = @_;
 	return if $self->{idx_shards};
@@ -315,5 +332,6 @@ no warnings 'once';
 *index_todo = \&PublicInbox::V2Writable::index_todo;
 *count_shards = \&PublicInbox::V2Writable::count_shards;
 *atfork_child = \&PublicInbox::V2Writable::atfork_child;
+*idx_shard = \&PublicInbox::V2Writable::idx_shard;
 
 1;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 37/52] searchidx: reduce inbox-dependency, wrap ->with_umask
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (35 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 36/52] extsearchidx: sync updates Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 38/52] searchidx: favor $sync->{ibx} (over $self->{ibx}) Eric Wong
                   ` (16 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This will let us work consistently with both existing inboxes
and external indices.
---
 lib/PublicInbox/SearchIdx.pm | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 0c0e844a..ea884434 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -137,7 +137,7 @@ sub idx_acquire {
 		}
 	}
 	return unless defined $flag;
-	$flag |= $DB_NO_SYNC if $self->{ibx}->{-no_fsync};
+	$flag |= $DB_NO_SYNC if ($self->{ibx} // $self->{eidx})->{-no_fsync};
 	my $xdb = eval { ($X->{WritableDatabase})->new($dir, $flag) };
 	croak "Failed opening $dir: $@" if $@;
 	$self->{xdb} = $xdb;
@@ -631,11 +631,16 @@ sub unindex_both { # git->cat_async callback
 	unindex_eml($self, $oid, PublicInbox::Eml->new($bref));
 }
 
+sub with_umask {
+	my $self = shift;
+	($self->{ibx} // $self->{eidx})->with_umask(@_);
+}
+
 # called by public-inbox-index
 sub index_sync {
 	my ($self, $opt) = @_;
 	delete $self->{lock_path} if $opt->{-skip_lock};
-	$self->{ibx}->with_umask(\&_index_sync, $self, $opt);
+	$self->with_umask(\&_index_sync, $self, $opt);
 	if ($opt->{reindex}) {
 		my %again = %$opt;
 		delete @again{qw(rethread reindex)};
@@ -893,7 +898,7 @@ sub _begin_txn {
 
 sub begin_txn_lazy {
 	my ($self) = @_;
-	$self->{ibx}->with_umask(\&_begin_txn, $self) if !$self->{txn};
+	$self->with_umask(\&_begin_txn, $self) if !$self->{txn};
 }
 
 # store 'indexlevel=medium' in v2 shard=0 and v1 (only one shard)
@@ -931,7 +936,7 @@ sub _commit_txn {
 sub commit_txn_lazy {
 	my ($self) = @_;
 	delete($self->{txn}) and
-		$self->{ibx}->with_umask(\&_commit_txn, $self);
+		$self->with_umask(\&_commit_txn, $self);
 }
 
 sub worker_done {
@@ -945,6 +950,7 @@ sub worker_done {
 sub eidx_shard_new {
 	my ($class, $eidx, $shard) = @_;
 	my $self = bless {
+		eidx => $eidx,
 		xpfx => $eidx->{xpfx},
 		indexlevel => $eidx->{indexlevel},
 		-skip_docdata => 1,

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 38/52] searchidx: favor $sync->{ibx} (over $self->{ibx})
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (36 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 37/52] searchidx: reduce inbox-dependency, wrap ->with_umask Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 39/52] Makefile.PL: do not build manpage if POD is missing Eric Wong
                   ` (15 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

In case we want to reuse code with ExtSearchIdx or V2Writable.
---
 lib/PublicInbox/SearchIdx.pm | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index ea884434..32fa16f5 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -691,7 +691,7 @@ sub v1_checkpoint ($$;$) {
 
 	$self->{oidx}->rethread_done($sync->{-opt}) if $newest; # all done
 	commit_txn_lazy($self);
-	$self->{ibx}->git->cleanup;
+	$sync->{ibx}->git->cleanup;
 	my $nr = ${$sync->{nr}};
 	idx_release($self, $nr);
 	# let another process do some work...
@@ -707,7 +707,7 @@ sub v1_checkpoint ($$;$) {
 # only for v1
 sub process_stack {
 	my ($self, $sync, $stk) = @_;
-	my $git = $self->{ibx}->git;
+	my $git = $sync->{ibx}->git;
 	my $max = $self->{batch_bytes};
 	my $nr = 0;
 	$sync->{nr} = \$nr;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 39/52] Makefile.PL: do not build manpage if POD is missing
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (37 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 38/52] searchidx: favor $sync->{ibx} (over $self->{ibx}) Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 40/52] script: add preliminary eindex implementation Eric Wong
                   ` (14 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

But warn on it, this lets us test new or throwaway commands more
easily if we don't have to start a new POD for everything we
want to dump in script/.
---
 Makefile.PL | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/Makefile.PL b/Makefile.PL
index f6b7abb6..56679598 100644
--- a/Makefile.PL
+++ b/Makefile.PL
@@ -31,7 +31,18 @@ my @syn = (@EXE_FILES, grep(m!^lib/.*\.pm$!, @manifest), @scripts);
 @syn = grep(!/SaPlugin/, @syn) if !eval { require Mail::SpamAssasin };
 $v->{syn_files} = \@syn;
 $v->{my_syntax} = [map { "$_.syntax" } @syn];
-$v->{-m1} = [ map { (split('/'))[-1] } @EXE_FILES ];
+my @no_pod;
+$v->{-m1} = [ map {
+		my $x = (split('/'))[-1];
+		my $pod = "Documentation/$x.pod";
+		if (-f $pod) {
+			$x;
+		} else {
+			warn "W: $pod missing\n";
+			push @no_pod, $x;
+			();
+		}
+	} @EXE_FILES ];
 $v->{-m5} = [ qw(public-inbox-config public-inbox-v1-format
 		public-inbox-v2-format) ];
 $v->{-m7} = [ qw(public-inbox-overview public-inbox-tuning) ];
@@ -109,6 +120,7 @@ my %man3 = map {; # semi-colon tells Perl this is a BLOCK (and not EXPR)
 	$mod =~ s/\.\w+\z//;
 	"lib/PublicInbox/$_" => "blib/man3/PublicInbox::$mod.3"
 } qw(Git.pm Import.pm WWW.pod SaPlugin/ListMirror.pod);
+my $warn_no_pod = @no_pod ? "\n\t\@echo W: missing .pod: @no_pod\n" : '';
 
 WriteMakefile(
 	NAME => 'PublicInbox', # n.b. camel-case is not our choice
@@ -172,6 +184,8 @@ $VARS
 -include Documentation/include.mk
 $TGTS
 
+check-man ::$warn_no_pod
+
 # syntax checks are currently GNU make only:
 %.syntax :: %
 	@\$(PERL) -w -I lib -c \$<

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 40/52] script: add preliminary eindex implementation
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (38 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 39/52] Makefile.PL: do not build manpage if POD is missing Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 41/52] index: eindex wiring Eric Wong
                   ` (13 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

Not documented, yet, but it runs...
---
 MANIFEST                   |  1 +
 script/public-inbox-eindex | 43 ++++++++++++++++++++++++++++++++++++++
 t/extsearch.t              | 26 +++++++++++++++++++++++
 3 files changed, 70 insertions(+)
 create mode 100644 script/public-inbox-eindex

diff --git a/MANIFEST b/MANIFEST
index 418a2f17..10561cd2 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -225,6 +225,7 @@ sa_config/user/.spamassassin/user_prefs
 script/public-inbox-compact
 script/public-inbox-convert
 script/public-inbox-edit
+script/public-inbox-eindex
 script/public-inbox-httpd
 script/public-inbox-imapd
 script/public-inbox-index
diff --git a/script/public-inbox-eindex b/script/public-inbox-eindex
new file mode 100644
index 00000000..c26edb93
--- /dev/null
+++ b/script/public-inbox-eindex
@@ -0,0 +1,43 @@
+#!perl -w
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+# Basic tool to create a Xapian search index for a public-inbox.
+use strict;
+use v5.10.1;
+use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
+my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term:
+usage: public-inbox-eindex [options] EINDEX_DIR [INBOX_DIR]
+
+  Create and update external (detached) search indices
+
+  --no-fsync          speed up indexing, risk corruption on power outage
+  -L LEVEL            `medium', or `full' (default: full)
+  --all               index all configured inboxes
+  --jobs=NUM          set or disable parallelization (NUM=0)
+  --batch-size=BYTES  flush changes to OS after a given number of bytes
+  --max-size=BYTES    do not index messages larger than the given size
+  --verbose | -v      increase verbosity (may be repeated)
+
+BYTES may use `k', `m', and `g' suffixes (e.g. `10m' for 10 megabytes)
+See public-inbox-eindex(1) man page for full documentation.
+EOF
+my $opt = { quiet => -1, compact => 0, max_size => undef, fsync => 1 };
+GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i
+		fsync|sync!
+		indexlevel|index-level|L=s max_size|max-size=s
+		batch_size|batch-size=s
+		skip-docdata all help|h))
+	or die $help;
+if ($opt->{help}) { print $help; exit 0 };
+die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
+
+# require lazily to speed up --help
+my $eidx_dir = shift(@ARGV) // die "E: $help";
+require PublicInbox::Admin;
+my $cfg = PublicInbox::Config->new;
+my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, $opt, $cfg);
+PublicInbox::Admin::require_or_die(qw(-search));
+require PublicInbox::ExtSearchIdx;
+my $eidx = PublicInbox::ExtSearchIdx->new($eidx_dir, $opt);
+$eidx->attach_inbox($_) for @ibxs;
+$eidx->eidx_sync($opt);
diff --git a/t/extsearch.t b/t/extsearch.t
index 54927c50..dfec6b6f 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -4,9 +4,35 @@
 use strict;
 use Test::More;
 use PublicInbox::TestCommon;
+use Fcntl qw(:seek);
 require_git(2.6);
 require_mods(qw(DBD::SQLite Search::Xapian));
 use_ok 'PublicInbox::ExtSearch';
 use_ok 'PublicInbox::ExtSearchIdx';
+my ($home, $for_destroy) = tmpdir();
+local $ENV{HOME} = $home;
+mkdir "$home/.public-inbox" or BAIL_OUT $!;
+open my $fh, '>', "$home/.public-inbox/config" or BAIL_OUT $!;
+print $fh <<EOF or BAIL_OUT $!;
+[publicinboxMda]
+	spamcheck = none
+EOF
+close $fh or BAIL_OUT $!;
+my $v2addr = 'v2test@example.com';
+my $v1addr = 'v1test@example.com';
+ok(run_script([qw(-init -V2 v2test), "$home/v2test",
+	'http://example.com/v2test', $v2addr ]), 'v2test init');
+my $env = { ORIGINAL_RECIPIENT => $v2addr };
+open($fh, '<', 't/utf8.eml') or BAIL_OUT("open t/utf8.eml: $!");
+run_script(['-mda', '--no-precheck'], $env, { 0 => $fh }) or BAIL_OUT '-mda';
+
+ok(run_script([qw(-init -V1 v1test), "$home/v1test",
+	'http://example.com/v1test', $v1addr ]), 'v1test init');
+$env = { ORIGINAL_RECIPIENT => $v1addr };
+seek($fh, 0, SEEK_SET) or BAIL_OUT $!;
+run_script(['-mda', '--no-precheck'], $env, { 0 => $fh }) or BAIL_OUT '-mda';
+run_script(['-index', "$home/v1test"]) or BAIL_OUT "index $?";
+
+ok(run_script([qw(-eindex --all), "$home/eindex"]), 'eindex init');
 
 done_testing;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 41/52] index: eindex wiring
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (39 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 40/52] script: add preliminary eindex implementation Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 42/52] over: store xref3 data in over.sqlite3 Eric Wong
                   ` (12 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This doesn't do anything, yet, but it will once the rest
of the eindex stuff works.
---
 script/public-inbox-index | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/script/public-inbox-index b/script/public-inbox-index
index 5dad6ecb..55e4f641 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -11,12 +11,13 @@ use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
 my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term:
 usage: public-inbox-index [options] INBOX_DIR
 
-  Create and update search indices
+  Create and update per-inbox search indices
 
 options:
 
   --no-fsync          speed up indexing, risk corruption on power outage
   -L LEVEL            `basic', `medium', or `full' (default: full)
+  -E EIDX             update EIDX (e.g. `all')
   --all               index all configured inboxes
   --compact | -c      run public-inbox-compact(1) after indexing
   --sequential-shard  index Xapian shards sequentially for slow storage

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 42/52] over: store xref3 data in over.sqlite3
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (40 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 41/52] index: eindex wiring Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 43/52] searchidx: remove xref3 support for Xapian Eric Wong
                   ` (11 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We may not end up storing xref3 data in Xapian, actually.
This will make indexlevel=basic possible, and along with
--sequential-shard indexing support for slow storage.

Making oidmap a separate table seems unnecessary, too, so
fold it into the xref3 table since it's unlikely a git blob
will be responsible for multiple xref3 rows.
---
 lib/PublicInbox/Over.pm    | 19 +++++++++++
 lib/PublicInbox/OverIdx.pm | 64 ++++++++++++++++++++++----------------
 t/over.t                   | 27 +++++++++++-----
 3 files changed, 77 insertions(+), 33 deletions(-)

diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index 08112386..f34e7fc1 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -260,6 +260,25 @@ SELECT num,tid,ds,ts,ddd FROM over WHERE num = ? LIMIT 1
 	$smsg ? load_from_row($smsg) : undef;
 }
 
+sub get_xref3 {
+	my ($self, $num) = @_;
+	my $dbh = dbh($self);
+	my $sth = $dbh->prepare_cached(<<'', undef, 1);
+SELECT ibx_id,xnum,oidbin FROM xref3 WHERE docid = ? ORDER BY ibx_id ASC
+
+	$sth->execute($num);
+	my $rows = $sth->fetchall_arrayref;
+	my $eidx_key_sth = $dbh->prepare_cached(<<'', undef, 1);
+SELECT eidx_key FROM inboxes WHERE ibx_id = ?
+
+	[ map {
+		my $r = $_;
+		$eidx_key_sth->execute($r->[0]);
+		my $eidx_key = $eidx_key_sth->fetchrow_array;
+		"$eidx_key:$r->[1]:".unpack('H*', $r->[2]);
+	} @$rows ];
+}
+
 sub next_by_mid {
 	my ($self, $mid, $id, $prev) = @_;
 	my $dbh = dbh($self);
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 09bca790..dff2780d 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -517,20 +517,27 @@ sub eidx_prep ($) {
 	my ($self) = @_;
 	$self->{-eidx_prep} //= do {
 		my $dbh = $self->dbh;
-		$dbh->do(<<'');
-INSERT OR IGNORE INTO counter (key) VALUES ('oidmap_num')
+		$dbh->do(<<"");
+INSERT OR IGNORE INTO counter (key) VALUES ('eidx_docid')
 
 		$dbh->do(<<'');
-INSERT OR IGNORE INTO counter (key) VALUES ('eidx_docid')
+CREATE TABLE IF NOT EXISTS inboxes (
+	ibx_id INTEGER PRIMARY KEY AUTOINCREMENT,
+	eidx_key VARCHAR(255) NOT NULL, /* {newsgroup} // {inboxdir} */
+	UNIQUE (eidx_key)
+)
 
 		$dbh->do(<<'');
-CREATE TABLE IF NOT EXISTS oidmap (
-	num INTEGER NOT NULL, /* NNTP article number == IMAP UID */
-	oidbin VARBINARY, /* 20-byte SHA-1 or 32-byte SHA-256 */
-	UNIQUE (num),
-	UNIQUE (oidbin)
+CREATE TABLE IF NOT EXISTS xref3 (
+	docid INTEGER NOT NULL, /* <=> over.num */
+	ibx_id INTEGER NOT NULL, /* <=> inboxes.ibx_id */
+	xnum INTEGER NOT NULL, /* NNTP article number in ibx */
+	oidbin VARBINARY NOT NULL, /* 20-byte SHA-1 or 32-byte SHA-256 */
+	UNIQUE (docid, ibx_id, xnum, oidbin)
 )
 
+	$dbh->do('CREATE INDEX IF NOT EXISTS idx_docid ON xref3 (docid)');
+
 		$dbh->do(<<'');
 CREATE TABLE IF NOT EXISTS eidx_meta (
 	key VARCHAR(255) PRIMARY KEY,
@@ -564,28 +571,33 @@ sub eidx_max {
 	get_counter($self->{dbh}, 'eidx_docid');
 }
 
-sub oid2num {
-	my ($self, $oidhex) = @_;
-	my $dbh = eidx_prep($self);
-	my $sth = $dbh->prepare_cached(<<'', undef, 1);
-SELECT num FROM oidmap WHERE oidbin = ?
-
-	$sth->bind_param(1, pack('H*', $oidhex), SQL_BLOB);
+sub add_xref3 {
+	my ($self, $docid, $xnum, $oidhex, $eidx_key) = @_;
+	begin_lazy($self);
+	my $ibx_id = id_for($self, 'inboxes', 'ibx_id', eidx_key => $eidx_key);
+	my $oidbin = pack('H*', $oidhex);
+	my $sth = $self->{dbh}->prepare_cached(<<'');
+INSERT OR IGNORE INTO xref3 (docid, ibx_id, xnum, oidbin) VALUES (?, ?, ?, ?)
+
+	$sth->bind_param(1, $docid);
+	$sth->bind_param(2, $ibx_id);
+	$sth->bind_param(3, $xnum);
+	$sth->bind_param(4, $oidbin, SQL_BLOB);
 	$sth->execute;
-	$sth->fetchrow_array;
 }
 
-sub oid_add {
-	my ($self, $oidhex) = @_;
-	my $dbh = eidx_prep($self);
-	my $num = adj_counter($self, 'oidmap_num', '+');
-	my $sth = $dbh->prepare_cached(<<'');
-INSERT INTO oidmap (num, oidbin) VALUES (?,?)
-
-	$sth->bind_param(1, $num);
-	$sth->bind_param(2, pack('H*', $oidhex), SQL_BLOB);
+sub remove_xref3 {
+	my ($self, $docid, $oidhex, $eidx_key) = @_;
+	begin_lazy($self);
+	my $ibx_id = id_for($self, 'inboxes', 'ibx_id', eidx_key => $eidx_key);
+	my $oidbin = pack('H*', $oidhex);
+	my $sth = $self->{dbh}->prepare_cached(<<'');
+DELETE FROM xref3 WHERE docid = ? AND ibx_id = ? AND oidbin = ?
+
+	$sth->bind_param(1, $docid);
+	$sth->bind_param(2, $ibx_id);
+	$sth->bind_param(3, $oidbin, SQL_BLOB);
 	$sth->execute;
-	$num;
 }
 
 1;
diff --git a/t/over.t b/t/over.t
index 3e2860f8..56c20d01 100644
--- a/t/over.t
+++ b/t/over.t
@@ -75,14 +75,27 @@ SKIP: {
 }
 
 # ext index additions
+$over->eidx_prep;
 {
-	my $hex = 'deadbeefcafe';
-	my $n = $over->oid_add($hex);
-	ok($n > 0, 'oid_add returned number');
-	is($over->oid2num($hex), $n, 'oid2num works');
-	my $n2 = $over->oid_add($hex.$hex);
-	ok($n2 > $n, 'oid_add increments');
-	is($over->oid2num($hex.$hex), $n2, 'oid2num works again');
+	my @arg = qw(1349 2019 adeadba7cafe example.key);
+	ok($over->add_xref3(@arg), 'first add');
+	ok($over->add_xref3(@arg), 'add idempotent');
+	my $xref3 = $over->get_xref3(1349);
+	is_deeply($xref3, [ 'example.key:2019:adeadba7cafe' ], 'xref3 works');
+
+	@arg = qw(1349 2018 deadbeefcafe example.kee);
+	ok($over->add_xref3(@arg), 'add another xref3');
+	$xref3 = $over->get_xref3(1349);
+	is_deeply($xref3, [ 'example.key:2019:adeadba7cafe',
+			'example.kee:2018:deadbeefcafe' ],
+			'xref3 works forw two');
+
+	@arg = qw(1349 adeadba7cafe example.key);
+	ok($over->remove_xref3(@arg), 'remove first');
+	$xref3 = $over->get_xref3(1349);
+	is_deeply($xref3, [ 'example.kee:2018:deadbeefcafe' ],
+		'confirm removal successful');
+	$over->rollback_lazy;
 }
 
 done_testing();

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 43/52] searchidx: remove xref3 support for Xapian
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (41 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 42/52] over: store xref3 data in over.sqlite3 Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 44/52] t/extsearch.t: verify results and xref3 ordering Eric Wong
                   ` (10 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

It doesn't seem worth storing xref3 data in Xapian now that
the same info is in over.sqlite3.
---
 lib/PublicInbox/ExtSearchIdx.pm   | 10 +++--
 lib/PublicInbox/SearchIdx.pm      | 64 +++++++++++--------------------
 lib/PublicInbox/SearchIdxShard.pm | 28 +++++++-------
 lib/PublicInbox/Smsg.pm           | 13 -------
 t/search.t                        | 26 +------------
 5 files changed, 46 insertions(+), 95 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 790ee921..026e1377 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -116,9 +116,10 @@ sub do_xpost ($$) {
 	my $eml = $req->{eml};
 	if (my $new_smsg = $req->{new_smsg}) { # 'm' on cross-posted message
 		my $xnum = $req->{xnum};
-		$idx->shard_add_xref3($docid, $xnum, $oid, $xibx, $eml);
+		$self->{oidx}->add_xref3($docid, $xnum, $oid, $xibx->eidx_key);
+		$idx->shard_add_eidx_info($docid, $oid, $xibx, $eml);
 	} else { # 'd'
-		$idx->shard_remove_xref3($docid, $oid, $xibx, $eml);
+		$idx->shard_remove_eidx_info($docid, $oid, $xibx, $eml);
 	}
 }
 
@@ -135,7 +136,10 @@ sub index_unseen ($) {
 	$new_smsg->{num} = $docid;
 	my $idx = $self->idx_shard($docid);
 	$self->{oidx}->add_overview($eml, $new_smsg);
-	$idx->index_raw(undef, $eml, $new_smsg, $req->{ibx});
+	my $oid = $new_smsg->{blob};
+	my $ibx = delete $req->{ibx} or die 'BUG: {ibx} unset';
+	$self->{oidx}->add_xref3($docid, $req->{xnum}, $oid, $ibx->eidx_key);
+	$idx->index_raw(undef, $eml, $new_smsg, $ibx);
 }
 
 sub do_finalize ($) {
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 32fa16f5..569efbb0 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -370,8 +370,6 @@ sub add_xapian ($$$$) {
 
 	if (defined(my $eidx_key = $smsg->{eidx_key})) {
 		$doc->add_boolean_term('O'.$eidx_key);
-		$doc->add_boolean_term('P'.
-				"$eidx_key:$smsg->{num}:$smsg->{blob}");
 	}
 	msg_iter($eml, \&index_xapian, [ $self, $doc ]);
 	index_ids($self, $doc, $eml, $mids);
@@ -456,57 +454,41 @@ sub _get_doc ($$$) {
 	}
 }
 
-sub add_xref3 {
-	my ($self, $docid, $xnum, $oid, $eidx_key, $eml) = @_;
+sub add_eidx_info {
+	my ($self, $docid, $oid, $eidx_key, $eml) = @_;
 	begin_txn_lazy($self);
 	my $doc = _get_doc($self, $docid, $oid) or return;
 	term_generator($self)->set_document($doc);
 	$doc->add_boolean_term('O'.$eidx_key);
-	$doc->add_boolean_term('P'."$eidx_key:$xnum:$oid");
 	index_list_id($self, $doc, $eml);
 	$self->{xdb}->replace_document($docid, $doc);
 }
 
-sub remove_xref3 {
+sub remove_eidx_info {
 	my ($self, $docid, $oid, $eidx_key, $eml) = @_;
 	begin_txn_lazy($self);
 	my $doc = _get_doc($self, $docid, $oid) or return;
-	my $xref3 = PublicInbox::Smsg::xref3(undef, $doc);
-	my %x3 = map { $_ => undef } @$xref3;
-	for (grep(/\A\Q$eidx_key\E:[0-9]+:\Q$oid\E\z/, @$xref3)) {
-		delete $x3{$_};
-		$doc->remove_term('P' . $_);
-	}
-	if (scalar(keys(%x3)) == 0) {
-		$self->{xdb}->delete_document($docid);
-		if (my $del_fh = $self->{del_fh}) { # TODO
-			print $del_fh $docid, "\n" or die "E: print $!";
-		}
-	} else {
-		if (!grep(/\A\Q$eidx_key\E:/, keys %x3)) {
-			$doc->remove_term('O'.$eidx_key);
-		}
-		for my $l ($eml->header_raw('List-Id')) {
-			$l =~ /<([^>]+)>/ or next;
-			my $lid = lc $1;
-			$doc->remove_term('G' . $lid);
-
-			# nb: we don't remove the XL probabilistic terms
-			# since terms may overlap if cross-posted.
-			#
-			# IOW, a message which has both <foo.example.com>
-			# and <bar.example.com> would have overlapping
-			# "XLexample" and "XLcom" as terms and which we
-			# wouldn't know if they're safe to remove if we just
-			# unindex <foo.example.com> while preserving
-			# <bar.example.com>.
-			#
-			# In any case, this entire sub is will likely never
-			# be needed and users using the "l:" prefix are probably
-			# rarer.
-		}
-		$self->{xdb}->replace_document($docid, $doc);
+	$doc->remove_term('O'.$eidx_key);
+	for my $l ($eml->header_raw('List-Id')) {
+		$l =~ /<([^>]+)>/ or next;
+		my $lid = lc $1;
+		$doc->remove_term('G' . $lid);
+
+		# nb: we don't remove the XL probabilistic terms
+		# since terms may overlap if cross-posted.
+		#
+		# IOW, a message which has both <foo.example.com>
+		# and <bar.example.com> would have overlapping
+		# "XLexample" and "XLcom" as terms and which we
+		# wouldn't know if they're safe to remove if we just
+		# unindex <foo.example.com> while preserving
+		# <bar.example.com>.
+		#
+		# In any case, this entire sub is will likely never
+		# be needed and users using the "l:" prefix are probably
+		# rarer.
 	}
+	$self->{xdb}->replace_document($docid, $doc);
 }
 
 sub get_val ($$) {
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index ac01340c..644d8b58 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -75,15 +75,15 @@ sub shard_worker_loop ($$$$$) {
 		} elsif ($line =~ /\AD ([a-f0-9]{40,}) ([0-9]+)\n\z/s) {
 			$self->remove_by_oid($1, $2 + 0);
 		} elsif ($line =~ s/\A\+X //) {
-			my ($len, $docid, $xnum, $oid, $eidx_key) =
-							split(/ /, $line, 5);
-			$self->add_xref3($docid, $xnum, $oid, $eidx_key,
-						eml($r, $len));
+			my ($len, $docid, $oid, $eidx_key) =
+							split(/ /, $line, 4);
+			$self->add_eidx_info($docid, $oid, $eidx_key,
+							eml($r, $len));
 		} elsif ($line =~ s/\A-X //) {
-			my ($len, $docid, $xnum, $oid, $eidx_key) =
-							split(/ /, $line, 5);
-			$self->remove_xref3($docid, $xnum, $oid,
-						$eidx_key, eml($r, $len));
+			my ($len, $docid, $oid, $eidx_key) =
+							split(/ /, $line, 4);
+			$self->remove_eidx_info($docid, $oid, $eidx_key,
+							eml($r, $len));
 		} else {
 			chomp $line;
 			my $eidx_key;
@@ -135,20 +135,20 @@ sub index_raw {
 	}
 }
 
-sub shard_add_xref3 {
-	my ($self, $docid, $xnum, $oid, $xibx, $eml) = @_;
+sub shard_add_eidx_info {
+	my ($self, $docid, $oid, $xibx, $eml) = @_;
 	my $eidx_key = $xibx->eidx_key;
 	if (my $w = $self->{w}) {
 		my $hdr = $eml->header_obj->as_string;
 		my $len = length($hdr);
-		print $w "+X $len $docid $xnum $oid $eidx_key\n", $hdr or
+		print $w "+X $len $docid $oid $eidx_key\n", $hdr or
 			die "failed to write shard: $!";
 	} else {
-		$self->add_xref3($docid, $xnum, $oid, $eidx_key, $eml);
+		$self->add_eidx_info($docid, $oid, $eidx_key, $eml);
 	}
 }
 
-sub shard_remove_xref3 {
+sub shard_remove_eidx_info {
 	my ($self, $docid, $oid, $xibx, $eml) = @_;
 	my $eidx_key = $xibx->eidx_key;
 	if (my $w = $self->{w}) {
@@ -157,7 +157,7 @@ sub shard_remove_xref3 {
 		print $w "-X $len $docid $oid $eidx_key\n", $hdr or
 			die "failed to write shard: $!";
 	} else {
-		$self->remove_xref3($docid, $oid, $eidx_key, $eml);
+		$self->remove_eidx_info($docid, $oid, $eidx_key, $eml);
 	}
 }
 
diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm
index c0fd85fd..14086538 100644
--- a/lib/PublicInbox/Smsg.pm
+++ b/lib/PublicInbox/Smsg.pm
@@ -137,17 +137,4 @@ sub subject_normalized ($) {
 	$subj;
 }
 
-sub xref3 {
-	my ($self, $doc) = @_;
-	my $end = $doc->termlist_end;
-	my $it = $doc->termlist_begin;
-	$it->skip_to('P');
-	my @ret;
-	for (; $it != $end; $it++) {
-		my $val = $it->get_termname;
-		$val =~ s/\AP// and push @ret, $val;
-	}
-	\@ret;
-}
-
 1;
diff --git a/t/search.t b/t/search.t
index e789b81e..da9acb07 100644
--- a/t/search.t
+++ b/t/search.t
@@ -341,14 +341,6 @@ $ibx->with_umask(sub {
 		my $uid = PublicInbox::SearchIdx::get_val($doc, $col);
 		is($uid, $smsg->{num}, 'UID column matches {num}');
 		is($uid, $m->get_docid, 'UID column matches docid');
-
-		# check ->xref3 for external index:
-		is_deeply($smsg->xref3($doc), [], 'xref3 empty by default');
-		my $exp = "inbox.com.example:$uid:deadbeef";
-		$doc->add_boolean_term('P'.$exp);
-		is_deeply($smsg->xref3($doc), [ $exp ], 'xref3 can be set');
-		$doc->remove_term('P'.$exp);
-		is_deeply($smsg->xref3($doc), [], 'xref3 can be unset');
 	}
 
 	$mset = $ibx->search->mset('tc:list@example.com');
@@ -521,13 +513,8 @@ $ibx->with_umask(sub {
 	$rw_commit->();
 	my $doc_id = $rw->add_message(eml_load('t/data/message_embed.eml'));
 	ok($doc_id > 0, 'messages within messages');
-
-	my $eml = PublicInbox::Eml->new(<<EOF);
-List-Id: <blahblah.example.com>
-
-EOF
-	$rw->add_xref3($doc_id, 1, 'deadbeef', 'newsgroup1.example', $eml);
-	$rw_commit->();
+	$rw->commit_txn_lazy;
+	$ibx->search->reopen;
 	my $n_test_eml = $query->('n:test.eml');
 	is(scalar(@$n_test_eml), 1, 'got a result');
 	my $n_embed2x_eml = $query->('n:embed2x.eml');
@@ -545,15 +532,6 @@ EOF
 	is($query->('s:"mail header experiments"')->[0]->{mid},
 		'20200418222508.GA13918@dcvr',
 		'Subject search reaches inside message/rfc822');
-	is($query->('l:blahblah.example.com')->[0]->{num}, $doc_id,
-		'xref3 List-Id probabilistic works');
-	is($query->('lid:blahblah.example.com')->[0]->{num}, $doc_id,
-		'xref3 List-Id boolean term works');
-	$rw->remove_xref3($doc_id, 'deadbeef', 'newsgroup1.example', $eml);
-	$rw->commit_txn_lazy;
-	$ibx->search->reopen;
-	my $res = $query->('lid:blahblah.example.com');
-	is_deeply($res, [], '->remove_xref3 dropped boolean term');
 });
 
 done_testing();

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 44/52] t/extsearch.t: verify results and xref3 ordering
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (42 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 43/52] searchidx: remove xref3 support for Xapian Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 45/52] t/v2writable: remove pointless ->barrier call Eric Wong
                   ` (9 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We want NNTP clients to see consistent Xref: headers to ensure
client-side caches don't get confused.
---
 t/extsearch.t | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/t/extsearch.t b/t/extsearch.t
index dfec6b6f..108ffaeb 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -20,7 +20,7 @@ EOF
 close $fh or BAIL_OUT $!;
 my $v2addr = 'v2test@example.com';
 my $v1addr = 'v1test@example.com';
-ok(run_script([qw(-init -V2 v2test), "$home/v2test",
+ok(run_script([qw(-init -V2 v2test --newsgroup v2.example), "$home/v2test",
 	'http://example.com/v2test', $v2addr ]), 'v2test init');
 my $env = { ORIGINAL_RECIPIENT => $v2addr };
 open($fh, '<', 't/utf8.eml') or BAIL_OUT("open t/utf8.eml: $!");
@@ -35,4 +35,15 @@ run_script(['-index', "$home/v1test"]) or BAIL_OUT "index $?";
 
 ok(run_script([qw(-eindex --all), "$home/eindex"]), 'eindex init');
 
+{
+	my $es = PublicInbox::ExtSearch->new("$home/eindex");
+	my $smsg = $es->over->get_art(1);
+	ok($smsg, 'got first article');
+	is($es->over->get_art(2), undef, 'only one added');
+	my $xref3 = $es->over->get_xref3(1);
+	like($xref3->[0], qr/\A\Qv2.example\E:1:/, 'order preserved 1');
+	like($xref3->[1], qr!\A\Q$home/v1test\E:1:!, 'order preserved 2');
+	is(scalar(@$xref3), 2, 'only to entries');
+}
+
 done_testing;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 45/52] t/v2writable: remove pointless ->barrier call
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (43 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 44/52] t/extsearch.t: verify results and xref3 ordering Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 46/52] extsearch: wire up smsg_eml Eric Wong
                   ` (8 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We don't actually use it anywhere, and may not need it in
the future.
---
 t/v2writable.t | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/t/v2writable.t b/t/v2writable.t
index 2f71fafa..358a2bb7 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -274,14 +274,13 @@ EOF
 	$mime->header_set('Message-ID', "<$y>");
 	$mime->header_set('References', "<$x>");
 	ok($im->add($mime), 'add excessively long References');
-	$im->barrier;
+	$im->done;
 
 	my $msgs = $ibx->over->get_thread('x'x244);
 	is(2, scalar(@$msgs), 'got both messages');
 	is($msgs->[0]->{mid}, 'x'x244, 'stored truncated mid');
 	is($msgs->[1]->{references}, '<'.('x'x244).'>', 'stored truncated ref');
 	is($msgs->[1]->{mid}, 'y'x244, 'stored truncated mid(2)');
-	$im->done;
 }
 
 my $tmp = {

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 46/52] extsearch: wire up smsg_eml
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (44 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 45/52] t/v2writable: remove pointless ->barrier call Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 47/52] extsearchidx: handle edits Eric Wong
                   ` (7 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We'll probably still need synchronous message retrieval
in a few places (tests, at least).
---
 lib/PublicInbox/ExtSearch.pm | 4 ++++
 lib/PublicInbox/Inbox.pm     | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/ExtSearch.pm b/lib/PublicInbox/ExtSearch.pm
index 8997cd54..3e8ca82c 100644
--- a/lib/PublicInbox/ExtSearch.pm
+++ b/lib/PublicInbox/ExtSearch.pm
@@ -7,6 +7,7 @@ package PublicInbox::ExtSearch;
 use strict;
 use v5.10.1;
 use PublicInbox::Over;
+use PublicInbox::Inbox;
 
 # for ->reopen, ->mset, ->mset_to_artnums
 use parent qw(PublicInbox::Search);
@@ -37,4 +38,7 @@ sub git {
 	$self->{git} //= PublicInbox::Git->new("$self->{topdir}/ALL.git");
 }
 
+no warnings 'once';
+*smsg_eml = \&PublicInbox::Inbox::smsg_eml;
+
 1;
diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index cbb95b8d..cd5c098a 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -331,7 +331,7 @@ sub msg_by_smsg ($$) {
 	return unless defined $smsg;
 	defined(my $blob = $smsg->{blob}) or return;
 
-	git($self)->cat_file($blob);
+	$self->git->cat_file($blob);
 }
 
 sub smsg_eml {

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 47/52] extsearchidx: handle edits
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (45 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 46/52] extsearch: wire up smsg_eml Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 48/52] extsearch: wire up remaining Inbox-like methods for WWW Eric Wong
                   ` (6 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

We can now handle cases where messages are edited in one inbox
but not another, bifurcating the message.

V2Writable::log_range handles some edge-cases which could happen
in v2-only code paths, as well, but weren't usually triggered
due to default git-gc knobs not pruning immediately
---
 lib/PublicInbox/ExtSearchIdx.pm | 66 ++++++++++++++++++++++++++-------
 lib/PublicInbox/OverIdx.pm      | 44 +++++++++++++++++-----
 lib/PublicInbox/V2Writable.pm   | 26 ++++++++-----
 t/extsearch.t                   | 28 +++++++++++++-
 4 files changed, 132 insertions(+), 32 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 026e1377..bfe39891 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -19,7 +19,7 @@ use v5.10.1;
 use parent qw(PublicInbox::ExtSearch PublicInbox::Lock);
 use Carp qw(croak carp);
 use PublicInbox::Search;
-use PublicInbox::SearchIdx qw(crlf_adjust prepare_stack);
+use PublicInbox::SearchIdx qw(crlf_adjust prepare_stack is_ancestor);
 use PublicInbox::OverIdx;
 use PublicInbox::MID qw(mids);
 use PublicInbox::V2Writable;
@@ -119,6 +119,7 @@ sub do_xpost ($$) {
 		$self->{oidx}->add_xref3($docid, $xnum, $oid, $xibx->eidx_key);
 		$idx->shard_add_eidx_info($docid, $oid, $xibx, $eml);
 	} else { # 'd'
+		$self->{oidx}->remove_xref3($docid, $oid, $xibx->eidx_key);
 		$idx->shard_remove_eidx_info($docid, $oid, $xibx, $eml);
 	}
 }
@@ -176,14 +177,35 @@ sub do_step ($) { # main iterator for adding messages to the index
 	do_finalize($req);
 }
 
+sub _blob_missing ($) { # called when req->{cur_smsg}->{blob} is bad
+	my ($req) = @_;
+	my $smsg = $req->{cur_smsg} or die 'BUG: {cur_smsg} missing';
+	my $self = $req->{self};
+	my $xref3 = $self->{oidx}->get_xref3($smsg->{num});
+	my @keep = grep(!/:$smsg->{blob}\z/, @$xref3);
+	if (@keep) {
+		$keep[0] =~ /:([a-f0-9]{40,}+)\z/ or
+			die "BUG: xref $keep[0] has no OID";
+		my $oidhex = $1;
+		$self->{oidx}->remove_xref3($smsg->{num}, $smsg->{blob});
+		my $upd = $self->{oidx}->update_blob($smsg, $oidhex);
+		my $saved = $self->{oidx}->get_art($smsg->{num});
+	} else {
+		$self->{oidx}->delete_by_num($smsg->{num});
+	}
+}
+
 sub ck_existing { # git->cat_async callback
 	my ($bref, $oid, $type, $size, $req) = @_;
 	my $smsg = $req->{cur_smsg} or die 'BUG: {cur_smsg} missing';
-	return if is_bad_blob($oid, $type, $size, $smsg->{blob});
-	my $cur = PublicInbox::Eml->new($bref);
-	if (content_hash($cur) eq $req->{chash}) {
-		push @{$req->{indexed}}, $smsg; # for do_xpost
-	} # else { index_unseen later }
+	if ($type eq 'missing') {
+		_blob_missing($req);
+	} elsif (!is_bad_blob($oid, $type, $size, $smsg->{blob})) {
+		my $cur = PublicInbox::Eml->new($bref);
+		if (content_hash($cur) eq $req->{chash}) {
+			push @{$req->{indexed}}, $smsg; # for do_xpost
+		} # else { index_unseen later }
+	}
 	do_step($req);
 }
 
@@ -281,15 +303,33 @@ sub eidx_sync { # main entry point
 	PublicInbox::V2Writable::done($self);
 }
 
-sub update_last_commit {
+sub update_last_commit { # overrides V2Writable
 	my ($self, $sync, $unit, $latest_cmt) = @_;
+	return unless defined $latest_cmt;
 
-	my $ALL = $self->git;
-	# while (scalar(@{$ALL->{inflight_c}}) || scalar(@{$ALL->{inflight}})) {
-		# $ALL->check_async_wait;
-		# $ALL->cat_async_wait;
-	# }
-	# TODO
+	$self->git->async_wait_all;
+	my $ibx = $sync->{ibx} or die 'BUG: {ibx} missing';
+	my $ekey = $ibx->eidx_key;
+	my $uv = $ibx->uidvalidity;
+	my $epoch = $unit->{epoch};
+	my $meta_key;
+	my $v = $ibx->version;
+	if ($v == 2) {
+		die 'No {epoch} for v2 unit' unless defined $epoch;
+		$meta_key = "lc-v2:$ekey//$uv;$epoch";
+	} elsif ($v == 1) {
+		die 'Unexpected {epoch} for v1 unit' if defined $epoch;
+		$meta_key = "lc-v1:$ekey//$uv";
+	} else {
+		die "Unsupported inbox version: $v";
+	}
+	my $last = $self->{oidx}->eidx_meta($meta_key);
+	if (defined $last && is_ancestor($unit->{git}, $last, $latest_cmt)) {
+		my @cmd = (qw(rev-list --count), "$last..$latest_cmt");
+		chomp(my $n = $unit->{git}->qx(@cmd));
+		return if $n ne '' && $n == 0;
+	}
+	$self->{oidx}->eidx_meta($meta_key, $latest_cmt);
 }
 
 sub idx_init { # similar to V2Writable
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index dff2780d..173e3220 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -261,6 +261,13 @@ sub subject_path ($) {
 	lc($subj);
 }
 
+sub ddd_for ($) {
+	my ($smsg) = @_;
+	my $dd = $smsg->to_doc_data;
+	utf8::encode($dd);
+	compress($dd);
+}
+
 sub add_overview {
 	my ($self, $eml, $smsg) = @_;
 	$smsg->{lines} = $eml->body_raw =~ tr!\n!\n!;
@@ -272,10 +279,7 @@ sub add_overview {
 		$xpath = subject_path($subj);
 		$xpath = id_compress($xpath);
 	}
-	my $dd = $smsg->to_doc_data;
-	utf8::encode($dd);
-	$dd = compress($dd);
-	add_over($self, $smsg, $mids, $refs, $xpath, $dd);
+	add_over($self, $smsg, $mids, $refs, $xpath, ddd_for($smsg));
 }
 
 sub _add_over {
@@ -589,14 +593,36 @@ INSERT OR IGNORE INTO xref3 (docid, ibx_id, xnum, oidbin) VALUES (?, ?, ?, ?)
 sub remove_xref3 {
 	my ($self, $docid, $oidhex, $eidx_key) = @_;
 	begin_lazy($self);
-	my $ibx_id = id_for($self, 'inboxes', 'ibx_id', eidx_key => $eidx_key);
 	my $oidbin = pack('H*', $oidhex);
-	my $sth = $self->{dbh}->prepare_cached(<<'');
+	my $sth;
+	if (defined $eidx_key) {
+		my $ibx_id = id_for($self, 'inboxes', 'ibx_id',
+					eidx_key => $eidx_key);
+		$sth = $self->{dbh}->prepare_cached(<<'');
 DELETE FROM xref3 WHERE docid = ? AND ibx_id = ? AND oidbin = ?
 
-	$sth->bind_param(1, $docid);
-	$sth->bind_param(2, $ibx_id);
-	$sth->bind_param(3, $oidbin, SQL_BLOB);
+		$sth->bind_param(1, $docid);
+		$sth->bind_param(2, $ibx_id);
+		$sth->bind_param(3, $oidbin, SQL_BLOB);
+	} else {
+		$sth = $self->{dbh}->prepare_cached(<<'');
+DELETE FROM xref3 WHERE docid = ? AND oidbin = ?
+
+		$sth->bind_param(1, $docid);
+		$sth->bind_param(2, $oidbin, SQL_BLOB);
+	}
+	$sth->execute;
+}
+
+# for when an xref3 goes missing, this does NOT update {ts}
+sub update_blob {
+	my ($self, $smsg, $oidhex) = @_;
+	my $sth = $self->{dbh}->prepare(<<'');
+UPDATE over SET ddd = ? WHERE num = ?
+
+	$smsg->{blob} = $oidhex;
+	$sth->bind_param(1, ddd_for($smsg), SQL_BLOB);
+	$sth->bind_param(2, $smsg->{num});
 	$sth->execute;
 }
 
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index efda7907..0364857f 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -567,7 +567,7 @@ sub last_epoch_commit ($$;$) {
 	$self->{mm}->last_commit_xap($v, $i, $cmt);
 }
 
-sub set_last_commits ($) {
+sub set_last_commits ($) { # this is NOT for ExtSearchIdx
 	my ($self) = @_;
 	defined(my $epoch_max = $self->{epoch_max}) or return;
 	my $last_commit = $self->{last_commit};
@@ -992,9 +992,10 @@ sub log_range ($$$) {
 
 	my $range = "$cur..$tip";
 	$pr->("$i.git checking contiguity... ") if $pr;
-	if (is_ancestor($unit->{git}, $cur, $tip)) { # common case
+	my $git = $unit->{git};
+	if (is_ancestor($git, $cur, $tip)) { # common case
 		$pr->("OK\n") if $pr;
-		my $n = $unit->{git}->qx(qw(rev-list --count), $range);
+		my $n = $git->qx(qw(rev-list --count), $range);
 		chomp($n);
 		if ($n == 0) {
 			$sync->{ranges}->[$i] = undef;
@@ -1006,9 +1007,9 @@ sub log_range ($$$) {
 		$pr->("FAIL\n") if $pr;
 		warn <<"";
 discontiguous range: $range
-Rewritten history? (in $unit->{git}->{git_dir})
+Rewritten history? (in $git->{git_dir})
 
-		chomp(my $base = $unit->{git}->qx('merge-base', $tip, $cur));
+		chomp(my $base = $git->qx('merge-base', $tip, $cur));
 		if ($base) {
 			$range = "$base..$tip";
 			warn "found merge-base: $base\n"
@@ -1017,10 +1018,17 @@ Rewritten history? (in $unit->{git}->{git_dir})
 			warn "discarding history at $cur\n";
 		}
 		warn <<"";
-reindexing $unit->{git}->{git_dir} starting at
-$range
-
-		$unit->{unindex_range} = "$base..$cur";
+reindexing $git->{git_dir}
+starting at $range
+
+		# $cur^0 may no longer exist if pruned by git
+		if ($git->qx(qw(rev-parse -q --verify), "$cur^0")) {
+			$unit->{unindex_range} = "$base..$cur";
+		} elsif ($base && $git->qx(qw(rev-parse -q --verify), $base)) {
+			$unit->{unindex_range} = "$base..";
+		} else {
+			warn "W: unable to unindex before $range\n";
+		}
 	}
 	$range;
 }
diff --git a/t/extsearch.t b/t/extsearch.t
index 108ffaeb..8d2c1507 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -35,8 +35,8 @@ run_script(['-index', "$home/v1test"]) or BAIL_OUT "index $?";
 
 ok(run_script([qw(-eindex --all), "$home/eindex"]), 'eindex init');
 
+my $es = PublicInbox::ExtSearch->new("$home/eindex");
 {
-	my $es = PublicInbox::ExtSearch->new("$home/eindex");
 	my $smsg = $es->over->get_art(1);
 	ok($smsg, 'got first article');
 	is($es->over->get_art(2), undef, 'only one added');
@@ -46,4 +46,30 @@ ok(run_script([qw(-eindex --all), "$home/eindex"]), 'eindex init');
 	is(scalar(@$xref3), 2, 'only to entries');
 }
 
+{
+	my ($in, $out, $err);
+	$in = $out = $err = '';
+	my $opt = { 0 => \$in, 1 => \$out, 2 => \$err };
+	my $env = { MAIL_EDITOR => "$^X -i -p -e 's/test message/BEST MSG/'" };
+	my $cmd = [ qw(-edit -Ft/utf8.eml), "$home/v2test" ];
+	ok(run_script($cmd, $env, $opt), '-edit');
+	ok(run_script([qw(-eindex --all), "$home/eindex"], undef, $opt),
+		'eindex again');
+	like($err, qr/discontiguous range/, 'warned about discontiguous range');
+	my $msg1 = $es->over->get_art(1) or BAIL_OUT 'msg1 missing';
+	my $msg2 = $es->over->get_art(2) or BAIL_OUT 'msg2 missing';
+	is($msg1->{mid}, $msg2->{mid}, 'edited message indexed');
+	isnt($msg1->{blob}, $msg2->{blob}, 'blobs differ');
+	my $eml2 = $es->smsg_eml($msg2);
+	like($eml2->body, qr/BEST MSG/, 'edited body in #2');
+	unlike($eml2->body, qr/test message/, 'old body discarded in #2');
+	my $eml1 = $es->smsg_eml($msg1);
+	like($eml1->body, qr/test message/, 'original body in #1');
+	my $x1 = $es->over->get_xref3(1);
+	my $x2 = $es->over->get_xref3(2);
+	is(scalar(@$x1), 1, 'original only has one xref3');
+	is(scalar(@$x2), 1, 'new message has one xref3');
+	isnt($x1->[0], $x2->[0], 'xref3 differs');
+}
+
 done_testing;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 48/52] extsearch: wire up remaining Inbox-like methods for WWW
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (46 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 47/52] extsearchidx: handle edits Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 49/52] searchidx: ignore exceptions from ->remove_term Eric Wong
                   ` (5 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This lets us pretend an ExtSearch object is an Inbox object
in most of the existing WWW code.
---
 lib/PublicInbox/Config.pm    | 12 +++++++++
 lib/PublicInbox/ExtSearch.pm | 25 ++++++++++++++++++
 lib/PublicInbox/Inbox.pm     | 51 ++++++++++++++++++------------------
 lib/PublicInbox/WWW.pm       |  3 ++-
 4 files changed, 64 insertions(+), 27 deletions(-)

diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index d57c361a..d425cc9b 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -89,6 +89,11 @@ sub lookup_name ($$) {
 	$self->{-by_name}->{$name} // _fill($self, "publicinbox.$name");
 }
 
+sub lookup_ei {
+	my ($self, $name) = @_;
+	$self->{-ei_by_name}->{$name} //= _fill_ei($self, "eindex.$name");
+}
+
 sub each_inbox {
 	my ($self, $cb, @arg) = @_;
 	# may auto-vivify if config file is non-existent:
@@ -457,6 +462,13 @@ EOF
 	$ibx
 }
 
+sub _fill_ei ($$) {
+	my ($self, $pfx) = @_;
+	require PublicInbox::ExtSearch;
+	my $d = $self->{"$pfx.topdir"};
+	defined($d) && -d $d ? PublicInbox::ExtSearch->new($d) : undef;
+}
+
 sub urlmatch {
 	my ($self, $key, $url) = @_;
 	state $urlmatch_broken; # requires git 1.8.5
diff --git a/lib/PublicInbox/ExtSearch.pm b/lib/PublicInbox/ExtSearch.pm
index 3e8ca82c..66c99eaa 100644
--- a/lib/PublicInbox/ExtSearch.pm
+++ b/lib/PublicInbox/ExtSearch.pm
@@ -3,6 +3,7 @@
 
 # Read-only external (detached) index for cross inbox search.
 # This is a read-only counterpart to PublicInbox::ExtSearchIdx
+# and behaves like PublicInbox::Inbox AND PublicInbox::Search
 package PublicInbox::ExtSearch;
 use strict;
 use v5.10.1;
@@ -21,6 +22,8 @@ sub new {
 	}, __PACKAGE__;
 }
 
+sub search { $_[0] } # self
+
 # overrides PublicInbox::Search::_xdb
 sub _xdb {
 	my ($self) = @_;
@@ -38,7 +41,29 @@ sub git {
 	$self->{git} //= PublicInbox::Git->new("$self->{topdir}/ALL.git");
 }
 
+sub mm { undef }
+
+sub altid_map { {} }
+
+sub description {
+	my ($self) = @_;
+	($self->{description} //=
+		PublicInbox::Inbox::cat_desc("$self->{topdir}/description")) //
+		'$EINDEX_DIR/description missing';
+}
+
+sub cloneurl { [] } # TODO
+
+sub base_url { 'https://example.com/TODO/' }
+sub nntp_url { [] }
+
 no warnings 'once';
 *smsg_eml = \&PublicInbox::Inbox::smsg_eml;
+*smsg_by_mid = \&PublicInbox::Inbox::smsg_by_mid;
+*msg_by_mid = \&PublicInbox::Inbox::msg_by_mid;
+*modified = \&PublicInbox::Inbox::modified;
+*recent = \&PublicInbox::Inbox::recent;
+
+*max_git_epoch = *nntp_usable = *msg_by_path = \&mm; # undef
 
 1;
diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index cd5c098a..1d18cdf1 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -225,16 +225,19 @@ sub try_cat {
 	$rv;
 }
 
+sub cat_desc ($) {
+	my $desc = try_cat($_[0]);
+	local $/ = "\n";
+	chomp $desc;
+	utf8::decode($desc);
+	$desc =~ s/\s+/ /smg;
+	$desc eq '' ? undef : $desc;
+}
+
 sub description {
 	my ($self) = @_;
-	($self->{description} //= do {
-		my $desc = try_cat("$self->{inboxdir}/description");
-		local $/ = "\n";
-		chomp $desc;
-		utf8::decode($desc);
-		$desc =~ s/\s+/ /smg;
-		$desc eq '' ? undef : $desc;
-	}) // '($INBOX_DIR/description missing)';
+	($self->{description} //= cat_desc("$self->{inboxdir}/description")) //
+		'($INBOX_DIR/description missing)';
 }
 
 sub cloneurl {
@@ -342,39 +345,35 @@ sub smsg_eml {
 	$eml;
 }
 
-sub mid2num($$) {
-	my ($self, $mid) = @_;
-	my $mm = mm($self) or return;
-	$mm->num_for($mid);
-}
-
 sub smsg_by_mid ($$) {
 	my ($self, $mid) = @_;
-	my $over = over($self) or return;
-	# favor the Message-ID we used for the NNTP article number:
-	defined(my $num = mid2num($self, $mid)) or return;
-	my $smsg = $over->get_art($num) or return;
-	PublicInbox::Smsg::psgi_cull($smsg);
+	my $over = $self->over or return;
+	my $smsg;
+	if (my $mm = $self->mm) {
+		# favor the Message-ID we used for the NNTP article number:
+		defined(my $num = $mm->num_for($mid)) or return;
+		$smsg = $over->get_art($num);
+	} else {
+		my ($id, $prev);
+		$smsg = $over->next_by_mid($mid, \$id, \$prev);
+	}
+	$smsg ? PublicInbox::Smsg::psgi_cull($smsg) : undef;
 }
 
 sub msg_by_mid ($$) {
 	my ($self, $mid) = @_;
-
-	over($self) or
-		return msg_by_path($self, mid2path($mid));
-
 	my $smsg = smsg_by_mid($self, $mid);
-	$smsg ? msg_by_smsg($self, $smsg) : undef;
+	$smsg ? msg_by_smsg($self, $smsg) : msg_by_path($self, mid2path($mid));
 }
 
 sub recent {
 	my ($self, $opts, $after, $before) = @_;
-	over($self)->recent($opts, $after, $before);
+	$self->over->recent($opts, $after, $before);
 }
 
 sub modified {
 	my ($self) = @_;
-	if (my $over = over($self)) {
+	if (my $over = $self->over) {
 		my $msgs = $over->recent({limit => 1});
 		if (my $smsg = $msgs->[0]) {
 			return $smsg->{ts};
diff --git a/lib/PublicInbox/WWW.pm b/lib/PublicInbox/WWW.pm
index e3b589cb..cdbcff1e 100644
--- a/lib/PublicInbox/WWW.pm
+++ b/lib/PublicInbox/WWW.pm
@@ -210,7 +210,8 @@ sub news_cgit_fallback ($) {
 # returns undef if valid, array ref response if invalid
 sub invalid_inbox ($$) {
 	my ($ctx, $inbox) = @_;
-	my $ibx = $ctx->{www}->{pi_config}->lookup_name($inbox);
+	my $ibx = $ctx->{www}->{pi_config}->lookup_name($inbox) //
+			$ctx->{www}->{pi_config}->lookup_ei($inbox);
 	if (defined $ibx) {
 		$ctx->{-inbox} = $ibx;
 		return;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 49/52] searchidx: ignore exceptions from ->remove_term
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (47 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 48/52] extsearch: wire up remaining Inbox-like methods for WWW Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 50/52] extsearchidx: set current_info in warning callbacks Eric Wong
                   ` (4 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This seems necessary for some cross-posted messages (and we did
it historically before we used over.sqlite3).
---
 lib/PublicInbox/SearchIdx.pm | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 569efbb0..06d1a9f5 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -468,11 +468,13 @@ sub remove_eidx_info {
 	my ($self, $docid, $oid, $eidx_key, $eml) = @_;
 	begin_txn_lazy($self);
 	my $doc = _get_doc($self, $docid, $oid) or return;
-	$doc->remove_term('O'.$eidx_key);
+	eval { $doc->remove_term('O'.$eidx_key) };
+	warn "W: ->remove_term O$eidx_key: $@\n" if $@;
 	for my $l ($eml->header_raw('List-Id')) {
 		$l =~ /<([^>]+)>/ or next;
 		my $lid = lc $1;
-		$doc->remove_term('G' . $lid);
+		eval { $doc->remove_term('G' . $lid) };
+		warn "W: ->remove_term G$lid: $@\n" if $@;
 
 		# nb: we don't remove the XL probabilistic terms
 		# since terms may overlap if cross-posted.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 50/52] extsearchidx: set current_info in warning callbacks
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (48 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 49/52] searchidx: ignore exceptions from ->remove_term Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27  7:54 ` [PATCH 51/52] extsearchidx: support --batch-size checkpoints Eric Wong
                   ` (3 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This bit is duplicated with per-Inbox indexing in Admin,
undecided if it's the right place for it.
---
 lib/PublicInbox/ExtSearchIdx.pm | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index bfe39891..050c4252 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -296,6 +296,11 @@ sub eidx_sync { # main entry point
 	$self->idx_init($opt); # acquire lock via V2Writable::_idx_init
 	$self->{oidx}->rethread_prepare($opt);
 
+	my $warn_cb = $SIG{__WARN__} || sub { print STDERR @_ };
+	local $self->{current_info} = '';
+	local $SIG{__WARN__} = sub {
+		$warn_cb->($self->{current_info}, ': ', @_);
+	};
 	_sync_inbox($self, $opt, $_) for (@{$self->{ibx_list}});
 
 	$self->{oidx}->rethread_done($opt);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 51/52] extsearchidx: support --batch-size checkpoints
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (49 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 50/52] extsearchidx: set current_info in warning callbacks Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-11-03  9:08   ` Eric Wong
  2020-10-27  7:54 ` [PATCH 52/52] searchidxshard: make warnings with eidx_key less confusing Eric Wong
                   ` (2 subsequent siblings)
  53 siblings, 1 reply; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

This is needed to limit the RSS of processes and ensure the
stored data in over.sqlite3 and Xapian DBs are consistent if
interrupted.  Without checkpoints, indexing lore causes shard
workers to take several GB of memory and thrash/OOM smaller
systems.
---
 lib/PublicInbox/ExtSearchIdx.pm | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 050c4252..9d576adb 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -106,6 +106,18 @@ sub is_bad_blob ($$$$) {
 	$size == 0 ? 1 : 0; # size == 0 means purged
 }
 
+sub check_batch_limit ($) {
+	my ($req) = @_;
+	my $self = $req->{self};
+	my $new_smsg = $req->{new_smsg};
+
+	# {raw_bytes} may be unset, so just use {bytes}
+	my $n = $self->{transact_bytes} += $new_smsg->{bytes};
+
+	# set flag for PublicInbox::V2Writable::index_todo:
+	${$req->{need_checkpoint}} = 1 if $n >= $self->{batch_bytes};
+}
+
 sub do_xpost ($$) {
 	my ($req, $smsg) = @_;
 	my $self = $req->{self};
@@ -118,6 +130,7 @@ sub do_xpost ($$) {
 		my $xnum = $req->{xnum};
 		$self->{oidx}->add_xref3($docid, $xnum, $oid, $xibx->eidx_key);
 		$idx->shard_add_eidx_info($docid, $oid, $xibx, $eml);
+		check_batch_limit($req);
 	} else { # 'd'
 		$self->{oidx}->remove_xref3($docid, $oid, $xibx->eidx_key);
 		$idx->shard_remove_eidx_info($docid, $oid, $xibx, $eml);
@@ -141,6 +154,7 @@ sub index_unseen ($) {
 	my $ibx = delete $req->{ibx} or die 'BUG: {ibx} unset';
 	$self->{oidx}->add_xref3($docid, $req->{xnum}, $oid, $ibx->eidx_key);
 	$idx->index_raw(undef, $eml, $new_smsg, $ibx);
+	check_batch_limit($req);
 }
 
 sub do_finalize ($) {
@@ -301,7 +315,11 @@ sub eidx_sync { # main entry point
 	local $SIG{__WARN__} = sub {
 		$warn_cb->($self->{current_info}, ': ', @_);
 	};
-	_sync_inbox($self, $opt, $_) for (@{$self->{ibx_list}});
+
+	# don't use $_ here, it'll get clobbered by reindex_checkpoint
+	for my $ibx (@{$self->{ibx_list}}) {
+		_sync_inbox($self, $opt, $ibx);
+	}
 
 	$self->{oidx}->rethread_done($opt);
 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 52/52] searchidxshard: make warnings with eidx_key less confusing
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (50 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 51/52] extsearchidx: support --batch-size checkpoints Eric Wong
@ 2020-10-27  7:54 ` Eric Wong
  2020-10-27 12:08 ` [PATCH 00/52] detached external index: mostly Konstantin Ryabitsev
  2020-11-10 18:53 ` detached external index: performance note Eric Wong
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-10-27  7:54 UTC (permalink / raw)
  To: meta

Seeing "Xorg.foo.bar" can be confusing in warnings if the
eidx_key is only "org.foo.bar" with no relation to "Xorg" at
all.  Furthermore, printing "\0" to log or terminal output isn't
very nice and could throw off some users/tools.
---
 lib/PublicInbox/SearchIdxShard.pm | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index 644d8b58..e194b7e0 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -87,8 +87,9 @@ sub shard_worker_loop ($$$$$) {
 		} else {
 			chomp $line;
 			my $eidx_key;
-			if ($line =~ s/\AX(.+)\0//) {
+			if ($line =~ s/\AX=(.+)\0//) {
 				$eidx_key = $1;
+				$v2w->{current_info} =~ s/\0/\\0/;
 			}
 			# n.b. $mid may contain spaces(!)
 			my ($len, $bytes, $num, $oid, $ds, $ts, $tid, $mid)
@@ -114,7 +115,7 @@ sub index_raw {
 	my ($self, $msgref, $eml, $smsg, $ibx) = @_;
 	if (my $w = $self->{w}) {
 		if ($ibx) {
-			print $w 'X', $ibx->eidx_key, "\0" or die
+			print $w 'X=', $ibx->eidx_key, "\0" or die
 				"failed to write shard: $!\n";
 		}
 		$msgref //= \($eml->as_string);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/52] detached external index: mostly
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (51 preceding siblings ...)
  2020-10-27  7:54 ` [PATCH 52/52] searchidxshard: make warnings with eidx_key less confusing Eric Wong
@ 2020-10-27 12:08 ` Konstantin Ryabitsev
  2020-11-10 18:53 ` detached external index: performance note Eric Wong
  53 siblings, 0 replies; 57+ messages in thread
From: Konstantin Ryabitsev @ 2020-10-27 12:08 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Tue, Oct 27, 2020 at 07:54:01AM +0000, Eric Wong wrote:
> ...and mostly wired up for WWW, but requires manual config
> editing atm.  Needs docs and tests, and IMAP support.

Great progress! I look forward to reading the forthcoming docs. :)

-K

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 51/52] extsearchidx: support --batch-size checkpoints
  2020-10-27  7:54 ` [PATCH 51/52] extsearchidx: support --batch-size checkpoints Eric Wong
@ 2020-11-03  9:08   ` Eric Wong
  0 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-11-03  9:08 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> This is needed to limit the RSS of processes and ensure the
> stored data in over.sqlite3 and Xapian DBs are consistent if
> interrupted.  Without checkpoints, indexing lore causes shard
> workers to take several GB of memory and thrash/OOM smaller
> systems.

Ugh, the ~30 hours in the cover letter was without this patch.
Using even an 100m batch size(*) makes lore/* take ~70 hours :<

(*) 100m works fine for lore/lkml and even >10m is diminishing
    returns...

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2] doc/standards: add RFCs for URL schemes
  2020-10-27  7:54 ` [PATCH 01/52] doc/standards: add RFCs for URL schemes Eric Wong
@ 2020-11-05  7:54   ` Eric Wong
  0 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-11-05  7:54 UTC (permalink / raw)
  To: meta

v2 to resolve conflict from 44227c2624e4f954943d632cd5335396351373be
("nntp: delimit Newsgroup: header with commas").

Still extremely unhappy with performance of this series...

------8<------
Subject: [PATCH] doc/standards: add RFCs for URL schemes

We linkify these in the WWW UI, and will support them in other
places.  These URL schemes may end up being stored in
external/detached indices for indexing non-git-based mail
stores.
---
 Documentation/standards.perl | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/standards.perl b/Documentation/standards.perl
index 1c56830e..3ae64ddf 100755
--- a/Documentation/standards.perl
+++ b/Documentation/standards.perl
@@ -28,6 +28,9 @@ my $rfcs = [
 	1036 => 'Standard for Interchange of USENET Messages',
 	5536 => 'Netnews Article Format',
 	5537 => 'Netnews Architecture and Protocols',
+	1738 => 'Uniform resource locators',
+	5092 => 'IMAP URL scheme',
+	5538 => 'NNTP URI schemes',
 	6048 => 'NNTP additions to LIST command (TODO)',
 	8054 => 'NNTP compression',
 	4642 => 'NNTP TLS',

^ permalink raw reply	[flat|nested] 57+ messages in thread

* detached external index: performance note
  2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
                   ` (52 preceding siblings ...)
  2020-10-27 12:08 ` [PATCH 00/52] detached external index: mostly Konstantin Ryabitsev
@ 2020-11-10 18:53 ` Eric Wong
  53 siblings, 0 replies; 57+ messages in thread
From: Eric Wong @ 2020-11-10 18:53 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> Not sure about the usability aspects, but I think this can
> replace the need for per-inbox Xapian DBs and save a truckload
> of disk space (and more importantly: cache space).  Per-inbox
> over.sqlite3 remains required for compatibility with NNTP/IMAP
> and existing WWW code.

Keeping v2 indexlevel=basic (*.sqlite3) and git repos on HDD
and putting -extindex on SSD seems to work reasonably well.
Xapian on HDD is really painful.

> Performance isn't great, it took 30+ hours to index my mirror of
> lore on a SATA SSD, but the entire index is <200GB due to
> deduplication between cross posts.

Still a problem for RAM-starved systems and Xapian :<

Larger systems can use --batch-size, and maybe Sys::Meminfo
can be used (if installed) to determine a larger batch-size
by default.

Fortunately, Sys::Meminfo is packaged for FreeBSD, CentOS and
Debian so it can be an optional dependency.

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2020-11-10 18:53 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
2020-10-27  7:54 ` [PATCH 01/52] doc/standards: add RFCs for URL schemes Eric Wong
2020-11-05  7:54   ` [PATCH v2] " Eric Wong
2020-10-27  7:54 ` [PATCH 02/52] search: hoist out _xdb_sharded for v2 inboxes Eric Wong
2020-10-27  7:54 ` [PATCH 03/52] extsearch: start mocking out Eric Wong
2020-10-27  7:54 ` [PATCH 04/52] searchidx: expose INDEXLEVELS as `our' Eric Wong
2020-10-27  7:54 ` [PATCH 05/52] v2writable: add git method Eric Wong
2020-10-27  7:54 ` [PATCH 06/52] v2writable: make OO calls to last_commit-related methods Eric Wong
2020-10-27  7:54 ` [PATCH 07/52] search: xdb_sharded: make this a public method for ExtSearch Eric Wong
2020-10-27  7:54 ` [PATCH 08/52] searchidx: introduce "xref3" concept Eric Wong
2020-10-27  7:54 ` [PATCH 09/52] v2writable: prepare initialization for external indices Eric Wong
2020-10-27  7:54 ` [PATCH 10/52] v2writable: hoist out write_alternates Eric Wong
2020-10-27  7:54 ` [PATCH 11/52] searchidxshard: allow msgref to be undef Eric Wong
2020-10-27  7:54 ` [PATCH 12/52] v2writable: idx_shard: simplify callers Eric Wong
2020-10-27  7:54 ` [PATCH 13/52] v2writable: count_shards: allow working without {ibx} Eric Wong
2020-10-27  7:54 ` [PATCH 14/52] overidx: introduce changes for external index Eric Wong
2020-10-27  7:54 ` [PATCH 15/52] v2: some changes for ExtSearchIdx compatibility Eric Wong
2020-10-27  7:54 ` [PATCH 16/52] inboxwritable: eidx_key for external index Eric Wong
2020-10-27  7:54 ` [PATCH 17/52] v2writable: rename remaining "remote" terminology Eric Wong
2020-10-27  7:54 ` [PATCH 18/52] v2writable: checkpoint: account for lack of {mm} Eric Wong
2020-10-27  7:54 ` [PATCH 19/52] extsearchidx: initial implementation Eric Wong
2020-10-27  7:54 ` [PATCH 20/52] searchidx: index eidx_key as a boolean term Eric Wong
2020-10-27  7:54 ` [PATCH 21/52] searchidx: xref3 delete support Eric Wong
2020-10-27  7:54 ` [PATCH 22/52] searchidxshard: special init for eidx Eric Wong
2020-10-27  7:54 ` [PATCH 23/52] searchidx: put {ibx} into $sync state Eric Wong
2020-10-27  7:54 ` [PATCH 24/52] searchidx: log2stack: simplify callers Eric Wong
2020-10-27  7:54 ` [PATCH 25/52] v2writable: more generic sync setup code Eric Wong
2020-10-27  7:54 ` [PATCH 26/52] v2writable: allow OO method references Eric Wong
2020-10-27  7:54 ` [PATCH 27/52] v2writable: rename {v2w} field to {self} Eric Wong
2020-10-27  7:54 ` [PATCH 28/52] v2writable: make *last_commits and sync_prepare OO methods Eric Wong
2020-10-27  7:54 ` [PATCH 29/52] v2writable: move size check init to sync_prepare Eric Wong
2020-10-27  7:54 ` [PATCH 30/52] extsearchidx: more compatibility with V2Writable callers Eric Wong
2020-10-27  7:54 ` [PATCH 31/52] v2writable: reduce scope of epoch-aware code Eric Wong
2020-10-27  7:54 ` [PATCH 32/52] extsearchidx: remove {unindex_range} field Eric Wong
2020-10-27  7:54 ` [PATCH 33/52] v2writable: pass oid to uindex_oid Eric Wong
2020-10-27  7:54 ` [PATCH 34/52] extsearchidx: sync unit updates Eric Wong
2020-10-27  7:54 ` [PATCH 35/52] searchidx: export prepare_stack Eric Wong
2020-10-27  7:54 ` [PATCH 36/52] extsearchidx: sync updates Eric Wong
2020-10-27  7:54 ` [PATCH 37/52] searchidx: reduce inbox-dependency, wrap ->with_umask Eric Wong
2020-10-27  7:54 ` [PATCH 38/52] searchidx: favor $sync->{ibx} (over $self->{ibx}) Eric Wong
2020-10-27  7:54 ` [PATCH 39/52] Makefile.PL: do not build manpage if POD is missing Eric Wong
2020-10-27  7:54 ` [PATCH 40/52] script: add preliminary eindex implementation Eric Wong
2020-10-27  7:54 ` [PATCH 41/52] index: eindex wiring Eric Wong
2020-10-27  7:54 ` [PATCH 42/52] over: store xref3 data in over.sqlite3 Eric Wong
2020-10-27  7:54 ` [PATCH 43/52] searchidx: remove xref3 support for Xapian Eric Wong
2020-10-27  7:54 ` [PATCH 44/52] t/extsearch.t: verify results and xref3 ordering Eric Wong
2020-10-27  7:54 ` [PATCH 45/52] t/v2writable: remove pointless ->barrier call Eric Wong
2020-10-27  7:54 ` [PATCH 46/52] extsearch: wire up smsg_eml Eric Wong
2020-10-27  7:54 ` [PATCH 47/52] extsearchidx: handle edits Eric Wong
2020-10-27  7:54 ` [PATCH 48/52] extsearch: wire up remaining Inbox-like methods for WWW Eric Wong
2020-10-27  7:54 ` [PATCH 49/52] searchidx: ignore exceptions from ->remove_term Eric Wong
2020-10-27  7:54 ` [PATCH 50/52] extsearchidx: set current_info in warning callbacks Eric Wong
2020-10-27  7:54 ` [PATCH 51/52] extsearchidx: support --batch-size checkpoints Eric Wong
2020-11-03  9:08   ` Eric Wong
2020-10-27  7:54 ` [PATCH 52/52] searchidxshard: make warnings with eidx_key less confusing Eric Wong
2020-10-27 12:08 ` [PATCH 00/52] detached external index: mostly Konstantin Ryabitsev
2020-11-10 18:53 ` detached external index: performance note Eric Wong

unofficial mirror of meta@public-inbox.org

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://yhetil.org/meta

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 meta meta/ https://yhetil.org/meta \
		meta@public-inbox.org
	public-inbox-index meta

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.yhetil.org/yhetil.mail.public-inbox.meta
	nntp://news.public-inbox.org/inbox.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git