unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 00/12] extindex: speed up manifest.js.gz generation
@ 2020-11-23  7:05 Eric Wong
  2020-11-23  7:05 ` [PATCH 01/12] miscsearch: a new Xapian sub-DB for extindex Eric Wong
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23  7:05 UTC (permalink / raw)
  To: meta

manifest.js.gz generation gets faster with this series
(~1000ms => ~40ms) on the current set of lore.kernel.org inboxes

We may need to rely on varnish to handle things up to 30-100K
inboxes, since manifest.js.gz generation won't monopolize the
-httpd event loop.

WwwListing (HTML) output still needs to be updated and searching
for inboxes needs to be implemented along with pagination for
30-100K inboxes.

Eric Wong (12):
  miscsearch: a new Xapian sub-DB for extindex
  move JSON module portability into PublicInbox::Config
  git: add manifest_entry method
  manifest: use ibx->git_epoch method for v2
  inbox: git_epoch: remove ->version check
  miscidx: put grokmirror manifest entries in Xapian docdata
  extsearch: fix remaining "eindex" references
  miscidx: cleanup git processes after manifest indexing
  miscidx: store absolute git_dir of each epoch in docdata
  extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare
  manifest: support faster generation via [extindex "all"]
  *search: simplify retry_reopen users

 MANIFEST                         |   3 +
 lib/PublicInbox/Config.pm        |  15 ++++
 lib/PublicInbox/ExtSearch.pm     |   8 +-
 lib/PublicInbox/ExtSearchIdx.pm  |  18 ++++-
 lib/PublicInbox/Git.pm           |  53 +++++++++++++
 lib/PublicInbox/Inbox.pm         |   6 +-
 lib/PublicInbox/InboxWritable.pm |   2 -
 lib/PublicInbox/ManifestJsGz.pm  | 108 +++++++++-----------------
 lib/PublicInbox/MiscIdx.pm       | 125 +++++++++++++++++++++++++++++++
 lib/PublicInbox/MiscSearch.pm    |  98 ++++++++++++++++++++++++
 lib/PublicInbox/Search.pm        |  18 ++---
 lib/PublicInbox/SearchIdx.pm     |   7 +-
 lib/PublicInbox/V2Writable.pm    |   5 ++
 script/public-inbox-extindex     |   1 +
 t/extsearch.t                    |  14 +++-
 t/miscsearch.t                   |  57 ++++++++++++++
 t/www_listing.t                  |   5 +-
 17 files changed, 446 insertions(+), 97 deletions(-)
 create mode 100644 lib/PublicInbox/MiscIdx.pm
 create mode 100644 lib/PublicInbox/MiscSearch.pm
 create mode 100644 t/miscsearch.t

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 01/12] miscsearch: a new Xapian sub-DB for extindex
  2020-11-23  7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
@ 2020-11-23  7:05 ` Eric Wong
  2020-11-23  7:05 ` [PATCH 02/12] move JSON module portability into PublicInbox::Config Eric Wong
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23  7:05 UTC (permalink / raw)
  To: meta

This will be used to index and search Inbox objects and perhaps
individual git repositories/epochs for grokmirror manifest.js.gz
generation.  There is no sharding planned for this at the moment
since inbox count should remain low (~100K to 1M) compared to
message count.

Folding this into the existing sharded DBs could be possible;
but would likely increase query and maintenance costs, as well
as development complexity.  So we'll use a few more inodes and
FDs at runtime, instead.
---
 MANIFEST                        |   3 +
 lib/PublicInbox/ExtSearch.pm    |   6 ++
 lib/PublicInbox/ExtSearchIdx.pm |  11 +++-
 lib/PublicInbox/MiscIdx.pm      | 107 ++++++++++++++++++++++++++++++++
 lib/PublicInbox/MiscSearch.pm   |  79 +++++++++++++++++++++++
 lib/PublicInbox/Search.pm       |   8 +--
 lib/PublicInbox/SearchIdx.pm    |   7 ++-
 lib/PublicInbox/V2Writable.pm   |   5 ++
 t/extsearch.t                   |   3 +
 t/miscsearch.t                  |  54 ++++++++++++++++
 10 files changed, 275 insertions(+), 8 deletions(-)
 create mode 100644 lib/PublicInbox/MiscIdx.pm
 create mode 100644 lib/PublicInbox/MiscSearch.pm
 create mode 100644 t/miscsearch.t

diff --git a/MANIFEST b/MANIFEST
index fc79a134..544ec5f9 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -166,6 +166,8 @@ lib/PublicInbox/MIME.pm
 lib/PublicInbox/ManifestJsGz.pm
 lib/PublicInbox/Mbox.pm
 lib/PublicInbox/MboxGz.pm
+lib/PublicInbox/MiscIdx.pm
+lib/PublicInbox/MiscSearch.pm
 lib/PublicInbox/MsgIter.pm
 lib/PublicInbox/MsgTime.pm
 lib/PublicInbox/Msgmap.pm
@@ -319,6 +321,7 @@ t/mda.t
 t/mda_filter_rubylang.t
 t/mid.t
 t/mime.t
+t/miscsearch.t
 t/msg_iter-nested.eml
 t/msg_iter-order.eml
 t/msg_iter.t
diff --git a/lib/PublicInbox/ExtSearch.pm b/lib/PublicInbox/ExtSearch.pm
index eb665027..c41ae443 100644
--- a/lib/PublicInbox/ExtSearch.pm
+++ b/lib/PublicInbox/ExtSearch.pm
@@ -10,6 +10,7 @@ use v5.10.1;
 use PublicInbox::Over;
 use PublicInbox::Inbox;
 use File::Spec ();
+use PublicInbox::MiscSearch;
 
 # for ->reopen, ->mset, ->mset_to_artnums
 use parent qw(PublicInbox::Search);
@@ -24,6 +25,11 @@ sub new {
 	}, __PACKAGE__;
 }
 
+sub misc {
+	my ($self) = @_;
+	$self->{misc} //= PublicInbox::MiscSearch->new("$self->{xpfx}/misc");
+}
+
 sub search { $_[0] } # self
 
 # overrides PublicInbox::Search::_xdb
diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 91434b26..708f8a3e 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -21,6 +21,7 @@ use Carp qw(croak carp);
 use PublicInbox::Search;
 use PublicInbox::SearchIdx qw(crlf_adjust prepare_stack is_ancestor);
 use PublicInbox::OverIdx;
+use PublicInbox::MiscIdx;
 use PublicInbox::MID qw(mids);
 use PublicInbox::V2Writable;
 use PublicInbox::InboxWritable;
@@ -309,6 +310,7 @@ sub _sync_inbox ($$$) {
 		return;
 	}
 	index_todo($self, $sync, $_) for @{delete($sync->{todo}) // []};
+	$self->{midx}->index_ibx($ibx);
 }
 
 sub eidx_sync { # main entry point
@@ -374,6 +376,12 @@ sub update_last_commit { # overrides V2Writable
 	$self->{oidx}->eidx_meta($meta_key, $latest_cmt);
 }
 
+sub _idx_init { # with_umask callback
+	my ($self, $opt) = @_;
+	PublicInbox::V2Writable::_idx_init($self, $opt);
+	$self->{midx} = PublicInbox::MiscIdx->new($self);
+}
+
 sub idx_init { # similar to V2Writable
 	my ($self, $opt) = @_;
 	return if $self->{idx_shards};
@@ -406,9 +414,10 @@ sub idx_init { # similar to V2Writable
 	}
 	$self->parallel_init($self->{indexlevel});
 	$self->umask_prepare;
-	$self->with_umask(\&PublicInbox::V2Writable::_idx_init, $self, $opt);
+	$self->with_umask(\&_idx_init, $self, $opt);
 	$self->{oidx}->begin_lazy;
 	$self->{oidx}->eidx_prep;
+	$self->{midx}->begin_txn;
 }
 
 no warnings 'once';
diff --git a/lib/PublicInbox/MiscIdx.pm b/lib/PublicInbox/MiscIdx.pm
new file mode 100644
index 00000000..edc70f9b
--- /dev/null
+++ b/lib/PublicInbox/MiscIdx.pm
@@ -0,0 +1,107 @@
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# like PublicInbox::SearchIdx, but for searching for non-mail messages.
+# Things indexed include:
+# * inboxes themselves
+# * epoch information
+# * (maybe) git code repository information
+# Expect ~100K-1M documents with no parallelism opportunities,
+# so no sharding, here.
+#
+# See MiscSearch for read-only counterpart
+package PublicInbox::MiscIdx;
+use strict;
+use v5.10.1;
+use PublicInbox::InboxWritable;
+use PublicInbox::Search; # for SWIG Xapian and Search::Xapian compat
+use PublicInbox::SearchIdx qw(index_text term_generator add_val);
+use PublicInbox::Spawn qw(nodatacow_dir);
+use Carp qw(croak);
+use File::Path ();
+use PublicInbox::MiscSearch;
+
+sub new {
+	my ($class, $eidx) = @_;
+	PublicInbox::SearchIdx::load_xapian_writable();
+	my $mi_dir = "$eidx->{xpfx}/misc";
+	File::Path::mkpath($mi_dir);
+	nodatacow_dir($mi_dir);
+	my $flags = $PublicInbox::SearchIdx::DB_CREATE_OR_OPEN;
+	$flags |= $PublicInbox::SearchIdx::DB_NO_SYNC if $eidx->{-no_fsync};
+	bless {
+		mi_dir => $mi_dir,
+		flags => $flags,
+		indexlevel => 'full', # small DB, no point in medium?
+	}, $class;
+}
+
+sub begin_txn {
+	my ($self) = @_;
+	croak 'BUG: already in txn' if $self->{xdb}; # XXX make lazy?
+	my $wdb = $PublicInbox::Search::X{WritableDatabase};
+	my $xdb = eval { $wdb->new($self->{mi_dir}, $self->{flags}) };
+	croak "Failed opening $self->{mi_dir}: $@" if $@;
+	$self->{xdb} = $xdb;
+	$xdb->begin_transaction;
+}
+
+sub commit_txn {
+	my ($self) = @_;
+	croak 'BUG: not in txn' unless $self->{xdb}; # XXX make lazy?
+	delete($self->{xdb})->commit_transaction;
+}
+
+sub index_ibx {
+	my ($self, $ibx) = @_;
+	my $eidx_key = $ibx->eidx_key;
+	my $xdb = $self->{xdb};
+	# Q = uniQue in Xapian terminology
+	my $head = $xdb->postlist_begin('Q'.$eidx_key);
+	my $tail = $xdb->postlist_end('Q'.$eidx_key);
+	my ($docid, @drop);
+	for (; $head != $tail; $head++) {
+		if (defined $docid) {
+			my $i = $head->get_docid;
+			push @drop, $i;
+			warn <<EOF;
+W: multiple inboxes keyed to `$eidx_key', deleting #$i
+EOF
+		} else {
+			$docid = $head->get_docid;
+		}
+	}
+	$xdb->delete_document($_) for @drop; # just in case
+
+	my $doc = $PublicInbox::Search::X{Document}->new;
+
+	# allow sorting by modified
+	add_val($doc, $PublicInbox::MiscSearch::MODIFIED, $ibx->modified);
+
+	$doc->add_boolean_term('Q'.$eidx_key);
+	$doc->add_boolean_term('T'.'inbox');
+	term_generator($self)->set_document($doc);
+
+	# description = S/Subject (or title)
+	# address = A/Author
+	index_text($self, $ibx->description, 1, 'S');
+	my %map = (
+		address => 'A',
+		listid => 'XLISTID',
+		infourl => 'XINFOURL',
+		url => 'XURL'
+	);
+	while (my ($f, $pfx) = each %map) {
+		for my $v (@{$ibx->{$f} // []}) {
+			index_text($self, $v, 1, $pfx);
+		}
+	}
+	index_text($self, $ibx->{name}, 1, 'XNAME');
+	if (defined $docid) {
+		$xdb->replace_document($docid, $doc);
+	} else {
+		$xdb->add_document($doc);
+	}
+}
+
+1;
diff --git a/lib/PublicInbox/MiscSearch.pm b/lib/PublicInbox/MiscSearch.pm
new file mode 100644
index 00000000..8beb8349
--- /dev/null
+++ b/lib/PublicInbox/MiscSearch.pm
@@ -0,0 +1,79 @@
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# read-only counterpart to MiscIdx
+package PublicInbox::MiscSearch;
+use strict;
+use v5.10.1;
+use PublicInbox::Search qw(retry_reopen);
+
+# Xapian value columns:
+our $MODIFIED = 0;
+
+# avoid conflicting with message Search::prob_prefix for UI/UX reasons
+my %PROB_PREFIX = (
+	description => 'S', # $INBOX_DIR/description
+	address => 'A',
+	listid => 'XLISTID',
+	url => 'XURL',
+	infourl => 'XINFOURL',
+	name => 'XNAME',
+	'' => 'S A XLISTID XNAME XURL XINFOURL'
+);
+
+sub new {
+	my ($class, $dir) = @_;
+	bless {
+		xdb => $PublicInbox::Search::X{Database}->new($dir)
+	}, $class;
+}
+
+# read-only
+sub mi_qp_new ($) {
+	my ($self) = @_;
+	my $xdb = $self->{xdb};
+	my $qp = $PublicInbox::Search::X{QueryParser}->new;
+	$qp->set_default_op(PublicInbox::Search::OP_AND());
+	$qp->set_database($xdb);
+	$qp->set_stemmer(PublicInbox::Search::stemmer($self));
+	$qp->set_stemming_strategy(PublicInbox::Search::STEM_SOME());
+	my $cb = $qp->can('set_max_wildcard_expansion') //
+		$qp->can('set_max_expansion'); # Xapian 1.5.0+
+	$cb->($qp, 100);
+	$cb = $qp->can('add_valuerangeprocessor') //
+		$qp->can('add_rangeprocessor'); # Xapian 1.5.0+
+	while (my ($name, $prefix) = each %PROB_PREFIX) {
+		$qp->add_prefix($name, $_) for split(/ /, $prefix);
+	}
+	$qp->add_boolean_prefix('type', 'T');
+	$qp;
+}
+
+sub misc_enquire_once { # retry_reopen callback
+	my ($self, $qr, $opt) = @{$_[0]};
+	my $eq = $PublicInbox::Search::X{Enquire}->new($self->{xdb});
+	$eq->set_query($qr);
+        my $desc = !$opt->{asc};
+	my $rel = $opt->{relevance} // 0;
+	if ($rel == -1) { # ORDER BY docid/UID
+		$eq->set_docid_order($PublicInbox::Search::ENQ_ASCENDING);
+		$eq->set_weighting_scheme($PublicInbox::Search::X{BoolWeight}->new);
+	} elsif ($rel) {
+		$eq->set_sort_by_relevance_then_value($MODIFIED, $desc);
+	} else {
+		$eq->set_sort_by_value_then_relevance($MODIFIED, $desc);
+	}
+	$eq->get_mset($opt->{offset} || 0, $opt->{limit} || 200);
+}
+
+sub mset {
+	my ($self, $qs, $opt) = @_;
+	$opt ||= {};
+	my $qp = $self->{qp} //= mi_qp_new($self);
+	$qs = 'type:inbox' if $qs eq '';
+	my $qr = $qp->parse_query($qs, $PublicInbox::Search::QP_FLAGS);
+	$opt->{relevance} = 1 unless exists $opt->{relevance};
+	retry_reopen($self, \&misc_enquire_once, [ $self, $qr, $opt ]);
+}
+
+1;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 71417d5e..05d5a133 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -6,7 +6,7 @@
 package PublicInbox::Search;
 use strict;
 use parent qw(Exporter);
-our @EXPORT_OK = qw(mdocid);
+our @EXPORT_OK = qw(mdocid retry_reopen);
 use List::Util qw(max);
 
 # values for searching, changing the numeric value breaks
@@ -54,11 +54,11 @@ use constant {
 
 use PublicInbox::Smsg;
 use PublicInbox::Over;
-my $QP_FLAGS;
+our $QP_FLAGS;
 our %X = map { $_ => 0 } qw(BoolWeight Database Enquire QueryParser Stem);
 our $Xap; # 'Search::Xapian' or 'Xapian'
-my $NVRP; # '$Xap::'.('NumberValueRangeProcessor' or 'NumberRangeProcessor')
-my $ENQ_ASCENDING;
+our $NVRP; # '$Xap::'.('NumberValueRangeProcessor' or 'NumberRangeProcessor')
+our $ENQ_ASCENDING;
 
 sub load_xapian () {
 	return 1 if defined $Xap;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 6ff2cf94..18390602 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -22,9 +22,10 @@ use PublicInbox::OverIdx;
 use PublicInbox::Spawn qw(spawn nodatacow_dir);
 use PublicInbox::Git qw(git_unquote);
 use PublicInbox::MsgTime qw(msg_timestamp msg_datestamp);
-our @EXPORT_OK = qw(crlf_adjust log2stack is_ancestor check_size prepare_stack);
+our @EXPORT_OK = qw(crlf_adjust log2stack is_ancestor check_size prepare_stack
+	index_text term_generator add_val);
 my $X = \%PublicInbox::Search::X;
-my ($DB_CREATE_OR_OPEN, $DB_OPEN);
+our ($DB_CREATE_OR_OPEN, $DB_OPEN);
 our $DB_NO_SYNC = 0;
 our $BATCH_BYTES = $ENV{XAPIAN_FLUSH_THRESHOLD} ? 0x7fffffff : 1_000_000;
 use constant DEBUG => !!$ENV{DEBUG};
@@ -154,7 +155,7 @@ sub term_generator ($) { # write-only
 
 	$self->{term_generator} //= do {
 		my $tg = $X->{TermGenerator}->new;
-		$tg->set_stemmer($self->stemmer);
+		$tg->set_stemmer(PublicInbox::Search::stemmer($self));
 		$tg;
 	}
 }
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index ba7cef13..afba0220 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -631,6 +631,9 @@ sub checkpoint ($;$) {
 			$_->shard_commit for @$shards;
 		}
 
+		my $midx = $self->{midx}; # misc index
+		$midx->commit_txn if $midx;
+
 		# last_commit is special, don't commit these until
 		# Xapian shards are done:
 		$dbh->begin_work if $dbh;
@@ -639,6 +642,7 @@ sub checkpoint ($;$) {
 			$dbh->commit;
 			$dbh->begin_work;
 		}
+		$midx->begin_txn if $midx;
 	}
 	$self->{total_bytes} += $self->{transact_bytes};
 	$self->{transact_bytes} = 0;
@@ -678,6 +682,7 @@ sub done {
 	}
 	eval { $self->{oidx}->dbh_close };
 	$err .= "over close: $@\n" if $@;
+	delete $self->{midx};
 	delete $self->{bnote};
 	my $nbytes = $self->{total_bytes};
 	$self->{total_bytes} = 0;
diff --git a/t/extsearch.t b/t/extsearch.t
index 8792fd9e..e28e2f71 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -72,4 +72,7 @@ my $es = PublicInbox::ExtSearch->new("$home/eindex");
 	isnt($x1->[0], $x2->[0], 'xref3 differs');
 }
 
+my $misc = $es->misc;
+is(scalar($misc->mset('')->items), 2, 'two inboxes');
+
 done_testing;
diff --git a/t/miscsearch.t b/t/miscsearch.t
new file mode 100644
index 00000000..45a19da9
--- /dev/null
+++ b/t/miscsearch.t
@@ -0,0 +1,54 @@
+#!perl -w
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use Test::More;
+use PublicInbox::TestCommon;
+use PublicInbox::InboxWritable;
+require_mods(qw(Search::Xapian DBD::SQLite));
+use_ok 'PublicInbox::MiscSearch';
+use_ok 'PublicInbox::MiscIdx';
+
+my ($tmp, $for_destroy) = tmpdir();
+my $eidx = { xpfx => "$tmp/eidx", -no_fsync => 1 }; # mock ExtSearchIdx
+{
+	mkdir "$tmp/v1" or BAIL_OUT "mkdir $!";
+	open my $fh, '>', "$tmp/v1/description" or BAIL_OUT "open: $!";
+	print $fh "Everything sucks this year\n" or BAIL_OUT "print $!";
+	close $fh or BAIL_OUT "close $!";
+}
+{
+	my $v1 = PublicInbox::InboxWritable->new({
+		inboxdir => "$tmp/v1",
+		name => 'hope',
+		address => [ 'nope@example.com' ],
+		indexlevel => 'basic',
+		version => 1,
+	});
+	$v1->init_inbox;
+	my $mi = PublicInbox::MiscIdx->new($eidx);
+	$mi->begin_txn;
+	$mi->index_ibx($v1);
+	$mi->commit_txn;
+}
+
+my $ms = PublicInbox::MiscSearch->new("$tmp/eidx/misc");
+my $mset = $ms->mset('"everything sucks today"');
+is(scalar($mset->items), 0, 'no match on description phrase');
+
+$mset = $ms->mset('"everything sucks this year"');
+is(scalar($mset->items), 1, 'match phrase on description');
+
+$mset = $ms->mset('everything sucks');
+is(scalar($mset->items), 1, 'match words in description');
+
+$mset = $ms->mset('nope@example.com');
+is(scalar($mset->items), 1, 'match full address');
+
+$mset = $ms->mset('nope');
+is(scalar($mset->items), 1, 'match partial address');
+
+$mset = $ms->mset('hope');
+is(scalar($mset->items), 1, 'match name');
+
+done_testing;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 02/12] move JSON module portability into PublicInbox::Config
  2020-11-23  7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
  2020-11-23  7:05 ` [PATCH 01/12] miscsearch: a new Xapian sub-DB for extindex Eric Wong
@ 2020-11-23  7:05 ` Eric Wong
  2020-11-23  7:05 ` [PATCH 03/12] git: add manifest_entry method Eric Wong
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23  7:05 UTC (permalink / raw)
  To: meta

We'll be using JSON in MiscIdx and MiscSearch, and
PublicInbox::Config seems like an appropriate place to put it.
---
 lib/PublicInbox/Config.pm       | 12 ++++++++++++
 lib/PublicInbox/ManifestJsGz.pm |  8 ++------
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index d2010f7a..039eb445 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -488,4 +488,16 @@ sub urlmatch {
 	}
 }
 
+sub json {
+	state $json;
+	$json //= do {
+		for my $mod (qw(Cpanel::JSON::XS JSON::MaybeXS JSON JSON::PP)) {
+			eval "require $mod" or next;
+			# ->ascii encodes non-ASCII to "\uXXXX"
+			$json = $mod->new->ascii(1) and last;
+		}
+		$json;
+	};
+}
+
 1;
diff --git a/lib/PublicInbox/ManifestJsGz.pm b/lib/PublicInbox/ManifestJsGz.pm
index 16d2a87c..ab1478af 100644
--- a/lib/PublicInbox/ManifestJsGz.pm
+++ b/lib/PublicInbox/ManifestJsGz.pm
@@ -10,17 +10,13 @@ use Digest::SHA ();
 use File::Spec ();
 use bytes (); # length
 use PublicInbox::Inbox;
+use PublicInbox::Config;
 use PublicInbox::Git;
 use IO::Compress::Gzip qw(gzip);
 use HTTP::Date qw(time2str);
 *try_cat = \&PublicInbox::Inbox::try_cat;
 
-our $json;
-for my $mod (qw(Cpanel::JSON::XS JSON::MaybeXS JSON JSON::PP)) {
-	eval "require $mod" or next;
-	# ->ascii encodes non-ASCII to "\uXXXX"
-	$json = $mod->new->ascii(1) and last;
-}
+our $json = PublicInbox::Config::json();
 
 # called by WwwListing
 sub url_regexp {

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 03/12] git: add manifest_entry method
  2020-11-23  7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
  2020-11-23  7:05 ` [PATCH 01/12] miscsearch: a new Xapian sub-DB for extindex Eric Wong
  2020-11-23  7:05 ` [PATCH 02/12] move JSON module portability into PublicInbox::Config Eric Wong
@ 2020-11-23  7:05 ` Eric Wong
  2020-11-23  7:05 ` [PATCH 04/12] manifest: use ibx->git_epoch method for v2 Eric Wong
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23  7:05 UTC (permalink / raw)
  To: meta

We'll be using this for MiscIdx and pre-generating the necessary
JSON for manifest.js.gz, so make it easier to share code for
generating per-repo JSON entries for grokmirror.
---
 lib/PublicInbox/Git.pm          | 53 +++++++++++++++++++++++++++++
 lib/PublicInbox/ManifestJsGz.pm | 59 ++-------------------------------
 t/www_listing.t                 |  5 ++-
 3 files changed, 58 insertions(+), 59 deletions(-)

diff --git a/lib/PublicInbox/Git.pm b/lib/PublicInbox/Git.pm
index 86343ac9..917fa4a1 100644
--- a/lib/PublicInbox/Git.pm
+++ b/lib/PublicInbox/Git.pm
@@ -14,10 +14,12 @@ use POSIX ();
 use IO::Handle; # ->autoflush
 use Errno qw(EINTR);
 use File::Glob qw(bsd_glob GLOB_NOSORT);
+use File::Spec ();
 use Time::HiRes qw(stat);
 use PublicInbox::Spawn qw(popen_rd);
 use PublicInbox::Tmpfile;
 use Carp qw(croak);
+use Digest::SHA ();
 our @EXPORT_OK = qw(git_unquote git_quote);
 our $PIPE_BUFSIZ = 65536; # Linux default
 our $in_cleanup;
@@ -475,6 +477,57 @@ sub modified ($) {
 	$modified || time;
 }
 
+# for grokmirror, which doesn't read gitweb.description
+# templates/hooks--update.sample and git-multimail in git.git
+# only match "Unnamed repository", not the full contents of
+# templates/this--description in git.git
+sub manifest_entry {
+	my ($self, $epoch, $default_desc) = @_;
+	my ($fh, $pid) = $self->popen('show-ref');
+	my $dig = Digest::SHA->new(1);
+	while (read($fh, my $buf, 65536)) {
+		$dig->add($buf);
+	}
+	close $fh;
+	waitpid($pid, 0);
+	return if $?; # empty, uninitialized git repo
+	my $git_dir = $self->{git_dir};
+	my $ent = {
+		fingerprint => $dig->hexdigest,
+		reference => undef,
+		modified => modified($self),
+	};
+	chomp(my $owner = $self->qx('config', 'gitweb.owner'));
+	utf8::decode($owner);
+	$ent->{owner} = $owner eq '' ? undef : $owner;
+	my $desc = '';
+	if (open($fh, '<', "$git_dir/description")) {
+		local $/ = "\n";
+		chomp($desc = <$fh>);
+		utf8::decode($desc);
+	}
+	$desc = 'Unnamed repository' if $desc eq '';
+	if (defined $epoch && $desc =~ /\AUnnamed repository/) {
+		$desc = "$default_desc [epoch $epoch]";
+	}
+	$ent->{description} = $desc;
+	if (open($fh, '<', "$git_dir/objects/info/alternates")) {
+		# n.b.: GitPython doesn't seem to handle comments or C-quoted
+		# strings like native git does; and we don't for now, either.
+		local $/ = "\n";
+		chomp(my @alt = <$fh>);
+
+		# grokmirror only supports 1 alternate for "reference",
+		if (scalar(@alt) == 1) {
+			my $objdir = "$git_dir/objects";
+			my $ref = File::Spec->rel2abs($alt[0], $objdir);
+			$ref =~ s!/[^/]+/?\z!!; # basename
+			$ent->{reference} = $ref;
+		}
+	}
+	$ent;
+}
+
 1;
 __END__
 =pod
diff --git a/lib/PublicInbox/ManifestJsGz.pm b/lib/PublicInbox/ManifestJsGz.pm
index ab1478af..3d8a38ae 100644
--- a/lib/PublicInbox/ManifestJsGz.pm
+++ b/lib/PublicInbox/ManifestJsGz.pm
@@ -6,15 +6,12 @@ package PublicInbox::ManifestJsGz;
 use strict;
 use v5.10.1;
 use parent qw(PublicInbox::WwwListing);
-use Digest::SHA ();
-use File::Spec ();
 use bytes (); # length
 use PublicInbox::Inbox;
 use PublicInbox::Config;
 use PublicInbox::Git;
 use IO::Compress::Gzip qw(gzip);
 use HTTP::Date qw(time2str);
-*try_cat = \&PublicInbox::Inbox::try_cat;
 
 our $json = PublicInbox::Config::json();
 
@@ -26,21 +23,6 @@ sub url_regexp {
 	$ctx->SUPER::url_regexp('publicInbox.grokManifest', 'match=domain');
 }
 
-sub fingerprint ($) {
-	my ($git) = @_;
-	# TODO: convert to qspawn for fairness when there's
-	# thousands of repos
-	my ($fh, $pid) = $git->popen('show-ref');
-	my $dig = Digest::SHA->new(1);
-	while (read($fh, my $buf, 65536)) {
-		$dig->add($buf);
-	}
-	close $fh;
-	waitpid($pid, 0);
-	return if $?; # empty, uninitialized git repo
-	$dig->hexdigest;
-}
-
 sub manifest_add ($$;$$) {
 	my ($ctx, $ibx, $epoch, $default_desc) = @_;
 	my $url_path = "/$ibx->{name}";
@@ -51,48 +33,13 @@ sub manifest_add ($$;$$) {
 	}
 	return unless -d $git_dir;
 	my $git = PublicInbox::Git->new($git_dir);
-	my $fingerprint = fingerprint($git) or return; # no empty repos
-
-	chomp(my $owner = $git->qx('config', 'gitweb.owner'));
-	chomp(my $desc = try_cat("$git_dir/description"));
-	utf8::decode($owner);
-	utf8::decode($desc);
-	$owner = undef if $owner eq '';
-	$desc = 'Unnamed repository' if $desc eq '';
-
-	# templates/hooks--update.sample and git-multimail in git.git
-	# only match "Unnamed repository", not the full contents of
-	# templates/this--description in git.git
-	if ($desc =~ /\AUnnamed repository/) {
-		$desc = "$default_desc [epoch $epoch]" if defined($epoch);
-	}
-
-	my $reference;
-	chomp(my $alt = try_cat("$git_dir/objects/info/alternates"));
-	if ($alt) {
-		# n.b.: GitPython doesn't seem to handle comments or C-quoted
-		# strings like native git does; and we don't for now, either.
-		my @alt = split(/\n+/, $alt);
-
-		# grokmirror only supports 1 alternate for "reference",
-		if (scalar(@alt) == 1) {
-			my $objdir = "$git_dir/objects";
-			$reference = File::Spec->rel2abs($alt[0], $objdir);
-			$reference =~ s!/[^/]+/?\z!!; # basename
-		}
-	}
+	my $ent = $git->manifest_entry($epoch, $default_desc) or return;
 	$ctx->{-abs2urlpath}->{$git_dir} = $url_path;
-	my $modified = $git->modified;
+	my $modified = $ent->{modified};
 	if ($modified > ($ctx->{-mtime} // 0)) {
 		$ctx->{-mtime} = $modified;
 	}
-	$ctx->{manifest}->{$url_path} = {
-		owner => $owner,
-		reference => $reference,
-		description => $desc,
-		modified => $modified,
-		fingerprint => $fingerprint,
-	};
+	$ctx->{manifest}->{$url_path} = $ent;
 }
 
 sub ibx_entry {
diff --git a/t/www_listing.t b/t/www_listing.t
index 4309a5e1..63613371 100644
--- a/t/www_listing.t
+++ b/t/www_listing.t
@@ -21,8 +21,7 @@ use_ok 'PublicInbox::Git';
 my ($tmpdir, $for_destroy) = tmpdir();
 my $bare = PublicInbox::Git->new("$tmpdir/bare.git");
 PublicInbox::Import::init_bare($bare->{git_dir});
-is(PublicInbox::ManifestJsGz::fingerprint($bare), undef,
-	'empty repo has no fingerprint');
+is($bare->manifest_entry, undef, 'empty repo has no manifest entry');
 {
 	my $fi_data = './t/git.fast-import-data';
 	open my $fh, '<', $fi_data or die "open $fi_data: $!";
@@ -31,7 +30,7 @@ is(PublicInbox::ManifestJsGz::fingerprint($bare), undef,
 		'fast-import');
 }
 
-like(PublicInbox::ManifestJsGz::fingerprint($bare), qr/\A[a-f0-9]{40}\z/,
+like($bare->manifest_entry->{fingerprint}, qr/\A[a-f0-9]{40}\z/,
 	'got fingerprint with non-empty repo');
 
 sub tiny_test {

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 04/12] manifest: use ibx->git_epoch method for v2
  2020-11-23  7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
                   ` (2 preceding siblings ...)
  2020-11-23  7:05 ` [PATCH 03/12] git: add manifest_entry method Eric Wong
@ 2020-11-23  7:05 ` Eric Wong
  2020-11-23  7:05 ` [PATCH 05/12] inbox: git_epoch: remove ->version check Eric Wong
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23  7:05 UTC (permalink / raw)
  To: meta

We can slightly reduce the amount of version-specific logic,
here.
---
 lib/PublicInbox/Inbox.pm        |  1 +
 lib/PublicInbox/ManifestJsGz.pm | 12 +++++-------
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index 1d18cdf1..64b12345 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -137,6 +137,7 @@ sub git_epoch {
 	$self->version == 2 or return;
 	$self->{"$epoch.git"} ||= do {
 		my $git_dir = "$self->{inboxdir}/git/$epoch.git";
+		return unless -d $git_dir;
 		my $g = PublicInbox::Git->new($git_dir);
 		$g->{-httpbackend_limiter} = $self->{-httpbackend_limiter};
 		# no cleanup needed, we never cat-file off this, only clone
diff --git a/lib/PublicInbox/ManifestJsGz.pm b/lib/PublicInbox/ManifestJsGz.pm
index 3d8a38ae..3b436827 100644
--- a/lib/PublicInbox/ManifestJsGz.pm
+++ b/lib/PublicInbox/ManifestJsGz.pm
@@ -7,9 +7,7 @@ use strict;
 use v5.10.1;
 use parent qw(PublicInbox::WwwListing);
 use bytes (); # length
-use PublicInbox::Inbox;
 use PublicInbox::Config;
-use PublicInbox::Git;
 use IO::Compress::Gzip qw(gzip);
 use HTTP::Date qw(time2str);
 
@@ -26,15 +24,15 @@ sub url_regexp {
 sub manifest_add ($$;$$) {
 	my ($ctx, $ibx, $epoch, $default_desc) = @_;
 	my $url_path = "/$ibx->{name}";
-	my $git_dir = $ibx->{inboxdir};
+	my $git;
 	if (defined $epoch) {
-		$git_dir .= "/git/$epoch.git";
 		$url_path .= "/git/$epoch.git";
+		$git = $ibx->git_epoch($epoch) or return;
+	} else {
+		$git = $ibx->git;
 	}
-	return unless -d $git_dir;
-	my $git = PublicInbox::Git->new($git_dir);
 	my $ent = $git->manifest_entry($epoch, $default_desc) or return;
-	$ctx->{-abs2urlpath}->{$git_dir} = $url_path;
+	$ctx->{-abs2urlpath}->{$git->{git_dir}} = $url_path;
 	my $modified = $ent->{modified};
 	if ($modified > ($ctx->{-mtime} // 0)) {
 		$ctx->{-mtime} = $modified;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 05/12] inbox: git_epoch: remove ->version check
  2020-11-23  7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
                   ` (3 preceding siblings ...)
  2020-11-23  7:05 ` [PATCH 04/12] manifest: use ibx->git_epoch method for v2 Eric Wong
@ 2020-11-23  7:05 ` Eric Wong
  2020-11-23  7:05 ` [PATCH 06/12] miscidx: put grokmirror manifest entries in Xapian docdata Eric Wong
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23  7:05 UTC (permalink / raw)
  To: meta

If $epoch is supplied to this method, there's already epochs and
an extra method call for ->version is a pointless waste of CPU
cycles.
---
 lib/PublicInbox/Inbox.pm | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index 64b12345..a1a072ad 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -133,8 +133,7 @@ sub new {
 sub version { $_[0]->{version} // 1 }
 
 sub git_epoch {
-	my ($self, $epoch) = @_;
-	$self->version == 2 or return;
+	my ($self, $epoch) = @_; # v2-only, callers always supply $epoch
 	$self->{"$epoch.git"} ||= do {
 		my $git_dir = "$self->{inboxdir}/git/$epoch.git";
 		return unless -d $git_dir;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 06/12] miscidx: put grokmirror manifest entries in Xapian docdata
  2020-11-23  7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
                   ` (4 preceding siblings ...)
  2020-11-23  7:05 ` [PATCH 05/12] inbox: git_epoch: remove ->version check Eric Wong
@ 2020-11-23  7:05 ` Eric Wong
  2020-11-23  7:05 ` [PATCH 07/12] extsearch: fix remaining "eindex" references Eric Wong
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23  7:05 UTC (permalink / raw)
  To: meta

This should make it possible for us quickly generate
manifest.js.gz files with less random I/O and process
spawning in the WWW code.
---
 lib/PublicInbox/MiscIdx.pm   | 15 +++++++++++++++
 script/public-inbox-extindex |  1 +
 t/extsearch.t                |  7 ++++++-
 t/miscsearch.t               |  3 +++
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/MiscIdx.pm b/lib/PublicInbox/MiscIdx.pm
index edc70f9b..9dcc96b7 100644
--- a/lib/PublicInbox/MiscIdx.pm
+++ b/lib/PublicInbox/MiscIdx.pm
@@ -20,6 +20,7 @@ use PublicInbox::Spawn qw(nodatacow_dir);
 use Carp qw(croak);
 use File::Path ();
 use PublicInbox::MiscSearch;
+use PublicInbox::Config;
 
 sub new {
 	my ($class, $eidx) = @_;
@@ -97,6 +98,20 @@ EOF
 		}
 	}
 	index_text($self, $ibx->{name}, 1, 'XNAME');
+	my $data = {};
+	if (defined(my $max = $ibx->max_git_epoch)) { # v2
+		my $desc = $ibx->description;
+		my $pfx = "/$ibx->{name}/git/";
+		for my $epoch (0..$max) {
+			my $git = $ibx->git_epoch($epoch) or return;
+			if (my $ent = $git->manifest_entry($epoch, $desc)) {
+				$data->{"$pfx$epoch.git"} = $ent;
+			}
+		}
+	} elsif (my $ent = $ibx->git->manifest_entry) { # v1
+		$data->{"/$ibx->{name}"} = $ent;
+	}
+	$doc->set_data(PublicInbox::Config::json()->encode($data));
 	if (defined $docid) {
 		$xdb->replace_document($docid, $doc);
 	} else {
diff --git a/script/public-inbox-extindex b/script/public-inbox-extindex
index 78d6d9d9..20a0737c 100644
--- a/script/public-inbox-extindex
+++ b/script/public-inbox-extindex
@@ -38,6 +38,7 @@ require PublicInbox::Admin;
 my $cfg = PublicInbox::Config->new;
 my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, $opt, $cfg);
 PublicInbox::Admin::require_or_die(qw(-search));
+PublicInbox::Config::json() or die "Cpanel::JSON::XS or similar missing\n";
 PublicInbox::Admin::progress_prepare($opt);
 my $env = PublicInbox::Admin::index_prepare($opt, $cfg);
 local %ENV = (%ENV, %$env) if $env;
diff --git a/t/extsearch.t b/t/extsearch.t
index e28e2f71..dc825bf4 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -4,7 +4,9 @@
 use strict;
 use Test::More;
 use PublicInbox::TestCommon;
+use PublicInbox::Config;
 use Fcntl qw(:seek);
+my $json = PublicInbox::Config::json() or plan skip_all => 'JSON missing';
 require_git(2.6);
 require_mods(qw(DBD::SQLite Search::Xapian));
 use_ok 'PublicInbox::ExtSearch';
@@ -73,6 +75,9 @@ my $es = PublicInbox::ExtSearch->new("$home/eindex");
 }
 
 my $misc = $es->misc;
-is(scalar($misc->mset('')->items), 2, 'two inboxes');
+my @it = $misc->mset('')->items;
+is(scalar(@it), 2, 'two inboxes');
+like($it[0]->get_document->get_data, qr/v2test/, 'docdata matched v2');
+like($it[1]->get_document->get_data, qr/v1test/, 'docdata matched v1');
 
 done_testing;
diff --git a/t/miscsearch.t b/t/miscsearch.t
index 45a19da9..0ba79194 100644
--- a/t/miscsearch.t
+++ b/t/miscsearch.t
@@ -50,5 +50,8 @@ is(scalar($mset->items), 1, 'match partial address');
 
 $mset = $ms->mset('hope');
 is(scalar($mset->items), 1, 'match name');
+my $mi = ($mset->items)[0];
+my $doc = $mi->get_document;
+is($doc->get_data, '{}', 'stored empty data');
 
 done_testing;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 07/12] extsearch: fix remaining "eindex" references
  2020-11-23  7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
                   ` (5 preceding siblings ...)
  2020-11-23  7:05 ` [PATCH 06/12] miscidx: put grokmirror manifest entries in Xapian docdata Eric Wong
@ 2020-11-23  7:05 ` Eric Wong
  2020-11-23  7:05 ` [PATCH 08/12] miscidx: cleanup git processes after manifest indexing Eric Wong
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23  7:05 UTC (permalink / raw)
  To: meta

We'll replace "$EINDEX" => "$EXTINDEX" in a user-visible
line and also some hacker-only tests.

"eindex" is no longer used because it rhymes with "reindex",
so remove the last instance of it.

Fixes: 6b0fed3b03263ba2 ("extsearch: rename -eindex to -extindex")
---
 lib/PublicInbox/ExtSearch.pm | 2 +-
 t/extsearch.t                | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/ExtSearch.pm b/lib/PublicInbox/ExtSearch.pm
index c41ae443..dd93cd32 100644
--- a/lib/PublicInbox/ExtSearch.pm
+++ b/lib/PublicInbox/ExtSearch.pm
@@ -57,7 +57,7 @@ sub description {
 	my ($self) = @_;
 	($self->{description} //=
 		PublicInbox::Inbox::cat_desc("$self->{topdir}/description")) //
-		'$EINDEX_DIR/description missing';
+		'$EXTINDEX_DIR/description missing';
 }
 
 sub cloneurl { [] } # TODO
diff --git a/t/extsearch.t b/t/extsearch.t
index dc825bf4..0045294b 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -35,9 +35,9 @@ seek($fh, 0, SEEK_SET) or BAIL_OUT $!;
 run_script(['-mda', '--no-precheck'], $env, { 0 => $fh }) or BAIL_OUT '-mda';
 run_script(['-index', "$home/v1test"]) or BAIL_OUT "index $?";
 
-ok(run_script([qw(-extindex --all), "$home/eindex"]), 'extindex init');
+ok(run_script([qw(-extindex --all), "$home/extindex"]), 'extindex init');
 
-my $es = PublicInbox::ExtSearch->new("$home/eindex");
+my $es = PublicInbox::ExtSearch->new("$home/extindex");
 {
 	my $smsg = $es->over->get_art(1);
 	ok($smsg, 'got first article');
@@ -55,7 +55,7 @@ my $es = PublicInbox::ExtSearch->new("$home/eindex");
 	my $env = { MAIL_EDITOR => "$^X -i -p -e 's/test message/BEST MSG/'" };
 	my $cmd = [ qw(-edit -Ft/utf8.eml), "$home/v2test" ];
 	ok(run_script($cmd, $env, $opt), '-edit');
-	ok(run_script([qw(-extindex --all), "$home/eindex"], undef, $opt),
+	ok(run_script([qw(-extindex --all), "$home/extindex"], undef, $opt),
 		'extindex again');
 	like($err, qr/discontiguous range/, 'warned about discontiguous range');
 	my $msg1 = $es->over->get_art(1) or BAIL_OUT 'msg1 missing';

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 08/12] miscidx: cleanup git processes after manifest indexing
  2020-11-23  7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
                   ` (6 preceding siblings ...)
  2020-11-23  7:05 ` [PATCH 07/12] extsearch: fix remaining "eindex" references Eric Wong
@ 2020-11-23  7:05 ` Eric Wong
  2020-11-23  7:05 ` [PATCH 09/12] miscidx: store absolute git_dir of each epoch in docdata Eric Wong
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23  7:05 UTC (permalink / raw)
  To: meta

We shouldn't leave "cat-file --batch" processes around when
we're done with an epoch or inbox, since there could be
many thousands.
---
 lib/PublicInbox/ExtSearchIdx.pm | 1 +
 lib/PublicInbox/MiscIdx.pm      | 1 +
 2 files changed, 2 insertions(+)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 708f8a3e..890ac282 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -311,6 +311,7 @@ sub _sync_inbox ($$$) {
 	}
 	index_todo($self, $sync, $_) for @{delete($sync->{todo}) // []};
 	$self->{midx}->index_ibx($ibx);
+	$ibx->git->cleanup; # done with this inbox, now
 }
 
 sub eidx_sync { # main entry point
diff --git a/lib/PublicInbox/MiscIdx.pm b/lib/PublicInbox/MiscIdx.pm
index 9dcc96b7..acb49ce7 100644
--- a/lib/PublicInbox/MiscIdx.pm
+++ b/lib/PublicInbox/MiscIdx.pm
@@ -107,6 +107,7 @@ EOF
 			if (my $ent = $git->manifest_entry($epoch, $desc)) {
 				$data->{"$pfx$epoch.git"} = $ent;
 			}
+			$git->cleanup; # ->modified starts cat-file --batch
 		}
 	} elsif (my $ent = $ibx->git->manifest_entry) { # v1
 		$data->{"/$ibx->{name}"} = $ent;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 09/12] miscidx: store absolute git_dir of each epoch in docdata
  2020-11-23  7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
                   ` (7 preceding siblings ...)
  2020-11-23  7:05 ` [PATCH 08/12] miscidx: cleanup git processes after manifest indexing Eric Wong
@ 2020-11-23  7:05 ` Eric Wong
  2020-11-23  7:06 ` [PATCH 10/12] extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare Eric Wong
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23  7:05 UTC (permalink / raw)
  To: meta

This will make it possible to map reference repos in case
somebody uses the feature.
---
 lib/PublicInbox/MiscIdx.pm | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/PublicInbox/MiscIdx.pm b/lib/PublicInbox/MiscIdx.pm
index acb49ce7..642d920b 100644
--- a/lib/PublicInbox/MiscIdx.pm
+++ b/lib/PublicInbox/MiscIdx.pm
@@ -106,10 +106,12 @@ EOF
 			my $git = $ibx->git_epoch($epoch) or return;
 			if (my $ent = $git->manifest_entry($epoch, $desc)) {
 				$data->{"$pfx$epoch.git"} = $ent;
+				$ent->{git_dir} = $git->{git_dir};
 			}
 			$git->cleanup; # ->modified starts cat-file --batch
 		}
 	} elsif (my $ent = $ibx->git->manifest_entry) { # v1
+		$ent->{git_dir} = $ibx->{inboxdir};
 		$data->{"/$ibx->{name}"} = $ent;
 	}
 	$doc->set_data(PublicInbox::Config::json()->encode($data));

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 10/12] extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare
  2020-11-23  7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
                   ` (8 preceding siblings ...)
  2020-11-23  7:05 ` [PATCH 09/12] miscidx: store absolute git_dir of each epoch in docdata Eric Wong
@ 2020-11-23  7:06 ` Eric Wong
  2020-11-23  7:06 ` [PATCH 11/12] manifest: support faster generation via [extindex "all"] Eric Wong
  2020-11-23  7:06 ` [PATCH 12/12] *search: simplify retry_reopen users Eric Wong
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23  7:06 UTC (permalink / raw)
  To: meta

This was intended to make development easier; but also allows us
description, URL, and address changes to be picked up
independently of message history.
---
 lib/PublicInbox/ExtSearchIdx.pm | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 890ac282..2cdc31cb 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -298,7 +298,7 @@ sub _sync_inbox ($$$) {
 		my $epoch_max;
 		defined($ibx->git_dir_latest(\$epoch_max)) or return;
 		$sync->{epoch_max} = $epoch_max;
-		sync_prepare($self, $sync) or return; # fills $sync->{todo}
+		sync_prepare($self, $sync); # or return # TODO: once MiscIdx is stable
 	} elsif ($v == 1) {
 		my $uv = $ibx->uidvalidity;
 		my $lc = $self->{oidx}->eidx_meta("lc-v1:$ekey//$uv");
@@ -309,8 +309,10 @@ sub _sync_inbox ($$$) {
 		warn "E: $ekey unsupported inbox version (v$v)\n";
 		return;
 	}
-	index_todo($self, $sync, $_) for @{delete($sync->{todo}) // []};
-	$self->{midx}->index_ibx($ibx);
+	unless ($sync->{quit}) {
+		index_todo($self, $sync, $_) for @{delete($sync->{todo}) // []};
+		$self->{midx}->index_ibx($ibx) unless $sync->{quit};
+	}
 	$ibx->git->cleanup; # done with this inbox, now
 }
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 11/12] manifest: support faster generation via [extindex "all"]
  2020-11-23  7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
                   ` (9 preceding siblings ...)
  2020-11-23  7:06 ` [PATCH 10/12] extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare Eric Wong
@ 2020-11-23  7:06 ` Eric Wong
  2020-11-23  7:06 ` [PATCH 12/12] *search: simplify retry_reopen users Eric Wong
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23  7:06 UTC (permalink / raw)
  To: meta

For a mirror of lore.kernel.org with >140 inboxes, this speeds
up manifest.js.gz generation from ~1s to 40ms on my HW.  This
is still unacceptable when dealing with thousands of inboxes,
but gets us closer to where we need to be.
---
 lib/PublicInbox/Config.pm        |  3 +++
 lib/PublicInbox/Inbox.pm         |  2 ++
 lib/PublicInbox/InboxWritable.pm |  2 --
 lib/PublicInbox/ManifestJsGz.pm  | 39 ++++++++++++++++++++++++++------
 lib/PublicInbox/MiscSearch.pm    | 19 ++++++++++++++++
 5 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index 039eb445..251008a3 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -94,6 +94,9 @@ sub lookup_ei {
 	$self->{-ei_by_name}->{$name} //= _fill_ei($self, "extindex.$name");
 }
 
+# special case for [extindex "all"]
+sub ALL { lookup_ei($_[0], 'all') }
+
 sub each_inbox {
 	my ($self, $cb, @arg) = @_;
 	# may auto-vivify if config file is non-existent:
diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index a1a072ad..5a22e40d 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -429,4 +429,6 @@ sub on_unlock {
 
 sub uidvalidity  { $_[0]->{uidvalidity} //= $_[0]->mm->created_at }
 
+sub eidx_key { $_[0]->{newsgroup} // $_[0]->{inboxdir} }
+
 1;
diff --git a/lib/PublicInbox/InboxWritable.pm b/lib/PublicInbox/InboxWritable.pm
index d3c255c7..e97c7e2d 100644
--- a/lib/PublicInbox/InboxWritable.pm
+++ b/lib/PublicInbox/InboxWritable.pm
@@ -319,6 +319,4 @@ sub git_dir_latest {
 	$latest;
 }
 
-sub eidx_key { $_[0]->{newsgroup} // $_[0]->{inboxdir} }
-
 1;
diff --git a/lib/PublicInbox/ManifestJsGz.pm b/lib/PublicInbox/ManifestJsGz.pm
index 3b436827..2c4a231d 100644
--- a/lib/PublicInbox/ManifestJsGz.pm
+++ b/lib/PublicInbox/ManifestJsGz.pm
@@ -21,6 +21,14 @@ sub url_regexp {
 	$ctx->SUPER::url_regexp('publicInbox.grokManifest', 'match=domain');
 }
 
+sub inject_entry ($$$;$) {
+	my ($ctx, $url_path, $ent, $git_dir) = @_;
+	$ctx->{-abs2urlpath}->{$git_dir // delete $ent->{git_dir}} = $url_path;
+	my $modified = $ent->{modified};
+	$ctx->{-mtime} = $modified if $modified > ($ctx->{-mtime} // 0);
+	$ctx->{manifest}->{$url_path} = $ent;
+}
+
 sub manifest_add ($$;$$) {
 	my ($ctx, $ibx, $epoch, $default_desc) = @_;
 	my $url_path = "/$ibx->{name}";
@@ -32,15 +40,10 @@ sub manifest_add ($$;$$) {
 		$git = $ibx->git;
 	}
 	my $ent = $git->manifest_entry($epoch, $default_desc) or return;
-	$ctx->{-abs2urlpath}->{$git->{git_dir}} = $url_path;
-	my $modified = $ent->{modified};
-	if ($modified > ($ctx->{-mtime} // 0)) {
-		$ctx->{-mtime} = $modified;
-	}
-	$ctx->{manifest}->{$url_path} = $ent;
+	inject_entry($ctx, $url_path, $ent, $git->{git_dir});
 }
 
-sub ibx_entry {
+sub slow_manifest_add ($$) {
 	my ($ctx, $ibx) = @_;
 	eval {
 		if (defined(my $max = $ibx->max_git_epoch)) {
@@ -52,6 +55,28 @@ sub ibx_entry {
 			manifest_add($ctx, $ibx);
 		}
 	};
+}
+
+sub eidx_manifest_add ($$$) {
+	my ($ctx, $ALL, $ibx) = @_;
+	if (my $data = $ALL->misc->inbox_data($ibx)) {
+		$data = $json->decode($data);
+		while (my ($url_path, $ent) = each %$data) {
+			inject_entry($ctx, $url_path, $ent);
+		}
+	} else {
+		warn "E: `${\$ibx->eidx_key}' not indexed by $ALL->{topdir}\n";
+	}
+}
+
+sub ibx_entry {
+	my ($ctx, $ibx) = @_;
+	my $ALL = $ctx->{www}->{pi_config}->ALL;
+	if ($ALL) {
+		eidx_manifest_add($ctx, $ALL, $ibx);
+	} else {
+		slow_manifest_add($ctx, $ibx);
+	}
 	warn "E: $@" if $@;
 }
 
diff --git a/lib/PublicInbox/MiscSearch.pm b/lib/PublicInbox/MiscSearch.pm
index 8beb8349..5a44d751 100644
--- a/lib/PublicInbox/MiscSearch.pm
+++ b/lib/PublicInbox/MiscSearch.pm
@@ -76,4 +76,23 @@ sub mset {
 	retry_reopen($self, \&misc_enquire_once, [ $self, $qr, $opt ]);
 }
 
+sub ibx_data_once {
+	my ($self, $ibx) = @{$_[0]};
+	my $xdb = $self->{xdb};
+	my $eidx_key = $ibx->eidx_key; # may be {inboxdir}, so private
+	my $head = $xdb->postlist_begin('Q'.$eidx_key);
+	my $tail = $xdb->postlist_end('Q'.$eidx_key);
+	if ($head != $tail) {
+		my $doc = $xdb->get_document($head->get_docid);
+		$doc->get_data;
+	} else {
+		undef;
+	}
+}
+
+sub inbox_data {
+	my ($self, $ibx) = @_;
+	retry_reopen($self, \&ibx_data_once, [ $self, $ibx ]);
+}
+
 1;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 12/12] *search: simplify retry_reopen users
  2020-11-23  7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
                   ` (10 preceding siblings ...)
  2020-11-23  7:06 ` [PATCH 11/12] manifest: support faster generation via [extindex "all"] Eric Wong
@ 2020-11-23  7:06 ` Eric Wong
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23  7:06 UTC (permalink / raw)
  To: meta

Every callback uses `$self', and creating short-lived
array references is not necessary when it's just as
easy to copy the array in Perl (unlike C).
---
 lib/PublicInbox/MiscSearch.pm |  8 ++++----
 lib/PublicInbox/Search.pm     | 10 +++++-----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/lib/PublicInbox/MiscSearch.pm b/lib/PublicInbox/MiscSearch.pm
index 5a44d751..48ef6914 100644
--- a/lib/PublicInbox/MiscSearch.pm
+++ b/lib/PublicInbox/MiscSearch.pm
@@ -50,7 +50,7 @@ sub mi_qp_new ($) {
 }
 
 sub misc_enquire_once { # retry_reopen callback
-	my ($self, $qr, $opt) = @{$_[0]};
+	my ($self, $qr, $opt) = @_;
 	my $eq = $PublicInbox::Search::X{Enquire}->new($self->{xdb});
 	$eq->set_query($qr);
         my $desc = !$opt->{asc};
@@ -73,11 +73,11 @@ sub mset {
 	$qs = 'type:inbox' if $qs eq '';
 	my $qr = $qp->parse_query($qs, $PublicInbox::Search::QP_FLAGS);
 	$opt->{relevance} = 1 unless exists $opt->{relevance};
-	retry_reopen($self, \&misc_enquire_once, [ $self, $qr, $opt ]);
+	retry_reopen($self, \&misc_enquire_once, $qr, $opt);
 }
 
 sub ibx_data_once {
-	my ($self, $ibx) = @{$_[0]};
+	my ($self, $ibx) = @_;
 	my $xdb = $self->{xdb};
 	my $eidx_key = $ibx->eidx_key; # may be {inboxdir}, so private
 	my $head = $xdb->postlist_begin('Q'.$eidx_key);
@@ -92,7 +92,7 @@ sub ibx_data_once {
 
 sub inbox_data {
 	my ($self, $ibx) = @_;
-	retry_reopen($self, \&ibx_data_once, [ $self, $ibx ]);
+	retry_reopen($self, \&ibx_data_once, $ibx);
 }
 
 1;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 05d5a133..574bc145 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -291,15 +291,15 @@ sub mset {
 }
 
 sub retry_reopen {
-	my ($self, $cb, $arg) = @_;
+	my ($self, $cb, @arg) = @_;
 	for my $i (1..10) {
 		if (wantarray) {
 			my @ret;
-			eval { @ret = $cb->($arg) };
+			eval { @ret = $cb->($self, @arg) };
 			return @ret unless $@;
 		} else {
 			my $ret;
-			eval { $ret = $cb->($arg) };
+			eval { $ret = $cb->($self, @arg) };
 			return $ret unless $@;
 		}
 		# Exception: The revision being read has been discarded -
@@ -319,7 +319,7 @@ sub retry_reopen {
 
 sub _do_enquire {
 	my ($self, $query, $opts) = @_;
-	retry_reopen($self, \&_enquire_once, [ $self, $query, $opts ]);
+	retry_reopen($self, \&_enquire_once, $query, $opts);
 }
 
 # returns true if all docs have the THREADID value
@@ -329,7 +329,7 @@ sub has_threadid ($) {
 }
 
 sub _enquire_once { # retry_reopen callback
-	my ($self, $query, $opts) = @{$_[0]};
+	my ($self, $query, $opts) = @_;
 	my $xdb = xdb($self);
 	my $enquire = $X{Enquire}->new($xdb);
 	$enquire->set_query($query);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-11-23  7:06 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-23  7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
2020-11-23  7:05 ` [PATCH 01/12] miscsearch: a new Xapian sub-DB for extindex Eric Wong
2020-11-23  7:05 ` [PATCH 02/12] move JSON module portability into PublicInbox::Config Eric Wong
2020-11-23  7:05 ` [PATCH 03/12] git: add manifest_entry method Eric Wong
2020-11-23  7:05 ` [PATCH 04/12] manifest: use ibx->git_epoch method for v2 Eric Wong
2020-11-23  7:05 ` [PATCH 05/12] inbox: git_epoch: remove ->version check Eric Wong
2020-11-23  7:05 ` [PATCH 06/12] miscidx: put grokmirror manifest entries in Xapian docdata Eric Wong
2020-11-23  7:05 ` [PATCH 07/12] extsearch: fix remaining "eindex" references Eric Wong
2020-11-23  7:05 ` [PATCH 08/12] miscidx: cleanup git processes after manifest indexing Eric Wong
2020-11-23  7:05 ` [PATCH 09/12] miscidx: store absolute git_dir of each epoch in docdata Eric Wong
2020-11-23  7:06 ` [PATCH 10/12] extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare Eric Wong
2020-11-23  7:06 ` [PATCH 11/12] manifest: support faster generation via [extindex "all"] Eric Wong
2020-11-23  7:06 ` [PATCH 12/12] *search: simplify retry_reopen users Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).