* [PATCH 00/12] extindex: speed up manifest.js.gz generation
@ 2020-11-23 7:05 Eric Wong
2020-11-23 7:05 ` [PATCH 01/12] miscsearch: a new Xapian sub-DB for extindex Eric Wong
` (11 more replies)
0 siblings, 12 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23 7:05 UTC (permalink / raw)
To: meta
manifest.js.gz generation gets faster with this series
(~1000ms => ~40ms) on the current set of lore.kernel.org inboxes
We may need to rely on varnish to handle things up to 30-100K
inboxes, since manifest.js.gz generation won't monopolize the
-httpd event loop.
WwwListing (HTML) output still needs to be updated and searching
for inboxes needs to be implemented along with pagination for
30-100K inboxes.
Eric Wong (12):
miscsearch: a new Xapian sub-DB for extindex
move JSON module portability into PublicInbox::Config
git: add manifest_entry method
manifest: use ibx->git_epoch method for v2
inbox: git_epoch: remove ->version check
miscidx: put grokmirror manifest entries in Xapian docdata
extsearch: fix remaining "eindex" references
miscidx: cleanup git processes after manifest indexing
miscidx: store absolute git_dir of each epoch in docdata
extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare
manifest: support faster generation via [extindex "all"]
*search: simplify retry_reopen users
MANIFEST | 3 +
lib/PublicInbox/Config.pm | 15 ++++
lib/PublicInbox/ExtSearch.pm | 8 +-
lib/PublicInbox/ExtSearchIdx.pm | 18 ++++-
lib/PublicInbox/Git.pm | 53 +++++++++++++
lib/PublicInbox/Inbox.pm | 6 +-
lib/PublicInbox/InboxWritable.pm | 2 -
lib/PublicInbox/ManifestJsGz.pm | 108 +++++++++-----------------
lib/PublicInbox/MiscIdx.pm | 125 +++++++++++++++++++++++++++++++
lib/PublicInbox/MiscSearch.pm | 98 ++++++++++++++++++++++++
lib/PublicInbox/Search.pm | 18 ++---
lib/PublicInbox/SearchIdx.pm | 7 +-
lib/PublicInbox/V2Writable.pm | 5 ++
script/public-inbox-extindex | 1 +
t/extsearch.t | 14 +++-
t/miscsearch.t | 57 ++++++++++++++
t/www_listing.t | 5 +-
17 files changed, 446 insertions(+), 97 deletions(-)
create mode 100644 lib/PublicInbox/MiscIdx.pm
create mode 100644 lib/PublicInbox/MiscSearch.pm
create mode 100644 t/miscsearch.t
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 01/12] miscsearch: a new Xapian sub-DB for extindex
2020-11-23 7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
@ 2020-11-23 7:05 ` Eric Wong
2020-11-23 7:05 ` [PATCH 02/12] move JSON module portability into PublicInbox::Config Eric Wong
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23 7:05 UTC (permalink / raw)
To: meta
This will be used to index and search Inbox objects and perhaps
individual git repositories/epochs for grokmirror manifest.js.gz
generation. There is no sharding planned for this at the moment
since inbox count should remain low (~100K to 1M) compared to
message count.
Folding this into the existing sharded DBs could be possible;
but would likely increase query and maintenance costs, as well
as development complexity. So we'll use a few more inodes and
FDs at runtime, instead.
---
MANIFEST | 3 +
lib/PublicInbox/ExtSearch.pm | 6 ++
lib/PublicInbox/ExtSearchIdx.pm | 11 +++-
lib/PublicInbox/MiscIdx.pm | 107 ++++++++++++++++++++++++++++++++
lib/PublicInbox/MiscSearch.pm | 79 +++++++++++++++++++++++
lib/PublicInbox/Search.pm | 8 +--
lib/PublicInbox/SearchIdx.pm | 7 ++-
lib/PublicInbox/V2Writable.pm | 5 ++
t/extsearch.t | 3 +
t/miscsearch.t | 54 ++++++++++++++++
10 files changed, 275 insertions(+), 8 deletions(-)
create mode 100644 lib/PublicInbox/MiscIdx.pm
create mode 100644 lib/PublicInbox/MiscSearch.pm
create mode 100644 t/miscsearch.t
diff --git a/MANIFEST b/MANIFEST
index fc79a134..544ec5f9 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -166,6 +166,8 @@ lib/PublicInbox/MIME.pm
lib/PublicInbox/ManifestJsGz.pm
lib/PublicInbox/Mbox.pm
lib/PublicInbox/MboxGz.pm
+lib/PublicInbox/MiscIdx.pm
+lib/PublicInbox/MiscSearch.pm
lib/PublicInbox/MsgIter.pm
lib/PublicInbox/MsgTime.pm
lib/PublicInbox/Msgmap.pm
@@ -319,6 +321,7 @@ t/mda.t
t/mda_filter_rubylang.t
t/mid.t
t/mime.t
+t/miscsearch.t
t/msg_iter-nested.eml
t/msg_iter-order.eml
t/msg_iter.t
diff --git a/lib/PublicInbox/ExtSearch.pm b/lib/PublicInbox/ExtSearch.pm
index eb665027..c41ae443 100644
--- a/lib/PublicInbox/ExtSearch.pm
+++ b/lib/PublicInbox/ExtSearch.pm
@@ -10,6 +10,7 @@ use v5.10.1;
use PublicInbox::Over;
use PublicInbox::Inbox;
use File::Spec ();
+use PublicInbox::MiscSearch;
# for ->reopen, ->mset, ->mset_to_artnums
use parent qw(PublicInbox::Search);
@@ -24,6 +25,11 @@ sub new {
}, __PACKAGE__;
}
+sub misc {
+ my ($self) = @_;
+ $self->{misc} //= PublicInbox::MiscSearch->new("$self->{xpfx}/misc");
+}
+
sub search { $_[0] } # self
# overrides PublicInbox::Search::_xdb
diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 91434b26..708f8a3e 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -21,6 +21,7 @@ use Carp qw(croak carp);
use PublicInbox::Search;
use PublicInbox::SearchIdx qw(crlf_adjust prepare_stack is_ancestor);
use PublicInbox::OverIdx;
+use PublicInbox::MiscIdx;
use PublicInbox::MID qw(mids);
use PublicInbox::V2Writable;
use PublicInbox::InboxWritable;
@@ -309,6 +310,7 @@ sub _sync_inbox ($$$) {
return;
}
index_todo($self, $sync, $_) for @{delete($sync->{todo}) // []};
+ $self->{midx}->index_ibx($ibx);
}
sub eidx_sync { # main entry point
@@ -374,6 +376,12 @@ sub update_last_commit { # overrides V2Writable
$self->{oidx}->eidx_meta($meta_key, $latest_cmt);
}
+sub _idx_init { # with_umask callback
+ my ($self, $opt) = @_;
+ PublicInbox::V2Writable::_idx_init($self, $opt);
+ $self->{midx} = PublicInbox::MiscIdx->new($self);
+}
+
sub idx_init { # similar to V2Writable
my ($self, $opt) = @_;
return if $self->{idx_shards};
@@ -406,9 +414,10 @@ sub idx_init { # similar to V2Writable
}
$self->parallel_init($self->{indexlevel});
$self->umask_prepare;
- $self->with_umask(\&PublicInbox::V2Writable::_idx_init, $self, $opt);
+ $self->with_umask(\&_idx_init, $self, $opt);
$self->{oidx}->begin_lazy;
$self->{oidx}->eidx_prep;
+ $self->{midx}->begin_txn;
}
no warnings 'once';
diff --git a/lib/PublicInbox/MiscIdx.pm b/lib/PublicInbox/MiscIdx.pm
new file mode 100644
index 00000000..edc70f9b
--- /dev/null
+++ b/lib/PublicInbox/MiscIdx.pm
@@ -0,0 +1,107 @@
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# like PublicInbox::SearchIdx, but for searching for non-mail messages.
+# Things indexed include:
+# * inboxes themselves
+# * epoch information
+# * (maybe) git code repository information
+# Expect ~100K-1M documents with no parallelism opportunities,
+# so no sharding, here.
+#
+# See MiscSearch for read-only counterpart
+package PublicInbox::MiscIdx;
+use strict;
+use v5.10.1;
+use PublicInbox::InboxWritable;
+use PublicInbox::Search; # for SWIG Xapian and Search::Xapian compat
+use PublicInbox::SearchIdx qw(index_text term_generator add_val);
+use PublicInbox::Spawn qw(nodatacow_dir);
+use Carp qw(croak);
+use File::Path ();
+use PublicInbox::MiscSearch;
+
+sub new {
+ my ($class, $eidx) = @_;
+ PublicInbox::SearchIdx::load_xapian_writable();
+ my $mi_dir = "$eidx->{xpfx}/misc";
+ File::Path::mkpath($mi_dir);
+ nodatacow_dir($mi_dir);
+ my $flags = $PublicInbox::SearchIdx::DB_CREATE_OR_OPEN;
+ $flags |= $PublicInbox::SearchIdx::DB_NO_SYNC if $eidx->{-no_fsync};
+ bless {
+ mi_dir => $mi_dir,
+ flags => $flags,
+ indexlevel => 'full', # small DB, no point in medium?
+ }, $class;
+}
+
+sub begin_txn {
+ my ($self) = @_;
+ croak 'BUG: already in txn' if $self->{xdb}; # XXX make lazy?
+ my $wdb = $PublicInbox::Search::X{WritableDatabase};
+ my $xdb = eval { $wdb->new($self->{mi_dir}, $self->{flags}) };
+ croak "Failed opening $self->{mi_dir}: $@" if $@;
+ $self->{xdb} = $xdb;
+ $xdb->begin_transaction;
+}
+
+sub commit_txn {
+ my ($self) = @_;
+ croak 'BUG: not in txn' unless $self->{xdb}; # XXX make lazy?
+ delete($self->{xdb})->commit_transaction;
+}
+
+sub index_ibx {
+ my ($self, $ibx) = @_;
+ my $eidx_key = $ibx->eidx_key;
+ my $xdb = $self->{xdb};
+ # Q = uniQue in Xapian terminology
+ my $head = $xdb->postlist_begin('Q'.$eidx_key);
+ my $tail = $xdb->postlist_end('Q'.$eidx_key);
+ my ($docid, @drop);
+ for (; $head != $tail; $head++) {
+ if (defined $docid) {
+ my $i = $head->get_docid;
+ push @drop, $i;
+ warn <<EOF;
+W: multiple inboxes keyed to `$eidx_key', deleting #$i
+EOF
+ } else {
+ $docid = $head->get_docid;
+ }
+ }
+ $xdb->delete_document($_) for @drop; # just in case
+
+ my $doc = $PublicInbox::Search::X{Document}->new;
+
+ # allow sorting by modified
+ add_val($doc, $PublicInbox::MiscSearch::MODIFIED, $ibx->modified);
+
+ $doc->add_boolean_term('Q'.$eidx_key);
+ $doc->add_boolean_term('T'.'inbox');
+ term_generator($self)->set_document($doc);
+
+ # description = S/Subject (or title)
+ # address = A/Author
+ index_text($self, $ibx->description, 1, 'S');
+ my %map = (
+ address => 'A',
+ listid => 'XLISTID',
+ infourl => 'XINFOURL',
+ url => 'XURL'
+ );
+ while (my ($f, $pfx) = each %map) {
+ for my $v (@{$ibx->{$f} // []}) {
+ index_text($self, $v, 1, $pfx);
+ }
+ }
+ index_text($self, $ibx->{name}, 1, 'XNAME');
+ if (defined $docid) {
+ $xdb->replace_document($docid, $doc);
+ } else {
+ $xdb->add_document($doc);
+ }
+}
+
+1;
diff --git a/lib/PublicInbox/MiscSearch.pm b/lib/PublicInbox/MiscSearch.pm
new file mode 100644
index 00000000..8beb8349
--- /dev/null
+++ b/lib/PublicInbox/MiscSearch.pm
@@ -0,0 +1,79 @@
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# read-only counterpart to MiscIdx
+package PublicInbox::MiscSearch;
+use strict;
+use v5.10.1;
+use PublicInbox::Search qw(retry_reopen);
+
+# Xapian value columns:
+our $MODIFIED = 0;
+
+# avoid conflicting with message Search::prob_prefix for UI/UX reasons
+my %PROB_PREFIX = (
+ description => 'S', # $INBOX_DIR/description
+ address => 'A',
+ listid => 'XLISTID',
+ url => 'XURL',
+ infourl => 'XINFOURL',
+ name => 'XNAME',
+ '' => 'S A XLISTID XNAME XURL XINFOURL'
+);
+
+sub new {
+ my ($class, $dir) = @_;
+ bless {
+ xdb => $PublicInbox::Search::X{Database}->new($dir)
+ }, $class;
+}
+
+# read-only
+sub mi_qp_new ($) {
+ my ($self) = @_;
+ my $xdb = $self->{xdb};
+ my $qp = $PublicInbox::Search::X{QueryParser}->new;
+ $qp->set_default_op(PublicInbox::Search::OP_AND());
+ $qp->set_database($xdb);
+ $qp->set_stemmer(PublicInbox::Search::stemmer($self));
+ $qp->set_stemming_strategy(PublicInbox::Search::STEM_SOME());
+ my $cb = $qp->can('set_max_wildcard_expansion') //
+ $qp->can('set_max_expansion'); # Xapian 1.5.0+
+ $cb->($qp, 100);
+ $cb = $qp->can('add_valuerangeprocessor') //
+ $qp->can('add_rangeprocessor'); # Xapian 1.5.0+
+ while (my ($name, $prefix) = each %PROB_PREFIX) {
+ $qp->add_prefix($name, $_) for split(/ /, $prefix);
+ }
+ $qp->add_boolean_prefix('type', 'T');
+ $qp;
+}
+
+sub misc_enquire_once { # retry_reopen callback
+ my ($self, $qr, $opt) = @{$_[0]};
+ my $eq = $PublicInbox::Search::X{Enquire}->new($self->{xdb});
+ $eq->set_query($qr);
+ my $desc = !$opt->{asc};
+ my $rel = $opt->{relevance} // 0;
+ if ($rel == -1) { # ORDER BY docid/UID
+ $eq->set_docid_order($PublicInbox::Search::ENQ_ASCENDING);
+ $eq->set_weighting_scheme($PublicInbox::Search::X{BoolWeight}->new);
+ } elsif ($rel) {
+ $eq->set_sort_by_relevance_then_value($MODIFIED, $desc);
+ } else {
+ $eq->set_sort_by_value_then_relevance($MODIFIED, $desc);
+ }
+ $eq->get_mset($opt->{offset} || 0, $opt->{limit} || 200);
+}
+
+sub mset {
+ my ($self, $qs, $opt) = @_;
+ $opt ||= {};
+ my $qp = $self->{qp} //= mi_qp_new($self);
+ $qs = 'type:inbox' if $qs eq '';
+ my $qr = $qp->parse_query($qs, $PublicInbox::Search::QP_FLAGS);
+ $opt->{relevance} = 1 unless exists $opt->{relevance};
+ retry_reopen($self, \&misc_enquire_once, [ $self, $qr, $opt ]);
+}
+
+1;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 71417d5e..05d5a133 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -6,7 +6,7 @@
package PublicInbox::Search;
use strict;
use parent qw(Exporter);
-our @EXPORT_OK = qw(mdocid);
+our @EXPORT_OK = qw(mdocid retry_reopen);
use List::Util qw(max);
# values for searching, changing the numeric value breaks
@@ -54,11 +54,11 @@ use constant {
use PublicInbox::Smsg;
use PublicInbox::Over;
-my $QP_FLAGS;
+our $QP_FLAGS;
our %X = map { $_ => 0 } qw(BoolWeight Database Enquire QueryParser Stem);
our $Xap; # 'Search::Xapian' or 'Xapian'
-my $NVRP; # '$Xap::'.('NumberValueRangeProcessor' or 'NumberRangeProcessor')
-my $ENQ_ASCENDING;
+our $NVRP; # '$Xap::'.('NumberValueRangeProcessor' or 'NumberRangeProcessor')
+our $ENQ_ASCENDING;
sub load_xapian () {
return 1 if defined $Xap;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 6ff2cf94..18390602 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -22,9 +22,10 @@ use PublicInbox::OverIdx;
use PublicInbox::Spawn qw(spawn nodatacow_dir);
use PublicInbox::Git qw(git_unquote);
use PublicInbox::MsgTime qw(msg_timestamp msg_datestamp);
-our @EXPORT_OK = qw(crlf_adjust log2stack is_ancestor check_size prepare_stack);
+our @EXPORT_OK = qw(crlf_adjust log2stack is_ancestor check_size prepare_stack
+ index_text term_generator add_val);
my $X = \%PublicInbox::Search::X;
-my ($DB_CREATE_OR_OPEN, $DB_OPEN);
+our ($DB_CREATE_OR_OPEN, $DB_OPEN);
our $DB_NO_SYNC = 0;
our $BATCH_BYTES = $ENV{XAPIAN_FLUSH_THRESHOLD} ? 0x7fffffff : 1_000_000;
use constant DEBUG => !!$ENV{DEBUG};
@@ -154,7 +155,7 @@ sub term_generator ($) { # write-only
$self->{term_generator} //= do {
my $tg = $X->{TermGenerator}->new;
- $tg->set_stemmer($self->stemmer);
+ $tg->set_stemmer(PublicInbox::Search::stemmer($self));
$tg;
}
}
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index ba7cef13..afba0220 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -631,6 +631,9 @@ sub checkpoint ($;$) {
$_->shard_commit for @$shards;
}
+ my $midx = $self->{midx}; # misc index
+ $midx->commit_txn if $midx;
+
# last_commit is special, don't commit these until
# Xapian shards are done:
$dbh->begin_work if $dbh;
@@ -639,6 +642,7 @@ sub checkpoint ($;$) {
$dbh->commit;
$dbh->begin_work;
}
+ $midx->begin_txn if $midx;
}
$self->{total_bytes} += $self->{transact_bytes};
$self->{transact_bytes} = 0;
@@ -678,6 +682,7 @@ sub done {
}
eval { $self->{oidx}->dbh_close };
$err .= "over close: $@\n" if $@;
+ delete $self->{midx};
delete $self->{bnote};
my $nbytes = $self->{total_bytes};
$self->{total_bytes} = 0;
diff --git a/t/extsearch.t b/t/extsearch.t
index 8792fd9e..e28e2f71 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -72,4 +72,7 @@ my $es = PublicInbox::ExtSearch->new("$home/eindex");
isnt($x1->[0], $x2->[0], 'xref3 differs');
}
+my $misc = $es->misc;
+is(scalar($misc->mset('')->items), 2, 'two inboxes');
+
done_testing;
diff --git a/t/miscsearch.t b/t/miscsearch.t
new file mode 100644
index 00000000..45a19da9
--- /dev/null
+++ b/t/miscsearch.t
@@ -0,0 +1,54 @@
+#!perl -w
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use Test::More;
+use PublicInbox::TestCommon;
+use PublicInbox::InboxWritable;
+require_mods(qw(Search::Xapian DBD::SQLite));
+use_ok 'PublicInbox::MiscSearch';
+use_ok 'PublicInbox::MiscIdx';
+
+my ($tmp, $for_destroy) = tmpdir();
+my $eidx = { xpfx => "$tmp/eidx", -no_fsync => 1 }; # mock ExtSearchIdx
+{
+ mkdir "$tmp/v1" or BAIL_OUT "mkdir $!";
+ open my $fh, '>', "$tmp/v1/description" or BAIL_OUT "open: $!";
+ print $fh "Everything sucks this year\n" or BAIL_OUT "print $!";
+ close $fh or BAIL_OUT "close $!";
+}
+{
+ my $v1 = PublicInbox::InboxWritable->new({
+ inboxdir => "$tmp/v1",
+ name => 'hope',
+ address => [ 'nope@example.com' ],
+ indexlevel => 'basic',
+ version => 1,
+ });
+ $v1->init_inbox;
+ my $mi = PublicInbox::MiscIdx->new($eidx);
+ $mi->begin_txn;
+ $mi->index_ibx($v1);
+ $mi->commit_txn;
+}
+
+my $ms = PublicInbox::MiscSearch->new("$tmp/eidx/misc");
+my $mset = $ms->mset('"everything sucks today"');
+is(scalar($mset->items), 0, 'no match on description phrase');
+
+$mset = $ms->mset('"everything sucks this year"');
+is(scalar($mset->items), 1, 'match phrase on description');
+
+$mset = $ms->mset('everything sucks');
+is(scalar($mset->items), 1, 'match words in description');
+
+$mset = $ms->mset('nope@example.com');
+is(scalar($mset->items), 1, 'match full address');
+
+$mset = $ms->mset('nope');
+is(scalar($mset->items), 1, 'match partial address');
+
+$mset = $ms->mset('hope');
+is(scalar($mset->items), 1, 'match name');
+
+done_testing;
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 02/12] move JSON module portability into PublicInbox::Config
2020-11-23 7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
2020-11-23 7:05 ` [PATCH 01/12] miscsearch: a new Xapian sub-DB for extindex Eric Wong
@ 2020-11-23 7:05 ` Eric Wong
2020-11-23 7:05 ` [PATCH 03/12] git: add manifest_entry method Eric Wong
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23 7:05 UTC (permalink / raw)
To: meta
We'll be using JSON in MiscIdx and MiscSearch, and
PublicInbox::Config seems like an appropriate place to put it.
---
lib/PublicInbox/Config.pm | 12 ++++++++++++
lib/PublicInbox/ManifestJsGz.pm | 8 ++------
2 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index d2010f7a..039eb445 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -488,4 +488,16 @@ sub urlmatch {
}
}
+sub json {
+ state $json;
+ $json //= do {
+ for my $mod (qw(Cpanel::JSON::XS JSON::MaybeXS JSON JSON::PP)) {
+ eval "require $mod" or next;
+ # ->ascii encodes non-ASCII to "\uXXXX"
+ $json = $mod->new->ascii(1) and last;
+ }
+ $json;
+ };
+}
+
1;
diff --git a/lib/PublicInbox/ManifestJsGz.pm b/lib/PublicInbox/ManifestJsGz.pm
index 16d2a87c..ab1478af 100644
--- a/lib/PublicInbox/ManifestJsGz.pm
+++ b/lib/PublicInbox/ManifestJsGz.pm
@@ -10,17 +10,13 @@ use Digest::SHA ();
use File::Spec ();
use bytes (); # length
use PublicInbox::Inbox;
+use PublicInbox::Config;
use PublicInbox::Git;
use IO::Compress::Gzip qw(gzip);
use HTTP::Date qw(time2str);
*try_cat = \&PublicInbox::Inbox::try_cat;
-our $json;
-for my $mod (qw(Cpanel::JSON::XS JSON::MaybeXS JSON JSON::PP)) {
- eval "require $mod" or next;
- # ->ascii encodes non-ASCII to "\uXXXX"
- $json = $mod->new->ascii(1) and last;
-}
+our $json = PublicInbox::Config::json();
# called by WwwListing
sub url_regexp {
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 03/12] git: add manifest_entry method
2020-11-23 7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
2020-11-23 7:05 ` [PATCH 01/12] miscsearch: a new Xapian sub-DB for extindex Eric Wong
2020-11-23 7:05 ` [PATCH 02/12] move JSON module portability into PublicInbox::Config Eric Wong
@ 2020-11-23 7:05 ` Eric Wong
2020-11-23 7:05 ` [PATCH 04/12] manifest: use ibx->git_epoch method for v2 Eric Wong
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23 7:05 UTC (permalink / raw)
To: meta
We'll be using this for MiscIdx and pre-generating the necessary
JSON for manifest.js.gz, so make it easier to share code for
generating per-repo JSON entries for grokmirror.
---
lib/PublicInbox/Git.pm | 53 +++++++++++++++++++++++++++++
lib/PublicInbox/ManifestJsGz.pm | 59 ++-------------------------------
t/www_listing.t | 5 ++-
3 files changed, 58 insertions(+), 59 deletions(-)
diff --git a/lib/PublicInbox/Git.pm b/lib/PublicInbox/Git.pm
index 86343ac9..917fa4a1 100644
--- a/lib/PublicInbox/Git.pm
+++ b/lib/PublicInbox/Git.pm
@@ -14,10 +14,12 @@ use POSIX ();
use IO::Handle; # ->autoflush
use Errno qw(EINTR);
use File::Glob qw(bsd_glob GLOB_NOSORT);
+use File::Spec ();
use Time::HiRes qw(stat);
use PublicInbox::Spawn qw(popen_rd);
use PublicInbox::Tmpfile;
use Carp qw(croak);
+use Digest::SHA ();
our @EXPORT_OK = qw(git_unquote git_quote);
our $PIPE_BUFSIZ = 65536; # Linux default
our $in_cleanup;
@@ -475,6 +477,57 @@ sub modified ($) {
$modified || time;
}
+# for grokmirror, which doesn't read gitweb.description
+# templates/hooks--update.sample and git-multimail in git.git
+# only match "Unnamed repository", not the full contents of
+# templates/this--description in git.git
+sub manifest_entry {
+ my ($self, $epoch, $default_desc) = @_;
+ my ($fh, $pid) = $self->popen('show-ref');
+ my $dig = Digest::SHA->new(1);
+ while (read($fh, my $buf, 65536)) {
+ $dig->add($buf);
+ }
+ close $fh;
+ waitpid($pid, 0);
+ return if $?; # empty, uninitialized git repo
+ my $git_dir = $self->{git_dir};
+ my $ent = {
+ fingerprint => $dig->hexdigest,
+ reference => undef,
+ modified => modified($self),
+ };
+ chomp(my $owner = $self->qx('config', 'gitweb.owner'));
+ utf8::decode($owner);
+ $ent->{owner} = $owner eq '' ? undef : $owner;
+ my $desc = '';
+ if (open($fh, '<', "$git_dir/description")) {
+ local $/ = "\n";
+ chomp($desc = <$fh>);
+ utf8::decode($desc);
+ }
+ $desc = 'Unnamed repository' if $desc eq '';
+ if (defined $epoch && $desc =~ /\AUnnamed repository/) {
+ $desc = "$default_desc [epoch $epoch]";
+ }
+ $ent->{description} = $desc;
+ if (open($fh, '<', "$git_dir/objects/info/alternates")) {
+ # n.b.: GitPython doesn't seem to handle comments or C-quoted
+ # strings like native git does; and we don't for now, either.
+ local $/ = "\n";
+ chomp(my @alt = <$fh>);
+
+ # grokmirror only supports 1 alternate for "reference",
+ if (scalar(@alt) == 1) {
+ my $objdir = "$git_dir/objects";
+ my $ref = File::Spec->rel2abs($alt[0], $objdir);
+ $ref =~ s!/[^/]+/?\z!!; # basename
+ $ent->{reference} = $ref;
+ }
+ }
+ $ent;
+}
+
1;
__END__
=pod
diff --git a/lib/PublicInbox/ManifestJsGz.pm b/lib/PublicInbox/ManifestJsGz.pm
index ab1478af..3d8a38ae 100644
--- a/lib/PublicInbox/ManifestJsGz.pm
+++ b/lib/PublicInbox/ManifestJsGz.pm
@@ -6,15 +6,12 @@ package PublicInbox::ManifestJsGz;
use strict;
use v5.10.1;
use parent qw(PublicInbox::WwwListing);
-use Digest::SHA ();
-use File::Spec ();
use bytes (); # length
use PublicInbox::Inbox;
use PublicInbox::Config;
use PublicInbox::Git;
use IO::Compress::Gzip qw(gzip);
use HTTP::Date qw(time2str);
-*try_cat = \&PublicInbox::Inbox::try_cat;
our $json = PublicInbox::Config::json();
@@ -26,21 +23,6 @@ sub url_regexp {
$ctx->SUPER::url_regexp('publicInbox.grokManifest', 'match=domain');
}
-sub fingerprint ($) {
- my ($git) = @_;
- # TODO: convert to qspawn for fairness when there's
- # thousands of repos
- my ($fh, $pid) = $git->popen('show-ref');
- my $dig = Digest::SHA->new(1);
- while (read($fh, my $buf, 65536)) {
- $dig->add($buf);
- }
- close $fh;
- waitpid($pid, 0);
- return if $?; # empty, uninitialized git repo
- $dig->hexdigest;
-}
-
sub manifest_add ($$;$$) {
my ($ctx, $ibx, $epoch, $default_desc) = @_;
my $url_path = "/$ibx->{name}";
@@ -51,48 +33,13 @@ sub manifest_add ($$;$$) {
}
return unless -d $git_dir;
my $git = PublicInbox::Git->new($git_dir);
- my $fingerprint = fingerprint($git) or return; # no empty repos
-
- chomp(my $owner = $git->qx('config', 'gitweb.owner'));
- chomp(my $desc = try_cat("$git_dir/description"));
- utf8::decode($owner);
- utf8::decode($desc);
- $owner = undef if $owner eq '';
- $desc = 'Unnamed repository' if $desc eq '';
-
- # templates/hooks--update.sample and git-multimail in git.git
- # only match "Unnamed repository", not the full contents of
- # templates/this--description in git.git
- if ($desc =~ /\AUnnamed repository/) {
- $desc = "$default_desc [epoch $epoch]" if defined($epoch);
- }
-
- my $reference;
- chomp(my $alt = try_cat("$git_dir/objects/info/alternates"));
- if ($alt) {
- # n.b.: GitPython doesn't seem to handle comments or C-quoted
- # strings like native git does; and we don't for now, either.
- my @alt = split(/\n+/, $alt);
-
- # grokmirror only supports 1 alternate for "reference",
- if (scalar(@alt) == 1) {
- my $objdir = "$git_dir/objects";
- $reference = File::Spec->rel2abs($alt[0], $objdir);
- $reference =~ s!/[^/]+/?\z!!; # basename
- }
- }
+ my $ent = $git->manifest_entry($epoch, $default_desc) or return;
$ctx->{-abs2urlpath}->{$git_dir} = $url_path;
- my $modified = $git->modified;
+ my $modified = $ent->{modified};
if ($modified > ($ctx->{-mtime} // 0)) {
$ctx->{-mtime} = $modified;
}
- $ctx->{manifest}->{$url_path} = {
- owner => $owner,
- reference => $reference,
- description => $desc,
- modified => $modified,
- fingerprint => $fingerprint,
- };
+ $ctx->{manifest}->{$url_path} = $ent;
}
sub ibx_entry {
diff --git a/t/www_listing.t b/t/www_listing.t
index 4309a5e1..63613371 100644
--- a/t/www_listing.t
+++ b/t/www_listing.t
@@ -21,8 +21,7 @@ use_ok 'PublicInbox::Git';
my ($tmpdir, $for_destroy) = tmpdir();
my $bare = PublicInbox::Git->new("$tmpdir/bare.git");
PublicInbox::Import::init_bare($bare->{git_dir});
-is(PublicInbox::ManifestJsGz::fingerprint($bare), undef,
- 'empty repo has no fingerprint');
+is($bare->manifest_entry, undef, 'empty repo has no manifest entry');
{
my $fi_data = './t/git.fast-import-data';
open my $fh, '<', $fi_data or die "open $fi_data: $!";
@@ -31,7 +30,7 @@ is(PublicInbox::ManifestJsGz::fingerprint($bare), undef,
'fast-import');
}
-like(PublicInbox::ManifestJsGz::fingerprint($bare), qr/\A[a-f0-9]{40}\z/,
+like($bare->manifest_entry->{fingerprint}, qr/\A[a-f0-9]{40}\z/,
'got fingerprint with non-empty repo');
sub tiny_test {
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 04/12] manifest: use ibx->git_epoch method for v2
2020-11-23 7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
` (2 preceding siblings ...)
2020-11-23 7:05 ` [PATCH 03/12] git: add manifest_entry method Eric Wong
@ 2020-11-23 7:05 ` Eric Wong
2020-11-23 7:05 ` [PATCH 05/12] inbox: git_epoch: remove ->version check Eric Wong
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23 7:05 UTC (permalink / raw)
To: meta
We can slightly reduce the amount of version-specific logic,
here.
---
lib/PublicInbox/Inbox.pm | 1 +
lib/PublicInbox/ManifestJsGz.pm | 12 +++++-------
2 files changed, 6 insertions(+), 7 deletions(-)
diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index 1d18cdf1..64b12345 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -137,6 +137,7 @@ sub git_epoch {
$self->version == 2 or return;
$self->{"$epoch.git"} ||= do {
my $git_dir = "$self->{inboxdir}/git/$epoch.git";
+ return unless -d $git_dir;
my $g = PublicInbox::Git->new($git_dir);
$g->{-httpbackend_limiter} = $self->{-httpbackend_limiter};
# no cleanup needed, we never cat-file off this, only clone
diff --git a/lib/PublicInbox/ManifestJsGz.pm b/lib/PublicInbox/ManifestJsGz.pm
index 3d8a38ae..3b436827 100644
--- a/lib/PublicInbox/ManifestJsGz.pm
+++ b/lib/PublicInbox/ManifestJsGz.pm
@@ -7,9 +7,7 @@ use strict;
use v5.10.1;
use parent qw(PublicInbox::WwwListing);
use bytes (); # length
-use PublicInbox::Inbox;
use PublicInbox::Config;
-use PublicInbox::Git;
use IO::Compress::Gzip qw(gzip);
use HTTP::Date qw(time2str);
@@ -26,15 +24,15 @@ sub url_regexp {
sub manifest_add ($$;$$) {
my ($ctx, $ibx, $epoch, $default_desc) = @_;
my $url_path = "/$ibx->{name}";
- my $git_dir = $ibx->{inboxdir};
+ my $git;
if (defined $epoch) {
- $git_dir .= "/git/$epoch.git";
$url_path .= "/git/$epoch.git";
+ $git = $ibx->git_epoch($epoch) or return;
+ } else {
+ $git = $ibx->git;
}
- return unless -d $git_dir;
- my $git = PublicInbox::Git->new($git_dir);
my $ent = $git->manifest_entry($epoch, $default_desc) or return;
- $ctx->{-abs2urlpath}->{$git_dir} = $url_path;
+ $ctx->{-abs2urlpath}->{$git->{git_dir}} = $url_path;
my $modified = $ent->{modified};
if ($modified > ($ctx->{-mtime} // 0)) {
$ctx->{-mtime} = $modified;
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 05/12] inbox: git_epoch: remove ->version check
2020-11-23 7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
` (3 preceding siblings ...)
2020-11-23 7:05 ` [PATCH 04/12] manifest: use ibx->git_epoch method for v2 Eric Wong
@ 2020-11-23 7:05 ` Eric Wong
2020-11-23 7:05 ` [PATCH 06/12] miscidx: put grokmirror manifest entries in Xapian docdata Eric Wong
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23 7:05 UTC (permalink / raw)
To: meta
If $epoch is supplied to this method, there's already epochs and
an extra method call for ->version is a pointless waste of CPU
cycles.
---
lib/PublicInbox/Inbox.pm | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index 64b12345..a1a072ad 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -133,8 +133,7 @@ sub new {
sub version { $_[0]->{version} // 1 }
sub git_epoch {
- my ($self, $epoch) = @_;
- $self->version == 2 or return;
+ my ($self, $epoch) = @_; # v2-only, callers always supply $epoch
$self->{"$epoch.git"} ||= do {
my $git_dir = "$self->{inboxdir}/git/$epoch.git";
return unless -d $git_dir;
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 06/12] miscidx: put grokmirror manifest entries in Xapian docdata
2020-11-23 7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
` (4 preceding siblings ...)
2020-11-23 7:05 ` [PATCH 05/12] inbox: git_epoch: remove ->version check Eric Wong
@ 2020-11-23 7:05 ` Eric Wong
2020-11-23 7:05 ` [PATCH 07/12] extsearch: fix remaining "eindex" references Eric Wong
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23 7:05 UTC (permalink / raw)
To: meta
This should make it possible for us quickly generate
manifest.js.gz files with less random I/O and process
spawning in the WWW code.
---
lib/PublicInbox/MiscIdx.pm | 15 +++++++++++++++
script/public-inbox-extindex | 1 +
t/extsearch.t | 7 ++++++-
t/miscsearch.t | 3 +++
4 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/lib/PublicInbox/MiscIdx.pm b/lib/PublicInbox/MiscIdx.pm
index edc70f9b..9dcc96b7 100644
--- a/lib/PublicInbox/MiscIdx.pm
+++ b/lib/PublicInbox/MiscIdx.pm
@@ -20,6 +20,7 @@ use PublicInbox::Spawn qw(nodatacow_dir);
use Carp qw(croak);
use File::Path ();
use PublicInbox::MiscSearch;
+use PublicInbox::Config;
sub new {
my ($class, $eidx) = @_;
@@ -97,6 +98,20 @@ EOF
}
}
index_text($self, $ibx->{name}, 1, 'XNAME');
+ my $data = {};
+ if (defined(my $max = $ibx->max_git_epoch)) { # v2
+ my $desc = $ibx->description;
+ my $pfx = "/$ibx->{name}/git/";
+ for my $epoch (0..$max) {
+ my $git = $ibx->git_epoch($epoch) or return;
+ if (my $ent = $git->manifest_entry($epoch, $desc)) {
+ $data->{"$pfx$epoch.git"} = $ent;
+ }
+ }
+ } elsif (my $ent = $ibx->git->manifest_entry) { # v1
+ $data->{"/$ibx->{name}"} = $ent;
+ }
+ $doc->set_data(PublicInbox::Config::json()->encode($data));
if (defined $docid) {
$xdb->replace_document($docid, $doc);
} else {
diff --git a/script/public-inbox-extindex b/script/public-inbox-extindex
index 78d6d9d9..20a0737c 100644
--- a/script/public-inbox-extindex
+++ b/script/public-inbox-extindex
@@ -38,6 +38,7 @@ require PublicInbox::Admin;
my $cfg = PublicInbox::Config->new;
my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, $opt, $cfg);
PublicInbox::Admin::require_or_die(qw(-search));
+PublicInbox::Config::json() or die "Cpanel::JSON::XS or similar missing\n";
PublicInbox::Admin::progress_prepare($opt);
my $env = PublicInbox::Admin::index_prepare($opt, $cfg);
local %ENV = (%ENV, %$env) if $env;
diff --git a/t/extsearch.t b/t/extsearch.t
index e28e2f71..dc825bf4 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -4,7 +4,9 @@
use strict;
use Test::More;
use PublicInbox::TestCommon;
+use PublicInbox::Config;
use Fcntl qw(:seek);
+my $json = PublicInbox::Config::json() or plan skip_all => 'JSON missing';
require_git(2.6);
require_mods(qw(DBD::SQLite Search::Xapian));
use_ok 'PublicInbox::ExtSearch';
@@ -73,6 +75,9 @@ my $es = PublicInbox::ExtSearch->new("$home/eindex");
}
my $misc = $es->misc;
-is(scalar($misc->mset('')->items), 2, 'two inboxes');
+my @it = $misc->mset('')->items;
+is(scalar(@it), 2, 'two inboxes');
+like($it[0]->get_document->get_data, qr/v2test/, 'docdata matched v2');
+like($it[1]->get_document->get_data, qr/v1test/, 'docdata matched v1');
done_testing;
diff --git a/t/miscsearch.t b/t/miscsearch.t
index 45a19da9..0ba79194 100644
--- a/t/miscsearch.t
+++ b/t/miscsearch.t
@@ -50,5 +50,8 @@ is(scalar($mset->items), 1, 'match partial address');
$mset = $ms->mset('hope');
is(scalar($mset->items), 1, 'match name');
+my $mi = ($mset->items)[0];
+my $doc = $mi->get_document;
+is($doc->get_data, '{}', 'stored empty data');
done_testing;
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 07/12] extsearch: fix remaining "eindex" references
2020-11-23 7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
` (5 preceding siblings ...)
2020-11-23 7:05 ` [PATCH 06/12] miscidx: put grokmirror manifest entries in Xapian docdata Eric Wong
@ 2020-11-23 7:05 ` Eric Wong
2020-11-23 7:05 ` [PATCH 08/12] miscidx: cleanup git processes after manifest indexing Eric Wong
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23 7:05 UTC (permalink / raw)
To: meta
We'll replace "$EINDEX" => "$EXTINDEX" in a user-visible
line and also some hacker-only tests.
"eindex" is no longer used because it rhymes with "reindex",
so remove the last instance of it.
Fixes: 6b0fed3b03263ba2 ("extsearch: rename -eindex to -extindex")
---
lib/PublicInbox/ExtSearch.pm | 2 +-
t/extsearch.t | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/lib/PublicInbox/ExtSearch.pm b/lib/PublicInbox/ExtSearch.pm
index c41ae443..dd93cd32 100644
--- a/lib/PublicInbox/ExtSearch.pm
+++ b/lib/PublicInbox/ExtSearch.pm
@@ -57,7 +57,7 @@ sub description {
my ($self) = @_;
($self->{description} //=
PublicInbox::Inbox::cat_desc("$self->{topdir}/description")) //
- '$EINDEX_DIR/description missing';
+ '$EXTINDEX_DIR/description missing';
}
sub cloneurl { [] } # TODO
diff --git a/t/extsearch.t b/t/extsearch.t
index dc825bf4..0045294b 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -35,9 +35,9 @@ seek($fh, 0, SEEK_SET) or BAIL_OUT $!;
run_script(['-mda', '--no-precheck'], $env, { 0 => $fh }) or BAIL_OUT '-mda';
run_script(['-index', "$home/v1test"]) or BAIL_OUT "index $?";
-ok(run_script([qw(-extindex --all), "$home/eindex"]), 'extindex init');
+ok(run_script([qw(-extindex --all), "$home/extindex"]), 'extindex init');
-my $es = PublicInbox::ExtSearch->new("$home/eindex");
+my $es = PublicInbox::ExtSearch->new("$home/extindex");
{
my $smsg = $es->over->get_art(1);
ok($smsg, 'got first article');
@@ -55,7 +55,7 @@ my $es = PublicInbox::ExtSearch->new("$home/eindex");
my $env = { MAIL_EDITOR => "$^X -i -p -e 's/test message/BEST MSG/'" };
my $cmd = [ qw(-edit -Ft/utf8.eml), "$home/v2test" ];
ok(run_script($cmd, $env, $opt), '-edit');
- ok(run_script([qw(-extindex --all), "$home/eindex"], undef, $opt),
+ ok(run_script([qw(-extindex --all), "$home/extindex"], undef, $opt),
'extindex again');
like($err, qr/discontiguous range/, 'warned about discontiguous range');
my $msg1 = $es->over->get_art(1) or BAIL_OUT 'msg1 missing';
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 08/12] miscidx: cleanup git processes after manifest indexing
2020-11-23 7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
` (6 preceding siblings ...)
2020-11-23 7:05 ` [PATCH 07/12] extsearch: fix remaining "eindex" references Eric Wong
@ 2020-11-23 7:05 ` Eric Wong
2020-11-23 7:05 ` [PATCH 09/12] miscidx: store absolute git_dir of each epoch in docdata Eric Wong
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23 7:05 UTC (permalink / raw)
To: meta
We shouldn't leave "cat-file --batch" processes around when
we're done with an epoch or inbox, since there could be
many thousands.
---
lib/PublicInbox/ExtSearchIdx.pm | 1 +
lib/PublicInbox/MiscIdx.pm | 1 +
2 files changed, 2 insertions(+)
diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 708f8a3e..890ac282 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -311,6 +311,7 @@ sub _sync_inbox ($$$) {
}
index_todo($self, $sync, $_) for @{delete($sync->{todo}) // []};
$self->{midx}->index_ibx($ibx);
+ $ibx->git->cleanup; # done with this inbox, now
}
sub eidx_sync { # main entry point
diff --git a/lib/PublicInbox/MiscIdx.pm b/lib/PublicInbox/MiscIdx.pm
index 9dcc96b7..acb49ce7 100644
--- a/lib/PublicInbox/MiscIdx.pm
+++ b/lib/PublicInbox/MiscIdx.pm
@@ -107,6 +107,7 @@ EOF
if (my $ent = $git->manifest_entry($epoch, $desc)) {
$data->{"$pfx$epoch.git"} = $ent;
}
+ $git->cleanup; # ->modified starts cat-file --batch
}
} elsif (my $ent = $ibx->git->manifest_entry) { # v1
$data->{"/$ibx->{name}"} = $ent;
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 09/12] miscidx: store absolute git_dir of each epoch in docdata
2020-11-23 7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
` (7 preceding siblings ...)
2020-11-23 7:05 ` [PATCH 08/12] miscidx: cleanup git processes after manifest indexing Eric Wong
@ 2020-11-23 7:05 ` Eric Wong
2020-11-23 7:06 ` [PATCH 10/12] extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare Eric Wong
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23 7:05 UTC (permalink / raw)
To: meta
This will make it possible to map reference repos in case
somebody uses the feature.
---
lib/PublicInbox/MiscIdx.pm | 2 ++
1 file changed, 2 insertions(+)
diff --git a/lib/PublicInbox/MiscIdx.pm b/lib/PublicInbox/MiscIdx.pm
index acb49ce7..642d920b 100644
--- a/lib/PublicInbox/MiscIdx.pm
+++ b/lib/PublicInbox/MiscIdx.pm
@@ -106,10 +106,12 @@ EOF
my $git = $ibx->git_epoch($epoch) or return;
if (my $ent = $git->manifest_entry($epoch, $desc)) {
$data->{"$pfx$epoch.git"} = $ent;
+ $ent->{git_dir} = $git->{git_dir};
}
$git->cleanup; # ->modified starts cat-file --batch
}
} elsif (my $ent = $ibx->git->manifest_entry) { # v1
+ $ent->{git_dir} = $ibx->{inboxdir};
$data->{"/$ibx->{name}"} = $ent;
}
$doc->set_data(PublicInbox::Config::json()->encode($data));
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 10/12] extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare
2020-11-23 7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
` (8 preceding siblings ...)
2020-11-23 7:05 ` [PATCH 09/12] miscidx: store absolute git_dir of each epoch in docdata Eric Wong
@ 2020-11-23 7:06 ` Eric Wong
2020-11-23 7:06 ` [PATCH 11/12] manifest: support faster generation via [extindex "all"] Eric Wong
2020-11-23 7:06 ` [PATCH 12/12] *search: simplify retry_reopen users Eric Wong
11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23 7:06 UTC (permalink / raw)
To: meta
This was intended to make development easier; but also allows us
description, URL, and address changes to be picked up
independently of message history.
---
lib/PublicInbox/ExtSearchIdx.pm | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 890ac282..2cdc31cb 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -298,7 +298,7 @@ sub _sync_inbox ($$$) {
my $epoch_max;
defined($ibx->git_dir_latest(\$epoch_max)) or return;
$sync->{epoch_max} = $epoch_max;
- sync_prepare($self, $sync) or return; # fills $sync->{todo}
+ sync_prepare($self, $sync); # or return # TODO: once MiscIdx is stable
} elsif ($v == 1) {
my $uv = $ibx->uidvalidity;
my $lc = $self->{oidx}->eidx_meta("lc-v1:$ekey//$uv");
@@ -309,8 +309,10 @@ sub _sync_inbox ($$$) {
warn "E: $ekey unsupported inbox version (v$v)\n";
return;
}
- index_todo($self, $sync, $_) for @{delete($sync->{todo}) // []};
- $self->{midx}->index_ibx($ibx);
+ unless ($sync->{quit}) {
+ index_todo($self, $sync, $_) for @{delete($sync->{todo}) // []};
+ $self->{midx}->index_ibx($ibx) unless $sync->{quit};
+ }
$ibx->git->cleanup; # done with this inbox, now
}
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 11/12] manifest: support faster generation via [extindex "all"]
2020-11-23 7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
` (9 preceding siblings ...)
2020-11-23 7:06 ` [PATCH 10/12] extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare Eric Wong
@ 2020-11-23 7:06 ` Eric Wong
2020-11-23 7:06 ` [PATCH 12/12] *search: simplify retry_reopen users Eric Wong
11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23 7:06 UTC (permalink / raw)
To: meta
For a mirror of lore.kernel.org with >140 inboxes, this speeds
up manifest.js.gz generation from ~1s to 40ms on my HW. This
is still unacceptable when dealing with thousands of inboxes,
but gets us closer to where we need to be.
---
lib/PublicInbox/Config.pm | 3 +++
lib/PublicInbox/Inbox.pm | 2 ++
lib/PublicInbox/InboxWritable.pm | 2 --
lib/PublicInbox/ManifestJsGz.pm | 39 ++++++++++++++++++++++++++------
lib/PublicInbox/MiscSearch.pm | 19 ++++++++++++++++
5 files changed, 56 insertions(+), 9 deletions(-)
diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index 039eb445..251008a3 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -94,6 +94,9 @@ sub lookup_ei {
$self->{-ei_by_name}->{$name} //= _fill_ei($self, "extindex.$name");
}
+# special case for [extindex "all"]
+sub ALL { lookup_ei($_[0], 'all') }
+
sub each_inbox {
my ($self, $cb, @arg) = @_;
# may auto-vivify if config file is non-existent:
diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index a1a072ad..5a22e40d 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -429,4 +429,6 @@ sub on_unlock {
sub uidvalidity { $_[0]->{uidvalidity} //= $_[0]->mm->created_at }
+sub eidx_key { $_[0]->{newsgroup} // $_[0]->{inboxdir} }
+
1;
diff --git a/lib/PublicInbox/InboxWritable.pm b/lib/PublicInbox/InboxWritable.pm
index d3c255c7..e97c7e2d 100644
--- a/lib/PublicInbox/InboxWritable.pm
+++ b/lib/PublicInbox/InboxWritable.pm
@@ -319,6 +319,4 @@ sub git_dir_latest {
$latest;
}
-sub eidx_key { $_[0]->{newsgroup} // $_[0]->{inboxdir} }
-
1;
diff --git a/lib/PublicInbox/ManifestJsGz.pm b/lib/PublicInbox/ManifestJsGz.pm
index 3b436827..2c4a231d 100644
--- a/lib/PublicInbox/ManifestJsGz.pm
+++ b/lib/PublicInbox/ManifestJsGz.pm
@@ -21,6 +21,14 @@ sub url_regexp {
$ctx->SUPER::url_regexp('publicInbox.grokManifest', 'match=domain');
}
+sub inject_entry ($$$;$) {
+ my ($ctx, $url_path, $ent, $git_dir) = @_;
+ $ctx->{-abs2urlpath}->{$git_dir // delete $ent->{git_dir}} = $url_path;
+ my $modified = $ent->{modified};
+ $ctx->{-mtime} = $modified if $modified > ($ctx->{-mtime} // 0);
+ $ctx->{manifest}->{$url_path} = $ent;
+}
+
sub manifest_add ($$;$$) {
my ($ctx, $ibx, $epoch, $default_desc) = @_;
my $url_path = "/$ibx->{name}";
@@ -32,15 +40,10 @@ sub manifest_add ($$;$$) {
$git = $ibx->git;
}
my $ent = $git->manifest_entry($epoch, $default_desc) or return;
- $ctx->{-abs2urlpath}->{$git->{git_dir}} = $url_path;
- my $modified = $ent->{modified};
- if ($modified > ($ctx->{-mtime} // 0)) {
- $ctx->{-mtime} = $modified;
- }
- $ctx->{manifest}->{$url_path} = $ent;
+ inject_entry($ctx, $url_path, $ent, $git->{git_dir});
}
-sub ibx_entry {
+sub slow_manifest_add ($$) {
my ($ctx, $ibx) = @_;
eval {
if (defined(my $max = $ibx->max_git_epoch)) {
@@ -52,6 +55,28 @@ sub ibx_entry {
manifest_add($ctx, $ibx);
}
};
+}
+
+sub eidx_manifest_add ($$$) {
+ my ($ctx, $ALL, $ibx) = @_;
+ if (my $data = $ALL->misc->inbox_data($ibx)) {
+ $data = $json->decode($data);
+ while (my ($url_path, $ent) = each %$data) {
+ inject_entry($ctx, $url_path, $ent);
+ }
+ } else {
+ warn "E: `${\$ibx->eidx_key}' not indexed by $ALL->{topdir}\n";
+ }
+}
+
+sub ibx_entry {
+ my ($ctx, $ibx) = @_;
+ my $ALL = $ctx->{www}->{pi_config}->ALL;
+ if ($ALL) {
+ eidx_manifest_add($ctx, $ALL, $ibx);
+ } else {
+ slow_manifest_add($ctx, $ibx);
+ }
warn "E: $@" if $@;
}
diff --git a/lib/PublicInbox/MiscSearch.pm b/lib/PublicInbox/MiscSearch.pm
index 8beb8349..5a44d751 100644
--- a/lib/PublicInbox/MiscSearch.pm
+++ b/lib/PublicInbox/MiscSearch.pm
@@ -76,4 +76,23 @@ sub mset {
retry_reopen($self, \&misc_enquire_once, [ $self, $qr, $opt ]);
}
+sub ibx_data_once {
+ my ($self, $ibx) = @{$_[0]};
+ my $xdb = $self->{xdb};
+ my $eidx_key = $ibx->eidx_key; # may be {inboxdir}, so private
+ my $head = $xdb->postlist_begin('Q'.$eidx_key);
+ my $tail = $xdb->postlist_end('Q'.$eidx_key);
+ if ($head != $tail) {
+ my $doc = $xdb->get_document($head->get_docid);
+ $doc->get_data;
+ } else {
+ undef;
+ }
+}
+
+sub inbox_data {
+ my ($self, $ibx) = @_;
+ retry_reopen($self, \&ibx_data_once, [ $self, $ibx ]);
+}
+
1;
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 12/12] *search: simplify retry_reopen users
2020-11-23 7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
` (10 preceding siblings ...)
2020-11-23 7:06 ` [PATCH 11/12] manifest: support faster generation via [extindex "all"] Eric Wong
@ 2020-11-23 7:06 ` Eric Wong
11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2020-11-23 7:06 UTC (permalink / raw)
To: meta
Every callback uses `$self', and creating short-lived
array references is not necessary when it's just as
easy to copy the array in Perl (unlike C).
---
lib/PublicInbox/MiscSearch.pm | 8 ++++----
lib/PublicInbox/Search.pm | 10 +++++-----
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/lib/PublicInbox/MiscSearch.pm b/lib/PublicInbox/MiscSearch.pm
index 5a44d751..48ef6914 100644
--- a/lib/PublicInbox/MiscSearch.pm
+++ b/lib/PublicInbox/MiscSearch.pm
@@ -50,7 +50,7 @@ sub mi_qp_new ($) {
}
sub misc_enquire_once { # retry_reopen callback
- my ($self, $qr, $opt) = @{$_[0]};
+ my ($self, $qr, $opt) = @_;
my $eq = $PublicInbox::Search::X{Enquire}->new($self->{xdb});
$eq->set_query($qr);
my $desc = !$opt->{asc};
@@ -73,11 +73,11 @@ sub mset {
$qs = 'type:inbox' if $qs eq '';
my $qr = $qp->parse_query($qs, $PublicInbox::Search::QP_FLAGS);
$opt->{relevance} = 1 unless exists $opt->{relevance};
- retry_reopen($self, \&misc_enquire_once, [ $self, $qr, $opt ]);
+ retry_reopen($self, \&misc_enquire_once, $qr, $opt);
}
sub ibx_data_once {
- my ($self, $ibx) = @{$_[0]};
+ my ($self, $ibx) = @_;
my $xdb = $self->{xdb};
my $eidx_key = $ibx->eidx_key; # may be {inboxdir}, so private
my $head = $xdb->postlist_begin('Q'.$eidx_key);
@@ -92,7 +92,7 @@ sub ibx_data_once {
sub inbox_data {
my ($self, $ibx) = @_;
- retry_reopen($self, \&ibx_data_once, [ $self, $ibx ]);
+ retry_reopen($self, \&ibx_data_once, $ibx);
}
1;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 05d5a133..574bc145 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -291,15 +291,15 @@ sub mset {
}
sub retry_reopen {
- my ($self, $cb, $arg) = @_;
+ my ($self, $cb, @arg) = @_;
for my $i (1..10) {
if (wantarray) {
my @ret;
- eval { @ret = $cb->($arg) };
+ eval { @ret = $cb->($self, @arg) };
return @ret unless $@;
} else {
my $ret;
- eval { $ret = $cb->($arg) };
+ eval { $ret = $cb->($self, @arg) };
return $ret unless $@;
}
# Exception: The revision being read has been discarded -
@@ -319,7 +319,7 @@ sub retry_reopen {
sub _do_enquire {
my ($self, $query, $opts) = @_;
- retry_reopen($self, \&_enquire_once, [ $self, $query, $opts ]);
+ retry_reopen($self, \&_enquire_once, $query, $opts);
}
# returns true if all docs have the THREADID value
@@ -329,7 +329,7 @@ sub has_threadid ($) {
}
sub _enquire_once { # retry_reopen callback
- my ($self, $query, $opts) = @{$_[0]};
+ my ($self, $query, $opts) = @_;
my $xdb = xdb($self);
my $enquire = $X{Enquire}->new($xdb);
$enquire->set_query($query);
^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-11-23 7:06 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-23 7:05 [PATCH 00/12] extindex: speed up manifest.js.gz generation Eric Wong
2020-11-23 7:05 ` [PATCH 01/12] miscsearch: a new Xapian sub-DB for extindex Eric Wong
2020-11-23 7:05 ` [PATCH 02/12] move JSON module portability into PublicInbox::Config Eric Wong
2020-11-23 7:05 ` [PATCH 03/12] git: add manifest_entry method Eric Wong
2020-11-23 7:05 ` [PATCH 04/12] manifest: use ibx->git_epoch method for v2 Eric Wong
2020-11-23 7:05 ` [PATCH 05/12] inbox: git_epoch: remove ->version check Eric Wong
2020-11-23 7:05 ` [PATCH 06/12] miscidx: put grokmirror manifest entries in Xapian docdata Eric Wong
2020-11-23 7:05 ` [PATCH 07/12] extsearch: fix remaining "eindex" references Eric Wong
2020-11-23 7:05 ` [PATCH 08/12] miscidx: cleanup git processes after manifest indexing Eric Wong
2020-11-23 7:05 ` [PATCH 09/12] miscidx: store absolute git_dir of each epoch in docdata Eric Wong
2020-11-23 7:06 ` [PATCH 10/12] extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare Eric Wong
2020-11-23 7:06 ` [PATCH 11/12] manifest: support faster generation via [extindex "all"] Eric Wong
2020-11-23 7:06 ` [PATCH 12/12] *search: simplify retry_reopen users Eric Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).