* [PATCH 00/10] start optimizing startup w/ ALL->misc
@ 2020-12-23 8:38 Eric Wong
2020-12-23 8:38 ` [PATCH 01/10] miscsearch: load Xapian at initialization Eric Wong
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: Eric Wong @ 2020-12-23 8:38 UTC (permalink / raw)
To: meta
-nntpd [PATCH 5/10] is the single most significant improvements.
And some cleanups, and some general improvements independent of
indexing in patches 7-10 (patch 8 is already superceded by 10,
but kept separate for documentation purposes).
PublicInbox::Config->new is over twice as fast, now.
Eric Wong (10):
miscsearch: load Xapian at initialization
xt: add create-many-inboxes helper test
inbox: git_epoch: correct false comment
inboxwritable: _init_v1: set created_at ASAP
miscsearch: index UIDVALIDITY, use as startup cache
extsearchidx: close SQLite handles after attaching
config: _fill: inbox name extraction optimization
config: git_config_dump: pre-compile RE for split
config: config_fh_parse: micro-optimize
config: config_fh_parse: micro-optimize even harder
MANIFEST | 1 +
lib/PublicInbox/Config.pm | 26 ++++-----
lib/PublicInbox/ExtSearchIdx.pm | 25 +++++---
lib/PublicInbox/Inbox.pm | 2 +-
lib/PublicInbox/InboxWritable.pm | 3 +-
lib/PublicInbox/MiscIdx.pm | 26 ++++++---
lib/PublicInbox/MiscSearch.pm | 57 ++++++++++++++++--
lib/PublicInbox/NNTPD.pm | 4 +-
lib/PublicInbox/Search.pm | 9 ++-
lib/PublicInbox/SearchIdx.pm | 7 ---
t/search.t | 4 +-
xt/create-many-inboxes.t | 99 ++++++++++++++++++++++++++++++++
12 files changed, 213 insertions(+), 50 deletions(-)
create mode 100644 xt/create-many-inboxes.t
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 01/10] miscsearch: load Xapian at initialization
2020-12-23 8:38 [PATCH 00/10] start optimizing startup w/ ALL->misc Eric Wong
@ 2020-12-23 8:38 ` Eric Wong
2020-12-23 8:38 ` [PATCH 02/10] xt: add create-many-inboxes helper test Eric Wong
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-12-23 8:38 UTC (permalink / raw)
To: meta
We need Xapian bindings loaded before calling
(Search::)Xapian::Database->new
---
lib/PublicInbox/MiscSearch.pm | 1 +
1 file changed, 1 insertion(+)
diff --git a/lib/PublicInbox/MiscSearch.pm b/lib/PublicInbox/MiscSearch.pm
index f2e31443..de587d35 100644
--- a/lib/PublicInbox/MiscSearch.pm
+++ b/lib/PublicInbox/MiscSearch.pm
@@ -23,6 +23,7 @@ my %PROB_PREFIX = (
sub new {
my ($class, $dir) = @_;
+ PublicInbox::Search::load_xapian();
bless {
xdb => $PublicInbox::Search::X{Database}->new($dir)
}, $class;
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 02/10] xt: add create-many-inboxes helper test
2020-12-23 8:38 [PATCH 00/10] start optimizing startup w/ ALL->misc Eric Wong
2020-12-23 8:38 ` [PATCH 01/10] miscsearch: load Xapian at initialization Eric Wong
@ 2020-12-23 8:38 ` Eric Wong
2020-12-23 8:38 ` [PATCH 03/10] inbox: git_epoch: correct false comment Eric Wong
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-12-23 8:38 UTC (permalink / raw)
To: meta
I've been using something like this to mock out thousands
of inboxes for testing.
---
MANIFEST | 1 +
xt/create-many-inboxes.t | 99 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+)
create mode 100644 xt/create-many-inboxes.t
diff --git a/MANIFEST b/MANIFEST
index ac442606..a4cdedff 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -394,6 +394,7 @@ t/x-unknown-alpine.eml
t/xcpdb-reshard.t
xt/cmp-msgstr.t
xt/cmp-msgview.t
+xt/create-many-inboxes.t
xt/eml_check_limits.t
xt/git-http-backend.t
xt/git_async_cmp.t
diff --git a/xt/create-many-inboxes.t b/xt/create-many-inboxes.t
new file mode 100644
index 00000000..c92643b2
--- /dev/null
+++ b/xt/create-many-inboxes.t
@@ -0,0 +1,99 @@
+#!perl -w
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use Test::More;
+use PublicInbox::TestCommon;
+use PublicInbox::Eml;
+use File::Path qw(mkpath);
+use IO::Handle (); # autoflush
+use POSIX qw(_exit);
+use Cwd qw(getcwd abs_path);
+use File::Spec;
+my $many_root = $ENV{TEST_MANY_ROOT} or
+ plan skip_all => 'TEST_MANY_ROOT not defined';
+my $cwd = getcwd();
+mkpath($many_root);
+-d $many_root or BAIL_OUT "$many_root: $!";
+$many_root = abs_path($many_root);
+$many_root =~ m!\A\Q$cwd\E/! and BAIL_OUT "$many_root must not be in $cwd";
+require_git 2.6;
+require_mods(qw(DBD::SQLite Search::Xapian));
+use_ok 'PublicInbox::V2Writable';
+my $nr_inbox = $ENV{NR_INBOX} // 10;
+my $nproc = $ENV{NPROC} || PublicInbox::V2Writable::detect_nproc() || 2;
+my $indexlevel = $ENV{TEST_INDEXLEVEL} // 'basic';
+diag "NR_INBOX=$nr_inbox NPROC=$nproc TEST_INDEXLEVEL=$indexlevel";
+diag "TEST_MANY_ROOT=$many_root";
+my $level_cfg = $indexlevel eq 'full' ? '' : "\tindexlevel = $indexlevel\n";
+my $pfx = "$many_root/$nr_inbox-$indexlevel";
+mkpath($pfx);
+open my $cfg_fh, '>>', "$pfx/config" or BAIL_OUT $!;
+$cfg_fh->autoflush(1);
+my $v2_init_add = sub {
+ my ($i) = @_;
+ my $ibx = PublicInbox::Inbox->new({
+ inboxdir => "$pfx/test-$i",
+ name => "test-$i",
+ newsgroup => "inbox.comp.test.foo.test-$i",
+ address => [ "test-$i\@example.com" ],
+ url => [ "//example.com/test-$i" ],
+ version => 2,
+ });
+ $ibx->{indexlevel} = $indexlevel if $level_cfg ne '';
+ my $entry = <<EOF;
+[publicinbox "$ibx->{name}"]
+ address = $ibx->{-primary_address}
+ url = $ibx->{url}->[0]
+ newsgroup = $ibx->{newsgroup}
+ inboxdir = $ibx->{inboxdir}
+EOF
+ $entry .= $level_cfg;
+ print $cfg_fh $entry or die $!;
+ my $v2w = PublicInbox::V2Writable->new($ibx, { nproc => 0 });
+ $v2w->init_inbox(0);
+ $v2w->add(PublicInbox::Eml->new(<<EOM));
+Date: Sat, 02 Oct 2010 00:00:00 +0000
+From: Lorelei <l\@example.com>
+To: test-$i\@example.com
+Message-ID: <20101002-000000-$i\@example.com>
+Subject: hello world $i
+
+hi
+EOM
+ $v2w->done;
+};
+
+my @children;
+for my $i (1..$nproc) {
+ my ($r, $w);
+ pipe($r, $w) or BAIL_OUT $!;
+ my $pid = fork;
+ if ($pid == 0) {
+ close $w;
+ while (my $i = <$r>) {
+ chomp $i;
+ $v2_init_add->($i);
+ }
+ _exit(0);
+ }
+ defined $pid or BAIL_OUT "fork: $!";
+ close $r or BAIL_OUT $!;
+ push @children, [ $w, $pid ];
+ $w->autoflush(1);
+}
+
+for my $i (0..$nr_inbox) {
+ print { $children[$i % @children]->[0] } "$i\n" or BAIL_OUT $!;
+}
+
+for my $c (@children) {
+ close $c->[0] or BAIL_OUT "close $!";
+}
+my $i = 0;
+for my $c (@children) {
+ my $pid = waitpid($c->[1], 0);
+ is($?, 0, ++$i.' exited ok');
+}
+ok(close($cfg_fh), 'config written');
+done_testing;
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 03/10] inbox: git_epoch: correct false comment
2020-12-23 8:38 [PATCH 00/10] start optimizing startup w/ ALL->misc Eric Wong
2020-12-23 8:38 ` [PATCH 01/10] miscsearch: load Xapian at initialization Eric Wong
2020-12-23 8:38 ` [PATCH 02/10] xt: add create-many-inboxes helper test Eric Wong
@ 2020-12-23 8:38 ` Eric Wong
2020-12-23 8:38 ` [PATCH 04/10] inboxwritable: _init_v1: set created_at ASAP Eric Wong
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-12-23 8:38 UTC (permalink / raw)
To: meta
The original comment hasn't been true since
PublicInbox::Git->modified was changed to use cat_async blob
responses. In any case, manifest.js.gz generation already
cleans up per-epoch git processes used for ->modified.
---
lib/PublicInbox/Inbox.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index 863a5de4..ec8469e3 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -132,7 +132,7 @@ sub git_epoch {
return unless -d $git_dir;
my $g = PublicInbox::Git->new($git_dir);
$g->{-httpbackend_limiter} = $self->{-httpbackend_limiter};
- # no cleanup needed, we never cat-file off this, only clone
+ # caller must manually cleanup when done
$g;
};
}
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 04/10] inboxwritable: _init_v1: set created_at ASAP
2020-12-23 8:38 [PATCH 00/10] start optimizing startup w/ ALL->misc Eric Wong
` (2 preceding siblings ...)
2020-12-23 8:38 ` [PATCH 03/10] inbox: git_epoch: correct false comment Eric Wong
@ 2020-12-23 8:38 ` Eric Wong
2020-12-23 8:38 ` [PATCH 05/10] miscsearch: index UIDVALIDITY, use as startup cache Eric Wong
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-12-23 8:38 UTC (permalink / raw)
To: meta
This ensures we have UIDVALIDITY to index earlier
rather than later for v1 inboxes, matching v2 behavior.
---
lib/PublicInbox/InboxWritable.pm | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/lib/PublicInbox/InboxWritable.pm b/lib/PublicInbox/InboxWritable.pm
index c0e88f3d..69275bb0 100644
--- a/lib/PublicInbox/InboxWritable.pm
+++ b/lib/PublicInbox/InboxWritable.pm
@@ -46,12 +46,13 @@ sub _init_v1 {
require PublicInbox::Msgmap;
my $sidx = PublicInbox::SearchIdx->new($self, 1); # just create
$sidx->begin_txn_lazy;
+ my $mm = PublicInbox::Msgmap->new($self->{inboxdir}, 1);
if (defined $skip_artnum) {
- my $mm = PublicInbox::Msgmap->new($self->{inboxdir}, 1);
$mm->{dbh}->begin_work;
$mm->skip_artnum($skip_artnum);
$mm->{dbh}->commit;
}
+ undef $mm; # ->created_at set
$sidx->commit_txn_lazy;
} else {
open my $fh, '>>', "$self->{inboxdir}/ssoma.lock" or
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 05/10] miscsearch: index UIDVALIDITY, use as startup cache
2020-12-23 8:38 [PATCH 00/10] start optimizing startup w/ ALL->misc Eric Wong
` (3 preceding siblings ...)
2020-12-23 8:38 ` [PATCH 04/10] inboxwritable: _init_v1: set created_at ASAP Eric Wong
@ 2020-12-23 8:38 ` Eric Wong
2020-12-23 8:38 ` [PATCH 06/10] extsearchidx: close SQLite handles after attaching Eric Wong
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-12-23 8:38 UTC (permalink / raw)
To: meta
This brings -nntpd startup time down from ~35s to ~5s with 50K
inboxes.
Further improvements ought to be possible with deeper changes to
MiscIdx, since -mda having to load every inbox seems unreasonable;
but this general change is fairly unintrusive.
---
lib/PublicInbox/ExtSearchIdx.pm | 22 +++++++------
lib/PublicInbox/MiscIdx.pm | 26 ++++++++++-----
lib/PublicInbox/MiscSearch.pm | 56 ++++++++++++++++++++++++++++++---
lib/PublicInbox/NNTPD.pm | 4 ++-
lib/PublicInbox/Search.pm | 9 +++++-
lib/PublicInbox/SearchIdx.pm | 7 -----
t/search.t | 4 +--
7 files changed, 96 insertions(+), 32 deletions(-)
diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index f04e0443..9d64ff5a 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -61,16 +61,20 @@ sub new {
sub attach_inbox {
my ($self, $ibx) = @_;
- my $key = $ibx->eidx_key;
- if (!$ibx->over || !$ibx->mm) {
- warn "W: skipping $key (unindexed)\n";
- return;
- }
- if (!defined($ibx->uidvalidity)) {
- warn "W: skipping $key (no UIDVALIDITY)\n";
- return;
+ my $ekey = $ibx->eidx_key;
+ my $misc = $self->{misc};
+ if ($misc && $misc->inbox_data($ibx)) { # all good if already indexed
+ } else {
+ if (!$ibx->over || !$ibx->mm) {
+ warn "W: skipping $ekey (unindexed)\n";
+ return;
+ }
+ if (!defined($ibx->uidvalidity)) {
+ warn "W: skipping $ekey (no UIDVALIDITY)\n";
+ return;
+ }
}
- $self->{ibx_map}->{$key} //= do {
+ $self->{ibx_map}->{$ekey} //= do {
push @{$self->{ibx_list}}, $ibx;
$ibx;
}
diff --git a/lib/PublicInbox/MiscIdx.pm b/lib/PublicInbox/MiscIdx.pm
index 64591d05..a04dd1c5 100644
--- a/lib/PublicInbox/MiscIdx.pm
+++ b/lib/PublicInbox/MiscIdx.pm
@@ -21,6 +21,7 @@ use Carp qw(croak);
use File::Path ();
use PublicInbox::MiscSearch;
use PublicInbox::Config;
+my $json;
sub new {
my ($class, $eidx) = @_;
@@ -30,6 +31,7 @@ sub new {
nodatacow_dir($mi_dir);
my $flags = $PublicInbox::SearchIdx::DB_CREATE_OR_OPEN;
$flags |= $PublicInbox::SearchIdx::DB_NO_SYNC if $eidx->{-no_fsync};
+ $json //= PublicInbox::Config::json();
bless {
mi_dir => $mi_dir,
flags => $flags,
@@ -91,17 +93,27 @@ EOF
$xdb->delete_document($_) for @drop; # just in case
my $doc = $PublicInbox::Search::X{Document}->new;
+ term_generator($self)->set_document($doc);
- # allow sorting by modified
+ # allow sorting by modified and uidvalidity (created at)
add_val($doc, $PublicInbox::MiscSearch::MODIFIED, $ibx->modified);
+ add_val($doc, $PublicInbox::MiscSearch::UIDVALIDITY, $ibx->uidvalidity);
- $doc->add_boolean_term('Q'.$eidx_key);
- $doc->add_boolean_term('T'.'inbox');
- term_generator($self)->set_document($doc);
+ $doc->add_boolean_term('Q'.$eidx_key); # uniQue id
+ $doc->add_boolean_term('T'.'inbox'); # Type
+
+ if (defined($ibx->{newsgroup}) && $ibx->nntp_usable) {
+ $doc->add_boolean_term('T'.'newsgroup'); # additional Type
+ }
+
+ # force reread from disk, {description} could be loaded from {misc}
+ delete $ibx->{description};
+ my $desc = $ibx->description;
# description = S/Subject (or title)
# address = A/Author
- index_text($self, $ibx->description, 1, 'S');
+ index_text($self, $desc, 1, 'S');
+ index_text($self, $ibx->{name}, 1, 'XNAME');
my %map = (
address => 'A',
listid => 'XLISTID',
@@ -113,10 +125,8 @@ EOF
index_text($self, $v, 1, $pfx);
}
}
- index_text($self, $ibx->{name}, 1, 'XNAME');
my $data = {};
if (defined(my $max = $ibx->max_git_epoch)) { # v2
- my $desc = $ibx->description;
my $pfx = "/$ibx->{name}/git/";
for my $epoch (0..$max) {
my $git = $ibx->git_epoch($epoch) or return;
@@ -130,7 +140,7 @@ EOF
$ent->{git_dir} = $ibx->{inboxdir};
$data->{"/$ibx->{name}"} = $ent;
}
- $doc->set_data(PublicInbox::Config::json()->encode($data));
+ $doc->set_data($json->encode($data));
if (defined $docid) {
$xdb->replace_document($docid, $doc);
} else {
diff --git a/lib/PublicInbox/MiscSearch.pm b/lib/PublicInbox/MiscSearch.pm
index de587d35..c6ce255f 100644
--- a/lib/PublicInbox/MiscSearch.pm
+++ b/lib/PublicInbox/MiscSearch.pm
@@ -5,10 +5,12 @@
package PublicInbox::MiscSearch;
use strict;
use v5.10.1;
-use PublicInbox::Search qw(retry_reopen);
+use PublicInbox::Search qw(retry_reopen int_val);
+my $json;
# Xapian value columns:
our $MODIFIED = 0;
+our $UIDVALIDITY = 1; # (created time)
# avoid conflicting with message Search::prob_prefix for UI/UX reasons
my %PROB_PREFIX = (
@@ -24,6 +26,7 @@ my %PROB_PREFIX = (
sub new {
my ($class, $dir) = @_;
PublicInbox::Search::load_xapian();
+ $json //= PublicInbox::Config::json();
bless {
xdb => $PublicInbox::Search::X{Database}->new($dir)
}, $class;
@@ -120,11 +123,13 @@ sub newsgroup_matches {
sub ibx_data_once {
my ($self, $ibx) = @_;
my $xdb = $self->{xdb};
- my $eidx_key = $ibx->eidx_key; # may be {inboxdir}, so private
- my $head = $xdb->postlist_begin('Q'.$eidx_key);
- my $tail = $xdb->postlist_end('Q'.$eidx_key);
+ my $term = 'Q'.$ibx->eidx_key; # may be {inboxdir}, so private
+ my $head = $xdb->postlist_begin($term);
+ my $tail = $xdb->postlist_end($term);
if ($head != $tail) {
my $doc = $xdb->get_document($head->get_docid);
+ $ibx->{uidvalidity} //= int_val($doc, $UIDVALIDITY);
+ $ibx->{-modified} = int_val($doc, $MODIFIED);
$doc->get_data;
} else {
undef;
@@ -136,4 +141,47 @@ sub inbox_data {
retry_reopen($self, \&ibx_data_once, $ibx);
}
+sub ibx_cache_load {
+ my ($doc, $cache) = @_;
+ my $end = $doc->termlist_end;
+ my $cur = $doc->termlist_begin;
+ $cur->skip_to('Q');
+ return if $cur == $end;
+ my $eidx_key = $cur->get_termname;
+ $eidx_key =~ s/\AQ// or return; # expired
+ my $ce = $cache->{$eidx_key} = {};
+ $ce->{uidvalidity} = int_val($doc, $UIDVALIDITY);
+ $ce->{-modified} = int_val($doc, $MODIFIED);
+ $ce->{description} = do {
+ # extract description from manifest.js.gz epoch description
+ my $d;
+ my $data = $json->decode($doc->get_data);
+ for (values %$data) {
+ $d = $_->{description} // next;
+ $d =~ s/ \[epoch [0-9]+\]\z// or next;
+ last;
+ }
+ $d;
+ }
+}
+
+sub _nntpd_cache_load { # retry_reopen callback
+ my ($self) = @_;
+ my $opt = { limit => $self->{xdb}->get_doccount * 10, relevance => -1 };
+ my $mset = mset($self, 'type:newsgroup type:inbox', $opt);
+ my $cache = {};
+ for my $it ($mset->items) {
+ ibx_cache_load($it->get_document, $cache);
+ }
+ $cache
+}
+
+# returns { newsgroup => $cache_entry } mapping, $cache_entry contains
+# anything which may trigger seeks at startup, currently: description,
+# -modified, and uidvalidity.
+sub nntpd_cache_load {
+ my ($self) = @_;
+ retry_reopen($self, \&_nntpd_cache_load);
+}
+
1;
diff --git a/lib/PublicInbox/NNTPD.pm b/lib/PublicInbox/NNTPD.pm
index 7f9a1d58..6907a03c 100644
--- a/lib/PublicInbox/NNTPD.pm
+++ b/lib/PublicInbox/NNTPD.pm
@@ -36,10 +36,12 @@ sub refresh_groups {
my ($self, $sig) = @_;
my $pi_cfg = $sig ? PublicInbox::Config->new : $self->{pi_cfg};
my $groups = $pi_cfg->{-by_newsgroup}; # filled during each_inbox
+ my $cache = eval { $pi_cfg->ALL->misc->nntpd_cache_load } // {};
$pi_cfg->each_inbox(sub {
my ($ibx) = @_;
my $ngname = $ibx->{newsgroup} // return;
- if ($ibx->nntp_usable) {
+ my $ce = $cache->{$ngname};
+ if (($ce and (%$ibx = (%$ibx, %$ce))) || $ibx->nntp_usable) {
# only valid if msgmap and over works
# preload to avoid fragmentation:
$ibx->description;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index b1d38fb9..05c679c9 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -6,7 +6,7 @@
package PublicInbox::Search;
use strict;
use parent qw(Exporter);
-our @EXPORT_OK = qw(retry_reopen);
+our @EXPORT_OK = qw(retry_reopen int_val);
use List::Util qw(max);
# values for searching, changing the numeric value breaks
@@ -91,6 +91,7 @@ sub load_xapian () {
1 : Search::Xapian::ENQ_ASCENDING();
*sortable_serialise = $x.'::sortable_serialise';
+ *sortable_unserialise = $x.'::sortable_unserialise';
# n.b. FLAG_PURE_NOT is expensive not suitable for a public
# website as it could become a denial-of-service vector
# FLAG_PHRASE also seems to cause performance problems chert
@@ -436,4 +437,10 @@ sub help {
\@ret;
}
+sub int_val ($$) {
+ my ($doc, $col) = @_;
+ my $val = $doc->get_value($col) or return; # undefined is '' in Xapian
+ sortable_unserialise($val) + 0; # PV => IV conversion
+}
+
1;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index cf2c2c55..d1b0c724 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -106,7 +106,6 @@ sub load_xapian_writable () {
}
eval 'require '.$X->{WritableDatabase} or die;
*sortable_serialise = $xap.'::sortable_serialise';
- *sortable_unserialise = $xap.'::sortable_unserialise';
$DB_CREATE_OR_OPEN = eval($xap.'::DB_CREATE_OR_OPEN()');
$DB_OPEN = eval($xap.'::DB_OPEN()');
my $ver = (eval($xap.'::major_version()') << 16) |
@@ -501,12 +500,6 @@ sub remove_eidx_info {
$self->{xdb}->replace_document($docid, $doc);
}
-sub int_val ($$) {
- my ($doc, $col) = @_;
- my $val = $doc->get_value($col) or return; # undefined is '' in Xapian
- sortable_unserialise($val) + 0; # PV => IV conversion
-}
-
sub smsg_from_doc ($) {
my ($doc) = @_;
my $data = $doc->get_data or return;
diff --git a/t/search.t b/t/search.t
index 11143204..3754717d 100644
--- a/t/search.t
+++ b/t/search.t
@@ -332,13 +332,13 @@ $ibx->with_umask(sub {
like($smsg->{to}, qr/\blist\@example\.com\b/, 'to appears');
my $doc = $m->get_document;
my $col = PublicInbox::Search::BYTES();
- my $bytes = PublicInbox::SearchIdx::int_val($doc, $col);
+ my $bytes = PublicInbox::Search::int_val($doc, $col);
like($bytes, qr/\A[0-9]+\z/, '$bytes stored as digit');
ok($bytes > 0, '$bytes is > 0');
is($bytes, $smsg->{bytes}, 'bytes Xapian value matches Over');
$col = PublicInbox::Search::UID();
- my $uid = PublicInbox::SearchIdx::int_val($doc, $col);
+ my $uid = PublicInbox::Search::int_val($doc, $col);
is($uid, $smsg->{num}, 'UID column matches {num}');
is($uid, $m->get_docid, 'UID column matches docid');
}
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 06/10] extsearchidx: close SQLite handles after attaching
2020-12-23 8:38 [PATCH 00/10] start optimizing startup w/ ALL->misc Eric Wong
` (4 preceding siblings ...)
2020-12-23 8:38 ` [PATCH 05/10] miscsearch: index UIDVALIDITY, use as startup cache Eric Wong
@ 2020-12-23 8:38 ` Eric Wong
2020-12-23 8:38 ` [PATCH 07/10] config: _fill: inbox name extraction optimization Eric Wong
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-12-23 8:38 UTC (permalink / raw)
To: meta
This is needed to prevent us from running out of FDs when
indexing many inboxes. Perhaps checking these on attach_inbox
is unnecessary and may be removed entirely down the line.
---
lib/PublicInbox/ExtSearchIdx.pm | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 9d64ff5a..fb627089 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -65,11 +65,14 @@ sub attach_inbox {
my $misc = $self->{misc};
if ($misc && $misc->inbox_data($ibx)) { # all good if already indexed
} else {
- if (!$ibx->over || !$ibx->mm) {
+ my @sqlite = ($ibx->over, $ibx->mm);
+ my $uidvalidity = $ibx->uidvalidity;
+ $ibx->{mm} = $ibx->{over} = undef;
+ if (scalar(@sqlite) != 2) {
warn "W: skipping $ekey (unindexed)\n";
return;
}
- if (!defined($ibx->uidvalidity)) {
+ if (!defined($uidvalidity)) {
warn "W: skipping $ekey (no UIDVALIDITY)\n";
return;
}
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 07/10] config: _fill: inbox name extraction optimization
2020-12-23 8:38 [PATCH 00/10] start optimizing startup w/ ALL->misc Eric Wong
` (5 preceding siblings ...)
2020-12-23 8:38 ` [PATCH 06/10] extsearchidx: close SQLite handles after attaching Eric Wong
@ 2020-12-23 8:38 ` Eric Wong
2020-12-23 8:38 ` [PATCH 08/10] config: git_config_dump: pre-compile RE for split Eric Wong
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-12-23 8:38 UTC (permalink / raw)
To: meta
Using substr() instead of a string copy + s// substitution here
reduces ->fill_all from 4.00s to 3.88s with 50K inboxes on my
workstation.
---
lib/PublicInbox/Config.pm | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index 577337dc..cd8957a1 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -424,9 +424,7 @@ EOF
}
}
- my $name = $pfx;
- $name =~ s/\Apublicinbox\.//;
-
+ my $name = substr($pfx, length('publicinbox.'));
if (!valid_inbox_name($name)) {
warn "invalid inbox name: '$name'\n";
return;
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 08/10] config: git_config_dump: pre-compile RE for split
2020-12-23 8:38 [PATCH 00/10] start optimizing startup w/ ALL->misc Eric Wong
` (6 preceding siblings ...)
2020-12-23 8:38 ` [PATCH 07/10] config: _fill: inbox name extraction optimization Eric Wong
@ 2020-12-23 8:38 ` Eric Wong
2020-12-23 8:38 ` [PATCH 09/10] config: config_fh_parse: micro-optimize Eric Wong
2020-12-23 8:38 ` [PATCH 10/10] config: config_fh_parse: micro-optimize harder Eric Wong
9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-12-23 8:38 UTC (permalink / raw)
To: meta
It appears the Perl split() operator is not optimized for fixed
strings at all. With this change, PublicInbox::Config->new (w/o
->fill_all) time is reduced from 1.81s to 1.22s on a config file
with 50K inboxes.
---
lib/PublicInbox/Config.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index cd8957a1..4d143c6e 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -165,7 +165,7 @@ sub git_config_dump {
return {} unless -e $file;
my $cmd = [ qw(git config -z -l --includes), "--file=$file" ];
my $fh = popen_rd($cmd);
- my $rv = config_fh_parse($fh, "\0", "\n");
+ my $rv = config_fh_parse($fh, "\0", qr/\n/);
close $fh or die "failed to close (@$cmd) pipe: $?";
$rv;
}
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 09/10] config: config_fh_parse: micro-optimize
2020-12-23 8:38 [PATCH 00/10] start optimizing startup w/ ALL->misc Eric Wong
` (7 preceding siblings ...)
2020-12-23 8:38 ` [PATCH 08/10] config: git_config_dump: pre-compile RE for split Eric Wong
@ 2020-12-23 8:38 ` Eric Wong
2020-12-23 8:38 ` [PATCH 10/10] config: config_fh_parse: micro-optimize harder Eric Wong
9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-12-23 8:38 UTC (permalink / raw)
To: meta
We can avoid a slow regexp capture and instead and rely on
rindex + substr to extract the section from the config file.
Then we use the defined-or-assignment (//=) operator combined
with the documented return value of `push' to ensure @section_order
is unique without repeating a hash lookup.
Finally, we avoid short-lived variables inside the loop and
declare them subroutine-wide to knock a teeny bit of allocation
time.
Combined, these optimizations bring the ~1.22s
PublicInbox::Config->new time down to ~0.98s with 50K inboxes.
---
lib/PublicInbox/Config.pm | 17 ++++++-----------
1 file changed, 6 insertions(+), 11 deletions(-)
diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index 4d143c6e..60107d45 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -132,20 +132,15 @@ sub default_file {
sub config_fh_parse ($$$) {
my ($fh, $rs, $fs) = @_;
- my %rv;
- my (%section_seen, @section_order);
+ my (%rv, %section_seen, @section_order, $line, $k, $v, $section, $cur);
local $/ = $rs;
- while (defined(my $line = <$fh>)) {
+ while (defined($line = <$fh>)) { # performance critical with giant configs
chomp $line;
- my ($k, $v) = split($fs, $line, 2);
- my ($section) = ($k =~ /\A(\S+)\.[^\.]+\z/);
- unless (defined $section_seen{$section}) {
- $section_seen{$section} = 1;
- push @section_order, $section;
- }
+ ($k, $v) = split($fs, $line, 2);
+ $section = substr($k, 0, rindex($k, '.'));
+ $section_seen{$section} //= push(@section_order, $section);
- my $cur = $rv{$k};
- if (defined $cur) {
+ if (defined($cur = $rv{$k})) {
if (ref($cur) eq "ARRAY") {
push @$cur, $v;
} else {
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 10/10] config: config_fh_parse: micro-optimize harder
2020-12-23 8:38 [PATCH 00/10] start optimizing startup w/ ALL->misc Eric Wong
` (8 preceding siblings ...)
2020-12-23 8:38 ` [PATCH 09/10] config: config_fh_parse: micro-optimize Eric Wong
@ 2020-12-23 8:38 ` Eric Wong
9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-12-23 8:38 UTC (permalink / raw)
To: meta
Instead of relying on split() and a regexp, we'll drop split()
entirely and rely on index() + two substr() calls to operate on
fixed strings. This brings PublicInbox::Config->new time down
from 0.98s down to 0.84s.
---
lib/PublicInbox/Config.pm | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index 60107d45..21f2161a 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -132,13 +132,14 @@ sub default_file {
sub config_fh_parse ($$$) {
my ($fh, $rs, $fs) = @_;
- my (%rv, %section_seen, @section_order, $line, $k, $v, $section, $cur);
+ my (%rv, %seen, @section_order, $line, $k, $v, $section, $cur, $i);
local $/ = $rs;
- while (defined($line = <$fh>)) { # performance critical with giant configs
- chomp $line;
- ($k, $v) = split($fs, $line, 2);
+ while (defined($line = <$fh>)) { # perf critical with giant configs
+ $i = index($line, $fs);
+ $k = substr($line, 0, $i);
+ $v = substr($line, $i + 1, -1); # chop off $fs
$section = substr($k, 0, rindex($k, '.'));
- $section_seen{$section} //= push(@section_order, $section);
+ $seen{$section} //= push(@section_order, $section);
if (defined($cur = $rv{$k})) {
if (ref($cur) eq "ARRAY") {
@@ -160,7 +161,7 @@ sub git_config_dump {
return {} unless -e $file;
my $cmd = [ qw(git config -z -l --includes), "--file=$file" ];
my $fh = popen_rd($cmd);
- my $rv = config_fh_parse($fh, "\0", qr/\n/);
+ my $rv = config_fh_parse($fh, "\0", "\n");
close $fh or die "failed to close (@$cmd) pipe: $?";
$rv;
}
^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2020-12-23 8:38 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-23 8:38 [PATCH 00/10] start optimizing startup w/ ALL->misc Eric Wong
2020-12-23 8:38 ` [PATCH 01/10] miscsearch: load Xapian at initialization Eric Wong
2020-12-23 8:38 ` [PATCH 02/10] xt: add create-many-inboxes helper test Eric Wong
2020-12-23 8:38 ` [PATCH 03/10] inbox: git_epoch: correct false comment Eric Wong
2020-12-23 8:38 ` [PATCH 04/10] inboxwritable: _init_v1: set created_at ASAP Eric Wong
2020-12-23 8:38 ` [PATCH 05/10] miscsearch: index UIDVALIDITY, use as startup cache Eric Wong
2020-12-23 8:38 ` [PATCH 06/10] extsearchidx: close SQLite handles after attaching Eric Wong
2020-12-23 8:38 ` [PATCH 07/10] config: _fill: inbox name extraction optimization Eric Wong
2020-12-23 8:38 ` [PATCH 08/10] config: git_config_dump: pre-compile RE for split Eric Wong
2020-12-23 8:38 ` [PATCH 09/10] config: config_fh_parse: micro-optimize Eric Wong
2020-12-23 8:38 ` [PATCH 10/10] config: config_fh_parse: micro-optimize harder Eric Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).