Some miscellaneous help and cleanup things, too. Document data is no longer read from Xapian by read-only daemons; that data is redundant given over.sqlite3 always exists. This should should improve page cache hit rates for over.sqlite3 by a small bit. Being able to mass load a bunch of rows from SQLite speeds up the default search summary view by ~10%, too. --skip-docdata option to -init and -index can avoid writing Xapian document data, saving ~1.5% in Xapian space overhead (and associated I/O and page cache overheads). It breaks rollbacks to old versions, though, so it won't be the default. Eric Wong (23): doc: note -compact and -xcpdb are rarely used admin: progress shows the inbox being indexed compact: support --help/-? and perform lazy loading init: support --help and -? init: support --newsgroup option init: drop -N alias for --skip-artnum search: v2: ensure shards are numerically sorted xapcmd: simplify {reindex} parameter passing www: reduce long-lived PublicInbox::Search references search: improve comments around constants search: export mdocid subroutine searchquery: split off from searchview search: make qparse_new an internal function smsg: reduce utf8::decode call sites searchview: use over.sqlite3 instead of Xapian docdata searchview: speed up search summary by ~10% searchview: convert nested and Atom display to over.sqlite3 extmsg: avoid using Xapian docdata mbox: avoid Xapian docdata in search results smsg: remove from_mitem t/nntpd-v2: set PI_TEST_VERSION=2 properly init+index: support --skip-docdata for Xapian search: add mset_to_artnums method Documentation/public-inbox-compact.pod | 5 + Documentation/public-inbox-config.pod | 2 +- Documentation/public-inbox-index.pod | 8 ++ Documentation/public-inbox-init.pod | 37 +++++++- Documentation/public-inbox-xcpdb.pod | 3 + MANIFEST | 1 + lib/PublicInbox/Admin.pm | 15 ++- lib/PublicInbox/ExtMsg.pm | 19 ++-- lib/PublicInbox/IMAP.pm | 5 +- lib/PublicInbox/Inbox.pm | 11 ++- lib/PublicInbox/Mbox.pm | 24 +++-- lib/PublicInbox/Over.pm | 28 +++--- lib/PublicInbox/Search.pm | 125 ++++++++++++++----------- lib/PublicInbox/SearchIdx.pm | 35 +++++-- lib/PublicInbox/SearchIdxShard.pm | 2 +- lib/PublicInbox/SearchQuery.pm | 53 +++++++++++ lib/PublicInbox/SearchView.pm | 90 ++++-------------- lib/PublicInbox/Smsg.pm | 10 +- lib/PublicInbox/Xapcmd.pm | 20 ++-- script/public-inbox-compact | 39 ++++++-- script/public-inbox-convert | 3 +- script/public-inbox-index | 7 +- script/public-inbox-init | 101 +++++++++++++------- t/imapd.t | 6 +- t/inbox_idle.t | 2 +- t/index-git-times.t | 11 ++- t/init.t | 17 +++- t/nntpd-v2.t | 2 +- t/nntpd.t | 9 +- t/search.t | 34 ++++--- 30 files changed, 437 insertions(+), 287 deletions(-) create mode 100644 lib/PublicInbox/SearchQuery.pm
Slowly improving the learning curve... --- Documentation/public-inbox-compact.pod | 5 +++++ Documentation/public-inbox-xcpdb.pod | 3 +++ 2 files changed, 8 insertions(+) diff --git a/Documentation/public-inbox-compact.pod b/Documentation/public-inbox-compact.pod index 8e463ab1..4e9b6d9f 100644 --- a/Documentation/public-inbox-compact.pod +++ b/Documentation/public-inbox-compact.pod @@ -19,6 +19,11 @@ It enforces the use of the C<--no-renumber> option of L<xapian-compact(1)> which is required to work with the rest of the public-inbox search code. +This command is rarely needed for active inboxes. + +Using the C<--compact> option of L<public-inbox-index(1)> +is recommended, instead, and only when doing a C<--reindex>. + =head1 OPTIONS =over diff --git a/Documentation/public-inbox-xcpdb.pod b/Documentation/public-inbox-xcpdb.pod index 2b38f1da..52939894 100644 --- a/Documentation/public-inbox-xcpdb.pod +++ b/Documentation/public-inbox-xcpdb.pod @@ -17,6 +17,9 @@ accept parallel Xapian database modifications from L<public-inbox-watch(1)>, L<public-inbox-mda(1)>, L<public-inbox-learn(1)>, and L<public-inbox-index(1)>. +This command is rarely used, as Xapian DB formats rarely +change. + =head1 OPTIONS =over
This is helpful with --all, or when multiple inboxes are being indexed. --- lib/PublicInbox/Admin.pm | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm index d99a00b4..f5427af7 100644 --- a/lib/PublicInbox/Admin.pm +++ b/lib/PublicInbox/Admin.pm @@ -207,6 +207,9 @@ sub index_terminate { sub index_inbox { my ($ibx, $im, $opt) = @_; my $jobs = delete $opt->{jobs} if $opt; + if (my $pr = $opt->{-progress}) { + $pr->("indexing $ibx->{inboxdir} ...\n"); + } local %SIG = %SIG; setup_signals(\&index_terminate, $ibx); if (ref($ibx) && $ibx->version == 2) {
This probably won't be used much, but --help can still make sense. --- script/public-inbox-compact | 39 +++++++++++++++++++++++++++---------- 1 file changed, 29 insertions(+), 10 deletions(-) diff --git a/script/public-inbox-compact b/script/public-inbox-compact index b5fa0086..a6bb62bd 100755 --- a/script/public-inbox-compact +++ b/script/public-inbox-compact @@ -1,18 +1,37 @@ -#!/usr/bin/perl -w +#!perl -w # Copyright (C) 2018-2020 all contributors <meta@public-inbox.org> # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt> use strict; -use warnings; +use v5.10.1; use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev); -use PublicInbox::InboxWritable; -use PublicInbox::Xapcmd; -use PublicInbox::Admin; -PublicInbox::Admin::require_or_die('-index'); -my $usage = "Usage: public-inbox-compact INBOX_DIR\n"; +my $usage = 'public-inbox-compact INBOX_DIR'; my $opt = { compact => 1, -coarse_lock => 1 }; -GetOptions($opt, qw(all), @PublicInbox::Xapcmd::COMPACT_OPT) or - die "bad command-line args\n$usage"; -my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, $opt) or die $usage; +my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term: +Usage: $usage + + Compact Xapian DBs in an inbox + +options: + + --all index all configured inboxes + --jobs=NUM control parallelization + +See public-inbox-compact(1) man page for full documentation. +EOF +GetOptions($opt, qw(all help|?), + # compact options: + qw(jobs|j=i quiet|q blocksize|b=s no-full|n fuller|F), +) or die "bad command-line args\n$usage\n"; +if ($opt->{help}) { print $help; exit 0 }; + +require PublicInbox::Admin; +PublicInbox::Admin::require_or_die('-index'); +PublicInbox::Admin::progress_prepare($opt); + +require PublicInbox::InboxWritable; +require PublicInbox::Xapcmd; +my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, $opt); +unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 } foreach (@ibxs) { my $ibx = PublicInbox::InboxWritable->new($_); PublicInbox::Xapcmd::run($ibx, 'compact', $opt);
And speed those up with some lazy loading, too. --- script/public-inbox-init | 79 +++++++++++++++++++++++++--------------- 1 file changed, 50 insertions(+), 29 deletions(-) diff --git a/script/public-inbox-init b/script/public-inbox-init index 1c8066df..6852f64a 100755 --- a/script/public-inbox-init +++ b/script/public-inbox-init @@ -1,57 +1,76 @@ -#!/usr/bin/perl -w +#!perl -w # Copyright (C) 2014-2020 all contributors <meta@public-inbox.org> # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt> -# -# Initializes a public-inbox, basically a wrapper for git-init(1) use strict; -use warnings; -sub usage { - print STDERR <<EOF; -Usage: public-inbox-init NAME INBOX_DIR HTTP_URL ADDRESS [ADDRESS..] -EOF - exit 1; -} +use v5.10.1; use Getopt::Long qw/:config gnu_getopt no_ignore_case auto_abbrev/; -use PublicInbox::Admin; -PublicInbox::Admin::require_or_die('-base'); -use PublicInbox::Config; -use PublicInbox::InboxWritable; -use PublicInbox::Import; -use File::Temp; -use PublicInbox::Lock; -use File::Basename qw/dirname/; -use File::Path qw/mkpath/; use Fcntl qw(:DEFAULT); -use Cwd qw/abs_path/; +my $usage = 'public-inbox-init NAME INBOX_DIR HTTP_URL ADDRESS [ADDRESS..]'; +my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term: +usage: $usage + + Initialize a public-inbox + +required arguments: + + NAME the name of the inbox + INBOX_DIR pathname the inbox + HTTP_URL HTTP (or HTTPS) URL + ADDRESS email address(es), may be specified multiple times + +options: -my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs); + -V2 use scalable public-inbox-v2-format(5) + -L LEVEL index level `basic', `medium', or `full' (default: full) + --skip-artnum=NUM NNTP article numbers to skip + --skip-epoch=NUM epochs to skip (-V2 only) + -J JOBS number of indexing jobs (-V2 only), (default: 4) + +See public-inbox-init(1) man page for full documentation. +EOF + +require PublicInbox::Admin; +PublicInbox::Admin::require_or_die('-base'); + +my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs, $show_help); my %opts = ( 'V|version=i' => \$version, 'L|index-level|indexlevel=s' => \$indexlevel, 'S|skip|skip-epoch=i' => \$skip_epoch, 'N|skip-artnum=i' => \$skip_artnum, 'j|jobs=i' => \$jobs, + 'help|?' => \$show_help, ); -GetOptions(%opts) or usage(); +my $usage_cb = sub { + print STDERR "Usage: $usage\n"; + exit 1; +}; +GetOptions(%opts) or $usage_cb->(); +if ($show_help) { print $help; exit 0 }; PublicInbox::Admin::indexlevel_ok_or_die($indexlevel) if defined $indexlevel; -my $name = shift @ARGV or usage(); -my $inboxdir = shift @ARGV or usage(); -my $http_url = shift @ARGV or usage(); +my $name = shift @ARGV or $usage_cb->(); +my $inboxdir = shift @ARGV or $usage_cb->(); +my $http_url = shift @ARGV or $usage_cb->(); my (@address) = @ARGV; -@address or usage(); +@address or $usage_cb->(); my %seen; +require PublicInbox::Config; my $pi_config = PublicInbox::Config->default_file; -my $dir = dirname($pi_config); -mkpath($dir); # will croak on fatal errors +require File::Basename; +my $dir = File::Basename::dirname($pi_config); +require File::Path; +File::Path::mkpath($dir); # will croak on fatal errors # first, we grab a flock to prevent simultaneous public-inbox-init # processes from trampling over each other, or exiting with 255 on # O_EXCL failure below. This gets unlocked automatically on exit: +require PublicInbox::Lock; my $lock_obj = { lock_path => "$pi_config.flock" }; PublicInbox::Lock::lock_acquire($lock_obj); # git-config will operate on this (and rename on success): +require File::Temp; my $fh = File::Temp->new(TEMPLATE => 'pi-init-XXXXXXXX', DIR => $dir); # Now, we grab another lock to use git-config(1) locking, so it won't @@ -111,7 +130,8 @@ close($fh) or die "failed to close $pi_config_tmp: $!\n"; my $pfx = "publicinbox.$name"; my @x = (qw/git config/, "--file=$pi_config_tmp"); -$inboxdir = abs_path($inboxdir); +require Cwd; +$inboxdir = Cwd::abs_path($inboxdir); die "`\\n' not allowed in `$inboxdir'\n" if $inboxdir =~ /\n/s; if (-f "$inboxdir/inbox.lock") { if (!defined $version) { @@ -148,6 +168,7 @@ if (defined $jobs) { $creat_opt->{nproc} = $jobs; } +require PublicInbox::InboxWritable; $ibx = PublicInbox::InboxWritable->new($ibx, $creat_opt); $ibx->init_inbox(0, $skip_epoch, $skip_artnum);
We can reduce the need to edit the config file for NNTP group names this way. --- Documentation/public-inbox-config.pod | 2 +- Documentation/public-inbox-init.pod | 25 +++++++++++++++++++++---- script/public-inbox-init | 12 ++++++++++-- t/imapd.t | 6 ++---- t/nntpd.t | 9 +++------ 5 files changed, 37 insertions(+), 17 deletions(-) diff --git a/Documentation/public-inbox-config.pod b/Documentation/public-inbox-config.pod index 05b84819..1dfb926e 100644 --- a/Documentation/public-inbox-config.pod +++ b/Documentation/public-inbox-config.pod @@ -63,7 +63,7 @@ Default: none, optional =item publicinbox.<name>.newsgroup The NNTP group name for use with L<public-inbox-nntpd(8)>. This -may be any newsgroup name with hierarchies delimited by '.'. +may be any newsgroup name with hierarchies delimited by C<.>. For example, the newsgroup for L<mailto:meta@public-inbox.org> is: C<inbox.comp.mail.public-inbox.meta> diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod index d0c87563..240959eb 100644 --- a/Documentation/public-inbox-init.pod +++ b/Documentation/public-inbox-init.pod @@ -39,6 +39,21 @@ See L<public-inbox-config(5)> for more information. Default: C<full> +=item --ng, --newsgroup NEWSGROUP + +The NNTP group name for use with L<public-inbox-nntpd(8)>. This +may be any newsgroup name with hierarchies delimited by C<.>. +For example, the newsgroup for L<mailto:meta@public-inbox.org> +is: C<inbox.comp.mail.public-inbox.meta> + +This may be set after-the-fact via C<publicinbox.$NAME.newsgroup> +in the configuration file. See L<public-inbox-config(5)> for more +info. + +Available since public-inbox 1.6.0 (PENDING). + +Default: none. + =item -N, --skip-artnum This option allows archivists to publish incomplete archives @@ -94,10 +109,12 @@ Used to override the default C<~/.public-inbox/config> value. =head1 LIMITATIONS -This tool predates NNTP support in public-inbox and is missing -C<newsgroup> and many of the options documented in -L<public-inbox-config(5)>. See L<public-inbox-config(5)> for all the -options which may be applied to a given inbox. +Some of the options documented in L<public-inbox-config(5)> +require editing the config file. Old versions lack the +C<-n>/C<--newsgroup> parameter + +See L<public-inbox-config(5)> for all the options which may be applied +to a given inbox. =head1 CONTACT diff --git a/script/public-inbox-init b/script/public-inbox-init index 6852f64a..90b32be8 100755 --- a/script/public-inbox-init +++ b/script/public-inbox-init @@ -22,6 +22,7 @@ options: -V2 use scalable public-inbox-v2-format(5) -L LEVEL index level `basic', `medium', or `full' (default: full) + --ng NEWSGROUP set NNTP newsgroup name --skip-artnum=NUM NNTP article numbers to skip --skip-epoch=NUM epochs to skip (-V2 only) -J JOBS number of indexing jobs (-V2 only), (default: 4) @@ -33,12 +34,14 @@ require PublicInbox::Admin; PublicInbox::Admin::require_or_die('-base'); my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs, $show_help); +my $ng = ''; my %opts = ( 'V|version=i' => \$version, 'L|index-level|indexlevel=s' => \$indexlevel, 'S|skip|skip-epoch=i' => \$skip_epoch, 'N|skip-artnum=i' => \$skip_artnum, 'j|jobs=i' => \$jobs, + 'ng|newsgroup=s' => \$ng, 'help|?' => \$show_help, ); my $usage_cb = sub { @@ -53,7 +56,11 @@ my $inboxdir = shift @ARGV or $usage_cb->(); my $http_url = shift @ARGV or $usage_cb->(); my (@address) = @ARGV; @address or $usage_cb->(); -my %seen; + +$ng =~ m![^A-Za-z0-9/_\.\-\~\@\+\=:]! and + die "--newsgroup `$ng' is not valid\n"; +($ng =~ m!\A\.! || $ng =~ m!\.\z!) and + die "--newsgroup `$ng' must not start or end with `.'\n"; require PublicInbox::Config; my $pi_config = PublicInbox::Config->default_file; @@ -84,7 +91,7 @@ sysopen($lockfh, $lockfile, O_RDWR|O_CREAT|O_EXCL) or do { exit(255); }; my $auto_unlink = UnlinkMe->new($lockfile); -my $perm; +my ($perm, %seen); if (-e $pi_config) { open(my $oh, '<', $pi_config) or die "unable to read $pi_config: $!\n"; my @st = stat($oh); @@ -185,6 +192,7 @@ PublicInbox::Import::run_die([@x, "$pfx.inboxdir", $inboxdir]); if (defined($indexlevel)) { PublicInbox::Import::run_die([@x, "$pfx.indexlevel", $indexlevel]); } +PublicInbox::Import::run_die([@x, "$pfx.newsgroup", $ng]) if $ng ne ''; # needed for git prior to v2.1.0 if (defined $perm) { diff --git a/t/imapd.t b/t/imapd.t index 6cfced41..4d627af7 100644 --- a/t/imapd.t +++ b/t/imapd.t @@ -39,11 +39,9 @@ for my $V (@V) { my $url = "http://example.com/i$V"; my $inboxdir = "$tmpdir/$name"; my $folder = "inbox.i$V"; - my $cmd = ['-init', "-V$V", "-L$level", $name, $inboxdir, $url, $addr]; + my $cmd = ['-init', "-V$V", "-L$level", "--ng=$folder", + $name, $inboxdir, $url, $addr]; run_script($cmd) or BAIL_OUT("init $name"); - xsys(qw(git config), "--file=$ENV{HOME}/.public-inbox/config", - "publicinbox.$name.newsgroup", $folder) == 0 or - BAIL_OUT("setting newsgroup $V"); if ($V == 1) { xsys(qw(git config), "--file=$ENV{HOME}/.public-inbox/config", 'publicinboxmda.spamcheck', 'none') == 0 or diff --git a/t/nntpd.t b/t/nntpd.t index b9b9a63d..74e21a41 100644 --- a/t/nntpd.t +++ b/t/nntpd.t @@ -46,14 +46,11 @@ my $ibx = { $ibx = PublicInbox::Inbox->new($ibx); { local $ENV{HOME} = $home; - my @cmd = ('-init', $group, $inboxdir, 'http://example.com/', $addr); - push @cmd, "-V$version", '-Lbasic'; + my @cmd = ('-init', $group, $inboxdir, 'http://example.com/', $addr, + "-V$version", '-Lbasic', '--newsgroup', $group); ok(run_script(\@cmd), 'init OK'); - is(xsys(qw(git config), "--file=$home/.public-inbox/config", - "publicinbox.$group.newsgroup", $group), - 0, 'enabled newsgroup'); - my $len; + my $len; $ibx = PublicInbox::InboxWritable->new($ibx); my $im = $ibx->importer(0);
It may be too easily confused for --newsgroup or --ng. This is too rarely used and never made it into a release, so it should be fine. --- Documentation/public-inbox-init.pod | 2 +- script/public-inbox-init | 2 +- t/init.t | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod index 240959eb..4cc7e29f 100644 --- a/Documentation/public-inbox-init.pod +++ b/Documentation/public-inbox-init.pod @@ -54,7 +54,7 @@ Available since public-inbox 1.6.0 (PENDING). Default: none. -=item -N, --skip-artnum +=item --skip-artnum This option allows archivists to publish incomplete archives with only new mail while allowing NNTP article numbers diff --git a/script/public-inbox-init b/script/public-inbox-init index 90b32be8..b19c2321 100755 --- a/script/public-inbox-init +++ b/script/public-inbox-init @@ -39,7 +39,7 @@ my %opts = ( 'V|version=i' => \$version, 'L|index-level|indexlevel=s' => \$indexlevel, 'S|skip|skip-epoch=i' => \$skip_epoch, - 'N|skip-artnum=i' => \$skip_artnum, + 'skip-artnum=i' => \$skip_artnum, 'j|jobs=i' => \$jobs, 'ng|newsgroup=s' => \$ng, 'help|?' => \$show_help, diff --git a/t/init.t b/t/init.t index 6211bb58..4d2c5049 100644 --- a/t/init.t +++ b/t/init.t @@ -116,7 +116,7 @@ SKIP: { 'publicinboxmda.spamcheck', 'none') == 0 or BAIL_OUT "git config $?"; my $addr = 'skip3@example.com'; - $cmd = [ qw(-init -V2 -Lbasic -N12 skip3), "$tmpdir/skip3", + $cmd = [ qw(-init -V2 -Lbasic --skip-artnum=12 skip3), "$tmpdir/skip3", qw(http://example.com/skip3), $addr ]; ok(run_script($cmd), '--skip-artnum -V2'); my $env = { ORIGINAL_RECIPIENT => $addr }; @@ -131,7 +131,7 @@ SKIP: { $addr = 'skip4@example.com'; $env = { ORIGINAL_RECIPIENT => $addr }; - $cmd = [ qw(-init -V1 -N12 -Lmedium skip4), "$tmpdir/skip4", + $cmd = [ qw(-init -V1 --skip-artnum 12 -Lmedium skip4), "$tmpdir/skip4", qw(http://example.com/skip4), $addr ]; ok(run_script($cmd), '--skip-artnum -V1'); $err = '';
This seems required to correctly get the NNTP article number from Xapian docid on combined Xapian DBs. The default (ASCII-betical) sorting was only acceptable for -imapd users until somebody hit 11 (or more) shards, which is a rare case. --- lib/PublicInbox/Search.pm | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index 4e08aed7..4d02a7c1 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -177,19 +177,24 @@ sub _xdb ($) { my ($xdb, $slow_phrase); my $qpf = \($self->{qp_flags} ||= $QP_FLAGS); if ($self->{ibx_ver} >= 2) { - my $n = 0; - foreach my $shard (<$dir/*>) { - -d $shard && $shard =~ m!/[0-9]+\z! or next; - my $sub = $X{Database}->new($shard); - if ($xdb) { - $xdb->add_database($sub); - } else { - $xdb = $sub; + my @xdb; + opendir(my $dh, $dir) or return; # not initialized yet + + # We need numeric sorting so shard[0] is first for reading + # Xapian metadata, if needed + for (sort { $a <=> $b } grep(/\A[0-9]+\z/, readdir($dh))) { + my $shard_dir = "$dir/$_"; + if (-d $shard_dir && -r _) { + push @xdb, $X{Database}->new($shard_dir); + $slow_phrase ||= -f "$shard_dir/iamchert"; + } else { # gaps from missing epochs throw off mdocid() + warn "E: $shard_dir missing or unreadable\n"; + return; } - $slow_phrase ||= -f "$shard/iamchert"; - ++$n; } - $self->{nshard} = $n; + $self->{nshard} = scalar(@xdb); + $xdb = shift @xdb; + $xdb->add_database($_) for @xdb; } else { $slow_phrase = -f "$dir/iamchert"; $xdb = $X{Database}->new($dir);
No need to localize it, here, since we can just refer to it in the `$opt' hashref. Hopefully this improves readability for others like it does for me. I sometimes wonder if the concept of a stack in high-level languages is even necessary... --- lib/PublicInbox/Xapcmd.pm | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm index 46548a94..fffac99c 100644 --- a/lib/PublicInbox/Xapcmd.pm +++ b/lib/PublicInbox/Xapcmd.pm @@ -19,7 +19,6 @@ our @COMPACT_OPT = qw(jobs|j=i quiet|q blocksize|b=s no-full|n fuller|F); sub commit_changes ($$$$) { my ($ibx, $im, $tmp, $opt) = @_; my $reshard = $opt->{reshard}; - my $reindex = $opt->{reindex}; $SIG{INT} or die 'BUG: $SIG{INT} not handled'; my @old_shard; @@ -102,17 +101,17 @@ sub runnable_or_die ($) { } sub prepare_reindex ($$$) { - my ($ibx, $im, $reindex) = @_; + my ($ibx, $im, $opt) = @_; if ($ibx->version == 1) { my $dir = $ibx->search->xdir(1); my $xdb = $PublicInbox::Search::X{Database}->new($dir); if (my $lc = $xdb->get_metadata('last_commit')) { - $reindex->{from} = $lc; + $opt->{reindex}->{from} = $lc; } } else { # v2 my $max; $im->git_dir_latest(\$max) or return; - my $from = $reindex->{from}; + my $from = $opt->{reindex}->{from}; my $mm = $ibx->mm; my $v = PublicInbox::Search::SCHEMA_VERSION(); foreach my $i (0..$max) { @@ -238,14 +237,14 @@ sub prepare_run { sub check_compact () { runnable_or_die($XAPIAN_COMPACT) } sub _run { - my ($ibx, $cb, $opt, $reindex) = @_; + my ($ibx, $cb, $opt) = @_; my $im = $ibx->importer(0); $im->lock_acquire; my ($tmp, $queue) = prepare_run($ibx, $opt); # fine-grained locking if we prepare for reindex if (!$opt->{-coarse_lock}) { - prepare_reindex($ibx, $im, $reindex); + prepare_reindex($ibx, $im, $opt); $im->lock_release; } @@ -262,19 +261,18 @@ sub run { defined(my $dir = $ibx->{inboxdir}) or die "no inboxdir defined\n"; -d $dir or die "inboxdir=$dir does not exist\n"; check_compact() if $opt->{compact} && $ibx->search; - my $reindex; # v1:{ from => $x40 }, v2:{ from => [ $x40, $x40, .. ] } } if (!$opt->{-coarse_lock}) { - $reindex = $opt->{reindex} = { # per-epoch ranges for v2 - from => $ibx->version == 1 ? '' : [], - }; + # per-epoch ranges for v2 + # v1:{ from => $OID }, v2:{ from => [ $OID, $OID, $OID ] } } + $opt->{reindex} = { from => $ibx->version == 1 ? '' : [] }; PublicInbox::SearchIdx::load_xapian_writable(); } local %SIG = %SIG; setup_signals(); $ibx->umask_prepare; - $ibx->with_umask(\&_run, $ibx, $cb, $opt, $reindex); + $ibx->with_umask(\&_run, $ibx, $cb, $opt); } sub cpdb_retryable ($$) {
While this is unlikely to be a problem in current practice, keeping Xapian DBs open for long responses can interfere with free space recovery after -compact. In the future, it will interfere with inbox search grouping and lead to unexpected results. --- lib/PublicInbox/Inbox.pm | 11 ++++++++--- lib/PublicInbox/Mbox.pm | 6 +++--- lib/PublicInbox/SearchView.pm | 4 ++-- 3 files changed, 13 insertions(+), 8 deletions(-) diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm index 267be4e3..55e546e1 100644 --- a/lib/PublicInbox/Inbox.pm +++ b/lib/PublicInbox/Inbox.pm @@ -191,14 +191,19 @@ sub mm { }; } -sub search ($;$) { - my ($self, $over_only) = @_; +sub search ($;$$) { + my ($self, $over_only, $ctx) = @_; my $srch = $self->{search} ||= eval { _cleanup_later($self); require PublicInbox::Search; PublicInbox::Search->new($self); }; - ($over_only || eval { $srch->xdb }) ? $srch : undef; + ($over_only || eval { $srch->xdb }) ? $srch : do { + $ctx and $ctx->{env}->{'psgi.errors'}->print(<<EOF); +`$self->{name}' search went away unexpectedly +EOF + undef; + }; } sub over ($) { diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm index fc83a893..a83c0356 100644 --- a/lib/PublicInbox/Mbox.pm +++ b/lib/PublicInbox/Mbox.pm @@ -205,8 +205,8 @@ sub mbox_all_ids { sub results_cb { my ($ctx) = @_; + my $srch = $ctx->{-inbox}->search(undef, $ctx) or return; my $mset = $ctx->{mset}; - my $srch = $ctx->{srch}; while (1) { while (my $mi = (($mset->items)[$ctx->{iter}++])) { my $smsg = PublicInbox::Smsg::from_mitem($mi, @@ -227,8 +227,8 @@ sub mbox_all { return mbox_all_ids($ctx) if $query eq ''; my $qopts = $ctx->{qopts} = { mset => 2 }; - my $srch = $ctx->{srch} = $ctx->{-inbox}->search or - return PublicInbox::WWW::need($ctx, 'Search');; + my $srch = $ctx->{-inbox}->search or + return PublicInbox::WWW::need($ctx, 'Search'); my $mset = $ctx->{mset} = $srch->query($query, $qopts); $qopts->{offset} = $mset->size or return [404, [qw(Content-Type text/plain)], diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm index 84c04c6c..5d77469e 100644 --- a/lib/PublicInbox/SearchView.pm +++ b/lib/PublicInbox/SearchView.pm @@ -321,7 +321,6 @@ sub adump { my ($cb, $mset, $q, $ctx) = @_; $ctx->{items} = [ $mset->items ]; $ctx->{search_query} = $q; # used by WwwAtomStream::atom_header - $ctx->{srch} = $ctx->{-inbox}->search; PublicInbox::WwwAtomStream->response($ctx, 200, \&adump_i); } @@ -329,8 +328,9 @@ sub adump { sub adump_i { my ($ctx) = @_; while (my $mi = shift @{$ctx->{items}}) { + my $srch = $ctx->{-inbox}->search(undef, $ctx) or return; my $smsg = eval { - PublicInbox::Smsg::from_mitem($mi, $ctx->{srch}); + PublicInbox::Smsg::from_mitem($mi, $srch); } or next; return $smsg; }
We'll probably be adding more value columns like THREADID to sort on. --- lib/PublicInbox/Search.pm | 63 +++++++++++++++++++++------------------ 1 file changed, 34 insertions(+), 29 deletions(-) diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index 4d02a7c1..593040a8 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -6,15 +6,47 @@ package PublicInbox::Search; use strict; -# values for searching +# values for searching, changing the numeric value breaks +# compatibility with old indices (so don't change them it) use constant { TS => 0, # Received: header in Unix time (IMAP INTERNALDATE) YYYYMMDD => 1, # Date: header for searching in the WWW UI DT => 2, # Date: YYYYMMDDHHMMSS + + # added for public-inbox 1.6.0+ BYTES => 3, # IMAP RFC822.SIZE UID => 4, # IMAP UID == NNTP article number == Xapian docid + # TODO - # REPLYCNT => 4, # IMAP ANSWERED + # THREADID => ? + # REPLYCNT => ?, # IMAP ANSWERED + + # SCHEMA_VERSION history + # 0 - initial + # 1 - subject_path is lower-cased + # 2 - subject_path is id_compress in the index, only + # 3 - message-ID is compressed if it includes '%' (hack!) + # 4 - change "Re: " normalization, avoid circular Reference ghosts + # 5 - subject_path drops trailing '.' + # 6 - preserve References: order in document data + # 7 - remove references and inreplyto terms + # 8 - remove redundant/unneeded document data + # 9 - disable Message-ID compression (SHA-1) + # 10 - optimize doc for NNTP overviews + # 11 - merge threads when vivifying ghosts + # 12 - change YYYYMMDD value column to numeric + # 13 - fix threading for empty References/In-Reply-To + # (commit 83425ef12e4b65cdcecd11ddcb38175d4a91d5a0) + # 14 - fix ghost root vivification + # 15 - see public-inbox-v2-format(5) + # further bumps likely unnecessary, we'll suggest in-place + # "--reindex" use for further fixes and tweaks: + # + # public-inbox v1.5.0 adds (still SCHEMA_VERSION=15): + # * "lid:" and "l:" for List-Id searches + # + # v1.6.0 adds BYTES and UID values + SCHEMA_VERSION => 15, }; use PublicInbox::Smsg; @@ -61,33 +93,6 @@ sub load_xapian () { # a prefix common in patch emails our $LANG = 'english'; -use constant { - # SCHEMA_VERSION history - # 0 - initial - # 1 - subject_path is lower-cased - # 2 - subject_path is id_compress in the index, only - # 3 - message-ID is compressed if it includes '%' (hack!) - # 4 - change "Re: " normalization, avoid circular Reference ghosts - # 5 - subject_path drops trailing '.' - # 6 - preserve References: order in document data - # 7 - remove references and inreplyto terms - # 8 - remove redundant/unneeded document data - # 9 - disable Message-ID compression (SHA-1) - # 10 - optimize doc for NNTP overviews - # 11 - merge threads when vivifying ghosts - # 12 - change YYYYMMDD value column to numeric - # 13 - fix threading for empty References/In-Reply-To - # (commit 83425ef12e4b65cdcecd11ddcb38175d4a91d5a0) - # 14 - fix ghost root vivification - # 15 - see public-inbox-v2-format(5) - # further bumps likely unnecessary, we'll suggest in-place - # "--reindex" use for further fixes and tweaks - # - # public-inbox v1.5.0 adds (still SCHEMA_VERSION=15): - # * "lid:" and "l:" for List-Id searches - SCHEMA_VERSION => 15, -}; - # note: the non-X term prefix allocations are shared with # Xapian omega, see xapian-applications/omega/docs/termprefixes.rst my %bool_pfx_external = (
No need to have awkward globrefs for this. --- lib/PublicInbox/IMAP.pm | 3 +-- lib/PublicInbox/Search.pm | 2 ++ 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/lib/PublicInbox/IMAP.pm b/lib/PublicInbox/IMAP.pm index 3d66f930..562c59d4 100644 --- a/lib/PublicInbox/IMAP.pm +++ b/lib/PublicInbox/IMAP.pm @@ -40,9 +40,8 @@ use PublicInbox::Syscall qw(EPOLLIN EPOLLONESHOT); use PublicInbox::GitAsyncCat; use Text::ParseWords qw(parse_line); use Errno qw(EAGAIN); -use PublicInbox::Search; +use PublicInbox::Search qw(mdocid); use PublicInbox::IMAPsearchqp; -*mdocid = \&PublicInbox::Search::mdocid; my $Address; for my $mod (qw(Email::Address::XS Mail::Address)) { diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index 593040a8..f98513d3 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -5,6 +5,8 @@ # Read-only search interface for use by the web and NNTP interfaces package PublicInbox::Search; use strict; +use parent qw(Exporter); +our @EXPORT_OK = qw(mdocid); # values for searching, changing the numeric value breaks # compatibility with old indices (so don't change them it)
Since this was already a separate package, split it off into its own file since SearchView may not handle inbox groups. --- MANIFEST | 1 + lib/PublicInbox/SearchQuery.pm | 53 ++++++++++++++++++++++++++++++++++ lib/PublicInbox/SearchView.pm | 53 ++-------------------------------- 3 files changed, 57 insertions(+), 50 deletions(-) create mode 100644 lib/PublicInbox/SearchQuery.pm diff --git a/MANIFEST b/MANIFEST index 6cb5f6bf..d86d3b15 100644 --- a/MANIFEST +++ b/MANIFEST @@ -173,6 +173,7 @@ lib/PublicInbox/SaPlugin/ListMirror.pod lib/PublicInbox/Search.pm lib/PublicInbox/SearchIdx.pm lib/PublicInbox/SearchIdxShard.pm +lib/PublicInbox/SearchQuery.pm lib/PublicInbox/SearchThread.pm lib/PublicInbox/SearchView.pm lib/PublicInbox/Sigfd.pm diff --git a/lib/PublicInbox/SearchQuery.pm b/lib/PublicInbox/SearchQuery.pm new file mode 100644 index 00000000..ce1eae12 --- /dev/null +++ b/lib/PublicInbox/SearchQuery.pm @@ -0,0 +1,53 @@ +# Copyright (C) 2015-2020 all contributors <meta@public-inbox.org> +# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt> + +# used by PublicInbox::SearchView +package PublicInbox::SearchQuery; +use strict; +use v5.10.1; +use URI::Escape qw(uri_escape); +use PublicInbox::MID qw(MID_ESC); +our $LIM = 200; + +sub new { + my ($class, $qp) = @_; + + my $r = $qp->{r}; + my ($l) = (($qp->{l} || '') =~ /([0-9]+)/); + $l = $LIM if !$l || $l > $LIM; + bless { + q => $qp->{'q'}, + x => $qp->{x} || '', + o => (($qp->{o} || '0') =~ /(-?[0-9]+)/), + l => $l, + r => (defined $r && $r ne '0'), + }, $class; +} + +sub qs_html { + my ($self, %override) = @_; + + if (scalar(keys(%override))) { + $self = bless { (%$self, %override) }, ref($self); + } + + my $q = uri_escape($self->{'q'}, MID_ESC); + $q =~ s/%20/+/g; # improve URL readability + my $qs = "q=$q"; + + if (my $o = $self->{o}) { # ignore o == 0 + $qs .= "&o=$o"; + } + if (my $l = $self->{l}) { + $qs .= "&l=$l" unless $l == $LIM; + } + if (my $r = $self->{r}) { + $qs .= "&r"; + } + if (my $x = $self->{x}) { + $qs .= "&x=$x" if ($x eq 't' || $x eq 'A' || $x eq 'm'); + } + $qs; +} + +1; diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm index 5d77469e..28d9ce5d 100644 --- a/lib/PublicInbox/SearchView.pm +++ b/lib/PublicInbox/SearchView.pm @@ -4,15 +4,15 @@ # Displays search results for the web interface package PublicInbox::SearchView; use strict; -use warnings; -use URI::Escape qw(uri_unescape uri_escape); +use v5.10.1; +use URI::Escape qw(uri_unescape); use PublicInbox::Smsg; use PublicInbox::Hval qw(ascii_html obfuscate_addrs mid_href); use PublicInbox::View; use PublicInbox::WwwAtomStream; use PublicInbox::WwwStream qw(html_oneshot); use PublicInbox::SearchThread; -our $LIM = 200; +use PublicInbox::SearchQuery; my %rmap_inc; sub mbox_results { @@ -336,51 +336,4 @@ sub adump_i { } } -package PublicInbox::SearchQuery; -use strict; -use warnings; -use URI::Escape qw(uri_escape); -use PublicInbox::MID qw(MID_ESC); - -sub new { - my ($class, $qp) = @_; - - my $r = $qp->{r}; - my ($l) = (($qp->{l} || '') =~ /([0-9]+)/); - $l = $LIM if !$l || $l > $LIM; - bless { - q => $qp->{'q'}, - x => $qp->{x} || '', - o => (($qp->{o} || '0') =~ /(-?[0-9]+)/), - l => $l, - r => (defined $r && $r ne '0'), - }, $class; -} - -sub qs_html { - my ($self, %override) = @_; - - if (scalar(keys(%override))) { - $self = bless { (%$self, %override) }, ref($self); - } - - my $q = uri_escape($self->{'q'}, MID_ESC); - $q =~ s/%20/+/g; # improve URL readability - my $qs = "q=$q"; - - if (my $o = $self->{o}) { # ignore o == 0 - $qs .= "&o=$o"; - } - if (my $l = $self->{l}) { - $qs .= "&l=$l" unless $l == $LIM; - } - if (my $r = $self->{r}) { - $qs .= "&r"; - } - if (my $x = $self->{x}) { - $qs .= "&x=$x" if ($x eq 't' || $x eq 'A' || $x eq 'm'); - } - $qs; -} - 1;
We'll probably be reusing it from another package in a future commit. --- lib/PublicInbox/Search.pm | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index f98513d3..e6200bfb 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -265,7 +265,7 @@ sub query { if ($query_string eq '' && !$opts->{mset}) { $self->{over_ro}->recent($opts); } else { - my $qp = qp($self); + my $qp = $self->{qp} //= qparse_new($self); my $qp_flags = $self->{qp_flags}; my $query = $qp->parse_query($query_string, $qp_flags); $opts->{relevance} = 1 unless exists $opts->{relevance}; @@ -334,17 +334,14 @@ sub _enquire_once { # retry_reopen callback sub stemmer { $X{Stem}->new($LANG) } # read-only -sub qp { +sub qparse_new ($) { my ($self) = @_; - my $qp = $self->{query_parser}; - return $qp if $qp; my $xdb = xdb($self); - # new parser - $qp = $X{QueryParser}->new; + my $qp = $X{QueryParser}->new; $qp->set_default_op(OP_AND()); $qp->set_database($xdb); - $qp->set_stemmer($self->stemmer); + $qp->set_stemmer(stemmer($self)); $qp->set_stemming_strategy(STEM_SOME()); $qp->set_max_wildcard_expansion(100); my $nvrp = $X{NumberValueRangeProcessor}; @@ -382,13 +379,12 @@ EOF while (my ($name, $prefix) = each %prob_prefix) { $qp->add_prefix($name, $_) foreach split(/ /, $prefix); } - - $self->{query_parser} = $qp; + $qp; } sub help { my ($self) = @_; - $self->qp; # parse altids + $self->{qp} //= qparse_new($self); # parse altids my @ret = @HELP; if (my $user_pfx = $self->{-user_pfx}) { push @ret, @$user_pfx;
Both callers of load_from_data call utf8::decode, so just do utf8::decode in load_from_data. --- lib/PublicInbox/Over.pm | 1 - lib/PublicInbox/Smsg.pm | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm index 2b314882..81b9fca7 100644 --- a/lib/PublicInbox/Over.pm +++ b/lib/PublicInbox/Over.pm @@ -68,7 +68,6 @@ sub load_from_row ($;$) { bless $smsg, 'PublicInbox::Smsg'; if (defined(my $data = delete $smsg->{ddd})) { $data = uncompress($data); - utf8::decode($data); PublicInbox::Smsg::load_from_data($smsg, $data); # saves over 600K for 1000+ message threads diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm index 62cb951e..f22cd43e 100644 --- a/lib/PublicInbox/Smsg.pm +++ b/lib/PublicInbox/Smsg.pm @@ -40,6 +40,7 @@ sub to_doc_data { sub load_from_data ($$) { my ($self) = $_[0]; # data = $_[1] + utf8::decode($_[1]); ( $self->{subject}, $self->{from}, @@ -67,7 +68,6 @@ sub load_expand { my $dt = get_val($doc, PublicInbox::Search::DT()); my ($yyyy, $mon, $dd, $hh, $mm, $ss) = unpack('A4A2A2A2A2A2', $dt); $self->{ds} = timegm($ss, $mm, $hh, $dd, $mon - 1, $yyyy); - utf8::decode($data); load_from_data($self, $data); $self; }
This is a step towards improving kernel page cache hit rates by relying on over.sqlite3 for document data instead of Xapian. Some micro-optimization to over->get_art was required to maintain performance. --- lib/PublicInbox/Over.pm | 18 +++++------------- lib/PublicInbox/SearchView.pm | 10 +++++++--- 2 files changed, 12 insertions(+), 16 deletions(-) diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm index 81b9fca7..80e57e62 100644 --- a/lib/PublicInbox/Over.pm +++ b/lib/PublicInbox/Over.pm @@ -57,6 +57,7 @@ sub new { sub disconnect { my ($self) = @_; if (my $dbh = delete $self->{dbh}) { + delete $self->{-get_art}; $self->{filename} = $dbh->sqlite_db_filename; } } @@ -201,8 +202,8 @@ SELECT COUNT(num) FROM over WHERE num > 0 sub get_art { my ($self, $num) = @_; - my $dbh = $self->connect; - my $sth = $dbh->prepare_cached(<<'', undef, 1); + # caching $sth ourselves is faster than prepare_cached + my $sth = $self->{-get_art} //= $self->connect->prepare(<<''); SELECT num,ds,ts,ddd FROM over WHERE num = ? LIMIT 1 $sth->execute($num); @@ -230,13 +231,7 @@ ORDER BY num ASC LIMIT 1 $sth->execute($$id, $$prev); my $num = $sth->fetchrow_array or return; $$prev = $num; - - $sth = $dbh->prepare_cached(<<"", undef, 1); -SELECT num,ts,ds,ddd FROM over WHERE num = ? LIMIT 1 - - $sth->execute($num); - my $smsg = $sth->fetchrow_hashref or return; - load_from_row($smsg); + get_art($self, $num); } # IMAP search, this is limited by callers to UID_SLICE size (50K) @@ -278,10 +273,7 @@ sub check_inodes { my $st = pack('dd', $st[0], $st[1]); # don't actually reopen, just let {dbh} be recreated later - if ($st ne ($self->{st} // $st)) { - delete($self->{dbh}); - $self->{filename} = $f; - } + disconnect($self) if $st ne ($self->{st} // $st); } else { warn "W: stat $f: $!\n"; } diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm index 28d9ce5d..61534c25 100644 --- a/lib/PublicInbox/SearchView.pm +++ b/lib/PublicInbox/SearchView.pm @@ -13,6 +13,7 @@ use PublicInbox::WwwAtomStream; use PublicInbox::WwwStream qw(html_oneshot); use PublicInbox::SearchThread; use PublicInbox::SearchQuery; +use PublicInbox::Search qw(mdocid); my %rmap_inc; sub mbox_results { @@ -90,19 +91,22 @@ sub mset_summary { my $pfx = ' ' x $pad; my $res = \($ctx->{-html_tip}); my $ibx = $ctx->{-inbox}; - my $srch = $ibx->search; + my $over = $ibx->over; + my $nshard = $ibx->search->{nshard} // 1; my $obfs_ibx = $ibx->{obfuscate} ? $ibx : undef; foreach my $m ($mset->items) { my $rank = sprintf("%${pad}d", $m->get_rank + 1); my $pct = get_pct($m); - my $smsg = PublicInbox::Smsg::from_mitem($m, $srch); + my $num = mdocid($nshard, $m); + my $smsg = $over->get_art($num, 1); unless ($smsg) { eval { - $m = "$m ".$m->get_docid . " expired\n"; + $m = "$m $num expired\n"; $ctx->{env}->{'psgi.errors'}->print($m); }; next; } + PublicInbox::Smsg::psgi_cull($smsg); my $s = ascii_html($smsg->{subject}); my $f = ascii_html($smsg->{from_name}); if ($obfs_ibx) {
Instead of loading one article at-a-time from over.sqlite3, we can use SQL to mass-load IN (?,?, ...) all results with a single SQLite query. Despite SQLite being in-process and having no network latency, the reduction in SQL query executions from loading multiple rows at once speeds things up significantly. We'll keep the over->get_art optimizations from the previous commit, since it still speeds up long-lived responses, slightly. --- lib/PublicInbox/Over.pm | 9 +++++++++ lib/PublicInbox/SearchView.pm | 12 ++++++------ 2 files changed, 15 insertions(+), 6 deletions(-) diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm index 80e57e62..a055b4cd 100644 --- a/lib/PublicInbox/Over.pm +++ b/lib/PublicInbox/Over.pm @@ -104,6 +104,15 @@ ORDER BY num ASC } +sub get_all { + my $self = shift; + my $nr = scalar(@_) or return []; + my $in = '?' . (',?' x ($nr - 1)); + do_get($self, <<"", { cull => 1, limit => $nr }, @_); +SELECT num,ds,ddd FROM over WHERE num IN ($in) + +} + sub nothing () { wantarray ? (0, []) : [] }; sub get_thread { diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm index 61534c25..a3527057 100644 --- a/lib/PublicInbox/SearchView.pm +++ b/lib/PublicInbox/SearchView.pm @@ -91,22 +91,22 @@ sub mset_summary { my $pfx = ' ' x $pad; my $res = \($ctx->{-html_tip}); my $ibx = $ctx->{-inbox}; - my $over = $ibx->over; my $nshard = $ibx->search->{nshard} // 1; my $obfs_ibx = $ibx->{obfuscate} ? $ibx : undef; + my @nums = map { mdocid($nshard, $_) } $mset->items; + my %num2msg = map { $_->{num} => $_ } @{$ibx->over->get_all(@nums)}; + foreach my $m ($mset->items) { my $rank = sprintf("%${pad}d", $m->get_rank + 1); my $pct = get_pct($m); - my $num = mdocid($nshard, $m); - my $smsg = $over->get_art($num, 1); - unless ($smsg) { + my $num = shift @nums; + my $smsg = delete($num2msg{$num}) or do { eval { $m = "$m $num expired\n"; $ctx->{env}->{'psgi.errors'}->print($m); }; next; - } - PublicInbox::Smsg::psgi_cull($smsg); + }; my $s = ascii_html($smsg->{subject}); my $f = ascii_html($smsg->{from_name}); if ($obfs_ibx) {
git blob retrieval dominates on these, "&x=t" (nested) is roughly the same due to increased overhead for ->get_percent storage balancing out the mass-loading from SQLite. Atom "&x=A" is sped up slightly and uses less memory in the long-lived response. --- lib/PublicInbox/SearchView.pm | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm index a3527057..ef1b9767 100644 --- a/lib/PublicInbox/SearchView.pm +++ b/lib/PublicInbox/SearchView.pm @@ -255,20 +255,13 @@ sub get_pct ($) { $n > 99 ? 99 : $n; } -sub load_msgs { - my ($mset) = @_; - [ map { - my $mi = $_; - my $smsg = PublicInbox::Smsg::from_mitem($mi); - $smsg->{pct} = get_pct($mi); - $smsg; - } ($mset->items) ] -} - sub mset_thread { my ($ctx, $mset, $q) = @_; my $ibx = $ctx->{-inbox}; - my $msgs = $ibx->search->retry_reopen(\&load_msgs, $mset); + my $nshard = $ibx->search->{nshard} // 1; + my %pct = map { mdocid($nshard, $_) => get_pct($_) } $mset->items; + my $msgs = $ibx->over->get_all(keys %pct); + $_->{pct} = $pct{$_->{num}} for @$msgs; my $r = $q->{r}; my $rootset = PublicInbox::SearchThread::thread($msgs, $r ? \&sort_relevance : \&PublicInbox::View::sort_ds, @@ -323,7 +316,8 @@ sub ctx_prepare { sub adump { my ($cb, $mset, $q, $ctx) = @_; - $ctx->{items} = [ $mset->items ]; + my $nshard = $ctx->{-inbox}->search->{nshard} // 1; + $ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ]; $ctx->{search_query} = $q; # used by WwwAtomStream::atom_header PublicInbox::WwwAtomStream->response($ctx, 200, \&adump_i); } @@ -331,11 +325,8 @@ sub adump { # callback for PublicInbox::WwwAtomStream::getline sub adump_i { my ($ctx) = @_; - while (my $mi = shift @{$ctx->{items}}) { - my $srch = $ctx->{-inbox}->search(undef, $ctx) or return; - my $smsg = eval { - PublicInbox::Smsg::from_mitem($mi, $srch); - } or next; + while (my $num = shift @{$ctx->{ids}}) { + my $smsg = eval { $ctx->{-inbox}->over->get_art($num) } or next; return $smsg; } }
Once again, over.sqlite3 contains everything necessary for Message-ID resolution. Also, Xapian may be completely unnecessary with the advent of over.sqlite3, but that's for another time. --- lib/PublicInbox/ExtMsg.pm | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/lib/PublicInbox/ExtMsg.pm b/lib/PublicInbox/ExtMsg.pm index d7917b34..5c23dc31 100644 --- a/lib/PublicInbox/ExtMsg.pm +++ b/lib/PublicInbox/ExtMsg.pm @@ -11,6 +11,7 @@ use warnings; use PublicInbox::Hval qw(ascii_html prurl mid_href); use PublicInbox::WwwStream qw(html_oneshot); use PublicInbox::Smsg; +use PublicInbox::Search qw(mdocid); our $MIN_PARTIAL_LEN = 16; # TODO: user-configurable @@ -29,13 +30,10 @@ our @EXT_URL = map { ascii_html($_) } ( sub PARTIAL_MAX () { 100 } -sub mids_from_mset { # Search::retry_reopen callback - [ map { PublicInbox::Smsg::from_mitem($_)->{mid} } $_[0]->items ]; -} - sub search_partial ($$) { - my ($srch, $mid) = @_; + my ($ibx, $mid) = @_; return if length($mid) < $MIN_PARTIAL_LEN; + my $srch = $ibx->search or return; my $opt = { limit => PARTIAL_MAX, mset => 2 }; my @try = ("m:$mid*"); my $chop = $mid; @@ -63,14 +61,17 @@ sub search_partial ($$) { } } + my $n = $srch->{nshard} // 1; foreach my $m (@try) { # If Xapian can't handle the wildcard since it # has too many results. $@ can be # Search::Xapian::QueryParserError or even: # "something terrible happened at ../Search/Xapian/Enquire.pm" my $mset = eval { $srch->query($m, $opt) } or next; - my $mids = $srch->retry_reopen(\&mids_from_mset, $mset); - return $mids if scalar(@$mids); + my @mids = map { + $_->{mid} + } @{$ibx->over->get_all(map { mdocid($n, $_) } $mset->items)}; + return \@mids if scalar(@mids); } } @@ -111,8 +112,7 @@ sub ext_msg { # fall back to partial MID matching my @partial; my $n_partial = 0; - my $srch = $cur->search; - my $mids = search_partial($srch, $mid) if $srch; + my $mids = search_partial($cur, $mid); if ($mids) { $n_partial = scalar(@$mids); push @partial, [ $cur, $mids ]; @@ -121,8 +121,7 @@ sub ext_msg { # can't find a partial match in current inbox, try the others: if (!$n_partial && length($mid) >= $MIN_PARTIAL_LEN) { foreach my $ibx (@$ibxs) { - $srch = $ibx->search or next; - $mids = search_partial($srch, $mid) or next; + $mids = search_partial($ibx, $mid) or next; $n_partial += scalar(@$mids); push @partial, [ $ibx, $mids]; last if $n_partial >= PARTIAL_MAX;
Another place where we can reduce kernel page cache overhead by hitting over.sqlite3 instead of docdata.glass. --- lib/PublicInbox/Mbox.pm | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm index a83c0356..0fa9a38d 100644 --- a/lib/PublicInbox/Mbox.pm +++ b/lib/PublicInbox/Mbox.pm @@ -10,6 +10,7 @@ use PublicInbox::MID qw/mid_escape/; use PublicInbox::Hval qw/to_filename/; use PublicInbox::Smsg; use PublicInbox::Eml; +use PublicInbox::Search qw(mdocid); # called by PSGI server as body response # this gets called twice for every message, once to return the header, @@ -205,20 +206,19 @@ sub mbox_all_ids { sub results_cb { my ($ctx) = @_; - my $srch = $ctx->{-inbox}->search(undef, $ctx) or return; - my $mset = $ctx->{mset}; + my $over = $ctx->{-inbox}->over or return; while (1) { - while (my $mi = (($mset->items)[$ctx->{iter}++])) { - my $smsg = PublicInbox::Smsg::from_mitem($mi, - $srch) or next; + while (defined(my $num = shift(@{$ctx->{ids}}))) { + my $smsg = $over->get_art($num) or next; return $smsg; } # refill result set - $mset = $ctx->{mset} = $srch->query($ctx->{query}, - $ctx->{qopts}); + my $srch = $ctx->{-inbox}->search(undef, $ctx) or return; + my $mset = $srch->query($ctx->{query}, $ctx->{qopts}); my $size = $mset->size or return; $ctx->{qopts}->{offset} += $size; - $ctx->{iter} = 0; + my $nshard = $srch->{nshard} // 1; + $ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ]; } } @@ -226,15 +226,16 @@ sub mbox_all { my ($ctx, $query) = @_; return mbox_all_ids($ctx) if $query eq ''; - my $qopts = $ctx->{qopts} = { mset => 2 }; + my $qopts = $ctx->{qopts} = { mset => 2 }; # order by docid my $srch = $ctx->{-inbox}->search or return PublicInbox::WWW::need($ctx, 'Search'); - my $mset = $ctx->{mset} = $srch->query($query, $qopts); + my $mset = $srch->query($query, $qopts); $qopts->{offset} = $mset->size or return [404, [qw(Content-Type text/plain)], ["No results found\n"]]; - $ctx->{iter} = 0; $ctx->{query} = $query; + my $nshard = $srch->{nshard} // 1; + $ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ]; require PublicInbox::MboxGz; PublicInbox::MboxGz::mbox_gz($ctx, \&results_cb, 'results-'.$query); }
We no longer read docdata.glass from anywhere in our code base. Some adjustments were needed to t/search.t to deal with the Xapian::WritableDatabase committing at different times, since our ->query is avoided from PublicInbox::SearchIdx to avoid needing a {over_ro} field. --- lib/PublicInbox/Search.pm | 11 +++++++---- lib/PublicInbox/Smsg.pm | 8 -------- t/search.t | 34 +++++++++++++++++++--------------- 3 files changed, 26 insertions(+), 27 deletions(-) diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index e6200bfb..1c648299 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -324,10 +324,13 @@ sub _enquire_once { # retry_reopen callback my $limit = $opts->{limit} || 50; my $mset = $enquire->get_mset($offset, $limit); return $mset if $opts->{mset}; - my @msgs = map { PublicInbox::Smsg::from_mitem($_) } $mset->items; - return \@msgs unless wantarray; - - ($mset->get_matches_estimated, \@msgs) + my $nshard = $self->{nshard} // 1; + my $i = 0; + my %order = map { mdocid($nshard, $_) => ++$i } $mset->items; + my @msgs = sort { + $order{$a->{num}} <=> $order{$b->{num}} + } @{$self->{over_ro}->get_all(keys %order)}; + wantarray ? ($mset->get_matches_estimated, \@msgs) : \@msgs; } # read-write diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm index f22cd43e..51226b8e 100644 --- a/lib/PublicInbox/Smsg.pm +++ b/lib/PublicInbox/Smsg.pm @@ -86,14 +86,6 @@ sub psgi_cull ($) { $self; } -# Only called by PSGI interface, not NNTP -sub from_mitem { - my ($mitem, $srch) = @_; - return $srch->retry_reopen(\&from_mitem, $mitem) if $srch; - my $self = bless {}, __PACKAGE__; - psgi_cull(load_expand($self, $mitem->get_document)); -} - # for Import and v1 non-SQLite WWW code paths sub populate { my ($self, $hdr, $sync) = @_; diff --git a/t/search.t b/t/search.t index 299f57c8..e2290ecd 100644 --- a/t/search.t +++ b/t/search.t @@ -263,13 +263,14 @@ To: list@example.com theatre fade EOF - my $res = $rw->query("theatre"); + $rw_commit->(); + my $res = $ro->reopen->query("theatre"); is(scalar(@$res), 2, "got both matches"); if (@$res == 2) { is($res->[0]->{mid}, 'nquote@a', 'non-quoted scores higher'); is($res->[1]->{mid}, 'quote@a', 'quoted result still returned'); } - $res = $rw->query("illusions"); + $res = $ro->query("illusions"); is(scalar(@$res), 1, "got a match for quoted text"); is($res->[0]->{mid}, 'quote@a', "quoted result returned if nothing else") if scalar(@$res); @@ -290,7 +291,8 @@ To: list\@example.com LOOP! EOF ok($doc_id > 0, "doc_id defined with circular reference"); - my $smsg = $rw->query('m:circle@a', {limit=>1})->[0]; + $rw_commit->(); + my $smsg = $ro->reopen->query('m:circle@a', {limit=>1})->[0]; is(defined($smsg), 1, 'found m:circl@a'); if (defined $smsg) { is($smsg->{references}, '', "no references created"); @@ -298,11 +300,22 @@ EOF } }); +{ + my $msgs = $ro->query('d:19931002..20101002'); + ok(scalar(@$msgs) > 0, 'got results within range'); + $msgs = $ro->query('d:20101003..'); + is(scalar(@$msgs), 0, 'nothing after 20101003'); + $msgs = $ro->query('d:..19931001'); + is(scalar(@$msgs), 0, 'nothing before 19931001'); +} + $ibx->with_umask(sub { my $mime = eml_load 't/utf8.eml'; my $doc_id = $rw->add_message($mime); ok($doc_id > 0, 'message indexed doc_id with UTF-8'); - my $msg = $rw->query('m:testmessage@example.com', {limit => 1})->[0]; + $rw_commit->(); + my $msg = $ro->reopen-> + query('m:testmessage@example.com', {limit => 1})->[0]; is(defined($msg), 1, 'found testmessage@example.com'); if (defined $msg) { is($mime->header('Subject'), $msg->{subject}, @@ -310,19 +323,10 @@ $ibx->with_umask(sub { } }); -{ - my $msgs = $ro->query('d:19931002..20101002'); - ok(scalar(@$msgs) > 0, 'got results within range'); - $msgs = $ro->query('d:20101003..'); - is(scalar(@$msgs), 0, 'nothing after 20101003'); - $msgs = $ro->query('d:..19931001'); - is(scalar(@$msgs), 0, 'nothing before 19931001'); -} - # names and addresses { my $mset = $ro->query('t:list@example.com', {mset => 1}); - is($mset->size, 6, 'searched To: successfully'); + is($mset->size, 9, 'searched To: successfully'); foreach my $m ($mset->items) { my $smsg = $ro->{over_ro}->get_art($m->get_docid); like($smsg->{to}, qr/\blist\@example\.com\b/, 'to appears'); @@ -340,7 +344,7 @@ $ibx->with_umask(sub { } $mset = $ro->query('tc:list@example.com', {mset => 1}); - is($mset->size, 6, 'searched To+Cc: successfully'); + is($mset->size, 9, 'searched To+Cc: successfully'); foreach my $m ($mset->items) { my $smsg = $ro->{over_ro}->get_art($m->get_docid); my $tocc = join("\n", $smsg->{to}, $smsg->{cc});
Numbers are hard :< --- t/nntpd-v2.t | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/t/nntpd-v2.t b/t/nntpd-v2.t index 7fc3447e..1dd992a0 100644 --- a/t/nntpd-v2.t +++ b/t/nntpd-v2.t @@ -1,4 +1,4 @@ # Copyright (C) 2019-2020 all contributors <meta@public-inbox.org> # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt> -local $ENV{PI_TEST_VERSION} = 1; +local $ENV{PI_TEST_VERSION} = 2; require './t/nntpd.t';
Since we no longer read document data from Xapian, allow users to opt-out of storing it. This breaks compatibility with previous releases of public-inbox, but gives us a ~1.5% space savings on Xapian storage (and associated I/O and page cache pressure reduction). --- Documentation/public-inbox-index.pod | 8 +++++++ Documentation/public-inbox-init.pod | 10 ++++++++ lib/PublicInbox/Admin.pm | 12 ++++++---- lib/PublicInbox/SearchIdx.pm | 35 +++++++++++++++++++++------- lib/PublicInbox/SearchIdxShard.pm | 2 +- script/public-inbox-convert | 3 ++- script/public-inbox-index | 7 ++++-- script/public-inbox-init | 8 +++++++ t/inbox_idle.t | 2 +- t/index-git-times.t | 11 ++++++++- t/init.t | 13 +++++++++++ 11 files changed, 91 insertions(+), 20 deletions(-) diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod index 1ed9f5e7..46a53825 100644 --- a/Documentation/public-inbox-index.pod +++ b/Documentation/public-inbox-index.pod @@ -145,6 +145,14 @@ below. Available in public-inbox 1.6.0 (PENDING). +=item --skip-docdata + +Stop storing document data in Xapian on an existing inbox. + +See L<public-inbox-init(1)/--skip-docdata> for description and caveats. + +Available in public-inbox 1.6.0 (PENDING). + =back =head1 FILES diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod index 4cc7e29f..3f98807a 100644 --- a/Documentation/public-inbox-init.pod +++ b/Documentation/public-inbox-init.pod @@ -95,6 +95,16 @@ default due to contention in the top-level producer process. Default: the number of online CPUs, up to 4 +=item --skip-docdata + +Do not store document data in Xapian, reducing Xapian storage +overhead by around 1.5%. + +Warning: this option prevents rollbacks to public-inbox 1.5.0 +and earlier. + +Available since public-inbox 1.6.0 (PENDING). + =back =head1 ENVIRONMENT diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm index f5427af7..b8ead6f7 100644 --- a/lib/PublicInbox/Admin.pm +++ b/lib/PublicInbox/Admin.pm @@ -48,13 +48,14 @@ sub resolve_repo_dir { sub detect_indexlevel ($) { my ($ibx) = @_; - # brand new or never before indexed inboxes default to full - return 'full' unless $ibx->over; - delete $ibx->{over}; # don't leave open FD lying around + my $over = $ibx->over; + my $srch = $ibx->search; + delete @$ibx{qw(over search)}; # don't leave open FDs lying around + # brand new or never before indexed inboxes default to full + return 'full' unless $over; my $l = 'basic'; - my $srch = $ibx->search or return $l; - delete $ibx->{search}; # don't leave open FD lying around + return $l unless $srch; if (my $xdb = $srch->xdb) { $l = 'full'; my $m = $xdb->get_metadata('indexlevel'); @@ -65,6 +66,7 @@ sub detect_indexlevel ($) { $ibx->{inboxdir} has unexpected indexlevel in Xapian: $m } + $ibx->{-skip_docdata} = 1 if $xdb->get_metadata('skip_docdata'); } $l; } diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index 5c39f3d6..be46b2b9 100644 --- a/lib/PublicInbox/SearchIdx.pm +++ b/lib/PublicInbox/SearchIdx.pm @@ -61,6 +61,10 @@ sub new { }, $class; $self->xpfx_init; $self->{-set_indexlevel_once} = 1 if $indexlevel eq 'medium'; + if ($ibx->{-skip_docdata}) { + $self->{-set_skip_docdata_once} = 1; + $self->{-skip_docdata} = 1; + } $ibx->umask_prepare; if ($version == 1) { $self->{lock_path} = "$inboxdir/ssoma.lock"; @@ -359,10 +363,18 @@ sub add_xapian ($$$$) { msg_iter($eml, \&index_xapian, [ $self, $doc ]); index_ids($self, $doc, $eml, $mids); - $smsg->{to} = $smsg->{cc} = ''; # WWW doesn't need these, only NNTP - PublicInbox::OverIdx::parse_references($smsg, $eml, $mids); - my $data = $smsg->to_doc_data; - $doc->set_data($data); + + # by default, we maintain compatibility with v1.5.0 and earlier + # by writing to docdata.glass, users who never exect to downgrade can + # use --skip-docdata + if (!$self->{-skip_docdata}) { + # WWW doesn't need {to} or {cc}, only NNTP + $smsg->{to} = $smsg->{cc} = ''; + PublicInbox::OverIdx::parse_references($smsg, $eml, $mids); + my $data = $smsg->to_doc_data; + $doc->set_data($data); + } + if (my $altid = $self->{-altid}) { foreach my $alt (@$altid) { my $pfx = $alt->{xprefix}; @@ -831,23 +843,28 @@ sub begin_txn_lazy { # store 'indexlevel=medium' in v2 shard=0 and v1 (only one shard) # This metadata is read by Admin::detect_indexlevel: -sub set_indexlevel { +sub set_metadata_once { my ($self) = @_; - if (!$self->{shard} && # undef or 0, not >0 - delete($self->{-set_indexlevel_once})) { - my $xdb = $self->{xdb}; + return if $self->{shard}; # only continue if undef or 0, not >0 + my $xdb = $self->{xdb}; + + if (delete($self->{-set_indexlevel_once})) { my $level = $xdb->get_metadata('indexlevel'); if (!$level || $level ne 'medium') { $xdb->set_metadata('indexlevel', 'medium'); } } + if (delete($self->{-set_skip_docdata_once})) { + $xdb->get_metadata('skip_docdata') or + $xdb->set_metadata('skip_docdata', '1'); + } } sub _commit_txn { my ($self) = @_; if (my $xdb = $self->{xdb}) { - set_indexlevel($self); + set_metadata_once($self); $xdb->commit_transaction; } $self->{over}->commit_lazy if $self->{over}; diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm index 59b36087..20077e08 100644 --- a/lib/PublicInbox/SearchIdxShard.pm +++ b/lib/PublicInbox/SearchIdxShard.pm @@ -16,7 +16,7 @@ sub new { my $self = $class->SUPER::new($ibx, 1, $shard); # create the DB before forking: $self->idx_acquire; - $self->set_indexlevel; + $self->set_metadata_once; $self->idx_release; $self->spawn_worker($v2w, $shard) if $v2w->{parallel}; $self; diff --git a/script/public-inbox-convert b/script/public-inbox-convert index d655dcc6..4ff198d1 100755 --- a/script/public-inbox-convert +++ b/script/public-inbox-convert @@ -77,7 +77,8 @@ if ($old) { die "Only conversion from v1 inboxes is supported\n" if $old->version >= 2; require PublicInbox::Admin; -$old->{indexlevel} //= PublicInbox::Admin::detect_indexlevel($old); +my $detected = PublicInbox::Admin::detect_indexlevel($old); +$old->{indexlevel} //= $detected; my $env; if ($opt->{'index'}) { my $mods = {}; diff --git a/script/public-inbox-index b/script/public-inbox-index index 30d24838..9855c67d 100755 --- a/script/public-inbox-index +++ b/script/public-inbox-index @@ -39,7 +39,7 @@ GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune indexlevel|index-level|L=s max_size|max-size=s batch_size|batch-size=s sequential_shard|seq-shard|sequential-shard - all help|?)) + skip-docdata all help|?)) or die "bad command-line args\n$usage"; if ($opt->{help}) { print $help; exit 0 }; die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0; @@ -58,9 +58,11 @@ unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 } my $mods = {}; foreach my $ibx (@ibxs) { + # detect_indexlevel may also set $ibx->{-skip_docdata} + my $detected = PublicInbox::Admin::detect_indexlevel($ibx); # XXX: users can shoot themselves in the foot, with opt->{indexlevel} $ibx->{indexlevel} //= $opt->{indexlevel} // ($opt->{xapian_only} ? - 'full' : PublicInbox::Admin::detect_indexlevel($ibx)); + 'full' : $detected); PublicInbox::Admin::scan_ibx_modules($mods, $ibx); } @@ -75,6 +77,7 @@ for my $ibx (@ibxs) { PublicInbox::Xapcmd::run($ibx, 'compact', $opt->{compact_opt}); } $ibx->{-no_fsync} = 1 if !$opt->{fsync}; + $ibx->{-skip_docdata} //= $opt->{'skip-docdata'}; my $ibx_opt = $opt; if (defined(my $s = $ibx->{lc('indexSequentialShard')})) { diff --git a/script/public-inbox-init b/script/public-inbox-init index b19c2321..037e8e56 100755 --- a/script/public-inbox-init +++ b/script/public-inbox-init @@ -34,6 +34,7 @@ require PublicInbox::Admin; PublicInbox::Admin::require_or_die('-base'); my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs, $show_help); +my $skip_docdata; my $ng = ''; my %opts = ( 'V|version=i' => \$version, @@ -42,6 +43,7 @@ my %opts = ( 'skip-artnum=i' => \$skip_artnum, 'j|jobs=i' => \$jobs, 'ng|newsgroup=s' => \$ng, + 'skip-docdata' => \$skip_docdata, 'help|?' => \$show_help, ); my $usage_cb = sub { @@ -177,6 +179,12 @@ if (defined $jobs) { require PublicInbox::InboxWritable; $ibx = PublicInbox::InboxWritable->new($ibx, $creat_opt); +if ($skip_docdata) { + $ibx->{indexlevel} //= 'full'; # ensure init_inbox writes xdb + $ibx->{indexlevel} eq 'basic' and + die "--skip-docdata ignored with --indexlevel=basic\n"; + $ibx->{-skip_docdata} = $skip_docdata; +} $ibx->init_inbox(0, $skip_epoch, $skip_artnum); # needed for git prior to v2.1.0 diff --git a/t/inbox_idle.t b/t/inbox_idle.t index 61287200..e16ee11b 100644 --- a/t/inbox_idle.t +++ b/t/inbox_idle.t @@ -29,7 +29,7 @@ for my $V (1, 2) { if ($V == 1) { my $sidx = PublicInbox::SearchIdx->new($ibx, 1); $sidx->idx_acquire; - $sidx->set_indexlevel; + $sidx->set_metadata_once; $sidx->idx_release; # allow watching on lockfile } my $pi_config = PublicInbox::Config->new(\<<EOF); diff --git a/t/index-git-times.t b/t/index-git-times.t index 2e9e88e8..8f80c866 100644 --- a/t/index-git-times.t +++ b/t/index-git-times.t @@ -4,6 +4,7 @@ use Test::More; use PublicInbox::TestCommon; use PublicInbox::Import; use PublicInbox::Config; +use PublicInbox::Admin; use File::Path qw(remove_tree); require_mods(qw(DBD::SQLite Search::Xapian)); @@ -47,11 +48,15 @@ EOF PublicInbox::Import::run_die($cmd, undef, { 0 => $r }); } -run_script(['-index', $v1dir]) or die 'v1 index failed'; +run_script(['-index', '--skip-docdata', $v1dir]) or die 'v1 index failed'; + my $smsg; { my $cfg = PublicInbox::Config->new; my $ibx = $cfg->lookup($addr); + my $lvl = PublicInbox::Admin::detect_indexlevel($ibx); + is($lvl, 'medium', 'indexlevel detected'); + is($ibx->{-skip_docdata}, 1, '--skip-docdata flag set on -index'); $smsg = $ibx->over->get_art(1); is($smsg->{ds}, 749520000, 'datestamp from git author time'); is($smsg->{ts}, 1285977600, 'timestamp from git committer time'); @@ -70,6 +75,10 @@ SKIP: { my $check_v2 = sub { my $ibx = PublicInbox::Inbox->new({inboxdir => $v2dir, address => $addr}); + my $lvl = PublicInbox::Admin::detect_indexlevel($ibx); + is($lvl, 'medium', 'indexlevel detected after convert'); + is($ibx->{-skip_docdata}, 1, + '--skip-docdata preserved after convert'); my $v2smsg = $ibx->over->get_art(1); is($v2smsg->{ds}, $smsg->{ds}, 'v2 datestamp from git author time'); diff --git a/t/init.t b/t/init.t index 4d2c5049..dad09435 100644 --- a/t/init.t +++ b/t/init.t @@ -95,6 +95,19 @@ SKIP: { my $ibx = PublicInbox::Inbox->new({ inboxdir => $dir }); is(PublicInbox::Admin::detect_indexlevel($ibx), $lvl, 'detected expected level w/o config'); + ok(!$ibx->{-skip_docdata}, 'docdata written by default'); + } + for my $v (1, 2) { + my $name = "v$v-skip-docdata"; + my $dir = "$tmpdir/$name"; + $cmd = [ '-init', $name, "-V$v", '--skip-docdata', + $dir, "http://example.com/$name", + "$name\@example.com" ]; + ok(run_script($cmd), "-init -V$v --skip-docdata"); + my $ibx = PublicInbox::Inbox->new({ inboxdir => $dir }); + is(PublicInbox::Admin::detect_indexlevel($ibx), 'full', + "detected default indexlevel -V$v"); + ok($ibx->{-skip_docdata}, "docdata skip set -V$v"); } # loop for idempotency
We can avoid importing mdocid() in several places by using this method, simplifying callers. --- lib/PublicInbox/ExtMsg.pm | 4 +--- lib/PublicInbox/IMAP.pm | 4 +--- lib/PublicInbox/Mbox.pm | 7 ++----- lib/PublicInbox/Search.pm | 6 ++++++ lib/PublicInbox/SearchView.pm | 6 ++---- 5 files changed, 12 insertions(+), 15 deletions(-) diff --git a/lib/PublicInbox/ExtMsg.pm b/lib/PublicInbox/ExtMsg.pm index 5c23dc31..65892161 100644 --- a/lib/PublicInbox/ExtMsg.pm +++ b/lib/PublicInbox/ExtMsg.pm @@ -11,7 +11,6 @@ use warnings; use PublicInbox::Hval qw(ascii_html prurl mid_href); use PublicInbox::WwwStream qw(html_oneshot); use PublicInbox::Smsg; -use PublicInbox::Search qw(mdocid); our $MIN_PARTIAL_LEN = 16; # TODO: user-configurable @@ -61,7 +60,6 @@ sub search_partial ($$) { } } - my $n = $srch->{nshard} // 1; foreach my $m (@try) { # If Xapian can't handle the wildcard since it # has too many results. $@ can be @@ -70,7 +68,7 @@ sub search_partial ($$) { my $mset = eval { $srch->query($m, $opt) } or next; my @mids = map { $_->{mid} - } @{$ibx->over->get_all(map { mdocid($n, $_) } $mset->items)}; + } @{$ibx->over->get_all(@{$srch->mset_to_artnums($mset)})}; return \@mids if scalar(@mids); } } diff --git a/lib/PublicInbox/IMAP.pm b/lib/PublicInbox/IMAP.pm index 562c59d4..abdb8fec 100644 --- a/lib/PublicInbox/IMAP.pm +++ b/lib/PublicInbox/IMAP.pm @@ -40,7 +40,6 @@ use PublicInbox::Syscall qw(EPOLLIN EPOLLONESHOT); use PublicInbox::GitAsyncCat; use Text::ParseWords qw(parse_line); use Errno qw(EAGAIN); -use PublicInbox::Search qw(mdocid); use PublicInbox::IMAPsearchqp; my $Address; @@ -1188,9 +1187,8 @@ sub refill_xap ($$$$) { my ($beg, $end) = @$range_info; my $srch = $self->{ibx}->search; my $opt = { mset => 2, limit => 1000 }; - my $nshard = $srch->{nshard} // 1; my $mset = $srch->query("$q uid:$beg..$end", $opt); - @$uids = map { mdocid($nshard, $_) } $mset->items; + @$uids = @{$srch->mset_to_artnums($mset)}; if (@$uids) { $range_info->[0] = $uids->[-1] + 1; # update $beg return; # possibly more diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm index 0fa9a38d..873ff7be 100644 --- a/lib/PublicInbox/Mbox.pm +++ b/lib/PublicInbox/Mbox.pm @@ -10,7 +10,6 @@ use PublicInbox::MID qw/mid_escape/; use PublicInbox::Hval qw/to_filename/; use PublicInbox::Smsg; use PublicInbox::Eml; -use PublicInbox::Search qw(mdocid); # called by PSGI server as body response # this gets called twice for every message, once to return the header, @@ -217,8 +216,7 @@ sub results_cb { my $mset = $srch->query($ctx->{query}, $ctx->{qopts}); my $size = $mset->size or return; $ctx->{qopts}->{offset} += $size; - my $nshard = $srch->{nshard} // 1; - $ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ]; + $ctx->{ids} = $srch->mset_to_artnums($mset); } } @@ -234,8 +232,7 @@ sub mbox_all { return [404, [qw(Content-Type text/plain)], ["No results found\n"]]; $ctx->{query} = $query; - my $nshard = $srch->{nshard} // 1; - $ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ]; + $ctx->{ids} = $srch->mset_to_artnums($mset); require PublicInbox::MboxGz; PublicInbox::MboxGz::mbox_gz($ctx, \&results_cb, 'results-'.$query); } diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index 1c648299..c18e19d4 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -219,6 +219,12 @@ sub mdocid { int(($docid - 1) / $nshard) + 1; } +sub mset_to_artnums { + my ($self, $mset) = @_; + my $nshard = $self->{nshard} // 1; + [ map { mdocid($nshard, $_) } $mset->items ]; +} + sub xdb ($) { my ($self) = @_; $self->{xdb} ||= do { diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm index ef1b9767..aa8fa037 100644 --- a/lib/PublicInbox/SearchView.pm +++ b/lib/PublicInbox/SearchView.pm @@ -91,9 +91,8 @@ sub mset_summary { my $pfx = ' ' x $pad; my $res = \($ctx->{-html_tip}); my $ibx = $ctx->{-inbox}; - my $nshard = $ibx->search->{nshard} // 1; my $obfs_ibx = $ibx->{obfuscate} ? $ibx : undef; - my @nums = map { mdocid($nshard, $_) } $mset->items; + my @nums = @{$ibx->search->mset_to_artnums($mset)}; my %num2msg = map { $_->{num} => $_ } @{$ibx->over->get_all(@nums)}; foreach my $m ($mset->items) { @@ -316,8 +315,7 @@ sub ctx_prepare { sub adump { my ($cb, $mset, $q, $ctx) = @_; - my $nshard = $ctx->{-inbox}->search->{nshard} // 1; - $ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ]; + $ctx->{ids} = $ctx->{-inbox}->search->mset_to_artnums($mset); $ctx->{search_query} = $q; # used by WwwAtomStream::atom_header PublicInbox::WwwAtomStream->response($ctx, 200, \&adump_i); }
Eric Wong <e@yhbt.net> wrote: > +Some of the options documented in L<public-inbox-config(5)> > +require editing the config file. Old versions lack the > +C<-n>/C<--newsgroup> parameter While working on this, I realized -n vs -N could be confusing, so I made the abbreviation --ng instead. So I'll squash this in before pushing: diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod index 240959eb..ea2b38dd 100644 --- a/Documentation/public-inbox-init.pod +++ b/Documentation/public-inbox-init.pod @@ -111,7 +111,7 @@ Used to override the default C<~/.public-inbox/config> value. Some of the options documented in L<public-inbox-config(5)> require editing the config file. Old versions lack the -C<-n>/C<--newsgroup> parameter +C<--ng>/C<--newsgroup> parameter See L<public-inbox-config(5)> for all the options which may be applied to a given inbox.