* [PATCH 00/23] indexing: --skip-docdata + speedups
@ 2020-08-20 20:24 Eric Wong
2020-08-20 20:24 ` [PATCH 01/23] doc: note -compact and -xcpdb are rarely used Eric Wong
` (22 more replies)
0 siblings, 23 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
Some miscellaneous help and cleanup things, too.
Document data is no longer read from Xapian by read-only
daemons; that data is redundant given over.sqlite3 always
exists. This should should improve page cache hit rates for
over.sqlite3 by a small bit. Being able to mass load a bunch of
rows from SQLite speeds up the default search summary view by
~10%, too.
--skip-docdata option to -init and -index can avoid writing
Xapian document data, saving ~1.5% in Xapian space overhead
(and associated I/O and page cache overheads). It breaks
rollbacks to old versions, though, so it won't be the default.
Eric Wong (23):
doc: note -compact and -xcpdb are rarely used
admin: progress shows the inbox being indexed
compact: support --help/-? and perform lazy loading
init: support --help and -?
init: support --newsgroup option
init: drop -N alias for --skip-artnum
search: v2: ensure shards are numerically sorted
xapcmd: simplify {reindex} parameter passing
www: reduce long-lived PublicInbox::Search references
search: improve comments around constants
search: export mdocid subroutine
searchquery: split off from searchview
search: make qparse_new an internal function
smsg: reduce utf8::decode call sites
searchview: use over.sqlite3 instead of Xapian docdata
searchview: speed up search summary by ~10%
searchview: convert nested and Atom display to over.sqlite3
extmsg: avoid using Xapian docdata
mbox: avoid Xapian docdata in search results
smsg: remove from_mitem
t/nntpd-v2: set PI_TEST_VERSION=2 properly
init+index: support --skip-docdata for Xapian
search: add mset_to_artnums method
Documentation/public-inbox-compact.pod | 5 +
Documentation/public-inbox-config.pod | 2 +-
Documentation/public-inbox-index.pod | 8 ++
Documentation/public-inbox-init.pod | 37 +++++++-
Documentation/public-inbox-xcpdb.pod | 3 +
MANIFEST | 1 +
lib/PublicInbox/Admin.pm | 15 ++-
lib/PublicInbox/ExtMsg.pm | 19 ++--
lib/PublicInbox/IMAP.pm | 5 +-
lib/PublicInbox/Inbox.pm | 11 ++-
lib/PublicInbox/Mbox.pm | 24 +++--
lib/PublicInbox/Over.pm | 28 +++---
lib/PublicInbox/Search.pm | 125 ++++++++++++++-----------
lib/PublicInbox/SearchIdx.pm | 35 +++++--
lib/PublicInbox/SearchIdxShard.pm | 2 +-
lib/PublicInbox/SearchQuery.pm | 53 +++++++++++
lib/PublicInbox/SearchView.pm | 90 ++++--------------
lib/PublicInbox/Smsg.pm | 10 +-
lib/PublicInbox/Xapcmd.pm | 20 ++--
script/public-inbox-compact | 39 ++++++--
script/public-inbox-convert | 3 +-
script/public-inbox-index | 7 +-
script/public-inbox-init | 101 +++++++++++++-------
t/imapd.t | 6 +-
t/inbox_idle.t | 2 +-
t/index-git-times.t | 11 ++-
t/init.t | 17 +++-
t/nntpd-v2.t | 2 +-
t/nntpd.t | 9 +-
t/search.t | 34 ++++---
30 files changed, 437 insertions(+), 287 deletions(-)
create mode 100644 lib/PublicInbox/SearchQuery.pm
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 01/23] doc: note -compact and -xcpdb are rarely used
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 02/23] admin: progress shows the inbox being indexed Eric Wong
` (21 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
Slowly improving the learning curve...
---
Documentation/public-inbox-compact.pod | 5 +++++
Documentation/public-inbox-xcpdb.pod | 3 +++
2 files changed, 8 insertions(+)
diff --git a/Documentation/public-inbox-compact.pod b/Documentation/public-inbox-compact.pod
index 8e463ab1..4e9b6d9f 100644
--- a/Documentation/public-inbox-compact.pod
+++ b/Documentation/public-inbox-compact.pod
@@ -19,6 +19,11 @@ It enforces the use of the C<--no-renumber> option of
L<xapian-compact(1)> which is required to work with the
rest of the public-inbox search code.
+This command is rarely needed for active inboxes.
+
+Using the C<--compact> option of L<public-inbox-index(1)>
+is recommended, instead, and only when doing a C<--reindex>.
+
=head1 OPTIONS
=over
diff --git a/Documentation/public-inbox-xcpdb.pod b/Documentation/public-inbox-xcpdb.pod
index 2b38f1da..52939894 100644
--- a/Documentation/public-inbox-xcpdb.pod
+++ b/Documentation/public-inbox-xcpdb.pod
@@ -17,6 +17,9 @@ accept parallel Xapian database modifications from
L<public-inbox-watch(1)>, L<public-inbox-mda(1)>,
L<public-inbox-learn(1)>, and L<public-inbox-index(1)>.
+This command is rarely used, as Xapian DB formats rarely
+change.
+
=head1 OPTIONS
=over
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 02/23] admin: progress shows the inbox being indexed
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
2020-08-20 20:24 ` [PATCH 01/23] doc: note -compact and -xcpdb are rarely used Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 03/23] compact: support --help/-? and perform lazy loading Eric Wong
` (20 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
This is helpful with --all, or when multiple inboxes
are being indexed.
---
lib/PublicInbox/Admin.pm | 3 +++
1 file changed, 3 insertions(+)
diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index d99a00b4..f5427af7 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -207,6 +207,9 @@ sub index_terminate {
sub index_inbox {
my ($ibx, $im, $opt) = @_;
my $jobs = delete $opt->{jobs} if $opt;
+ if (my $pr = $opt->{-progress}) {
+ $pr->("indexing $ibx->{inboxdir} ...\n");
+ }
local %SIG = %SIG;
setup_signals(\&index_terminate, $ibx);
if (ref($ibx) && $ibx->version == 2) {
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 03/23] compact: support --help/-? and perform lazy loading
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
2020-08-20 20:24 ` [PATCH 01/23] doc: note -compact and -xcpdb are rarely used Eric Wong
2020-08-20 20:24 ` [PATCH 02/23] admin: progress shows the inbox being indexed Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 04/23] init: support --help and -? Eric Wong
` (19 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
This probably won't be used much, but --help can still
make sense.
---
script/public-inbox-compact | 39 +++++++++++++++++++++++++++----------
1 file changed, 29 insertions(+), 10 deletions(-)
diff --git a/script/public-inbox-compact b/script/public-inbox-compact
index b5fa0086..a6bb62bd 100755
--- a/script/public-inbox-compact
+++ b/script/public-inbox-compact
@@ -1,18 +1,37 @@
-#!/usr/bin/perl -w
+#!perl -w
# Copyright (C) 2018-2020 all contributors <meta@public-inbox.org>
# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
use strict;
-use warnings;
+use v5.10.1;
use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
-use PublicInbox::InboxWritable;
-use PublicInbox::Xapcmd;
-use PublicInbox::Admin;
-PublicInbox::Admin::require_or_die('-index');
-my $usage = "Usage: public-inbox-compact INBOX_DIR\n";
+my $usage = 'public-inbox-compact INBOX_DIR';
my $opt = { compact => 1, -coarse_lock => 1 };
-GetOptions($opt, qw(all), @PublicInbox::Xapcmd::COMPACT_OPT) or
- die "bad command-line args\n$usage";
-my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, $opt) or die $usage;
+my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term:
+Usage: $usage
+
+ Compact Xapian DBs in an inbox
+
+options:
+
+ --all index all configured inboxes
+ --jobs=NUM control parallelization
+
+See public-inbox-compact(1) man page for full documentation.
+EOF
+GetOptions($opt, qw(all help|?),
+ # compact options:
+ qw(jobs|j=i quiet|q blocksize|b=s no-full|n fuller|F),
+) or die "bad command-line args\n$usage\n";
+if ($opt->{help}) { print $help; exit 0 };
+
+require PublicInbox::Admin;
+PublicInbox::Admin::require_or_die('-index');
+PublicInbox::Admin::progress_prepare($opt);
+
+require PublicInbox::InboxWritable;
+require PublicInbox::Xapcmd;
+my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, $opt);
+unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 }
foreach (@ibxs) {
my $ibx = PublicInbox::InboxWritable->new($_);
PublicInbox::Xapcmd::run($ibx, 'compact', $opt);
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 04/23] init: support --help and -?
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (2 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 03/23] compact: support --help/-? and perform lazy loading Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 05/23] init: support --newsgroup option Eric Wong
` (18 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
And speed those up with some lazy loading, too.
---
script/public-inbox-init | 79 +++++++++++++++++++++++++---------------
1 file changed, 50 insertions(+), 29 deletions(-)
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 1c8066df..6852f64a 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -1,57 +1,76 @@
-#!/usr/bin/perl -w
+#!perl -w
# Copyright (C) 2014-2020 all contributors <meta@public-inbox.org>
# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
-#
-# Initializes a public-inbox, basically a wrapper for git-init(1)
use strict;
-use warnings;
-sub usage {
- print STDERR <<EOF;
-Usage: public-inbox-init NAME INBOX_DIR HTTP_URL ADDRESS [ADDRESS..]
-EOF
- exit 1;
-}
+use v5.10.1;
use Getopt::Long qw/:config gnu_getopt no_ignore_case auto_abbrev/;
-use PublicInbox::Admin;
-PublicInbox::Admin::require_or_die('-base');
-use PublicInbox::Config;
-use PublicInbox::InboxWritable;
-use PublicInbox::Import;
-use File::Temp;
-use PublicInbox::Lock;
-use File::Basename qw/dirname/;
-use File::Path qw/mkpath/;
use Fcntl qw(:DEFAULT);
-use Cwd qw/abs_path/;
+my $usage = 'public-inbox-init NAME INBOX_DIR HTTP_URL ADDRESS [ADDRESS..]';
+my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term:
+usage: $usage
+
+ Initialize a public-inbox
+
+required arguments:
+
+ NAME the name of the inbox
+ INBOX_DIR pathname the inbox
+ HTTP_URL HTTP (or HTTPS) URL
+ ADDRESS email address(es), may be specified multiple times
+
+options:
-my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs);
+ -V2 use scalable public-inbox-v2-format(5)
+ -L LEVEL index level `basic', `medium', or `full' (default: full)
+ --skip-artnum=NUM NNTP article numbers to skip
+ --skip-epoch=NUM epochs to skip (-V2 only)
+ -J JOBS number of indexing jobs (-V2 only), (default: 4)
+
+See public-inbox-init(1) man page for full documentation.
+EOF
+
+require PublicInbox::Admin;
+PublicInbox::Admin::require_or_die('-base');
+
+my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs, $show_help);
my %opts = (
'V|version=i' => \$version,
'L|index-level|indexlevel=s' => \$indexlevel,
'S|skip|skip-epoch=i' => \$skip_epoch,
'N|skip-artnum=i' => \$skip_artnum,
'j|jobs=i' => \$jobs,
+ 'help|?' => \$show_help,
);
-GetOptions(%opts) or usage();
+my $usage_cb = sub {
+ print STDERR "Usage: $usage\n";
+ exit 1;
+};
+GetOptions(%opts) or $usage_cb->();
+if ($show_help) { print $help; exit 0 };
PublicInbox::Admin::indexlevel_ok_or_die($indexlevel) if defined $indexlevel;
-my $name = shift @ARGV or usage();
-my $inboxdir = shift @ARGV or usage();
-my $http_url = shift @ARGV or usage();
+my $name = shift @ARGV or $usage_cb->();
+my $inboxdir = shift @ARGV or $usage_cb->();
+my $http_url = shift @ARGV or $usage_cb->();
my (@address) = @ARGV;
-@address or usage();
+@address or $usage_cb->();
my %seen;
+require PublicInbox::Config;
my $pi_config = PublicInbox::Config->default_file;
-my $dir = dirname($pi_config);
-mkpath($dir); # will croak on fatal errors
+require File::Basename;
+my $dir = File::Basename::dirname($pi_config);
+require File::Path;
+File::Path::mkpath($dir); # will croak on fatal errors
# first, we grab a flock to prevent simultaneous public-inbox-init
# processes from trampling over each other, or exiting with 255 on
# O_EXCL failure below. This gets unlocked automatically on exit:
+require PublicInbox::Lock;
my $lock_obj = { lock_path => "$pi_config.flock" };
PublicInbox::Lock::lock_acquire($lock_obj);
# git-config will operate on this (and rename on success):
+require File::Temp;
my $fh = File::Temp->new(TEMPLATE => 'pi-init-XXXXXXXX', DIR => $dir);
# Now, we grab another lock to use git-config(1) locking, so it won't
@@ -111,7 +130,8 @@ close($fh) or die "failed to close $pi_config_tmp: $!\n";
my $pfx = "publicinbox.$name";
my @x = (qw/git config/, "--file=$pi_config_tmp");
-$inboxdir = abs_path($inboxdir);
+require Cwd;
+$inboxdir = Cwd::abs_path($inboxdir);
die "`\\n' not allowed in `$inboxdir'\n" if $inboxdir =~ /\n/s;
if (-f "$inboxdir/inbox.lock") {
if (!defined $version) {
@@ -148,6 +168,7 @@ if (defined $jobs) {
$creat_opt->{nproc} = $jobs;
}
+require PublicInbox::InboxWritable;
$ibx = PublicInbox::InboxWritable->new($ibx, $creat_opt);
$ibx->init_inbox(0, $skip_epoch, $skip_artnum);
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 05/23] init: support --newsgroup option
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (3 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 04/23] init: support --help and -? Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 21:10 ` Eric Wong
2020-08-20 20:24 ` [PATCH 06/23] init: drop -N alias for --skip-artnum Eric Wong
` (17 subsequent siblings)
22 siblings, 1 reply; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
We can reduce the need to edit the config file for NNTP group names
this way.
---
Documentation/public-inbox-config.pod | 2 +-
Documentation/public-inbox-init.pod | 25 +++++++++++++++++++++----
script/public-inbox-init | 12 ++++++++++--
t/imapd.t | 6 ++----
t/nntpd.t | 9 +++------
5 files changed, 37 insertions(+), 17 deletions(-)
diff --git a/Documentation/public-inbox-config.pod b/Documentation/public-inbox-config.pod
index 05b84819..1dfb926e 100644
--- a/Documentation/public-inbox-config.pod
+++ b/Documentation/public-inbox-config.pod
@@ -63,7 +63,7 @@ Default: none, optional
=item publicinbox.<name>.newsgroup
The NNTP group name for use with L<public-inbox-nntpd(8)>. This
-may be any newsgroup name with hierarchies delimited by '.'.
+may be any newsgroup name with hierarchies delimited by C<.>.
For example, the newsgroup for L<mailto:meta@public-inbox.org>
is: C<inbox.comp.mail.public-inbox.meta>
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index d0c87563..240959eb 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -39,6 +39,21 @@ See L<public-inbox-config(5)> for more information.
Default: C<full>
+=item --ng, --newsgroup NEWSGROUP
+
+The NNTP group name for use with L<public-inbox-nntpd(8)>. This
+may be any newsgroup name with hierarchies delimited by C<.>.
+For example, the newsgroup for L<mailto:meta@public-inbox.org>
+is: C<inbox.comp.mail.public-inbox.meta>
+
+This may be set after-the-fact via C<publicinbox.$NAME.newsgroup>
+in the configuration file. See L<public-inbox-config(5)> for more
+info.
+
+Available since public-inbox 1.6.0 (PENDING).
+
+Default: none.
+
=item -N, --skip-artnum
This option allows archivists to publish incomplete archives
@@ -94,10 +109,12 @@ Used to override the default C<~/.public-inbox/config> value.
=head1 LIMITATIONS
-This tool predates NNTP support in public-inbox and is missing
-C<newsgroup> and many of the options documented in
-L<public-inbox-config(5)>. See L<public-inbox-config(5)> for all the
-options which may be applied to a given inbox.
+Some of the options documented in L<public-inbox-config(5)>
+require editing the config file. Old versions lack the
+C<-n>/C<--newsgroup> parameter
+
+See L<public-inbox-config(5)> for all the options which may be applied
+to a given inbox.
=head1 CONTACT
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 6852f64a..90b32be8 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -22,6 +22,7 @@ options:
-V2 use scalable public-inbox-v2-format(5)
-L LEVEL index level `basic', `medium', or `full' (default: full)
+ --ng NEWSGROUP set NNTP newsgroup name
--skip-artnum=NUM NNTP article numbers to skip
--skip-epoch=NUM epochs to skip (-V2 only)
-J JOBS number of indexing jobs (-V2 only), (default: 4)
@@ -33,12 +34,14 @@ require PublicInbox::Admin;
PublicInbox::Admin::require_or_die('-base');
my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs, $show_help);
+my $ng = '';
my %opts = (
'V|version=i' => \$version,
'L|index-level|indexlevel=s' => \$indexlevel,
'S|skip|skip-epoch=i' => \$skip_epoch,
'N|skip-artnum=i' => \$skip_artnum,
'j|jobs=i' => \$jobs,
+ 'ng|newsgroup=s' => \$ng,
'help|?' => \$show_help,
);
my $usage_cb = sub {
@@ -53,7 +56,11 @@ my $inboxdir = shift @ARGV or $usage_cb->();
my $http_url = shift @ARGV or $usage_cb->();
my (@address) = @ARGV;
@address or $usage_cb->();
-my %seen;
+
+$ng =~ m![^A-Za-z0-9/_\.\-\~\@\+\=:]! and
+ die "--newsgroup `$ng' is not valid\n";
+($ng =~ m!\A\.! || $ng =~ m!\.\z!) and
+ die "--newsgroup `$ng' must not start or end with `.'\n";
require PublicInbox::Config;
my $pi_config = PublicInbox::Config->default_file;
@@ -84,7 +91,7 @@ sysopen($lockfh, $lockfile, O_RDWR|O_CREAT|O_EXCL) or do {
exit(255);
};
my $auto_unlink = UnlinkMe->new($lockfile);
-my $perm;
+my ($perm, %seen);
if (-e $pi_config) {
open(my $oh, '<', $pi_config) or die "unable to read $pi_config: $!\n";
my @st = stat($oh);
@@ -185,6 +192,7 @@ PublicInbox::Import::run_die([@x, "$pfx.inboxdir", $inboxdir]);
if (defined($indexlevel)) {
PublicInbox::Import::run_die([@x, "$pfx.indexlevel", $indexlevel]);
}
+PublicInbox::Import::run_die([@x, "$pfx.newsgroup", $ng]) if $ng ne '';
# needed for git prior to v2.1.0
if (defined $perm) {
diff --git a/t/imapd.t b/t/imapd.t
index 6cfced41..4d627af7 100644
--- a/t/imapd.t
+++ b/t/imapd.t
@@ -39,11 +39,9 @@ for my $V (@V) {
my $url = "http://example.com/i$V";
my $inboxdir = "$tmpdir/$name";
my $folder = "inbox.i$V";
- my $cmd = ['-init', "-V$V", "-L$level", $name, $inboxdir, $url, $addr];
+ my $cmd = ['-init', "-V$V", "-L$level", "--ng=$folder",
+ $name, $inboxdir, $url, $addr];
run_script($cmd) or BAIL_OUT("init $name");
- xsys(qw(git config), "--file=$ENV{HOME}/.public-inbox/config",
- "publicinbox.$name.newsgroup", $folder) == 0 or
- BAIL_OUT("setting newsgroup $V");
if ($V == 1) {
xsys(qw(git config), "--file=$ENV{HOME}/.public-inbox/config",
'publicinboxmda.spamcheck', 'none') == 0 or
diff --git a/t/nntpd.t b/t/nntpd.t
index b9b9a63d..74e21a41 100644
--- a/t/nntpd.t
+++ b/t/nntpd.t
@@ -46,14 +46,11 @@ my $ibx = {
$ibx = PublicInbox::Inbox->new($ibx);
{
local $ENV{HOME} = $home;
- my @cmd = ('-init', $group, $inboxdir, 'http://example.com/', $addr);
- push @cmd, "-V$version", '-Lbasic';
+ my @cmd = ('-init', $group, $inboxdir, 'http://example.com/', $addr,
+ "-V$version", '-Lbasic', '--newsgroup', $group);
ok(run_script(\@cmd), 'init OK');
- is(xsys(qw(git config), "--file=$home/.public-inbox/config",
- "publicinbox.$group.newsgroup", $group),
- 0, 'enabled newsgroup');
- my $len;
+ my $len;
$ibx = PublicInbox::InboxWritable->new($ibx);
my $im = $ibx->importer(0);
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 06/23] init: drop -N alias for --skip-artnum
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (4 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 05/23] init: support --newsgroup option Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 07/23] search: v2: ensure shards are numerically sorted Eric Wong
` (16 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
It may be too easily confused for --newsgroup or --ng. This is
too rarely used and never made it into a release, so it should
be fine.
---
Documentation/public-inbox-init.pod | 2 +-
script/public-inbox-init | 2 +-
t/init.t | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 240959eb..4cc7e29f 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -54,7 +54,7 @@ Available since public-inbox 1.6.0 (PENDING).
Default: none.
-=item -N, --skip-artnum
+=item --skip-artnum
This option allows archivists to publish incomplete archives
with only new mail while allowing NNTP article numbers
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 90b32be8..b19c2321 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -39,7 +39,7 @@ my %opts = (
'V|version=i' => \$version,
'L|index-level|indexlevel=s' => \$indexlevel,
'S|skip|skip-epoch=i' => \$skip_epoch,
- 'N|skip-artnum=i' => \$skip_artnum,
+ 'skip-artnum=i' => \$skip_artnum,
'j|jobs=i' => \$jobs,
'ng|newsgroup=s' => \$ng,
'help|?' => \$show_help,
diff --git a/t/init.t b/t/init.t
index 6211bb58..4d2c5049 100644
--- a/t/init.t
+++ b/t/init.t
@@ -116,7 +116,7 @@ SKIP: {
'publicinboxmda.spamcheck', 'none') == 0 or
BAIL_OUT "git config $?";
my $addr = 'skip3@example.com';
- $cmd = [ qw(-init -V2 -Lbasic -N12 skip3), "$tmpdir/skip3",
+ $cmd = [ qw(-init -V2 -Lbasic --skip-artnum=12 skip3), "$tmpdir/skip3",
qw(http://example.com/skip3), $addr ];
ok(run_script($cmd), '--skip-artnum -V2');
my $env = { ORIGINAL_RECIPIENT => $addr };
@@ -131,7 +131,7 @@ SKIP: {
$addr = 'skip4@example.com';
$env = { ORIGINAL_RECIPIENT => $addr };
- $cmd = [ qw(-init -V1 -N12 -Lmedium skip4), "$tmpdir/skip4",
+ $cmd = [ qw(-init -V1 --skip-artnum 12 -Lmedium skip4), "$tmpdir/skip4",
qw(http://example.com/skip4), $addr ];
ok(run_script($cmd), '--skip-artnum -V1');
$err = '';
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 07/23] search: v2: ensure shards are numerically sorted
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (5 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 06/23] init: drop -N alias for --skip-artnum Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 08/23] xapcmd: simplify {reindex} parameter passing Eric Wong
` (15 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
This seems required to correctly get the NNTP article number
from Xapian docid on combined Xapian DBs. The default
(ASCII-betical) sorting was only acceptable for -imapd users
until somebody hit 11 (or more) shards, which is a rare case.
---
lib/PublicInbox/Search.pm | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 4e08aed7..4d02a7c1 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -177,19 +177,24 @@ sub _xdb ($) {
my ($xdb, $slow_phrase);
my $qpf = \($self->{qp_flags} ||= $QP_FLAGS);
if ($self->{ibx_ver} >= 2) {
- my $n = 0;
- foreach my $shard (<$dir/*>) {
- -d $shard && $shard =~ m!/[0-9]+\z! or next;
- my $sub = $X{Database}->new($shard);
- if ($xdb) {
- $xdb->add_database($sub);
- } else {
- $xdb = $sub;
+ my @xdb;
+ opendir(my $dh, $dir) or return; # not initialized yet
+
+ # We need numeric sorting so shard[0] is first for reading
+ # Xapian metadata, if needed
+ for (sort { $a <=> $b } grep(/\A[0-9]+\z/, readdir($dh))) {
+ my $shard_dir = "$dir/$_";
+ if (-d $shard_dir && -r _) {
+ push @xdb, $X{Database}->new($shard_dir);
+ $slow_phrase ||= -f "$shard_dir/iamchert";
+ } else { # gaps from missing epochs throw off mdocid()
+ warn "E: $shard_dir missing or unreadable\n";
+ return;
}
- $slow_phrase ||= -f "$shard/iamchert";
- ++$n;
}
- $self->{nshard} = $n;
+ $self->{nshard} = scalar(@xdb);
+ $xdb = shift @xdb;
+ $xdb->add_database($_) for @xdb;
} else {
$slow_phrase = -f "$dir/iamchert";
$xdb = $X{Database}->new($dir);
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 08/23] xapcmd: simplify {reindex} parameter passing
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (6 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 07/23] search: v2: ensure shards are numerically sorted Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 09/23] www: reduce long-lived PublicInbox::Search references Eric Wong
` (14 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
No need to localize it, here, since we can just refer to it
in the `$opt' hashref. Hopefully this improves readability
for others like it does for me.
I sometimes wonder if the concept of a stack in high-level
languages is even necessary...
---
lib/PublicInbox/Xapcmd.pm | 20 +++++++++-----------
1 file changed, 9 insertions(+), 11 deletions(-)
diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm
index 46548a94..fffac99c 100644
--- a/lib/PublicInbox/Xapcmd.pm
+++ b/lib/PublicInbox/Xapcmd.pm
@@ -19,7 +19,6 @@ our @COMPACT_OPT = qw(jobs|j=i quiet|q blocksize|b=s no-full|n fuller|F);
sub commit_changes ($$$$) {
my ($ibx, $im, $tmp, $opt) = @_;
my $reshard = $opt->{reshard};
- my $reindex = $opt->{reindex};
$SIG{INT} or die 'BUG: $SIG{INT} not handled';
my @old_shard;
@@ -102,17 +101,17 @@ sub runnable_or_die ($) {
}
sub prepare_reindex ($$$) {
- my ($ibx, $im, $reindex) = @_;
+ my ($ibx, $im, $opt) = @_;
if ($ibx->version == 1) {
my $dir = $ibx->search->xdir(1);
my $xdb = $PublicInbox::Search::X{Database}->new($dir);
if (my $lc = $xdb->get_metadata('last_commit')) {
- $reindex->{from} = $lc;
+ $opt->{reindex}->{from} = $lc;
}
} else { # v2
my $max;
$im->git_dir_latest(\$max) or return;
- my $from = $reindex->{from};
+ my $from = $opt->{reindex}->{from};
my $mm = $ibx->mm;
my $v = PublicInbox::Search::SCHEMA_VERSION();
foreach my $i (0..$max) {
@@ -238,14 +237,14 @@ sub prepare_run {
sub check_compact () { runnable_or_die($XAPIAN_COMPACT) }
sub _run {
- my ($ibx, $cb, $opt, $reindex) = @_;
+ my ($ibx, $cb, $opt) = @_;
my $im = $ibx->importer(0);
$im->lock_acquire;
my ($tmp, $queue) = prepare_run($ibx, $opt);
# fine-grained locking if we prepare for reindex
if (!$opt->{-coarse_lock}) {
- prepare_reindex($ibx, $im, $reindex);
+ prepare_reindex($ibx, $im, $opt);
$im->lock_release;
}
@@ -262,19 +261,18 @@ sub run {
defined(my $dir = $ibx->{inboxdir}) or die "no inboxdir defined\n";
-d $dir or die "inboxdir=$dir does not exist\n";
check_compact() if $opt->{compact} && $ibx->search;
- my $reindex; # v1:{ from => $x40 }, v2:{ from => [ $x40, $x40, .. ] } }
if (!$opt->{-coarse_lock}) {
- $reindex = $opt->{reindex} = { # per-epoch ranges for v2
- from => $ibx->version == 1 ? '' : [],
- };
+ # per-epoch ranges for v2
+ # v1:{ from => $OID }, v2:{ from => [ $OID, $OID, $OID ] } }
+ $opt->{reindex} = { from => $ibx->version == 1 ? '' : [] };
PublicInbox::SearchIdx::load_xapian_writable();
}
local %SIG = %SIG;
setup_signals();
$ibx->umask_prepare;
- $ibx->with_umask(\&_run, $ibx, $cb, $opt, $reindex);
+ $ibx->with_umask(\&_run, $ibx, $cb, $opt);
}
sub cpdb_retryable ($$) {
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 09/23] www: reduce long-lived PublicInbox::Search references
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (7 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 08/23] xapcmd: simplify {reindex} parameter passing Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 10/23] search: improve comments around constants Eric Wong
` (13 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
While this is unlikely to be a problem in current practice,
keeping Xapian DBs open for long responses can interfere with
free space recovery after -compact.
In the future, it will interfere with inbox search grouping
and lead to unexpected results.
---
lib/PublicInbox/Inbox.pm | 11 ++++++++---
lib/PublicInbox/Mbox.pm | 6 +++---
lib/PublicInbox/SearchView.pm | 4 ++--
3 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index 267be4e3..55e546e1 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -191,14 +191,19 @@ sub mm {
};
}
-sub search ($;$) {
- my ($self, $over_only) = @_;
+sub search ($;$$) {
+ my ($self, $over_only, $ctx) = @_;
my $srch = $self->{search} ||= eval {
_cleanup_later($self);
require PublicInbox::Search;
PublicInbox::Search->new($self);
};
- ($over_only || eval { $srch->xdb }) ? $srch : undef;
+ ($over_only || eval { $srch->xdb }) ? $srch : do {
+ $ctx and $ctx->{env}->{'psgi.errors'}->print(<<EOF);
+`$self->{name}' search went away unexpectedly
+EOF
+ undef;
+ };
}
sub over ($) {
diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index fc83a893..a83c0356 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -205,8 +205,8 @@ sub mbox_all_ids {
sub results_cb {
my ($ctx) = @_;
+ my $srch = $ctx->{-inbox}->search(undef, $ctx) or return;
my $mset = $ctx->{mset};
- my $srch = $ctx->{srch};
while (1) {
while (my $mi = (($mset->items)[$ctx->{iter}++])) {
my $smsg = PublicInbox::Smsg::from_mitem($mi,
@@ -227,8 +227,8 @@ sub mbox_all {
return mbox_all_ids($ctx) if $query eq '';
my $qopts = $ctx->{qopts} = { mset => 2 };
- my $srch = $ctx->{srch} = $ctx->{-inbox}->search or
- return PublicInbox::WWW::need($ctx, 'Search');;
+ my $srch = $ctx->{-inbox}->search or
+ return PublicInbox::WWW::need($ctx, 'Search');
my $mset = $ctx->{mset} = $srch->query($query, $qopts);
$qopts->{offset} = $mset->size or
return [404, [qw(Content-Type text/plain)],
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index 84c04c6c..5d77469e 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -321,7 +321,6 @@ sub adump {
my ($cb, $mset, $q, $ctx) = @_;
$ctx->{items} = [ $mset->items ];
$ctx->{search_query} = $q; # used by WwwAtomStream::atom_header
- $ctx->{srch} = $ctx->{-inbox}->search;
PublicInbox::WwwAtomStream->response($ctx, 200, \&adump_i);
}
@@ -329,8 +328,9 @@ sub adump {
sub adump_i {
my ($ctx) = @_;
while (my $mi = shift @{$ctx->{items}}) {
+ my $srch = $ctx->{-inbox}->search(undef, $ctx) or return;
my $smsg = eval {
- PublicInbox::Smsg::from_mitem($mi, $ctx->{srch});
+ PublicInbox::Smsg::from_mitem($mi, $srch);
} or next;
return $smsg;
}
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 10/23] search: improve comments around constants
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (8 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 09/23] www: reduce long-lived PublicInbox::Search references Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 11/23] search: export mdocid subroutine Eric Wong
` (12 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
We'll probably be adding more value columns like THREADID to sort
on.
---
lib/PublicInbox/Search.pm | 63 +++++++++++++++++++++------------------
1 file changed, 34 insertions(+), 29 deletions(-)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 4d02a7c1..593040a8 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -6,15 +6,47 @@
package PublicInbox::Search;
use strict;
-# values for searching
+# values for searching, changing the numeric value breaks
+# compatibility with old indices (so don't change them it)
use constant {
TS => 0, # Received: header in Unix time (IMAP INTERNALDATE)
YYYYMMDD => 1, # Date: header for searching in the WWW UI
DT => 2, # Date: YYYYMMDDHHMMSS
+
+ # added for public-inbox 1.6.0+
BYTES => 3, # IMAP RFC822.SIZE
UID => 4, # IMAP UID == NNTP article number == Xapian docid
+
# TODO
- # REPLYCNT => 4, # IMAP ANSWERED
+ # THREADID => ?
+ # REPLYCNT => ?, # IMAP ANSWERED
+
+ # SCHEMA_VERSION history
+ # 0 - initial
+ # 1 - subject_path is lower-cased
+ # 2 - subject_path is id_compress in the index, only
+ # 3 - message-ID is compressed if it includes '%' (hack!)
+ # 4 - change "Re: " normalization, avoid circular Reference ghosts
+ # 5 - subject_path drops trailing '.'
+ # 6 - preserve References: order in document data
+ # 7 - remove references and inreplyto terms
+ # 8 - remove redundant/unneeded document data
+ # 9 - disable Message-ID compression (SHA-1)
+ # 10 - optimize doc for NNTP overviews
+ # 11 - merge threads when vivifying ghosts
+ # 12 - change YYYYMMDD value column to numeric
+ # 13 - fix threading for empty References/In-Reply-To
+ # (commit 83425ef12e4b65cdcecd11ddcb38175d4a91d5a0)
+ # 14 - fix ghost root vivification
+ # 15 - see public-inbox-v2-format(5)
+ # further bumps likely unnecessary, we'll suggest in-place
+ # "--reindex" use for further fixes and tweaks:
+ #
+ # public-inbox v1.5.0 adds (still SCHEMA_VERSION=15):
+ # * "lid:" and "l:" for List-Id searches
+ #
+ # v1.6.0 adds BYTES and UID values
+ SCHEMA_VERSION => 15,
};
use PublicInbox::Smsg;
@@ -61,33 +93,6 @@ sub load_xapian () {
# a prefix common in patch emails
our $LANG = 'english';
-use constant {
- # SCHEMA_VERSION history
- # 0 - initial
- # 1 - subject_path is lower-cased
- # 2 - subject_path is id_compress in the index, only
- # 3 - message-ID is compressed if it includes '%' (hack!)
- # 4 - change "Re: " normalization, avoid circular Reference ghosts
- # 5 - subject_path drops trailing '.'
- # 6 - preserve References: order in document data
- # 7 - remove references and inreplyto terms
- # 8 - remove redundant/unneeded document data
- # 9 - disable Message-ID compression (SHA-1)
- # 10 - optimize doc for NNTP overviews
- # 11 - merge threads when vivifying ghosts
- # 12 - change YYYYMMDD value column to numeric
- # 13 - fix threading for empty References/In-Reply-To
- # (commit 83425ef12e4b65cdcecd11ddcb38175d4a91d5a0)
- # 14 - fix ghost root vivification
- # 15 - see public-inbox-v2-format(5)
- # further bumps likely unnecessary, we'll suggest in-place
- # "--reindex" use for further fixes and tweaks
- #
- # public-inbox v1.5.0 adds (still SCHEMA_VERSION=15):
- # * "lid:" and "l:" for List-Id searches
- SCHEMA_VERSION => 15,
-};
-
# note: the non-X term prefix allocations are shared with
# Xapian omega, see xapian-applications/omega/docs/termprefixes.rst
my %bool_pfx_external = (
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 11/23] search: export mdocid subroutine
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (9 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 10/23] search: improve comments around constants Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 12/23] searchquery: split off from searchview Eric Wong
` (11 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
No need to have awkward globrefs for this.
---
lib/PublicInbox/IMAP.pm | 3 +--
lib/PublicInbox/Search.pm | 2 ++
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/lib/PublicInbox/IMAP.pm b/lib/PublicInbox/IMAP.pm
index 3d66f930..562c59d4 100644
--- a/lib/PublicInbox/IMAP.pm
+++ b/lib/PublicInbox/IMAP.pm
@@ -40,9 +40,8 @@ use PublicInbox::Syscall qw(EPOLLIN EPOLLONESHOT);
use PublicInbox::GitAsyncCat;
use Text::ParseWords qw(parse_line);
use Errno qw(EAGAIN);
-use PublicInbox::Search;
+use PublicInbox::Search qw(mdocid);
use PublicInbox::IMAPsearchqp;
-*mdocid = \&PublicInbox::Search::mdocid;
my $Address;
for my $mod (qw(Email::Address::XS Mail::Address)) {
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 593040a8..f98513d3 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -5,6 +5,8 @@
# Read-only search interface for use by the web and NNTP interfaces
package PublicInbox::Search;
use strict;
+use parent qw(Exporter);
+our @EXPORT_OK = qw(mdocid);
# values for searching, changing the numeric value breaks
# compatibility with old indices (so don't change them it)
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 12/23] searchquery: split off from searchview
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (10 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 11/23] search: export mdocid subroutine Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 13/23] search: make qparse_new an internal function Eric Wong
` (10 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
Since this was already a separate package, split it off
into its own file since SearchView may not handle inbox
groups.
---
MANIFEST | 1 +
lib/PublicInbox/SearchQuery.pm | 53 ++++++++++++++++++++++++++++++++++
lib/PublicInbox/SearchView.pm | 53 ++--------------------------------
3 files changed, 57 insertions(+), 50 deletions(-)
create mode 100644 lib/PublicInbox/SearchQuery.pm
diff --git a/MANIFEST b/MANIFEST
index 6cb5f6bf..d86d3b15 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -173,6 +173,7 @@ lib/PublicInbox/SaPlugin/ListMirror.pod
lib/PublicInbox/Search.pm
lib/PublicInbox/SearchIdx.pm
lib/PublicInbox/SearchIdxShard.pm
+lib/PublicInbox/SearchQuery.pm
lib/PublicInbox/SearchThread.pm
lib/PublicInbox/SearchView.pm
lib/PublicInbox/Sigfd.pm
diff --git a/lib/PublicInbox/SearchQuery.pm b/lib/PublicInbox/SearchQuery.pm
new file mode 100644
index 00000000..ce1eae12
--- /dev/null
+++ b/lib/PublicInbox/SearchQuery.pm
@@ -0,0 +1,53 @@
+# Copyright (C) 2015-2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# used by PublicInbox::SearchView
+package PublicInbox::SearchQuery;
+use strict;
+use v5.10.1;
+use URI::Escape qw(uri_escape);
+use PublicInbox::MID qw(MID_ESC);
+our $LIM = 200;
+
+sub new {
+ my ($class, $qp) = @_;
+
+ my $r = $qp->{r};
+ my ($l) = (($qp->{l} || '') =~ /([0-9]+)/);
+ $l = $LIM if !$l || $l > $LIM;
+ bless {
+ q => $qp->{'q'},
+ x => $qp->{x} || '',
+ o => (($qp->{o} || '0') =~ /(-?[0-9]+)/),
+ l => $l,
+ r => (defined $r && $r ne '0'),
+ }, $class;
+}
+
+sub qs_html {
+ my ($self, %override) = @_;
+
+ if (scalar(keys(%override))) {
+ $self = bless { (%$self, %override) }, ref($self);
+ }
+
+ my $q = uri_escape($self->{'q'}, MID_ESC);
+ $q =~ s/%20/+/g; # improve URL readability
+ my $qs = "q=$q";
+
+ if (my $o = $self->{o}) { # ignore o == 0
+ $qs .= "&o=$o";
+ }
+ if (my $l = $self->{l}) {
+ $qs .= "&l=$l" unless $l == $LIM;
+ }
+ if (my $r = $self->{r}) {
+ $qs .= "&r";
+ }
+ if (my $x = $self->{x}) {
+ $qs .= "&x=$x" if ($x eq 't' || $x eq 'A' || $x eq 'm');
+ }
+ $qs;
+}
+
+1;
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index 5d77469e..28d9ce5d 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -4,15 +4,15 @@
# Displays search results for the web interface
package PublicInbox::SearchView;
use strict;
-use warnings;
-use URI::Escape qw(uri_unescape uri_escape);
+use v5.10.1;
+use URI::Escape qw(uri_unescape);
use PublicInbox::Smsg;
use PublicInbox::Hval qw(ascii_html obfuscate_addrs mid_href);
use PublicInbox::View;
use PublicInbox::WwwAtomStream;
use PublicInbox::WwwStream qw(html_oneshot);
use PublicInbox::SearchThread;
-our $LIM = 200;
+use PublicInbox::SearchQuery;
my %rmap_inc;
sub mbox_results {
@@ -336,51 +336,4 @@ sub adump_i {
}
}
-package PublicInbox::SearchQuery;
-use strict;
-use warnings;
-use URI::Escape qw(uri_escape);
-use PublicInbox::MID qw(MID_ESC);
-
-sub new {
- my ($class, $qp) = @_;
-
- my $r = $qp->{r};
- my ($l) = (($qp->{l} || '') =~ /([0-9]+)/);
- $l = $LIM if !$l || $l > $LIM;
- bless {
- q => $qp->{'q'},
- x => $qp->{x} || '',
- o => (($qp->{o} || '0') =~ /(-?[0-9]+)/),
- l => $l,
- r => (defined $r && $r ne '0'),
- }, $class;
-}
-
-sub qs_html {
- my ($self, %override) = @_;
-
- if (scalar(keys(%override))) {
- $self = bless { (%$self, %override) }, ref($self);
- }
-
- my $q = uri_escape($self->{'q'}, MID_ESC);
- $q =~ s/%20/+/g; # improve URL readability
- my $qs = "q=$q";
-
- if (my $o = $self->{o}) { # ignore o == 0
- $qs .= "&o=$o";
- }
- if (my $l = $self->{l}) {
- $qs .= "&l=$l" unless $l == $LIM;
- }
- if (my $r = $self->{r}) {
- $qs .= "&r";
- }
- if (my $x = $self->{x}) {
- $qs .= "&x=$x" if ($x eq 't' || $x eq 'A' || $x eq 'm');
- }
- $qs;
-}
-
1;
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 13/23] search: make qparse_new an internal function
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (11 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 12/23] searchquery: split off from searchview Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 14/23] smsg: reduce utf8::decode call sites Eric Wong
` (9 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
We'll probably be reusing it from another package in a future commit.
---
lib/PublicInbox/Search.pm | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index f98513d3..e6200bfb 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -265,7 +265,7 @@ sub query {
if ($query_string eq '' && !$opts->{mset}) {
$self->{over_ro}->recent($opts);
} else {
- my $qp = qp($self);
+ my $qp = $self->{qp} //= qparse_new($self);
my $qp_flags = $self->{qp_flags};
my $query = $qp->parse_query($query_string, $qp_flags);
$opts->{relevance} = 1 unless exists $opts->{relevance};
@@ -334,17 +334,14 @@ sub _enquire_once { # retry_reopen callback
sub stemmer { $X{Stem}->new($LANG) }
# read-only
-sub qp {
+sub qparse_new ($) {
my ($self) = @_;
- my $qp = $self->{query_parser};
- return $qp if $qp;
my $xdb = xdb($self);
- # new parser
- $qp = $X{QueryParser}->new;
+ my $qp = $X{QueryParser}->new;
$qp->set_default_op(OP_AND());
$qp->set_database($xdb);
- $qp->set_stemmer($self->stemmer);
+ $qp->set_stemmer(stemmer($self));
$qp->set_stemming_strategy(STEM_SOME());
$qp->set_max_wildcard_expansion(100);
my $nvrp = $X{NumberValueRangeProcessor};
@@ -382,13 +379,12 @@ EOF
while (my ($name, $prefix) = each %prob_prefix) {
$qp->add_prefix($name, $_) foreach split(/ /, $prefix);
}
-
- $self->{query_parser} = $qp;
+ $qp;
}
sub help {
my ($self) = @_;
- $self->qp; # parse altids
+ $self->{qp} //= qparse_new($self); # parse altids
my @ret = @HELP;
if (my $user_pfx = $self->{-user_pfx}) {
push @ret, @$user_pfx;
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 14/23] smsg: reduce utf8::decode call sites
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (12 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 13/23] search: make qparse_new an internal function Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 15/23] searchview: use over.sqlite3 instead of Xapian docdata Eric Wong
` (8 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
Both callers of load_from_data call utf8::decode, so just
do utf8::decode in load_from_data.
---
lib/PublicInbox/Over.pm | 1 -
lib/PublicInbox/Smsg.pm | 2 +-
2 files changed, 1 insertion(+), 2 deletions(-)
diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index 2b314882..81b9fca7 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -68,7 +68,6 @@ sub load_from_row ($;$) {
bless $smsg, 'PublicInbox::Smsg';
if (defined(my $data = delete $smsg->{ddd})) {
$data = uncompress($data);
- utf8::decode($data);
PublicInbox::Smsg::load_from_data($smsg, $data);
# saves over 600K for 1000+ message threads
diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm
index 62cb951e..f22cd43e 100644
--- a/lib/PublicInbox/Smsg.pm
+++ b/lib/PublicInbox/Smsg.pm
@@ -40,6 +40,7 @@ sub to_doc_data {
sub load_from_data ($$) {
my ($self) = $_[0]; # data = $_[1]
+ utf8::decode($_[1]);
(
$self->{subject},
$self->{from},
@@ -67,7 +68,6 @@ sub load_expand {
my $dt = get_val($doc, PublicInbox::Search::DT());
my ($yyyy, $mon, $dd, $hh, $mm, $ss) = unpack('A4A2A2A2A2A2', $dt);
$self->{ds} = timegm($ss, $mm, $hh, $dd, $mon - 1, $yyyy);
- utf8::decode($data);
load_from_data($self, $data);
$self;
}
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 15/23] searchview: use over.sqlite3 instead of Xapian docdata
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (13 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 14/23] smsg: reduce utf8::decode call sites Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 16/23] searchview: speed up search summary by ~10% Eric Wong
` (7 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
This is a step towards improving kernel page cache hit rates by
relying on over.sqlite3 for document data instead of Xapian.
Some micro-optimization to over->get_art was required to
maintain performance.
---
lib/PublicInbox/Over.pm | 18 +++++-------------
lib/PublicInbox/SearchView.pm | 10 +++++++---
2 files changed, 12 insertions(+), 16 deletions(-)
diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index 81b9fca7..80e57e62 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -57,6 +57,7 @@ sub new {
sub disconnect {
my ($self) = @_;
if (my $dbh = delete $self->{dbh}) {
+ delete $self->{-get_art};
$self->{filename} = $dbh->sqlite_db_filename;
}
}
@@ -201,8 +202,8 @@ SELECT COUNT(num) FROM over WHERE num > 0
sub get_art {
my ($self, $num) = @_;
- my $dbh = $self->connect;
- my $sth = $dbh->prepare_cached(<<'', undef, 1);
+ # caching $sth ourselves is faster than prepare_cached
+ my $sth = $self->{-get_art} //= $self->connect->prepare(<<'');
SELECT num,ds,ts,ddd FROM over WHERE num = ? LIMIT 1
$sth->execute($num);
@@ -230,13 +231,7 @@ ORDER BY num ASC LIMIT 1
$sth->execute($$id, $$prev);
my $num = $sth->fetchrow_array or return;
$$prev = $num;
-
- $sth = $dbh->prepare_cached(<<"", undef, 1);
-SELECT num,ts,ds,ddd FROM over WHERE num = ? LIMIT 1
-
- $sth->execute($num);
- my $smsg = $sth->fetchrow_hashref or return;
- load_from_row($smsg);
+ get_art($self, $num);
}
# IMAP search, this is limited by callers to UID_SLICE size (50K)
@@ -278,10 +273,7 @@ sub check_inodes {
my $st = pack('dd', $st[0], $st[1]);
# don't actually reopen, just let {dbh} be recreated later
- if ($st ne ($self->{st} // $st)) {
- delete($self->{dbh});
- $self->{filename} = $f;
- }
+ disconnect($self) if $st ne ($self->{st} // $st);
} else {
warn "W: stat $f: $!\n";
}
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index 28d9ce5d..61534c25 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -13,6 +13,7 @@ use PublicInbox::WwwAtomStream;
use PublicInbox::WwwStream qw(html_oneshot);
use PublicInbox::SearchThread;
use PublicInbox::SearchQuery;
+use PublicInbox::Search qw(mdocid);
my %rmap_inc;
sub mbox_results {
@@ -90,19 +91,22 @@ sub mset_summary {
my $pfx = ' ' x $pad;
my $res = \($ctx->{-html_tip});
my $ibx = $ctx->{-inbox};
- my $srch = $ibx->search;
+ my $over = $ibx->over;
+ my $nshard = $ibx->search->{nshard} // 1;
my $obfs_ibx = $ibx->{obfuscate} ? $ibx : undef;
foreach my $m ($mset->items) {
my $rank = sprintf("%${pad}d", $m->get_rank + 1);
my $pct = get_pct($m);
- my $smsg = PublicInbox::Smsg::from_mitem($m, $srch);
+ my $num = mdocid($nshard, $m);
+ my $smsg = $over->get_art($num, 1);
unless ($smsg) {
eval {
- $m = "$m ".$m->get_docid . " expired\n";
+ $m = "$m $num expired\n";
$ctx->{env}->{'psgi.errors'}->print($m);
};
next;
}
+ PublicInbox::Smsg::psgi_cull($smsg);
my $s = ascii_html($smsg->{subject});
my $f = ascii_html($smsg->{from_name});
if ($obfs_ibx) {
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 16/23] searchview: speed up search summary by ~10%
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (14 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 15/23] searchview: use over.sqlite3 instead of Xapian docdata Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 17/23] searchview: convert nested and Atom display to over.sqlite3 Eric Wong
` (6 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
Instead of loading one article at-a-time from over.sqlite3, we
can use SQL to mass-load IN (?,?, ...) all results with a single
SQLite query. Despite SQLite being in-process and having no
network latency, the reduction in SQL query executions from
loading multiple rows at once speeds things up significantly.
We'll keep the over->get_art optimizations from the previous
commit, since it still speeds up long-lived responses, slightly.
---
lib/PublicInbox/Over.pm | 9 +++++++++
lib/PublicInbox/SearchView.pm | 12 ++++++------
2 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index 80e57e62..a055b4cd 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -104,6 +104,15 @@ ORDER BY num ASC
}
+sub get_all {
+ my $self = shift;
+ my $nr = scalar(@_) or return [];
+ my $in = '?' . (',?' x ($nr - 1));
+ do_get($self, <<"", { cull => 1, limit => $nr }, @_);
+SELECT num,ds,ddd FROM over WHERE num IN ($in)
+
+}
+
sub nothing () { wantarray ? (0, []) : [] };
sub get_thread {
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index 61534c25..a3527057 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -91,22 +91,22 @@ sub mset_summary {
my $pfx = ' ' x $pad;
my $res = \($ctx->{-html_tip});
my $ibx = $ctx->{-inbox};
- my $over = $ibx->over;
my $nshard = $ibx->search->{nshard} // 1;
my $obfs_ibx = $ibx->{obfuscate} ? $ibx : undef;
+ my @nums = map { mdocid($nshard, $_) } $mset->items;
+ my %num2msg = map { $_->{num} => $_ } @{$ibx->over->get_all(@nums)};
+
foreach my $m ($mset->items) {
my $rank = sprintf("%${pad}d", $m->get_rank + 1);
my $pct = get_pct($m);
- my $num = mdocid($nshard, $m);
- my $smsg = $over->get_art($num, 1);
- unless ($smsg) {
+ my $num = shift @nums;
+ my $smsg = delete($num2msg{$num}) or do {
eval {
$m = "$m $num expired\n";
$ctx->{env}->{'psgi.errors'}->print($m);
};
next;
- }
- PublicInbox::Smsg::psgi_cull($smsg);
+ };
my $s = ascii_html($smsg->{subject});
my $f = ascii_html($smsg->{from_name});
if ($obfs_ibx) {
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 17/23] searchview: convert nested and Atom display to over.sqlite3
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (15 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 16/23] searchview: speed up search summary by ~10% Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 18/23] extmsg: avoid using Xapian docdata Eric Wong
` (5 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
git blob retrieval dominates on these, "&x=t" (nested) is
roughly the same due to increased overhead for ->get_percent
storage balancing out the mass-loading from SQLite.
Atom "&x=A" is sped up slightly and uses less memory in the
long-lived response.
---
lib/PublicInbox/SearchView.pm | 25 ++++++++-----------------
1 file changed, 8 insertions(+), 17 deletions(-)
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index a3527057..ef1b9767 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -255,20 +255,13 @@ sub get_pct ($) {
$n > 99 ? 99 : $n;
}
-sub load_msgs {
- my ($mset) = @_;
- [ map {
- my $mi = $_;
- my $smsg = PublicInbox::Smsg::from_mitem($mi);
- $smsg->{pct} = get_pct($mi);
- $smsg;
- } ($mset->items) ]
-}
-
sub mset_thread {
my ($ctx, $mset, $q) = @_;
my $ibx = $ctx->{-inbox};
- my $msgs = $ibx->search->retry_reopen(\&load_msgs, $mset);
+ my $nshard = $ibx->search->{nshard} // 1;
+ my %pct = map { mdocid($nshard, $_) => get_pct($_) } $mset->items;
+ my $msgs = $ibx->over->get_all(keys %pct);
+ $_->{pct} = $pct{$_->{num}} for @$msgs;
my $r = $q->{r};
my $rootset = PublicInbox::SearchThread::thread($msgs,
$r ? \&sort_relevance : \&PublicInbox::View::sort_ds,
@@ -323,7 +316,8 @@ sub ctx_prepare {
sub adump {
my ($cb, $mset, $q, $ctx) = @_;
- $ctx->{items} = [ $mset->items ];
+ my $nshard = $ctx->{-inbox}->search->{nshard} // 1;
+ $ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ];
$ctx->{search_query} = $q; # used by WwwAtomStream::atom_header
PublicInbox::WwwAtomStream->response($ctx, 200, \&adump_i);
}
@@ -331,11 +325,8 @@ sub adump {
# callback for PublicInbox::WwwAtomStream::getline
sub adump_i {
my ($ctx) = @_;
- while (my $mi = shift @{$ctx->{items}}) {
- my $srch = $ctx->{-inbox}->search(undef, $ctx) or return;
- my $smsg = eval {
- PublicInbox::Smsg::from_mitem($mi, $srch);
- } or next;
+ while (my $num = shift @{$ctx->{ids}}) {
+ my $smsg = eval { $ctx->{-inbox}->over->get_art($num) } or next;
return $smsg;
}
}
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 18/23] extmsg: avoid using Xapian docdata
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (16 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 17/23] searchview: convert nested and Atom display to over.sqlite3 Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 19/23] mbox: avoid Xapian docdata in search results Eric Wong
` (4 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
Once again, over.sqlite3 contains everything necessary for
Message-ID resolution. Also, Xapian may be completely
unnecessary with the advent of over.sqlite3, but that's for
another time.
---
lib/PublicInbox/ExtMsg.pm | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)
diff --git a/lib/PublicInbox/ExtMsg.pm b/lib/PublicInbox/ExtMsg.pm
index d7917b34..5c23dc31 100644
--- a/lib/PublicInbox/ExtMsg.pm
+++ b/lib/PublicInbox/ExtMsg.pm
@@ -11,6 +11,7 @@ use warnings;
use PublicInbox::Hval qw(ascii_html prurl mid_href);
use PublicInbox::WwwStream qw(html_oneshot);
use PublicInbox::Smsg;
+use PublicInbox::Search qw(mdocid);
our $MIN_PARTIAL_LEN = 16;
# TODO: user-configurable
@@ -29,13 +30,10 @@ our @EXT_URL = map { ascii_html($_) } (
sub PARTIAL_MAX () { 100 }
-sub mids_from_mset { # Search::retry_reopen callback
- [ map { PublicInbox::Smsg::from_mitem($_)->{mid} } $_[0]->items ];
-}
-
sub search_partial ($$) {
- my ($srch, $mid) = @_;
+ my ($ibx, $mid) = @_;
return if length($mid) < $MIN_PARTIAL_LEN;
+ my $srch = $ibx->search or return;
my $opt = { limit => PARTIAL_MAX, mset => 2 };
my @try = ("m:$mid*");
my $chop = $mid;
@@ -63,14 +61,17 @@ sub search_partial ($$) {
}
}
+ my $n = $srch->{nshard} // 1;
foreach my $m (@try) {
# If Xapian can't handle the wildcard since it
# has too many results. $@ can be
# Search::Xapian::QueryParserError or even:
# "something terrible happened at ../Search/Xapian/Enquire.pm"
my $mset = eval { $srch->query($m, $opt) } or next;
- my $mids = $srch->retry_reopen(\&mids_from_mset, $mset);
- return $mids if scalar(@$mids);
+ my @mids = map {
+ $_->{mid}
+ } @{$ibx->over->get_all(map { mdocid($n, $_) } $mset->items)};
+ return \@mids if scalar(@mids);
}
}
@@ -111,8 +112,7 @@ sub ext_msg {
# fall back to partial MID matching
my @partial;
my $n_partial = 0;
- my $srch = $cur->search;
- my $mids = search_partial($srch, $mid) if $srch;
+ my $mids = search_partial($cur, $mid);
if ($mids) {
$n_partial = scalar(@$mids);
push @partial, [ $cur, $mids ];
@@ -121,8 +121,7 @@ sub ext_msg {
# can't find a partial match in current inbox, try the others:
if (!$n_partial && length($mid) >= $MIN_PARTIAL_LEN) {
foreach my $ibx (@$ibxs) {
- $srch = $ibx->search or next;
- $mids = search_partial($srch, $mid) or next;
+ $mids = search_partial($ibx, $mid) or next;
$n_partial += scalar(@$mids);
push @partial, [ $ibx, $mids];
last if $n_partial >= PARTIAL_MAX;
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 19/23] mbox: avoid Xapian docdata in search results
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (17 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 18/23] extmsg: avoid using Xapian docdata Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 20/23] smsg: remove from_mitem Eric Wong
` (3 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
Another place where we can reduce kernel page cache overhead
by hitting over.sqlite3 instead of docdata.glass.
---
lib/PublicInbox/Mbox.pm | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)
diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index a83c0356..0fa9a38d 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -10,6 +10,7 @@ use PublicInbox::MID qw/mid_escape/;
use PublicInbox::Hval qw/to_filename/;
use PublicInbox::Smsg;
use PublicInbox::Eml;
+use PublicInbox::Search qw(mdocid);
# called by PSGI server as body response
# this gets called twice for every message, once to return the header,
@@ -205,20 +206,19 @@ sub mbox_all_ids {
sub results_cb {
my ($ctx) = @_;
- my $srch = $ctx->{-inbox}->search(undef, $ctx) or return;
- my $mset = $ctx->{mset};
+ my $over = $ctx->{-inbox}->over or return;
while (1) {
- while (my $mi = (($mset->items)[$ctx->{iter}++])) {
- my $smsg = PublicInbox::Smsg::from_mitem($mi,
- $srch) or next;
+ while (defined(my $num = shift(@{$ctx->{ids}}))) {
+ my $smsg = $over->get_art($num) or next;
return $smsg;
}
# refill result set
- $mset = $ctx->{mset} = $srch->query($ctx->{query},
- $ctx->{qopts});
+ my $srch = $ctx->{-inbox}->search(undef, $ctx) or return;
+ my $mset = $srch->query($ctx->{query}, $ctx->{qopts});
my $size = $mset->size or return;
$ctx->{qopts}->{offset} += $size;
- $ctx->{iter} = 0;
+ my $nshard = $srch->{nshard} // 1;
+ $ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ];
}
}
@@ -226,15 +226,16 @@ sub mbox_all {
my ($ctx, $query) = @_;
return mbox_all_ids($ctx) if $query eq '';
- my $qopts = $ctx->{qopts} = { mset => 2 };
+ my $qopts = $ctx->{qopts} = { mset => 2 }; # order by docid
my $srch = $ctx->{-inbox}->search or
return PublicInbox::WWW::need($ctx, 'Search');
- my $mset = $ctx->{mset} = $srch->query($query, $qopts);
+ my $mset = $srch->query($query, $qopts);
$qopts->{offset} = $mset->size or
return [404, [qw(Content-Type text/plain)],
["No results found\n"]];
- $ctx->{iter} = 0;
$ctx->{query} = $query;
+ my $nshard = $srch->{nshard} // 1;
+ $ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ];
require PublicInbox::MboxGz;
PublicInbox::MboxGz::mbox_gz($ctx, \&results_cb, 'results-'.$query);
}
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 20/23] smsg: remove from_mitem
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (18 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 19/23] mbox: avoid Xapian docdata in search results Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 21/23] t/nntpd-v2: set PI_TEST_VERSION=2 properly Eric Wong
` (2 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
We no longer read docdata.glass from anywhere in our code base.
Some adjustments were needed to t/search.t to deal with the
Xapian::WritableDatabase committing at different times, since
our ->query is avoided from PublicInbox::SearchIdx to avoid
needing a {over_ro} field.
---
lib/PublicInbox/Search.pm | 11 +++++++----
lib/PublicInbox/Smsg.pm | 8 --------
t/search.t | 34 +++++++++++++++++++---------------
3 files changed, 26 insertions(+), 27 deletions(-)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index e6200bfb..1c648299 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -324,10 +324,13 @@ sub _enquire_once { # retry_reopen callback
my $limit = $opts->{limit} || 50;
my $mset = $enquire->get_mset($offset, $limit);
return $mset if $opts->{mset};
- my @msgs = map { PublicInbox::Smsg::from_mitem($_) } $mset->items;
- return \@msgs unless wantarray;
-
- ($mset->get_matches_estimated, \@msgs)
+ my $nshard = $self->{nshard} // 1;
+ my $i = 0;
+ my %order = map { mdocid($nshard, $_) => ++$i } $mset->items;
+ my @msgs = sort {
+ $order{$a->{num}} <=> $order{$b->{num}}
+ } @{$self->{over_ro}->get_all(keys %order)};
+ wantarray ? ($mset->get_matches_estimated, \@msgs) : \@msgs;
}
# read-write
diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm
index f22cd43e..51226b8e 100644
--- a/lib/PublicInbox/Smsg.pm
+++ b/lib/PublicInbox/Smsg.pm
@@ -86,14 +86,6 @@ sub psgi_cull ($) {
$self;
}
-# Only called by PSGI interface, not NNTP
-sub from_mitem {
- my ($mitem, $srch) = @_;
- return $srch->retry_reopen(\&from_mitem, $mitem) if $srch;
- my $self = bless {}, __PACKAGE__;
- psgi_cull(load_expand($self, $mitem->get_document));
-}
-
# for Import and v1 non-SQLite WWW code paths
sub populate {
my ($self, $hdr, $sync) = @_;
diff --git a/t/search.t b/t/search.t
index 299f57c8..e2290ecd 100644
--- a/t/search.t
+++ b/t/search.t
@@ -263,13 +263,14 @@ To: list@example.com
theatre
fade
EOF
- my $res = $rw->query("theatre");
+ $rw_commit->();
+ my $res = $ro->reopen->query("theatre");
is(scalar(@$res), 2, "got both matches");
if (@$res == 2) {
is($res->[0]->{mid}, 'nquote@a', 'non-quoted scores higher');
is($res->[1]->{mid}, 'quote@a', 'quoted result still returned');
}
- $res = $rw->query("illusions");
+ $res = $ro->query("illusions");
is(scalar(@$res), 1, "got a match for quoted text");
is($res->[0]->{mid}, 'quote@a',
"quoted result returned if nothing else") if scalar(@$res);
@@ -290,7 +291,8 @@ To: list\@example.com
LOOP!
EOF
ok($doc_id > 0, "doc_id defined with circular reference");
- my $smsg = $rw->query('m:circle@a', {limit=>1})->[0];
+ $rw_commit->();
+ my $smsg = $ro->reopen->query('m:circle@a', {limit=>1})->[0];
is(defined($smsg), 1, 'found m:circl@a');
if (defined $smsg) {
is($smsg->{references}, '', "no references created");
@@ -298,11 +300,22 @@ EOF
}
});
+{
+ my $msgs = $ro->query('d:19931002..20101002');
+ ok(scalar(@$msgs) > 0, 'got results within range');
+ $msgs = $ro->query('d:20101003..');
+ is(scalar(@$msgs), 0, 'nothing after 20101003');
+ $msgs = $ro->query('d:..19931001');
+ is(scalar(@$msgs), 0, 'nothing before 19931001');
+}
+
$ibx->with_umask(sub {
my $mime = eml_load 't/utf8.eml';
my $doc_id = $rw->add_message($mime);
ok($doc_id > 0, 'message indexed doc_id with UTF-8');
- my $msg = $rw->query('m:testmessage@example.com', {limit => 1})->[0];
+ $rw_commit->();
+ my $msg = $ro->reopen->
+ query('m:testmessage@example.com', {limit => 1})->[0];
is(defined($msg), 1, 'found testmessage@example.com');
if (defined $msg) {
is($mime->header('Subject'), $msg->{subject},
@@ -310,19 +323,10 @@ $ibx->with_umask(sub {
}
});
-{
- my $msgs = $ro->query('d:19931002..20101002');
- ok(scalar(@$msgs) > 0, 'got results within range');
- $msgs = $ro->query('d:20101003..');
- is(scalar(@$msgs), 0, 'nothing after 20101003');
- $msgs = $ro->query('d:..19931001');
- is(scalar(@$msgs), 0, 'nothing before 19931001');
-}
-
# names and addresses
{
my $mset = $ro->query('t:list@example.com', {mset => 1});
- is($mset->size, 6, 'searched To: successfully');
+ is($mset->size, 9, 'searched To: successfully');
foreach my $m ($mset->items) {
my $smsg = $ro->{over_ro}->get_art($m->get_docid);
like($smsg->{to}, qr/\blist\@example\.com\b/, 'to appears');
@@ -340,7 +344,7 @@ $ibx->with_umask(sub {
}
$mset = $ro->query('tc:list@example.com', {mset => 1});
- is($mset->size, 6, 'searched To+Cc: successfully');
+ is($mset->size, 9, 'searched To+Cc: successfully');
foreach my $m ($mset->items) {
my $smsg = $ro->{over_ro}->get_art($m->get_docid);
my $tocc = join("\n", $smsg->{to}, $smsg->{cc});
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 21/23] t/nntpd-v2: set PI_TEST_VERSION=2 properly
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (19 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 20/23] smsg: remove from_mitem Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 22/23] init+index: support --skip-docdata for Xapian Eric Wong
2020-08-20 20:24 ` [PATCH 23/23] search: add mset_to_artnums method Eric Wong
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
Numbers are hard :<
---
t/nntpd-v2.t | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/t/nntpd-v2.t b/t/nntpd-v2.t
index 7fc3447e..1dd992a0 100644
--- a/t/nntpd-v2.t
+++ b/t/nntpd-v2.t
@@ -1,4 +1,4 @@
# Copyright (C) 2019-2020 all contributors <meta@public-inbox.org>
# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
-local $ENV{PI_TEST_VERSION} = 1;
+local $ENV{PI_TEST_VERSION} = 2;
require './t/nntpd.t';
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 22/23] init+index: support --skip-docdata for Xapian
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (20 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 21/23] t/nntpd-v2: set PI_TEST_VERSION=2 properly Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
2020-08-20 20:24 ` [PATCH 23/23] search: add mset_to_artnums method Eric Wong
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
Since we no longer read document data from Xapian, allow users
to opt-out of storing it.
This breaks compatibility with previous releases of
public-inbox, but gives us a ~1.5% space savings on Xapian
storage (and associated I/O and page cache pressure reduction).
---
Documentation/public-inbox-index.pod | 8 +++++++
Documentation/public-inbox-init.pod | 10 ++++++++
lib/PublicInbox/Admin.pm | 12 ++++++----
lib/PublicInbox/SearchIdx.pm | 35 +++++++++++++++++++++-------
lib/PublicInbox/SearchIdxShard.pm | 2 +-
script/public-inbox-convert | 3 ++-
script/public-inbox-index | 7 ++++--
script/public-inbox-init | 8 +++++++
t/inbox_idle.t | 2 +-
t/index-git-times.t | 11 ++++++++-
t/init.t | 13 +++++++++++
11 files changed, 91 insertions(+), 20 deletions(-)
diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 1ed9f5e7..46a53825 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -145,6 +145,14 @@ below.
Available in public-inbox 1.6.0 (PENDING).
+=item --skip-docdata
+
+Stop storing document data in Xapian on an existing inbox.
+
+See L<public-inbox-init(1)/--skip-docdata> for description and caveats.
+
+Available in public-inbox 1.6.0 (PENDING).
+
=back
=head1 FILES
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 4cc7e29f..3f98807a 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -95,6 +95,16 @@ default due to contention in the top-level producer process.
Default: the number of online CPUs, up to 4
+=item --skip-docdata
+
+Do not store document data in Xapian, reducing Xapian storage
+overhead by around 1.5%.
+
+Warning: this option prevents rollbacks to public-inbox 1.5.0
+and earlier.
+
+Available since public-inbox 1.6.0 (PENDING).
+
=back
=head1 ENVIRONMENT
diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index f5427af7..b8ead6f7 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -48,13 +48,14 @@ sub resolve_repo_dir {
sub detect_indexlevel ($) {
my ($ibx) = @_;
- # brand new or never before indexed inboxes default to full
- return 'full' unless $ibx->over;
- delete $ibx->{over}; # don't leave open FD lying around
+ my $over = $ibx->over;
+ my $srch = $ibx->search;
+ delete @$ibx{qw(over search)}; # don't leave open FDs lying around
+ # brand new or never before indexed inboxes default to full
+ return 'full' unless $over;
my $l = 'basic';
- my $srch = $ibx->search or return $l;
- delete $ibx->{search}; # don't leave open FD lying around
+ return $l unless $srch;
if (my $xdb = $srch->xdb) {
$l = 'full';
my $m = $xdb->get_metadata('indexlevel');
@@ -65,6 +66,7 @@ sub detect_indexlevel ($) {
$ibx->{inboxdir} has unexpected indexlevel in Xapian: $m
}
+ $ibx->{-skip_docdata} = 1 if $xdb->get_metadata('skip_docdata');
}
$l;
}
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 5c39f3d6..be46b2b9 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -61,6 +61,10 @@ sub new {
}, $class;
$self->xpfx_init;
$self->{-set_indexlevel_once} = 1 if $indexlevel eq 'medium';
+ if ($ibx->{-skip_docdata}) {
+ $self->{-set_skip_docdata_once} = 1;
+ $self->{-skip_docdata} = 1;
+ }
$ibx->umask_prepare;
if ($version == 1) {
$self->{lock_path} = "$inboxdir/ssoma.lock";
@@ -359,10 +363,18 @@ sub add_xapian ($$$$) {
msg_iter($eml, \&index_xapian, [ $self, $doc ]);
index_ids($self, $doc, $eml, $mids);
- $smsg->{to} = $smsg->{cc} = ''; # WWW doesn't need these, only NNTP
- PublicInbox::OverIdx::parse_references($smsg, $eml, $mids);
- my $data = $smsg->to_doc_data;
- $doc->set_data($data);
+
+ # by default, we maintain compatibility with v1.5.0 and earlier
+ # by writing to docdata.glass, users who never exect to downgrade can
+ # use --skip-docdata
+ if (!$self->{-skip_docdata}) {
+ # WWW doesn't need {to} or {cc}, only NNTP
+ $smsg->{to} = $smsg->{cc} = '';
+ PublicInbox::OverIdx::parse_references($smsg, $eml, $mids);
+ my $data = $smsg->to_doc_data;
+ $doc->set_data($data);
+ }
+
if (my $altid = $self->{-altid}) {
foreach my $alt (@$altid) {
my $pfx = $alt->{xprefix};
@@ -831,23 +843,28 @@ sub begin_txn_lazy {
# store 'indexlevel=medium' in v2 shard=0 and v1 (only one shard)
# This metadata is read by Admin::detect_indexlevel:
-sub set_indexlevel {
+sub set_metadata_once {
my ($self) = @_;
- if (!$self->{shard} && # undef or 0, not >0
- delete($self->{-set_indexlevel_once})) {
- my $xdb = $self->{xdb};
+ return if $self->{shard}; # only continue if undef or 0, not >0
+ my $xdb = $self->{xdb};
+
+ if (delete($self->{-set_indexlevel_once})) {
my $level = $xdb->get_metadata('indexlevel');
if (!$level || $level ne 'medium') {
$xdb->set_metadata('indexlevel', 'medium');
}
}
+ if (delete($self->{-set_skip_docdata_once})) {
+ $xdb->get_metadata('skip_docdata') or
+ $xdb->set_metadata('skip_docdata', '1');
+ }
}
sub _commit_txn {
my ($self) = @_;
if (my $xdb = $self->{xdb}) {
- set_indexlevel($self);
+ set_metadata_once($self);
$xdb->commit_transaction;
}
$self->{over}->commit_lazy if $self->{over};
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index 59b36087..20077e08 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -16,7 +16,7 @@ sub new {
my $self = $class->SUPER::new($ibx, 1, $shard);
# create the DB before forking:
$self->idx_acquire;
- $self->set_indexlevel;
+ $self->set_metadata_once;
$self->idx_release;
$self->spawn_worker($v2w, $shard) if $v2w->{parallel};
$self;
diff --git a/script/public-inbox-convert b/script/public-inbox-convert
index d655dcc6..4ff198d1 100755
--- a/script/public-inbox-convert
+++ b/script/public-inbox-convert
@@ -77,7 +77,8 @@ if ($old) {
die "Only conversion from v1 inboxes is supported\n" if $old->version >= 2;
require PublicInbox::Admin;
-$old->{indexlevel} //= PublicInbox::Admin::detect_indexlevel($old);
+my $detected = PublicInbox::Admin::detect_indexlevel($old);
+$old->{indexlevel} //= $detected;
my $env;
if ($opt->{'index'}) {
my $mods = {};
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 30d24838..9855c67d 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -39,7 +39,7 @@ GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
indexlevel|index-level|L=s max_size|max-size=s
batch_size|batch-size=s
sequential_shard|seq-shard|sequential-shard
- all help|?))
+ skip-docdata all help|?))
or die "bad command-line args\n$usage";
if ($opt->{help}) { print $help; exit 0 };
die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
@@ -58,9 +58,11 @@ unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 }
my $mods = {};
foreach my $ibx (@ibxs) {
+ # detect_indexlevel may also set $ibx->{-skip_docdata}
+ my $detected = PublicInbox::Admin::detect_indexlevel($ibx);
# XXX: users can shoot themselves in the foot, with opt->{indexlevel}
$ibx->{indexlevel} //= $opt->{indexlevel} // ($opt->{xapian_only} ?
- 'full' : PublicInbox::Admin::detect_indexlevel($ibx));
+ 'full' : $detected);
PublicInbox::Admin::scan_ibx_modules($mods, $ibx);
}
@@ -75,6 +77,7 @@ for my $ibx (@ibxs) {
PublicInbox::Xapcmd::run($ibx, 'compact', $opt->{compact_opt});
}
$ibx->{-no_fsync} = 1 if !$opt->{fsync};
+ $ibx->{-skip_docdata} //= $opt->{'skip-docdata'};
my $ibx_opt = $opt;
if (defined(my $s = $ibx->{lc('indexSequentialShard')})) {
diff --git a/script/public-inbox-init b/script/public-inbox-init
index b19c2321..037e8e56 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -34,6 +34,7 @@ require PublicInbox::Admin;
PublicInbox::Admin::require_or_die('-base');
my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs, $show_help);
+my $skip_docdata;
my $ng = '';
my %opts = (
'V|version=i' => \$version,
@@ -42,6 +43,7 @@ my %opts = (
'skip-artnum=i' => \$skip_artnum,
'j|jobs=i' => \$jobs,
'ng|newsgroup=s' => \$ng,
+ 'skip-docdata' => \$skip_docdata,
'help|?' => \$show_help,
);
my $usage_cb = sub {
@@ -177,6 +179,12 @@ if (defined $jobs) {
require PublicInbox::InboxWritable;
$ibx = PublicInbox::InboxWritable->new($ibx, $creat_opt);
+if ($skip_docdata) {
+ $ibx->{indexlevel} //= 'full'; # ensure init_inbox writes xdb
+ $ibx->{indexlevel} eq 'basic' and
+ die "--skip-docdata ignored with --indexlevel=basic\n";
+ $ibx->{-skip_docdata} = $skip_docdata;
+}
$ibx->init_inbox(0, $skip_epoch, $skip_artnum);
# needed for git prior to v2.1.0
diff --git a/t/inbox_idle.t b/t/inbox_idle.t
index 61287200..e16ee11b 100644
--- a/t/inbox_idle.t
+++ b/t/inbox_idle.t
@@ -29,7 +29,7 @@ for my $V (1, 2) {
if ($V == 1) {
my $sidx = PublicInbox::SearchIdx->new($ibx, 1);
$sidx->idx_acquire;
- $sidx->set_indexlevel;
+ $sidx->set_metadata_once;
$sidx->idx_release; # allow watching on lockfile
}
my $pi_config = PublicInbox::Config->new(\<<EOF);
diff --git a/t/index-git-times.t b/t/index-git-times.t
index 2e9e88e8..8f80c866 100644
--- a/t/index-git-times.t
+++ b/t/index-git-times.t
@@ -4,6 +4,7 @@ use Test::More;
use PublicInbox::TestCommon;
use PublicInbox::Import;
use PublicInbox::Config;
+use PublicInbox::Admin;
use File::Path qw(remove_tree);
require_mods(qw(DBD::SQLite Search::Xapian));
@@ -47,11 +48,15 @@ EOF
PublicInbox::Import::run_die($cmd, undef, { 0 => $r });
}
-run_script(['-index', $v1dir]) or die 'v1 index failed';
+run_script(['-index', '--skip-docdata', $v1dir]) or die 'v1 index failed';
+
my $smsg;
{
my $cfg = PublicInbox::Config->new;
my $ibx = $cfg->lookup($addr);
+ my $lvl = PublicInbox::Admin::detect_indexlevel($ibx);
+ is($lvl, 'medium', 'indexlevel detected');
+ is($ibx->{-skip_docdata}, 1, '--skip-docdata flag set on -index');
$smsg = $ibx->over->get_art(1);
is($smsg->{ds}, 749520000, 'datestamp from git author time');
is($smsg->{ts}, 1285977600, 'timestamp from git committer time');
@@ -70,6 +75,10 @@ SKIP: {
my $check_v2 = sub {
my $ibx = PublicInbox::Inbox->new({inboxdir => $v2dir,
address => $addr});
+ my $lvl = PublicInbox::Admin::detect_indexlevel($ibx);
+ is($lvl, 'medium', 'indexlevel detected after convert');
+ is($ibx->{-skip_docdata}, 1,
+ '--skip-docdata preserved after convert');
my $v2smsg = $ibx->over->get_art(1);
is($v2smsg->{ds}, $smsg->{ds},
'v2 datestamp from git author time');
diff --git a/t/init.t b/t/init.t
index 4d2c5049..dad09435 100644
--- a/t/init.t
+++ b/t/init.t
@@ -95,6 +95,19 @@ SKIP: {
my $ibx = PublicInbox::Inbox->new({ inboxdir => $dir });
is(PublicInbox::Admin::detect_indexlevel($ibx), $lvl,
'detected expected level w/o config');
+ ok(!$ibx->{-skip_docdata}, 'docdata written by default');
+ }
+ for my $v (1, 2) {
+ my $name = "v$v-skip-docdata";
+ my $dir = "$tmpdir/$name";
+ $cmd = [ '-init', $name, "-V$v", '--skip-docdata',
+ $dir, "http://example.com/$name",
+ "$name\@example.com" ];
+ ok(run_script($cmd), "-init -V$v --skip-docdata");
+ my $ibx = PublicInbox::Inbox->new({ inboxdir => $dir });
+ is(PublicInbox::Admin::detect_indexlevel($ibx), 'full',
+ "detected default indexlevel -V$v");
+ ok($ibx->{-skip_docdata}, "docdata skip set -V$v");
}
# loop for idempotency
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 23/23] search: add mset_to_artnums method
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
` (21 preceding siblings ...)
2020-08-20 20:24 ` [PATCH 22/23] init+index: support --skip-docdata for Xapian Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
To: meta
We can avoid importing mdocid() in several places by using
this method, simplifying callers.
---
lib/PublicInbox/ExtMsg.pm | 4 +---
lib/PublicInbox/IMAP.pm | 4 +---
lib/PublicInbox/Mbox.pm | 7 ++-----
lib/PublicInbox/Search.pm | 6 ++++++
lib/PublicInbox/SearchView.pm | 6 ++----
5 files changed, 12 insertions(+), 15 deletions(-)
diff --git a/lib/PublicInbox/ExtMsg.pm b/lib/PublicInbox/ExtMsg.pm
index 5c23dc31..65892161 100644
--- a/lib/PublicInbox/ExtMsg.pm
+++ b/lib/PublicInbox/ExtMsg.pm
@@ -11,7 +11,6 @@ use warnings;
use PublicInbox::Hval qw(ascii_html prurl mid_href);
use PublicInbox::WwwStream qw(html_oneshot);
use PublicInbox::Smsg;
-use PublicInbox::Search qw(mdocid);
our $MIN_PARTIAL_LEN = 16;
# TODO: user-configurable
@@ -61,7 +60,6 @@ sub search_partial ($$) {
}
}
- my $n = $srch->{nshard} // 1;
foreach my $m (@try) {
# If Xapian can't handle the wildcard since it
# has too many results. $@ can be
@@ -70,7 +68,7 @@ sub search_partial ($$) {
my $mset = eval { $srch->query($m, $opt) } or next;
my @mids = map {
$_->{mid}
- } @{$ibx->over->get_all(map { mdocid($n, $_) } $mset->items)};
+ } @{$ibx->over->get_all(@{$srch->mset_to_artnums($mset)})};
return \@mids if scalar(@mids);
}
}
diff --git a/lib/PublicInbox/IMAP.pm b/lib/PublicInbox/IMAP.pm
index 562c59d4..abdb8fec 100644
--- a/lib/PublicInbox/IMAP.pm
+++ b/lib/PublicInbox/IMAP.pm
@@ -40,7 +40,6 @@ use PublicInbox::Syscall qw(EPOLLIN EPOLLONESHOT);
use PublicInbox::GitAsyncCat;
use Text::ParseWords qw(parse_line);
use Errno qw(EAGAIN);
-use PublicInbox::Search qw(mdocid);
use PublicInbox::IMAPsearchqp;
my $Address;
@@ -1188,9 +1187,8 @@ sub refill_xap ($$$$) {
my ($beg, $end) = @$range_info;
my $srch = $self->{ibx}->search;
my $opt = { mset => 2, limit => 1000 };
- my $nshard = $srch->{nshard} // 1;
my $mset = $srch->query("$q uid:$beg..$end", $opt);
- @$uids = map { mdocid($nshard, $_) } $mset->items;
+ @$uids = @{$srch->mset_to_artnums($mset)};
if (@$uids) {
$range_info->[0] = $uids->[-1] + 1; # update $beg
return; # possibly more
diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index 0fa9a38d..873ff7be 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -10,7 +10,6 @@ use PublicInbox::MID qw/mid_escape/;
use PublicInbox::Hval qw/to_filename/;
use PublicInbox::Smsg;
use PublicInbox::Eml;
-use PublicInbox::Search qw(mdocid);
# called by PSGI server as body response
# this gets called twice for every message, once to return the header,
@@ -217,8 +216,7 @@ sub results_cb {
my $mset = $srch->query($ctx->{query}, $ctx->{qopts});
my $size = $mset->size or return;
$ctx->{qopts}->{offset} += $size;
- my $nshard = $srch->{nshard} // 1;
- $ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ];
+ $ctx->{ids} = $srch->mset_to_artnums($mset);
}
}
@@ -234,8 +232,7 @@ sub mbox_all {
return [404, [qw(Content-Type text/plain)],
["No results found\n"]];
$ctx->{query} = $query;
- my $nshard = $srch->{nshard} // 1;
- $ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ];
+ $ctx->{ids} = $srch->mset_to_artnums($mset);
require PublicInbox::MboxGz;
PublicInbox::MboxGz::mbox_gz($ctx, \&results_cb, 'results-'.$query);
}
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 1c648299..c18e19d4 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -219,6 +219,12 @@ sub mdocid {
int(($docid - 1) / $nshard) + 1;
}
+sub mset_to_artnums {
+ my ($self, $mset) = @_;
+ my $nshard = $self->{nshard} // 1;
+ [ map { mdocid($nshard, $_) } $mset->items ];
+}
+
sub xdb ($) {
my ($self) = @_;
$self->{xdb} ||= do {
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index ef1b9767..aa8fa037 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -91,9 +91,8 @@ sub mset_summary {
my $pfx = ' ' x $pad;
my $res = \($ctx->{-html_tip});
my $ibx = $ctx->{-inbox};
- my $nshard = $ibx->search->{nshard} // 1;
my $obfs_ibx = $ibx->{obfuscate} ? $ibx : undef;
- my @nums = map { mdocid($nshard, $_) } $mset->items;
+ my @nums = @{$ibx->search->mset_to_artnums($mset)};
my %num2msg = map { $_->{num} => $_ } @{$ibx->over->get_all(@nums)};
foreach my $m ($mset->items) {
@@ -316,8 +315,7 @@ sub ctx_prepare {
sub adump {
my ($cb, $mset, $q, $ctx) = @_;
- my $nshard = $ctx->{-inbox}->search->{nshard} // 1;
- $ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ];
+ $ctx->{ids} = $ctx->{-inbox}->search->mset_to_artnums($mset);
$ctx->{search_query} = $q; # used by WwwAtomStream::atom_header
PublicInbox::WwwAtomStream->response($ctx, 200, \&adump_i);
}
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 05/23] init: support --newsgroup option
2020-08-20 20:24 ` [PATCH 05/23] init: support --newsgroup option Eric Wong
@ 2020-08-20 21:10 ` Eric Wong
0 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 21:10 UTC (permalink / raw)
To: meta
Eric Wong <e@yhbt.net> wrote:
> +Some of the options documented in L<public-inbox-config(5)>
> +require editing the config file. Old versions lack the
> +C<-n>/C<--newsgroup> parameter
While working on this, I realized -n vs -N could be confusing,
so I made the abbreviation --ng instead. So I'll squash
this in before pushing:
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 240959eb..ea2b38dd 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -111,7 +111,7 @@ Used to override the default C<~/.public-inbox/config> value.
Some of the options documented in L<public-inbox-config(5)>
require editing the config file. Old versions lack the
-C<-n>/C<--newsgroup> parameter
+C<--ng>/C<--newsgroup> parameter
See L<public-inbox-config(5)> for all the options which may be applied
to a given inbox.
^ permalink raw reply related [flat|nested] 25+ messages in thread
end of thread, other threads:[~2020-08-20 21:10 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
2020-08-20 20:24 ` [PATCH 01/23] doc: note -compact and -xcpdb are rarely used Eric Wong
2020-08-20 20:24 ` [PATCH 02/23] admin: progress shows the inbox being indexed Eric Wong
2020-08-20 20:24 ` [PATCH 03/23] compact: support --help/-? and perform lazy loading Eric Wong
2020-08-20 20:24 ` [PATCH 04/23] init: support --help and -? Eric Wong
2020-08-20 20:24 ` [PATCH 05/23] init: support --newsgroup option Eric Wong
2020-08-20 21:10 ` Eric Wong
2020-08-20 20:24 ` [PATCH 06/23] init: drop -N alias for --skip-artnum Eric Wong
2020-08-20 20:24 ` [PATCH 07/23] search: v2: ensure shards are numerically sorted Eric Wong
2020-08-20 20:24 ` [PATCH 08/23] xapcmd: simplify {reindex} parameter passing Eric Wong
2020-08-20 20:24 ` [PATCH 09/23] www: reduce long-lived PublicInbox::Search references Eric Wong
2020-08-20 20:24 ` [PATCH 10/23] search: improve comments around constants Eric Wong
2020-08-20 20:24 ` [PATCH 11/23] search: export mdocid subroutine Eric Wong
2020-08-20 20:24 ` [PATCH 12/23] searchquery: split off from searchview Eric Wong
2020-08-20 20:24 ` [PATCH 13/23] search: make qparse_new an internal function Eric Wong
2020-08-20 20:24 ` [PATCH 14/23] smsg: reduce utf8::decode call sites Eric Wong
2020-08-20 20:24 ` [PATCH 15/23] searchview: use over.sqlite3 instead of Xapian docdata Eric Wong
2020-08-20 20:24 ` [PATCH 16/23] searchview: speed up search summary by ~10% Eric Wong
2020-08-20 20:24 ` [PATCH 17/23] searchview: convert nested and Atom display to over.sqlite3 Eric Wong
2020-08-20 20:24 ` [PATCH 18/23] extmsg: avoid using Xapian docdata Eric Wong
2020-08-20 20:24 ` [PATCH 19/23] mbox: avoid Xapian docdata in search results Eric Wong
2020-08-20 20:24 ` [PATCH 20/23] smsg: remove from_mitem Eric Wong
2020-08-20 20:24 ` [PATCH 21/23] t/nntpd-v2: set PI_TEST_VERSION=2 properly Eric Wong
2020-08-20 20:24 ` [PATCH 22/23] init+index: support --skip-docdata for Xapian Eric Wong
2020-08-20 20:24 ` [PATCH 23/23] search: add mset_to_artnums method Eric Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).