Nothing terribly exciting, since xcpdb isn't really used often. But it'd be bad if it flooded the system with many parallel processes on HDD because -index was configured for many small shards. So now it now supports --sequential-shard and all the other index options. Eric Wong (6): xapcmd: simplify sub reference xcpdb: support --no-fsync from CLI xapcmd: reduce CPU idling when shards exceeds job count admin: don't warn when --jobs exceeds shards xcpdb: wire up new index options and --help v2writable: remove IdxStack import Documentation/public-inbox-xcpdb.pod | 19 +++++++- lib/PublicInbox/Admin.pm | 4 +- lib/PublicInbox/V2Writable.pm | 1 - lib/PublicInbox/Xapcmd.pm | 12 +++-- script/public-inbox-xcpdb | 65 +++++++++++++++++++++++----- 5 files changed, 83 insertions(+), 18 deletions(-)
We don't need to fully-qualify when referring to subs in the same namespace, nor do we need make a SCALAR ref only to dereference it (Yes, still learning Perl :x) --- lib/PublicInbox/Xapcmd.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm index 348621cef..6fcc9e90c 100644 --- a/lib/PublicInbox/Xapcmd.pm +++ b/lib/PublicInbox/Xapcmd.pm @@ -253,7 +253,7 @@ sub _run { sub run { my ($ibx, $task, $opt) = @_; # task = 'cpdb' or 'compact' - my $cb = \&${\"PublicInbox::Xapcmd::$task"}; + my $cb = \&$task; PublicInbox::Admin::progress_prepare($opt ||= {}); defined(my $dir = $ibx->{inboxdir}) or die "no inboxdir defined\n"; -d $dir or die "inboxdir=$dir does not exist\n";
This was omitted in 8b1950055d51d436 :x Fixes: 8b1950055d51d436 ("index+xcpdb: rename `--no-sync' to `--no-fsync'") --- script/public-inbox-xcpdb | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/script/public-inbox-xcpdb b/script/public-inbox-xcpdb index fcd961488..2c91598cb 100755 --- a/script/public-inbox-xcpdb +++ b/script/public-inbox-xcpdb @@ -8,8 +8,9 @@ use PublicInbox::Xapcmd; use PublicInbox::Admin; PublicInbox::Admin::require_or_die('-search'); my $usage = "Usage: public-inbox-xcpdb [--compact] INBOX_DIR\n"; -my $opt = { sync => 1 }; -my @opt = (qw(sync! compact reshard|R=i), @PublicInbox::Xapcmd::COMPACT_OPT); +my $opt = { fsync => 1 }; +my @opt = (qw(fsync|sync! compact reshard|R=i), + @PublicInbox::Xapcmd::COMPACT_OPT); GetOptions($opt, @opt) or die "bad command-line args\n$usage"; my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV) or die $usage; foreach (@ibxs) {
In case there's unbalanced shards AND we're limiting parallelism while using many shards, spawn the next task in the queue ASAP once a task is done, instead of waiting for all tasks to finish before spawning the next batch. Unbalanced shards probably isn't a big issue for most users; however many smaller shards with few jobs can be useful for HDD users to reduce the effect of random writes. --- lib/PublicInbox/Xapcmd.pm | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm index 6fcc9e90c..b6279218c 100644 --- a/lib/PublicInbox/Xapcmd.pm +++ b/lib/PublicInbox/Xapcmd.pm @@ -9,7 +9,7 @@ use PublicInbox::SearchIdx; use File::Temp 0.19 (); # ->newdir use File::Path qw(remove_tree); use File::Basename qw(dirname); -use POSIX (); +use POSIX qw(WNOHANG); # support testing with dev versions of Xapian which installs # commands with a version number suffix (e.g. "xapian-compact-1.5") @@ -151,14 +151,17 @@ sub process_queue { $pids{cb_spawn($cb, $args, $opt)} = $args; } + my $flags = 0; while (scalar keys %pids) { - my $pid = waitpid(-1, 0); + my $pid = waitpid(-1, $flags) or last; + last if $pid < 0; my $args = delete $pids{$pid}; if ($args) { die join(' ', @$args)." failed: $?\n" if $?; } else { warn "unknown PID($pid) reaped: $?\n"; } + $flags = WNOHANG if scalar(@$queue); } } }
Established tools like make(1), prove(1) and xargs(1) don't warn when the desired parallelism level can't be met, either. --- lib/PublicInbox/Admin.pm | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm index ce720beb6..d99a00b4b 100644 --- a/lib/PublicInbox/Admin.pm +++ b/lib/PublicInbox/Admin.pm @@ -219,9 +219,9 @@ sub index_inbox { $v2w->{parallel} = 0; } else { my $n = $v2w->{shards}; - if ($jobs != ($n + 1) && !$opt->{reshard}) { + if ($jobs < ($n + 1) && !$opt->{reshard}) { warn -"Unable to respect --jobs=$jobs, inbox was created with $n shards\n"; +"Unable to respect --jobs=$jobs on index, inbox was created with $n shards\n"; } } }
--sequential-shard also disables the copy parallelism (--jobs), so it can be useful for systems unable to handle parallel random I/O but still want many shards. There was a missing "use strict", too, which is fixed. --- Documentation/public-inbox-xcpdb.pod | 19 +++++++- lib/PublicInbox/Xapcmd.pm | 3 +- script/public-inbox-xcpdb | 66 +++++++++++++++++++++++----- 3 files changed, 75 insertions(+), 13 deletions(-) diff --git a/Documentation/public-inbox-xcpdb.pod b/Documentation/public-inbox-xcpdb.pod index 2ed4c5821..62a28c0a1 100644 --- a/Documentation/public-inbox-xcpdb.pod +++ b/Documentation/public-inbox-xcpdb.pod @@ -19,7 +19,7 @@ L<public-inbox-learn(1)>, and L<public-inbox-index(1)>. =over -=item --compact +=item -c, --compact In addition to performing the copy operation, run L<xapian-compact(1)> on each Xapian shard after copying but before finalizing it. @@ -52,6 +52,23 @@ Disable L<fsync(2)> and L<fdatasync(2)>. Available in public-inbox 1.6.0 (PENDING). +=item --sequential-shard + +Copy each shard sequentially, ignoring C<--jobs>. This also +affects indexing done at the end of a run. + +=item --batch-size=BYTES + +=item --max-size=BYTES + +See L<public-inbox-index(1)> for a description of these options. + +These indexing options indexing at the end of a run. +C<public-inbox-xcpdb> may run in parallel with with +L<public-inbox-index(1)>, and C<public-inbox-xcpdb> needs to +reindex changes made to the old Xapian DBs by +L<public-inbox-index(1)> while it was running. + =back =head1 ENVIRONMENT diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm index b6279218c..46548a948 100644 --- a/lib/PublicInbox/Xapcmd.pm +++ b/lib/PublicInbox/Xapcmd.pm @@ -82,7 +82,8 @@ sub commit_changes ($$$$) { $im->{shards} = $n; } } - + my $env = $opt->{-idx_env}; + local %ENV = (%ENV, %$env) if $env; PublicInbox::Admin::index_inbox($ibx, $im, $opt); } } diff --git a/script/public-inbox-xcpdb b/script/public-inbox-xcpdb index 2c91598cb..718a34b77 100755 --- a/script/public-inbox-xcpdb +++ b/script/public-inbox-xcpdb @@ -1,20 +1,64 @@ -#!/usr/bin/perl -w +#!perl -w # Copyright (C) 2019-2020 all contributors <meta@public-inbox.org> # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt> -# xcpdb: Xapian copy database, a wrapper around Xapian's copydatabase(1) +use strict; +use v5.10.1; use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev); -use PublicInbox::InboxWritable; -use PublicInbox::Xapcmd; +my $usage = 'Usage: public-inbox-xcpdb [options] INBOX_DIR'; +my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term: +usage: $usage + + upgrade or reshard Xapian DB(s) used by public-inbox + +options: + + --compact | -c run public-inbox-compact(1) after indexing + --reshard=NUM change number the number of shards + --jobs=NUM limit parallelism to JOBS count + --verbose | -v increase verbosity (may be repeated) + --sequential-shard copy+index Xapian shards sequentially (for slow HDD) + --help | -? show this help + +index options (see public-inbox-index(1) man page for full description): + + --no-fsync speed up indexing, risk corruption on power outage + --batch-size=BYTES flush changes to OS after a given number of bytes + --max-size=BYTES do not index messages larger than the given size + +See public-inbox-xcpdb(1) man page for full documentation. +EOF +my $opt = { quiet => -1, compact => 0, fsync => 1 }; +GetOptions($opt, qw( + fsync|sync! compact|c reshard|R=i + max_size|max-size=s batch_size|batch-size=s + sequential_shard|seq-shard|sequential-shard + jobs|j=i quiet|q verbose|v + blocksize|b=s no-full|n fuller|F + help|?)) or die "bad command-line args\n$usage"; +if ($opt->{help}) { print $help; exit 0 }; + use PublicInbox::Admin; PublicInbox::Admin::require_or_die('-search'); -my $usage = "Usage: public-inbox-xcpdb [--compact] INBOX_DIR\n"; -my $opt = { fsync => 1 }; -my @opt = (qw(fsync|sync! compact reshard|R=i), - @PublicInbox::Xapcmd::COMPACT_OPT); -GetOptions($opt, @opt) or die "bad command-line args\n$usage"; -my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV) or die $usage; + +require PublicInbox::Config; +my $cfg = PublicInbox::Config->new; +my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, undef, $cfg) or + die $usage; +my $idx_env = PublicInbox::Admin::index_prepare($opt, $cfg); + +# we only set XAPIAN_FLUSH_THRESHOLD for index, since cpdb doesn't +# know sizes, only doccounts +$opt->{-idx_env} = $idx_env; + +if ($opt->{sequential_shard} && ($opt->{jobs} // 1) > 1) { + warn "W: --jobs=$opt->{jobs} ignored with --sequential-shard\n"; + $opt->{jobs} = 0; +} + +require PublicInbox::InboxWritable; +require PublicInbox::Xapcmd; foreach (@ibxs) { my $ibx = PublicInbox::InboxWritable->new($_); - # we rely on --no-renumber to keep docids synched to NNTP + # we rely on --no-renumber to keep docids synched for NNTP PublicInbox::Xapcmd::run($ibx, 'cpdb', $opt); }
We use IdxStack via log2stack() from SearchIdx, now. --- lib/PublicInbox/V2Writable.pm | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 72198a298..d99e476aa 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -8,7 +8,6 @@ use strict; use v5.10.1; use parent qw(PublicInbox::Lock); use PublicInbox::SearchIdxShard; -use PublicInbox::IdxStack; use PublicInbox::Eml; use PublicInbox::Git; use PublicInbox::Import;