* [PATCH 00/14] more indexing related improvements
@ 2020-08-10 2:11 Eric Wong
2020-08-10 2:11 ` [PATCH 01/14] index: require --reindex when using --xapian-only Eric Wong
` (13 more replies)
0 siblings, 14 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:11 UTC (permalink / raw)
To: meta
publicInbox.indexSequentialShard now works incrementally
-convert also learned all the options -index learned,
so it can be less painful on HDDs.
Eric Wong (14):
index: require --reindex when using --xapian-only
index: --sequential-shard works incrementally
doc: index: some more notes about latest changes
doc: add some notes around -xcpdb / -edit / -purge
index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior
msgmap: tmp_clone: simplify + meaningful filename
avoid File::Temp::tempfile in more places
admin: use a generic veriable name
index: cleanup internal variables
searchidx: use singular `$opt' for consistency with v2
convert: support new -index options
convert: speed up --help
convert: check ARGV more correctly
convert: set No_COW on copied SQLite files
Documentation/public-inbox-convert.pod | 19 ++++
Documentation/public-inbox-edit.pod | 14 +++
Documentation/public-inbox-index.pod | 68 +++++++------
Documentation/public-inbox-init.pod | 2 +-
Documentation/public-inbox-purge.pod | 14 +++
Documentation/public-inbox-xcpdb.pod | 15 ++-
lib/PublicInbox/Admin.pm | 71 ++++++++++++--
lib/PublicInbox/Msgmap.pm | 19 ++--
lib/PublicInbox/SearchIdx.pm | 34 +++----
lib/PublicInbox/V2Writable.pm | 77 ++++++++-------
lib/PublicInbox/Xapcmd.pm | 28 ++++--
script/public-inbox-convert | 131 ++++++++++++++++---------
script/public-inbox-index | 69 ++++---------
script/public-inbox-init | 17 ++--
t/import.t | 5 +-
15 files changed, 357 insertions(+), 226 deletions(-)
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 01/14] index: require --reindex when using --xapian-only
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
@ 2020-08-10 2:11 ` Eric Wong
2020-08-10 2:11 ` [PATCH 02/14] index: --sequential-shard works incrementally Eric Wong
` (12 subsequent siblings)
13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:11 UTC (permalink / raw)
To: meta
This to avoid user error of a currently undocumented switch;
since --xapian-only always goes through the full history at
the moment.
---
script/public-inbox-index | 3 +++
1 file changed, 3 insertions(+)
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 73ca2953..9e0907be 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -42,6 +42,9 @@ GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
or die "bad command-line args\n$usage";
if ($opt->{help}) { print $help; exit 0 };
die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
+if ($opt->{xapianonly} && !$opt->{reindex}) {
+ die "--xapian-only requires --reindex\n";
+}
# require lazily to speed up --help
require PublicInbox::Admin;
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 02/14] index: --sequential-shard works incrementally
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
2020-08-10 2:11 ` [PATCH 01/14] index: require --reindex when using --xapian-only Eric Wong
@ 2020-08-10 2:11 ` Eric Wong
2020-08-10 2:11 ` [PATCH 03/14] doc: index: more notes about latest changes Eric Wong
` (11 subsequent siblings)
13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:11 UTC (permalink / raw)
To: meta
We should never reindex all data in Xapian unless --reindex is
specified on the command-line. This means users who put
publicInbox.indexSequentialShard in their config file won't have
to put up with a full reindex at every invocation, only when
they specify --reindex.
We'll also cleanup the progress output to not emit non-sensical
ranges where the starting number is higher than the end.
---
lib/PublicInbox/V2Writable.pm | 36 ++++++++++++++++++++++-------------
1 file changed, 23 insertions(+), 13 deletions(-)
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index f7a318e5..0b527f18 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1198,20 +1198,20 @@ sub index_xap_only { # git->cat_async callback
sub index_xap_step ($$$;$) {
my ($self, $sync, $beg, $step) = @_;
- my $ibx = $self->{ibx};
- my $all = $ibx->git;
- my $over = $ibx->over;
- my $batch_bytes = batch_bytes($self);
- $step //= $self->{shards};
my $end = $sync->{art_end};
+ return if $beg > $end; # nothing to do
+
+ $step //= $self->{shards};
+ my $ibx = $self->{ibx};
if (my $pr = $sync->{-opt}->{-progress}) {
$pr->("Xapian indexlevel=$ibx->{indexlevel} ".
"$beg..$end (% $step)\n");
}
+ my $batch_bytes = batch_bytes($self);
for (my $num = $beg; $num <= $end; $num += $step) {
- my $smsg = $over->get_art($num) or next;
+ my $smsg = $ibx->over->get_art($num) or next;
$smsg->{v2w} = $self;
- $all->cat_async($smsg->{blob}, \&index_xap_only, $smsg);
+ $ibx->git->cat_async($smsg->{blob}, \&index_xap_only, $smsg);
if ($self->{transact_bytes} >= $batch_bytes) {
${$sync->{nr}} = $num;
reindex_checkpoint($self, $sync);
@@ -1253,8 +1253,9 @@ sub index_epoch ($$$) {
}
sub xapian_only {
- my ($self, $opt, $sync) = @_;
+ my ($self, $opt, $sync, $art_beg) = @_;
my $seq = $opt->{sequentialshard};
+ $art_beg //= 0;
local $self->{parallel} = 0 if $seq;
$self->idx_init($opt); # acquire lock
if (my $art_end = $self->{ibx}->mm->max) {
@@ -1268,9 +1269,11 @@ sub xapian_only {
$sync->{art_end} = $art_end;
if ($seq || !$self->{parallel}) {
my $shard_end = $self->{shards} - 1;
- index_xap_step($self, $sync, $_) for (0..$shard_end);
+ for (0..$shard_end) {
+ index_xap_step($self, $sync, $art_beg + $_)
+ }
} else { # parallel (maybe)
- index_xap_step($self, $sync, 0, 1);
+ index_xap_step($self, $sync, $art_beg, 1);
}
}
$self->{ibx}->git->cat_async_wait;
@@ -1289,6 +1292,7 @@ sub index_sync {
return unless defined $latest;
my $seq = $opt->{sequentialshard};
+ my $art_beg; # the NNTP article number we start xapian_only at
my $idxlevel = $self->{ibx}->{indexlevel};
local $self->{ibx}->{indexlevel} = 'basic' if $seq;
@@ -1312,6 +1316,12 @@ sub index_sync {
$self->{mm}->{dbh}->begin_work;
$sync->{mm_tmp} =
$self->{mm}->tmp_clone($self->{ibx}->{inboxdir});
+
+ # xapian_only works incrementally w/o --reindex
+ if ($seq && !$opt->{reindex}) {
+ $art_beg = $sync->{mm_tmp}->max;
+ $art_beg++ if defined($art_beg);
+ }
}
if ($sync->{index_max_size} = $self->{ibx}->{index_max_size}) {
$sync->{index_oid} = \&index_oid;
@@ -1326,10 +1336,10 @@ sub index_sync {
$pr->('all.git '.sprintf($sync->{-regen_fmt}, $$nr)) if $pr;
}
- if ($seq) { # deal with Xapian shards sequentially
+ # deal with Xapian shards sequentially
+ if ($seq && delete($sync->{mm_tmp})) {
$self->{ibx}->{indexlevel} = $idxlevel;
- delete $sync->{mm_tmp};
- xapian_only($self, $opt, $sync);
+ xapian_only($self, $opt, $sync, $art_beg);
}
# reindex does not pick up new changes, so we rerun w/o it:
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 03/14] doc: index: more notes about latest changes
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
2020-08-10 2:11 ` [PATCH 01/14] index: require --reindex when using --xapian-only Eric Wong
2020-08-10 2:11 ` [PATCH 02/14] index: --sequential-shard works incrementally Eric Wong
@ 2020-08-10 2:11 ` Eric Wong
2020-08-10 2:38 ` Kyle Meyer
2020-08-10 2:11 ` [PATCH 04/14] doc: add some notes around -xcpdb / -edit / -purge Eric Wong
` (10 subsequent siblings)
13 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:11 UTC (permalink / raw)
To: meta
With LKML on an HDD, a giant --batch-size of 500m ends up being
pretty useful. I was able to index LKML in ~16 hours on a
system that had other activity on it. The big downside was it
was eating up over 5g of RAM :x.
We'll also fix up a duplicated indexBatchSize section, fix
formatting around global vs per-inbox indexSequentialShard,
and ensure section 5 manpages are linked correctly.
---
Documentation/public-inbox-index.pod | 62 +++++++++++++++-------------
1 file changed, 33 insertions(+), 29 deletions(-)
diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 56dec993..3ae3b008 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -115,6 +115,11 @@ Sets or overrides L</publicinbox.indexBatchSize> on a
per-invocation basis. See L</publicinbox.indexBatchSize>
below.
+When using rotational storage but abundant RAM, using a large
+value (e.g. C<500m>) with C<--sequential-shard> can
+significantly speed up the initial index and full C<--reindex>
+invocations (but not incremental updates).
+
Available in public-inbox 1.6.0 (PENDING).
=item --no-fsync
@@ -136,11 +141,11 @@ Available in public-inbox 1.6.0 (PENDING).
=head1 FILES
-For v1 (ssoma) repositories described in L<public-inbox-v1-format>.
+For v1 (ssoma) repositories described in L<public-inbox-v1-format(5)>.
All public-inbox-specific files are contained within the
C<$GIT_DIR/public-inbox/> directory.
-v2 inboxes are described in L<public-inbox-v2-format>.
+v2 inboxes are described in L<public-inbox-v2-format(5)>.
=head1 CONFIGURATION
@@ -168,40 +173,25 @@ L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
Increase this value on powerful systems to improve throughput at
the expense of memory use. The reduction of lock granularity
-may not be noticeable on fast systems.
-
-This option is available in public-inbox 1.6 or later.
-public-inbox 1.5 and earlier used the current default, C<1m>.
+may not be noticeable on fast systems. With SSDs, values above
+C<4m> have little benefit.
For L<public-inbox-v2-format(5)> inboxes, this value is
multiplied by the number of Xapian shards. Thus a typical v2
-inbox with 3 shards will flush every 3 megabytes by default.
-
-Default: 1m (one megabyte)
+inbox with 3 shards will flush every 3 megabytes by default
+when unless parallelism is disabled via C<--sequential-shard>
+or C<--jobs=0>.
-=item publicinbox.indexBatchSize
-
-Flushes changes to the filesystem and releases locks after
-indexing the given number of bytes. The default value of C<1m>
-(one megabyte) is low to minimize memory use and reduce
-contention with parallel invocations of L<public-inbox-mda(1)>,
-L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
-
-Increase this value on powerful systems to improve throughput at
-the expense of memory use. The reduction of lock granularity
-may not be noticeable on fast systems.
+This influences memory usage of Xapian, but it is not exact.
+The actual memory used by Xapian and Perl has been observed
+in excess of 10x this value.
This option is available in public-inbox 1.6 or later.
public-inbox 1.5 and earlier used the current default, C<1m>.
-For L<public-inbox-v2-format(5)> inboxes, this value is
-multiplied by the number of Xapian shards. Thus a typical v2
-inbox with 3 shards will flush every 3 megabytes by default.
-
Default: 1m (one megabyte)
=item publicinbox.indexSequentialShard
-=item publicinbox.<inbox_name>.indexSequentialShard
For L<public-inbox-v2-format(5)> inboxes, setting this to C<true>
allows indexing Xapian shards in multiple passes. This speeds up
@@ -212,12 +202,23 @@ Using a higher-than-normal number of C<--jobs> with
L<public-inbox-init(1)> may be required to ensure individual
shards are small enough to fit into cache.
+Warning: interrupting C<public-inbox-index(1)> while this option
+is in use may leave the search indices out-of-date with respect
+to SQLite databases. WWW and IMAP users may notice incomplete
+search results, but it is otherwise non-fatal. Using C<--reindex>
+will bring everything back up-to-date.
+
Available in public-inbox 1.6.0 (PENDING).
This is ignored on L<public-inbox-v1-format(5)> inboxes.
Default: false, shards are indexed in parallel
+=item publicinbox.<name>.indexSequentialShard
+
+Identical to L</publicinbox.indexSequentialShard>,
+but only affect the inbox matching E<lt>nameE<gt>.
+
=back
=head1 ENVIRONMENT
@@ -235,10 +236,13 @@ disk. This environment is handled directly by Xapian, refer to
Xapian API documentation for more details.
For public-inbox 1.6 and later, use C<publicinbox.indexBatchSize>
-instead. Setting C<XAPIAN_FLUSH_THRESHOLD> for a large C<--reindex>
-may cause L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
-L<public-inbox-watch(1)> tasks to wait long periods of time
-during C<--reindex>.
+instead.
+
+Setting C<XAPIAN_FLUSH_THRESHOLD> or
+C<publicinbox.indexBatchSize> for a large C<--reindex> may cause
+L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
+L<public-inbox-watch(1)> tasks to wait long and unpredictable
+periods of time during C<--reindex>.
Default: none, uses C<publicinbox.indexBatchSize>
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 04/14] doc: add some notes around -xcpdb / -edit / -purge
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
` (2 preceding siblings ...)
2020-08-10 2:11 ` [PATCH 03/14] doc: index: more notes about latest changes Eric Wong
@ 2020-08-10 2:11 ` Eric Wong
2020-08-10 2:11 ` [PATCH 05/14] index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior Eric Wong
` (9 subsequent siblings)
13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:11 UTC (permalink / raw)
To: meta
These rarely-used commands have some caveats that needed
expanding on.
---
Documentation/public-inbox-edit.pod | 14 ++++++++++++++
Documentation/public-inbox-purge.pod | 14 ++++++++++++++
Documentation/public-inbox-xcpdb.pod | 15 +++++++++++++--
3 files changed, 41 insertions(+), 2 deletions(-)
diff --git a/Documentation/public-inbox-edit.pod b/Documentation/public-inbox-edit.pod
index c64b4da8..3853fa9c 100644
--- a/Documentation/public-inbox-edit.pod
+++ b/Documentation/public-inbox-edit.pod
@@ -91,6 +91,20 @@ See L<public-inbox-config(5)>
Only L<v2|public-inbox-v2-format(5)> repositories are supported.
+This is safe to run while normal inbox writing tools
+(L<public-inbox-mda(1)>, L<public-inbox-watch(1)>,
+L<public-inbox-learn(1)>) are active.
+
+Running this in parallel with L<public-inbox-xcpdb(1)> or
+C<"public-inbox-index --reindex"> can lead to errors or
+edited data remaining indexed.
+
+Incremental L<public-inbox-index(1)> (without C<--reindex>)
+is fine.
+
+Keep in mind this is a last resort, as it will be distruptive
+to anyone using L<git(1)> to mirror the inbox being edited.
+
=head1 CONTACT
Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org>
diff --git a/Documentation/public-inbox-purge.pod b/Documentation/public-inbox-purge.pod
index 56a3f95a..e20e18df 100644
--- a/Documentation/public-inbox-purge.pod
+++ b/Documentation/public-inbox-purge.pod
@@ -51,6 +51,20 @@ See L<public-inbox-config(5)>
Only L<public-inbox-v2-format(5)> inboxes are supported.
+This is safe to run while normal inbox writing tools
+(L<public-inbox-mda(1)>, L<public-inbox-watch(1)>,
+L<public-inbox-learn(1)>) are active.
+
+Running this in parallel with L<public-inbox-xcpdb(1)> or
+C<"public-inbox-index --reindex"> can lead to errors or
+purged data remaining indexed.
+
+Incremental L<public-inbox-index(1)> (without C<--reindex>)
+is fine.
+
+Keep in mind this is a last resort, as it will be distruptive
+to anyone using L<git(1)> to mirror the inbox being purged.
+
=head1 CONTACT
Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org>
diff --git a/Documentation/public-inbox-xcpdb.pod b/Documentation/public-inbox-xcpdb.pod
index 89eed079..2ed4c582 100644
--- a/Documentation/public-inbox-xcpdb.pod
+++ b/Documentation/public-inbox-xcpdb.pod
@@ -11,8 +11,9 @@ public-inbox-xcpdb - upgrade Xapian DB formats
public-inbox-xcpdb is similar to L<copydatabase(1)> for
upgrading to the latest database format supported by Xapian
(e.g. "glass" or "honey"), but is designed to tolerate and
-recover from Xapian database modifications from
-L<public-inbox-watch(1)> or L<public-inbox-mda(1)>.
+accept parallel Xapian database modifications from
+L<public-inbox-watch(1)>, L<public-inbox-mda(1)>,
+L<public-inbox-learn(1)>, and L<public-inbox-index(1)>.
=head1 OPTIONS
@@ -80,6 +81,16 @@ used by public-inbox, NOT users upgrading public-inbox itself.
In particular, it DOES NOT upgrade the schema used by the
PSGI search interface (see L<public-inbox-index(1)>).
+=head1 LIMITATIONS
+
+Do not use L<public-inbox-purge(1)> or L<public-inbox-edit(1)>
+while this is running; old (purged or edited data) may show up.
+
+Normal invocations L<public-inbox-index(1)> can safely run
+while this is running, too. However, reindexing via the
+L<public-inbox-index(1)/--reindex> switch will be a waste of
+computing resources.
+
=head1 CONTACT
Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org>
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 05/14] index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
` (3 preceding siblings ...)
2020-08-10 2:11 ` [PATCH 04/14] doc: add some notes around -xcpdb / -edit / -purge Eric Wong
@ 2020-08-10 2:11 ` Eric Wong
2020-08-10 2:11 ` [PATCH 06/14] msgmap: tmp_clone: simplify + meaningful filename Eric Wong
` (8 subsequent siblings)
13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:11 UTC (permalink / raw)
To: meta
-index now invokes ->DESTROY like xcpdb does, which is necessary
to cleanup $INBOX_DIR/msgmap-XXXXXXX files. We'll also exit
with the expected values for various signals by adding 128
as described in <https://www.tldp.org/LDP/abs/html/exitcodes.html>
-xcpdb now terminates worker processes and xapian-compact(1)
invocations when prematurely killed, too.
---
lib/PublicInbox/Admin.pm | 29 +++++++++++++++++++++++++----
lib/PublicInbox/Xapcmd.pm | 28 +++++++++++++++++++---------
2 files changed, 44 insertions(+), 13 deletions(-)
diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index e42b01e0..af2b3da9 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -5,14 +5,28 @@
# Unstable internal API
package PublicInbox::Admin;
use strict;
-use warnings;
-use Cwd 'abs_path';
-use base qw(Exporter);
-our @EXPORT_OK = qw(resolve_repo_dir);
+use parent qw(Exporter);
+use Cwd qw(abs_path);
+use POSIX ();
+our @EXPORT_OK = qw(resolve_repo_dir setup_signals);
use PublicInbox::Config;
use PublicInbox::Inbox;
use PublicInbox::Spawn qw(popen_rd);
+sub setup_signals {
+ my ($cb, $arg) = @_; # optional
+
+ # we call exit() here instead of _exit() so DESTROY methods
+ # get called (e.g. File::Temp::Dir and PublicInbox::Msgmap)
+ $SIG{INT} = $SIG{HUP} = $SIG{PIPE} = $SIG{TERM} = sub {
+ my ($sig) = @_;
+ # https://www.tldp.org/LDP/abs/html/exitcodes.html
+ eval { $cb->($sig, $arg) } if $cb;
+ $sig = 'SIG'.$sig;
+ exit(128 + POSIX->$sig);
+ };
+}
+
sub resolve_repo_dir {
my ($cd, $ver) = @_;
my $prefix = defined $cd ? $cd : './';
@@ -185,9 +199,16 @@ invalid indexlevel=$indexlevel (must be `basic', `medium', or `full')
die missing_mod_msg($err) ." required for indexlevel=$indexlevel\n";
}
+sub index_terminate {
+ my (undef, $ibx) = @_; # $_[0] = signal name
+ $ibx->git->cleanup;
+}
+
sub index_inbox {
my ($ibx, $im, $opt) = @_;
my $jobs = delete $opt->{jobs} if $opt;
+ local %SIG = %SIG;
+ setup_signals(\&index_terminate, $ibx);
if (ref($ibx) && $ibx->version == 2) {
eval { require PublicInbox::V2Writable };
die "v2 requirements not met: $@\n" if $@;
diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm
index 714f6859..348621ce 100644
--- a/lib/PublicInbox/Xapcmd.pm
+++ b/lib/PublicInbox/Xapcmd.pm
@@ -2,8 +2,8 @@
# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
package PublicInbox::Xapcmd;
use strict;
-use warnings;
use PublicInbox::Spawn qw(which popen_rd nodatacow_dir);
+use PublicInbox::Admin qw(setup_signals);
use PublicInbox::Over;
use PublicInbox::SearchIdx;
use File::Temp 0.19 (); # ->newdir
@@ -126,6 +126,11 @@ sub same_fs_or_die ($$) {
die "$x and $y reside on different filesystems\n";
}
+sub kill_pids {
+ my ($sig, $pids) = @_;
+ kill($sig, keys %$pids); # pids may be empty
+}
+
sub process_queue {
my ($queue, $cb, $opt) = @_;
my $max = $opt->{jobs} // scalar(@$queue);
@@ -138,6 +143,8 @@ sub process_queue {
# run in parallel:
my %pids;
+ local %SIG = %SIG;
+ setup_signals(\&kill_pids, \%pids);
while (@$queue) {
while (scalar(keys(%pids)) < $max && scalar(@$queue)) {
my $args = shift @$queue;
@@ -156,12 +163,6 @@ sub process_queue {
}
}
-sub setup_signals () {
- # http://www.tldp.org/LDP/abs/html/exitcodes.html
- $SIG{INT} = sub { exit(130) };
- $SIG{HUP} = $SIG{PIPE} = $SIG{TERM} = sub { exit(1) };
-}
-
sub prepare_run {
my ($ibx, $opt) = @_;
my $tmp = {}; # old shard dir => File::Temp->newdir object or undef
@@ -294,6 +295,11 @@ sub progress_pfx ($) {
($p[-1] =~ /\A([0-9]+)/) ? "$p[-2]/$1" : $p[-1];
}
+sub kill_compact { # setup_signals callback
+ my ($sig, $pidref) = @_;
+ kill($sig, $$pidref) if defined($$pidref);
+}
+
# xapian-compact wrapper
sub compact ($$) {
my ($args, $opt) = @_;
@@ -319,14 +325,18 @@ sub compact ($$) {
}
$pr->("$pfx `".join(' ', @$cmd)."'\n") if $pr;
push @$cmd, $src, $dst;
- my $rd = popen_rd($cmd, undef, $rdr);
+ my ($rd, $pid);
+ local %SIG = %SIG;
+ setup_signals(\&kill_compact, \$pid);
+ ($rd, $pid) = popen_rd($cmd, undef, $rdr);
while (<$rd>) {
if ($pr) {
s/\r/\r$pfx /g;
$pr->("$pfx $_");
}
}
- close $rd or die join(' ', @$cmd)." failed: $?n";
+ waitpid($pid, 0);
+ die "@$cmd failed: \$?=$?\n" if $?;
}
sub cpdb_loop ($$$;$$) {
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 06/14] msgmap: tmp_clone: simplify + meaningful filename
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
` (4 preceding siblings ...)
2020-08-10 2:11 ` [PATCH 05/14] index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior Eric Wong
@ 2020-08-10 2:11 ` Eric Wong
2020-08-10 2:11 ` [PATCH 07/14] avoid File::Temp::tempfile in more places Eric Wong
` (7 subsequent siblings)
13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:11 UTC (permalink / raw)
To: meta
Trying to use the newer ->sqlite_backup_to_dbh method doesn't
seem worth it, as we'll have to support DBD::SQLite <= 1.60
another decade or more.
Dumping 'msgmap-XXXXXXX' into $INBOX_DIR can appear a bit
confusing to users, so give it a "mm_tmp-$PID-XXXXXXXX" name
to emphasize it's a temporary file tied to a given PID.
We also don't want to penalize read-only daemons with
loading File::Temp, so do it lazily.
---
lib/PublicInbox/Msgmap.pm | 19 ++++++-------------
1 file changed, 6 insertions(+), 13 deletions(-)
diff --git a/lib/PublicInbox/Msgmap.pm b/lib/PublicInbox/Msgmap.pm
index e7f7e2c9..7290959d 100644
--- a/lib/PublicInbox/Msgmap.pm
+++ b/lib/PublicInbox/Msgmap.pm
@@ -9,10 +9,8 @@
# This is maintained by ::SearchIdx
package PublicInbox::Msgmap;
use strict;
-use warnings;
use DBI;
use DBD::SQLite;
-use File::Temp qw(tempfile);
use PublicInbox::Over;
use PublicInbox::Spawn;
@@ -50,18 +48,13 @@ sub new_file {
# used to keep track of used numeric mappings for v2 reindex
sub tmp_clone {
my ($self, $dir) = @_;
- my ($fh, $fn) = tempfile('msgmap-XXXXXXXX', EXLOCK => 0, DIR => $dir);
+ require File::Temp;
+ my $tmp = "mm_tmp-$$-XXXXXX";
+ my ($fh, $fn) = File::Temp::tempfile($tmp, EXLOCK => 0, DIR => $dir);
PublicInbox::Spawn::nodatacow_fd(fileno($fh));
- my $tmp;
- if ($self->{dbh}->can('sqlite_backup_to_dbh')) {
- $tmp = ref($self)->new_file($fn, 2);
- $tmp->{dbh}->do('PRAGMA journal_mode = MEMORY');
- $self->{dbh}->sqlite_backup_to_dbh($tmp->{dbh});
- } else { # DBD::SQLite <= 1.61_01
- $self->{dbh}->sqlite_backup_to_file($fn);
- $tmp = ref($self)->new_file($fn, 2);
- $tmp->{dbh}->do('PRAGMA journal_mode = MEMORY');
- }
+ $self->{dbh}->sqlite_backup_to_file($fn);
+ $tmp = ref($self)->new_file($fn, 2);
+ $tmp->{dbh}->do('PRAGMA journal_mode = MEMORY');
$tmp->{pid} = $$;
$tmp;
}
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 07/14] avoid File::Temp::tempfile in more places
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
` (5 preceding siblings ...)
2020-08-10 2:11 ` [PATCH 06/14] msgmap: tmp_clone: simplify + meaningful filename Eric Wong
@ 2020-08-10 2:11 ` Eric Wong
2020-08-10 2:11 ` [PATCH 08/14] admin: use a generic veriable name Eric Wong
` (6 subsequent siblings)
13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:11 UTC (permalink / raw)
To: meta
We can use open(..., undef) natively in Perl in t/import.t
In places where we need a pathname, the File::Temp OO API
gives us auto-unlinking for free.
---
lib/PublicInbox/V2Writable.pm | 17 +++++++++--------
script/public-inbox-init | 9 ++++-----
t/import.t | 5 ++---
3 files changed, 15 insertions(+), 16 deletions(-)
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 0b527f18..93646e57 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -20,7 +20,7 @@ use PublicInbox::Msgmap;
use PublicInbox::Spawn qw(spawn popen_rd);
use PublicInbox::SearchIdx qw(log2stack crlf_adjust is_ancestor check_size);
use IO::Handle; # ->autoflush
-use File::Temp qw(tempfile);
+use File::Temp ();
my $OID = qr/[a-f0-9]{40,}/;
# an estimate of the post-packed size to the raw uncompressed size
@@ -733,12 +733,14 @@ sub fill_alternates ($$) {
}
return unless $new;
- my ($fh, $tmp) = tempfile('alt-XXXXXXXX', DIR => $info_dir);
+ my $fh = File::Temp->new(TEMPLATE => 'alt-XXXXXXXX', DIR => $info_dir);
+ my $tmp = $fh->filename;
print $fh join("\n", sort { $alt{$b} <=> $alt{$a} } keys %alt), "\n"
or die "print $tmp: $!\n";
chmod($mode, $fh) or die "fchmod $tmp: $!\n";
close $fh or die "close $tmp $!\n";
rename($tmp, $alt) or die "rename $tmp => $alt: $!\n";
+ $fh->unlink_on_destroy(0);
}
sub git_init {
@@ -819,18 +821,17 @@ sub import_init {
sub diff ($$$) {
my ($mid, $cur, $new) = @_;
- my ($ah, $an) = tempfile('email-cur-XXXXXXXX', TMPDIR => 1);
+ my $ah = File::Temp->new(TEMPLATE => 'email-cur-XXXXXXXX', TMPDIR => 1);
print $ah $cur->as_string or die "print: $!";
- close $ah or die "close: $!";
- my ($bh, $bn) = tempfile('email-new-XXXXXXXX', TMPDIR => 1);
+ $ah->flush or die "flush: $!";
PublicInbox::Import::drop_unwanted_headers($new);
+ my $bh = File::Temp->new(TEMPLATE => 'email-new-XXXXXXXX', TMPDIR => 1);
print $bh $new->as_string or die "print: $!";
- close $bh or die "close: $!";
- my $cmd = [ qw(diff -u), $an, $bn ];
+ $bh->flush or die "flush: $!";
+ my $cmd = [ qw(diff -u), $ah->filename, $bh->filename ];
print STDERR "# MID conflict <$mid>\n";
my $pid = spawn($cmd, undef, { 1 => 2 });
waitpid($pid, 0) == $pid or die "diff did not finish";
- unlink($an, $bn);
}
sub get_blob ($$) {
diff --git a/script/public-inbox-init b/script/public-inbox-init
index b8d71f35..6a959db7 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -17,7 +17,7 @@ PublicInbox::Admin::require_or_die('-base');
use PublicInbox::Config;
use PublicInbox::InboxWritable;
use PublicInbox::Import;
-use File::Temp qw/tempfile/;
+use File::Temp;
use PublicInbox::Lock;
use File::Basename qw/dirname/;
use File::Path qw/mkpath/;
@@ -52,8 +52,7 @@ my $lock_obj = { lock_path => "$pi_config.flock" };
PublicInbox::Lock::lock_acquire($lock_obj);
# git-config will operate on this (and rename on success):
-my ($fh, $pi_config_tmp) = tempfile('pi-init-XXXXXXXX', DIR => $dir);
-my $cfg_tmp = UnlinkMe->new($pi_config_tmp);
+my $fh = File::Temp->new(TEMPLATE => 'pi-init-XXXXXXXX', DIR => $dir);
# Now, we grab another lock to use git-config(1) locking, so it won't
# wait on the lock, unlike some of our internal flock()-based locks.
@@ -110,7 +109,8 @@ if (-e $pi_config) {
}
}
}
-close $fh or die "failed to close $pi_config_tmp: $!\n";
+my $pi_config_tmp = $fh->filename;
+close($fh) or die "failed to close $pi_config_tmp: $!\n";
my $pfx = "publicinbox.$name";
my @x = (qw/git config/, "--file=$pi_config_tmp");
@@ -177,7 +177,6 @@ if (defined $perm) {
rename $pi_config_tmp, $pi_config or
die "failed to rename `$pi_config_tmp' to `$pi_config': $!\n";
-delete $cfg_tmp->{file};
$auto_unlink->DESTROY;
package UnlinkMe;
diff --git a/t/import.t b/t/import.t
index 440e8994..9a88416f 100644
--- a/t/import.t
+++ b/t/import.t
@@ -9,7 +9,6 @@ use PublicInbox::Git;
use PublicInbox::Import;
use PublicInbox::Spawn qw(spawn);
use Fcntl qw(:DEFAULT SEEK_SET);
-use File::Temp qw/tempfile/;
use PublicInbox::TestCommon;
use MIME::Base64 3.05; # Perl 5.10.0 / 5.9.2
my ($dir, $for_destroy) = tmpdir();
@@ -37,11 +36,11 @@ if ($v2) {
is($mime->as_string, $$raw_email, 'string matches');
is($smsg->{raw_bytes}, length($$raw_email), 'length matches');
my @cmd = ('git', "--git-dir=$git->{git_dir}", qw(hash-object --stdin));
- my $in = tempfile();
+ open my $in, '+<', undef or BAIL_OUT "open(+<): $!";
print $in $mime->as_string or die "write failed: $!";
$in->flush or die "flush failed: $!";
seek($in, 0, SEEK_SET);
- my $out = tempfile();
+ open my $out, '+<', undef or BAIL_OUT "open(+<): $!";
my $pid = spawn(\@cmd, {}, { 0 => $in, 1 => $out });
is(waitpid($pid, 0), $pid, 'waitpid succeeds on hash-object');
is($?, 0, 'hash-object');
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 08/14] admin: use a generic veriable name
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
` (6 preceding siblings ...)
2020-08-10 2:11 ` [PATCH 07/14] avoid File::Temp::tempfile in more places Eric Wong
@ 2020-08-10 2:11 ` Eric Wong
2020-08-10 2:38 ` Kyle Meyer
2020-08-10 2:12 ` [PATCH 09/14] index: cleanup internal variables Eric Wong
` (5 subsequent siblings)
13 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:11 UTC (permalink / raw)
To: meta
We parse other options, too, not just --max-size
---
lib/PublicInbox/Admin.pm | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index af2b3da9..8a9a81c9 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -256,12 +256,12 @@ sub progress_prepare ($) {
# same unit factors as git:
sub parse_unsigned ($) {
- my ($max_size) = @_;
+ my ($val) = @_;
- $$max_size =~ /\A([0-9]+)([kmg])?\z/i or return;
+ $$val =~ /\A([0-9]+)([kmg])?\z/i or return;
my ($n, $unit_factor) = ($1, $2 // '');
my %u = ( k => 1024, m => 1024**2, g => 1024**3 );
- $$max_size = $n * ($u{lc($unit_factor)} // 1);
+ $$val = $n * ($u{lc($unit_factor)} // 1);
1;
}
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 09/14] index: cleanup internal variables
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
` (7 preceding siblings ...)
2020-08-10 2:11 ` [PATCH 08/14] admin: use a generic veriable name Eric Wong
@ 2020-08-10 2:12 ` Eric Wong
2020-08-10 2:12 ` [PATCH 10/14] searchidx: use singular `$opt' for consistency with v2 Eric Wong
` (4 subsequent siblings)
13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:12 UTC (permalink / raw)
To: meta
Move away from hard-to-read alllowercase naming and favor
snake_case or separated-by-dashes.
We'll keep `--indexlevel' as-is for now, since it's been around
for several releases; but we'll support `--index-level' in the
CLI and update our documentation in a few months.
We'll also clarify that publicInbox.indexMaxSize is only
intended for -index, and not -watch or -mda.
---
Documentation/public-inbox-index.pod | 6 ++++-
Documentation/public-inbox-init.pod | 2 +-
lib/PublicInbox/SearchIdx.pm | 14 +++++------
lib/PublicInbox/V2Writable.pm | 26 +++++++++------------
script/public-inbox-index | 35 ++++++++++++++--------------
script/public-inbox-init | 8 ++-----
6 files changed, 43 insertions(+), 48 deletions(-)
diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 3ae3b008..7a97432e 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -158,7 +158,11 @@ value. A single suffix modifier of C<k>, C<m> or C<g> is
supported, thus the value of C<1m> to prevents indexing of
messages larger than one megabyte.
-This is useful for avoiding memory exhaustion in mirrors.
+This is useful for avoiding memory exhaustion in mirrors
+via git. It does not prevent L<public-inbox-mda(1)> or
+L<public-inbox-watch(1)> from importing (and indexing)
+a message.
+
This option is only available in public-inbox 1.5 or later.
Default: none
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index fd9fc637..d0c87563 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -31,7 +31,7 @@ L<DBD::SQLite>.
Default: C<1>
-=item --indexlevel <basic|medium|full>
+=item -L, --indexlevel <basic|medium|full>
Controls the indexing level for L<public-inbox-index(1)>
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 1cf3e66c..7f2447fe 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -67,7 +67,6 @@ sub new {
my $dir = $self->xdir;
$self->{over} = PublicInbox::OverIdx->new("$dir/over.sqlite3");
$self->{over}->{-no_fsync} = 1 if $ibx->{-no_fsync};
- $self->{index_max_size} = $ibx->{index_max_size};
} elsif ($version == 2) {
defined $shard or die "shard is required for v2\n";
# shard is a number
@@ -553,10 +552,10 @@ sub index_sync {
sub check_size { # check_async cb for -index --max-size=...
my ($oid, $type, $size, $arg, $git) = @_;
(($type // '') eq 'blob') or die "E: bad $oid in $git->{git_dir}";
- if ($size <= $arg->{index_max_size}) {
+ if ($size <= $arg->{max_size}) {
$git->cat_async($oid, $arg->{index_oid}, $arg);
} else {
- warn "W: skipping $oid ($size > $arg->{index_max_size})\n";
+ warn "W: skipping $oid ($size > $arg->{max_size})\n";
}
}
@@ -573,7 +572,7 @@ sub v1_checkpoint ($$;$) {
$self->{mm}->last_commit($newest);
}
} else {
- ${$sync->{max}} = $BATCH_BYTES;
+ ${$sync->{max}} = $self->{batch_bytes};
}
$self->{mm}->{dbh}->commit;
@@ -603,7 +602,7 @@ sub v1_checkpoint ($$;$) {
sub process_stack {
my ($self, $sync, $stk) = @_;
my $git = $self->{ibx}->git;
- my $max = $BATCH_BYTES;
+ my $max = $self->{batch_bytes};
my $nr = 0;
$sync->{nr} = \$nr;
$sync->{max} = \$max;
@@ -617,13 +616,13 @@ sub process_stack {
$git->cat_async($oid, \&unindex_both, $self);
}
}
- if ($sync->{index_max_size} = $self->{ibx}->{index_max_size}) {
+ if ($sync->{max_size} = $sync->{-opt}->{max_size}) {
$sync->{index_oid} = \&index_both;
}
while (my ($f, $at, $ct, $oid) = $stk->pop_rec) {
if ($f eq 'm') {
my $arg = { %$sync, autime => $at, cotime => $ct };
- if ($sync->{index_max_size}) {
+ if ($sync->{max_size}) {
$git->check_async($oid, \&check_size, $arg);
} else {
$git->cat_async($oid, \&index_both, $arg);
@@ -749,6 +748,7 @@ sub _index_sync {
my ($self, $opts) = @_;
my $tip = $opts->{ref} || 'HEAD';
my $git = $self->{ibx}->git;
+ $self->{batch_bytes} = $opts->{batch_size} // $BATCH_BYTES;
$git->batch_prepare;
my $pr = $opts->{-progress};
my $sync = { reindex => $opts->{reindex}, -opt => $opts };
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 93646e57..8e36b92c 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -152,12 +152,6 @@ sub add {
$self->{ibx}->with_umask(\&_add, $self, $eml, $check_cb);
}
-sub batch_bytes ($) {
- my ($self) = @_;
- ($self->{parallel} ? $self->{shards} : 1) *
- $PublicInbox::SearchIdx::BATCH_BYTES;
-}
-
# indexes a message, returns true if checkpointing is needed
sub do_idx ($$$$) {
my ($self, $msgref, $mime, $smsg) = @_;
@@ -166,7 +160,7 @@ sub do_idx ($$$$) {
my $idx = idx_shard($self, $smsg->{num} % $self->{shards});
$idx->index_raw($msgref, $mime, $smsg);
my $n = $self->{transact_bytes} += $smsg->{raw_bytes};
- $n >= batch_bytes($self);
+ $n >= $self->{batch_bytes};
}
sub _add {
@@ -287,6 +281,9 @@ sub _idx_init { # with_umask callback
# xcpdb can change shard count while -watch is idle
my $nshards = count_shards($self);
$self->{shards} = $nshards if $nshards && $nshards != $self->{shards};
+ $self->{batch_bytes} = $opt->{batch_size} //
+ $PublicInbox::SearchIdx::BATCH_BYTES;
+ $self->{batch_bytes} *= $self->{shards} if $self->{parallel};
# need to create all shards before initializing msgmap FD
# idx_shards must be visible to all forked processes
@@ -891,7 +888,7 @@ sub reindex_checkpoint ($$) {
}
# allow -watch or -mda to write...
- $self->idx_init; # reacquire lock
+ $self->idx_init($sync->{-opt}); # reacquire lock
$mm_tmp->atfork_parent if $mm_tmp;
}
@@ -1208,12 +1205,11 @@ sub index_xap_step ($$$;$) {
$pr->("Xapian indexlevel=$ibx->{indexlevel} ".
"$beg..$end (% $step)\n");
}
- my $batch_bytes = batch_bytes($self);
for (my $num = $beg; $num <= $end; $num += $step) {
my $smsg = $ibx->over->get_art($num) or next;
$smsg->{v2w} = $self;
$ibx->git->cat_async($smsg->{blob}, \&index_xap_only, $smsg);
- if ($self->{transact_bytes} >= $batch_bytes) {
+ if ($self->{transact_bytes} >= $self->{batch_bytes}) {
${$sync->{nr}} = $num;
reindex_checkpoint($self, $sync);
}
@@ -1236,7 +1232,7 @@ sub index_epoch ($$$) {
$self->{current_info} = "$i.git $oid";
if ($f eq 'm') {
my $arg = { %$sync, autime => $at, cotime => $ct };
- if ($sync->{index_max_size}) {
+ if ($sync->{max_size}) {
$all->check_async($oid, \&check_size, $arg);
} else {
$all->cat_async($oid, \&index_oid, $arg);
@@ -1255,7 +1251,7 @@ sub index_epoch ($$$) {
sub xapian_only {
my ($self, $opt, $sync, $art_beg) = @_;
- my $seq = $opt->{sequentialshard};
+ my $seq = $opt->{sequential_shard};
$art_beg //= 0;
local $self->{parallel} = 0 if $seq;
$self->idx_init($opt); # acquire lock
@@ -1285,14 +1281,14 @@ sub xapian_only {
sub index_sync {
my ($self, $opt) = @_;
$opt //= $_[1] //= {};
- goto \&xapian_only if $opt->{xapianonly};
+ goto \&xapian_only if $opt->{xapian_only};
my $pr = $opt->{-progress};
my $epoch_max;
my $latest = git_dir_latest($self, \$epoch_max);
return unless defined $latest;
- my $seq = $opt->{sequentialshard};
+ my $seq = $opt->{sequential_shard};
my $art_beg; # the NNTP article number we start xapian_only at
my $idxlevel = $self->{ibx}->{indexlevel};
local $self->{ibx}->{indexlevel} = 'basic' if $seq;
@@ -1324,7 +1320,7 @@ sub index_sync {
$art_beg++ if defined($art_beg);
}
}
- if ($sync->{index_max_size} = $self->{ibx}->{index_max_size}) {
+ if ($sync->{max_size} = $opt->{max_size}) {
$sync->{index_oid} = \&index_oid;
}
# work forwards through history
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 9e0907be..b1d29ec1 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -17,7 +17,7 @@ usage: $usage
options:
--no-fsync speed up indexing, risk corruption on power outage
- --indexlevel=LEVEL `basic', 'medium', or `full' (default: full)
+ -L LEVEL `basic', `medium', or `full' (default: full)
--compact | -c run public-inbox-compact(1) after indexing
--sequential-shard index Xapian shards sequentially for slow storage
--jobs=NUM set or disable parallelization (NUM=0)
@@ -33,16 +33,17 @@ BYTES may use `k', `m', and `g' suffixes (e.g. `10m' for 10 megabytes)
See public-inbox-index(1) man page for full documentation.
EOF
my $compact_opt;
-my $opt = { quiet => -1, compact => 0, maxsize => undef, fsync => 1 };
+my $opt = { quiet => -1, compact => 0, max_size => undef, fsync => 1 };
GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
- fsync|sync! xapianonly|xapian-only
- indexlevel|L=s maxsize|max-size=s batchsize|batch-size=s
- sequentialshard|seq-shard|sequential-shard
+ fsync|sync! xapian_only|xapian-only
+ indexlevel|index-level|L=s max_size|max-size=s
+ batch_size|batch-size=s
+ sequential_shard|seq-shard|sequential-shard
help|?))
or die "bad command-line args\n$usage";
if ($opt->{help}) { print $help; exit 0 };
die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
-if ($opt->{xapianonly} && !$opt->{reindex}) {
+if ($opt->{xapian_only} && !$opt->{reindex}) {
die "--xapian-only requires --reindex\n";
}
@@ -64,40 +65,38 @@ my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, undef, $cfg);
PublicInbox::Admin::require_or_die('-index');
unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 }
-my $max_size = $opt->{maxsize} // $cfg->{lc('publicInbox.indexMaxSize')};
+my $max_size = $opt->{max_size} // $cfg->{lc('publicInbox.indexMaxSize')};
if (defined $max_size) {
PublicInbox::Admin::parse_unsigned(\$max_size) or
die "`publicInbox.indexMaxSize=$max_size' not parsed\n";
+ $opt->{max_size} = $max_size;
}
-my $bs = $opt->{batchsize} // $cfg->{lc('publicInbox.indexBatchSize')};
+my $bs = $opt->{batch_size} // $cfg->{lc('publicInbox.indexBatchSize')};
if (defined $bs) {
PublicInbox::Admin::parse_unsigned(\$bs) or
die "`publicInbox.indexBatchSize=$bs' not parsed\n";
+ $opt->{batch_size} = $bs;
}
-no warnings 'once';
-local $PublicInbox::SearchIdx::BATCH_BYTES = $bs if defined($bs);
-use warnings 'once';
# out-of-the-box builds of Xapian 1.4.x are still limited to 32-bit
# https://getting-started-with-xapian.readthedocs.io/en/latest/concepts/indexing/limitations.html
local $ENV{XAPIAN_FLUSH_THRESHOLD} ||= '4294967295' if defined($bs);
-my $s = $opt->{sequentialshard} //
+my $s = $opt->{sequential_shard} //
$cfg->{lc('publicInbox.indexSequentialShard')};
if (defined $s) {
my $v = $cfg->git_bool($s);
defined($v) or
die "`publicInbox.indexSequentialShard=$s' not boolean\n";
- $opt->{sequentialshard} = $v;
+ $opt->{sequential_shard} = $v;
}
my $mods = {};
foreach my $ibx (@ibxs) {
# XXX: users can shoot themselves in the foot, with opt->{indexlevel}
- $ibx->{indexlevel} //= $opt->{indexlevel} // ($opt->{xapianonly} ?
+ $ibx->{indexlevel} //= $opt->{indexlevel} // ($opt->{xapian_only} ?
'full' : PublicInbox::Admin::detect_indexlevel($ibx));
- $ibx->{index_max_size} = $max_size;
PublicInbox::Admin::scan_ibx_modules($mods, $ibx);
}
@@ -112,15 +111,15 @@ for my $ibx (@ibxs) {
$ibx->{-no_fsync} = 1 if !$opt->{fsync};
my $ibx_opt = $opt;
- if (defined(my $s = $ibx->{indexsequentialshard})) {
+ if (defined(my $s = $ibx->{lc('indexSequentialShard')})) {
defined(my $v = $cfg->git_bool($s)) or die <<EOL;
publicInbox.$ibx->{name}.indexSequentialShard not boolean
EOL
- $ibx_opt = { %$opt, sequentialshard => $v };
+ $ibx_opt = { %$opt, sequential_shard => $v };
}
PublicInbox::Admin::index_inbox($ibx, undef, $ibx_opt);
if ($compact_opt) {
- local $compact_opt->{jobs} = 0 if $ibx_opt->{sequentialshard};
+ local $compact_opt->{jobs} = 0 if $ibx_opt->{sequential_shard};
PublicInbox::Xapcmd::run($ibx, 'compact', $compact_opt);
}
}
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 6a959db7..1c8066df 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -27,7 +27,7 @@ use Cwd qw/abs_path/;
my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs);
my %opts = (
'V|version=i' => \$version,
- 'L|indexlevel=s' => \$indexlevel,
+ 'L|index-level|indexlevel=s' => \$indexlevel,
'S|skip|skip-epoch=i' => \$skip_epoch,
'N|skip-artnum=i' => \$skip_artnum,
'j|jobs=i' => \$jobs,
@@ -103,11 +103,7 @@ if (-e $pi_config) {
exit(1) if $conflict;
my $ibx = $cfg->lookup_name($name);
- if ($ibx) {
- if (!defined($indexlevel) && $ibx->{indexlevel}) {
- $indexlevel = $ibx->{indexlevel};
- }
- }
+ $indexlevel //= $ibx->{indexlevel} if $ibx;
}
my $pi_config_tmp = $fh->filename;
close($fh) or die "failed to close $pi_config_tmp: $!\n";
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 10/14] searchidx: use singular `$opt' for consistency with v2
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
` (8 preceding siblings ...)
2020-08-10 2:12 ` [PATCH 09/14] index: cleanup internal variables Eric Wong
@ 2020-08-10 2:12 ` Eric Wong
2020-08-10 2:12 ` [PATCH 11/14] convert: support new -index options Eric Wong
` (3 subsequent siblings)
13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:12 UTC (permalink / raw)
To: meta
The rest of our indexing code uses `$opt' instead of `$opts'.
---
lib/PublicInbox/SearchIdx.pm | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 7f2447fe..5c39f3d6 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -539,11 +539,11 @@ sub unindex_both { # git->cat_async callback
# called by public-inbox-index
sub index_sync {
- my ($self, $opts) = @_;
- delete $self->{lock_path} if $opts->{-skip_lock};
- $self->{ibx}->with_umask(\&_index_sync, $self, $opts);
- if ($opts->{reindex}) {
- my %again = %$opts;
+ my ($self, $opt) = @_;
+ delete $self->{lock_path} if $opt->{-skip_lock};
+ $self->{ibx}->with_umask(\&_index_sync, $self, $opt);
+ if ($opt->{reindex}) {
+ my %again = %$opt;
delete @again{qw(rethread reindex)};
index_sync($self, \%again);
}
@@ -745,15 +745,15 @@ sub reindex_from ($$) {
# indexes all unindexed messages (v1 only)
sub _index_sync {
- my ($self, $opts) = @_;
- my $tip = $opts->{ref} || 'HEAD';
+ my ($self, $opt) = @_;
+ my $tip = $opt->{ref} || 'HEAD';
my $git = $self->{ibx}->git;
- $self->{batch_bytes} = $opts->{batch_size} // $BATCH_BYTES;
+ $self->{batch_bytes} = $opt->{batch_size} // $BATCH_BYTES;
$git->batch_prepare;
- my $pr = $opts->{-progress};
- my $sync = { reindex => $opts->{reindex}, -opt => $opts };
+ my $pr = $opt->{-progress};
+ my $sync = { reindex => $opt->{reindex}, -opt => $opt };
my $xdb = $self->begin_txn_lazy;
- $self->{over}->rethread_prepare($opts);
+ $self->{over}->rethread_prepare($opt);
my $mm = _msgmap_init($self);
if ($sync->{reindex}) {
my $last = $mm->last_commit;
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 11/14] convert: support new -index options
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
` (9 preceding siblings ...)
2020-08-10 2:12 ` [PATCH 10/14] searchidx: use singular `$opt' for consistency with v2 Eric Wong
@ 2020-08-10 2:12 ` Eric Wong
2020-08-10 2:12 ` [PATCH 12/14] convert: speed up --help Eric Wong
` (2 subsequent siblings)
13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:12 UTC (permalink / raw)
To: meta
Converting v1 inboxes from v2 can be a painful experience
on HDD. Some of the new options in the CLI or config
file make it less painful.
---
Documentation/public-inbox-convert.pod | 19 +++++++
lib/PublicInbox/Admin.pm | 36 ++++++++++++
script/public-inbox-convert | 77 +++++++++++++++++++-------
script/public-inbox-index | 47 ++--------------
4 files changed, 117 insertions(+), 62 deletions(-)
diff --git a/Documentation/public-inbox-convert.pod b/Documentation/public-inbox-convert.pod
index a8a5658c..a7958cf8 100644
--- a/Documentation/public-inbox-convert.pod
+++ b/Documentation/public-inbox-convert.pod
@@ -33,6 +33,25 @@ at 4 due to various bottlenecks. The number of Xapian shards
will be 1 less than the JOBS value, since there is a single
process which distributes work to the Xapian shards.
+=item -L LEVEL, --index-level=LEVEL
+
+=item -c, --compact
+
+=item -v, --verbose
+
+=item --no-fsync
+
+=item --sequential-shard
+
+=item --batch-size=BYTES
+
+=item --max-size=BYTES
+
+These options affect indexing. They have no effect if
+L</--no-index> is specified
+
+See L<public-inbox-index(1)> for a description of these options.
+
=back
=head1 ENVIRONMENT
diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index 8a9a81c9..ce720beb 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -265,4 +265,40 @@ sub parse_unsigned ($) {
1;
}
+sub index_prepare ($$) {
+ my ($opt, $cfg) = @_;
+ my $env;
+ if ($opt->{compact}) {
+ require PublicInbox::Xapcmd;
+ PublicInbox::Xapcmd::check_compact();
+ $opt->{compact_opt} = { -coarse_lock => 1, compact => 1 };
+ if (defined(my $jobs = $opt->{jobs})) {
+ $opt->{compact_opt}->{jobs} = $jobs;
+ }
+ }
+ for my $k (qw(max_size batch_size)) {
+ my $git_key = "publicInbox.index".ucfirst($k);
+ $git_key =~ s/_([a-z])/\U$1/g;
+ defined(my $v = $opt->{$k} // $cfg->{lc($git_key)}) or next;
+ parse_unsigned(\$v) or die "`$git_key=$v' not parsed\n";
+ $v > 0 or die "`$git_key=$v' must be positive\n";
+ $opt->{$k} = $v;
+ }
+
+ # out-of-the-box builds of Xapian 1.4.x are still limited to 32-bit
+ # https://getting-started-with-xapian.readthedocs.io/en/latest/concepts/indexing/limitations.html
+ $opt->{batch_size} and
+ $env = { XAPIAN_FLUSH_THRESHOLD => '4294967295' };
+
+ for my $k (qw(sequential_shard)) {
+ my $git_key = "publicInbox.index".ucfirst($k);
+ $git_key =~ s/_([a-z])/\U$1/g;
+ defined(my $s = $opt->{$k} // $cfg->{lc($git_key)}) or next;
+ defined(my $v = $cfg->git_bool($s))
+ or die "`$git_key=$s' not boolean\n";
+ $opt->{$k} = $v;
+ }
+ $env;
+}
+
1;
diff --git a/script/public-inbox-convert b/script/public-inbox-convert
index dbb2bd38..ca16b0dc 100755
--- a/script/public-inbox-convert
+++ b/script/public-inbox-convert
@@ -12,26 +12,57 @@ use PublicInbox::Git;
use PublicInbox::Spawn qw(spawn);
use Cwd 'abs_path';
use File::Copy 'cp'; # preserves permissions:
-my $usage = "Usage: public-inbox-convert OLD NEW\n";
-my $jobs;
-my $index = 1;
-my %opts = (
- '--jobs|j=i' => \$jobs,
- '--index!' => \$index,
-);
-GetOptions(%opts) or die "bad command-line args\n$usage";
+my $usage = 'Usage: public-inbox-convert [options] OLD NEW';
+my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term:
+usage: $usage
+
+ convert v1 format inboxes to v2
+
+options:
+
+ --no-index do not index after conversion
+ --jobs=NUM set shards (NUM=0)
+ --verbose | -v increase verbosity (may be repeated)
+ --help | -? show this help
+
+index options (see public-inbox-index(1) manpage for full description):
+
+ --no-fsync speed up indexing, risk corruption on power outage
+ -L LEVEL `basic', `medium', or `full' (default: full)
+ --compact | -c run public-inbox-compact(1) after indexing
+ --sequential-shard index Xapian shards sequentially for slow storage
+ --batch-size=BYTES flush changes to OS after a given number of bytes
+ --max-size=BYTES do not index messages larger than the given size
+
+See public-inbox-convert(1) man page for full documentation.
+EOF
+
+my $opt = {
+ index => 1,
+ # index defaults:
+ quiet => -1, compact => 0, maxsize => undef, fsync => 1,
+ reindex => 1, # we always reindex
+};
+GetOptions($opt, qw(jobs|j=i index! help|?),
+ # index options
+ qw(verbose|v+ rethread compact|c+ fsync|sync!
+ indexlevel|index-level|L=s max_size|max-size=s
+ batch_size|batch-size=s
+ sequential_shard|sequential-shard|seq-shard
+ )) or die <<EOF;
+bad command-line args\n$usage
+EOF
+if ($opt->{help}) { print $help; exit 0 };
my $old_dir = shift(@ARGV) or die $usage;
my $new_dir = shift(@ARGV) or die $usage;
die "$new_dir exists\n" if -d $new_dir;
die "$old_dir not a directory\n" unless -d $old_dir;
-my $config = PublicInbox::Config->new;
+my $cfg = PublicInbox::Config->new;
$old_dir = abs_path($old_dir);
my $old;
-if ($config) {
- $config->each_inbox(sub {
- $old = $_[0] if abs_path($_[0]->{inboxdir}) eq $old_dir;
- });
-}
+$cfg->each_inbox(sub {
+ $old = $_[0] if abs_path($_[0]->{inboxdir}) eq $old_dir;
+});
unless ($old) {
warn "W: $old_dir not configured in " .
PublicInbox::Config::default_file() . "\n";
@@ -48,16 +79,20 @@ if ($old->version >= 2) {
}
$old->{indexlevel} //= PublicInbox::Admin::detect_indexlevel($old);
-if ($index) {
+my $env;
+if ($opt->{'index'}) {
my $mods = {};
PublicInbox::Admin::scan_ibx_modules($mods, $old);
PublicInbox::Admin::require_or_die(keys %$mods);
+ PublicInbox::Admin::progress_prepare($opt);
+ $env = PublicInbox::Admin::index_prepare($opt, $cfg);
}
-
+local %ENV = (%$env, %ENV) if $env;
my $new = { %$old };
$new->{inboxdir} = abs_path($new_dir);
$new->{version} = 2;
-$new = PublicInbox::InboxWritable->new($new);
+$new = PublicInbox::InboxWritable->new($new, { nproc => $opt->{jobs} });
+$new->{-no_fsync} = 1 if !$opt->{fsync};
my $v2w;
$old->umask_prepare;
@@ -73,7 +108,7 @@ $old->with_umask(sub {
local $ENV{GIT_CONFIG} = $old_cfg;
my $new_cfg = "$new->{inboxdir}/all.git/config";
$v2w = PublicInbox::V2Writable->new($new, 1);
- $v2w->init_inbox($jobs);
+ $v2w->init_inbox(delete $opt->{jobs});
unlink $new_cfg;
link_or_copy($old_cfg, $new_cfg);
if (my $alt = $new->{altid}) {
@@ -98,7 +133,7 @@ $clone may not be valid after migrating to v2, not copying
my $state = '';
my $head = $old->{ref_head} || 'HEAD';
my ($rd, $pid) = $old->git->popen(qw(fast-export --use-done-feature), $head);
-$v2w->idx_init;
+$v2w->idx_init($opt);
my $im = $v2w->importer;
my ($r, $w) = $im->gfi_start;
my $h = '[0-9a-f]';
@@ -155,10 +190,10 @@ if (my $mm = $old->mm) {
# we want to trigger a reindex, not a from scratch index if
# we're reusing the msgmap from an existing v1 installation.
- $v2w->idx_init;
+ $v2w->idx_init($opt);
my $epoch0 = PublicInbox::Git->new($v2w->git_init(0));
chop(my $cmt = $epoch0->qx(qw(rev-parse --verify), $head));
$v2w->last_epoch_commit(0, $cmt);
}
-$v2w->index_sync({reindex => 1}) if $index;
+$v2w->index_sync($opt) if delete $opt->{'index'};
$v2w->done;
diff --git a/script/public-inbox-index b/script/public-inbox-index
index b1d29ec1..14d3afd4 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -32,7 +32,6 @@ options:
BYTES may use `k', `m', and `g' suffixes (e.g. `10m' for 10 megabytes)
See public-inbox-index(1) man page for full documentation.
EOF
-my $compact_opt;
my $opt = { quiet => -1, compact => 0, max_size => undef, fsync => 1 };
GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
fsync|sync! xapian_only|xapian-only
@@ -51,47 +50,11 @@ if ($opt->{xapian_only} && !$opt->{reindex}) {
require PublicInbox::Admin;
PublicInbox::Admin::require_or_die('-index');
-if ($opt->{compact}) {
- require PublicInbox::Xapcmd;
- PublicInbox::Xapcmd::check_compact();
- $compact_opt = { -coarse_lock => 1, compact => 1 };
- if (defined(my $jobs = $opt->{jobs})) {
- $compact_opt->{jobs} = $jobs;
- }
-}
-
my $cfg = PublicInbox::Config->new; # Config is loaded by Admin
my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, undef, $cfg);
PublicInbox::Admin::require_or_die('-index');
unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 }
-my $max_size = $opt->{max_size} // $cfg->{lc('publicInbox.indexMaxSize')};
-if (defined $max_size) {
- PublicInbox::Admin::parse_unsigned(\$max_size) or
- die "`publicInbox.indexMaxSize=$max_size' not parsed\n";
- $opt->{max_size} = $max_size;
-}
-
-my $bs = $opt->{batch_size} // $cfg->{lc('publicInbox.indexBatchSize')};
-if (defined $bs) {
- PublicInbox::Admin::parse_unsigned(\$bs) or
- die "`publicInbox.indexBatchSize=$bs' not parsed\n";
- $opt->{batch_size} = $bs;
-}
-
-# out-of-the-box builds of Xapian 1.4.x are still limited to 32-bit
-# https://getting-started-with-xapian.readthedocs.io/en/latest/concepts/indexing/limitations.html
-local $ENV{XAPIAN_FLUSH_THRESHOLD} ||= '4294967295' if defined($bs);
-
-my $s = $opt->{sequential_shard} //
- $cfg->{lc('publicInbox.indexSequentialShard')};
-if (defined $s) {
- my $v = $cfg->git_bool($s);
- defined($v) or
- die "`publicInbox.indexSequentialShard=$s' not boolean\n";
- $opt->{sequential_shard} = $v;
-}
-
my $mods = {};
foreach my $ibx (@ibxs) {
# XXX: users can shoot themselves in the foot, with opt->{indexlevel}
@@ -101,12 +64,14 @@ foreach my $ibx (@ibxs) {
}
PublicInbox::Admin::require_or_die(keys %$mods);
+my $env = PublicInbox::Admin::index_prepare($opt, $cfg);
+local %ENV = (%ENV, %$env) if $env;
require PublicInbox::InboxWritable;
PublicInbox::Admin::progress_prepare($opt);
for my $ibx (@ibxs) {
$ibx = PublicInbox::InboxWritable->new($ibx);
if ($opt->{compact} >= 2) {
- PublicInbox::Xapcmd::run($ibx, 'compact', $compact_opt);
+ PublicInbox::Xapcmd::run($ibx, 'compact', $opt->{compact_opt});
}
$ibx->{-no_fsync} = 1 if !$opt->{fsync};
@@ -118,8 +83,8 @@ EOL
$ibx_opt = { %$opt, sequential_shard => $v };
}
PublicInbox::Admin::index_inbox($ibx, undef, $ibx_opt);
- if ($compact_opt) {
- local $compact_opt->{jobs} = 0 if $ibx_opt->{sequential_shard};
- PublicInbox::Xapcmd::run($ibx, 'compact', $compact_opt);
+ if (my $copt = $opt->{compact_opt}) {
+ local $copt->{jobs} = 0 if $ibx_opt->{sequential_shard};
+ PublicInbox::Xapcmd::run($ibx, 'compact', $copt);
}
}
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 12/14] convert: speed up --help
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
` (10 preceding siblings ...)
2020-08-10 2:12 ` [PATCH 11/14] convert: support new -index options Eric Wong
@ 2020-08-10 2:12 ` Eric Wong
2020-08-10 2:12 ` [PATCH 13/14] convert: check ARGV more correctly Eric Wong
2020-08-10 2:12 ` [PATCH 14/14] convert: set No_COW on copied SQLite files Eric Wong
13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:12 UTC (permalink / raw)
To: meta
Lazy-loading dependencies speeds up --help by several hundred
milliseconds and is a huge step towards user-friendliness.
---
script/public-inbox-convert | 39 ++++++++++++++++++-------------------
1 file changed, 19 insertions(+), 20 deletions(-)
diff --git a/script/public-inbox-convert b/script/public-inbox-convert
index ca16b0dc..c9075207 100755
--- a/script/public-inbox-convert
+++ b/script/public-inbox-convert
@@ -2,16 +2,8 @@
# Copyright (C) 2018-2020 all contributors <meta@public-inbox.org>
# License: AGPL-3.0+ <http://www.gnu.org/licenses/agpl-3.0.txt>
use strict;
-use warnings;
+use v5.10.1;
use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
-use PublicInbox::InboxWritable;
-use PublicInbox::Config;
-use PublicInbox::Admin;
-use PublicInbox::V2Writable;
-use PublicInbox::Git;
-use PublicInbox::Spawn qw(spawn);
-use Cwd 'abs_path';
-use File::Copy 'cp'; # preserves permissions:
my $usage = 'Usage: public-inbox-convert [options] OLD NEW';
my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term:
usage: $usage
@@ -57,27 +49,33 @@ my $old_dir = shift(@ARGV) or die $usage;
my $new_dir = shift(@ARGV) or die $usage;
die "$new_dir exists\n" if -d $new_dir;
die "$old_dir not a directory\n" unless -d $old_dir;
-my $cfg = PublicInbox::Config->new;
+
+require Cwd;
+Cwd->import('abs_path');
+require PublicInbox::Config;
+require PublicInbox::InboxWritable;
+
$old_dir = abs_path($old_dir);
+my $cfg = PublicInbox::Config->new;
my $old;
$cfg->each_inbox(sub {
$old = $_[0] if abs_path($_[0]->{inboxdir}) eq $old_dir;
});
-unless ($old) {
+if ($old) {
+ $old = PublicInbox::InboxWritable->new($old);
+} else {
warn "W: $old_dir not configured in " .
PublicInbox::Config::default_file() . "\n";
- $old = {
+ $old = PublicInbox::InboxWritable->new({
inboxdir => $old_dir,
name => 'ignored',
+ -primary_address => 'old@example.com',
address => [ 'old@example.com' ],
- };
- $old = PublicInbox::Inbox->new($old);
-}
-$old = PublicInbox::InboxWritable->new($old);
-if ($old->version >= 2) {
- die "Only conversion from v1 inboxes is supported\n";
+ });
}
+die "Only conversion from v1 inboxes is supported\n" if $old->version >= 2;
+require PublicInbox::Admin;
$old->{indexlevel} //= PublicInbox::Admin::detect_indexlevel($old);
my $env;
if ($opt->{'index'}) {
@@ -100,14 +98,15 @@ sub link_or_copy ($$) {
my ($src, $dst) = @_;
link($src, $dst) and return;
$!{EXDEV} or warn "link $src, $dst failed: $!, trying cp\n";
- cp($src, $dst) or die "cp $src, $dst failed: $!\n";
+ require File::Copy; # preserves permissions:
+ File::Copy::cp($src, $dst) or die "cp $src, $dst failed: $!\n";
}
$old->with_umask(sub {
my $old_cfg = "$old->{inboxdir}/config";
local $ENV{GIT_CONFIG} = $old_cfg;
my $new_cfg = "$new->{inboxdir}/all.git/config";
- $v2w = PublicInbox::V2Writable->new($new, 1);
+ $v2w = $new->importer(1);
$v2w->init_inbox(delete $opt->{jobs});
unlink $new_cfg;
link_or_copy($old_cfg, $new_cfg);
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 13/14] convert: check ARGV more correctly
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
` (11 preceding siblings ...)
2020-08-10 2:12 ` [PATCH 12/14] convert: speed up --help Eric Wong
@ 2020-08-10 2:12 ` Eric Wong
2020-08-10 2:12 ` [PATCH 14/14] convert: set No_COW on copied SQLite files Eric Wong
13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:12 UTC (permalink / raw)
To: meta
Instead of silently ignoring excessive args, don't let a user
specify an extra directory. Furthermore, we'll support the odd
case where BOFH wants to name an $INBOX_DIR to be `0' :P
---
script/public-inbox-convert | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/script/public-inbox-convert b/script/public-inbox-convert
index c9075207..275857fa 100755
--- a/script/public-inbox-convert
+++ b/script/public-inbox-convert
@@ -45,8 +45,9 @@ GetOptions($opt, qw(jobs|j=i index! help|?),
bad command-line args\n$usage
EOF
if ($opt->{help}) { print $help; exit 0 };
-my $old_dir = shift(@ARGV) or die $usage;
-my $new_dir = shift(@ARGV) or die $usage;
+my $old_dir = shift(@ARGV) // '';
+my $new_dir = shift(@ARGV) // '';
+die $usage if (scalar(@ARGV) || $new_dir eq '' || $old_dir eq '');
die "$new_dir exists\n" if -d $new_dir;
die "$old_dir not a directory\n" unless -d $old_dir;
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 14/14] convert: set No_COW on copied SQLite files
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
` (12 preceding siblings ...)
2020-08-10 2:12 ` [PATCH 13/14] convert: check ARGV more correctly Eric Wong
@ 2020-08-10 2:12 ` Eric Wong
13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 2:12 UTC (permalink / raw)
To: meta
We'll use our existing logic and use sqlite_backup_from_file,
which appeared in 1.39 (along with sqlite_backup_to_file).
---
script/public-inbox-convert | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/script/public-inbox-convert b/script/public-inbox-convert
index 275857fa..d655dcc6 100755
--- a/script/public-inbox-convert
+++ b/script/public-inbox-convert
@@ -115,10 +115,10 @@ $old->with_umask(sub {
require PublicInbox::AltId;
foreach my $i (0..$#$alt) {
my $src = PublicInbox::AltId->new($old, $alt->[$i], 0);
- $src->mm_alt or next;
+ $src = $src->mm_alt or next;
+ $src = $src->{dbh}->sqlite_db_filename;
my $dst = PublicInbox::AltId->new($new, $alt->[$i], 1);
- $dst = $dst->{filename};
- $src->mm_alt->{dbh}->sqlite_backup_to_file($dst);
+ $dst->mm_alt->{dbh}->sqlite_backup_from_file($src);
}
}
my $desc = "$old->{inboxdir}/description";
@@ -184,13 +184,15 @@ waitpid($pid, 0) or die "waitpid failed: $!\n";
$? == 0 or die "fast-export failed: $?\n";
$r = $w = undef; # v2w->done does the actual close and error checking
$v2w->done;
-if (my $mm = $old->mm) {
+if (my $old_mm = $old->mm) {
$old->cleanup;
- $mm->{dbh}->sqlite_backup_to_file("$new_dir/msgmap.sqlite3");
+ $old_mm = $old_mm->{dbh}->sqlite_db_filename;
# we want to trigger a reindex, not a from scratch index if
# we're reusing the msgmap from an existing v1 installation.
$v2w->idx_init($opt);
+ $v2w->{mm}->{dbh}->sqlite_backup_from_file($old_mm);
+
my $epoch0 = PublicInbox::Git->new($v2w->git_init(0));
chop(my $cmt = $epoch0->qx(qw(rev-parse --verify), $head));
$v2w->last_epoch_commit(0, $cmt);
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 03/14] doc: index: more notes about latest changes
2020-08-10 2:11 ` [PATCH 03/14] doc: index: more notes about latest changes Eric Wong
@ 2020-08-10 2:38 ` Kyle Meyer
2020-08-10 6:29 ` Eric Wong
0 siblings, 1 reply; 18+ messages in thread
From: Kyle Meyer @ 2020-08-10 2:38 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
Eric Wong writes:
> For L<public-inbox-v2-format(5)> inboxes, this value is
> multiplied by the number of Xapian shards. Thus a typical v2
> -inbox with 3 shards will flush every 3 megabytes by default.
> -
> -Default: 1m (one megabyte)
> +inbox with 3 shards will flush every 3 megabytes by default
> +when unless parallelism is disabled via C<--sequential-shard>
s/when unless/unless/ ?
> +or C<--jobs=0>.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 08/14] admin: use a generic veriable name
2020-08-10 2:11 ` [PATCH 08/14] admin: use a generic veriable name Eric Wong
@ 2020-08-10 2:38 ` Kyle Meyer
0 siblings, 0 replies; 18+ messages in thread
From: Kyle Meyer @ 2020-08-10 2:38 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
Eric Wong writes:
> [PATCH 08/14] admin: use a generic veriable name
s/veriable/variable/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 03/14] doc: index: more notes about latest changes
2020-08-10 2:38 ` Kyle Meyer
@ 2020-08-10 6:29 ` Eric Wong
0 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10 6:29 UTC (permalink / raw)
To: Kyle Meyer; +Cc: meta
Kyle Meyer <kyle@kyleam.com> wrote:
> Eric Wong writes:
>
> > For L<public-inbox-v2-format(5)> inboxes, this value is
> > multiplied by the number of Xapian shards. Thus a typical v2
> > -inbox with 3 shards will flush every 3 megabytes by default.
> > -
> > -Default: 1m (one megabyte)
> > +inbox with 3 shards will flush every 3 megabytes by default
> > +when unless parallelism is disabled via C<--sequential-shard>
>
> s/when unless/unless/ ?
Yup, thanks. Will squash the folowing before pushing (and speling
for 8/14 shall be vary-able :>) :
diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 3ae3b008..10cf2d19 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -179,7 +179,7 @@ C<4m> have little benefit.
For L<public-inbox-v2-format(5)> inboxes, this value is
multiplied by the number of Xapian shards. Thus a typical v2
inbox with 3 shards will flush every 3 megabytes by default
-when unless parallelism is disabled via C<--sequential-shard>
+unless parallelism is disabled via C<--sequential-shard>
or C<--jobs=0>.
This influences memory usage of Xapian, but it is not exact.
^ permalink raw reply related [flat|nested] 18+ messages in thread
end of thread, other threads:[~2020-08-10 6:29 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-10 2:11 [PATCH 00/14] more indexing related improvements Eric Wong
2020-08-10 2:11 ` [PATCH 01/14] index: require --reindex when using --xapian-only Eric Wong
2020-08-10 2:11 ` [PATCH 02/14] index: --sequential-shard works incrementally Eric Wong
2020-08-10 2:11 ` [PATCH 03/14] doc: index: more notes about latest changes Eric Wong
2020-08-10 2:38 ` Kyle Meyer
2020-08-10 6:29 ` Eric Wong
2020-08-10 2:11 ` [PATCH 04/14] doc: add some notes around -xcpdb / -edit / -purge Eric Wong
2020-08-10 2:11 ` [PATCH 05/14] index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior Eric Wong
2020-08-10 2:11 ` [PATCH 06/14] msgmap: tmp_clone: simplify + meaningful filename Eric Wong
2020-08-10 2:11 ` [PATCH 07/14] avoid File::Temp::tempfile in more places Eric Wong
2020-08-10 2:11 ` [PATCH 08/14] admin: use a generic veriable name Eric Wong
2020-08-10 2:38 ` Kyle Meyer
2020-08-10 2:12 ` [PATCH 09/14] index: cleanup internal variables Eric Wong
2020-08-10 2:12 ` [PATCH 10/14] searchidx: use singular `$opt' for consistency with v2 Eric Wong
2020-08-10 2:12 ` [PATCH 11/14] convert: support new -index options Eric Wong
2020-08-10 2:12 ` [PATCH 12/14] convert: speed up --help Eric Wong
2020-08-10 2:12 ` [PATCH 13/14] convert: check ARGV more correctly Eric Wong
2020-08-10 2:12 ` [PATCH 14/14] convert: set No_COW on copied SQLite files Eric Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).