unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 00/14] more indexing related improvements
@ 2020-08-10  2:11 Eric Wong
  2020-08-10  2:11 ` [PATCH 01/14] index: require --reindex when using --xapian-only Eric Wong
                   ` (13 more replies)
  0 siblings, 14 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:11 UTC (permalink / raw)
  To: meta

publicInbox.indexSequentialShard now works incrementally

-convert also learned all the options -index learned,
so it can be less painful on HDDs.

Eric Wong (14):
  index: require --reindex when using --xapian-only
  index: --sequential-shard works incrementally
  doc: index: some more notes about latest changes
  doc: add some notes around -xcpdb / -edit / -purge
  index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior
  msgmap: tmp_clone: simplify + meaningful filename
  avoid File::Temp::tempfile in more places
  admin: use a generic veriable name
  index: cleanup internal variables
  searchidx: use singular `$opt' for consistency with v2
  convert: support new -index options
  convert: speed up --help
  convert: check ARGV more correctly
  convert: set No_COW on copied SQLite files

 Documentation/public-inbox-convert.pod |  19 ++++
 Documentation/public-inbox-edit.pod    |  14 +++
 Documentation/public-inbox-index.pod   |  68 +++++++------
 Documentation/public-inbox-init.pod    |   2 +-
 Documentation/public-inbox-purge.pod   |  14 +++
 Documentation/public-inbox-xcpdb.pod   |  15 ++-
 lib/PublicInbox/Admin.pm               |  71 ++++++++++++--
 lib/PublicInbox/Msgmap.pm              |  19 ++--
 lib/PublicInbox/SearchIdx.pm           |  34 +++----
 lib/PublicInbox/V2Writable.pm          |  77 ++++++++-------
 lib/PublicInbox/Xapcmd.pm              |  28 ++++--
 script/public-inbox-convert            | 131 ++++++++++++++++---------
 script/public-inbox-index              |  69 ++++---------
 script/public-inbox-init               |  17 ++--
 t/import.t                             |   5 +-
 15 files changed, 357 insertions(+), 226 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 01/14] index: require --reindex when using --xapian-only
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
@ 2020-08-10  2:11 ` Eric Wong
  2020-08-10  2:11 ` [PATCH 02/14] index: --sequential-shard works incrementally Eric Wong
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:11 UTC (permalink / raw)
  To: meta

This to avoid user error of a currently undocumented switch;
since --xapian-only always goes through the full history at
the moment.
---
 script/public-inbox-index | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/script/public-inbox-index b/script/public-inbox-index
index 73ca2953..9e0907be 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -42,6 +42,9 @@ GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
 	or die "bad command-line args\n$usage";
 if ($opt->{help}) { print $help; exit 0 };
 die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
+if ($opt->{xapianonly} && !$opt->{reindex}) {
+	die "--xapian-only requires --reindex\n";
+}
 
 # require lazily to speed up --help
 require PublicInbox::Admin;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 02/14] index: --sequential-shard works incrementally
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
  2020-08-10  2:11 ` [PATCH 01/14] index: require --reindex when using --xapian-only Eric Wong
@ 2020-08-10  2:11 ` Eric Wong
  2020-08-10  2:11 ` [PATCH 03/14] doc: index: more notes about latest changes Eric Wong
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:11 UTC (permalink / raw)
  To: meta

We should never reindex all data in Xapian unless --reindex is
specified on the command-line.  This means users who put
publicInbox.indexSequentialShard in their config file won't have
to put up with a full reindex at every invocation, only when
they specify --reindex.

We'll also cleanup the progress output to not emit non-sensical
ranges where the starting number is higher than the end.
---
 lib/PublicInbox/V2Writable.pm | 36 ++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index f7a318e5..0b527f18 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1198,20 +1198,20 @@ sub index_xap_only { # git->cat_async callback
 
 sub index_xap_step ($$$;$) {
 	my ($self, $sync, $beg, $step) = @_;
-	my $ibx = $self->{ibx};
-	my $all = $ibx->git;
-	my $over = $ibx->over;
-	my $batch_bytes = batch_bytes($self);
-	$step //= $self->{shards};
 	my $end = $sync->{art_end};
+	return if $beg > $end; # nothing to do
+
+	$step //= $self->{shards};
+	my $ibx = $self->{ibx};
 	if (my $pr = $sync->{-opt}->{-progress}) {
 		$pr->("Xapian indexlevel=$ibx->{indexlevel} ".
 			"$beg..$end (% $step)\n");
 	}
+	my $batch_bytes = batch_bytes($self);
 	for (my $num = $beg; $num <= $end; $num += $step) {
-		my $smsg = $over->get_art($num) or next;
+		my $smsg = $ibx->over->get_art($num) or next;
 		$smsg->{v2w} = $self;
-		$all->cat_async($smsg->{blob}, \&index_xap_only, $smsg);
+		$ibx->git->cat_async($smsg->{blob}, \&index_xap_only, $smsg);
 		if ($self->{transact_bytes} >= $batch_bytes) {
 			${$sync->{nr}} = $num;
 			reindex_checkpoint($self, $sync);
@@ -1253,8 +1253,9 @@ sub index_epoch ($$$) {
 }
 
 sub xapian_only {
-	my ($self, $opt, $sync) = @_;
+	my ($self, $opt, $sync, $art_beg) = @_;
 	my $seq = $opt->{sequentialshard};
+	$art_beg //= 0;
 	local $self->{parallel} = 0 if $seq;
 	$self->idx_init($opt); # acquire lock
 	if (my $art_end = $self->{ibx}->mm->max) {
@@ -1268,9 +1269,11 @@ sub xapian_only {
 		$sync->{art_end} = $art_end;
 		if ($seq || !$self->{parallel}) {
 			my $shard_end = $self->{shards} - 1;
-			index_xap_step($self, $sync, $_) for (0..$shard_end);
+			for (0..$shard_end) {
+				index_xap_step($self, $sync, $art_beg + $_)
+			}
 		} else { # parallel (maybe)
-			index_xap_step($self, $sync, 0, 1);
+			index_xap_step($self, $sync, $art_beg, 1);
 		}
 	}
 	$self->{ibx}->git->cat_async_wait;
@@ -1289,6 +1292,7 @@ sub index_sync {
 	return unless defined $latest;
 
 	my $seq = $opt->{sequentialshard};
+	my $art_beg; # the NNTP article number we start xapian_only at
 	my $idxlevel = $self->{ibx}->{indexlevel};
 	local $self->{ibx}->{indexlevel} = 'basic' if $seq;
 
@@ -1312,6 +1316,12 @@ sub index_sync {
 		$self->{mm}->{dbh}->begin_work;
 		$sync->{mm_tmp} =
 			$self->{mm}->tmp_clone($self->{ibx}->{inboxdir});
+
+		# xapian_only works incrementally w/o --reindex
+		if ($seq && !$opt->{reindex}) {
+			$art_beg = $sync->{mm_tmp}->max;
+			$art_beg++ if defined($art_beg);
+		}
 	}
 	if ($sync->{index_max_size} = $self->{ibx}->{index_max_size}) {
 		$sync->{index_oid} = \&index_oid;
@@ -1326,10 +1336,10 @@ sub index_sync {
 		$pr->('all.git '.sprintf($sync->{-regen_fmt}, $$nr)) if $pr;
 	}
 
-	if ($seq) { # deal with Xapian shards sequentially
+	# deal with Xapian shards sequentially
+	if ($seq && delete($sync->{mm_tmp})) {
 		$self->{ibx}->{indexlevel} = $idxlevel;
-		delete $sync->{mm_tmp};
-		xapian_only($self, $opt, $sync);
+		xapian_only($self, $opt, $sync, $art_beg);
 	}
 
 	# reindex does not pick up new changes, so we rerun w/o it:

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 03/14] doc: index: more notes about latest changes
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
  2020-08-10  2:11 ` [PATCH 01/14] index: require --reindex when using --xapian-only Eric Wong
  2020-08-10  2:11 ` [PATCH 02/14] index: --sequential-shard works incrementally Eric Wong
@ 2020-08-10  2:11 ` Eric Wong
  2020-08-10  2:38   ` Kyle Meyer
  2020-08-10  2:11 ` [PATCH 04/14] doc: add some notes around -xcpdb / -edit / -purge Eric Wong
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:11 UTC (permalink / raw)
  To: meta

With LKML on an HDD, a giant --batch-size of 500m ends up being
pretty useful.  I was able to index LKML in ~16 hours on a
system that had other activity on it.  The big downside was it
was eating up over 5g of RAM :x.

We'll also fix up a duplicated indexBatchSize section, fix
formatting around global vs per-inbox indexSequentialShard,
and ensure section 5 manpages are linked correctly.
---
 Documentation/public-inbox-index.pod | 62 +++++++++++++++-------------
 1 file changed, 33 insertions(+), 29 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 56dec993..3ae3b008 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -115,6 +115,11 @@ Sets or overrides L</publicinbox.indexBatchSize> on a
 per-invocation basis.  See L</publicinbox.indexBatchSize>
 below.
 
+When using rotational storage but abundant RAM, using a large
+value (e.g. C<500m>) with C<--sequential-shard> can
+significantly speed up the initial index and full C<--reindex>
+invocations (but not incremental updates).
+
 Available in public-inbox 1.6.0 (PENDING).
 
 =item --no-fsync
@@ -136,11 +141,11 @@ Available in public-inbox 1.6.0 (PENDING).
 
 =head1 FILES
 
-For v1 (ssoma) repositories described in L<public-inbox-v1-format>.
+For v1 (ssoma) repositories described in L<public-inbox-v1-format(5)>.
 All public-inbox-specific files are contained within the
 C<$GIT_DIR/public-inbox/> directory.
 
-v2 inboxes are described in L<public-inbox-v2-format>.
+v2 inboxes are described in L<public-inbox-v2-format(5)>.
 
 =head1 CONFIGURATION
 
@@ -168,40 +173,25 @@ L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
 
 Increase this value on powerful systems to improve throughput at
 the expense of memory use.  The reduction of lock granularity
-may not be noticeable on fast systems.
-
-This option is available in public-inbox 1.6 or later.
-public-inbox 1.5 and earlier used the current default, C<1m>.
+may not be noticeable on fast systems.  With SSDs, values above
+C<4m> have little benefit.
 
 For L<public-inbox-v2-format(5)> inboxes, this value is
 multiplied by the number of Xapian shards.  Thus a typical v2
-inbox with 3 shards will flush every 3 megabytes by default.
-
-Default: 1m (one megabyte)
+inbox with 3 shards will flush every 3 megabytes by default
+when unless parallelism is disabled via C<--sequential-shard>
+or C<--jobs=0>.
 
-=item publicinbox.indexBatchSize
-
-Flushes changes to the filesystem and releases locks after
-indexing the given number of bytes.  The default value of C<1m>
-(one megabyte) is low to minimize memory use and reduce
-contention with parallel invocations of L<public-inbox-mda(1)>,
-L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
-
-Increase this value on powerful systems to improve throughput at
-the expense of memory use.  The reduction of lock granularity
-may not be noticeable on fast systems.
+This influences memory usage of Xapian, but it is not exact.
+The actual memory used by Xapian and Perl has been observed
+in excess of 10x this value.
 
 This option is available in public-inbox 1.6 or later.
 public-inbox 1.5 and earlier used the current default, C<1m>.
 
-For L<public-inbox-v2-format(5)> inboxes, this value is
-multiplied by the number of Xapian shards.  Thus a typical v2
-inbox with 3 shards will flush every 3 megabytes by default.
-
 Default: 1m (one megabyte)
 
 =item publicinbox.indexSequentialShard
-=item publicinbox.<inbox_name>.indexSequentialShard
 
 For L<public-inbox-v2-format(5)> inboxes, setting this to C<true>
 allows indexing Xapian shards in multiple passes.  This speeds up
@@ -212,12 +202,23 @@ Using a higher-than-normal number of C<--jobs> with
 L<public-inbox-init(1)> may be required to ensure individual
 shards are small enough to fit into cache.
 
+Warning: interrupting C<public-inbox-index(1)> while this option
+is in use may leave the search indices out-of-date with respect
+to SQLite databases.  WWW and IMAP users may notice incomplete
+search results, but it is otherwise non-fatal.  Using C<--reindex>
+will bring everything back up-to-date.
+
 Available in public-inbox 1.6.0 (PENDING).
 
 This is ignored on L<public-inbox-v1-format(5)> inboxes.
 
 Default: false, shards are indexed in parallel
 
+=item publicinbox.<name>.indexSequentialShard
+
+Identical to L</publicinbox.indexSequentialShard>,
+but only affect the inbox matching E<lt>nameE<gt>.
+
 =back
 
 =head1 ENVIRONMENT
@@ -235,10 +236,13 @@ disk.  This environment is handled directly by Xapian, refer to
 Xapian API documentation for more details.
 
 For public-inbox 1.6 and later, use C<publicinbox.indexBatchSize>
-instead.  Setting C<XAPIAN_FLUSH_THRESHOLD> for a large C<--reindex>
-may cause L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
-L<public-inbox-watch(1)> tasks to wait long periods of time
-during C<--reindex>.
+instead.
+
+Setting C<XAPIAN_FLUSH_THRESHOLD> or
+C<publicinbox.indexBatchSize> for a large C<--reindex> may cause
+L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
+L<public-inbox-watch(1)> tasks to wait long and unpredictable
+periods of time during C<--reindex>.
 
 Default: none, uses C<publicinbox.indexBatchSize>
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 04/14] doc: add some notes around -xcpdb / -edit / -purge
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
                   ` (2 preceding siblings ...)
  2020-08-10  2:11 ` [PATCH 03/14] doc: index: more notes about latest changes Eric Wong
@ 2020-08-10  2:11 ` Eric Wong
  2020-08-10  2:11 ` [PATCH 05/14] index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior Eric Wong
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:11 UTC (permalink / raw)
  To: meta

These rarely-used commands have some caveats that needed
expanding on.
---
 Documentation/public-inbox-edit.pod  | 14 ++++++++++++++
 Documentation/public-inbox-purge.pod | 14 ++++++++++++++
 Documentation/public-inbox-xcpdb.pod | 15 +++++++++++++--
 3 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/Documentation/public-inbox-edit.pod b/Documentation/public-inbox-edit.pod
index c64b4da8..3853fa9c 100644
--- a/Documentation/public-inbox-edit.pod
+++ b/Documentation/public-inbox-edit.pod
@@ -91,6 +91,20 @@ See L<public-inbox-config(5)>
 
 Only L<v2|public-inbox-v2-format(5)> repositories are supported.
 
+This is safe to run while normal inbox writing tools
+(L<public-inbox-mda(1)>, L<public-inbox-watch(1)>,
+L<public-inbox-learn(1)>) are active.
+
+Running this in parallel with L<public-inbox-xcpdb(1)> or
+C<"public-inbox-index --reindex"> can lead to errors or
+edited data remaining indexed.
+
+Incremental L<public-inbox-index(1)> (without C<--reindex>)
+is fine.
+
+Keep in mind this is a last resort, as it will be distruptive
+to anyone using L<git(1)> to mirror the inbox being edited.
+
 =head1 CONTACT
 
 Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org>
diff --git a/Documentation/public-inbox-purge.pod b/Documentation/public-inbox-purge.pod
index 56a3f95a..e20e18df 100644
--- a/Documentation/public-inbox-purge.pod
+++ b/Documentation/public-inbox-purge.pod
@@ -51,6 +51,20 @@ See L<public-inbox-config(5)>
 
 Only L<public-inbox-v2-format(5)> inboxes are supported.
 
+This is safe to run while normal inbox writing tools
+(L<public-inbox-mda(1)>, L<public-inbox-watch(1)>,
+L<public-inbox-learn(1)>) are active.
+
+Running this in parallel with L<public-inbox-xcpdb(1)> or
+C<"public-inbox-index --reindex"> can lead to errors or
+purged data remaining indexed.
+
+Incremental L<public-inbox-index(1)> (without C<--reindex>)
+is fine.
+
+Keep in mind this is a last resort, as it will be distruptive
+to anyone using L<git(1)> to mirror the inbox being purged.
+
 =head1 CONTACT
 
 Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org>
diff --git a/Documentation/public-inbox-xcpdb.pod b/Documentation/public-inbox-xcpdb.pod
index 89eed079..2ed4c582 100644
--- a/Documentation/public-inbox-xcpdb.pod
+++ b/Documentation/public-inbox-xcpdb.pod
@@ -11,8 +11,9 @@ public-inbox-xcpdb - upgrade Xapian DB formats
 public-inbox-xcpdb is similar to L<copydatabase(1)> for
 upgrading to the latest database format supported by Xapian
 (e.g. "glass" or "honey"), but is designed to tolerate and
-recover from Xapian database modifications from
-L<public-inbox-watch(1)> or L<public-inbox-mda(1)>.
+accept parallel Xapian database modifications from
+L<public-inbox-watch(1)>, L<public-inbox-mda(1)>,
+L<public-inbox-learn(1)>, and L<public-inbox-index(1)>.
 
 =head1 OPTIONS
 
@@ -80,6 +81,16 @@ used by public-inbox, NOT users upgrading public-inbox itself.
 In particular, it DOES NOT upgrade the schema used by the
 PSGI search interface (see L<public-inbox-index(1)>).
 
+=head1 LIMITATIONS
+
+Do not use L<public-inbox-purge(1)> or L<public-inbox-edit(1)>
+while this is running; old (purged or edited data) may show up.
+
+Normal invocations L<public-inbox-index(1)> can safely run
+while this is running, too.  However, reindexing via the
+L<public-inbox-index(1)/--reindex> switch will be a waste of
+computing resources.
+
 =head1 CONTACT
 
 Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 05/14] index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
                   ` (3 preceding siblings ...)
  2020-08-10  2:11 ` [PATCH 04/14] doc: add some notes around -xcpdb / -edit / -purge Eric Wong
@ 2020-08-10  2:11 ` Eric Wong
  2020-08-10  2:11 ` [PATCH 06/14] msgmap: tmp_clone: simplify + meaningful filename Eric Wong
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:11 UTC (permalink / raw)
  To: meta

-index now invokes ->DESTROY like xcpdb does, which is necessary
to cleanup $INBOX_DIR/msgmap-XXXXXXX files.  We'll also exit
with the expected values for various signals by adding 128
as described in <https://www.tldp.org/LDP/abs/html/exitcodes.html>

-xcpdb now terminates worker processes and xapian-compact(1)
invocations when prematurely killed, too.
---
 lib/PublicInbox/Admin.pm  | 29 +++++++++++++++++++++++++----
 lib/PublicInbox/Xapcmd.pm | 28 +++++++++++++++++++---------
 2 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index e42b01e0..af2b3da9 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -5,14 +5,28 @@
 # Unstable internal API
 package PublicInbox::Admin;
 use strict;
-use warnings;
-use Cwd 'abs_path';
-use base qw(Exporter);
-our @EXPORT_OK = qw(resolve_repo_dir);
+use parent qw(Exporter);
+use Cwd qw(abs_path);
+use POSIX ();
+our @EXPORT_OK = qw(resolve_repo_dir setup_signals);
 use PublicInbox::Config;
 use PublicInbox::Inbox;
 use PublicInbox::Spawn qw(popen_rd);
 
+sub setup_signals {
+	my ($cb, $arg) = @_; # optional
+
+	# we call exit() here instead of _exit() so DESTROY methods
+	# get called (e.g. File::Temp::Dir and PublicInbox::Msgmap)
+	$SIG{INT} = $SIG{HUP} = $SIG{PIPE} = $SIG{TERM} = sub {
+		my ($sig) = @_;
+		# https://www.tldp.org/LDP/abs/html/exitcodes.html
+		eval { $cb->($sig, $arg) } if $cb;
+		$sig = 'SIG'.$sig;
+		exit(128 + POSIX->$sig);
+	};
+}
+
 sub resolve_repo_dir {
 	my ($cd, $ver) = @_;
 	my $prefix = defined $cd ? $cd : './';
@@ -185,9 +199,16 @@ invalid indexlevel=$indexlevel (must be `basic', `medium', or `full')
 	die missing_mod_msg($err) ." required for indexlevel=$indexlevel\n";
 }
 
+sub index_terminate {
+	my (undef, $ibx) = @_; # $_[0] = signal name
+	$ibx->git->cleanup;
+}
+
 sub index_inbox {
 	my ($ibx, $im, $opt) = @_;
 	my $jobs = delete $opt->{jobs} if $opt;
+	local %SIG = %SIG;
+	setup_signals(\&index_terminate, $ibx);
 	if (ref($ibx) && $ibx->version == 2) {
 		eval { require PublicInbox::V2Writable };
 		die "v2 requirements not met: $@\n" if $@;
diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm
index 714f6859..348621ce 100644
--- a/lib/PublicInbox/Xapcmd.pm
+++ b/lib/PublicInbox/Xapcmd.pm
@@ -2,8 +2,8 @@
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 package PublicInbox::Xapcmd;
 use strict;
-use warnings;
 use PublicInbox::Spawn qw(which popen_rd nodatacow_dir);
+use PublicInbox::Admin qw(setup_signals);
 use PublicInbox::Over;
 use PublicInbox::SearchIdx;
 use File::Temp 0.19 (); # ->newdir
@@ -126,6 +126,11 @@ sub same_fs_or_die ($$) {
 	die "$x and $y reside on different filesystems\n";
 }
 
+sub kill_pids {
+	my ($sig, $pids) = @_;
+	kill($sig, keys %$pids); # pids may be empty
+}
+
 sub process_queue {
 	my ($queue, $cb, $opt) = @_;
 	my $max = $opt->{jobs} // scalar(@$queue);
@@ -138,6 +143,8 @@ sub process_queue {
 
 	# run in parallel:
 	my %pids;
+	local %SIG = %SIG;
+	setup_signals(\&kill_pids, \%pids);
 	while (@$queue) {
 		while (scalar(keys(%pids)) < $max && scalar(@$queue)) {
 			my $args = shift @$queue;
@@ -156,12 +163,6 @@ sub process_queue {
 	}
 }
 
-sub setup_signals () {
-	# http://www.tldp.org/LDP/abs/html/exitcodes.html
-	$SIG{INT} = sub { exit(130) };
-	$SIG{HUP} = $SIG{PIPE} = $SIG{TERM} = sub { exit(1) };
-}
-
 sub prepare_run {
 	my ($ibx, $opt) = @_;
 	my $tmp = {}; # old shard dir => File::Temp->newdir object or undef
@@ -294,6 +295,11 @@ sub progress_pfx ($) {
 	($p[-1] =~ /\A([0-9]+)/) ? "$p[-2]/$1" : $p[-1];
 }
 
+sub kill_compact { # setup_signals callback
+	my ($sig, $pidref) = @_;
+	kill($sig, $$pidref) if defined($$pidref);
+}
+
 # xapian-compact wrapper
 sub compact ($$) {
 	my ($args, $opt) = @_;
@@ -319,14 +325,18 @@ sub compact ($$) {
 	}
 	$pr->("$pfx `".join(' ', @$cmd)."'\n") if $pr;
 	push @$cmd, $src, $dst;
-	my $rd = popen_rd($cmd, undef, $rdr);
+	my ($rd, $pid);
+	local %SIG = %SIG;
+	setup_signals(\&kill_compact, \$pid);
+	($rd, $pid) = popen_rd($cmd, undef, $rdr);
 	while (<$rd>) {
 		if ($pr) {
 			s/\r/\r$pfx /g;
 			$pr->("$pfx $_");
 		}
 	}
-	close $rd or die join(' ', @$cmd)." failed: $?n";
+	waitpid($pid, 0);
+	die "@$cmd failed: \$?=$?\n" if $?;
 }
 
 sub cpdb_loop ($$$;$$) {

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 06/14] msgmap: tmp_clone: simplify + meaningful filename
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
                   ` (4 preceding siblings ...)
  2020-08-10  2:11 ` [PATCH 05/14] index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior Eric Wong
@ 2020-08-10  2:11 ` Eric Wong
  2020-08-10  2:11 ` [PATCH 07/14] avoid File::Temp::tempfile in more places Eric Wong
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:11 UTC (permalink / raw)
  To: meta

Trying to use the newer ->sqlite_backup_to_dbh method doesn't
seem worth it, as we'll have to support DBD::SQLite <= 1.60
another decade or more.

Dumping 'msgmap-XXXXXXX' into $INBOX_DIR can appear a bit
confusing to users, so give it a "mm_tmp-$PID-XXXXXXXX" name
to emphasize it's a temporary file tied to a given PID.

We also don't want to penalize read-only daemons with
loading File::Temp, so do it lazily.
---
 lib/PublicInbox/Msgmap.pm | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/lib/PublicInbox/Msgmap.pm b/lib/PublicInbox/Msgmap.pm
index e7f7e2c9..7290959d 100644
--- a/lib/PublicInbox/Msgmap.pm
+++ b/lib/PublicInbox/Msgmap.pm
@@ -9,10 +9,8 @@
 # This is maintained by ::SearchIdx
 package PublicInbox::Msgmap;
 use strict;
-use warnings;
 use DBI;
 use DBD::SQLite;
-use File::Temp qw(tempfile);
 use PublicInbox::Over;
 use PublicInbox::Spawn;
 
@@ -50,18 +48,13 @@ sub new_file {
 # used to keep track of used numeric mappings for v2 reindex
 sub tmp_clone {
 	my ($self, $dir) = @_;
-	my ($fh, $fn) = tempfile('msgmap-XXXXXXXX', EXLOCK => 0, DIR => $dir);
+	require File::Temp;
+	my $tmp = "mm_tmp-$$-XXXXXX";
+	my ($fh, $fn) = File::Temp::tempfile($tmp, EXLOCK => 0, DIR => $dir);
 	PublicInbox::Spawn::nodatacow_fd(fileno($fh));
-	my $tmp;
-	if ($self->{dbh}->can('sqlite_backup_to_dbh')) {
-		$tmp = ref($self)->new_file($fn, 2);
-		$tmp->{dbh}->do('PRAGMA journal_mode = MEMORY');
-		$self->{dbh}->sqlite_backup_to_dbh($tmp->{dbh});
-	} else { # DBD::SQLite <= 1.61_01
-		$self->{dbh}->sqlite_backup_to_file($fn);
-		$tmp = ref($self)->new_file($fn, 2);
-		$tmp->{dbh}->do('PRAGMA journal_mode = MEMORY');
-	}
+	$self->{dbh}->sqlite_backup_to_file($fn);
+	$tmp = ref($self)->new_file($fn, 2);
+	$tmp->{dbh}->do('PRAGMA journal_mode = MEMORY');
 	$tmp->{pid} = $$;
 	$tmp;
 }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 07/14] avoid File::Temp::tempfile in more places
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
                   ` (5 preceding siblings ...)
  2020-08-10  2:11 ` [PATCH 06/14] msgmap: tmp_clone: simplify + meaningful filename Eric Wong
@ 2020-08-10  2:11 ` Eric Wong
  2020-08-10  2:11 ` [PATCH 08/14] admin: use a generic veriable name Eric Wong
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:11 UTC (permalink / raw)
  To: meta

We can use open(..., undef) natively in Perl in t/import.t

In places where we need a pathname, the File::Temp OO API
gives us auto-unlinking for free.
---
 lib/PublicInbox/V2Writable.pm | 17 +++++++++--------
 script/public-inbox-init      |  9 ++++-----
 t/import.t                    |  5 ++---
 3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 0b527f18..93646e57 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -20,7 +20,7 @@ use PublicInbox::Msgmap;
 use PublicInbox::Spawn qw(spawn popen_rd);
 use PublicInbox::SearchIdx qw(log2stack crlf_adjust is_ancestor check_size);
 use IO::Handle; # ->autoflush
-use File::Temp qw(tempfile);
+use File::Temp ();
 
 my $OID = qr/[a-f0-9]{40,}/;
 # an estimate of the post-packed size to the raw uncompressed size
@@ -733,12 +733,14 @@ sub fill_alternates ($$) {
 	}
 	return unless $new;
 
-	my ($fh, $tmp) = tempfile('alt-XXXXXXXX', DIR => $info_dir);
+	my $fh = File::Temp->new(TEMPLATE => 'alt-XXXXXXXX', DIR => $info_dir);
+	my $tmp = $fh->filename;
 	print $fh join("\n", sort { $alt{$b} <=> $alt{$a} } keys %alt), "\n"
 		or die "print $tmp: $!\n";
 	chmod($mode, $fh) or die "fchmod $tmp: $!\n";
 	close $fh or die "close $tmp $!\n";
 	rename($tmp, $alt) or die "rename $tmp => $alt: $!\n";
+	$fh->unlink_on_destroy(0);
 }
 
 sub git_init {
@@ -819,18 +821,17 @@ sub import_init {
 sub diff ($$$) {
 	my ($mid, $cur, $new) = @_;
 
-	my ($ah, $an) = tempfile('email-cur-XXXXXXXX', TMPDIR => 1);
+	my $ah = File::Temp->new(TEMPLATE => 'email-cur-XXXXXXXX', TMPDIR => 1);
 	print $ah $cur->as_string or die "print: $!";
-	close $ah or die "close: $!";
-	my ($bh, $bn) = tempfile('email-new-XXXXXXXX', TMPDIR => 1);
+	$ah->flush or die "flush: $!";
 	PublicInbox::Import::drop_unwanted_headers($new);
+	my $bh = File::Temp->new(TEMPLATE => 'email-new-XXXXXXXX', TMPDIR => 1);
 	print $bh $new->as_string or die "print: $!";
-	close $bh or die "close: $!";
-	my $cmd = [ qw(diff -u), $an, $bn ];
+	$bh->flush or die "flush: $!";
+	my $cmd = [ qw(diff -u), $ah->filename, $bh->filename ];
 	print STDERR "# MID conflict <$mid>\n";
 	my $pid = spawn($cmd, undef, { 1 => 2 });
 	waitpid($pid, 0) == $pid or die "diff did not finish";
-	unlink($an, $bn);
 }
 
 sub get_blob ($$) {
diff --git a/script/public-inbox-init b/script/public-inbox-init
index b8d71f35..6a959db7 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -17,7 +17,7 @@ PublicInbox::Admin::require_or_die('-base');
 use PublicInbox::Config;
 use PublicInbox::InboxWritable;
 use PublicInbox::Import;
-use File::Temp qw/tempfile/;
+use File::Temp;
 use PublicInbox::Lock;
 use File::Basename qw/dirname/;
 use File::Path qw/mkpath/;
@@ -52,8 +52,7 @@ my $lock_obj = { lock_path => "$pi_config.flock" };
 PublicInbox::Lock::lock_acquire($lock_obj);
 
 # git-config will operate on this (and rename on success):
-my ($fh, $pi_config_tmp) = tempfile('pi-init-XXXXXXXX', DIR => $dir);
-my $cfg_tmp = UnlinkMe->new($pi_config_tmp);
+my $fh = File::Temp->new(TEMPLATE => 'pi-init-XXXXXXXX', DIR => $dir);
 
 # Now, we grab another lock to use git-config(1) locking, so it won't
 # wait on the lock, unlike some of our internal flock()-based locks.
@@ -110,7 +109,8 @@ if (-e $pi_config) {
 		}
 	}
 }
-close $fh or die "failed to close $pi_config_tmp: $!\n";
+my $pi_config_tmp = $fh->filename;
+close($fh) or die "failed to close $pi_config_tmp: $!\n";
 
 my $pfx = "publicinbox.$name";
 my @x = (qw/git config/, "--file=$pi_config_tmp");
@@ -177,7 +177,6 @@ if (defined $perm) {
 
 rename $pi_config_tmp, $pi_config or
 	die "failed to rename `$pi_config_tmp' to `$pi_config': $!\n";
-delete $cfg_tmp->{file};
 $auto_unlink->DESTROY;
 
 package UnlinkMe;
diff --git a/t/import.t b/t/import.t
index 440e8994..9a88416f 100644
--- a/t/import.t
+++ b/t/import.t
@@ -9,7 +9,6 @@ use PublicInbox::Git;
 use PublicInbox::Import;
 use PublicInbox::Spawn qw(spawn);
 use Fcntl qw(:DEFAULT SEEK_SET);
-use File::Temp qw/tempfile/;
 use PublicInbox::TestCommon;
 use MIME::Base64 3.05; # Perl 5.10.0 / 5.9.2
 my ($dir, $for_destroy) = tmpdir();
@@ -37,11 +36,11 @@ if ($v2) {
 	is($mime->as_string, $$raw_email, 'string matches');
 	is($smsg->{raw_bytes}, length($$raw_email), 'length matches');
 	my @cmd = ('git', "--git-dir=$git->{git_dir}", qw(hash-object --stdin));
-	my $in = tempfile();
+	open my $in, '+<', undef or BAIL_OUT "open(+<): $!";
 	print $in $mime->as_string or die "write failed: $!";
 	$in->flush or die "flush failed: $!";
 	seek($in, 0, SEEK_SET);
-	my $out = tempfile();
+	open my $out, '+<', undef or BAIL_OUT "open(+<): $!";
 	my $pid = spawn(\@cmd, {}, { 0 => $in, 1 => $out });
 	is(waitpid($pid, 0), $pid, 'waitpid succeeds on hash-object');
 	is($?, 0, 'hash-object');

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 08/14] admin: use a generic veriable name
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
                   ` (6 preceding siblings ...)
  2020-08-10  2:11 ` [PATCH 07/14] avoid File::Temp::tempfile in more places Eric Wong
@ 2020-08-10  2:11 ` Eric Wong
  2020-08-10  2:38   ` Kyle Meyer
  2020-08-10  2:12 ` [PATCH 09/14] index: cleanup internal variables Eric Wong
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:11 UTC (permalink / raw)
  To: meta

We parse other options, too, not just --max-size
---
 lib/PublicInbox/Admin.pm | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index af2b3da9..8a9a81c9 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -256,12 +256,12 @@ sub progress_prepare ($) {
 
 # same unit factors as git:
 sub parse_unsigned ($) {
-	my ($max_size) = @_;
+	my ($val) = @_;
 
-	$$max_size =~ /\A([0-9]+)([kmg])?\z/i or return;
+	$$val =~ /\A([0-9]+)([kmg])?\z/i or return;
 	my ($n, $unit_factor) = ($1, $2 // '');
 	my %u = ( k => 1024, m => 1024**2, g => 1024**3 );
-	$$max_size = $n * ($u{lc($unit_factor)} // 1);
+	$$val = $n * ($u{lc($unit_factor)} // 1);
 	1;
 }
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 09/14] index: cleanup internal variables
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
                   ` (7 preceding siblings ...)
  2020-08-10  2:11 ` [PATCH 08/14] admin: use a generic veriable name Eric Wong
@ 2020-08-10  2:12 ` Eric Wong
  2020-08-10  2:12 ` [PATCH 10/14] searchidx: use singular `$opt' for consistency with v2 Eric Wong
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:12 UTC (permalink / raw)
  To: meta

Move away from hard-to-read alllowercase naming and favor
snake_case or separated-by-dashes.

We'll keep `--indexlevel' as-is for now, since it's been around
for several releases; but we'll support `--index-level' in the
CLI and update our documentation in a few months.

We'll also clarify that publicInbox.indexMaxSize is only
intended for -index, and not -watch or -mda.
---
 Documentation/public-inbox-index.pod |  6 ++++-
 Documentation/public-inbox-init.pod  |  2 +-
 lib/PublicInbox/SearchIdx.pm         | 14 +++++------
 lib/PublicInbox/V2Writable.pm        | 26 +++++++++------------
 script/public-inbox-index            | 35 ++++++++++++++--------------
 script/public-inbox-init             |  8 ++-----
 6 files changed, 43 insertions(+), 48 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 3ae3b008..7a97432e 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -158,7 +158,11 @@ value.  A single suffix modifier of C<k>, C<m> or C<g> is
 supported, thus the value of C<1m> to prevents indexing of
 messages larger than one megabyte.
 
-This is useful for avoiding memory exhaustion in mirrors.
+This is useful for avoiding memory exhaustion in mirrors
+via git.  It does not prevent L<public-inbox-mda(1)> or
+L<public-inbox-watch(1)> from importing (and indexing)
+a message.
+
 This option is only available in public-inbox 1.5 or later.
 
 Default: none
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index fd9fc637..d0c87563 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -31,7 +31,7 @@ L<DBD::SQLite>.
 
 Default: C<1>
 
-=item --indexlevel <basic|medium|full>
+=item -L, --indexlevel <basic|medium|full>
 
 Controls the indexing level for L<public-inbox-index(1)>
 
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 1cf3e66c..7f2447fe 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -67,7 +67,6 @@ sub new {
 		my $dir = $self->xdir;
 		$self->{over} = PublicInbox::OverIdx->new("$dir/over.sqlite3");
 		$self->{over}->{-no_fsync} = 1 if $ibx->{-no_fsync};
-		$self->{index_max_size} = $ibx->{index_max_size};
 	} elsif ($version == 2) {
 		defined $shard or die "shard is required for v2\n";
 		# shard is a number
@@ -553,10 +552,10 @@ sub index_sync {
 sub check_size { # check_async cb for -index --max-size=...
 	my ($oid, $type, $size, $arg, $git) = @_;
 	(($type // '') eq 'blob') or die "E: bad $oid in $git->{git_dir}";
-	if ($size <= $arg->{index_max_size}) {
+	if ($size <= $arg->{max_size}) {
 		$git->cat_async($oid, $arg->{index_oid}, $arg);
 	} else {
-		warn "W: skipping $oid ($size > $arg->{index_max_size})\n";
+		warn "W: skipping $oid ($size > $arg->{max_size})\n";
 	}
 }
 
@@ -573,7 +572,7 @@ sub v1_checkpoint ($$;$) {
 			$self->{mm}->last_commit($newest);
 		}
 	} else {
-		${$sync->{max}} = $BATCH_BYTES;
+		${$sync->{max}} = $self->{batch_bytes};
 	}
 
 	$self->{mm}->{dbh}->commit;
@@ -603,7 +602,7 @@ sub v1_checkpoint ($$;$) {
 sub process_stack {
 	my ($self, $sync, $stk) = @_;
 	my $git = $self->{ibx}->git;
-	my $max = $BATCH_BYTES;
+	my $max = $self->{batch_bytes};
 	my $nr = 0;
 	$sync->{nr} = \$nr;
 	$sync->{max} = \$max;
@@ -617,13 +616,13 @@ sub process_stack {
 			$git->cat_async($oid, \&unindex_both, $self);
 		}
 	}
-	if ($sync->{index_max_size} = $self->{ibx}->{index_max_size}) {
+	if ($sync->{max_size} = $sync->{-opt}->{max_size}) {
 		$sync->{index_oid} = \&index_both;
 	}
 	while (my ($f, $at, $ct, $oid) = $stk->pop_rec) {
 		if ($f eq 'm') {
 			my $arg = { %$sync, autime => $at, cotime => $ct };
-			if ($sync->{index_max_size}) {
+			if ($sync->{max_size}) {
 				$git->check_async($oid, \&check_size, $arg);
 			} else {
 				$git->cat_async($oid, \&index_both, $arg);
@@ -749,6 +748,7 @@ sub _index_sync {
 	my ($self, $opts) = @_;
 	my $tip = $opts->{ref} || 'HEAD';
 	my $git = $self->{ibx}->git;
+	$self->{batch_bytes} = $opts->{batch_size} // $BATCH_BYTES;
 	$git->batch_prepare;
 	my $pr = $opts->{-progress};
 	my $sync = { reindex => $opts->{reindex}, -opt => $opts };
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 93646e57..8e36b92c 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -152,12 +152,6 @@ sub add {
 	$self->{ibx}->with_umask(\&_add, $self, $eml, $check_cb);
 }
 
-sub batch_bytes ($) {
-	my ($self) = @_;
-	($self->{parallel} ? $self->{shards} : 1) *
-		$PublicInbox::SearchIdx::BATCH_BYTES;
-}
-
 # indexes a message, returns true if checkpointing is needed
 sub do_idx ($$$$) {
 	my ($self, $msgref, $mime, $smsg) = @_;
@@ -166,7 +160,7 @@ sub do_idx ($$$$) {
 	my $idx = idx_shard($self, $smsg->{num} % $self->{shards});
 	$idx->index_raw($msgref, $mime, $smsg);
 	my $n = $self->{transact_bytes} += $smsg->{raw_bytes};
-	$n >= batch_bytes($self);
+	$n >= $self->{batch_bytes};
 }
 
 sub _add {
@@ -287,6 +281,9 @@ sub _idx_init { # with_umask callback
 	# xcpdb can change shard count while -watch is idle
 	my $nshards = count_shards($self);
 	$self->{shards} = $nshards if $nshards && $nshards != $self->{shards};
+	$self->{batch_bytes} = $opt->{batch_size} //
+				$PublicInbox::SearchIdx::BATCH_BYTES;
+	$self->{batch_bytes} *= $self->{shards} if $self->{parallel};
 
 	# need to create all shards before initializing msgmap FD
 	# idx_shards must be visible to all forked processes
@@ -891,7 +888,7 @@ sub reindex_checkpoint ($$) {
 	}
 
 	# allow -watch or -mda to write...
-	$self->idx_init; # reacquire lock
+	$self->idx_init($sync->{-opt}); # reacquire lock
 	$mm_tmp->atfork_parent if $mm_tmp;
 }
 
@@ -1208,12 +1205,11 @@ sub index_xap_step ($$$;$) {
 		$pr->("Xapian indexlevel=$ibx->{indexlevel} ".
 			"$beg..$end (% $step)\n");
 	}
-	my $batch_bytes = batch_bytes($self);
 	for (my $num = $beg; $num <= $end; $num += $step) {
 		my $smsg = $ibx->over->get_art($num) or next;
 		$smsg->{v2w} = $self;
 		$ibx->git->cat_async($smsg->{blob}, \&index_xap_only, $smsg);
-		if ($self->{transact_bytes} >= $batch_bytes) {
+		if ($self->{transact_bytes} >= $self->{batch_bytes}) {
 			${$sync->{nr}} = $num;
 			reindex_checkpoint($self, $sync);
 		}
@@ -1236,7 +1232,7 @@ sub index_epoch ($$$) {
 		$self->{current_info} = "$i.git $oid";
 		if ($f eq 'm') {
 			my $arg = { %$sync, autime => $at, cotime => $ct };
-			if ($sync->{index_max_size}) {
+			if ($sync->{max_size}) {
 				$all->check_async($oid, \&check_size, $arg);
 			} else {
 				$all->cat_async($oid, \&index_oid, $arg);
@@ -1255,7 +1251,7 @@ sub index_epoch ($$$) {
 
 sub xapian_only {
 	my ($self, $opt, $sync, $art_beg) = @_;
-	my $seq = $opt->{sequentialshard};
+	my $seq = $opt->{sequential_shard};
 	$art_beg //= 0;
 	local $self->{parallel} = 0 if $seq;
 	$self->idx_init($opt); # acquire lock
@@ -1285,14 +1281,14 @@ sub xapian_only {
 sub index_sync {
 	my ($self, $opt) = @_;
 	$opt //= $_[1] //= {};
-	goto \&xapian_only if $opt->{xapianonly};
+	goto \&xapian_only if $opt->{xapian_only};
 
 	my $pr = $opt->{-progress};
 	my $epoch_max;
 	my $latest = git_dir_latest($self, \$epoch_max);
 	return unless defined $latest;
 
-	my $seq = $opt->{sequentialshard};
+	my $seq = $opt->{sequential_shard};
 	my $art_beg; # the NNTP article number we start xapian_only at
 	my $idxlevel = $self->{ibx}->{indexlevel};
 	local $self->{ibx}->{indexlevel} = 'basic' if $seq;
@@ -1324,7 +1320,7 @@ sub index_sync {
 			$art_beg++ if defined($art_beg);
 		}
 	}
-	if ($sync->{index_max_size} = $self->{ibx}->{index_max_size}) {
+	if ($sync->{max_size} = $opt->{max_size}) {
 		$sync->{index_oid} = \&index_oid;
 	}
 	# work forwards through history
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 9e0907be..b1d29ec1 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -17,7 +17,7 @@ usage: $usage
 options:
 
   --no-fsync          speed up indexing, risk corruption on power outage
-  --indexlevel=LEVEL  `basic', 'medium', or `full' (default: full)
+  -L LEVEL            `basic', `medium', or `full' (default: full)
   --compact | -c      run public-inbox-compact(1) after indexing
   --sequential-shard  index Xapian shards sequentially for slow storage
   --jobs=NUM          set or disable parallelization (NUM=0)
@@ -33,16 +33,17 @@ BYTES may use `k', `m', and `g' suffixes (e.g. `10m' for 10 megabytes)
 See public-inbox-index(1) man page for full documentation.
 EOF
 my $compact_opt;
-my $opt = { quiet => -1, compact => 0, maxsize => undef, fsync => 1 };
+my $opt = { quiet => -1, compact => 0, max_size => undef, fsync => 1 };
 GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
-		fsync|sync! xapianonly|xapian-only
-		indexlevel|L=s maxsize|max-size=s batchsize|batch-size=s
-		sequentialshard|seq-shard|sequential-shard
+		fsync|sync! xapian_only|xapian-only
+		indexlevel|index-level|L=s max_size|max-size=s
+		batch_size|batch-size=s
+		sequential_shard|seq-shard|sequential-shard
 		help|?))
 	or die "bad command-line args\n$usage";
 if ($opt->{help}) { print $help; exit 0 };
 die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
-if ($opt->{xapianonly} && !$opt->{reindex}) {
+if ($opt->{xapian_only} && !$opt->{reindex}) {
 	die "--xapian-only requires --reindex\n";
 }
 
@@ -64,40 +65,38 @@ my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, undef, $cfg);
 PublicInbox::Admin::require_or_die('-index');
 unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 }
 
-my $max_size = $opt->{maxsize} // $cfg->{lc('publicInbox.indexMaxSize')};
+my $max_size = $opt->{max_size} // $cfg->{lc('publicInbox.indexMaxSize')};
 if (defined $max_size) {
 	PublicInbox::Admin::parse_unsigned(\$max_size) or
 		die "`publicInbox.indexMaxSize=$max_size' not parsed\n";
+	$opt->{max_size} = $max_size;
 }
 
-my $bs = $opt->{batchsize} // $cfg->{lc('publicInbox.indexBatchSize')};
+my $bs = $opt->{batch_size} // $cfg->{lc('publicInbox.indexBatchSize')};
 if (defined $bs) {
 	PublicInbox::Admin::parse_unsigned(\$bs) or
 		die "`publicInbox.indexBatchSize=$bs' not parsed\n";
+	$opt->{batch_size} = $bs;
 }
-no warnings 'once';
-local $PublicInbox::SearchIdx::BATCH_BYTES = $bs if defined($bs);
-use warnings 'once';
 
 # out-of-the-box builds of Xapian 1.4.x are still limited to 32-bit
 # https://getting-started-with-xapian.readthedocs.io/en/latest/concepts/indexing/limitations.html
 local $ENV{XAPIAN_FLUSH_THRESHOLD} ||= '4294967295' if defined($bs);
 
-my $s = $opt->{sequentialshard} //
+my $s = $opt->{sequential_shard} //
 			$cfg->{lc('publicInbox.indexSequentialShard')};
 if (defined $s) {
 	my $v = $cfg->git_bool($s);
 	defined($v) or
 		die "`publicInbox.indexSequentialShard=$s' not boolean\n";
-	$opt->{sequentialshard} = $v;
+	$opt->{sequential_shard} = $v;
 }
 
 my $mods = {};
 foreach my $ibx (@ibxs) {
 	# XXX: users can shoot themselves in the foot, with opt->{indexlevel}
-	$ibx->{indexlevel} //= $opt->{indexlevel} // ($opt->{xapianonly} ?
+	$ibx->{indexlevel} //= $opt->{indexlevel} // ($opt->{xapian_only} ?
 			'full' : PublicInbox::Admin::detect_indexlevel($ibx));
-	$ibx->{index_max_size} = $max_size;
 	PublicInbox::Admin::scan_ibx_modules($mods, $ibx);
 }
 
@@ -112,15 +111,15 @@ for my $ibx (@ibxs) {
 	$ibx->{-no_fsync} = 1 if !$opt->{fsync};
 
 	my $ibx_opt = $opt;
-	if (defined(my $s = $ibx->{indexsequentialshard})) {
+	if (defined(my $s = $ibx->{lc('indexSequentialShard')})) {
 		defined(my $v = $cfg->git_bool($s)) or die <<EOL;
 publicInbox.$ibx->{name}.indexSequentialShard not boolean
 EOL
-		$ibx_opt = { %$opt, sequentialshard => $v };
+		$ibx_opt = { %$opt, sequential_shard => $v };
 	}
 	PublicInbox::Admin::index_inbox($ibx, undef, $ibx_opt);
 	if ($compact_opt) {
-		local $compact_opt->{jobs} = 0 if $ibx_opt->{sequentialshard};
+		local $compact_opt->{jobs} = 0 if $ibx_opt->{sequential_shard};
 		PublicInbox::Xapcmd::run($ibx, 'compact', $compact_opt);
 	}
 }
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 6a959db7..1c8066df 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -27,7 +27,7 @@ use Cwd qw/abs_path/;
 my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs);
 my %opts = (
 	'V|version=i' => \$version,
-	'L|indexlevel=s' => \$indexlevel,
+	'L|index-level|indexlevel=s' => \$indexlevel,
 	'S|skip|skip-epoch=i' => \$skip_epoch,
 	'N|skip-artnum=i' => \$skip_artnum,
 	'j|jobs=i' => \$jobs,
@@ -103,11 +103,7 @@ if (-e $pi_config) {
 	exit(1) if $conflict;
 
 	my $ibx = $cfg->lookup_name($name);
-	if ($ibx) {
-		if (!defined($indexlevel) && $ibx->{indexlevel}) {
-			$indexlevel = $ibx->{indexlevel};
-		}
-	}
+	$indexlevel //= $ibx->{indexlevel} if $ibx;
 }
 my $pi_config_tmp = $fh->filename;
 close($fh) or die "failed to close $pi_config_tmp: $!\n";

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 10/14] searchidx: use singular `$opt' for consistency with v2
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
                   ` (8 preceding siblings ...)
  2020-08-10  2:12 ` [PATCH 09/14] index: cleanup internal variables Eric Wong
@ 2020-08-10  2:12 ` Eric Wong
  2020-08-10  2:12 ` [PATCH 11/14] convert: support new -index options Eric Wong
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:12 UTC (permalink / raw)
  To: meta

The rest of our indexing code uses `$opt' instead of `$opts'.
---
 lib/PublicInbox/SearchIdx.pm | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 7f2447fe..5c39f3d6 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -539,11 +539,11 @@ sub unindex_both { # git->cat_async callback
 
 # called by public-inbox-index
 sub index_sync {
-	my ($self, $opts) = @_;
-	delete $self->{lock_path} if $opts->{-skip_lock};
-	$self->{ibx}->with_umask(\&_index_sync, $self, $opts);
-	if ($opts->{reindex}) {
-		my %again = %$opts;
+	my ($self, $opt) = @_;
+	delete $self->{lock_path} if $opt->{-skip_lock};
+	$self->{ibx}->with_umask(\&_index_sync, $self, $opt);
+	if ($opt->{reindex}) {
+		my %again = %$opt;
 		delete @again{qw(rethread reindex)};
 		index_sync($self, \%again);
 	}
@@ -745,15 +745,15 @@ sub reindex_from ($$) {
 
 # indexes all unindexed messages (v1 only)
 sub _index_sync {
-	my ($self, $opts) = @_;
-	my $tip = $opts->{ref} || 'HEAD';
+	my ($self, $opt) = @_;
+	my $tip = $opt->{ref} || 'HEAD';
 	my $git = $self->{ibx}->git;
-	$self->{batch_bytes} = $opts->{batch_size} // $BATCH_BYTES;
+	$self->{batch_bytes} = $opt->{batch_size} // $BATCH_BYTES;
 	$git->batch_prepare;
-	my $pr = $opts->{-progress};
-	my $sync = { reindex => $opts->{reindex}, -opt => $opts };
+	my $pr = $opt->{-progress};
+	my $sync = { reindex => $opt->{reindex}, -opt => $opt };
 	my $xdb = $self->begin_txn_lazy;
-	$self->{over}->rethread_prepare($opts);
+	$self->{over}->rethread_prepare($opt);
 	my $mm = _msgmap_init($self);
 	if ($sync->{reindex}) {
 		my $last = $mm->last_commit;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 11/14] convert: support new -index options
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
                   ` (9 preceding siblings ...)
  2020-08-10  2:12 ` [PATCH 10/14] searchidx: use singular `$opt' for consistency with v2 Eric Wong
@ 2020-08-10  2:12 ` Eric Wong
  2020-08-10  2:12 ` [PATCH 12/14] convert: speed up --help Eric Wong
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:12 UTC (permalink / raw)
  To: meta

Converting v1 inboxes from v2 can be a painful experience
on HDD.  Some of the new options in the CLI or config
file make it less painful.
---
 Documentation/public-inbox-convert.pod | 19 +++++++
 lib/PublicInbox/Admin.pm               | 36 ++++++++++++
 script/public-inbox-convert            | 77 +++++++++++++++++++-------
 script/public-inbox-index              | 47 ++--------------
 4 files changed, 117 insertions(+), 62 deletions(-)

diff --git a/Documentation/public-inbox-convert.pod b/Documentation/public-inbox-convert.pod
index a8a5658c..a7958cf8 100644
--- a/Documentation/public-inbox-convert.pod
+++ b/Documentation/public-inbox-convert.pod
@@ -33,6 +33,25 @@ at 4 due to various bottlenecks.  The number of Xapian shards
 will be 1 less than the JOBS value, since there is a single
 process which distributes work to the Xapian shards.
 
+=item -L LEVEL, --index-level=LEVEL
+
+=item -c, --compact
+
+=item -v, --verbose
+
+=item --no-fsync
+
+=item --sequential-shard
+
+=item --batch-size=BYTES
+
+=item --max-size=BYTES
+
+These options affect indexing.  They have no effect if
+L</--no-index> is specified
+
+See L<public-inbox-index(1)> for a description of these options.
+
 =back
 
 =head1 ENVIRONMENT
diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index 8a9a81c9..ce720beb 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -265,4 +265,40 @@ sub parse_unsigned ($) {
 	1;
 }
 
+sub index_prepare ($$) {
+	my ($opt, $cfg) = @_;
+	my $env;
+	if ($opt->{compact}) {
+		require PublicInbox::Xapcmd;
+		PublicInbox::Xapcmd::check_compact();
+		$opt->{compact_opt} = { -coarse_lock => 1, compact => 1 };
+		if (defined(my $jobs = $opt->{jobs})) {
+			$opt->{compact_opt}->{jobs} = $jobs;
+		}
+	}
+	for my $k (qw(max_size batch_size)) {
+		my $git_key = "publicInbox.index".ucfirst($k);
+		$git_key =~ s/_([a-z])/\U$1/g;
+		defined(my $v = $opt->{$k} // $cfg->{lc($git_key)}) or next;
+		parse_unsigned(\$v) or die "`$git_key=$v' not parsed\n";
+		$v > 0 or die "`$git_key=$v' must be positive\n";
+		$opt->{$k} = $v;
+	}
+
+	# out-of-the-box builds of Xapian 1.4.x are still limited to 32-bit
+	# https://getting-started-with-xapian.readthedocs.io/en/latest/concepts/indexing/limitations.html
+	$opt->{batch_size} and
+		$env = { XAPIAN_FLUSH_THRESHOLD => '4294967295' };
+
+	for my $k (qw(sequential_shard)) {
+		my $git_key = "publicInbox.index".ucfirst($k);
+		$git_key =~ s/_([a-z])/\U$1/g;
+		defined(my $s = $opt->{$k} // $cfg->{lc($git_key)}) or next;
+		defined(my $v = $cfg->git_bool($s))
+					or die "`$git_key=$s' not boolean\n";
+		$opt->{$k} = $v;
+	}
+	$env;
+}
+
 1;
diff --git a/script/public-inbox-convert b/script/public-inbox-convert
index dbb2bd38..ca16b0dc 100755
--- a/script/public-inbox-convert
+++ b/script/public-inbox-convert
@@ -12,26 +12,57 @@ use PublicInbox::Git;
 use PublicInbox::Spawn qw(spawn);
 use Cwd 'abs_path';
 use File::Copy 'cp'; # preserves permissions:
-my $usage = "Usage: public-inbox-convert OLD NEW\n";
-my $jobs;
-my $index = 1;
-my %opts = (
-	'--jobs|j=i' => \$jobs,
-	'--index!' => \$index,
-);
-GetOptions(%opts) or die "bad command-line args\n$usage";
+my $usage = 'Usage: public-inbox-convert [options] OLD NEW';
+my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term:
+usage: $usage
+
+  convert v1 format inboxes to v2
+
+options:
+
+  --no-index          do not index after conversion
+  --jobs=NUM          set shards (NUM=0)
+  --verbose | -v      increase verbosity (may be repeated)
+  --help | -?         show this help
+
+index options (see public-inbox-index(1) manpage for full description):
+
+  --no-fsync          speed up indexing, risk corruption on power outage
+  -L LEVEL            `basic', `medium', or `full' (default: full)
+  --compact | -c      run public-inbox-compact(1) after indexing
+  --sequential-shard  index Xapian shards sequentially for slow storage
+  --batch-size=BYTES  flush changes to OS after a given number of bytes
+  --max-size=BYTES    do not index messages larger than the given size
+
+See public-inbox-convert(1) man page for full documentation.
+EOF
+
+my $opt = {
+	index => 1,
+	# index defaults:
+	quiet => -1, compact => 0, maxsize => undef, fsync => 1,
+	reindex => 1, # we always reindex
+};
+GetOptions($opt, qw(jobs|j=i index! help|?),
+		# index options
+		qw(verbose|v+ rethread compact|c+ fsync|sync!
+		indexlevel|index-level|L=s max_size|max-size=s
+		batch_size|batch-size=s
+		sequential_shard|sequential-shard|seq-shard
+		)) or die <<EOF;
+bad command-line args\n$usage
+EOF
+if ($opt->{help}) { print $help; exit 0 };
 my $old_dir = shift(@ARGV) or die $usage;
 my $new_dir = shift(@ARGV) or die $usage;
 die "$new_dir exists\n" if -d $new_dir;
 die "$old_dir not a directory\n" unless -d $old_dir;
-my $config = PublicInbox::Config->new;
+my $cfg = PublicInbox::Config->new;
 $old_dir = abs_path($old_dir);
 my $old;
-if ($config) {
-	$config->each_inbox(sub {
-		$old = $_[0] if abs_path($_[0]->{inboxdir}) eq $old_dir;
-	});
-}
+$cfg->each_inbox(sub {
+	$old = $_[0] if abs_path($_[0]->{inboxdir}) eq $old_dir;
+});
 unless ($old) {
 	warn "W: $old_dir not configured in " .
 		PublicInbox::Config::default_file() . "\n";
@@ -48,16 +79,20 @@ if ($old->version >= 2) {
 }
 
 $old->{indexlevel} //= PublicInbox::Admin::detect_indexlevel($old);
-if ($index) {
+my $env;
+if ($opt->{'index'}) {
 	my $mods = {};
 	PublicInbox::Admin::scan_ibx_modules($mods, $old);
 	PublicInbox::Admin::require_or_die(keys %$mods);
+	PublicInbox::Admin::progress_prepare($opt);
+	$env = PublicInbox::Admin::index_prepare($opt, $cfg);
 }
-
+local %ENV = (%$env, %ENV) if $env;
 my $new = { %$old };
 $new->{inboxdir} = abs_path($new_dir);
 $new->{version} = 2;
-$new = PublicInbox::InboxWritable->new($new);
+$new = PublicInbox::InboxWritable->new($new, { nproc => $opt->{jobs} });
+$new->{-no_fsync} = 1 if !$opt->{fsync};
 my $v2w;
 $old->umask_prepare;
 
@@ -73,7 +108,7 @@ $old->with_umask(sub {
 	local $ENV{GIT_CONFIG} = $old_cfg;
 	my $new_cfg = "$new->{inboxdir}/all.git/config";
 	$v2w = PublicInbox::V2Writable->new($new, 1);
-	$v2w->init_inbox($jobs);
+	$v2w->init_inbox(delete $opt->{jobs});
 	unlink $new_cfg;
 	link_or_copy($old_cfg, $new_cfg);
 	if (my $alt = $new->{altid}) {
@@ -98,7 +133,7 @@ $clone may not be valid after migrating to v2, not copying
 my $state = '';
 my $head = $old->{ref_head} || 'HEAD';
 my ($rd, $pid) = $old->git->popen(qw(fast-export --use-done-feature), $head);
-$v2w->idx_init;
+$v2w->idx_init($opt);
 my $im = $v2w->importer;
 my ($r, $w) = $im->gfi_start;
 my $h = '[0-9a-f]';
@@ -155,10 +190,10 @@ if (my $mm = $old->mm) {
 
 	# we want to trigger a reindex, not a from scratch index if
 	# we're reusing the msgmap from an existing v1 installation.
-	$v2w->idx_init;
+	$v2w->idx_init($opt);
 	my $epoch0 = PublicInbox::Git->new($v2w->git_init(0));
 	chop(my $cmt = $epoch0->qx(qw(rev-parse --verify), $head));
 	$v2w->last_epoch_commit(0, $cmt);
 }
-$v2w->index_sync({reindex => 1}) if $index;
+$v2w->index_sync($opt) if delete $opt->{'index'};
 $v2w->done;
diff --git a/script/public-inbox-index b/script/public-inbox-index
index b1d29ec1..14d3afd4 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -32,7 +32,6 @@ options:
 BYTES may use `k', `m', and `g' suffixes (e.g. `10m' for 10 megabytes)
 See public-inbox-index(1) man page for full documentation.
 EOF
-my $compact_opt;
 my $opt = { quiet => -1, compact => 0, max_size => undef, fsync => 1 };
 GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
 		fsync|sync! xapian_only|xapian-only
@@ -51,47 +50,11 @@ if ($opt->{xapian_only} && !$opt->{reindex}) {
 require PublicInbox::Admin;
 PublicInbox::Admin::require_or_die('-index');
 
-if ($opt->{compact}) {
-	require PublicInbox::Xapcmd;
-	PublicInbox::Xapcmd::check_compact();
-	$compact_opt = { -coarse_lock => 1, compact => 1 };
-	if (defined(my $jobs = $opt->{jobs})) {
-		$compact_opt->{jobs} = $jobs;
-	}
-}
-
 my $cfg = PublicInbox::Config->new; # Config is loaded by Admin
 my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, undef, $cfg);
 PublicInbox::Admin::require_or_die('-index');
 unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 }
 
-my $max_size = $opt->{max_size} // $cfg->{lc('publicInbox.indexMaxSize')};
-if (defined $max_size) {
-	PublicInbox::Admin::parse_unsigned(\$max_size) or
-		die "`publicInbox.indexMaxSize=$max_size' not parsed\n";
-	$opt->{max_size} = $max_size;
-}
-
-my $bs = $opt->{batch_size} // $cfg->{lc('publicInbox.indexBatchSize')};
-if (defined $bs) {
-	PublicInbox::Admin::parse_unsigned(\$bs) or
-		die "`publicInbox.indexBatchSize=$bs' not parsed\n";
-	$opt->{batch_size} = $bs;
-}
-
-# out-of-the-box builds of Xapian 1.4.x are still limited to 32-bit
-# https://getting-started-with-xapian.readthedocs.io/en/latest/concepts/indexing/limitations.html
-local $ENV{XAPIAN_FLUSH_THRESHOLD} ||= '4294967295' if defined($bs);
-
-my $s = $opt->{sequential_shard} //
-			$cfg->{lc('publicInbox.indexSequentialShard')};
-if (defined $s) {
-	my $v = $cfg->git_bool($s);
-	defined($v) or
-		die "`publicInbox.indexSequentialShard=$s' not boolean\n";
-	$opt->{sequential_shard} = $v;
-}
-
 my $mods = {};
 foreach my $ibx (@ibxs) {
 	# XXX: users can shoot themselves in the foot, with opt->{indexlevel}
@@ -101,12 +64,14 @@ foreach my $ibx (@ibxs) {
 }
 
 PublicInbox::Admin::require_or_die(keys %$mods);
+my $env = PublicInbox::Admin::index_prepare($opt, $cfg);
+local %ENV = (%ENV, %$env) if $env;
 require PublicInbox::InboxWritable;
 PublicInbox::Admin::progress_prepare($opt);
 for my $ibx (@ibxs) {
 	$ibx = PublicInbox::InboxWritable->new($ibx);
 	if ($opt->{compact} >= 2) {
-		PublicInbox::Xapcmd::run($ibx, 'compact', $compact_opt);
+		PublicInbox::Xapcmd::run($ibx, 'compact', $opt->{compact_opt});
 	}
 	$ibx->{-no_fsync} = 1 if !$opt->{fsync};
 
@@ -118,8 +83,8 @@ EOL
 		$ibx_opt = { %$opt, sequential_shard => $v };
 	}
 	PublicInbox::Admin::index_inbox($ibx, undef, $ibx_opt);
-	if ($compact_opt) {
-		local $compact_opt->{jobs} = 0 if $ibx_opt->{sequential_shard};
-		PublicInbox::Xapcmd::run($ibx, 'compact', $compact_opt);
+	if (my $copt = $opt->{compact_opt}) {
+		local $copt->{jobs} = 0 if $ibx_opt->{sequential_shard};
+		PublicInbox::Xapcmd::run($ibx, 'compact', $copt);
 	}
 }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 12/14] convert: speed up --help
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
                   ` (10 preceding siblings ...)
  2020-08-10  2:12 ` [PATCH 11/14] convert: support new -index options Eric Wong
@ 2020-08-10  2:12 ` Eric Wong
  2020-08-10  2:12 ` [PATCH 13/14] convert: check ARGV more correctly Eric Wong
  2020-08-10  2:12 ` [PATCH 14/14] convert: set No_COW on copied SQLite files Eric Wong
  13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:12 UTC (permalink / raw)
  To: meta

Lazy-loading dependencies speeds up --help by several hundred
milliseconds and is a huge step towards user-friendliness.
---
 script/public-inbox-convert | 39 ++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 20 deletions(-)

diff --git a/script/public-inbox-convert b/script/public-inbox-convert
index ca16b0dc..c9075207 100755
--- a/script/public-inbox-convert
+++ b/script/public-inbox-convert
@@ -2,16 +2,8 @@
 # Copyright (C) 2018-2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <http://www.gnu.org/licenses/agpl-3.0.txt>
 use strict;
-use warnings;
+use v5.10.1;
 use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
-use PublicInbox::InboxWritable;
-use PublicInbox::Config;
-use PublicInbox::Admin;
-use PublicInbox::V2Writable;
-use PublicInbox::Git;
-use PublicInbox::Spawn qw(spawn);
-use Cwd 'abs_path';
-use File::Copy 'cp'; # preserves permissions:
 my $usage = 'Usage: public-inbox-convert [options] OLD NEW';
 my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term:
 usage: $usage
@@ -57,27 +49,33 @@ my $old_dir = shift(@ARGV) or die $usage;
 my $new_dir = shift(@ARGV) or die $usage;
 die "$new_dir exists\n" if -d $new_dir;
 die "$old_dir not a directory\n" unless -d $old_dir;
-my $cfg = PublicInbox::Config->new;
+
+require Cwd;
+Cwd->import('abs_path');
+require PublicInbox::Config;
+require PublicInbox::InboxWritable;
+
 $old_dir = abs_path($old_dir);
+my $cfg = PublicInbox::Config->new;
 my $old;
 $cfg->each_inbox(sub {
 	$old = $_[0] if abs_path($_[0]->{inboxdir}) eq $old_dir;
 });
-unless ($old) {
+if ($old) {
+	$old = PublicInbox::InboxWritable->new($old);
+} else {
 	warn "W: $old_dir not configured in " .
 		PublicInbox::Config::default_file() . "\n";
-	$old = {
+	$old = PublicInbox::InboxWritable->new({
 		inboxdir => $old_dir,
 		name => 'ignored',
+		-primary_address => 'old@example.com',
 		address => [ 'old@example.com' ],
-	};
-	$old = PublicInbox::Inbox->new($old);
-}
-$old = PublicInbox::InboxWritable->new($old);
-if ($old->version >= 2) {
-	die "Only conversion from v1 inboxes is supported\n";
+	});
 }
+die "Only conversion from v1 inboxes is supported\n" if $old->version >= 2;
 
+require PublicInbox::Admin;
 $old->{indexlevel} //= PublicInbox::Admin::detect_indexlevel($old);
 my $env;
 if ($opt->{'index'}) {
@@ -100,14 +98,15 @@ sub link_or_copy ($$) {
 	my ($src, $dst) = @_;
 	link($src, $dst) and return;
 	$!{EXDEV} or warn "link $src, $dst failed: $!, trying cp\n";
-	cp($src, $dst) or die "cp $src, $dst failed: $!\n";
+	require File::Copy; # preserves permissions:
+	File::Copy::cp($src, $dst) or die "cp $src, $dst failed: $!\n";
 }
 
 $old->with_umask(sub {
 	my $old_cfg = "$old->{inboxdir}/config";
 	local $ENV{GIT_CONFIG} = $old_cfg;
 	my $new_cfg = "$new->{inboxdir}/all.git/config";
-	$v2w = PublicInbox::V2Writable->new($new, 1);
+	$v2w = $new->importer(1);
 	$v2w->init_inbox(delete $opt->{jobs});
 	unlink $new_cfg;
 	link_or_copy($old_cfg, $new_cfg);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 13/14] convert: check ARGV more correctly
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
                   ` (11 preceding siblings ...)
  2020-08-10  2:12 ` [PATCH 12/14] convert: speed up --help Eric Wong
@ 2020-08-10  2:12 ` Eric Wong
  2020-08-10  2:12 ` [PATCH 14/14] convert: set No_COW on copied SQLite files Eric Wong
  13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:12 UTC (permalink / raw)
  To: meta

Instead of silently ignoring excessive args, don't let a user
specify an extra directory.  Furthermore, we'll support the odd
case where BOFH wants to name an $INBOX_DIR to be `0' :P
---
 script/public-inbox-convert | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/script/public-inbox-convert b/script/public-inbox-convert
index c9075207..275857fa 100755
--- a/script/public-inbox-convert
+++ b/script/public-inbox-convert
@@ -45,8 +45,9 @@ GetOptions($opt, qw(jobs|j=i index! help|?),
 bad command-line args\n$usage
 EOF
 if ($opt->{help}) { print $help; exit 0 };
-my $old_dir = shift(@ARGV) or die $usage;
-my $new_dir = shift(@ARGV) or die $usage;
+my $old_dir = shift(@ARGV) // '';
+my $new_dir = shift(@ARGV) // '';
+die $usage if (scalar(@ARGV) || $new_dir eq '' || $old_dir eq '');
 die "$new_dir exists\n" if -d $new_dir;
 die "$old_dir not a directory\n" unless -d $old_dir;
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 14/14] convert: set No_COW on copied SQLite files
  2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
                   ` (12 preceding siblings ...)
  2020-08-10  2:12 ` [PATCH 13/14] convert: check ARGV more correctly Eric Wong
@ 2020-08-10  2:12 ` Eric Wong
  13 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  2:12 UTC (permalink / raw)
  To: meta

We'll use our existing logic and use sqlite_backup_from_file,
which appeared in 1.39 (along with sqlite_backup_to_file).
---
 script/public-inbox-convert | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/script/public-inbox-convert b/script/public-inbox-convert
index 275857fa..d655dcc6 100755
--- a/script/public-inbox-convert
+++ b/script/public-inbox-convert
@@ -115,10 +115,10 @@ $old->with_umask(sub {
 		require PublicInbox::AltId;
 		foreach my $i (0..$#$alt) {
 			my $src = PublicInbox::AltId->new($old, $alt->[$i], 0);
-			$src->mm_alt or next;
+			$src = $src->mm_alt or next;
+			$src = $src->{dbh}->sqlite_db_filename;
 			my $dst = PublicInbox::AltId->new($new, $alt->[$i], 1);
-			$dst = $dst->{filename};
-			$src->mm_alt->{dbh}->sqlite_backup_to_file($dst);
+			$dst->mm_alt->{dbh}->sqlite_backup_from_file($src);
 		}
 	}
 	my $desc = "$old->{inboxdir}/description";
@@ -184,13 +184,15 @@ waitpid($pid, 0) or die "waitpid failed: $!\n";
 $? == 0 or die "fast-export failed: $?\n";
 $r = $w = undef; # v2w->done does the actual close and error checking
 $v2w->done;
-if (my $mm = $old->mm) {
+if (my $old_mm = $old->mm) {
 	$old->cleanup;
-	$mm->{dbh}->sqlite_backup_to_file("$new_dir/msgmap.sqlite3");
+	$old_mm = $old_mm->{dbh}->sqlite_db_filename;
 
 	# we want to trigger a reindex, not a from scratch index if
 	# we're reusing the msgmap from an existing v1 installation.
 	$v2w->idx_init($opt);
+	$v2w->{mm}->{dbh}->sqlite_backup_from_file($old_mm);
+
 	my $epoch0 = PublicInbox::Git->new($v2w->git_init(0));
 	chop(my $cmt = $epoch0->qx(qw(rev-parse --verify), $head));
 	$v2w->last_epoch_commit(0, $cmt);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 03/14] doc: index: more notes about latest changes
  2020-08-10  2:11 ` [PATCH 03/14] doc: index: more notes about latest changes Eric Wong
@ 2020-08-10  2:38   ` Kyle Meyer
  2020-08-10  6:29     ` Eric Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Kyle Meyer @ 2020-08-10  2:38 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong writes:

>  For L<public-inbox-v2-format(5)> inboxes, this value is
>  multiplied by the number of Xapian shards.  Thus a typical v2
> -inbox with 3 shards will flush every 3 megabytes by default.
> -
> -Default: 1m (one megabyte)
> +inbox with 3 shards will flush every 3 megabytes by default
> +when unless parallelism is disabled via C<--sequential-shard>

s/when unless/unless/ ?

> +or C<--jobs=0>.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 08/14] admin: use a generic veriable name
  2020-08-10  2:11 ` [PATCH 08/14] admin: use a generic veriable name Eric Wong
@ 2020-08-10  2:38   ` Kyle Meyer
  0 siblings, 0 replies; 18+ messages in thread
From: Kyle Meyer @ 2020-08-10  2:38 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong writes:

> [PATCH 08/14] admin: use a generic veriable name

s/veriable/variable/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 03/14] doc: index: more notes about latest changes
  2020-08-10  2:38   ` Kyle Meyer
@ 2020-08-10  6:29     ` Eric Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2020-08-10  6:29 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: meta

Kyle Meyer <kyle@kyleam.com> wrote:
> Eric Wong writes:
> 
> >  For L<public-inbox-v2-format(5)> inboxes, this value is
> >  multiplied by the number of Xapian shards.  Thus a typical v2
> > -inbox with 3 shards will flush every 3 megabytes by default.
> > -
> > -Default: 1m (one megabyte)
> > +inbox with 3 shards will flush every 3 megabytes by default
> > +when unless parallelism is disabled via C<--sequential-shard>
> 
> s/when unless/unless/ ?

Yup, thanks.  Will squash the folowing before pushing (and speling
for 8/14 shall be vary-able :>) :

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 3ae3b008..10cf2d19 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -179,7 +179,7 @@ C<4m> have little benefit.
 For L<public-inbox-v2-format(5)> inboxes, this value is
 multiplied by the number of Xapian shards.  Thus a typical v2
 inbox with 3 shards will flush every 3 megabytes by default
-when unless parallelism is disabled via C<--sequential-shard>
+unless parallelism is disabled via C<--sequential-shard>
 or C<--jobs=0>.
 
 This influences memory usage of Xapian, but it is not exact.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-08-10  6:29 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
2020-08-10  2:11 ` [PATCH 01/14] index: require --reindex when using --xapian-only Eric Wong
2020-08-10  2:11 ` [PATCH 02/14] index: --sequential-shard works incrementally Eric Wong
2020-08-10  2:11 ` [PATCH 03/14] doc: index: more notes about latest changes Eric Wong
2020-08-10  2:38   ` Kyle Meyer
2020-08-10  6:29     ` Eric Wong
2020-08-10  2:11 ` [PATCH 04/14] doc: add some notes around -xcpdb / -edit / -purge Eric Wong
2020-08-10  2:11 ` [PATCH 05/14] index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior Eric Wong
2020-08-10  2:11 ` [PATCH 06/14] msgmap: tmp_clone: simplify + meaningful filename Eric Wong
2020-08-10  2:11 ` [PATCH 07/14] avoid File::Temp::tempfile in more places Eric Wong
2020-08-10  2:11 ` [PATCH 08/14] admin: use a generic veriable name Eric Wong
2020-08-10  2:38   ` Kyle Meyer
2020-08-10  2:12 ` [PATCH 09/14] index: cleanup internal variables Eric Wong
2020-08-10  2:12 ` [PATCH 10/14] searchidx: use singular `$opt' for consistency with v2 Eric Wong
2020-08-10  2:12 ` [PATCH 11/14] convert: support new -index options Eric Wong
2020-08-10  2:12 ` [PATCH 12/14] convert: speed up --help Eric Wong
2020-08-10  2:12 ` [PATCH 13/14] convert: check ARGV more correctly Eric Wong
2020-08-10  2:12 ` [PATCH 14/14] convert: set No_COW on copied SQLite files Eric Wong

unofficial mirror of meta@public-inbox.org

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://yhetil.org/meta

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 meta meta/ https://yhetil.org/meta \
		meta@public-inbox.org
	public-inbox-index meta

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.yhetil.org/yhetil.mail.public-inbox.meta
	nntp://news.public-inbox.org/inbox.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git