unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 0/3] -init updates
@ 2020-06-21  0:21 Eric Wong
  2020-06-21  0:21 ` [PATCH 1/3] init: add -j / --jobs parameter Eric Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Eric Wong @ 2020-06-21  0:21 UTC (permalink / raw)
  To: meta

Eric Wong (3):
  init: add -j / --jobs parameter
  init: refer to inboxes as "inbox" or "inboxes" in errors
  init: add --skip-artnum parameter

 Documentation/public-inbox-init.pod | 28 ++++++++++++++++++++++++++++
 lib/PublicInbox/InboxWritable.pm    | 13 ++++++++++++-
 lib/PublicInbox/Msgmap.pm           | 26 ++++++++++++++++++++++++++
 lib/PublicInbox/SearchIdx.pm        |  1 +
 lib/PublicInbox/V2Writable.pm       |  3 ++-
 script/public-inbox-init            | 21 ++++++++++++++-------
 t/init.t                            | 28 +++++++++++++++++++++++++++-
 t/v2mirror.t                        |  4 +++-
 8 files changed, 113 insertions(+), 11 deletions(-)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/3] init: add -j / --jobs parameter
  2020-06-21  0:21 [PATCH 0/3] -init updates Eric Wong
@ 2020-06-21  0:21 ` Eric Wong
  2020-06-21  0:21 ` [PATCH 2/3] init: refer to inboxes as "inbox" or "inboxes" in errors Eric Wong
  2020-06-21  0:21 ` [PATCH 3/3] init: add --skip-artnum parameter Eric Wong
  2 siblings, 0 replies; 5+ messages in thread
From: Eric Wong @ 2020-06-21  0:21 UTC (permalink / raw)
  To: meta

On a powerful (by my standards) machine with 16GB RAM and an
7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in
git) LKML snapshot from Sep 2019 did not finish after 7 days
with the default number (3) of Xapian shards (`--jobs=4') and
`--batch-size=10m'.

Indexing starts off fast, but progressively get slower as
contents of the inbox (including Xapian + SQLite DBs) could no
longer be cached by the kernel.  Once the on-disk size
increased, HDD seek contention between the Xapian shard workers
slowed the process down to a crawl.

With a single shard, it still took around 3.5 days to index on
the HDD.  That's not good, but it's far better than not
finishing after 7 days.  So allow unfortunate HDD users to
easily specify a single shard on public-inbox-init.

For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II
bus on the same machine indexes that same snapshot of LKML in
~7 hours with 3 shards and the same 10m batch size.  In the past,
a higher-end consumer grade MLC SSDs on similar hardware indexed
a similarly sized-data set in ~4 hours.
---
 Documentation/public-inbox-init.pod | 14 ++++++++++++++
 script/public-inbox-init            |  8 ++++++++
 t/v2mirror.t                        |  4 +++-
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 4744da96..495a258f 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -48,6 +48,20 @@ added-after-the-fact (without affecting "git clone" followers).
 
 Default: unset, no epochs are skipped
 
+=item -j, --jobs=JOBS
+
+Control the number of Xapian index shards in a
+C<-V2> (L<public-inbox-v2-format(5)>) inbox.
+
+It is useful to use a single shard (C<-j1>) for inboxes on
+high-latency storage (e.g. rotational HDD) unless the system has
+enough RAM to cache 5-10x the size of the git repository.
+
+It is generally not useful to specify higher values than the
+default due to contention in the top-level producer process.
+
+Default: the number of online CPUs, up to 4
+
 =back
 
 =head1 ENVIRONMENT
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 10d3ad45..00147db5 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -27,10 +27,12 @@ use Cwd qw/abs_path/;
 my $version = undef;
 my $indexlevel = undef;
 my $skip_epoch;
+my $jobs;
 my %opts = (
 	'V|version=i' => \$version,
 	'L|indexlevel=s' => \$indexlevel,
 	'S|skip|skip-epoch=i' => \$skip_epoch,
+	'j|jobs=i' => \$jobs,
 );
 GetOptions(%opts) or usage();
 PublicInbox::Admin::indexlevel_ok_or_die($indexlevel) if defined $indexlevel;
@@ -144,6 +146,12 @@ my $ibx = PublicInbox::Inbox->new({
 });
 
 my $creat_opt = {};
+if (defined $jobs) {
+	die "--jobs is only supported for -V2 inboxes\n" if $version == 1;
+	die "--jobs=$jobs must be >= 1\n" if $jobs <= 0;
+	$creat_opt->{nproc} = $jobs;
+}
+
 PublicInbox::InboxWritable->new($ibx, $creat_opt)->init_inbox(0, $skip_epoch);
 
 # needed for git prior to v2.1.0
diff --git a/t/v2mirror.t b/t/v2mirror.t
index fc03c3d7..b24528fe 100644
--- a/t/v2mirror.t
+++ b/t/v2mirror.t
@@ -80,9 +80,11 @@ foreach my $i (0..$epoch_max) {
 	ok(-d "$tmpdir/m/git/$i.git", "mirror $i OK");
 }
 
-@cmd = ("-init", '-V2', 'm', "$tmpdir/m", 'http://example.com/m',
+@cmd = ("-init", '-j1', '-V2', 'm', "$tmpdir/m", 'http://example.com/m',
 	'alt@example.com');
 ok(run_script(\@cmd), 'initialized public-inbox -V2');
+my @shards = glob("$tmpdir/m/xap*/?");
+is(scalar(@shards), 1, 'got a single shard on init');
 
 ok(run_script([qw(-index -j0), "$tmpdir/m"]), 'indexed');
 

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/3] init: refer to inboxes as "inbox" or "inboxes" in errors
  2020-06-21  0:21 [PATCH 0/3] -init updates Eric Wong
  2020-06-21  0:21 ` [PATCH 1/3] init: add -j / --jobs parameter Eric Wong
@ 2020-06-21  0:21 ` Eric Wong
  2020-06-21  0:21 ` [PATCH 3/3] init: add --skip-artnum parameter Eric Wong
  2 siblings, 0 replies; 5+ messages in thread
From: Eric Wong @ 2020-06-21  0:21 UTC (permalink / raw)
  To: meta

Since V2 uses multiple git repositories, stop using
the word "repo" when referring to inboxes.
---
 script/public-inbox-init | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/script/public-inbox-init b/script/public-inbox-init
index 00147db5..e8dcf4fc 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -121,20 +121,20 @@ if (-f "$inboxdir/inbox.lock") {
 	if (!defined $version) {
 		$version = 2;
 	} elsif ($version != 2) {
-		die "$inboxdir is a -V2 repo, -V$version specified\n"
+		die "$inboxdir is a -V2 inbox, -V$version specified\n"
 	}
 } elsif (-d "$inboxdir/objects") {
 	if (!defined $version) {
 		$version = 1;
 	} elsif ($version != 1) {
-		die "$inboxdir is a -V1 repo, -V$version specified\n"
+		die "$inboxdir is a -V1 inbox, -V$version specified\n"
 	}
 }
 
 $version = 1 unless defined $version;
 
 if ($version == 1 && defined $skip_epoch) {
-	die "--skip-epoch is only supported for -V2 repos\n";
+	die "--skip-epoch is only supported for -V2 inboxes\n";
 }
 
 my $ibx = PublicInbox::Inbox->new({

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 3/3] init: add --skip-artnum parameter
  2020-06-21  0:21 [PATCH 0/3] -init updates Eric Wong
  2020-06-21  0:21 ` [PATCH 1/3] init: add -j / --jobs parameter Eric Wong
  2020-06-21  0:21 ` [PATCH 2/3] init: refer to inboxes as "inbox" or "inboxes" in errors Eric Wong
@ 2020-06-21  0:21 ` Eric Wong
  2020-06-23 18:34   ` [PATCH] t/init: remove leftover find(1) call Eric Wong
  2 siblings, 1 reply; 5+ messages in thread
From: Eric Wong @ 2020-06-21  0:21 UTC (permalink / raw)
  To: meta

For archivists with only newer mail archives, this option allows
reserving reserve NNTP article numbers for yet-to-be-archived
old messages.  Indexers will need to be updated to support this
feature in future commits.

-V1 inboxes will now be initialized with SQLite and Xapian
support if this option is used, or if --indexlevel= is
specified.
---
 Documentation/public-inbox-init.pod | 14 ++++++++++++++
 lib/PublicInbox/InboxWritable.pm    | 13 ++++++++++++-
 lib/PublicInbox/Msgmap.pm           | 26 ++++++++++++++++++++++++++
 lib/PublicInbox/SearchIdx.pm        |  1 +
 lib/PublicInbox/V2Writable.pm       |  3 ++-
 script/public-inbox-init            |  9 ++++-----
 t/init.t                            | 28 +++++++++++++++++++++++++++-
 7 files changed, 86 insertions(+), 8 deletions(-)

diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 495a258f..5714828d 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -39,6 +39,20 @@ See L<public-inbox-config(5)> for more information.
 
 Default: C<full>
 
+=item -N, --skip-artnum
+
+This option allows archivists to publish incomplete archives
+with only new mail while allowing NNTP article numbers
+to be reserved for yet-to-be-archived old mail.
+
+This is mainly intended for users of C<--skip-epoch> (documented below)
+but may be of use to L<public-inbox-v1-format(5)> users.
+
+There is no automatic way to use reserved NNTP article numbers
+when old mail is found, yet.
+
+Default: unset, no NNTP article numbers are skipped
+
 =item -S, --skip-epoch
 
 For C<-V2> (L<public-inbox-v2-format(5)>) inboxes only, this option
diff --git a/lib/PublicInbox/InboxWritable.pm b/lib/PublicInbox/InboxWritable.pm
index c54be046..f9e28502 100644
--- a/lib/PublicInbox/InboxWritable.pm
+++ b/lib/PublicInbox/InboxWritable.pm
@@ -39,10 +39,21 @@ sub assert_usable_dir {
 
 sub init_inbox {
 	my ($self, $shards, $skip_epoch, $skip_artnum) = @_;
-	# TODO: honor skip_artnum
 	if ($self->version == 1) {
 		my $dir = assert_usable_dir($self);
 		PublicInbox::Import::init_bare($dir);
+		if (defined($self->{indexlevel}) || defined($skip_artnum)) {
+			require PublicInbox::SearchIdx;
+			my $sidx = PublicInbox::SearchIdx->new($self, 1); # just create
+			$sidx->begin_txn_lazy;
+			$self->with_umask(sub {
+				my $mm = PublicInbox::Msgmap->new($dir, 1);
+				$mm->{dbh}->begin_work;
+				$mm->skip_artnum($skip_artnum);
+				$mm->{dbh}->commit;
+			}) if defined($skip_artnum);
+			$sidx->commit_txn_lazy;
+		}
 	} else {
 		my $v2w = importer($self);
 		$v2w->init_inbox($shards, $skip_epoch, $skip_artnum);
diff --git a/lib/PublicInbox/Msgmap.pm b/lib/PublicInbox/Msgmap.pm
index d115cbce..aa07e344 100644
--- a/lib/PublicInbox/Msgmap.pm
+++ b/lib/PublicInbox/Msgmap.pm
@@ -270,4 +270,30 @@ sub atfork_prepare {
 	%$self = (tmp_name => $f, pid => $$);
 }
 
+sub skip_artnum {
+	my ($self, $skip_artnum) = @_;
+	return meta_accessor($self, 'skip_artnum') if !defined($skip_artnum);
+
+	my $cur = num_highwater($self) // 0;
+	if ($skip_artnum < $cur) {
+		die "E: current article number $cur ",
+			"exceeds --skip-artnum=$skip_artnum\n";
+	} else {
+		my $ok;
+		for (1..10) {
+			my $mid = 'skip'.rand.'@'.rand.'.example.com';
+			$ok = mid_set($self, $skip_artnum, $mid);
+			if ($ok) {
+				mid_delete($self, $mid);
+				last;
+			}
+		}
+		$ok or die '--skip-artnum failed';
+
+		# in the future, the indexer may use this value for
+		# new messages in old epochs
+		meta_accessor($self, 'skip_artnum', $skip_artnum);
+	}
+}
+
 1;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 85821ea7..00e63938 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -694,6 +694,7 @@ sub _git_log {
 		} else {
 			# normal regen is for for fresh data
 			$self->{regen_down} = $fcount;
+			$self->{regen_down} += $high unless $opts->{reindex};
 		}
 	} else {
 		# Give oldest messages the smallest numbers
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 91379431..a0f041dd 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -128,12 +128,13 @@ sub new {
 
 # public (for now?)
 sub init_inbox {
-	my ($self, $shards, $skip_epoch) = @_;
+	my ($self, $shards, $skip_epoch, $skip_artnum) = @_;
 	if (defined $shards) {
 		$self->{parallel} = 0 if $shards == 0;
 		$self->{shards} = $shards if $shards > 0;
 	}
 	$self->idx_init;
+	$self->{mm}->skip_artnum($skip_artnum) if defined $skip_artnum;
 	my $epoch_max = -1;
 	git_dir_latest($self, \$epoch_max);
 	if (defined $skip_epoch && $epoch_max == -1) {
diff --git a/script/public-inbox-init b/script/public-inbox-init
index e8dcf4fc..c7f3da6f 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -24,14 +24,12 @@ use File::Path qw/mkpath/;
 use Fcntl qw(:DEFAULT);
 use Cwd qw/abs_path/;
 
-my $version = undef;
-my $indexlevel = undef;
-my $skip_epoch;
-my $jobs;
+my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs);
 my %opts = (
 	'V|version=i' => \$version,
 	'L|indexlevel=s' => \$indexlevel,
 	'S|skip|skip-epoch=i' => \$skip_epoch,
+	'N|skip-artnum=i' => \$skip_artnum,
 	'j|jobs=i' => \$jobs,
 );
 GetOptions(%opts) or usage();
@@ -152,7 +150,8 @@ if (defined $jobs) {
 	$creat_opt->{nproc} = $jobs;
 }
 
-PublicInbox::InboxWritable->new($ibx, $creat_opt)->init_inbox(0, $skip_epoch);
+$ibx = PublicInbox::InboxWritable->new($ibx, $creat_opt);
+$ibx->init_inbox(0, $skip_epoch, $skip_artnum);
 
 # needed for git prior to v2.1.0
 umask(0077) if defined $perm;
diff --git a/t/init.t b/t/init.t
index 94c6184e..e3e8a229 100644
--- a/t/init.t
+++ b/t/init.t
@@ -93,12 +93,38 @@ SKIP: {
 		is_deeply($gits, ["$tmpdir/skip1/git/1.git"], 'skip OK');
 	}
 
-
 	$cmd = [ '-init', '-V2', '--skip-epoch=2', 'skip2', "$tmpdir/skip2",
 		   qw(http://example.com/skip2 skip2@example.com) ];
 	ok(run_script($cmd), "--skip-epoch 2");
 	my $gits = [ glob("$tmpdir/skip2/git/*.git") ];
 	is_deeply($gits, ["$tmpdir/skip2/git/2.git"], 'skipping 2 works, too');
+
+	xsys(qw(git config), "--file=$ENV{PI_DIR}/config",
+			'publicinboxmda.spamcheck', 'none') == 0 or
+			BAIL_OUT "git config $?";
+	my $addr = 'skip3@example.com';
+	$cmd = [ qw(-init -V2 -Lbasic -N12 skip3), "$tmpdir/skip3",
+		   qw(http://example.com/skip3), $addr ];
+	ok(run_script($cmd), '--skip-artnum -V2');
+	my $env = { ORIGINAL_RECIPIENT => $addr };
+	my $mid = 'skip-artnum@example.com';
+	my $msg = "Message-ID: <$mid>\n\n";
+	my $rdr = { 0 => \$msg, 2 => \(my $err = '')  };
+	ok(run_script([qw(-mda --no-precheck)], $env, $rdr), 'deliver V1');
+	my $mm = PublicInbox::Msgmap->new_file("$tmpdir/skip3/msgmap.sqlite3");
+	my $n = $mm->num_for($mid);
+	is($n, 13, 'V2 NNTP article numbers skipped via --skip-artnum');
+
+	$addr = 'skip4@example.com';
+	$env = { ORIGINAL_RECIPIENT => $addr };
+	$cmd = [ qw(-init -V1 -N12 -Lmedium skip4), "$tmpdir/skip4",
+		   qw(http://example.com/skip4), $addr ];
+	ok(run_script($cmd), '--skip-artnum -V1');
+	ok(run_script([qw(-mda --no-precheck)], $env, $rdr), 'deliver V1');
+	$mm = PublicInbox::Msgmap->new("$tmpdir/skip4");
+	system "find $tmpdir/skip4 >&2";
+	$n = $mm->num_for($mid);
+	is($n, 13, 'V1 NNTP article numbers skipped via --skip-artnum');
 }
 
 done_testing();

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH] t/init: remove leftover find(1) call
  2020-06-21  0:21 ` [PATCH 3/3] init: add --skip-artnum parameter Eric Wong
@ 2020-06-23 18:34   ` Eric Wong
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Wong @ 2020-06-23 18:34 UTC (permalink / raw)
  To: meta

Eric Wong <e@yhbt.net> wrote:
> diff --git a/t/init.t b/t/init.t
> index 94c6184e..e3e8a229 100644
> --- a/t/init.t
> +++ b/t/init.t

> +	system "find $tmpdir/skip4 >&2";

Gah :x

---------8<--------
Subject: [PATCH] t/init: remove leftover find(1) call

I used find(1) here for debugging.  The "make check-run" test
target needs to be updated to make stderr spew more obvious.
---
 t/init.t | 1 -
 1 file changed, 1 deletion(-)

diff --git a/t/init.t b/t/init.t
index e3e8a229..f4ebc2f6 100644
--- a/t/init.t
+++ b/t/init.t
@@ -122,7 +122,6 @@ SKIP: {
 	ok(run_script($cmd), '--skip-artnum -V1');
 	ok(run_script([qw(-mda --no-precheck)], $env, $rdr), 'deliver V1');
 	$mm = PublicInbox::Msgmap->new("$tmpdir/skip4");
-	system "find $tmpdir/skip4 >&2";
 	$n = $mm->num_for($mid);
 	is($n, 13, 'V1 NNTP article numbers skipped via --skip-artnum');
 }

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-06-23 18:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-06-21  0:21 [PATCH 0/3] -init updates Eric Wong
2020-06-21  0:21 ` [PATCH 1/3] init: add -j / --jobs parameter Eric Wong
2020-06-21  0:21 ` [PATCH 2/3] init: refer to inboxes as "inbox" or "inboxes" in errors Eric Wong
2020-06-21  0:21 ` [PATCH 3/3] init: add --skip-artnum parameter Eric Wong
2020-06-23 18:34   ` [PATCH] t/init: remove leftover find(1) call Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).