From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 0F5801F5AF for ; Sun, 21 Jun 2020 00:21:34 +0000 (UTC) From: Eric Wong To: meta@public-inbox.org Subject: [PATCH 1/3] init: add -j / --jobs parameter Date: Sun, 21 Jun 2020 00:21:31 +0000 Message-Id: <20200621002133.9090-2-e@yhbt.net> In-Reply-To: <20200621002133.9090-1-e@yhbt.net> References: <20200621002133.9090-1-e@yhbt.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit List-Id: On a powerful (by my standards) machine with 16GB RAM and an 7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in git) LKML snapshot from Sep 2019 did not finish after 7 days with the default number (3) of Xapian shards (`--jobs=4') and `--batch-size=10m'. Indexing starts off fast, but progressively get slower as contents of the inbox (including Xapian + SQLite DBs) could no longer be cached by the kernel. Once the on-disk size increased, HDD seek contention between the Xapian shard workers slowed the process down to a crawl. With a single shard, it still took around 3.5 days to index on the HDD. That's not good, but it's far better than not finishing after 7 days. So allow unfortunate HDD users to easily specify a single shard on public-inbox-init. For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II bus on the same machine indexes that same snapshot of LKML in ~7 hours with 3 shards and the same 10m batch size. In the past, a higher-end consumer grade MLC SSDs on similar hardware indexed a similarly sized-data set in ~4 hours. --- Documentation/public-inbox-init.pod | 14 ++++++++++++++ script/public-inbox-init | 8 ++++++++ t/v2mirror.t | 4 +++- 3 files changed, 25 insertions(+), 1 deletion(-) diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod index 4744da96..495a258f 100644 --- a/Documentation/public-inbox-init.pod +++ b/Documentation/public-inbox-init.pod @@ -48,6 +48,20 @@ added-after-the-fact (without affecting "git clone" followers). Default: unset, no epochs are skipped +=item -j, --jobs=JOBS + +Control the number of Xapian index shards in a +C<-V2> (L) inbox. + +It is useful to use a single shard (C<-j1>) for inboxes on +high-latency storage (e.g. rotational HDD) unless the system has +enough RAM to cache 5-10x the size of the git repository. + +It is generally not useful to specify higher values than the +default due to contention in the top-level producer process. + +Default: the number of online CPUs, up to 4 + =back =head1 ENVIRONMENT diff --git a/script/public-inbox-init b/script/public-inbox-init index 10d3ad45..00147db5 100755 --- a/script/public-inbox-init +++ b/script/public-inbox-init @@ -27,10 +27,12 @@ use Cwd qw/abs_path/; my $version = undef; my $indexlevel = undef; my $skip_epoch; +my $jobs; my %opts = ( 'V|version=i' => \$version, 'L|indexlevel=s' => \$indexlevel, 'S|skip|skip-epoch=i' => \$skip_epoch, + 'j|jobs=i' => \$jobs, ); GetOptions(%opts) or usage(); PublicInbox::Admin::indexlevel_ok_or_die($indexlevel) if defined $indexlevel; @@ -144,6 +146,12 @@ my $ibx = PublicInbox::Inbox->new({ }); my $creat_opt = {}; +if (defined $jobs) { + die "--jobs is only supported for -V2 inboxes\n" if $version == 1; + die "--jobs=$jobs must be >= 1\n" if $jobs <= 0; + $creat_opt->{nproc} = $jobs; +} + PublicInbox::InboxWritable->new($ibx, $creat_opt)->init_inbox(0, $skip_epoch); # needed for git prior to v2.1.0 diff --git a/t/v2mirror.t b/t/v2mirror.t index fc03c3d7..b24528fe 100644 --- a/t/v2mirror.t +++ b/t/v2mirror.t @@ -80,9 +80,11 @@ foreach my $i (0..$epoch_max) { ok(-d "$tmpdir/m/git/$i.git", "mirror $i OK"); } -@cmd = ("-init", '-V2', 'm', "$tmpdir/m", 'http://example.com/m', +@cmd = ("-init", '-j1', '-V2', 'm', "$tmpdir/m", 'http://example.com/m', 'alt@example.com'); ok(run_script(\@cmd), 'initialized public-inbox -V2'); +my @shards = glob("$tmpdir/m/xap*/?"); +is(scalar(@shards), 1, 'got a single shard on init'); ok(run_script([qw(-index -j0), "$tmpdir/m"]), 'indexed');