unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 0/5] clone|fetch: flesh out partial mirror support
@ 2021-09-24 10:56 Eric Wong
  2021-09-24 10:56 ` [PATCH 1/5] clone|--mirror: support --epoch=RANGE for partial clones Eric Wong
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Eric Wong @ 2021-09-24 10:56 UTC (permalink / raw)
  To: meta

The --epoch=RANGE feature discussed last week[1] is implemented.
There's also a bunch of fixes and improvements for handling
partial fetches from work started last week.

There's also a significant amount of work done to ensure the
client-side code works on servers running old, pre-manifest
versions of public-inbox.

I'm not sure if there's pre-manifest.js.gz versions of
public-inbox out there, but it's only ~2 years old and I can
understand if some admins have been preoccupied with the
pandemic and unable to upgrade :/

[1] https://public-inbox.org/meta/20210917002204.GA13112@dcvr/T/#u

Eric Wong (5):
  clone|--mirror: support --epoch=RANGE for partial clones
  fetch: fix skipping with multi-epoch inboxes
  clone|--mirror: fix and test against pre-manifest WWW
  clone|fetch|--mirror: cull manifest in partial mirrors
  fetch: support v2 w/o manifest on old WWW

 Documentation/lei-add-external.pod   |  15 +++
 Documentation/public-inbox-clone.pod |  15 +++
 lib/PublicInbox/Fetch.pm             |  27 ++++--
 lib/PublicInbox/LEI.pm               |   2 +-
 lib/PublicInbox/LeiMirror.pm         | 130 +++++++++++++++++++++++---
 lib/PublicInbox/TestCommon.pm        |   1 +
 script/public-inbox-clone            |   3 +-
 t/lei-mirror.t                       |   8 ++
 t/v2mirror.t                         | 135 +++++++++++++++++++++++++--
 9 files changed, 306 insertions(+), 30 deletions(-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/5] clone|--mirror: support --epoch=RANGE for partial clones
  2021-09-24 10:56 [PATCH 0/5] clone|fetch: flesh out partial mirror support Eric Wong
@ 2021-09-24 10:56 ` Eric Wong
  2021-09-24 10:56 ` [PATCH 2/5] fetch: fix skipping with multi-epoch inboxes Eric Wong
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2021-09-24 10:56 UTC (permalink / raw)
  To: meta; +Cc: Luis Chamberlain

Partial (v2) clones should be useful addition for users wanting
to conserve storage while having fast access to recent messages.

Continuing work started in 876e74283ff3 (fetch: ignore
non-writable epoch dirs, 2021-09-17), this creates bare,
read-only epoch git repos.  These git repos have the remotes
pre-configured, but does not fetch any objects.

The goal is to allow users to set the writable bit on a
previously-skipped epoch and start fetching it.

Shell completion support may not be necessary given how short
the epoch ranges are, here.

Cc: Luis Chamberlain <mcgrof@kernel.org>
Link: https://public-inbox.org/meta/20210917002204.GA13112@dcvr/T/#u
---
 Documentation/lei-add-external.pod   |  15 ++++
 Documentation/public-inbox-clone.pod |  15 ++++
 lib/PublicInbox/LEI.pm               |   2 +-
 lib/PublicInbox/LeiMirror.pm         | 101 +++++++++++++++++++++++----
 script/public-inbox-clone            |   3 +-
 t/lei-mirror.t                       |   8 +++
 t/v2mirror.t                         |  25 +++++++
 7 files changed, 155 insertions(+), 14 deletions(-)

diff --git a/Documentation/lei-add-external.pod b/Documentation/lei-add-external.pod
index 9c8bde0f..1ab65a16 100644
--- a/Documentation/lei-add-external.pod
+++ b/Documentation/lei-add-external.pod
@@ -30,6 +30,21 @@ Default: 0
 
 Create C<LOCATION> by mirroring the public-inbox at C<URL>.
 
+=item --epoch=RANGE
+
+Restrict clones of L<public-inbox-v2-format(5)> inboxes to the
+given range of epochs.  The range may be a single non-negative
+integer or a (possibly open-ended) C<LOW..HIGH> range of
+non-negative integers.  C<~> may be prefixed to either (or both)
+integer values to represent the offset from the maximum possible
+value.
+
+For example, C<--epoch=~0> alone clones only the latest epoch,
+C<--epoch=~2..> clones the three latest epochs.
+
+Default: C<0..~0> or C<0..> or C<..~0>
+(all epochs, all three examples are equivalent)
+
 =item -v
 
 =item --verbose
diff --git a/Documentation/public-inbox-clone.pod b/Documentation/public-inbox-clone.pod
index fdb57663..efee01ee 100644
--- a/Documentation/public-inbox-clone.pod
+++ b/Documentation/public-inbox-clone.pod
@@ -31,6 +31,21 @@ file to speed up subsequent L<public-inbox-fetch(1)>.
 
 =over
 
+=item --epoch=RANGE
+
+Restrict clones of L<public-inbox-v2-format(5)> inboxes to the
+given range of epochs.  The range may be a single non-negative
+integer or a (possibly open-ended) C<LOW..HIGH> range of
+non-negative integers.  C<~> may be prefixed to either (or both)
+integer values to represent the offset from the maximum possible
+value.
+
+For example, C<--epoch=~0> alone clones only the latest epoch,
+C<--epoch=~2..> clones the three latest epochs.
+
+Default: C<0..~0> or C<0..> or C<..~0>
+(all epochs, all three examples are equivalent)
+
 =item -q
 
 =item --quiet
diff --git a/lib/PublicInbox/LEI.pm b/lib/PublicInbox/LEI.pm
index 96f63805..9d5a5a46 100644
--- a/lib/PublicInbox/LEI.pm
+++ b/lib/PublicInbox/LEI.pm
@@ -206,7 +206,7 @@ our %CMD = ( # sorted in order of importance/use:
 
 'add-external' => [ 'LOCATION',
 	'add/set priority of a publicinbox|extindex for extra matches',
-	qw(boost=i mirror=s inbox-version=i verbose|v+),
+	qw(boost=i mirror=s inbox-version=i epoch=s verbose|v+),
 	@c_opt, index_opt(), @net_opt ],
 'ls-external' => [ '[FILTER]', 'list publicinbox|extindex locations',
 	qw(format|f=s z|0 globoff|g invert-match|v local remote), @c_opt ],
diff --git a/lib/PublicInbox/LeiMirror.pm b/lib/PublicInbox/LeiMirror.pm
index 6bfa4b6f..53f7dd31 100644
--- a/lib/PublicInbox/LeiMirror.pm
+++ b/lib/PublicInbox/LeiMirror.pm
@@ -49,8 +49,11 @@ sub try_scrape {
 	# since this is for old instances w/o manifest.js.gz, try v1 first
 	return clone_v1($self) if grep(m!\A\Q$url\E/*\z!, @urls);
 	if (my @v2_urls = grep(m!\A\Q$url\E/[0-9]+\z!, @urls)) {
-		my %v2_uris = map { $_ => URI->new($_) } @v2_urls; # uniq
-		return clone_v2($self, [ values %v2_uris ]);
+		my %v2_epochs = map {
+			my ($n) = (m!/([0-9]+)\z!);
+			$n => URI->new($_)
+		} @v2_urls; # uniq
+		return clone_v2($self, \%v2_epochs);
 	}
 
 	# filter out common URLs served by WWW (e.g /$MSGID/T/)
@@ -189,6 +192,8 @@ sub clone_v1 {
 	my $lei = $self->{lei};
 	my $curl = $self->{curl} //= PublicInbox::LeiCurl->new($lei) or return;
 	my $uri = URI->new($self->{src});
+	defined($lei->{opt}->{epoch}) and
+		die "$uri is a v1 inbox, --epoch is not supported\n";
 	my $pfx = $curl->torsocks($lei, $uri) or return;
 	my $cmd = [ @$pfx, clone_cmd($lei, my $opt = {}),
 			$uri->as_string, $self->{dst} ];
@@ -199,22 +204,89 @@ sub clone_v1 {
 	index_cloned_inbox($self, 1);
 }
 
-sub clone_v2 {
-	my ($self, $v2_uris) = @_;
+sub parse_epochs ($$) {
+	my ($opt_epochs, $v2_epochs) = @_; # $epcohs "LOW..HIGH"
+	$opt_epochs // return; # undef => all epochs
+	my ($lo, $dotdot, $hi, @extra) = split(/(\.\.)/, $opt_epochs);
+	undef($lo) if ($lo // '') eq '';
+	my $re = qr/\A~?[0-9]+\z/;
+	if (@extra || (($lo // '0') !~ $re) ||
+			(($hi // '0') !~ $re) ||
+			!(grep(defined, $lo, $hi))) {
+		die <<EOM;
+--epoch=$opt_epochs not in the form of `LOW..HIGH', `LOW..', nor `..HIGH'
+EOM
+	}
+	my @n = sort { $a <=> $b } keys %$v2_epochs;
+	for (grep(defined, $lo, $hi)) {
+		if (/\A[0-9]+\z/) {
+			$_ > $n[-1] and die
+"`$_' exceeds maximum available epoch ($n[-1])\n";
+			$_ < $n[0] and die
+"`$_' is lower than minimum available epoch ($n[0])\n";
+		} elsif (/\A~([0-9]+)/) {
+			my $off = -$1 - 1;
+			$n[$off] // die "`$_' is out of range\n";
+			$_ = $n[$off];
+		} else { die "`$_' not understood\n" }
+	}
+	defined($lo) && defined($hi) && $lo > $hi and die
+"low value (`$lo') exceeds high (`$hi')\n";
+	$lo //= $n[0] if $dotdot;
+	$hi //= $n[-1] if $dotdot;
+	$hi //= $lo;
+	my $want = {};
+	for ($lo..$hi) {
+		if (defined $v2_epochs->{$_}) {
+			$want->{$_} = 1;
+		} else {
+			warn
+"# epoch $_ is not available (non-fatal, $lo..$hi)\n";
+		}
+	}
+	$want
+}
+
+sub init_placeholder ($$) {
+	my ($src, $edst) = @_;
+	PublicInbox::Import::init_bare($edst);
+	my $f = "$edst/config";
+	open my $fh, '>>', $f or die "open($f): $!";
+	print $fh <<EOM or die "print($f): $!";
+[remote "origin"]
+	url = $src
+	fetch = +refs/*:refs/*
+	mirror = true
+
+; This git epoch was created read-only and "public-inbox-fetch"
+; will not fetch updates for it unless write permission is added.
+EOM
+	close $fh or die "close:($f): $!";
+}
+
+sub clone_v2 ($$) {
+	my ($self, $v2_epochs) = @_;
 	my $lei = $self->{lei};
 	my $curl = $self->{curl} //= PublicInbox::LeiCurl->new($lei) or return;
-	my $pfx //= $curl->torsocks($lei, $v2_uris->[0]) or return;
+	my $pfx = $curl->torsocks($lei, (values %$v2_epochs)[0]) or return;
 	my $dst = $self->{dst};
-	my @src_edst;
-	for my $uri (@$v2_uris) {
+	my $want = parse_epochs($lei->{opt}->{epoch}, $v2_epochs);
+	my (@src_edst, @read_only);
+	for my $nr (sort { $a <=> $b } keys %$v2_epochs) {
+		my $uri = $v2_epochs->{$nr};
 		my $src = $uri->as_string;
 		my $edst = $dst;
 		$src =~ m!/([0-9]+)(?:\.git)?\z! or die <<"";
 failed to extract epoch number from $src
 
-		my $nr = $1 + 0;
+		$1 + 0 == $nr or die "BUG: <$uri> miskeyed $1 != $nr";
 		$edst .= "/git/$nr.git";
-		push @src_edst, $src, $edst;
+		if (!$want || $want->{$nr}) {
+			push @src_edst, $src, $edst;
+		} else { # create a placeholder so users only need to chmod +w
+			init_placeholder($src, $edst);
+			push @read_only, $edst;
+		}
 	}
 	my $lk = bless { lock_path => "$dst/inbox.lock" }, 'PublicInbox::Lock';
 	_try_config($self);
@@ -229,6 +301,10 @@ failed to extract epoch number from $src
 	my $mg = PublicInbox::MultiGit->new($dst, 'all.git', 'git');
 	$mg->fill_alternates;
 	for my $i ($mg->git_epochs) { $mg->epoch_cfg_set($i) }
+	for my $edst (@read_only) {
+		my @st = stat($edst) or die "stat($edst): $!";
+		chmod($st[2] & 0555, $edst) or die "chmod(a-w, $edst): $!";
+	}
 	write_makefile($self->{dst}, 2);
 	undef $on_destroy; # unlock
 	index_cloned_inbox($self, 2);
@@ -291,11 +367,12 @@ sub try_manifest {
 # @v2_epochs
 # ignoring $v1_path (use --inbox-version=1 to force v1 instead)
 EOM
-		@v2_epochs = map {
+		my %v2_epochs = map {
 			$uri->path($path_pfx.$_);
-			$uri->clone
+			my ($n) = ("$uri" =~ m!/([0-9]+)\.git\z!);
+			$n => $uri->clone
 		} @v2_epochs;
-		clone_v2($self, \@v2_epochs);
+		clone_v2($self, \%v2_epochs);
 	} elsif (defined $v1_path) {
 		clone_v1($self);
 	} else {
diff --git a/script/public-inbox-clone b/script/public-inbox-clone
index 0efde1a8..54059d03 100755
--- a/script/public-inbox-clone
+++ b/script/public-inbox-clone
@@ -13,6 +13,7 @@ usage: public-inbox-clone INBOX_URL [DESTINATION]
 
 options:
 
+  --epoch=RANGE       range of v2 epochs to clone (e.g `2..5', `~0', `~1..')
   --torsocks VAL      whether or not to wrap git and curl commands with
                       torsocks (default: `auto')
                       Must be one of: `auto', `no' or `yes'
@@ -21,7 +22,7 @@ options:
     -C DIR            chdir to specified directory
 EOF
 GetOptions($opt, qw(help|h quiet|q verbose|v+ C=s@ c=s@
-		no-torsocks torsocks=s)) or die $help;
+		no-torsocks torsocks=s epoch=s)) or die $help;
 if ($opt->{help}) { print $help; exit };
 require PublicInbox::Admin; # loads Config
 PublicInbox::Admin::do_chdir(delete $opt->{C});
diff --git a/t/lei-mirror.t b/t/lei-mirror.t
index 7dd03b26..de5246b6 100644
--- a/t/lei-mirror.t
+++ b/t/lei-mirror.t
@@ -65,6 +65,14 @@ test_lei({ tmpdir => $tmpdir }, sub {
 	lei_ok('ls-external');
 	unlike($lei_out, qr!\Q$d\E!s, 'not added to ls-external');
 
+	$d = "$home/bad-epoch";
+	ok(!lei(qw(add-external -q --epoch=0.. --mirror), "$http/t1/", $d),
+		'v1 fails on --epoch');
+	ok(!-d $d, 'destination not created on unacceptable --epoch');
+	ok(!lei(qw(add-external -q --epoch=1 --mirror), "$http/t2/", $d),
+		'v2 fails on bad epoch range');
+	ok(!-d $d, 'destination not created on bad epoch');
+
 	my %phail = (
 		HTTPS => 'https://public-inbox.org/' . 'phail',
 		ONION =>
diff --git a/t/v2mirror.t b/t/v2mirror.t
index 665a4d59..20a8daaa 100644
--- a/t/v2mirror.t
+++ b/t/v2mirror.t
@@ -263,6 +263,31 @@ if ('test read-only epoch dirs') {
 		'fetch restored objects once GIT_DIR became writable');
 }
 
+{
+	my $dst = "$tmpdir/partial";
+	run_script([qw(-clone -q --epoch=~0), "http://$host:$port/v2/", $dst]);
+	is($?, 0, 'no error from partial clone');
+	my @g = glob("$dst/git/*.git");
+	my @w = grep { -w $_ } @g;
+	my @r = grep { ! -w $_ } @g;
+	is(scalar(@w), 1, 'one writable directory');
+	my ($w) = ($w[0] =~ m!/([0-9]+)\.git\z!);
+	is((grep {
+		m!/([0-9]+)\.git\z! or xbail "no digit in $_";
+		$w > ($1 + 0)
+	} @r), scalar(@r), 'writable epoch # exceeds read-only ones');
+	run_script([qw(-fetch -q)], undef, { -C => $dst });
+	is($?, 0, 'no error from partial fetch');
+	remove_tree($dst);
+
+	run_script([qw(-clone -q --epoch=~1..),
+			"http://$host:$port/v2/", $dst]);
+	my @g2 = glob("$dst/git/*.git") ;
+	is_deeply(\@g2, \@g, 'cloned again');
+	is(scalar(grep { -w $_ } @g2), scalar(@w) + 1,
+		'got one more cloned epoch');
+}
+
 ok($td->kill, 'killed httpd');
 $td->join;
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/5] fetch: fix skipping with multi-epoch inboxes
  2021-09-24 10:56 [PATCH 0/5] clone|fetch: flesh out partial mirror support Eric Wong
  2021-09-24 10:56 ` [PATCH 1/5] clone|--mirror: support --epoch=RANGE for partial clones Eric Wong
@ 2021-09-24 10:56 ` Eric Wong
  2021-09-24 10:56 ` [PATCH 3/5] clone|--mirror: fix and test against pre-manifest WWW Eric Wong
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2021-09-24 10:56 UTC (permalink / raw)
  To: meta

We need to check every epoch for writability, so don't
break out of the loop when we find a URL.
---
 lib/PublicInbox/Fetch.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Fetch.pm b/lib/PublicInbox/Fetch.pm
index 0bd6502c..464ffe12 100644
--- a/lib/PublicInbox/Fetch.pm
+++ b/lib/PublicInbox/Fetch.pm
@@ -112,10 +112,10 @@ sub do_fetch { # main entry point
 				$skip->{$nr} = 1;
 				next;
 			}
+			next if defined $git_url;
 			if (defined(my $url = remote_url($lei, $edir))) {
 				$git_url = $url;
 				$epoch = $nr;
-				last;
 			} else {
 				warn "W: $edir missing remote.origin.url\n";
 			}

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/5] clone|--mirror: fix and test against pre-manifest WWW
  2021-09-24 10:56 [PATCH 0/5] clone|fetch: flesh out partial mirror support Eric Wong
  2021-09-24 10:56 ` [PATCH 1/5] clone|--mirror: support --epoch=RANGE for partial clones Eric Wong
  2021-09-24 10:56 ` [PATCH 2/5] fetch: fix skipping with multi-epoch inboxes Eric Wong
@ 2021-09-24 10:56 ` Eric Wong
  2021-09-24 10:56 ` [PATCH 4/5] clone|fetch|--mirror: cull manifest in partial mirrors Eric Wong
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2021-09-24 10:56 UTC (permalink / raw)
  To: meta

There may still be pre-manifest.js.gz versions of PublicInbox::WWW.
running and serving v2 inboxes.

Since $INBOX_URL/manifest.js.gz was not understood, it was
assumed to be a Message-ID and 301-ed to
"$INBOX_URL/manifest.js.gz/" with a trailing slash, so our 404
checks were invalid.  Update our fallbacks to deal with 301
by catching JSON decoding errors to trigger HTML scraping.

For HTML parsing, be sure to not be fooled by potential
user-generated content and only scan the part after the last
<hr>.

We also need to avoid propagating $? from curl unnecessarily
when we can continue safely.

Finally, update v2mirror.t with tests to use PublicInbox::WWW
from our "v1.1.0-pre1" tag to ensure these code paths get tested
---
 lib/PublicInbox/LeiMirror.pm  | 13 ++++--
 lib/PublicInbox/TestCommon.pm |  1 +
 t/v2mirror.t                  | 78 +++++++++++++++++++++++++++++++----
 3 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/lib/PublicInbox/LeiMirror.pm b/lib/PublicInbox/LeiMirror.pm
index 53f7dd31..fe81b967 100644
--- a/lib/PublicInbox/LeiMirror.pm
+++ b/lib/PublicInbox/LeiMirror.pm
@@ -42,7 +42,8 @@ sub try_scrape {
 
 	# we grep with URL below, we don't want Subject/From headers
 	# making us clone random URLs
-	my @urls = ($html =~ m!\bgit clone --mirror ([a-z\+]+://\S+)!g);
+	my @html = split(/<hr>/, $html);
+	my @urls = ($html[-1] =~ m!\bgit clone --mirror ([a-z\+]+://\S+)!g);
 	my $url = $uri->as_string;
 	chop($url) eq '/' or die "BUG: $uri not canonicalized";
 
@@ -184,7 +185,9 @@ sub run_reap {
 	my $reap = PublicInbox::OnDestroy->new($lei->can('sigint_reap'), $pid);
 	waitpid($pid, 0) == $pid or die "waitpid @$cmd: $!";
 	@$reap = (); # cancel reap
-	$?
+	my $ret = $?;
+	$? = 0; # don't let it influence normal exit
+	$ret;
 }
 
 sub clone_v1 {
@@ -358,7 +361,11 @@ sub try_manifest {
 		return try_scrape($self) if ($cerr >> 8) == 22; # 404 missing
 		return $lei->child_error($cerr, "@$cmd failed");
 	}
-	my $m = decode_manifest($ft, $fn, $uri);
+	my $m = eval { decode_manifest($ft, $fn, $uri) };
+	if ($@) {
+		warn $@;
+		return try_scrape($self);
+	}
 	my ($path_pfx, $v1_path, @v2_epochs) = deduce_epochs($m, $path);
 	if (@v2_epochs) {
 		# It may be possible to have v1 + v2 in parallel someday:
diff --git a/lib/PublicInbox/TestCommon.pm b/lib/PublicInbox/TestCommon.pm
index aff34853..cd706e0e 100644
--- a/lib/PublicInbox/TestCommon.pm
+++ b/lib/PublicInbox/TestCommon.pm
@@ -469,6 +469,7 @@ sub start_script {
 			$ENV{LISTEN_PID} = $$;
 			$ENV{LISTEN_FDS} = $fds;
 		}
+		if ($opt->{-C}) { chdir($opt->{-C}) or die "chdir: $!" }
 		$0 = join(' ', @$cmd);
 		if ($sub) {
 			eval { PublicInbox::DS->Reset };
diff --git a/t/v2mirror.t b/t/v2mirror.t
index 20a8daaa..1231b72d 100644
--- a/t/v2mirror.t
+++ b/t/v2mirror.t
@@ -5,6 +5,7 @@ use v5.10.1;
 use PublicInbox::TestCommon;
 use File::Path qw(remove_tree make_path);
 use Cwd qw(abs_path);
+use PublicInbox::Spawn qw(which);
 require_git(2.6);
 require_cmd('curl');
 local $ENV{HOME} = abs_path('t');
@@ -23,7 +24,8 @@ my $pi_config = "$tmpdir/config";
 	open my $fh, '>', $pi_config or die "open($pi_config): $!";
 	print $fh <<"" or die "print $pi_config: $!";
 [publicinbox "v2"]
-	inboxdir = $tmpdir/in
+; using "mainrepo" rather than "inboxdir" for v1.1.0-pre1 WWW compat below
+	mainrepo = $tmpdir/in
 	address = test\@example.com
 
 	close $fh or die "close($pi_config): $!";
@@ -62,11 +64,11 @@ $v2w->done;
 }
 $ibx->cleanup;
 
-my $sock = tcp_server();
+local $ENV{TEST_IPV4_ONLY} = 1; # plackup (below) doesn't do IPv6
+my $rdr = { 3 => tcp_server() };
 my @cmd = ('-httpd', '-W0', "--stdout=$tmpdir/out", "--stderr=$tmpdir/err");
-my $td = start_script(\@cmd, undef, { 3 => $sock });
-my ($host, $port) = tcp_host_port($sock);
-$sock = undef;
+my $td = start_script(\@cmd, undef, $rdr);
+my ($host, $port) = tcp_host_port(delete $rdr->{3});
 
 @cmd = (qw(-clone -q), "http://$host:$port/v2/", "$tmpdir/m");
 run_script(\@cmd) or xbail '-clone';
@@ -288,7 +290,69 @@ if ('test read-only epoch dirs') {
 		'got one more cloned epoch');
 }
 
-ok($td->kill, 'killed httpd');
-$td->join;
+my $err = '';
+my $v110 = xqx([qw(git rev-parse v1.1.0-pre1)], undef, { 2 => \$err });
+SKIP: {
+	skip("no detected public-inbox GIT_DIR ($err)", 1) if $?;
+	# using plackup to test old PublicInbox::WWW since -httpd from
+	# back then relied on some packages we no longer depend on
+	my $plackup = which('plackup') or skip('no plackup in path', 1);
+	require PublicInbox::Lock;
+	chomp $v110;
+	my ($base) = ($0 =~ m!\b([^/]+)\.[^\.]+\z!);
+	my $wt = "t/data-gen/$base.pre-manifest";
+	my $lk = bless { lock_path => __FILE__ }, 'PublicInbox::Lock';
+	$lk->lock_acquire;
+	my $psgi = "$wt/app.psgi";
+	if (!-f $psgi) { # checkout a pre-manifest.js.gz version
+		my $t = File::Temp->new(TEMPLATE => 'g-XXXX', TMPDIR => 1);
+		my $env = { GIT_INDEX_FILE => $t->filename };
+		xsys([qw(git read-tree), $v110], $env) and xbail 'read-tree';
+		xsys([qw(git checkout-index -a), "--prefix=$wt/"], $env)
+			and xbail 'checkout-index';
+		my $f = "$wt/app.psgi.tmp.$$";
+		open my $fh, '>', $f or xbail $!;
+		print $fh <<'EOM' or xbail $!;
+use Plack::Builder;
+use PublicInbox::WWW;
+my $www = PublicInbox::WWW->new;
+builder { enable 'Head'; sub { $www->call(@_) } }
+EOM
+		close $fh or xbail $!;
+		rename($f, $psgi) or xbail $!;
+	}
+	$lk->lock_release;
+
+	$rdr->{run_mode} = 0;
+	$rdr->{-C} = $wt;
+	my $cmd = [$plackup, qw(-Enone -Ilib), "--host=$host", "--port=$port"];
+	$td->join('TERM');
+	open $rdr->{2}, '>>', "$tmpdir/plackup.err.log" or xbail "open: $!";
+	open $rdr->{1}, '>>&', $rdr->{2} or xbail "open: $!";
+	$td = start_script($cmd, { PERL5LIB => 'lib' }, $rdr);
+	# wait for plackup socket()+bind()+listen()
+	my %opt = ( Proto => 'tcp', Type => Socket::SOCK_STREAM(),
+		PeerAddr => "$host:$port" );
+	for (0..50) {
+		tick();
+		last if IO::Socket::INET->new(%opt);
+	}
+	my $dst = "$tmpdir/scrape";
+	@cmd = (qw(-clone -q), "http://$host:$port/v2", $dst);
+	run_script(\@cmd, undef, { 2 => \(my $err = '') });
+	is($?, 0, 'scraping clone on old PublicInbox::WWW')
+		or diag $err;
+	my @g_all = glob("$dst/git/*.git");
+	ok(scalar(@g_all) > 1, 'cloned multiple epochs');
+
+	remove_tree($dst);
+	@cmd = (qw(-clone -q --epoch=~0), "http://$host:$port/v2", $dst);
+	run_script(\@cmd, undef, { 2 => \($err = '') });
+	is($?, 0, 'partial scraping clone on old PublicInbox::WWW');
+	my @g_last = grep { -w $_ } glob("$dst/git/*.git");
+	is_deeply(\@g_last, [ $g_all[-1] ], 'partial clone of ~0 worked');
+
+	$td->join('TERM');
+}
 
 done_testing;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 4/5] clone|fetch|--mirror: cull manifest in partial mirrors
  2021-09-24 10:56 [PATCH 0/5] clone|fetch: flesh out partial mirror support Eric Wong
                   ` (2 preceding siblings ...)
  2021-09-24 10:56 ` [PATCH 3/5] clone|--mirror: fix and test against pre-manifest WWW Eric Wong
@ 2021-09-24 10:56 ` Eric Wong
  2021-09-24 10:56 ` [PATCH 5/5] fetch: support v2 w/o manifest on old WWW Eric Wong
  2021-09-25  3:21 ` [PATCH 6/5] t/v2mirror: check dependencies for legacy test Eric Wong
  5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2021-09-24 10:56 UTC (permalink / raw)
  To: meta

This makes it easier for users to enable fetching on a
previously read-only epoch.  Prior to this change, users were
required to delete manifest.js.gz in addition to adding the
writable bit.  Now, they just have to "chmod +w $EPOCH_DIR".
---
 lib/PublicInbox/Fetch.pm     | 17 +++++++++++++++--
 lib/PublicInbox/LeiMirror.pm | 24 ++++++++++++++++++++----
 t/v2mirror.t                 | 24 ++++++++++++++++++++++++
 3 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/Fetch.pm b/lib/PublicInbox/Fetch.pm
index 464ffe12..7f60b619 100644
--- a/lib/PublicInbox/Fetch.pm
+++ b/lib/PublicInbox/Fetch.pm
@@ -12,6 +12,8 @@ use PublicInbox::LEI;
 use PublicInbox::LeiCurl;
 use PublicInbox::LeiMirror;
 use File::Temp ();
+use PublicInbox::Config;
+use IO::Compress::Gzip qw(gzip $GzipError);
 
 sub new { bless {}, __PACKAGE__ }
 
@@ -81,7 +83,7 @@ sub do_manifest ($$$) {
 	}
 	my (undef, $v1_path, @v2_epochs) =
 		PublicInbox::LeiMirror::deduce_epochs($mdiff, $ibx_uri->path);
-	[ 200, $v1_path, \@v2_epochs, $muri, $ft, $mf ];
+	[ 200, $v1_path, \@v2_epochs, $muri, $ft, $mf, $m1 ];
 }
 
 sub get_fingerprint2 {
@@ -133,7 +135,7 @@ EOM
 	PublicInbox::LeiMirror::write_makefile($dir, $ibx_ver);
 	$lei->qerr("# inbox URL: $ibx_uri/");
 	my $res = do_manifest($lei, $dir, $ibx_uri) or return;
-	my ($code, $v1_path, $v2_epochs, $muri, $ft, $mf) = @$res;
+	my ($code, $v1_path, $v2_epochs, $muri, $ft, $mf, $m1) = @$res;
 	if ($code == 404) {
 		# any pre-manifest.js.gz instances running? Just fetch all
 		# existing ones and unconditionally try cloning the next
@@ -145,6 +147,7 @@ EOM
 	} else {
 		$code == 200 or die "BUG unexpected code $code\n";
 	}
+	my $mculled;
 	if ($ibx_ver == 2) {
 		defined($v1_path) and warn <<EOM;
 E: got v1 `$v1_path' when expecting v2 epoch(s) in <$muri>, WTF?
@@ -153,6 +156,12 @@ EOM
 				my ($nr) = (m!/([0-9]+)\.git\z!g);
 				$skip->{$nr} ? () : $nr;
 			} @$v2_epochs;
+		if ($m1 && scalar keys %$skip) {
+			my $re = join('|', keys %$skip);
+			my @del = grep(m!/git/$re\.git\z!, keys %$m1);
+			delete @$m1{@del};
+			$mculled = 1;
+		}
 	} else {
 		$git_dir[0] = $dir;
 	}
@@ -193,6 +202,10 @@ EOM
 	for my $i (@new_epoch) { $mg->epoch_cfg_set($i) }
 	if ($ft) {
 		my $fn = $ft->filename;
+		if ($mculled) {
+			my $json = PublicInbox::Config->json->encode($m1);
+			gzip(\$json => $fn) or die "gzip: $GzipError";
+		}
 		rename($fn, $mf) or die "E: rename($fn, $mf): $!\n";
 		$ft->unlink_on_destroy(0);
 	}
diff --git a/lib/PublicInbox/LeiMirror.pm b/lib/PublicInbox/LeiMirror.pm
index fe81b967..1ab5e0d8 100644
--- a/lib/PublicInbox/LeiMirror.pm
+++ b/lib/PublicInbox/LeiMirror.pm
@@ -6,7 +6,9 @@ package PublicInbox::LeiMirror;
 use strict;
 use v5.10.1;
 use parent qw(PublicInbox::IPC);
+use PublicInbox::Config;
 use IO::Uncompress::Gunzip qw(gunzip $GunzipError);
+use IO::Compress::Gzip qw(gzip $GzipError);
 use PublicInbox::Spawn qw(popen_rd spawn run_die);
 use File::Temp ();
 use Fcntl qw(SEEK_SET O_CREAT O_EXCL O_WRONLY);
@@ -267,14 +269,14 @@ EOM
 	close $fh or die "close:($f): $!";
 }
 
-sub clone_v2 ($$) {
-	my ($self, $v2_epochs) = @_;
+sub clone_v2 ($$;$) {
+	my ($self, $v2_epochs, $m) = @_; # $m => manifest.js.gz hashref
 	my $lei = $self->{lei};
 	my $curl = $self->{curl} //= PublicInbox::LeiCurl->new($lei) or return;
 	my $pfx = $curl->torsocks($lei, (values %$v2_epochs)[0]) or return;
 	my $dst = $self->{dst};
 	my $want = parse_epochs($lei->{opt}->{epoch}, $v2_epochs);
-	my (@src_edst, @read_only);
+	my (@src_edst, @read_only, @skip_nr);
 	for my $nr (sort { $a <=> $b } keys %$v2_epochs) {
 		my $uri = $v2_epochs->{$nr};
 		my $src = $uri->as_string;
@@ -289,8 +291,15 @@ failed to extract epoch number from $src
 		} else { # create a placeholder so users only need to chmod +w
 			init_placeholder($src, $edst);
 			push @read_only, $edst;
+			push @skip_nr, $nr;
 		}
 	}
+	if (@skip_nr) { # filter out the epochs we skipped
+		my $re = join('|', @skip_nr);
+		my @del = grep(m!/git/$re\.git\z!, keys %$m);
+		delete @$m{@del};
+		$self->{-culled_manifest} = 1;
+	}
 	my $lk = bless { lock_path => "$dst/inbox.lock" }, 'PublicInbox::Lock';
 	_try_config($self);
 	my $on_destroy = $lk->lock_for_scope($$);
@@ -379,13 +388,20 @@ EOM
 			my ($n) = ("$uri" =~ m!/([0-9]+)\.git\z!);
 			$n => $uri->clone
 		} @v2_epochs;
-		clone_v2($self, \%v2_epochs);
+		clone_v2($self, \%v2_epochs, $m);
 	} elsif (defined $v1_path) {
 		clone_v1($self);
 	} else {
 		die "E: confused by <$uri>, possible matches:\n\t",
 			join(', ', sort keys %$m), "\n";
 	}
+	if (delete $self->{-culled_manifest}) { # set by clone_v2
+		# write the smaller manifest if epochs were skipped so
+		# users won't have to delete manifest if they +w an
+		# epoch they no longer want to skip
+		my $json = PublicInbox::Config->json->encode($m);
+		gzip(\$json => $fn) or die "gzip: $GzipError";
+	}
 	my $fin = "$self->{dst}/manifest.js.gz";
 	rename($fn, $fin) or die "E: rename($fn, $fin): $!";
 	$ft->unlink_on_destroy(0);
diff --git a/t/v2mirror.t b/t/v2mirror.t
index 1231b72d..fa4a717d 100644
--- a/t/v2mirror.t
+++ b/t/v2mirror.t
@@ -9,6 +9,7 @@ use PublicInbox::Spawn qw(which);
 require_git(2.6);
 require_cmd('curl');
 local $ENV{HOME} = abs_path('t');
+use IO::Uncompress::Gunzip qw(gunzip $GunzipError);
 
 # Integration tests for HTTP cloning + mirroring
 require_mods(qw(Plack::Util Plack::Builder
@@ -288,6 +289,29 @@ if ('test read-only epoch dirs') {
 	is_deeply(\@g2, \@g, 'cloned again');
 	is(scalar(grep { -w $_ } @g2), scalar(@w) + 1,
 		'got one more cloned epoch');
+
+	# make 0.git writable and fetch into it, relies on culled manifest
+	chmod(0755, $g2[0]) or xbail "chmod: $!";
+	my @before = glob("$g2[0]/objects/*/*");
+	run_script([qw(-fetch -q)], undef, { -C => $dst });
+	is($?, 0, 'no error from partial fetch');
+	my @after = glob("$g2[0]/objects/*/*");
+	ok(scalar(@before) < scalar(@after), 'fetched after chmod 0755 0.git');
+
+	# ensure culled manifest is maintained after fetch
+	gunzip("$dst/manifest.js.gz" => \(my $m), MultiStream => 1) or
+		xbail "gunzip: $GunzipError";
+	$m = PublicInbox::Config->json->decode($m);
+	for my $k (keys %$m) { # /$name/git/$N.git
+		my ($nr) = ($k =~ m!/git/([0-9]+)\.git\z!);
+		ok(-w "$dst/git/$nr.git", "writable $nr.git in manifest");
+	}
+	for my $ro (grep { !-w $_ } @g2) {
+		my ($nr) = ($ro =~ m!/git/([0-9]+)\.git\z!);
+		is(grep(m!/git/$nr\.git\z!, keys %$m), 0,
+			"read-only $nr.git not in manifest")
+			or xbail([sort keys %$m]);
+	}
 }
 
 my $err = '';

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 5/5] fetch: support v2 w/o manifest on old WWW
  2021-09-24 10:56 [PATCH 0/5] clone|fetch: flesh out partial mirror support Eric Wong
                   ` (3 preceding siblings ...)
  2021-09-24 10:56 ` [PATCH 4/5] clone|fetch|--mirror: cull manifest in partial mirrors Eric Wong
@ 2021-09-24 10:56 ` Eric Wong
  2021-09-25  3:21 ` [PATCH 6/5] t/v2mirror: check dependencies for legacy test Eric Wong
  5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2021-09-24 10:56 UTC (permalink / raw)
  To: meta

There may still be pre-manifest.js.gz versions of
PublicInbox::WWW running and serving v2 inboxes.

While -clone and "add-external --mirror" were working, -fetch
was failing due to 301 redirect to $INBOX_URL/manifest.js.gz/
and not the expected 404.  Update the code to deal with a JSON
decode error (from the 301) and ensure v2 epochs detection is
correct (and not using a shadowed variable).
---
 lib/PublicInbox/Fetch.pm | 12 +++++++-----
 t/v2mirror.t             |  8 ++++++++
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/Fetch.pm b/lib/PublicInbox/Fetch.pm
index 7f60b619..7881b402 100644
--- a/lib/PublicInbox/Fetch.pm
+++ b/lib/PublicInbox/Fetch.pm
@@ -60,11 +60,13 @@ sub do_manifest ($$$) {
 	$opt->{$_} = $lei->{$_} for (0..2);
 	my $cerr = PublicInbox::LeiMirror::run_reap($lei, $curl_cmd, $opt);
 	if ($cerr) {
-		return [ 404 ] if ($cerr >> 8) == 22; # 404 Missing
+		return [ 404, $muri ] if ($cerr >> 8) == 22; # 404 Missing
 		$lei->child_error($cerr, "@$curl_cmd failed");
 		return;
 	}
-	my $m1 = PublicInbox::LeiMirror::decode_manifest($ft, $fn, $muri);
+	my $m1 = eval {
+		PublicInbox::LeiMirror::decode_manifest($ft, $fn, $muri);
+	} or return [ 404, $muri ];
 	my $mdiff = { %$m1 };
 
 	# filter out unchanged entries.  We check modified, too, since
@@ -83,7 +85,7 @@ sub do_manifest ($$$) {
 	}
 	my (undef, $v1_path, @v2_epochs) =
 		PublicInbox::LeiMirror::deduce_epochs($mdiff, $ibx_uri->path);
-	[ 200, $v1_path, \@v2_epochs, $muri, $ft, $mf, $m1 ];
+	[ 200, $muri, $v1_path, \@v2_epochs, $ft, $mf, $m1 ];
 }
 
 sub get_fingerprint2 {
@@ -106,7 +108,7 @@ sub do_fetch { # main entry point
 	} else { # v2:
 		require PublicInbox::MultiGit;
 		$mg = PublicInbox::MultiGit->new($dir, 'all.git', 'git');
-		my @epochs = $mg->git_epochs;
+		@epochs = $mg->git_epochs;
 		my ($git_url, $epoch);
 		for my $nr (@epochs) { # try newest epoch, first
 			my $edir = "$dir/git/$nr.git";
@@ -135,7 +137,7 @@ EOM
 	PublicInbox::LeiMirror::write_makefile($dir, $ibx_ver);
 	$lei->qerr("# inbox URL: $ibx_uri/");
 	my $res = do_manifest($lei, $dir, $ibx_uri) or return;
-	my ($code, $v1_path, $v2_epochs, $muri, $ft, $mf, $m1) = @$res;
+	my ($code, $muri, $v1_path, $v2_epochs, $ft, $mf, $m1) = @$res;
 	if ($code == 404) {
 		# any pre-manifest.js.gz instances running? Just fetch all
 		# existing ones and unconditionally try cloning the next
diff --git a/t/v2mirror.t b/t/v2mirror.t
index fa4a717d..a625646d 100644
--- a/t/v2mirror.t
+++ b/t/v2mirror.t
@@ -376,6 +376,14 @@ EOM
 	my @g_last = grep { -w $_ } glob("$dst/git/*.git");
 	is_deeply(\@g_last, [ $g_all[-1] ], 'partial clone of ~0 worked');
 
+	chmod(0755, $g_all[0]) or xbail "chmod $!";
+	my @before = glob("$g_all[0]/objects/*/*");
+	run_script([qw(-fetch -v)], undef, { -C => $dst, 2 => \($err = '') });
+	is($?, 0, 'scraping fetch on old PublicInbox::WWW') or diag $err;
+	my @after = glob("$g_all[0]/objects/*/*");
+	ok(scalar(@before) < scalar(@after),
+		'fetched 0.git after enabling write-bit');
+
 	$td->join('TERM');
 }
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 6/5] t/v2mirror: check dependencies for legacy test
  2021-09-24 10:56 [PATCH 0/5] clone|fetch: flesh out partial mirror support Eric Wong
                   ` (4 preceding siblings ...)
  2021-09-24 10:56 ` [PATCH 5/5] fetch: support v2 w/o manifest on old WWW Eric Wong
@ 2021-09-25  3:21 ` Eric Wong
  5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2021-09-25  3:21 UTC (permalink / raw)
  To: meta

We still need Email::MIME to test against old revisions.
We'll also depend on the revision just prior to the
manifest.js.gz introduction to avoid loading Danga::Socket,
since it was getting loaded even with `plackup'.

Finally, we'll disable Inline::C usage with old Spawn.pm
since our old code included alloca.h, which is not
portable to FreeBSD.
---
 t/v2mirror.t | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/t/v2mirror.t b/t/v2mirror.t
index a625646d..63d17ebf 100644
--- a/t/v2mirror.t
+++ b/t/v2mirror.t
@@ -315,23 +315,26 @@ if ('test read-only epoch dirs') {
 }
 
 my $err = '';
-my $v110 = xqx([qw(git rev-parse v1.1.0-pre1)], undef, { 2 => \$err });
+my $oldrev = '0b3e19584c90d958a723ac2d3dec3f84f5513688~1';
+# 3e0e596105198cfa (wwwlisting: allow hiding entries from manifest, 2019-06-09)
+$oldrev = xqx([qw(git rev-parse), $oldrev], undef, { 2 => \$err });
 SKIP: {
 	skip("no detected public-inbox GIT_DIR ($err)", 1) if $?;
+	require_mods('Email::MIME', 1); # for legacy revision
 	# using plackup to test old PublicInbox::WWW since -httpd from
 	# back then relied on some packages we no longer depend on
 	my $plackup = which('plackup') or skip('no plackup in path', 1);
 	require PublicInbox::Lock;
-	chomp $v110;
+	chomp $oldrev;
 	my ($base) = ($0 =~ m!\b([^/]+)\.[^\.]+\z!);
-	my $wt = "t/data-gen/$base.pre-manifest";
+	my $wt = "t/data-gen/$base.pre-manifest-$oldrev";
 	my $lk = bless { lock_path => __FILE__ }, 'PublicInbox::Lock';
 	$lk->lock_acquire;
 	my $psgi = "$wt/app.psgi";
 	if (!-f $psgi) { # checkout a pre-manifest.js.gz version
 		my $t = File::Temp->new(TEMPLATE => 'g-XXXX', TMPDIR => 1);
 		my $env = { GIT_INDEX_FILE => $t->filename };
-		xsys([qw(git read-tree), $v110], $env) and xbail 'read-tree';
+		xsys([qw(git read-tree), $oldrev], $env) and xbail 'read-tree';
 		xsys([qw(git checkout-index -a), "--prefix=$wt/"], $env)
 			and xbail 'checkout-index';
 		my $f = "$wt/app.psgi.tmp.$$";
@@ -353,7 +356,8 @@ EOM
 	$td->join('TERM');
 	open $rdr->{2}, '>>', "$tmpdir/plackup.err.log" or xbail "open: $!";
 	open $rdr->{1}, '>>&', $rdr->{2} or xbail "open: $!";
-	$td = start_script($cmd, { PERL5LIB => 'lib' }, $rdr);
+	my $env = { PERL5LIB => 'lib', PERL_INLINE_DIRECTORY => undef };
+	$td = start_script($cmd, $env, $rdr);
 	# wait for plackup socket()+bind()+listen()
 	my %opt = ( Proto => 'tcp', Type => Socket::SOCK_STREAM(),
 		PeerAddr => "$host:$port" );
@@ -363,7 +367,7 @@ EOM
 	}
 	my $dst = "$tmpdir/scrape";
 	@cmd = (qw(-clone -q), "http://$host:$port/v2", $dst);
-	run_script(\@cmd, undef, { 2 => \(my $err = '') });
+	run_script(\@cmd, undef, { 2 => \($err = '') });
 	is($?, 0, 'scraping clone on old PublicInbox::WWW')
 		or diag $err;
 	my @g_all = glob("$dst/git/*.git");

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-09-25  3:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-24 10:56 [PATCH 0/5] clone|fetch: flesh out partial mirror support Eric Wong
2021-09-24 10:56 ` [PATCH 1/5] clone|--mirror: support --epoch=RANGE for partial clones Eric Wong
2021-09-24 10:56 ` [PATCH 2/5] fetch: fix skipping with multi-epoch inboxes Eric Wong
2021-09-24 10:56 ` [PATCH 3/5] clone|--mirror: fix and test against pre-manifest WWW Eric Wong
2021-09-24 10:56 ` [PATCH 4/5] clone|fetch|--mirror: cull manifest in partial mirrors Eric Wong
2021-09-24 10:56 ` [PATCH 5/5] fetch: support v2 w/o manifest on old WWW Eric Wong
2021-09-25  3:21 ` [PATCH 6/5] t/v2mirror: check dependencies for legacy test Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).