unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 00/27] v2 public-inbox-watch support
@ 2018-03-19  8:14 Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 01/27] content_id: use Sender header if From is not available Eric Wong (Contractor, The Linux Foundation)
                   ` (26 more replies)
  0 siblings, 27 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

barrier support in v2writable makes checking duplicates during
large imports faster as it no longer requires tearing down
subprocesses.  This also makes checking deletes a bit faster
(we check deletes to prevent blindly writing the same deleted
 files over and over again)

On the subject of subprocesses for v2, they're now optional
(per-invocation), so -init obviously won't need to spawn
subprocesses, and normal -watch usage shouldn't, either, as mail
doesn't tend to arrive quickly.

public-inbox-watch seems to be working well enough on my desktop
with my Maildir for LKML (synched via offlineimap).

Eric Wong (Contractor, The Linux Foundation) (27):
  content_id: use Sender header if From is not available
  v2writable: support "barrier" operation to avoid reforking
  use string ref for Email::Simple->new
  v2writable: remove unnecessary idx_init call
  searchidx: do not delete documents while iterating
  search: allow ->reopen to be chainable
  v2writable: implement remove correctly
  skeleton: barrier init requires a lock
  import: (v2) delete writes the blob into history in subdir
  import: (v2): write deletes to a separate '_' subdirectory
  import: implement barrier operation for v1 repos
  mid: mid_mime uses v2-compatible mids function
  watchmaildir: use content_digest to generate Message-Id
  import: force Message-ID generation for v1 here
  import: switch to URL-safe Base64 for Message-IDs
  v2writable: test for idempotent removals
  import: enable locking under v2
  index: s/GIT_DIR/REPO_DIR/
  Lock: new base class for writable lockers
  t/watch_maildir: note the reason for FIFO creation
  v2writable: ensure ->done is idempotent
  watchmaildir: support v2 repositories
  searchidxpart: s/barrier/remote_barrier/
  v2writable: allow disabling parallelization
  scripts/import_vger_from_mbox: filter out same headers as MDA
  v2writable: add DEBUG_DIFF env support
  v2writable: remove "resent" message for duplicate Message-IDs

 Documentation/public-inbox-index.pod |   8 +-
 MANIFEST                             |   2 +
 lib/PublicInbox/ContentId.pm         |  16 +++-
 lib/PublicInbox/Import.pm            | 132 ++++++++++++++++++-----------
 lib/PublicInbox/Lock.pm              |  31 +++++++
 lib/PublicInbox/MID.pm               |   3 +-
 lib/PublicInbox/Msgmap.pm            |   8 ++
 lib/PublicInbox/Search.pm            |   1 +
 lib/PublicInbox/SearchIdx.pm         | 159 +++++++++++++++++++++++------------
 lib/PublicInbox/SearchIdxPart.pm     |  62 +++++++++-----
 lib/PublicInbox/SearchIdxSkeleton.pm | 112 +++++++++++++++++-------
 lib/PublicInbox/SearchMsg.pm         |   4 +-
 lib/PublicInbox/V2Writable.pm        | 136 ++++++++++++++++++++++++------
 lib/PublicInbox/WatchMaildir.pm      |  54 ++++--------
 script/public-inbox-index            |  12 +--
 script/public-inbox-init             |   3 +-
 scripts/import_vger_from_mbox        |   2 +
 t/v2writable.t                       |  51 +++++++++--
 t/watch_maildir.t                    |   3 +-
 t/watch_maildir_v2.t                 | 125 +++++++++++++++++++++++++++
 20 files changed, 683 insertions(+), 241 deletions(-)
 create mode 100644 lib/PublicInbox/Lock.pm
 create mode 100644 t/watch_maildir_v2.t

-- 
EW

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 01/27] content_id: use Sender header if From is not available
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 02/27] v2writable: support "barrier" operation to avoid reforking Eric Wong (Contractor, The Linux Foundation)
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

We will be using Sender: in more places if the From: header
is not available, this is one of them.

Followup-to: ("import: fall back to Sender for extracting name and email")
---
 lib/PublicInbox/ContentId.pm | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm
index 8347de2..9082b76 100644
--- a/lib/PublicInbox/ContentId.pm
+++ b/lib/PublicInbox/ContentId.pm
@@ -11,9 +11,6 @@ use PublicInbox::MID qw(mids references);
 # not sure if less-widely supported hash families are worth bothering with
 use Digest::SHA;
 
-# Content-* headers are often no-ops, so maybe we don't need them
-my @ID_HEADERS = qw(Subject From Date To Cc);
-
 sub content_digest ($) {
 	my ($mime) = @_;
 	my $dig = Digest::SHA->new(256);
@@ -31,7 +28,18 @@ sub content_digest ($) {
 		next if $seen{$mid};
 		$dig->add('ref: '.$mid);
 	}
-	foreach my $h (@ID_HEADERS) {
+
+	# Only use Sender: if From is not present
+	foreach my $h (qw(From Sender)) {
+		my @v = $hdr->header_raw($h);
+		if (@v) {
+			$dig->add("$h: $_") foreach @v;
+			last;
+		}
+	}
+
+	# Content-* headers are often no-ops, so maybe we don't need them
+	foreach my $h (qw(Subject Date To Cc)) {
 		my @v = $hdr->header_raw($h);
 		$dig->add("$h: $_") foreach @v;
 	}
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 02/27] v2writable: support "barrier" operation to avoid reforking
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 01/27] content_id: use Sender header if From is not available Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 03/27] use string ref for Email::Simple->new Eric Wong (Contractor, The Linux Foundation)
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient.  Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.
---
 lib/PublicInbox/Import.pm            | 10 ++++++++-
 lib/PublicInbox/SearchIdxPart.pm     | 12 ++++++++++
 lib/PublicInbox/SearchIdxSkeleton.pm | 43 ++++++++++++++++++++++++++++++++----
 lib/PublicInbox/V2Writable.pm        | 34 +++++++++++++++++++++++++++-
 t/v2writable.t                       |  3 ++-
 5 files changed, 95 insertions(+), 7 deletions(-)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 664bec6..8406c9e 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -133,7 +133,6 @@ sub check_remove_v1 {
 	(undef, $cur);
 }
 
-# used for v2 (maybe)
 sub checkpoint {
 	my ($self) = @_;
 	return unless $self->{pid};
@@ -141,6 +140,15 @@ sub checkpoint {
 	undef;
 }
 
+sub progress {
+	my ($self, $msg) = @_;
+	return unless $self->{pid};
+	print { $self->{out} } "progress $msg\n" or wfail;
+	$self->{in}->getline eq "progress $msg\n" or die
+		"progress $msg not received\n";
+	undef;
+}
+
 # used for v2
 sub get_mark {
 	my ($self, $mark) = @_;
diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm
index 6d8cb2a..dd7ace6 100644
--- a/lib/PublicInbox/SearchIdxPart.pm
+++ b/lib/PublicInbox/SearchIdxPart.pm
@@ -49,6 +49,11 @@ sub partition_worker_loop ($$$) {
 		} elsif ($line eq "close\n") {
 			$self->_xdb_release;
 			$xdb = $txn = undef;
+		} elsif ($line eq "barrier\n") {
+			$xdb->commit_transaction if $txn;
+			$txn = undef;
+			print { $self->{skeleton}->{w} } "barrier $part\n" or
+					die "write failed to skeleton: $!\n";
 		} else {
 			chomp $line;
 			my ($len, $artnum, $oid, $mid0) = split(/ /, $line);
@@ -81,4 +86,11 @@ sub atfork_child {
 	close $_[0]->{w} or die "failed to close write pipe: $!\n";
 }
 
+# called by V2Writable:
+sub barrier {
+	my $w = $_[0]->{w};
+	print $w "barrier\n" or die "failed to print: $!";
+	$w->flush or die "failed to flush: $!";
+}
+
 1;
diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
index 40b28c5..4cb10f5 100644
--- a/lib/PublicInbox/SearchIdxSkeleton.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -15,21 +15,25 @@ sub new {
 
 	my ($r, $w);
 	pipe($r, $w) or die "pipe failed: $!\n";
-	binmode $r, ':raw';
-	binmode $w, ':raw';
+	my ($barrier_wait, $barrier_note);
+	pipe($barrier_wait, $barrier_note) or die "pipe failed: $!\n";
+	binmode $_, ':raw' foreach ($r, $w, $barrier_wait, $barrier_note);
 	my $pid = fork;
 	defined $pid or die "fork failed: $!\n";
 	if ($pid == 0) {
 		$v2writable->atfork_child;
 		$v2writable = undef;
 		close $w;
-		eval { skeleton_worker_loop($self, $r) };
+		close $barrier_wait;
+		eval { skeleton_worker_loop($self, $r, $barrier_note) };
 		die "skeleton worker died: $@\n" if $@;
 		exit;
 	}
 	$self->{w} = $w;
 	$self->{pid} = $pid;
 	close $r;
+	close $barrier_note;
+	$self->{barrier_wait} = $barrier_wait;
 
 	$w->autoflush(1);
 
@@ -40,11 +44,13 @@ sub new {
 }
 
 sub skeleton_worker_loop {
-	my ($self, $r) = @_;
+	my ($self, $r, $barrier_note) = @_;
+	$barrier_note->autoflush(1);
 	$0 = 'pi-v2-skeleton';
 	my $xdb = $self->_xdb_acquire;
 	$xdb->begin_transaction;
 	my $txn = 1;
+	my $barrier = undef;
 	while (my $line = $r->getline) {
 		if ($line eq "commit\n") {
 			$xdb->commit_transaction if $txn;
@@ -52,6 +58,21 @@ sub skeleton_worker_loop {
 		} elsif ($line eq "close\n") {
 			$self->_xdb_release;
 			$xdb = $txn = undef;
+		} elsif ($line =~ /\Abarrier_init (\d+)\n\z/) {
+			my $n = $1 - 1;
+			die "barrier in-progress\n" if defined $barrier;
+			$barrier = { map { $_ => 1 } (0..$n) };
+		} elsif ($line =~ /\Abarrier (\d+)\n\z/) {
+			my $part = $1;
+			die "no barrier in-progress\n" unless defined $barrier;
+			delete $barrier->{$1} or die "unknown barrier: $part\n";
+			if ((scalar keys %$barrier) == 0) {
+				$barrier = undef;
+				$xdb->commit_transaction if $txn;
+				$txn = undef;
+				print $barrier_note "barrier_done\n" or die
+					"print failed to barrier note: $!";
+			}
 		} else {
 			my $len = int($line);
 			my $n = read($r, my $msg, $len) or die "read: $!\n";
@@ -107,4 +128,18 @@ sub index_skeleton_real ($$) {
 	$self->link_and_save($doc, $mids, \@refs, $num, $xpath);
 }
 
+# write to the subprocess
+sub barrier_init {
+	my ($self, $nparts) = @_;
+	my $w = $_[0]->{w};
+	print $w "barrier_init $nparts\n" or die "failed to write: $!";
+	$w->flush or die "failed to flush: $!";
+}
+
+sub barrier_wait {
+	my ($self) = @_;
+	my $l = $self->{barrier_wait}->getline;
+	$l eq "barrier_done\n" or die "bad response from barrier_wait: $l\n";
+}
+
 1;
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 7728b91..6e2a8d6 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -113,7 +113,7 @@ sub num_for {
 		};
 
 		# crap, Message-ID is already known, hope somebody just resent:
-		$self->done; # write barrier, clears $self->{skel}
+		$self->barrier;
 		foreach my $m (@$mids) {
 			# read-only lookup now safe to do after above barrier
 			my $existing = $self->lookup_content($mime, $m);
@@ -228,6 +228,37 @@ sub checkpoint {
 	$self->searchidx_checkpoint(1);
 }
 
+# issue a write barrier to ensure all data is visible to other processes
+# and read-only ops.  Order of data importance is: git > SQLite > Xapian
+sub barrier {
+	my ($self) = @_;
+
+	# For safety, we ensure git checkpoint is complete before because
+	# the data in git is still more important than what is in Xapian.
+	# Performance may be gained by delaying ->progress call but we
+	# lose safety
+	if (my $im = $self->{im}) {
+		$im->checkpoint;
+		$im->progress('checkpoint');
+	}
+	my $skel = $self->{skel};
+	my $parts = $self->{idx_parts};
+	if ($parts && $skel) {
+		my $dbh = $skel->{mm}->{dbh};
+		$dbh->commit; # SQLite data is second in importance
+
+		# Now deal with Xapian
+		$skel->barrier_init(scalar(@$parts));
+		# each partition needs to issue a barrier command to skel:
+		$_->barrier foreach @$parts;
+
+		$skel->barrier_wait; # wait for each Xapian partition
+
+		$dbh->begin_work;
+	}
+	$self->{transact_bytes} = 0;
+}
+
 sub searchidx_checkpoint {
 	my ($self, $more) = @_;
 
@@ -349,6 +380,7 @@ sub lookup_content {
 	my $ibx = $self->{-inbox};
 
 	my $srch = $ibx->search;
+	$srch->reopen;
 	my $cid = content_id($mime);
 	my $found;
 	$srch->each_smsg_by_mid($mid, sub {
diff --git a/t/v2writable.t b/t/v2writable.t
index 404c865..7d276da 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -55,7 +55,7 @@ if ('ensure git configs are correct') {
 {
 	my @warn;
 	local $SIG{__WARN__} = sub { push @warn, @_ };
-	is(undef, $im->add($mime), 'obvious duplicate rejected');
+	is($im->add($mime), undef, 'obvious duplicate rejected');
 	like(join(' ', @warn), qr/resent/, 'warned about resent message');
 
 	@warn = ();
@@ -105,6 +105,7 @@ if ('ensure git configs are correct') {
 	ok($im->add($mime), 'message with multiple Message-ID');
 	$im->done;
 	my @found;
+	$ibx->search->reopen;
 	$ibx->search->each_smsg_by_mid('abcde@1', sub { push @found, @_; 1 });
 	is(scalar(@found), 1, 'message found by first MID');
 	$ibx->search->each_smsg_by_mid('abcde@2', sub { push @found, @_; 1 });
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 03/27] use string ref for Email::Simple->new
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 01/27] content_id: use Sender header if From is not available Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 02/27] v2writable: support "barrier" operation to avoid reforking Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 04/27] v2writable: remove unnecessary idx_init call Eric Wong (Contractor, The Linux Foundation)
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

Email::Simple is slightly faster this way, and Email::MIME
and PublicInbox::MIME both wrap that.
---
 lib/PublicInbox/Import.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 8406c9e..e20c6e0 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -122,7 +122,7 @@ sub check_remove_v1 {
 	$n = read($r, my $lf, 1);
 	defined($n) or die "read final byte of cat-blob failed: $!";
 	die "bad read on final byte: <$lf>" if $lf ne "\n";
-	my $cur = PublicInbox::MIME->new($buf);
+	my $cur = PublicInbox::MIME->new(\$buf);
 	my $cur_s = $cur->header('Subject');
 	$cur_s = '' unless defined $cur_s;
 	my $cur_m = $mime->header('Subject');
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 04/27] v2writable: remove unnecessary idx_init call
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (2 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 03/27] use string ref for Email::Simple->new Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 05/27] searchidx: do not delete documents while iterating Eric Wong (Contractor, The Linux Foundation)
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

We no longer need it with ->barrier working
---
 lib/PublicInbox/V2Writable.pm | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 6e2a8d6..b6c46a2 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -125,7 +125,6 @@ sub num_for {
 
 		# very unlikely:
 		warn "<$mid> reused for mismatched content\n";
-		$self->idx_init;
 
 		# try the rest of the mids
 		foreach my $i (1..$#$mids) {
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 05/27] searchidx: do not delete documents while iterating
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (3 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 04/27] v2writable: remove unnecessary idx_init call Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 06/27] search: allow ->reopen to be chainable Eric Wong (Contractor, The Linux Foundation)
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

Followup-to: ebb59815035b42c2
  ("searchidx: do not modify Xapian DB while iterating")
---
 lib/PublicInbox/SearchIdx.pm | 50 +++++++++++++++++++++++---------------------
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 725bbd8..ccec018 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -406,29 +406,38 @@ sub add_message {
 	$doc_id;
 }
 
-# returns deleted doc_id on success, undef on missing
+sub batch_do {
+	my ($self, $termval, $cb) = @_;
+	my $batch_size = 1000; # don't let @ids grow too large to avoid OOM
+	while (1) {
+		my ($head, $tail) = $self->find_doc_ids($termval);
+		return if $head == $tail;
+		my @ids;
+		for (; $head != $tail && @ids < $batch_size; $head->inc) {
+			push @ids, $head->get_docid;
+		}
+		$cb->(\@ids);
+	}
+}
+
 sub remove_message {
 	my ($self, $mid) = @_;
 	my $db = $self->{xdb};
-	my $doc_id;
+	my $called;
 	$mid = mid_clean($mid);
 
 	eval {
-		my ($head, $tail) = $self->find_doc_ids('Q' . $mid);
-		if ($head->equal($tail)) {
-			warn "cannot remove non-existent <$mid>\n";
-		}
-		for (; $head != $tail; $head->inc) {
-			my $docid = $head->get_docid;
-			$db->delete_document($docid);
-		}
+		batch_do($self, 'Q' . $mid, sub {
+			my ($ids) = @_;
+			$db->delete_document($_) for @$ids;
+			$called = 1;
+		});
 	};
-
 	if ($@) {
 		warn "failed to remove message <$mid>: $@\n";
-		return undef;
+	} elsif (!$called) {
+		warn "cannot remove non-existent <$mid>\n";
 	}
-	$doc_id;
 }
 
 sub term_generator { # write-only
@@ -795,22 +804,15 @@ sub merge_threads {
 	my ($self, $winner_tid, $loser_tid) = @_;
 	return if $winner_tid == $loser_tid;
 	my $db = $self->{xdb};
-
-	my $batch_size = 1000; # don't let @ids grow too large to avoid OOM
-	while (1) {
-		my ($head, $tail) = $self->find_doc_ids('G' . $loser_tid);
-		return if $head == $tail;
-		my @ids;
-		for (; $head != $tail && @ids < $batch_size; $head->inc) {
-			push @ids, $head->get_docid;
-		}
-		foreach my $docid (@ids) {
+	batch_do($self, 'G' . $loser_tid, sub {
+		my ($ids) = @_;
+		foreach my $docid (@$ids) {
 			my $doc = $db->get_document($docid);
 			$doc->remove_term('G' . $loser_tid);
 			$doc->add_boolean_term('G' . $winner_tid);
 			$db->replace_document($docid, $doc);
 		}
-	}
+	});
 }
 
 sub _read_git_config_perm {
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 06/27] search: allow ->reopen to be chainable
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (4 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 05/27] searchidx: do not delete documents while iterating Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 07/27] v2writable: implement remove correctly Eric Wong (Contractor, The Linux Foundation)
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

Makes life a little easier for V2Writable...
---
 lib/PublicInbox/Search.pm     | 1 +
 lib/PublicInbox/V2Writable.pm | 3 +--
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 7cad31a..7e7c989 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -173,6 +173,7 @@ sub reopen {
 	if (my $skel = $self->{skel}) {
 		$skel->reopen;
 	}
+	$self; # make chaining easier
 }
 
 # read-only
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index b6c46a2..e673c25 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -378,8 +378,7 @@ sub lookup_content {
 	my ($self, $mime, $mid) = @_;
 	my $ibx = $self->{-inbox};
 
-	my $srch = $ibx->search;
-	$srch->reopen;
+	my $srch = $ibx->search->reopen;
 	my $cid = content_id($mime);
 	my $found;
 	$srch->each_smsg_by_mid($mid, sub {
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 07/27] v2writable: implement remove correctly
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (5 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 06/27] search: allow ->reopen to be chainable Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 08/27] skeleton: barrier init requires a lock Eric Wong (Contractor, The Linux Foundation)
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

We need to hide removals from anybody hitting the search engine.
---
 lib/PublicInbox/Msgmap.pm            |  8 ++++++
 lib/PublicInbox/SearchIdx.pm         | 32 ++++++++++++++++++++++
 lib/PublicInbox/SearchIdxPart.pm     |  8 ++++++
 lib/PublicInbox/SearchIdxSkeleton.pm | 18 +++++++++++++
 lib/PublicInbox/SearchMsg.pm         |  4 ++-
 lib/PublicInbox/V2Writable.pm        | 51 ++++++++++++++++++++++++++++--------
 t/v2writable.t                       | 18 ++++++++++++-
 7 files changed, 126 insertions(+), 13 deletions(-)

diff --git a/lib/PublicInbox/Msgmap.pm b/lib/PublicInbox/Msgmap.pm
index a147b9f..8e81fba 100644
--- a/lib/PublicInbox/Msgmap.pm
+++ b/lib/PublicInbox/Msgmap.pm
@@ -140,6 +140,14 @@ sub mid_delete {
 	$sth->execute;
 }
 
+sub num_delete {
+	my ($self, $num) = @_;
+	my $dbh = $self->{dbh};
+	my $sth = $dbh->prepare('DELETE FROM msgmap WHERE num = ?');
+	$sth->bind_param(1, $num);
+	$sth->execute;
+}
+
 sub create_tables {
 	my ($dbh) = @_;
 	my $e;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index ccec018..ae2544d 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -440,6 +440,31 @@ sub remove_message {
 	}
 }
 
+# MID is a hint in V2
+sub remove_by_oid {
+	my ($self, $oid, $mid) = @_;
+	my $db = $self->{xdb};
+
+	# XXX careful, we cannot use batch_do here since we conditionally
+	# delete documents based on other factors, so we cannot call
+	# find_doc_ids twice.
+	my ($head, $tail) = $self->find_doc_ids('Q' . $mid);
+	return if $head == $tail;
+
+	# there is only ONE element in @delete unless we
+	# have bugs in our v2writable deduplication check
+	my @delete;
+	for (; $head != $tail; $head->inc) {
+		my $docid = $head->get_docid;
+		my $doc = $db->get_document($docid);
+		my $smsg = PublicInbox::SearchMsg->wrap($doc, $mid);
+		$smsg->load_expand;
+		push(@delete, $docid) if $smsg->{blob} eq $oid;
+	}
+	$db->delete_document($_) foreach @delete;
+	scalar(@delete);
+}
+
 sub term_generator { # write-only
 	my ($self) = @_;
 
@@ -896,4 +921,11 @@ sub remote_close {
 	$? == 0 or die ref($self)." pid:$pid exited with: $?";
 }
 
+# triggers remove_by_oid in partition or skeleton
+sub remote_remove {
+	my ($self, $oid, $mid) = @_;
+	print { $self->{w} } "D $oid $mid\n" or
+			die "failed to write remove $!";
+}
+
 1;
diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm
index dd7ace6..c166078 100644
--- a/lib/PublicInbox/SearchIdxPart.pm
+++ b/lib/PublicInbox/SearchIdxPart.pm
@@ -54,6 +54,14 @@ sub partition_worker_loop ($$$) {
 			$txn = undef;
 			print { $self->{skeleton}->{w} } "barrier $part\n" or
 					die "write failed to skeleton: $!\n";
+		} elsif ($line =~ /\AD ([a-f0-9]{40,}) (.+)\n\z/s) {
+			my ($oid, $mid) = ($1, $2);
+			$xdb ||= $self->_xdb_acquire;
+			if (!$txn) {
+				$xdb->begin_transaction;
+				$txn = 1;
+			}
+			$self->remove_by_oid($oid, $mid);
 		} else {
 			chomp $line;
 			my ($len, $artnum, $oid, $mid0) = split(/ /, $line);
diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
index 4cb10f5..beb17b9 100644
--- a/lib/PublicInbox/SearchIdxSkeleton.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -73,6 +73,14 @@ sub skeleton_worker_loop {
 				print $barrier_note "barrier_done\n" or die
 					"print failed to barrier note: $!";
 			}
+		} elsif ($line =~ /\AD ([a-f0-9]{40,}) (.*)\n\z/s) {
+			my ($oid, $mid) = ($1, $2);
+			$xdb ||= $self->_xdb_acquire;
+			if (!$txn) {
+				$xdb->begin_transaction;
+				$txn = 1;
+			}
+			$self->remove_by_oid($oid, $mid);
 		} else {
 			my $len = int($line);
 			my $n = read($r, my $msg, $len) or die "read: $!\n";
@@ -110,6 +118,16 @@ sub index_skeleton {
 	die "print failed: $err\n" if $err;
 }
 
+sub remote_remove {
+	my ($self, $oid, $mid) = @_;
+	my $err;
+	$self->_lock_acquire;
+	eval { $self->SUPER::remote_remove($oid, $mid) };
+	$err = $@;
+	$self->_lock_release;
+	die $err if $err;
+}
+
 # values: [ TS, NUM, BYTES, LINES, MID, XPATH, doc_data ]
 sub index_skeleton_real ($$) {
 	my ($self, $values) = @_;
diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm
index 23478a2..a1cd0c2 100644
--- a/lib/PublicInbox/SearchMsg.pm
+++ b/lib/PublicInbox/SearchMsg.pm
@@ -64,7 +64,9 @@ sub load_doc {
 # :bytes and :lines metadata in RFC 3977
 sub bytes ($) { get_val($_[0]->{doc}, &PublicInbox::Search::BYTES) }
 sub lines ($) { get_val($_[0]->{doc}, &PublicInbox::Search::LINES) }
-sub num ($) { get_val($_[0]->{doc}, &PublicInbox::Search::NUM) }
+sub num ($) {
+	$_[0]->{num} ||= get_val($_[0]->{doc}, PublicInbox::Search::NUM)
+}
 
 sub __hdr ($$) {
 	my ($self, $field) = @_;
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index e673c25..656f069 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -199,18 +199,47 @@ sub idx_init {
 }
 
 sub remove {
-	my ($self, $mime, $msg) = @_;
-	my $existing = $self->lookup_content($mime) or return;
-
-	# don't touch ghosts or already junked messages
-	return unless $existing->type eq 'mail';
-
-	# always write removals to the current (latest) git repo since
-	# we process chronologically
+	my ($self, $mime, $cmt_msg) = @_;
+	$self->barrier;
+	$self->idx_init;
 	my $im = $self->importer;
-	my ($cmt, undef) = $im->remove($mime, $msg);
-	$cmt = $im->get_mark($cmt);
-	$self->unindex_msg($existing, $cmt);
+	my $ibx = $self->{-inbox};
+	my $srch = $ibx->search;
+	my $cid = content_id($mime);
+	my $skel = $self->{skel};
+	my $parts = $self->{idx_parts};
+	my $mm = $skel->{mm};
+	my $removed;
+	my $mids = mids($mime->header_obj);
+	foreach my $mid (@$mids) {
+		$srch->reopen->each_smsg_by_mid($mid, sub {
+			my ($smsg) = @_;
+			$smsg->load_expand;
+			my $msg = $ibx->msg_by_smsg($smsg);
+			if (!defined($msg)) {
+				warn "broken smsg for $mid\n";
+				return 1; # continue
+			}
+			my $cur = PublicInbox::MIME->new($msg);
+			if (content_id($cur) eq $cid) {
+				$mm->num_delete($smsg->num);
+				# $removed should only be set once assuming
+				# no bugs in our deduplication code:
+				$removed = $smsg;
+				$removed->{mime} = $cur;
+				$im->remove($cur, $cmt_msg);
+				$removed->num; # memoize this for callers
+
+				my $oid = $smsg->{blob};
+				foreach my $idx (@$parts, $skel) {
+					$idx->remote_remove($oid, $mid);
+				}
+			}
+			1; # continue
+		});
+		$self->barrier;
+	}
+	$removed;
 }
 
 sub done {
diff --git a/t/v2writable.t b/t/v2writable.t
index 7d276da..6e37b72 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -36,12 +36,13 @@ my $im = eval {
 };
 is($im->{partitions}, 1, 'one partition when forced');
 ok($im->add($mime), 'ordinary message added');
+my $git0;
 
 if ('ensure git configs are correct') {
 	my @cmd = (qw(git config), "--file=$mainrepo/all.git/config",
 		qw(core.sharedRepository 0644));
 	is(system(@cmd), 0, "set sharedRepository in all.git");
-	my $git0 = PublicInbox::Git->new("$mainrepo/git/0.git");
+	$git0 = PublicInbox::Git->new("$mainrepo/git/0.git");
 	my $fh = $git0->popen(qw(config core.sharedRepository));
 	my $v = eval { local $/; <$fh> };
 	chomp $v;
@@ -189,8 +190,23 @@ EOF
 };
 {
 	local $ENV{NPROC} = 2;
+	my @before = $git0->qx(qw(log --pretty=oneline));
 	$im = PublicInbox::V2Writable->new($ibx, 1);
 	is($im->{partitions}, 1, 'detected single partition from previous');
+	my $smsg = $im->remove($mime, 'test removal');
+	my @after = $git0->qx(qw(log --pretty=oneline));
+	$im->done;
+	my $tip = shift @after;
+	like($tip, qr/\A[a-f0-9]+ test removal\n\z/s,
+		'commit message propaged to git');
+	is_deeply(\@after, \@before, 'only one commit written to git');
+	is($ibx->mm->num_for($smsg->mid), undef, 'no longer in Msgmap by mid');
+	like($smsg->num, qr/\A\d+\z/, 'numeric number in return message');
+	is($ibx->mm->mid_for($smsg->num), undef, 'no longer in Msgmap by num');
+	my $srch = $ibx->search->reopen;
+	my @found = ();
+	$srch->each_smsg_by_mid($smsg->mid, sub { push @found, @_; 1 });
+	is(scalar(@found), 0, 'no longer found in Xapian skeleton');
 }
 
 done_testing();
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 08/27] skeleton: barrier init requires a lock
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (6 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 07/27] v2writable: implement remove correctly Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 09/27] import: (v2) delete writes the blob into history in subdir Eric Wong (Contractor, The Linux Foundation)
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

Writing to the main skeleton pipe requires a lock since it's
shared with partition processes.
---
 lib/PublicInbox/SearchIdxSkeleton.pm | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
index beb17b9..51a88fd 100644
--- a/lib/PublicInbox/SearchIdxSkeleton.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -149,9 +149,12 @@ sub index_skeleton_real ($$) {
 # write to the subprocess
 sub barrier_init {
 	my ($self, $nparts) = @_;
-	my $w = $_[0]->{w};
-	print $w "barrier_init $nparts\n" or die "failed to write: $!";
-	$w->flush or die "failed to flush: $!";
+	my $w = $self->{w};
+	my $err;
+	$self->_lock_acquire;
+	print $w "barrier_init $nparts\n" or $err = "failed to write: $!\n";
+	$self->_lock_release;
+	die $err if $err;
 }
 
 sub barrier_wait {
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 09/27] import: (v2) delete writes the blob into history in subdir
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (7 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 08/27] skeleton: barrier init requires a lock Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 10/27] import: (v2): write deletes to a separate '_' subdirectory Eric Wong (Contractor, The Linux Foundation)
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

This makes it easier to audit deletes with "git log -p" and
prevents an unstable specification of "content_id" from being
stored in history.

This should be cost-free if done in the same partition (and even
cheaper than before as it introduces no new blobs).  It does
have a higher cost across partitions, but is probably irrelevant
given the typical ham:spam ratio.
---
 lib/PublicInbox/Import.pm     | 15 ++++++++++-----
 lib/PublicInbox/V2Writable.pm |  4 +++-
 t/v2writable.t                |  9 +++++++++
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index e20c6e0..94a49fe 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -11,7 +11,6 @@ use Fcntl qw(:flock :DEFAULT);
 use PublicInbox::Spawn qw(spawn);
 use PublicInbox::MID qw(mid_mime mid2path);
 use PublicInbox::Address;
-use PublicInbox::ContentId qw(content_id);
 use PublicInbox::MsgTime qw(msg_timestamp);
 
 sub new {
@@ -163,7 +162,6 @@ sub get_mark {
 # ('MISMATCH', Email::MIME) on mismatch
 # (:MARK, Email::MIME) on success
 #
-# For v2 inboxes, the content_id is returned instead of the msg
 # v2 callers should check with Xapian before calling this as
 # it is not idempotent.
 sub remove {
@@ -179,10 +177,17 @@ sub remove {
 		($err, $cur) = check_remove_v1($r, $w, $tip, $path, $mime);
 		return ($err, $cur) if $err;
 	} else {
-		$cur = content_id($mime);
-		my $len = length($cur);
+		my $sref;
+		if (ref($mime) eq 'SCALAR') { # optimization used by V2Writable
+			$sref = $mime;
+		} else { # XXX should not be necessary:
+			my $str = $mime->as_string;
+			$sref = \$str;
+		}
+		my $len = length($$sref);
 		$blob = $self->{mark}++;
-		print $w "blob\nmark :$blob\ndata $len\n$cur\n" or wfail;
+		print $w "blob\nmark :$blob\ndata $len\n",
+			$$sref, "\n" or wfail;
 	}
 
 	my $ref = $self->{ref};
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 656f069..fd9bf61 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -220,6 +220,7 @@ sub remove {
 				warn "broken smsg for $mid\n";
 				return 1; # continue
 			}
+			my $orig = $$msg;
 			my $cur = PublicInbox::MIME->new($msg);
 			if (content_id($cur) eq $cid) {
 				$mm->num_delete($smsg->num);
@@ -227,7 +228,8 @@ sub remove {
 				# no bugs in our deduplication code:
 				$removed = $smsg;
 				$removed->{mime} = $cur;
-				$im->remove($cur, $cmt_msg);
+				$im->remove(\$orig, $cmt_msg);
+				$orig = undef;
 				$removed->num; # memoize this for callers
 
 				my $oid = $smsg->{blob};
diff --git a/t/v2writable.t b/t/v2writable.t
index 6e37b72..a5c982e 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -191,6 +191,7 @@ EOF
 {
 	local $ENV{NPROC} = 2;
 	my @before = $git0->qx(qw(log --pretty=oneline));
+	my $before = $git0->qx(qw(log --pretty=raw --raw -r --no-abbrev));
 	$im = PublicInbox::V2Writable->new($ibx, 1);
 	is($im->{partitions}, 1, 'detected single partition from previous');
 	my $smsg = $im->remove($mime, 'test removal');
@@ -207,6 +208,14 @@ EOF
 	my @found = ();
 	$srch->each_smsg_by_mid($smsg->mid, sub { push @found, @_; 1 });
 	is(scalar(@found), 0, 'no longer found in Xapian skeleton');
+
+	my $after = $git0->qx(qw(log -1 --pretty=raw --raw -r --no-abbrev));
+	if ($after =~ m!( [a-f0-9]+ )A\td$!) {
+		my $oid = $1;
+		ok(index($before, $oid) > 0, 'no new blob introduced');
+	} else {
+		fail('failed to extract blob from log output');
+	}
 }
 
 done_testing();
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 10/27] import: (v2): write deletes to a separate '_' subdirectory
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (8 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 09/27] import: (v2) delete writes the blob into history in subdir Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 11/27] import: implement barrier operation for v1 repos Eric Wong (Contractor, The Linux Foundation)
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

In the future, we may store "purged" content IDs or other
uncommon stuff under "_/" of the git tree.  This keeps the
top-level tree small and more amenable to deltafication.
This helps the the common case where "m" is most commonly
changed file at the top level.

Also, use 'D' instead of 'd' since it matches git's '--raw'
output format.
---
 lib/PublicInbox/Import.pm | 2 +-
 t/v2writable.t            | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 94a49fe..6a640e2 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -208,7 +208,7 @@ sub remove {
 	if (defined $path) {
 		print $w "D $path\n\n" or wfail;
 	} else {
-		print $w "M 100644 :$blob d\n\n" or wfail;
+		print $w "M 100644 :$blob _/D\n\n" or wfail;
 	}
 	$self->{nchg}++;
 	(($self->{tip} = ":$commit"), $cur);
diff --git a/t/v2writable.t b/t/v2writable.t
index a5c982e..c6bcefd 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -210,7 +210,7 @@ EOF
 	is(scalar(@found), 0, 'no longer found in Xapian skeleton');
 
 	my $after = $git0->qx(qw(log -1 --pretty=raw --raw -r --no-abbrev));
-	if ($after =~ m!( [a-f0-9]+ )A\td$!) {
+	if ($after =~ m!( [a-f0-9]+ )A\t_/D$!) {
 		my $oid = $1;
 		ok(index($before, $oid) > 0, 'no new blob introduced');
 	} else {
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 11/27] import: implement barrier operation for v1 repos
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (9 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 10/27] import: (v2): write deletes to a separate '_' subdirectory Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 12/27] mid: mid_mime uses v2-compatible mids function Eric Wong (Contractor, The Linux Foundation)
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

This will allow WatchMaildir to use ->barrier operations instead
of reaching inside for nchg.  This also ensures dumb HTTP
clients can see changes to V2 repos immediately.
---
 lib/PublicInbox/Import.pm       | 58 ++++++++++++++++++++++++++---------------
 lib/PublicInbox/V2Writable.pm   |  7 +----
 lib/PublicInbox/WatchMaildir.pm |  2 +-
 3 files changed, 39 insertions(+), 28 deletions(-)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 6a640e2..12df7d5 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -148,6 +148,42 @@ sub progress {
 	undef;
 }
 
+sub _update_git_info ($$) {
+	my ($self, $do_gc) = @_;
+	# for compatibility with existing ssoma installations
+	# we can probably remove this entirely by 2020
+	my $git_dir = $self->{git}->{git_dir};
+	my @cmd = ('git', "--git-dir=$git_dir");
+	my $index = "$git_dir/ssoma.index";
+	if (-e $index && !$ENV{FAST}) {
+		my $env = { GIT_INDEX_FILE => $index };
+		run_die([@cmd, qw(read-tree -m -v -i), $self->{ref}], $env);
+	}
+	run_die([@cmd, 'update-server-info'], undef);
+	($self->{path_type} eq '2/38') and eval {
+		require PublicInbox::SearchIdx;
+		my $inbox = $self->{inbox} || $git_dir;
+		my $s = PublicInbox::SearchIdx->new($inbox);
+		$s->index_sync({ ref => $self->{ref} });
+	};
+	eval { run_die([@cmd, qw(gc --auto)], undef) } if $do_gc;
+}
+
+sub barrier {
+	my ($self) = @_;
+
+	# For safety, we ensure git checkpoint is complete before because
+	# the data in git is still more important than what is in Xapian
+	# in v2.  Performance may be gained by delaying the ->progress
+	# call but we lose safety
+	if ($self->{nchg}) {
+		$self->checkpoint;
+		$self->progress('checkpoint');
+		_update_git_info($self, 0);
+		$self->{nchg} = 0;
+	}
+}
+
 # used for v2
 sub get_mark {
 	my ($self, $mark) = @_;
@@ -341,28 +377,8 @@ sub done {
 	my $pid = delete $self->{pid} or die 'BUG: missing {pid} when done';
 	waitpid($pid, 0) == $pid or die 'fast-import did not finish';
 	$? == 0 or die "fast-import failed: $?";
-	my $nchg = delete $self->{nchg};
 
-	# for compatibility with existing ssoma installations
-	# we can probably remove this entirely by 2020
-	my $git_dir = $self->{git}->{git_dir};
-	my @cmd = ('git', "--git-dir=$git_dir");
-	my $index = "$git_dir/ssoma.index";
-	if ($nchg && -e $index && !$ENV{FAST}) {
-		my $env = { GIT_INDEX_FILE => $index };
-		run_die([@cmd, qw(read-tree -m -v -i), $self->{ref}], $env);
-	}
-	if ($nchg) {
-		run_die([@cmd, 'update-server-info'], undef);
-		($self->{path_type} eq '2/38') and eval {
-			require PublicInbox::SearchIdx;
-			my $inbox = $self->{inbox} || $git_dir;
-			my $s = PublicInbox::SearchIdx->new($inbox);
-			$s->index_sync({ ref => $self->{ref} });
-		};
-
-		eval { run_die([@cmd, qw(gc --auto)], undef) };
-	}
+	_update_git_info($self, 1) if delete $self->{nchg};
 
 	$self->{ssoma_lock} or return;
 	my $lockfh = delete $self->{lockfh} or die "BUG: not locked: $!";
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index fd9bf61..fbc71c8 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -263,13 +263,8 @@ sub checkpoint {
 sub barrier {
 	my ($self) = @_;
 
-	# For safety, we ensure git checkpoint is complete before because
-	# the data in git is still more important than what is in Xapian.
-	# Performance may be gained by delaying ->progress call but we
-	# lose safety
 	if (my $im = $self->{im}) {
-		$im->checkpoint;
-		$im->progress('checkpoint');
+		$im->barrier;
 	}
 	my $skel = $self->{skel};
 	my $parts = $self->{idx_parts};
diff --git a/lib/PublicInbox/WatchMaildir.pm b/lib/PublicInbox/WatchMaildir.pm
index 3da6b27..c72d939 100644
--- a/lib/PublicInbox/WatchMaildir.pm
+++ b/lib/PublicInbox/WatchMaildir.pm
@@ -91,7 +91,7 @@ sub _done_for_now {
 	my ($self) = @_;
 	my $importers = $self->{importers};
 	foreach my $im (values %$importers) {
-		$im->done if $im->{nchg};
+		$im->barrier;
 	}
 
 	my $opendirs = $self->{opendirs};
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 12/27] mid: mid_mime uses v2-compatible mids function
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (10 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 11/27] import: implement barrier operation for v1 repos Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 13/27] watchmaildir: use content_digest to generate Message-Id Eric Wong (Contractor, The Linux Foundation)
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

This allows us to be more consistent in dealing with completely
empty Message-Ids.
---
 lib/PublicInbox/MID.pm | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
index 422902f..117d3c4 100644
--- a/lib/PublicInbox/MID.pm
+++ b/lib/PublicInbox/MID.pm
@@ -50,7 +50,8 @@ sub mid2path {
 	"$x2/$x38";
 }
 
-sub mid_mime ($) { $_[0]->header_obj->header_raw('Message-ID') }
+# Only for v1 code paths:
+sub mid_mime ($) { mids($_[0]->header_obj)->[0] }
 
 sub mids ($) {
 	my ($hdr) = @_;
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 13/27] watchmaildir: use content_digest to generate Message-Id
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (11 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 12/27] mid: mid_mime uses v2-compatible mids function Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 14/27] import: force Message-ID generation for v1 here Eric Wong (Contractor, The Linux Foundation)
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

This can probably be moved to Import for code reuse.
---
 lib/PublicInbox/WatchMaildir.pm | 25 ++++++++-----------------
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/lib/PublicInbox/WatchMaildir.pm b/lib/PublicInbox/WatchMaildir.pm
index c72d939..f2d3db9 100644
--- a/lib/PublicInbox/WatchMaildir.pm
+++ b/lib/PublicInbox/WatchMaildir.pm
@@ -12,6 +12,8 @@ use PublicInbox::Import;
 use PublicInbox::MDA;
 use PublicInbox::Spawn qw(spawn);
 use File::Temp qw//;
+use PublicInbox::MID qw(mids);
+use PublicInbox::ContentId qw(content_digest);
 
 sub new {
 	my ($class, $config) = @_;
@@ -144,25 +146,14 @@ sub _remove_spam {
 	})
 }
 
-# used to hash the relevant portions of a message when there are conflicts
-sub _hash_mime2 {
-	my ($mime) = @_;
-	require Digest::SHA;
-	my $dig = Digest::SHA->new('SHA-1');
-	$dig->add($mime->header_obj->header_raw('Subject'));
-	$dig->add($mime->body_raw);
-	$dig->hexdigest;
-}
-
 sub _force_mid {
 	my ($mime) = @_;
-	# probably a bad idea, but we inject a Message-Id if
-	# one is missing, here..
-	my $mid = $mime->header_obj->header_raw('Message-Id');
-	if (!defined $mid || $mid =~ /\A\s*\z/) {
-		$mid = '<' . _hash_mime2($mime) . '@generated>';
-		$mime->header_set('Message-Id', $mid);
-	}
+	my $hdr = $mime->header_obj;
+	my $mids = mids($hdr);
+	return if @$mids;
+	my $dig = content_digest($mime);
+	my $mid = $dig->clone->hexdigest . '@localhost';
+	$hdr->header_set('Message-Id', $mid);
 }
 
 sub _try_path {
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 14/27] import: force Message-ID generation for v1 here
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (12 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 13/27] watchmaildir: use content_digest to generate Message-Id Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 15/27] import: switch to URL-safe Base64 for Message-IDs Eric Wong (Contractor, The Linux Foundation)
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

This allows us to share code for generating Message-IDs
between v1 and v2 repos.

For v1, this introduces a slight incompatibility in message
removal iff the original message lacked a Message-ID AND
the training request came from a message which did not
pass through the public-inbox:

The workaround for this would be to reuse the bad message from
the archive itself.
---
 lib/PublicInbox/Import.pm       | 15 +++++++++++++--
 lib/PublicInbox/V2Writable.pm   |  6 +++---
 lib/PublicInbox/WatchMaildir.pm | 14 --------------
 3 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 12df7d5..4c007b6 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -9,9 +9,10 @@ use strict;
 use warnings;
 use Fcntl qw(:flock :DEFAULT);
 use PublicInbox::Spawn qw(spawn);
-use PublicInbox::MID qw(mid_mime mid2path);
+use PublicInbox::MID qw(mids mid_mime mid2path);
 use PublicInbox::Address;
 use PublicInbox::MsgTime qw(msg_timestamp);
+use PublicInbox::ContentId qw(content_digest);
 
 sub new {
 	my ($class, $git, $name, $email, $ibx) = @_;
@@ -308,7 +309,12 @@ sub add {
 
 	my $path;
 	if ($path_type eq '2/38') {
-		$path = mid2path(mid_mime($mime));
+		my $mids = mids($mime->header_obj);
+		if (!scalar(@$mids)) {
+			my $dig = content_digest($mime);
+			@$mids = (digest2mid($dig));
+		}
+		$path = mid2path($mids->[0]);
 	} else { # v2 layout, one file:
 		$path = 'm';
 	}
@@ -393,6 +399,11 @@ sub atfork_child {
 	}
 }
 
+sub digest2mid ($) {
+	my ($dig) = @_;
+	$dig->clone->hexdigest . '@localhost';
+}
+
 1;
 __END__
 =pod
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index fbc71c8..a305842 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -146,19 +146,19 @@ sub num_for_harder {
 
 	my $hdr = $mime->header_obj;
 	my $dig = content_digest($mime);
-	$$mid0 = $dig->clone->hexdigest . '@localhost';
+	$$mid0 = PublicInbox::Import::digest2mid($dig);
 	my $num = $self->{skel}->{mm}->mid_insert($$mid0);
 	unless (defined $num) {
 		# it's hard to spoof the last Received: header
 		my @recvd = $hdr->header_raw('Received');
 		$dig->add("Received: $_") foreach (@recvd);
-		$$mid0 = $dig->clone->hexdigest . '@localhost';
+		$$mid0 = PublicInbox::Import::digest2mid($dig);
 		$num = $self->{skel}->{mm}->mid_insert($$mid0);
 
 		# fall back to a random Message-ID and give up determinism:
 		until (defined($num)) {
 			$dig->add(rand);
-			$$mid0 = $dig->clone->hexdigest . '@localhost';
+			$$mid0 = PublicInbox::Import::digest2mid($dig);
 			warn "using random Message-ID <$$mid0> as fallback\n";
 			$num = $self->{skel}->{mm}->mid_insert($$mid0);
 		}
diff --git a/lib/PublicInbox/WatchMaildir.pm b/lib/PublicInbox/WatchMaildir.pm
index f2d3db9..3adebdd 100644
--- a/lib/PublicInbox/WatchMaildir.pm
+++ b/lib/PublicInbox/WatchMaildir.pm
@@ -12,8 +12,6 @@ use PublicInbox::Import;
 use PublicInbox::MDA;
 use PublicInbox::Spawn qw(spawn);
 use File::Temp qw//;
-use PublicInbox::MID qw(mids);
-use PublicInbox::ContentId qw(content_digest);
 
 sub new {
 	my ($class, $config) = @_;
@@ -127,7 +125,6 @@ sub _remove_spam {
 	# path must be marked as (S)een
 	$path =~ /:2,[A-R]*S[T-Za-z]*\z/ or return;
 	my $mime = _path_to_mime($path) or return;
-	_force_mid($mime);
 	$self->{config}->each_inbox(sub {
 		my ($ibx) = @_;
 		eval {
@@ -146,16 +143,6 @@ sub _remove_spam {
 	})
 }
 
-sub _force_mid {
-	my ($mime) = @_;
-	my $hdr = $mime->header_obj;
-	my $mids = mids($hdr);
-	return if @$mids;
-	my $dig = content_digest($mime);
-	my $mid = $dig->clone->hexdigest . '@localhost';
-	$hdr->header_set('Message-Id', $mid);
-}
-
 sub _try_path {
 	my ($self, $path) = @_;
 	my @p = split(m!/+!, $path);
@@ -191,7 +178,6 @@ sub _try_path {
 		$mime = $ret;
 	}
 
-	_force_mid($mime);
 	$im->add($mime, $self->{spamcheck});
 }
 
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 15/27] import: switch to URL-safe Base64 for Message-IDs
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (13 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 14/27] import: force Message-ID generation for v1 here Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 16/27] v2writable: test for idempotent removals Eric Wong (Contractor, The Linux Foundation)
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

Hexdigests are too long and shorter Message-IDs are easier
to deal with.
---
 lib/PublicInbox/Import.pm | 11 ++++++++++-
 t/v2writable.t            | 10 ++++++----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 4c007b6..77e74c1 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -401,7 +401,16 @@ sub atfork_child {
 
 sub digest2mid ($) {
 	my ($dig) = @_;
-	$dig->clone->hexdigest . '@localhost';
+	my $b64 = $dig->clone->b64digest;
+	# Make our own URLs nicer:
+	# See "Base 64 Encoding with URL and Filename Safe Alphabet" in RFC4648
+	$b64 =~ tr!+/=!-_!d;
+
+	# We can make this more meaningful with a date prefix or other things,
+	# but this is only needed for crap that fails to generate a Message-ID
+	# or reuses one.  In other words, it's usually spammers who hit this
+	# so they don't deserve nice Message-IDs :P
+	$b64 . '@localhost';
 }
 
 1;
diff --git a/t/v2writable.t b/t/v2writable.t
index c6bcefd..bbe6d14 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -68,6 +68,7 @@ if ('ensure git configs are correct') {
 		[ $mime->header_obj->header_raw('Message-Id') ],
 		'no new Message-Id added');
 
+	my $sane_mid = qr/\A<[\w\-]+\@localhost>\z/;
 	@warn = ();
 	$mime->header_set('Message-Id', '<a-mid@b>');
 	$mime->body_set('different');
@@ -75,13 +76,14 @@ if ('ensure git configs are correct') {
 	like(join(' ', @warn), qr/reused/, 'warned about reused MID');
 	my @mids = $mime->header_obj->header_raw('Message-Id');
 	is($mids[1], '<a-mid@b>', 'original mid not changed');
-	like($mids[0], qr/\A<\w+\@localhost>\z/, 'new MID added');
+	like($mids[0], $sane_mid, 'new MID added');
 	is(scalar(@mids), 2, 'only one new MID added');
 
 	@warn = ();
 	$mime->header_set('Message-Id', '<a-mid@b>');
 	$mime->body_set('this one needs a random mid');
-	my $gen = content_digest($mime)->hexdigest . '@localhost';
+	my $gen = PublicInbox::Import::digest2mid(content_digest($mime));
+	unlike($gen, qr![\+/=]!, 'no URL-unfriendly chars in Message-Id');
 	my $fake = PublicInbox::MIME->new($mime->as_string);
 	$fake->header_set('Message-Id', $gen);
 	ok($im->add($fake), 'fake added easily');
@@ -90,14 +92,14 @@ if ('ensure git configs are correct') {
 	like(join(' ', @warn), qr/using random/, 'warned about using random');
 	@mids = $mime->header_obj->header_raw('Message-Id');
 	is($mids[1], '<a-mid@b>', 'original mid not changed');
-	like($mids[0], qr/\A<\w+\@localhost>\z/, 'new MID added');
+	like($mids[0], $sane_mid, 'new MID added');
 	is(scalar(@mids), 2, 'only one new MID added');
 
 	@warn = ();
 	$mime->header_set('Message-Id');
 	ok($im->add($mime), 'random MID made for MID free message');
 	@mids = $mime->header_obj->header_raw('Message-Id');
-	like($mids[0], qr/\A<\w+\@localhost>\z/, 'mid was generated');
+	like($mids[0], $sane_mid, 'mid was generated');
 	is(scalar(@mids), 1, 'new generated');
 }
 
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 16/27] v2writable: test for idempotent removals
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (14 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 15/27] import: switch to URL-safe Base64 for Message-IDs Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 17/27] import: enable locking under v2 Eric Wong (Contractor, The Linux Foundation)
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

This will make reindexing easier.
---
 t/v2writable.t | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/t/v2writable.t b/t/v2writable.t
index bbe6d14..5245a84 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -218,6 +218,11 @@ EOF
 	} else {
 		fail('failed to extract blob from log output');
 	}
+	is($im->remove($mime, 'test removal'), undef,
+		'remove is idempotent');
+	$im->done;
+	is($git0->qx(qw(log -1 --pretty=raw --raw -r --no-abbrev)),
+		$after, 'no git history made with idempotent remove');
 }
 
 done_testing();
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 17/27] import: enable locking under v2
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (15 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 16/27] v2writable: test for idempotent removals Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 18/27] index: s/GIT_DIR/REPO_DIR/ Eric Wong (Contractor, The Linux Foundation)
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

Instead of using ssoma-based locking, enable locking via Import
for now.
---
 lib/PublicInbox/Import.pm     | 14 ++++++--------
 lib/PublicInbox/V2Writable.pm |  6 +++++-
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 77e74c1..ca30ac4 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -29,7 +29,7 @@ sub new {
 		ref => $ref,
 		inbox => $ibx,
 		path_type => '2/38', # or 'v2'
-		ssoma_lock => 1, # disable for v2
+		lock_path => "$git->{git_dir}/ssoma.lock", # v2 changes this
 		bytes_added => 0,
 	}, $class
 }
@@ -46,13 +46,12 @@ sub gfi_start {
 	my $git = $self->{git};
 	my $git_dir = $git->{git_dir};
 
-	my $lockfh;
-	if ($self->{ssoma_lock}) {
-		my $lockpath = "$git_dir/ssoma.lock";
-		sysopen($lockfh, $lockpath, O_WRONLY|O_CREAT) or
-			die "failed to open lock $lockpath: $!";
+	if (my $lock_path = $self->{lock_path}) {
+		sysopen(my $lockfh, $lock_path, O_WRONLY|O_CREAT) or
+			die "failed to open lock $lock_path: $!";
 		# wait for other processes to be done
 		flock($lockfh, LOCK_EX) or die "lock failed: $!\n";
+		$self->{lockfh} = $lockfh;
 	}
 
 	local $/ = "\n";
@@ -66,7 +65,6 @@ sub gfi_start {
 	$out_w->autoflush(1);
 	$self->{in} = $in_r;
 	$self->{out} = $out_w;
-	$self->{lockfh} = $lockfh;
 	$self->{pid} = $pid;
 	$self->{nchg} = 0;
 	binmode $out_w, ':raw' or die "binmode :raw failed: $!";
@@ -386,7 +384,7 @@ sub done {
 
 	_update_git_info($self, 1) if delete $self->{nchg};
 
-	$self->{ssoma_lock} or return;
+	$self->{lock_path} or return;
 	my $lockfh = delete $self->{lockfh} or die "BUG: not locked: $!";
 	flock($lockfh, LOCK_UN) or die "unlock failed: $!";
 	close $lockfh or die "close lock failed: $!";
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index a305842..bdde76b 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -26,10 +26,13 @@ sub nproc () {
 sub new {
 	my ($class, $v2ibx, $creat) = @_;
 	my $dir = $v2ibx->{mainrepo} or die "no mainrepo in inbox\n";
+	my $lock_path = "$dir/inbox.lock";
 	unless (-d $dir) {
 		if ($creat) {
 			require File::Path;
 			File::Path::mkpath($dir);
+			open my $fh, '>>', $lock_path or
+				die "failed to open $lock_path: $!\n";
 		} else {
 			die "$dir does not exist\n";
 		}
@@ -57,6 +60,7 @@ sub new {
 		xap_ro => undef,
 		partitions => $nparts,
 		transact_bytes => 0,
+		lock_path => "$dir/inbox.lock",
 		# limit each repo to 1GB or so
 		rotate_bytes => int((1024 * 1024 * 1024) / $PACKING_FACTOR),
 	};
@@ -395,7 +399,7 @@ sub import_init {
 	my $im = PublicInbox::Import->new($git, undef, undef, $self->{-inbox});
 	$im->{bytes_added} = int($packed_bytes / $PACKING_FACTOR);
 	$im->{want_object_info} = 1;
-	$im->{ssoma_lock} = 0;
+	$im->{lock_path} = $self->{lock_path};
 	$im->{path_type} = 'v2';
 	$self->{im} = $im;
 }
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 18/27] index: s/GIT_DIR/REPO_DIR/
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (16 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 17/27] import: enable locking under v2 Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 19/27] Lock: new base class for writable lockers Eric Wong (Contractor, The Linux Foundation)
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

No functional changes, yet, but this makes future changes
easier-to-read.
---
 Documentation/public-inbox-index.pod |  8 ++++----
 script/public-inbox-index            | 12 +++++++-----
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 838a206..acc9039 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -4,7 +4,7 @@ public-inbox-index - create and update search indices
 
 =head1 SYNOPSIS
 
-public-inbox-index [OPTIONS] GIT_DIR
+public-inbox-index [OPTIONS] REPO_DIR
 
 =head1 DESCRIPTION
 
@@ -46,14 +46,14 @@ This does not touch the NNTP article number database.
 =head1 FILES
 
 All public-inbox-specific files are contained within the
-C<$GIT_DIR/public-inbox/> directory.  All files are expected to
+C<$REPO_DIR/public-inbox/> directory.  All files are expected to
 grow in size as more messages are archived, so using compaction
 commands (e.g. L<xapian-compact(1)>) is not recommended unless
 the list is no longer active.
 
 =over
 
-=item $GIT_DIR/public-inbox/msgmap.sqlite3
+=item $REPO_DIR/public-inbox/msgmap.sqlite3
 
 The stable NNTP article number to Message-ID mapping is
 stored in an SQLite3 database.
@@ -70,7 +70,7 @@ messages.
 This file is relatively small, and typically less than 5%
 of the space of the mail stored in a packed git repository.
 
-=item $GIT_DIR/public-inbox/xapian*
+=item $REPO_DIR/public-inbox/xapian*
 
 The database used by L<Search::Xapian>.  This directory name is
 followed by a number indicating the index schema version this
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 594a3d9..1debbaa 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -4,13 +4,13 @@
 # Basic tool to create a Xapian search index for a git repository
 # configured for public-inbox.
 # Usage with libeatmydata <https://www.flamingspork.com/projects/libeatmydata/>
-# highly recommended: eatmydata public-inbox-index GIT_DIR
+# highly recommended: eatmydata public-inbox-index REPO_DIR
 
 use strict;
 use warnings;
 use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
 use Cwd 'abs_path';
-my $usage = "public-inbox-index GIT_DIR";
+my $usage = "public-inbox-index REPO_DIR";
 use PublicInbox::Config;
 my $config = eval { PublicInbox::Config->new } || eval {
 	warn "public-inbox unconfigured for serving, indexing anyways...\n";
@@ -28,8 +28,10 @@ GetOptions(%opts) or die "bad command-line args\n$usage";
 
 my @dirs;
 
-sub resolve_git_dir {
+sub resolve_repo_dir {
 	my ($cd) = @_;
+	my $prefix = defined $cd ? $cd : './';
+
 	my @cmd = qw(git rev-parse --git-dir);
 	my $cmd = join(' ', @cmd);
 	my $pid = open my $fh, '-|';
@@ -53,9 +55,9 @@ sub resolve_git_dir {
 }
 
 if (@ARGV) {
-	@dirs = map { resolve_git_dir($_) } @ARGV;
+	@dirs = map { resolve_repo_dir($_) } @ARGV;
 } else {
-	@dirs = (resolve_git_dir());
+	@dirs = (resolve_repo_dir());
 }
 
 sub usage { print STDERR "Usage: $usage\n"; exit 1 }
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 19/27] Lock: new base class for writable lockers
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (17 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 18/27] index: s/GIT_DIR/REPO_DIR/ Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 20/27] t/watch_maildir: note the reason for FIFO creation Eric Wong (Contractor, The Linux Foundation)
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

This reduces code duplication needed for locking and
and hopefully makes things easier to understand.
---
 MANIFEST                             |  1 +
 lib/PublicInbox/Import.pm            | 17 ++++-------------
 lib/PublicInbox/Lock.pm              | 31 +++++++++++++++++++++++++++++++
 lib/PublicInbox/SearchIdx.pm         | 27 +++------------------------
 lib/PublicInbox/SearchIdxSkeleton.pm | 15 +++++++--------
 lib/PublicInbox/V2Writable.pm        | 12 ++++++------
 6 files changed, 52 insertions(+), 51 deletions(-)
 create mode 100644 lib/PublicInbox/Lock.pm

diff --git a/MANIFEST b/MANIFEST
index a42b9e1..3b0b013 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -68,6 +68,7 @@ lib/PublicInbox/Import.pm
 lib/PublicInbox/Inbox.pm
 lib/PublicInbox/Linkify.pm
 lib/PublicInbox/Listener.pm
+lib/PublicInbox/Lock.pm
 lib/PublicInbox/MDA.pm
 lib/PublicInbox/MID.pm
 lib/PublicInbox/MIME.pm
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index ca30ac4..fc740fa 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -7,7 +7,7 @@
 package PublicInbox::Import;
 use strict;
 use warnings;
-use Fcntl qw(:flock :DEFAULT);
+use base qw(PublicInbox::Lock);
 use PublicInbox::Spawn qw(spawn);
 use PublicInbox::MID qw(mids mid_mime mid2path);
 use PublicInbox::Address;
@@ -44,19 +44,13 @@ sub gfi_start {
 	pipe($in_r, $in_w) or die "pipe failed: $!";
 	pipe($out_r, $out_w) or die "pipe failed: $!";
 	my $git = $self->{git};
-	my $git_dir = $git->{git_dir};
 
-	if (my $lock_path = $self->{lock_path}) {
-		sysopen(my $lockfh, $lock_path, O_WRONLY|O_CREAT) or
-			die "failed to open lock $lock_path: $!";
-		# wait for other processes to be done
-		flock($lockfh, LOCK_EX) or die "lock failed: $!\n";
-		$self->{lockfh} = $lockfh;
-	}
+	$self->lock_acquire;
 
 	local $/ = "\n";
 	chomp($self->{tip} = $git->qx(qw(rev-parse --revs-only), $self->{ref}));
 
+	my $git_dir = $git->{git_dir};
 	my @cmd = ('git', "--git-dir=$git_dir", qw(fast-import
 			--quiet --done --date-format=raw));
 	my $rdr = { 0 => fileno($out_r), 1 => fileno($in_w) };
@@ -384,10 +378,7 @@ sub done {
 
 	_update_git_info($self, 1) if delete $self->{nchg};
 
-	$self->{lock_path} or return;
-	my $lockfh = delete $self->{lockfh} or die "BUG: not locked: $!";
-	flock($lockfh, LOCK_UN) or die "unlock failed: $!";
-	close $lockfh or die "close lock failed: $!";
+	$self->lock_release;
 }
 
 sub atfork_child {
diff --git a/lib/PublicInbox/Lock.pm b/lib/PublicInbox/Lock.pm
new file mode 100644
index 0000000..ca6b33f
--- /dev/null
+++ b/lib/PublicInbox/Lock.pm
@@ -0,0 +1,31 @@
+# Copyright (C) 2018 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# Base class for per-inbox locking
+package PublicInbox::Lock;
+use strict;
+use warnings;
+use Fcntl qw(:flock :DEFAULT);
+use Carp qw(croak);
+
+# we only acquire the flock if creating or reindexing;
+# PublicInbox::Import already has the lock on its own.
+sub lock_acquire {
+	my ($self) = @_;
+	croak 'already locked' if $self->{lockfh};
+	my $lock_path = $self->{lock_path} or return;
+	sysopen(my $lockfh, $lock_path, O_WRONLY|O_CREAT) or
+		die "failed to open lock $lock_path: $!\n";
+	flock($lockfh, LOCK_EX) or die "lock failed: $!\n";
+	$self->{lockfh} = $lockfh;
+}
+
+sub lock_release {
+	my ($self) = @_;
+	return unless $self->{lock_path};
+	my $lockfh = delete $self->{lockfh} or croak 'not locked';
+	flock($lockfh, LOCK_UN) or die "unlock failed: $!\n";
+	close $lockfh or die "close failed: $!\n";
+}
+
+1;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index ae2544d..0b9fb4b 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -9,9 +9,8 @@
 package PublicInbox::SearchIdx;
 use strict;
 use warnings;
-use Fcntl qw(:flock :DEFAULT);
+use base qw(PublicInbox::Search PublicInbox::Lock);
 use PublicInbox::MIME;
-use base qw(PublicInbox::Search);
 use PublicInbox::MID qw/mid_clean id_compress mid_mime mids references/;
 use PublicInbox::MsgIter;
 use Carp qw(croak);
@@ -96,7 +95,7 @@ sub _xdb_release {
 	my ($self) = @_;
 	my $xdb = delete $self->{xdb} or croak 'not acquired';
 	$xdb->close;
-	_lock_release($self) if $self->{creat};
+	$self->lock_release if $self->{creat};
 	undef;
 }
 
@@ -107,33 +106,13 @@ sub _xdb_acquire {
 	my $flag = Search::Xapian::DB_OPEN;
 	if ($self->{creat}) {
 		require File::Path;
-		_lock_acquire($self);
+		$self->lock_acquire;
 		File::Path::mkpath($dir);
 		$flag = Search::Xapian::DB_CREATE_OR_OPEN;
 	}
 	$self->{xdb} = Search::Xapian::WritableDatabase->new($dir, $flag);
 }
 
-# we only acquire the flock if creating or reindexing;
-# PublicInbox::Import already has the lock on its own.
-sub _lock_acquire {
-	my ($self) = @_;
-	croak 'already locked' if $self->{lockfh};
-	my $lock_path = $self->{lock_path} or return;
-	sysopen(my $lockfh, $lock_path, O_WRONLY|O_CREAT) or
-		die "failed to open lock $lock_path: $!\n";
-	flock($lockfh, LOCK_EX) or die "lock failed: $!\n";
-	$self->{lockfh} = $lockfh;
-}
-
-sub _lock_release {
-	my ($self) = @_;
-	return unless $self->{lock_path};
-	my $lockfh = delete $self->{lockfh} or croak 'not locked';
-	flock($lockfh, LOCK_UN) or die "unlock failed: $!\n";
-	close $lockfh or die "close failed: $!\n";
-}
-
 sub add_val ($$$) {
 	my ($doc, $col, $num) = @_;
 	$num = Search::Xapian::sortable_serialise($num);
diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
index 51a88fd..54a59ab 100644
--- a/lib/PublicInbox/SearchIdxSkeleton.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -38,8 +38,7 @@ sub new {
 	$w->autoflush(1);
 
 	# lock on only exists in parent, not in worker
-	my $l = $self->{lock_path} = $self->xdir . '/pi-v2-skeleton.lock';
-	open my $fh, '>>', $l or die "failed to create $l: $!\n";
+	$self->{lock_path} = $self->xdir . '/pi-v2-skeleton.lock';
 	$self;
 }
 
@@ -111,9 +110,9 @@ sub index_skeleton {
 	# multiple processes write to the same pipe, so use flock
 	# We can't avoid this lock for <=PIPE_BUF writes, either,
 	# because those atomic writes can break up >PIPE_BUF ones
-	$self->_lock_acquire;
+	$self->lock_acquire;
 	print $w $str or $err = $!;
-	$self->_lock_release;
+	$self->lock_release;
 
 	die "print failed: $err\n" if $err;
 }
@@ -121,10 +120,10 @@ sub index_skeleton {
 sub remote_remove {
 	my ($self, $oid, $mid) = @_;
 	my $err;
-	$self->_lock_acquire;
+	$self->lock_acquire;
 	eval { $self->SUPER::remote_remove($oid, $mid) };
 	$err = $@;
-	$self->_lock_release;
+	$self->lock_release;
 	die $err if $err;
 }
 
@@ -151,9 +150,9 @@ sub barrier_init {
 	my ($self, $nparts) = @_;
 	my $w = $self->{w};
 	my $err;
-	$self->_lock_acquire;
+	$self->lock_acquire;
 	print $w "barrier_init $nparts\n" or $err = "failed to write: $!\n";
-	$self->_lock_release;
+	$self->lock_release;
 	die $err if $err;
 }
 
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index bdde76b..36901cd 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -5,7 +5,7 @@
 package PublicInbox::V2Writable;
 use strict;
 use warnings;
-use Fcntl qw(:flock :DEFAULT);
+use base qw(PublicInbox::Lock);
 use PublicInbox::SearchIdxPart;
 use PublicInbox::SearchIdxSkeleton;
 use PublicInbox::MIME;
@@ -26,13 +26,10 @@ sub nproc () {
 sub new {
 	my ($class, $v2ibx, $creat) = @_;
 	my $dir = $v2ibx->{mainrepo} or die "no mainrepo in inbox\n";
-	my $lock_path = "$dir/inbox.lock";
 	unless (-d $dir) {
 		if ($creat) {
 			require File::Path;
 			File::Path::mkpath($dir);
-			open my $fh, '>>', $lock_path or
-				die "failed to open $lock_path: $!\n";
 		} else {
 			die "$dir does not exist\n";
 		}
@@ -64,7 +61,7 @@ sub new {
 		# limit each repo to 1GB or so
 		rotate_bytes => int((1024 * 1024 * 1024) / $PACKING_FACTOR),
 	};
-	bless $self, $class
+	bless $self, $class;
 }
 
 # returns undef on duplicate or spam
@@ -188,6 +185,8 @@ sub idx_init {
 	# frequently activated.
 	delete $ibx->{$_} foreach (qw(git mm search));
 
+	$self->lock_acquire;
+
 	# first time initialization, first we create the skeleton pipe:
 	my $skel = $self->{skel} = PublicInbox::SearchIdxSkeleton->new($self);
 
@@ -253,6 +252,7 @@ sub done {
 	my $im = delete $self->{im};
 	$im->done if $im; # PublicInbox::Import::done
 	$self->searchidx_checkpoint(0);
+	$self->lock_release;
 }
 
 sub checkpoint {
@@ -399,7 +399,7 @@ sub import_init {
 	my $im = PublicInbox::Import->new($git, undef, undef, $self->{-inbox});
 	$im->{bytes_added} = int($packed_bytes / $PACKING_FACTOR);
 	$im->{want_object_info} = 1;
-	$im->{lock_path} = $self->{lock_path};
+	$im->{lock_path} = undef;
 	$im->{path_type} = 'v2';
 	$self->{im} = $im;
 }
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 20/27] t/watch_maildir: note the reason for FIFO creation
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (18 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 19/27] Lock: new base class for writable lockers Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 21/27] v2writable: ensure ->done is idempotent Eric Wong (Contractor, The Linux Foundation)
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

I had to dig through commit history for this and we should
better document our tests (along with everything else).
---
 t/watch_maildir.t | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/watch_maildir.t b/t/watch_maildir.t
index 30e94c1..7178f29 100644
--- a/t/watch_maildir.t
+++ b/t/watch_maildir.t
@@ -31,7 +31,8 @@ Date: Sat, 18 Jun 2016 00:00:00 +0000
 something
 EOF
 PublicInbox::Emergency->new($maildir)->prepare(\$msg);
-ok(POSIX::mkfifo("$maildir/cur/fifo", 0777));
+ok(POSIX::mkfifo("$maildir/cur/fifo", 0777),
+	'create FIFO to ensure we do not get stuck on it :P');
 my $sem = PublicInbox::Emergency->new($spamdir); # create dirs
 
 my $config = PublicInbox::Config->new({
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 21/27] v2writable: ensure ->done is idempotent
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (19 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 20/27] t/watch_maildir: note the reason for FIFO creation Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 22/27] watchmaildir: support v2 repositories Eric Wong (Contractor, The Linux Foundation)
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

This matches Import::done behavior
---
 lib/PublicInbox/V2Writable.pm | 3 ++-
 t/v2writable.t                | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 36901cd..5c104d8 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -249,10 +249,11 @@ sub remove {
 
 sub done {
 	my ($self) = @_;
+	my $locked = defined $self->{idx_parts};
 	my $im = delete $self->{im};
 	$im->done if $im; # PublicInbox::Import::done
 	$self->searchidx_checkpoint(0);
-	$self->lock_release;
+	$self->lock_release if $locked;
 }
 
 sub checkpoint {
diff --git a/t/v2writable.t b/t/v2writable.t
index 5245a84..771e8c1 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -223,6 +223,8 @@ EOF
 	$im->done;
 	is($git0->qx(qw(log -1 --pretty=raw --raw -r --no-abbrev)),
 		$after, 'no git history made with idempotent remove');
+	eval { $im->done };
+	ok(!$@, '->done is idempotent');
 }
 
 done_testing();
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 22/27] watchmaildir: support v2 repositories
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (20 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 21/27] v2writable: ensure ->done is idempotent Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 23/27] searchidxpart: s/barrier/remote_barrier/ Eric Wong (Contractor, The Linux Foundation)
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

Unfortunately this gives up some minor performance tweaks we
made to avoid reforking import processes.
---
 MANIFEST                        |   1 +
 lib/PublicInbox/WatchMaildir.pm |  29 +++++-----
 t/watch_maildir_v2.t            | 125 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 139 insertions(+), 16 deletions(-)
 create mode 100644 t/watch_maildir_v2.t

diff --git a/MANIFEST b/MANIFEST
index 3b0b013..4346cd9 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -182,3 +182,4 @@ t/utf8.mbox
 t/v2writable.t
 t/view.t
 t/watch_maildir.t
+t/watch_maildir_v2.t
diff --git a/lib/PublicInbox/WatchMaildir.pm b/lib/PublicInbox/WatchMaildir.pm
index 3adebdd..2808b72 100644
--- a/lib/PublicInbox/WatchMaildir.pm
+++ b/lib/PublicInbox/WatchMaildir.pm
@@ -91,18 +91,6 @@ sub _done_for_now {
 	my ($self) = @_;
 	my $importers = $self->{importers};
 	foreach my $im (values %$importers) {
-		$im->barrier;
-	}
-
-	my $opendirs = $self->{opendirs};
-
-	# spamdir scanning means every importer remains open
-	my $spamdir = $self->{spamdir};
-	return if defined($spamdir) && $opendirs->{$spamdir};
-
-	foreach my $im (values %$importers) {
-		# not done if we're scanning
-		next if $opendirs->{$im->{git}->{git_dir}};
 		$im->done;
 	}
 }
@@ -267,10 +255,19 @@ sub _path_to_mime {
 sub _importer_for {
 	my ($self, $inbox) = @_;
 	my $im = $inbox->{-import} ||= eval {
-		my $git = $inbox->git;
-		my $name = $inbox->{name};
-		my $addr = $inbox->{-primary_address};
-		PublicInbox::Import->new($git, $name, $addr, $inbox);
+		my $v = $inbox->{version} || 1;
+		if ($v == 2) {
+			eval { require PublicInbox::V2Writable };
+			die "v2 not supported: $@\n" if $@;
+			PublicInbox::V2Writable->new($inbox);
+		} elsif ($v == 1) {
+			my $git = $inbox->git;
+			my $name = $inbox->{name};
+			my $addr = $inbox->{-primary_address};
+			PublicInbox::Import->new($git, $name, $addr, $inbox);
+		} else {
+			die "unsupported inbox version: $v\n";
+		}
 	};
 
 	my $importers = $self->{importers};
diff --git a/t/watch_maildir_v2.t b/t/watch_maildir_v2.t
new file mode 100644
index 0000000..85130e3
--- /dev/null
+++ b/t/watch_maildir_v2.t
@@ -0,0 +1,125 @@
+# Copyright (C) 2018 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use Test::More;
+use File::Temp qw/tempdir/;
+use PublicInbox::MIME;
+use Cwd;
+use PublicInbox::Config;
+my @mods = qw(Filesys::Notify::Simple PublicInbox::V2Writable);
+foreach my $mod (@mods) {
+	eval "require $mod";
+	plan skip_all => "$mod missing for watch_maildir_v2.t" if $@;
+}
+
+my $tmpdir = tempdir('watch_maildir-v2-XXXXXX', TMPDIR => 1, CLEANUP => 1);
+my $mainrepo = "$tmpdir/v2";
+my $maildir = "$tmpdir/md";
+my $spamdir = "$tmpdir/spam";
+use_ok 'PublicInbox::WatchMaildir';
+use_ok 'PublicInbox::Emergency';
+my $cfgpfx = "publicinbox.test";
+my $addr = 'test-public@example.com';
+my @cmd = ('blib/script/public-inbox-init', '-V2', 'test', $mainrepo,
+	'http://example.com/v2list', $addr);
+local $ENV{PI_CONFIG} = "$tmpdir/pi_config";
+is(system(@cmd), 0, 'public-inbox init OK');
+
+my $msg = <<EOF;
+From: user\@example.com
+To: $addr
+Subject: spam
+Message-Id: <a\@b.com>
+Date: Sat, 18 Jun 2016 00:00:00 +0000
+
+something
+EOF
+PublicInbox::Emergency->new($maildir)->prepare(\$msg);
+ok(POSIX::mkfifo("$maildir/cur/fifo", 0777),
+	'create FIFO to ensure we do not get stuck on it :P');
+my $sem = PublicInbox::Emergency->new($spamdir); # create dirs
+
+my $config = PublicInbox::Config->new({
+	"$cfgpfx.address" => $addr,
+	"$cfgpfx.mainrepo" => $mainrepo,
+	"$cfgpfx.watch" => "maildir:$maildir",
+	"$cfgpfx.filter" => 'PublicInbox::Filter::Vger',
+	"publicinboxlearn.watchspam" => "maildir:$spamdir",
+});
+my $ibx = $config->lookup_name('test');
+ok($ibx, 'found inbox by name');
+my $srch = $ibx->search;
+
+PublicInbox::WatchMaildir->new($config)->scan('full');
+my $res = $srch->reopen->query('');
+is($res->{total}, 1, 'got one revision');
+
+# my $git = PublicInbox::Git->new("$mainrepo/git/0.git");
+# my @list = $git->qx(qw(rev-list refs/heads/master));
+# is(scalar @list, 1, 'one revision in rev-list');
+
+my $write_spam = sub {
+	is(scalar glob("$spamdir/new/*"), undef, 'no spam existing');
+	$sem->prepare(\$msg);
+	$sem->commit;
+	my @new = glob("$spamdir/new/*");
+	is(scalar @new, 1);
+	my @p = split(m!/+!, $new[0]);
+	ok(link($new[0], "$spamdir/cur/".$p[-1].":2,S"));
+	is(unlink($new[0]), 1);
+};
+$write_spam->();
+is(unlink(glob("$maildir/new/*")), 1, 'unlinked old spam');
+PublicInbox::WatchMaildir->new($config)->scan('full');
+is($srch->reopen->query('')->{total}, 0, 'deleted file');
+
+# check with scrubbing
+{
+	$msg .= qq(--
+To unsubscribe from this list: send the line "unsubscribe git" in
+the body of a message to majordomo\@vger.kernel.org
+More majordomo info at  http://vger.kernel.org/majordomo-info.html\n);
+	PublicInbox::Emergency->new($maildir)->prepare(\$msg);
+	PublicInbox::WatchMaildir->new($config)->scan('full');
+	$res = $srch->reopen->query('');
+	is($res->{total}, 1, 'got one file back');
+	my $mref = $ibx->msg_by_smsg($res->{msgs}->[0]);
+	like($$mref, qr/something\n\z/s, 'message scrubbed on import');
+
+	is(unlink(glob("$maildir/new/*")), 1, 'unlinked spam');
+	$write_spam->();
+	PublicInbox::WatchMaildir->new($config)->scan('full');
+	$res = $srch->reopen->query('');
+	is($res->{total}, 0, 'inbox is empty again');
+}
+
+{
+	my $fail_bin = getcwd()."/t/fail-bin";
+	ok(-x "$fail_bin/spamc", "mock spamc exists");
+	my $fail_path = "$fail_bin:$ENV{PATH}"; # for spamc ham mock
+	local $ENV{PATH} = $fail_path;
+	PublicInbox::Emergency->new($maildir)->prepare(\$msg);
+	$config->{'publicinboxwatch.spamcheck'} = 'spamc';
+	{
+		local $SIG{__WARN__} = sub {}; # quiet spam check warning
+		PublicInbox::WatchMaildir->new($config)->scan('full');
+	}
+	$res = $srch->reopen->query('');
+	is($res->{total}, 0, 'inbox is still empty');
+	is(unlink(glob("$maildir/new/*")), 1);
+}
+
+{
+	my $main_bin = getcwd()."/t/main-bin";
+	ok(-x "$main_bin/spamc", "mock spamc exists");
+	my $main_path = "$main_bin:$ENV{PATH}"; # for spamc ham mock
+	local $ENV{PATH} = $main_path;
+	PublicInbox::Emergency->new($maildir)->prepare(\$msg);
+	$config->{'publicinboxwatch.spamcheck'} = 'spamc';
+	PublicInbox::WatchMaildir->new($config)->scan('full');
+	$res = $srch->reopen->query('');
+	is($res->{total}, 1, 'inbox has one mail after spamc OK-ed a message');
+	my $mref = $ibx->msg_by_smsg($res->{msgs}->[0]);
+	like($$mref, qr/something\n\z/s, 'message scrubbed on import');
+}
+
+done_testing;
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 23/27] searchidxpart: s/barrier/remote_barrier/
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (21 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 22/27] watchmaildir: support v2 repositories Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 24/27] v2writable: allow disabling parallelization Eric Wong (Contractor, The Linux Foundation)
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

Be consistent with our "remote_" prefix for other IPC subs
---
 lib/PublicInbox/SearchIdxPart.pm | 2 +-
 lib/PublicInbox/V2Writable.pm    | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm
index c166078..d8c8c8b 100644
--- a/lib/PublicInbox/SearchIdxPart.pm
+++ b/lib/PublicInbox/SearchIdxPart.pm
@@ -95,7 +95,7 @@ sub atfork_child {
 }
 
 # called by V2Writable:
-sub barrier {
+sub remote_barrier {
 	my $w = $_[0]->{w};
 	print $w "barrier\n" or die "failed to print: $!";
 	$w->flush or die "failed to flush: $!";
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 5c104d8..261f9d9 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -280,7 +280,7 @@ sub barrier {
 		# Now deal with Xapian
 		$skel->barrier_init(scalar(@$parts));
 		# each partition needs to issue a barrier command to skel:
-		$_->barrier foreach @$parts;
+		$_->remote_barrier foreach @$parts;
 
 		$skel->barrier_wait; # wait for each Xapian partition
 
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 24/27] v2writable: allow disabling parallelization
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (22 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 23/27] searchidxpart: s/barrier/remote_barrier/ Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 25/27] scripts/import_vger_from_mbox: filter out same headers as MDA Eric Wong (Contractor, The Linux Foundation)
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

While parallel processes improves import speed for initial
imports; they are probably not necessary for daily mail imports
via WatchMaildir and certainly not for public-inbox-init.  Save
some memory for daily use and even helps improve readability of
some subroutines by showing which methods they call remotely.
---
 lib/PublicInbox/SearchIdx.pm         | 56 ++++++++++++++++++++++++++------
 lib/PublicInbox/SearchIdxPart.pm     | 62 ++++++++++++++++++-----------------
 lib/PublicInbox/SearchIdxSkeleton.pm | 63 +++++++++++++++++-------------------
 lib/PublicInbox/V2Writable.pm        |  3 +-
 lib/PublicInbox/WatchMaildir.pm      |  4 ++-
 script/public-inbox-init             |  3 +-
 t/v2writable.t                       |  6 ++--
 7 files changed, 118 insertions(+), 79 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 0b9fb4b..3d80b00 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -523,6 +523,7 @@ sub link_and_save {
 	$doc->add_boolean_term('Q' . $_) foreach @$mids;
 
 	my $vivified = 0;
+	$self->{skel} and die "Should not have read-only skel here\n";;
 	foreach my $mid (@$mids) {
 		$self->each_smsg_by_mid($mid, sub {
 			my ($cur) = @_;
@@ -887,24 +888,59 @@ sub DESTROY {
 # remote_* subs are only used by SearchIdxPart and SearchIdxSkeleton
 sub remote_commit {
 	my ($self) = @_;
-	print { $self->{w} } "commit\n" or die "failed to write commit: $!";
+	if (my $w = $self->{w}) {
+		print $w "commit\n" or die "failed to write commit: $!";
+	} else {
+		$self->commit_txn_lazy;
+		if (my $skel = $self->{skeleton}) {
+			$skel->commit_txn_lazy;
+		}
+	}
 }
 
 sub remote_close {
 	my ($self) = @_;
-	my $pid = delete $self->{pid} or die "no process to wait on\n";
-	my $w = delete $self->{w} or die "no pipe to write to\n";
-	print $w "close\n" or die "failed to write to pid:$pid: $!\n";
-	close $w or die "failed to close pipe for pid:$pid: $!\n";
-	waitpid($pid, 0) == $pid or die "remote process did not finish";
-	$? == 0 or die ref($self)." pid:$pid exited with: $?";
+	if (my $w = delete $self->{w}) {
+		my $pid = delete $self->{pid} or die "no process to wait on\n";
+		print $w "close\n" or die "failed to write to pid:$pid: $!\n";
+		close $w or die "failed to close pipe for pid:$pid: $!\n";
+		waitpid($pid, 0) == $pid or die "remote process did not finish";
+		$? == 0 or die ref($self)." pid:$pid exited with: $?";
+	} else {
+		die "transaction in progress $self\n" if $self->{txn};
+		$self->_xdb_release if $self->{xdb};
+	}
 }
 
-# triggers remove_by_oid in partition or skeleton
 sub remote_remove {
 	my ($self, $oid, $mid) = @_;
-	print { $self->{w} } "D $oid $mid\n" or
-			die "failed to write remove $!";
+	if (my $w = $self->{w}) {
+		# triggers remove_by_oid in partition or skeleton
+		print $w "D $oid $mid\n" or die "failed to write remove $!";
+	} else {
+		$self->begin_txn_lazy;
+		$self->remove_by_oid($oid, $mid);
+	}
+}
+
+sub begin_txn_lazy {
+	my ($self) = @_;
+	return if $self->{txn};
+	my $xdb = $self->{xdb} || $self->_xdb_acquire;
+	$xdb->begin_transaction;
+	$self->{txn} = 1;
+}
+
+sub commit_txn_lazy {
+	my ($self) = @_;
+	delete $self->{txn} or return;
+	$self->{xdb}->commit_transaction;
+}
+
+sub worker_done {
+	my ($self) = @_;
+	die "$$ $0 xdb not released\n" if $self->{xdb};
+	die "$$ $0 still in transaction\n" if $self->{txn};
 }
 
 1;
diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm
index d8c8c8b..82f5c1b 100644
--- a/lib/PublicInbox/SearchIdxPart.pm
+++ b/lib/PublicInbox/SearchIdxPart.pm
@@ -9,6 +9,15 @@ sub new {
 	my ($class, $v2writable, $part, $skel) = @_;
 	my $self = $class->SUPER::new($v2writable->{-inbox}, 1, $part);
 	$self->{skeleton} = $skel;
+	# create the DB:
+	$self->_xdb_acquire;
+	$self->_xdb_release;
+	$self->spawn_worker($v2writable, $part) if $v2writable->{parallel};
+	$self;
+}
+
+sub spawn_worker {
+	my ($self, $v2writable, $part) = @_;
 	my ($r, $w);
 	pipe($r, $w) or die "pipe failed: $!\n";
 	binmode $r, ':raw';
@@ -32,44 +41,30 @@ sub new {
 	$self->{pid} = $pid;
 	$self->{w} = $w;
 	close $r;
-	$self;
 }
 
 sub partition_worker_loop ($$$) {
 	my ($self, $r, $part) = @_;
 	$0 = "pi-v2-partition[$part]";
-	my $xdb = $self->_xdb_acquire;
-	$xdb->begin_transaction;
-	my $txn = 1;
+	$self->begin_txn_lazy;
 	while (my $line = $r->getline) {
 		if ($line eq "commit\n") {
-			$xdb->commit_transaction if $txn;
-			$txn = undef;
+			$self->commit_txn_lazy;
 			$self->{skeleton}->remote_commit;
 		} elsif ($line eq "close\n") {
 			$self->_xdb_release;
-			$xdb = $txn = undef;
 		} elsif ($line eq "barrier\n") {
-			$xdb->commit_transaction if $txn;
-			$txn = undef;
+			$self->commit_txn_lazy;
 			print { $self->{skeleton}->{w} } "barrier $part\n" or
 					die "write failed to skeleton: $!\n";
 		} elsif ($line =~ /\AD ([a-f0-9]{40,}) (.+)\n\z/s) {
 			my ($oid, $mid) = ($1, $2);
-			$xdb ||= $self->_xdb_acquire;
-			if (!$txn) {
-				$xdb->begin_transaction;
-				$txn = 1;
-			}
+			$self->begin_txn_lazy;
 			$self->remove_by_oid($oid, $mid);
 		} else {
 			chomp $line;
 			my ($len, $artnum, $oid, $mid0) = split(/ /, $line);
-			$xdb ||= $self->_xdb_acquire;
-			if (!$txn) {
-				$xdb->begin_transaction;
-				$txn = 1;
-			}
+			$self->begin_txn_lazy;
 			my $n = read($r, my $msg, $len) or die "read: $!\n";
 			$n == $len or die "short read: $n != $len\n";
 			my $mime = PublicInbox::MIME->new(\$msg);
@@ -77,17 +72,21 @@ sub partition_worker_loop ($$$) {
 			$self->add_message($mime, $n, $artnum, $oid, $mid0);
 		}
 	}
-	warn "$$ still in transaction\n" if $txn;
-	warn "$$ xdb active\n" if $xdb;
+	$self->worker_done;
 }
 
 # called by V2Writable
 sub index_raw {
-	my ($self, $len, $msgref, $artnum, $object_id, $mid0) = @_;
-	my $w = $self->{w};
-	print $w "$len $artnum $object_id $mid0\n", $$msgref or die
-		"failed to write partition $!\n";
-	$w->flush or die "failed to flush: $!\n";
+	my ($self, $bytes, $msgref, $artnum, $oid, $mid0, $mime) = @_;
+	if (my $w = $self->{w}) {
+		print $w "$bytes $artnum $oid $mid0\n", $$msgref or die
+			"failed to write partition $!\n";
+		$w->flush or die "failed to flush: $!\n";
+	} else {
+		$$msgref = undef;
+		$self->begin_txn_lazy;
+		$self->add_message($mime, $bytes, $artnum, $oid, $mid0);
+	}
 }
 
 sub atfork_child {
@@ -96,9 +95,14 @@ sub atfork_child {
 
 # called by V2Writable:
 sub remote_barrier {
-	my $w = $_[0]->{w};
-	print $w "barrier\n" or die "failed to print: $!";
-	$w->flush or die "failed to flush: $!";
+	my ($self) = @_;
+	if (my $w = $self->{w}) {
+		print $w "barrier\n" or die "failed to print: $!";
+		$w->flush or die "failed to flush: $!";
+	} else {
+		$self->commit_txn_lazy;
+		$self->{skeleton}->remote_commit;
+	}
 }
 
 1;
diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
index 54a59ab..ba43969 100644
--- a/lib/PublicInbox/SearchIdxSkeleton.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -12,7 +12,12 @@ sub new {
 	# create the DB:
 	$self->_xdb_acquire;
 	$self->_xdb_release;
+	$self->spawn_worker($v2writable) if $v2writable->{parallel};
+	$self
+}
 
+sub spawn_worker {
+	my ($self, $v2writable) = @_;
 	my ($r, $w);
 	pipe($r, $w) or die "pipe failed: $!\n";
 	my ($barrier_wait, $barrier_note);
@@ -39,24 +44,19 @@ sub new {
 
 	# lock on only exists in parent, not in worker
 	$self->{lock_path} = $self->xdir . '/pi-v2-skeleton.lock';
-	$self;
 }
 
 sub skeleton_worker_loop {
 	my ($self, $r, $barrier_note) = @_;
 	$barrier_note->autoflush(1);
 	$0 = 'pi-v2-skeleton';
-	my $xdb = $self->_xdb_acquire;
-	$xdb->begin_transaction;
-	my $txn = 1;
+	$self->begin_txn_lazy;
 	my $barrier = undef;
 	while (my $line = $r->getline) {
 		if ($line eq "commit\n") {
-			$xdb->commit_transaction if $txn;
-			$txn = undef;
+			$self->commit_txn_lazy;
 		} elsif ($line eq "close\n") {
 			$self->_xdb_release;
-			$xdb = $txn = undef;
 		} elsif ($line =~ /\Abarrier_init (\d+)\n\z/) {
 			my $n = $1 - 1;
 			die "barrier in-progress\n" if defined $barrier;
@@ -67,18 +67,13 @@ sub skeleton_worker_loop {
 			delete $barrier->{$1} or die "unknown barrier: $part\n";
 			if ((scalar keys %$barrier) == 0) {
 				$barrier = undef;
-				$xdb->commit_transaction if $txn;
-				$txn = undef;
+				$self->commit_txn_lazy;
 				print $barrier_note "barrier_done\n" or die
 					"print failed to barrier note: $!";
 			}
 		} elsif ($line =~ /\AD ([a-f0-9]{40,}) (.*)\n\z/s) {
 			my ($oid, $mid) = ($1, $2);
-			$xdb ||= $self->_xdb_acquire;
-			if (!$txn) {
-				$xdb->begin_transaction;
-				$txn = 1;
-			}
+			$self->begin_txn_lazy;
 			$self->remove_by_oid($oid, $mid);
 		} else {
 			my $len = int($line);
@@ -86,35 +81,34 @@ sub skeleton_worker_loop {
 			$n == $len or die "short read: $n != $len\n";
 			$msg = thaw($msg); # should raise on error
 			defined $msg or die "failed to thaw buffer\n";
-			$xdb ||= $self->_xdb_acquire;
-			if (!$txn) {
-				$xdb->begin_transaction;
-				$txn = 1;
-			}
+			$self->begin_txn_lazy;
 			eval { index_skeleton_real($self, $msg) };
 			warn "failed to index message <$msg->[-1]>: $@\n" if $@;
 		}
 	}
-	die "xdb not released\n" if $xdb;
-	die "in transaction\n" if $txn;
+	$self->worker_done;
 }
 
 # called by a partition worker
 sub index_skeleton {
 	my ($self, $values) = @_;
-	my $w = $self->{w};
-	my $err;
-	my $str = freeze($values);
-	$str = length($str) . "\n" . $str;
+	if (my $w = $self->{w}) {
+		my $err;
+		my $str = freeze($values);
+		$str = length($str) . "\n" . $str;
 
-	# multiple processes write to the same pipe, so use flock
-	# We can't avoid this lock for <=PIPE_BUF writes, either,
-	# because those atomic writes can break up >PIPE_BUF ones
-	$self->lock_acquire;
-	print $w $str or $err = $!;
-	$self->lock_release;
+		# multiple processes write to the same pipe, so use flock
+		# We can't avoid this lock for <=PIPE_BUF writes, either,
+		# because those atomic writes can break up >PIPE_BUF ones
+		$self->lock_acquire;
+		print $w $str or $err = $!;
+		$self->lock_release;
 
-	die "print failed: $err\n" if $err;
+		die "print failed: $err\n" if $err;
+	} else {
+		$self->begin_txn_lazy;
+		index_skeleton_real($self, $values);
+	}
 }
 
 sub remote_remove {
@@ -148,7 +142,7 @@ sub index_skeleton_real ($$) {
 # write to the subprocess
 sub barrier_init {
 	my ($self, $nparts) = @_;
-	my $w = $self->{w};
+	my $w = $self->{w} or return;
 	my $err;
 	$self->lock_acquire;
 	print $w "barrier_init $nparts\n" or $err = "failed to write: $!\n";
@@ -158,7 +152,8 @@ sub barrier_init {
 
 sub barrier_wait {
 	my ($self) = @_;
-	my $l = $self->{barrier_wait}->getline;
+	my $bw = $self->{barrier_wait} or return;
+	my $l = $bw->getline;
 	$l eq "barrier_done\n" or die "bad response from barrier_wait: $l\n";
 }
 
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 261f9d9..8e1363e 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -56,6 +56,7 @@ sub new {
 		xap_rw => undef, # PublicInbox::V2SearchIdx
 		xap_ro => undef,
 		partitions => $nparts,
+		parallel => 1,
 		transact_bytes => 0,
 		lock_path => "$dir/inbox.lock",
 		# limit each repo to 1GB or so
@@ -93,7 +94,7 @@ sub add {
 	my $nparts = $self->{partitions};
 	my $part = $num % $nparts;
 	my $idx = $self->idx_part($part);
-	$idx->index_raw($len, $msgref, $num, $oid, $mid0);
+	$idx->index_raw($len, $msgref, $num, $oid, $mid0, $mime);
 	my $n = $self->{transact_bytes} += $len;
 	if ($n > (PublicInbox::SearchIdx::BATCH_BYTES * $nparts)) {
 		$self->checkpoint;
diff --git a/lib/PublicInbox/WatchMaildir.pm b/lib/PublicInbox/WatchMaildir.pm
index 2808b72..e28e602 100644
--- a/lib/PublicInbox/WatchMaildir.pm
+++ b/lib/PublicInbox/WatchMaildir.pm
@@ -259,7 +259,9 @@ sub _importer_for {
 		if ($v == 2) {
 			eval { require PublicInbox::V2Writable };
 			die "v2 not supported: $@\n" if $@;
-			PublicInbox::V2Writable->new($inbox);
+			my $v2w = PublicInbox::V2Writable->new($inbox);
+			$v2w->{parallel} = 0;
+			$v2w;
 		} elsif ($v == 1) {
 			my $git = $inbox->git;
 			my $name = $inbox->{name};
diff --git a/script/public-inbox-init b/script/public-inbox-init
index f7a60fb..fdad136 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -83,8 +83,9 @@ if ($version >= 2) {
 	};
 	$ibx = PublicInbox::Inbox->new($ibx);
 	my $v2w = PublicInbox::V2Writable->new($ibx, 1);
+	$v2w->{parallel} = 0;
+	$v2w->idx_init;
 	$v2w->git_init(0);
-	$v2w->idx_init(0);
 	$v2w->done;
 } elsif ($version == 1) {
 	x(qw(git init -q --bare), $mainrepo);
diff --git a/t/v2writable.t b/t/v2writable.t
index 771e8c1..2088f3f 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -108,10 +108,10 @@ if ('ensure git configs are correct') {
 	ok($im->add($mime), 'message with multiple Message-ID');
 	$im->done;
 	my @found;
-	$ibx->search->reopen;
-	$ibx->search->each_smsg_by_mid('abcde@1', sub { push @found, @_; 1 });
+	my $srch = $ibx->search;
+	$srch->reopen->each_smsg_by_mid('abcde@1', sub { push @found, @_; 1 });
 	is(scalar(@found), 1, 'message found by first MID');
-	$ibx->search->each_smsg_by_mid('abcde@2', sub { push @found, @_; 1 });
+	$srch->reopen->each_smsg_by_mid('abcde@2', sub { push @found, @_; 1 });
 	is(scalar(@found), 2, 'message found by second MID');
 	is($found[0]->{doc_id}, $found[1]->{doc_id}, 'same document');
 	ok($found[1]->{doc_id} > 0, 'doc_id is positive');
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 25/27] scripts/import_vger_from_mbox: filter out same headers as MDA
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (23 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 24/27] v2writable: allow disabling parallelization Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 26/27] v2writable: add DEBUG_DIFF env support Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 27/27] v2writable: remove "resent" message for duplicate Message-IDs Eric Wong (Contractor, The Linux Foundation)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

Perhaps we should filter these headers out in Import
---
 scripts/import_vger_from_mbox | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox
index 6a00fae..1edb987 100644
--- a/scripts/import_vger_from_mbox
+++ b/scripts/import_vger_from_mbox
@@ -8,6 +8,7 @@ use PublicInbox::MIME;
 use PublicInbox::Inbox;
 use PublicInbox::V2Writable;
 use PublicInbox::Import;
+use PublicInbox::MDA;
 my $usage = "usage: $0 NAME EMAIL DIR <MBOX\n";
 my $dry_run;
 my $version = 2;
@@ -57,6 +58,7 @@ sub do_add ($$) {
 	}
 	$mime = $vger->scrub($mime);
 	return unless $im;
+	$mime->header_set($_) foreach @PublicInbox::MDA::BAD_HEADERS;
 	$im->add($mime) or
 		warn "duplicate: ",
 			$mime->header_obj->header_raw('Message-ID'), "\n";
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 26/27] v2writable: add DEBUG_DIFF env support
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (24 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 25/27] scripts/import_vger_from_mbox: filter out same headers as MDA Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:14 ` [PATCH 27/27] v2writable: remove "resent" message for duplicate Message-IDs Eric Wong (Contractor, The Linux Foundation)
  26 siblings, 0 replies; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

This can help us track down some differences during import,
if needed.
---
 lib/PublicInbox/V2Writable.pm | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 8e1363e..30ca9ce 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -406,6 +406,26 @@ sub import_init {
 	$self->{im} = $im;
 }
 
+# XXX experimental
+sub diff ($$$) {
+	my ($mid, $cur, $new) = @_;
+	use File::Temp qw(tempfile);
+	use PublicInbox::Spawn qw(spawn);
+
+	my ($ah, $an) = tempfile('email-cur-XXXXXXXX');
+	print $ah $cur->as_string or die "print: $!";
+	close $ah or die "close: $!";
+	my ($bh, $bn) = tempfile('email-new-XXXXXXXX');
+	print $bh $new->as_string or die "print: $!";
+	close $bh or die "close: $!";
+	my $cmd = [ qw(diff -u), $an, $bn ];
+	print STDERR "# MID conflict <$mid>\n";
+	my $pid = spawn($cmd, undef, { 1 => 2 });
+	defined $pid or die "diff failed to spawn $!";
+	waitpid($pid, 0) == $pid or die "diff did not finish";
+	unlink($an, $bn);
+}
+
 sub lookup_content {
 	my ($self, $mime, $mid) = @_;
 	my $ibx = $self->{-inbox};
@@ -427,6 +447,10 @@ sub lookup_content {
 			$found = $smsg;
 			return 0; # break out of loop
 		}
+
+		# XXX DEBUG_DIFF is experimental and may be removed
+		diff($mid, $cur, $mime) if $ENV{DEBUG_DIFF};
+
 		1; # continue
 	});
 	$found;
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 27/27] v2writable: remove "resent" message for duplicate Message-IDs
  2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
                   ` (25 preceding siblings ...)
  2018-03-19  8:14 ` [PATCH 26/27] v2writable: add DEBUG_DIFF env support Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:14 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-19  8:18   ` SQUASH: " Eric Wong
  26 siblings, 1 reply; 29+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-19  8:14 UTC (permalink / raw)
  To: meta

public-inbox-watch gets restarted on reboots and whatnot, so
it could get pointlessly noisy.  This message was only useful
during initial development and imports.
---
 lib/PublicInbox/V2Writable.pm | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 30ca9ce..dc96b87 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -119,10 +119,10 @@ sub num_for {
 		foreach my $m (@$mids) {
 			# read-only lookup now safe to do after above barrier
 			my $existing = $self->lookup_content($mime, $m);
-			if ($existing) {
-				warn "<$m> resent\n";
-				return; # easy, don't store duplicates
-			}
+			# easy, don't store duplicates
+			# note: do not add more diagnostic info here since
+			# it gets noisy on public-inbox-watch restarts
+			return if $existing;
 		}
 
 		# very unlikely:
-- 
EW


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* SQUASH: [PATCH 27/27] v2writable: remove "resent" message for duplicate Message-IDs
  2018-03-19  8:14 ` [PATCH 27/27] v2writable: remove "resent" message for duplicate Message-IDs Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-19  8:18   ` Eric Wong
  0 siblings, 0 replies; 29+ messages in thread
From: Eric Wong @ 2018-03-19  8:18 UTC (permalink / raw)
  To: meta

Oops:

diff --git a/t/v2writable.t b/t/v2writable.t
index 2088f3f..85b48d2 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -57,7 +57,7 @@ if ('ensure git configs are correct') {
 	my @warn;
 	local $SIG{__WARN__} = sub { push @warn, @_ };
 	is($im->add($mime), undef, 'obvious duplicate rejected');
-	like(join(' ', @warn), qr/resent/, 'warned about resent message');
+	is(scalar(@warn), 0, 'no warning about resent message');
 
 	@warn = ();
 	$mime->header_set('Message-Id', '<a-mid@b>', '<c@d>');

^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2018-03-19  8:18 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-19  8:14 [PATCH 00/27] v2 public-inbox-watch support Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 01/27] content_id: use Sender header if From is not available Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 02/27] v2writable: support "barrier" operation to avoid reforking Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 03/27] use string ref for Email::Simple->new Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 04/27] v2writable: remove unnecessary idx_init call Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 05/27] searchidx: do not delete documents while iterating Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 06/27] search: allow ->reopen to be chainable Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 07/27] v2writable: implement remove correctly Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 08/27] skeleton: barrier init requires a lock Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 09/27] import: (v2) delete writes the blob into history in subdir Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 10/27] import: (v2): write deletes to a separate '_' subdirectory Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 11/27] import: implement barrier operation for v1 repos Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 12/27] mid: mid_mime uses v2-compatible mids function Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 13/27] watchmaildir: use content_digest to generate Message-Id Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 14/27] import: force Message-ID generation for v1 here Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 15/27] import: switch to URL-safe Base64 for Message-IDs Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 16/27] v2writable: test for idempotent removals Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 17/27] import: enable locking under v2 Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 18/27] index: s/GIT_DIR/REPO_DIR/ Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 19/27] Lock: new base class for writable lockers Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 20/27] t/watch_maildir: note the reason for FIFO creation Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 21/27] v2writable: ensure ->done is idempotent Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 22/27] watchmaildir: support v2 repositories Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 23/27] searchidxpart: s/barrier/remote_barrier/ Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 24/27] v2writable: allow disabling parallelization Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 25/27] scripts/import_vger_from_mbox: filter out same headers as MDA Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 26/27] v2writable: add DEBUG_DIFF env support Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:14 ` [PATCH 27/27] v2writable: remove "resent" message for duplicate Message-IDs Eric Wong (Contractor, The Linux Foundation)
2018-03-19  8:18   ` SQUASH: " Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).