unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads
@ 2020-08-22  6:06 Eric Wong
  2020-08-22  6:06 ` [PATCH 1/5] searchidxshard: clear $msgref buffer properly Eric Wong
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Eric Wong @ 2020-08-22  6:06 UTC (permalink / raw)
  To: meta

Actually, the Xapian aspect of it turned out to be easy once
I learned ->set_collapse_key.

Getting the tests and compatibility with existing (pre-upgrade)
inboxes was more work.

It requires "public-inbox-index --reindex" to activate;
but PATCH 5/5 makes it safe to upgrade WWW either before
or after --reindex.  That means BOFHs can upgrade without
regard to ordering.

Tested with w3m, links, and lynx (I actually split out
my lynx fix separately):
  https://public-inbox.org/meta/20200822004125.9458-1-e@80x24.org/

TODO: CLI tool support, HTML interface, JMAP, etc...

Eric Wong (5):
  searchidxshard: clear $msgref buffer properly
  searchidx: put all shard-related stuff in SearchIdxShard.pm
  searchidx: index THREADID in Xapian
  search: support downloading mboxes results with full thread
  mbox: disable "&t" on existing Xapian until full reindex

 Documentation/standards.perl      |  4 +++
 lib/PublicInbox/Mbox.pm           | 54 +++++++++++++++++++++++++------
 lib/PublicInbox/Over.pm           | 31 +++++++++++++++++-
 lib/PublicInbox/OverIdx.pm        | 18 +++++------
 lib/PublicInbox/Search.pm         | 16 +++++++--
 lib/PublicInbox/SearchIdx.pm      | 51 +++++++++--------------------
 lib/PublicInbox/SearchIdxShard.pm | 48 ++++++++++++++++++++++-----
 lib/PublicInbox/SearchQuery.pm    |  8 +++--
 lib/PublicInbox/SearchView.pm     | 30 +++++++++++------
 lib/PublicInbox/Smsg.pm           |  3 +-
 lib/PublicInbox/V2Writable.pm     | 19 ++++++++---
 t/init.t                          |  1 +
 t/over.t                          | 13 ++++----
 t/psgi_search.t                   | 39 ++++++++++++++++++++--
 14 files changed, 244 insertions(+), 91 deletions(-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/5] searchidxshard: clear $msgref buffer properly
  2020-08-22  6:06 [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads Eric Wong
@ 2020-08-22  6:06 ` Eric Wong
  2020-08-22  6:06 ` [PATCH 2/5] searchidx: put all shard-related stuff in SearchIdxShard.pm Eric Wong
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-08-22  6:06 UTC (permalink / raw)
  To: meta

Merely assigning `undef' to a scalar does not free the
underlying buffer memory of a scalar.
---
 lib/PublicInbox/SearchIdxShard.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index 20077e08..75521b43 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -97,7 +97,7 @@ sub index_raw {
 			"\n", $$msgref or die "failed to write shard $!\n";
 	} else {
 		if ($eml) {
-			$$msgref = undef;
+			undef $$msgref;
 		} else { # --xapian-only + --sequential-shard:
 			$eml = PublicInbox::Eml->new($msgref);
 		}

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/5] searchidx: put all shard-related stuff in SearchIdxShard.pm
  2020-08-22  6:06 [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads Eric Wong
  2020-08-22  6:06 ` [PATCH 1/5] searchidxshard: clear $msgref buffer properly Eric Wong
@ 2020-08-22  6:06 ` Eric Wong
  2020-08-22  6:06 ` [PATCH 3/5] searchidx: index THREADID in Xapian Eric Wong
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-08-22  6:06 UTC (permalink / raw)
  To: meta

We'll also rename the /^remote_/ prefix to "shard_", since
remote implies the process is on a different host.  These
methods only pass messages to a child process on the same host
OR perform operations within the same process.
---
 lib/PublicInbox/SearchIdx.pm      | 34 ---------------------------
 lib/PublicInbox/SearchIdxShard.pm | 39 +++++++++++++++++++++++++++----
 lib/PublicInbox/V2Writable.pm     |  8 +++----
 3 files changed, 39 insertions(+), 42 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index be46b2b9..098fead7 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -793,40 +793,6 @@ sub DESTROY {
 	$_[0]->{lockfh} = undef;
 }
 
-# remote_* subs are only used by SearchIdxPart
-sub remote_commit {
-	my ($self) = @_;
-	if (my $w = $self->{w}) {
-		print $w "commit\n" or die "failed to write commit: $!";
-	} else {
-		$self->commit_txn_lazy;
-	}
-}
-
-sub remote_close {
-	my ($self) = @_;
-	if (my $w = delete $self->{w}) {
-		my $pid = delete $self->{pid} or die "no process to wait on\n";
-		print $w "close\n" or die "failed to write to pid:$pid: $!\n";
-		close $w or die "failed to close pipe for pid:$pid: $!\n";
-		waitpid($pid, 0) == $pid or die "remote process did not finish";
-		$? == 0 or die ref($self)." pid:$pid exited with: $?";
-	} else {
-		die "transaction in progress $self\n" if $self->{txn};
-		idx_release($self) if $self->{xdb};
-	}
-}
-
-sub remote_remove {
-	my ($self, $oid, $num) = @_;
-	if (my $w = $self->{w}) {
-		# triggers remove_by_oid in a shard
-		print $w "D $oid $num\n" or die "failed to write remove $!";
-	} else {
-		$self->remove_by_oid($oid, $num);
-	}
-}
-
 sub _begin_txn {
 	my ($self) = @_;
 	my $xdb = $self->{xdb} || idx_acquire($self);
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index 75521b43..c0f8be89 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -1,7 +1,7 @@
 # Copyright (C) 2018-2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 
-# used to interface with a single Xapian shard in V2 repos.
+# Internal interface for a single Xapian shard in V2 inboxes.
 # See L<public-inbox-v2-format(5)> for more info on how we shard Xapian
 package PublicInbox::SearchIdxShard;
 use strict;
@@ -47,6 +47,7 @@ sub spawn_worker {
 	close $r or die "failed to close: $!";
 }
 
+# this reads all the writes to $self->{w} from the parent process
 sub shard_worker_loop ($$$$$) {
 	my ($self, $v2w, $r, $shard, $bnote) = @_;
 	$0 = "pi-v2-shard[$shard]";
@@ -87,7 +88,6 @@ sub shard_worker_loop ($$$$$) {
 	$self->worker_done;
 }
 
-# called by V2Writable
 sub index_raw {
 	my ($self, $msgref, $eml, $smsg) = @_;
 	if (my $w = $self->{w}) {
@@ -110,8 +110,7 @@ sub atfork_child {
 	close $_[0]->{w} or die "failed to close write pipe: $!\n";
 }
 
-# called by V2Writable:
-sub remote_barrier {
+sub shard_barrier {
 	my ($self) = @_;
 	if (my $w = $self->{w}) {
 		print $w "barrier\n" or die "failed to print: $!";
@@ -120,4 +119,36 @@ sub remote_barrier {
 	}
 }
 
+sub shard_commit {
+	my ($self) = @_;
+	if (my $w = $self->{w}) {
+		print $w "commit\n" or die "failed to write commit: $!";
+	} else {
+		$self->commit_txn_lazy;
+	}
+}
+
+sub shard_close {
+	my ($self) = @_;
+	if (my $w = delete $self->{w}) {
+		my $pid = delete $self->{pid} or die "no process to wait on\n";
+		print $w "close\n" or die "failed to write to pid:$pid: $!\n";
+		close $w or die "failed to close pipe for pid:$pid: $!\n";
+		waitpid($pid, 0) == $pid or die "remote process did not finish";
+		$? == 0 or die ref($self)." pid:$pid exited with: $?";
+	} else {
+		die "transaction in progress $self\n" if $self->{txn};
+		$self->idx_release if $self->{xdb};
+	}
+}
+
+sub shard_remove {
+	my ($self, $oid, $num) = @_;
+	if (my $w = $self->{w}) { # triggers remove_by_oid in a shard child
+		print $w "D $oid $num\n" or die "failed to write remove $!";
+	} else { # same process
+		$self->remove_by_oid($oid, $num);
+	}
+}
+
 1;
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 9c200288..0a91a132 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -630,12 +630,12 @@ sub checkpoint ($;$) {
 			my $barrier = $self->barrier_init(scalar @$shards);
 
 			# each shard needs to issue a barrier command
-			$_->remote_barrier for @$shards;
+			$_->shard_barrier for @$shards;
 
 			# wait for each Xapian shard
 			$self->barrier_wait($barrier);
 		} else {
-			$_->remote_commit for @$shards;
+			$_->shard_commit for @$shards;
 		}
 
 		# last_commit is special, don't commit these until
@@ -675,7 +675,7 @@ sub done {
 	my $shards = delete $self->{idx_shards};
 	if ($shards) {
 		for (@$shards) {
-			eval { $_->remote_close };
+			eval { $_->shard_close };
 			$err .= "shard close: $@\n" if $@;
 		}
 	}
@@ -1107,7 +1107,7 @@ sub unindex_oid_remote ($$$) {
 	my @removed = $self->{over}->remove_oid($oid, $mid);
 	for my $num (@removed) {
 		my $idx = idx_shard($self, $num % $self->{shards});
-		$idx->remote_remove($oid, $num);
+		$idx->shard_remove($oid, $num);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/5] searchidx: index THREADID in Xapian
  2020-08-22  6:06 [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads Eric Wong
  2020-08-22  6:06 ` [PATCH 1/5] searchidxshard: clear $msgref buffer properly Eric Wong
  2020-08-22  6:06 ` [PATCH 2/5] searchidx: put all shard-related stuff in SearchIdxShard.pm Eric Wong
@ 2020-08-22  6:06 ` Eric Wong
  2020-08-22  6:06 ` [PATCH 4/5] search: support downloading mboxes results with full thread Eric Wong
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-08-22  6:06 UTC (permalink / raw)
  To: meta

This is the `tid' column from over.sqlite3; and will be used for
IMAP and JMAP search (among other things).
---
 Documentation/standards.perl      |  4 ++++
 lib/PublicInbox/Over.pm           |  2 +-
 lib/PublicInbox/OverIdx.pm        | 18 +++++++++---------
 lib/PublicInbox/Search.pm         |  4 ++--
 lib/PublicInbox/SearchIdx.pm      |  1 +
 lib/PublicInbox/SearchIdxShard.pm |  7 ++++---
 lib/PublicInbox/Smsg.pm           |  3 ++-
 t/over.t                          | 13 +++++++------
 8 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/Documentation/standards.perl b/Documentation/standards.perl
index a64f033e..0ac6cc52 100755
--- a/Documentation/standards.perl
+++ b/Documentation/standards.perl
@@ -48,8 +48,12 @@ my $rfcs = [
 	# 5032 = 'WITHIN search extension for IMAP',
 	4978 => 'IMAP COMPRESS Extension',
 	# 5182 = 'IMAP Extension for Referencing the Last SEARCH Result',
+	# 5256 => 'IMAP SORT and THREAD extensions',
 	# 5738 =>  'IMAP Support for UTF-8',
 	# 8474 => 'IMAP Extension for Object Identifiers',
+
+	# 8620 => JSON Meta Application Protocol (JMAP)
+	# 8621 => JSON Meta Application Protocol (JMAP) for Mail
 	# ...
 
 	# TODO: flesh this out
diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index a055b4cd..34d0b05d 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -213,7 +213,7 @@ sub get_art {
 	my ($self, $num) = @_;
 	# caching $sth ourselves is faster than prepare_cached
 	my $sth = $self->{-get_art} //= $self->connect->prepare(<<'');
-SELECT num,ds,ts,ddd FROM over WHERE num = ? LIMIT 1
+SELECT num,tid,ds,ts,ddd FROM over WHERE num = ? LIMIT 1
 
 	$sth->execute($num);
 	my $smsg = $sth->fetchrow_hashref;
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 4543bfa1..d42d6fe7 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -284,7 +284,7 @@ sub add_overview {
 	my $dd = $smsg->to_doc_data;
 	utf8::encode($dd);
 	$dd = compress($dd);
-	add_over($self, [ @$smsg{qw(ts ds num)}, $mids, $refs, $xpath, $dd ]);
+	add_over($self, $smsg, $mids, $refs, $xpath, $dd);
 }
 
 sub _add_over {
@@ -311,10 +311,10 @@ sub _add_over {
 }
 
 sub add_over {
-	my ($self, $values) = @_;
-	my ($ts, $ds, $num, $mids, $refs, $xpath, $ddd) = @$values;
+	my ($self, $smsg, $mids, $refs, $xpath, $ddd) = @_;
 	my $old_tid;
 	my $vivified = 0;
+	my $num = $smsg->{num};
 
 	begin_lazy($self);
 	delete_by_num($self, $num, \$old_tid);
@@ -326,17 +326,17 @@ sub add_over {
 		$v > 1 and warn "BUG: vivified multiple ($v) ghosts for $mid\n";
 		$vivified += $v;
 	}
-	my $tid = $vivified ? $old_tid : link_refs($self, $refs, $old_tid);
-	my $sid = sid($self, $xpath);
+	$smsg->{tid} = $vivified ? $old_tid : link_refs($self, $refs, $old_tid);
+	$smsg->{sid} = sid($self, $xpath);
 	my $dbh = $self->{dbh};
 	my $sth = $dbh->prepare_cached(<<'');
 INSERT INTO over (num, tid, sid, ts, ds, ddd)
 VALUES (?,?,?,?,?,?)
 
-	my $n = 0;
-	my @v = ($num, $tid, $sid, $ts, $ds);
-	foreach (@v) { $sth->bind_param(++$n, $_) }
-	$sth->bind_param(++$n, $ddd, SQL_BLOB);
+	my $nc = 1;
+	$sth->bind_param($nc, $num);
+	$sth->bind_param(++$nc, $smsg->{$_}) for (qw(tid sid ts ds));
+	$sth->bind_param(++$nc, $ddd, SQL_BLOB);
 	$sth->execute;
 	$sth = $dbh->prepare_cached(<<'');
 INSERT INTO id2num (id, num) VALUES (?,?)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index c18e19d4..4cfb7b38 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -18,9 +18,9 @@ use constant {
 	# added for public-inbox 1.6.0+
 	BYTES => 3, # IMAP RFC822.SIZE
 	UID => 4, # IMAP UID == NNTP article number == Xapian docid
+	THREADID => 5, # RFC 8474, RFC 8621
 
 	# TODO
-	# THREADID => ?
 	# REPLYCNT => ?, # IMAP ANSWERED
 
 	# SCHEMA_VERSION history
@@ -47,7 +47,7 @@ use constant {
 	#      public-inbox v1.5.0 adds (still SCHEMA_VERSION=15):
 	#      * "lid:" and "l:" for List-Id searches
 	#
-	#      v1.6.0 adds BYTES and UID values
+	#      v1.6.0 adds BYTES, UID and THREADID values
 	SCHEMA_VERSION => 15,
 };
 
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 098fead7..baa6f41a 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -356,6 +356,7 @@ sub add_xapian ($$$$) {
 	add_val($doc, PublicInbox::Search::DT(), $dt);
 	add_val($doc, PublicInbox::Search::BYTES(), $smsg->{bytes});
 	add_val($doc, PublicInbox::Search::UID(), $smsg->{num});
+	add_val($doc, PublicInbox::Search::THREADID, $smsg->{tid});
 
 	my $tg = term_generator($self);
 	$tg->set_document($doc);
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index c0f8be89..f23d23d0 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -68,8 +68,8 @@ sub shard_worker_loop ($$$$$) {
 		} else {
 			chomp $line;
 			# n.b. $mid may contain spaces(!)
-			my ($to_read, $bytes, $num, $blob, $ds, $ts, $mid) =
-							split(/ /, $line, 7);
+			my ($to_read, $bytes, $num, $blob, $ds, $ts, $tid, $mid)
+				= split(/ /, $line, 8);
 			$self->begin_txn_lazy;
 			my $n = read($r, my $msg, $to_read) or die "read: $!\n";
 			$n == $to_read or die "short read: $n != $to_read\n";
@@ -79,6 +79,7 @@ sub shard_worker_loop ($$$$$) {
 				num => $num + 0,
 				blob => $blob,
 				mid => $mid,
+				tid => $tid,
 				ds => $ds,
 				ts => $ts,
 			}, 'PublicInbox::Smsg';
@@ -93,7 +94,7 @@ sub index_raw {
 	if (my $w = $self->{w}) {
 		# mid must be last, it can contain spaces (but not LF)
 		print $w join(' ', @$smsg{qw(raw_bytes bytes
-						num blob ds ts mid)}),
+						num blob ds ts tid mid)}),
 			"\n", $$msgref or die "failed to write shard $!\n";
 	} else {
 		if ($eml) {
diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm
index 51226b8e..0a0384ef 100644
--- a/lib/PublicInbox/Smsg.pm
+++ b/lib/PublicInbox/Smsg.pm
@@ -82,7 +82,8 @@ sub psgi_cull ($) {
 
 	# drop NNTP-only fields which aren't relevant to PSGI results:
 	# saves ~80K on a 200 item search result:
-	delete @$self{qw(ts to cc bytes lines)};
+	# TODO: we may need to keep some of these for JMAP...
+	delete @$self{qw(ts tid to cc bytes lines)};
 	$self;
 }
 
diff --git a/t/over.t b/t/over.t
index 734fdaa3..07672aa7 100644
--- a/t/over.t
+++ b/t/over.t
@@ -40,22 +40,23 @@ $y = $over->create_ghost('NEVAR');
 is($y, $x + 1, 'integer tid for ghost increases');
 
 my $ddd = compress('');
+my $msg = sub { { ts => 0, ds => 0, num => $_[0] } };
 foreach my $s ('', undef) {
-	$over->add_over([0, 0, 98, [ 'a' ], [], $s, $ddd]);
-	$over->add_over([0, 0, 99, [ 'b' ], [], $s, $ddd]);
+	$over->add_over($msg->(98), [ 'a' ], [], $s, $ddd);
+	$over->add_over($msg->(99), [ 'b' ], [], $s, $ddd);
 	my $msgs = [ map { $_->{num} } @{$over->get_thread('a')} ];
 	is_deeply([98], $msgs,
 		'messages not linked by empty subject');
 }
 
-$over->add_over([0, 0, 98, [ 'a' ], [], 's', $ddd]);
-$over->add_over([0, 0, 99, [ 'b' ], [], 's', $ddd]);
+$over->add_over($msg->(98), [ 'a' ], [], 's', $ddd);
+$over->add_over($msg->(99), [ 'b' ], [], 's', $ddd);
 foreach my $mid (qw(a b)) {
 	my $msgs = [ map { $_->{num} } @{$over->get_thread('a')} ];
 	is_deeply([98, 99], $msgs, 'linked messages by subject');
 }
-$over->add_over([0, 0, 98, [ 'a' ], [], 's', $ddd]);
-$over->add_over([0, 0, 99, [ 'b' ], ['a'], 'diff', $ddd]);
+$over->add_over($msg->(98), [ 'a' ], [], 's', $ddd);
+$over->add_over($msg->(99), [ 'b' ], ['a'], 'diff', $ddd);
 foreach my $mid (qw(a b)) {
 	my $msgs = [ map { $_->{num} } @{$over->get_thread($mid)} ];
 	is_deeply([98, 99], $msgs, "linked messages by Message-ID: <$mid>");

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/5] search: support downloading mboxes results with full thread
  2020-08-22  6:06 [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads Eric Wong
                   ` (2 preceding siblings ...)
  2020-08-22  6:06 ` [PATCH 3/5] searchidx: index THREADID in Xapian Eric Wong
@ 2020-08-22  6:06 ` Eric Wong
  2020-08-22  6:06 ` [PATCH 5/5] mbox: disable "&t" on existing Xapian until full reindex Eric Wong
  2020-08-22  6:42 ` [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads Eric Wong
  5 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-08-22  6:06 UTC (permalink / raw)
  To: meta

Finally, the addition of THREADID for collapsing results
in Xapian lets us emulate the "mairix --threads" feature.
That is, instead of returning only the matching messages,
the entire thread is included in the downloaded mbox.gz

This requires a "public-inbox-index --reindex" to be usable.
---
 lib/PublicInbox/Mbox.pm        | 54 +++++++++++++++++++++++++++-------
 lib/PublicInbox/Over.pm        | 29 ++++++++++++++++++
 lib/PublicInbox/Search.pm      |  4 +++
 lib/PublicInbox/SearchQuery.pm |  8 +++--
 lib/PublicInbox/SearchView.pm  | 21 ++++++++-----
 5 files changed, 95 insertions(+), 21 deletions(-)

diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index 873ff7be..c9b11c21 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -167,13 +167,13 @@ sub thread_mbox {
 sub emit_range {
 	my ($ctx, $range) = @_;
 
-	my $query;
+	my $q;
 	if ($range eq 'all') { # TODO: YYYY[-MM]
-		$query = '';
+		$q = '';
 	} else {
 		return [404, [qw(Content-Type text/plain)], []];
 	}
-	mbox_all($ctx, $query);
+	mbox_all($ctx, { q => $q });
 }
 
 sub all_ids_cb {
@@ -220,21 +220,55 @@ sub results_cb {
 	}
 }
 
-sub mbox_all {
-	my ($ctx, $query) = @_;
+sub results_thread_cb {
+	my ($ctx) = @_;
 
-	return mbox_all_ids($ctx) if $query eq '';
-	my $qopts = $ctx->{qopts} = { mset => 2 }; # order by docid
+	my $over = $ctx->{-inbox}->over or return;
+	while (1) {
+		while (defined(my $num = shift(@{$ctx->{xids}}))) {
+			my $smsg = $over->get_art($num) or next;
+			return $smsg;
+		}
+
+		# refills ctx->{xids}
+		next if $over->expand_thread($ctx);
+
+		# refill result set
+		my $srch = $ctx->{-inbox}->search(undef, $ctx) or return;
+		my $mset = $srch->query($ctx->{query}, $ctx->{qopts});
+		my $size = $mset->size or return;
+		$ctx->{qopts}->{offset} += $size;
+		$ctx->{ids} = $srch->mset_to_artnums($mset);
+	}
+
+}
+
+sub mbox_all {
+	my ($ctx, $q) = @_;
+	my $q_string = $q->{'q'};
+	return mbox_all_ids($ctx) if $q_string !~ /\S/;
 	my $srch = $ctx->{-inbox}->search or
 		return PublicInbox::WWW::need($ctx, 'Search');
-	my $mset = $srch->query($query, $qopts);
+	my $over = $ctx->{-inbox}->over or
+		return PublicInbox::WWW::need($ctx, 'Overview');
+
+	my $qopts = $ctx->{qopts} = { mset => 2 }; # order by docid
+	$qopts->{thread} = 1 if $q->{t};
+	my $mset = $srch->query($q_string, $qopts);
 	$qopts->{offset} = $mset->size or
 			return [404, [qw(Content-Type text/plain)],
 				["No results found\n"]];
-	$ctx->{query} = $query;
+	$ctx->{query} = $q_string;
 	$ctx->{ids} = $srch->mset_to_artnums($mset);
 	require PublicInbox::MboxGz;
-	PublicInbox::MboxGz::mbox_gz($ctx, \&results_cb, 'results-'.$query);
+	my $fn;
+	if ($q->{t}) {
+		$fn = 'results-thread-'.$q_string;
+		PublicInbox::MboxGz::mbox_gz($ctx, \&results_thread_cb, $fn);
+	} else {
+		$fn = 'results-'.$q_string;
+		PublicInbox::MboxGz::mbox_gz($ctx, \&results_cb, $fn);
+	}
 }
 
 1;
diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index 34d0b05d..fba58d17 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -179,6 +179,35 @@ ORDER BY $sort_col DESC
 	($nr, $msgs);
 }
 
+# strict `tid' matches, only, for thread-expanded mbox.gz search results
+# and future CLI interface
+# returns true if we have IDs, undef if not
+sub expand_thread {
+	my ($self, $ctx) = @_;
+	my $dbh = $self->connect;
+	do {
+		defined(my $num = $ctx->{ids}->[0]) or return;
+		my ($tid) = $dbh->selectrow_array(<<'', undef, $num);
+SELECT tid FROM over WHERE num = ?
+
+		if (defined($tid)) {
+			my $sql = <<'';
+SELECT num FROM over WHERE tid = ? AND num > ?
+ORDER BY num ASC LIMIT 1000
+
+			my $xids = $dbh->selectcol_arrayref($sql, undef, $tid,
+							$ctx->{prev} // 0);
+			if (scalar(@$xids)) {
+				$ctx->{prev} = $xids->[-1];
+				$ctx->{xids} = $xids;
+				return 1; # success
+			}
+		}
+		$ctx->{prev} = 0;
+		shift @{$ctx->{ids}};
+	} while (1);
+}
+
 sub recent {
 	my ($self, $opts, $after, $before) = @_;
 	my ($s, @v);
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 4cfb7b38..bc820b64 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -326,6 +326,10 @@ sub _enquire_once { # retry_reopen callback
 	} else {
 		$enquire->set_sort_by_value_then_relevance(TS, $desc);
 	}
+
+	# `mairix -t / --threads' or JMAP collapseThreads
+	$enquire->set_collapse_key(THREADID) if $opts->{thread};
+
 	my $offset = $opts->{offset} || 0;
 	my $limit = $opts->{limit} || 50;
 	my $mset = $enquire->get_mset($offset, $limit);
diff --git a/lib/PublicInbox/SearchQuery.pm b/lib/PublicInbox/SearchQuery.pm
index ce1eae12..6724ae39 100644
--- a/lib/PublicInbox/SearchQuery.pm
+++ b/lib/PublicInbox/SearchQuery.pm
@@ -12,7 +12,8 @@ our $LIM = 200;
 sub new {
 	my ($class, $qp) = @_;
 
-	my $r = $qp->{r};
+	my $r = $qp->{r}; # relevance
+	my $t = $qp->{t}; # collapse threads
 	my ($l) = (($qp->{l} || '') =~ /([0-9]+)/);
 	$l = $LIM if !$l || $l > $LIM;
 	bless {
@@ -21,6 +22,7 @@ sub new {
 		o => (($qp->{o} || '0') =~ /(-?[0-9]+)/),
 		l => $l,
 		r => (defined $r && $r ne '0'),
+		t => (defined $t && $t ne '0'),
 	}, $class;
 }
 
@@ -41,8 +43,8 @@ sub qs_html {
 	if (my $l = $self->{l}) {
 		$qs .= "&amp;l=$l" unless $l == $LIM;
 	}
-	if (my $r = $self->{r}) {
-		$qs .= "&amp;r";
+	for my $bool (qw(r t)) {
+		$qs .= "&amp;$bool" if $self->{$bool};
 	}
 	if (my $x = $self->{x}) {
 		$qs .= "&amp;x=$x" if ($x eq 't' || $x eq 'A' || $x eq 'm');
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index 75e2d39d..dd69564a 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -19,10 +19,12 @@ my %rmap_inc;
 sub mbox_results {
 	my ($ctx) = @_;
 	my $q = PublicInbox::SearchQuery->new($ctx->{qp});
-	my $x = $q->{x};
+	if ($ctx->{env}->{'psgi.input'}->read(my $buf, 3)) {
+		$q->{t} = 1 if $buf =~ /\Ax=[^0]/;
+	}
 	require PublicInbox::Mbox;
-	return PublicInbox::Mbox::mbox_all($ctx, $q->{'q'}) if $x eq 'm';
-	sres_top_html($ctx);
+	$q->{x} eq 'm' ? PublicInbox::Mbox::mbox_all($ctx, $q) :
+			sres_top_html($ctx);
 }
 
 sub sres_top_html {
@@ -46,6 +48,7 @@ sub sres_top_html {
 		offset => $o,
 		mset => 1,
 		relevance => $q->{r},
+		thread => $q->{t},
 		asc => $asc,
 	};
 	my ($mset, $total, $err, $html);
@@ -151,7 +154,7 @@ sub err_txt {
 
 sub search_nav_top {
 	my ($mset, $q, $ctx) = @_;
-	my $m = $q->qs_html(x => 'm', r => undef);
+	my $m = $q->qs_html(x => 'm', r => undef, t => undef);
 	my $rv = qq{<form\naction="?$m"\nmethod="post"><pre>};
 	my $initial_q = $ctx->{-uxs_retried};
 	if (defined $initial_q) {
@@ -186,10 +189,12 @@ sub search_nav_top {
 	}
 	my $A = $q->qs_html(x => 'A', r => undef);
 	$rv .= qq{|<a\nhref="?$A">Atom feed</a>]} .
-		qq{\n\t\t\t\t\t\tdownload: } .
-		# lynx seems to require a name=, here, so just use 'z'
-		qq{<input\ntype=submit\nname=z\nvalue="mbox.gz"/>} .
-		q{</pre></form><pre>};
+		qq{\n\t\t\tdownload mbox.gz: } .
+		# we set name=z w/o using it since it seems required for
+		# lynx (but works fine for w3m).
+		qq{<input\ntype=submit\nname=z\nvalue="results only"/>|} .
+		qq{<input\ntype=submit\nname=x\nvalue="full threads"/>} .
+		qq{</pre></form><pre>};
 }
 
 sub search_nav_bot {

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/5] mbox: disable "&t" on existing Xapian until full reindex
  2020-08-22  6:06 [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads Eric Wong
                   ` (3 preceding siblings ...)
  2020-08-22  6:06 ` [PATCH 4/5] search: support downloading mboxes results with full thread Eric Wong
@ 2020-08-22  6:06 ` Eric Wong
  2020-08-22  6:39   ` Eric Wong
  2020-08-22  6:42 ` [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads Eric Wong
  5 siblings, 1 reply; 11+ messages in thread
From: Eric Wong @ 2020-08-22  6:06 UTC (permalink / raw)
  To: meta

Expanding threads via over.sqlite3 for mbox.gz downloads without
Xapian effectively collapsing on the THREADID column leads to
repeated messages getting downloaded.

To avoid that situation, use a "has_threadid" Xapian metadata
flag that's only set on --reindex (and brand new Xapian DBs).

This allows admins to upgrade WWW or do --reindex in any order;
without worrying about users eating up bandwidth and CPU cycles.
---
 lib/PublicInbox/Mbox.pm       |  2 +-
 lib/PublicInbox/Search.pm     | 10 ++++++++-
 lib/PublicInbox/SearchIdx.pm  | 16 ++++++++++++--
 lib/PublicInbox/SearchView.pm | 21 ++++++++++++-------
 lib/PublicInbox/V2Writable.pm | 11 ++++++++++
 t/init.t                      |  1 +
 t/psgi_search.t               | 39 +++++++++++++++++++++++++++++++++--
 7 files changed, 87 insertions(+), 13 deletions(-)

diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index c9b11c21..0223bead 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -262,7 +262,7 @@ sub mbox_all {
 	$ctx->{ids} = $srch->mset_to_artnums($mset);
 	require PublicInbox::MboxGz;
 	my $fn;
-	if ($q->{t}) {
+	if ($q->{t} && $srch->has_threadid) {
 		$fn = 'results-thread-'.$q_string;
 		PublicInbox::MboxGz::mbox_gz($ctx, \&results_thread_cb, $fn);
 	} else {
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index bc820b64..01bbe73d 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -311,6 +311,12 @@ sub _do_enquire {
 	retry_reopen($self, \&_enquire_once, [ $self, $query, $opts ]);
 }
 
+# returns true if all docs have the THREADID value
+sub has_threadid ($) {
+	my ($self) = @_;
+	(xdb($self)->get_metadata('has_threadid') // '') eq '1';
+}
+
 sub _enquire_once { # retry_reopen callback
 	my ($self, $query, $opts) = @{$_[0]};
 	my $xdb = xdb($self);
@@ -328,7 +334,9 @@ sub _enquire_once { # retry_reopen callback
 	}
 
 	# `mairix -t / --threads' or JMAP collapseThreads
-	$enquire->set_collapse_key(THREADID) if $opts->{thread};
+	if ($opts->{thread} && has_threadid($self)) {
+		$enquire->set_collapse_key(THREADID);
+	}
 
 	my $offset = $opts->{offset} || 0;
 	my $limit = $opts->{limit} || 50;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index baa6f41a..ade55756 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -131,6 +131,7 @@ sub idx_acquire {
 				($is_shard && need_xapian($self)))) {
 			File::Path::mkpath($dir);
 			nodatacow_dir($dir);
+			$self->{-set_has_threadid_once} = 1;
 		}
 	}
 	return unless defined $flag;
@@ -590,9 +591,17 @@ sub v1_checkpoint ($$;$) {
 
 	$self->{mm}->{dbh}->commit;
 	if ($newest && need_xapian($self)) {
-		my $cur = $self->{xdb}->get_metadata('last_commit');
+		my $xdb = $self->{xdb};
+		my $cur = $xdb->get_metadata('last_commit');
 		if (need_update($self, $cur, $newest)) {
-			$self->{xdb}->set_metadata('last_commit', $newest);
+			$xdb->set_metadata('last_commit', $newest);
+		}
+
+		# let SearchView know a full --reindex was done so it can
+		# generate ->has_threadid-dependent links
+		if ($sync->{reindex} && !ref($sync->{reindex})) {
+			my $n = $xdb->get_metadata('has_threadid');
+			$xdb->set_metadata('has_threadid', '1') if $n ne '1';
 		}
 	}
 
@@ -816,6 +825,9 @@ sub set_metadata_once {
 	return if $self->{shard}; # only continue if undef or 0, not >0
 	my $xdb = $self->{xdb};
 
+	if (delete($self->{-set_has_threadid_once})) {
+		$xdb->set_metadata('has_threadid', '1');
+	}
 	if (delete($self->{-set_indexlevel_once})) {
 		my $level = $xdb->get_metadata('indexlevel');
 		if (!$level || $level ne 'medium') {
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index dd69564a..76428dfb 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -188,13 +188,20 @@ sub search_nav_top {
 		$rv .= qq{<a\nhref="?$s">summary</a>|<b>nested</b>};
 	}
 	my $A = $q->qs_html(x => 'A', r => undef);
-	$rv .= qq{|<a\nhref="?$A">Atom feed</a>]} .
-		qq{\n\t\t\tdownload mbox.gz: } .
-		# we set name=z w/o using it since it seems required for
-		# lynx (but works fine for w3m).
-		qq{<input\ntype=submit\nname=z\nvalue="results only"/>|} .
-		qq{<input\ntype=submit\nname=x\nvalue="full threads"/>} .
-		qq{</pre></form><pre>};
+	$rv .= qq{|<a\nhref="?$A">Atom feed</a>]};
+	if ($ctx->{-inbox}->search->has_threadid) {
+		$rv .= qq{\n\t\t\tdownload mbox.gz: } .
+			# we set name=z w/o using it since it seems required for
+			# lynx (but works fine for w3m).
+			qq{<input\ntype=submit\nname=z\n} .
+				q{value="results only"/>} .
+			qq{|<input\ntype=submit\nname=x\n} .
+				q{value="full threads"/>};
+	} else { # BOFH needs to --reindex
+		$rv .= qq{\n\t\t\t\t\t\tdownload: } .
+			qq{<input\ntype=submit\nname=z\nvalue="mbox.gz"/>}
+	}
+	$rv .= qq{</pre></form><pre>};
 }
 
 sub search_nav_bot {
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 0a91a132..a32fdcf3 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1337,6 +1337,17 @@ sub index_sync {
 		xapian_only($self, $opt, $sync, $art_beg);
 	}
 
+	# --reindex on the command-line
+	if ($opt->{reindex} && !ref($opt->{reindex}) && $idxlevel ne 'basic') {
+		local $self->{parallel} = 0;
+		my $s0 = PublicInbox::SearchIdx->new($self->{ibx}, 0, 0);
+		if (my $xdb = $s0->idx_acquire) {
+			my $n = $xdb->get_metadata('has_threadid');
+			$xdb->set_metadata('has_threadid', 1) if $n ne '1';
+		}
+		$s0->idx_release;
+	}
+
 	# reindex does not pick up new changes, so we rerun w/o it:
 	if ($opt->{reindex}) {
 		my %again = %$opt;
diff --git a/t/init.t b/t/init.t
index dad09435..a5a9debc 100644
--- a/t/init.t
+++ b/t/init.t
@@ -108,6 +108,7 @@ SKIP: {
 		is(PublicInbox::Admin::detect_indexlevel($ibx), 'full',
 			"detected default indexlevel -V$v");
 		ok($ibx->{-skip_docdata}, "docdata skip set -V$v");
+		ok($ibx->search->has_threadid, 'has_threadid flag set on new inbox');
 	}
 
 	# loop for idempotency
diff --git a/t/psgi_search.t b/t/psgi_search.t
index 5d537363..c1677eb3 100644
--- a/t/psgi_search.t
+++ b/t/psgi_search.t
@@ -3,6 +3,7 @@
 use strict;
 use warnings;
 use Test::More;
+use IO::Uncompress::Gunzip qw(gunzip);
 use PublicInbox::Eml;
 use PublicInbox::Config;
 use PublicInbox::Inbox;
@@ -39,6 +40,12 @@ To: git\@vger.kernel.org
 EOF
 $im->add($mime);
 
+$im->add(PublicInbox::Eml->new(<<""));
+Message-ID: <reply\@asdf>
+From: replier <r\@example.com>
+In-Reply-To: <$mid>
+Subject: mismatch
+
 $mime = PublicInbox::Eml->new(<<'EOF');
 Subject:
 Message-ID: <blank-subject@example.com>
@@ -79,6 +86,9 @@ test_psgi(sub { $www->call(@_) }, sub {
 	ok(index($html, 'by &#198;var Arnfj&#246;r&#240; Bjarmason') >= 0,
 		"displayed Ævar's name properly in HTML");
 
+	like($html, qr/download mbox\.gz: .*?"full threads"/s,
+		'"full threads" download option shown');
+
 	my $warn = [];
 	local $SIG{__WARN__} = sub { push @$warn, @_ };
 	$res = $cb->(GET('/test/?q=s:test&l=5e'));
@@ -118,8 +128,33 @@ test_psgi(sub { $www->call(@_) }, sub {
 	$res = $cb->(GET('/test/no-subject-at-all@example.com/t.mbox.gz'));
 	like($res->header('Content-Disposition'),
 		qr/filename=no-subject\.mbox\.gz/);
+
+	# "full threads" mbox.gz download
+	$res = $cb->(POST('/test/?q=s:test&x=m&t'));
+	is($res->code, 200, 'successful mbox download with threads');
+	gunzip(\($res->content) => \(my $before));
+	is_deeply([ "Message-ID: <$mid>\n", "Message-ID: <reply\@asdf>\n" ],
+		[ grep(/^Message-ID:/m, split(/^/m, $before)) ],
+		'got full thread');
+
+	# clobber has_threadid to emulate old versions:
+	{
+		my $sidx = PublicInbox::SearchIdx->new($ibx, 0);
+		my $xdb = $sidx->idx_acquire;
+		$xdb->set_metadata('has_threadid', '0');
+		$sidx->idx_release;
+	}
+	$config->each_inbox(sub { delete $_[0]->{search} });
+	$res = $cb->(GET('/test/?q=s:test'));
+	is($res->code, 200, 'successful search w/o has_threadid');
+	unlike($html, qr/download mbox\.gz: .*?"full threads"/s,
+		'"full threads" download option not shown w/o has_threadid');
+
+	# in case somebody uses curl to bypass <form>
+	$res = $cb->(POST('/test/?q=s:test&x=m&t'));
+	is($res->code, 200, 'successful mbox download w/ threads');
+	gunzip(\($res->content) => \(my $after));
+	isnt($before, $after);
 });
 
 done_testing();
-
-1;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 5/5] mbox: disable "&t" on existing Xapian until full reindex
  2020-08-22  6:06 ` [PATCH 5/5] mbox: disable "&t" on existing Xapian until full reindex Eric Wong
@ 2020-08-22  6:39   ` Eric Wong
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2020-08-22  6:39 UTC (permalink / raw)
  To: meta

Eric Wong <e@yhbt.net> wrote:
> Expanding threads via over.sqlite3 for mbox.gz downloads without
> Xapian effectively collapsing on the THREADID column leads to
> repeated messages getting downloaded.
> 
> To avoid that situation, use a "has_threadid" Xapian metadata
> flag that's only set on --reindex (and brand new Xapian DBs).
> 
> This allows admins to upgrade WWW or do --reindex in any order;
> without worrying about users eating up bandwidth and CPU cycles.
> ---
>  lib/PublicInbox/Mbox.pm       |  2 +-
>  lib/PublicInbox/Search.pm     | 10 ++++++++-
>  lib/PublicInbox/SearchIdx.pm  | 16 ++++++++++++--
>  lib/PublicInbox/SearchView.pm | 21 ++++++++++++-------
>  lib/PublicInbox/V2Writable.pm | 11 ++++++++++
>  t/init.t                      |  1 +
>  t/psgi_search.t               | 39 +++++++++++++++++++++++++++++++++--
>  7 files changed, 87 insertions(+), 13 deletions(-)
> 
> diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
> index c9b11c21..0223bead 100644
> --- a/lib/PublicInbox/Mbox.pm
> +++ b/lib/PublicInbox/Mbox.pm
> @@ -262,7 +262,7 @@ sub mbox_all {
>  	$ctx->{ids} = $srch->mset_to_artnums($mset);
>  	require PublicInbox::MboxGz;
>  	my $fn;
> -	if ($q->{t}) {
> +	if ($q->{t} && $srch->has_threadid) {
>  		$fn = 'results-thread-'.$q_string;
>  		PublicInbox::MboxGz::mbox_gz($ctx, \&results_thread_cb, $fn);
>  	} else {
> diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
> index bc820b64..01bbe73d 100644
> --- a/lib/PublicInbox/Search.pm
> +++ b/lib/PublicInbox/Search.pm
> @@ -311,6 +311,12 @@ sub _do_enquire {
>  	retry_reopen($self, \&_enquire_once, [ $self, $query, $opts ]);
>  }
>  
> +# returns true if all docs have the THREADID value
> +sub has_threadid ($) {
> +	my ($self) = @_;
> +	(xdb($self)->get_metadata('has_threadid') // '') eq '1';
> +}
> +
>  sub _enquire_once { # retry_reopen callback
>  	my ($self, $query, $opts) = @{$_[0]};
>  	my $xdb = xdb($self);
> @@ -328,7 +334,9 @@ sub _enquire_once { # retry_reopen callback
>  	}
>  
>  	# `mairix -t / --threads' or JMAP collapseThreads
> -	$enquire->set_collapse_key(THREADID) if $opts->{thread};
> +	if ($opts->{thread} && has_threadid($self)) {
> +		$enquire->set_collapse_key(THREADID);
> +	}
>  
>  	my $offset = $opts->{offset} || 0;
>  	my $limit = $opts->{limit} || 50;
> diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
> index baa6f41a..ade55756 100644
> --- a/lib/PublicInbox/SearchIdx.pm
> +++ b/lib/PublicInbox/SearchIdx.pm
> @@ -131,6 +131,7 @@ sub idx_acquire {
>  				($is_shard && need_xapian($self)))) {
>  			File::Path::mkpath($dir);
>  			nodatacow_dir($dir);
> +			$self->{-set_has_threadid_once} = 1;
>  		}
>  	}
>  	return unless defined $flag;
> @@ -590,9 +591,17 @@ sub v1_checkpoint ($$;$) {
>  
>  	$self->{mm}->{dbh}->commit;
>  	if ($newest && need_xapian($self)) {
> -		my $cur = $self->{xdb}->get_metadata('last_commit');
> +		my $xdb = $self->{xdb};
> +		my $cur = $xdb->get_metadata('last_commit');
>  		if (need_update($self, $cur, $newest)) {
> -			$self->{xdb}->set_metadata('last_commit', $newest);
> +			$xdb->set_metadata('last_commit', $newest);
> +		}
> +
> +		# let SearchView know a full --reindex was done so it can
> +		# generate ->has_threadid-dependent links
> +		if ($sync->{reindex} && !ref($sync->{reindex})) {
> +			my $n = $xdb->get_metadata('has_threadid');
> +			$xdb->set_metadata('has_threadid', '1') if $n ne '1';
>  		}
>  	}
>  
> @@ -816,6 +825,9 @@ sub set_metadata_once {
>  	return if $self->{shard}; # only continue if undef or 0, not >0
>  	my $xdb = $self->{xdb};
>  
> +	if (delete($self->{-set_has_threadid_once})) {
> +		$xdb->set_metadata('has_threadid', '1');
> +	}
>  	if (delete($self->{-set_indexlevel_once})) {
>  		my $level = $xdb->get_metadata('indexlevel');
>  		if (!$level || $level ne 'medium') {
> diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
> index dd69564a..76428dfb 100644
> --- a/lib/PublicInbox/SearchView.pm
> +++ b/lib/PublicInbox/SearchView.pm
> @@ -188,13 +188,20 @@ sub search_nav_top {
>  		$rv .= qq{<a\nhref="?$s">summary</a>|<b>nested</b>};
>  	}
>  	my $A = $q->qs_html(x => 'A', r => undef);
> -	$rv .= qq{|<a\nhref="?$A">Atom feed</a>]} .
> -		qq{\n\t\t\tdownload mbox.gz: } .
> -		# we set name=z w/o using it since it seems required for
> -		# lynx (but works fine for w3m).
> -		qq{<input\ntype=submit\nname=z\nvalue="results only"/>|} .
> -		qq{<input\ntype=submit\nname=x\nvalue="full threads"/>} .
> -		qq{</pre></form><pre>};
> +	$rv .= qq{|<a\nhref="?$A">Atom feed</a>]};
> +	if ($ctx->{-inbox}->search->has_threadid) {
> +		$rv .= qq{\n\t\t\tdownload mbox.gz: } .
> +			# we set name=z w/o using it since it seems required for
> +			# lynx (but works fine for w3m).
> +			qq{<input\ntype=submit\nname=z\n} .
> +				q{value="results only"/>} .
> +			qq{|<input\ntype=submit\nname=x\n} .
> +				q{value="full threads"/>};
> +	} else { # BOFH needs to --reindex
> +		$rv .= qq{\n\t\t\t\t\t\tdownload: } .
> +			qq{<input\ntype=submit\nname=z\nvalue="mbox.gz"/>}
> +	}
> +	$rv .= qq{</pre></form><pre>};
>  }
>  
>  sub search_nav_bot {
> diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
> index 0a91a132..a32fdcf3 100644
> --- a/lib/PublicInbox/V2Writable.pm
> +++ b/lib/PublicInbox/V2Writable.pm
> @@ -1337,6 +1337,17 @@ sub index_sync {
>  		xapian_only($self, $opt, $sync, $art_beg);
>  	}
>  
> +	# --reindex on the command-line
> +	if ($opt->{reindex} && !ref($opt->{reindex}) && $idxlevel ne 'basic') {
> +		local $self->{parallel} = 0;
> +		my $s0 = PublicInbox::SearchIdx->new($self->{ibx}, 0, 0);
> +		if (my $xdb = $s0->idx_acquire) {
> +			my $n = $xdb->get_metadata('has_threadid');
> +			$xdb->set_metadata('has_threadid', 1) if $n ne '1';
> +		}
> +		$s0->idx_release;
> +	}

Assigning {parallel} there is useless; I originally used
SearchIdxShard instead of SearchIdx but avoided the *Shard
bit to avoid an extra idx_acquire.

Anyways, we actually need to flock the inbox.lock there to
avoid contending with -index/-watch/-mda etc.

So I'll squash this in:

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index a32fdcf3..b0148dba 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1339,13 +1339,14 @@ sub index_sync {
 
 	# --reindex on the command-line
 	if ($opt->{reindex} && !ref($opt->{reindex}) && $idxlevel ne 'basic') {
-		local $self->{parallel} = 0;
+		$self->lock_acquire;
 		my $s0 = PublicInbox::SearchIdx->new($self->{ibx}, 0, 0);
 		if (my $xdb = $s0->idx_acquire) {
 			my $n = $xdb->get_metadata('has_threadid');
 			$xdb->set_metadata('has_threadid', 1) if $n ne '1';
 		}
 		$s0->idx_release;
+		$self->lock_release;
 	}
 
 	# reindex does not pick up new changes, so we rerun w/o it:

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads
  2020-08-22  6:06 [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads Eric Wong
                   ` (4 preceding siblings ...)
  2020-08-22  6:06 ` [PATCH 5/5] mbox: disable "&t" on existing Xapian until full reindex Eric Wong
@ 2020-08-22  6:42 ` Eric Wong
  2020-08-22 20:12   ` Kyle Meyer
  5 siblings, 1 reply; 11+ messages in thread
From: Eric Wong @ 2020-08-22  6:42 UTC (permalink / raw)
  To: meta

Eric Wong <e@yhbt.net> wrote:
> It requires "public-inbox-index --reindex" to activate;
> but PATCH 5/5 makes it safe to upgrade WWW either before
> or after --reindex.  That means BOFHs can upgrade without
> regard to ordering.

public-inbox-watch users will need to restart -watch before
--reindex, though.  Don't think that's avoidable...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads
  2020-08-22  6:42 ` [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads Eric Wong
@ 2020-08-22 20:12   ` Kyle Meyer
  2020-08-22 20:30     ` Eric Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Kyle Meyer @ 2020-08-22 20:12 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong writes:

> Eric Wong <e@yhbt.net> wrote:
>> It requires "public-inbox-index --reindex" to activate;
>> but PATCH 5/5 makes it safe to upgrade WWW either before
>> or after --reindex.  That means BOFHs can upgrade without
>> regard to ordering.
>
> public-inbox-watch users will need to restart -watch before
> --reindex, though.  Don't think that's avoidable...

Oops, the few times I've used --reindex I don't think I've given order
much thought.  Is it accurate to say that "restart services then
--reindex" is the recommended order in general?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads
  2020-08-22 20:12   ` Kyle Meyer
@ 2020-08-22 20:30     ` Eric Wong
  2020-08-22 21:04       ` Kyle Meyer
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Wong @ 2020-08-22 20:30 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: meta

Kyle Meyer <kyle@kyleam.com> wrote:
> Eric Wong writes:
> > Eric Wong <e@yhbt.net> wrote:
> >> It requires "public-inbox-index --reindex" to activate;
> >> but PATCH 5/5 makes it safe to upgrade WWW either before
> >> or after --reindex.  That means BOFHs can upgrade without
> >> regard to ordering.
> >
> > public-inbox-watch users will need to restart -watch before
> > --reindex, though.  Don't think that's avoidable...
> 
> Oops, the few times I've used --reindex I don't think I've given order
> much thought.  Is it accurate to say that "restart services then
> --reindex" is the recommended order in general?

Not really.  In the distant past (pre-SCHEMA_VERSION=15), it was
actually index (no need for --reindex) then restart daemons,
because the daemons were running on SCHEMA_VERSION=14 (or
whatever was before).

This order is only needed since we're trying to stick to
SCHEMA_VERSION=15 to avoid the space penalty of parallel
versions.

If we get to SCHEMA_VERSION=16, it'll again be index first (no
need for --reindex), then restart -watch, then restart read-only
daemons.  Right now I don't see a need to do a
SCHEMA_VERSION=16.

However, there will very likely be an optional multi-inbox index
in 1.7 which will run parallel to existing (xap15) indices.
This multi-inbox index may be able to completely replace
existing indices (and save space while doing so).  But it'll
be completely optional.

In any case, the release notes will be updated with any
necessary upgrade instructions.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads
  2020-08-22 20:30     ` Eric Wong
@ 2020-08-22 21:04       ` Kyle Meyer
  0 siblings, 0 replies; 11+ messages in thread
From: Kyle Meyer @ 2020-08-22 21:04 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong writes:

> Kyle Meyer <kyle@kyleam.com> wrote:
>> 
>> Oops, the few times I've used --reindex I don't think I've given order
>> much thought.  Is it accurate to say that "restart services then
>> --reindex" is the recommended order in general?
>
> Not really.  In the distant past (pre-SCHEMA_VERSION=15), it was
> actually index (no need for --reindex) then restart daemons,
> because the daemons were running on SCHEMA_VERSION=14 (or
> whatever was before).
>
> This order is only needed since we're trying to stick to
> SCHEMA_VERSION=15 to avoid the space penalty of parallel
> versions.

Got it.  Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-08-22 21:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-22  6:06 [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads Eric Wong
2020-08-22  6:06 ` [PATCH 1/5] searchidxshard: clear $msgref buffer properly Eric Wong
2020-08-22  6:06 ` [PATCH 2/5] searchidx: put all shard-related stuff in SearchIdxShard.pm Eric Wong
2020-08-22  6:06 ` [PATCH 3/5] searchidx: index THREADID in Xapian Eric Wong
2020-08-22  6:06 ` [PATCH 4/5] search: support downloading mboxes results with full thread Eric Wong
2020-08-22  6:06 ` [PATCH 5/5] mbox: disable "&t" on existing Xapian until full reindex Eric Wong
2020-08-22  6:39   ` Eric Wong
2020-08-22  6:42 ` [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads Eric Wong
2020-08-22 20:12   ` Kyle Meyer
2020-08-22 20:30     ` Eric Wong
2020-08-22 21:04       ` Kyle Meyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).