Actually, the Xapian aspect of it turned out to be easy once I learned ->set_collapse_key. Getting the tests and compatibility with existing (pre-upgrade) inboxes was more work. It requires "public-inbox-index --reindex" to activate; but PATCH 5/5 makes it safe to upgrade WWW either before or after --reindex. That means BOFHs can upgrade without regard to ordering. Tested with w3m, links, and lynx (I actually split out my lynx fix separately): https://public-inbox.org/meta/20200822004125.9458-1-e@80x24.org/ TODO: CLI tool support, HTML interface, JMAP, etc... Eric Wong (5): searchidxshard: clear $msgref buffer properly searchidx: put all shard-related stuff in SearchIdxShard.pm searchidx: index THREADID in Xapian search: support downloading mboxes results with full thread mbox: disable "&t" on existing Xapian until full reindex Documentation/standards.perl | 4 +++ lib/PublicInbox/Mbox.pm | 54 +++++++++++++++++++++++++------ lib/PublicInbox/Over.pm | 31 +++++++++++++++++- lib/PublicInbox/OverIdx.pm | 18 +++++------ lib/PublicInbox/Search.pm | 16 +++++++-- lib/PublicInbox/SearchIdx.pm | 51 +++++++++-------------------- lib/PublicInbox/SearchIdxShard.pm | 48 ++++++++++++++++++++++----- lib/PublicInbox/SearchQuery.pm | 8 +++-- lib/PublicInbox/SearchView.pm | 30 +++++++++++------ lib/PublicInbox/Smsg.pm | 3 +- lib/PublicInbox/V2Writable.pm | 19 ++++++++--- t/init.t | 1 + t/over.t | 13 ++++---- t/psgi_search.t | 39 ++++++++++++++++++++-- 14 files changed, 244 insertions(+), 91 deletions(-)
Merely assigning `undef' to a scalar does not free the underlying buffer memory of a scalar. --- lib/PublicInbox/SearchIdxShard.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm index 20077e08..75521b43 100644 --- a/lib/PublicInbox/SearchIdxShard.pm +++ b/lib/PublicInbox/SearchIdxShard.pm @@ -97,7 +97,7 @@ sub index_raw { "\n", $$msgref or die "failed to write shard $!\n"; } else { if ($eml) { - $$msgref = undef; + undef $$msgref; } else { # --xapian-only + --sequential-shard: $eml = PublicInbox::Eml->new($msgref); }
We'll also rename the /^remote_/ prefix to "shard_", since remote implies the process is on a different host. These methods only pass messages to a child process on the same host OR perform operations within the same process. --- lib/PublicInbox/SearchIdx.pm | 34 --------------------------- lib/PublicInbox/SearchIdxShard.pm | 39 +++++++++++++++++++++++++++---- lib/PublicInbox/V2Writable.pm | 8 +++---- 3 files changed, 39 insertions(+), 42 deletions(-) diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index be46b2b9..098fead7 100644 --- a/lib/PublicInbox/SearchIdx.pm +++ b/lib/PublicInbox/SearchIdx.pm @@ -793,40 +793,6 @@ sub DESTROY { $_[0]->{lockfh} = undef; } -# remote_* subs are only used by SearchIdxPart -sub remote_commit { - my ($self) = @_; - if (my $w = $self->{w}) { - print $w "commit\n" or die "failed to write commit: $!"; - } else { - $self->commit_txn_lazy; - } -} - -sub remote_close { - my ($self) = @_; - if (my $w = delete $self->{w}) { - my $pid = delete $self->{pid} or die "no process to wait on\n"; - print $w "close\n" or die "failed to write to pid:$pid: $!\n"; - close $w or die "failed to close pipe for pid:$pid: $!\n"; - waitpid($pid, 0) == $pid or die "remote process did not finish"; - $? == 0 or die ref($self)." pid:$pid exited with: $?"; - } else { - die "transaction in progress $self\n" if $self->{txn}; - idx_release($self) if $self->{xdb}; - } -} - -sub remote_remove { - my ($self, $oid, $num) = @_; - if (my $w = $self->{w}) { - # triggers remove_by_oid in a shard - print $w "D $oid $num\n" or die "failed to write remove $!"; - } else { - $self->remove_by_oid($oid, $num); - } -} - sub _begin_txn { my ($self) = @_; my $xdb = $self->{xdb} || idx_acquire($self); diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm index 75521b43..c0f8be89 100644 --- a/lib/PublicInbox/SearchIdxShard.pm +++ b/lib/PublicInbox/SearchIdxShard.pm @@ -1,7 +1,7 @@ # Copyright (C) 2018-2020 all contributors <meta@public-inbox.org> # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt> -# used to interface with a single Xapian shard in V2 repos. +# Internal interface for a single Xapian shard in V2 inboxes. # See L<public-inbox-v2-format(5)> for more info on how we shard Xapian package PublicInbox::SearchIdxShard; use strict; @@ -47,6 +47,7 @@ sub spawn_worker { close $r or die "failed to close: $!"; } +# this reads all the writes to $self->{w} from the parent process sub shard_worker_loop ($$$$$) { my ($self, $v2w, $r, $shard, $bnote) = @_; $0 = "pi-v2-shard[$shard]"; @@ -87,7 +88,6 @@ sub shard_worker_loop ($$$$$) { $self->worker_done; } -# called by V2Writable sub index_raw { my ($self, $msgref, $eml, $smsg) = @_; if (my $w = $self->{w}) { @@ -110,8 +110,7 @@ sub atfork_child { close $_[0]->{w} or die "failed to close write pipe: $!\n"; } -# called by V2Writable: -sub remote_barrier { +sub shard_barrier { my ($self) = @_; if (my $w = $self->{w}) { print $w "barrier\n" or die "failed to print: $!"; @@ -120,4 +119,36 @@ sub remote_barrier { } } +sub shard_commit { + my ($self) = @_; + if (my $w = $self->{w}) { + print $w "commit\n" or die "failed to write commit: $!"; + } else { + $self->commit_txn_lazy; + } +} + +sub shard_close { + my ($self) = @_; + if (my $w = delete $self->{w}) { + my $pid = delete $self->{pid} or die "no process to wait on\n"; + print $w "close\n" or die "failed to write to pid:$pid: $!\n"; + close $w or die "failed to close pipe for pid:$pid: $!\n"; + waitpid($pid, 0) == $pid or die "remote process did not finish"; + $? == 0 or die ref($self)." pid:$pid exited with: $?"; + } else { + die "transaction in progress $self\n" if $self->{txn}; + $self->idx_release if $self->{xdb}; + } +} + +sub shard_remove { + my ($self, $oid, $num) = @_; + if (my $w = $self->{w}) { # triggers remove_by_oid in a shard child + print $w "D $oid $num\n" or die "failed to write remove $!"; + } else { # same process + $self->remove_by_oid($oid, $num); + } +} + 1; diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 9c200288..0a91a132 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -630,12 +630,12 @@ sub checkpoint ($;$) { my $barrier = $self->barrier_init(scalar @$shards); # each shard needs to issue a barrier command - $_->remote_barrier for @$shards; + $_->shard_barrier for @$shards; # wait for each Xapian shard $self->barrier_wait($barrier); } else { - $_->remote_commit for @$shards; + $_->shard_commit for @$shards; } # last_commit is special, don't commit these until @@ -675,7 +675,7 @@ sub done { my $shards = delete $self->{idx_shards}; if ($shards) { for (@$shards) { - eval { $_->remote_close }; + eval { $_->shard_close }; $err .= "shard close: $@\n" if $@; } } @@ -1107,7 +1107,7 @@ sub unindex_oid_remote ($$$) { my @removed = $self->{over}->remove_oid($oid, $mid); for my $num (@removed) { my $idx = idx_shard($self, $num % $self->{shards}); - $idx->remote_remove($oid, $num); + $idx->shard_remove($oid, $num); } }
This is the `tid' column from over.sqlite3; and will be used for IMAP and JMAP search (among other things). --- Documentation/standards.perl | 4 ++++ lib/PublicInbox/Over.pm | 2 +- lib/PublicInbox/OverIdx.pm | 18 +++++++++--------- lib/PublicInbox/Search.pm | 4 ++-- lib/PublicInbox/SearchIdx.pm | 1 + lib/PublicInbox/SearchIdxShard.pm | 7 ++++--- lib/PublicInbox/Smsg.pm | 3 ++- t/over.t | 13 +++++++------ 8 files changed, 30 insertions(+), 22 deletions(-) diff --git a/Documentation/standards.perl b/Documentation/standards.perl index a64f033e..0ac6cc52 100755 --- a/Documentation/standards.perl +++ b/Documentation/standards.perl @@ -48,8 +48,12 @@ my $rfcs = [ # 5032 = 'WITHIN search extension for IMAP', 4978 => 'IMAP COMPRESS Extension', # 5182 = 'IMAP Extension for Referencing the Last SEARCH Result', + # 5256 => 'IMAP SORT and THREAD extensions', # 5738 => 'IMAP Support for UTF-8', # 8474 => 'IMAP Extension for Object Identifiers', + + # 8620 => JSON Meta Application Protocol (JMAP) + # 8621 => JSON Meta Application Protocol (JMAP) for Mail # ... # TODO: flesh this out diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm index a055b4cd..34d0b05d 100644 --- a/lib/PublicInbox/Over.pm +++ b/lib/PublicInbox/Over.pm @@ -213,7 +213,7 @@ sub get_art { my ($self, $num) = @_; # caching $sth ourselves is faster than prepare_cached my $sth = $self->{-get_art} //= $self->connect->prepare(<<''); -SELECT num,ds,ts,ddd FROM over WHERE num = ? LIMIT 1 +SELECT num,tid,ds,ts,ddd FROM over WHERE num = ? LIMIT 1 $sth->execute($num); my $smsg = $sth->fetchrow_hashref; diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm index 4543bfa1..d42d6fe7 100644 --- a/lib/PublicInbox/OverIdx.pm +++ b/lib/PublicInbox/OverIdx.pm @@ -284,7 +284,7 @@ sub add_overview { my $dd = $smsg->to_doc_data; utf8::encode($dd); $dd = compress($dd); - add_over($self, [ @$smsg{qw(ts ds num)}, $mids, $refs, $xpath, $dd ]); + add_over($self, $smsg, $mids, $refs, $xpath, $dd); } sub _add_over { @@ -311,10 +311,10 @@ sub _add_over { } sub add_over { - my ($self, $values) = @_; - my ($ts, $ds, $num, $mids, $refs, $xpath, $ddd) = @$values; + my ($self, $smsg, $mids, $refs, $xpath, $ddd) = @_; my $old_tid; my $vivified = 0; + my $num = $smsg->{num}; begin_lazy($self); delete_by_num($self, $num, \$old_tid); @@ -326,17 +326,17 @@ sub add_over { $v > 1 and warn "BUG: vivified multiple ($v) ghosts for $mid\n"; $vivified += $v; } - my $tid = $vivified ? $old_tid : link_refs($self, $refs, $old_tid); - my $sid = sid($self, $xpath); + $smsg->{tid} = $vivified ? $old_tid : link_refs($self, $refs, $old_tid); + $smsg->{sid} = sid($self, $xpath); my $dbh = $self->{dbh}; my $sth = $dbh->prepare_cached(<<''); INSERT INTO over (num, tid, sid, ts, ds, ddd) VALUES (?,?,?,?,?,?) - my $n = 0; - my @v = ($num, $tid, $sid, $ts, $ds); - foreach (@v) { $sth->bind_param(++$n, $_) } - $sth->bind_param(++$n, $ddd, SQL_BLOB); + my $nc = 1; + $sth->bind_param($nc, $num); + $sth->bind_param(++$nc, $smsg->{$_}) for (qw(tid sid ts ds)); + $sth->bind_param(++$nc, $ddd, SQL_BLOB); $sth->execute; $sth = $dbh->prepare_cached(<<''); INSERT INTO id2num (id, num) VALUES (?,?) diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index c18e19d4..4cfb7b38 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -18,9 +18,9 @@ use constant { # added for public-inbox 1.6.0+ BYTES => 3, # IMAP RFC822.SIZE UID => 4, # IMAP UID == NNTP article number == Xapian docid + THREADID => 5, # RFC 8474, RFC 8621 # TODO - # THREADID => ? # REPLYCNT => ?, # IMAP ANSWERED # SCHEMA_VERSION history @@ -47,7 +47,7 @@ use constant { # public-inbox v1.5.0 adds (still SCHEMA_VERSION=15): # * "lid:" and "l:" for List-Id searches # - # v1.6.0 adds BYTES and UID values + # v1.6.0 adds BYTES, UID and THREADID values SCHEMA_VERSION => 15, }; diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index 098fead7..baa6f41a 100644 --- a/lib/PublicInbox/SearchIdx.pm +++ b/lib/PublicInbox/SearchIdx.pm @@ -356,6 +356,7 @@ sub add_xapian ($$$$) { add_val($doc, PublicInbox::Search::DT(), $dt); add_val($doc, PublicInbox::Search::BYTES(), $smsg->{bytes}); add_val($doc, PublicInbox::Search::UID(), $smsg->{num}); + add_val($doc, PublicInbox::Search::THREADID, $smsg->{tid}); my $tg = term_generator($self); $tg->set_document($doc); diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm index c0f8be89..f23d23d0 100644 --- a/lib/PublicInbox/SearchIdxShard.pm +++ b/lib/PublicInbox/SearchIdxShard.pm @@ -68,8 +68,8 @@ sub shard_worker_loop ($$$$$) { } else { chomp $line; # n.b. $mid may contain spaces(!) - my ($to_read, $bytes, $num, $blob, $ds, $ts, $mid) = - split(/ /, $line, 7); + my ($to_read, $bytes, $num, $blob, $ds, $ts, $tid, $mid) + = split(/ /, $line, 8); $self->begin_txn_lazy; my $n = read($r, my $msg, $to_read) or die "read: $!\n"; $n == $to_read or die "short read: $n != $to_read\n"; @@ -79,6 +79,7 @@ sub shard_worker_loop ($$$$$) { num => $num + 0, blob => $blob, mid => $mid, + tid => $tid, ds => $ds, ts => $ts, }, 'PublicInbox::Smsg'; @@ -93,7 +94,7 @@ sub index_raw { if (my $w = $self->{w}) { # mid must be last, it can contain spaces (but not LF) print $w join(' ', @$smsg{qw(raw_bytes bytes - num blob ds ts mid)}), + num blob ds ts tid mid)}), "\n", $$msgref or die "failed to write shard $!\n"; } else { if ($eml) { diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm index 51226b8e..0a0384ef 100644 --- a/lib/PublicInbox/Smsg.pm +++ b/lib/PublicInbox/Smsg.pm @@ -82,7 +82,8 @@ sub psgi_cull ($) { # drop NNTP-only fields which aren't relevant to PSGI results: # saves ~80K on a 200 item search result: - delete @$self{qw(ts to cc bytes lines)}; + # TODO: we may need to keep some of these for JMAP... + delete @$self{qw(ts tid to cc bytes lines)}; $self; } diff --git a/t/over.t b/t/over.t index 734fdaa3..07672aa7 100644 --- a/t/over.t +++ b/t/over.t @@ -40,22 +40,23 @@ $y = $over->create_ghost('NEVAR'); is($y, $x + 1, 'integer tid for ghost increases'); my $ddd = compress(''); +my $msg = sub { { ts => 0, ds => 0, num => $_[0] } }; foreach my $s ('', undef) { - $over->add_over([0, 0, 98, [ 'a' ], [], $s, $ddd]); - $over->add_over([0, 0, 99, [ 'b' ], [], $s, $ddd]); + $over->add_over($msg->(98), [ 'a' ], [], $s, $ddd); + $over->add_over($msg->(99), [ 'b' ], [], $s, $ddd); my $msgs = [ map { $_->{num} } @{$over->get_thread('a')} ]; is_deeply([98], $msgs, 'messages not linked by empty subject'); } -$over->add_over([0, 0, 98, [ 'a' ], [], 's', $ddd]); -$over->add_over([0, 0, 99, [ 'b' ], [], 's', $ddd]); +$over->add_over($msg->(98), [ 'a' ], [], 's', $ddd); +$over->add_over($msg->(99), [ 'b' ], [], 's', $ddd); foreach my $mid (qw(a b)) { my $msgs = [ map { $_->{num} } @{$over->get_thread('a')} ]; is_deeply([98, 99], $msgs, 'linked messages by subject'); } -$over->add_over([0, 0, 98, [ 'a' ], [], 's', $ddd]); -$over->add_over([0, 0, 99, [ 'b' ], ['a'], 'diff', $ddd]); +$over->add_over($msg->(98), [ 'a' ], [], 's', $ddd); +$over->add_over($msg->(99), [ 'b' ], ['a'], 'diff', $ddd); foreach my $mid (qw(a b)) { my $msgs = [ map { $_->{num} } @{$over->get_thread($mid)} ]; is_deeply([98, 99], $msgs, "linked messages by Message-ID: <$mid>");
Finally, the addition of THREADID for collapsing results in Xapian lets us emulate the "mairix --threads" feature. That is, instead of returning only the matching messages, the entire thread is included in the downloaded mbox.gz This requires a "public-inbox-index --reindex" to be usable. --- lib/PublicInbox/Mbox.pm | 54 +++++++++++++++++++++++++++------- lib/PublicInbox/Over.pm | 29 ++++++++++++++++++ lib/PublicInbox/Search.pm | 4 +++ lib/PublicInbox/SearchQuery.pm | 8 +++-- lib/PublicInbox/SearchView.pm | 21 ++++++++----- 5 files changed, 95 insertions(+), 21 deletions(-) diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm index 873ff7be..c9b11c21 100644 --- a/lib/PublicInbox/Mbox.pm +++ b/lib/PublicInbox/Mbox.pm @@ -167,13 +167,13 @@ sub thread_mbox { sub emit_range { my ($ctx, $range) = @_; - my $query; + my $q; if ($range eq 'all') { # TODO: YYYY[-MM] - $query = ''; + $q = ''; } else { return [404, [qw(Content-Type text/plain)], []]; } - mbox_all($ctx, $query); + mbox_all($ctx, { q => $q }); } sub all_ids_cb { @@ -220,21 +220,55 @@ sub results_cb { } } -sub mbox_all { - my ($ctx, $query) = @_; +sub results_thread_cb { + my ($ctx) = @_; - return mbox_all_ids($ctx) if $query eq ''; - my $qopts = $ctx->{qopts} = { mset => 2 }; # order by docid + my $over = $ctx->{-inbox}->over or return; + while (1) { + while (defined(my $num = shift(@{$ctx->{xids}}))) { + my $smsg = $over->get_art($num) or next; + return $smsg; + } + + # refills ctx->{xids} + next if $over->expand_thread($ctx); + + # refill result set + my $srch = $ctx->{-inbox}->search(undef, $ctx) or return; + my $mset = $srch->query($ctx->{query}, $ctx->{qopts}); + my $size = $mset->size or return; + $ctx->{qopts}->{offset} += $size; + $ctx->{ids} = $srch->mset_to_artnums($mset); + } + +} + +sub mbox_all { + my ($ctx, $q) = @_; + my $q_string = $q->{'q'}; + return mbox_all_ids($ctx) if $q_string !~ /\S/; my $srch = $ctx->{-inbox}->search or return PublicInbox::WWW::need($ctx, 'Search'); - my $mset = $srch->query($query, $qopts); + my $over = $ctx->{-inbox}->over or + return PublicInbox::WWW::need($ctx, 'Overview'); + + my $qopts = $ctx->{qopts} = { mset => 2 }; # order by docid + $qopts->{thread} = 1 if $q->{t}; + my $mset = $srch->query($q_string, $qopts); $qopts->{offset} = $mset->size or return [404, [qw(Content-Type text/plain)], ["No results found\n"]]; - $ctx->{query} = $query; + $ctx->{query} = $q_string; $ctx->{ids} = $srch->mset_to_artnums($mset); require PublicInbox::MboxGz; - PublicInbox::MboxGz::mbox_gz($ctx, \&results_cb, 'results-'.$query); + my $fn; + if ($q->{t}) { + $fn = 'results-thread-'.$q_string; + PublicInbox::MboxGz::mbox_gz($ctx, \&results_thread_cb, $fn); + } else { + $fn = 'results-'.$q_string; + PublicInbox::MboxGz::mbox_gz($ctx, \&results_cb, $fn); + } } 1; diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm index 34d0b05d..fba58d17 100644 --- a/lib/PublicInbox/Over.pm +++ b/lib/PublicInbox/Over.pm @@ -179,6 +179,35 @@ ORDER BY $sort_col DESC ($nr, $msgs); } +# strict `tid' matches, only, for thread-expanded mbox.gz search results +# and future CLI interface +# returns true if we have IDs, undef if not +sub expand_thread { + my ($self, $ctx) = @_; + my $dbh = $self->connect; + do { + defined(my $num = $ctx->{ids}->[0]) or return; + my ($tid) = $dbh->selectrow_array(<<'', undef, $num); +SELECT tid FROM over WHERE num = ? + + if (defined($tid)) { + my $sql = <<''; +SELECT num FROM over WHERE tid = ? AND num > ? +ORDER BY num ASC LIMIT 1000 + + my $xids = $dbh->selectcol_arrayref($sql, undef, $tid, + $ctx->{prev} // 0); + if (scalar(@$xids)) { + $ctx->{prev} = $xids->[-1]; + $ctx->{xids} = $xids; + return 1; # success + } + } + $ctx->{prev} = 0; + shift @{$ctx->{ids}}; + } while (1); +} + sub recent { my ($self, $opts, $after, $before) = @_; my ($s, @v); diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index 4cfb7b38..bc820b64 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -326,6 +326,10 @@ sub _enquire_once { # retry_reopen callback } else { $enquire->set_sort_by_value_then_relevance(TS, $desc); } + + # `mairix -t / --threads' or JMAP collapseThreads + $enquire->set_collapse_key(THREADID) if $opts->{thread}; + my $offset = $opts->{offset} || 0; my $limit = $opts->{limit} || 50; my $mset = $enquire->get_mset($offset, $limit); diff --git a/lib/PublicInbox/SearchQuery.pm b/lib/PublicInbox/SearchQuery.pm index ce1eae12..6724ae39 100644 --- a/lib/PublicInbox/SearchQuery.pm +++ b/lib/PublicInbox/SearchQuery.pm @@ -12,7 +12,8 @@ our $LIM = 200; sub new { my ($class, $qp) = @_; - my $r = $qp->{r}; + my $r = $qp->{r}; # relevance + my $t = $qp->{t}; # collapse threads my ($l) = (($qp->{l} || '') =~ /([0-9]+)/); $l = $LIM if !$l || $l > $LIM; bless { @@ -21,6 +22,7 @@ sub new { o => (($qp->{o} || '0') =~ /(-?[0-9]+)/), l => $l, r => (defined $r && $r ne '0'), + t => (defined $t && $t ne '0'), }, $class; } @@ -41,8 +43,8 @@ sub qs_html { if (my $l = $self->{l}) { $qs .= "&l=$l" unless $l == $LIM; } - if (my $r = $self->{r}) { - $qs .= "&r"; + for my $bool (qw(r t)) { + $qs .= "&$bool" if $self->{$bool}; } if (my $x = $self->{x}) { $qs .= "&x=$x" if ($x eq 't' || $x eq 'A' || $x eq 'm'); diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm index 75e2d39d..dd69564a 100644 --- a/lib/PublicInbox/SearchView.pm +++ b/lib/PublicInbox/SearchView.pm @@ -19,10 +19,12 @@ my %rmap_inc; sub mbox_results { my ($ctx) = @_; my $q = PublicInbox::SearchQuery->new($ctx->{qp}); - my $x = $q->{x}; + if ($ctx->{env}->{'psgi.input'}->read(my $buf, 3)) { + $q->{t} = 1 if $buf =~ /\Ax=[^0]/; + } require PublicInbox::Mbox; - return PublicInbox::Mbox::mbox_all($ctx, $q->{'q'}) if $x eq 'm'; - sres_top_html($ctx); + $q->{x} eq 'm' ? PublicInbox::Mbox::mbox_all($ctx, $q) : + sres_top_html($ctx); } sub sres_top_html { @@ -46,6 +48,7 @@ sub sres_top_html { offset => $o, mset => 1, relevance => $q->{r}, + thread => $q->{t}, asc => $asc, }; my ($mset, $total, $err, $html); @@ -151,7 +154,7 @@ sub err_txt { sub search_nav_top { my ($mset, $q, $ctx) = @_; - my $m = $q->qs_html(x => 'm', r => undef); + my $m = $q->qs_html(x => 'm', r => undef, t => undef); my $rv = qq{<form\naction="?$m"\nmethod="post"><pre>}; my $initial_q = $ctx->{-uxs_retried}; if (defined $initial_q) { @@ -186,10 +189,12 @@ sub search_nav_top { } my $A = $q->qs_html(x => 'A', r => undef); $rv .= qq{|<a\nhref="?$A">Atom feed</a>]} . - qq{\n\t\t\t\t\t\tdownload: } . - # lynx seems to require a name=, here, so just use 'z' - qq{<input\ntype=submit\nname=z\nvalue="mbox.gz"/>} . - q{</pre></form><pre>}; + qq{\n\t\t\tdownload mbox.gz: } . + # we set name=z w/o using it since it seems required for + # lynx (but works fine for w3m). + qq{<input\ntype=submit\nname=z\nvalue="results only"/>|} . + qq{<input\ntype=submit\nname=x\nvalue="full threads"/>} . + qq{</pre></form><pre>}; } sub search_nav_bot {
Expanding threads via over.sqlite3 for mbox.gz downloads without Xapian effectively collapsing on the THREADID column leads to repeated messages getting downloaded. To avoid that situation, use a "has_threadid" Xapian metadata flag that's only set on --reindex (and brand new Xapian DBs). This allows admins to upgrade WWW or do --reindex in any order; without worrying about users eating up bandwidth and CPU cycles. --- lib/PublicInbox/Mbox.pm | 2 +- lib/PublicInbox/Search.pm | 10 ++++++++- lib/PublicInbox/SearchIdx.pm | 16 ++++++++++++-- lib/PublicInbox/SearchView.pm | 21 ++++++++++++------- lib/PublicInbox/V2Writable.pm | 11 ++++++++++ t/init.t | 1 + t/psgi_search.t | 39 +++++++++++++++++++++++++++++++++-- 7 files changed, 87 insertions(+), 13 deletions(-) diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm index c9b11c21..0223bead 100644 --- a/lib/PublicInbox/Mbox.pm +++ b/lib/PublicInbox/Mbox.pm @@ -262,7 +262,7 @@ sub mbox_all { $ctx->{ids} = $srch->mset_to_artnums($mset); require PublicInbox::MboxGz; my $fn; - if ($q->{t}) { + if ($q->{t} && $srch->has_threadid) { $fn = 'results-thread-'.$q_string; PublicInbox::MboxGz::mbox_gz($ctx, \&results_thread_cb, $fn); } else { diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index bc820b64..01bbe73d 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -311,6 +311,12 @@ sub _do_enquire { retry_reopen($self, \&_enquire_once, [ $self, $query, $opts ]); } +# returns true if all docs have the THREADID value +sub has_threadid ($) { + my ($self) = @_; + (xdb($self)->get_metadata('has_threadid') // '') eq '1'; +} + sub _enquire_once { # retry_reopen callback my ($self, $query, $opts) = @{$_[0]}; my $xdb = xdb($self); @@ -328,7 +334,9 @@ sub _enquire_once { # retry_reopen callback } # `mairix -t / --threads' or JMAP collapseThreads - $enquire->set_collapse_key(THREADID) if $opts->{thread}; + if ($opts->{thread} && has_threadid($self)) { + $enquire->set_collapse_key(THREADID); + } my $offset = $opts->{offset} || 0; my $limit = $opts->{limit} || 50; diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index baa6f41a..ade55756 100644 --- a/lib/PublicInbox/SearchIdx.pm +++ b/lib/PublicInbox/SearchIdx.pm @@ -131,6 +131,7 @@ sub idx_acquire { ($is_shard && need_xapian($self)))) { File::Path::mkpath($dir); nodatacow_dir($dir); + $self->{-set_has_threadid_once} = 1; } } return unless defined $flag; @@ -590,9 +591,17 @@ sub v1_checkpoint ($$;$) { $self->{mm}->{dbh}->commit; if ($newest && need_xapian($self)) { - my $cur = $self->{xdb}->get_metadata('last_commit'); + my $xdb = $self->{xdb}; + my $cur = $xdb->get_metadata('last_commit'); if (need_update($self, $cur, $newest)) { - $self->{xdb}->set_metadata('last_commit', $newest); + $xdb->set_metadata('last_commit', $newest); + } + + # let SearchView know a full --reindex was done so it can + # generate ->has_threadid-dependent links + if ($sync->{reindex} && !ref($sync->{reindex})) { + my $n = $xdb->get_metadata('has_threadid'); + $xdb->set_metadata('has_threadid', '1') if $n ne '1'; } } @@ -816,6 +825,9 @@ sub set_metadata_once { return if $self->{shard}; # only continue if undef or 0, not >0 my $xdb = $self->{xdb}; + if (delete($self->{-set_has_threadid_once})) { + $xdb->set_metadata('has_threadid', '1'); + } if (delete($self->{-set_indexlevel_once})) { my $level = $xdb->get_metadata('indexlevel'); if (!$level || $level ne 'medium') { diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm index dd69564a..76428dfb 100644 --- a/lib/PublicInbox/SearchView.pm +++ b/lib/PublicInbox/SearchView.pm @@ -188,13 +188,20 @@ sub search_nav_top { $rv .= qq{<a\nhref="?$s">summary</a>|<b>nested</b>}; } my $A = $q->qs_html(x => 'A', r => undef); - $rv .= qq{|<a\nhref="?$A">Atom feed</a>]} . - qq{\n\t\t\tdownload mbox.gz: } . - # we set name=z w/o using it since it seems required for - # lynx (but works fine for w3m). - qq{<input\ntype=submit\nname=z\nvalue="results only"/>|} . - qq{<input\ntype=submit\nname=x\nvalue="full threads"/>} . - qq{</pre></form><pre>}; + $rv .= qq{|<a\nhref="?$A">Atom feed</a>]}; + if ($ctx->{-inbox}->search->has_threadid) { + $rv .= qq{\n\t\t\tdownload mbox.gz: } . + # we set name=z w/o using it since it seems required for + # lynx (but works fine for w3m). + qq{<input\ntype=submit\nname=z\n} . + q{value="results only"/>} . + qq{|<input\ntype=submit\nname=x\n} . + q{value="full threads"/>}; + } else { # BOFH needs to --reindex + $rv .= qq{\n\t\t\t\t\t\tdownload: } . + qq{<input\ntype=submit\nname=z\nvalue="mbox.gz"/>} + } + $rv .= qq{</pre></form><pre>}; } sub search_nav_bot { diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 0a91a132..a32fdcf3 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -1337,6 +1337,17 @@ sub index_sync { xapian_only($self, $opt, $sync, $art_beg); } + # --reindex on the command-line + if ($opt->{reindex} && !ref($opt->{reindex}) && $idxlevel ne 'basic') { + local $self->{parallel} = 0; + my $s0 = PublicInbox::SearchIdx->new($self->{ibx}, 0, 0); + if (my $xdb = $s0->idx_acquire) { + my $n = $xdb->get_metadata('has_threadid'); + $xdb->set_metadata('has_threadid', 1) if $n ne '1'; + } + $s0->idx_release; + } + # reindex does not pick up new changes, so we rerun w/o it: if ($opt->{reindex}) { my %again = %$opt; diff --git a/t/init.t b/t/init.t index dad09435..a5a9debc 100644 --- a/t/init.t +++ b/t/init.t @@ -108,6 +108,7 @@ SKIP: { is(PublicInbox::Admin::detect_indexlevel($ibx), 'full', "detected default indexlevel -V$v"); ok($ibx->{-skip_docdata}, "docdata skip set -V$v"); + ok($ibx->search->has_threadid, 'has_threadid flag set on new inbox'); } # loop for idempotency diff --git a/t/psgi_search.t b/t/psgi_search.t index 5d537363..c1677eb3 100644 --- a/t/psgi_search.t +++ b/t/psgi_search.t @@ -3,6 +3,7 @@ use strict; use warnings; use Test::More; +use IO::Uncompress::Gunzip qw(gunzip); use PublicInbox::Eml; use PublicInbox::Config; use PublicInbox::Inbox; @@ -39,6 +40,12 @@ To: git\@vger.kernel.org EOF $im->add($mime); +$im->add(PublicInbox::Eml->new(<<"")); +Message-ID: <reply\@asdf> +From: replier <r\@example.com> +In-Reply-To: <$mid> +Subject: mismatch + $mime = PublicInbox::Eml->new(<<'EOF'); Subject: Message-ID: <blank-subject@example.com> @@ -79,6 +86,9 @@ test_psgi(sub { $www->call(@_) }, sub { ok(index($html, 'by Ævar Arnfjörð Bjarmason') >= 0, "displayed Ævar's name properly in HTML"); + like($html, qr/download mbox\.gz: .*?"full threads"/s, + '"full threads" download option shown'); + my $warn = []; local $SIG{__WARN__} = sub { push @$warn, @_ }; $res = $cb->(GET('/test/?q=s:test&l=5e')); @@ -118,8 +128,33 @@ test_psgi(sub { $www->call(@_) }, sub { $res = $cb->(GET('/test/no-subject-at-all@example.com/t.mbox.gz')); like($res->header('Content-Disposition'), qr/filename=no-subject\.mbox\.gz/); + + # "full threads" mbox.gz download + $res = $cb->(POST('/test/?q=s:test&x=m&t')); + is($res->code, 200, 'successful mbox download with threads'); + gunzip(\($res->content) => \(my $before)); + is_deeply([ "Message-ID: <$mid>\n", "Message-ID: <reply\@asdf>\n" ], + [ grep(/^Message-ID:/m, split(/^/m, $before)) ], + 'got full thread'); + + # clobber has_threadid to emulate old versions: + { + my $sidx = PublicInbox::SearchIdx->new($ibx, 0); + my $xdb = $sidx->idx_acquire; + $xdb->set_metadata('has_threadid', '0'); + $sidx->idx_release; + } + $config->each_inbox(sub { delete $_[0]->{search} }); + $res = $cb->(GET('/test/?q=s:test')); + is($res->code, 200, 'successful search w/o has_threadid'); + unlike($html, qr/download mbox\.gz: .*?"full threads"/s, + '"full threads" download option not shown w/o has_threadid'); + + # in case somebody uses curl to bypass <form> + $res = $cb->(POST('/test/?q=s:test&x=m&t')); + is($res->code, 200, 'successful mbox download w/ threads'); + gunzip(\($res->content) => \(my $after)); + isnt($before, $after); }); done_testing(); - -1;
Eric Wong <e@yhbt.net> wrote: > Expanding threads via over.sqlite3 for mbox.gz downloads without > Xapian effectively collapsing on the THREADID column leads to > repeated messages getting downloaded. > > To avoid that situation, use a "has_threadid" Xapian metadata > flag that's only set on --reindex (and brand new Xapian DBs). > > This allows admins to upgrade WWW or do --reindex in any order; > without worrying about users eating up bandwidth and CPU cycles. > --- > lib/PublicInbox/Mbox.pm | 2 +- > lib/PublicInbox/Search.pm | 10 ++++++++- > lib/PublicInbox/SearchIdx.pm | 16 ++++++++++++-- > lib/PublicInbox/SearchView.pm | 21 ++++++++++++------- > lib/PublicInbox/V2Writable.pm | 11 ++++++++++ > t/init.t | 1 + > t/psgi_search.t | 39 +++++++++++++++++++++++++++++++++-- > 7 files changed, 87 insertions(+), 13 deletions(-) > > diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm > index c9b11c21..0223bead 100644 > --- a/lib/PublicInbox/Mbox.pm > +++ b/lib/PublicInbox/Mbox.pm > @@ -262,7 +262,7 @@ sub mbox_all { > $ctx->{ids} = $srch->mset_to_artnums($mset); > require PublicInbox::MboxGz; > my $fn; > - if ($q->{t}) { > + if ($q->{t} && $srch->has_threadid) { > $fn = 'results-thread-'.$q_string; > PublicInbox::MboxGz::mbox_gz($ctx, \&results_thread_cb, $fn); > } else { > diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm > index bc820b64..01bbe73d 100644 > --- a/lib/PublicInbox/Search.pm > +++ b/lib/PublicInbox/Search.pm > @@ -311,6 +311,12 @@ sub _do_enquire { > retry_reopen($self, \&_enquire_once, [ $self, $query, $opts ]); > } > > +# returns true if all docs have the THREADID value > +sub has_threadid ($) { > + my ($self) = @_; > + (xdb($self)->get_metadata('has_threadid') // '') eq '1'; > +} > + > sub _enquire_once { # retry_reopen callback > my ($self, $query, $opts) = @{$_[0]}; > my $xdb = xdb($self); > @@ -328,7 +334,9 @@ sub _enquire_once { # retry_reopen callback > } > > # `mairix -t / --threads' or JMAP collapseThreads > - $enquire->set_collapse_key(THREADID) if $opts->{thread}; > + if ($opts->{thread} && has_threadid($self)) { > + $enquire->set_collapse_key(THREADID); > + } > > my $offset = $opts->{offset} || 0; > my $limit = $opts->{limit} || 50; > diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm > index baa6f41a..ade55756 100644 > --- a/lib/PublicInbox/SearchIdx.pm > +++ b/lib/PublicInbox/SearchIdx.pm > @@ -131,6 +131,7 @@ sub idx_acquire { > ($is_shard && need_xapian($self)))) { > File::Path::mkpath($dir); > nodatacow_dir($dir); > + $self->{-set_has_threadid_once} = 1; > } > } > return unless defined $flag; > @@ -590,9 +591,17 @@ sub v1_checkpoint ($$;$) { > > $self->{mm}->{dbh}->commit; > if ($newest && need_xapian($self)) { > - my $cur = $self->{xdb}->get_metadata('last_commit'); > + my $xdb = $self->{xdb}; > + my $cur = $xdb->get_metadata('last_commit'); > if (need_update($self, $cur, $newest)) { > - $self->{xdb}->set_metadata('last_commit', $newest); > + $xdb->set_metadata('last_commit', $newest); > + } > + > + # let SearchView know a full --reindex was done so it can > + # generate ->has_threadid-dependent links > + if ($sync->{reindex} && !ref($sync->{reindex})) { > + my $n = $xdb->get_metadata('has_threadid'); > + $xdb->set_metadata('has_threadid', '1') if $n ne '1'; > } > } > > @@ -816,6 +825,9 @@ sub set_metadata_once { > return if $self->{shard}; # only continue if undef or 0, not >0 > my $xdb = $self->{xdb}; > > + if (delete($self->{-set_has_threadid_once})) { > + $xdb->set_metadata('has_threadid', '1'); > + } > if (delete($self->{-set_indexlevel_once})) { > my $level = $xdb->get_metadata('indexlevel'); > if (!$level || $level ne 'medium') { > diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm > index dd69564a..76428dfb 100644 > --- a/lib/PublicInbox/SearchView.pm > +++ b/lib/PublicInbox/SearchView.pm > @@ -188,13 +188,20 @@ sub search_nav_top { > $rv .= qq{<a\nhref="?$s">summary</a>|<b>nested</b>}; > } > my $A = $q->qs_html(x => 'A', r => undef); > - $rv .= qq{|<a\nhref="?$A">Atom feed</a>]} . > - qq{\n\t\t\tdownload mbox.gz: } . > - # we set name=z w/o using it since it seems required for > - # lynx (but works fine for w3m). > - qq{<input\ntype=submit\nname=z\nvalue="results only"/>|} . > - qq{<input\ntype=submit\nname=x\nvalue="full threads"/>} . > - qq{</pre></form><pre>}; > + $rv .= qq{|<a\nhref="?$A">Atom feed</a>]}; > + if ($ctx->{-inbox}->search->has_threadid) { > + $rv .= qq{\n\t\t\tdownload mbox.gz: } . > + # we set name=z w/o using it since it seems required for > + # lynx (but works fine for w3m). > + qq{<input\ntype=submit\nname=z\n} . > + q{value="results only"/>} . > + qq{|<input\ntype=submit\nname=x\n} . > + q{value="full threads"/>}; > + } else { # BOFH needs to --reindex > + $rv .= qq{\n\t\t\t\t\t\tdownload: } . > + qq{<input\ntype=submit\nname=z\nvalue="mbox.gz"/>} > + } > + $rv .= qq{</pre></form><pre>}; > } > > sub search_nav_bot { > diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm > index 0a91a132..a32fdcf3 100644 > --- a/lib/PublicInbox/V2Writable.pm > +++ b/lib/PublicInbox/V2Writable.pm > @@ -1337,6 +1337,17 @@ sub index_sync { > xapian_only($self, $opt, $sync, $art_beg); > } > > + # --reindex on the command-line > + if ($opt->{reindex} && !ref($opt->{reindex}) && $idxlevel ne 'basic') { > + local $self->{parallel} = 0; > + my $s0 = PublicInbox::SearchIdx->new($self->{ibx}, 0, 0); > + if (my $xdb = $s0->idx_acquire) { > + my $n = $xdb->get_metadata('has_threadid'); > + $xdb->set_metadata('has_threadid', 1) if $n ne '1'; > + } > + $s0->idx_release; > + } Assigning {parallel} there is useless; I originally used SearchIdxShard instead of SearchIdx but avoided the *Shard bit to avoid an extra idx_acquire. Anyways, we actually need to flock the inbox.lock there to avoid contending with -index/-watch/-mda etc. So I'll squash this in: diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index a32fdcf3..b0148dba 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -1339,13 +1339,14 @@ sub index_sync { # --reindex on the command-line if ($opt->{reindex} && !ref($opt->{reindex}) && $idxlevel ne 'basic') { - local $self->{parallel} = 0; + $self->lock_acquire; my $s0 = PublicInbox::SearchIdx->new($self->{ibx}, 0, 0); if (my $xdb = $s0->idx_acquire) { my $n = $xdb->get_metadata('has_threadid'); $xdb->set_metadata('has_threadid', 1) if $n ne '1'; } $s0->idx_release; + $self->lock_release; } # reindex does not pick up new changes, so we rerun w/o it:
Eric Wong <e@yhbt.net> wrote:
> It requires "public-inbox-index --reindex" to activate;
> but PATCH 5/5 makes it safe to upgrade WWW either before
> or after --reindex. That means BOFHs can upgrade without
> regard to ordering.
public-inbox-watch users will need to restart -watch before
--reindex, though. Don't think that's avoidable...
Eric Wong writes:
> Eric Wong <e@yhbt.net> wrote:
>> It requires "public-inbox-index --reindex" to activate;
>> but PATCH 5/5 makes it safe to upgrade WWW either before
>> or after --reindex. That means BOFHs can upgrade without
>> regard to ordering.
>
> public-inbox-watch users will need to restart -watch before
> --reindex, though. Don't think that's avoidable...
Oops, the few times I've used --reindex I don't think I've given order
much thought. Is it accurate to say that "restart services then
--reindex" is the recommended order in general?
Kyle Meyer <kyle@kyleam.com> wrote:
> Eric Wong writes:
> > Eric Wong <e@yhbt.net> wrote:
> >> It requires "public-inbox-index --reindex" to activate;
> >> but PATCH 5/5 makes it safe to upgrade WWW either before
> >> or after --reindex. That means BOFHs can upgrade without
> >> regard to ordering.
> >
> > public-inbox-watch users will need to restart -watch before
> > --reindex, though. Don't think that's avoidable...
>
> Oops, the few times I've used --reindex I don't think I've given order
> much thought. Is it accurate to say that "restart services then
> --reindex" is the recommended order in general?
Not really. In the distant past (pre-SCHEMA_VERSION=15), it was
actually index (no need for --reindex) then restart daemons,
because the daemons were running on SCHEMA_VERSION=14 (or
whatever was before).
This order is only needed since we're trying to stick to
SCHEMA_VERSION=15 to avoid the space penalty of parallel
versions.
If we get to SCHEMA_VERSION=16, it'll again be index first (no
need for --reindex), then restart -watch, then restart read-only
daemons. Right now I don't see a need to do a
SCHEMA_VERSION=16.
However, there will very likely be an optional multi-inbox index
in 1.7 which will run parallel to existing (xap15) indices.
This multi-inbox index may be able to completely replace
existing indices (and save space while doing so). But it'll
be completely optional.
In any case, the release notes will be updated with any
necessary upgrade instructions.
Eric Wong writes:
> Kyle Meyer <kyle@kyleam.com> wrote:
>>
>> Oops, the few times I've used --reindex I don't think I've given order
>> much thought. Is it accurate to say that "restart services then
>> --reindex" is the recommended order in general?
>
> Not really. In the distant past (pre-SCHEMA_VERSION=15), it was
> actually index (no need for --reindex) then restart daemons,
> because the daemons were running on SCHEMA_VERSION=14 (or
> whatever was before).
>
> This order is only needed since we're trying to stick to
> SCHEMA_VERSION=15 to avoid the space penalty of parallel
> versions.
Got it. Thanks.