unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 0/2] isearch: per-inbox search using ->ALL
@ 2020-12-05 10:11 Eric Wong
  2020-12-05 10:11 ` [PATCH 1/2] inbox: simplify ->search and callers Eric Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Eric Wong @ 2020-12-05 10:11 UTC (permalink / raw)
  To: meta

1/2 is preparatory, 2/2 changes all the WWW code to support
isearch and avoid touching per-inbox Xapian indices.

Now, if extindex.all.topdir is configured, per-Inbox Xapian
search will use ->ALL, and not xap15/[0-9] (over.sqlite3 is
still required for each inbox).

Unfortunately, IMAP still needs much work and still uses per-inbox
Xapian indices.

Not sure about JMAP, but it looks like it'll be supportable
since we won't have to worry about MSN<=>UID mappings and
limiting mailbox size to accomodate (non-existent :P) MUA support

Eric Wong (2):
  inbox: simplify ->search and callers
  isearch: emulate per-inbox search with ->ALL

 MANIFEST                      |  1 +
 lib/PublicInbox/Config.pm     |  4 ++
 lib/PublicInbox/DummyInbox.pm |  2 +-
 lib/PublicInbox/ExtMsg.pm     |  2 +-
 lib/PublicInbox/ExtSearch.pm  |  1 +
 lib/PublicInbox/Inbox.pm      | 24 ++++++----
 lib/PublicInbox/Isearch.pm    | 87 +++++++++++++++++++++++++++++++++++
 lib/PublicInbox/Mbox.pm       | 16 +++++--
 lib/PublicInbox/Search.pm     |  5 +-
 lib/PublicInbox/SearchView.pm | 10 ++--
 lib/PublicInbox/SolverGit.pm  |  2 +-
 lib/PublicInbox/WWW.pm        |  2 +-
 lib/PublicInbox/WwwStream.pm  |  2 +-
 lib/PublicInbox/WwwText.pm    |  2 +-
 t/extsearch.t                 | 25 +++++++++-
 15 files changed, 157 insertions(+), 28 deletions(-)
 create mode 100644 lib/PublicInbox/Isearch.pm

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] inbox: simplify ->search and callers
  2020-12-05 10:11 [PATCH 0/2] isearch: per-inbox search using ->ALL Eric Wong
@ 2020-12-05 10:11 ` Eric Wong
  2020-12-05 10:11 ` [PATCH 2/2] isearch: emulate per-inbox search with ->ALL Eric Wong
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2020-12-05 10:11 UTC (permalink / raw)
  To: meta

Stop leaking WWW/PSGI-specific logic into classes like
PublicInbox::Inbox, which is used universally.

We'll also decouple $ibx->over from $ibx->search and just deal
with duplicate the code inside ->over to reduce argument
complexity in ->search.

This is also a step in moving away from using {psgi.errors}
to ease code sharing between IMAP, NNTP, and command-line
interfaces.  Perl's built-in `warn' and `local $SIG{__WARN__}'
provides all the flexibility we need to control warning output
and should be universally understood by Perl hackers who may
be unfamiliar with PSGI.
---
 lib/PublicInbox/Inbox.pm | 20 ++++++++++----------
 lib/PublicInbox/Mbox.pm  | 14 ++++++++++----
 2 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index 5a22e40d..58651687 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -191,30 +191,30 @@ sub mm {
 	};
 }
 
-sub search ($;$$) {
-	my ($self, $over_only, $ctx) = @_;
-	my $srch = $self->{search} ||= eval {
+sub search {
+	my ($self) = @_;
+	my $srch = $self->{search} //= eval {
 		_cleanup_later($self);
 		require PublicInbox::Search;
 		PublicInbox::Search->new($self);
 	};
-	($over_only || eval { $srch->xdb }) ? $srch : do {
-		$ctx and $ctx->{env}->{'psgi.errors'}->print(<<EOF);
-`$self->{name}' search went away unexpectedly
-EOF
-		undef;
-	};
+	(eval { $srch->xdb }) ? $srch : undef;
 }
 
 sub over {
 	$_[0]->{over} //= eval {
-		my $srch = search($_[0], 1) or return;
+		my $srch = $_[0]->{search} //= eval {
+			_cleanup_later($_[0]);
+			require PublicInbox::Search;
+			PublicInbox::Search->new($_[0]);
+		};
 		my $over = PublicInbox::Over->new("$srch->{xpfx}/over.sqlite3");
 		$over->dbh; # may fail
 		$over;
 	};
 }
 
+
 sub try_cat {
 	my ($path) = @_;
 	my $rv = '';
diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index 47025891..22516998 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -203,16 +203,22 @@ sub mbox_all_ids {
 	PublicInbox::MboxGz::mbox_gz($ctx, \&all_ids_cb, 'all');
 }
 
+sub gone ($$) {
+	my ($ctx, $what) = @_;
+	warn "W: `$ctx->{-inbox}->{inboxdir}' $what went away unexpectedly\n";
+	undef;
+}
+
 sub results_cb {
 	my ($ctx) = @_;
-	my $over = $ctx->{-inbox}->over or return;
+	my $over = $ctx->{-inbox}->over or return gone($ctx, 'over');
 	while (1) {
 		while (defined(my $num = shift(@{$ctx->{ids}}))) {
 			my $smsg = $over->get_art($num) or next;
 			return $smsg;
 		}
 		# refill result set
-		my $srch = $ctx->{-inbox}->search(undef, $ctx) or return;
+		my $srch = $ctx->{-inbox}->search or return gone($ctx,'search');
 		my $mset = $srch->mset($ctx->{query}, $ctx->{qopts});
 		my $size = $mset->size or return;
 		$ctx->{qopts}->{offset} += $size;
@@ -223,7 +229,7 @@ sub results_cb {
 sub results_thread_cb {
 	my ($ctx) = @_;
 
-	my $over = $ctx->{-inbox}->over or return;
+	my $over = $ctx->{-inbox}->over or return gone($ctx, 'over');
 	while (1) {
 		while (defined(my $num = shift(@{$ctx->{xids}}))) {
 			my $smsg = $over->get_art($num) or next;
@@ -234,7 +240,7 @@ sub results_thread_cb {
 		next if $over->expand_thread($ctx);
 
 		# refill result set
-		my $srch = $ctx->{-inbox}->search(undef, $ctx) or return;
+		my $srch = $ctx->{-inbox}->search or return gone($ctx,'search');
 		my $mset = $srch->mset($ctx->{query}, $ctx->{qopts});
 		my $size = $mset->size or return;
 		$ctx->{qopts}->{offset} += $size;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] isearch: emulate per-inbox search with ->ALL
  2020-12-05 10:11 [PATCH 0/2] isearch: per-inbox search using ->ALL Eric Wong
  2020-12-05 10:11 ` [PATCH 1/2] inbox: simplify ->search and callers Eric Wong
@ 2020-12-05 10:11 ` Eric Wong
  2020-12-05 11:10 ` [PATCH 3/2] imap: support isearch and reduce Xapian queries Eric Wong
  2020-12-10  0:41 ` One, All; Some? [was: isearch: per-inbox search using ->ALL] Eric Wong
  3 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2020-12-05 10:11 UTC (permalink / raw)
  To: meta

Using "eidx_key:" boolean prefix to limit results to a given
inbox, we can use ->ALL to emulate and replace per-Inbox
xap15/[0-9] search indices.

With this change, the presence of "extindex.all.topdir" in the
$PI_CONFIG will cause the WWW code to use that extindex and
ignore per-inbox Xapian DBs in xap15/[0-9].

Unfortunately IMAP search still requires old per-inbox indices,
for now.  Mapping extindex Xapian docids to per-Inbox UIDs and
vice-versa is proving tricky.  Fortunately, IMAP search is
rarely used and optional.  The RFCs don't specify expensive
phrase search, either, so `indexlevel=medium' can be used in
per-inbox Xapian indices to save space.

For primarily WWW (and future JMAP) users; this should result in
significant disk space, FD, and page cache footprint savings for
large instances with many inboxes and many cross-posted
messages.
---
 MANIFEST                      |  1 +
 lib/PublicInbox/Config.pm     |  4 ++
 lib/PublicInbox/DummyInbox.pm |  2 +-
 lib/PublicInbox/ExtMsg.pm     |  2 +-
 lib/PublicInbox/ExtSearch.pm  |  1 +
 lib/PublicInbox/Inbox.pm      |  4 ++
 lib/PublicInbox/Isearch.pm    | 87 +++++++++++++++++++++++++++++++++++
 lib/PublicInbox/Mbox.pm       |  6 +--
 lib/PublicInbox/Search.pm     |  5 +-
 lib/PublicInbox/SearchView.pm | 10 ++--
 lib/PublicInbox/SolverGit.pm  |  2 +-
 lib/PublicInbox/WWW.pm        |  2 +-
 lib/PublicInbox/WwwStream.pm  |  2 +-
 lib/PublicInbox/WwwText.pm    |  2 +-
 t/extsearch.t                 | 25 +++++++++-
 15 files changed, 139 insertions(+), 16 deletions(-)
 create mode 100644 lib/PublicInbox/Isearch.pm

diff --git a/MANIFEST b/MANIFEST
index 946e4b8a..b39f63db 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -156,6 +156,7 @@ lib/PublicInbox/In2Tie.pm
 lib/PublicInbox/Inbox.pm
 lib/PublicInbox/InboxIdle.pm
 lib/PublicInbox/InboxWritable.pm
+lib/PublicInbox/Isearch.pm
 lib/PublicInbox/KQNotify.pm
 lib/PublicInbox/Linkify.pm
 lib/PublicInbox/Listener.pm
diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index ba0ead6e..1844f8b2 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -477,6 +477,10 @@ EOF
 			push @$repo_objs, $repo if $repo;
 		}
 	}
+	if (my $es = ALL($self)) {
+		require PublicInbox::Isearch;
+		$ibx->{isrch} = PublicInbox::Isearch->new($ibx, $es);
+	}
 	$self->{-by_eidx_key}->{$ibx->eidx_key} = $ibx;
 }
 
diff --git a/lib/PublicInbox/DummyInbox.pm b/lib/PublicInbox/DummyInbox.pm
index 02426f13..981043ce 100644
--- a/lib/PublicInbox/DummyInbox.pm
+++ b/lib/PublicInbox/DummyInbox.pm
@@ -16,7 +16,7 @@ no warnings 'once';
 *max = \&uidvalidity;
 *query_xover = \&uid_range;
 *over = \&mm;
-*search = *unsubscribe_unlock =
+*isrch = *search = *unsubscribe_unlock =
 	*get_art = *description = *base_url = \&subscribe_unlock;
 
 1;
diff --git a/lib/PublicInbox/ExtMsg.pm b/lib/PublicInbox/ExtMsg.pm
index 2a0a3e46..2a579c1b 100644
--- a/lib/PublicInbox/ExtMsg.pm
+++ b/lib/PublicInbox/ExtMsg.pm
@@ -32,7 +32,7 @@ sub PARTIAL_MAX () { 100 }
 sub search_partial ($$) {
 	my ($ibx, $mid) = @_;
 	return if length($mid) < $MIN_PARTIAL_LEN;
-	my $srch = $ibx->search or return;
+	my $srch = $ibx->search or return; # NOT ->isrch, we already try ->ALL
 	my $opt = { limit => PARTIAL_MAX, mset => 2 };
 	my @try = ("m:$mid*");
 	my $chop = $mid;
diff --git a/lib/PublicInbox/ExtSearch.pm b/lib/PublicInbox/ExtSearch.pm
index 80455d8d..2a560935 100644
--- a/lib/PublicInbox/ExtSearch.pm
+++ b/lib/PublicInbox/ExtSearch.pm
@@ -128,5 +128,6 @@ no warnings 'once';
 *recent = \&PublicInbox::Inbox::recent;
 
 *max_git_epoch = *nntp_usable = *msg_by_path = \&mm; # undef
+*isrch = *search;
 
 1;
diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index 58651687..52aece7c 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -201,6 +201,10 @@ sub search {
 	(eval { $srch->xdb }) ? $srch : undef;
 }
 
+# isrch is preferred for read-only interfaces if available since it
+# reduces kernel cache and FD overhead
+sub isrch { $_[0]->{isrch} // search($_[0]) }
+
 sub over {
 	$_[0]->{over} //= eval {
 		my $srch = $_[0]->{search} //= eval {
diff --git a/lib/PublicInbox/Isearch.pm b/lib/PublicInbox/Isearch.pm
new file mode 100644
index 00000000..0ab3b19a
--- /dev/null
+++ b/lib/PublicInbox/Isearch.pm
@@ -0,0 +1,87 @@
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# Provides everything the PublicInbox::Search object does;
+# but uses global ExtSearch (->ALL) with an eidx_key query to
+# emulate per-Inbox search using ->ALL.
+package PublicInbox::Isearch;
+use strict;
+use v5.10.1;
+use PublicInbox::ExtSearch;
+use PublicInbox::Search;
+
+sub new {
+	my (undef, $ibx, $es) = @_;
+	bless { es => $es, eidx_key => $ibx->eidx_key }, __PACKAGE__;
+}
+
+sub mset {
+	my ($self, $str, $opt) = @_;
+	$self->{es}->mset($str, { $opt ? %$opt : (),
+				eidx_key => $self->{eidx_key} });
+}
+
+sub _ibx_id ($) {
+	my ($self) = @_;
+	my $sth = $self->{es}->over->dbh->prepare_cached(<<'', undef, 1);
+SELECT ibx_id FROM inboxes WHERE eidx_key = ? LIMIT 1
+
+	$sth->execute($self->{eidx_key});
+	$sth->fetchrow_array //
+		die "E: `$self->{eidx_key}' not in $self->{es}->{topdir}\n";
+}
+
+sub mset_to_artnums {
+	my ($self, $mset) = @_;
+	my $docids = PublicInbox::Search::mset_to_artnums($self->{es}, $mset);
+	my $ibx_id = $self->{-ibx_id} //= _ibx_id($self);
+	my $qmarks = join(',', map { '?' } @$docids);
+	my $rows = $self->{es}->over->dbh->
+			selectall_arrayref(<<"", undef, $ibx_id, @$docids);
+SELECT docid,xnum FROM xref3 WHERE ibx_id = ? AND docid IN ($qmarks)
+
+	my $i = -1;
+	my %order = map { $_ => ++$i } @$docids;
+	my @xnums;
+	for my $row (@$rows) { # @row = ($docid, $xnum)
+		my $idx = delete($order{$row->[0]}) // next;
+		$xnums[$idx] = $row->[1];
+	}
+	if (scalar keys %order) {
+		warn "W: $self->{es}->{topdir} #",
+			join(', #', sort keys %order),
+			" not mapped to `$self->{eidx_key}'\n";
+		warn "W: $self->{es}->{topdir} may need to be reindexed\n";
+		@xnums = grep { defined } @xnums;
+	}
+	\@xnums;
+}
+
+sub mset_to_smsg {
+	my ($self, $ibx, $mset) = @_; # $ibx is a real inbox, not eidx
+	my $xnums = mset_to_artnums($self, $mset);
+	my $i = -1;
+	my %order = map { $_ => ++$i } @$xnums;
+	my $unordered = $ibx->over->get_all(@$xnums);
+	my @msgs;
+	for my $smsg (@$unordered) {
+		my $idx = delete($order{$smsg->{num}}) // do {
+			warn "W: $ibx->{inboxdir} #$smsg->{num}\n";
+			next;
+		};
+		$msgs[$idx] = $smsg;
+	}
+	if (scalar keys %order) {
+		warn "W: $ibx->{inboxdir} #",
+			join(', #', sort keys %order),
+			" no longer valid\n";
+		warn "W: $self->{es}->{topdir} may need to be reindexed\n";
+	}
+	wantarray ? ($mset->get_matches_estimated, \@msgs) : \@msgs;
+}
+
+sub has_threadid { 1 }
+
+sub help { $_[0]->{es}->help }
+
+1;
diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index 22516998..19459150 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -218,7 +218,7 @@ sub results_cb {
 			return $smsg;
 		}
 		# refill result set
-		my $srch = $ctx->{-inbox}->search or return gone($ctx,'search');
+		my $srch = $ctx->{-inbox}->isrch or return gone($ctx, 'search');
 		my $mset = $srch->mset($ctx->{query}, $ctx->{qopts});
 		my $size = $mset->size or return;
 		$ctx->{qopts}->{offset} += $size;
@@ -240,7 +240,7 @@ sub results_thread_cb {
 		next if $over->expand_thread($ctx);
 
 		# refill result set
-		my $srch = $ctx->{-inbox}->search or return gone($ctx,'search');
+		my $srch = $ctx->{-inbox}->isrch or return gone($ctx, 'search');
 		my $mset = $srch->mset($ctx->{query}, $ctx->{qopts});
 		my $size = $mset->size or return;
 		$ctx->{qopts}->{offset} += $size;
@@ -253,7 +253,7 @@ sub mbox_all {
 	my ($ctx, $q) = @_;
 	my $q_string = $q->{'q'};
 	return mbox_all_ids($ctx) if $q_string !~ /\S/;
-	my $srch = $ctx->{-inbox}->search or
+	my $srch = $ctx->{-inbox}->isrch or
 		return PublicInbox::WWW::need($ctx, 'Search');
 	my $over = $ctx->{-inbox}->over or
 		return PublicInbox::WWW::need($ctx, 'Overview');
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 7e72913f..ba239255 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -55,7 +55,7 @@ use constant {
 use PublicInbox::Smsg;
 use PublicInbox::Over;
 our $QP_FLAGS;
-our %X = map { $_ => 0 } qw(BoolWeight Database Enquire QueryParser Stem);
+our %X = map { $_ => 0 } qw(BoolWeight Database Enquire QueryParser Stem Query);
 our $Xap; # 'Search::Xapian' or 'Xapian'
 our $NVRP; # '$Xap::'.('NumberValueRangeProcessor' or 'NumberRangeProcessor')
 our $ENQ_ASCENDING;
@@ -331,6 +331,9 @@ sub has_threadid ($) {
 sub _enquire_once { # retry_reopen callback
 	my ($self, $query, $opts) = @_;
 	my $xdb = xdb($self);
+	if (defined(my $eidx_key = $opts->{eidx_key})) {
+		$query = $X{Query}->new(OP_FILTER(), $query, 'O'.$eidx_key);
+	}
 	my $enquire = $X{Enquire}->new($xdb);
 	$enquire->set_query($query);
 	$opts ||= {};
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index 26426c01..f3c96126 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -30,7 +30,7 @@ sub mbox_results {
 
 sub sres_top_html {
 	my ($ctx) = @_;
-	my $srch = $ctx->{-inbox}->search or
+	my $srch = $ctx->{-inbox}->isrch or
 		return PublicInbox::WWW::need($ctx, 'Search');
 	my $q = PublicInbox::SearchQuery->new($ctx->{qp});
 	my $x = $q->{x};
@@ -95,7 +95,7 @@ sub mset_summary {
 	my $res = \($ctx->{-html_tip});
 	my $ibx = $ctx->{-inbox};
 	my $obfs_ibx = $ibx->{obfuscate} ? $ibx : undef;
-	my @nums = @{$ibx->search->mset_to_artnums($mset)};
+	my @nums = @{$ibx->isrch->mset_to_artnums($mset)};
 	my %num2msg = map { $_->{num} => $_ } @{$ibx->over->get_all(@nums)};
 	my ($min, $max);
 
@@ -201,7 +201,7 @@ sub search_nav_top {
 	}
 	my $A = $q->qs_html(x => 'A', r => undef);
 	$rv .= qq{|<a\nhref="?$A">Atom feed</a>]};
-	if ($ctx->{-inbox}->search->has_threadid) {
+	if ($ctx->{-inbox}->isrch->has_threadid) {
 		$rv .= qq{\n\t\t\tdownload mbox.gz: } .
 			# we set name=z w/o using it since it seems required for
 			# lynx (but works fine for w3m).
@@ -288,7 +288,7 @@ sub mset_thread {
 	my ($ctx, $mset, $q) = @_;
 	my $ibx = $ctx->{-inbox};
 	my @pct = map { get_pct($_) } $mset->items;
-	my $msgs = $ibx->search->mset_to_smsg($ibx, $mset);
+	my $msgs = $ibx->isrch->mset_to_smsg($ibx, $mset);
 	my $i = 0;
 	$_->{pct} = $pct[$i++] for @$msgs;
 	my $r = $q->{r};
@@ -353,7 +353,7 @@ sub ctx_prepare {
 
 sub adump {
 	my ($cb, $mset, $q, $ctx) = @_;
-	$ctx->{ids} = $ctx->{-inbox}->search->mset_to_artnums($mset);
+	$ctx->{ids} = $ctx->{-inbox}->isrch->mset_to_artnums($mset);
 	$ctx->{search_query} = $q; # used by WwwAtomStream::atom_header
 	PublicInbox::WwwAtomStream->response($ctx, 200, \&adump_i);
 }
diff --git a/lib/PublicInbox/SolverGit.pm b/lib/PublicInbox/SolverGit.pm
index 83f7a4ee..a53f28b1 100644
--- a/lib/PublicInbox/SolverGit.pm
+++ b/lib/PublicInbox/SolverGit.pm
@@ -216,7 +216,7 @@ sub filename_query ($) {
 
 sub find_smsgs ($$$) {
 	my ($self, $ibx, $want) = @_;
-	my $srch = $ibx->search or return;
+	my $srch = $ibx->isrch or return;
 
 	my $post = $want->{oid_b} or die 'BUG: no {oid_b}';
 	$post =~ /\A[a-f0-9]+\z/ or die "BUG: oid_b not hex: $post";
diff --git a/lib/PublicInbox/WWW.pm b/lib/PublicInbox/WWW.pm
index cdbcff1e..fc208816 100644
--- a/lib/PublicInbox/WWW.pm
+++ b/lib/PublicInbox/WWW.pm
@@ -462,7 +462,7 @@ sub serve_git {
 sub mbox_results {
 	my ($ctx) = @_;
 	if ($ctx->{env}->{QUERY_STRING} =~ /(?:\A|[&;])q=/) {
-		$ctx->{-inbox}->search or return need($ctx, 'search');
+		$ctx->{-inbox}->isrch or return need($ctx, 'search');
 		require PublicInbox::SearchView;
 		return PublicInbox::SearchView::mbox_results($ctx);
 	}
diff --git a/lib/PublicInbox/WwwStream.pm b/lib/PublicInbox/WwwStream.pm
index 638f4e27..2527b8ed 100644
--- a/lib/PublicInbox/WwwStream.pm
+++ b/lib/PublicInbox/WwwStream.pm
@@ -54,7 +54,7 @@ sub html_top ($) {
 			qq(<a\nhref="$color">color</a> / ).
 			qq(<a\nhref=#mirror>mirror</a> / ).
 			qq(<a\nhref="$atom">Atom feed</a>);
-	if ($ibx->search) {
+	if ($ibx->isrch) {
 		my $q_val = delete($ctx->{-q_value_html}) // '';
 		$q_val = qq(\nvalue="$q_val") if $q_val ne '';
 		# XXX gross, for SearchView.pm
diff --git a/lib/PublicInbox/WwwText.pm b/lib/PublicInbox/WwwText.pm
index 04c9b1c4..8cc818df 100644
--- a/lib/PublicInbox/WwwText.pm
+++ b/lib/PublicInbox/WwwText.pm
@@ -250,7 +250,7 @@ EOF
 
 	# n.b. we use the Xapian DB for any regeneratable,
 	# order-of-arrival-independent data.
-	my $srch = $ibx->search;
+	my $srch = $ibx->isrch;
 	if ($srch) {
 		$$txt .= <<EOF;
 search
diff --git a/t/extsearch.t b/t/extsearch.t
index 2b8b88ea..97786b21 100644
--- a/t/extsearch.t
+++ b/t/extsearch.t
@@ -130,9 +130,32 @@ my $es = PublicInbox::ExtSearch->new("$home/extindex");
 	is($mset->size, 1, 'new message found');
 	$mset = $es->mset('b:"test message"');
 	is($mset->size, 1, 'old message found');
-
 	delete @$es{qw(git over xdb)}; # fork preparation
 
+	my $pi_cfg = PublicInbox::Config->new;
+	$pi_cfg->fill_all;
+	is(scalar($pi_cfg->ALL->mset('s:Testing')->items), 2,
+		'2 results in ->ALL');
+	my $res = {};
+	my $nr = 0;
+	$pi_cfg->each_inbox(sub {
+		$nr++;
+		my ($ibx) = @_;
+		local $SIG{__WARN__} = sub {}; # FIXME support --reindex
+		my $mset = $ibx->isrch->mset('s:Testing');
+		$res->{$ibx->eidx_key} = $ibx->isrch->mset_to_smsg($ibx, $mset);
+	});
+	is($nr, 2, 'two inboxes');
+	my $exp = {};
+	for my $v (qw(v1 v2)) {
+		my $ibx = $pi_cfg->lookup_newsgroup("$v.example");
+		my $smsg = $ibx->over->get_art(1);
+		$smsg->psgi_cull;
+		$exp->{"$v.example"} = [ $smsg ];
+	}
+	is_deeply($res, $exp, 'isearch limited results');
+	$pi_cfg = $res = $exp = undef;
+
 	open my $rmfh, '+>', undef or BAIL_OUT $!;
 	$rmfh->autoflush(1);
 	print $rmfh $eml2->as_string or BAIL_OUT $!;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/2] imap: support isearch and reduce Xapian queries
  2020-12-05 10:11 [PATCH 0/2] isearch: per-inbox search using ->ALL Eric Wong
  2020-12-05 10:11 ` [PATCH 1/2] inbox: simplify ->search and callers Eric Wong
  2020-12-05 10:11 ` [PATCH 2/2] isearch: emulate per-inbox search with ->ALL Eric Wong
@ 2020-12-05 11:10 ` Eric Wong
  2020-12-05 22:26   ` [PATCH 4/2] search: reinstate "uid:" internal search prefix Eric Wong
  2020-12-10  0:41 ` One, All; Some? [was: isearch: per-inbox search using ->ALL] Eric Wong
  3 siblings, 1 reply; 6+ messages in thread
From: Eric Wong @ 2020-12-05 11:10 UTC (permalink / raw)
  To: meta

Since IMAP search (either with Isearch or traditional per-Inbox
search) only returns UIDs, we can safely set the limit to the
UID slice size(*).  With isearch, we can also trust the Xapian
result to fit any docid range we specify.

Limiting Xapian results to 1000 was making ->ALL docid <=>
per-Inbox UID impossible since results could overlap between
ranges unpredictably.

Finally, we can map the ->ALL docids into per-Inbox UIDs and
show them to the client in the UID order of the Inbox, not the
docid order of the ->ALL extindex.

This also lets us get rid of the "uid:" query parser prefix
and use the Xapian::Query API directly to reduce our search
prefix footprint.

For mbox.gz downloads in WWW, we'll also make a best effort to
preserve the order from the Inbox, not the order of extindex;
though it's possible large result sets can have non-overlapping
windows.

(*) by definition, UID slice size is a "safe" value which
    shouldn't OOM either the server or clients.
---
 lib/PublicInbox/IMAP.pm    | 41 +++++++----------------------
 lib/PublicInbox/Isearch.pm | 54 +++++++++++++++++++++++++++++++++-----
 lib/PublicInbox/Mbox.pm    |  6 ++---
 lib/PublicInbox/Search.pm  |  8 +++++-
 4 files changed, 67 insertions(+), 42 deletions(-)

diff --git a/lib/PublicInbox/IMAP.pm b/lib/PublicInbox/IMAP.pm
index 9599f494..f123eb01 100644
--- a/lib/PublicInbox/IMAP.pm
+++ b/lib/PublicInbox/IMAP.pm
@@ -1122,33 +1122,6 @@ sub parse_query ($$) {
 	$q;
 }
 
-sub refill_xap ($$$$) {
-	my ($self, $uids, $range_info, $q) = @_;
-	my ($beg, $end) = @$range_info;
-	my $srch = $self->{ibx}->search;
-	my $opt = { mset => 2, limit => 1000 };
-	my $mset = $srch->mset("$q uid:$beg..$end", $opt);
-	@$uids = @{$srch->mset_to_artnums($mset)};
-	if (@$uids) {
-		$range_info->[0] = $uids->[-1] + 1; # update $beg
-		return; # possibly more
-	}
-	0; # all done
-}
-
-sub search_xap_range { # long_response
-	my ($self, $tag, $q, $range_info, $want_msn) = @_;
-	my $uids = [];
-	if (defined(my $err = refill_xap($self, $uids, $range_info, $q))) {
-		$err ||= 'OK Search done';
-		$self->write("\r\n$tag $err\r\n");
-		return;
-	}
-	msn_convert($self, $uids) if $want_msn;
-	$self->msg_more(join(' ', '', @$uids));
-	1; # more
-}
-
 sub search_common {
 	my ($self, $tag, $query, $want_msn) = @_;
 	my $ibx = $self->{ibx} or return "$tag BAD No mailbox selected\r\n";
@@ -1160,11 +1133,17 @@ sub search_common {
 		long_response($self, \&search_uid_range,
 				$tag, $sql, $range_info, $want_msn);
 	} elsif ($q = $q->{xap}) {
-		$self->{ibx}->search or
+		my $srch = $self->{ibx}->isrch or
 			return "$tag BAD search not available for mailbox\r\n";
-		$self->msg_more('* SEARCH');
-		long_response($self, \&search_xap_range,
-				$tag, $q, $range_info, $want_msn);
+		my $opt = {
+			mset => 2,
+			limit => UID_SLICE,
+			uid_range => $range_info
+		};
+		my $mset = $srch->mset($q, $opt);
+		my $uids = $srch->mset_to_artnums($mset, $opt);
+		msn_convert($self, $uids) if $want_msn;
+		"* SEARCH @$uids\r\n$tag OK Search done\r\n";
 	} else {
 		"$tag BAD Error\r\n";
 	}
diff --git a/lib/PublicInbox/Isearch.pm b/lib/PublicInbox/Isearch.pm
index 0ab3b19a..8a1f257a 100644
--- a/lib/PublicInbox/Isearch.pm
+++ b/lib/PublicInbox/Isearch.pm
@@ -15,12 +15,6 @@ sub new {
 	bless { es => $es, eidx_key => $ibx->eidx_key }, __PACKAGE__;
 }
 
-sub mset {
-	my ($self, $str, $opt) = @_;
-	$self->{es}->mset($str, { $opt ? %$opt : (),
-				eidx_key => $self->{eidx_key} });
-}
-
 sub _ibx_id ($) {
 	my ($self) = @_;
 	my $sth = $self->{es}->over->dbh->prepare_cached(<<'', undef, 1);
@@ -31,11 +25,57 @@ SELECT ibx_id FROM inboxes WHERE eidx_key = ? LIMIT 1
 		die "E: `$self->{eidx_key}' not in $self->{es}->{topdir}\n";
 }
 
+
+sub mset {
+	my ($self, $str, $opt) = @_;
+	my %opt = $opt ? %$opt : ();
+	$opt{eidx_key} = $self->{eidx_key};
+	if (my $uid_range = $opt{uid_range}) {
+		my ($beg, $end) = @$uid_range;
+		my $ibx_id = $self->{-ibx_id} //= _ibx_id($self);
+		my $dbh = $self->{es}->{over}->dbh;
+		my $sth = $dbh->prepare_cached(<<'', undef, 1);
+SELECT MIN(docid) FROM xref3 WHERE ibx_id = ? AND xnum >= ? AND xnum <= ?
+
+		$sth->execute($ibx_id, $beg, $end);
+		my @r = ($sth->fetchrow_array);
+
+		$sth = $dbh->prepare_cached(<<'', undef, 1);
+SELECT MAX(docid) FROM xref3 WHERE ibx_id = ? AND xnum >= ? AND xnum <= ?
+
+		$sth->execute($ibx_id, $beg, $end);
+		$r[1] = $sth->fetchrow_array;
+		if (defined($r[1]) && defined($r[0])) {
+			$opt{limit} = $r[1] - $r[0] + 1;
+		} else {
+			$r[1] //= 0xffffffff;
+			$r[0] //= 0;
+		}
+		$opt{uid_range} = \@r;
+	}
+	$self->{es}->mset($str, \%opt);
+}
+
 sub mset_to_artnums {
-	my ($self, $mset) = @_;
+	my ($self, $mset, $opt) = @_;
 	my $docids = PublicInbox::Search::mset_to_artnums($self->{es}, $mset);
 	my $ibx_id = $self->{-ibx_id} //= _ibx_id($self);
 	my $qmarks = join(',', map { '?' } @$docids);
+	if ($opt && ($opt->{mset} // 0) == 2) { # opt->{mset} = 2 was used
+		my $range = '';
+		my @r;
+		if (my $r = $opt->{uid_range}) {
+			$range = 'AND xnum >= ? AND xnum <= ?';
+			@r = @$r;
+		}
+		my $rows = $self->{es}->over->dbh->
+			selectall_arrayref(<<"", undef, $ibx_id, @$docids, @r);
+SELECT xnum FROM xref3 WHERE ibx_id = ? AND docid IN ($qmarks) $range
+ORDER BY xnum ASC
+
+		return [ map { $_->[0] } @$rows ];
+	}
+
 	my $rows = $self->{es}->over->dbh->
 			selectall_arrayref(<<"", undef, $ibx_id, @$docids);
 SELECT docid,xnum FROM xref3 WHERE ibx_id = ? AND docid IN ($qmarks)
diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index 19459150..0df31e7f 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -222,7 +222,7 @@ sub results_cb {
 		my $mset = $srch->mset($ctx->{query}, $ctx->{qopts});
 		my $size = $mset->size or return;
 		$ctx->{qopts}->{offset} += $size;
-		$ctx->{ids} = $srch->mset_to_artnums($mset);
+		$ctx->{ids} = $srch->mset_to_artnums($mset, $ctx->{qopts});
 	}
 }
 
@@ -244,7 +244,7 @@ sub results_thread_cb {
 		my $mset = $srch->mset($ctx->{query}, $ctx->{qopts});
 		my $size = $mset->size or return;
 		$ctx->{qopts}->{offset} += $size;
-		$ctx->{ids} = $srch->mset_to_artnums($mset);
+		$ctx->{ids} = $srch->mset_to_artnums($mset, $ctx->{qopts});
 	}
 
 }
@@ -265,7 +265,7 @@ sub mbox_all {
 			return [404, [qw(Content-Type text/plain)],
 				["No results found\n"]];
 	$ctx->{query} = $q_string;
-	$ctx->{ids} = $srch->mset_to_artnums($mset);
+	$ctx->{ids} = $srch->mset_to_artnums($mset, $qopts);
 	require PublicInbox::MboxGz;
 	my $fn;
 	if ($q->{t} && $srch->has_threadid) {
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index ba239255..7785d483 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -90,6 +90,7 @@ sub load_xapian () {
 		$ENQ_ASCENDING = $x eq 'Xapian' ?
 				1 : Search::Xapian::ENQ_ASCENDING();
 
+		*sortable_serialise = $x.'::sortable_serialise';
 		# n.b. FLAG_PURE_NOT is expensive not suitable for a public
 		# website as it could become a denial-of-service vector
 		# FLAG_PHRASE also seems to cause performance problems chert
@@ -334,6 +335,12 @@ sub _enquire_once { # retry_reopen callback
 	if (defined(my $eidx_key = $opts->{eidx_key})) {
 		$query = $X{Query}->new(OP_FILTER(), $query, 'O'.$eidx_key);
 	}
+	if (defined(my $uid_range = $opts->{uid_range})) {
+		my $range = $X{Query}->new(OP_VALUE_RANGE(), UID,
+					sortable_serialise($uid_range->[0]),
+					sortable_serialise($uid_range->[1]));
+		$query = $X{Query}->new(OP_FILTER(), $query, $range);
+	}
 	my $enquire = $X{Enquire}->new($xdb);
 	$enquire->set_query($query);
 	$opts ||= {};
@@ -389,7 +396,6 @@ sub qparse_new ($) {
 	# for IMAP, undocumented for WWW and may be split off go away
 	$cb->($qp, $NVRP->new(BYTES, 'bytes:'));
 	$cb->($qp, $NVRP->new(TS, 'ts:'));
-	$cb->($qp, $NVRP->new(UID, 'uid:'));
 
 	while (my ($name, $prefix) = each %bool_pfx_external) {
 		$qp->add_boolean_prefix($name, $_) foreach split(/ /, $prefix);

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/2] search: reinstate "uid:" internal search prefix
  2020-12-05 11:10 ` [PATCH 3/2] imap: support isearch and reduce Xapian queries Eric Wong
@ 2020-12-05 22:26   ` Eric Wong
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2020-12-05 22:26 UTC (permalink / raw)
  To: meta

User-supplied queries (via PublicInbox::IMAPsearchqp) may
restrict messages to certain UID ranges in addition to the
limits we impose ourselves for mailbox slices.  So we'll
continue to ask Xapian::QueryParser to "uid:" numeric ranges.

Fixes: 4b551c884a648b45 ("imap: support isearch and reduce Xapian queries")
---
 lib/PublicInbox/Search.pm | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 7785d483..803914b0 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -396,6 +396,7 @@ sub qparse_new ($) {
 	# for IMAP, undocumented for WWW and may be split off go away
 	$cb->($qp, $NVRP->new(BYTES, 'bytes:'));
 	$cb->($qp, $NVRP->new(TS, 'ts:'));
+	$cb->($qp, $NVRP->new(UID, 'uid:'));
 
 	while (my ($name, $prefix) = each %bool_pfx_external) {
 		$qp->add_boolean_prefix($name, $_) foreach split(/ /, $prefix);

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* One, All; Some? [was: isearch: per-inbox search using ->ALL]
  2020-12-05 10:11 [PATCH 0/2] isearch: per-inbox search using ->ALL Eric Wong
                   ` (2 preceding siblings ...)
  2020-12-05 11:10 ` [PATCH 3/2] imap: support isearch and reduce Xapian queries Eric Wong
@ 2020-12-10  0:41 ` Eric Wong
  3 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2020-12-10  0:41 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> Now, if extindex.all.topdir is configured, per-Inbox Xapian
> search will use ->ALL, and not xap15/[0-9] (over.sqlite3 is
> still required for each inbox).

Just putting down some thoughts down:

This is done:

	[extindex "all"]	-> all (obviously)
		isearch		-> one


This is somewhere in the distant horizon:

		???		-> some (2+ inboxes, but not all of them)

> Unfortunately, IMAP still needs much work and still uses per-inbox
> Xapian indices.
> 
> Not sure about JMAP, but it looks like it'll be supportable
> since we won't have to worry about MSN<=>UID mappings and
> limiting mailbox size to accomodate (non-existent :P) MUA support

isearch was simple because we can use `num' directly from the per-inbox
over.sqlite3 DB.

For combining 2+ inboxes together ("???"), there still needs to
be a new over.sqlite3 file which provides IMAP UIDs.  Of course
Xapian is the storage hog, so "???" will use what's already in
"all" and not its own Xapian index.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-12-10  0:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-05 10:11 [PATCH 0/2] isearch: per-inbox search using ->ALL Eric Wong
2020-12-05 10:11 ` [PATCH 1/2] inbox: simplify ->search and callers Eric Wong
2020-12-05 10:11 ` [PATCH 2/2] isearch: emulate per-inbox search with ->ALL Eric Wong
2020-12-05 11:10 ` [PATCH 3/2] imap: support isearch and reduce Xapian queries Eric Wong
2020-12-05 22:26   ` [PATCH 4/2] search: reinstate "uid:" internal search prefix Eric Wong
2020-12-10  0:41 ` One, All; Some? [was: isearch: per-inbox search using ->ALL] Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).