unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 00/14] IT'S ALIVE! www loads cindex join data
@ 2023-11-28 14:56 Eric Wong
  2023-11-28 14:56 ` [PATCH 01/14] test_common: create_*: detect changes all parameters Eric Wong
                   ` (14 more replies)
  0 siblings, 15 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

8/14 is the killer one which actually makes the cindex data
useful for WWW and powering solver.  Keep in mind, I've had
to cap solver at 3 coderepos as a temporary measure since
there's a lot of "weak" joins we should be weeding out.

More documentation coming, but cindex joins are very much
a fuzzy thing which will have to deal with false positives
and such.  So figuring out the scoring for sanity would
make sense...

Fortunately, --join=aggressive,reset only takes ~1 hour for me,
so probably 1/3 that on modern hardware.  Incremental
`-cindex --join' (no suboptions) usually takes <5 minutes if
done frequently.

New performance problem: solver could definitely be smarter
about dealing with common roots/groups.  For the longest time,
I've only had 1 coderepo per-inbox, having hundreds is wacky.

Actual searching against the cindex isn't done, yet, but
that's kinda straightforward.

Eric Wong (14):
  test_common: create_*: detect changes all parameters
  t/cindex*: require SCM_RIGHTS for these tests
  codesearch: eliminate redundant substitutions
  solver: schedule cleanup after synchronous git->check
  xap_helper.h: move cindex endpoints to separate file
  xap_helper: implement mset endpoint for WWW, IMAP, etc...
  hval: use File::Spec to make relative paths for href
  www: load and use cindex join data
  git: speed up ->git_path for non-worktrees
  cindex: require `-g GIT_DIR' or `-r PROJECT_ROOT'
  git: speed up Git->new by 5% or so
  admin: resolve_git_dir respects symlinks
  cindex: extra quit checks
  www: start working on a repo listing

 Documentation/public-inbox-cindex.pod |   2 +-
 MANIFEST                              |   3 +
 Makefile.PL                           |   8 +-
 lib/PublicInbox/Admin.pm              |  25 +-
 lib/PublicInbox/CodeSearch.pm         | 162 ++++++++++-
 lib/PublicInbox/CodeSearchIdx.pm      |  52 ++--
 lib/PublicInbox/Config.pm             |  39 ++-
 lib/PublicInbox/Git.pm                |  27 +-
 lib/PublicInbox/Hval.pm               |  12 +-
 lib/PublicInbox/RepoList.pm           |  39 +++
 lib/PublicInbox/Search.pm             |  42 +++
 lib/PublicInbox/SearchIdx.pm          |  10 +-
 lib/PublicInbox/SolverGit.pm          |   9 +-
 lib/PublicInbox/TestCommon.pm         |  35 ++-
 lib/PublicInbox/View.pm               |   7 +-
 lib/PublicInbox/WWW.pm                |   1 +
 lib/PublicInbox/WwwCoderepo.pm        |  44 ++-
 lib/PublicInbox/WwwStream.pm          |  11 +-
 lib/PublicInbox/WwwText.pm            |  19 +-
 lib/PublicInbox/XapHelper.pm          |  51 ++--
 lib/PublicInbox/XapHelperCxx.pm       |  14 +-
 lib/PublicInbox/xap_helper.h          | 379 +++++++-------------------
 lib/PublicInbox/xh_cidx.h             | 244 +++++++++++++++++
 lib/PublicInbox/xh_mset.h             |  96 +++++++
 script/public-inbox-cindex            |  38 ++-
 t/admin.t                             |  12 +
 t/cindex-join.t                       |   9 +-
 t/cindex.t                            |  91 ++++++-
 t/xap_helper.t                        |  53 +++-
 xt/solver.t                           |   3 +-
 30 files changed, 1111 insertions(+), 426 deletions(-)
 create mode 100644 lib/PublicInbox/RepoList.pm
 create mode 100644 lib/PublicInbox/xh_cidx.h
 create mode 100644 lib/PublicInbox/xh_mset.h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 01/14] test_common: create_*: detect changes all parameters
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2023-11-28 14:56 ` [PATCH 02/14] t/cindex*: require SCM_RIGHTS for these tests Eric Wong
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

Data::Dumper+B::Deparse seems fast enough to generate cache keys
with, so this makes updating and developing tests easier (as
opposed to forcing the developer to change the identifier).  The
main downside is we'll have to deal with cache expiration, but
"make clean" seems overly aggressive already (it keeps blowing
away the clones made by t/cindex-join.t :<)
---
 lib/PublicInbox/TestCommon.pm | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/lib/PublicInbox/TestCommon.pm b/lib/PublicInbox/TestCommon.pm
index 361a2356..8e7eb950 100644
--- a/lib/PublicInbox/TestCommon.pm
+++ b/lib/PublicInbox/TestCommon.pm
@@ -793,6 +793,19 @@ our %COMMIT_ENV = (
 	GIT_COMMITTER_EMAIL => 'c@example.com',
 );
 
+# for memoizing based on coderefs and various create_* params
+sub my_sum {
+	require PublicInbox::SHA;
+	require Data::Dumper;
+	my $d = Data::Dumper->new(\@_);
+	$d->$_(1) for qw(Deparse Sortkeys Terse);
+	my @l = split /\n/s, $d->Dump;
+	@l = grep !/\$\^H\{.+?[A-Z]+\(0x[0-9a-f]+\)/, @l; # autodie addresses
+	my @addr = grep /[A-Za-z]+\(0x[0-9a-f]+\)/, @l;
+	xbail 'undumpable addresses: ', \@addr if @addr;
+	substr PublicInbox::SHA::sha256_hex(join('', @l)), 0, 8;
+}
+
 sub create_coderepo ($$;@) {
 	my $ident = shift;
 	my $cb = pop;
@@ -801,15 +814,12 @@ sub create_coderepo ($$;@) {
 	require PublicInbox::Import;
 	my ($base) = ($0 =~ m!\b([^/]+)\.[^\.]+\z!);
 	my ($db) = (PublicInbox::Import::default_branch() =~ m!([^/]+)\z!);
-	my $dir = "t/data-gen/$base.$ident-$db";
-	my $new = !-d $dir;
-	if ($new && !CORE::mkdir($dir)) {
-		my $err = $!;
-		-d $dir or xbail "mkdir($dir): $err";
-	}
+	my $tmpdir = delete $opt{tmpdir};
+	my $dir = "t/data-gen/$base.$ident-".my_sum($db, $cb, \%opt);
+	require File::Path;
+	my $new = File::Path::make_path($dir);
 	my $lk = PublicInbox::Lock->new("$dir/creat.lock");
 	my $scope = $lk->lock_for_scope;
-	my $tmpdir = delete $opt{tmpdir};
 	if (!-f "$dir/creat.stamp") {
 		opendir(my $dfh, '.');
 		chdir($dir);
@@ -832,12 +842,10 @@ sub create_inbox ($;@) {
 	require PublicInbox::Import;
 	my ($base) = ($0 =~ m!\b([^/]+)\.[^\.]+\z!);
 	my ($db) = (PublicInbox::Import::default_branch() =~ m!([^/]+)\z!);
-	my $dir = "t/data-gen/$base.$ident-$db";
-	my $new = !-d $dir;
-	if ($new && !mkdir($dir)) {
-		my $err = $!;
-		-d $dir or xbail "mkdir($dir): $err";
-	}
+	my $tmpdir = delete $opt{tmpdir};
+	my $dir = "t/data-gen/$base.$ident-".my_sum($db, $cb, \%opt);
+	require File::Path;
+	my $new = File::Path::make_path($dir);
 	my $lk = PublicInbox::Lock->new("$dir/creat.lock");
 	$opt{inboxdir} = File::Spec->rel2abs($dir);
 	$opt{name} //= $ident;
@@ -846,7 +854,6 @@ sub create_inbox ($;@) {
 	$pre_cb->($dir) if $pre_cb && $new;
 	$opt{-no_fsync} = 1;
 	my $no_gc = delete $opt{-no_gc};
-	my $tmpdir = delete $opt{tmpdir};
 	my $addr = $opt{address} // [];
 	$opt{-primary_address} //= $addr->[0] // "$ident\@example.com";
 	my $parallel = delete($opt{importer_parallel}) // 0;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 02/14] t/cindex*: require SCM_RIGHTS for these tests
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
  2023-11-28 14:56 ` [PATCH 01/14] test_common: create_*: detect changes all parameters Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2024-01-29 21:23   ` [PATCH 0/2] pure Perl sendmsg/recvmsg on *BSD Eric Wong
  2023-11-28 14:56 ` [PATCH 03/14] codesearch: eliminate redundant substitutions Eric Wong
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

Code search will require SCM_RIGHTS, and Inline::C on BSDs
probably isn't too onerous a dependency for new features as
all the ones I've tested have it packaged.

Furthermore, requiring SCM_RIGHTS isn't far off since OpenBSD's
Perl is patched to route the `syscall' perlop through libc[1],
while NetBSD[2] and FreeBSD[3] actually do strive for backwards
compatibility.  We'd just need to use the numbers and not rely
on syscall.ph shipped with Perl since the macro names themselves
are unstable.

[1] https://cvsweb.openbsd.org/src/gnu/usr.bin/perl/gen_syscall_emulator.pl
[2] https://www.netbsd.org/docs/internals/en/chap-processes.html#syscall_versioning
[3] https://wiki.freebsd.org/AddingSyscalls#Backward_compatibily
---
 t/cindex-join.t | 2 +-
 t/cindex.t      | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/cindex-join.t b/t/cindex-join.t
index 2836eb6c..ac90cd64 100644
--- a/t/cindex-join.t
+++ b/t/cindex-join.t
@@ -12,7 +12,7 @@ use autodie;
 use File::Spec;
 $ENV{TEST_REMOTE_JOIN} or plan skip_all => 'TEST_REMOTE_JOIN unset';
 local $ENV{TAIL_ALL} = $ENV{TAIL_ALL} // 1; # while features are unstable
-require_mods(qw(json Xapian DBD::SQLite));
+require_mods(qw(json Xapian DBD::SQLite +SCM_RIGHTS));
 my @code = qw(https://80x24.org/mwrap-perl.git
 		https://80x24.org/mwrap.git);
 my @inboxes = qw(https://80x24.org/mwrap-public 2 inbox.comp.lang.ruby.mwrap
diff --git a/t/cindex.t b/t/cindex.t
index 1a9e564a..261945bf 100644
--- a/t/cindex.t
+++ b/t/cindex.t
@@ -6,7 +6,7 @@ use PublicInbox::TestCommon;
 use Cwd qw(getcwd abs_path);
 use List::Util qw(sum);
 use autodie qw(close open rename);
-require_mods(qw(json Xapian));
+require_mods(qw(json Xapian +SCM_RIGHTS));
 use_ok 'PublicInbox::CodeSearchIdx';
 use PublicInbox::Import;
 my ($tmp, $for_destroy) = tmpdir();

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 03/14] codesearch: eliminate redundant substitutions
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
  2023-11-28 14:56 ` [PATCH 01/14] test_common: create_*: detect changes all parameters Eric Wong
  2023-11-28 14:56 ` [PATCH 02/14] t/cindex*: require SCM_RIGHTS for these tests Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2023-11-28 14:56 ` [PATCH 04/14] solver: schedule cleanup after synchronous git->check Eric Wong
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

We store the full path name and xap_terms already removes
the `P' character, so the loop and substr calls are a
no-op replacing `/' with `/'.
---
 lib/PublicInbox/CodeSearch.pm | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/PublicInbox/CodeSearch.pm b/lib/PublicInbox/CodeSearch.pm
index 9051d85f..eb057525 100644
--- a/lib/PublicInbox/CodeSearch.pm
+++ b/lib/PublicInbox/CodeSearch.pm
@@ -191,7 +191,6 @@ sub roots2paths { # for diagnostics
 			}
 			$size = $mset->size;
 		} while ($size);
-		substr($_, 0, 1, '/') for @$dirs; # s!^P!/!
 		@$dirs = sort @$dirs;
 	}
 	\%ret;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 04/14] solver: schedule cleanup after synchronous git->check
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
                   ` (2 preceding siblings ...)
  2023-11-28 14:56 ` [PATCH 03/14] codesearch: eliminate redundant substitutions Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2023-11-28 14:56 ` [PATCH 05/14] xap_helper.h: move cindex endpoints to separate file Eric Wong
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

We don't want hundreds of git cat-file processes for coderepos
lingering around.
---
 lib/PublicInbox/Git.pm       | 7 ++++++-
 lib/PublicInbox/SolverGit.pm | 3 +++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Git.pm b/lib/PublicInbox/Git.pm
index fe834210..7c6e15b7 100644
--- a/lib/PublicInbox/Git.pm
+++ b/lib/PublicInbox/Git.pm
@@ -628,10 +628,15 @@ sub event_step {
 	}
 }
 
+sub schedule_cleanup {
+	my ($self) = @_;
+	PublicInbox::DS::add_uniq_timer($self+0, 30, \&cleanup, $self, 1);
+}
+
 # idempotently registers with DS epoll/kqueue/select/poll
 sub watch_async ($) {
 	my ($self) = @_;
-	PublicInbox::DS::add_uniq_timer($self+0, 30, \&cleanup, $self, 1);
+	schedule_cleanup($self);
 	$self->{epwatch} //= do {
 		$self->SUPER::new($self->{sock}, EPOLLIN);
 		\undef;
diff --git a/lib/PublicInbox/SolverGit.pm b/lib/PublicInbox/SolverGit.pm
index ba3c94cb..7cc10198 100644
--- a/lib/PublicInbox/SolverGit.pm
+++ b/lib/PublicInbox/SolverGit.pm
@@ -82,7 +82,10 @@ sub solve_existing ($$) {
 	my $try = $want->{try_gits} //= [ @{$self->{gits}} ]; # array copy
 	my $git = shift @$try or die 'BUG {try_gits} empty';
 	my $oid_b = $want->{oid_b};
+
+	# can't use async_check due to last_check_err :<
 	my ($oid_full, $type, $size) = $git->check($oid_b);
+	$git->schedule_cleanup if $self->{psgi_env}->{'pi-httpd.async'};
 
 	if ($oid_b eq ($oid_full // '') || (defined($type) &&
 				(!$self->{have_hints} || $type eq 'blob'))) {

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 05/14] xap_helper.h: move cindex endpoints to separate file
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
                   ` (3 preceding siblings ...)
  2023-11-28 14:56 ` [PATCH 04/14] solver: schedule cleanup after synchronous git->check Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2023-11-28 14:56 ` [PATCH 06/14] xap_helper: implement mset endpoint for WWW, IMAP, etc Eric Wong
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

It ought to help a bit with organization since xap_helper.h
is getting somewhat large and we'll need new endpoints to
support WWW, lei, and whatever else that needs to come.
---
 MANIFEST                        |   1 +
 lib/PublicInbox/XapHelperCxx.pm |  10 +-
 lib/PublicInbox/xap_helper.h    | 269 +-------------------------------
 lib/PublicInbox/xh_cidx.h       | 259 ++++++++++++++++++++++++++++++
 4 files changed, 272 insertions(+), 267 deletions(-)
 create mode 100644 lib/PublicInbox/xh_cidx.h

diff --git a/MANIFEST b/MANIFEST
index 85811133..bbbe0b91 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -378,6 +378,7 @@ lib/PublicInbox/XapHelperCxx.pm
 lib/PublicInbox/Xapcmd.pm
 lib/PublicInbox/gcf2_libgit2.h
 lib/PublicInbox/xap_helper.h
+lib/PublicInbox/xh_cidx.h
 sa_config/Makefile
 sa_config/README
 sa_config/root/etc/spamassassin/public-inbox.pre
diff --git a/lib/PublicInbox/XapHelperCxx.pm b/lib/PublicInbox/XapHelperCxx.pm
index f421c7bc..8a66fdcd 100644
--- a/lib/PublicInbox/XapHelperCxx.pm
+++ b/lib/PublicInbox/XapHelperCxx.pm
@@ -20,7 +20,7 @@ $ENV{PERL_INLINE_DIRECTORY} // die('BUG: PERL_INLINE_DIRECTORY unset');
 substr($dir, 0, 0) = "$ENV{PERL_INLINE_DIRECTORY}/";
 my $bin = "$dir/xap_helper";
 my ($srcpfx) = (__FILE__ =~ m!\A(.+/)[^/]+\z!);
-my @srcs = map { $srcpfx.$_ } qw(xap_helper.h);
+my @srcs = map { $srcpfx.$_ } qw(xap_helper.h xh_cidx.h);
 my @pm_dep = map { $srcpfx.$_ } qw(Search.pm CodeSearch.pm);
 my $ldflags = '-Wl,-O1';
 $ldflags .= ' -Wl,--compress-debug-sections=zlib' if $^O ne 'openbsd';
@@ -61,11 +61,9 @@ sub build () {
 	require PublicInbox::OnDestroy;
 	my ($prog) = ($bin =~ m!/([^/]+)\z!);
 	my $lk = PublicInbox::Lock->new("$dir/$prog.lock")->lock_for_scope;
-	open my $fh, '>', "$dir/$prog.cpp";
-	say $fh qq(# include "$_") for @srcs;
-	print $fh PublicInbox::Search::generate_cxx();
-	print $fh PublicInbox::CodeSearch::generate_cxx();
-	close $fh;
+	write_file '>', "$dir/$prog.cpp", qq{#include "xap_helper.h"\n},
+			PublicInbox::Search::generate_cxx(),
+			PublicInbox::CodeSearch::generate_cxx();
 
 	opendir my $dh, '.';
 	my $restore = PublicInbox::OnDestroy->new(\&chdir, $dh);
diff --git a/lib/PublicInbox/xap_helper.h b/lib/PublicInbox/xap_helper.h
index 5816c24c..89d151d9 100644
--- a/lib/PublicInbox/xap_helper.h
+++ b/lib/PublicInbox/xap_helper.h
@@ -146,6 +146,12 @@ struct worker {
 	unsigned nr;
 };
 
+struct fbuf {
+	FILE *fp;
+	char *ptr;
+	size_t len;
+};
+
 #define SPLIT2ARGV(dst,buf,len) split2argv(dst,buf,len,MY_ARRAY_SIZE(dst))
 static size_t split2argv(char **dst, char *buf, size_t len, size_t limit)
 {
@@ -253,87 +259,11 @@ static bool starts_with(const std::string *s, const char *pfx, size_t pfx_len)
 	return s->size() >= pfx_len && !memcmp(pfx, s->c_str(), pfx_len);
 }
 
-static void dump_ibx_term(struct req *req, const char *pfx,
-			Xapian::Document *doc, const char *ibx_id)
-{
-	Xapian::TermIterator cur = doc->termlist_begin();
-	Xapian::TermIterator end = doc->termlist_end();
-	size_t pfx_len = strlen(pfx);
-
-	for (cur.skip_to(pfx); cur != end; cur++) {
-		std::string tn = *cur;
-
-		if (starts_with(&tn, pfx, pfx_len)) {
-			fprintf(req->fp[0], "%s %s\n",
-				tn.c_str() + pfx_len, ibx_id);
-			++req->nr_out;
-		}
-	}
-}
-
 static int my_setlinebuf(FILE *fp) // glibc setlinebuf(3) can't report errors
 {
 	return setvbuf(fp, NULL, _IOLBF, 0);
 }
 
-static enum exc_iter dump_ibx_iter(struct req *req, const char *ibx_id,
-				Xapian::MSetIterator *i)
-{
-	try {
-		Xapian::Document doc = i->get_document();
-		for (int p = 0; p < req->pfxc; p++)
-			dump_ibx_term(req, req->pfxv[p], &doc, ibx_id);
-	} catch (const Xapian::DatabaseModifiedError & e) {
-		req->srch->db->reopen();
-		return ITER_RETRY;
-	} catch (const Xapian::DocNotFoundError & e) { // oh well...
-		warnx("doc not found: %s", e.get_description().c_str());
-	}
-	return ITER_OK;
-}
-
-static bool cmd_dump_ibx(struct req *req)
-{
-	if ((optind + 1) >= req->argc)
-		ABORT("usage: dump_ibx [OPTIONS] IBX_ID QRY_STR");
-	if (!req->pfxc)
-		ABORT("dump_ibx requires -A PREFIX");
-
-	const char *ibx_id = req->argv[optind];
-	if (my_setlinebuf(req->fp[0])) // for sort(1) pipe
-		EABORT("setlinebuf(fp[0])"); // WTF?
-	req->asc = true;
-	req->sort_col = -1;
-	Xapian::MSet mset = mail_mset(req, req->argv[optind + 1]);
-
-	// @UNIQ_FOLD in CodeSearchIdx.pm can handle duplicate lines fine
-	// in case we need to retry on DB reopens
-	for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++) {
-		for (int t = 10; t > 0; --t)
-			switch (dump_ibx_iter(req, ibx_id, &i)) {
-			case ITER_OK: t = 0; break; // leave inner loop
-			case ITER_RETRY: break; // continue for-loop
-			case ITER_ABORT: return false; // error
-			}
-	}
-	emit_mset_stats(req, &mset);
-	return true;
-}
-
-struct fbuf {
-	FILE *fp;
-	char *ptr;
-	size_t len;
-};
-
-struct dump_roots_tmp {
-	struct stat sb;
-	void *mm_ptr;
-	char **entries;
-	struct fbuf wbuf;
-	int root2off_fd;
-};
-
 // n.b. __cleanup__ works fine with C++ exceptions, but not longjmp
 // Only clang and g++ are supported, as AFAIK there's no other
 // relevant Free(-as-in-speech) C++ compilers.
@@ -367,127 +297,6 @@ static size_t off2size(off_t n)
 	return (size_t)n;
 }
 
-#define CLEANUP_DUMP_ROOTS __attribute__((__cleanup__(dump_roots_ensure)))
-static void dump_roots_ensure(void *ptr)
-{
-	struct dump_roots_tmp *drt = (struct dump_roots_tmp *)ptr;
-	if (drt->root2off_fd >= 0)
-		xclose(drt->root2off_fd);
-	hdestroy(); // idempotent
-	size_t size = off2size(drt->sb.st_size);
-	if (drt->mm_ptr && munmap(drt->mm_ptr, size))
-		EABORT("BUG: munmap(%p, %zu)", drt->mm_ptr, size);
-	free(drt->entries);
-	fbuf_ensure(&drt->wbuf);
-}
-
-static bool root2offs_str(struct fbuf *root_offs, Xapian::Document *doc)
-{
-	Xapian::TermIterator cur = doc->termlist_begin();
-	Xapian::TermIterator end = doc->termlist_end();
-	ENTRY e, *ep;
-	fbuf_init(root_offs);
-	for (cur.skip_to("G"); cur != end; cur++) {
-		std::string tn = *cur;
-		if (!starts_with(&tn, "G", 1))
-			continue;
-		union { const char *in; char *out; } u;
-		u.in = tn.c_str() + 1;
-		e.key = u.out;
-		ep = hsearch(e, FIND);
-		if (!ep) ABORT("hsearch miss `%s'", e.key);
-		// ep->data is a NUL-terminated string matching /[0-9]+/
-		fputc(' ', root_offs->fp);
-		fputs((const char *)ep->data, root_offs->fp);
-	}
-	fputc('\n', root_offs->fp);
-	if (ferror(root_offs->fp) | fclose(root_offs->fp))
-		err(EXIT_FAILURE, "ferror|fclose(root_offs)"); // ENOMEM
-	root_offs->fp = NULL;
-	return true;
-}
-
-// writes term values matching @pfx for a given @doc, ending the line
-// with the contents of @root_offs
-static void dump_roots_term(struct req *req, const char *pfx,
-				struct dump_roots_tmp *drt,
-				struct fbuf *root_offs,
-				Xapian::Document *doc)
-{
-	Xapian::TermIterator cur = doc->termlist_begin();
-	Xapian::TermIterator end = doc->termlist_end();
-	size_t pfx_len = strlen(pfx);
-
-	for (cur.skip_to(pfx); cur != end; cur++) {
-		std::string tn = *cur;
-		if (!starts_with(&tn, pfx, pfx_len))
-			continue;
-		fputs(tn.c_str() + pfx_len, drt->wbuf.fp);
-		fwrite(root_offs->ptr, root_offs->len, 1, drt->wbuf.fp);
-		++req->nr_out;
-	}
-}
-
-// we may have lines which exceed PIPE_BUF, so we do our own
-// buffering and rely on flock(2), here
-static bool dump_roots_flush(struct req *req, struct dump_roots_tmp *drt)
-{
-	char *p;
-	int fd = fileno(req->fp[0]);
-	bool ok = true;
-
-	if (!drt->wbuf.fp) return true;
-	if (fd < 0) EABORT("BUG: fileno");
-	if (ferror(drt->wbuf.fp) | fclose(drt->wbuf.fp)) // ENOMEM?
-		err(EXIT_FAILURE, "ferror|fclose(drt->wbuf.fp)");
-	drt->wbuf.fp = NULL;
-	if (!drt->wbuf.len) goto done_free;
-	while (flock(drt->root2off_fd, LOCK_EX)) {
-		if (errno == EINTR) continue;
-		err(EXIT_FAILURE, "LOCK_EX"); // ENOLCK?
-	}
-	p = drt->wbuf.ptr;
-	do { // write to client FD
-		ssize_t n = write(fd, p, drt->wbuf.len);
-		if (n > 0) {
-			drt->wbuf.len -= n;
-			p += n;
-		} else {
-			perror(n ? "write" : "write (zero bytes)");
-			return false;
-		}
-	} while (drt->wbuf.len);
-	while (flock(drt->root2off_fd, LOCK_UN)) {
-		if (errno == EINTR) continue;
-		err(EXIT_FAILURE, "LOCK_UN"); // ENOLCK?
-	}
-done_free: // OK to skip on errors, dump_roots_ensure calls fbuf_ensure
-	free(drt->wbuf.ptr);
-	drt->wbuf.ptr = NULL;
-	return ok;
-}
-
-static enum exc_iter dump_roots_iter(struct req *req,
-				struct dump_roots_tmp *drt,
-				Xapian::MSetIterator *i)
-{
-	CLEANUP_FBUF struct fbuf root_offs = {}; // " $ID0 $ID1 $IDx..\n"
-	try {
-		Xapian::Document doc = i->get_document();
-		if (!root2offs_str(&root_offs, &doc))
-			return ITER_ABORT; // bad request, abort
-		for (int p = 0; p < req->pfxc; p++)
-			dump_roots_term(req, req->pfxv[p], drt,
-					&root_offs, &doc);
-	} catch (const Xapian::DatabaseModifiedError & e) {
-		req->srch->db->reopen();
-		return ITER_RETRY;
-	} catch (const Xapian::DocNotFoundError & e) { // oh well...
-		warnx("doc not found: %s", e.get_description().c_str());
-	}
-	return ITER_OK;
-}
-
 static char *hsearch_enter_key(char *s)
 {
 #if defined(__OpenBSD__) || defined(__DragonFly__)
@@ -507,70 +316,6 @@ static char *hsearch_enter_key(char *s)
 	return s;
 }
 
-static bool cmd_dump_roots(struct req *req)
-{
-	CLEANUP_DUMP_ROOTS struct dump_roots_tmp drt = {};
-	drt.root2off_fd = -1;
-	if ((optind + 1) >= req->argc)
-		ABORT("usage: dump_roots [OPTIONS] ROOT2ID_FILE QRY_STR");
-	if (!req->pfxc)
-		ABORT("dump_roots requires -A PREFIX");
-	const char *root2off_file = req->argv[optind];
-	drt.root2off_fd = open(root2off_file, O_RDONLY);
-	if (drt.root2off_fd < 0)
-		EABORT("open(%s)", root2off_file);
-	if (fstat(drt.root2off_fd, &drt.sb)) // ENOMEM?
-		err(EXIT_FAILURE, "fstat(%s)", root2off_file);
-	// each entry is at least 43 bytes ({OIDHEX}\0{INT}\0),
-	// so /32 overestimates the number of expected entries by
-	// ~%25 (as recommended by Linux hcreate(3) manpage)
-	size_t size = off2size(drt.sb.st_size);
-	size_t est = (size / 32) + 1; //+1 for "\0" termination
-	drt.mm_ptr = mmap(NULL, size, PROT_READ,
-				MAP_PRIVATE, drt.root2off_fd, 0);
-	if (drt.mm_ptr == MAP_FAILED)
-		err(EXIT_FAILURE, "mmap(%zu, %s)", size, root2off_file);
-	size_t asize = est * 2;
-	if (asize < est) ABORT("too many entries: %zu", est);
-	drt.entries = (char **)calloc(asize, sizeof(char *));
-	if (!drt.entries)
-		err(EXIT_FAILURE, "calloc(%zu * 2, %zu)", est, sizeof(char *));
-	size_t tot = split2argv(drt.entries, (char *)drt.mm_ptr, size, asize);
-	if (tot <= 0) return false; // split2argv already warned on error
-	if (!hcreate(est))
-		err(EXIT_FAILURE, "hcreate(%zu)", est);
-	for (size_t i = 0; i < tot; ) {
-		ENTRY e;
-		e.key = hsearch_enter_key(drt.entries[i++]); // dies on ENOMEM
-		e.data = drt.entries[i++];
-		if (!hsearch(e, ENTER))
-			err(EXIT_FAILURE, "hsearch(%s => %s, ENTER)", e.key,
-					(const char *)e.data);
-	}
-	req->asc = true;
-	req->sort_col = -1;
-	Xapian::MSet mset = commit_mset(req, req->argv[optind + 1]);
-
-	// @UNIQ_FOLD in CodeSearchIdx.pm can handle duplicate lines fine
-	// in case we need to retry on DB reopens
-	for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++) {
-		if (!drt.wbuf.fp)
-			fbuf_init(&drt.wbuf);
-		for (int t = 10; t > 0; --t)
-			switch (dump_roots_iter(req, &drt, &i)) {
-			case ITER_OK: t = 0; break; // leave inner loop
-			case ITER_RETRY: break; // continue for-loop
-			case ITER_ABORT: return false; // error
-			}
-		if (!(req->nr_out & 0x3fff) && !dump_roots_flush(req, &drt))
-			return false;
-	}
-	if (!dump_roots_flush(req, &drt))
-		return false;
-	emit_mset_stats(req, &mset);
-	return true;
-}
-
 // for test usage only, we need to ensure the compiler supports
 // __cleanup__ when exceptions are thrown
 struct inspect { struct req *req; };
@@ -594,6 +339,8 @@ static bool cmd_test_inspect(struct req *req)
 	return false;
 }
 
+#include "xh_cidx.h" // CodeSearchIdx.pm stuff
+
 #define CMD(n) { .fn_len = sizeof(#n) - 1, .fn_name = #n, .fn = cmd_##n }
 static const struct cmd_entry {
 	size_t fn_len;
diff --git a/lib/PublicInbox/xh_cidx.h b/lib/PublicInbox/xh_cidx.h
new file mode 100644
index 00000000..c2d94162
--- /dev/null
+++ b/lib/PublicInbox/xh_cidx.h
@@ -0,0 +1,259 @@
+// Copyright (C) all contributors <meta@public-inbox.org>
+// License: GPL-2.0+ <https://www.gnu.org/licenses/gpl-2.0.txt>
+// This file is only intended to be included by xap_helper.h
+// it implements pieces used by CodeSearchIdx.pm
+
+static void dump_ibx_term(struct req *req, const char *pfx,
+			Xapian::Document *doc, const char *ibx_id)
+{
+	Xapian::TermIterator cur = doc->termlist_begin();
+	Xapian::TermIterator end = doc->termlist_end();
+	size_t pfx_len = strlen(pfx);
+
+	for (cur.skip_to(pfx); cur != end; cur++) {
+		std::string tn = *cur;
+
+		if (starts_with(&tn, pfx, pfx_len)) {
+			fprintf(req->fp[0], "%s %s\n",
+				tn.c_str() + pfx_len, ibx_id);
+			++req->nr_out;
+		}
+	}
+}
+
+static enum exc_iter dump_ibx_iter(struct req *req, const char *ibx_id,
+				Xapian::MSetIterator *i)
+{
+	try {
+		Xapian::Document doc = i->get_document();
+		for (int p = 0; p < req->pfxc; p++)
+			dump_ibx_term(req, req->pfxv[p], &doc, ibx_id);
+	} catch (const Xapian::DatabaseModifiedError & e) {
+		req->srch->db->reopen();
+		return ITER_RETRY;
+	} catch (const Xapian::DocNotFoundError & e) { // oh well...
+		warnx("doc not found: %s", e.get_description().c_str());
+	}
+	return ITER_OK;
+}
+
+static bool cmd_dump_ibx(struct req *req)
+{
+	if ((optind + 1) >= req->argc)
+		ABORT("usage: dump_ibx [OPTIONS] IBX_ID QRY_STR");
+	if (!req->pfxc)
+		ABORT("dump_ibx requires -A PREFIX");
+
+	const char *ibx_id = req->argv[optind];
+	if (my_setlinebuf(req->fp[0])) // for sort(1) pipe
+		EABORT("setlinebuf(fp[0])"); // WTF?
+	req->asc = true;
+	req->sort_col = -1;
+	Xapian::MSet mset = mail_mset(req, req->argv[optind + 1]);
+
+	// @UNIQ_FOLD in CodeSearchIdx.pm can handle duplicate lines fine
+	// in case we need to retry on DB reopens
+	for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++) {
+		for (int t = 10; t > 0; --t)
+			switch (dump_ibx_iter(req, ibx_id, &i)) {
+			case ITER_OK: t = 0; break; // leave inner loop
+			case ITER_RETRY: break; // continue for-loop
+			case ITER_ABORT: return false; // error
+			}
+	}
+	emit_mset_stats(req, &mset);
+	return true;
+}
+
+struct dump_roots_tmp {
+	struct stat sb;
+	void *mm_ptr;
+	char **entries;
+	struct fbuf wbuf;
+	int root2off_fd;
+};
+
+#define CLEANUP_DUMP_ROOTS __attribute__((__cleanup__(dump_roots_ensure)))
+static void dump_roots_ensure(void *ptr)
+{
+	struct dump_roots_tmp *drt = (struct dump_roots_tmp *)ptr;
+	if (drt->root2off_fd >= 0)
+		xclose(drt->root2off_fd);
+	hdestroy(); // idempotent
+	size_t size = off2size(drt->sb.st_size);
+	if (drt->mm_ptr && munmap(drt->mm_ptr, size))
+		EABORT("BUG: munmap(%p, %zu)", drt->mm_ptr, size);
+	free(drt->entries);
+	fbuf_ensure(&drt->wbuf);
+}
+
+static bool root2offs_str(struct fbuf *root_offs, Xapian::Document *doc)
+{
+	Xapian::TermIterator cur = doc->termlist_begin();
+	Xapian::TermIterator end = doc->termlist_end();
+	ENTRY e, *ep;
+	fbuf_init(root_offs);
+	for (cur.skip_to("G"); cur != end; cur++) {
+		std::string tn = *cur;
+		if (!starts_with(&tn, "G", 1))
+			continue;
+		union { const char *in; char *out; } u;
+		u.in = tn.c_str() + 1;
+		e.key = u.out;
+		ep = hsearch(e, FIND);
+		if (!ep) ABORT("hsearch miss `%s'", e.key);
+		// ep->data is a NUL-terminated string matching /[0-9]+/
+		fputc(' ', root_offs->fp);
+		fputs((const char *)ep->data, root_offs->fp);
+	}
+	fputc('\n', root_offs->fp);
+	if (ferror(root_offs->fp) | fclose(root_offs->fp))
+		err(EXIT_FAILURE, "ferror|fclose(root_offs)"); // ENOMEM
+	root_offs->fp = NULL;
+	return true;
+}
+
+// writes term values matching @pfx for a given @doc, ending the line
+// with the contents of @root_offs
+static void dump_roots_term(struct req *req, const char *pfx,
+				struct dump_roots_tmp *drt,
+				struct fbuf *root_offs,
+				Xapian::Document *doc)
+{
+	Xapian::TermIterator cur = doc->termlist_begin();
+	Xapian::TermIterator end = doc->termlist_end();
+	size_t pfx_len = strlen(pfx);
+
+	for (cur.skip_to(pfx); cur != end; cur++) {
+		std::string tn = *cur;
+		if (!starts_with(&tn, pfx, pfx_len))
+			continue;
+		fputs(tn.c_str() + pfx_len, drt->wbuf.fp);
+		fwrite(root_offs->ptr, root_offs->len, 1, drt->wbuf.fp);
+		++req->nr_out;
+	}
+}
+
+// we may have lines which exceed PIPE_BUF, so we do our own
+// buffering and rely on flock(2), here
+static bool dump_roots_flush(struct req *req, struct dump_roots_tmp *drt)
+{
+	char *p;
+	int fd = fileno(req->fp[0]);
+	bool ok = true;
+
+	if (!drt->wbuf.fp) return true;
+	if (fd < 0) EABORT("BUG: fileno");
+	if (ferror(drt->wbuf.fp) | fclose(drt->wbuf.fp)) // ENOMEM?
+		err(EXIT_FAILURE, "ferror|fclose(drt->wbuf.fp)");
+	drt->wbuf.fp = NULL;
+	if (!drt->wbuf.len) goto done_free;
+	while (flock(drt->root2off_fd, LOCK_EX)) {
+		if (errno == EINTR) continue;
+		err(EXIT_FAILURE, "LOCK_EX"); // ENOLCK?
+	}
+	p = drt->wbuf.ptr;
+	do { // write to client FD
+		ssize_t n = write(fd, p, drt->wbuf.len);
+		if (n > 0) {
+			drt->wbuf.len -= n;
+			p += n;
+		} else {
+			perror(n ? "write" : "write (zero bytes)");
+			return false;
+		}
+	} while (drt->wbuf.len);
+	while (flock(drt->root2off_fd, LOCK_UN)) {
+		if (errno == EINTR) continue;
+		err(EXIT_FAILURE, "LOCK_UN"); // ENOLCK?
+	}
+done_free: // OK to skip on errors, dump_roots_ensure calls fbuf_ensure
+	free(drt->wbuf.ptr);
+	drt->wbuf.ptr = NULL;
+	return ok;
+}
+
+static enum exc_iter dump_roots_iter(struct req *req,
+				struct dump_roots_tmp *drt,
+				Xapian::MSetIterator *i)
+{
+	CLEANUP_FBUF struct fbuf root_offs = {}; // " $ID0 $ID1 $IDx..\n"
+	try {
+		Xapian::Document doc = i->get_document();
+		if (!root2offs_str(&root_offs, &doc))
+			return ITER_ABORT; // bad request, abort
+		for (int p = 0; p < req->pfxc; p++)
+			dump_roots_term(req, req->pfxv[p], drt,
+					&root_offs, &doc);
+	} catch (const Xapian::DatabaseModifiedError & e) {
+		req->srch->db->reopen();
+		return ITER_RETRY;
+	} catch (const Xapian::DocNotFoundError & e) { // oh well...
+		warnx("doc not found: %s", e.get_description().c_str());
+	}
+	return ITER_OK;
+}
+
+static bool cmd_dump_roots(struct req *req)
+{
+	CLEANUP_DUMP_ROOTS struct dump_roots_tmp drt = {};
+	drt.root2off_fd = -1;
+	if ((optind + 1) >= req->argc)
+		ABORT("usage: dump_roots [OPTIONS] ROOT2ID_FILE QRY_STR");
+	if (!req->pfxc)
+		ABORT("dump_roots requires -A PREFIX");
+	const char *root2off_file = req->argv[optind];
+	drt.root2off_fd = open(root2off_file, O_RDONLY);
+	if (drt.root2off_fd < 0)
+		EABORT("open(%s)", root2off_file);
+	if (fstat(drt.root2off_fd, &drt.sb)) // ENOMEM?
+		err(EXIT_FAILURE, "fstat(%s)", root2off_file);
+	// each entry is at least 43 bytes ({OIDHEX}\0{INT}\0),
+	// so /32 overestimates the number of expected entries by
+	// ~%25 (as recommended by Linux hcreate(3) manpage)
+	size_t size = off2size(drt.sb.st_size);
+	size_t est = (size / 32) + 1; //+1 for "\0" termination
+	drt.mm_ptr = mmap(NULL, size, PROT_READ,
+				MAP_PRIVATE, drt.root2off_fd, 0);
+	if (drt.mm_ptr == MAP_FAILED)
+		err(EXIT_FAILURE, "mmap(%zu, %s)", size, root2off_file);
+	size_t asize = est * 2;
+	if (asize < est) ABORT("too many entries: %zu", est);
+	drt.entries = (char **)calloc(asize, sizeof(char *));
+	if (!drt.entries)
+		err(EXIT_FAILURE, "calloc(%zu * 2, %zu)", est, sizeof(char *));
+	size_t tot = split2argv(drt.entries, (char *)drt.mm_ptr, size, asize);
+	if (tot <= 0) return false; // split2argv already warned on error
+	if (!hcreate(est))
+		err(EXIT_FAILURE, "hcreate(%zu)", est);
+	for (size_t i = 0; i < tot; ) {
+		ENTRY e;
+		e.key = hsearch_enter_key(drt.entries[i++]); // dies on ENOMEM
+		e.data = drt.entries[i++];
+		if (!hsearch(e, ENTER))
+			err(EXIT_FAILURE, "hsearch(%s => %s, ENTER)", e.key,
+					(const char *)e.data);
+	}
+	req->asc = true;
+	req->sort_col = -1;
+	Xapian::MSet mset = commit_mset(req, req->argv[optind + 1]);
+
+	// @UNIQ_FOLD in CodeSearchIdx.pm can handle duplicate lines fine
+	// in case we need to retry on DB reopens
+	for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++) {
+		if (!drt.wbuf.fp)
+			fbuf_init(&drt.wbuf);
+		for (int t = 10; t > 0; --t)
+			switch (dump_roots_iter(req, &drt, &i)) {
+			case ITER_OK: t = 0; break; // leave inner loop
+			case ITER_RETRY: break; // continue for-loop
+			case ITER_ABORT: return false; // error
+			}
+		if (!(req->nr_out & 0x3fff) && !dump_roots_flush(req, &drt))
+			return false;
+	}
+	if (!dump_roots_flush(req, &drt))
+		return false;
+	emit_mset_stats(req, &mset);
+	return true;
+}

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 06/14] xap_helper: implement mset endpoint for WWW, IMAP, etc...
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
                   ` (4 preceding siblings ...)
  2023-11-28 14:56 ` [PATCH 05/14] xap_helper.h: move cindex endpoints to separate file Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2023-11-28 14:56 ` [PATCH 07/14] hval: use File::Spec to make relative paths for href Eric Wong
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

The C++ version will allow us to take full advantage of Xapian's
APIs for better queries, and the Perl bindings version can still
be advantageous in the future since we'll be able to support
timeouts effectively.
---
 MANIFEST                        |   1 +
 Makefile.PL                     |   8 ++-
 lib/PublicInbox/Search.pm       |  25 ++++++++
 lib/PublicInbox/XapHelper.pm    |  51 ++++++++++-----
 lib/PublicInbox/XapHelperCxx.pm |   6 +-
 lib/PublicInbox/xap_helper.h    | 110 ++++++++++++++++++++++++++------
 lib/PublicInbox/xh_cidx.h       |  37 ++++-------
 lib/PublicInbox/xh_mset.h       |  96 ++++++++++++++++++++++++++++
 t/cindex.t                      |  52 ++++++++++++++-
 t/xap_helper.t                  |  49 ++++++++++++--
 10 files changed, 363 insertions(+), 72 deletions(-)
 create mode 100644 lib/PublicInbox/xh_mset.h

diff --git a/MANIFEST b/MANIFEST
index bbbe0b91..7b6178f9 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -379,6 +379,7 @@ lib/PublicInbox/Xapcmd.pm
 lib/PublicInbox/gcf2_libgit2.h
 lib/PublicInbox/xap_helper.h
 lib/PublicInbox/xh_cidx.h
+lib/PublicInbox/xh_mset.h
 sa_config/Makefile
 sa_config/README
 sa_config/root/etc/spamassassin/public-inbox.pre
diff --git a/Makefile.PL b/Makefile.PL
index 38e030f5..28f8263e 100644
--- a/Makefile.PL
+++ b/Makefile.PL
@@ -273,14 +273,16 @@ pm_to_blib : lib/PublicInbox.pm
 lib/PublicInbox.pm : FORCE
 	VERSION=\$(VERSION) \$(PERL) -w ./version-gen.perl
 
+XH_TESTS = t/xap_helper.t t/cindex.t
+
 test-asan : pure_all
-	TEST_XH_CXX_ONLY=1 CXXFLAGS='-O0 -Wall -ggdb3 -fsanitize=address' \\
-		prove -bvw t/xap_helper.t
+	TEST_XH_CXX_ONLY=1 CXXFLAGS='-Wall -ggdb3 -fsanitize=address' \\
+		prove -bvw \$(XH_TESTS)
 
 VG_OPT = -v --trace-children=yes --track-fds=yes
 VG_OPT += --leak-check=yes --track-origins=yes
 test-valgrind : pure_all
 	TEST_XH_CXX_ONLY=1 VALGRIND="valgrind \$(VG_OPT)" \\
-		prove -bvw t/xap_helper.t
+		prove -bvw \$(XH_TESTS)
 EOF
 }
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 477f77dc..6145b027 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -76,6 +76,25 @@ our @MAIL_VMAP = (
 );
 our @MAIL_NRP;
 
+# Getopt::Long spec, only short options for portability in C++ implementation
+our @XH_SPEC = (
+	'a', # ascending sort
+	'c', # code search
+	'd=s@', # shard dirs
+	'g=s', # git dir (with -c)
+	'k=i', # sort column (like sort(1))
+	'm=i', # maximum number of results
+	'o=i', # offset
+	'p', # show percent
+	'r', # 1=relevance then column
+	't', # collapse threads
+	'A=s@', # prefixes
+	'D', # emit docdata
+	'K=i', # timeout kill after i seconds
+	'O=s', # eidx_key
+	'T=i', # threadid
+);
+
 sub load_xapian () {
 	return 1 if defined $Xap;
 	# n.b. PI_XAPIAN is intended for development use only
@@ -247,6 +266,12 @@ sub mdocid {
 	int(($docid - 1) / $nshard) + 1;
 }
 
+sub docids_to_artnums {
+	my $nshard = shift->{nshard};
+	# XXX does array vs arrayref make a difference in modern Perls?
+	map { int(($_ - 1) / $nshard) + 1 } @_;
+}
+
 sub mset_to_artnums {
 	my ($self, $mset) = @_;
 	my $nshard = $self->{nshard};
diff --git a/lib/PublicInbox/XapHelper.pm b/lib/PublicInbox/XapHelper.pm
index fe831b8f..b21e70a2 100644
--- a/lib/PublicInbox/XapHelper.pm
+++ b/lib/PublicInbox/XapHelper.pm
@@ -21,21 +21,6 @@ my $X = \%PublicInbox::Search::X;
 our (%SRCH, %WORKERS, $nworker, $workerset, $in);
 our $stderr = \*STDERR;
 
-# only short options for portability in C++ implementation
-our @SPEC = (
-	'a', # ascending sort
-	'c', # code search
-	'd=s@', # shard dirs
-	'k=i', # sort column (like sort(1))
-	'm=i', # maximum number of results
-	'o=i', # offset
-	'r', # 1=relevance then column
-	't', # collapse threads
-	'A=s@', # prefixes
-	'O=s', # eidx_key
-	'T=i', # timeout in seconds
-);
-
 sub cmd_test_inspect {
 	my ($req) = @_;
 	print { $req->{0} } "pid=$$ has_threadid=",
@@ -144,10 +129,44 @@ sub cmd_dump_roots {
 	emit_mset_stats($req, $mset);
 }
 
+sub mset_iter ($$) {
+	my ($req, $it) = @_;
+	eval {
+		my $buf = $it->get_docid;
+		$buf .= "\0".$it->get_percent if $req->{p};
+		my $doc = ($req->{A} || $req->{D}) ? $it->get_document : undef;
+		for my $p (@{$req->{A}}) {
+			$buf .= "\0".$p.$_ for xap_terms($p, $doc);
+		}
+		$buf .= "\0".$doc->get_data if $req->{D};
+		say { $req->{0} } $buf;
+	};
+	$@ ? iter_retry_check($req) : 0;
+}
+
+sub cmd_mset { # to be used by WWW + IMAP
+	my ($req, $qry_str) = @_;
+	$qry_str // die 'usage: mset [OPTIONS] QRY_STR';
+	my $opt = { limit => $req->{'m'}, offset => $req->{o} // 0 };
+	$opt->{relevance} = 1 if $req->{r};
+	$opt->{threads} = 1 if defined $req->{t};
+	$opt->{git_dir} = $req->{g} if defined $req->{g};
+	$opt->{eidx_key} = $req->{O} if defined $req->{O};
+	$opt->{threadid} = $req->{T} if defined $req->{T};
+	my $mset = $req->{srch}->mset($qry_str, $opt);
+	say { $req->{0} } 'mset.size=', $mset->size;
+	for my $it ($mset->items) {
+		for (my $t = 10; $t > 0; --$t) {
+			$t = mset_iter($req, $it) // $t;
+		}
+	}
+}
+
 sub dispatch {
 	my ($req, $cmd, @argv) = @_;
 	my $fn = $req->can("cmd_$cmd") or return;
-	$GLP->getoptionsfromarray(\@argv, $req, @SPEC) or return;
+	$GLP->getoptionsfromarray(\@argv, $req, @PublicInbox::Search::XH_SPEC)
+		or return;
 	my $dirs = delete $req->{d} or die 'no -d args';
 	my $key = join("\0", @$dirs);
 	$req->{srch} = $SRCH{$key} //= do {
diff --git a/lib/PublicInbox/XapHelperCxx.pm b/lib/PublicInbox/XapHelperCxx.pm
index 8a66fdcd..1aa75f2a 100644
--- a/lib/PublicInbox/XapHelperCxx.pm
+++ b/lib/PublicInbox/XapHelperCxx.pm
@@ -20,13 +20,15 @@ $ENV{PERL_INLINE_DIRECTORY} // die('BUG: PERL_INLINE_DIRECTORY unset');
 substr($dir, 0, 0) = "$ENV{PERL_INLINE_DIRECTORY}/";
 my $bin = "$dir/xap_helper";
 my ($srcpfx) = (__FILE__ =~ m!\A(.+/)[^/]+\z!);
-my @srcs = map { $srcpfx.$_ } qw(xap_helper.h xh_cidx.h);
+my @srcs = map { $srcpfx.$_ } qw(xh_mset.h xh_cidx.h xap_helper.h);
 my @pm_dep = map { $srcpfx.$_ } qw(Search.pm CodeSearch.pm);
 my $ldflags = '-Wl,-O1';
 $ldflags .= ' -Wl,--compress-debug-sections=zlib' if $^O ne 'openbsd';
 my $xflags = ($ENV{CXXFLAGS} // '-Wall -ggdb3 -pipe') . ' ' .
 	' -DTHREADID=' . PublicInbox::Search::THREADID .
-	' ' . ($ENV{LDFLAGS} // $ldflags);
+	' -DXH_SPEC="'.join('',
+		map { s/=.*/:/; $_ } @PublicInbox::Search::XH_SPEC) . '" ' .
+	($ENV{LDFLAGS} // $ldflags);
 my $xap_modversion;
 
 sub xap_cfg (@) {
diff --git a/lib/PublicInbox/xap_helper.h b/lib/PublicInbox/xap_helper.h
index 89d151d9..18665567 100644
--- a/lib/PublicInbox/xap_helper.h
+++ b/lib/PublicInbox/xap_helper.h
@@ -124,10 +124,12 @@ struct req { // argv and pfxv point into global rbuf
 	char *argv[MY_ARG_MAX];
 	char *pfxv[MY_ARG_MAX]; // -A <prefix>
 	struct srch *srch;
+	char *Pgit_dir;
 	char *Oeidx_key;
 	cmd fn;
 	unsigned long long max;
 	unsigned long long off;
+	unsigned long long threadid;
 	unsigned long timeout_sec;
 	size_t nr_out;
 	long sort_col; // value column, negative means BoolWeight
@@ -138,6 +140,8 @@ struct req { // argv and pfxv point into global rbuf
 	bool collapse_threads;
 	bool code_search;
 	bool relevance; // sort by relevance before column
+	bool emit_percent;
+	bool emit_docdata;
 	bool asc; // ascending sort
 };
 
@@ -230,12 +234,53 @@ static Xapian::MSet mail_mset(struct req *req, const char *qry_str)
 	return enquire_mset(req, &enq);
 }
 
+static bool starts_with(const std::string *s, const char *pfx, size_t pfx_len)
+{
+	return s->size() >= pfx_len && !memcmp(pfx, s->c_str(), pfx_len);
+}
+
+static void apply_roots_filter(struct req *req, Xapian::Query *qry)
+{
+	if (!req->Pgit_dir) return;
+	req->Pgit_dir[0] = 'P'; // modifies static rbuf
+	Xapian::Database *xdb = req->srch->db;
+	for (int i = 0; i < 9; i++) {
+		try {
+			std::string P = req->Pgit_dir;
+			Xapian::PostingIterator p = xdb->postlist_begin(P);
+			if (p == xdb->postlist_end(P)) {
+				warnx("W: %s not indexed?", req->Pgit_dir + 1);
+				return;
+			}
+			Xapian::TermIterator cur = xdb->termlist_begin(*p);
+			Xapian::TermIterator end = xdb->termlist_end(*p);
+			cur.skip_to("G");
+			if (cur == end) {
+				warnx("W: %s has no root commits?",
+					req->Pgit_dir + 1);
+				return;
+			}
+			Xapian::Query f = Xapian::Query(*cur);
+			for (++cur; cur != end; ++cur) {
+				std::string tn = *cur;
+				if (!starts_with(&tn, "G", 1))
+					continue;
+				f = Xapian::Query(Xapian::Query::OP_OR, f, tn);
+			}
+			*qry = Xapian::Query(Xapian::Query::OP_FILTER, *qry, f);
+			return;
+		} catch (const Xapian::DatabaseModifiedError & e) {
+			xdb->reopen();
+		}
+	}
+}
+
 // for cindex
 static Xapian::MSet commit_mset(struct req *req, const char *qry_str)
 {
 	struct srch *srch = req->srch;
 	Xapian::Query qry = srch->qp->parse_query(qry_str, srch->qp_flags);
-	// TODO: git_dir + roots_filter
+	apply_roots_filter(req, &qry);
 
 	// we only want commits:
 	qry = Xapian::Query(Xapian::Query::OP_FILTER, qry,
@@ -254,11 +299,6 @@ static void emit_mset_stats(struct req *req, const Xapian::MSet *mset)
 		ABORT("BUG: %s caller only passed 1 FD", req->argv[0]);
 }
 
-static bool starts_with(const std::string *s, const char *pfx, size_t pfx_len)
-{
-	return s->size() >= pfx_len && !memcmp(pfx, s->c_str(), pfx_len);
-}
-
 static int my_setlinebuf(FILE *fp) // glibc setlinebuf(3) can't report errors
 {
 	return setvbuf(fp, NULL, _IOLBF, 0);
@@ -284,6 +324,32 @@ static void fbuf_init(struct fbuf *fbuf)
 	if (!fbuf->fp) err(EXIT_FAILURE, "open_memstream(fbuf)");
 }
 
+static bool write_all(int fd, const struct fbuf *wbuf, size_t len)
+{
+	const char *p = wbuf->ptr;
+	assert(wbuf->len >= len);
+	do { // write to client FD
+		ssize_t n = write(fd, p, len);
+		if (n > 0) {
+			len -= n;
+			p += n;
+		} else {
+			perror(n ? "write" : "write (zero bytes)");
+			return false;
+		}
+	} while (len);
+	return true;
+}
+
+#define ERR_FLUSH(f) do { \
+	if (ferror(f) | fflush(f)) err(EXIT_FAILURE, "ferror|fflush "#f); \
+} while (0)
+
+#define ERR_CLOSE(f, e) do { \
+	if (ferror(f) | fclose(f)) \
+		e ? err(e, "ferror|fclose "#f) : perror("ferror|fclose "#f); \
+} while (0)
+
 static void xclose(int fd)
 {
 	if (close(fd) < 0 && errno != EINTR)
@@ -339,6 +405,7 @@ static bool cmd_test_inspect(struct req *req)
 	return false;
 }
 
+#include "xh_mset.h" // read-only (WWW, IMAP, lei) stuff
 #include "xh_cidx.h" // CodeSearchIdx.pm stuff
 
 #define CMD(n) { .fn_len = sizeof(#n) - 1, .fn_name = #n, .fn = cmd_##n }
@@ -348,6 +415,7 @@ static const struct cmd_entry {
 	cmd fn;
 } cmds[] = { // should be small enough to not need bsearch || gperf
 	// most common commands first
+	CMD(mset), // WWW and IMAP requests
 	CMD(dump_ibx), // many inboxes
 	CMD(dump_roots), // per-cidx shard
 	CMD(test_inspect), // least common commands last
@@ -520,7 +588,7 @@ static void dispatch(struct req *req)
 	char *end;
 	FILE *kfp;
 	struct srch **s;
-	req->fn = NULL;
+	req->threadid = ULLONG_MAX;
 	for (c = 0; c < (int)MY_ARRAY_SIZE(cmds); c++) {
 		if (cmds[c].fn_len == size &&
 			!memcmp(cmds[c].fn_name, req->argv[0], size)) {
@@ -540,12 +608,13 @@ static void dispatch(struct req *req)
 	optarg = NULL;
 	MY_DO_OPTRESET();
 
-	// keep sync with @PublicInbox::XapHelper::SPEC
-	while ((c = getopt(req->argc, req->argv, "acd:k:m:o:rtA:O:T:")) != -1) {
+	// XH_SPEC is generated from @PublicInbox::Search::XH_SPEC
+	while ((c = getopt(req->argc, req->argv, XH_SPEC)) != -1) {
 		switch (c) {
 		case 'a': req->asc = true; break;
 		case 'c': req->code_search = true; break;
 		case 'd': fwrite(optarg, strlen(optarg) + 1, 1, kfp); break;
+		case 'g': req->Pgit_dir = optarg - 1; break; // pad "P" prefix
 		case 'k':
 			req->sort_col = strtol(optarg, &end, 10);
 			if (*end) ABORT("-k %s", optarg);
@@ -563,6 +632,7 @@ static void dispatch(struct req *req)
 			if (*end || req->off == ULLONG_MAX)
 				ABORT("-o %s", optarg);
 			break;
+		case 'p': req->emit_percent = true; break;
 		case 'r': req->relevance = true; break;
 		case 't': req->collapse_threads = true; break;
 		case 'A':
@@ -570,17 +640,22 @@ static void dispatch(struct req *req)
 			if (MY_ARG_MAX == req->pfxc)
 				ABORT("too many -A");
 			break;
-		case 'O': req->Oeidx_key = optarg - 1; break; // pad "O" prefix
-		case 'T':
+		case 'D': req->emit_docdata = true; break;
+		case 'K':
 			req->timeout_sec = strtoul(optarg, &end, 10);
 			if (*end || req->timeout_sec == ULONG_MAX)
+				ABORT("-K %s", optarg);
+			break;
+		case 'O': req->Oeidx_key = optarg - 1; break; // pad "O" prefix
+		case 'T':
+			req->threadid = strtoull(optarg, &end, 10);
+			if (*end || req->threadid == ULLONG_MAX)
 				ABORT("-T %s", optarg);
 			break;
 		default: ABORT("bad switch `-%c'", c);
 		}
 	}
-	if (ferror(kfp) | fclose(kfp)) /* sets kbuf.srch */
-		err(EXIT_FAILURE, "ferror|fclose"); // likely ENOMEM
+	ERR_CLOSE(kfp, EXIT_FAILURE); // may ENOMEM, sets kbuf.srch
 	kbuf.srch->db = NULL;
 	kbuf.srch->qp = NULL;
 	kbuf.srch->paths_len = size - offsetof(struct srch, paths);
@@ -639,8 +714,7 @@ static void stderr_restore(FILE *tmp_err)
 	stderr = orig_err;
 	return;
 #endif
-	if (ferror(stderr) | fflush(stderr))
-		err(EXIT_FAILURE, "ferror|fflush stderr");
+	ERR_CLOSE(stderr, EXIT_FAILURE);
 	while (dup2(orig_err_fd, STDERR_FILENO) < 0) {
 		if (errno != EINTR)
 			err(EXIT_FAILURE, "dup2(%d => 2)", orig_err_fd);
@@ -670,12 +744,10 @@ static void recv_loop(void) // worker process loop
 			stderr_set(req.fp[1]);
 		req.argc = (int)SPLIT2ARGV(req.argv, rbuf, len);
 		dispatch(&req);
-		if (ferror(req.fp[0]) | fclose(req.fp[0]))
-			perror("ferror|fclose fp[0]");
+		ERR_CLOSE(req.fp[0], 0);
 		if (req.fp[1]) {
 			stderr_restore(req.fp[1]);
-			if (ferror(req.fp[1]) | fclose(req.fp[1]))
-				perror("ferror|fclose fp[1]");
+			ERR_CLOSE(req.fp[1], 0);
 		}
 	}
 }
diff --git a/lib/PublicInbox/xh_cidx.h b/lib/PublicInbox/xh_cidx.h
index c2d94162..1980f9f6 100644
--- a/lib/PublicInbox/xh_cidx.h
+++ b/lib/PublicInbox/xh_cidx.h
@@ -107,8 +107,7 @@ static bool root2offs_str(struct fbuf *root_offs, Xapian::Document *doc)
 		fputs((const char *)ep->data, root_offs->fp);
 	}
 	fputc('\n', root_offs->fp);
-	if (ferror(root_offs->fp) | fclose(root_offs->fp))
-		err(EXIT_FAILURE, "ferror|fclose(root_offs)"); // ENOMEM
+	ERR_CLOSE(root_offs->fp, EXIT_FAILURE); // ENOMEM
 	root_offs->fp = NULL;
 	return true;
 }
@@ -138,38 +137,24 @@ static void dump_roots_term(struct req *req, const char *pfx,
 // buffering and rely on flock(2), here
 static bool dump_roots_flush(struct req *req, struct dump_roots_tmp *drt)
 {
-	char *p;
-	int fd = fileno(req->fp[0]);
 	bool ok = true;
+	off_t off = ftello(drt->wbuf.fp);
+	if (off < 0) EABORT("ftello");
+	if (!off) return ok;
+
+	ERR_FLUSH(drt->wbuf.fp); // ENOMEM
+	int fd = fileno(req->fp[0]);
 
-	if (!drt->wbuf.fp) return true;
-	if (fd < 0) EABORT("BUG: fileno");
-	if (ferror(drt->wbuf.fp) | fclose(drt->wbuf.fp)) // ENOMEM?
-		err(EXIT_FAILURE, "ferror|fclose(drt->wbuf.fp)");
-	drt->wbuf.fp = NULL;
-	if (!drt->wbuf.len) goto done_free;
 	while (flock(drt->root2off_fd, LOCK_EX)) {
 		if (errno == EINTR) continue;
 		err(EXIT_FAILURE, "LOCK_EX"); // ENOLCK?
 	}
-	p = drt->wbuf.ptr;
-	do { // write to client FD
-		ssize_t n = write(fd, p, drt->wbuf.len);
-		if (n > 0) {
-			drt->wbuf.len -= n;
-			p += n;
-		} else {
-			perror(n ? "write" : "write (zero bytes)");
-			return false;
-		}
-	} while (drt->wbuf.len);
+	ok = write_all(fd, &drt->wbuf, (size_t)off);
 	while (flock(drt->root2off_fd, LOCK_UN)) {
 		if (errno == EINTR) continue;
 		err(EXIT_FAILURE, "LOCK_UN"); // ENOLCK?
 	}
-done_free: // OK to skip on errors, dump_roots_ensure calls fbuf_ensure
-	free(drt->wbuf.ptr);
-	drt->wbuf.ptr = NULL;
+	if (fseeko(drt->wbuf.fp, 0, SEEK_SET)) EABORT("fseeko");
 	return ok;
 }
 
@@ -238,11 +223,11 @@ static bool cmd_dump_roots(struct req *req)
 	req->sort_col = -1;
 	Xapian::MSet mset = commit_mset(req, req->argv[optind + 1]);
 
+	fbuf_init(&drt.wbuf);
+
 	// @UNIQ_FOLD in CodeSearchIdx.pm can handle duplicate lines fine
 	// in case we need to retry on DB reopens
 	for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++) {
-		if (!drt.wbuf.fp)
-			fbuf_init(&drt.wbuf);
 		for (int t = 10; t > 0; --t)
 			switch (dump_roots_iter(req, &drt, &i)) {
 			case ITER_OK: t = 0; break; // leave inner loop
diff --git a/lib/PublicInbox/xh_mset.h b/lib/PublicInbox/xh_mset.h
new file mode 100644
index 00000000..056fe22b
--- /dev/null
+++ b/lib/PublicInbox/xh_mset.h
@@ -0,0 +1,96 @@
+// Copyright (C) all contributors <meta@public-inbox.org>
+// License: GPL-2.0+ <https://www.gnu.org/licenses/gpl-2.0.txt>
+// This file is only intended to be included by xap_helper.h
+// it implements pieces used by WWW, IMAP and lei
+
+static void emit_doc_term(FILE *fp, const char *pfx, Xapian::Document *doc)
+{
+	Xapian::TermIterator cur = doc->termlist_begin();
+	Xapian::TermIterator end = doc->termlist_end();
+	size_t pfx_len = strlen(pfx);
+
+	for (cur.skip_to(pfx); cur != end; cur++) {
+		std::string tn = *cur;
+		if (!starts_with(&tn, pfx, pfx_len)) continue;
+		fputc(0, fp);
+		fwrite(tn.data(), tn.size(), 1, fp);
+	}
+}
+
+static enum exc_iter mset_iter(const struct req *req, FILE *fp, off_t off,
+				Xapian::MSetIterator *i)
+{
+	try {
+		fprintf(fp, "%llu", (unsigned long long)(*(*i))); // get_docid
+		if (req->emit_percent)
+			fprintf(fp, "%c%d", 0, i->get_percent());
+		if (req->pfxc || req->emit_docdata) {
+			Xapian::Document doc = i->get_document();
+			for (int p = 0; p < req->pfxc; p++)
+				emit_doc_term(fp, req->pfxv[p], &doc);
+			if (req->emit_docdata) {
+				std::string d = doc.get_data();
+				fputc(0, fp);
+				fwrite(d.data(), d.size(), 1, fp);
+			}
+		}
+		fputc('\n', fp);
+	} catch (const Xapian::DatabaseModifiedError & e) {
+		req->srch->db->reopen();
+		if (fseeko(fp, off, SEEK_SET) < 0) EABORT("fseeko");
+		return ITER_RETRY;
+	} catch (const Xapian::DocNotFoundError & e) { // oh well...
+		warnx("doc not found: %s", e.get_description().c_str());
+		if (fseeko(fp, off, SEEK_SET) < 0) EABORT("fseeko");
+	}
+	return ITER_OK;
+}
+
+#ifndef WBUF_FLUSH_THRESHOLD
+#	define WBUF_FLUSH_THRESHOLD (BUFSIZ - 1000)
+#endif
+#if WBUF_FLUSH_THRESHOLD < 0
+#	undef WBUF_FLUSH_THRESHOLD
+#	define WBUF_FLUSH_THRESHOLD BUFSIZ
+#endif
+
+static bool cmd_mset(struct req *req)
+{
+	if (optind >= req->argc) ABORT("usage: mset [OPTIONS] WANT QRY_STR");
+	if (req->fp[1]) ABORT("mset only accepts 1 FD");
+	const char *qry_str = req->argv[optind];
+	CLEANUP_FBUF struct fbuf wbuf = {};
+	Xapian::MSet mset = req->code_search ? commit_mset(req, qry_str) :
+						mail_mset(req, qry_str);
+	fbuf_init(&wbuf);
+	fprintf(wbuf.fp, "mset.size=%llu\n", (unsigned long long)mset.size());
+	int fd = fileno(req->fp[0]);
+	for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++) {
+		off_t off = ftello(wbuf.fp);
+		if (off < 0) EABORT("ftello");
+		/*
+		 * TODO verify our fflush + fseeko use isn't affected by a
+		 * glibc <2.25 bug:
+		 * https://sourceware.org/bugzilla/show_bug.cgi?id=20181
+		 * CentOS 7.x only has glibc 2.17.  In any case, bug #20181
+		 * shouldn't affect us since our use of fseeko is used to
+		 * effectively discard data.
+		 */
+		if (off > WBUF_FLUSH_THRESHOLD) {
+			ERR_FLUSH(wbuf.fp);
+			if (!write_all(fd, &wbuf, (size_t)off)) return false;
+			if (fseeko(wbuf.fp, 0, SEEK_SET)) EABORT("fseeko");
+			off = 0;
+		}
+		for (int t = 10; t > 0; --t)
+			switch (mset_iter(req, wbuf.fp, off, &i)) {
+			case ITER_OK: t = 0; break; // leave inner loop
+			case ITER_RETRY: break; // continue for-loop
+			case ITER_ABORT: return false; // error
+			}
+	}
+	off_t off = ftello(wbuf.fp);
+	if (off < 0) EABORT("ftello");
+	ERR_FLUSH(wbuf.fp);
+	return off > 0 ? write_all(fd, &wbuf, (size_t)off) : true;
+}
diff --git a/t/cindex.t b/t/cindex.t
index 261945bf..a9075092 100644
--- a/t/cindex.t
+++ b/t/cindex.t
@@ -121,22 +121,70 @@ my $no_metadata_set = sub {
 
 use_ok 'PublicInbox::CodeSearch';
 
+
+my @xh_args;
+my $exp = [ 'initial with NUL character', 'remove NUL character' ];
+my $zp_git = abs_path("$zp/.git");
 if ('multi-repo search') {
 	my $csrch = PublicInbox::CodeSearch->new("$tmp/ext");
 	my $mset = $csrch->mset('NUL');
 	is(scalar($mset->items), 2, 'got results');
-	my $exp = [ 'initial with NUL character', 'remove NUL character' ];
 	my @have = sort(map { $_->get_document->get_data } $mset->items);
 	is_xdeeply(\@have, $exp, 'got expected subjects');
 
 	$mset = $csrch->mset('NUL', { git_dir => "$tmp/wt0/.git" });
 	is(scalar($mset->items), 0, 'no results with other GIT_DIR');
 
-	$mset = $csrch->mset('NUL', { git_dir => abs_path("$zp/.git") });
+	$mset = $csrch->mset('NUL', { git_dir => $zp_git });
 	@have = sort(map { $_->get_document->get_data } $mset->items);
 	is_xdeeply(\@have, $exp, 'got expected subjects w/ GIT_DIR filter');
 	my @xdb = $csrch->xdb_shards_flat;
 	$no_metadata_set->(0, ['indexlevel'], \@xdb);
+	@xh_args = $csrch->xh_args;
+}
+
+my $test_xhc = sub {
+	my ($xhc) = @_;
+	my $impl = $xhc->{impl};
+	my ($r, @l);
+	$r = $xhc->mkreq([], qw(mset -D -c -g), $zp_git, @xh_args, 'NUL');
+	chomp(@l = <$r>);
+	is(shift(@l), 'mset.size=2', "got expected header $impl");
+	my %docid2data;
+	my @got = sort map {
+		my @f = split /\0/;
+		is scalar(@f), 2, 'got 2 entries';
+		$docid2data{$f[0]} = $f[1];
+		$f[1];
+	} @l;
+	is_deeply(\@got, $exp, "expected doc_data $impl");
+
+	$r = $xhc->mkreq([], qw(mset -c -g), "$tmp/wt0/.git", @xh_args, 'NUL');
+	chomp(@l = <$r>);
+	is(shift(@l), 'mset.size=0', "got miss in wrong dir $impl");
+	is_deeply(\@l, [], "no extra lines $impl");
+
+	my $csrch = PublicInbox::CodeSearch->new("$tmp/ext");
+	while (my ($did, $expect) = each %docid2data) {
+		is_deeply($csrch->xdb->get_document($did)->get_data,
+			$expect, "docid=$did data matches");
+	}
+	ok(!$xhc->{io}->close, "$impl close");
+	is($?, 66 << 8, "got EX_NOINPUT from $impl exit");
+};
+
+SKIP: {
+	require_mods('+SCM_RIGHTS', 1);
+	require PublicInbox::XapClient;
+	my $xhc = PublicInbox::XapClient::start_helper('-j0');
+	$test_xhc->($xhc);
+	skip 'PI_NO_CXX set', 1 if $ENV{PI_NO_CXX};
+	$xhc->{impl} =~ /Cxx/ or
+		skip 'C++ compiler or xapian development libs missing', 1;
+	skip 'TEST_XH_CXX_ONLY set', 1 if $ENV{TEST_XH_CXX_ONLY};
+	local $ENV{PI_NO_CXX} = 1; # force XS or SWIG binding test
+	$xhc = PublicInbox::XapClient::start_helper('-j0');
+	$test_xhc->($xhc);
 }
 
 if ('--update') {
diff --git a/t/xap_helper.t b/t/xap_helper.t
index e3abeded..ee25b2dc 100644
--- a/t/xap_helper.t
+++ b/t/xap_helper.t
@@ -40,6 +40,7 @@ my $v2 = create_inbox 'v2', indexlevel => 'medium', version => 2,
 };
 
 my @ibx_idx = glob("$v2->{inboxdir}/xap*/?");
+my @ibx_shard_args = map { ('-d', $_) } @ibx_idx;
 my (@int) = glob("$crepo/public-inbox-cindex/cidx*/?");
 my (@ext) = glob("$crepo/cidx-ext/cidx*/?");
 is(scalar(@ext), 2, 'have 2 external shards') or diag explain(\@ext);
@@ -76,8 +77,7 @@ my $test = sub {
 	is($cinfo{has_threadid}, '0', 'has_threadid false for cindex');
 	is($cinfo{pid}, $info{pid}, 'PID unchanged for cindex');
 
-	my @dump = (qw(dump_ibx -A XDFID), (map { ('-d', $_) } @ibx_idx),
-			qw(13 rt:0..));
+	my @dump = (qw(dump_ibx -A XDFID), @ibx_shard_args, qw(13 rt:0..));
 	$r = $doreq->($s, @dump);
 	my @res;
 	while (sysread($r, my $buf, 512) != 0) { push @res, $buf }
@@ -89,7 +89,8 @@ my $test = sub {
 	my $res = do { local $/; <$r> };
 	is(join('', @res), $res, 'got identical response w/ error pipe');
 	my $stats = do { local $/; <$err_rd> };
-	is($stats, "mset.size=6 nr_out=6\n", 'mset.size reported');
+	is($stats, "mset.size=6 nr_out=6\n", 'mset.size reported') or
+		diag "res=$res";
 
 	return wantarray ? ($ar, $s) : $ar if $cinfo{pid} == $pid;
 
@@ -198,7 +199,47 @@ for my $n (@NO_CXX) {
 	is(scalar(@res), scalar(grep(/\A[0-9a-f]{40,} [0-9]+\n\z/, @res)),
 		'entries match format');
 	$err = do { local $/; <$err_r> };
-	is($err, "mset.size=6 nr_out=5\n", "got expected status ($xhc->{impl})");
+	is $err, "mset.size=6 nr_out=5\n", "got expected status ($xhc->{impl})";
+
+	$r = $xhc->mkreq([], qw(mset -p -A XDFID -A Q), @ibx_shard_args,
+				'dfn:lib/PublicInbox/Search.pm');
+	chomp((my $hdr, @res) = readline($r));
+	is $hdr, 'mset.size=1', "got expected header via mset ($xhc->{impl}";
+	is scalar(@res), 1, 'got one result';
+	@res = split /\0/, $res[0];
+	{
+		my $doc = $v2->search->xdb->get_document($res[0]);
+		my @q = PublicInbox::Search::xap_terms('Q', $doc);
+		is_deeply \@q, [ $mid ], 'docid usable';
+	}
+	ok $res[1] > 0 && $res[1] <= 100, 'pct > 0 && <= 100';
+	is $res[2], 'XDFID'.$dfid, 'XDFID result matches';
+	is $res[3], 'Q'.$mid, 'Q (msgid) mset result matches';
+	is scalar(@res), 4, 'only 4 columns in result';
+
+	$r = $xhc->mkreq([], qw(mset -p -A XDFID -A Q), @ibx_shard_args,
+				'dt:19700101'.'000000..');
+	chomp(($hdr, @res) = readline($r));
+	is $hdr, 'mset.size=6',
+		"got expected header via multi-result mset ($xhc->{impl}";
+	is(scalar(@res), 6, 'got 6 rows');
+	for my $r (@res) {
+		my ($docid, $pct, @rest) = split /\0/, $r;
+		my $doc = $v2->search->xdb->get_document($docid);
+		ok $pct > 0 && $pct <= 100,
+			"pct > 0 && <= 100 #$docid ($xhc->{impl})";
+		my %terms;
+		for (@rest) {
+			s/\A([A-Z]+)// or xbail 'no prefix=', \@rest;
+			push @{$terms{$1}}, $_;
+		}
+		while (my ($pfx, $vals) = each %terms) {
+			@$vals = sort @$vals;
+			my @q = PublicInbox::Search::xap_terms($pfx, $doc);
+			is_deeply $vals, \@q,
+				"#$docid $pfx as expected ($xhc->{impl})";
+		}
+	}
 }
 
 done_testing;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 07/14] hval: use File::Spec to make relative paths for href
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
                   ` (5 preceding siblings ...)
  2023-11-28 14:56 ` [PATCH 06/14] xap_helper: implement mset endpoint for WWW, IMAP, etc Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2023-11-28 14:56 ` [PATCH 08/14] www: load and use cindex join data Eric Wong
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

File::Spec->abs2rel doesn't touch the filesystem at all when
given an absolute base arg ($env->{PATH_INFO}), so we can rely
on it to generate relative links that work with the `mount'
from Plack::Builder and also people running `wget -r' mirrors.
---
 lib/PublicInbox/Hval.pm | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Hval.pm b/lib/PublicInbox/Hval.pm
index e9b9ae64..b804254a 100644
--- a/lib/PublicInbox/Hval.pm
+++ b/lib/PublicInbox/Hval.pm
@@ -13,6 +13,7 @@ our @EXPORT_OK = qw/ascii_html obfuscate_addrs to_filename src_escape
 		to_attr prurl mid_href fmt_ts ts2str utf8_maybe/;
 use POSIX qw(strftime);
 my $enc_ascii = find_encoding('us-ascii');
+use File::Spec;
 
 # safe-ish acceptable filename pattern for portability
 our $FN = '[a-zA-Z0-9][a-zA-Z0-9_\-\.]+[a-zA-Z0-9]'; # needs \z anchor
@@ -69,7 +70,16 @@ sub prurl ($$) {
 		$u = $host_match[0] // $u->[0];
 		# fall through to below:
 	}
-	index($u, '//') == 0 ? "$env->{'psgi.url_scheme'}:$u" : $u;
+	my $dslash = index($u, '//');
+	if ($dslash == 0) {
+		"$env->{'psgi.url_scheme'}:$u"
+	} elsif ($dslash < 0 && substr($u, 0, 1) ne '/' &&
+			substr(my $path = $env->{PATH_INFO}, 0, 1) eq '/') {
+		# this won't touch the FS at all:
+		File::Spec->abs2rel("/$u", $path);
+	} else {
+		$u;
+	}
 }
 
 # for misguided people who believe in this stuff, give them a

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 08/14] www: load and use cindex join data
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
                   ` (6 preceding siblings ...)
  2023-11-28 14:56 ` [PATCH 07/14] hval: use File::Spec to make relative paths for href Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2023-11-28 14:56 ` [PATCH 09/14] git: speed up ->git_path for non-worktrees Eric Wong
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

This is a major step in solving the problem of having to
manually associate hundreds/thousands of coderepos with
hundreds/thousands of public-inboxes to power solver
(and more).
---
 lib/PublicInbox/CodeSearch.pm    | 153 +++++++++++++++++++++++++++++--
 lib/PublicInbox/CodeSearchIdx.pm |  42 ++++-----
 lib/PublicInbox/Config.pm        |  39 +++++++-
 lib/PublicInbox/Search.pm        |  17 ++++
 lib/PublicInbox/SearchIdx.pm     |  10 +-
 lib/PublicInbox/SolverGit.pm     |   6 +-
 lib/PublicInbox/View.pm          |   7 +-
 lib/PublicInbox/WWW.pm           |   1 +
 lib/PublicInbox/WwwCoderepo.pm   |  41 ++++++++-
 lib/PublicInbox/WwwText.pm       |  19 +++-
 t/cindex.t                       |  28 +++++-
 xt/solver.t                      |   3 +-
 12 files changed, 312 insertions(+), 54 deletions(-)

diff --git a/lib/PublicInbox/CodeSearch.pm b/lib/PublicInbox/CodeSearch.pm
index eb057525..7d7f6df6 100644
--- a/lib/PublicInbox/CodeSearch.pm
+++ b/lib/PublicInbox/CodeSearch.pm
@@ -21,7 +21,7 @@ use constant {
 our @CODE_NRP;
 our @CODE_VMAP = (
 	[ AT, 'd:' ], # mairix compat
-	[ AT, 'dt:' ], # mail compat
+	[ AT, 'dt:' ], # public-inbox mail compat
 	[ CT, 'ct:' ],
 );
 
@@ -51,7 +51,7 @@ my %prob_prefix = ( # copied from PublicInbox::Search
 sub new {
 	my ($cls, $dir, $cfg) = @_;
 	# can't have a PublicInbox::Config here due to circular refs
-	bless { xpfx => "$dir/cidx".CIDX_SCHEMA_VER,
+	bless { topdir => $dir, xpfx => "$dir/cidx".CIDX_SCHEMA_VER,
 		-cfg_f => $cfg->{-f} }, $cls;
 }
 
@@ -63,7 +63,20 @@ sub join_data {
 	my $cur = $self->xdb->get_metadata($key) or return;
 	$cur = eval { PublicInbox::Config::json()->decode(uncompress($cur)) };
 	warn "E: $@ (corrupt metadata in `$key' key?)" if $@;
-	$cur;
+	my @m = grep { ref($cur->{$_}) ne 'ARRAY' } qw(ekeys roots ibx2root);
+	if (@m) {
+		warn <<EOM;
+W: $self->{topdir} join data for $self->{-cfg_f} missing: @m
+EOM
+		undef;
+	} elsif (@{$cur->{ekeys}} != @{$cur->{ibx2root}}) {
+		warn <<EOM;
+W: $self->{topdir} join data for $self->{-cfg_f} mismatched ekeys and ibx2root
+EOM
+		undef;
+	} else {
+		$cur;
+	}
 }
 
 sub qparse_new ($) {
@@ -196,16 +209,136 @@ sub roots2paths { # for diagnostics
 	\%ret;
 }
 
-sub paths2roots { # for diagnostics
-	my ($self) = @_;
+sub root_oids ($$) {
+	my ($self, $git_dir) = @_;
+	my @ids = $self->docids_by_postlist('P'.$git_dir);
+	@ids or warn <<"";
+BUG? (non-fatal) `$git_dir' not indexed in $self->{topdir}
+
+	warn <<"" if @ids > 1;
+BUG: (non-fatal) $git_dir indexed multiple times in $self->{topdir}
+
 	my %ret;
-	my $tmp = roots2paths($self);
-	for my $root_oidhex (keys %$tmp) {
-		my $paths = delete $tmp->{$root_oidhex};
-		push @{$ret{$_}}, $root_oidhex for @$paths;
+	for my $docid (@ids) {
+		my @oids = xap_terms('G', $self->xdb, $docid);
+		@ret{@oids} = @oids;
+	}
+	sort keys %ret;
+}
+
+sub paths2roots {
+	my ($self, $paths) = @_;
+	my %ret;
+	if ($paths) {
+		for my $p (keys %$paths) { @{$ret{$p}} = root_oids($self, $p) }
+	} else {
+		my $tmp = roots2paths($self);
+		for my $root_oidhex (keys %$tmp) {
+			my $paths = delete $tmp->{$root_oidhex};
+			push @{$ret{$_}}, $root_oidhex for @$paths;
+		}
+		@$_ = sort(@$_) for values %ret;
 	}
-	@$_ = sort(@$_) for values %ret;
 	\%ret;
 }
 
+sub load_commit_times { # each_cindex callback
+	my ($self, $todo) = @_; # todo = [ [ time, git ], [ time, git ] ...]
+	my (@pending, $rec, $dir, @ids, $doc);
+	while ($rec = shift @$todo) {
+		@ids = $self->docids_by_postlist('P'.$rec->[1]->{git_dir});
+		if (@ids) {
+			warn <<EOM if @ids > 1;
+W: $rec->[1]->{git_dir} indexed multiple times in $self->{topdir}
+EOM
+			for (@ids) {
+				$doc = $self->get_doc($_) // next;
+				$rec->[0] = int_val($doc, CT);
+				last;
+			}
+		} else { # may be in another cindex:
+			push @pending, $rec;
+		}
+	}
+	@$todo = @pending;
+}
+
+sub load_coderepos { # each_cindex callback
+	my ($self, $pi_cfg) = @_;
+	my $name = $self->{name};
+	my $cfg_f = $pi_cfg->{-f};
+	my $lpfx = $self->{localprefix} or return warn <<EOM;
+W: cindex.$name.localprefix unset in $cfg_f, ignoring cindex.$name
+EOM
+	my $lre = join('|', map { $_ .= '/'; tr!/!/!s; quotemeta } @$lpfx);
+	$lre = qr!\A(?:$lre)!;
+	my $coderepos = $pi_cfg->{-coderepos};
+	my $nick_pfx = $name eq '' ? '' : "$name/";
+	my %dir2cr;
+	for my $p ($self->all_terms('P')) {
+		my $nick = $p;
+		$nick =~ s!$lre!$nick_pfx!s or next;
+		$dir2cr{$p} = $coderepos->{$nick} //= do {
+			my $git = PublicInbox::Git->new($p);
+			$git->{nick} = $nick; # for git->pub_urls
+			$git;
+		};
+	}
+	my $jd = join_data($self) or return warn <<EOM;
+W: cindex.$name.topdir=$self->{topdir} has no usable join data for $cfg_f
+EOM
+	my ($ekeys, $roots, $ibx2root) = @$jd{qw(ekeys roots ibx2root)};
+	my $roots2paths = roots2paths($self);
+	for my $root_offs (@$ibx2root) {
+		my $ekey = shift(@$ekeys) // die 'BUG: {ekeys} empty';
+		scalar(@$root_offs) or next;
+		my $ibx = $pi_cfg->lookup_eidx_key($ekey) // do {
+			warn "W: `$ekey' gone from $cfg_f\n";
+			next;
+		};
+		my $gits = $ibx->{-repo_objs} //= [];
+		my $cr_score = $ibx->{-cr_score} //= {};
+		my %ibx_p2g = map { $_->{git_dir} => $_ } @$gits;
+		my $ibx2self; # cindex has an association w/ inbox?
+		for (@$root_offs) { # sorted by $nr descending
+			my ($nr, $root_off) = @$_;
+			my $root_oid = $roots->[$root_off] // do {
+				warn <<EOM;
+BUG: root #$root_off invalid in join data for `$ekey' with $cfg_f
+EOM
+				next;
+			};
+			my $git_dirs = $roots2paths->{$root_oid};
+			my @gits = map { $dir2cr{$_} // () } @$git_dirs;
+			$cr_score->{$_->{nick}} //= $nr for @gits;
+			@$git_dirs = grep { !$ibx_p2g{$_} } @$git_dirs;
+			# @$git_dirs or warn "W: no matches for $root_oid\n";
+			for (@$git_dirs) {
+				if (my $git = $dir2cr{$_}) {
+					$ibx_p2g{$_} = $git;
+					$ibx2self = 1;
+					$ibx->{-hide}->{www} or
+						push @{$git->{ibx_score}},
+							[ $nr, $ibx->{name} ];
+					push @$gits, $git;
+				} else {
+					warn <<EOM;
+W: no coderepo available for $_ (localprefix=@$lpfx)
+EOM
+				}
+			}
+		}
+		if (@$gits) {
+			push @{$ibx->{-csrch}}, $self if $ibx2self;
+		} else {
+			delete $ibx->{-repo_objs};
+			delete $ibx->{-cr_score};
+		}
+	}
+	for my $git (values %dir2cr) {
+		my $s = $git->{ibx_score};
+		@$s = sort { $b->[0] <=> $a->[0] } @$s if $s;
+	}
+}
+
 1;
diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm
index bb1d698b..a6cbe0b0 100644
--- a/lib/PublicInbox/CodeSearchIdx.pm
+++ b/lib/PublicInbox/CodeSearchIdx.pm
@@ -172,7 +172,7 @@ sub count_shards { scalar($_[0]->xdb_shards_flat) }
 sub update_commit ($$$) {
 	my ($self, $cmt, $roots) = @_; # fields from @FMT
 	my $x = 'Q'.$cmt->{H};
-	my ($docid, @extra) = sort { $a <=> $b } docids_by_postlist($self, $x);
+	my ($docid, @extra) = sort { $a <=> $b } $self->docids_by_postlist($x);
 	@extra and warn "W: $cmt->{H} indexed multiple times, pruning ",
 			join(', ', map { "#$_" } @extra), "\n";
 	$self->{xdb}->delete_document($_) for @extra;
@@ -377,15 +377,6 @@ sub seen ($$) {
 # used to select the shard for a GIT_DIR
 sub git_dir_hash ($) { hex(substr(sha256_hex($_[0]), 0, 8)) }
 
-sub docids_by_postlist ($$) { # consider moving to PublicInbox::Search
-	my ($self, $q) = @_;
-	my $cur = $self->{xdb}->postlist_begin($q);
-	my $end = $self->{xdb}->postlist_end($q);
-	my @ids;
-	for (; $cur != $end; $cur++) { push(@ids, $cur->get_docid) };
-	@ids;
-}
-
 sub _cb { # run_await cb
 	my ($pid, $cmd, undef, $opt, $cb, $self, $git, @arg) = @_;
 	return if $DO_QUIT;
@@ -452,7 +443,7 @@ sub prep_repo ($$) {
 
 sub check_existing { # retry_reopen callback
 	my ($shard, $self, $git) = @_;
-	my @docids = docids_by_postlist($shard, 'P'.$git->{git_dir});
+	my @docids = $shard->docids_by_postlist('P'.$git->{git_dir});
 	my $docid = shift(@docids) // return get_roots($self, $git);
 	my $doc = $shard->get_doc($docid) //
 			die "BUG: no #$docid ($git->{git_dir})";
@@ -778,7 +769,7 @@ sub prune_init { # via wq_io_do in IDX_SHARDS
 
 sub prune_one { # via wq_io_do in IDX_SHARDS
 	my ($self, $term) = @_;
-	my @docids = docids_by_postlist($self, $term);
+	my @docids = $self->docids_by_postlist($term);
 	for (@docids) {
 		$TXN_BYTES -= $self->{xdb}->get_doclength($_) * 42;
 		$self->{xdb}->delete_document($_);
@@ -894,10 +885,9 @@ sub current_join_data ($) {
 sub score_old_join_data ($$$) {
 	my ($self, $score, $ekeys_new) = @_;
 	my $old = ($JOIN{reset} ? undef : current_join_data($self)) or return;
-	my @old = @$old{qw(ekeys roots ibx2root)};
-	@old == 3 or return warn "W: ekeys/roots missing from old JOIN data\n";
 	progress($self, 'merging old join data...');
-	my ($ekeys_old, $roots_old, $ibx2root_old) = @old;
+	my ($ekeys_old, $roots_old, $ibx2root_old) =
+					@$old{qw(ekeys roots ibx2root)};
 	# score: "ibx_off root_off" => nr
 	my $i = -1;
 	my %root2id_new = map { $_ => ++$i } @OFF2ROOT;
@@ -905,16 +895,24 @@ sub score_old_join_data ($$$) {
 	my %ekey2id_new = map { $_ => ++$i } @$ekeys_new;
 	for my $ibx_off_old (0..$#$ibx2root_old) {
 		my $root_offs_old = $ibx2root_old->[$ibx_off_old];
-		my $ekey = $ekeys_old->[$ibx_off_old] //
-			warn "W: no ibx #$ibx_off_old in old JOIN data\n";
-		my $ibx_off_new = $ekey2id_new{$ekey // next} //
+		my $ekey = $ekeys_old->[$ibx_off_old] // do {
+			warn "W: no ibx #$ibx_off_old in old join data\n";
+			next;
+		};
+		my $ibx_off_new = $ekey2id_new{$ekey} // do {
 			warn "W: `$ekey' no longer exists\n";
+			next;
+		};
 		for (@$root_offs_old) {
 			my ($nr, $rid_old) = @$_;
-			my $root_old = $roots_old->[$rid_old] //
-				warn "W: no root #$rid_old in old JOIN data\n";
-			my $rid_new = $root2id_new{$root_old // next} //
+			my $root_old = $roots_old->[$rid_old] // do {
+				warn "W: no root #$rid_old in old data\n";
+				next;
+			};
+			my $rid_new = $root2id_new{$root_old} // do {
 				warn "W: root `$root_old' no longer exists\n";
+				next;
+			};
 			$score->{"$ibx_off_new $rid_new"} += $nr;
 		}
 	}
@@ -963,7 +961,7 @@ sub do_join {
 		progress($self, "$ekey => $root has $nr matches");
 		push @{$new->{ibx2root}->[$ibx_off]}, [ $nr, $root_off ];
 	}
-	for my $ary (values %$new) { # sort by nr
+	for my $ary (values %$new) { # sort by nr (largest first)
 		for (@$ary) { @$_ = sort { $b->[0] <=> $a->[0] } @$_ }
 	}
 	$new->{ekeys} = \@ekeys;
diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index 9bee94b8..779e3140 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -412,8 +412,8 @@ sub get_1 {
 
 sub repo_objs {
 	my ($self, $ibxish) = @_;
-	my $ibx_coderepos = $ibxish->{coderepo} // return;
 	$ibxish->{-repo_objs} // do {
+		my $ibx_coderepos = $ibxish->{coderepo} // return;
 		parse_cgitrc($self, undef, 0);
 		my $coderepos = $self->{-coderepos};
 		my @repo_objs;
@@ -568,6 +568,43 @@ sub _fill_ei ($$) {
 	$es;
 }
 
+sub _fill_csrch ($$) {
+	my ($self, $name) = @_; # "" is a valid name for cindex
+	return if $name ne '' && !valid_foo_name($name, 'cindex');
+	eval { require PublicInbox::CodeSearch } or return;
+	my $pfx = "cindex.$name";
+	my $d = $self->{"$pfx.topdir"} // return;
+	-d $d or return;
+	if (index($d, "\n") >= 0) {
+		warn "E: `$d' must not contain `\\n'\n";
+		return;
+	}
+	my $csrch = PublicInbox::CodeSearch->new($d, $self);
+	for my $k (qw(localprefix)) {
+		my $v = $self->{"$pfx.$k"} // next;
+		$csrch->{$k} = _array($v);
+	}
+	$csrch->{name} = $name;
+	$csrch;
+}
+
+sub lookup_cindex ($$) {
+	my ($self, $name) = @_;
+	$self->{-csrch_by_name}->{$name} //= _fill_csrch($self, $name);
+}
+
+sub each_cindex {
+	my ($self, $cb, @arg) = @_;
+	my @csrch = map {
+		lookup_cindex($self, substr($_, length('cindex.'))) // ()
+	} grep(m!\Acindex\.[^\./]*\z!, @{$self->{-section_order}});
+	if (ref($cb) eq 'CODE') {
+		$cb->($_, @arg) for @csrch;
+	} else { # string function
+		$_->$cb(@arg) for @csrch;
+	}
+}
+
 sub config_cmd {
 	my ($self, $env, $opt) = @_;
 	my $f = $self->{-f} // default_file();
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 6145b027..8ef17d58 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -649,4 +649,21 @@ sub xh_args { # prep getopt args to feed to xap_helper.h socket
 	map { ('-d', $_) } shard_dirs($_[0]);
 }
 
+sub docids_by_postlist ($$) {
+	my ($self, $q) = @_;
+	my $cur = $self->xdb->postlist_begin($q);
+	my $end = $self->{xdb}->postlist_end($q);
+	my @ids;
+	for (; $cur != $end; $cur++) { push(@ids, $cur->get_docid) };
+	@ids;
+}
+
+sub get_doc ($$) {
+	my ($self, $docid) = @_;
+	eval { $self->{xdb}->get_document($docid) } // do {
+		die $@ if $@ && ref($@) !~ /\bDocNotFoundError\b/;
+		undef;
+	}
+}
+
 1;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index f569428c..17538027 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -545,17 +545,9 @@ sub add_message {
 	$smsg->{num};
 }
 
-sub get_doc ($$) {
-	my ($self, $docid) = @_;
-	eval { $self->{xdb}->get_document($docid) } // do {
-		die $@ if $@ && ref($@) !~ /\bDocNotFoundError\b/;
-		undef;
-	}
-}
-
 sub _get_doc ($$) {
 	my ($self, $docid) = @_;
-	get_doc($self, $docid) // do {
+	$self->get_doc($docid) // do {
 		warn "E: #$docid missing in Xapian\n";
 		undef;
 	}
diff --git a/lib/PublicInbox/SolverGit.pm b/lib/PublicInbox/SolverGit.pm
index 7cc10198..4e79f750 100644
--- a/lib/PublicInbox/SolverGit.pm
+++ b/lib/PublicInbox/SolverGit.pm
@@ -643,9 +643,13 @@ sub resolve_patch ($$) {
 # so user_cb never references the SolverGit object
 sub new {
 	my ($class, $ibx, $user_cb, $uarg) = @_;
+	my $gits = $ibx ? $ibx->{-repo_objs} : undef;
+
+	# FIXME: cindex --join= is super-aggressive and may hit too many
+	$gits = [ @$gits[0..2] ] if $gits && @$gits > 3;
 
 	bless { # $ibx is undef if coderepo only (see WwwCoderepo)
-		gits => $ibx ? $ibx->{-repo_objs} : undef,
+		gits => $gits,
 		user_cb => $user_cb,
 		uarg => $uarg,
 		# -cur_di, -qsp_err, -msg => temp fields for Qspawn callbacks
diff --git a/lib/PublicInbox/View.pm b/lib/PublicInbox/View.pm
index e5f748f7..d81c66b7 100644
--- a/lib/PublicInbox/View.pm
+++ b/lib/PublicInbox/View.pm
@@ -80,7 +80,7 @@ sub msg_page {
 	# allow user to easily browse the range around this message if
 	# they have ->over
 	$ctx->{-t_max} = $smsg->{ts};
-	$ctx->{-spfx} = '../' if $ibx->{coderepo};
+	$ctx->{-spfx} = '../' if $ibx->{-repo_objs};
 	PublicInbox::WwwStream::aresponse($ctx, \&msg_page_i);
 }
 
@@ -443,7 +443,7 @@ sub thread_html {
 	my $ibx = $ctx->{ibx};
 	my ($nr, $msgs) = $ibx->over->get_thread($mid);
 	return missing_thread($ctx) if $nr == 0;
-	$ctx->{-spfx} = '../../' if $ibx->{coderepo};
+	$ctx->{-spfx} = '../../' if $ibx->{-repo_objs};
 
 	# link $INBOX_DIR/description text to "index_topics" view around
 	# the newest message in this thread
@@ -779,6 +779,9 @@ href=#t>this message</a>:
 <input type=submit value=search
 />\t(<a href=${upfx}_/text/help/#search>help</a>)</pre></form>
 EOM
+		# TODO: related codesearch
+		# my $csrchv = $ctx->{ibx}->{-csrch} // [];
+		# push @related, '<pre>'.ascii_html(Dumper($csrchv)).'</pre>';
 	}
 	if ($ctx->{ibx}->over) {
 		my $t = ts2str($ctx->{-t_max});
diff --git a/lib/PublicInbox/WWW.pm b/lib/PublicInbox/WWW.pm
index 6b616bd4..289599b8 100644
--- a/lib/PublicInbox/WWW.pm
+++ b/lib/PublicInbox/WWW.pm
@@ -189,6 +189,7 @@ sub preload {
 		}
 		$pi_cfg->ALL and require PublicInbox::Isearch;
 		$self->cgit;
+		$self->coderepo;
 		$self->stylesheets_prepare($_) for ('', '../', '../../');
 		$self->news_www;
 	}
diff --git a/lib/PublicInbox/WwwCoderepo.pm b/lib/PublicInbox/WwwCoderepo.pm
index 0eb4a2d6..8ab4911f 100644
--- a/lib/PublicInbox/WwwCoderepo.pm
+++ b/lib/PublicInbox/WwwCoderepo.pm
@@ -14,12 +14,14 @@ use PublicInbox::ViewVCS;
 use PublicInbox::WwwStatic qw(r);
 use PublicInbox::GitHTTPBackend;
 use PublicInbox::WwwStream;
-use PublicInbox::Hval qw(ascii_html utf8_maybe);
+use PublicInbox::Hval qw(prurl ascii_html utf8_maybe);
 use PublicInbox::ViewDiff qw(uri_escape_path);
 use PublicInbox::RepoSnapshot;
 use PublicInbox::RepoAtom;
 use PublicInbox::RepoTree;
 use PublicInbox::OnDestroy;
+use URI::Escape qw(uri_escape_utf8);
+use File::Spec;
 
 my @EACH_REF = (qw(git for-each-ref --sort=-creatordate),
 		"--format=%(HEAD)%00".join('%00', map { "%($_)" }
@@ -62,6 +64,7 @@ sub prepare_coderepos {
 		my $eidx = $pi_cfg->lookup_ei($k) // next;
 		$pi_cfg->repo_objs($eidx);
 	}
+	$pi_cfg->each_cindex('load_coderepos', $pi_cfg);
 }
 
 sub new {
@@ -119,6 +122,41 @@ sub _refs_tags_link {
 		"</a>$align ", ascii_html($s), " ($cd)", @snap_fmt, "\n");
 }
 
+sub emit_joined_inboxes ($) {
+	my ($ctx) = @_;
+	my $names = $ctx->{git}->{ibx_names}; # coderepo directives in config
+	my $score = $ctx->{git}->{ibx_score}; # generated w/ cindex --join
+	($names || $score) or return;
+	my $pi_cfg = $ctx->{wcr}->{pi_cfg};
+	my ($u, $h);
+	my $zfh = $ctx->zfh;
+	print $zfh "\n# associated public inboxes:",
+		"\n# (number on the left is used for dev purposes)";
+	my @ns = map { [ 0, $_ ] } @$names;
+	my $env = $ctx->{env};
+	for (@ns, @$score) {
+		my ($nr, $name) = @$_;
+		my $ibx = $pi_cfg->lookup_name($name) // do {
+			warn "W: inbox `$name' gone for $ctx->{git}->{git_dir}";
+			say $zfh '# ', ascii_html($name), ' (missing inbox?)';
+			next;
+		};
+		if (scalar(@{$ibx->{url} // []})) {
+			$u = $h = ascii_html(prurl($env, $ibx->{url}));
+		} else {
+			$h = ascii_html(prurl($env, uri_escape_utf8($name)));
+			$h .= '/';
+			$u = ascii_html($name);
+		}
+		if ($nr) {
+			printf $zfh "\n% 11u", $nr;
+		} else {
+			print $zfh "\n", ' 'x11;
+		}
+		print $zfh qq{ <a\nhref="$h">$u</a>};
+	}
+}
+
 sub summary_END { # called via OnDestroy
 	my ($ctx) = @_;
 	my $wcb = delete($ctx->{-wcb}) or return; # already done
@@ -174,6 +212,7 @@ EOM
 	for (@r) { print $zfh _refs_tags_link($_, './', $snap_pfx, @snap_fmt) }
 	print $zfh $NO_TAGS if !@r;
 	print $zfh qq(<a href="refs/tags/">...</a>\n) if $last;
+	emit_joined_inboxes $ctx;
 	$wcb->($ctx->html_done('</pre>'));
 }
 
diff --git a/lib/PublicInbox/WwwText.pm b/lib/PublicInbox/WwwText.pm
index f4508b3f..4b4b2e4c 100644
--- a/lib/PublicInbox/WwwText.pm
+++ b/lib/PublicInbox/WwwText.pm
@@ -7,7 +7,7 @@ use strict;
 use v5.10.1;
 use PublicInbox::Linkify;
 use PublicInbox::WwwStream;
-use PublicInbox::Hval qw(ascii_html prurl);
+use PublicInbox::Hval qw(ascii_html prurl fmt_ts);
 use HTTP::Date qw(time2str);
 use URI::Escape qw(uri_escape_utf8);
 use PublicInbox::GzipFilter qw(gzf_maybe);
@@ -248,14 +248,23 @@ EOS
 
 sub coderepos_raw ($$) {
 	my ($ctx, $top_url) = @_;
-	my $cr = $ctx->{ibx}->{coderepo} // return ();
 	my $cfg = $ctx->{www}->{pi_cfg};
+	my $cr = $cfg->repo_objs($ctx->{ibx}) or return ();
 	my $buf = 'Code repositories for project(s) associated with this '.
-		$ctx->{ibx}->thing_type . "\n";
-	for my $git (@{$ctx->{www}->{pi_cfg}->repo_objs($ctx->{ibx})}) {
+		$ctx->{ibx}->thing_type . ":\n";
+	my @recs = map { [ 0, $_ ] } @$cr;
+	my @todo = @recs;
+	$cfg->each_cindex('load_commit_times', \@todo);
+	@recs = sort { $b->[0] <=> $a->[0] } @recs;
+	my $cr_score = $ctx->{ibx}->{-cr_score};
+	for (@recs) {
+		my ($t, $git) = @$_;
 		for ($git->pub_urls($ctx->{env})) {
 			my $u = m!\A(?:[a-z\+]+:)?//!i ? $_ : $top_url.$_;
-			$buf .= "\n\t" . prurl($ctx->{env}, $u);
+			my $nr = $cr_score->{$git->{nick}};
+			$buf .= "\n";
+			$buf .= $nr ? sprintf('% 9u', $nr) : (' 'x9);
+			$buf .= ' '.fmt_ts($t).' '.prurl($ctx->{env}, $u);
 		}
 	}
 	($buf);
diff --git a/t/cindex.t b/t/cindex.t
index a9075092..29d88ca8 100644
--- a/t/cindex.t
+++ b/t/cindex.t
@@ -5,7 +5,7 @@ use v5.12;
 use PublicInbox::TestCommon;
 use Cwd qw(getcwd abs_path);
 use List::Util qw(sum);
-use autodie qw(close open rename);
+use autodie qw(close mkdir open rename);
 require_mods(qw(json Xapian +SCM_RIGHTS));
 use_ok 'PublicInbox::CodeSearchIdx';
 use PublicInbox::Import;
@@ -227,7 +227,7 @@ SKIP: { # --prune
 }
 
 File::Path::remove_tree("$tmp/ext");
-ok(mkdir("$tmp/ext", 0707), 'create $tmp/ext with odd permissions');
+mkdir("$tmp/ext", 0707);
 ok(run_script([qw(-cindex --dangerous -q -d), "$tmp/ext", $zp]),
 	'external on existing dir');
 {
@@ -265,4 +265,28 @@ EOM
 		'non-Xapian-enabled inbox noted');
 }
 
+# we need to support blank sections for a top-level repos
+# (e.g. <https://example.com/my-project>
+# git.kernel.org could use "pub" as section name, though, since all git repos
+# are currently under //git.kernel.org/pub/**/*
+{
+	mkdir(my $d = "$tmp/blanksection");
+	my $cfg = cfg_new($d, <<EOM);
+[cindex ""]
+	topdir = $tmp/ext
+	localprefix = $tmp
+EOM
+	my $csrch = $cfg->lookup_cindex('');
+	is ref($csrch), 'PublicInbox::CodeSearch', 'codesearch w/ blank name';
+	is_deeply $csrch->{localprefix}, [ "$tmp" ], 'localprefix respected';
+	my $nr = 0;
+	$cfg->each_cindex(sub {
+		my ($cs, @rest) = @_;
+		is $cs->{topdir}, $csrch->{topdir}, 'each_cindex works';
+		is_deeply \@rest, [ '.' ], 'got expected arg';
+		++$nr;
+	}, '.');
+	is $nr, 1, 'iterated through cindices';
+}
+
 done_testing;
diff --git a/xt/solver.t b/xt/solver.t
index 51b4144c..372d003b 100644
--- a/xt/solver.t
+++ b/xt/solver.t
@@ -10,6 +10,7 @@ use_ok($_) for @psgi;
 use_ok 'PublicInbox::WWW';
 my $cfg = PublicInbox::Config->new;
 my $www = PublicInbox::WWW->new($cfg);
+$www->preload;
 my $app = sub {
 	my $env = shift;
 	$env->{'psgi.errors'} = \*STDERR;
@@ -63,7 +64,7 @@ while (my ($ibx_name, $urls) = each %$todo) {
 			skip(qq{[publicinbox "$ibx_name"] not configured},
 				scalar(@$urls));
 		}
-		if (!defined($ibx->{coderepo})) {
+		if (!defined($ibx->{-repo_objs})) {
 			push @gone, $ibx_name;
 			skip(qq{publicinbox.$ibx_name.coderepo not configured},
 				scalar(@$urls));

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 09/14] git: speed up ->git_path for non-worktrees
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
                   ` (7 preceding siblings ...)
  2023-11-28 14:56 ` [PATCH 08/14] www: load and use cindex join data Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2023-11-28 14:56 ` [PATCH 10/14] cindex: require `-g GIT_DIR' or `-r PROJECT_ROOT' Eric Wong
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

Only worktrees need to use `git rev-parse --git-path', so avoid
the spawn overhead of a new process.  With the SolverGit.pm
limit on coderepo scans disabled and scanning over 800 git repos
for git@vger matches, this reduces up xt/solver.t times by
roughly 25%.
---
 lib/PublicInbox/Git.pm | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/lib/PublicInbox/Git.pm b/lib/PublicInbox/Git.pm
index 7c6e15b7..a374649f 100644
--- a/lib/PublicInbox/Git.pm
+++ b/lib/PublicInbox/Git.pm
@@ -100,14 +100,17 @@ sub new {
 sub git_path ($$) {
 	my ($self, $path) = @_;
 	$self->{-git_path}->{$path} //= do {
-		local $/ = "\n";
-		chomp(my $str = $self->qx(qw(rev-parse --git-path), $path));
-
-		# git prior to 2.5.0 did not understand --git-path
-		if ($str eq "--git-path\n$path") {
-			$str = "$self->{git_dir}/$path";
+		my $d = "$self->{git_dir}/$path";
+		if (-e $d) {
+			$d;
+		} else {
+			local $/ = "\n";
+			my $s = $self->qx(qw(rev-parse --git-path), $path);
+			chomp $s;
+
+			# git prior to 2.5.0 did not understand --git-path
+			$s eq "--git-path\n$path" ? $d : $s;
 		}
-		$str;
 	};
 }
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 10/14] cindex: require `-g GIT_DIR' or `-r PROJECT_ROOT'
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
                   ` (8 preceding siblings ...)
  2023-11-28 14:56 ` [PATCH 09/14] git: speed up ->git_path for non-worktrees Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2023-11-28 14:56 ` [PATCH 11/14] git: speed up Git->new by 5% or so Eric Wong
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

Accepting @ARGV without switches ends up being ambiguous with
optional parameters for --join and --show.  Requiring users to
specify `--join=' or `--show=' is a bit awkward (as it with
-clone --objstore= and the like, but that is historical baggage
we need to carry at this point...)
---
 Documentation/public-inbox-cindex.pod |  2 +-
 lib/PublicInbox/CodeSearchIdx.pm      |  5 ++--
 script/public-inbox-cindex            | 38 ++++++++++++++++++---------
 t/cindex-join.t                       |  7 ++++-
 t/cindex.t                            |  9 ++++---
 t/xap_helper.t                        |  4 +--
 6 files changed, 42 insertions(+), 23 deletions(-)

diff --git a/Documentation/public-inbox-cindex.pod b/Documentation/public-inbox-cindex.pod
index 3ff394be..0c9c4bdb 100644
--- a/Documentation/public-inbox-cindex.pod
+++ b/Documentation/public-inbox-cindex.pod
@@ -4,7 +4,7 @@ public-inbox-cindex - create and update search for code repositories
 
 =head1 SYNOPSIS
 
-public-inbox-cindex [OPTIONS] GIT_DIR...
+public-inbox-cindex [OPTIONS] -g GIT_DIR [-g GIT_DIR]
 
 public-inbox-cindex [OPTIONS] --update
 
diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm
index a6cbe0b0..d49e9a8d 100644
--- a/lib/PublicInbox/CodeSearchIdx.pm
+++ b/lib/PublicInbox/CodeSearchIdx.pm
@@ -1131,8 +1131,6 @@ sub init_join_prefork ($) {
 	} split(/,/, join(',', @$subopt));
 	require PublicInbox::CidxXapHelperAux;
 	require PublicInbox::XapClient;
-	my $cfg = $self->{-opt}->{-pi_cfg} // die 'BUG: -pi_cfg unset';
-	$self->{-cfg_f} = $cfg->{-f} = rel2abs_collapsed($cfg->{-f});
 	my @unknown;
 	my $pfx = $JOIN{prefixes} // 'patchid';
 	for (split /\+/, $pfx) {
@@ -1223,7 +1221,8 @@ sub cidx_run { # main entry point
 				$PublicInbox::SearchIdx::BATCH_BYTES;
 	local $MAX_SIZE = $self->{-opt}->{max_size};
 	local $self->{PENDING} = {}; # used by PublicInbox::CidxXapHelperAux
-	local $self->{-cfg_f};
+	my $cfg = $self->{-opt}->{-pi_cfg} // die 'BUG: -pi_cfg unset';
+	$self->{-cfg_f} = $cfg->{-f} = rel2abs_collapsed($cfg->{-f});
 	if (grep { $_ } @{$self->{-opt}}{qw(prune join)}) {
 		require File::Temp;
 		$TMPDIR = File::Temp->newdir('cidx-all-git-XXXX', TMPDIR => 1);
diff --git a/script/public-inbox-cindex b/script/public-inbox-cindex
index 97890c1b..a015d7a4 100755
--- a/script/public-inbox-cindex
+++ b/script/public-inbox-cindex
@@ -4,8 +4,8 @@
 use v5.12;
 use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
 my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term:
-usage: public-inbox-cindex [options] GIT_DIR...
-usage: public-inbox-cindex [options] --project-list=FILE PROJECT_ROOT
+usage: public-inbox-cindex [options] -g GIT_DIR...
+usage: public-inbox-cindex [options] --project-list=FILE -r PROJECT_ROOT
 
   Create and update search indices for code repos
 
@@ -29,7 +29,8 @@ GetOptions($opt, qw(quiet|q verbose|v+ reindex jobs|j=i fsync|sync! dangerous
 		indexlevel|index-level|L=s join:s@
 		batch_size|batch-size=s max_size|max-size=s
 		include|I=s@ only=s@ all show:s@
-		project-list=s exclude=s@
+		project-list=s exclude=s@ project-root|r=s
+		git-dir|g=s@
 		sort-parallel=s sort-compress-program=s sort-buffer-size=s
 		d=s update|u scan! prune dry-run|n C=s@ help|h))
 	or die $help;
@@ -50,23 +51,36 @@ PublicInbox::Admin::progress_prepare($opt);
 my $env = PublicInbox::Admin::index_prepare($opt, $cfg);
 %ENV = (%ENV, %$env) if $env;
 
-require PublicInbox::CodeSearchIdx; # unstable internal API
 my @git_dirs;
-if (defined(my $pl = $opt->{'project-list'})) {
-	my $pfx = shift @ARGV // die <<EOM;
+require PublicInbox::CodeSearchIdx; # unstable internal API
+if (@ARGV) {
+	my @g = map { "-g $_" } @ARGV;
+	die <<EOM;
+Specify git directories with `-g' (or --git-dir=): @g
+Or use --project-list=... and --project-root=...
+EOM
+} elsif (defined(my $pl = $opt->{'project-list'})) {
+	my $pfx = $opt->{'project-root'} // die <<EOM;
 PROJECTS_ROOT required for --project-list
 EOM
-	@ARGV and die <<EOM;
---project-list does not accept additional directories
-(@ARGV)
-beyond `$pfx'
+	$opt->{'git-dir'} and die <<EOM;
+--project-list does not accept additional --git-dir directories
+(@{$opt->{'git-dir'}})
 EOM
 	open my $fh, '<', $pl or die "open($pl): $!\n";
 	chomp(@git_dirs = <$fh>);
-	$_ = PublicInbox::Admin::resolve_git_dir("$pfx/$_") for @git_dirs;
+	$pfx .= '/';
+	$pfx =~ tr!/!/!s;
+	substr($_, 0, 0, $pfx) for @git_dirs;
+} elsif (my $gd = $opt->{'git-dir'}) {
+	@git_dirs = @$gd;
+} elsif (grep defined, @$opt{qw(show update prune)}) {
 } else {
-	@git_dirs = map { PublicInbox::Admin::resolve_git_dir($_) } @ARGV;
+	warn "No --git-dir= nor --project-list= + --project-root= specified\n";
+	die $help;
 }
+
+$_ = PublicInbox::Admin::resolve_git_dir($_) for @git_dirs;
 if (defined $cidx_dir) { # external index
 	die "`%' is not allowed in $cidx_dir\n" if $cidx_dir =~ /\%/;
 	my $cidx = PublicInbox::CodeSearchIdx->new($cidx_dir, $opt);
diff --git a/t/cindex-join.t b/t/cindex-join.t
index ac90cd64..c2e85332 100644
--- a/t/cindex-join.t
+++ b/t/cindex-join.t
@@ -70,7 +70,7 @@ my $cidxdir = "$tmpdir/cidx";
 my $rdr = { 1 => \my $cout, 2 => \my $cerr };
 ok run_script([qw(-cindex -v --all --show=join_data),
 		'--join=aggressive,dt:..2022-12-01',
-		'-d', $cidxdir, values %code ],
+		'-d', $cidxdir, map { ('-g', $_) } values %code ],
 		$env, $rdr), 'initial join inboxes w/ coderepos';
 my $out = PublicInbox::Config->json->decode($cout);
 is($out->{join_data}->{dt}->[0], '19700101'.'000000',
@@ -79,4 +79,9 @@ is($out->{join_data}->{dt}->[0], '19700101'.'000000',
 ok run_script([qw(-cindex -v --all -u --join --show),
 		'-d', $cidxdir], $env, $rdr), 'incremental --join';
 
+ok run_script([qw(-cindex -v --no-scan --show),
+		'-d', $cidxdir], $env, $rdr), 'show';
+$out = PublicInbox::Config->json->decode($cout);
+is ref($out->{join_data}), 'HASH', 'got hash join data';
+is $cerr, '', 'no warnings or errors in stderr w/ --show';
 done_testing;
diff --git a/t/cindex.t b/t/cindex.t
index 29d88ca8..0193cf18 100644
--- a/t/cindex.t
+++ b/t/cindex.t
@@ -33,7 +33,7 @@ git gc -q
 EOM
 }; # /create_coderepo
 
-ok(run_script([qw(-cindex --dangerous -q), "$tmp/wt0"]), 'cindex internal');
+ok(run_script([qw(-cindex --dangerous -q -g), "$tmp/wt0"]), 'cindex internal');
 {
 	my $exists = -e "$tmp/wt0/.git/public-inbox-cindex/cidx.lock";
 	my @st = stat(_);
@@ -67,13 +67,14 @@ git gc -q
 EOM
 }; # /create_coderepo
 
-ok(run_script([qw(-cindex --dangerous -q -d), "$tmp/ext", $zp, "$tmp/wt0"]),
+ok(run_script([qw(-cindex --dangerous -q -d), "$tmp/ext",
+		'-g', $zp, '-g', "$tmp/wt0" ]),
 	'cindex external');
 ok(-e "$tmp/ext/cidx.lock", 'external dir created');
 ok(!-d "$zp/.git/public-inbox-cindex", 'no cindex in original coderepo');
 
 ok(run_script([qw(-cindex -L medium --dangerous -q -d),
-	"$tmp/med", $zp, "$tmp/wt0"]), 'cindex external medium');
+	"$tmp/med", '-g', $zp, '-g', "$tmp/wt0"]), 'cindex external medium');
 
 
 SKIP: {
@@ -228,7 +229,7 @@ SKIP: { # --prune
 
 File::Path::remove_tree("$tmp/ext");
 mkdir("$tmp/ext", 0707);
-ok(run_script([qw(-cindex --dangerous -q -d), "$tmp/ext", $zp]),
+ok(run_script([qw(-cindex --dangerous -q -d), "$tmp/ext", '-g', $zp]),
 	'external on existing dir');
 {
 	my @st = stat("$tmp/ext/cidx.lock");
diff --git a/t/xap_helper.t b/t/xap_helper.t
index ee25b2dc..37679ae9 100644
--- a/t/xap_helper.t
+++ b/t/xap_helper.t
@@ -20,10 +20,10 @@ my $crepo = create_coderepo 'for-cindex', sub {
 	xsys_e([qw(git init -q --bare)]);
 	xsys_e([qw(git fast-import --quiet)], undef, { 0 => $fi_fh });
 	chdir($dh);
-	run_script([qw(-cindex --dangerous -L medium --no-fsync -q -j1), $d])
+	run_script([qw(-cindex --dangerous -L medium --no-fsync -q -j1), '-g', $d])
 		or xbail '-cindex internal';
 	run_script([qw(-cindex --dangerous -L medium --no-fsync -q -j3 -d),
-		"$d/cidx-ext", $d]) or xbail '-cindex "external"';
+		"$d/cidx-ext", '-g', $d]) or xbail '-cindex "external"';
 };
 $dh = $fi_fh = undef;
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 11/14] git: speed up Git->new by 5% or so
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
                   ` (9 preceding siblings ...)
  2023-11-28 14:56 ` [PATCH 10/14] cindex: require `-g GIT_DIR' or `-r PROJECT_ROOT' Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2023-11-28 14:56 ` [PATCH 12/14] admin: resolve_git_dir respects symlinks Eric Wong
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

This becomes noticeable when loading lots of coderepos on
my local mirror of git.kernel.org now that we can load repos
from cindex.
---
 lib/PublicInbox/Git.pm | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Git.pm b/lib/PublicInbox/Git.pm
index a374649f..235a35cd 100644
--- a/lib/PublicInbox/Git.pm
+++ b/lib/PublicInbox/Git.pm
@@ -91,8 +91,9 @@ sub git_quote ($) {
 
 sub new {
 	my ($class, $git_dir) = @_;
+	$git_dir .= '/';
 	$git_dir =~ tr!/!/!s;
-	$git_dir =~ s!/*\z!!s;
+	chop $git_dir;
 	# may contain {-tmp} field for File::Temp::Dir
 	bless { git_dir => $git_dir }, $class
 }

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 12/14] admin: resolve_git_dir respects symlinks
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
                   ` (10 preceding siblings ...)
  2023-11-28 14:56 ` [PATCH 11/14] git: speed up Git->new by 5% or so Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2023-11-28 14:56 ` [PATCH 13/14] cindex: extra quit checks Eric Wong
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

Absolute pathnames of git coderepos are stored in the cindex,
but we should favor paths relative to $ENV{PWD} since it
respects symlinks in the heirarchy.

Respecting symlinks makes it easier to migrate cindex to
new storage as old storage wears out and to relocate the
storage device onto another machine.
---
 lib/PublicInbox/Admin.pm | 25 +++++++++++++++++++++----
 t/admin.t                | 12 ++++++++++++
 2 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index 893f4a1b..cc9d2171 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -63,15 +63,32 @@ sub resolve_inboxdir {
 	$dir;
 }
 
+sub valid_pwd {
+	my $pwd = $ENV{PWD} // return;
+	my @st_pwd = stat $pwd or return;
+	my @st_cwd = stat '.' or die "stat(.): $!";
+	"@st_pwd[1,0]" eq "@st_cwd[1,0]" ? $pwd : undef;
+}
+
 sub resolve_git_dir {
-	my ($cd) = @_;
+	my ($cd) = @_; # cd may be `undef' for cwd
 	# try v1 bare git dirs
+	my $pwd = valid_pwd();
+	my $env;
+	defined($pwd) && substr($cd // '/', 0, 1) ne '/' and
+		$env->{PWD} = "$pwd/$cd";
 	my $cmd = [ qw(git rev-parse --git-dir) ];
-	my $dir = run_qx($cmd, undef, {-C => $cd});
+	my $dir = run_qx($cmd, $env, { -C => $cd });
 	die "error in @$cmd (cwd:${\($cd // '.')}): $?\n" if $?;
 	chomp $dir;
-	# --absolute-git-dir requires git v2.13.0+
-	$dir = rel2abs_collapsed($dir, $cd) if $dir !~ m!\A/!;
+	# --absolute-git-dir requires git v2.13.0+, and we want to
+	# respect symlinks when $ENV{PWD} if $ENV{PWD} ne abs_path('.')
+	# since we store absolute GIT_DIR paths in cindex.
+	if (substr($dir, 0, 1) ne '/') {
+		substr($cd // '/', 0, 1) eq '/' or
+			$cd = File::Spec->rel2abs($cd, $pwd);
+		$dir = rel2abs_collapsed($dir, $cd);
+	}
 	$dir;
 }
 
diff --git a/t/admin.t b/t/admin.t
index 20e3deb7..586938d0 100644
--- a/t/admin.t
+++ b/t/admin.t
@@ -6,6 +6,7 @@ use v5.10.1;
 use PublicInbox::TestCommon;
 use PublicInbox::Import;
 use_ok 'PublicInbox::Admin';
+use autodie;
 my $v1 = create_inbox 'v1', -no_gc => 1, sub {};
 my ($tmpdir, $for_destroy) = tmpdir();
 my $git_dir = $v1->{inboxdir};
@@ -23,6 +24,17 @@ SKIP: {
 };
 
 *resolve_inboxdir = \&PublicInbox::Admin::resolve_inboxdir;
+*resolve_git_dir = \&PublicInbox::Admin::resolve_git_dir;
+
+{
+	symlink $git_dir, my $sym = "$tmpdir/v1-symlink.git";
+	for my $d ('') { # TODO: should work inside $sym/objects
+		local $ENV{PWD} = $sym.$d;
+		chdir $sym.$d;
+		is resolve_git_dir('.'), $sym,
+			"symlink preserved from {SYMLINKDIR}.git$d";
+	}
+}
 
 # v1
 is(resolve_inboxdir($git_dir), $git_dir, 'top-level GIT_DIR resolved');

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 13/14] cindex: extra quit checks
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
                   ` (11 preceding siblings ...)
  2023-11-28 14:56 ` [PATCH 12/14] admin: resolve_git_dir respects symlinks Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2023-11-28 14:56 ` [PATCH 14/14] www: start working on a repo listing Eric Wong
  2023-11-28 17:55 ` [PATCH 15/14] www: load cindex join data for ->ALL, too Eric Wong
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

We don't want to be accessing uninitialized variables on
process teardown since much of our control flow revolves
around DESTROY for dependency handling.
---
 lib/PublicInbox/CodeSearchIdx.pm | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm
index d49e9a8d..7d696099 100644
--- a/lib/PublicInbox/CodeSearchIdx.pm
+++ b/lib/PublicInbox/CodeSearchIdx.pm
@@ -338,6 +338,9 @@ sub shard_done { # called via PktOp on shard_index completion
 
 sub repo_stored {
 	my ($self, $repo_ctx, $drs, $did) = @_;
+	# check @IDX_SHARDS instead of DO_QUIT to avoid wasting prior work
+	# because shard_commit is fast
+	return unless @IDX_SHARDS;
 	$did > 0 or die "BUG: $repo_ctx->{repo}->{git_dir}: docid=$did";
 	my ($c, $p) = PublicInbox::PktOp->pair;
 	$c->{ops}->{shard_done} = [ $self, $repo_ctx,
@@ -509,6 +512,7 @@ sub shard_commit { # via wq_io_do
 
 sub dump_roots_start {
 	my ($self, $do_join) = @_;
+	return if $DO_QUIT;
 	$XHC //= PublicInbox::XapClient::start_helper("-j$NPROC");
 	$do_join // die 'BUG: no $do_join';
 	progress($self, 'dumping IDs from coderepos');
@@ -562,6 +566,7 @@ EOM
 
 sub dump_ibx_start {
 	my ($self, $do_join) = @_;
+	return if $DO_QUIT;
 	$XHC //= PublicInbox::XapClient::start_helper("-j$NPROC");
 	my ($sort_opt, $fold_opt);
 	pipe(local $sort_opt->{0}, $DUMP_IBX_WPIPE);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 14/14] www: start working on a repo listing
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
                   ` (12 preceding siblings ...)
  2023-11-28 14:56 ` [PATCH 13/14] cindex: extra quit checks Eric Wong
@ 2023-11-28 14:56 ` Eric Wong
  2023-11-28 17:55 ` [PATCH 15/14] www: load cindex join data for ->ALL, too Eric Wong
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
  To: meta

The HTML is still extremely rough, but links seem to be mostly
working...
---
 MANIFEST                       |  1 +
 lib/PublicInbox/CodeSearch.pm  |  8 +++++++
 lib/PublicInbox/RepoList.pm    | 39 ++++++++++++++++++++++++++++++++++
 lib/PublicInbox/WwwCoderepo.pm |  3 +++
 lib/PublicInbox/WwwStream.pm   | 11 +++++-----
 lib/PublicInbox/WwwText.pm     | 10 ++++-----
 6 files changed, 61 insertions(+), 11 deletions(-)
 create mode 100644 lib/PublicInbox/RepoList.pm

diff --git a/MANIFEST b/MANIFEST
index 7b6178f9..e22674b7 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -323,6 +323,7 @@ lib/PublicInbox/PktOp.pm
 lib/PublicInbox/Qspawn.pm
 lib/PublicInbox/Reply.pm
 lib/PublicInbox/RepoAtom.pm
+lib/PublicInbox/RepoList.pm
 lib/PublicInbox/RepoSnapshot.pm
 lib/PublicInbox/RepoTree.pm
 lib/PublicInbox/SHA.pm
diff --git a/lib/PublicInbox/CodeSearch.pm b/lib/PublicInbox/CodeSearch.pm
index 7d7f6df6..7c0dd063 100644
--- a/lib/PublicInbox/CodeSearch.pm
+++ b/lib/PublicInbox/CodeSearch.pm
@@ -341,4 +341,12 @@ EOM
 	}
 }
 
+sub repos_sorted {
+	my $pi_cfg = shift;
+	my @recs = map { [ 0, $_ ] } @_; # PublicInbox::Git objects
+	my @todo = @recs;
+	$pi_cfg->each_cindex(\&load_commit_times, \@todo);
+	@recs = sort { $b->[0] <=> $a->[0] } @recs;
+}
+
 1;
diff --git a/lib/PublicInbox/RepoList.pm b/lib/PublicInbox/RepoList.pm
new file mode 100644
index 00000000..4b313ed6
--- /dev/null
+++ b/lib/PublicInbox/RepoList.pm
@@ -0,0 +1,39 @@
+# Copyright (C) all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+package PublicInbox::RepoList;
+use v5.12;
+use parent qw(PublicInbox::WwwStream);
+use PublicInbox::Hval qw(ascii_html prurl fmt_ts);
+require PublicInbox::CodeSearch;
+
+sub html_top_fallback { # WwwStream->html_repo_top
+	my ($ctx) = @_;
+	my $title = delete($ctx->{-title_html}) //
+		ascii_html("$ctx->{env}->{PATH_INFO}*");
+	my $upfx = $ctx->{-upfx} // '';
+	"<html><head><title>$title</title>" .
+		$ctx->{www}->style($upfx) . '</head><body>';
+}
+
+sub html ($$$) {
+	my ($wcr, $ctx, $pfx) = @_;
+	my $cr = $wcr->{pi_cfg}->{-coderepos};
+	my @nicks = grep(m!\A\Q$pfx\E/!, keys %$cr) or return; # 404
+	__PACKAGE__->html_init($ctx);
+	my $zfh = $ctx->zfh;
+	print $zfh "<pre>matching coderepos\n";
+	my @recs = PublicInbox::CodeSearch::repos_sorted($wcr->{pi_cfg},
+							@$cr{@nicks});
+	my $env = $ctx->{env};
+	for (@recs) {
+		my ($t, $git) = @$_;
+		my $nick = ascii_html("$git->{nick}");
+		for my $u ($git->pub_urls($env)) {
+			$u = prurl($env, $u);
+			print $zfh "\n".fmt_ts($t).qq{ <a\nhref="$u">$nick</a>}
+		}
+	}
+	$ctx->html_done('</pre>');
+}
+
+1;
diff --git a/lib/PublicInbox/WwwCoderepo.pm b/lib/PublicInbox/WwwCoderepo.pm
index 8ab4911f..d1354af5 100644
--- a/lib/PublicInbox/WwwCoderepo.pm
+++ b/lib/PublicInbox/WwwCoderepo.pm
@@ -19,6 +19,7 @@ use PublicInbox::ViewDiff qw(uri_escape_path);
 use PublicInbox::RepoSnapshot;
 use PublicInbox::RepoAtom;
 use PublicInbox::RepoTree;
+use PublicInbox::RepoList;
 use PublicInbox::OnDestroy;
 use URI::Escape qw(uri_escape_utf8);
 use File::Spec;
@@ -354,6 +355,8 @@ sub srv { # endpoint called by PublicInbox::WWW
 	} elsif ($path_info =~ m!\A/(.+?)/(refs/(?:heads|tags))/\z! and
 			($ctx->{git} = $pi_cfg->get_coderepo($1))) {
 		refs_foo($self, $ctx, $2);
+	} elsif ($path_info =~ m!\A/(.+?)/\z!) {
+		PublicInbox::RepoList::html($self, $ctx, $1) // r(404);
 	} elsif ($path_info =~ m!\A/(.+?)\z! and
 			($git = $pi_cfg->get_coderepo($1))) {
 		my $qs = $ctx->{env}->{QUERY_STRING};
diff --git a/lib/PublicInbox/WwwStream.pm b/lib/PublicInbox/WwwStream.pm
index 3a1d6edf..8d32074f 100644
--- a/lib/PublicInbox/WwwStream.pm
+++ b/lib/PublicInbox/WwwStream.pm
@@ -17,8 +17,9 @@ http://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/public-inb
 https://public-inbox.org/public-inbox.git) ];
 
 sub base_url ($) {
-	my $ctx = shift;
-	my $base_url = ($ctx->{ibx} // $ctx->{git})->base_url($ctx->{env});
+	my ($ctx) = @_;
+	my $thing = $ctx->{ibx} // $ctx->{git} // return;
+	my $base_url = $thing->base_url($ctx->{env});
 	chop $base_url; # no trailing slash for clone
 	$base_url;
 }
@@ -40,7 +41,7 @@ sub async_eml { # for async_blob_cb
 
 sub html_repo_top ($) {
 	my ($ctx) = @_;
-	my $git = $ctx->{git};
+	my $git = $ctx->{git} // return $ctx->html_top_fallback;
 	my $desc = ascii_html($git->description);
 	my $title = delete($ctx->{-title_html}) // $desc;
 	my $upfx = $ctx->{-upfx} // '';
@@ -265,11 +266,11 @@ sub aresponse {
 }
 
 sub html_init {
-	my ($ctx) = @_;
+	my $ctx = $_[-1];
 	$ctx->{base_url} = base_url($ctx);
 	my $h = $ctx->{-res_hdr} = ['Content-Type', 'text/html; charset=UTF-8'];
 	$ctx->{gz} = PublicInbox::GzipFilter::gz_or_noop($h, $ctx->{env});
-	bless $ctx, __PACKAGE__;
+	bless $ctx, @_ > 1 ? $_[0] : __PACKAGE__;
 	print { $ctx->zfh } html_top($ctx);
 }
 
diff --git a/lib/PublicInbox/WwwText.pm b/lib/PublicInbox/WwwText.pm
index 4b4b2e4c..5e23005e 100644
--- a/lib/PublicInbox/WwwText.pm
+++ b/lib/PublicInbox/WwwText.pm
@@ -252,19 +252,17 @@ sub coderepos_raw ($$) {
 	my $cr = $cfg->repo_objs($ctx->{ibx}) or return ();
 	my $buf = 'Code repositories for project(s) associated with this '.
 		$ctx->{ibx}->thing_type . ":\n";
-	my @recs = map { [ 0, $_ ] } @$cr;
-	my @todo = @recs;
-	$cfg->each_cindex('load_commit_times', \@todo);
-	@recs = sort { $b->[0] <=> $a->[0] } @recs;
+	my @recs = PublicInbox::CodeSearch::repos_sorted($cfg, @$cr);
 	my $cr_score = $ctx->{ibx}->{-cr_score};
+	my $env = $ctx->{env};
 	for (@recs) {
 		my ($t, $git) = @$_;
-		for ($git->pub_urls($ctx->{env})) {
+		for ($git->pub_urls($env)) {
 			my $u = m!\A(?:[a-z\+]+:)?//!i ? $_ : $top_url.$_;
 			my $nr = $cr_score->{$git->{nick}};
 			$buf .= "\n";
 			$buf .= $nr ? sprintf('% 9u', $nr) : (' 'x9);
-			$buf .= ' '.fmt_ts($t).' '.prurl($ctx->{env}, $u);
+			$buf .= ' '.fmt_ts($t).' '.prurl($env, $u);
 		}
 	}
 	($buf);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 15/14] www: load cindex join data for ->ALL, too
  2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
                   ` (13 preceding siblings ...)
  2023-11-28 14:56 ` [PATCH 14/14] www: start working on a repo listing Eric Wong
@ 2023-11-28 17:55 ` Eric Wong
  14 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-28 17:55 UTC (permalink / raw)
  To: meta

This ensures the /all/ extindex can have auto-associations
with coderepos just like normal inboxes do.
---
 lib/PublicInbox/CodeSearch.pm | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/lib/PublicInbox/CodeSearch.pm b/lib/PublicInbox/CodeSearch.pm
index 7c0dd063..5c5774cf 100644
--- a/lib/PublicInbox/CodeSearch.pm
+++ b/lib/PublicInbox/CodeSearch.pm
@@ -339,6 +339,15 @@ EOM
 		my $s = $git->{ibx_score};
 		@$s = sort { $b->[0] <=> $a->[0] } @$s if $s;
 	}
+	my $ALL = $pi_cfg->ALL or return;
+	my @alls_gits = sort {
+		scalar @{$b->{ibx_score} // []} <=>
+			scalar @{$a->{ibx_score} // []}
+	} values %$coderepos;
+	my $gits = $ALL->{-repo_objs} //= [];
+	push @$gits, @alls_gits;
+	my $cr_score = $ALL->{-cr_score} //= {};
+	$cr_score->{$_->{nick}} //= scalar(@{$_->{ibx_score}//[]}) for @$gits;
 }
 
 sub repos_sorted {

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 0/2] pure Perl sendmsg/recvmsg on *BSD
  2023-11-28 14:56 ` [PATCH 02/14] t/cindex*: require SCM_RIGHTS for these tests Eric Wong
@ 2024-01-29 21:23   ` Eric Wong
  2024-01-29 21:23     ` [PATCH 1/2] syscall: update formatting to match our codebase Eric Wong
  2024-01-29 21:23     ` [PATCH 2/2] syscall: use pure Perl sendmsg/recvmsg on *BSD Eric Wong
  0 siblings, 2 replies; 19+ messages in thread
From: Eric Wong @ 2024-01-29 21:23 UTC (permalink / raw)
  To: meta

I was wrong about the `syscall' Perl function being unusable on
*BSD.  It turns out only the symbol names (e.g. from syscall.ph)
are unusable, but using the numbers is fine.

Eric Wong (2):
  syscall: update formatting to match our codebase
  syscall: use pure Perl sendmsg/recvmsg on *BSD

 devel/sysdefs-list         |   9 +-
 lib/PublicInbox/Syscall.pm | 527 +++++++++++++++++++------------------
 t/cmd_ipc.t                |   9 +-
 3 files changed, 286 insertions(+), 259 deletions(-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/2] syscall: update formatting to match our codebase
  2024-01-29 21:23   ` [PATCH 0/2] pure Perl sendmsg/recvmsg on *BSD Eric Wong
@ 2024-01-29 21:23     ` Eric Wong
  2024-01-29 21:23     ` [PATCH 2/2] syscall: use pure Perl sendmsg/recvmsg on *BSD Eric Wong
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Wong @ 2024-01-29 21:23 UTC (permalink / raw)
  To: meta

Sys::Syscall needs separate patches anyways (if it ever gets
updated), and having a mix of indentation styles in our codebase
gets confusing.  We'll also update cfarm-related comments for
the current URL.
---
 lib/PublicInbox/Syscall.pm | 427 ++++++++++++++++++-------------------
 1 file changed, 213 insertions(+), 214 deletions(-)

diff --git a/lib/PublicInbox/Syscall.pm b/lib/PublicInbox/Syscall.pm
index 96af2b22..9071e6b1 100644
--- a/lib/PublicInbox/Syscall.pm
+++ b/lib/PublicInbox/Syscall.pm
@@ -4,7 +4,7 @@
 #
 # See devel/sysdefs-list in the public-inbox source tree for maintenance
 # <https://80x24.org/public-inbox.git>, and machines from the GCC Farm:
-# <https://cfarm.tetaneutral.net/>
+# <https://portal.cfarm.net/>
 #
 # This license differs from the rest of public-inbox
 #
@@ -26,10 +26,10 @@ our $INOTIFY;
 
 # $VERSION = '0.25'; # Sys::Syscall version
 our @EXPORT_OK = qw(epoll_ctl epoll_create epoll_wait
-                  EPOLLIN EPOLLOUT EPOLLET
-                  EPOLL_CTL_ADD EPOLL_CTL_DEL EPOLL_CTL_MOD
-                  EPOLLONESHOT EPOLLEXCLUSIVE
-                  signalfd rename_noreplace %SIGNUM $F_SETPIPE_SZ);
+		EPOLLIN EPOLLOUT EPOLLET
+		EPOLL_CTL_ADD EPOLL_CTL_DEL EPOLL_CTL_MOD
+		EPOLLONESHOT EPOLLEXCLUSIVE
+		signalfd rename_noreplace %SIGNUM $F_SETPIPE_SZ);
 use constant {
 	EPOLLIN => 1,
 	EPOLLOUT => 4,
@@ -71,216 +71,216 @@ our $no_deprecated = 0;
 
 if ($^O eq "linux") {
 	$F_SETPIPE_SZ = 1031;
-    my (undef, undef, $release, undef, $machine) = POSIX::uname();
-    my ($maj, $min) = ($release =~ /\A([0-9]+)\.([0-9]+)/);
-    $SYS_renameat2 = 0 if "$maj.$min" < 3.15;
-    # whether the machine requires 64-bit numbers to be on 8-byte
-    # boundaries.
-    my $u64_mod_8 = 0;
+	my (undef, undef, $release, undef, $machine) = POSIX::uname();
+	my ($maj, $min) = ($release =~ /\A([0-9]+)\.([0-9]+)/);
+	$SYS_renameat2 = 0 if "$maj.$min" < 3.15;
+	# whether the machine requires 64-bit numbers to be on 8-byte
+	# boundaries.
+	my $u64_mod_8 = 0;
 
-    if ($Config{ptrsize} == 4) {
-	# if we're running on an x86_64 kernel, but a 32-bit process,
-	# we need to use the x32 or i386 syscall numbers.
-	if ($machine eq 'x86_64') {
-	    my $s = $Config{cppsymbols};
-	    $machine = ($s =~ /\b__ILP32__=1\b/ && $s =~ /\b__x86_64__=1\b/) ?
+	if ($Config{ptrsize} == 4) {
+		# if we're running on an x86_64 kernel, but a 32-bit process,
+		# we need to use the x32 or i386 syscall numbers.
+		if ($machine eq 'x86_64') {
+			my $s = $Config{cppsymbols};
+			$machine = ($s =~ /\b__ILP32__=1\b/ &&
+					$s =~ /\b__x86_64__=1\b/) ?
 				'x32' : 'i386'
-	} elsif ($machine eq 'mips64') { # similarly for mips64 vs mips
-	    $machine = 'mips';
+		} elsif ($machine eq 'mips64') { # similarly for mips64 vs mips
+			$machine = 'mips';
+		}
 	}
-    }
-
-    if ($machine =~ m/^i[3456]86$/) {
-        $SYS_epoll_create = 254;
-        $SYS_epoll_ctl    = 255;
-        $SYS_epoll_wait   = 256;
-        $SYS_signalfd4 = 327;
-        $SYS_renameat2 //= 353;
-	$SYS_fstatfs = 100;
-	$SYS_sendmsg = 370;
-	$SYS_recvmsg = 372;
-	$INOTIFY = { # usage: `use constant $PublicInbox::Syscall::INOTIFY'
-		SYS_inotify_init1 => 332,
-		SYS_inotify_add_watch => 292,
-		SYS_inotify_rm_watch => 293,
-	};
-	$FS_IOC_GETFLAGS = 0x80046601;
-	$FS_IOC_SETFLAGS = 0x40046602;
-    } elsif ($machine eq "x86_64") {
-        $SYS_epoll_create = 213;
-        $SYS_epoll_ctl    = 233;
-        $SYS_epoll_wait   = 232;
-        $SYS_signalfd4 = 289;
-	$SYS_renameat2 //= 316;
-	$SYS_fstatfs = 138;
-	$SYS_sendmsg = 46;
-	$SYS_recvmsg = 47;
-	$INOTIFY = {
-		SYS_inotify_init1 => 294,
-		SYS_inotify_add_watch => 254,
-		SYS_inotify_rm_watch => 255,
-	};
-	$FS_IOC_GETFLAGS = 0x80086601;
-	$FS_IOC_SETFLAGS = 0x40086602;
-    } elsif ($machine eq 'x32') {
-        $SYS_epoll_create = 1073742037;
-        $SYS_epoll_ctl = 1073742057;
-        $SYS_epoll_wait = 1073742056;
-        $SYS_signalfd4 = 1073742113;
-	$SYS_renameat2 //= 0x40000000 + 316;
-	$SYS_fstatfs = 138;
-	$SYS_sendmsg = 0x40000206;
-	$SYS_recvmsg = 0x40000207;
-	$FS_IOC_GETFLAGS = 0x80046601;
-	$FS_IOC_SETFLAGS = 0x40046602;
-	$INOTIFY = {
-		SYS_inotify_init1 => 1073742118,
-		SYS_inotify_add_watch => 1073742078,
-		SYS_inotify_rm_watch => 1073742079,
-	};
-    } elsif ($machine eq 'sparc64') {
-	$SYS_epoll_create = 193;
-	$SYS_epoll_ctl = 194;
-	$SYS_epoll_wait = 195;
-	$u64_mod_8 = 1;
-	$SYS_signalfd4 = 317;
-	$SYS_renameat2 //= 345;
-	$SFD_CLOEXEC = 020000000;
-	$SYS_fstatfs = 158;
-	$SYS_sendmsg = 114;
-	$SYS_recvmsg = 113;
-	$FS_IOC_GETFLAGS = 0x40086601;
-	$FS_IOC_SETFLAGS = 0x80086602;
-    } elsif ($machine =~ m/^parisc/) {
-        $SYS_epoll_create = 224;
-        $SYS_epoll_ctl    = 225;
-        $SYS_epoll_wait   = 226;
-        $u64_mod_8        = 1;
-        $SYS_signalfd4 = 309;
-        $SIGNUM{WINCH} = 23;
-    } elsif ($machine =~ m/^ppc64/) {
-        $SYS_epoll_create = 236;
-        $SYS_epoll_ctl    = 237;
-        $SYS_epoll_wait   = 238;
-        $u64_mod_8        = 1;
-        $SYS_signalfd4 = 313;
-	$SYS_renameat2 //= 357;
-	$SYS_fstatfs = 100;
-	$SYS_sendmsg = 341;
-	$SYS_recvmsg = 342;
-	$FS_IOC_GETFLAGS = 0x40086601;
-	$FS_IOC_SETFLAGS = 0x80086602;
-	$INOTIFY = {
-		SYS_inotify_init1 => 318,
-		SYS_inotify_add_watch => 276,
-		SYS_inotify_rm_watch => 277,
-	};
-    } elsif ($machine eq "ppc") {
-        $SYS_epoll_create = 236;
-        $SYS_epoll_ctl    = 237;
-        $SYS_epoll_wait   = 238;
-        $u64_mod_8        = 1;
-        $SYS_signalfd4 = 313;
-	$SYS_renameat2 //= 357;
-	$SYS_fstatfs = 100;
-	$FS_IOC_GETFLAGS = 0x40086601;
-	$FS_IOC_SETFLAGS = 0x80086602;
-    } elsif ($machine =~ m/^s390/) { # untested, no machine on cfarm
-        $SYS_epoll_create = 249;
-        $SYS_epoll_ctl    = 250;
-        $SYS_epoll_wait   = 251;
-        $u64_mod_8        = 1;
-        $SYS_signalfd4 = 322;
-	$SYS_renameat2 //= 347;
-	$SYS_fstatfs = 100;
-	$SYS_sendmsg = 370;
-	$SYS_recvmsg = 372;
-    } elsif ($machine eq 'ia64') { # untested, no machine on cfarm
-        $SYS_epoll_create = 1243;
-        $SYS_epoll_ctl    = 1244;
-        $SYS_epoll_wait   = 1245;
-        $u64_mod_8        = 1;
-        $SYS_signalfd4 = 289;
-    } elsif ($machine eq "alpha") { # untested, no machine on cfarm
-        # natural alignment, ints are 32-bits
-        $SYS_epoll_create = 407;
-        $SYS_epoll_ctl    = 408;
-        $SYS_epoll_wait   = 409;
-        $u64_mod_8        = 1;
-        $SYS_signalfd4 = 484;
-	$SFD_CLOEXEC = 010000000;
-    } elsif ($machine =~ /\A(?:loong|a)arch64\z/ || $machine eq 'riscv64') {
-        $SYS_epoll_create = 20;  # (sys_epoll_create1)
-        $SYS_epoll_ctl    = 21;
-        $SYS_epoll_wait   = 22;  # (sys_epoll_pwait)
-        $u64_mod_8        = 1;
-        $no_deprecated    = 1;
-        $SYS_signalfd4 = 74;
-	$SYS_renameat2 //= 276;
-	$SYS_fstatfs = 44;
-	$SYS_sendmsg = 211;
-	$SYS_recvmsg = 212;
-	$INOTIFY = {
-		SYS_inotify_init1 => 26,
-		SYS_inotify_add_watch => 27,
-		SYS_inotify_rm_watch => 28,
-	};
-	$FS_IOC_GETFLAGS = 0x80086601;
-	$FS_IOC_SETFLAGS = 0x40086602;
-    } elsif ($machine =~ m/arm(v\d+)?.*l/) { # ARM OABI (untested on cfarm)
-        $SYS_epoll_create = 250;
-        $SYS_epoll_ctl    = 251;
-        $SYS_epoll_wait   = 252;
-        $u64_mod_8        = 1;
-        $SYS_signalfd4 = 355;
-	$SYS_renameat2 //= 382;
-	$SYS_fstatfs = 100;
-	$SYS_sendmsg = 296;
-	$SYS_recvmsg = 297;
-    } elsif ($machine =~ m/^mips64/) { # cfarm only has 32-bit userspace
-        $SYS_epoll_create = 5207;
-        $SYS_epoll_ctl    = 5208;
-        $SYS_epoll_wait   = 5209;
-        $u64_mod_8        = 1;
-        $SYS_signalfd4 = 5283;
-	$SYS_renameat2 //= 5311;
-	$SYS_fstatfs = 5135;
-	$SYS_sendmsg = 5045;
-	$SYS_recvmsg = 5046;
-	$FS_IOC_GETFLAGS = 0x40046601;
-	$FS_IOC_SETFLAGS = 0x80046602;
-    } elsif ($machine =~ m/^mips/) { # 32-bit, tested on mips64 cfarm machine
-        $SYS_epoll_create = 4248;
-        $SYS_epoll_ctl    = 4249;
-        $SYS_epoll_wait   = 4250;
-        $u64_mod_8        = 1;
-        $SYS_signalfd4 = 4324;
-	$SYS_renameat2 //= 4351;
-	$SYS_fstatfs = 4100;
-	$SYS_sendmsg = 4179;
-	$SYS_recvmsg = 4177;
-	$FS_IOC_GETFLAGS = 0x40046601;
-	$FS_IOC_SETFLAGS = 0x80046602;
-	$SIGNUM{WINCH} = 20;
-	$INOTIFY = {
-		SYS_inotify_init1 => 4329,
-		SYS_inotify_add_watch => 4285,
-		SYS_inotify_rm_watch => 4286,
-	};
-    } else {
-        warn <<EOM;
+	if ($machine =~ m/^i[3456]86$/) {
+		$SYS_epoll_create = 254;
+		$SYS_epoll_ctl = 255;
+		$SYS_epoll_wait = 256;
+		$SYS_signalfd4 = 327;
+		$SYS_renameat2 //= 353;
+		$SYS_fstatfs = 100;
+		$SYS_sendmsg = 370;
+		$SYS_recvmsg = 372;
+		$INOTIFY = { # usage: `use constant $PublicInbox::Syscall::INOTIFY'
+			SYS_inotify_init1 => 332,
+			SYS_inotify_add_watch => 292,
+			SYS_inotify_rm_watch => 293,
+		};
+		$FS_IOC_GETFLAGS = 0x80046601;
+		$FS_IOC_SETFLAGS = 0x40046602;
+	} elsif ($machine eq "x86_64") {
+		$SYS_epoll_create = 213;
+		$SYS_epoll_ctl = 233;
+		$SYS_epoll_wait = 232;
+		$SYS_signalfd4 = 289;
+		$SYS_renameat2 //= 316;
+		$SYS_fstatfs = 138;
+		$SYS_sendmsg = 46;
+		$SYS_recvmsg = 47;
+		$INOTIFY = {
+			SYS_inotify_init1 => 294,
+			SYS_inotify_add_watch => 254,
+			SYS_inotify_rm_watch => 255,
+		};
+		$FS_IOC_GETFLAGS = 0x80086601;
+		$FS_IOC_SETFLAGS = 0x40086602;
+	} elsif ($machine eq 'x32') {
+		$SYS_epoll_create = 1073742037;
+		$SYS_epoll_ctl = 1073742057;
+		$SYS_epoll_wait = 1073742056;
+		$SYS_signalfd4 = 1073742113;
+		$SYS_renameat2 //= 0x40000000 + 316;
+		$SYS_fstatfs = 138;
+		$SYS_sendmsg = 0x40000206;
+		$SYS_recvmsg = 0x40000207;
+		$FS_IOC_GETFLAGS = 0x80046601;
+		$FS_IOC_SETFLAGS = 0x40046602;
+		$INOTIFY = {
+			SYS_inotify_init1 => 1073742118,
+			SYS_inotify_add_watch => 1073742078,
+			SYS_inotify_rm_watch => 1073742079,
+		};
+	} elsif ($machine eq 'sparc64') {
+		$SYS_epoll_create = 193;
+		$SYS_epoll_ctl = 194;
+		$SYS_epoll_wait = 195;
+		$u64_mod_8 = 1;
+		$SYS_signalfd4 = 317;
+		$SYS_renameat2 //= 345;
+		$SFD_CLOEXEC = 020000000;
+		$SYS_fstatfs = 158;
+		$SYS_sendmsg = 114;
+		$SYS_recvmsg = 113;
+		$FS_IOC_GETFLAGS = 0x40086601;
+		$FS_IOC_SETFLAGS = 0x80086602;
+	} elsif ($machine =~ m/^parisc/) { # untested, no machine on cfarm
+		$SYS_epoll_create = 224;
+		$SYS_epoll_ctl = 225;
+		$SYS_epoll_wait = 226;
+		$u64_mod_8 = 1;
+		$SYS_signalfd4 = 309;
+		$SIGNUM{WINCH} = 23;
+	} elsif ($machine =~ m/^ppc64/) {
+		$SYS_epoll_create = 236;
+		$SYS_epoll_ctl = 237;
+		$SYS_epoll_wait = 238;
+		$u64_mod_8 = 1;
+		$SYS_signalfd4 = 313;
+		$SYS_renameat2 //= 357;
+		$SYS_fstatfs = 100;
+		$SYS_sendmsg = 341;
+		$SYS_recvmsg = 342;
+		$FS_IOC_GETFLAGS = 0x40086601;
+		$FS_IOC_SETFLAGS = 0x80086602;
+		$INOTIFY = {
+			SYS_inotify_init1 => 318,
+			SYS_inotify_add_watch => 276,
+			SYS_inotify_rm_watch => 277,
+		};
+	} elsif ($machine eq "ppc") { # untested, no machine on cfarm
+		$SYS_epoll_create = 236;
+		$SYS_epoll_ctl = 237;
+		$SYS_epoll_wait = 238;
+		$u64_mod_8 = 1;
+		$SYS_signalfd4 = 313;
+		$SYS_renameat2 //= 357;
+		$SYS_fstatfs = 100;
+		$FS_IOC_GETFLAGS = 0x40086601;
+		$FS_IOC_SETFLAGS = 0x80086602;
+	} elsif ($machine =~ m/^s390/) { # untested, no machine on cfarm
+		$SYS_epoll_create = 249;
+		$SYS_epoll_ctl = 250;
+		$SYS_epoll_wait = 251;
+		$u64_mod_8 = 1;
+		$SYS_signalfd4 = 322;
+		$SYS_renameat2 //= 347;
+		$SYS_fstatfs = 100;
+		$SYS_sendmsg = 370;
+		$SYS_recvmsg = 372;
+	} elsif ($machine eq 'ia64') { # untested, no machine on cfarm
+		$SYS_epoll_create = 1243;
+		$SYS_epoll_ctl = 1244;
+		$SYS_epoll_wait = 1245;
+		$u64_mod_8 = 1;
+		$SYS_signalfd4 = 289;
+	} elsif ($machine eq "alpha") { # untested, no machine on cfarm
+		# natural alignment, ints are 32-bits
+		$SYS_epoll_create = 407;
+		$SYS_epoll_ctl = 408;
+		$SYS_epoll_wait = 409;
+		$u64_mod_8 = 1;
+		$SYS_signalfd4 = 484;
+		$SFD_CLOEXEC = 010000000;
+	} elsif ($machine =~ /\A(?:loong|a)arch64\z/ || $machine eq 'riscv64') {
+		$SYS_epoll_create = 20; # (sys_epoll_create1)
+		$SYS_epoll_ctl = 21;
+		$SYS_epoll_wait = 22; # (sys_epoll_pwait)
+		$u64_mod_8 = 1;
+		$no_deprecated = 1;
+		$SYS_signalfd4 = 74;
+		$SYS_renameat2 //= 276;
+		$SYS_fstatfs = 44;
+		$SYS_sendmsg = 211;
+		$SYS_recvmsg = 212;
+		$INOTIFY = {
+			SYS_inotify_init1 => 26,
+			SYS_inotify_add_watch => 27,
+			SYS_inotify_rm_watch => 28,
+		};
+		$FS_IOC_GETFLAGS = 0x80086601;
+		$FS_IOC_SETFLAGS = 0x40086602;
+	} elsif ($machine =~ m/arm(v\d+)?.*l/) { # ARM OABI (untested on cfarm)
+		$SYS_epoll_create = 250;
+		$SYS_epoll_ctl = 251;
+		$SYS_epoll_wait = 252;
+		$u64_mod_8 = 1;
+		$SYS_signalfd4 = 355;
+		$SYS_renameat2 //= 382;
+		$SYS_fstatfs = 100;
+		$SYS_sendmsg = 296;
+		$SYS_recvmsg = 297;
+	} elsif ($machine =~ m/^mips64/) { # cfarm only has 32-bit userspace
+		$SYS_epoll_create = 5207;
+		$SYS_epoll_ctl = 5208;
+		$SYS_epoll_wait = 5209;
+		$u64_mod_8 = 1;
+		$SYS_signalfd4 = 5283;
+		$SYS_renameat2 //= 5311;
+		$SYS_fstatfs = 5135;
+		$SYS_sendmsg = 5045;
+		$SYS_recvmsg = 5046;
+		$FS_IOC_GETFLAGS = 0x40046601;
+		$FS_IOC_SETFLAGS = 0x80046602;
+	} elsif ($machine =~ m/^mips/) { # 32-bit, tested on mips64 cfarm host
+		$SYS_epoll_create = 4248;
+		$SYS_epoll_ctl = 4249;
+		$SYS_epoll_wait = 4250;
+		$u64_mod_8 = 1;
+		$SYS_signalfd4 = 4324;
+		$SYS_renameat2 //= 4351;
+		$SYS_fstatfs = 4100;
+		$SYS_sendmsg = 4179;
+		$SYS_recvmsg = 4177;
+		$FS_IOC_GETFLAGS = 0x40046601;
+		$FS_IOC_SETFLAGS = 0x80046602;
+		$SIGNUM{WINCH} = 20;
+		$INOTIFY = {
+			SYS_inotify_init1 => 4329,
+			SYS_inotify_add_watch => 4285,
+			SYS_inotify_rm_watch => 4286,
+		};
+	} else {
+		warn <<EOM;
 machine=$machine ptrsize=$Config{ptrsize} has no syscall definitions
 git clone https://80x24.org/public-inbox.git and
 Send the output of ./devel/sysdefs-list to meta\@public-inbox.org
 EOM
-    }
-    if ($u64_mod_8) {
-        *epoll_wait = \&epoll_wait_mod8;
-        *epoll_ctl = \&epoll_ctl_mod8;
-    } else {
-        *epoll_wait = \&epoll_wait_mod4;
-        *epoll_ctl = \&epoll_ctl_mod4;
-    }
+	}
+	if ($u64_mod_8) {
+		*epoll_wait = \&epoll_wait_mod8;
+		*epoll_ctl = \&epoll_ctl_mod8;
+	} else {
+		*epoll_wait = \&epoll_wait_mod4;
+		*epoll_ctl = \&epoll_ctl_mod4;
+	}
 }
 
 # SFD_CLOEXEC is arch-dependent, so IN_CLOEXEC may be, too
@@ -291,10 +291,6 @@ $INOTIFY->{IN_CLOEXEC} //= 0x80000 if $INOTIFY;
 # use devel/sysdefs-list on Linux to detect new syscall numbers and
 # other system constants
 
-############################################################################
-# epoll functions
-############################################################################
-
 sub epoll_create {
 	syscall($SYS_epoll_create, $no_deprecated ? 0 : 100);
 }
@@ -302,10 +298,13 @@ sub epoll_create {
 # epoll_ctl wrapper
 # ARGS: (epfd, op, fd, events_mask)
 sub epoll_ctl_mod4 {
-    syscall($SYS_epoll_ctl, $_[0]+0, $_[1]+0, $_[2]+0, pack("LLL", $_[3], $_[2], 0));
+	syscall($SYS_epoll_ctl, $_[0]+0, $_[1]+0, $_[2]+0,
+		pack("LLL", $_[3], $_[2], 0));
 }
+
 sub epoll_ctl_mod8 {
-    syscall($SYS_epoll_ctl, $_[0]+0, $_[1]+0, $_[2]+0, pack("LLLL", $_[3], 0, $_[2], 0));
+	syscall($SYS_epoll_ctl, $_[0]+0, $_[1]+0, $_[2]+0,
+		pack("LLLL", $_[3], 0, $_[2], 0));
 }
 
 # epoll_wait wrapper

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] syscall: use pure Perl sendmsg/recvmsg on *BSD
  2024-01-29 21:23   ` [PATCH 0/2] pure Perl sendmsg/recvmsg on *BSD Eric Wong
  2024-01-29 21:23     ` [PATCH 1/2] syscall: update formatting to match our codebase Eric Wong
@ 2024-01-29 21:23     ` Eric Wong
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Wong @ 2024-01-29 21:23 UTC (permalink / raw)
  To: meta

While syscall symbols (e.g. SYS_*) have changed on us in FreeBSD
during the history of Sys::Syscall and this project and did bite
us in some cases; the actual numbers don't get recycled for new
syscalls.  We're also fortunate that sendmsg and recvmsg syscalls
and associated msghdr and cmsg structs predate the BSD forks and
are compatible across all the BSDs I've tried.

OpenBSD routes Perl `syscall' through libc; while NetBSD + FreeBSD
document procedures for maintaining backwards compatibility.
It looks like Dragonfly follows FreeBSD, here.

Tested on i386 OpenBSD, and amd64 {Free,Net,Open,Dragonfly}BSD

This enables *BSD users to use lei, -cindex and future SCM_RIGHTS-only
features without needing Inline::C.

[1] https://cvsweb.openbsd.org/src/gnu/usr.bin/perl/gen_syscall_emulator.pl
[2] https://www.netbsd.org/docs/internals/en/chap-processes.html#syscall_versioning
[3] https://wiki.freebsd.org/AddingSyscalls#Backward_compatibily
---
 devel/sysdefs-list         |   9 +++-
 lib/PublicInbox/Syscall.pm | 102 +++++++++++++++++++++++--------------
 t/cmd_ipc.t                |   9 ++--
 3 files changed, 74 insertions(+), 46 deletions(-)

diff --git a/devel/sysdefs-list b/devel/sysdefs-list
index 61532cf2..ba51de6c 100755
--- a/devel/sysdefs-list
+++ b/devel/sysdefs-list
@@ -2,8 +2,6 @@
 # License: AGPL-3.0+ <http://www.gnu.org/licenses/agpl-3.0.txt>
 # Dump system-specific constant numbers this is to maintain
 # PublicInbox::Syscall and any other system-specific pieces.
-# DO NOT USE syscall numbers for *BSDs, none of the current BSD kernels
-# we know about promise stable syscall numbers (unlike Linux).
 # However, sysconf(3) constants are stable ABI on all safe to dump.
 eval 'exec perl -S $0 ${1+"$@"}' # no shebang
 	if 0; # running under some shell
@@ -179,5 +177,12 @@ int main(void)
 		PR_NUM(cmsg_type);
 	STRUCT_END;
 
+	{
+		struct cmsghdr cmsg;
+		uintptr_t cmsg_data_off;
+		cmsg_data_off = (uintptr_t)CMSG_DATA(&cmsg) - (uintptr_t)&cmsg;
+		D(cmsg_data_off);
+	}
+
 	return 0;
 }
diff --git a/lib/PublicInbox/Syscall.pm b/lib/PublicInbox/Syscall.pm
index 9071e6b1..829cfa3c 100644
--- a/lib/PublicInbox/Syscall.pm
+++ b/lib/PublicInbox/Syscall.pm
@@ -22,7 +22,7 @@ use POSIX qw(ENOENT ENOSYS EINVAL O_NONBLOCK);
 use Socket qw(SOL_SOCKET SCM_RIGHTS);
 use Config;
 our %SIGNUM = (WINCH => 28); # most Linux, {Free,Net,Open}BSD, *Darwin
-our $INOTIFY;
+our ($INOTIFY, %PACK);
 
 # $VERSION = '0.25'; # Sys::Syscall version
 our @EXPORT_OK = qw(epoll_ctl epoll_create epoll_wait
@@ -44,26 +44,21 @@ use constant {
 	EPOLL_CTL_MOD => 3,
 	SIZEOF_int => $Config{intsize},
 	SIZEOF_size_t => $Config{sizesize},
+	SIZEOF_ptr => $Config{ptrsize},
 	NUL => "\0",
 };
 
-use constant {
-	TMPL_size_t => SIZEOF_size_t == 8 ? 'Q' : 'L',
-	BYTES_4_hole => SIZEOF_size_t == 8 ? 'L' : '',
-	# cmsg_len, cmsg_level, cmsg_type
-	SIZEOF_cmsghdr => SIZEOF_int * 2 + SIZEOF_size_t,
-};
-
-my @BYTES_4_hole = BYTES_4_hole ? (0) : ();
+use constant TMPL_size_t => SIZEOF_size_t == 8 ? 'Q' : 'L';
 
 our ($SYS_epoll_create,
 	$SYS_epoll_ctl,
 	$SYS_epoll_wait,
 	$SYS_signalfd4,
 	$SYS_renameat2,
-	$F_SETPIPE_SZ);
+	$F_SETPIPE_SZ,
+	$SYS_sendmsg,
+	$SYS_recvmsg);
 
-my ($SYS_sendmsg, $SYS_recvmsg);
 my $SYS_fstatfs; # don't need fstatfs64, just statfs.f_type
 my ($FS_IOC_GETFLAGS, $FS_IOC_SETFLAGS);
 my $SFD_CLOEXEC = 02000000; # Perl does not expose O_CLOEXEC
@@ -78,7 +73,7 @@ if ($^O eq "linux") {
 	# boundaries.
 	my $u64_mod_8 = 0;
 
-	if ($Config{ptrsize} == 4) {
+	if (SIZEOF_ptr == 4) {
 		# if we're running on an x86_64 kernel, but a 32-bit process,
 		# we need to use the x32 or i386 syscall numbers.
 		if ($machine eq 'x86_64') {
@@ -281,16 +276,52 @@ EOM
 		*epoll_wait = \&epoll_wait_mod4;
 		*epoll_ctl = \&epoll_ctl_mod4;
 	}
+} elsif ($^O =~ /\A(?:freebsd|openbsd|netbsd|dragonfly)\z/) {
+# don't use syscall.ph here, name => number mappings are not stable on *BSD
+# but the actual numbers are.
+# OpenBSD perl redirects syscall perlop to libc functions
+# https://cvsweb.openbsd.org/src/gnu/usr.bin/perl/gen_syscall_emulator.pl
+# https://www.netbsd.org/docs/internals/en/chap-processes.html#syscall_versioning
+# https://wiki.freebsd.org/AddingSyscalls#Backward_compatibily
+# (I'm assuming Dragonfly copies FreeBSD, here, too)
+	$SYS_recvmsg = 27;
+	$SYS_sendmsg = 28;
+}
+
+BEGIN {
+	if ($^O eq 'linux') {
+		%PACK = (
+			TMPL_cmsg_len => TMPL_size_t,
+			# cmsg_len, cmsg_level, cmsg_type
+			SIZEOF_cmsghdr => SIZEOF_int * 2 + SIZEOF_size_t,
+			CMSG_DATA_off => '',
+			TMPL_msghdr => 'PL' . # msg_name, msg_namelen
+				'@'.(2 * SIZEOF_ptr).'P'. # msg_iov
+				'i'. # msg_iovlen
+				'@'.(4 * SIZEOF_ptr).'P'. # msg_control
+				'L'. # msg_controllen (socklen_t)
+				'i', # msg_flags
+		);
+	} elsif ($^O =~ /\A(?:freebsd|openbsd|netbsd|dragonfly)\z/) {
+		%PACK = (
+			TMPL_cmsg_len => 'L', # socklen_t
+			SIZEOF_cmsghdr => SIZEOF_int * 3,
+			CMSG_DATA_off => SIZEOF_ptr == 8 ? '@16' : '',
+			TMPL_msghdr => 'PL' . # msg_name, msg_namelen
+				'@'.(2 * SIZEOF_ptr).'P'. # msg_iov
+				TMPL_size_t. # msg_iovlen
+				'@'.(4 * SIZEOF_ptr).'P'. # msg_control
+				TMPL_size_t. # msg_controllen
+				'i', # msg_flags
+
+		)
+	}
+	$PACK{CMSG_ALIGN_size} = SIZEOF_size_t;
 }
 
 # SFD_CLOEXEC is arch-dependent, so IN_CLOEXEC may be, too
 $INOTIFY->{IN_CLOEXEC} //= 0x80000 if $INOTIFY;
 
-# use Inline::C for *BSD-only or general POSIX stuff.
-# Linux guarantees stable syscall numbering, BSDs only offer a stable libc
-# use devel/sysdefs-list on Linux to detect new syscall numbers and
-# other system constants
-
 sub epoll_create {
 	syscall($SYS_epoll_create, $no_deprecated ? 0 : 100);
 }
@@ -420,11 +451,13 @@ sub nodatacow_dir {
 	if (open my $fh, '<', $_[0]) { nodatacow_fh($fh) }
 }
 
-sub CMSG_ALIGN ($) { ($_[0] + SIZEOF_size_t - 1) & ~(SIZEOF_size_t - 1) }
+use constant \%PACK;
+sub CMSG_ALIGN ($) { ($_[0] + CMSG_ALIGN_size - 1) & ~(CMSG_ALIGN_size - 1) }
 use constant CMSG_ALIGN_SIZEOF_cmsghdr => CMSG_ALIGN(SIZEOF_cmsghdr);
 sub CMSG_SPACE ($) { CMSG_ALIGN($_[0]) + CMSG_ALIGN_SIZEOF_cmsghdr }
 sub CMSG_LEN ($) { CMSG_ALIGN_SIZEOF_cmsghdr + $_[0] }
-use constant msg_controllen => CMSG_SPACE(10 * SIZEOF_int) + 16; # 10 FDs
+use constant msg_controllen_max =>
+	CMSG_SPACE(10 * SIZEOF_int) + SIZEOF_cmsghdr; # space for 10 FDs
 
 if (defined($SYS_sendmsg) && defined($SYS_recvmsg)) {
 no warnings 'once';
@@ -436,20 +469,15 @@ require PublicInbox::CmdIPC4;
 			$_[2] // NUL, length($_[2] // NUL) || 1);
 	my $fd_space = scalar(@$fds) * SIZEOF_int;
 	my $msg_controllen = CMSG_SPACE($fd_space);
-	my $cmsghdr = pack(TMPL_size_t . # cmsg_len
+	my $cmsghdr = pack(TMPL_cmsg_len .
 			'LL' .  # cmsg_level, cmsg_type,
-			('i' x scalar(@$fds)) . # CMSG_DATA
+			CMSG_DATA_off.('i' x scalar(@$fds)). # CMSG_DATA
 			'@'.($msg_controllen - 1).'x1', # pad to space, not len
 			CMSG_LEN($fd_space), # cmsg_len
 			SOL_SOCKET, SCM_RIGHTS, # cmsg_{level,type}
 			@$fds); # CMSG_DATA
-	my $mh = pack('PL' . # msg_name, msg_namelen (socklen_t (U32))
-			BYTES_4_hole . # 4-byte padding on 64-bit
-			'P'.TMPL_size_t . # msg_iov, msg_iovlen,
-			'P'.TMPL_size_t . # msg_control, msg_controllen,
-			'i', # msg_flags
-			NUL, 0, # msg_name, msg_namelen (unused)
-			@BYTES_4_hole,
+	my $mh = pack(TMPL_msghdr,
+			undef, 0, # msg_name, msg_namelen (unused)
 			$iov, 1, # msg_iov, msg_iovlen
 			$cmsghdr, # msg_control
 			$msg_controllen,
@@ -465,18 +493,13 @@ require PublicInbox::CmdIPC4;
 *recv_cmd4 = sub ($$$) {
 	my ($sock, undef, $len) = @_;
 	vec($_[1] //= '', $len - 1, 8) = 0;
-	my $cmsghdr = "\0" x msg_controllen; # 10 * sizeof(int)
+	my $cmsghdr = "\0" x msg_controllen_max; # 10 * sizeof(int)
 	my $iov = pack('P'.TMPL_size_t, $_[1], $len);
-	my $mh = pack('PL' . # msg_name, msg_namelen (socklen_t (U32))
-			BYTES_4_hole . # 4-byte padding on 64-bit
-			'P'.TMPL_size_t . # msg_iov, msg_iovlen,
-			'P'.TMPL_size_t . # msg_control, msg_controllen,
-			'i', # msg_flags
-			NUL, 0, # msg_name, msg_namelen (unused)
-			@BYTES_4_hole,
+	my $mh = pack(TMPL_msghdr,
+			undef, 0, # msg_name, msg_namelen (unused)
 			$iov, 1, # msg_iov, msg_iovlen
 			$cmsghdr, # msg_control
-			msg_controllen,
+			msg_controllen_max,
 			0); # msg_flags
 	my $r;
 	do {
@@ -489,8 +512,9 @@ require PublicInbox::CmdIPC4;
 	substr($_[1], $r, length($_[1]), '');
 	my @ret;
 	if ($r > 0) {
-		my ($len, $lvl, $type, @fds) = unpack(TMPL_size_t . # cmsg_len
-					'LLi*', # cmsg_level, cmsg_type, @fds
+		my ($len, $lvl, $type, @fds) = unpack(TMPL_cmsg_len.
+					'LL'. # cmsg_level, cmsg_type
+					CMSG_DATA_off.'i*', # @fds
 					$cmsghdr);
 		if ($lvl == SOL_SOCKET && $type == SCM_RIGHTS) {
 			$len -= CMSG_ALIGN_SIZEOF_cmsghdr;
diff --git a/t/cmd_ipc.t b/t/cmd_ipc.t
index 08a4dcc3..c973c6f0 100644
--- a/t/cmd_ipc.t
+++ b/t/cmd_ipc.t
@@ -143,14 +143,13 @@ SKIP: {
 }
 
 SKIP: {
-	skip 'not Linux', 1 if $^O ne 'linux';
 	require_ok 'PublicInbox::Syscall';
 	$send = PublicInbox::Syscall->can('send_cmd4') or
-		skip 'send_cmd4 not defined for arch', 1;
+		skip "send_cmd4 not defined for $^O arch", 1;
 	$recv = PublicInbox::Syscall->can('recv_cmd4') or
-		skip 'recv_cmd4 not defined for arch', 1;
-	$do_test->(SOCK_STREAM, 0, 'PP Linux stream');
-	$do_test->(SOCK_SEQPACKET, 0, 'PP Linux seqpacket');
+		skip "recv_cmd4 not defined for $^O arch", 1;
+	$do_test->(SOCK_STREAM, 0, 'pure Perl stream');
+	$do_test->(SOCK_SEQPACKET, 0, 'pure Perl seqpacket');
 }
 
 done_testing;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-01-29 21:27 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-28 14:56 [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
2023-11-28 14:56 ` [PATCH 01/14] test_common: create_*: detect changes all parameters Eric Wong
2023-11-28 14:56 ` [PATCH 02/14] t/cindex*: require SCM_RIGHTS for these tests Eric Wong
2024-01-29 21:23   ` [PATCH 0/2] pure Perl sendmsg/recvmsg on *BSD Eric Wong
2024-01-29 21:23     ` [PATCH 1/2] syscall: update formatting to match our codebase Eric Wong
2024-01-29 21:23     ` [PATCH 2/2] syscall: use pure Perl sendmsg/recvmsg on *BSD Eric Wong
2023-11-28 14:56 ` [PATCH 03/14] codesearch: eliminate redundant substitutions Eric Wong
2023-11-28 14:56 ` [PATCH 04/14] solver: schedule cleanup after synchronous git->check Eric Wong
2023-11-28 14:56 ` [PATCH 05/14] xap_helper.h: move cindex endpoints to separate file Eric Wong
2023-11-28 14:56 ` [PATCH 06/14] xap_helper: implement mset endpoint for WWW, IMAP, etc Eric Wong
2023-11-28 14:56 ` [PATCH 07/14] hval: use File::Spec to make relative paths for href Eric Wong
2023-11-28 14:56 ` [PATCH 08/14] www: load and use cindex join data Eric Wong
2023-11-28 14:56 ` [PATCH 09/14] git: speed up ->git_path for non-worktrees Eric Wong
2023-11-28 14:56 ` [PATCH 10/14] cindex: require `-g GIT_DIR' or `-r PROJECT_ROOT' Eric Wong
2023-11-28 14:56 ` [PATCH 11/14] git: speed up Git->new by 5% or so Eric Wong
2023-11-28 14:56 ` [PATCH 12/14] admin: resolve_git_dir respects symlinks Eric Wong
2023-11-28 14:56 ` [PATCH 13/14] cindex: extra quit checks Eric Wong
2023-11-28 14:56 ` [PATCH 14/14] www: start working on a repo listing Eric Wong
2023-11-28 17:55 ` [PATCH 15/14] www: load cindex join data for ->ALL, too Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).