unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 00/18] cindex: some --associate work
@ 2023-11-13 13:15 Eric Wong
  2023-11-13 13:15 ` [PATCH 01/18] cindex: check `say' errors w/ close or ->flush Eric Wong
                   ` (17 more replies)
  0 siblings, 18 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

Still very much in flux, but some treewide cleanups in there...

And I've been wondering if "join" is a better word than
"associate" to denote the relationship between inboxes
and coderepos.

But "join" (even if we use join(1) internally) probably
implies strict relationships, whereas our current "associate"
is always going to be fuzzy due to patchids being fuzzy
and blobs OIDs being abbreviated in patches.

I'm also thinking about moving --associate-* CLI switches
into suboptions (e.g. what getsubopt(3) supports), so:

	--associate=aggressive,prefixes=patchid+dfblob

But Perl doesn't ship with getsubopt(3) emulation
out-of-the-box

Eric Wong (18):
  cindex: check `say' errors w/ close or ->flush
  tmpfile: check `stat' errors, use autodie for unlink
  cindex: use `local' for pipes between processes
  xap_helper_cxx: use write_file helper
  xap_helper_cxx: make the build process ccache-friendly
  xap_helper_cxx: use -pipe by default in CXXFLAGS
  xap_client: spawn C++ xap_helper directly
  treewide: update read_all to avoid eof|close checks
  spawn: don't append to scalarrefs on stdout/stderr
  cindex: imply --all with --associate w/o -I/--only
  cindex: delay associate until prune+indexing finish
  xap_helper: Perl dump_ibx respects `-m MAX'
  cidx_xap_helper_aux: complain about truncated inputs
  xap_helper: stricter and harsher error handling
  xap_helper: better variable naming for key buffer
  cindex: do not guess integer maximum for Xapian
  cindex: rename associate-max => window
  cindex: support --associate-aggressive shortcut

 lib/PublicInbox/CidxComm.pm         |   6 +-
 lib/PublicInbox/CidxXapHelperAux.pm |   6 +-
 lib/PublicInbox/CodeSearchIdx.pm    | 122 ++++++++++-----
 lib/PublicInbox/Gcf2.pm             |   3 +-
 lib/PublicInbox/IO.pm               |  18 ++-
 lib/PublicInbox/LeiInput.pm         |  10 +-
 lib/PublicInbox/LeiMirror.pm        |  10 +-
 lib/PublicInbox/LeiToMail.pm        |   3 +-
 lib/PublicInbox/Spawn.pm            |   4 +-
 lib/PublicInbox/TestCommon.pm       |   6 +-
 lib/PublicInbox/Tmpfile.pm          |  10 +-
 lib/PublicInbox/XapClient.pm        |  28 ++--
 lib/PublicInbox/XapHelper.pm        |  30 ++--
 lib/PublicInbox/XapHelperCxx.pm     |  55 +++----
 lib/PublicInbox/xap_helper.h        | 233 ++++++++++++----------------
 script/public-inbox-cindex          |   3 +-
 script/public-inbox-learn           |   2 +-
 script/public-inbox-mda             |   2 +-
 script/public-inbox-purge           |   2 +-
 t/spawn.t                           |   2 +-
 t/xap_helper.t                      |  27 ++--
 21 files changed, 287 insertions(+), 295 deletions(-)

Yay, less code!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 01/18] cindex: check `say' errors w/ close or ->flush
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 02/18] tmpfile: check `stat' errors, use autodie for unlink Eric Wong
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

We actually need to rely on autodie `close' to check for errors,
since error-checking with `say' is not useful due to perlio
write buffering.  We'll also stop relying on `say ... or die'
since it's needless noise.

Fixes: 19f9089343c9 (cindex: drop redundant close on regular FH)
---
 lib/PublicInbox/CodeSearchIdx.pm | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm
index 0b00c303..5b80db44 100644
--- a/lib/PublicInbox/CodeSearchIdx.pm
+++ b/lib/PublicInbox/CodeSearchIdx.pm
@@ -468,12 +468,12 @@ sub partition_refs ($$$) {
 		chomp $cmt;
 		my $n = hex(substr($cmt, 0, 8)) % scalar(@RDONLY_XDB);
 		if ($REINDEX && $REINDEX->set_maybe(pack('H*', $cmt), '')) {
-			say { $shard_in[$n] } $cmt or die "say: $!";
+			say { $shard_in[$n] } $cmt;
 			++$NCHANGE;
 		} elsif (seen($RDONLY_XDB[$n], 'Q'.$cmt)) {
 			last if ++$seen > $SEEN_MAX;
 		} else {
-			say { $shard_in[$n] } $cmt or die "say: $!";
+			say { $shard_in[$n] } $cmt;
 			++$NCHANGE;
 			$seen = 0;
 		}
@@ -845,7 +845,7 @@ EOM
 		PublicInbox::Import::init_bare($git_dir, 'cidx-all', $fmt);
 		open $ALT_FH{$hexlen}, '>', "$git_dir/objects/info/alternates";
 	}
-	say { $ALT_FH{$hexlen} } $objdir or die "say: $!";
+	say { $ALT_FH{$hexlen} } $objdir;
 }
 
 sub prep_alternate_start {
@@ -969,7 +969,7 @@ sub dump_git_commits { # run_await cb
 	(defined($pid) && $?) and die "E: @PRUNE_BATCH: \$?=$?";
 	return if $DO_QUIT;
 	my ($hexlen) = keys(%ALT_FH) or return; # done
-	delete $ALT_FH{$hexlen};
+	close(delete $ALT_FH{$hexlen}); # flushes `say' buffer
 
 	$PRUNE_BATCH[1] = "--git-dir=$TMPDIR/hexlen$hexlen.git";
 	run_await(\@PRUNE_BATCH, undef, $batch_opt, \&dump_git_commits);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 02/18] tmpfile: check `stat' errors, use autodie for unlink
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
  2023-11-13 13:15 ` [PATCH 01/18] cindex: check `say' errors w/ close or ->flush Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 03/18] cindex: use `local' for pipes between processes Eric Wong
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

`stat' can fail due to bugs on our end or ENOMEM, but there's
no autodie support for it.  So just die if `unlink' fails, since
the FS wouldn't be usable for tmpfiles in that state, anyways.
---
 lib/PublicInbox/Tmpfile.pm | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/Tmpfile.pm b/lib/PublicInbox/Tmpfile.pm
index 3040dd77..72dd9d24 100644
--- a/lib/PublicInbox/Tmpfile.pm
+++ b/lib/PublicInbox/Tmpfile.pm
@@ -1,9 +1,9 @@
-# Copyright (C) 2019-2021 all contributors <meta@public-inbox.org>
+# Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 package PublicInbox::Tmpfile;
-use strict;
-use v5.10.1;
+use v5.12;
 use parent qw(Exporter);
+use autodie qw(unlink);
 our @EXPORT = qw(tmpfile);
 use Fcntl qw(:DEFAULT);
 use Errno qw(EEXIST);
@@ -21,7 +21,7 @@ sub tmpfile ($;$$) {
 	if (defined $sock) {
 		# add the socket inode number so we can figure out which
 		# socket it belongs to
-		my @st = stat($sock);
+		my @st = stat($sock) or die "stat($sock): $!";
 		$id .= '-ino:'.$st[1];
 	}
 	$id =~ tr!/!^!;
@@ -31,7 +31,7 @@ sub tmpfile ($;$$) {
 	do {
 		my $fn = File::Spec->tmpdir . "/$id-".time.'-'.rand;
 		if (sysopen(my $fh, $fn, $fl, 0600)) { # likely
-			unlink($fn) or warn "unlink($fn): $!"; # FS broken
+			unlink($fn);
 			return $fh; # success
 		}
 	} while ($! == EEXIST);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 03/18] cindex: use `local' for pipes between processes
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
  2023-11-13 13:15 ` [PATCH 01/18] cindex: check `say' errors w/ close or ->flush Eric Wong
  2023-11-13 13:15 ` [PATCH 02/18] tmpfile: check `stat' errors, use autodie for unlink Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 04/18] xap_helper_cxx: use write_file helper Eric Wong
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

We can let these pipes get auto-closed upon leaving the process
subroutine scope.
---
 lib/PublicInbox/CodeSearchIdx.pm | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm
index 5b80db44..ffd443e1 100644
--- a/lib/PublicInbox/CodeSearchIdx.pm
+++ b/lib/PublicInbox/CodeSearchIdx.pm
@@ -522,8 +522,8 @@ sub dump_roots_start {
 	close $fh;
 	# dump_roots | sort -k1,1 | OFS=' ' uniq_fold >to_root_id
 	my ($sort_opt, $fold_opt);
-	pipe($sort_opt->{0}, my $sort_w);
-	pipe($fold_opt->{0}, $sort_opt->{1});
+	pipe(local $sort_opt->{0}, my $sort_w);
+	pipe(local $fold_opt->{0}, local $sort_opt->{1});
 	my @sort = (@SORT, '-k1,1');
 	my $dst = "$TMPDIR/to_root_id";
 	open $fold_opt->{1}, '>', $dst;
@@ -560,8 +560,8 @@ EOM
 sub dump_ibx_start {
 	my ($self, $associate) = @_;
 	my ($sort_opt, $fold_opt);
-	pipe($sort_opt->{0}, $DUMP_IBX_WPIPE);
-	pipe($fold_opt->{0}, $sort_opt->{1});
+	pipe(local $sort_opt->{0}, $DUMP_IBX_WPIPE);
+	pipe(local $fold_opt->{0}, local $sort_opt->{1});
 	my @sort = (@SORT, '-k1,1'); # sort only on ASSOC_PFX
 	# pipeline: dump_ibx | sort -k1,1 | uniq_fold >to_ibx_id
 	open $fold_opt->{1}, '>', "$TMPDIR/to_ibx_id";
@@ -952,8 +952,8 @@ sub init_prune ($) {
 	for (0..$#IDX_SHARDS) { push @delve, "$self->{xpfx}/$_" }
 	my $run_prune = PublicInbox::OnDestroy->new($$, \&run_prune, $self);
 	my ($sort_opt, $sed_opt, $delve_opt);
-	pipe($sed_opt->{0}, $delve_opt->{1});
-	pipe($sort_opt->{0}, $sed_opt->{1});
+	pipe(local $sed_opt->{0}, local $delve_opt->{1});
+	pipe(local $sort_opt->{0}, local $sed_opt->{1});
 	open($sort_opt->{1}, '+>', "$TMPDIR/indexed_commits");
 	run_await([@SORT, '-u'], $CMD_ENV, $sort_opt, \&cmd_done, $run_prune);
 	run_await(\@sed, $CMD_ENV, $sed_opt, \&cmd_done, $run_prune);
@@ -986,13 +986,12 @@ sub run_prune { # OnDestroy when `git config extensions.objectFormat' are done
 	# ) | awk | sort | comm | cidx_read_comm()
 	my ($awk_opt, $sort_opt, $batch_opt);
 	my $comm_opt = { -C => "$TMPDIR" };
-	pipe($awk_opt->{0}, $batch_opt->{1});
-	pipe($sort_opt->{0}, $awk_opt->{1});
-	pipe($comm_opt->{0}, $sort_opt->{1});
+	pipe(local $awk_opt->{0}, local $batch_opt->{1});
+	pipe(local $sort_opt->{0}, local $awk_opt->{1});
+	pipe(local $comm_opt->{0}, local $sort_opt->{1});
 	run_await(\@AWK, $CMD_ENV, $awk_opt, \&cmd_done);
 	run_await([@SORT, '-u'], $CMD_ENV, $sort_opt, \&cmd_done);
 	my $comm_rd = popen_rd(\@COMM, $CMD_ENV, $comm_opt, \&cmd_done, \@COMM);
-	%$_ = () for ($awk_opt, $sort_opt, $comm_opt); # comm_rd is blocking :<
 	PublicInbox::CidxComm->new($comm_rd, $self); # calls cidx_read_comm
 	my $git_ver = PublicInbox::Git::git_version();
 	push @PRUNE_BATCH, '--buffer' if $git_ver ge v2.6;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 04/18] xap_helper_cxx: use write_file helper
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (2 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 03/18] cindex: use `local' for pipes between processes Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 05/18] xap_helper_cxx: make the build process ccache-friendly Eric Wong
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

PublicInbox::IO already gets loaded by PublicInbox::Spawn, so
there's no avoiding it even if we want fast startup time :<
But startup time for this piece will be less relevant in the
near future...
---
 lib/PublicInbox/XapHelperCxx.pm | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/lib/PublicInbox/XapHelperCxx.pm b/lib/PublicInbox/XapHelperCxx.pm
index 908a71f4..0891074d 100644
--- a/lib/PublicInbox/XapHelperCxx.pm
+++ b/lib/PublicInbox/XapHelperCxx.pm
@@ -8,6 +8,7 @@
 package PublicInbox::XapHelperCxx;
 use v5.12;
 use PublicInbox::Spawn qw(run_qx which);
+use PublicInbox::IO qw(read_all write_file);
 use PublicInbox::Search;
 use Fcntl qw(SEEK_SET);
 use Config;
@@ -39,11 +40,11 @@ EOM
 
 sub needs_rebuild () {
 	open my $fh, '<', "$dir/XFLAGS" or return 1;
-	chomp(my $prev = <$fh>);
+	chomp(my $prev = read_all($fh));
 	return 1 if $prev ne $xflags;
 
 	open $fh, '<', "$dir/xap_modversion" or return 1;
-	chomp($prev = <$fh>);
+	chomp($prev = read_all($fh));
 	$prev or return 1;
 
 	$xap_modversion = xap_cfg('--modversion');
@@ -66,7 +67,7 @@ sub build () {
 	for (@srcs) {
 		say $fh qq(# line 1 "$_");
 		open my $rfh, '<', $_;
-		print $fh PublicInbox::IO::read_all $rfh;
+		print $fh read_all($rfh);
 	}
 	print $fh PublicInbox::Search::generate_cxx();
 	print $fh PublicInbox::CodeSearch::generate_cxx();
@@ -86,13 +87,8 @@ sub build () {
 
 	my $cmd = "$cxx $src $fl $xflags -o $tmp/$prog";
 	system($cmd) and die "$cmd failed: \$?=$?";
-	open $fh, '>', "$tmp/XFLAGS";
-	say $fh $xflags;
-	close $fh;
-
-	open $fh, '>', "$tmp/xap_modversion";
-	say $fh $xap_modversion;
-	close $fh;
+	write_file '>', "$tmp/XFLAGS", $xflags, "\n";
+	write_file '>', "$tmp/xap_modversion", $xap_modversion, "\n";
 	undef $xap_modversion; # do we ever build() twice?
 	# not quite atomic, but close enough :P
 	rename("$tmp/$_", "$dir/$_") for ($prog, qw(XFLAGS xap_modversion));

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 05/18] xap_helper_cxx: make the build process ccache-friendly
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (3 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 04/18] xap_helper_cxx: use write_file helper Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 06/18] xap_helper_cxx: use -pipe by default in CXXFLAGS Eric Wong
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

We need to have stable filenames and separate compilation
from the linkage stage for ccache to hit.  So avoid the use
of a temporary directory and instead rely on a lock file to
guard against parallel builds.
---
 lib/PublicInbox/XapHelperCxx.pm | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/lib/PublicInbox/XapHelperCxx.pm b/lib/PublicInbox/XapHelperCxx.pm
index 0891074d..7aa52416 100644
--- a/lib/PublicInbox/XapHelperCxx.pm
+++ b/lib/PublicInbox/XapHelperCxx.pm
@@ -7,7 +7,7 @@
 # The resulting executable is not linked to Perl in any way.
 package PublicInbox::XapHelperCxx;
 use v5.12;
-use PublicInbox::Spawn qw(run_qx which);
+use PublicInbox::Spawn qw(run_die run_qx which);
 use PublicInbox::IO qw(read_all write_file);
 use PublicInbox::Search;
 use Fcntl qw(SEEK_SET);
@@ -24,8 +24,8 @@ my @pm_dep = map { $srcpfx.$_ } qw(Search.pm CodeSearch.pm);
 my $ldflags = '-Wl,-O1';
 $ldflags .= ' -Wl,--compress-debug-sections=zlib' if $^O ne 'openbsd';
 my $xflags = ($ENV{CXXFLAGS} // '-Wall -ggdb3 -O0') . ' ' .
-	($ENV{LDFLAGS} // $ldflags) .
-	qq{ -DTHREADID=}.PublicInbox::Search::THREADID;
+	' -DTHREADID=' . PublicInbox::Search::THREADID .
+	' ' . ($ENV{LDFLAGS} // $ldflags);
 my $xap_modversion;
 
 sub xap_cfg (@) {
@@ -58,12 +58,12 @@ sub build () {
 		die "mkdir($dir): $err" if !-d $dir;
 	}
 	use autodie;
-	require File::Temp;
 	require PublicInbox::CodeSearch;
+	require PublicInbox::Lock;
+	require PublicInbox::OnDestroy;
 	my ($prog) = ($bin =~ m!/([^/]+)\z!);
-	my $tmp = File::Temp->newdir(DIR => $dir) // die "newdir: $!";
-	my $src = "$tmp/$prog.cpp";
-	open my $fh, '>', $src;
+	my $lk = PublicInbox::Lock->new("$dir/$prog.lock")->lock_for_scope;
+	open my $fh, '>', "$dir/$prog.cpp";
 	for (@srcs) {
 		say $fh qq(# line 1 "$_");
 		open my $rfh, '<', $_;
@@ -73,6 +73,10 @@ sub build () {
 	print $fh PublicInbox::CodeSearch::generate_cxx();
 	close $fh;
 
+	opendir my $dh, '.';
+	my $restore = PublicInbox::OnDestroy->new(\&chdir, $dh);
+	chdir $dir;
+
 	# xap_modversion may be set by needs_rebuild
 	$xap_modversion //= xap_cfg('--modversion');
 	my $fl = xap_cfg(qw(--libs --cflags));
@@ -84,14 +88,16 @@ sub build () {
 	# distributed packages.
 	$^O eq 'netbsd' and $fl =~ s/(\A|[ \t])\-L([^ \t]+)([ \t]|\z)/
 				"$1-L$2 -Wl,-rpath=$2$3"/egsx;
-
-	my $cmd = "$cxx $src $fl $xflags -o $tmp/$prog";
-	system($cmd) and die "$cmd failed: \$?=$?";
-	write_file '>', "$tmp/XFLAGS", $xflags, "\n";
-	write_file '>', "$tmp/xap_modversion", $xap_modversion, "\n";
+	my @xflags = split(/\s+/, "$fl $xflags");
+	my @cflags = grep(!/\A-(?:Wl|l|L)/, @xflags);
+	run_die([$cxx, '-c', "$prog.cpp", @cflags]);
+	run_die([$cxx, '-o', "$prog.tmp", "$prog.o", @xflags]);
+	unlink "$prog.cpp", "$prog.o";
+	write_file '>', 'XFLAGS.tmp', $xflags, "\n";
+	write_file '>', 'xap_modversion.tmp', $xap_modversion, "\n";
 	undef $xap_modversion; # do we ever build() twice?
 	# not quite atomic, but close enough :P
-	rename("$tmp/$_", "$dir/$_") for ($prog, qw(XFLAGS xap_modversion));
+	rename("$_.tmp", $_) for ($prog, qw(XFLAGS xap_modversion));
 }
 
 sub check_build () {

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 06/18] xap_helper_cxx: use -pipe by default in CXXFLAGS
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (4 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 05/18] xap_helper_cxx: make the build process ccache-friendly Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 07/18] xap_client: spawn C++ xap_helper directly Eric Wong
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

-ggdb3 is already used for g++ and clang, and -pipe is supported
by clang even if it's a no-op.  So just use it to speed up g++
since it saves me 30-40ms.

We'll also get rid of the explicit `-O0' since it's the default
for both clang and g++.
---
 lib/PublicInbox/XapHelperCxx.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/XapHelperCxx.pm b/lib/PublicInbox/XapHelperCxx.pm
index 7aa52416..3afdd69e 100644
--- a/lib/PublicInbox/XapHelperCxx.pm
+++ b/lib/PublicInbox/XapHelperCxx.pm
@@ -23,7 +23,7 @@ my @srcs = map { $srcpfx.$_ } qw(xap_helper.h);
 my @pm_dep = map { $srcpfx.$_ } qw(Search.pm CodeSearch.pm);
 my $ldflags = '-Wl,-O1';
 $ldflags .= ' -Wl,--compress-debug-sections=zlib' if $^O ne 'openbsd';
-my $xflags = ($ENV{CXXFLAGS} // '-Wall -ggdb3 -O0') . ' ' .
+my $xflags = ($ENV{CXXFLAGS} // '-Wall -ggdb3 -pipe') . ' ' .
 	' -DTHREADID=' . PublicInbox::Search::THREADID .
 	' ' . ($ENV{LDFLAGS} // $ldflags);
 my $xap_modversion;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 07/18] xap_client: spawn C++ xap_helper directly
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (5 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 06/18] xap_helper_cxx: use -pipe by default in CXXFLAGS Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 08/18] treewide: update read_all to avoid eof|close checks Eric Wong
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

No need to suffer through an extra dose of slow Perl load times
when we can drive the build in the big parent Perl process and
get the executable path name to pass to spawn directly.
---
 lib/PublicInbox/XapClient.pm    | 28 ++++++++++++++--------------
 lib/PublicInbox/XapHelperCxx.pm |  9 ++++-----
 t/xap_helper.t                  | 22 +++++++++-------------
 3 files changed, 27 insertions(+), 32 deletions(-)

diff --git a/lib/PublicInbox/XapClient.pm b/lib/PublicInbox/XapClient.pm
index 21c89265..dda5e044 100644
--- a/lib/PublicInbox/XapClient.pm
+++ b/lib/PublicInbox/XapClient.pm
@@ -11,7 +11,7 @@ use v5.12;
 use PublicInbox::Spawn qw(spawn);
 use Socket qw(AF_UNIX SOCK_SEQPACKET);
 use PublicInbox::IPC;
-use autodie qw(pipe socketpair);
+use autodie qw(fork pipe socketpair);
 
 sub mkreq {
 	my ($self, $ios, @arg) = @_;
@@ -28,19 +28,19 @@ sub mkreq {
 sub start_helper {
 	my @argv = @_;
 	socketpair(my $sock, my $in, AF_UNIX, SOCK_SEQPACKET, 0);
-	my $cls = ($ENV{PI_NO_CXX} ? undef : eval {
-			require PublicInbox::XapHelperCxx;
-			PublicInbox::XapHelperCxx::check_build();
-			'PublicInbox::XapHelperCxx';
-		}) // do {
-			require PublicInbox::XapHelper;
-			'PublicInbox::XapHelper';
-		};
-	# ensure the child process has the same @INC we do:
-	my $env = { PERL5LIB => join(':', @INC) };
-	my $pid = spawn([$^X, ($^W ? ('-w') : ()), "-M$cls", '-e',
-				$cls.'::start(@ARGV)', '--', @argv],
-			$env, { 0 => $in });
+	require PublicInbox::XapHelperCxx;
+	my $cls = 'PublicInbox::XapHelperCxx';
+	my $env;
+	my $cmd = eval { PublicInbox::XapHelperCxx::cmd() };
+	if ($@) { # fall back to Perl + XS|SWIG
+		require PublicInbox::XapHelper;
+		$cls = 'PublicInbox::XapHelper';
+		# ensure the child process has the same @INC we do:
+		$env = { PERL5LIB => join(':', @INC) };
+		$cmd = [$^X, ($^W ? ('-w') : ()), "-M$cls", '-e',
+			$cls.'::start(@ARGV)', '--' ];
+	}
+	my $pid = spawn($cmd, $env, { 0 => $in });
 	((bless { io => $sock, impl => $cls }, __PACKAGE__), $pid);
 }
 
diff --git a/lib/PublicInbox/XapHelperCxx.pm b/lib/PublicInbox/XapHelperCxx.pm
index 3afdd69e..e516b111 100644
--- a/lib/PublicInbox/XapHelperCxx.pm
+++ b/lib/PublicInbox/XapHelperCxx.pm
@@ -114,17 +114,16 @@ sub check_build () {
 	needs_rebuild() ? build() : 0;
 }
 
-sub start (@) {
+# returns spawn arg
+sub cmd {
 	check_build();
 	my @cmd;
 	if (my $v = $ENV{VALGRIND}) {
 		$v = 'valgrind -v' if $v eq '1';
 		@cmd = split(/\s+/, $v);
 	}
-	push @cmd, $bin, @_;
-	my $prog = $cmd[0];
-	$cmd[0] =~ s!\A.*?/([^/]+)\z!$1!;
-	exec { $prog } @cmd;
+	push @cmd, $bin;
+	\@cmd;
 }
 
 1;
diff --git a/t/xap_helper.t b/t/xap_helper.t
index 7890392d..83f59d7d 100644
--- a/t/xap_helper.t
+++ b/t/xap_helper.t
@@ -61,11 +61,11 @@ my $doreq = sub {
 
 my $env = { PERL5LIB => join(':', @INC) };
 my $test = sub {
-	my (@arg) = @_;
+	my (@cmd) = @_;
 	socketpair(my $s, my $y, AF_UNIX, SOCK_SEQPACKET, 0);
-	my $pid = spawn([$^X, '-w', @arg], $env, { 0 => $y });
+	my $pid = spawn(\@cmd, $env, { 0 => $y });
 	my $ar = PublicInbox::AutoReap->new($pid);
-	diag "$arg[-1] running pid=$pid";
+	diag "$cmd[-1] running pid=$pid";
 	close $y;
 	my $r = $doreq->($s, qw(test_inspect -d), $ibx_idx[0]);
 	my %info = map { split(/=/, $_, 2) } split(/ /, do { local $/; <$r> });
@@ -141,24 +141,20 @@ my $test = sub {
 
 my @NO_CXX = (1);
 unless ($ENV{TEST_XH_CXX_ONLY}) {
-	my $ar = $test->(qw[-MPublicInbox::XapHelper -e
+	my $ar = $test->($^X, qw[-w -MPublicInbox::XapHelper -e
 			PublicInbox::XapHelper::start('-j0')]);
-	($ar, my $s) = $test->(qw[-MPublicInbox::XapHelper -e
+	($ar, my $s) = $test->($^X, qw[-w -MPublicInbox::XapHelper -e
 			PublicInbox::XapHelper::start('-j1')]);
 	no_pollerfd($ar->{pid});
 }
 SKIP: {
-	eval {
-		require PublicInbox::XapHelperCxx;
-		PublicInbox::XapHelperCxx::check_build();
-	};
+	require PublicInbox::XapHelperCxx;
+	my $cmd = eval { PublicInbox::XapHelperCxx::cmd() };
 	skip "XapHelperCxx build: $@", 1 if $@ || $ENV{PI_NO_CXX};
 
 	@NO_CXX = $ENV{TEST_XH_CXX_ONLY} ? (0) : (0, 1);
-	my $ar = $test->(qw[-MPublicInbox::XapHelperCxx -e
-			PublicInbox::XapHelperCxx::start('-j0')]);
-	$ar = $test->(qw[-MPublicInbox::XapHelperCxx -e
-			PublicInbox::XapHelperCxx::start('-j1')]);
+	my $ar = $test->(@$cmd, '-j0');
+	$ar = $test->(@$cmd, '-j1');
 };
 
 require PublicInbox::CodeSearch;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 08/18] treewide: update read_all to avoid eof|close checks
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (6 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 07/18] xap_client: spawn C++ xap_helper directly Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 09/18] spawn: don't append to scalarrefs on stdout/stderr Eric Wong
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

read_all can be expanded to support FIFOs/pipes/sockets where
read-until-EOF behavior is desired.  We can also rely on
wantarray to support splitting on EOL markers, but it's
hard-coded to support only `$/ eq "\n"' since (AFAIK)
it's the only way we use the wantarray form `readline'.
---
 lib/PublicInbox/CodeSearchIdx.pm |  3 +--
 lib/PublicInbox/Gcf2.pm          |  3 +--
 lib/PublicInbox/IO.pm            | 18 +++++++++++++-----
 lib/PublicInbox/LeiInput.pm      | 10 ++++------
 lib/PublicInbox/LeiMirror.pm     | 10 ++++------
 lib/PublicInbox/LeiToMail.pm     |  3 +--
 lib/PublicInbox/Spawn.pm         |  4 +---
 lib/PublicInbox/TestCommon.pm    |  6 +++---
 script/public-inbox-learn        |  2 +-
 script/public-inbox-mda          |  2 +-
 script/public-inbox-purge        |  2 +-
 11 files changed, 31 insertions(+), 32 deletions(-)

diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm
index ffd443e1..3601176c 100644
--- a/lib/PublicInbox/CodeSearchIdx.pm
+++ b/lib/PublicInbox/CodeSearchIdx.pm
@@ -627,8 +627,7 @@ sub index_repo { # run_git cb
 	my $repo = delete $git->{-repo} or return index_next($self);
 	my $roots_fh = delete $repo->{roots_fh} // die 'BUG: no {roots_fh}';
 	seek($roots_fh, 0, SEEK_SET);
-	chomp(my @roots = <$roots_fh>);
-	$roots_fh = eof($roots_fh) | close $roots_fh; # detect readline errors
+	chomp(my @roots = PublicInbox::IO::read_all $roots_fh);
 	if (!@roots) {
 		warn("E: $git->{git_dir} has no root commits\n");
 		return index_next($self);
diff --git a/lib/PublicInbox/Gcf2.pm b/lib/PublicInbox/Gcf2.pm
index 5490049d..e0219b55 100644
--- a/lib/PublicInbox/Gcf2.pm
+++ b/lib/PublicInbox/Gcf2.pm
@@ -81,8 +81,7 @@ sub add_alt ($$) {
 	#
 	# See https://bugs.debian.org/975607
 	if (open(my $fh, '<', "$objdir/info/alternates")) {
-		chomp(my @abs_alt = grep(m!^/!, <$fh>));
-		$fh = eof($fh) | close $fh; # detect readline errors
+		chomp(my @abs_alt = grep m!^/!, PublicInbox::IO::read_all $fh);
 		$gcf2->add_alternate($_) for @abs_alt;
 	}
 	$gcf2->add_alternate($objdir);
diff --git a/lib/PublicInbox/IO.pm b/lib/PublicInbox/IO.pm
index 0d303500..11ce9be1 100644
--- a/lib/PublicInbox/IO.pm
+++ b/lib/PublicInbox/IO.pm
@@ -62,13 +62,21 @@ sub poll_in ($;$) {
 	IO::Poll::_poll($_[1] // -1, fileno($_[0]), my $ev = POLLIN);
 }
 
-sub read_all ($;$$) {
+sub read_all ($;$$$) { # pass $len=0 to read until EOF for :utf8 handles
 	use autodie qw(read);
-	my ($io, $len, $bref) = @_;
+	my ($io, $len, $bref, $off) = @_;
 	$bref //= \(my $buf);
-	my $r = read($io, $$bref, $len //= -s $io);
-	croak("read($io) ($r != $len)") if $len != $r;
-	$$bref;
+	$off //= 0;
+	my $r = 0;
+	if (my $left = $len //= -s $io) { # known size (binmode :raw/:unix)
+		do { # retry for binmode :unix
+			$r = read($io, $$bref, $left, $off += $r) or croak(
+				"read($io) premature EOF ($left/$len remain)");
+		} while ($left -= $r);
+	} else { # read until EOF
+		while (($r = read($io, $$bref, 65536, $off += $r))) {}
+	}
+	wantarray ? split(/^/sm, $$bref) : $$bref
 }
 
 sub try_cat ($) {
diff --git a/lib/PublicInbox/LeiInput.pm b/lib/PublicInbox/LeiInput.pm
index 68c3c459..daba9a8e 100644
--- a/lib/PublicInbox/LeiInput.pm
+++ b/lib/PublicInbox/LeiInput.pm
@@ -79,12 +79,10 @@ sub input_net_cb { # imap_each, nntp_each cb
 sub input_fh {
 	my ($self, $ifmt, $fh, $name, @args) = @_;
 	if ($ifmt eq 'eml') {
-		my $buf = do { local $/; <$fh> };
-		my $ok = defined($buf) ? 1 : 0;
-		++$ok if eof($fh);
-		++$ok if $fh->close;
-		$ok == 3 or return $self->{lei}->child_error($?, <<"");
-error reading $name: $! (\$?=$?)
+		my $buf = eval { PublicInbox::IO::read_all $fh, 0 };
+		my $e = $@;
+		return $self->{lei}->child_error($?, <<"") if !$fh->close || $e;
+error reading $name: $! (\$?=$?) (\$@=$e)
 
 		PublicInbox::Eml::strip_from($buf);
 
diff --git a/lib/PublicInbox/LeiMirror.pm b/lib/PublicInbox/LeiMirror.pm
index 84266d03..e048a807 100644
--- a/lib/PublicInbox/LeiMirror.pm
+++ b/lib/PublicInbox/LeiMirror.pm
@@ -307,10 +307,9 @@ sub fgrp_update {
 	my $dstfh = delete $fgrp->{dstfh} or return;
 	seek($srcfh, 0, SEEK_SET);
 	seek($dstfh, 0, SEEK_SET);
-	my %src = map { chomp; split(/\0/) } (<$srcfh>);
-	my %dst = map { chomp; split(/\0/) } (<$dstfh>);
-	$srcfh = eof($srcfh) | close $srcfh; # detects readline errors
-	$dstfh = eof($dstfh) | close $dstfh; # ditto
+	my %src = map { chomp; split /\0/ } PublicInbox::IO::read_all $srcfh;
+	my %dst = map { chomp; split /\0/ } PublicInbox::IO::read_all $dstfh;
+	$srcfh = $dstfh = undef;
 	my $w = start_update_ref($fgrp) or return;
 	my $lei = $fgrp->{lei};
 	my $ndel;
@@ -508,8 +507,7 @@ EOM
 		my $l = File::Spec->abs2rel($alt, File::Spec->rel2abs($o));
 		open my $fh, '+>>', my $f = "$o/info/alternates";
 		seek($fh, 0, SEEK_SET); # Perl did SEEK_END when it saw '>>'
-		my $seen = grep(/\A\Q$l\E\n/, <$fh>); # error check with eof
-		eof($fh) or die "not at `$f' EOF ($!)"; # $! was set by readline
+		my $seen = grep /\A\Q$l\E\n/, PublicInbox::IO::read_all $fh;
 		print $fh "$l\n" if !$seen;
 		close $fh;
 	}
diff --git a/lib/PublicInbox/LeiToMail.pm b/lib/PublicInbox/LeiToMail.pm
index 0d2f586a..2928be45 100644
--- a/lib/PublicInbox/LeiToMail.pm
+++ b/lib/PublicInbox/LeiToMail.pm
@@ -687,8 +687,7 @@ sub _pre_augment_v2 {
 	my $d = "$lei->{ale}->{git}->{git_dir}/objects";
 	open my $fh, '+>>', my $f = "$dir/git/0.git/objects/info/alternates";
 	seek($fh, 0, SEEK_SET); # Perl did SEEK_END when it saw '>>'
-	my $seen = grep(/\A\Q$d\E\n/, <$fh>);
-	eof($fh) or die "not at `$f' EOF ($!)"; # $! was set by readline
+	my $seen = grep /\A\Q$d\E\n/, PublicInbox::IO::read_all $fh;
 	print $fh "$d\n" if !$seen;
 	close $fh;
 }
diff --git a/lib/PublicInbox/Spawn.pm b/lib/PublicInbox/Spawn.pm
index 8cc4dfaf..849438bc 100644
--- a/lib/PublicInbox/Spawn.pm
+++ b/lib/PublicInbox/Spawn.pm
@@ -396,14 +396,12 @@ sub popen_wr {
 
 sub read_out_err ($) {
 	my ($opt) = @_;
-	local $/;
 	for my $fd (1, 2) { # read stdout/stderr
 		my $fh = delete($opt->{"fh.$fd"}) // next;
 		seek($fh, 0, SEEK_SET);
 		my $dst = $opt->{$fd};
 		$dst = $opt->{$fd} = $dst->[1] if ref($dst) eq 'ARRAY';
-		$$dst .= <$fh>;
-		$fh = eof($fh) | close $fh; # detects readline errors
+		PublicInbox::IO::read_all $fh, 0, $dst, length($$dst);
 	}
 }
 
diff --git a/lib/PublicInbox/TestCommon.pm b/lib/PublicInbox/TestCommon.pm
index b84886a0..caf709c2 100644
--- a/lib/PublicInbox/TestCommon.pm
+++ b/lib/PublicInbox/TestCommon.pm
@@ -48,16 +48,16 @@ sub require_bsd (;$) {
 
 sub xbail (@) { BAIL_OUT join(' ', map { ref() ? (explain($_)) : ($_) } @_) }
 
-sub read_all ($;$$) {
+sub read_all ($;$$$) {
 	require PublicInbox::IO;
-	PublicInbox::IO::read_all($_[0], $_[1], $_[2])
+	PublicInbox::IO::read_all($_[0], $_[1], $_[2], $_[3])
 }
 
 sub eml_load ($) {
 	my ($path, $cb) = @_;
 	open(my $fh, '<', $path);
 	require PublicInbox::Eml;
-	PublicInbox::Eml->new(\(read_all($fh)));
+	PublicInbox::Eml->new(\(scalar read_all $fh));
 }
 
 sub tmpdir (;$) {
diff --git a/script/public-inbox-learn b/script/public-inbox-learn
index 6a1bc890..a955cdf6 100755
--- a/script/public-inbox-learn
+++ b/script/public-inbox-learn
@@ -42,7 +42,7 @@ local $PublicInbox::Import::DROP_UNIQUE_UNSUB;
 PublicInbox::Import::load_config($pi_cfg);
 my $err;
 my $mime = PublicInbox::Eml->new(do{
-	defined(my $data = do { local $/; <STDIN> }) or die "read STDIN: $!\n";
+	my $data = PublicInbox::IO::read_all \*STDIN;
 	PublicInbox::Eml::strip_from($data);
 
 	if ($train ne 'rm') {
diff --git a/script/public-inbox-mda b/script/public-inbox-mda
index b2e0908d..b463b07b 100755
--- a/script/public-inbox-mda
+++ b/script/public-inbox-mda
@@ -44,7 +44,7 @@ use PublicInbox::Spamcheck;
 # in case there's bugs in our code or user error.
 my $emergency = $ENV{PI_EMERGENCY} || "$ENV{HOME}/.public-inbox/emergency/";
 $ems = PublicInbox::Emergency->new($emergency);
-my $str = do { local $/; <STDIN> };
+my $str = PublicInbox::IO::read_all \*STDIN;
 PublicInbox::Eml::strip_from($str);
 $ems->prepare(\$str);
 my $eml = PublicInbox::Eml->new(\$str);
diff --git a/script/public-inbox-purge b/script/public-inbox-purge
index 8f9b0b16..618cfec4 100755
--- a/script/public-inbox-purge
+++ b/script/public-inbox-purge
@@ -33,7 +33,7 @@ PublicInbox::Admin::do_chdir(delete $opt->{C});
 my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, $opt);
 PublicInbox::AdminEdit::check_editable(\@ibxs);
 
-defined(my $data = do { local $/; <STDIN> }) or die "read STDIN: $!\n";
+my $data = PublicInbox::IO::read_all \*STDIN;
 PublicInbox::Eml::strip_from($data);
 my $n_purged = 0;
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 09/18] spawn: don't append to scalarrefs on stdout/stderr
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (7 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 08/18] treewide: update read_all to avoid eof|close checks Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 10/18] cindex: imply --all with --associate w/o -I/--only Eric Wong
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

None of our current code relies on it, and I can't imagine it's
something we'd need in the future, actually...  This keeps the
door open for relying more on Spawn in TestCommon.
---
 lib/PublicInbox/Spawn.pm | 2 +-
 t/spawn.t                | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/Spawn.pm b/lib/PublicInbox/Spawn.pm
index 849438bc..9c680690 100644
--- a/lib/PublicInbox/Spawn.pm
+++ b/lib/PublicInbox/Spawn.pm
@@ -401,7 +401,7 @@ sub read_out_err ($) {
 		seek($fh, 0, SEEK_SET);
 		my $dst = $opt->{$fd};
 		$dst = $opt->{$fd} = $dst->[1] if ref($dst) eq 'ARRAY';
-		PublicInbox::IO::read_all $fh, 0, $dst, length($$dst);
+		PublicInbox::IO::read_all $fh, 0, $dst
 	}
 }
 
diff --git a/t/spawn.t b/t/spawn.t
index 3479b6b3..48f541b8 100644
--- a/t/spawn.t
+++ b/t/spawn.t
@@ -26,7 +26,7 @@ require PublicInbox::DS;
 	is($out, 'in', 'stdin read and stdout captured');
 	$opt->{0} = \"IN\n3\nLINES";
 	my @out = run_qx(['sh', '-c', 'echo E >&2; cat'], undef, $opt);
-	is($e, "e\nE\n", 'captured stderr appended to string');
+	is($e, "E\n", 'captured stderr clobbers string');
 	is_deeply(\@out, [ "IN\n", "3\n", 'LINES' ], 'stdout array');
 }
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 10/18] cindex: imply --all with --associate w/o -I/--only
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (8 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 09/18] spawn: don't append to scalarrefs on stdout/stderr Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 11/18] cindex: delay associate until prune+indexing finish Eric Wong
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

I just forgot to use --all with --associate and it wasn't
easily apparent what was wrong.  We'll also show some extra
progress while we're at it.
---
 lib/PublicInbox/CodeSearchIdx.pm | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm
index 3601176c..0bd26af2 100644
--- a/lib/PublicInbox/CodeSearchIdx.pm
+++ b/lib/PublicInbox/CodeSearchIdx.pm
@@ -1033,7 +1033,8 @@ sub init_associate_prefork ($) {
 	$self->{-pi_cfg} = PublicInbox::Config->new;
 	my @unknown;
 	my @pfx = @{$self->{-opt}->{'associate-prefixes'} // [ 'patchid' ]};
-	for (map { split(/\s*,\s*/) } @pfx) {
+	@pfx = map { split(/\s*,\s*/) } @pfx;
+	for (@pfx) {
 		my $v = $PublicInbox::Search::PATCH_BOOL_COMMON{$_} //
 			push(@unknown, $_);
 		push(@ASSOC_PFX, split(/ /, $v));
@@ -1045,13 +1046,28 @@ EOM
 	my %incl = map {
 		rel2abs_collapsed($_) => undef;
 	} (@{$self->{-opt}->{include} // []});
-	$self->{-pi_cfg}->each_inbox(\&_prep_ibx, $self, \%incl);
+	my $all = $self->{-opt}->{all};
+	if (my $only = $self->{-opt}->{only}) {
+		die <<'' if $all;
+E: --all is incompatible with --only
+
+		$incl{rel2abs_collapsed($_)} = undef for @$only;
+	}
+	unless (keys(%incl)) {
+		$all = 1;
+		warn <<EOM unless $self->{opt}->{quiet};
+# --all implied since no inboxes were specified with --only or --include
+EOM
+	}
+	$self->{-pi_cfg}->each_inbox(\&_prep_ibx, $self, \%incl, $all);
+	my $nr = scalar(@IBX) or die "E: no inboxes to associate\n";
+	progress($self, "will associate $nr inboxes in ",
+			$self->{-pi_cfg}->{-f}, " using: @pfx");
 }
 
 sub _prep_ibx { # each_inbox callback
-	my ($ibx, $self, $incl) = @_;
-	($self->{-opt}->{all} || exists($incl->{$ibx->{inboxdir}})) and
-		push @IBX, $ibx;
+	my ($ibx, $self, $incl, $all) = @_;
+	($all || exists($incl->{$ibx->{inboxdir}})) and push @IBX, $ibx;
 }
 
 sub show_roots { # for diagnostics

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 11/18] cindex: delay associate until prune+indexing finish
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (9 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 10/18] cindex: imply --all with --associate w/o -I/--only Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 12/18] xap_helper: Perl dump_ibx respects `-m MAX' Eric Wong
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

Prune can get rid of invalid commits while indexing can add new
candidates for association, so we don't dump coderepo roots for
association until those are squared away.  However, we can dump
inbox info since we don't touch inboxes while -cindex is running.
---
 lib/PublicInbox/CidxComm.pm      |  6 ++--
 lib/PublicInbox/CodeSearchIdx.pm | 48 +++++++++++++++++++++-----------
 2 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/lib/PublicInbox/CidxComm.pm b/lib/PublicInbox/CidxComm.pm
index c7ab3c10..80a235e9 100644
--- a/lib/PublicInbox/CidxComm.pm
+++ b/lib/PublicInbox/CidxComm.pm
@@ -13,8 +13,8 @@ use parent qw(PublicInbox::DS);
 use PublicInbox::Syscall qw(EPOLLIN EPOLLONESHOT);
 
 sub new {
-	my ($cls, $rd, $cidx) = @_;
-	my $self = bless { cidx => $cidx }, $cls;
+	my ($cls, $rd, $cidx, $drs) = @_;
+	my $self = bless { cidx => $cidx, drs => $drs }, $cls;
 	$self->SUPER::new($rd, EPOLLIN|EPOLLONESHOT);
 }
 
@@ -22,7 +22,7 @@ sub event_step {
 	my ($self) = @_;
 	my $rd = $self->{sock} // return warn('BUG?: no {sock}');
 	$self->close; # EPOLL_CTL_DEL
-	delete($self->{cidx})->cidx_read_comm($rd);
+	delete($self->{cidx})->cidx_read_comm($rd, delete $self->{drs});
 }
 
 1;
diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm
index 0bd26af2..04c514fe 100644
--- a/lib/PublicInbox/CodeSearchIdx.pm
+++ b/lib/PublicInbox/CodeSearchIdx.pm
@@ -76,8 +76,8 @@ our (
 	$NPROC,
 	$XHC, # XapClient
 	$REPO_CTX, # current repo being indexed in shards
-	$IDX_TODO, # [ $git0, $root0, $git1, $root1, ...]
-	$GIT_TODO, # [ GIT_DIR0, GIT_DIR1, ...]
+	$IDX_TODO, # PublicInbox::Git object arrayref
+	$GIT_TODO, # PublicInbox::Git object arrayref
 	%ALT_FH, # hexlen => tmp IO for TMPDIR git alternates
 	$TMPDIR, # File::Temp->newdir object for prune
 	@PRUNE_QUEUE, # GIT_DIRs to prepare for pruning
@@ -337,8 +337,10 @@ sub prune_done { # called via prune_do completion
 	return if $DO_QUIT || !$PRUNE_DONE;
 	die "BUG: \$PRUNE_DONE->[$n] already defined" if $PRUNE_DONE->[$n];
 	$PRUNE_DONE->[$n] = 1;
-	grep(defined, @$PRUNE_DONE) == @IDX_SHARDS and
-		progress($self, 'prune done')
+	if (grep(defined, @$PRUNE_DONE) == @IDX_SHARDS) {
+		progress($self, 'prune done');
+		index_next($self); # may kick dump_roots_start
+	}
 }
 
 sub seen ($$) {
@@ -506,10 +508,15 @@ sub assoc_max_init ($) {
 	$max < 0 ? ((2 ** 31) - 1) : $max;
 }
 
+sub start_xhc () {
+	my ($xhc, $pid) = PublicInbox::XapClient::start_helper("-j$NPROC");
+	awaitpid($pid, \&cmd_done, ['xap_helper', "-j$NPROC"]);
+	$xhc;
+}
+
 sub dump_roots_start {
 	my ($self, $associate) = @_;
-	($XHC, my $pid) = PublicInbox::XapClient::start_helper("-j$NPROC");
-	awaitpid($pid, \&cmd_done, ['xap_helper', "-j$NPROC"]);
+	$XHC //= start_xhc;
 	$associate // die 'BUG: no $associate';
 	$TODO{associating} = 1; # keep shards_active() happy
 	progress($self, 'dumping IDs from coderepos');
@@ -559,6 +566,7 @@ EOM
 
 sub dump_ibx_start {
 	my ($self, $associate) = @_;
+	$XHC //= start_xhc;
 	my ($sort_opt, $fold_opt);
 	pipe(local $sort_opt->{0}, $DUMP_IBX_WPIPE);
 	pipe(local $fold_opt->{0}, local $sort_opt->{1});
@@ -581,11 +589,10 @@ sub index_next ($) {
 		fp_start($self, $git, $prep_repo);
 		ct_start($self, $git, $prep_repo);
 	} elsif ($TMPDIR) {
-		return if delete($TODO{dump_roots_start});
+		delete $TODO{dump_roots_start};
 		delete $TODO{dump_ibx_start}; # runs OnDestroy once
 		return dump_ibx($self, shift @IBXQ) if @IBXQ;
 		undef $DUMP_IBX_WPIPE; # done dumping inboxes
-		undef $XHC;
 		delete $TODO{associate};
 	}
 	# else: wait for shards_active (post_loop_do) callback
@@ -604,7 +611,7 @@ sub next_repos { # OnDestroy cb
 }
 
 sub index_done { # OnDestroy cb called when done indexing each code repo
-	my ($repo_ctx) = @_;
+	my ($repo_ctx, $drs) = @_;
 	my ($self, $repo, $active) = @$repo_ctx{qw(self repo active)};
 
 	return if $DO_QUIT;
@@ -615,6 +622,7 @@ sub index_done { # OnDestroy cb called when done indexing each code repo
 	$active->{$n} = undef;
 	my ($c, $p) = PublicInbox::PktOp->pair;
 	$c->{ops}->{repo_stored} = [ $self, $repo_ctx ];
+	$c->{-cidx_dump_roots_start} = $drs if $drs;
 	$IDX_SHARDS[$n]->wq_io_do('store_repo', [ $p->{op_p} ], $repo);
 	# repo_stored will fire once store_repo is done
 }
@@ -638,8 +646,9 @@ sub index_repo { # run_git cb
 	$repo->{git_dir} = $git->{git_dir};
 	my $repo_ctx = $REPO_CTX = { self => $self, repo => $repo };
 	delete $git->{-cidx_gits_fini}; # may fire gits_fini
+	my $drs = delete $git->{-cidx_dump_roots_start};
 	my $index_done = PublicInbox::OnDestroy->new($$, \&index_done,
-							$repo_ctx);
+							$repo_ctx, $drs);
 	my ($c, $p) = PublicInbox::PktOp->pair;
 	$c->{ops}->{shard_done} = [ $self, $repo_ctx, $index_done ];
 	for my $n (0..$#shard_in) {
@@ -738,6 +747,7 @@ EOM
 	@shards;
 }
 
+# called when all git coderepos are done
 sub gits_fini {
 	undef $GITS_NR;
 	PublicInbox::DS::enqueue_reap(); # kick @post_loop_do
@@ -749,6 +759,9 @@ sub scan_git_dirs ($) {
 	$GITS_NR = @$GIT_TODO;
 	my $gits_fini = PublicInbox::OnDestroy->new($$, \&gits_fini);
 	$_->{-cidx_gits_fini} = $gits_fini for @$GIT_TODO;
+	if (my $drs = $TODO{dump_roots_start}) {
+		$_->{-cidx_dump_roots_start} = $drs for @$GIT_TODO;
+	}
 	progress($self, "scanning $GITS_NR code repositories...");
 }
 
@@ -797,7 +810,7 @@ sub kill_shards { $_->wq_kill(@_) for (@IDX_SHARDS) }
 
 sub parent_quit {
 	$DO_QUIT = POSIX->can("SIG$_[0]")->();
-	$XHC = undef;
+	$XHC = 0; # stops the process
 	kill_shards(@_);
 	warn "# SIG$_[0] received, quitting...\n";
 }
@@ -870,6 +883,7 @@ sub cmd_done { # run_await cb for sort, xapian-delve, sed failures
 sub associate {
 	my ($self) = @_;
 	return if $DO_QUIT;
+	$XHC = 0; # should not be recreated again
 	@IDX_SHARDS or return warn("# aborting on no shards\n");
 	unlink("$TMPDIR/root2id");
 	my @pending = keys %{$self->{PENDING}};
@@ -949,7 +963,8 @@ sub init_prune ($) {
 	require_progs('prune', 'xapian-delve' => \@delve, sed => \@sed,
 			comm => \@COMM, awk => \@AWK);
 	for (0..$#IDX_SHARDS) { push @delve, "$self->{xpfx}/$_" }
-	my $run_prune = PublicInbox::OnDestroy->new($$, \&run_prune, $self);
+	my $run_prune = PublicInbox::OnDestroy->new($$, \&run_prune, $self,
+						$TODO{dump_roots_start});
 	my ($sort_opt, $sed_opt, $delve_opt);
 	pipe(local $sed_opt->{0}, local $delve_opt->{1});
 	pipe(local $sort_opt->{0}, local $sed_opt->{1});
@@ -975,7 +990,7 @@ sub dump_git_commits { # run_await cb
 }
 
 sub run_prune { # OnDestroy when `git config extensions.objectFormat' are done
-	my ($self) = @_;
+	my ($self, $drs) = @_;
 	return if $DO_QUIT;
 	# setup the following pipeline: (
 	#	git --git-dir=hexlen40.git cat-file \
@@ -991,7 +1006,7 @@ sub run_prune { # OnDestroy when `git config extensions.objectFormat' are done
 	run_await(\@AWK, $CMD_ENV, $awk_opt, \&cmd_done);
 	run_await([@SORT, '-u'], $CMD_ENV, $sort_opt, \&cmd_done);
 	my $comm_rd = popen_rd(\@COMM, $CMD_ENV, $comm_opt, \&cmd_done, \@COMM);
-	PublicInbox::CidxComm->new($comm_rd, $self); # calls cidx_read_comm
+	PublicInbox::CidxComm->new($comm_rd, $self, $drs); # ->cidx_read_comm
 	my $git_ver = PublicInbox::Git::git_version();
 	push @PRUNE_BATCH, '--buffer' if $git_ver ge v2.6;
 
@@ -1007,7 +1022,7 @@ EOM
 }
 
 sub cidx_read_comm { # via PublicInbox::CidxComm::event_step
-	my ($self, $comm_rd) = @_;
+	my ($self, $comm_rd, $drs) = @_;
 	return if $DO_QUIT;
 	$_->wq_do('prune_init') for @IDX_SHARDS;
 	while (defined(my $cmt = <$comm_rd>)) {
@@ -1022,6 +1037,7 @@ sub cidx_read_comm { # via PublicInbox::CidxComm::event_step
 	}
 	my ($c, $p) = PublicInbox::PktOp->pair;
 	$c->{ops}->{prune_done} = [ $self ];
+	$c->{-cidx_dump_roots_start} = $drs;
 	$_->wq_io_do('prune_commit', [ $p->{op_p} ]) for @IDX_SHARDS;
 }
 
@@ -1103,8 +1119,8 @@ sub show_roots { # for diagnostics
 
 sub do_inits { # called via PublicInbox::DS::add_timer
 	my ($self) = @_;
-	init_prune($self);
 	init_associate_postfork($self);
+	init_prune($self);
 	scan_git_dirs($self) if $self->{-opt}->{scan} // 1;
 	my $max = $TODO{associate} ? max($LIVE_JOBS, $NPROC) : $LIVE_JOBS;
 	index_next($self) for (1..$max);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 12/18] xap_helper: Perl dump_ibx respects `-m MAX'
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (10 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 11/18] cindex: delay associate until prune+indexing finish Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 13/18] cidx_xap_helper_aux: complain about truncated inputs Eric Wong
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

The C++ version does, so the Perl/XS version should, too;
even if we intentionally avoid using it right now.
---
 lib/PublicInbox/XapHelper.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/XapHelper.pm b/lib/PublicInbox/XapHelper.pm
index 1ee918e3..4157600f 100644
--- a/lib/PublicInbox/XapHelper.pm
+++ b/lib/PublicInbox/XapHelper.pm
@@ -78,7 +78,7 @@ sub cmd_dump_ibx {
 	my ($req, $ibx_id, $qry_str) = @_;
 	$qry_str // return warn('usage: dump_ibx [OPTIONS] IBX_ID QRY_STR');
 	$req->{A} or return warn('dump_ibx requires -A PREFIX');
-	my $max = $req->{srch}->{xdb}->get_doccount;
+	my $max = $req->{'m'} // $req->{srch}->{xdb}->get_doccount;
 	my $opt = { relevance => -1, limit => $max, offset => $req->{o} // 0 };
 	$opt->{eidx_key} = $req->{O} if defined $req->{O};
 	my $mset = $req->{srch}->mset($qry_str, $opt);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 13/18] cidx_xap_helper_aux: complain about truncated inputs
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (11 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 12/18] xap_helper: Perl dump_ibx respects `-m MAX' Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 14/18] xap_helper: stricter and harsher error handling Eric Wong
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

This will help us notice bugs and system resource limitations
sooner rather than later.
---
 lib/PublicInbox/CidxXapHelperAux.pm | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/CidxXapHelperAux.pm b/lib/PublicInbox/CidxXapHelperAux.pm
index f402bde0..91c9b021 100644
--- a/lib/PublicInbox/CidxXapHelperAux.pm
+++ b/lib/PublicInbox/CidxXapHelperAux.pm
@@ -26,7 +26,11 @@ sub event_step {
 	}
 	my $pfx = $self->{pfx};
 	if ($n == 0) {
-		$self->{cidx}->progress("$pfx $buf") if $buf ne '';
+		warn "BUG? $pfx buf=$buf" if $buf ne '';
+		if (delete $self->{cidx}->{PENDING}->{$pfx}) {
+			warn "BUG? $pfx did not get mset.size";
+			$self->{cidx}->index_next;
+		}
 		return $self->close;
 	}
 	my @lines = split(/^/m, $buf);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 14/18] xap_helper: stricter and harsher error handling
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (12 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 13/18] cidx_xap_helper_aux: complain about truncated inputs Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 15/18] xap_helper: better variable naming for key buffer Eric Wong
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

We'll require an error stream for dump_ibx and dump_roots
commands; they're too important to ignore.  Instead of writing
code to provide diagnostics for errors, rely on abort(3) and the
-ggdb3 compiler flag to generate nice core dumps for gdb since
all commands sent to xap_helper are from internal users.
We'll even abort on most usage errors since they could be
bugs in split2argv or our use of getopt(3).

We'll also just exit on ENOMEM errors since it's the easiest way
to recover from those errors by starting a new process which
closes all open Xapian DB handles.
---
 lib/PublicInbox/XapHelper.pm |  28 ++---
 lib/PublicInbox/xap_helper.h | 202 ++++++++++++++---------------------
 t/xap_helper.t               |   5 +-
 3 files changed, 95 insertions(+), 140 deletions(-)

diff --git a/lib/PublicInbox/XapHelper.pm b/lib/PublicInbox/XapHelper.pm
index 4157600f..428b732e 100644
--- a/lib/PublicInbox/XapHelper.pm
+++ b/lib/PublicInbox/XapHelper.pm
@@ -16,6 +16,7 @@ use PublicInbox::DS qw(awaitpid);
 use autodie qw(open getsockopt);
 use POSIX qw(:signal_h);
 use Fcntl qw(LOCK_UN LOCK_EX);
+use Carp qw(croak);
 my $X = \%PublicInbox::Search::X;
 our (%SRCH, %WORKERS, $nworker, $workerset, $in);
 our $stderr = \*STDERR;
@@ -70,14 +71,14 @@ sub dump_ibx_iter ($$$) {
 
 sub emit_mset_stats ($$) {
 	my ($req, $mset) = @_;
-	my $err = $req->{1} or return;
+	my $err = $req->{1} or croak "BUG: caller only passed 1 FD";
 	say $err 'mset.size='.$mset->size.' nr_out='.$req->{nr_out}
 }
 
 sub cmd_dump_ibx {
 	my ($req, $ibx_id, $qry_str) = @_;
-	$qry_str // return warn('usage: dump_ibx [OPTIONS] IBX_ID QRY_STR');
-	$req->{A} or return warn('dump_ibx requires -A PREFIX');
+	$qry_str // die 'usage: dump_ibx [OPTIONS] IBX_ID QRY_STR';
+	$req->{A} or die 'dump_ibx requires -A PREFIX';
 	my $max = $req->{'m'} // $req->{srch}->{xdb}->get_doccount;
 	my $opt = { relevance => -1, limit => $max, offset => $req->{o} // 0 };
 	$opt->{eidx_key} = $req->{O} if defined $req->{O};
@@ -88,9 +89,7 @@ sub cmd_dump_ibx {
 			$t = dump_ibx_iter($req, $ibx_id, $it) // $t;
 		}
 	}
-	if (my $err = $req->{1}) {
-		say $err 'mset.size='.$mset->size.' nr_out='.$req->{nr_out}
-	}
+	emit_mset_stats($req, $mset);
 }
 
 sub dump_roots_iter ($$$) {
@@ -120,9 +119,8 @@ sub dump_roots_flush ($$) {
 
 sub cmd_dump_roots {
 	my ($req, $root2id_file, $qry_str) = @_;
-	$qry_str // return
-		warn('usage: dump_roots [OPTIONS] ROOT2ID_FILE QRY_STR');
-	$req->{A} or return warn('dump_roots requires -A PREFIX');
+	$qry_str // die 'usage: dump_roots [OPTIONS] ROOT2ID_FILE QRY_STR';
+	$req->{A} or die 'dump_roots requires -A PREFIX';
 	open my $fh, '<', $root2id_file;
 	my $root2id; # record format: $OIDHEX "\0" uint32_t
 	my @x = split(/\0/, read_all $fh);
@@ -150,7 +148,7 @@ sub dispatch {
 	my ($req, $cmd, @argv) = @_;
 	my $fn = $req->can("cmd_$cmd") or return;
 	$GLP->getoptionsfromarray(\@argv, $req, @SPEC) or return;
-	my $dirs = delete $req->{d} or return warn 'no -d args';
+	my $dirs = delete $req->{d} or die 'no -d args';
 	my $key = join("\0", @$dirs);
 	$req->{srch} = $SRCH{$key} //= do {
 		my $new = { qp_flags => $PublicInbox::Search::QP_FLAGS };
@@ -168,8 +166,7 @@ sub dispatch {
 		$new->{qp} = $new->qparse_new;
 		$new;
 	};
-	eval { $fn->($req, @argv) };
-	warn "E: $@" if $@;
+	$fn->($req, @argv);
 }
 
 sub recv_loop {
@@ -192,13 +189,10 @@ sub recv_loop {
 		my $i = 0;
 		open($req->{$i++}, '+<&=', $_) for @fds;
 		local $stderr = $req->{1} // \*STDERR;
-		if (chop($rbuf) ne "\0") {
-			warn "not NUL-terminated";
-			next;
-		}
+		die "not NUL-terminated" if chop($rbuf) ne "\0";
 		my @argv = split(/\0/, $rbuf);
 		$req->{nr_out} = 0;
-		eval { $req->dispatch(@argv) } if @argv;
+		$req->dispatch(@argv) if @argv;
 	}
 }
 
diff --git a/lib/PublicInbox/xap_helper.h b/lib/PublicInbox/xap_helper.h
index 8602170f..1380ffd0 100644
--- a/lib/PublicInbox/xap_helper.h
+++ b/lib/PublicInbox/xap_helper.h
@@ -78,6 +78,10 @@
 	assert(ckvar______ == (expect) && "BUG" && __FILE__ && __LINE__); \
 } while (0)
 
+// coredump on most usage errors since our only users are internal
+#define ABORT(...) do { warnx(__VA_ARGS__); abort(); } while (0)
+#define EABORT(...) do { warn(__VA_ARGS__); abort(); } while (0)
+
 // sock_fd is modified in signal handler, yes, it's SOCK_SEQPACKET
 static volatile int sock_fd = STDIN_FILENO;
 static sigset_t fullset, workerset;
@@ -145,10 +149,8 @@ struct worker {
 #define SPLIT2ARGV(dst,buf,len) split2argv(dst,buf,len,MY_ARRAY_SIZE(dst))
 static size_t split2argv(char **dst, char *buf, size_t len, size_t limit)
 {
-	if (buf[0] == 0 || len == 0 || buf[len - 1] != 0) {
-		warnx("bogus argument given");
-		return 0;
-	}
+	if (buf[0] == 0 || len == 0 || buf[len - 1] != 0)
+		ABORT("bogus argument given");
 	size_t nr = 0;
 	char *c = buf;
 	for (size_t i = 1; i < len; i++) {
@@ -156,11 +158,11 @@ static size_t split2argv(char **dst, char *buf, size_t len, size_t limit)
 			dst[nr++] = c;
 			c = buf + i + 1;
 		}
-		if (nr == limit) {
-			warnx("too many args: %zu", nr);
-			return 0;
-		}
+		if (nr == limit)
+			ABORT("too many args: %zu == %zu", nr, limit);
 	}
+	if (nr == 0) ABORT("no argument given");
+	if ((long)nr < 0) ABORT("too many args: %zu", nr);
 	return (long)nr;
 }
 
@@ -242,6 +244,8 @@ static void emit_mset_stats(struct req *req, const Xapian::MSet *mset)
 	if (req->fp[1])
 		fprintf(req->fp[1], "mset.size=%llu nr_out=%zu\n",
 			(unsigned long long)mset->size(), req->nr_out);
+	else
+		ABORT("BUG: %s caller only passed 1 FD", req->argv[0]);
 }
 
 static bool starts_with(const std::string *s, const char *pfx, size_t pfx_len)
@@ -290,20 +294,14 @@ static enum exc_iter dump_ibx_iter(struct req *req, const char *ibx_id,
 
 static bool cmd_dump_ibx(struct req *req)
 {
-	if ((optind + 1) >= req->argc) {
-		warnx("usage: dump_ibx [OPTIONS] IBX_ID QRY_STR");
-		return false; // need ibx_id + qry_str
-	}
-	if (!req->pfxc) {
-		warnx("dump_ibx requires -A PREFIX");
-		return false;
-	}
+	if ((optind + 1) >= req->argc)
+		ABORT("usage: dump_ibx [OPTIONS] IBX_ID QRY_STR");
+	if (!req->pfxc)
+		ABORT("dump_ibx requires -A PREFIX");
 
 	const char *ibx_id = req->argv[optind];
-	if (my_setlinebuf(req->fp[0])) { // for sort(1) pipe
-		perror("setlinebuf(fp[0])");
-		return false;
-	}
+	if (my_setlinebuf(req->fp[0])) // for sort(1) pipe
+		EABORT("setlinebuf(fp[0])"); // WTF?
 	req->asc = true;
 	req->sort_col = -1;
 	Xapian::MSet mset = mail_mset(req, req->argv[optind + 1]);
@@ -344,24 +342,22 @@ static void fbuf_ensure(void *ptr)
 {
 	struct fbuf *fbuf = (struct fbuf *)ptr;
 	if (fbuf->fp && fclose(fbuf->fp))
-		perror("fclose(fbuf->fp)");
+		err(EXIT_FAILURE, "fclose(fbuf->fp)"); // ENOMEM?
 	fbuf->fp = NULL;
 	free(fbuf->ptr);
 }
 
-static bool fbuf_init(struct fbuf *fbuf)
+static void fbuf_init(struct fbuf *fbuf)
 {
 	assert(!fbuf->ptr);
 	fbuf->fp = open_memstream(&fbuf->ptr, &fbuf->len);
-	if (fbuf->fp) return true;
-	perror("open_memstream(fbuf)");
-	return false;
+	if (!fbuf->fp) err(EXIT_FAILURE, "open_memstream(fbuf)");
 }
 
 static void xclose(int fd)
 {
 	if (close(fd) < 0 && errno != EINTR)
-		err(EXIT_FAILURE, "BUG: close");
+		EABORT("BUG: close");
 }
 
 #define CLEANUP_DUMP_ROOTS __attribute__((__cleanup__(dump_roots_ensure)))
@@ -372,19 +368,17 @@ static void dump_roots_ensure(void *ptr)
 		xclose(drt->root2id_fd);
 	hdestroy(); // idempotent
 	if (drt->mm_ptr && munmap(drt->mm_ptr, drt->sb.st_size))
-		err(EXIT_FAILURE, "BUG: munmap");
+		EABORT("BUG: munmap(%p, %zu)", drt->mm_ptr, drt->sb.st_size);
 	free(drt->entries);
 	fbuf_ensure(&drt->wbuf);
 }
 
 static bool root2ids_str(struct fbuf *root_ids, Xapian::Document *doc)
 {
-	if (!fbuf_init(root_ids)) return false;
-
-	bool ok = true;
 	Xapian::TermIterator cur = doc->termlist_begin();
 	Xapian::TermIterator end = doc->termlist_end();
 	ENTRY e, *ep;
+	fbuf_init(root_ids);
 	for (cur.skip_to("G"); cur != end; cur++) {
 		std::string tn = *cur;
 		if (!starts_with(&tn, "G", 1))
@@ -393,21 +387,16 @@ static bool root2ids_str(struct fbuf *root_ids, Xapian::Document *doc)
 		u.in = tn.c_str() + 1;
 		e.key = u.out;
 		ep = hsearch(e, FIND);
-		if (!ep) {
-			warnx("hsearch miss `%s'", e.key);
-			return false;
-		}
+		if (!ep) ABORT("hsearch miss `%s'", e.key);
 		// ep->data is a NUL-terminated string matching /[0-9]+/
 		fputc(' ', root_ids->fp);
 		fputs((const char *)ep->data, root_ids->fp);
 	}
 	fputc('\n', root_ids->fp);
-	if (ferror(root_ids->fp) | fclose(root_ids->fp)) {
-		perror("ferror|fclose(root_ids)");
-		ok = false;
-	}
+	if (ferror(root_ids->fp) | fclose(root_ids->fp))
+		err(EXIT_FAILURE, "ferror|fclose(root_ids)"); // ENOMEM
 	root_ids->fp = NULL;
-	return ok;
+	return true;
 }
 
 // writes term values matching @pfx for a given @doc, ending the line
@@ -440,20 +429,17 @@ static bool dump_roots_flush(struct req *req, struct dump_roots_tmp *drt)
 	bool ok = true;
 
 	if (!drt->wbuf.fp) return true;
-	if (fd < 0) err(EXIT_FAILURE, "BUG: fileno");
-	if (fclose(drt->wbuf.fp)) {
-		warn("fclose(drt->wbuf.fp)"); // malloc failure?
-		return false;
-	}
+	if (fd < 0) EABORT("BUG: fileno");
+	if (ferror(drt->wbuf.fp) | fclose(drt->wbuf.fp)) // ENOMEM?
+		err(EXIT_FAILURE, "ferror|fclose(drt->wbuf.fp)");
 	drt->wbuf.fp = NULL;
 	if (!drt->wbuf.len) goto done_free;
 	while (flock(drt->root2id_fd, LOCK_EX)) {
 		if (errno == EINTR) continue;
-		perror("LOCK_EX");
-		return false;
+		err(EXIT_FAILURE, "LOCK_EX"); // ENOLCK?
 	}
 	p = drt->wbuf.ptr;
-	do {
+	do { // write to client FD
 		ssize_t n = write(fd, p, drt->wbuf.len);
 		if (n > 0) {
 			drt->wbuf.len -= n;
@@ -465,10 +451,9 @@ static bool dump_roots_flush(struct req *req, struct dump_roots_tmp *drt)
 	} while (drt->wbuf.len);
 	while (flock(drt->root2id_fd, LOCK_UN)) {
 		if (errno == EINTR) continue;
-		perror("LOCK_UN");
-		return false;
+		err(EXIT_FAILURE, "LOCK_UN"); // ENOLCK?
 	}
-done_free:
+done_free: // OK to skip on errors, dump_roots_ensure calls fbuf_ensure
 	free(drt->wbuf.ptr);
 	drt->wbuf.ptr = NULL;
 	return ok;
@@ -501,7 +486,7 @@ static char *hsearch_enter_key(char *s)
 	// hdestroy frees each key on some platforms,
 	// so give it something to free:
 	char *ret = strdup(s);
-	if (!ret) perror("strdup");
+	if (!ret) err(EXIT_FAILURE, "strdup");
 	return ret;
 // AFAIK there's no way to detect musl, assume non-glibc Linux is musl:
 #elif defined(__GLIBC__) || defined(__linux__) || \
@@ -518,61 +503,42 @@ static bool cmd_dump_roots(struct req *req)
 {
 	CLEANUP_DUMP_ROOTS struct dump_roots_tmp drt = {};
 	drt.root2id_fd = -1;
-	if ((optind + 1) >= req->argc) {
-		warnx("usage: dump_roots [OPTIONS] ROOT2ID_FILE QRY_STR");
-		return false; // need file + qry_str
-	}
-	if (!req->pfxc) {
-		warnx("dump_roots requires -A PREFIX");
-		return false;
-	}
+	if ((optind + 1) >= req->argc)
+		ABORT("usage: dump_roots [OPTIONS] ROOT2ID_FILE QRY_STR");
+	if (!req->pfxc)
+		ABORT("dump_roots requires -A PREFIX");
 	const char *root2id_file = req->argv[optind];
 	drt.root2id_fd = open(root2id_file, O_RDONLY);
-	if (drt.root2id_fd < 0) {
-		warn("open(%s)", root2id_file);
-		return false;
-	}
-	if (fstat(drt.root2id_fd, &drt.sb)) {
-		warn("fstat(%s)", root2id_file);
-		return false;
-	}
+	if (drt.root2id_fd < 0)
+		EABORT("open(%s)", root2id_file);
+	if (fstat(drt.root2id_fd, &drt.sb)) // ENOMEM?
+		err(EXIT_FAILURE, "fstat(%s)", root2id_file);
 	// each entry is at least 43 bytes ({OIDHEX}\0{INT}\0),
 	// so /32 overestimates the number of expected entries by
 	// ~%25 (as recommended by Linux hcreate(3) manpage)
 	size_t est = (drt.sb.st_size / 32) + 1; //+1 for "\0" termination
-	if ((uint64_t)drt.sb.st_size > (uint64_t)SIZE_MAX) {
-		warnx("%s size too big (%lld bytes > %zu)", root2id_file,
-			(long long)drt.sb.st_size, SIZE_MAX);
-		return false;
-	}
+	if ((uint64_t)drt.sb.st_size > (uint64_t)SIZE_MAX)
+		err(EXIT_FAILURE, "%s size too big (%lld bytes > %zu)",
+			root2id_file, (long long)drt.sb.st_size, SIZE_MAX);
 	drt.mm_ptr = mmap(NULL, drt.sb.st_size, PROT_READ,
 				MAP_PRIVATE, drt.root2id_fd, 0);
-	if (drt.mm_ptr == MAP_FAILED) {
-		warn("mmap(%s)", root2id_file);
-		return false;
-	}
+	if (drt.mm_ptr == MAP_FAILED)
+		err(EXIT_FAILURE, "mmap(%zu, %s)", drt.sb.st_size,root2id_file);
 	drt.entries = (char **)calloc(est * 2, sizeof(char *));
-	if (!drt.entries) {
-		warn("calloc(%zu * 2, %zu)", est, sizeof(char *));
-		return false;
-	}
+	if (!drt.entries)
+		err(EXIT_FAILURE, "calloc(%zu * 2, %zu)", est, sizeof(char *));
 	size_t tot = split2argv(drt.entries, (char *)drt.mm_ptr,
 				drt.sb.st_size, est * 2);
 	if (tot <= 0) return false; // split2argv already warned on error
-	if (!hcreate(est)) {
-		warn("hcreate(%zu)", est);
-		return false;
-	}
+	if (!hcreate(est))
+		err(EXIT_FAILURE, "hcreate(%zu)", est);
 	for (size_t i = 0; i < tot; ) {
 		ENTRY e;
-		e.key = hsearch_enter_key(drt.entries[i++]);
-		if (!e.key) return false;
+		e.key = hsearch_enter_key(drt.entries[i++]); // dies on ENOMEM
 		e.data = drt.entries[i++];
-		if (!hsearch(e, ENTER)) {
-			warn("hsearch(%s => %s, ENTER)", e.key,
-				(const char *)e.data);
-			return false;
-		}
+		if (!hsearch(e, ENTER))
+			err(EXIT_FAILURE, "hsearch(%s => %s, ENTER)", e.key,
+					(const char *)e.data);
 	}
 	req->asc = true;
 	req->sort_col = -1;
@@ -581,8 +547,8 @@ static bool cmd_dump_roots(struct req *req)
 	// @UNIQ_FOLD in CodeSearchIdx.pm can handle duplicate lines fine
 	// in case we need to retry on DB reopens
 	for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++) {
-		if (!drt.wbuf.fp && !fbuf_init(&drt.wbuf))
-			return false;
+		if (!drt.wbuf.fp)
+			fbuf_init(&drt.wbuf);
 		for (int t = 10; t > 0; --t)
 			switch (dump_roots_iter(req, &drt, &i)) {
 			case ITER_OK: t = 0; break; // leave inner loop
@@ -672,6 +638,8 @@ again:
 
 	// success! no signals for the rest of the request/response cycle
 	CHECK(int, 0, sigprocmask(SIG_SETMASK, &fullset, NULL));
+	if (r > 0 && msg.msg_flags)
+		ABORT("unexpected msg_flags");
 
 	*len = r;
 	if (cmsg.hdr.cmsg_level == SOL_SOCKET &&
@@ -806,7 +774,7 @@ static void dispatch(struct req *req)
 			break;
 		}
 	}
-	if (!req->fn) goto cmd_err;
+	if (!req->fn) ABORT("not handled: `%s'", req->argv[0]);
 
 	kfp = open_memstream(&fbuf.ptr, &size);
 	// write padding, first
@@ -825,53 +793,48 @@ static void dispatch(struct req *req)
 		case 'd': fwrite(optarg, strlen(optarg) + 1, 1, kfp); break;
 		case 'k':
 			req->sort_col = strtol(optarg, &end, 10);
-			if (*end) goto cmd_err;
+			if (*end) ABORT("-k %s", optarg);
 			switch (req->sort_col) {
-			case LONG_MAX: case LONG_MIN: goto cmd_err;
+			case LONG_MAX: case LONG_MIN: ABORT("-k %s", optarg);
 			}
 			break;
 		case 'm':
 			req->max = strtoull(optarg, &end, 10);
-			if (*end) goto cmd_err;
-			if (req->max == ULLONG_MAX) goto cmd_err;
+			if (*end || req->max == ULLONG_MAX)
+				ABORT("-m %s", optarg);
 			break;
 		case 'o':
 			req->off = strtoull(optarg, &end, 10);
-			if (*end) goto cmd_err;
-			if (req->off == ULLONG_MAX) goto cmd_err;
+			if (*end || req->off == ULLONG_MAX)
+				ABORT("-o %s", optarg);
 			break;
 		case 'r': req->relevance = true; break;
 		case 't': req->collapse_threads = true; break;
 		case 'A':
 			req->pfxv[req->pfxc++] = optarg;
-			if (MY_ARG_MAX == req->pfxc) goto cmd_err;
+			if (MY_ARG_MAX == req->pfxc)
+				ABORT("too many -A");
 			break;
 		case 'O': req->Oeidx_key = optarg - 1; break; // pad "O" prefix
 		case 'T':
 			req->timeout_sec = strtoul(optarg, &end, 10);
-			if (*end) goto cmd_err;
-			if (req->timeout_sec == ULONG_MAX) goto cmd_err;
+			if (*end || req->timeout_sec == ULONG_MAX)
+				ABORT("-T %s", optarg);
 			break;
-		default: goto cmd_err;
+		default: ABORT("bad switch `-%c'", c);
 		}
 	}
-	if (ferror(kfp) | fclose(kfp)) {
-		perror("ferror|fclose");
-		goto cmd_err;
-	}
+	if (ferror(kfp) | fclose(kfp))
+		err(EXIT_FAILURE, "ferror|fclose"); // likely ENOMEM
 	fbuf.srch->db = NULL;
 	fbuf.srch->qp = NULL;
 	fbuf.srch->paths_len = size - offsetof(struct srch, paths);
 	if (fbuf.srch->paths_len <= 0) {
 		free_srch(fbuf.srch);
-		warnx("no -d args");
-		goto cmd_err;
+		ABORT("no -d args");
 	}
 	s = (struct srch **)tsearch(fbuf.srch, &srch_tree, srch_cmp);
-	if (!s) {
-		perror("tsearch");
-		goto cmd_err;
-	}
+	if (!s) err(EXIT_FAILURE, "tsearch"); // likely ENOMEM
 	req->srch = *s;
 	if (req->srch != fbuf.srch) { // reuse existing
 		free_srch(fbuf.srch);
@@ -881,11 +844,11 @@ static void dispatch(struct req *req)
 		void *del = tdelete(fbuf.srch, &srch_tree, srch_cmp);
 		assert(del);
 		free_srch(fbuf.srch);
-		goto cmd_err;
+		goto cmd_err; // srch_init already warned
 	}
 	try {
 		if (!req->fn(req))
-			goto cmd_err;
+			warnx("`%s' failed", req->argv[0]);
 	} catch (const Xapian::Error & e) {
 		warnx("Xapian::Error: %s", e.get_description().c_str());
 	} catch (...) {
@@ -924,7 +887,7 @@ static void stderr_restore(FILE *tmp_err)
 	return;
 #endif
 	if (ferror(stderr) | fflush(stderr))
-		perror("ferror|fflush stderr");
+		err(EXIT_FAILURE, "ferror|fflush stderr");
 	while (dup2(orig_err_fd, STDERR_FILENO) < 0) {
 		if (errno != EINTR)
 			err(EXIT_FAILURE, "dup2(%d => 2)", orig_err_fd);
@@ -953,8 +916,7 @@ static void recv_loop(void) // worker process loop
 		if (req.fp[1])
 			stderr_set(req.fp[1]);
 		req.argc = (int)SPLIT2ARGV(req.argv, rbuf, len);
-		if (req.argc > 0)
-			dispatch(&req);
+		dispatch(&req);
 		if (ferror(req.fp[0]) | fclose(req.fp[0]))
 			perror("ferror|fclose fp[0]");
 		if (req.fp[1]) {
@@ -1066,7 +1028,7 @@ static void reaped_worker(pid_t pid, int st)
 	if (WIFEXITED(st) && WEXITSTATUS(st) == EX_NOINPUT)
 		alive = false;
 	else if (st)
-		warnx("worker[%lu] died $?=%d", nr, st);
+		warnx("worker[%lu] died $?=%d alive=%d", nr, st, (int)alive);
 	if (alive)
 		start_workers();
 }
diff --git a/t/xap_helper.t b/t/xap_helper.t
index 83f59d7d..9e0b234d 100644
--- a/t/xap_helper.t
+++ b/t/xap_helper.t
@@ -47,11 +47,10 @@ is(scalar(@int), 1, 'have 1 internal shard') or diag explain(\@int);
 
 my $doreq = sub {
 	my ($s, @arg) = @_;
-	my $err = pop @arg if ref($arg[-1]);
+	my $err = ref($arg[-1]) ? pop(@arg) : \*STDERR;
 	pipe(my $x, my $y);
 	my $buf = join("\0", @arg, '');
-	my @fds = fileno($y);
-	push @fds, fileno($err) if $err;
+	my @fds = (fileno($y), fileno($err));
 	my $n = $PublicInbox::IPC::send_cmd->($s, \@fds, $buf, 0) //
 		xbail "send: $!";
 	my $exp = length($buf);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 15/18] xap_helper: better variable naming for key buffer
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (13 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 14/18] xap_helper: stricter and harsher error handling Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 16/18] cindex: do not guess integer maximum for Xapian Eric Wong
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

We'll use `kbuf' for the search object key, since we already use
the `fbuf' term in `struct fbuf'.  This also adds an extra check
for open_memstream(3) failures in case of ENOMEM.
---
 lib/PublicInbox/xap_helper.h | 33 ++++++++++++++++-----------------
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/lib/PublicInbox/xap_helper.h b/lib/PublicInbox/xap_helper.h
index 1380ffd0..0a652abd 100644
--- a/lib/PublicInbox/xap_helper.h
+++ b/lib/PublicInbox/xap_helper.h
@@ -762,7 +762,7 @@ static void dispatch(struct req *req)
 	union {
 		struct srch *srch;
 		char *ptr;
-	} fbuf;
+	} kbuf;
 	char *end;
 	FILE *kfp;
 	struct srch **s;
@@ -776,8 +776,9 @@ static void dispatch(struct req *req)
 	}
 	if (!req->fn) ABORT("not handled: `%s'", req->argv[0]);
 
-	kfp = open_memstream(&fbuf.ptr, &size);
-	// write padding, first
+	kfp = open_memstream(&kbuf.ptr, &size);
+	if (!kfp) err(EXIT_FAILURE, "open_memstream(kbuf)");
+	// write padding, first (contents don't matter)
 	fwrite(&req->argv[0], offsetof(struct srch, paths), 1, kfp);
 
 	// global getopt variables:
@@ -824,26 +825,24 @@ static void dispatch(struct req *req)
 		default: ABORT("bad switch `-%c'", c);
 		}
 	}
-	if (ferror(kfp) | fclose(kfp))
+	if (ferror(kfp) | fclose(kfp)) /* sets kbuf.srch */
 		err(EXIT_FAILURE, "ferror|fclose"); // likely ENOMEM
-	fbuf.srch->db = NULL;
-	fbuf.srch->qp = NULL;
-	fbuf.srch->paths_len = size - offsetof(struct srch, paths);
-	if (fbuf.srch->paths_len <= 0) {
-		free_srch(fbuf.srch);
+	kbuf.srch->db = NULL;
+	kbuf.srch->qp = NULL;
+	kbuf.srch->paths_len = size - offsetof(struct srch, paths);
+	if (kbuf.srch->paths_len <= 0)
 		ABORT("no -d args");
-	}
-	s = (struct srch **)tsearch(fbuf.srch, &srch_tree, srch_cmp);
+	s = (struct srch **)tsearch(kbuf.srch, &srch_tree, srch_cmp);
 	if (!s) err(EXIT_FAILURE, "tsearch"); // likely ENOMEM
 	req->srch = *s;
-	if (req->srch != fbuf.srch) { // reuse existing
-		free_srch(fbuf.srch);
+	if (req->srch != kbuf.srch) { // reuse existing
+		free_srch(kbuf.srch);
 	} else if (!srch_init(req)) {
-		assert(fbuf.srch == *((struct srch **)tfind(
-					fbuf.srch, &srch_tree, srch_cmp)));
-		void *del = tdelete(fbuf.srch, &srch_tree, srch_cmp);
+		assert(kbuf.srch == *((struct srch **)tfind(
+					kbuf.srch, &srch_tree, srch_cmp)));
+		void *del = tdelete(kbuf.srch, &srch_tree, srch_cmp);
 		assert(del);
-		free_srch(fbuf.srch);
+		free_srch(kbuf.srch);
 		goto cmd_err; // srch_init already warned
 	}
 	try {

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 16/18] cindex: do not guess integer maximum for Xapian
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (14 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 15/18] xap_helper: better variable naming for key buffer Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 17/18] cindex: rename associate-max => window Eric Wong
  2023-11-13 13:15 ` [PATCH 18/18] cindex: support --associate-aggressive shortcut Eric Wong
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

We can return an array to allow the caller to omit the internal
`-m' arg entirely.  We'll also allow any non-positive values to
mean there's no limit; and we'll defer the "unlimited" case to
the XapHelper implementation.  This frees us of having to deal
with mismatches between Perl and Xapian if Xapian was compiled
with 64-bit docid support and we're stuck on a 32-bit Perl
build.
---
 lib/PublicInbox/CodeSearchIdx.pm | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm
index 04c514fe..8e6b921d 100644
--- a/lib/PublicInbox/CodeSearchIdx.pm
+++ b/lib/PublicInbox/CodeSearchIdx.pm
@@ -501,11 +501,10 @@ sub shard_commit { # via wq_io_do
 	send($op_p, "shard_done $self->{shard}", 0);
 }
 
-sub assoc_max_init ($) {
+sub assoc_max_args ($) {
 	my ($self) = @_;
 	my $max = $self->{-opt}->{'associate-max'} // $ASSOC_MAX;
-	$max = $ASSOC_MAX if !$max;
-	$max < 0 ? ((2 ** 31) - 1) : $max;
+	$max <= 0 ? () : ('-m', $max);
 }
 
 sub start_xhc () {
@@ -538,7 +537,7 @@ sub dump_roots_start {
 	run_await(\@sort, $CMD_ENV, $sort_opt, \&cmd_done, $associate);
 	run_await(\@UNIQ_FOLD, $fold_env, $fold_opt, \&cmd_done, $associate);
 	my @arg = ((map { ('-A', $_) } @ASSOC_PFX), '-c',
-		'-m', assoc_max_init($self), $root2id, $QRY_STR);
+		assoc_max_args($self), $root2id, $QRY_STR);
 	for my $d ($self->shard_dirs) {
 		pipe(my $err_r, my $err_w);
 		$XHC->mkreq([$sort_w, $err_w], qw(dump_roots -d), $d, @arg);
@@ -556,6 +555,8 @@ sub dump_ibx { # sends to xap_helper.h
 	my $srch = $ibx->isrch or return warn <<EOM;
 W: $ekey not indexed for search
 EOM
+	# note: we don't send associate_max_args to dump_ibx since we
+	# have to post-filter non-patch messages
 	my @cmd = ('dump_ibx', $srch->xh_args,
 			(map { ('-A', $_) } @ASSOC_PFX), $ibx_id, $QRY_STR);
 	pipe(my $r, my $w);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 17/18] cindex: rename associate-max => window
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (15 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 16/18] cindex: do not guess integer maximum for Xapian Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  2023-11-13 13:15 ` [PATCH 18/18] cindex: support --associate-aggressive shortcut Eric Wong
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

"window" is probably a better term since it's an inexact thing
to match on.
---
 lib/PublicInbox/CodeSearchIdx.pm | 10 +++++-----
 script/public-inbox-cindex       |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm
index 8e6b921d..54ddb68e 100644
--- a/lib/PublicInbox/CodeSearchIdx.pm
+++ b/lib/PublicInbox/CodeSearchIdx.pm
@@ -97,7 +97,7 @@ our (
 our $SEEN_MAX = 100000;
 
 # window for commits/emails to determine a inbox <-> coderepo association
-my $ASSOC_MAX = 50000;
+my $ASSOC_WINDOW = 50000;
 
 our @PRUNE_BATCH = qw(git _ cat-file --batch-all-objects --batch-check);
 
@@ -501,10 +501,10 @@ sub shard_commit { # via wq_io_do
 	send($op_p, "shard_done $self->{shard}", 0);
 }
 
-sub assoc_max_args ($) {
+sub assoc_window_args ($) {
 	my ($self) = @_;
-	my $max = $self->{-opt}->{'associate-max'} // $ASSOC_MAX;
-	$max <= 0 ? () : ('-m', $max);
+	my $n = $self->{-opt}->{'associate-window'} // $ASSOC_WINDOW;
+	$n <= 0 ? () : ('-m', $n);
 }
 
 sub start_xhc () {
@@ -537,7 +537,7 @@ sub dump_roots_start {
 	run_await(\@sort, $CMD_ENV, $sort_opt, \&cmd_done, $associate);
 	run_await(\@UNIQ_FOLD, $fold_env, $fold_opt, \&cmd_done, $associate);
 	my @arg = ((map { ('-A', $_) } @ASSOC_PFX), '-c',
-		assoc_max_args($self), $root2id, $QRY_STR);
+		assoc_window_args($self), $root2id, $QRY_STR);
 	for my $d ($self->shard_dirs) {
 		pipe(my $err_r, my $err_w);
 		$XHC->mkreq([$sort_w, $err_w], qw(dump_roots -d), $d, @arg);
diff --git a/script/public-inbox-cindex b/script/public-inbox-cindex
index b5fe9cf8..b8133806 100755
--- a/script/public-inbox-cindex
+++ b/script/public-inbox-cindex
@@ -26,7 +26,7 @@ See public-inbox-cindex(1) man page for full documentation.
 EOF
 my $opt = { fsync => 1, scan => 1 }; # --no-scan is hidden
 GetOptions($opt, qw(quiet|q verbose|v+ reindex jobs|j=i fsync|sync! dangerous
-		indexlevel|index-level|L=s associate associate-max=i
+		indexlevel|index-level|L=s associate associate-window=i
 		associate-date-range=s associate-prefixes=s@
 		batch_size|batch-size=s max_size|max-size=s
 		include|I=s@ only=s@ all show-roots

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 18/18] cindex: support --associate-aggressive shortcut
  2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
                   ` (16 preceding siblings ...)
  2023-11-13 13:15 ` [PATCH 17/18] cindex: rename associate-max => window Eric Wong
@ 2023-11-13 13:15 ` Eric Wong
  17 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2023-11-13 13:15 UTC (permalink / raw)
  To: meta

This is shorthand to enabling --associate with the most
aggressive (and time-consuming) options available, starting from
the Unix epoch and having an unlimited window to join on.
---
 lib/PublicInbox/CodeSearchIdx.pm | 5 +++++
 script/public-inbox-cindex       | 1 +
 2 files changed, 6 insertions(+)

diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm
index 54ddb68e..4ed5ea64 100644
--- a/lib/PublicInbox/CodeSearchIdx.pm
+++ b/lib/PublicInbox/CodeSearchIdx.pm
@@ -1146,6 +1146,11 @@ sub cidx_run { # main entry point
 	local $self->{ASSOC_PFX} = \@ASSOC_PFX;
 	local $self->{PENDING} = {};
 	local $self->{-pi_cfg};
+	if ($self->{-opt}->{'associate-aggressive'}) { # shortcut
+		$self->{-opt}->{'associate-date-range'} //= '19700101000000..';
+		$self->{-opt}->{'associate-window'} //= -1;
+		$self->{-opt}->{associate} //= 1;
+	}
 	if (grep { $_ } @{$self->{-opt}}{qw(prune associate)}) {
 		require File::Temp;
 		$TMPDIR = File::Temp->newdir('cidx-all-git-XXXX', TMPDIR => 1);
diff --git a/script/public-inbox-cindex b/script/public-inbox-cindex
index b8133806..feb4a7f4 100755
--- a/script/public-inbox-cindex
+++ b/script/public-inbox-cindex
@@ -27,6 +27,7 @@ EOF
 my $opt = { fsync => 1, scan => 1 }; # --no-scan is hidden
 GetOptions($opt, qw(quiet|q verbose|v+ reindex jobs|j=i fsync|sync! dangerous
 		indexlevel|index-level|L=s associate associate-window=i
+		associate-aggressive
 		associate-date-range=s associate-prefixes=s@
 		batch_size|batch-size=s max_size|max-size=s
 		include|I=s@ only=s@ all show-roots

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-11-13 13:15 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-13 13:15 [PATCH 00/18] cindex: some --associate work Eric Wong
2023-11-13 13:15 ` [PATCH 01/18] cindex: check `say' errors w/ close or ->flush Eric Wong
2023-11-13 13:15 ` [PATCH 02/18] tmpfile: check `stat' errors, use autodie for unlink Eric Wong
2023-11-13 13:15 ` [PATCH 03/18] cindex: use `local' for pipes between processes Eric Wong
2023-11-13 13:15 ` [PATCH 04/18] xap_helper_cxx: use write_file helper Eric Wong
2023-11-13 13:15 ` [PATCH 05/18] xap_helper_cxx: make the build process ccache-friendly Eric Wong
2023-11-13 13:15 ` [PATCH 06/18] xap_helper_cxx: use -pipe by default in CXXFLAGS Eric Wong
2023-11-13 13:15 ` [PATCH 07/18] xap_client: spawn C++ xap_helper directly Eric Wong
2023-11-13 13:15 ` [PATCH 08/18] treewide: update read_all to avoid eof|close checks Eric Wong
2023-11-13 13:15 ` [PATCH 09/18] spawn: don't append to scalarrefs on stdout/stderr Eric Wong
2023-11-13 13:15 ` [PATCH 10/18] cindex: imply --all with --associate w/o -I/--only Eric Wong
2023-11-13 13:15 ` [PATCH 11/18] cindex: delay associate until prune+indexing finish Eric Wong
2023-11-13 13:15 ` [PATCH 12/18] xap_helper: Perl dump_ibx respects `-m MAX' Eric Wong
2023-11-13 13:15 ` [PATCH 13/18] cidx_xap_helper_aux: complain about truncated inputs Eric Wong
2023-11-13 13:15 ` [PATCH 14/18] xap_helper: stricter and harsher error handling Eric Wong
2023-11-13 13:15 ` [PATCH 15/18] xap_helper: better variable naming for key buffer Eric Wong
2023-11-13 13:15 ` [PATCH 16/18] cindex: do not guess integer maximum for Xapian Eric Wong
2023-11-13 13:15 ` [PATCH 17/18] cindex: rename associate-max => window Eric Wong
2023-11-13 13:15 ` [PATCH 18/18] cindex: support --associate-aggressive shortcut Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).