unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 0/2] cindex: future-proof blob OID indexing
@ 2023-12-05  9:46 Eric Wong
  2023-12-05  9:46 ` [PATCH 1/2] searchidx: drop redundant decl in index_git_blob_id Eric Wong
  2023-12-05  9:46 ` [PATCH 2/2] cindex: index full (40/64 char) hex blob OIDs Eric Wong
  0 siblings, 2 replies; 3+ messages in thread
From: Eric Wong @ 2023-12-05  9:46 UTC (permalink / raw)
  To: meta

1/2 fixes a bug while checking over the blob OID indexing code

Eric Wong (2):
  searchidx: drop redundant decl in index_git_blob_id
  cindex: index full (40/64 char) hex blob OIDs

 lib/PublicInbox/CodeSearchIdx.pm | 15 +++++++++------
 lib/PublicInbox/SearchIdx.pm     |  1 -
 t/cindex.t                       |  3 +++
 3 files changed, 12 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 1/2] searchidx: drop redundant decl in index_git_blob_id
  2023-12-05  9:46 [PATCH 0/2] cindex: future-proof blob OID indexing Eric Wong
@ 2023-12-05  9:46 ` Eric Wong
  2023-12-05  9:46 ` [PATCH 2/2] cindex: index full (40/64 char) hex blob OIDs Eric Wong
  1 sibling, 0 replies; 3+ messages in thread
From: Eric Wong @ 2023-12-05  9:46 UTC (permalink / raw)
  To: meta

Oddly, Perl did not warn about this.  Spotted while confirming
abbreviated OIDs are also indexed when unabbreviated OIDs
appear.
---
 lib/PublicInbox/SearchIdx.pm | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 86c435fd..1bf471fc 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -715,7 +715,6 @@ sub nr_quiet_rm { delete($_[0]->{-quiet_rm}) // 0 }
 sub index_git_blob_id {
 	my ($doc, $pfx, $objid) = @_;
 
-	my $len = length($objid);
 	for (my $len = length($objid); $len >= 7; ) {
 		$doc->add_term($pfx.$objid);
 		$objid = substr($objid, 0, --$len);

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 2/2] cindex: index full (40/64 char) hex blob OIDs
  2023-12-05  9:46 [PATCH 0/2] cindex: future-proof blob OID indexing Eric Wong
  2023-12-05  9:46 ` [PATCH 1/2] searchidx: drop redundant decl in index_git_blob_id Eric Wong
@ 2023-12-05  9:46 ` Eric Wong
  1 sibling, 0 replies; 3+ messages in thread
From: Eric Wong @ 2023-12-05  9:46 UTC (permalink / raw)
  To: meta

This future proofs the index against git auto-abbreviation
needing more characters as the repo grows.  It'll be useful for
joining against inboxes using dfpre.

As with emails, we'll continue indexing abbreviated blob OIDs
down to 7 hex characters so a SHA-1 git repo will have all
abbreviations of the OID from 7-39 hex characters in addition
to the 40 character unabbreviated form.
---
 lib/PublicInbox/CodeSearchIdx.pm | 15 +++++++++------
 t/cindex.t                       |  3 +++
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm
index ec0fc6e3..20aac584 100644
--- a/lib/PublicInbox/CodeSearchIdx.pm
+++ b/lib/PublicInbox/CodeSearchIdx.pm
@@ -107,6 +107,8 @@ our (
 	$DUMP_IBX_WPIPE, # goes to sort(1)
 	$ANY_SHARD, # shard round-robin for scan fingerprinting
 	@OFF2ROOT,
+	$GIT_VER,
+	@NO_ABBREV,
 );
 
 # stop walking history if we see >$SEEN_MAX existing commits, this assumes
@@ -304,7 +306,7 @@ sub shard_index { # via wq_io_do in IDX_SHARDS
 	my $in = delete($self->{0}) // die 'BUG: no {0} input';
 	my $op_p = delete($self->{1}) // die 'BUG: no {1} op_p';
 	sysseek($in, 0, SEEK_SET);
-	my $cmd = $git->cmd(@LOG_STDIN);
+	my $cmd = $git->cmd(@NO_ABBREV, @LOG_STDIN);
 	my $rd = popen_rd($cmd, undef, { 0 => $in },
 				\&cidx_reap_log, $cmd, $self, $op_p);
 	PublicInbox::CidxLogP->new($rd, $self, $git, $roots);
@@ -1151,15 +1153,14 @@ sub run_prune { # OnDestroy when `git config extensions.objectFormat' are done
 	run_await([@SORT, '-u'], $CMD_ENV, $sort_opt, \&cmd_done);
 	my $comm_rd = popen_rd(\@COMM, $CMD_ENV, $comm_opt, \&cmd_done, \@COMM);
 	PublicInbox::CidxComm->new($comm_rd, $self, $drs); # ->cidx_read_comm
-	my $git_ver = PublicInbox::Git::git_version();
-	push @PRUNE_BATCH, '--buffer' if $git_ver ge v2.6;
+	push @PRUNE_BATCH, '--buffer' if $GIT_VER ge v2.6;
 
 	# Yes, we pipe --unordered git output to sort(1) because sorting
 	# inside git leads to orders-of-magnitude slowdowns on rotational
 	# storage.  GNU sort(1) also works well on larger-than-memory
 	# datasets, and it's not worth eliding sort(1) for old git.
-	push @PRUNE_BATCH, '--unordered' if $git_ver ge v2.19;
-	warn(sprintf(<<EOM, $git_ver)) if $git_ver lt v2.19;
+	push @PRUNE_BATCH, '--unordered' if $GIT_VER ge v2.19;
+	warn(sprintf(<<EOM, $GIT_VER)) if $GIT_VER lt v2.19;
 W: git v2.19+ recommended for high-latency storage (have git v%vd)
 EOM
 	dump_git_commits(undef, undef, undef, $batch_opt, $self);
@@ -1281,7 +1282,7 @@ sub cidx_run { # main entry point
 	local $SCANQ = [];
 	local ($DO_QUIT, $REINDEX, $TXN_BYTES, @GIT_DIR_GONE, @PRUNEQ,
 		$REPO_CTX, %ALT_FH, $TMPDIR, @AWK, @COMM, $CMD_ENV,
-		%TODO, @IBXQ, @IBX, @JOIN, %JOIN, @JOIN_PFX,
+		%TODO, @IBXQ, @IBX, @JOIN, %JOIN, @JOIN_PFX, @NO_ABBREV,
 		@JOIN_DT, $DUMP_IBX_WPIPE, @OFF2ROOT, $XHC, @SORT, $GITS_NR);
 	local $BATCH_BYTES = $self->{-opt}->{batch_size} //
 				$PublicInbox::SearchIdx::BATCH_BYTES;
@@ -1289,6 +1290,8 @@ sub cidx_run { # main entry point
 	local $self->{PENDING} = {}; # used by PublicInbox::CidxXapHelperAux
 	my $cfg = $self->{-opt}->{-pi_cfg} // die 'BUG: -pi_cfg unset';
 	$self->{-cfg_f} = $cfg->{-f} = rel2abs_collapsed($cfg->{-f});
+	local $GIT_VER = PublicInbox::Git::git_version();
+	@NO_ABBREV = ('-c', 'core.abbrev='.($GIT_VER lt v2.31.0 ? 40 : 'no'));
 	if (grep { $_ } @{$self->{-opt}}{qw(prune join)}) {
 		require File::Temp;
 		$TMPDIR = File::Temp->newdir('cidx-all-git-XXXX', TMPDIR => 1);
diff --git a/t/cindex.t b/t/cindex.t
index 0193cf18..716e5984 100644
--- a/t/cindex.t
+++ b/t/cindex.t
@@ -288,6 +288,9 @@ EOM
 		++$nr;
 	}, '.');
 	is $nr, 1, 'iterated through cindices';
+	my $oid = 'dba13ed2ddf783ee8118c6a581dbf75305f816a3';
+	my $mset = $csrch->mset("dfpost:$oid");
+	is $mset->size, 1, 'got result from full OID search';
 }
 
 done_testing;

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-12-05  9:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-05  9:46 [PATCH 0/2] cindex: future-proof blob OID indexing Eric Wong
2023-12-05  9:46 ` [PATCH 1/2] searchidx: drop redundant decl in index_git_blob_id Eric Wong
2023-12-05  9:46 ` [PATCH 2/2] cindex: index full (40/64 char) hex blob OIDs Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).