unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
From: Eric Wong <e@80x24.org>
To: "Robin H. Johnson" <robbat2@orbis-terrarum.net>
Cc: meta@public-inbox.org
Subject: [RFC] altid: start supporting indexfilter type (was: Alternate permalink URLs)
Date: Mon, 20 Nov 2023 03:21:32 +0000	[thread overview]
Message-ID: <20231120032132.M610564@dcvr> (raw)
In-Reply-To: <robbat2-20231119T232932-954868624Z@orbis-terrarum.net>

"Robin H. Johnson" <robbat2@orbis-terrarum.net> wrote:
> Hi,
> 
> This is more of a feature request / request for pointers on how to tweak
> the design to support something, and it might be suited to maintaining
> as a local patch.

Since the indexing internals are somewhat in flux and tied to
Xapian and Perl, I'm happy to carry it to ensure it stays
working (similar to the "altid" and existing Filter stuff).

> The permalinks offered by public-inbox are great, but at Gentoo Linux,
> we'd like to ALSO continue to offer our historical permalinks.

Would you want the historical permalinks displayed on in the
PublicInbox::WWW HTML UI?  That is already the slowest and most
expensive part of public-inbox, so I'm hesitant to support more
options which slow it down (though I'm halfway considering
introducing more C to speed that part up...)

With the patch below, you should be able to use:

https://public-inbox.gentoo.org/gentoo-dev/?q=xarchiveshash:499b958da430b925dbd2f2b58e0f507e

The same way <https://public-inbox.org/git/?q=gmane:123> works.

(Maybe that would be better with an "I'm Feeling Lucky" search...)

> For those, the permalink slug portion was built when the mail arrived
> into the archives ingest pipeline.
> 
> Example legacy link:
> https://archives.gentoo.org/gentoo-dev/message/499b958da430b925dbd2f2b58e0f507e
> 
> We'd need to tweak the index somehow to expose it.
> 
> That same mail as visible in our public-inbox test site:
> https://public-inbox.gentoo.org/gentoo-dev/538ce05eef3f4df3468cbc7f7abfa90eb2ea7d51.camel@gentoo.org/raw
> 
> The permalink slug is in the header:
> X-Archives-Hash: 499b958da430b925dbd2f2b58e0f507e
> 
> This needs to end up in the Xapian index (which doesn't seem to index
> headers right now), and then get wired up as a route:
> On access, redirect to public-inbox permalink.
> 
> Pointers on where in the codebase to wire up the Xapian side greatly
> appreciated, since it doesn't seem to be indexing arbitrary headers
> right now.

The indexing+search part is something that's been requested by
others, too.  With the below patch, setting:

	altid = indexfilter:xarchiveshash:package=XArchivesHash

for a given inbox, you should be able to search on
"xarchiveshash:$hash" the same way the "gmane:$INTEGER" altid
search works for public-inbox.org/git/

Sidenote: Unfortunately, altid needs to be configured per-inbox, but
	I suppose indexfilter (unlike serial) makes sense to
	support globally in the future...

You can also replace "xarchiveshash" with any unused
all-lowercase prefix (my brain kept leaving out the "s" while
writing tests and I was puzzled why it didn't work at first :x).

If you want to carry a private plugin to search on "foo:" using
MyPackage::Foo, you should be able to add this to the
publicinbox.$NAME section:

	altid = indexfilter:foo:package=MyPackage::Foo

But I'm a bit hesitant to declare the indexing internals a
stable API to support into eternity.  So I'd rather take a patch
to handle stuff in the PublicInbox::IndexFilter::* namespace.

------8<-------
Subject: [RFC] altid: start supporting indexfilter type

In addition to the traditional AltId serial numbers from
external sources (e.g. gmane), we can support Xapian-only
indexing filters using Perl packages in the
PublicInbox::IndexFilter::* namespace.

Unlike the old `serial' type, this requires no separate SQLite
DB since it's data is expected to be contained within the raw
message.  `indexfilter' only affects Xapian indexing, and isn't
subject to the stricter `serial' type which enforced a 1:1
Message-ID <=> integer relationship used for NNTP.

Unlike the existing PublicInbox::Filter::* namespace, this
doesn't affect message delivery paths (-watch/-mda) at all
and can be used from (clone|fetch)-synchronized mirrors.

The new PublicInbox::IndexFilter::XArchivesHash may be a
starting point for Gentoo archives, but other packages can
be added for other hosts.

This depends on Perl modules being implemented for each case;
but I figure using Perl directly is preferable to having some
new syntax that gets translated (likely poorly!) to actual Perl.
In other words, we're trying not to reinvent or reimplement
procmail, sieve, or any other mail processing language.

Link: https://public-inbox.org/meta/robbat2-20231119T232932-954868624Z@orbis-terrarum.net/
---
 MANIFEST                                     |  2 +
 lib/PublicInbox/AltId.pm                     | 32 ++++---
 lib/PublicInbox/IndexFilter/XArchivesHash.pm | 30 +++++++
 lib/PublicInbox/Search.pm                    | 17 +++-
 lib/PublicInbox/SearchIdx.pm                 | 11 ++-
 t/watch_indexfilter_xarchiveshash.t          | 90 ++++++++++++++++++++
 6 files changed, 164 insertions(+), 18 deletions(-)
 create mode 100644 lib/PublicInbox/IndexFilter/XArchivesHash.pm
 create mode 100644 t/watch_indexfilter_xarchiveshash.t

diff --git a/MANIFEST b/MANIFEST
index e1c3dc97..d4173f20 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -226,6 +226,7 @@ lib/PublicInbox/In2Tie.pm
 lib/PublicInbox/Inbox.pm
 lib/PublicInbox/InboxIdle.pm
 lib/PublicInbox/InboxWritable.pm
+lib/PublicInbox/IndexFilter/XArchivesHash.pm
 lib/PublicInbox/Inotify.pm
 lib/PublicInbox/InputPipe.pm
 lib/PublicInbox/Isearch.pm
@@ -614,6 +615,7 @@ t/v2writable.t
 t/view.t
 t/watch_filter_rubylang.t
 t/watch_imap.t
+t/watch_indexfilter_xarchiveshash.t
 t/watch_maildir.t
 t/watch_maildir_v2.t
 t/watch_multiple_headers.t
diff --git a/lib/PublicInbox/AltId.pm b/lib/PublicInbox/AltId.pm
index 80757ceb..5b917edb 100644
--- a/lib/PublicInbox/AltId.pm
+++ b/lib/PublicInbox/AltId.pm
@@ -21,27 +21,37 @@ use PublicInbox::Msgmap;
 sub new {
 	my ($class, $ibx, $spec, $writable) = @_;
 	my ($type, $prefix, $query) = split(/:/, $spec, 3);
-	$type eq 'serial' or die "non-serial not supported, yet\n";
 	$prefix =~ /\A\w+\z/ or warn "non-word prefix not searchable\n";
 	my %params = map {
 		my ($k, $v) = split(/=/, uri_unescape($_), 2);
 		$v = '' unless defined $v;
 		($k, $v);
 	} split(/[&;]/, $query);
-	my $f = $params{file} or die "file: required for $type spec $spec\n";
-	unless (index($f, '/') == 0) {
-		if ($ibx->version == 1) {
-			$f = "$ibx->{inboxdir}/public-inbox/$f";
-		} else {
-			$f = "$ibx->{inboxdir}/$f";
-		}
-	}
-	bless {
-		filename => $f,
+	my $self = bless {
 		writable => $writable,
 		prefix => $prefix,
 		xprefix => 'X'.uc($prefix),
 	}, $class;
+	if ($type eq 'serial') { # traditional message-ID <=> integer mapping
+		my $f = $params{file} or die
+			"E: file required for $type altid=$spec\n";
+		unless (index($f, '/') == 0) {
+			$f = $ibx->version == 1 ?
+				"$ibx->{inboxdir}/public-inbox/$f" :
+				"$ibx->{inboxdir}/$f";
+		}
+		$self->{filename} = $f;
+	} elsif ($type eq 'indexfilter') {
+		my $pkg = $params{package} //
+			die "E: package= unset for altid=$spec\n";
+		$pkg =~ m!::! or $pkg = "PublicInbox::IndexFilter::$pkg";
+		eval "require $pkg";
+		die "E: could not load $pkg for altid=$spec: $@" if $@;
+		$self->{indexfilter} = $pkg->new;
+	} else {
+		die "non-serial/non-indexfilter not supported, yet ($type)\n"
+	}
+	$self;
 }
 
 sub mm_alt {
diff --git a/lib/PublicInbox/IndexFilter/XArchivesHash.pm b/lib/PublicInbox/IndexFilter/XArchivesHash.pm
new file mode 100644
index 00000000..238a5925
--- /dev/null
+++ b/lib/PublicInbox/IndexFilter/XArchivesHash.pm
@@ -0,0 +1,30 @@
+# Copyright (C) all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# map allow searching on X-Archives-Hash
+package PublicInbox::IndexFilter::XArchivesHash;
+use v5.12;
+use Carp qw(carp);
+
+# attach sidx (SearchIdx) object to $self?
+sub new { bless {}, __PACKAGE__ }
+
+# called by SearchIdx (internal APIs are unstable)
+sub index_filter {
+	my ($self, $sidx, $doc, $eml, $pfx) = @_;
+	# $sidx may be used for index_phrase in packages
+	my @h = grep /\A(?:[a-f0-9]{32})\z/, # strict RE
+		$eml->header_raw('X-Archives-Hash');
+	if (scalar(@h) == 0) {
+		carp 'E: no hash in X-Archives-Hash <',
+			$eml->header_raw('Message-ID'), '>';
+	} elsif (scalar(@h) != 1) {
+		carp "W: multiple hashes in X-Archives-Hash: @h";
+		# fall-through to index all of them:
+	}
+	$doc->add_boolean_term($pfx.$_) for @h;
+}
+
+# TODO: unindex_filter? maybe unneeded since entire Xapian doc is deleted
+
+1;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 477f77dc..bee86a6d 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -507,15 +507,26 @@ sub qparse_new {
 	# just parse the spec to avoid the extra DB handles for now.
 	if (my $altid = $self->{altid}) {
 		my $user_pfx = $self->{-user_pfx} = [];
+		# FIXME: consider moving some of this logic to AltId.pm
 		for (@$altid) {
 			# $_ = 'serial:gmane:/path/to/gmane.msgmap.sqlite3'
 			# note: Xapian supports multibyte UTF-8, /^[0-9]+$/,
 			# and '_' with prefixes matching \w+
-			/\Aserial:(\w+):/ or next;
-			my $pfx = $1;
-			push @$user_pfx, "$pfx:", <<EOF;
+			/\A(serial|indexfilter):(\w+):/ or do {
+				warn "W: unsupported altid=$_\n";
+				next;
+			};
+			my ($type, $pfx) = ($1, $2);
+			if ($type eq 'serial') {
+				push @$user_pfx, "$pfx:", <<EOF;
 alternate serial number  e.g. $pfx:12345 (boolean)
 EOF
+			} elsif ($type eq 'indexfilter') {
+				# TODO: support help in IndexFilter classes?
+				push @$user_pfx, "$pfx:", <<EOF;
+alternate prefix e.g. $pfx:xyz
+EOF
+			}
 			# gmane => XGMANE
 			$qp->add_boolean_prefix($pfx, 'X'.uc($pfx));
 		}
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 9566b14d..c5ddba45 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -474,10 +474,13 @@ sub eml2doc ($$$;$) {
 	if (my $altid = $self->{-altid}) {
 		foreach my $alt (@$altid) {
 			my $pfx = $alt->{xprefix};
-			foreach my $mid (@$mids) {
-				my $id = $alt->mid2alt($mid);
-				next unless defined $id;
-				$doc->add_boolean_term($pfx . $id);
+			if (my $idxf = $alt->{indexfilter}) {
+				$idxf->index_filter($self, $doc, $eml, $pfx);
+			} else { # traditional Message-ID <=> NNTP number map
+				for my $mid (@$mids) {
+					my $id = $alt->mid2alt($mid) // next;
+					$doc->add_boolean_term($pfx . $id);
+				}
 			}
 		}
 	}
diff --git a/t/watch_indexfilter_xarchiveshash.t b/t/watch_indexfilter_xarchiveshash.t
new file mode 100644
index 00000000..c0af8fcc
--- /dev/null
+++ b/t/watch_indexfilter_xarchiveshash.t
@@ -0,0 +1,90 @@
+# Copyright (C) all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use v5.12;
+use autodie;
+use PublicInbox::TestCommon;
+use PublicInbox::Eml;
+use PublicInbox::Emergency;
+use PublicInbox::IO qw(write_file);
+use PublicInbox::InboxIdle;
+use PublicInbox::Inbox;
+use PublicInbox::DS;
+use PublicInbox::Config;
+require_mods(qw(DBD::SQLite Xapian));
+my $tmpdir = tmpdir;
+my $config = "$tmpdir/pi_config";
+local $ENV{PI_CONFIG} = $config;
+delete local $ENV{PI_DIR};
+my @V = (1);
+my @creat_opt = (indexlevel => 'medium', sub {});
+my $v1 = create_inbox 'v1', tmpdir => "$tmpdir/v1", @creat_opt;
+my $fh = write_file '>', $config, <<EOM;
+[publicinbox "v1"]
+	inboxdir = $v1->{inboxdir}
+	address = v1\@example.com
+	watch = maildir:$tmpdir/v1-md
+	altid = indexfilter:xarchiveshash:package=XArchivesHash
+EOM
+
+SKIP: {
+	require_git(v2.6, 1);
+	push @V, 2;
+	my $v2 = create_inbox 'v2', tmpdir => "$tmpdir/v2", @creat_opt;
+	my $pkg = 'PublicInbox::IndexFilter::XArchivesHash';
+	print $fh <<EOM;
+[publicinbox "v2"]
+	inboxdir = $tmpdir/v2
+	address = v2\@example.com
+	watch = maildir:$tmpdir/v2-md
+	altid = indexfilter:xarchiveshash:package=$pkg
+EOM
+}
+close $fh;
+my $cfg = PublicInbox::Config->new;
+for my $v (@V) { for ('', qw(cur new tmp)) { mkdir "$tmpdir/v$v-md/$_" } }
+my $wm = start_script([qw(-watch)]);
+my $h1 = 'deadbeef' x 4;
+my @em = map {
+	my $v = $_;
+	my $em = PublicInbox::Emergency->new("$tmpdir/v$v-md");
+	$em->prepare(\(PublicInbox::Eml->new(<<EOM)->as_string));
+From: x\@example.com
+Message-ID: <i-1$v\@example.com>
+To: <v$v\@example.com>
+Date: Sat, 02 Oct 2010 00:00:00 +0000
+X-Archives-Hash: $h1
+
+EOM
+	$em;
+} @V;
+
+my $delivered = 0;
+my $cb = sub {
+	diag "message delivered to `$_[0]->{name}'";
+	++$delivered;
+};
+PublicInbox::DS->Reset;
+my $ii = PublicInbox::InboxIdle->new($cfg);
+my $obj = bless \$cb, 'PublicInbox::TestCommon::InboxWakeup';
+$cfg->each_inbox(sub { $_[0]->subscribe_unlock('ident', $obj) });
+local @PublicInbox::DS::post_loop_do = (sub { $delivered != @V });
+$_->commit for @em;
+diag 'waiting for -watch to import new message(s)';
+PublicInbox::DS::event_loop();
+$wm->join('TERM');
+$ii->close;
+
+$cfg->each_inbox(sub {
+	my ($ibx) = @_;
+	my $srch = $ibx->search;
+	my $mset = $srch->mset('xarchiveshash:miss');
+	is($mset->size, 0, 'got xarchiveshash:miss non-result');
+	$mset = $srch->mset("xarchiveshash:$h1");
+	is($mset->size, 1, 'got xarchiveshash: hit result') or return;
+	my $num = $srch->mset_to_artnums($mset);
+	my $eml = $ibx->smsg_eml($ibx->over->get_art($num->[0]));
+	is($eml->header_raw('X-Archives-Hash'), $h1,
+		'stored message with X-Archives-Hash');
+});
+
+done_testing;

  reply	other threads:[~2023-11-20  3:21 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-19 23:47 Alternate permalink URLs - for migration from other/custom archive solutions Robin H. Johnson
2023-11-20  3:21 ` Eric Wong [this message]
2023-12-08 21:23   ` [RFC] altid: start supporting indexfilter type (was: Alternate permalink URLs) Eric Wong
2024-04-27  7:00     ` Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231120032132.M610564@dcvr \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    --cc=robbat2@orbis-terrarum.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).