* [PATCH] search: support searching on List-Id
2020-04-22 22:17 ` Konstantin Ryabitsev
@ 2020-05-07 3:00 ` Eric Wong
0 siblings, 0 replies; 3+ messages in thread
From: Eric Wong @ 2020-05-07 3:00 UTC (permalink / raw)
To: meta
Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Mon, Apr 20, 2020 at 01:53:17AM +0000, Eric Wong wrote:
> > I'm probably going to start indexing List-Id: headers by
> > default, and have `lid:' be the search prefix for inboxes
> > which combine multiple lists and may have unstable email
> > addresses.
>
> This would be handy indeed!
Not sure if both `lid:' and `l:' are necessary, but it's
consistent with `mid:' and `m:' as far as exact (boolean) vs.
probabilistic search goes.
I figure `l:' is probably useful for lists projects which change
domains/hosts. Patch below
> > Anything else that should be indexed by default?
> >
> > There'll also be an option to define indexing for other headers,
> > such as bug-tracker-specific IDs and such.
>
> I think if this is configurable, then it's really the only thing that's
> needed. Everyone's needs are going to be different, so indexing headers
> that aren't interesting to many people is just going to lead to storage
> bloat.
The other thing is whether or not decoding RFC 2047 is necessary
or even correct for a particular header.
Email::MIME->{header_str,header} blindly decodes some things
which probably shouldn't be... I'm probably splitting hairs,
here, though.
-------8<------
Subject: [PATCH] search: support searching on List-Id
We'll support both probabilistic matches via `l:' and boolean
matches via `lid:' for exact matches, similar to how both `m:'
and `mid:' are supported. Only text inside angle braces (`<'
and `>') are supported, since I'm not sure if there's value in
searching on the optional phrases (which would require decoding
with ->header_str instead of ->header_raw).
---
lib/PublicInbox/Search.pm | 9 +++++++++
lib/PublicInbox/SearchIdx.pm | 6 ++++++
t/search.t | 31 +++++++++++++++++++++++++++++++
3 files changed, 46 insertions(+)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 86a6ad674b3..b7db2b9f7fc 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -77,11 +77,17 @@ use constant {
# 15 - see public-inbox-v2-format(5)
# further bumps likely unnecessary, we'll suggest in-place
# "--reindex" use for further fixes and tweaks
+ #
+ # public-inbox v1.5.0 adds (still SCHEMA_VERSION=15):
+ # * "lid:" and "l:" for List-Id searches
SCHEMA_VERSION => 15,
};
+# note: the non-X term prefix allocations are shared with
+# Xapian omega, see xapian-applications/omega/docs/termprefixes.rst
my %bool_pfx_external = (
mid => 'Q', # Message-ID (full/exact), this is mostly uniQue
+ lid => 'G', # newsGroup (or similar entity), just inside <>
dfpre => 'XDFPRE',
dfpost => 'XDFPOST',
dfblob => 'XDFPRE XDFPOST',
@@ -92,6 +98,7 @@ my %prob_prefix = (
# for mairix compatibility
s => 'S',
m => 'XM', # 'mid:' (bool) is exact, 'm:' (prob) can do partial
+ l => 'XL', # 'lid:' (bool) is exact, 'l:' (prob) can do partial
f => 'A',
t => 'XTO',
tc => 'XTO XCC',
@@ -134,6 +141,8 @@ EOF
'f:' => 'match within the From header',
'a:' => 'match within the To, Cc, and From headers',
'tc:' => 'match within the To and Cc headers',
+ 'lid:' => 'exact contents of the List-Id',
+ 'l:' => 'partial match contents of the List-Id header',
'bs:' => 'match within the Subject and body',
'dfn:' => 'match filename from diff',
'dfa:' => 'match diff removed (-) lines',
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 25118f43613..998341a7d4d 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -352,6 +352,12 @@ sub add_xapian ($$$$) {
}
}
$doc->add_boolean_term('Q' . $_) foreach @$mids;
+ for my $l ($hdr->header_raw('List-Id')) {
+ $l =~ /<([^>]+)>/ or next;
+ my $lid = $1;
+ $doc->add_boolean_term('G' . $lid);
+ index_text($self, $lid, 1, 'XL'); # probabilistic
+ }
$self->{xdb}->replace_document($smsg->{num}, $doc);
}
diff --git a/t/search.t b/t/search.t
index 83986837eaf..92f3305d556 100644
--- a/t/search.t
+++ b/t/search.t
@@ -66,6 +66,7 @@ Subject: Hello world
Message-ID: <root@s>
From: John Smith <js@example.com>
To: list@example.com
+List-Id: I'm not mad <i.m.just.bored>
\m/
EOF
@@ -77,6 +78,7 @@ Message-ID: <last@s>
From: John Smith <js@example.com>
To: list@example.com
Cc: foo@example.com
+List-Id: there's nothing <left.for.me.to.do>
goodbye forever :<
EOF
@@ -448,6 +450,35 @@ EOF
is($ro->query("m:Pine m:LNX m:10010260936330", {mset=>1})->size, 1);
});
+{ # List-Id searching
+ my $found = $ro->query('lid:i.m.just.bored');
+ is_deeply([ filter_mids($found) ], [ 'root@s' ],
+ 'got expected mid on exact lid: search');
+
+ $found = $ro->query('lid:just.bored');
+ is_deeply($found, [], 'got nothing on lid: search');
+
+ $found = $ro->query('lid:*.just.bored');
+ is_deeply($found, [], 'got nothing on lid: search');
+
+ $found = $ro->query('l:i.m.just.bored');
+ is_deeply([ filter_mids($found) ], [ 'root@s' ],
+ 'probabilistic search works on full List-Id contents');
+
+ $found = $ro->query('l:just.bored');
+ is_deeply([ filter_mids($found) ], [ 'root@s' ],
+ 'probabilistic search works on partial List-Id contents');
+
+ $found = $ro->query('lid:mad');
+ is_deeply($found, [], 'no match on phrase with lid:');
+
+ $found = $ro->query('lid:bored');
+ is_deeply($found, [], 'no match on partial List-Id with lid:');
+
+ $found = $ro->query('l:nothing');
+ is_deeply($found, [], 'matched on phrase with l:');
+}
+
done_testing();
1;
^ permalink raw reply related [flat|nested] 3+ messages in thread