unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: [PATCH] searchidx: do not index quoted Base-85 patches
Date: Mon, 20 Feb 2023 09:21:50 +0000	[thread overview]
Message-ID: <20230220092150.379964-1-e@80x24.org> (raw)

Base-85 binary patches were a source of false-positives in results
and we've filtered out in non-quoted text since July 2022.
Unfortunately, people were quoting binary patch contents
in replies (*sigh*) and triggering false positives in search
results.  So we must filter out base-85-looking contents from
quoted text, too.

Followup-to: 8fda04081acde705 (search: do not index base-85 binary patches, 2022-06-20)
Followup-to: 840785917bc74c8e (searchidx: skip "delta $N" sections for base-85, 2022-07-19)
---
 lib/PublicInbox/SearchIdx.pm | 10 ++++++++--
 t/search.t                   | 13 +++++++++++--
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 257b83a5..fc464383 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -37,7 +37,7 @@ our $BATCH_BYTES = $ENV{XAPIAN_FLUSH_THRESHOLD} ? 0x7fffffff :
 	# typical 32-bit system:
 	(($Config{ptrsize} >= 8 ? 8192 : 1024) * 1024);
 use constant DEBUG => !!$ENV{DEBUG};
-my $BASE85 = qr/\A[a-zA-Z0-9\!\#\$\%\&\(\)\*\+\-;<=>\?\@\^_`\{\|\}\~]+\z/;
+my $BASE85 = qr/[a-zA-Z0-9\!\#\$\%\&\(\)\*\+\-;<=>\?\@\^_`\{\|\}\~]+/;
 my $xapianlevels = qr/\A(?:full|medium)\z/;
 my $hex = '[a-f0-9]';
 my $OID = $hex .'{40,}';
@@ -270,7 +270,7 @@ sub index_diff ($$$) {
 				push @$xnq, shift(@l);
 
 				# skip base85 and empty lines
-				while (@l && ($l[0] =~ /$BASE85/o ||
+				while (@l && ($l[0] =~ /\A$BASE85\h*\z/o ||
 						$l[0] !~ /\S/)) {
 					shift @l;
 				}
@@ -389,6 +389,12 @@ sub index_xapian { # msg_iter callback
 	undef $s; # free memory
 	for my $txt (@sections) {
 		if ($txt =~ /\A>/) {
+			if ($txt =~ /^[>\t ]+GIT binary patch\r?/sm) {
+				# get rid of Base-85 noise
+				$txt =~ s/^([>\h]+(?:literal|delta)
+						\x20[0-9]+\r?\n)
+					(?:[>\h]+$BASE85\h*\r?\n)+/$1/gsmx;
+			}
 			index_text($self, $txt, 0, 'XQUOT');
 		} else {
 			# does it look like a diff?
diff --git a/t/search.t b/t/search.t
index dded6c40..cf639a6d 100644
--- a/t/search.t
+++ b/t/search.t
@@ -534,7 +534,15 @@ $ibx->with_umask(sub {
 		'20200418222508.GA13918@dcvr',
 		'Subject search reaches inside message/rfc822');
 
-	$doc_id = $rw->add_message(eml_load('t/data/binary.patch'));
+	my $eml = eml_load('t/data/binary.patch');
+	my $body = $eml->body;
+	$rw->add_message($eml);
+
+	$body =~ s/^/> /gsm;
+	$eml = PublicInbox::Eml->new($eml->header_obj->as_string."\n".$body);
+	$eml->header_set('Message-ID', '<binary-patch-reply@example>');
+	$rw->add_message($eml);
+
 	$rw->commit_txn_lazy;
 	$ibx->search->reopen;
 	my $res = $query->('HcmV');
@@ -542,8 +550,9 @@ $ibx->with_umask(sub {
 	$res = $query->('IcmZPo000310RR91');
 	is_deeply($res, [], 'no results against 1-byte binary patch');
 	$res = $query->('"GIT binary patch"');
-	is(scalar(@$res), 1, 'got binary result from "GIT binary patch"');
+	is(scalar(@$res), 2, 'got binary results from "GIT binary patch"');
 	is($res->[0]->{mid}, 'binary-patch-test@example', 'msgid for binary');
+	is($res->[1]->{mid}, 'binary-patch-reply@example', 'msgid for reply');
 	my $s = $query->('"literal 1"');
 	is_deeply($s, $res, 'got binary result from exact literal size');
 	$s = $query->('"literal 2"');

                 reply	other threads:[~2023-02-20  9:21 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230220092150.379964-1-e@80x24.org \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).