unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
From: "Eric Wong (Contractor, The Linux Foundation)" <e@80x24.org>
To: meta@public-inbox.org
Subject: [PATCH 21/34] mid: truncate excessively long MIDs early
Date: Tue,  6 Mar 2018 08:42:29 +0000	[thread overview]
Message-ID: <20180306084242.19988-22-e@80x24.org> (raw)
In-Reply-To: <20180306084242.19988-1-e@80x24.org>

Since we support duplicate MIDs in v2, we can safely truncate
long MID terms in the database and let other normal duplicate
resolution sort it out.  It seems only spammers use excessively
long MIDs, and there'll always be abuse/misuse vectors for causing
mis-threaded messages, so it's not worth worrying about
excessively long MIDs.
---
 lib/PublicInbox/MID.pm               | 11 ++++++++++-
 lib/PublicInbox/SearchIdx.pm         | 15 ++++++---------
 lib/PublicInbox/SearchIdxSkeleton.pm |  3 ---
 3 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
index 9608539..422902f 100644
--- a/lib/PublicInbox/MID.pm
+++ b/lib/PublicInbox/MID.pm
@@ -10,7 +10,10 @@ our @EXPORT_OK = qw/mid_clean id_compress mid2path mid_mime mid_escape MID_ESC
 	mids references/;
 use URI::Escape qw(uri_escape_utf8);
 use Digest::SHA qw/sha1_hex/;
-use constant MID_MAX => 40; # SHA-1 hex length
+use constant {
+	MID_MAX => 40, # SHA-1 hex length # TODO: get rid of this
+	MAX_MID_SIZE => 244, # max term size (Xapian limitation) - length('Q')
+};
 
 sub mid_clean {
 	my ($mid) = @_;
@@ -61,6 +64,12 @@ sub mids ($) {
 			push(@mids, $v);
 		}
 	}
+	foreach my $i (0..$#mids) {
+		next if length($mids[$i]) <= MAX_MID_SIZE;
+		warn "Message-ID: <$mids[$i]> too long, truncating\n";
+		$mids[$i] = substr($mids[$i], 0, MAX_MID_SIZE);
+	}
+
 	uniq_mids(\@mids);
 }
 
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 3ef444d..a70e1eb 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -19,7 +19,6 @@ use POSIX qw(strftime);
 require PublicInbox::Git;
 
 use constant {
-	MAX_MID_SIZE => 244, # max term size (Xapian limitation) - length('Q')
 	PERM_UMASK => 0,
 	OLD_PERM_GROUP => 1,
 	OLD_PERM_EVERYBODY => 2,
@@ -311,12 +310,6 @@ sub add_message {
 	eval {
 		my $smsg = PublicInbox::SearchMsg->new($mime);
 		my $doc = $smsg->{doc};
-		foreach my $mid (@$mids) {
-			# FIXME: may be abused to prevent archival
-			length($mid) > MAX_MID_SIZE and
-				die 'Message-ID too long';
-			$doc->add_term('Q' . $mid);
-		}
 		my $subj = $smsg->subject;
 		my $xpath;
 		if ($subj ne '') {
@@ -392,9 +385,11 @@ sub add_message {
 				}
 			}
 		}
+
 		if ($skel) {
 			push @values, $mids, $xpath, $data;
 			$skel->index_skeleton(\@values);
+			$doc->add_boolean_term('Q' . $_) foreach @$mids;
 			$doc_id = $self->{xdb}->add_document($doc);
 		} else {
 			$doc_id = link_and_save($self, $doc, $mids, $refs,
@@ -469,9 +464,9 @@ sub parse_references ($) {
 	my %mids = map { $_ => 1 } @{mids($hdr)};
 	my @keep;
 	foreach my $ref (@$refs) {
-		# FIXME: this is an archive-prevention vector like X-No-Archive
-		if (length($ref) > MAX_MID_SIZE) {
+		if (length($ref) > PublicInbox::MID::MAX_MID_SIZE) {
 			warn "References: <$ref> too long, ignoring\n";
+			next;
 		}
 		next if $mids{$ref};
 		push @keep, $ref;
@@ -510,6 +505,8 @@ sub link_and_save {
 	my $doc_id;
 	$doc->add_boolean_term('XNUM' . $num) if defined $num;
 	$doc->add_boolean_term('XPATH' . $xpath) if defined $xpath;
+	$doc->add_boolean_term('Q' . $_) foreach @$mids;
+
 	my $vivified = 0;
 	foreach my $mid (@$mids) {
 		$self->each_smsg_by_mid($mid, sub {
diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
index 4066b59..40b28c5 100644
--- a/lib/PublicInbox/SearchIdxSkeleton.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -98,9 +98,6 @@ sub index_skeleton_real ($$) {
 	my $ts = $values->[PublicInbox::Search::TS];
 	my $smsg = PublicInbox::SearchMsg->new(undef);
 	my $doc = $smsg->{doc};
-	foreach my $mid (@$mids) {
-		$doc->add_term('Q' . $mid);
-	}
 	PublicInbox::SearchIdx::add_values($doc, $values);
 	$doc->set_data($doc_data);
 	$smsg->{ts} = $ts;
-- 
EW


  parent reply	other threads:[~2018-03-06  8:42 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-06  8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 01/34] v2writable: delete ::Import obj when ->done Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 02/34] search: remove informational "warning" message Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 03/34] searchidx: add PID to error message when die-ing Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 04/34] content_id: special treatment for Message-Id headers Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 05/34] evcleanup: disable outside of daemon Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 06/34] v2writable: deduplicate detection on add Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 07/34] evcleanup: do not create event loop if nothing was registered Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 08/34] mid: add `mids' and `references' methods for extraction Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 09/34] content_id: use `mids' and `references' for MID extraction Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 10/34] searchidx: use new `references' method for parsing References Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 11/34] content_id: no need to be human-friendly Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 12/34] v2writable: inject new Message-IDs on true duplicates Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 13/34] search: revert to using 'Q' as a uniQue id per-Xapian conventions Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 14/34] searchidx: support indexing multiple MIDs Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 15/34] mid: be strict with References, but loose on Message-Id Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 16/34] searchidx: avoid excessive XNQ indexing with diffs Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 17/34] searchidxskeleton: add a note about locking Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 18/34] v2writable: generated Message-ID goes first Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 19/34] searchidx: use add_boolean_term for internal terms Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 20/34] searchidx: add NNTP article number as a searchable term Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` Eric Wong (Contractor, The Linux Foundation) [this message]
2018-03-06  8:42 ` [PATCH 22/34] nntp: use NNTP article numbers for lookups Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 23/34] nntp: fix NEWNEWS command Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 24/34] searchidx: store the primary MID in doc data for NNTP Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 25/34] import: consolidate object info for v2 imports Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 26/34] v2: avoid redundant/repeated configs for git partition repos Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 27/34] INSTALL: document more optional dependencies Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 28/34] search: favor skeleton DB for lookup_mail Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 29/34] search: each_smsg_by_mid uses skeleton if available Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 30/34] v2writable: remove unnecessary skeleton commit Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 31/34] favor Received: date over Date: header globally Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 32/34] import: fall back to Sender for extracting name and email Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 33/34] scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:42 ` [PATCH 34/34] v2writable: detect and use previous partition count Eric Wong (Contractor, The Linux Foundation)
2018-03-06  8:53 ` [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180306084242.19988-22-e@80x24.org \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).