unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: [PATCH] lei: make --dedupe=content always account for Message-IDs
Date: Thu, 15 Jun 2023 09:50:53 +0000	[thread overview]
Message-ID: <20230615095053.481470-1-e@80x24.org> (raw)

The content dedupe logic was originally designed for v2 public
inboxes as a fallback for when the importer sees identical
Message-IDs.  Thus it did not account for Message-ID(s) in
the message itself.

This change doesn't affect saved searches (the default when
writing to a pathname or IMAP).  It affects --no-save, and
outputs to stdout (even if stdout is redirected to a file).

Prior to this change, lei reused the v2 logic as-is without
accounting for Message-IDs anywhere with `--dedupe=content'
(the default).  This could cause messages to be skipped when
the content matches despite Message-IDs being different.

So with this change, `lei q --dedupe=content' will hash the
Message-ID(s) in the message to ensure messages with different
Message-IDs are NOT deduplicated.

Whether or not this change is a bug fix or introduces regression
is actually debatable.  In my mind, it is better to err on the
side of showing too many messages rather than too few, even if
the actual contents of the message are identical.  Making saved
searches deduplicate without accounting for Message-IDs would be
more difficult, too.
---
 lib/PublicInbox/ContentHash.pm | 15 +++++++++++----
 lib/PublicInbox/LeiDedupe.pm   |  9 +++++++--
 t/lei_dedupe.t                 |  6 +++++-
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/lib/PublicInbox/ContentHash.pm b/lib/PublicInbox/ContentHash.pm
index fc94257c..95ca2929 100644
--- a/lib/PublicInbox/ContentHash.pm
+++ b/lib/PublicInbox/ContentHash.pm
@@ -54,16 +54,23 @@ sub content_dig_i {
 	$dig->add($s);
 }
 
-sub content_digest ($;$) {
-	my ($eml, $dig) = @_;
+sub content_digest ($;$$) {
+	my ($eml, $dig, $hash_mids) = @_;
 	$dig //= Digest::SHA->new(256);
 
 	# References: and In-Reply-To: get used interchangeably
 	# in some "duplicates" in LKML.  We treat them the same
 	# in SearchIdx, so treat them the same for this:
 	# do NOT consider the Message-ID as part of the content_hash
-	# if we got here, we've already got Message-ID reuse
-	my %seen = map { $_ => 1 } @{mids($eml)};
+	# if we got here, we've already got Message-ID reuse for v2.
+	#
+	# However, `lei q --dedupe=content' does use $hash_mids since
+	# it doesn't have any other dedupe
+	my $mids = mids($eml);
+	if ($hash_mids) {
+		$dig->add("mid\0$_\0") for @$mids;
+	}
+	my %seen = map { $_ => 1 } @$mids;
 	for (grep { !$seen{$_}++ } @{references($eml)}) {
 		utf8::encode($_);
 		$dig->add("ref\0$_\0");
diff --git a/lib/PublicInbox/LeiDedupe.pm b/lib/PublicInbox/LeiDedupe.pm
index 86cd8490..eda54d79 100644
--- a/lib/PublicInbox/LeiDedupe.pm
+++ b/lib/PublicInbox/LeiDedupe.pm
@@ -2,7 +2,7 @@
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 package PublicInbox::LeiDedupe;
 use v5.12;
-use PublicInbox::ContentHash qw(content_hash git_sha);
+use PublicInbox::ContentHash qw(content_hash content_digest git_sha);
 use PublicInbox::SHA qw(sha256);
 
 # n.b. mutt sets most of these headers not sure about Bytes
@@ -69,7 +69,12 @@ sub dedupe_content ($) {
 	my ($skv) = @_;
 	(sub { # may be called in a child process
 		my ($eml) = @_; # $oidhex = $_[1], ignored
-		$skv->set_maybe(content_hash($eml), '');
+
+		# we must account for Message-ID via hash_mids, since
+		# (unlike v2 dedupe) Message-ID is not accounted for elsewhere:
+		$skv->set_maybe(content_digest($eml, PublicInbox::SHA->new(256),
+				1 # hash_mids
+				)->digest, '');
 	}, sub {
 		my ($smsg) = @_;
 		$skv->set_maybe(smsg_hash($smsg), '');
diff --git a/t/lei_dedupe.t b/t/lei_dedupe.t
index e1944d02..13fc1f3b 100644
--- a/t/lei_dedupe.t
+++ b/t/lei_dedupe.t
@@ -1,5 +1,5 @@
 #!perl -w
-# Copyright (C) 2020-2021 all contributors <meta@public-inbox.org>
+# Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 use strict;
 use v5.10.1;
@@ -10,6 +10,8 @@ use PublicInbox::Smsg;
 require_mods(qw(DBD::SQLite));
 use_ok 'PublicInbox::LeiDedupe';
 my $eml = eml_load('t/plack-qp.eml');
+my $sameish = eml_load('t/plack-qp.eml');
+$sameish->header_set('Message-ID', '<cuepee@example.com>');
 my $mid = $eml->header_raw('Message-ID');
 my $different = eml_load('t/msg_iter-order.eml');
 $different->header_set('Message-ID', $mid);
@@ -47,6 +49,8 @@ for my $strat (undef, 'content') {
 	ok(!$dd->is_dup($different), "different is_dup with $desc dedupe");
 	ok(!$dd->is_smsg_dup($smsg), "is_smsg_dup pass w/ $desc dedupe");
 	ok($dd->is_smsg_dup($smsg), "is_smsg_dup reject w/ $desc dedupe");
+	ok(!$dd->is_dup($sameish),
+		"Message-ID accounted for w/ same content otherwise");
 }
 $lei->{opt}->{dedupe} = 'bogus';
 eval { PublicInbox::LeiDedupe->new($lei) };

                 reply	other threads:[~2023-06-15  9:50 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230615095053.481470-1-e@80x24.org \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).