unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 0/4] mail diff updates
@ 2023-04-25 10:50 Eric Wong
  2023-04-25 10:50 ` [PATCH 1/4] mid+contenthash: eliminate needless local variable captures Eric Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Eric Wong @ 2023-04-25 10:50 UTC (permalink / raw)
  To: meta

Some things which I noticed while reading some cross-posted LKML
messages.  These affect the /$INBOX/$MSGID/d/ WWW endpoint as
well as `lei mail-diff'

I'm considering making tweaks to ContentHash to ignore the
name+comment parts of To/Cc headers and only rely on the
lowercased email address itself, too.  That would affect
dedupe across the board for v2 and extindex...

Eric Wong (4):
  mid+contenthash: eliminate needless local variable captures
  mail_diff: match ContentHash EOL and EOM behavior more closely
  mail_diff: show headers differences in WWW /$MSGID/d/ view
  content_digest_dbg: improve display of To:/Cc: diffs

 lib/PublicInbox/ContentDigestDbg.pm |  7 +++++--
 lib/PublicInbox/ContentHash.pm      |  8 +++-----
 lib/PublicInbox/MID.pm              |  6 ++----
 lib/PublicInbox/MailDiff.pm         | 11 ++++-------
 4 files changed, 14 insertions(+), 18 deletions(-)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/4] mid+contenthash: eliminate needless local variable captures
  2023-04-25 10:50 [PATCH 0/4] mail diff updates Eric Wong
@ 2023-04-25 10:50 ` Eric Wong
  2023-04-25 10:50 ` [PATCH 2/4] mail_diff: match ContentHash EOL and EOM behavior more closely Eric Wong
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Eric Wong @ 2023-04-25 10:50 UTC (permalink / raw)
  To: meta

It's possible in theory that Perl could be smarter and free
memory a tad sooner this way.  Regardless, fewer lines of code
is easier-to-navigate/read and can save optree size and reduce
parsing times.
---
 lib/PublicInbox/ContentHash.pm | 6 ++----
 lib/PublicInbox/MID.pm         | 6 ++----
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/lib/PublicInbox/ContentHash.pm b/lib/PublicInbox/ContentHash.pm
index d3ff146a..a4f6196f 100644
--- a/lib/PublicInbox/ContentHash.pm
+++ b/lib/PublicInbox/ContentHash.pm
@@ -76,8 +76,7 @@ sub content_digest ($;$) {
 		last;
 	}
 	foreach my $h (qw(Subject Date)) {
-		my @v = $eml->header($h);
-		foreach my $v (@v) {
+		for my $v ($eml->header($h)) {
 			utf8::encode($v);
 			$dig->add("$h\0$v\0");
 		}
@@ -86,8 +85,7 @@ sub content_digest ($;$) {
 	# not in the original message.  For the purposes of deduplication,
 	# do not take it into account:
 	foreach my $h (qw(To Cc)) {
-		my @v = $eml->header($h);
-		digest_addr($dig, $h, $_) foreach @v;
+		digest_addr($dig, $h, $_) for ($eml->header($h));
 	}
 	msg_iter($eml, \&content_dig_i, $dig);
 	$dig;
diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
index 4819cc25..b1ae9939 100644
--- a/lib/PublicInbox/MID.pm
+++ b/lib/PublicInbox/MID.pm
@@ -92,8 +92,7 @@ sub references ($) {
 	my ($hdr) = @_;
 	my @mids;
 	foreach my $f (qw(References In-Reply-To)) {
-		my @v = $hdr->header_raw($f);
-		foreach my $v (@v) {
+		for my $v ($hdr->header_raw($f)) {
 			push(@mids, ($v =~ /$MID_EXTRACT/g));
 		}
 	}
@@ -104,8 +103,7 @@ sub references ($) {
 	my %addr = ( y => 1, n => 1 );
 
 	foreach my $f (qw(To From Cc)) {
-		my @v = $hdr->header_raw($f);
-		foreach my $v (@v) {
+		for my $v ($hdr->header_raw($f)) {
 			$addr{$_} = 1 for (PublicInbox::Address::emails($v));
 		}
 	}

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/4] mail_diff: match ContentHash EOL and EOM behavior more closely
  2023-04-25 10:50 [PATCH 0/4] mail diff updates Eric Wong
  2023-04-25 10:50 ` [PATCH 1/4] mid+contenthash: eliminate needless local variable captures Eric Wong
@ 2023-04-25 10:50 ` Eric Wong
  2023-04-25 10:50 ` [PATCH 3/4] mail_diff: show headers differences in WWW /$MSGID/d/ view Eric Wong
  2023-04-25 10:50 ` [PATCH 4/4] content_digest_dbg: improve display of To:/Cc: diffs Eric Wong
  3 siblings, 0 replies; 5+ messages in thread
From: Eric Wong @ 2023-04-25 10:50 UTC (permalink / raw)
  To: meta

ContentHash currently doesn't convert CRCRLF to LF.  Perhaps it
should, but for now, have diff behavior match the actual
comparison behavior used for dedupe and omit all trailing
whitespace for diff.
---
 lib/PublicInbox/ContentHash.pm | 2 +-
 lib/PublicInbox/MailDiff.pm    | 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/ContentHash.pm b/lib/PublicInbox/ContentHash.pm
index a4f6196f..fc94257c 100644
--- a/lib/PublicInbox/ContentHash.pm
+++ b/lib/PublicInbox/ContentHash.pm
@@ -45,7 +45,7 @@ sub content_dig_i {
 	my $ct = $part->content_type || 'text/plain';
 	my ($s, undef) = msg_part_text($part, $ct);
 	if (defined $s) {
-		$s =~ s/\r\n/\n/gs;
+		$s =~ s/\r\n/\n/gs; # TODO: consider \r+\n to match View
 		$s =~ s/\s*\z//s;
 		utf8::encode($s);
 	} else {
diff --git a/lib/PublicInbox/MailDiff.pm b/lib/PublicInbox/MailDiff.pm
index 7511144c..d9733ed4 100644
--- a/lib/PublicInbox/MailDiff.pm
+++ b/lib/PublicInbox/MailDiff.pm
@@ -11,7 +11,7 @@ use PublicInbox::GitAsyncCat;
 sub write_part { # Eml->each_part callback
 	my ($ary, $self) = @_;
 	my ($part, $depth, $idx) = @$ary;
-	if ($idx ne '1' || $self->{-raw_hdr}) {
+	if ($idx ne '1' || $self->{-raw_hdr}) { # lei mail-diff --raw-header
 		open my $fh, '>', "$self->{curdir}/$idx.hdr" or die "open: $!";
 		print $fh ${$part->{hdr}} or die "print $!";
 		close $fh or die "close $!";
@@ -20,7 +20,8 @@ sub write_part { # Eml->each_part callback
 	my ($s, $err) = msg_part_text($part, $ct);
 	my $sfx = defined($s) ? 'txt' : 'bin';
 	$s //= $part->body;
-	$s =~ s/\r+\n/\n/sg;
+	$s =~ s/\r\n/\n/gs; # TODO: consider \r+\n to match View
+	$s =~ s/\s*\z//s;
 	open my $fh, '>:utf8', "$self->{curdir}/$idx.$sfx" or die "open: $!";
 	print $fh $s or die "print $!";
 	close $fh or die "close $!";

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 3/4] mail_diff: show headers differences in WWW /$MSGID/d/ view
  2023-04-25 10:50 [PATCH 0/4] mail diff updates Eric Wong
  2023-04-25 10:50 ` [PATCH 1/4] mid+contenthash: eliminate needless local variable captures Eric Wong
  2023-04-25 10:50 ` [PATCH 2/4] mail_diff: match ContentHash EOL and EOM behavior more closely Eric Wong
@ 2023-04-25 10:50 ` Eric Wong
  2023-04-25 10:50 ` [PATCH 4/4] content_digest_dbg: improve display of To:/Cc: diffs Eric Wong
  3 siblings, 0 replies; 5+ messages in thread
From: Eric Wong @ 2023-04-25 10:50 UTC (permalink / raw)
  To: meta

Some messages only differ in the To/Cc headers because some
MTAs seem to normalize them.  I was getting confused when I
saw some /d/ endpoints with no visible differences
---
 lib/PublicInbox/ContentDigestDbg.pm | 1 -
 lib/PublicInbox/MailDiff.pm         | 6 +-----
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/lib/PublicInbox/ContentDigestDbg.pm b/lib/PublicInbox/ContentDigestDbg.pm
index 5de0ee8a..1e60364f 100644
--- a/lib/PublicInbox/ContentDigestDbg.pm
+++ b/lib/PublicInbox/ContentDigestDbg.pm
@@ -1,6 +1,5 @@
 # Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
-# only loaded in lei
 package PublicInbox::ContentDigestDbg; # cf. PublicInbox::ContentDigest
 use v5.12;
 use Data::Dumper;
diff --git a/lib/PublicInbox/MailDiff.pm b/lib/PublicInbox/MailDiff.pm
index d9733ed4..994c7851 100644
--- a/lib/PublicInbox/MailDiff.pm
+++ b/lib/PublicInbox/MailDiff.pm
@@ -7,6 +7,7 @@ use PublicInbox::ContentHash qw(content_digest);
 use PublicInbox::MsgIter qw(msg_part_text);
 use PublicInbox::ViewDiff qw(flush_diff);
 use PublicInbox::GitAsyncCat;
+use PublicInbox::ContentDigestDbg;
 
 sub write_part { # Eml->each_part callback
 	my ($ary, $self) = @_;
@@ -33,11 +34,6 @@ sub dump_eml ($$$) {
 	local $self->{curdir} = $dir;
 	mkdir $dir or die "mkdir($dir): $!";
 	$eml->each_part(\&write_part, $self);
-
-	return if $self->{ctx}; # don't need content_digest noise in WWW UI
-	require PublicInbox::ContentDigestDbg;
-
-	# XXX is this even useful?  perhaps hide it behind a CLI switch
 	open my $fh, '>', "$dir/content_digest" or die "open: $!";
 	my $dig = PublicInbox::ContentDigestDbg->new($fh);
 	content_digest($eml, $dig);

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 4/4] content_digest_dbg: improve display of To:/Cc: diffs
  2023-04-25 10:50 [PATCH 0/4] mail diff updates Eric Wong
                   ` (2 preceding siblings ...)
  2023-04-25 10:50 ` [PATCH 3/4] mail_diff: show headers differences in WWW /$MSGID/d/ view Eric Wong
@ 2023-04-25 10:50 ` Eric Wong
  3 siblings, 0 replies; 5+ messages in thread
From: Eric Wong @ 2023-04-25 10:50 UTC (permalink / raw)
  To: meta

To: and Cc: headers can be long and differences in long lines
are easier to view when broken apart.  Just split by /,/ since
Data::Dumper will delimit with "," anyways.
---
 lib/PublicInbox/ContentDigestDbg.pm | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/ContentDigestDbg.pm b/lib/PublicInbox/ContentDigestDbg.pm
index 1e60364f..31d0f707 100644
--- a/lib/PublicInbox/ContentDigestDbg.pm
+++ b/lib/PublicInbox/ContentDigestDbg.pm
@@ -10,7 +10,11 @@ sub new { bless [ PublicInbox::SHA->new(256), $_[1] ], __PACKAGE__ }
 
 sub add {
 	$_[0]->[0]->add($_[1]);
-	print { $_[0]->[1] } Dumper([split(/^/sm, $_[1])]) or die "print $!";
+	my @dbg = split(/^/sm, $_[1]);
+	if ($dbg[0] =~ /\A(To|Cc)\0/) { # fold excessively long lines
+		@dbg = map { split(/,/s, $_) } @dbg;
+	}
+	print { $_[0]->[1] } Dumper(\@dbg) or die "print $!";
 }
 
 sub hexdigest { $_[0]->[0]->hexdigest }

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-04-25 10:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-25 10:50 [PATCH 0/4] mail diff updates Eric Wong
2023-04-25 10:50 ` [PATCH 1/4] mid+contenthash: eliminate needless local variable captures Eric Wong
2023-04-25 10:50 ` [PATCH 2/4] mail_diff: match ContentHash EOL and EOM behavior more closely Eric Wong
2023-04-25 10:50 ` [PATCH 3/4] mail_diff: show headers differences in WWW /$MSGID/d/ view Eric Wong
2023-04-25 10:50 ` [PATCH 4/4] content_digest_dbg: improve display of To:/Cc: diffs Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).