unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH/RFC 0/2] recurse into message/rfc822 parts
@ 2020-05-16 10:03 Eric Wong
  2020-05-16 10:03 ` [PATCH 1/2] t/psgi_attach: assert message/* parts are downloadable Eric Wong
  2020-05-16 10:03 ` [PATCH 2/2] descend into message/(rfc822|news|global) parts Eric Wong
  0 siblings, 2 replies; 3+ messages in thread
From: Eric Wong @ 2020-05-16 10:03 UTC (permalink / raw)
  To: meta

Multipart parts aren't the only things which nest,
message/rfc822 attachments can contain any sort of full
message.

I noticed gmime supports this while evaluating(*) it to replace
Email::MIME, and it seems needed for IMAP support.

Email::MIME seemed to attempt to support descending into
message/*, but didn't do it properly, so it never got
triggered.

There's definitely some message/rfc822 attachments in various
archives out there, and it looks like message/global is becoming
a thing, and some message/news for legacy stuff...

gmime supports message/rfc2822, too, which doesn't seem
specified anywhere...

Search indexing multiple From/To/Cc/Subject/Message-ID/List-Id
headers is straightforward, Date is not...

Also, note the t/data/message_embed.eml example includes
a circular reference :)   I have no intention of doing threading
for attached messages (AFAIK no mail client does this), but
maybe making the contents of References / In-Reply-To is
a helpful thing in general.

Eric Wong (2):
  t/psgi_attach: assert message/* parts are downloadable
  descend into message/(rfc822|news|global) parts

 MANIFEST                     |   1 +
 lib/PublicInbox/Eml.pm       |  37 ++++++--
 lib/PublicInbox/MsgIter.pm   |   6 +-
 lib/PublicInbox/SearchIdx.pm |  47 ++++++----
 lib/PublicInbox/View.pm      |  30 ++++++-
 t/data/message_embed.eml     | 163 +++++++++++++++++++++++++++++++++++
 t/eml.t                      |  28 ++++++
 t/psgi_attach.t              |  27 ++++++
 t/search.t                   |  25 ++++++
 9 files changed, 336 insertions(+), 28 deletions(-)
 create mode 100644 t/data/message_embed.eml


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 1/2] t/psgi_attach: assert message/* parts are downloadable
  2020-05-16 10:03 [PATCH/RFC 0/2] recurse into message/rfc822 parts Eric Wong
@ 2020-05-16 10:03 ` Eric Wong
  2020-05-16 10:03 ` [PATCH 2/2] descend into message/(rfc822|news|global) parts Eric Wong
  1 sibling, 0 replies; 3+ messages in thread
From: Eric Wong @ 2020-05-16 10:03 UTC (permalink / raw)
  To: meta

We'll be adding support to descend into message/rfc822 (and
legacy message/news) attachments.  First, we must ensure
existing message/rfc822 attachments can be downloaded and remain
downloadable in future commits.
---
 MANIFEST                 |   1 +
 t/data/message_embed.eml | 163 +++++++++++++++++++++++++++++++++++++++
 t/psgi_attach.t          |  18 +++++
 3 files changed, 182 insertions(+)
 create mode 100644 t/data/message_embed.eml

diff --git a/MANIFEST b/MANIFEST
index 7997bc9906c..24f95faa942 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -226,6 +226,7 @@ t/config_limiter.t
 t/content_hash.t
 t/convert-compact.t
 t/data/0001.patch
+t/data/message_embed.eml
 t/ds-kqxs.t
 t/ds-leak.t
 t/ds-poll.t
diff --git a/t/data/message_embed.eml b/t/data/message_embed.eml
new file mode 100644
index 00000000000..a7aa88acee3
--- /dev/null
+++ b/t/data/message_embed.eml
@@ -0,0 +1,163 @@
+Received: from localhost (dcvr.yhbt.net [127.0.0.1])
+	by dcvr.yhbt.net (Postfix) with ESMTP id 977481F45A;
+	Sat, 18 Apr 2020 22:25:08 +0000 (UTC)
+Date: Sat, 18 Apr 2020 22:25:08 +0000
+From: Eric Wong <e@yhbt.net>
+To: test@public-inbox.org
+Subject: Re: embedded message test
+Message-ID: <20200418222508.GA13918@dcvr>
+References: <20200418222020.GA2745@dcvr>
+MIME-Version: 1.0
+Content-Type: multipart/mixed; boundary="TB36FDmn/VVEgNH/"
+Content-Disposition: inline
+In-Reply-To: <20200418222020.GA2745@dcvr>
+
+
+--TB36FDmn/VVEgNH/
+Content-Type: text/plain; charset=utf-8
+Content-Disposition: inline
+
+testing embedded message harder
+
+--TB36FDmn/VVEgNH/
+Content-Type: message/rfc822
+Content-Disposition: attachment; filename="embed2x.eml"
+
+Date: Sat, 18 Apr 2020 22:20:20 +0000
+From: Eric Wong <e@yhbt.net>
+To: test@public-inbox.org
+Subject: embedded message test
+Message-ID: <20200418222020.GA2745@dcvr>
+MIME-Version: 1.0
+Content-Type: multipart/mixed; boundary="/04w6evG8XlLl3ft"
+Content-Disposition: inline
+
+--/04w6evG8XlLl3ft
+Content-Type: text/plain; charset=utf-8
+Content-Disposition: inline
+
+testing embedded message
+
+--/04w6evG8XlLl3ft
+Content-Type: message/rfc822
+Content-Disposition: attachment; filename="test.eml"
+
+From: Eric Wong <e@yhbt.net>
+To: spew@80x24.org
+Subject: [PATCH] mail header experiments
+Date: Sat, 18 Apr 2020 21:41:14 +0000
+Message-Id: <20200418214114.7575-1-e@yhbt.net>
+MIME-Version: 1.0
+Content-Transfer-Encoding: 8bit
+
+---
+ lib/PublicInbox/MailHeader.pm | 55 +++++++++++++++++++++++++++++++++++
+ t/mail_header.t               | 31 ++++++++++++++++++++
+ 2 files changed, 86 insertions(+)
+ create mode 100644 lib/PublicInbox/MailHeader.pm
+ create mode 100644 t/mail_header.t
+
+diff --git a/lib/PublicInbox/MailHeader.pm b/lib/PublicInbox/MailHeader.pm
+new file mode 100644
+index 00000000..166baf91
+--- /dev/null
++++ b/lib/PublicInbox/MailHeader.pm
+@@ -0,0 +1,55 @@
++# Copyright (C) 2020 all contributors <meta@public-inbox.org>
++# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
++package PublicInbox::MailHeader;
++use strict;
++use HTTP::Parser::XS qw(parse_http_response HEADERS_AS_ARRAYREF);
++use bytes (); #bytes::length
++my %casemap;
++
++sub _headerx_to_list {
++	my (undef, $head, $crlf) = @_;
++
++	# picohttpparser uses `int' as the return value, so the
++	# actual limit is 2GB on most platforms.  However, headers
++	# exceeding (or even close to) 1MB seems unreasonable
++	die 'headers too big' if bytes::length($$head) > 0x100000;
++	my ($ret, undef, undef, undef, $headers) =
++		parse_http_response('HTTP/1.0 1 X'. $crlf . $$head,
++					HEADERS_AS_ARRAYREF);
++	die 'failed to parse headers' if $ret <= 0;
++	# %casemap = map {; lc($_) => $_ } ($$head =~ m/^([^:]+):/gsm);
++	# my $nr = @$headers;
++	for (my $i = 0; $i < @$headers; $i += 2) {
++		my $key = $headers->[$i]; # = $casemap{$headers->[$i]};
++		my $val = $headers->[$i + 1];
++		(my $trimmed = $val) =~ s/\r?\n\s+/ /;
++		$headers->[$i + 1] = [
++			$trimmed,
++			"$key: $val"
++		]
++	}
++	$headers;
++}
++
++sub _header_to_list {
++	my (undef, $head, $crlf) = @_;
++	my @tmp = ($$head =~ m/^(([^ \t:][^:\n]*):[ \t]*
++			([^\n]*\n(?:[ \t]+[^\n]*\n)*))/gsmx);
++	my @headers;
++	$#headers = scalar @tmp;
++	@headers = ();
++	while (@tmp) {
++		my ($orig, $key, $val) = splice(@tmp, 0, 3);
++		# my $v = $tmp[$i + 2];
++		# $v =~ s/\r?\n[ \t]+/ /sg;
++		# $v =~ s/\r?\n\z//s;
++		$val =~ s/\n[ \t]+/ /sg;
++		chomp($val, $orig);
++		# $val =~ s/\r?\n\z//s;
++		# $orig =~ s/\r?\n\z//s;
++		push @headers, $key, [ $val, $orig ];
++	}
++	\@headers;
++}
++
++1;
+diff --git a/t/mail_header.t b/t/mail_header.t
+new file mode 100644
+index 00000000..4dc62c50
+--- /dev/null
++++ b/t/mail_header.t
+@@ -0,0 +1,31 @@
++# Copyright (C) 2020 all contributors <meta@public-inbox.org>
++# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
++use strict;
++use Test::More;
++use PublicInbox::TestCommon;
++require_mods('PublicInbox::MailHeader');
++
++my $head = <<'EOF';
++From d0147582e289fdd4cdd84e91d8b0f8ae9c230124 Mon Sep 17 00:00:00 2001
++From: Eric Wong <e@yhbt.net>
++Date: Fri, 17 Apr 2020 09:28:49 +0000
++Subject: [PATCH] searchthread: reduce indirection by removing container
++
++EOF
++my $orig = $head;
++use Email::Simple;
++my $xshdr = PublicInbox::MailHeader->_header_to_list(\$head, "\n");
++my $simpl = Email::Simple::Header->_header_to_list(\$head, "\n");
++is_deeply($xshdr, $simpl);
++use Benchmark qw(:all);
++my $res = timethese(100000, {
++	pmh => sub {
++		PublicInbox::MailHeader->_header_to_list(\$head, "\n");
++	},
++	esh =>  sub {
++		PublicInbox::MailHeader->_header_to_list(\$head, "\n");
++	}
++});
++is($head, $orig);
++use Data::Dumper; diag Dumper($res);
++done_testing;
+
+
+--/04w6evG8XlLl3ft--
+
+
+--TB36FDmn/VVEgNH/--
diff --git a/t/psgi_attach.t b/t/psgi_attach.t
index 9a2b241164a..12f9e6eeecd 100644
--- a/t/psgi_attach.t
+++ b/t/psgi_attach.t
@@ -15,6 +15,7 @@ use_ok 'PublicInbox::WWW';
 use PublicInbox::Import;
 use PublicInbox::Git;
 use PublicInbox::Config;
+use PublicInbox::Eml;
 use_ok 'PublicInbox::WwwAttach';
 my $config = PublicInbox::Config->new(\<<EOF);
 $cfgpfx.address=$addr
@@ -30,6 +31,7 @@ $im->init_bare;
 	my $txt = "plain\ntext\npass\nthrough\n";
 	my $dot = "dotfile\n";
 	$im->add(eml_load('t/psgi_attach.eml'));
+	$im->add(eml_load('t/data/message_embed.eml'));
 	$im->done;
 
 	my $www = PublicInbox::WWW->new($config);
@@ -67,6 +69,22 @@ $im->init_bare;
 		ok(length($dot_res) >= length($dot), 'dot almost matches');
 		$res = $cb->(GET('/test/Z%40B/4-any-filename.txt'));
 		is($res->content, $dot_res, 'user-specified filename is OK');
+
+		my $mid = '20200418222508.GA13918@dcvr';
+		my $irt = '20200418222020.GA2745@dcvr';
+		$res = $cb->(GET("/test/$mid/"));
+		like($res->content, qr/\bhref="2-embed2x\.eml"/s,
+			'href to message/rfc822 attachment visible');
+		$res = $cb->(GET("/test/$mid/2-embed2x.eml"));
+		my $eml = PublicInbox::Eml->new(\($res->content));
+		is_deeply([ $eml->header_raw('Message-ID') ], [ "<$irt>" ],
+			'got attached eml');
+		my @subs = $eml->subparts;
+		is(scalar(@subs), 2, 'attachment had 2 subparts');
+		like($subs[0]->body_str, qr/^testing embedded message\n*\z/sm,
+			'1st attachment is as expected');
+		is($subs[1]->header('Content-Type'), 'message/rfc822',
+			'2nd attachment is as expected');
 	});
 }
 done_testing();

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 2/2] descend into message/(rfc822|news|global) parts
  2020-05-16 10:03 [PATCH/RFC 0/2] recurse into message/rfc822 parts Eric Wong
  2020-05-16 10:03 ` [PATCH 1/2] t/psgi_attach: assert message/* parts are downloadable Eric Wong
@ 2020-05-16 10:03 ` Eric Wong
  1 sibling, 0 replies; 3+ messages in thread
From: Eric Wong @ 2020-05-16 10:03 UTC (permalink / raw)
  To: meta

Email::MIME never supported this properly, but there's real
instances of forwarded messages as message/rfc822 attachments.
message/news is legacy thing which we'll see in archives, and
message/global appears to be the new thing.

gmime also supports message/rfc2822, so we'll support it anyways
despite lacking other evidence of its existence.

Existing attachments remain downloadable as a whole message,
but individual attachments of subparts are now downloadable
and can be displayed in HTML, too.

Furthermore, ensure Xapian can now search for common headers
inside those messages as well as the message bodies.
---
 lib/PublicInbox/Eml.pm       | 37 +++++++++++++++++++++++-----
 lib/PublicInbox/MsgIter.pm   |  6 ++++-
 lib/PublicInbox/SearchIdx.pm | 47 ++++++++++++++++++++++--------------
 lib/PublicInbox/View.pm      | 30 ++++++++++++++++++++---
 t/eml.t                      | 28 +++++++++++++++++++++
 t/psgi_attach.t              |  9 +++++++
 t/search.t                   | 25 +++++++++++++++++++
 7 files changed, 154 insertions(+), 28 deletions(-)

diff --git a/lib/PublicInbox/Eml.pm b/lib/PublicInbox/Eml.pm
index ef401141c13..6f6874cd237 100644
--- a/lib/PublicInbox/Eml.pm
+++ b/lib/PublicInbox/Eml.pm
@@ -60,6 +60,14 @@ my %DECODE_FULL = (
 our %STR_TYPE = (text => 1);
 our %STR_SUBTYPE = (plain => 1, html => 1);
 
+# message/* subtypes we descend into
+our %MESSAGE_DESCEND = (
+	news => 1, # RFC 1849 (obsolete, but archives are forever)
+	rfc822 => 1, # RFC 2046
+	rfc2822 => 1, # gmime handles this (but not rfc5322)
+	global => 1, # RFC 6532
+);
+
 my %re_memo;
 sub re_memo ($) {
 	my ($k) = @_;
@@ -149,13 +157,25 @@ sub ct ($) {
 }
 
 # returns a queue of sub-parts iff it's worth descending into
-# TODO: descend into message/rfc822 parts (Email::MIME didn't)
 sub mp_descend ($$) {
 	my ($self, $nr) = @_; # or $once for top-level
-	my $bnd = ct($self)->{attributes}->{boundary} // return; # single-part
+	my $ct = ct($self);
+	my $type = lc($ct->{type});
+	if ($type eq 'message' && $MESSAGE_DESCEND{lc($ct->{subtype})}) {
+		my $nxt = new(undef, body_raw($self));
+		$self->{-call_cb} = $nxt->{is_submsg} = 1;
+		return [ $nxt ];
+	}
+	return if $type ne 'multipart';
+	my $bnd = $ct->{attributes}->{boundary} // return; # single-part
 	return if $bnd eq '' || length($bnd) >= $mime_boundary_length_limit;
 	$bnd = quotemeta($bnd);
 
+	# this is a multipart message that didn't get descended into in
+	# public-inbox <= 1.5.0, so ensure we call the user callback for
+	# this part to not break PSGI downloads.
+	$self->{-call_cb} = $self->{is_submsg};
+
 	# "multipart" messages can exist w/o a body
 	my $bdy = ($nr ? delete($self->{bdy}) : \(body_raw($self))) or return;
 
@@ -189,14 +209,15 @@ sub mp_descend ($$) {
 		# compatibility with Email::MIME
 		$parts[-1] =~ s/\n\r?\n\z/\n/s if $epilogue_missing;
 
-		@parts = grep /[^ \t\r\n]/s, @parts; # ignore empty parts
+		# ignore empty parts
+		@parts = map { new_sub(undef, \$_) } grep /[^ \t\r\n]/s, @parts;
 
 		# Keep "From: someone..." from preamble in old,
 		# buggy versions of git-send-email, otherwise drop it
 		# There's also a case where quoted text showed up in the
 		# preamble
 		# <20060515162817.65F0F1BBAE@citi.umich.edu>
-		unshift(@parts, $pre) if $pre =~ /:/s;
+		unshift(@parts, new_sub(undef, \$pre)) if $pre =~ /:/s;
 		return \@parts;
 	}
 	# "multipart", but no boundary found, treat as single part
@@ -217,6 +238,9 @@ sub each_part {
 	my ($self, $cb, $arg, $once) = @_;
 	my $p = mp_descend($self, $once // 0) or
 					return $cb->([$self, 0, 0], $arg);
+
+	$cb->([$self, 0, 0], $arg) if $self->{-call_cb}; # rare
+
 	$p = [ $p, 0 ];
 	my @s; # our virtual stack
 	my $nr = 0;
@@ -226,11 +250,12 @@ sub each_part {
 		my (undef, @idx) = @$p;
 		@idx = (join('.', @idx));
 		my $depth = ($idx[0] =~ tr/././) + 1;
-		my $sub = new_sub(undef, \(shift @{$p->[0]}));
+		my $sub = shift @{$p->[0]};
 		if ($depth < $mime_nesting_limit &&
 				(my $nxt = mp_descend($sub, $nr))) {
 			push(@s, $p) if scalar @{$p->[0]};
 			$p = [ $nxt, @idx, 0 ];
+			$cb->([$sub, $depth, @idx], $arg) if $sub->{-call_cb};
 		} else { # a leaf node
 			$cb->([$sub, $depth, @idx], $arg);
 		}
@@ -270,7 +295,7 @@ sub subparts {
 	if ($$bdy =~ /^--\Q$bnd\E--[ \t]*\r?\n(.+)\z/sm) {
 		$self->{epilogue} = $1;
 	}
-	map { new_sub(undef, \$_) } @$parts;
+	@$parts;
 }
 
 sub parts_set {
diff --git a/lib/PublicInbox/MsgIter.pm b/lib/PublicInbox/MsgIter.pm
index 7c28d019abc..5ec2a4d9c7f 100644
--- a/lib/PublicInbox/MsgIter.pm
+++ b/lib/PublicInbox/MsgIter.pm
@@ -64,8 +64,12 @@ sub msg_part_text ($$) {
 	# times when it should not have been:
 	#   <87llgalspt.fsf@free.fr>
 	#   <200308111450.h7BEoOu20077@mail.osdl.org>
+	# But also do not try this with ->{is_submsg} (message/rfc822),
+	# since a broken multipart/mixed inside a message/rfc822 part
+	# has not been seen in the wild, yet...
 	if ($err && ($ct =~ m!\btext/\b!i ||
-			$ct =~ m!\bmultipart/mixed\b!i)) {
+			(!$part->{is_submsg} &&
+				$ct =~ m!\bmultipart/mixed\b!i) ) ) {
 		my $cte = $part->header_raw('Content-Transfer-Encoding');
 		if (defined($cte) && $cte =~ /\b7bit\b/i) {
 			$s = $part->body;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 4bdd69f540b..5f5ae895e43 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -284,6 +284,13 @@ sub index_xapian { # msg_iter callback
 	if (defined $fn && $fn ne '') {
 		index_text($self, $fn, 1, 'XFN');
 	}
+	if ($part->{is_submsg}) {
+		my $mids = mids_for_index($part);
+		index_ids($self, $doc, $part, $mids);
+		my $smsg = PublicInbox::Smsg->new($part);
+		index_users($self, $smsg);
+		index_text($self, $smsg->subject, 1, 'S') if $smsg->subject;
+	}
 
 	my ($s, undef) = msg_part_text($part, $ct);
 	defined $s or return;
@@ -307,6 +314,27 @@ sub index_xapian { # msg_iter callback
 	}
 }
 
+sub index_ids ($$$$) {
+	my ($self, $doc, $hdr, $mids) = @_;
+	for my $mid (@$mids) {
+		index_text($self, $mid, 1, 'XM');
+
+		# because too many Message-IDs are prefixed with
+		# "Pine.LNX."...
+		if ($mid =~ /\w{12,}/) {
+			my @long = ($mid =~ /(\w{3,}+)/g);
+			index_text($self, join(' ', @long), 1, 'XM');
+		}
+	}
+	$doc->add_boolean_term('Q' . $_) for @$mids;
+	for my $l ($hdr->header_raw('List-Id')) {
+		$l =~ /<([^>]+)>/ or next;
+		my $lid = $1;
+		$doc->add_boolean_term('G' . $lid);
+		index_text($self, $lid, 1, 'XL'); # probabilistic
+	}
+}
+
 sub add_xapian ($$$$) {
 	my ($self, $mime, $smsg, $mids) = @_;
 	$smsg->{mime} = $mime; # XXX dangerous
@@ -321,22 +349,12 @@ sub add_xapian ($$$$) {
 	add_val($doc, PublicInbox::Search::DT(), $dt);
 
 	my $tg = term_generator($self);
-
 	$tg->set_document($doc);
 	index_text($self, $subj, 1, 'S') if $subj;
 	index_users($self, $smsg);
 
 	msg_iter($mime, \&index_xapian, [ $self, $doc ]);
-	foreach my $mid (@$mids) {
-		index_text($self, $mid, 1, 'XM');
-
-		# because too many Message-IDs are prefixed with
-		# "Pine.LNX."...
-		if ($mid =~ /\w{12,}/) {
-			my @long = ($mid =~ /(\w{3,}+)/g);
-			index_text($self, join(' ', @long), 1, 'XM');
-		}
-	}
+	index_ids($self, $doc, $hdr, $mids);
 	$smsg->{to} = $smsg->{cc} = ''; # WWW doesn't need these, only NNTP
 	PublicInbox::OverIdx::parse_references($smsg, $hdr, $mids);
 	my $data = $smsg->to_doc_data;
@@ -351,13 +369,6 @@ sub add_xapian ($$$$) {
 			}
 		}
 	}
-	$doc->add_boolean_term('Q' . $_) foreach @$mids;
-	for my $l ($hdr->header_raw('List-Id')) {
-		$l =~ /<([^>]+)>/ or next;
-		my $lid = $1;
-		$doc->add_boolean_term('G' . $lid);
-		index_text($self, $lid, 1, 'XL'); # probabilistic
-	}
 	$self->{xdb}->replace_document($smsg->{num}, $doc);
 }
 
diff --git a/lib/PublicInbox/View.pm b/lib/PublicInbox/View.pm
index ef5f4b3a25e..a1920212194 100644
--- a/lib/PublicInbox/View.pm
+++ b/lib/PublicInbox/View.pm
@@ -17,6 +17,7 @@ use PublicInbox::Address;
 use PublicInbox::WwwStream;
 use PublicInbox::Reply;
 use PublicInbox::ViewDiff qw(flush_diff);
+use PublicInbox::Eml;
 use POSIX qw(strftime);
 use Time::Local qw(timegm);
 use PublicInbox::Smsg qw(subject_normalized);
@@ -480,6 +481,21 @@ sub multipart_text_as_html {
 	$_[0]->each_part(\&add_text_body, $_[1], 1);
 }
 
+sub submsg_hdr ($$) {
+	my ($ctx, $eml) = @_;
+	my $obfs_ibx = $ctx->{-obfs_ibx};
+	my $rv = $ctx->{obuf};
+	$$rv .= "\n";
+	for my $h (qw(From To Cc Subject Date Message-ID X-Alt-Message-ID)) {
+		my @v = $eml->header($h);
+		for my $v (@v) {
+			obfuscate_addrs($obfs_ibx, $v) if $obfs_ibx;
+			$v = ascii_html($v);
+			$$rv .= "$h: $v\n";
+		}
+	}
+}
+
 sub attach_link ($$$$;$) {
 	my ($ctx, $ct, $p, $fn, $err) = @_;
 	my ($part, $depth, $idx) = @$p;
@@ -511,6 +527,9 @@ EOF
 	$desc = ascii_html($desc);
 	$$rv .= ($desc eq '') ? "$ts --]" : "$desc --]\n[-- $ts --]";
 	$$rv .= "</a>\n";
+
+	submsg_hdr($ctx, $part) if $part->{is_submsg};
+
 	undef;
 }
 
@@ -518,6 +537,7 @@ sub add_text_body { # callback for each_part
 	my ($p, $ctx) = @_;
 	my $upfx = $ctx->{mhref};
 	my $ibx = $ctx->{-inbox};
+	my $l = $ctx->{-linkify} //= PublicInbox::Linkify->new;
 	# $p - from each_part: [ Email::MIME-like, depth, $idx ]
 	my ($part, $depth, $idx) = @$p;
 	my $ct = $part->content_type || 'text/plain';
@@ -525,6 +545,12 @@ sub add_text_body { # callback for each_part
 	my ($s, $err) = msg_part_text($part, $ct);
 	return attach_link($ctx, $ct, $p, $fn) unless defined $s;
 
+	my $rv = $ctx->{obuf};
+	if ($part->{is_submsg}) {
+		submsg_hdr($ctx, $part);
+		$$rv .= "\n";
+	}
+
 	# makes no difference to browsers, and don't screw up filename
 	# link generation in diffs with the extra '%0D'
 	$s =~ s/\r\n/\n/sg;
@@ -571,13 +597,11 @@ sub add_text_body { # callback for each_part
 	# split off quoted and unquoted blocks:
 	my @sections = PublicInbox::MsgIter::split_quotes($s);
 	undef $s; # free memory
-	my $rv = $ctx->{obuf};
-	if (defined($fn) || $depth > 0 || $err) {
+	if (defined($fn) || ($depth > 0 && !$part->{is_submsg}) || $err) {
 		# badly-encoded message with $err? tell the world about it!
 		attach_link($ctx, $ct, $p, $fn, $err);
 		$$rv .= "\n";
 	}
-	my $l = $ctx->{-linkify} //= PublicInbox::Linkify->new;
 	foreach my $cur (@sections) {
 		if ($cur =~ /\A>/) {
 			# we use a <span> here to allow users to specify
diff --git a/t/eml.t b/t/eml.t
index c91deb3ab29..b7f58ac7069 100644
--- a/t/eml.t
+++ b/t/eml.t
@@ -117,6 +117,34 @@ EOF
 		'', 'each_part can clobber body');
 }
 
+if ('descend into message/rfc822') {
+	my $eml = eml_load 't/data/message_embed.eml';
+	my @parts;
+	$eml->each_part(sub {
+		my ($part, $level, @ex) = @{$_[0]};
+		push @parts, [ $part, $level, @ex ];
+	});
+	is(scalar(@parts), 6, 'got all parts');
+	like($parts[0]->[0]->body, qr/^testing embedded message harder\n/sm,
+		'first part found');
+	is_deeply([ @{$parts[0]}[1..2] ], [ 1, '1' ],
+		'got expected depth and level for part #0');
+	is($parts[1]->[0]->filename, 'embed2x.eml',
+		'attachment filename found');
+	is_deeply([ @{$parts[1]}[1..2] ], [ 1, '2' ],
+		'got expected depth and level for part #1');
+	is_deeply([ @{$parts[2]}[1..2] ], [ 2, '2.1' ],
+		'got expected depth and level for part #2');
+	is_deeply([ @{$parts[3]}[1..2] ], [ 3, '2.1.1' ],
+		'got expected depth and level for part #3');
+	is_deeply([ @{$parts[4]}[1..2] ], [ 3, '2.1.2' ],
+		'got expected depth and level for part #4');
+	is($parts[4]->[0]->filename, 'test.eml',
+		'another attachment filename found');
+	is_deeply([ @{$parts[5]}[1..2] ], [ 4, '2.1.2.1' ],
+		'got expected depth and level for part #5');
+}
+
 # body-less, boundary-less
 for my $cls (@classes) {
 	my $call = 0;
diff --git a/t/psgi_attach.t b/t/psgi_attach.t
index 12f9e6eeecd..c6f8072ff9a 100644
--- a/t/psgi_attach.t
+++ b/t/psgi_attach.t
@@ -75,6 +75,9 @@ $im->init_bare;
 		$res = $cb->(GET("/test/$mid/"));
 		like($res->content, qr/\bhref="2-embed2x\.eml"/s,
 			'href to message/rfc822 attachment visible');
+		like($res->content, qr/\bhref="2\.1\.2-test\.eml"/s,
+			'href to nested message/rfc822 attachment visible');
+
 		$res = $cb->(GET("/test/$mid/2-embed2x.eml"));
 		my $eml = PublicInbox::Eml->new(\($res->content));
 		is_deeply([ $eml->header_raw('Message-ID') ], [ "<$irt>" ],
@@ -85,6 +88,12 @@ $im->init_bare;
 			'1st attachment is as expected');
 		is($subs[1]->header('Content-Type'), 'message/rfc822',
 			'2nd attachment is as expected');
+
+		$res = $cb->(GET("/test/$mid/2.1.2-test.eml"));
+		$eml = PublicInbox::Eml->new(\($res->content));
+		is_deeply([ $eml->header_raw('Message-ID') ],
+			[ '<20200418214114.7575-1-e@yhbt.net>' ],
+			'nested eml retrieved');
 	});
 }
 done_testing();
diff --git a/t/search.t b/t/search.t
index 6dd5047454a..9d74f5e0532 100644
--- a/t/search.t
+++ b/t/search.t
@@ -479,6 +479,31 @@ EOF
 	is_deeply($found, [], 'matched on phrase with l:');
 }
 
+$ibx->with_umask(sub {
+	$rw_commit->();
+	my $doc_id = $rw->add_message(eml_load('t/data/message_embed.eml'));
+	ok($doc_id > 0, 'messages within messages');
+	$rw->commit_txn_lazy;
+	$ro->reopen;
+	my $n_test_eml = $ro->query('n:test.eml');
+	is(scalar(@$n_test_eml), 1, 'got a result');
+	my $n_embed2x_eml = $ro->query('n:embed2x.eml');
+	is_deeply($n_test_eml, $n_embed2x_eml, '.eml filenames searchable');
+	for my $m (qw(20200418222508.GA13918@dcvr 20200418222020.GA2745@dcvr
+			20200418214114.7575-1-e@yhbt.net)) {
+		is($ro->query("m:$m")->[0]->{mid},
+			'20200418222508.GA13918@dcvr', 'probabilistic m:'.$m);
+		is($ro->query("mid:$m")->[0]->{mid},
+			'20200418222508.GA13918@dcvr', 'boolean mid:'.$m);
+	}
+	is($ro->query('dfpost:4dc62c50')->[0]->{mid},
+		'20200418222508.GA13918@dcvr',
+		'diff search reaches inside message/rfc822');
+	is($ro->query('s:"mail header experiments"')->[0]->{mid},
+		'20200418222508.GA13918@dcvr',
+		'Subject search reaches inside message/rfc822');
+});
+
 done_testing();
 
 1;

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-05-16 10:03 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-16 10:03 [PATCH/RFC 0/2] recurse into message/rfc822 parts Eric Wong
2020-05-16 10:03 ` [PATCH 1/2] t/psgi_attach: assert message/* parts are downloadable Eric Wong
2020-05-16 10:03 ` [PATCH 2/2] descend into message/(rfc822|news|global) parts Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).