From: Eric Wong <e@yhbt.net>
To: meta@public-inbox.org
Subject: [PATCH] smsg: handle wide characters in raw mail headers
Date: Wed, 19 Aug 2020 08:15:49 +0000 [thread overview]
Message-ID: <20200819081549.24617-1-e@yhbt.net> (raw)
There may be messages in the wild with wide characters in
headers which aren't non-RFC2047 encoded. Assume UTF-8 so those
fields can round trip through the `ddd' (doc-data-deflated)
column of over.sqlite3.
This doesn't affect docdata.glass in Xapian (at least not with
Search::Xapian), but it does affect how over.sqlite3 stores the
same data via Compress::Zlib::compress().
Noticed while working on patches to remove docdata storage from
Xapian in favor of using over.sqlite3.
---
lib/PublicInbox/Smsg.pm | 3 +++
t/psgi_search.t | 6 +++++-
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm
index aaf88f35..62cb951e 100644
--- a/lib/PublicInbox/Smsg.pm
+++ b/lib/PublicInbox/Smsg.pm
@@ -105,6 +105,9 @@ sub populate {
# to protect git and NNTP clients
$val =~ tr/\0\t\n/ /;
+ # rare: in case headers have wide chars (not RFC2047-encoded)
+ utf8::decode($val);
+
# lower-case fields for read-only stuff
$self->{lc($f)} = $val;
diff --git a/t/psgi_search.t b/t/psgi_search.t
index 2d12ba6a..5d537363 100644
--- a/t/psgi_search.t
+++ b/t/psgi_search.t
@@ -28,8 +28,10 @@ my $im = $ibx->importer(0);
my $digits = '10010260936330';
my $ua = 'Pine.LNX.4.10';
my $mid = "$ua.$digits.2460-100000\@penguin.transmeta.com";
+
+# n.b. these headers are not properly RFC2047-encoded
my $mime = PublicInbox::Eml->new(<<EOF);
-Subject: test
+Subject: test Ævar
Message-ID: <$mid>
From: Ævar Arnfjörð Bjarmason <avarab\@example>
To: git\@vger.kernel.org
@@ -102,6 +104,8 @@ test_psgi(sub { $www->call(@_) }, sub {
'subject-less message linked from "/$INBOX/"');
like($html, qr/\bhref="blank-subject[^>]+>\(no subject\)</,
'blank subject message linked from "/$INBOX/"');
+ like($html, qr/test Ævar/,
+ "displayed Ævar's name properly in topic view");
$res = $cb->(GET('/test/?q=tc:git'));
like($html, qr/\bhref="no-subject-at-all[^>]+>\(no subject\)</,
reply other threads:[~2020-08-19 8:15 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200819081549.24617-1-e@yhbt.net \
--to=e@yhbt.net \
--cc=meta@public-inbox.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).