unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH] smsg: handle wide characters in raw mail headers
@ 2020-08-19  8:15 Eric Wong
  0 siblings, 0 replies; only message in thread
From: Eric Wong @ 2020-08-19  8:15 UTC (permalink / raw)
  To: meta

There may be messages in the wild with wide characters in
headers which aren't non-RFC2047 encoded.  Assume UTF-8 so those
fields can round trip through the `ddd' (doc-data-deflated)
column of over.sqlite3.

This doesn't affect docdata.glass in Xapian (at least not with
Search::Xapian), but it does affect how over.sqlite3 stores the
same data via Compress::Zlib::compress().

Noticed while working on patches to remove docdata storage from
Xapian in favor of using over.sqlite3.
---
 lib/PublicInbox/Smsg.pm | 3 +++
 t/psgi_search.t         | 6 +++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm
index aaf88f35..62cb951e 100644
--- a/lib/PublicInbox/Smsg.pm
+++ b/lib/PublicInbox/Smsg.pm
@@ -105,6 +105,9 @@ sub populate {
 		# to protect git and NNTP clients
 		$val =~ tr/\0\t\n/   /;
 
+		# rare: in case headers have wide chars (not RFC2047-encoded)
+		utf8::decode($val);
+
 		# lower-case fields for read-only stuff
 		$self->{lc($f)} = $val;
 
diff --git a/t/psgi_search.t b/t/psgi_search.t
index 2d12ba6a..5d537363 100644
--- a/t/psgi_search.t
+++ b/t/psgi_search.t
@@ -28,8 +28,10 @@ my $im = $ibx->importer(0);
 my $digits = '10010260936330';
 my $ua = 'Pine.LNX.4.10';
 my $mid = "$ua.$digits.2460-100000\@penguin.transmeta.com";
+
+# n.b. these headers are not properly RFC2047-encoded
 my $mime = PublicInbox::Eml->new(<<EOF);
-Subject: test
+Subject: test Ævar
 Message-ID: <$mid>
 From: Ævar Arnfjörð Bjarmason <avarab\@example>
 To: git\@vger.kernel.org
@@ -102,6 +104,8 @@ test_psgi(sub { $www->call(@_) }, sub {
 		'subject-less message linked from "/$INBOX/"');
 	like($html, qr/\bhref="blank-subject[^>]+>\(no subject\)</,
 		'blank subject message linked from "/$INBOX/"');
+	like($html, qr/test &#198;var/,
+		"displayed Ævar's name properly in topic view");
 
 	$res = $cb->(GET('/test/?q=tc:git'));
 	like($html, qr/\bhref="no-subject-at-all[^>]+>\(no subject\)</,

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2020-08-19  8:15 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-19  8:15 [PATCH] smsg: handle wide characters in raw mail headers Eric Wong

unofficial mirror of meta@public-inbox.org

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://yhetil.org/meta

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 meta meta/ https://yhetil.org/meta \
		meta@public-inbox.org
	public-inbox-index meta

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.yhetil.org/yhetil.mail.public-inbox.meta
	nntp://news.public-inbox.org/inbox.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git