* Two small issues when importing old archives @ 2020-02-24 20:45 Leah Neukirchen 2020-02-25 9:23 ` [RFC] msgtime: do not require tz offset with Date::Parse fallback Eric Wong 2020-02-25 9:28 ` weird From: lines [was: Two small issues when importing old archives] Eric Wong 0 siblings, 2 replies; 5+ messages in thread From: Leah Neukirchen @ 2020-02-24 20:45 UTC (permalink / raw) To: meta Hi, I've recently imported some sizable archives (~100k messages) of old mailing lists and noticed some slight inconveniences: 1) RFC5322/822 invalid Date: headers should be parsed more gracefully Some old mails had Date: headers without time zones, e.g. Date: Sat, 27 Sep 1997 10:02:32 This results in public-inbox asserting this is the current date. But this assumption makes no sense (literally every other guess would be more likely), and also results in these messages showing up on the first page of the archive. Furthermore, sorting is then not stable, pressing F5 make the threads jump around. I'd recommend falling back to +0000 instead. 2) Weird From: lines crash the whole import From: "=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de This funny line broke import_maildir: fatal: Missing > in ident string: =?iso-8859-1?Q?Jochen_K=FCpper?= usenet <"=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de> 1101853296 +0100 fast-import: dumping crash report to /var/lib/public-inbox/repositories/ding.git/fast_import_crash_31402 EOF from fast-import: at /usr/share/perl5/vendor_perl/PublicInbox/Import.pm line 96, <$r> line 54681. I fixed it manually. (But I think it's actually a valid mail address, even in this botched state.) I'm not sure what added the ">", it's not in the original mail. (I use public-inbox-1.3.0/git-2.25.0 on Void Linux.) thx, -- Leah Neukirchen <leah@vuxu.org> https://leahneukirchen.org/ ^ permalink raw reply [flat|nested] 5+ messages in thread
* [RFC] msgtime: do not require tz offset with Date::Parse fallback 2020-02-24 20:45 Two small issues when importing old archives Leah Neukirchen @ 2020-02-25 9:23 ` Eric Wong 2020-03-01 23:31 ` [pushed] msgtime: assume +0000 if TZ missing when using Date::Parse Eric Wong 2020-02-25 9:28 ` weird From: lines [was: Two small issues when importing old archives] Eric Wong 1 sibling, 1 reply; 5+ messages in thread From: Eric Wong @ 2020-02-25 9:23 UTC (permalink / raw) To: Leah Neukirchen; +Cc: meta, Eric W. Biederman Leah Neukirchen <leah@vuxu.org> wrote: > Hi, > > I've recently imported some sizable archives (~100k messages) of old > mailing lists and noticed some slight inconveniences: Thanks for the reports, will answer 2. separately. > 1) RFC5322/822 invalid Date: headers should be parsed more gracefully > > Some old mails had Date: headers without time zones, e.g. > Date: Sat, 27 Sep 1997 10:02:32 > > This results in public-inbox asserting this is the current date. > But this assumption makes no sense (literally every other guess > would be more likely), and also results in these messages showing up > on the first page of the archive. Furthermore, sorting is then not > stable, pressing F5 make the threads jump around. I'd recommend > falling back to +0000 instead. I think a fallback to +0000 makes sense, too. It's not a new bug in 1.3.0 (which makes Date::Parse optional). Looks like that regression was introduced a while ago in commit ae80a3fdb53d70142624f2691ed8ed84eddda66b ("MsgTime.pm: Use strptime to compute the time zone") Cc-ing Eric W. Biederman in case he has any input on this. Now, I'm not sure if this fallback is worth adding for users without Date::Parse. The non-Date::Parse path is a bit faster and optimized for common (correct) dates... ------------8<------------ Subject: [RFC] msgtime: assume +0000 if TZ missing when using Date::Parse Reported-by: Leah Neukirchen <leah@vuxu.org> Link: https://public-inbox.org/meta/87h7zfemur.fsf@vuxu.org/ Fixes: ae80a3fdb53d7014 ("MsgTime.pm: Use strptime to compute the time zone") --- lib/PublicInbox/MsgTime.pm | 3 ++- t/msgtime.t | 7 +++++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/lib/PublicInbox/MsgTime.pm b/lib/PublicInbox/MsgTime.pm index 8eee9a75..8703d7bc 100644 --- a/lib/PublicInbox/MsgTime.pm +++ b/lib/PublicInbox/MsgTime.pm @@ -104,7 +104,8 @@ sub str2date_zone ($) { # off is the time zone offset in seconds from GMT my ($ss,$mm,$hh,$day,$month,$year,$off) = Date::Parse::strptime($date); - return undef unless(defined $off); + return unless defined($year); + $off //= 0; # Compute the time zone from offset my $sign = ($off < 0) ? '-' : '+'; diff --git a/t/msgtime.t b/t/msgtime.t index 5c4636a2..7c95e547 100644 --- a/t/msgtime.t +++ b/t/msgtime.t @@ -5,6 +5,8 @@ use warnings; use Test::More; use PublicInbox::MIME; use PublicInbox::MsgTime; +use PublicInbox::TestCommon; + our $received_date = 'Mon, 22 Jan 2007 13:16:24 -0500'; sub datestamp ($) { my ($date) = @_; @@ -102,6 +104,11 @@ is_datestamp('Thu, 14 Dec 2006 00:20:24 +0480', [1166036424, '+0520']); is_datestamp('Thu, 14 Dec 2006 00:20:24 -0480', [1166074824, '-0520']); is_datestamp('Mon, 14 Apr 2014 07:59:01 -0007', [1397462761, '-0007']); +SKIP: { + require_mods('Date::Parse', 1); + is_datestamp('Sat, 27 Sep 1997 10:02:32', [875354552, '+0000']); +} + # obsolete formats described in RFC2822 for (qw(UT GMT Z)) { is_datestamp('Fri, 02 Oct 1993 00:00:00 '.$_, [ 749520000, '+0000']); ^ permalink raw reply related [flat|nested] 5+ messages in thread
* [pushed] msgtime: assume +0000 if TZ missing when using Date::Parse 2020-02-25 9:23 ` [RFC] msgtime: do not require tz offset with Date::Parse fallback Eric Wong @ 2020-03-01 23:31 ` Eric Wong 0 siblings, 0 replies; 5+ messages in thread From: Eric Wong @ 2020-03-01 23:31 UTC (permalink / raw) To: Leah Neukirchen; +Cc: meta, Eric W. Biederman Eric Wong <e@yhbt.net> wrote: > Leah Neukirchen <leah@vuxu.org> wrote: > > 1) RFC5322/822 invalid Date: headers should be parsed more gracefully > > > > Some old mails had Date: headers without time zones, e.g. > > Date: Sat, 27 Sep 1997 10:02:32 > > > > This results in public-inbox asserting this is the current date. > > But this assumption makes no sense (literally every other guess > > would be more likely), and also results in these messages showing up > > on the first page of the archive. Furthermore, sorting is then not > > stable, pressing F5 make the threads jump around. I'd recommend > > falling back to +0000 instead. > > I think a fallback to +0000 makes sense, too. > It's not a new bug in 1.3.0 (which makes Date::Parse optional). > > Looks like that regression was introduced a while ago in > commit ae80a3fdb53d70142624f2691ed8ed84eddda66b > ("MsgTime.pm: Use strptime to compute the time zone") > > Cc-ing Eric W. Biederman in case he has any input on this. > ------------8<------------ > Subject: [RFC] msgtime: assume +0000 if TZ missing when using Date::Parse Pushed as commit d857e7dc0d816b635a7ead09c3273f8c2d2434be with a more descriptive commit message: msgtime: assume +0000 if TZ missing when using Date::Parse Some old emails don't have timezone offsets, since our Date::Parse code path takes a liberal interpretation of dates, fallback to using "+0000" as the timezone offset since it's closer to the actual date of the message than whatever the current date is. Reported-by: Leah Neukirchen <leah@vuxu.org> Link: https://public-inbox.org/meta/87h7zfemur.fsf@vuxu.org/ Fixes: ae80a3fdb53d7014 ("MsgTime.pm: Use strptime to compute the time zone") Thanks ^ permalink raw reply [flat|nested] 5+ messages in thread
* weird From: lines [was: Two small issues when importing old archives] 2020-02-24 20:45 Two small issues when importing old archives Leah Neukirchen 2020-02-25 9:23 ` [RFC] msgtime: do not require tz offset with Date::Parse fallback Eric Wong @ 2020-02-25 9:28 ` Eric Wong 2020-02-26 10:21 ` [PATCH] import: drop '<' and '>' characters in addresses Eric Wong 1 sibling, 1 reply; 5+ messages in thread From: Eric Wong @ 2020-02-25 9:28 UTC (permalink / raw) To: Leah Neukirchen; +Cc: meta Leah Neukirchen <leah@vuxu.org> wrote: > 2) Weird From: lines crash the whole import > > From: "=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de > > This funny line broke import_maildir: > > fatal: Missing > in ident string: =?iso-8859-1?Q?Jochen_K=FCpper?= usenet <"=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de> 1101853296 +0100 > fast-import: dumping crash report to /var/lib/public-inbox/repositories/ding.git/fast_import_crash_31402 > EOF from fast-import: at /usr/share/perl5/vendor_perl/PublicInbox/Import.pm line 96, <$r> line 54681. > > I fixed it manually. (But I think it's actually a valid mail address, > even in this botched state.) I'm not sure what added the ">", it's > not in the original mail. > > (I use public-inbox-1.3.0/git-2.25.0 on Void Linux.) Gah, this looks like it's because Email::Address::XS leaves a "<" in the name... Perhaps Import should delete all [<>] characters unconditionally? (or swap in appropriate Unicode homographs and assume users have the necessary glyphs...) ---------8<---------- Subject: [RFC] t/address.t: dump failing case "PublicInbox::Address" (w/o "PP") is Email::Address::XS 1.04 from Debian 10: PublicInbox::Address names: $VAR1 = [ '=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet' ]; PublicInbox::Address emails: $VAR1 = [ '"=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@example.de' ]; PublicInbox::AddressPP names: $VAR1 = [ '=?iso-8859-1?Q?Jochen_K=FCpper?=' ]; PublicInbox::AddressPP emails: $VAR1 = [ 'usenet"@example.de' ]; --- t/address.t | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/t/address.t b/t/address.t index 6f4bff6c..8c39f04b 100644 --- a/t/address.t +++ b/t/address.t @@ -14,6 +14,11 @@ sub test_pkg { [$emails->('User <e@example.com>, e@example.org')], 'address extraction works as expected'); + my $odd = '"=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@example.de'; + use Data::Dumper; + diag "$pkg names: " . Dumper([$names->($odd)]); + diag "$pkg emails: " . Dumper([$emails->($odd)]); + is_deeply(['user@example.com'], [$emails->('<user@example.com (Comment)>')], 'comment after domain accepted before >'); ^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH] import: drop '<' and '>' characters in addresses 2020-02-25 9:28 ` weird From: lines [was: Two small issues when importing old archives] Eric Wong @ 2020-02-26 10:21 ` Eric Wong 0 siblings, 0 replies; 5+ messages in thread From: Eric Wong @ 2020-02-26 10:21 UTC (permalink / raw) To: Leah Neukirchen; +Cc: meta Eric Wong <e@yhbt.net> wrote: > Leah Neukirchen <leah@vuxu.org> wrote: > > 2) Weird From: lines crash the whole import > > > > From: "=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de > > > > This funny line broke import_maildir: > > > > fatal: Missing > in ident string: =?iso-8859-1?Q?Jochen_K=FCpper?= usenet <"=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de> 1101853296 +0100 > > fast-import: dumping crash report to /var/lib/public-inbox/repositories/ding.git/fast_import_crash_31402 > > EOF from fast-import: at /usr/share/perl5/vendor_perl/PublicInbox/Import.pm line 96, <$r> line 54681. > > > > I fixed it manually. (But I think it's actually a valid mail address, > > even in this botched state.) I'm not sure what added the ">", it's > > not in the original mail. > > > > (I use public-inbox-1.3.0/git-2.25.0 on Void Linux.) > > Gah, this looks like it's because Email::Address::XS leaves a > "<" in the name... Perhaps Import should delete all [<>] > characters unconditionally? (or swap in appropriate Unicode > homographs and assume users have the necessary glyphs...) So we already do `$name =~ tr/<>//d', so I think doing the same with `$email' is appropiate for fast-import. The "correct" address featuring '<' will still be indexed in Xapian, at least. -------------8<------------- Subject: [PATCH] import: drop '<' and '>' characters in addresses Some strange "From:" lines will cause Email::Address::XS to leave '<' (and presumably '>') in the address which git-fast-import won't accept even if quoted. Workaround this problem by deleting '<' and '>' the same way we delete them for the ident name. Reported-by: Leah Neukirchen <leah@vuxu.org> Link: https://public-inbox.org/meta/87h7zfemur.fsf@vuxu.org/ --- lib/PublicInbox/Import.pm | 4 ++++ t/import.t | 2 ++ 2 files changed, 6 insertions(+) diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm index d8dc49b8..68dc0c7e 100644 --- a/lib/PublicInbox/Import.pm +++ b/lib/PublicInbox/Import.pm @@ -293,6 +293,10 @@ sub extract_cmt_info ($) { } } if (defined $email) { + # Email::Address::XS may leave quoted '<' in addresses, + # which git-fast-import doesn't like + $email =~ tr/<>//d; + # quiet down wide character warnings with utf8::encode utf8::encode($email); } else { diff --git a/t/import.t b/t/import.t index e71dd714..b88d308e 100644 --- a/t/import.t +++ b/t/import.t @@ -55,6 +55,8 @@ $im->done; my @revs = $git->qx(qw(rev-list HEAD)); is(scalar @revs, 1, 'one revision created'); +my $odd = '"=?iso-8859-1?Q?J_K=FCpper?= <usenet"@example.de'; +$mime->header_set('From', $odd); $mime->header_set('Message-ID', '<b@example.com>'); $mime->header_set('Subject', 'msg2'); like($im->add($mime, sub { $mime }), qr/\A:\d+\z/, 'added 2nd message'); ^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-03-01 23:31 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-02-24 20:45 Two small issues when importing old archives Leah Neukirchen 2020-02-25 9:23 ` [RFC] msgtime: do not require tz offset with Date::Parse fallback Eric Wong 2020-03-01 23:31 ` [pushed] msgtime: assume +0000 if TZ missing when using Date::Parse Eric Wong 2020-02-25 9:28 ` weird From: lines [was: Two small issues when importing old archives] Eric Wong 2020-02-26 10:21 ` [PATCH] import: drop '<' and '>' characters in addresses Eric Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).