* how to gracefully handle spaces in Message-IDs? @ 2020-03-31 8:32 Eric Wong 2020-03-31 8:49 ` [WIP 1/?] v2writable: index Message-IDs w/ spaces properly Eric Wong 0 siblings, 1 reply; 3+ messages in thread From: Eric Wong @ 2020-03-31 8:32 UTC (permalink / raw) To: meta There exist Message-IDs with spaces in them, at least (and maybe other strangeness) Take this example: https://lore.kernel.org/lkml/200203040330.g243URr05337@3%20(NXDOMAIN)%20/ That is: Message-ID: <200203040330.g243URr05337@3 (NXDOMAIN) > RFC 3977 (NNTP) struggles with that with HDR/XHDR commands, since it's split-on-spaces-or-tabs behavior. Not only that, even with a successful attempt to handle parsing of spaces in the Message-ID for -nntpd requests, Net::NNTP has trouble parsing responses with spaces in the Message-ID. I haven't tried other NNTP clients, but I don't expect clients to know what to do with invalid Message-IDs in responses, either... RFC 5322, Appendix A.6.3. Obsolete White Space and Comments <https://tools.ietf.org/html/rfc5322#appendix-A.6.3> has a particularly nasty example: Message-ID : <1234 @ local(blah) .machine .example> And RFC 733 is full of examples with spaces in Message-IDs for the historically-inclined: <https://tools.ietf.org/html/rfc733> But I haven't found relevant docs on how to handle that case for NNTP in RFC 977 or 3977... In innd(*), the nnrpd/article.c::CMDpat function for HDR/XHDR commands calls lib/messageid.c::IsValidMessageID with the `stripspaces' parameter as `true', but `stripspaces' only strips leading and trailing whitespace. So I'm thinking at least stripping leading+trailing spaces is something we should be doing, and spaces in the middle of the Message-ID need to be preserved. But, maybe non-printable control characters can also be filtered out entirely, since I've definitely seen those in headers when they don't belong. I suspect those were introduced by hardware errors or software bugs. Anyways, my head hurts :< (*) svn co https://inn.eyrie.org/svn/trunk innd, ^ permalink raw reply [flat|nested] 3+ messages in thread
* [WIP 1/?] v2writable: index Message-IDs w/ spaces properly 2020-03-31 8:32 how to gracefully handle spaces in Message-IDs? Eric Wong @ 2020-03-31 8:49 ` Eric Wong 2020-04-01 0:05 ` Eric Wong 0 siblings, 1 reply; 3+ messages in thread From: Eric Wong @ 2020-03-31 8:49 UTC (permalink / raw) To: meta Message-IDs can apparently contain spaces and other weird characters. Ensure we pass those properly to shard subprocesses when importing messages in parallel mode. Our NNTP parser does not deal with spaces in the Message-ID, yet, and I don't expect most NNTP clients to, either. --- lib/PublicInbox/SearchIdxShard.pm | 8 +++++--- t/v2writable.t | 11 ++++++++++- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm index 1ea01095..06bcd403 100644 --- a/lib/PublicInbox/SearchIdxShard.pm +++ b/lib/PublicInbox/SearchIdxShard.pm @@ -69,8 +69,9 @@ sub shard_worker_loop ($$$$$) { $self->remove_by_oid($oid, $mid); } else { chomp $line; - my ($bytes, $num, $blob, $mid, $ds, $ts) = - split(/ /, $line); + # n.b. $mid may contain spaces(!) + my ($bytes, $num, $blob, $ds, $ts, $mid) = + split(/ /, $line, 6); $self->begin_txn_lazy; my $n = read($r, my $msg, $bytes) or die "read: $!\n"; $n == $bytes or die "short read: $n != $bytes\n"; @@ -93,7 +94,8 @@ sub shard_worker_loop ($$$$$) { sub index_raw { my ($self, $msgref, $mime, $smsg) = @_; if (my $w = $self->{w}) { - print $w join(' ', @$smsg{qw(bytes num blob mid ds ts)}), + # mid must be last, it can contain spaces (but not LF) + print $w join(' ', @$smsg{qw(bytes num blob ds ts mid)}), "\n", $$msgref or die "failed to write shard $!\n"; } else { $$msgref = undef; diff --git a/t/v2writable.t b/t/v2writable.t index cdcfe4d0..8167e4de 100644 --- a/t/v2writable.t +++ b/t/v2writable.t @@ -109,6 +109,11 @@ if ('ensure git configs are correct') { @mids = $mime->header_obj->header_raw('Message-Id'); like($mids[0], $sane_mid, 'mid was generated'); is(scalar(@mids), 1, 'new generated'); + + @warn = (); + $mime->header_set('Message-Id', '<space@ (NXDOMAIN) >'); + ok($im->add($mime), 'message added with space in Message-Id'); + is_deeply([], \@warn); } { @@ -175,8 +180,12 @@ EOF is($uniq{$mid}++, 0, "MID for $num is unique in XOVER"); is_deeply($n->xhdr('Message-ID', $num), { $num => $mid }, "XHDR lookup OK on num $num"); + + # FIXME NNTP.pm doesn't handle spaces in Message-ID + next if $mid =~ / /; + is_deeply($n->xhdr('Message-ID', $mid), - { $mid => $mid }, "XHDR lookup OK on MID $num"); + { $mid => $mid }, "XHDR lookup OK on MID $mid ($num)"); } my %nn; foreach my $mid (@{$n->newnews(0, $group)}) { ^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [WIP 1/?] v2writable: index Message-IDs w/ spaces properly 2020-03-31 8:49 ` [WIP 1/?] v2writable: index Message-IDs w/ spaces properly Eric Wong @ 2020-04-01 0:05 ` Eric Wong 0 siblings, 0 replies; 3+ messages in thread From: Eric Wong @ 2020-04-01 0:05 UTC (permalink / raw) To: meta Eric Wong <e@yhbt.net> wrote: > Message-IDs can apparently contain spaces and other weird > characters. Ensure we pass those properly to shard subprocesses > when importing messages in parallel mode. > > Our NNTP parser does not deal with spaces in the Message-ID, > yet, and I don't expect most NNTP clients to, either. Nor does Net::NNTP on the client side... But regardless of what happens with Message-IDs in the NNTP side, this patch will remain correct and fixes an indexing problem when Message-IDs. This bug was exacerbated by the changes to pass date and timestamps from the git commit into the shard when mirroring, but has always been with us when using multi-process indexing. > diff --git a/t/v2writable.t b/t/v2writable.t > index cdcfe4d0..8167e4de 100644 > --- a/t/v2writable.t > +++ b/t/v2writable.t > @@ -175,8 +180,12 @@ EOF > is($uniq{$mid}++, 0, "MID for $num is unique in XOVER"); > is_deeply($n->xhdr('Message-ID', $num), > { $num => $mid }, "XHDR lookup OK on num $num"); > + > + # FIXME NNTP.pm doesn't handle spaces in Message-ID > + next if $mid =~ / /; > + Pushed with the following squashed in: diff --git a/t/v2writable.t b/t/v2writable.t index 8167e4de..66d5663e 100644 --- a/t/v2writable.t +++ b/t/v2writable.t @@ -181,7 +181,8 @@ EOF is_deeply($n->xhdr('Message-ID', $num), { $num => $mid }, "XHDR lookup OK on num $num"); - # FIXME NNTP.pm doesn't handle spaces in Message-ID + # FIXME PublicInbox::NNTP (server) doesn't handle spaces in + # Message-ID, but neither does Net::NNTP (client) next if $mid =~ / /; is_deeply($n->xhdr('Message-ID', $mid), ^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2020-04-01 0:05 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-03-31 8:32 how to gracefully handle spaces in Message-IDs? Eric Wong 2020-03-31 8:49 ` [WIP 1/?] v2writable: index Message-IDs w/ spaces properly Eric Wong 2020-04-01 0:05 ` Eric Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).