* encoding of message-ids
@ 2016-02-16 12:38 David Bremner
2016-02-16 19:02 ` Daniel Kahn Gillmor
0 siblings, 1 reply; 4+ messages in thread
From: David Bremner @ 2016-02-16 12:38 UTC (permalink / raw)
To: notmuch
I spent a little time this morning staring at the code, and it seems
that all of the message-ids are parsed via g_mime_decode_text, which
deals with RFC2047 encodings and makes guesses at decoding 8bit
characters. In practice this means that in the notmuch database all
headers are UTF-8. Since message-id's are supposed to be printable ascii
[at least in rfc5322], this seems like not such a terrible decision, but
I wonder if we should document this potential conversion somewhere?
d
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: encoding of message-ids
2016-02-16 12:38 encoding of message-ids David Bremner
@ 2016-02-16 19:02 ` Daniel Kahn Gillmor
2016-02-17 13:34 ` David Bremner
0 siblings, 1 reply; 4+ messages in thread
From: Daniel Kahn Gillmor @ 2016-02-16 19:02 UTC (permalink / raw)
To: David Bremner, notmuch
On Tue 2016-02-16 07:38:09 -0500, David Bremner wrote:
> I spent a little time this morning staring at the code, and it seems
> that all of the message-ids are parsed via g_mime_decode_text, which
> deals with RFC2047 encodings and makes guesses at decoding 8bit
> characters. In practice this means that in the notmuch database all
> headers are UTF-8. Since message-id's are supposed to be printable ascii
> [at least in rfc5322], this seems like not such a terrible decision, but
> I wonder if we should document this potential conversion somewhere?
i think you mean g_mime_utils_header_decode_text, not gmime_decode_text,
right?
What do you think are the potential risks here?
* if all incoming message-ids are standards-compliant (lower-case
ascii, with an @ sign in the middle and surrounded by angle-brackets
[0], then it cannot be interpreted as RFC 2047 text because it does
not have the leading =? or the trailing ?=, so gmime shouldn't
translate it.
* if some incoming message-ids are not standards-compliant, then it's
possible that they will be transformed into other,
non-standards-compliant message IDs. Some of them might even be
transformed into standards-compliant message-IDs. for example,
'=?UTF-8?q?<abc@example.net>?=' will be transformed into
'<abc@example.net>'.
the main risk, i suppose, is that someone could craft a message with a
different literal Message-ID than an existing message, and could trigger
an otherwise undetectable message ID collision. This seems not much
worse than the existing (detectable) mesage ID collision problems
notmuch already has.
That said, RFC 2047 suggest that its encodings are only relevant in
places where a "text" token would be used. Message-ID (and References
and In-Reply-To) are intended to only contain dot-atom-text tokens. So
probably it would be more correct to avoid applying to these specific
fields.
i dunno that it's a big deal though, given the analysis above.
--dkg
[0] https://tools.ietf.org/html/rfc5322#section-3.6.4
[1] https://tools.ietf.org/html/rfc2047#section-5
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: encoding of message-ids
2016-02-16 19:02 ` Daniel Kahn Gillmor
@ 2016-02-17 13:34 ` David Bremner
2016-02-24 17:15 ` W. Trevor King
0 siblings, 1 reply; 4+ messages in thread
From: David Bremner @ 2016-02-17 13:34 UTC (permalink / raw)
To: Daniel Kahn Gillmor, notmuch
Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes:
> That said, RFC 2047 suggest that its encodings are only relevant in
> places where a "text" token would be used. Message-ID (and References
> and In-Reply-To) are intended to only contain dot-atom-text tokens. So
> probably it would be more correct to avoid applying to these specific
> fields.
>
> i dunno that it's a big deal though, given the analysis above.
I guess there are two seperate issues. One is the (mildly bogus)
application of RFC2047 decoding to message-ids. The other other is the
coercion into utf8 from whatever wacky 8bit encoding some creative
person might use in a message-id.
d
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: encoding of message-ids
2016-02-17 13:34 ` David Bremner
@ 2016-02-24 17:15 ` W. Trevor King
0 siblings, 0 replies; 4+ messages in thread
From: W. Trevor King @ 2016-02-24 17:15 UTC (permalink / raw)
To: David Bremner; +Cc: Daniel Kahn Gillmor, notmuch
[-- Attachment #1: Type: text/plain, Size: 2068 bytes --]
On Wed, Feb 17, 2016 at 09:34:29AM -0400, David Bremner wrote:
> Daniel Kahn Gillmor writes:
> > That said, RFC 2047 suggest that its encodings are only relevant
> > in places where a "text" token would be used. Message-ID (and
> > References and In-Reply-To) are intended to only contain
> > dot-atom-text tokens. So probably it would be more correct to
> > avoid applying to these specific fields.
> >
> > i dunno that it's a big deal though, given the analysis above.
>
> I guess there are two seperate issues. One is the (mildly bogus)
> application of RFC2047 decoding to message-ids. The other other is
> the coercion into utf8 from whatever wacky 8bit encoding some
> creative person might use in a message-id.
It looks like there's already an “implicit encodings are complicated”
RFC discussing this issue [1]. RFC 6532 overrides (among other
things) the atext behind message-id [2,3] for message/global messages.
Other related RFCs cover internationalized domain names [4] and
internationalized email addresses [5]. I think we should:
* Store message IDs as NFKC UTF-8 in notmuch (do we already do this?).
* For message/global messages:
* Convert headers to Unicode using UTF-8 (per RFC 6532).
* For non-message/global messages:
* Ignore any RFC 2047 =? encoding or RFC 5890 xn-- encoding that may
be present.
* Convert to Unicode by percent-encoding [6] (e.g. ‘ü%’ represented
as the three UTF-8 bytes ‘\xc3\xbc\x25’ would be represented by
the Unicode ‘%C3%BC%25’).
Cheers,
Trevor
[1]: https://tools.ietf.org/html/rfc6055
[2]: https://tools.ietf.org/html/rfc5322#section-3.6.4
[3]: https://tools.ietf.org/html/rfc5322#section-3.2.3
[4]: https://tools.ietf.org/html/rfc5890
[5]: https://tools.ietf.org/html/rfc6530
[6]: https://tools.ietf.org/html/rfc3986#section-2
[7]: https://tools.ietf.org/html/rfc2606#section-2
--
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2016-02-24 17:23 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-16 12:38 encoding of message-ids David Bremner
2016-02-16 19:02 ` Daniel Kahn Gillmor
2016-02-17 13:34 ` David Bremner
2016-02-24 17:15 ` W. Trevor King
Code repositories for project(s) associated with this public inbox
https://yhetil.org/notmuch.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).