On Wed, Feb 17, 2016 at 09:34:29AM -0400, David Bremner wrote: > Daniel Kahn Gillmor writes: > > That said, RFC 2047 suggest that its encodings are only relevant > > in places where a "text" token would be used. Message-ID (and > > References and In-Reply-To) are intended to only contain > > dot-atom-text tokens. So probably it would be more correct to > > avoid applying to these specific fields. > > > > i dunno that it's a big deal though, given the analysis above. > > I guess there are two seperate issues. One is the (mildly bogus) > application of RFC2047 decoding to message-ids. The other other is > the coercion into utf8 from whatever wacky 8bit encoding some > creative person might use in a message-id. It looks like there's already an “implicit encodings are complicated” RFC discussing this issue [1]. RFC 6532 overrides (among other things) the atext behind message-id [2,3] for message/global messages. Other related RFCs cover internationalized domain names [4] and internationalized email addresses [5]. I think we should: * Store message IDs as NFKC UTF-8 in notmuch (do we already do this?). * For message/global messages: * Convert headers to Unicode using UTF-8 (per RFC 6532). * For non-message/global messages: * Ignore any RFC 2047 =? encoding or RFC 5890 xn-- encoding that may be present. * Convert to Unicode by percent-encoding [6] (e.g. ‘ü%’ represented as the three UTF-8 bytes ‘\xc3\xbc\x25’ would be represented by the Unicode ‘%C3%BC%25’). Cheers, Trevor [1]: https://tools.ietf.org/html/rfc6055 [2]: https://tools.ietf.org/html/rfc5322#section-3.6.4 [3]: https://tools.ietf.org/html/rfc5322#section-3.2.3 [4]: https://tools.ietf.org/html/rfc5890 [5]: https://tools.ietf.org/html/rfc6530 [6]: https://tools.ietf.org/html/rfc3986#section-2 [7]: https://tools.ietf.org/html/rfc2606#section-2 -- This email may be signed or encrypted with GnuPG (http://www.gnupg.org). For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy