From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 79437431FBF for ; Sun, 29 Mar 2015 22:29:02 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: 2.338 X-Spam-Level: ** X-Spam-Status: No, score=2.338 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DNS_FROM_AHBL_RHSBL=2.438, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tmuG3yIFMfl1 for ; Sun, 29 Mar 2015 22:28:56 -0700 (PDT) Received: from resqmta-po-05v.sys.comcast.net (resqmta-po-05v.sys.comcast.net [96.114.154.164]) (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id AF391431FAE for ; Sun, 29 Mar 2015 22:28:55 -0700 (PDT) Received: from resomta-po-01v.sys.comcast.net ([96.114.154.225]) by resqmta-po-05v.sys.comcast.net with comcast id 9hUu1q0014s37d401hUuw4; Mon, 30 Mar 2015 05:28:54 +0000 Received: from odin.tremily.us ([67.168.81.176]) by resomta-po-01v.sys.comcast.net with comcast id 9hSs1q00A3oF5yT01hSsVG; Mon, 30 Mar 2015 05:26:54 +0000 Received: by odin.tremily.us (Postfix, from userid 1000) id EC07E16F8071; Sun, 29 Mar 2015 22:26:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tremily.us; s=odin; t=1427693211; bh=Jp/untkSO9O2zjt3gTGzIm69gcEKhGrC/KVtFAygSqY=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=V1wiJ5Vm0J1kkqDscyfp5v34UciutSH9FMW1lKguUTF8z3c2oRZPyJIpYGVUAas18 68xXcQrBrQ24ZPFSa1LndQiyQ1/Zj+/kg8AyTfwuOUdlmy6Qa3nT7/qwhRZNi8AdP5 XMoKP092vX0tOqtBR7sFhfljD0fZW3OIR7no3EQA= Date: Sun, 29 Mar 2015 22:26:51 -0700 From: "W. Trevor King" To: Sebastian Fischmeister Subject: Re: UnicodeDecodeError with python API Message-ID: <20150330052651.GS22036@odin.tremily.us> References: <874mp4q7e7.fsf@uwaterloo.ca> <20150329163658.GK22036@odin.tremily.us> <87ego7pfia.fsf@uwaterloo.ca> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="vSzmLLrdioqyxIBH" Content-Disposition: inline In-Reply-To: <87ego7pfia.fsf@uwaterloo.ca> OpenPGP: id=39A2F3FA2AB17E5D8764F388FC29BDCDF15F5BE8; url=http://tremily.us/pubkey.txt User-Agent: Mutt/1.5.23 (2014-03-12) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20140121; t=1427693334; bh=ZLlalfBAUD2HcSb1rb6a9XSYwlKCk342wNtpu3Oh9jM=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=Vk0Zx+blGwRXlFoLN9FG+kw6m3YT+GG7Xdq0R9yefDvp3g8QZtHiz+m/RqrSrbq5P FT56oT+KJnVgi0xG5pKTdl7DkvVwbP0PiKFD0Kan+PawR4Y2ABNZx/K+I6e/jUKGn0 EF6dZmgKEg185XPo8Gm2mXVQr2+Qmy+tdJUrSU+h9nchQHgMGwTIJ1c7t1QSjesr/H twdKzu079jt1RAhEz18TLd8ErSU3l2IK92ix3zQXEbRcWzjdD99Id6TiTumM2O0znZ 6jUxeKYUCmCIwiDWwdzx6xvyIzGPxOiPzJVC9Tq648tyTBgr44G0N0GsvHq3v2SHwW sF5vhcBMxmIlA== Cc: notmuch X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Mar 2015 05:29:04 -0000 --vSzmLLrdioqyxIBH Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Mar 29, 2015 at 07:10:53PM -0400, Sebastian Fischmeister wrote: > > My first guess is that the file's encoding doesn't match your > > locale. Do you have a non-ASCII locale set? You can check with: >=20 > It seems to be more tricky than I thought. I didn't have a locale set. >=20 > When I set one, I can parse some emails with this: >=20 > export LANG=3Den_US.latin-1 >=20 > Others with this: >=20 > export LANG=3Den_US.UTF-8 >=20 > Others fail with either of the two. Hmm, that's surprising. In hindsight, the locale should only be affecting the *output* (e.g., a non-Unicode locale might cause a UnicodeEncodeError). However, you're getting your errors on input. I'd expect the files to be loaded and parsed as byte-streams, but maybe there's a bug in Python's email parser. It wouldn't be the first time it's had trouble with bytes-vs-Unicode (see these old bugs with similar tracebacks from the initial transition to 3.0 [1,2], or search =E2=80=9Cunicode email=E2=80=9D on http://bugs.python.org/). I'd tr= y to reproduce this failure by calling email.message_from_file(=E2=80=A6) direct= ly (getting notmuch out of the loop), and then file a bug against Python once you have a pure-Python reproduction. Cheers, Trevor [1]: http://bugs.python.org/issue1086 [2]: http://bugs.python.org/issue1258#msg56470 --=20 This email may be signed or encrypted with GnuPG (http://www.gnupg.org). For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy --vSzmLLrdioqyxIBH Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBAgAGBQJVGN6ZAAoJEG8/JgBt8ol81h0P/2Gcb33fy+HnXdaZTa4fWAg7 ylyHLfa0UUv3MasF92a4MB6jnEMcQhgKfnAr0Tp3z/N0R2UXk6n6upetTwr5pLk1 zyJX5MSk2aEedyjh3n9lid0z75sGjqygEsTppUCTkidLeFQzWFB4j2tdH+xKVcOU /HXmgcmDlh+HtOu1DRSIEETQxio1LiPBqUXFKZ/FGDlornAlFbBViX1XZaudj+9A /LeyD92SYkRiDaUvKpogFBpM0pClxSUzezXfIVRho18nwft5tmmatGgIWLMrntsD iaBLAqJcjUPgYuhcxtVAC4JAk+L2IWaJr4HuufN1UfLu5Uj5IwUvaIjFevkblMkl 0RQ2Hf8IhueN59d+QtbGbRiWHJBbf6PfBDXHukkeQSvrYjkwiJi2hoiLnJ4OhhDn 3wkVKyIp0fGZwsq1xTySFqlqd8rTGOG9vhnHYYDEurr5+AXYARH6/33MoprVjrhc gdtSZJfPhvj1mv2ilTBWOsVOV9/ar3qOMD3dhjqhQvxprghknUf63y7L7e1FwFKx Uj9LA1tbI2wxiX4enWSUeYxkjQU8bDwHaOmxBl7OUPiOtq7zNnGnfLBFSqJ/NVUM f7fZlBajQkoMuwuYgcPnwGp0C//RUNpIreCqSbzB9hi6D/hY5Wqas8ax3Pm4jrx4 a8akA1PFycrAaIj79HTV =lybv -----END PGP SIGNATURE----- --vSzmLLrdioqyxIBH--