From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 638326DE0C66 for ; Tue, 31 Oct 2017 12:21:46 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: 0.129 X-Spam-Level: X-Spam-Status: No, score=0.129 tagged_above=-999 required=5 tests=[AWL=0.140, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ZFEhqSo9MTYj for ; Tue, 31 Oct 2017 12:21:45 -0700 (PDT) Received: from istari.evenmere.org (istari.evenmere.org [136.248.125.194]) by arlo.cworth.org (Postfix) with ESMTP id 36B1B6DE0C3F for ; Tue, 31 Oct 2017 12:21:45 -0700 (PDT) Received: by istari.evenmere.org (Postfix, from userid 1000) id 7DBD51E0063; Tue, 31 Oct 2017 15:21:40 -0400 (EDT) From: Brian Sniffen To: Matthew Lear Cc: Daniel Kahn Gillmor , Jani Nikula , Vladimir Panteleev , notmuch@notmuchmail.org Subject: Re: web interface to notmuch In-Reply-To: References: <87tvyvp4f2.fsf@istari.evenmere.org> <87376f13ho.fsf@fifthhorseman.net> <87r2tww9tr.fsf@nikula.org> <87wp3ow39i.fsf@fifthhorseman.net> <27e53def-32b4-45ab-1192-77cc0e837a93@gmail.com> <87zi8eopgq.fsf@istari.evenmere.org> <877evhy53k.fsf@fifthhorseman.net> <87she5nsmy.fsf@istari.evenmere.org> <87inf1gm7l.fsf@fifthhorseman.net> <87mv4co4vz.fsf@istari.evenmere.org> Date: Tue, 31 Oct 2017 15:21:40 -0400 Message-ID: <87h8ufnmwr.fsf@istari.evenmere.org> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 31 Oct 2017 19:21:46 -0000 > just remove it), but along the way of searching and viewing mail, I've > encountered quite a few occurrences of failing to UnicodeEncode. An example > backtrace looks like this: > > Traceback (most recent call last): > File "/usr/lib/python2.7/dist-packages/web/application.py", line 239, in > process > return self.handle() > File "/usr/lib/python2.7/dist-packages/web/application.py", line 230, in > handle > return self._delegate(fn, self.fvars, args) > File "/usr/lib/python2.7/dist-packages/web/application.py", line 420, in > _delegate > return handle_class(cls) > File "/usr/lib/python2.7/dist-packages/web/application.py", line 396, in > handle_class > return tocall(*args) > File "/b/git/notmuch-brians.git/contrib/notmuch-web/nmweb.py", line 153, > in GET > sprefix=webprefix) > File "/usr/lib/python2.7/dist-packages/jinja2/environment.py", line 989, > in render > return self.environment.handle_exception(exc_info, True) > File "/usr/lib/python2.7/dist-packages/jinja2/environment.py", line 754, > in handle_exception > reraise(exc_type, exc_value, tb) > File "templates/show.html", line 1, in top-level template code > {% extends "base.html" %} > File "templates/base.html", line 32, in top-level template code > {% block content %} > File "templates/show.html", line 12, in block "content" > {% for part in format_message(m.get_filename(),mid): %}{{ part|safe > }}{% endfor %} > File "/b/git/notmuch-brians.git/contrib/notmuch-web/nmweb.py", line 245, > in format_message_walk > tags=safe_tags).encode(part.get_content_charset('ascii'))) > UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in > position 1141: ordinal not in range(256) > > 127.0.0.1:60968 - - [31/Oct/2017 17:00:02] "HTTP/1.1 GET /show/ > 665d8c5c2b024898ae21951c4b8b4f93@CO2PR05MB747.namprd05.prod.outlook.com" - > 500 Internal Server Error > > I'm no Python expert, but from a quick google it would seem like the cause > of such an exception is related to not using utf-8. Neat. So to get there, this has to be a text/html part. It has to have been decoded, either with the declared content type or with ascii. If a \u201c (left double quote) showed up, it didn't get decoded as ascii---and indeed, it looks like the content-type specifies latin-1. But now when we try to encode back, using the same latin-1, it fails? That's really neat. > Brian - do you think something needs modifying in nmweb.py to cater for > this type of thing, or is this somehow related my own mailstore (not sure > why that would be as my messages haven't been modified). Lots of mail has busted encoding. I've done some defensive work against that---look at decodeAnyway and shed a tear for purity---but clearly not enough. Can you send me a message that causes the problem? In the mean time, I think like 245 ought to be, appropriately indented: tags=safe_tags).encode(part.get_content_charset('ascii'), 'xmlcharrefreplace')) Thanks for the report---investigating it showed me that the search box doesn't tolerate that character either. -Brian