From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 440C26DE141B for ; Tue, 16 Feb 2016 05:04:12 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.307 X-Spam-Level: X-Spam-Status: No, score=-0.307 tagged_above=-999 required=5 tests=[AWL=0.244, RP_MATCHES_RCVD=-0.55, SPF_PASS=-0.001] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id s3vrzFIj2p0b for ; Tue, 16 Feb 2016 05:04:10 -0800 (PST) Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197]) by arlo.cworth.org (Postfix) with ESMTPS id 1669C6DE02C9 for ; Tue, 16 Feb 2016 05:04:09 -0800 (PST) Received: from remotemail by fethera.tethera.net with local (Exim 4.84) (envelope-from ) id 1aVfHq-0002cb-Ri; Tue, 16 Feb 2016 08:03:26 -0500 Received: (nullmailer pid 25980 invoked by uid 1000); Tue, 16 Feb 2016 13:04:07 -0000 From: David Bremner To: "W. Trevor King" , notmuch@notmuchmail.org Subject: Re: [PATCH] nmbug: Allow Unicode tags and IDs in Python 2 In-Reply-To: References: User-Agent: Notmuch/0.21+26~g9404723 (http://notmuchmail.org) Emacs/24.5.1 (x86_64-pc-linux-gnu) Date: Tue, 16 Feb 2016 09:04:07 -0400 Message-ID: <87lh6kvmbc.fsf@zancas.localnet> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Feb 2016 13:04:12 -0000 "W. Trevor King" writes: > Avoid a UnicodeWarning and broken pipe on 'nmbug commit' in Python 2 > when a tag or message ID contains non-ASCII characters [1]. > > There are a number of Python bugs associated with this behavior > [2,3,4,5,6]. There's also some useful background in [8]. [3] lead to > the currently working Python 3 implementation, which encodes to UTF-8 > by default and has 'encoding' and 'errors' arguments [7]. This commit > follows that approach in a way that's compatible with both Python 2 > and Python 3. Coercing to UTF-8 (regardless of locale) gives us > consistent tag IDs for sharing between users. I'm not sure what "tag IDs" are. Do you mean message-ids here? or "tags and IDs"? At first I thought there might be problems with non-utf8 message-ids, but that turns out not to be the case [1]. It seems like it would take a fairly heroic effort to get non-UTF8 tags into the database (perhaps by calling the library interface with bad strings?) so we can probably ignore this case. It might be good to document the limitation though, since AFAIK, dump and restore can roundtrip any old crap. > > The 'isnumeric' check identifies Unicode instances in both Python 2 > [9] and Python 3 [10]. > I still haven't really tried to understand this part, but probably it deserves inline documentation. > --- > I haven't checked the other commands for issues with Unicode IDs or > tags. It's possible that in addition to this explicit encoding to > UTF-8, we'll also want explicit decoding from UTF-8 when reading from > Git trees (for 'nmbug checkout' and 'nmbug status'). Yes, this seems to be a problem, with the patch applied I can commit, but the same utf-8 message-id causes problems. bremner@zancas:~/software/upstream/notmuch$ ./devel/nmbug/nmbug status U D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@=C3=83=C3=82=C3=83=C3=82=C2=A5=C3=83=C3= =82=C2=B0=C3=83=C3=82=C2=A3=C3=83=C3=82=C2=A5=C3=83=C3=82=C2=A9-=C3=83=C3= =82=C3=83=C3=82 unread A D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@=C3=83=C3=83=C2=A5=C3=83=C2=B0=C3=83=C2= =A3=C3=83=C2=A5=C3=83=C2=A9-=C3=83=C3=83 unread bremner@zancas:~/software/upstream/notmuch$ delve -a -1 ~/Maildir/.notmuch/= xapian | grep D1B4DEBCAFFC4A05A4D4349A6EC5C9D8 QD1B4DEBCAFFC4A05A4D4349A6EC5C9D8@=C3=91=C3=A5=C3=B0=C3=A3=C3=A5=C3=A9-=C3= =8F=C3=8A [1]: id:87si0svnim.fsf@zancas.localnet