unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* [PATCH] nmbug: Allow Unicode tags and IDs in Python 2
@ 2016-02-15  5:30 W. Trevor King
  2016-02-16 13:04 ` David Bremner
  0 siblings, 1 reply; 4+ messages in thread
From: W. Trevor King @ 2016-02-15  5:30 UTC (permalink / raw)
  To: notmuch

Avoid a UnicodeWarning and broken pipe on 'nmbug commit' in Python 2
when a tag or message ID contains non-ASCII characters [1].

There are a number of Python bugs associated with this behavior
[2,3,4,5,6].  There's also some useful background in [8].  [3] lead to
the currently working Python 3 implementation, which encodes to UTF-8
by default and has 'encoding' and 'errors' arguments [7].  This commit
follows that approach in a way that's compatible with both Python 2
and Python 3.  Coercing to UTF-8 (regardless of locale) gives us
consistent tag IDs for sharing between users.

The 'isnumeric' check identifies Unicode instances in both Python 2
[9] and Python 3 [10].

[1]: id:87twlbv5vj.fsf@zancas.localnet
     http://thread.gmane.org/gmane.mail.notmuch.general/21855/focus=21862
     Subject: Re: problems with nmbug and empty prefix (UnicodeWarning and broken pipe)
     Date: Sun, 14 Feb 2016 08:22:24 -0400
[2]: http://bugs.python.org/issue2637
[3]: http://bugs.python.org/issue3300
[4]: http://bugs.python.org/issue22231
[5]: http://bugs.python.org/issue23885
[6]: http://bugs.python.org/issue1712522
[7]: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote
[8]: https://mail.python.org/pipermail/python-dev/2006-July/067335.html
[9]: https://docs.python.org/2/library/stdtypes.html#unicode.isnumeric
[10]: https://docs.python.org/3/library/stdtypes.html#str.isnumeric
---
I haven't checked the other commands for issues with Unicode IDs or
tags.  It's possible that in addition to this explicit encoding to
UTF-8, we'll also want explicit decoding from UTF-8 when reading from
Git trees (for 'nmbug checkout' and 'nmbug status').

Cheers,
Trevor

 devel/nmbug/nmbug | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/devel/nmbug/nmbug b/devel/nmbug/nmbug
index 81f582c..284d374 100755
--- a/devel/nmbug/nmbug
+++ b/devel/nmbug/nmbug
@@ -1,6 +1,6 @@
 #!/usr/bin/env python
 #
-# Copyright (c) 2011-2014 David Bremner <david@tethera.net>
+# Copyright (c) 2011-2016 David Bremner <david@tethera.net>
 #                         W. Trevor King <wking@tremily.us>
 #
 # This program is free software: you can redistribute it and/or modify
@@ -95,7 +95,7 @@ except AttributeError:  # Python < 3.2
     _tempfile.TemporaryDirectory = _TemporaryDirectory
 
 
-def _hex_quote(string, safe='+@=:,'):
+def _hex_quote(string, safe='+@=:,', encoding='utf-8', errors='strict'):
     """
     quote('abc def') -> 'abc%20def'.
 
@@ -103,6 +103,15 @@ def _hex_quote(string, safe='+@=:,'):
     addition to letters, digits, and '_.-') and lowercase hex digits
     (e.g. '%3a' instead of '%3A').
     """
+    if hasattr(string, 'isnumeric'):
+        string = string.encode(encoding, errors)
+    if hasattr(safe, 'isnumeric'):
+        safe_bytes = safe.encode(encoding, errors)
+        if len(safe_bytes) != len(safe):
+            raise ValueError(
+                'some safe characters are encoded as multiple bytes '
+                '({!r} -> {!r})'.format(safe, safe_bytes))
+        safe = safe_bytes
     uppercase_escapes = _quote(string, safe)
     return _HEX_ESCAPE_REGEX.sub(
         lambda match: match.group(0).lower(),
-- 
2.1.0.60.g85f0837

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] nmbug: Allow Unicode tags and IDs in Python 2
  2016-02-15  5:30 [PATCH] nmbug: Allow Unicode tags and IDs in Python 2 W. Trevor King
@ 2016-02-16 13:04 ` David Bremner
  2016-02-16 17:56   ` W. Trevor King
  0 siblings, 1 reply; 4+ messages in thread
From: David Bremner @ 2016-02-16 13:04 UTC (permalink / raw)
  To: W. Trevor King, notmuch

"W. Trevor King" <wking@tremily.us> writes:

> Avoid a UnicodeWarning and broken pipe on 'nmbug commit' in Python 2
> when a tag or message ID contains non-ASCII characters [1].
>
> There are a number of Python bugs associated with this behavior
> [2,3,4,5,6].  There's also some useful background in [8].  [3] lead to
> the currently working Python 3 implementation, which encodes to UTF-8
> by default and has 'encoding' and 'errors' arguments [7].  This commit
> follows that approach in a way that's compatible with both Python 2
> and Python 3.  Coercing to UTF-8 (regardless of locale) gives us
> consistent tag IDs for sharing between users.

I'm not sure what "tag IDs" are. Do you mean message-ids here? or "tags
and IDs"?

At first I thought there might be problems with non-utf8 message-ids,
but that turns out not to be the case [1].  It seems like it would take
a fairly heroic effort to get non-UTF8 tags into the database (perhaps
by calling the library interface with bad strings?) so we can probably
ignore this case. It might be good to document the limitation though,
since AFAIK, dump and restore can roundtrip any old crap.


>
> The 'isnumeric' check identifies Unicode instances in both Python 2
> [9] and Python 3 [10].
>

I still haven't really tried to understand this part, but probably it
deserves inline documentation.

> ---
> I haven't checked the other commands for issues with Unicode IDs or
> tags.  It's possible that in addition to this explicit encoding to
> UTF-8, we'll also want explicit decoding from UTF-8 when reading from
> Git trees (for 'nmbug checkout' and 'nmbug status').

Yes, this seems to be a problem, with the patch applied I can commit,
but the same utf-8 message-id causes problems.

bremner@zancas:~/software/upstream/notmuch$ ./devel/nmbug/nmbug status
U	D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@ÃÂÃ¥ðãÃ¥é-ÃÂÃÂ	unread
A	D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@Ãåðãåé-ÃÃ	unread

bremner@zancas:~/software/upstream/notmuch$ delve -a -1 ~/Maildir/.notmuch/xapian | grep D1B4DEBCAFFC4A05A4D4349A6EC5C9D8
QD1B4DEBCAFFC4A05A4D4349A6EC5C9D8@Ñåðãåé-ÏÊ

[1]: id:87si0svnim.fsf@zancas.localnet

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] nmbug: Allow Unicode tags and IDs in Python 2
  2016-02-16 13:04 ` David Bremner
@ 2016-02-16 17:56   ` W. Trevor King
  2016-02-16 18:37     ` David Bremner
  0 siblings, 1 reply; 4+ messages in thread
From: W. Trevor King @ 2016-02-16 17:56 UTC (permalink / raw)
  To: David Bremner; +Cc: notmuch

[-- Attachment #1: Type: text/plain, Size: 2153 bytes --]

On Tue, Feb 16, 2016 at 09:04:07AM -0400, David Bremner wrote:
> W. Trevor King writes:
> > Coercing to UTF-8 (regardless of locale) gives us consistent tag
> > IDs for sharing between users.
> 
> I'm not sure what "tag IDs" are. Do you mean message-ids here? or "tags
> and IDs"?

Yeah.  I'll fix that in v2.

> At first I thought there might be problems with non-utf8 message-ids,
> but that turns out not to be the case [1].  It seems like it would take
> a fairly heroic effort to get non-UTF8 tags into the database (perhaps
> by calling the library interface with bad strings?) so we can probably
> ignore this case. It might be good to document the limitation though,
> since AFAIK, dump and restore can roundtrip any old crap.

How about in a NEWS entry in v2 of this series, and then echoing that
NEWS entry in the notmuch-dtags (or whatever) man page once I work up
that series?

> > The 'isnumeric' check identifies Unicode instances in both Python
> > 2 [9] and Python 3 [10].
> 
> I still haven't really tried to understand this part, but probably
> it deserves inline documentation.

It's just “if you have a Unicode instance (str in Python 3, unicode in
Python 2), convert it to bytes (bytes in Python 3, str in Python 2).
Only Unicode instances will have an ‘isnumeric’ method, so it's a
convenient marker for switching that logic.  I'll add a “convert from
Unicode if necessary” comment to v2.

> > I haven't checked the other commands for issues with Unicode IDs
> > or tags.  It's possible that in addition to this explicit encoding
> > to UTF-8, we'll also want explicit decoding from UTF-8 when
> > reading from Git trees (for 'nmbug checkout' and 'nmbug status').
> 
> Yes, this seems to be a problem, with the patch applied I can
> commit, but the same utf-8 message-id causes problems.

Ugh.  Thanks for checking.  I'll try to fix all the places where this
would have an impact in v2 of this series.

Cheers,
Trevor

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] nmbug: Allow Unicode tags and IDs in Python 2
  2016-02-16 17:56   ` W. Trevor King
@ 2016-02-16 18:37     ` David Bremner
  0 siblings, 0 replies; 4+ messages in thread
From: David Bremner @ 2016-02-16 18:37 UTC (permalink / raw)
  To: W. Trevor King; +Cc: notmuch

"W. Trevor King" <wking@tremily.us> writes:

>
>> At first I thought there might be problems with non-utf8 message-ids,
>> but that turns out not to be the case [1].  It seems like it would take
>> a fairly heroic effort to get non-UTF8 tags into the database (perhaps
>> by calling the library interface with bad strings?) so we can probably
>> ignore this case. It might be good to document the limitation though,
>> since AFAIK, dump and restore can roundtrip any old crap.
>
> How about in a NEWS entry in v2 of this series, and then echoing that
> NEWS entry in the notmuch-dtags (or whatever) man page once I work up
> that series?

Sure, sounds fine

d

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-02-16 18:37 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-15  5:30 [PATCH] nmbug: Allow Unicode tags and IDs in Python 2 W. Trevor King
2016-02-16 13:04 ` David Bremner
2016-02-16 17:56   ` W. Trevor King
2016-02-16 18:37     ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).