* Encodings @ 2011-07-11 14:04 Sebastian Spaeth 2011-07-11 15:03 ` Encodings Carl Worth 2011-07-12 21:29 ` Encodings Patrick Totzke 0 siblings, 2 replies; 6+ messages in thread From: Sebastian Spaeth @ 2011-07-11 14:04 UTC (permalink / raw) To: Notmuch developer list [-- Attachment #1: Type: text/plain, Size: 928 bytes --] Hi all, after I was notified about how notmuch's python bindings perform differently depending on whether we hand it (byte-based) ASCII strings or unicode, I tried to disentangle what encodings to expect and send it to. The answer is that things are very implicit. notmuch.h speaks of strings but never mentions encodings, xapian docs don't mention encodings but ojwb confirmed that it expects utf-8. So, can be document what encoding we are expected to pass in the various APIs and where we can guarantee to actually return UTF-8 encoded strings? For some of the stuff we read directly from the files, eg arbitrary headers, we can probably be least sure, but are e.g. the returned tags always utf-8? I would love to make the python bindings use unicode() instances in cases where we can be sure to actually receive utf-8 encoded strings. Encodings make my brain hurt. Unfortunately one cannot simply ignore them. Sebastian [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Encodings 2011-07-11 14:04 Encodings Sebastian Spaeth @ 2011-07-11 15:03 ` Carl Worth 2011-07-12 20:27 ` Encodings Patrick Totzke 2011-07-12 21:29 ` Encodings Patrick Totzke 1 sibling, 1 reply; 6+ messages in thread From: Carl Worth @ 2011-07-11 15:03 UTC (permalink / raw) To: Sebastian Spaeth, Notmuch developer list [-- Attachment #1: Type: text/plain, Size: 1408 bytes --] On Mon, 11 Jul 2011 16:04:17 +0200, Sebastian Spaeth <Sebastian@SSpaeth.de> wrote: > The answer is that things are very implicit. notmuch.h speaks of > strings but never mentions encodings Much of this was intentional on my part. For example, I intentionally avoided restrictions on what could be stored as a tag in the database, (other than the terminating character implied by "string" of course). > So, can be document what encoding we are expected to pass in the various > APIs Yes, let's clarify documentation wherever we need to. > For some of the stuff we read directly from the files, eg > arbitrary headers, we can probably be least sure The headers should be decoded to utf-8, (via g_mime_utils_header_decode_text), before being stored in the database. > but are e.g. the returned tags always utf-8? No. The tag data is returned exactly as the user presented it. > I would love to make the python bindings use unicode() instances in > cases where we can be sure to actually receive utf-8 encoded strings. > > Encodings make my brain hurt. Unfortunately one cannot simply ignore > them. I think a lot of the pain here is due to some bad design decisions in python itself. Of course, my saying that doesn't make things any easier for you. But do tell me what more we can do to clarify behavior or documentation. -Carl -- carl.d.worth@intel.com [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Encodings 2011-07-11 15:03 ` Encodings Carl Worth @ 2011-07-12 20:27 ` Patrick Totzke 0 siblings, 0 replies; 6+ messages in thread From: Patrick Totzke @ 2011-07-12 20:27 UTC (permalink / raw) To: Carl Worth; +Cc: Notmuch developer list [-- Attachment #1.1: Type: text/plain, Size: 1849 bytes --] Hi! As discussed on irc, if notmuch stores header values in utf8, its safe to decode them to unicode instances here. best, /p On Mon, Jul 11, 2011 at 08:03:38AM -0700, Carl Worth wrote: > On Mon, 11 Jul 2011 16:04:17 +0200, Sebastian Spaeth <Sebastian@SSpaeth.de> wrote: > > The answer is that things are very implicit. notmuch.h speaks of > > strings but never mentions encodings > > Much of this was intentional on my part. > > For example, I intentionally avoided restrictions on what could be > stored as a tag in the database, (other than the terminating character > implied by "string" of course). > > > So, can be document what encoding we are expected to pass in the various > > APIs > > Yes, let's clarify documentation wherever we need to. > > > For some of the stuff we read directly from the files, eg > > arbitrary headers, we can probably be least sure > > The headers should be decoded to utf-8, (via > g_mime_utils_header_decode_text), before being stored in the database. > > > but are e.g. the returned tags always utf-8? > > No. The tag data is returned exactly as the user presented it. > > > I would love to make the python bindings use unicode() instances in > > cases where we can be sure to actually receive utf-8 encoded strings. > > > > Encodings make my brain hurt. Unfortunately one cannot simply ignore > > them. > > I think a lot of the pain here is due to some bad design decisions in > python itself. Of course, my saying that doesn't make things any easier > for you. > > But do tell me what more we can do to clarify behavior or documentation. > > -Carl > > -- > carl.d.worth@intel.com > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > http://notmuchmail.org/mailman/listinfo/notmuch [-- Attachment #1.2: 0001-unicode-return-value-for-Message.get_header.patch --] [-- Type: text/x-diff, Size: 1774 bytes --] From 988a9832d714dfa0f91b2b1185a50acb4a6ca4b5 Mon Sep 17 00:00:00 2001 From: pazz <patricktotzke@gmail.com> Date: Tue, 12 Jul 2011 19:47:39 +0100 Subject: [PATCH 1/8] unicode return value for Message.get_header() As discussed in IRC, notmuch recodes mailheaders to utf-8, so we can safely decode them into unicode instances. --- bindings/python/notmuch/message.py | 8 +++++--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/bindings/python/notmuch/message.py b/bindings/python/notmuch/message.py index 763d2c6..4a43a88 100644 --- a/bindings/python/notmuch/message.py +++ b/bindings/python/notmuch/message.py @@ -379,14 +379,16 @@ class Message(object): :param header: The name of the header to be retrieved. It is not case-sensitive (TODO: confirm). - :type header: str - :returns: The header value as string + :type header: str or unicode instance + :returns: The header value as a unicode string :exception: :exc:`NotmuchError` * STATUS.NOT_INITIALIZED if the message is not initialized. * STATUS.NULL_POINTER, if no header was found """ + if isinstance(header, unicode): + header = header.encode('utf-8') if self._msg is None: raise NotmuchError(STATUS.NOT_INITIALIZED) @@ -394,7 +396,7 @@ class Message(object): header = Message._get_header (self._msg, header) if header == None: raise NotmuchError(STATUS.NULL_POINTER) - return header + return header.decode('utf-8') def get_filename(self): """Returns the file path of the message file -- 1.7.4.1 [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: Encodings 2011-07-11 14:04 Encodings Sebastian Spaeth 2011-07-11 15:03 ` Encodings Carl Worth @ 2011-07-12 21:29 ` Patrick Totzke 2011-07-13 7:04 ` Encodings Uwe Kleine-König 1 sibling, 1 reply; 6+ messages in thread From: Patrick Totzke @ 2011-07-12 21:29 UTC (permalink / raw) To: Sebastian Spaeth; +Cc: Notmuch developer list [-- Attachment #1: Type: text/plain, Size: 2094 bytes --] Hiya, I noticed that commit 687366b920caa5de6ea0b66b70cf2a11e5399f7b breaks things with Database.get_all_tags: -------------------------------------->%------------------------------------- AttributeError Traceback (most recent call last) /home/pazz/projects/alot/<ipython console> in <module>() /usr/local/lib/python2.7/dist-packages/notmuch/tag.pyc in next(self) 86 # No need to call nmlib.notmuch_tags_valid(self._tags); 87 # Tags._get safely returns None, if there is no more valid tag. ---> 88 tag = Tags._get(self._tags).decode('utf-8') 89 if tag is None: 90 self._tags = None AttributeError: 'NoneType' object has no attribute 'decode' ------------------------------------%<--------------------------------------- The reason is that the Tags.next() tries to decode before it tests if tag is None. Now, we _could_ apply a patch like this one here: ---------------------------------->%----------------------------------------- diff --git a/bindings/python/notmuch/tag.py b/bindings/python/notmuch/tag.py index 65a9118..2ae670d 100644 --- a/bindings/python/notmuch/tag.py +++ b/bindings/python/notmuch/tag.py @@ -85,12 +85,12 @@ class Tags(object): raise NotmuchError(STATUS.NOT_INITIALIZED) # No need to call nmlib.notmuch_tags_valid(self._tags); # Tags._get safely returns None, if there is no more valid tag. - tag = Tags._get(self._tags).decode('utf-8') + tag = Tags._get(self._tags) if tag is None: self._tags = None raise StopIteration nmlib.notmuch_tags_move_to_next(self._tags) - return tag + return tag.decode('utf-8') def __nonzero__(self): """Implement bool(Tags) check that can be repeatedly used -------------------------------------------%<----------------------------- But as Carl sais, we cannot guarantee that a tag is utf8 encoded anyway. So i'd suggest we just revore the commit in question. best, /p [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: Encodings 2011-07-12 21:29 ` Encodings Patrick Totzke @ 2011-07-13 7:04 ` Uwe Kleine-König 2011-07-13 9:03 ` Encodings Patrick Totzke 0 siblings, 1 reply; 6+ messages in thread From: Uwe Kleine-König @ 2011-07-13 7:04 UTC (permalink / raw) To: Patrick Totzke; +Cc: Notmuch developer list Hi Patrick, On Tue, Jul 12, 2011 at 10:29:58PM +0100, Patrick Totzke wrote: > I noticed that commit 687366b920caa5de6ea0b66b70cf2a11e5399f7b > breaks things with Database.get_all_tags: > > -------------------------------------->%------------------------------------- > AttributeError Traceback (most recent call last) > > /home/pazz/projects/alot/<ipython console> in <module>() > > /usr/local/lib/python2.7/dist-packages/notmuch/tag.pyc in next(self) > 86 # No need to call nmlib.notmuch_tags_valid(self._tags); > > 87 # Tags._get safely returns None, if there is no more valid tag. > > ---> 88 tag = Tags._get(self._tags).decode('utf-8') > 89 if tag is None: > 90 self._tags = None > > AttributeError: 'NoneType' object has no attribute 'decode' > ------------------------------------%<--------------------------------------- > > The reason is that the Tags.next() tries to decode before it tests if tag is None. > Now, we _could_ apply a patch like this one here: > > ---------------------------------->%----------------------------------------- > diff --git a/bindings/python/notmuch/tag.py b/bindings/python/notmuch/tag.py > index 65a9118..2ae670d 100644 > --- a/bindings/python/notmuch/tag.py > +++ b/bindings/python/notmuch/tag.py > @@ -85,12 +85,12 @@ class Tags(object): > raise NotmuchError(STATUS.NOT_INITIALIZED) > # No need to call nmlib.notmuch_tags_valid(self._tags); > # Tags._get safely returns None, if there is no more valid tag. > - tag = Tags._get(self._tags).decode('utf-8') > + tag = Tags._get(self._tags) > if tag is None: > self._tags = None > raise StopIteration > nmlib.notmuch_tags_move_to_next(self._tags) > - return tag > + return tag.decode('utf-8') > > def __nonzero__(self): > """Implement bool(Tags) check that can be repeatedly used > -------------------------------------------%<----------------------------- > > But as Carl sais, we cannot guarantee that a tag is utf8 encoded anyway. I think it would be right to enforce that tags are utf-8 encoded. Otherwise the users get strange results if they change their locale. Best regards Uwe ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Encodings 2011-07-13 7:04 ` Encodings Uwe Kleine-König @ 2011-07-13 9:03 ` Patrick Totzke 0 siblings, 0 replies; 6+ messages in thread From: Patrick Totzke @ 2011-07-13 9:03 UTC (permalink / raw) To: Uwe Kleine-König; +Cc: Patrick Totzke, Notmuch developer list [-- Attachment #1: Type: text/plain, Size: 1024 bytes --] Hi Uwe, On Wed, Jul 13, 2011 at 09:04:47AM +0200, Uwe Kleine-König wrote: > > But as Carl sais, we cannot guarantee that a tag is utf8 encoded anyway. > I think it would be right to enforce that tags are utf-8 encoded. > Otherwise the users get strange results if they change their locale. I agree that it would be very nice indeed if it was safe to assume all tags are utf-8. But i also see that it's a bit of an effort to ensure this as all UI's would have to explicitly recode stuff that isn't utf-8. It seems to be a conciously made design decision to allow other encodings for tags, which is up for discussion f course. All I'm saying is that the bindings should conform. And if it's not safe to assume utf-8 here, we shouldn't decode as such. I'm unsure what happens in all the new get_part() parts of the api. If there, all mimepart-text is also returned as utf-8, it would only be consistant to bend tag encodings to utf-8 also. But I doubt thats the case. Can anyone clarify this? /Patrick [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-07-13 9:04 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-07-11 14:04 Encodings Sebastian Spaeth 2011-07-11 15:03 ` Encodings Carl Worth 2011-07-12 20:27 ` Encodings Patrick Totzke 2011-07-12 21:29 ` Encodings Patrick Totzke 2011-07-13 7:04 ` Encodings Uwe Kleine-König 2011-07-13 9:03 ` Encodings Patrick Totzke
Code repositories for project(s) associated with this public inbox https://yhetil.org/notmuch.git/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).