unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Encodings
@ 2011-07-11 14:04 Sebastian Spaeth
  2011-07-11 15:03 ` Encodings Carl Worth
  2011-07-12 21:29 ` Encodings Patrick Totzke
  0 siblings, 2 replies; 6+ messages in thread
From: Sebastian Spaeth @ 2011-07-11 14:04 UTC (permalink / raw)
  To: Notmuch developer list

[-- Attachment #1: Type: text/plain, Size: 928 bytes --]

Hi all,
after I was notified about how notmuch's python bindings perform
differently depending on whether we hand it (byte-based) ASCII strings
or unicode, I tried to disentangle what encodings to expect and send it
to. The answer is that things are very implicit. notmuch.h speaks of
strings but never mentions encodings, xapian docs don't mention
encodings but ojwb confirmed that it expects utf-8.

So, can be document what encoding we are expected to pass in the various
APIs and where we can guarantee to actually return UTF-8 encoded
strings? For some of the stuff we read directly from the files, eg
arbitrary headers, we can probably be least sure, but are e.g. the
returned tags always utf-8?

I would love to make the python bindings use unicode() instances in
cases where we can be sure to actually receive utf-8 encoded strings.

Encodings make my brain hurt. Unfortunately one cannot simply ignore
them.

Sebastian

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Encodings
  2011-07-11 14:04 Encodings Sebastian Spaeth
@ 2011-07-11 15:03 ` Carl Worth
  2011-07-12 20:27   ` Encodings Patrick Totzke
  2011-07-12 21:29 ` Encodings Patrick Totzke
  1 sibling, 1 reply; 6+ messages in thread
From: Carl Worth @ 2011-07-11 15:03 UTC (permalink / raw)
  To: Sebastian Spaeth, Notmuch developer list

[-- Attachment #1: Type: text/plain, Size: 1408 bytes --]

On Mon, 11 Jul 2011 16:04:17 +0200, Sebastian Spaeth <Sebastian@SSpaeth.de> wrote:
> The answer is that things are very implicit. notmuch.h speaks of
> strings but never mentions encodings

Much of this was intentional on my part.

For example, I intentionally avoided restrictions on what could be
stored as a tag in the database, (other than the terminating character
implied by "string" of course).

> So, can be document what encoding we are expected to pass in the various
> APIs

Yes, let's clarify documentation wherever we need to.

> For some of the stuff we read directly from the files, eg
> arbitrary headers, we can probably be least sure

The headers should be decoded to utf-8, (via
g_mime_utils_header_decode_text), before being stored in the database.

> but are e.g. the returned tags always utf-8?

No. The tag data is returned exactly as the user presented it.

> I would love to make the python bindings use unicode() instances in
> cases where we can be sure to actually receive utf-8 encoded strings.
> 
> Encodings make my brain hurt. Unfortunately one cannot simply ignore
> them.

I think a lot of the pain here is due to some bad design decisions in
python itself. Of course, my saying that doesn't make things any easier
for you.

But do tell me what more we can do to clarify behavior or documentation.

-Carl

-- 
carl.d.worth@intel.com

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Encodings
  2011-07-11 15:03 ` Encodings Carl Worth
@ 2011-07-12 20:27   ` Patrick Totzke
  0 siblings, 0 replies; 6+ messages in thread
From: Patrick Totzke @ 2011-07-12 20:27 UTC (permalink / raw)
  To: Carl Worth; +Cc: Notmuch developer list


[-- Attachment #1.1: Type: text/plain, Size: 1849 bytes --]

Hi!

As discussed on irc, if notmuch stores header values in utf8,
its safe to decode them to unicode instances here.
best,
/p


On Mon, Jul 11, 2011 at 08:03:38AM -0700, Carl Worth wrote:
> On Mon, 11 Jul 2011 16:04:17 +0200, Sebastian Spaeth <Sebastian@SSpaeth.de> wrote:
> > The answer is that things are very implicit. notmuch.h speaks of
> > strings but never mentions encodings
> 
> Much of this was intentional on my part.
> 
> For example, I intentionally avoided restrictions on what could be
> stored as a tag in the database, (other than the terminating character
> implied by "string" of course).
> 
> > So, can be document what encoding we are expected to pass in the various
> > APIs
> 
> Yes, let's clarify documentation wherever we need to.
> 
> > For some of the stuff we read directly from the files, eg
> > arbitrary headers, we can probably be least sure
> 
> The headers should be decoded to utf-8, (via
> g_mime_utils_header_decode_text), before being stored in the database.
> 
> > but are e.g. the returned tags always utf-8?
> 
> No. The tag data is returned exactly as the user presented it.
> 
> > I would love to make the python bindings use unicode() instances in
> > cases where we can be sure to actually receive utf-8 encoded strings.
> > 
> > Encodings make my brain hurt. Unfortunately one cannot simply ignore
> > them.
> 
> I think a lot of the pain here is due to some bad design decisions in
> python itself. Of course, my saying that doesn't make things any easier
> for you.
> 
> But do tell me what more we can do to clarify behavior or documentation.
> 
> -Carl
> 
> -- 
> carl.d.worth@intel.com



> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch


[-- Attachment #1.2: 0001-unicode-return-value-for-Message.get_header.patch --]
[-- Type: text/x-diff, Size: 1774 bytes --]

From 988a9832d714dfa0f91b2b1185a50acb4a6ca4b5 Mon Sep 17 00:00:00 2001
From: pazz <patricktotzke@gmail.com>
Date: Tue, 12 Jul 2011 19:47:39 +0100
Subject: [PATCH 1/8] unicode return value for Message.get_header()

As discussed in IRC, notmuch recodes mailheaders to
utf-8, so we can safely decode them into unicode instances.
---
 bindings/python/notmuch/message.py |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/bindings/python/notmuch/message.py b/bindings/python/notmuch/message.py
index 763d2c6..4a43a88 100644
--- a/bindings/python/notmuch/message.py
+++ b/bindings/python/notmuch/message.py
@@ -379,14 +379,16 @@ class Message(object):
 
         :param header: The name of the header to be retrieved.
                        It is not case-sensitive (TODO: confirm).
-        :type header: str
-        :returns: The header value as string
+        :type header: str or unicode instance
+        :returns: The header value as a unicode string
         :exception: :exc:`NotmuchError`
 
                     * STATUS.NOT_INITIALIZED if the message 
                       is not initialized.
                     * STATUS.NULL_POINTER, if no header was found
         """
+        if isinstance(header, unicode):
+            header = header.encode('utf-8')
         if self._msg is None:
             raise NotmuchError(STATUS.NOT_INITIALIZED)
 
@@ -394,7 +396,7 @@ class Message(object):
         header = Message._get_header (self._msg, header)
         if header == None:
             raise NotmuchError(STATUS.NULL_POINTER)
-        return header
+        return header.decode('utf-8')
 
     def get_filename(self):
         """Returns the file path of the message file
-- 
1.7.4.1


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: Encodings
  2011-07-11 14:04 Encodings Sebastian Spaeth
  2011-07-11 15:03 ` Encodings Carl Worth
@ 2011-07-12 21:29 ` Patrick Totzke
  2011-07-13  7:04   ` Encodings Uwe Kleine-König
  1 sibling, 1 reply; 6+ messages in thread
From: Patrick Totzke @ 2011-07-12 21:29 UTC (permalink / raw)
  To: Sebastian Spaeth; +Cc: Notmuch developer list

[-- Attachment #1: Type: text/plain, Size: 2094 bytes --]

Hiya,

I noticed that commit 687366b920caa5de6ea0b66b70cf2a11e5399f7b
breaks things with Database.get_all_tags:

-------------------------------------->%-------------------------------------
AttributeError                            Traceback (most recent call last)

/home/pazz/projects/alot/<ipython console> in <module>()

/usr/local/lib/python2.7/dist-packages/notmuch/tag.pyc in next(self)
     86         # No need to call nmlib.notmuch_tags_valid(self._tags);

     87         # Tags._get safely returns None, if there is no more valid tag.

---> 88         tag = Tags._get(self._tags).decode('utf-8')
     89         if tag is None:
     90             self._tags = None

AttributeError: 'NoneType' object has no attribute 'decode'
------------------------------------%<---------------------------------------

The reason is that the Tags.next() tries to decode before it tests if tag is None.
Now, we _could_ apply a patch like this one here:

---------------------------------->%-----------------------------------------
diff --git a/bindings/python/notmuch/tag.py b/bindings/python/notmuch/tag.py
index 65a9118..2ae670d 100644
--- a/bindings/python/notmuch/tag.py
+++ b/bindings/python/notmuch/tag.py
@@ -85,12 +85,12 @@ class Tags(object):
             raise NotmuchError(STATUS.NOT_INITIALIZED)
         # No need to call nmlib.notmuch_tags_valid(self._tags);
         # Tags._get safely returns None, if there is no more valid tag.
-        tag = Tags._get(self._tags).decode('utf-8')
+        tag = Tags._get(self._tags)
         if tag is None:
             self._tags = None
             raise StopIteration
         nmlib.notmuch_tags_move_to_next(self._tags)
-        return tag
+        return tag.decode('utf-8')
 
     def __nonzero__(self):
         """Implement bool(Tags) check that can be repeatedly used
-------------------------------------------%<-----------------------------

But as Carl sais, we cannot guarantee that a tag is utf8 encoded anyway.
So i'd suggest we just revore the commit in question.
best,
/p

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: Encodings
  2011-07-12 21:29 ` Encodings Patrick Totzke
@ 2011-07-13  7:04   ` Uwe Kleine-König
  2011-07-13  9:03     ` Encodings Patrick Totzke
  0 siblings, 1 reply; 6+ messages in thread
From: Uwe Kleine-König @ 2011-07-13  7:04 UTC (permalink / raw)
  To: Patrick Totzke; +Cc: Notmuch developer list

Hi Patrick,

On Tue, Jul 12, 2011 at 10:29:58PM +0100, Patrick Totzke wrote:
> I noticed that commit 687366b920caa5de6ea0b66b70cf2a11e5399f7b
> breaks things with Database.get_all_tags:
> 
> -------------------------------------->%-------------------------------------
> AttributeError                            Traceback (most recent call last)
> 
> /home/pazz/projects/alot/<ipython console> in <module>()
> 
> /usr/local/lib/python2.7/dist-packages/notmuch/tag.pyc in next(self)
>      86         # No need to call nmlib.notmuch_tags_valid(self._tags);
> 
>      87         # Tags._get safely returns None, if there is no more valid tag.
> 
> ---> 88         tag = Tags._get(self._tags).decode('utf-8')
>      89         if tag is None:
>      90             self._tags = None
> 
> AttributeError: 'NoneType' object has no attribute 'decode'
> ------------------------------------%<---------------------------------------
> 
> The reason is that the Tags.next() tries to decode before it tests if tag is None.
> Now, we _could_ apply a patch like this one here:
> 
> ---------------------------------->%-----------------------------------------
> diff --git a/bindings/python/notmuch/tag.py b/bindings/python/notmuch/tag.py
> index 65a9118..2ae670d 100644
> --- a/bindings/python/notmuch/tag.py
> +++ b/bindings/python/notmuch/tag.py
> @@ -85,12 +85,12 @@ class Tags(object):
>              raise NotmuchError(STATUS.NOT_INITIALIZED)
>          # No need to call nmlib.notmuch_tags_valid(self._tags);
>          # Tags._get safely returns None, if there is no more valid tag.
> -        tag = Tags._get(self._tags).decode('utf-8')
> +        tag = Tags._get(self._tags)
>          if tag is None:
>              self._tags = None
>              raise StopIteration
>          nmlib.notmuch_tags_move_to_next(self._tags)
> -        return tag
> +        return tag.decode('utf-8')
>  
>      def __nonzero__(self):
>          """Implement bool(Tags) check that can be repeatedly used
> -------------------------------------------%<-----------------------------
> 
> But as Carl sais, we cannot guarantee that a tag is utf8 encoded anyway.
I think it would be right to enforce that tags are utf-8 encoded.
Otherwise the users get strange results if they change their locale.

Best regards
Uwe

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Encodings
  2011-07-13  7:04   ` Encodings Uwe Kleine-König
@ 2011-07-13  9:03     ` Patrick Totzke
  0 siblings, 0 replies; 6+ messages in thread
From: Patrick Totzke @ 2011-07-13  9:03 UTC (permalink / raw)
  To: Uwe Kleine-König; +Cc: Patrick Totzke, Notmuch developer list

[-- Attachment #1: Type: text/plain, Size: 1024 bytes --]

Hi Uwe,

On Wed, Jul 13, 2011 at 09:04:47AM +0200, Uwe Kleine-König wrote:
> > But as Carl sais, we cannot guarantee that a tag is utf8 encoded anyway.
> I think it would be right to enforce that tags are utf-8 encoded.
> Otherwise the users get strange results if they change their locale.

I agree that it would be very nice indeed if it was safe to assume
all tags are utf-8. But i also see that it's a bit of an effort
to ensure this as all UI's would have to explicitly recode
stuff that isn't utf-8.
It seems to be a conciously made design decision to allow
other encodings for tags, which is up for discussion f course.
All I'm saying is that the bindings should conform. And if it's 
not safe to assume utf-8 here, we shouldn't decode as such.

I'm unsure what happens in all the new get_part() parts of the api.
If there, all mimepart-text is also returned as utf-8, it would only
be consistant to bend tag encodings to utf-8 also. But I doubt thats the case.
Can anyone clarify this?
/Patrick

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-07-13  9:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-11 14:04 Encodings Sebastian Spaeth
2011-07-11 15:03 ` Encodings Carl Worth
2011-07-12 20:27   ` Encodings Patrick Totzke
2011-07-12 21:29 ` Encodings Patrick Totzke
2011-07-13  7:04   ` Encodings Uwe Kleine-König
2011-07-13  9:03     ` Encodings Patrick Totzke

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).