Unicode Paths

unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed

* Unicode Paths
@ 2011-09-14  3:55 Martin Owens
  2011-09-14  4:38 ` Kan-Ru Chen
  2011-09-15 17:52 ` Austin Clements
  0 siblings, 2 replies; 5+ messages in thread
From: Martin Owens @ 2011-09-14  3:55 UTC (permalink / raw)
  To: Notmuch developer list

Hello Again,

I notice in the lib code notmuch_database_open(),
notmuch_database_create() these functions use const char *path for the
directory path input. Is this unicode safe?

The python bindings (and ctype docs) seem to suggest using something
called 'wchar_t *' for accepting unicode but that's for C not C++.

Is this something that should be patched?

Best Regards, Martin Owens

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unicode Paths
  2011-09-14  3:55 Unicode Paths Martin Owens
@ 2011-09-14  4:38 ` Kan-Ru Chen
  2011-09-15 16:52   ` Martin Owens
  2011-09-15 17:52 ` Austin Clements
  1 sibling, 1 reply; 5+ messages in thread
From: Kan-Ru Chen @ 2011-09-14  4:38 UTC (permalink / raw)
  To: Martin Owens; +Cc: Notmuch developer list

Martin Owens <doctormo@gmail.com> writes:

> Hello Again,
>
> I notice in the lib code notmuch_database_open(),
> notmuch_database_create() these functions use const char *path for the
> directory path input. Is this unicode safe?
>
> The python bindings (and ctype docs) seem to suggest using something
> called 'wchar_t *' for accepting unicode but that's for C not C++.
>
> Is this something that should be patched?

I think as long as the path does not contain embedded null character
then it is safe. Most posix filesystem does not allow null character in
the filename so you cannot use UTF-16 or UTF-32 to encode the unicode
path.

-- 
Kanru

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unicode Paths
  2011-09-14  4:38 ` Kan-Ru Chen
@ 2011-09-15 16:52   ` Martin Owens
  0 siblings, 0 replies; 5+ messages in thread
From: Martin Owens @ 2011-09-15 16:52 UTC (permalink / raw)
  To: Kan-Ru Chen; +Cc: Notmuch developer list

It looks like the python variables do include null, my investigations
show that the problem also effects tag names.

The symptoms can be seen when trying to use the python interface and
using unicode tag names or paths. Instead of seeing 'mytag1' we see 'm'
and instead of '/my/path/to/mail' we see '/' thus causing issues were
the db amusingly was trying to write to root.

I'll see if there is a way to remove the nulls from the strings in the
python bindings.

Martin,

On Wed, 2011-09-14 at 12:38 +0800, Kan-Ru Chen wrote:
> I think as long as the path does not contain embedded null character
> then it is safe. Most posix filesystem does not allow null character
> in
> the filename so you cannot use UTF-16 or UTF-32 to encode the unicode
> path. 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unicode Paths
  2011-09-14  3:55 Unicode Paths Martin Owens
  2011-09-14  4:38 ` Kan-Ru Chen
@ 2011-09-15 17:52 ` Austin Clements
  2011-09-16 10:58   ` Sebastian Spaeth
  1 sibling, 1 reply; 5+ messages in thread
From: Austin Clements @ 2011-09-15 17:52 UTC (permalink / raw)
  To: Martin Owens; +Cc: Notmuch developer list

On Tue, Sep 13, 2011 at 11:55 PM, Martin Owens <doctormo@gmail.com> wrote:
> Hello Again,
>
> I notice in the lib code notmuch_database_open(),
> notmuch_database_create() these functions use const char *path for the
> directory path input. Is this unicode safe?
>
> The python bindings (and ctype docs) seem to suggest using something
> called 'wchar_t *' for accepting unicode but that's for C not C++.
>
> Is this something that should be patched?

char* is the correct type for paths on POSIX systems.  The *meaning*
of those bytes is a more complicated matter and depends on your locale
settings.  On old systems it was generally ASCII, on modern systems
it's generally UTF-8, and it can be many other things.  However, as a
consequence of UNIX's C heritage, it is *always* terminated with a
NULL byte and cannot contain embedded NULL's.  Any encoding that
doesn't satisfy this would not be a valid encoding for file names (you
couldn't even pass such a file name to the open() system call, because
it expects a NULL-terminated byte string).

wchar_t is another matter entirely.  wchar_t is the type used by C to
represent wide strings internally, which generally (but not
necessarily!) means it stores a Unicode code point.  However, this
isn't an encoding, and different compilers can give wchar_t different
meanings, so wchar_t strings aren't generally appropriate for storing
or sharing between processes or with the kernel.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unicode Paths
  2011-09-15 17:52 ` Austin Clements
@ 2011-09-16 10:58   ` Sebastian Spaeth
  0 siblings, 0 replies; 5+ messages in thread
From: Sebastian Spaeth @ 2011-09-16 10:58 UTC (permalink / raw)
  To: Austin Clements, Martin Owens; +Cc: Notmuch developer list

[-- Attachment #1: Type: text/plain, Size: 1663 bytes --]

On Thu, 15 Sep 2011 13:52:12 -0400, Austin Clements <amdragon@mit.edu> wrote:
> On Tue, Sep 13, 2011 at 11:55 PM, Martin Owens <doctormo@gmail.com> wrote:
> > Hello Again,
> >
> > I notice in the lib code notmuch_database_open(),
> > notmuch_database_create() these functions use const char *path for the
> > directory path input. Is this unicode safe?
> >
> > The python bindings (and ctype docs) seem to suggest using something
> > called 'wchar_t *' for accepting unicode but that's for C not C++.
> >
> > Is this something that should be patched?
> 
> char* is the correct type for paths on POSIX systems.  The *meaning*
> of those bytes is a more complicated matter and depends on your locale
> settings.  On old systems it was generally ASCII, on modern systems
> it's generally UTF-8, and it can be many other things.  However, as a
> consequence of UNIX's C heritage, it is *always* terminated with a
> NULL byte and cannot contain embedded NULL's.

Right, that's what we are doing, passing in utf-8 encoded unicode
strings to char*, which should be just fine if that is what the
underlying OS uses.

> wchar_t is another matter entirely.  wchar_t is the type used by C to
> represent wide strings internally, which generally (but not
> necessarily!) means it stores a Unicode code point.  However, this
> isn't an encoding, and different compilers can give wchar_t different
> meanings, so wchar_t strings aren't generally appropriate for storing
> or sharing between processes or with the kernel.

Mmh, I remember I attempted to user wchar_t to pass in unicode objects
directly and it had failed miserably.

Sebastian

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-09-16 10:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-14  3:55 Unicode Paths Martin Owens
2011-09-14  4:38 ` Kan-Ru Chen
2011-09-15 16:52   ` Martin Owens
2011-09-15 17:52 ` Austin Clements
2011-09-16 10:58   ` Sebastian Spaeth

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).