* Unicode Paths @ 2011-09-14 3:55 Martin Owens 2011-09-14 4:38 ` Kan-Ru Chen 2011-09-15 17:52 ` Austin Clements 0 siblings, 2 replies; 5+ messages in thread From: Martin Owens @ 2011-09-14 3:55 UTC (permalink / raw) To: Notmuch developer list Hello Again, I notice in the lib code notmuch_database_open(), notmuch_database_create() these functions use const char *path for the directory path input. Is this unicode safe? The python bindings (and ctype docs) seem to suggest using something called 'wchar_t *' for accepting unicode but that's for C not C++. Is this something that should be patched? Best Regards, Martin Owens ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Unicode Paths 2011-09-14 3:55 Unicode Paths Martin Owens @ 2011-09-14 4:38 ` Kan-Ru Chen 2011-09-15 16:52 ` Martin Owens 2011-09-15 17:52 ` Austin Clements 1 sibling, 1 reply; 5+ messages in thread From: Kan-Ru Chen @ 2011-09-14 4:38 UTC (permalink / raw) To: Martin Owens; +Cc: Notmuch developer list Martin Owens <doctormo@gmail.com> writes: > Hello Again, > > I notice in the lib code notmuch_database_open(), > notmuch_database_create() these functions use const char *path for the > directory path input. Is this unicode safe? > > The python bindings (and ctype docs) seem to suggest using something > called 'wchar_t *' for accepting unicode but that's for C not C++. > > Is this something that should be patched? I think as long as the path does not contain embedded null character then it is safe. Most posix filesystem does not allow null character in the filename so you cannot use UTF-16 or UTF-32 to encode the unicode path. -- Kanru ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Unicode Paths 2011-09-14 4:38 ` Kan-Ru Chen @ 2011-09-15 16:52 ` Martin Owens 0 siblings, 0 replies; 5+ messages in thread From: Martin Owens @ 2011-09-15 16:52 UTC (permalink / raw) To: Kan-Ru Chen; +Cc: Notmuch developer list It looks like the python variables do include null, my investigations show that the problem also effects tag names. The symptoms can be seen when trying to use the python interface and using unicode tag names or paths. Instead of seeing 'mytag1' we see 'm' and instead of '/my/path/to/mail' we see '/' thus causing issues were the db amusingly was trying to write to root. I'll see if there is a way to remove the nulls from the strings in the python bindings. Martin, On Wed, 2011-09-14 at 12:38 +0800, Kan-Ru Chen wrote: > I think as long as the path does not contain embedded null character > then it is safe. Most posix filesystem does not allow null character > in > the filename so you cannot use UTF-16 or UTF-32 to encode the unicode > path. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Unicode Paths 2011-09-14 3:55 Unicode Paths Martin Owens 2011-09-14 4:38 ` Kan-Ru Chen @ 2011-09-15 17:52 ` Austin Clements 2011-09-16 10:58 ` Sebastian Spaeth 1 sibling, 1 reply; 5+ messages in thread From: Austin Clements @ 2011-09-15 17:52 UTC (permalink / raw) To: Martin Owens; +Cc: Notmuch developer list On Tue, Sep 13, 2011 at 11:55 PM, Martin Owens <doctormo@gmail.com> wrote: > Hello Again, > > I notice in the lib code notmuch_database_open(), > notmuch_database_create() these functions use const char *path for the > directory path input. Is this unicode safe? > > The python bindings (and ctype docs) seem to suggest using something > called 'wchar_t *' for accepting unicode but that's for C not C++. > > Is this something that should be patched? char* is the correct type for paths on POSIX systems. The *meaning* of those bytes is a more complicated matter and depends on your locale settings. On old systems it was generally ASCII, on modern systems it's generally UTF-8, and it can be many other things. However, as a consequence of UNIX's C heritage, it is *always* terminated with a NULL byte and cannot contain embedded NULL's. Any encoding that doesn't satisfy this would not be a valid encoding for file names (you couldn't even pass such a file name to the open() system call, because it expects a NULL-terminated byte string). wchar_t is another matter entirely. wchar_t is the type used by C to represent wide strings internally, which generally (but not necessarily!) means it stores a Unicode code point. However, this isn't an encoding, and different compilers can give wchar_t different meanings, so wchar_t strings aren't generally appropriate for storing or sharing between processes or with the kernel. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Unicode Paths 2011-09-15 17:52 ` Austin Clements @ 2011-09-16 10:58 ` Sebastian Spaeth 0 siblings, 0 replies; 5+ messages in thread From: Sebastian Spaeth @ 2011-09-16 10:58 UTC (permalink / raw) To: Austin Clements, Martin Owens; +Cc: Notmuch developer list [-- Attachment #1: Type: text/plain, Size: 1663 bytes --] On Thu, 15 Sep 2011 13:52:12 -0400, Austin Clements <amdragon@mit.edu> wrote: > On Tue, Sep 13, 2011 at 11:55 PM, Martin Owens <doctormo@gmail.com> wrote: > > Hello Again, > > > > I notice in the lib code notmuch_database_open(), > > notmuch_database_create() these functions use const char *path for the > > directory path input. Is this unicode safe? > > > > The python bindings (and ctype docs) seem to suggest using something > > called 'wchar_t *' for accepting unicode but that's for C not C++. > > > > Is this something that should be patched? > > char* is the correct type for paths on POSIX systems. The *meaning* > of those bytes is a more complicated matter and depends on your locale > settings. On old systems it was generally ASCII, on modern systems > it's generally UTF-8, and it can be many other things. However, as a > consequence of UNIX's C heritage, it is *always* terminated with a > NULL byte and cannot contain embedded NULL's. Right, that's what we are doing, passing in utf-8 encoded unicode strings to char*, which should be just fine if that is what the underlying OS uses. > wchar_t is another matter entirely. wchar_t is the type used by C to > represent wide strings internally, which generally (but not > necessarily!) means it stores a Unicode code point. However, this > isn't an encoding, and different compilers can give wchar_t different > meanings, so wchar_t strings aren't generally appropriate for storing > or sharing between processes or with the kernel. Mmh, I remember I attempted to user wchar_t to pass in unicode objects directly and it had failed miserably. Sebastian [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-09-16 10:58 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-09-14 3:55 Unicode Paths Martin Owens 2011-09-14 4:38 ` Kan-Ru Chen 2011-09-15 16:52 ` Martin Owens 2011-09-15 17:52 ` Austin Clements 2011-09-16 10:58 ` Sebastian Spaeth
Code repositories for project(s) associated with this public inbox https://yhetil.org/notmuch.git/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).