unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed
* bug#20822: environment mangled by locale
@ 2015-06-16  4:17 Zefram
  2015-06-16  6:26 ` John Darrington
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Zefram @ 2015-06-16  4:17 UTC (permalink / raw)
  To: 20822

When guile-2.0 is asked to read environment variables, via getenv,
it always decodes the underlying octet string according to the current
locale's nominal character encoding.  This is a problem, because the
environment variable's value is not necessarily encoded that way, and
may not even be an encoding of a character string at all.  The decoding
is lossy, where the octet string isn't consistent with the character
encoding, so the original octet string cannot be recovered from the
mangled form.  I don't see any Scheme interface that retrieves the
environment without locale decoding.

The decoding is governed by the currently selected locale at the time that
getenv is called, so this can be controlled to some extent by setlocale.
However, this doesn't provide a way round the lossy decoding problem,
because there is no guarantee of a cooperative locale being available
(and especially being available under a predictable name).  On my Debian
system here, the "POSIX" and "C" locales' nominal character encoding is
ASCII, so decoding under these locales results in all high-half octets
being turned into question marks.  Retrieving environment without calling
setlocale at all also yields this lossy ASCII decode.

Demos:

$ env - FOO=$'L\xc3\xa9on' guile-2.0 -c '(write (map char->integer (string->list (getenv "FOO")))) (newline)'
(76 63 63 111 110)
$ env - FOO=$'L\xc3\xa9on' guile-2.0 -c '(setlocale LC_ALL "POSIX") (write (map char->integer (string->list (getenv "FOO")))) (newline)'
(76 63 63 111 110)
$ env - FOO=$'L\xc3\xa9on' guile-2.0 -c '(setlocale LC_ALL "de_DE.utf8") (write (map char->integer (string->list (getenv "FOO")))) (newline)'
(76 233 111 110)
$ env - FOO=$'L\xc3\xa9on' guile-2.0 -c '(setlocale LC_ALL "de_DE.iso88591") (write (map char->integer (string->list (getenv "FOO")))) (newline)'
(76 195 169 111 110)

The actual data passed between processes is an octet string, and there
really needs to be some reliable way to access that octet string.
There's an obvious parallel with reading data from an input port.
If setlocale is called, then input is by default decoded according
to locale, including the very lossy ASCII decode for C/POSIX.  But if
setlocale has not been called, then input is by default decoded according
to ISO-8859-1, preserving the actual octets.  It would probably be most
sensible that, if setlocale hasn't been called, getenv should likewise
decode according to ISO-8859-1.  It might also be sensible to offer
some explicit control over the encoding to be used with the environment,
just as I/O ports have a concept of per-port selected encoding.

The same issue applies to other environment access functions too.
For setenv the corresponding problem is the inability to *write* an
arbitrary octet string to an environment variable.  Obviously all the
functions should have mutually consistent behaviour.

-zefram





^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-06-26 10:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-16  4:17 bug#20822: environment mangled by locale Zefram
2015-06-16  6:26 ` John Darrington
2015-06-16 20:03   ` Andreas Rottmann
2015-06-16 20:50     ` John Darrington
2016-03-04 23:22 ` Zefram
2016-06-24  5:57 ` Andy Wingo
2016-06-26  1:10   ` Mark H Weaver
2016-06-26 10:33     ` Zefram

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).