unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed
From: Zefram <zefram@fysh.org>
To: 20823@debbugs.gnu.org
Subject: bug#20823: argv mangled by locale
Date: Tue, 16 Jun 2015 05:33:00 +0100	[thread overview]
Message-ID: <20150616043300.GB2718@fysh.org> (raw)

When guile-2.0 stores argv for later access via program-arguments,
it sometimes decodes the underlying octet string according to the
nominal character encoding of the locale suggested by the environment.
This is a problem, because the arguments are not necessarily encoded
that way, and may not even be encodings of character strings at all.
The decoding is lossy, where the octet string isn't consistent with the
character encoding, so the original octet string cannot be recovered
from the mangled form.  I don't see any Scheme interface that reliably
retrieves the command line arguments without locale decoding.

The decoding doesn't follow the usual rules for locale control.  It is
not at all sensitive to setlocale, which is understandable due to the
arguments being acquired before any of the actual program's code runs.
Empirically, if the environment nominates no locale, "POSIX", or a
non-existent locale, then argv is decoded according to ISO-8859-1, thus
preserving the octets.  If the environment nominates an extant locale
other than "POSIX", then argv is decoded according to that locale's
nominal character encoding.

Demos:

$ env - guile-2.0 -c '(write (map char->integer (string->list (cadr (program-arguments))))) (newline)' $'L\xc3\xa9on'  
(76 195 169 111 110)
$ env - LANG=C guile-2.0 -c '(write (map char->integer (string->list (cadr (program-arguments))))) (newline)' $'L\xc3\xa9on' 
(76 63 63 111 110)
$ env - LANG=de_DE.utf8 guile-2.0 -c '(write (map char->integer (string->list (cadr (program-arguments))))) (newline)' $'L\xc3\xa9on' 
(76 233 111 110)
$ env - LANG=de_DE.iso88591 guile-2.0 -c '(write (map char->integer (string->list (cadr (program-arguments))))) (newline)' $'L\xc3\xa9on' 
(76 195 169 111 110)

The actual data passed between processes is an octet string, and
there really needs to be some reliable way to access that octet string.
My comments about resolution in bug#20822 "environment mangled by locale"
mostly apply here too, with a slight change: it seems necessary to store
the original octet strings and decode at the time program-arguments is
called.  With that change, the decoding can be responsive to setlocale
(and in particular can reliably use ISO-8859-1 in the absence of
setlocale).

-zefram





             reply	other threads:[~2015-06-16  4:33 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-16  4:33 Zefram [this message]
2016-03-04 23:24 ` bug#20823: argv mangled by locale Zefram
2016-06-24  6:11 ` Andy Wingo
2016-06-24  8:42   ` Zefram
2016-08-14 21:36   ` Zefram

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150616043300.GB2718@fysh.org \
    --to=zefram@fysh.org \
    --cc=20823@debbugs.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).