bug#20822: environment mangled by locale

unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed

* bug#20822: environment mangled by locale
@ 2015-06-16  4:17 Zefram
  2015-06-16  6:26 ` John Darrington
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Zefram @ 2015-06-16  4:17 UTC (permalink / raw)
  To: 20822

When guile-2.0 is asked to read environment variables, via getenv,
it always decodes the underlying octet string according to the current
locale's nominal character encoding.  This is a problem, because the
environment variable's value is not necessarily encoded that way, and
may not even be an encoding of a character string at all.  The decoding
is lossy, where the octet string isn't consistent with the character
encoding, so the original octet string cannot be recovered from the
mangled form.  I don't see any Scheme interface that retrieves the
environment without locale decoding.

The decoding is governed by the currently selected locale at the time that
getenv is called, so this can be controlled to some extent by setlocale.
However, this doesn't provide a way round the lossy decoding problem,
because there is no guarantee of a cooperative locale being available
(and especially being available under a predictable name).  On my Debian
system here, the "POSIX" and "C" locales' nominal character encoding is
ASCII, so decoding under these locales results in all high-half octets
being turned into question marks.  Retrieving environment without calling
setlocale at all also yields this lossy ASCII decode.

Demos:

$ env - FOO=$'L\xc3\xa9on' guile-2.0 -c '(write (map char->integer (string->list (getenv "FOO")))) (newline)'
(76 63 63 111 110)
$ env - FOO=$'L\xc3\xa9on' guile-2.0 -c '(setlocale LC_ALL "POSIX") (write (map char->integer (string->list (getenv "FOO")))) (newline)'
(76 63 63 111 110)
$ env - FOO=$'L\xc3\xa9on' guile-2.0 -c '(setlocale LC_ALL "de_DE.utf8") (write (map char->integer (string->list (getenv "FOO")))) (newline)'
(76 233 111 110)
$ env - FOO=$'L\xc3\xa9on' guile-2.0 -c '(setlocale LC_ALL "de_DE.iso88591") (write (map char->integer (string->list (getenv "FOO")))) (newline)'
(76 195 169 111 110)

The actual data passed between processes is an octet string, and there
really needs to be some reliable way to access that octet string.
There's an obvious parallel with reading data from an input port.
If setlocale is called, then input is by default decoded according
to locale, including the very lossy ASCII decode for C/POSIX.  But if
setlocale has not been called, then input is by default decoded according
to ISO-8859-1, preserving the actual octets.  It would probably be most
sensible that, if setlocale hasn't been called, getenv should likewise
decode according to ISO-8859-1.  It might also be sensible to offer
some explicit control over the encoding to be used with the environment,
just as I/O ports have a concept of per-port selected encoding.

The same issue applies to other environment access functions too.
For setenv the corresponding problem is the inability to *write* an
arbitrary octet string to an environment variable.  Obviously all the
functions should have mutually consistent behaviour.

-zefram

^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#20822: environment mangled by locale
  2015-06-16  4:17 bug#20822: environment mangled by locale Zefram
@ 2015-06-16  6:26 ` John Darrington
  2015-06-16 20:03   ` Andreas Rottmann
  2016-03-04 23:22 ` Zefram
  2016-06-24  5:57 ` Andy Wingo
  2 siblings, 1 reply; 8+ messages in thread
From: John Darrington @ 2015-06-16  6:26 UTC (permalink / raw)
  To: Zefram; +Cc: 20822

[-- Attachment #1: Type: text/plain, Size: 709 bytes --]

Can we configure this mailing list better?

Many (all?) of the messages posted have no obvious indication of which
mailing list they are coming from.

The subject line is something like "bug#12345: description"
The To: field is 12354@debbugs.gnu.org

In general, it takes a lot of detective work to discover that message relates to guile.

Can it not be configured to Prepend the Subject: line with Bug-Guile or something similar?
That way it'd be easier to manage - either manually or automatically.

J'

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://sks-keyservers.net or any PGP keyserver for public key.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#20822: environment mangled by locale
  2015-06-16  6:26 ` John Darrington
@ 2015-06-16 20:03   ` Andreas Rottmann
  2015-06-16 20:50     ` John Darrington
  0 siblings, 1 reply; 8+ messages in thread
From: Andreas Rottmann @ 2015-06-16 20:03 UTC (permalink / raw)
  To: John Darrington; +Cc: 20822, Zefram

John Darrington <john@darrington.wattle.id.au> writes:

> Can we configure this mailing list better?
>
> Many (all?) of the messages posted have no obvious indication of which
> mailing list they are coming from.
>
> The subject line is something like "bug#12345: description"
> The To: field is 12354@debbugs.gnu.org
>
> In general, it takes a lot of detective work to discover that message
> relates to guile.
>
No, it doesn't, there's a List-Id header in all messages sent out via
the list:

List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" <bug-guile.gnu.org>

You should use an email client (MUA) or mail delivery agent (MDA) which
can act on that header, and group all (e.g.) bug-guile mails into their
own folder (or view, or however your client calls that). Putting an
identifier into the subject just clutters the view for people who have
set up their email clients appropriatly for use with mailing list.

From the email headers of your post, it seems you use mutt; I don't know
if mutt has built-in support for grouping based on List-Id (I'd guess
no), but you can use a tool (MDA) like "maildrop"[1], "scmail"[2] or
"procmail"[3]" to automatically put the email you receive via mailing
lists into different (e.g.) IMAP mailboxes.

[1] http://www.courier-mta.org/maildrop/
[2] http://0xcc.net/scmail/index.html.en
[3] http://www.procmail.org/

Personally, I use scmail, as I quite like its Scheme-based configuration
file format; here's what I've done on a Debian system to set this up
(I'm hopefully not forgetting something here):

Create a .forward file containing the following single line in your
$HOME, to process incoming mail with scmail:

| /usr/bin/scmail-deliver

Then, follow the instructions on the scmail homepage[1], creating
~/.scmail/config and ~/.scmail/deliver-rules to split your incoming mail
into multiple mailboxes; I use the following rules for the Guile lists:

(add-filter-rule!
  '(list-id (#/guile-devel\.gnu\.org/i "lists/guile-devel"))
  '(list-id (#/guile-user\.gnu\.org/i "lists/guile-user"))
  '(list-id (#/bug-guile\.gnu\.org/i "lists/guile-bug")))

The exact destinations you can use (e.g. "lists/guile-devel") depends on
which program access your mail (IMAP server, local MUA). For a mutt
instance running on the same system, my config looks like [4]. Note that
this config is not polished at all; I use mutt on the server only as a
fallback.

[4] http://rotty.xx.vu/git/dotfiles/mutt/tree/.muttrc

> Can it not be configured to Prepend the Subject: line with Bug-Guile
> or something similar?  That way it'd be easier to manage - either
> manually or automatically.
>
As mentioned above, this is not a good idea.

Kind regards, Rotty
-- 
Andreas Rottmann -- <http://rotty.xx.vu/>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#20822: environment mangled by locale
  2015-06-16 20:03   ` Andreas Rottmann
@ 2015-06-16 20:50     ` John Darrington
  0 siblings, 0 replies; 8+ messages in thread
From: John Darrington @ 2015-06-16 20:50 UTC (permalink / raw)
  To: Andreas Rottmann; +Cc: 20822, Zefram

[-- Attachment #1: Type: text/plain, Size: 2532 bytes --]

On Tue, Jun 16, 2015 at 10:03:48PM +0200, Andreas Rottmann wrote:
     John Darrington <john@darrington.wattle.id.au> writes:
     
     > Can we configure this mailing list better?
     >
     > Many (all?) of the messages posted have no obvious indication of which
     > mailing list they are coming from.
     >
     > The subject line is something like "bug#12345: description"
     > The To: field is 12354@debbugs.gnu.org
     >
     > In general, it takes a lot of detective work to discover that message
     > relates to guile.
     >
     No, it doesn't, there's a List-Id header in all messages sent out via
     the list:
     
     List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" <bug-guile.gnu.org>

OK Thanks.  If that is invariant, then I'll set a rule accordingly.  
Now that I know, I can do that.
     
     Putting an
     identifier into the subject just clutters the view for people who have
     set up their email clients appropriatly for use with mailing list.

I don't agree.  In fact, I set my email client for use with mailing lists.  That is why
I made the suggestion.  I like to know if I'm receiving personally addressed mail, or
mail via a list (without having to explicitly check the envelope and all headers).
     
     >From the email headers of your post, it seems you use mutt; I don't know
     if mutt has built-in support for grouping based on List-Id (I'd guess
     no), but you can use a tool (MDA) like "maildrop"[1], "scmail"[2] or
     "procmail"[3]" to automatically put the email you receive via mailing
     lists into different (e.g.) IMAP mailboxes.

I do know how to use my computer - I just didn't know what field this list
used to identify itself.  But thanks for reminding me anyway.
     
     > Can it not be configured to Prepend the Subject: line with Bug-Guile
     > or something similar?  That way it'd be easier to manage - either
     > manually or automatically.
     >
     As mentioned above, this is not a good idea.

There are a lot of email conventions which are not good ideas.  They are nevertheless
ubiquitous, and refusing to conform to them is also not a good idea.
     
Anyway, I'll set a rule on the List-Id field as you suggested and  hopefully that'll
fix the problem.

Sorry for the noise.

J'

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://sks-keyservers.net or any PGP keyserver for public key.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#20822: environment mangled by locale
  2015-06-16  4:17 bug#20822: environment mangled by locale Zefram
  2015-06-16  6:26 ` John Darrington
@ 2016-03-04 23:22 ` Zefram
  2016-06-24  5:57 ` Andy Wingo
  2 siblings, 0 replies; 8+ messages in thread
From: Zefram @ 2016-03-04 23:22 UTC (permalink / raw)
  To: 20822

I wrote:
>There's an obvious parallel with reading data from an input port.
>If setlocale is called, then input is by default decoded according
>to locale, including the very lossy ASCII decode for C/POSIX.  But if
>setlocale has not been called, then input is by default decoded according
>to ISO-8859-1, preserving the actual octets.  It would probably be most
>sensible that, if setlocale hasn't been called, getenv should likewise
>decode according to ISO-8859-1.  It might also be sensible to offer
>some explicit control over the encoding to be used with the environment,
>just as I/O ports have a concept of per-port selected encoding.

In the light of what I've learned recently about Guile's locale handling,
this needs some revision.  What I thought was a well-defined "setlocale
not called" state is a mirage.  The encoding of ports is not reliably
fixed at ISO-8859-1; per bug#22910 it can be affected by ostensibly
read-only calls to setlocale, and seems to be only accidentally
ISO-8859-1 until that's done.  So that's not a good model.  Due to the
GUILE_INSTALL_LOCALE mechanism, a program wanting no locale selected
can't just never call setlocale in write mode.  So setlocale not having
been called is not really available as a way to control anything.

So it would seem to be necessary to use some explicit control of character
encoding for environment access.  (This must be control of encoding
per se, not merely of which locale to use for environment access,
because, as I noted in the original report, there's no guarantee of a
locale with a suitable encoding.)  This could be an optional parameter
to the environment access functions, or a settable variable that takes
precedence over locale to determine encoding for all environment access.
The latter would match the encoding model used by ports.

-zefram

^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#20822: environment mangled by locale
  2015-06-16  4:17 bug#20822: environment mangled by locale Zefram
  2015-06-16  6:26 ` John Darrington
  2016-03-04 23:22 ` Zefram
@ 2016-06-24  5:57 ` Andy Wingo
  2016-06-26  1:10   ` Mark H Weaver
  2 siblings, 1 reply; 8+ messages in thread
From: Andy Wingo @ 2016-06-24  5:57 UTC (permalink / raw)
  To: Zefram; +Cc: 20822, ludo

On Tue 16 Jun 2015 06:17, Zefram <zefram@fysh.org> writes:

> When guile-2.0 is asked to read environment variables, via getenv,
> it always decodes the underlying octet string according to the current
> locale's nominal character encoding.  This is a problem, because the
> environment variable's value is not necessarily encoded that way, and
> may not even be an encoding of a character string at all.  The decoding
> is lossy, where the octet string isn't consistent with the character
> encoding, so the original octet string cannot be recovered from the
> mangled form.  I don't see any Scheme interface that retrieves the
> environment without locale decoding.

Options:

  Add optional "encoding" arg to scm_getenv; encoding is a string

  Add alternate getenv interface that returns a bytevector

We'll have to do the same for setenv too, I think.

I think I would go with adding an encoding argument to getenv.  WDYT
Mark and Ludovic?

Andy





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#20822: environment mangled by locale
  2016-06-24  5:57 ` Andy Wingo
@ 2016-06-26  1:10   ` Mark H Weaver
  2016-06-26 10:33     ` Zefram
  0 siblings, 1 reply; 8+ messages in thread
From: Mark H Weaver @ 2016-06-26  1:10 UTC (permalink / raw)
  To: Andy Wingo; +Cc: 20822, Zefram, ludo

Andy Wingo <wingo@pobox.com> writes:

> On Tue 16 Jun 2015 06:17, Zefram <zefram@fysh.org> writes:
>
>> When guile-2.0 is asked to read environment variables, via getenv,
>> it always decodes the underlying octet string according to the current
>> locale's nominal character encoding.  This is a problem, because the
>> environment variable's value is not necessarily encoded that way, and
>> may not even be an encoding of a character string at all.  The decoding
>> is lossy, where the octet string isn't consistent with the character
>> encoding, so the original octet string cannot be recovered from the
>> mangled form.  I don't see any Scheme interface that retrieves the
>> environment without locale decoding.
>
> Options:
>
>   Add optional "encoding" arg to scm_getenv; encoding is a string
>
>   Add alternate getenv interface that returns a bytevector
>
> We'll have to do the same for setenv too, I think.
>
> I think I would go with adding an encoding argument to getenv.  WDYT
> Mark and Ludovic?

I just don't see how this could be used sanely in actual practice.
These things are conceptually strings, and by convention they are
supposed to encoded in the locale encoding.  If that convention is
violated, I don't see what a program could do about it.

Can someone show me a realistic example of how this would be used in
practice?

      Mark





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#20822: environment mangled by locale
  2016-06-26  1:10   ` Mark H Weaver
@ 2016-06-26 10:33     ` Zefram
  0 siblings, 0 replies; 8+ messages in thread
From: Zefram @ 2016-06-26 10:33 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 20822, ludo

Mark H Weaver wrote:
>                                           by convention they are
>supposed to encoded in the locale encoding.

This convention is bunk.  The encoding aspect of the locale system is
fundamentally broken: the model is that every string in the universe
(every file content, filename, command line argument, etc.) is encoded
in the same way, and the locale environment variable tells you which
universe you're in.  But in the real universe, files, filenames, and so
on turn up encoded how their authors liked to encode them, and that's
not always the same.  In the real universe we have to cope with data
that is not encoded in our preferred way.

>                                             If that convention is
>violated, I don't see what a program could do about it.

If the convention is violated, then there is some difficulty in presenting
correctly-encoded (or even consistently-encoded) output to the user, but
it is not insuperable.  Perhaps the program knows by some non-locale means
how a string is encoded, and can explicitly convert.  Perhaps it doesn't
know the real encoding, but can trust that the user will understand the
octet string if it is passed through with neither decoding of input nor
encoding for output.  Or perhaps the program doesn't need to put the
string into textual output at all, but only to use it some API or file
format that's expecting an encodingless octet string.

So there are many things a program can reasonably do about it, and which
one to do depends on the application.

>Can someone show me a realistic example of how this would be used in
>practice?

Looking specifically at environment variables: an environment
variable could give the name of a file that is to be consulted under
specified circumstances, and the right file may happen to have a name
that is inconsistent with the encoding used by the user's terminal.
(The filename is not required for output; it only needs to be passed as
an uninterpreted octet string to the open(2) syscall.)  An environment
variable could specify a Unicode-using name of a language module to be
loaded, while the user doesn't otherwise use Unicode, or doesn't use
an encoding encompassing enough of it.  (Name not required on output,
again; will be either transformed into a filename or looked up in a file
format that specifies its own encoding.)  The program could be env(1), not
interpreting the environment but needing to output the octets correctly.
The program could be saving an uninterpreted environment, for a cron
job to later run some other program with equivalent settings.

-zefram

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-06-26 10:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-16  4:17 bug#20822: environment mangled by locale Zefram
2015-06-16  6:26 ` John Darrington
2015-06-16 20:03   ` Andreas Rottmann
2015-06-16 20:50     ` John Darrington
2016-03-04 23:22 ` Zefram
2016-06-24  5:57 ` Andy Wingo
2016-06-26  1:10   ` Mark H Weaver
2016-06-26 10:33     ` Zefram

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).