Improving the handling of system data (env, users, paths, ...)

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

From: Rob Browning <rlb@defaultvalue.org>
To: guile-devel@gnu.org
Subject: Improving the handling of system data (env, users, paths, ...)
Date: Sat, 06 Jul 2024 15:32:17 -0500	[thread overview]
Message-ID: <878qyeqn1q.fsf@trouble.defaultvalue.org> (raw)

* Problem

System data like environment variables, user names, group names, file
paths and extended attributes (xattr), etc. are on some systems (like
Linux) binary data, and may not be encodable as a string in the current
locale.  For Linux, as an example, only the null character is an invalid
user/group/filename byte, while for UTF-8, a much smaller set of bytes
are valid[1].

As an example, "µ" (Greek Mu) when encoded as Latin-1 is 0xb5, which is
a completely invalid UTF-8 byte, but a perfectly legitimate Linux file
name.  As a result, (readdir dir) will return a corrupted value when the
locale is set to UTF-8.

You can try it yourself from bash if your current locale uses an
LC_CTYPE that's incompatible with 0xb5:

    $ locale | grep LC_CTYPE
    LC_CTYPE="en_US.utf8"
    $ guile -c '(write (program-arguments)) (newline)' $'\xb5'
    ("guile" "?")

You end up with a question mark instead of the correct value.  This
makes it difficult to write programs that don't risk silent corruption
unless all the relevant system data is known to be compatible with the
user's current locale.

It's perhaps worth noting, that while typically unlikely, any given
directory could contain paths in an arbitrary collection of encodings:
UTF-8, SHIFT-JIS, Latin-1, etc., and so if you really want to try to
handle them as strings (maybe you want to correctly upcase/downcase
them), you have to know (somehow) the encoding that applies to each one.
Otherwise, in the limiting case, you can only assume "bytes".

* Improvements

At a minimum, I suggest Guile should produce an error by default
(instead of generating incorrect data) when the system bytes cannot be
encoded in the current locale.

There should also be some straightforward, thread-safe way to write code
that accesses and manipulates system data efficiently and without
corruption.

As an incremental step, and as has been discussed elsewhere a bit, we
might add support for uselocale()[2] and then document that the current
recommendation is to always use ISO-8859-1 (i.e. Latin-1)[3] for system
data unless you're certain your program doesn't need to be general
purpose (perhaps you're sure you only care about UTF-8 systems).

A program intended to work everywhere might then do something like
this:

    ...
      #:use-module ((guile locale)
                    #:select (iso-8859-1 with-locale))
    ...

    (define (environment name)
      (with-locale iso-8859-1 (getenv name)))

There are disadvantages to this approach, but it's a fairly easy
improvement.

Some potential disadvantages:

  - In cases where the system data was actually UTF-8, non-ASCII
    characters will be displayed "completely wrong", i.e. mapped to
    "random" other characters according to the Latin-1 correspondences.

  - You have to pay whatever cost is involved in switching locales, and
    in encoding/decoding the bytes, even if you only care about the
    bytes.

  - If any manipulations of the string representing the system data end
    up performing Unicode canonicalizations or normalizations, the data
    could still be corrupted.  I don't *think* Guile itself ever does
    that implicitly.

  - Less importantly, if we switch the internal string representation to
    UTF-8 (proposed[4]), then non-ASCII bytes in the data will require
    two bytes in memory.

The most direct (and compact, if we do convert to UTF-8) representation
would bytevectors, but then you would have a much more limited set of
operations available (i.e. strings have all of srfi-13, srfi-14, etc.)
unless we expanded them (likely re-using the existing code paths).  Of
course you could still convert to Latin-1, perform the operation, and
convert back, but that's not ideal.

Finally, while I'm not sure how I feel about it, one notable precedent
is Python's "surrogateescape" approach[5], which shifts any unencodable
bytes into "lone Unicode surrogates", a process which can (and of course
must) be safely reversed before handing the data back to the system.  It
has its own trade-offs/(security)-concerns, as mentioned in the PEP.

[1] https://en.wikipedia.org/wiki/UTF-8#Encoding
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/uselocale.html
[3] https://en.wikipedia.org/wiki/ISO/IEC_8859-1
[4] https://codeberg.org/rlb/guile/src/branch/utf8
[5] https://peps.python.org/pep-0383/

Thanks, and I'm happy to help with the implementation of whatever
improvements we choose, if we come to a consensus.

-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

next             reply	other threads:[~2024-07-06 20:32 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-06 20:32 Rob Browning [this message]
2024-07-07  4:59 ` Improving the handling of system data (env, users, paths, ...) tomas
2024-07-07  5:33 ` Eli Zaretskii
2024-07-07 10:03   ` Jean Abou Samra
2024-07-07 11:04     ` Eli Zaretskii
2024-07-07 11:35       ` Maxime Devos
2024-07-07 14:25         ` Eli Zaretskii
2024-07-07 14:59           ` Maxime Devos
2024-07-07 15:43             ` Eli Zaretskii
2024-07-07 15:16           ` Jean Abou Samra
2024-07-07 15:18             ` Jean Abou Samra
2024-07-07 15:58             ` Eli Zaretskii
2024-07-07 16:09               ` Jean Abou Samra
2024-07-07 16:56               ` Mike Gran
2024-07-07  9:45 ` Jean Abou Samra
2024-07-07 19:25   ` Rob Browning
2024-07-07 10:24 ` Maxime Devos
2024-07-07 19:40   ` Rob Browning

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878qyeqn1q.fsf@trouble.defaultvalue.org \
    --to=rlb@defaultvalue.org \
    --cc=guile-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).