unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
From: <tomas@tuxteam.de>
To: Rob Browning <rlb@defaultvalue.org>
Cc: guile-devel@gnu.org
Subject: Re: Improving the handling of system data (env, users, paths, ...)
Date: Sun, 7 Jul 2024 06:59:05 +0200	[thread overview]
Message-ID: <ZoogmcyX9IsGMLRe@tuxteam.de> (raw)
In-Reply-To: <878qyeqn1q.fsf@trouble.defaultvalue.org>

[-- Attachment #1: Type: text/plain, Size: 3142 bytes --]

On Sat, Jul 06, 2024 at 03:32:17PM -0500, Rob Browning wrote:
> 
> 
> * Problem
> 
> System data like environment variables, user names, group names, file
> paths and extended attributes (xattr), etc. are on some systems (like
> Linux) binary data, and may not be encodable as a string in the current
> locale.

Since this might get lost in the ensuing discussion, yes: in Linux (and
relatives) file names are byte arrays, not strings.

> It's perhaps worth noting, that while typically unlikely, any given
> directory could contain paths in an arbitrary collection of encodings:

Exactly: it's the creating process's locale what calls the shots. So
if you are in a multi-locale environment (e.g. users with different
encodings) this will happen.

> At a minimum, I suggest Guile should produce an error by default
> (instead of generating incorrect data) when the system bytes cannot be
> encoded in the current locale.

Yes, perhaps.

[iso-8859-1]

> There are disadvantages to this approach, but it's a fairly easy
> improvement.

I'm not a fan of this one: watching Emacs's development, people end
up using Latin-1 as a poor substitute of "byte array" :-)

> The most direct (and compact, if we do convert to UTF-8) representation
> would bytevectors, but then you would have a much more limited set of
> operations available (i.e. strings have all of srfi-13, srfi-14, etc.)
> unless we expanded them (likely re-using the existing code paths).  Of
> course you could still convert to Latin-1, perform the operation, and
> convert back, but that's not ideal.

It would be the right one, and let users deal with explicit conversions
from/to strings, so they see the issues happening, but alas, you are
right: it's very inconvenient.

> Finally, while I'm not sure how I feel about it, one notable precedent
> is Python's "surrogateescape" approach[5], which shifts any unencodable
> bytes into "lone Unicode surrogates", a process which can (and of course
> must) be safely reversed before handing the data back to the system.  It
> has its own trade-offs/(security)-concerns, as mentioned in the PEP.

FWIW, that's more or less what Emacs's internal encoding does: it is roughly
UTF-8, but reserves some code points to odd bytes (which it then displays
as backslash sequences). It's round-trip safe, but has its own set of sharp
edges, and naive [1] users get caught in them from time to time.

What's my point? Basically, that we shouldn't try to get it 100% right,
because there's possibly no way, and we pile up a lot of complexity which
is very difficult to get rid of (most languages have their painful transitions
to tell stories about).

I think it's ok to try some guesswork to make user's lives easier, but
perhaps to (by default) fail noisily at the least suspicion than to carry
happily away with wrong results.

Guessing UTF-8 seems a safe bet: for one, everybody (except Javascript) is
moving in that direction, for the other, you notice quickly when it isn't
(as opposed to ISO-8859-x, which will trundle along, producing funny content).

Cheers
-- 
t

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

  reply	other threads:[~2024-07-07  4:59 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-06 20:32 Improving the handling of system data (env, users, paths, ...) Rob Browning
2024-07-07  4:59 ` tomas [this message]
2024-07-07  5:33 ` Eli Zaretskii
2024-07-07 10:03   ` Jean Abou Samra
2024-07-07 11:04     ` Eli Zaretskii
2024-07-07 11:35       ` Maxime Devos
2024-07-07 14:25         ` Eli Zaretskii
2024-07-07 14:59           ` Maxime Devos
2024-07-07 15:43             ` Eli Zaretskii
2024-07-07 15:16           ` Jean Abou Samra
2024-07-07 15:18             ` Jean Abou Samra
2024-07-07 15:58             ` Eli Zaretskii
2024-07-07 16:09               ` Jean Abou Samra
2024-07-07 16:56               ` Mike Gran
2024-07-07  9:45 ` Jean Abou Samra
2024-07-07 19:25   ` Rob Browning
2024-07-07 10:24 ` Maxime Devos
2024-07-07 19:40   ` Rob Browning

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZoogmcyX9IsGMLRe@tuxteam.de \
    --to=tomas@tuxteam.de \
    --cc=guile-devel@gnu.org \
    --cc=rlb@defaultvalue.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).