unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
From: Maxime Devos <maximedevos@telenet.be>
To: Rob Browning <rlb@defaultvalue.org>,
	 "guile-devel@gnu.org" <guile-devel@gnu.org>
Subject: RE: Improving the handling of system data (env, users, paths, ...)
Date: Sun, 7 Jul 2024 12:24:25 +0200	[thread overview]
Message-ID: <20240707122425.kaQQ2C00E4hwdlW06aQRe0@michel.telenet-ops.be> (raw)
In-Reply-To: <878qyeqn1q.fsf@trouble.defaultvalue.org>

[-- Attachment #1: Type: text/plain, Size: 9015 bytes --]

>* Problem
>
>System data like environment variables, user names, group names, file
paths and extended attributes (xattr), etc. are on some systems (like
Linux) binary data, and may not be encodable as a string in the current
locale.  For Linux, as an example, only the null character is an invalid
user/group/filename byte, while for UTF-8, a much smaller set of bytes
are valid[1].
>[...]
>You end up with a question mark instead of the correct value.  This
makes it difficult to write programs that don't risk silent corruption
unless all the relevant system data is known to be compatible with the
user's current locale.

>It's perhaps worth noting, that while typically unlikely, any given
directory could contain paths in an arbitrary collection of encodings:
UTF-8, SHIFT-JIS, Latin-1, etc., and so if you really want to try to
handle them as strings (maybe you want to correctly upcase/downcase
them), you have to know (somehow) the encoding that applies to each one.
Otherwise, in the limiting case, you can only assume "bytes".

>* Improvements

>At a minimum, I suggest Guile should produce an error by default
(instead of generating incorrect data) when the system bytes cannot be
encoded in the current locale.

I totally agree on this.

>There should also be some straightforward, thread-safe way to write code
that accesses and manipulates system data efficiently and without
corruption.

>As an incremental step, and as has been discussed elsewhere a bit, we
might add support for uselocale()[2] and then document that the current
recommendation is to always use ISO-8859-1 (i.e. Latin-1)[3] for system
data unless you're certain your program doesn't need to be general
purpose (perhaps you're sure you only care about UTF-8 systems).

I’d rather not. It’s rather stateful and hence non-trivial to compose.
Also, locale is not only about the encoding of text [file name/env encodings/xattr/...],
but also about language. Also setting the language is excessive in this case.

>A program intended to work everywhere might then do something like
this:

>   ...
>      #:use-module ((guile locale)
>                   #:select (iso-8859-1 with-locale))
>    ...
>
>    (define (environment name)
>      (with-locale iso-8859-1 (getenv name)))

This, OTOH, seems a bit better – ‘with-locale’ is like ‘parameterize’ and hence pretty composable.
However, it still stuffers from the problem that it sets too much (also, there is no such thing as the “iso-8859-1” locale?).

Instead, I would propose something like:

;; [todo: add validation]
;; if #false, default to what is implied by the locale
(define system-encoding (make-parameter #false))
;; if #false, default to system-encoding
(define file-name-encoding (make-parameter #false))
[...]

;; let’s say that for some reason, we know the file names have this encoding,
;; but we don’t have information on other things so we leave the decision
;; on other encodings to the caller.
(define (some-proc)
  (parameterize ((file-name-encoding "UTF-8"))
    [open some file and do stuff with it]))

This also has the advantage of separating the different things a bit – I can imagine a multi-user system where the usernames are encoded differently from the file names in the user home directory (not an unsurmountable problem for ‘with-locale’, but this seems a bit more straightforward to use when external libraries are involved).

(I’m not too sure about this splitting of parameter objects)

>There are disadvantages to this approach, but it's a fairly easy
improvement.

>Some potential disadvantages:

>  - In cases where the system data was actually UTF-8, non-ASCII
>    characters will be displayed "completely wrong", i.e. mapped to
>    "random" other characters according to the Latin-1 correspondences.

This is why I wouldn’t recommend always using ISO-85519-1 by default.
The situation where the encoding of things are different is the exception
(and a historical artifact of pre-UTF-8), not the norm.

I think changing the ‘?’ into ‘throw an exception’, and providing an _option_ (i.e. temporarily change locale to ISO-85519) and also supporting this historical artifact is sufficient.

>  - You have to pay whatever cost is involved in switching locales, and
>    in encoding/decoding the bytes, even if you only care about the
>    bytes.

IIRC, in ISO-88519-1 there is a direct correspondence between bytes and characters
(and Guile recognises this), so there is no cost beyond mere copying.

>  - If any manipulations of the string representing the system data end
>    up performing Unicode canonicalizations or normalizations, the data
>    could still be corrupted.  I don't *think* Guile itself ever does
>    that implicitly.

Pretty sure it doesn’t.

>  - Less importantly, if we switch the internal string representation to
    UTF-8 (proposed[4]), then non-ASCII bytes in the data will require
    two bytes in memory.

>The most direct (and compact, if we do convert to UTF-8) representation
would bytevectors, but then you would have a much more limited set of
operations available (i.e. strings have all of srfi-13, srfi-14, etc.)
unless we expanded them (likely re-using the existing code paths).  Of
course you could still convert to Latin-1, perform the operation, and
convert back, but that's not ideal.

>Finally, while I'm not sure how I feel about it, one notable precedent
is Python's "surrogateescape" approach[5], which shifts any unencodable
bytes into "lone Unicode surrogates", a process which can (and of course
must) be safely reversed before handing the data back to the system.  It
has its own trade-offs/(security)-concerns, as mentioned in the PEP.

IIRC, surrogates have codepoints, but are not characters. As a consequence, strings would contain non-characters, and (char? (string-ref s index)) might be #false. I’d rather not, such an object does not sound like a string to me.

Here is an alternative solution:

1. Define a new object type ‘<unencoded-string>’ (wrapping a bytevector). This represent things that are _conceptually_ a string instead of a mere sequence of bytes, but we don’t know the actual encoding so we can’t let it be a string.
2. Also define a bunch of procedure for converting between bytes, unencoded-strings and strings. Also, a ‘string-like?’ predicate that includes both ‘<string>’ and ‘<unencoded-string>’.
3. Procedures like ‘open-file’ etc. are extended to support <unencoded-string>.
4. Maybe do the same for SRFI-N stuff (maybe as part of (srfi srfi-N gnu) extensions).
(I don’t know if (string-append unencoded encoded) should be supported.)
5. When a procedure would return a filename, it first looks at some parameter objects. These parameter encoding determine what the encoding is, what to do when it is not valid according to the encoding (approximate via ? and the like, throw an exception, or return an <unencoded-string>) – or even return an <unencoded-string> unconditionally.
6. Also do the same for ‘getenv’ and the like, maybe with a different set of parameter objects.

(Name pending, <unencoded-string> not being a subtype of <string> is bad naming.)

I think this combines most of the positive qualities and avoids most of the negative qualities (with the exception of the surrogate-encoding stuff, which I see mostly as a negative):

• “unless we expanded them (likely re-using the existing code paths)”

This seems doable.
• “- In cases where the system data was actually UTF-8, non-ASCII  characters will be displayed "completely wrong", i.e. mapped to  "random" other characters according to the Latin-1 correspondences.

By distinguishing <string> from <unencoded-string>, for the most part this is non-applicable (depending on the encodings involved, <insert-encoding> might be incorrectly interpreted as UTF-8, but this seems rare).
• “even if you only care about the bytes.”
If you only care about the bytes, set the relevant parameter objects such that <unencoded-string> objects rare returned.
• “At a minimum, I suggest Guile should produce an error by default (instead of generating incorrect data) when the system bytes cannot be encoded in the current locale.”

Included. Also, in the rare situation where approximating things is appropriate (e.g. a basic directory listing), generating incorrect data is also possible.

A negative quality is that there now are two string-ish object types, but since the two types represent different situations, one of them requires more care than the other, and many operations are supported for both, I don’t think that’s too bad.

(It might also be possible to replace <unencoded-string> directly by a bytevector, but if you do this, then remember that on the C level you need to deal with the lack of trailing \0.)

Best regards,
Maxime Devos.


[-- Attachment #2: Type: text/html, Size: 23636 bytes --]

  parent reply	other threads:[~2024-07-07 10:24 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-06 20:32 Improving the handling of system data (env, users, paths, ...) Rob Browning
2024-07-07  4:59 ` tomas
2024-07-07  5:33 ` Eli Zaretskii
2024-07-07 10:03   ` Jean Abou Samra
2024-07-07 11:04     ` Eli Zaretskii
2024-07-07 11:35       ` Maxime Devos
2024-07-07 14:25         ` Eli Zaretskii
2024-07-07 14:59           ` Maxime Devos
2024-07-07 15:43             ` Eli Zaretskii
2024-07-07 15:16           ` Jean Abou Samra
2024-07-07 15:18             ` Jean Abou Samra
2024-07-07 15:58             ` Eli Zaretskii
2024-07-07 16:09               ` Jean Abou Samra
2024-07-07 16:56               ` Mike Gran
2024-07-07  9:45 ` Jean Abou Samra
2024-07-07 19:25   ` Rob Browning
2024-07-07 10:24 ` Maxime Devos [this message]
2024-07-07 19:40   ` Rob Browning

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240707122425.kaQQ2C00E4hwdlW06aQRe0@michel.telenet-ops.be \
    --to=maximedevos@telenet.be \
    --cc=guile-devel@gnu.org \
    --cc=rlb@defaultvalue.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).