Le dimanche 07 juillet 2024 à 17:25 +0300, Eli Zaretskii a écrit :
> 
> If Guile restricts itself to Unicode characters and only them, it will
> lack important features.  So my suggestion is not to have this
> restriction.
> 
> I think the fact that this discussion is held, and that Rob suggested
> to use Latin-1 for the purpose of supporting raw bytes is a clear
> indication that Guile, too, needs to deal with "character-like" data
> that does not fit the Unicode framework.  So I think saying that
> strings in Guile can only hold Unicode characters will not give you
> what this discussion attempts to give.  In particular, how will you
> handle the situations described by Rob where a file has a name that is
> not a valid UTF-8 sequence (thus not "characters" as long as you
> interpret text as UTF-8)?


Whatever the details of aref in Emacs are (which I have not studied),
I think we all agree that

a) Strings in Scheme have the semantics of arrays of something called
   "characters".

b) According to Scheme standards and in current Guile, a character
   is a wrapper around a Unicode scalar value.

   (NB I wasn't precise enough in my previous email. R6RS explicitly
   disallows surrogate code points, so characters really correspond to
   scalar values and not to code points).

c) If we want Guile strings to losslessly represent arbitrary byte
   sequences, Guile's definition of a character needs to be expanded
   to include things other than Unicode scalar values.

So what would it entail for Guile to change its string model in this
way?

First, Guile would become technically not R6RS-compliant. I'm not
sure how much of a problem this would actually be.

There are non-trivial backwards compatibility implications. To give
a concrete case: LilyPond definitely has code that would break if
passed a string whose "conversion to UTF-8" gave something not valid
UTF-8. (An example off the top: passing strings to the Pango API and to
GLib's PCRE-based regex API. By the way, running "emacs $'\xb5'"
gives a Pango warning on the terminal, I assume because of trying
to display the file name as the window title.)

From the implementation point of view: conversion from an encoding to
another could no longer use libiconv, because it stops on invalid
multibyte sequences. Likewise, Guile could probably not use libiconv
anymore. This means a large implementation cost to reimplement all
of this in Guile.

I don't think it's worth it. If anybody's going to work on this problem,
I'd recommend simply adding APIs like program-arguments-bytevector,
getenv-bytevector and the like, returning raw bytevectors instead of strings,
and letting programs which need to be reliable against invalid UTF-8
in the environment use these.

That is also the approach taken in, e.g., Rust (except that due to the
static typing, you are forced to handle the "invalid UTF-8" error case
when you use, e.g., std::env::args as opposed to std::env::args_os).