Le dimanche 07 juillet 2024 à 17:25 +0300, Eli Zaretskii a écrit : > > If Guile restricts itself to Unicode characters and only them, it will > lack important features.  So my suggestion is not to have this > restriction. > > I think the fact that this discussion is held, and that Rob suggested > to use Latin-1 for the purpose of supporting raw bytes is a clear > indication that Guile, too, needs to deal with "character-like" data > that does not fit the Unicode framework.  So I think saying that > strings in Guile can only hold Unicode characters will not give you > what this discussion attempts to give.  In particular, how will you > handle the situations described by Rob where a file has a name that is > not a valid UTF-8 sequence (thus not "characters" as long as you > interpret text as UTF-8)? Whatever the details of aref in Emacs are (which I have not studied), I think we all agree that a) Strings in Scheme have the semantics of arrays of something called "characters". b) According to Scheme standards and in current Guile, a character is a wrapper around a Unicode scalar value. (NB I wasn't precise enough in my previous email. R6RS explicitly disallows surrogate code points, so characters really correspond to scalar values and not to code points). c) If we want Guile strings to losslessly represent arbitrary byte sequences, Guile's definition of a character needs to be expanded to include things other than Unicode scalar values. So what would it entail for Guile to change its string model in this way? First, Guile would become technically not R6RS-compliant. I'm not sure how much of a problem this would actually be. There are non-trivial backwards compatibility implications. To give a concrete case: LilyPond definitely has code that would break if passed a string whose "conversion to UTF-8" gave something not valid UTF-8. (An example off the top: passing strings to the Pango API and to GLib's PCRE-based regex API. By the way, running "emacs $'\xb5'" gives a Pango warning on the terminal, I assume because of trying to display the file name as the window title.) From the implementation point of view: conversion from an encoding to another could no longer use libiconv, because it stops on invalid multibyte sequences. Likewise, Guile could probably not use libiconv anymore. This means a large implementation cost to reimplement all of this in Guile. I don't think it's worth it. If anybody's going to work on this problem, I'd recommend simply adding APIs like program-arguments-bytevector, getenv-bytevector and the like, returning raw bytevectors instead of strings, and letting programs which need to be reliable against invalid UTF-8 in the environment use these. That is also the approach taken in, e.g., Rust (except that due to the static typing, you are forced to handle the "invalid UTF-8" error case when you use, e.g., std::env::args as opposed to std::env::args_os).