Le dimanche 07 juillet 2024 à 08:33 +0300, Eli Zaretskii a écrit :
> 
>     - The internal representation is a superset of UTF-8, in that it
>       is capable of representing characters for which there are no
>       Unicode codepoints (such as GB 18030, some of whose characters
>       don't have Unicode counterparts; and raw bytes, used to
>       represent byte sequences that cannot be decoded).  It uses
>       5-byte UTF-8-like sequences for these extensions.


Guile is a Scheme implementation, bound by Scheme standards and compatibility
with other Scheme implementations (and backwards compatibility too).

I just tried (aref (cadr command-line-args) 0) in a lisp-interaction-mode
Emacs buffer after launching "emacs $'\xb5'". It gave 4194229 = 0x3fffb5,
which quite logically is outside the Unicode code point range 0 - 0x110000.

This doesn't work for Guile, since a character is a Unicode code point
in the Scheme semantics.


>     - Emacs has its own code for code-conversion, for moving by
>       characters through multibyte sequences, for producing a Unicode
>       codepoint from a byte sequence in the super-UTF-8 representation
>       and back, etc., so it doesn't use libc routines for that, and
>       thus doesn't depend on the current locale for these operations.


Guile's encoding conversions don't rely on the libc locale. They use
GNU libiconv. The issue at hand is that for argv specifically, the
conversion happens at startup with the locale encoding as a default
(AFAICT Guile uses environ_locale_charset from gnulib to convert the
C locale to an encoding name usable by libiconv) and Guile doesn't store
the original argv bytes.


>     - APIs are provided for "manual" encoding and decoding.  A Lisp
>       program can read a byte stream, then decode it "manually" using
>       a particular codeset, as deemed appropriate.  This allows to
>       handle complex situations where a program receives stuff whose
>       encoding can only be determined by examining the raw byte stream
>       (a typical example is a multipart email message with MIME
>       charset header for each part).


These exist, see (ice-9 iconv).


>     - Emacs also has tables of Unicode attributes of characters
>       (produced by parsing the relevant Unicode data files at build
>       time), so it can up/down-case characters, determine their
>       category (letters, digits, punctuation, etc.) and script to
>       which they belong, etc. -- all with its own code, independent of
>       the underlying libc.


Also exists, and AFAICT uses GNU libunistring. See string-upcase,
char-general-category, etc.

> 
>