Sent from Mail for Windows
From: Eli Zaretskii
Sent: Sunday, 7 July 2024 13:05
To: Jean Abou Samra
Cc: rlb@defaultvalue.org; guile-devel@gnu.org
Subject: Re: Improving the handling of system data (env, users, paths, ...)
> From: Jean Abou Samra <jean@abou-samra.fr>
> Cc: guile-devel@gnu.org
> Date: Sun, 07 Jul 2024 12:03:06 +0200
>
> Le dimanche 07 juillet 2024 à 08:33 +0300, Eli Zaretskii a écrit :
> >
> > - The internal representation is a superset of UTF-8, in that it
> > is capable of representing characters for which there are no
> > Unicode codepoints (such as GB 18030, some of whose characters
> > don't have Unicode counterparts; and raw bytes, used to
> > represent byte sequences that cannot be decoded). It uses
> > 5-byte UTF-8-like sequences for these extensions.
>
>
>> Guile is a Scheme implementation, bound by Scheme standards and compatibility
>> with other Scheme implementations (and backwards compatibility too).
>
>Yes, I understand that.
Going by what you are saying below, I think you don’t.
>> I just tried (aref (cadr command-line-args) 0) in a lisp-interaction-mode
>> Emacs buffer after launching "emacs $'\xb5'". It gave 4194229 = 0x3fffb5,
>> which quite logically is outside the Unicode code point range 0 - 0x110000.
>That's not how you get a raw byte from a multibyte string in Emacs.
>IOW, you code is wrong, if what you wanted was to get the 0xb5 byte.
>I guess you assumed something about 'aref' in Emacs that is not true
>with multibyte strings that include raw bytes. So what you got
>instead is the internal Emacs "codepoint" for raw bytes, which are in
>the 0x3fff00..0x3fffff range.
I’m pretty sure that they weren’t intending to get the 0xb5 byte. Rather, they were using the equivalent of ‘string-ref’ (i.e., ‘aref’) and demonstrating that the result is bogus in Scheme. In Scheme, ‘(string-ref ...)’ needs to return a character, and there exists no (Unicode) character with codepoint 4194229, so what Emacs returns here would be bogus for (Guile) Scheme.
From the Emacs manual:
>For example, you can access individual characters in a string using the function aref
(see Functions that Operate on Arrays).
Thus, (aref the-string index) is the equivalent of (string-ref the-string index). I do not see any indication they were trying to extract the byte itself, rather they were extracting the _character_ corresponding to the byte, and demonstrating that this ‘character’ is, in fact, not actually a character in Scheme (or in other words, no such character exists in Scheme).
>> This doesn't work for Guile, since a character is a Unicode code point
>> in the Scheme semantics.
>See above: the problem doesn't exist if one uses the correct APIs.
AFAICT, there are no correct APIs. Fundamentally (whether for compatibility or by choice), characters in (Guile) Scheme are _Unicode_ characters and (Scheme) strings consists of _only_ such _Unicode characters_. Yet, in Elisp strings consists of more stuff – whether that be characters from Emacs’ extended set, or a mixture of Unicode and raw bytes, in both cases the Elisp APIs that would return characters return things that aren’t _Unicode_ characters, and hence aren’t appropriate APIs for Guile.
This doesn’t mean that Emacs’ model can’t be adopted – rather, it could perhaps be partially adopted, but whenever the resulting ‘string’ contains things that aren’t (Unicode) characters, the result may not be called a ‘string’, and some of the things in the not-string may not be called ‘characters’.
> > - Emacs has its own code for code-conversion, for moving by
> > characters through multibyte sequences, for producing a Unicode
> > codepoint from a byte sequence in the super-UTF-8 representation
> > and back, etc., so it doesn't use libc routines for that, and
> > thus doesn't depend on the current locale for these operations.
>
> Guile's encoding conversions don't rely on the libc locale. They use
> GNU libiconv.
>That's okay, but what about other APIs, like conversion between
characters and their multibyte representations,
This is not an _other_ API, this is precisely the (ice-9 iconv) API. See string->bytevector and bytevector->string (well, you need to turn the single character into a string consisting of a single character first, but this is trivial, simply do (string [insert-character-here])).
> returning the length of a string in characters, etc.? AFAIK, libiconv doesn't provide
these facilities.
This is a basic string API, just do string-length like in (all?) Schemes. In Scheme, strings consists of characters, so string-length returns the length of a string in characters.
Best regards,
Maxime Devos.