Sent from Mail for Windows

 

From: Eli Zaretskii
Sent: Sunday, 7 July 2024 13:05
To: Jean Abou Samra
Cc: rlb@defaultvalue.org; guile-devel@gnu.org
Subject: Re: Improving the handling of system data (env, users, paths, ...)

 

> From: Jean Abou Samra <jean@abou-samra.fr>

> Cc: guile-devel@gnu.org

> Date: Sun, 07 Jul 2024 12:03:06 +0200

>

> Le dimanche 07 juillet 2024 à 08:33 +0300, Eli Zaretskii a écrit :

> >

> >     - The internal representation is a superset of UTF-8, in that it

> >       is capable of representing characters for which there are no

> >       Unicode codepoints (such as GB 18030, some of whose characters

> >       don't have Unicode counterparts; and raw bytes, used to

> >       represent byte sequences that cannot be decoded).  It uses

> >       5-byte UTF-8-like sequences for these extensions.

>

>

>> Guile is a Scheme implementation, bound by Scheme standards and compatibility

>> with other Scheme implementations (and backwards compatibility too).

> 

>Yes, I understand that.

 

Going by what you are saying below, I think you don’t.

 

>> I just tried (aref (cadr command-line-args) 0) in a lisp-interaction-mode

>> Emacs buffer after launching "emacs $'\xb5'". It gave 4194229 = 0x3fffb5,

>> which quite logically is outside the Unicode code point range 0 - 0x110000.

>That's not how you get a raw byte from a multibyte string in Emacs.

>IOW, you code is wrong, if what you wanted was to get the 0xb5 byte.

>I guess you assumed something about 'aref' in Emacs that is not true

>with multibyte strings that include raw bytes.  So what you got

>instead is the internal Emacs "codepoint" for raw bytes, which are in

>the 0x3fff00..0x3fffff range.

 

I’m pretty sure that they weren’t intending to get the 0xb5 byte. Rather, they were using the equivalent of ‘string-ref’ (i.e., ‘aref’) and demonstrating that the result is bogus in Scheme.  In Scheme, ‘(string-ref ...)’ needs to return a character, and there exists no (Unicode) character with codepoint 4194229, so what Emacs returns here would be bogus for (Guile) Scheme.

 

From the Emacs manual:

 

>For example, you can access individual characters in a string using the function aref (see Functions that Operate on Arrays).

 

Thus, (aref the-string index) is the equivalent of (string-ref the-string index). I do not see any indication they were trying to extract the byte itself, rather they were extracting the _character_ corresponding to the byte, and demonstrating that this ‘character’ is, in fact, not actually a character in Scheme (or in other words, no such character exists in Scheme).

 

>> This doesn't work for Guile, since a character is a Unicode code point

>> in the Scheme semantics.

>See above: the problem doesn't exist if one uses the correct APIs.

 

AFAICT, there are no correct APIs. Fundamentally (whether for compatibility or by choice), characters in (Guile) Scheme are _Unicode_ characters and (Scheme) strings consists of _only_ such _Unicode characters_. Yet, in Elisp strings consists of more stuff – whether that be characters from Emacs’ extended set, or a mixture of Unicode and raw bytes, in both cases the Elisp APIs that would return characters return things that aren’t _Unicode_ characters, and hence aren’t appropriate APIs for Guile.

 

This doesn’t mean that Emacs’ model can’t be adopted – rather, it could perhaps be partially adopted, but whenever the resulting ‘string’ contains things that aren’t (Unicode) characters, the result may not be called a ‘string’, and some of the things in the not-string may not be called ‘characters’.

 

> >     - Emacs has its own code for code-conversion, for moving by

> >       characters through multibyte sequences, for producing a Unicode

> >       codepoint from a byte sequence in the super-UTF-8 representation

> >       and back, etc., so it doesn't use libc routines for that, and

> >       thus doesn't depend on the current locale for these operations.

>

> Guile's encoding conversions don't rely on the libc locale. They use

> GNU libiconv.

 

>That's okay, but what about other APIs, like conversion between

characters and their multibyte representations,

 

This is not an _other_ API, this is precisely the (ice-9 iconv) API. See string->bytevector and bytevector->string (well, you need to turn the single character into a string consisting of a single character first, but this is trivial, simply do (string [insert-character-here])).

 

> returning the length of a string in characters, etc.?  AFAIK, libiconv doesn't provide

these facilities.

 

This is a basic string API, just do string-length like in (all?) Schemes. In Scheme, strings consists of characters, so string-length returns the length of a string in characters.

 

Best regards,

Maxime Devos.