Sent from Mail for Windows From: Eli Zaretskii Sent: Sunday, 7 July 2024 13:05 To: Jean Abou Samra Cc: rlb@defaultvalue.org; guile-devel@gnu.org Subject: Re: Improving the handling of system data (env, users, paths, ...) > From: Jean Abou Samra > Cc: guile-devel@gnu.org > Date: Sun, 07 Jul 2024 12:03:06 +0200 > > Le dimanche 07 juillet 2024 à 08:33 +0300, Eli Zaretskii a écrit : > > > >     - The internal representation is a superset of UTF-8, in that it > >       is capable of representing characters for which there are no > >       Unicode codepoints (such as GB 18030, some of whose characters > >       don't have Unicode counterparts; and raw bytes, used to > >       represent byte sequences that cannot be decoded).  It uses > >       5-byte UTF-8-like sequences for these extensions. > > >> Guile is a Scheme implementation, bound by Scheme standards and compatibility >> with other Scheme implementations (and backwards compatibility too). > >Yes, I understand that. Going by what you are saying below, I think you don’t. >> I just tried (aref (cadr command-line-args) 0) in a lisp-interaction-mode >> Emacs buffer after launching "emacs $'\xb5'". It gave 4194229 = 0x3fffb5, >> which quite logically is outside the Unicode code point range 0 - 0x110000. >That's not how you get a raw byte from a multibyte string in Emacs. >IOW, you code is wrong, if what you wanted was to get the 0xb5 byte. >I guess you assumed something about 'aref' in Emacs that is not true >with multibyte strings that include raw bytes. So what you got >instead is the internal Emacs "codepoint" for raw bytes, which are in >the 0x3fff00..0x3fffff range. I’m pretty sure that they weren’t intending to get the 0xb5 byte. Rather, they were using the equivalent of ‘string-ref’ (i.e., ‘aref’) and demonstrating that the result is bogus in Scheme. In Scheme, ‘(string-ref ...)’ needs to return a character, and there exists no (Unicode) character with codepoint 4194229, so what Emacs returns here would be bogus for (Guile) Scheme. From the Emacs manual: >For example, you can access individual characters in a string using the function aref (see Functions that Operate on Arrays). Thus, (aref the-string index) is the equivalent of (string-ref the-string index). I do not see any indication they were trying to extract the byte itself, rather they were extracting the _character_ corresponding to the byte, and demonstrating that this ‘character’ is, in fact, not actually a character in Scheme (or in other words, no such character exists in Scheme). >> This doesn't work for Guile, since a character is a Unicode code point >> in the Scheme semantics. >See above: the problem doesn't exist if one uses the correct APIs. AFAICT, there are no correct APIs. Fundamentally (whether for compatibility or by choice), characters in (Guile) Scheme are _Unicode_ characters and (Scheme) strings consists of _only_ such _Unicode characters_. Yet, in Elisp strings consists of more stuff – whether that be characters from Emacs’ extended set, or a mixture of Unicode and raw bytes, in both cases the Elisp APIs that would return characters return things that aren’t _Unicode_ characters, and hence aren’t appropriate APIs for Guile. This doesn’t mean that Emacs’ model can’t be adopted – rather, it could perhaps be partially adopted, but whenever the resulting ‘string’ contains things that aren’t (Unicode) characters, the result may not be called a ‘string’, and some of the things in the not-string may not be called ‘characters’. > >     - Emacs has its own code for code-conversion, for moving by > >       characters through multibyte sequences, for producing a Unicode > >       codepoint from a byte sequence in the super-UTF-8 representation > >       and back, etc., so it doesn't use libc routines for that, and > >       thus doesn't depend on the current locale for these operations. > > Guile's encoding conversions don't rely on the libc locale. They use > GNU libiconv. >That's okay, but what about other APIs, like conversion between characters and their multibyte representations, This is not an _other_ API, this is precisely the (ice-9 iconv) API. See string->bytevector and bytevector->string (well, you need to turn the single character into a string consisting of a single character first, but this is trivial, simply do (string [insert-character-here])). > returning the length of a string in characters, etc.? AFAIK, libiconv doesn't provide these facilities. This is a basic string API, just do string-length like in (all?) Schemes. In Scheme, strings consists of characters, so string-length returns the length of a string in characters. Best regards, Maxime Devos.