From: Eli Zaretskii <eliz@gnu.org>
To: Maxime Devos <maximedevos@telenet.be>
Cc: jean@abou-samra.fr, rlb@defaultvalue.org, guile-devel@gnu.org
Subject: Re: Improving the handling of system data (env, users, paths, ...)
Date: Sun, 07 Jul 2024 17:25:06 +0300 [thread overview]
Message-ID: <8634ol2sal.fsf@gnu.org> (raw)
In-Reply-To: <20240707133527.kbbT2C0064hwdlW01bbTq5@baptiste.telenet-ops.be> (message from Maxime Devos on Sun, 7 Jul 2024 13:35:27 +0200)
> Cc: "rlb@defaultvalue.org" <rlb@defaultvalue.org>,
> "guile-devel@gnu.org" <guile-devel@gnu.org>
> From: Maxime Devos <maximedevos@telenet.be>
> Date: Sun, 7 Jul 2024 13:35:27 +0200
>
> >> Guile is a Scheme implementation, bound by Scheme standards and compatibility
> >> with other Scheme implementations (and backwards compatibility too).
> >
> >Yes, I understand that.
>
> Going by what you are saying below, I think you don’t.
Thank you for your vote of confidence.
> >> I just tried (aref (cadr command-line-args) 0) in a lisp-interaction-mode
> >> Emacs buffer after launching "emacs $'\xb5'". It gave 4194229 = 0x3fffb5,
> >> which quite logically is outside the Unicode code point range 0 - 0x110000.
> >That's not how you get a raw byte from a multibyte string in Emacs.
> >IOW, you code is wrong, if what you wanted was to get the 0xb5 byte.
> >I guess you assumed something about 'aref' in Emacs that is not true
> >with multibyte strings that include raw bytes. So what you got
> >instead is the internal Emacs "codepoint" for raw bytes, which are in
> >the 0x3fff00..0x3fffff range.
>
> I’m pretty sure that they weren’t intending to get the 0xb5 byte. Rather, they were using the equivalent of ‘string-ref’ (i.e., ‘aref’) and demonstrating that the result is bogus in Scheme. In Scheme, ‘(string-ref ...)’ needs to return a character, and there exists no (Unicode) character with codepoint 4194229, so what Emacs returns here would be bogus for (Guile) Scheme.
aref in Emacs and string-ref in Guile are not the same, and if Guile
needs to produce a raw byte in this scenario, it can be easily
arranged. In Emacs we have other goals.
IOW, I think this argument is pointless, since it is easy to adapt the
mechanism to what Guile needs.
> >From the Emacs manual:
>
> >For example, you can access individual characters in a string using the function aref (see Functions that Operate on Arrays).
>
> Thus, (aref the-string index) is the equivalent of (string-ref the-string index).
No, because a raw byte is not a character.
I do not see any indication they were trying to extract the byte itself, rather they were extracting the _character_ corresponding to the byte, and demonstrating that this ‘character’ is, in fact, not actually a character in Scheme (or in other words, no such character exists in Scheme).
> >> This doesn't work for Guile, since a character is a Unicode code point
> >> in the Scheme semantics.
> >See above: the problem doesn't exist if one uses the correct APIs.
>
> AFAICT, there are no correct APIs. Fundamentally (whether for compatibility or by choice), characters in (Guile) Scheme are _Unicode_ characters and (Scheme) strings consists of _only_ such _Unicode characters_. Yet, in Elisp strings consists of more stuff – whether that be characters from Emacs’ extended set, or a mixture of Unicode and raw bytes, in both cases the Elisp APIs that would return characters return things that aren’t _Unicode_ characters, and hence aren’t appropriate APIs for Guile.
If Guile restricts itself to Unicode characters and only them, it will
lack important features. So my suggestion is not to have this
restriction.
I think the fact that this discussion is held, and that Rob suggested
to use Latin-1 for the purpose of supporting raw bytes is a clear
indication that Guile, too, needs to deal with "character-like" data
that does not fit the Unicode framework. So I think saying that
strings in Guile can only hold Unicode characters will not give you
what this discussion attempts to give. In particular, how will you
handle the situations described by Rob where a file has a name that is
not a valid UTF-8 sequence (thus not "characters" as long as you
interpret text as UTF-8)?
> This doesn’t mean that Emacs’ model can’t be adopted – rather, it could perhaps be partially adopted, but whenever the resulting ‘string’ contains things that aren’t (Unicode) characters, the result may not be called a ‘string’, and some of the things in the not-string may not be called ‘characters’.
I agree.
next prev parent reply other threads:[~2024-07-07 14:25 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-06 20:32 Improving the handling of system data (env, users, paths, ...) Rob Browning
2024-07-07 4:59 ` tomas
2024-07-07 5:33 ` Eli Zaretskii
2024-07-07 10:03 ` Jean Abou Samra
2024-07-07 11:04 ` Eli Zaretskii
2024-07-07 11:35 ` Maxime Devos
2024-07-07 14:25 ` Eli Zaretskii [this message]
2024-07-07 14:59 ` Maxime Devos
2024-07-07 15:43 ` Eli Zaretskii
2024-07-07 15:16 ` Jean Abou Samra
2024-07-07 15:18 ` Jean Abou Samra
2024-07-07 15:58 ` Eli Zaretskii
2024-07-07 16:09 ` Jean Abou Samra
2024-07-07 16:56 ` Mike Gran
2024-07-07 9:45 ` Jean Abou Samra
2024-07-07 19:25 ` Rob Browning
2024-07-07 10:24 ` Maxime Devos
2024-07-07 19:40 ` Rob Browning
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8634ol2sal.fsf@gnu.org \
--to=eliz@gnu.org \
--cc=guile-devel@gnu.org \
--cc=jean@abou-samra.fr \
--cc=maximedevos@telenet.be \
--cc=rlb@defaultvalue.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).