From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Maxime Devos Newsgroups: gmane.lisp.guile.devel Subject: RE: Improving the handling of system data (env, users, paths, ...) Date: Sun, 7 Jul 2024 13:35:27 +0200 Message-ID: <20240707133527.kbbT2C0064hwdlW01bbTq5@baptiste.telenet-ops.be> References: <878qyeqn1q.fsf@trouble.defaultvalue.org> <86jzhx3gxe.fsf@gnu.org> <9985c529ffbbabaa259ee62226ced1feec8c7810.camel@abou-samra.fr> <865xth31kq.fsf@gnu.org> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="_C36343D9-5D1F-4C14-9578-55DE8CFBABE5_" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="13228"; mail-complaints-to="usenet@ciao.gmane.io" Cc: "rlb@defaultvalue.org" , "guile-devel@gnu.org" To: Eli Zaretskii , Jean Abou Samra Original-X-From: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Sun Jul 07 13:35:55 2024 Return-path: Envelope-to: guile-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1sQQBO-0003AA-E0 for guile-devel@m.gmane-mx.org; Sun, 07 Jul 2024 13:35:54 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sQQBE-00061s-LO; Sun, 07 Jul 2024 07:35:44 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sQQBC-00061i-OS for guile-devel@gnu.org; Sun, 07 Jul 2024 07:35:42 -0400 Original-Received: from baptiste.telenet-ops.be ([2a02:1800:120:4::f00:13]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1sQQB2-0005ys-O9 for guile-devel@gnu.org; Sun, 07 Jul 2024 07:35:35 -0400 Original-Received: from [IPv6:2a02:1811:8c0e:ef00:95f6:12f6:aa85:7dcc] ([IPv6:2a02:1811:8c0e:ef00:95f6:12f6:aa85:7dcc]) by baptiste.telenet-ops.be with bizsmtp id kbbT2C0064hwdlW01bbTq5; Sun, 07 Jul 2024 13:35:27 +0200 Importance: normal X-Priority: 3 In-Reply-To: <865xth31kq.fsf@gnu.org> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=telenet.be; s=r24; t=1720352127; bh=0pAQWy9DB8BYlspxByvs0tyLedvtHphwmtnbhC6TCpY=; h=To:Cc:From:Subject:Date:In-Reply-To:References; b=bMA8DnSweE5Bq8JVaAMKoby1OXH0Bbb9hAKBUHQlN9tmUIcJm/Xjo8lVcqbIbFOpi H0Q+PIz0Ffkxuxm45lGV97DpnyayoHJtoAzRgPJqVrluIZ2ZnQj0q4sOMddnUhDbhf phB4rZtSwKcAbSz2XBGipQqo3dfRFIreCKHd9WlcvWt3xfw6EO9D2PlMLhL+wbhG37 CZ4WHyBLyXE8U2kP6qqlPpFPGBPIOPJ/Pr6tLO+9tRSCZK1vR0gjLS8/GvX2otddL5 tKk4l3N817Iseg4QDET9EWiOMlDQ5jeGXuziuZEjc4g8jwjD6UncS69UIjCsbsSSP3 Q1V054TBcgvdQ== Received-SPF: pass client-ip=2a02:1800:120:4::f00:13; envelope-from=maximedevos@telenet.be; helo=baptiste.telenet-ops.be X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.lisp.guile.devel:22555 Archived-At: --_C36343D9-5D1F-4C14-9578-55DE8CFBABE5_ Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Sent from Mail for Windows From: Eli Zaretskii Sent: Sunday, 7 July 2024 13:05 To: Jean Abou Samra Cc: rlb@defaultvalue.org; guile-devel@gnu.org Subject: Re: Improving the handling of system data (env, users, paths, ...) > From: Jean Abou Samra > Cc: guile-devel@gnu.org > Date: Sun, 07 Jul 2024 12:03:06 +0200 >=20 > Le dimanche 07 juillet 2024 =C3=A0 08:33 +0300, Eli Zaretskii a =C3=A9cri= t=C2=A0: > >=20 > > =C2=A0=C2=A0=C2=A0 - The internal representation is a superset of UTF-8= , in that it > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 is capable of representing characters fo= r which there are no > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Unicode codepoints (such as GB 18030, so= me of whose characters > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 don't have Unicode counterparts; and raw= bytes, used to > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 represent byte sequences that cannot be = decoded).=C2=A0 It uses > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 5-byte UTF-8-like sequences for these ex= tensions. >=20 >=20 >> Guile is a Scheme implementation, bound by Scheme standards and compatib= ility >> with other Scheme implementations (and backwards compatibility too). > >Yes, I understand that. Going by what you are saying below, I think you don=E2=80=99t. >> I just tried (aref (cadr command-line-args) 0) in a lisp-interaction-mod= e >> Emacs buffer after launching "emacs $'\xb5'". It gave 4194229 =3D 0x3fff= b5, >> which quite logically is outside the Unicode code point range 0 - 0x1100= 00. >That's not how you get a raw byte from a multibyte string in Emacs. >IOW, you code is wrong, if what you wanted was to get the 0xb5 byte. >I guess you assumed something about 'aref' in Emacs that is not true >with multibyte strings that include raw bytes. So what you got >instead is the internal Emacs "codepoint" for raw bytes, which are in >the 0x3fff00..0x3fffff range. I=E2=80=99m pretty sure that they weren=E2=80=99t intending to get the 0xb5= byte. Rather, they were using the equivalent of =E2=80=98string-ref=E2=80= =99 (i.e., =E2=80=98aref=E2=80=99) and demonstrating that the result is bog= us in Scheme. In Scheme, =E2=80=98(string-ref ...)=E2=80=99 needs to retur= n a character, and there exists no (Unicode) character with codepoint 41942= 29, so what Emacs returns here would be bogus for (Guile) Scheme. >From the Emacs manual: >For example, you can access individual characters in a string using the fu= nction=C2=A0aref=C2=A0(see=C2=A0Functions that Operate on Arrays). Thus, (aref the-string index) is the equivalent of (string-ref the-string i= ndex). I do not see any indication they were trying to extract the byte its= elf, rather they were extracting the _character_ corresponding to the byte,= and demonstrating that this =E2=80=98character=E2=80=99 is, in fact, not a= ctually a character in Scheme (or in other words, no such character exists = in Scheme). >> This doesn't work for Guile, since a character is a Unicode code point >> in the Scheme semantics. >See above: the problem doesn't exist if one uses the correct APIs. AFAICT, there are no correct APIs. Fundamentally (whether for compatibility= or by choice), characters in (Guile) Scheme are _Unicode_ characters and (= Scheme) strings consists of _only_ such _Unicode characters_. Yet, in Elisp= strings consists of more stuff =E2=80=93 whether that be characters from E= macs=E2=80=99 extended set, or a mixture of Unicode and raw bytes, in both = cases the Elisp APIs that would return characters return things that aren= =E2=80=99t _Unicode_ characters, and hence aren=E2=80=99t appropriate APIs = for Guile. This doesn=E2=80=99t mean that Emacs=E2=80=99 model can=E2=80=99t be adopte= d =E2=80=93 rather, it could perhaps be partially adopted, but whenever the= resulting =E2=80=98string=E2=80=99 contains things that aren=E2=80=99t (Un= icode) characters, the result may not be called a =E2=80=98string=E2=80=99,= and some of the things in the not-string may not be called =E2=80=98charac= ters=E2=80=99. > > =C2=A0=C2=A0=C2=A0 - Emacs has its own code for code-conversion, for mo= ving by > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 characters through multibyte sequences, = for producing a Unicode > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 codepoint from a byte sequence in the su= per-UTF-8 representation > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 and back, etc., so it doesn't use libc r= outines for that, and > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 thus doesn't depend on the current local= e for these operations. >=20 > Guile's encoding conversions don't rely on the libc locale. They use > GNU libiconv. >That's okay, but what about other APIs, like conversion between characters and their multibyte representations, This is not an _other_ API, this is precisely the (ice-9 iconv) API. See st= ring->bytevector and bytevector->string (well, you need to turn the single = character into a string consisting of a single character first, but this is= trivial, simply do (string [insert-character-here])). > returning the length of a string in characters, etc.? AFAIK, libiconv do= esn't provide these facilities. This is a basic string API, just do string-length like in (all?) Schemes. I= n Scheme, strings consists of characters, so string-length returns the leng= th of a string in characters. Best regards, Maxime Devos. --_C36343D9-5D1F-4C14-9578-55DE8CFBABE5_ Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8"

=  

 

Sent from = Mail for Windows

 

From: Eli Zaretskii
<= b>Sent: Sunday, 7 July 2024 13:05
To: Jean Abou Samra
Cc: rlb@defaultvalue.org; guile-devel@gnu.org
Subject: Re: Improving the handling o= f system data (env, users, paths, ...)

&= nbsp;

> From: Jean Abou Samra <jean@abo= u-samra.fr>

> Cc: guile-devel@gnu.org

> Date: Sun, 07 Jul 2024 12:03:06 +0200

>

> Le dimanche 07 juillet 2024 =C3= =A0 08:33 +0300, Eli Zaretskii a =C3=A9crit :

= > >

> >     - The inter= nal representation is a superset of UTF-8, in that it

> >       is capable of representing char= acters for which there are no

> >   = ;    Unicode codepoints (such as GB 18030, some of whose cha= racters

> >       do= n't have Unicode counterparts; and raw bytes, used to

> >       represent byte sequences that c= annot be decoded).  It uses

> >  &n= bsp;    5-byte UTF-8-like sequences for these extensions.

>

>

>> Guile is a Scheme implementation, bound by Scheme standards= and compatibility

>> with other Scheme imple= mentations (and backwards compatibility too).

><= o:p> 

>Yes, I understand that.

 

Going by what y= ou are saying below, I think you don=E2=80=99t.

 

>> I just tried (aref (cadr co= mmand-line-args) 0) in a lisp-interaction-mode

>= > Emacs buffer after launching "emacs $'\xb5'". It gave 419422= 9 =3D 0x3fffb5,

>> which quite logically is o= utside the Unicode code point range 0 - 0x110000.

&= gt;That's not how you get a raw byte from a multibyte string in Emacs.

<= p class=3DMsoNormal>>IOW, you code is wrong, if what you wanted was to g= et the 0xb5 byte.

>I guess you assumed something= about 'aref' in Emacs that is not true

>with mu= ltibyte strings that include raw bytes.=C2=A0 So what you got

>instead is the internal Emacs "codepoint" for ra= w bytes, which are in

>the 0x3fff00..0x3fffff ra= nge.

 

I=E2= =80=99m pretty sure that they weren=E2=80=99t intending to get the 0xb5 byt= e. Rather, they were using the equivalent of =E2=80=98string-ref=E2=80=99 (= i.e., =E2=80=98aref=E2=80=99) and demonstrating that the result is bogus in= Scheme.=C2=A0 In Scheme, =E2=80=98(string-ref ...)=E2=80=99 needs to retur= n a character, and there exists no (Unicode) character with codepoint 41942= 29, so what Emacs returns here would be bogus for (Guile) Scheme.

 

From the Emacs man= ual:

 

>= For example, you can access individual characters in= a string using the function aref=  (see Functions t= hat Operate on Arrays).=

 

Thus, (aref the-string index) = is the equivalent of (string-ref the-string index). I do not see any indica= tion they were trying to extract the byte itself, rather they were extracti= ng the _character_ corresponding to the byte, and demonstrating that= this =E2=80=98character=E2=80=99 is, in fact, not actually a character in = Scheme (or in other words, no such character exists in Scheme).<= /span>

 

&g= t;> This doesn't work for Guile, since a character is a Unicode code poi= nt

>> in the Scheme semantics.

>See above: the problem doesn't exist if one uses the correct = APIs.

 

AFA= ICT, there are no correct APIs. Fundamentally (whether for compatibility or= by choice), characters in (Guile) Scheme are _Unicode_ characters a= nd (Scheme) strings consists of _only_ such _Unicode characters_. Yet, in Elisp strings consists of more stuff =E2=80=93 whether that be= characters from Emacs=E2=80=99 extended set, or a mixture of Unicode and r= aw bytes, in both cases the Elisp APIs that would return characters return = things that aren=E2=80=99t _Unicode_ characters, and hence aren=E2=80=99t a= ppropriate APIs for Guile.

 

This doesn=E2=80=99t mean that Emacs=E2=80=99 model can= =E2=80=99t be adopted =E2=80=93 rather, it could perhaps be partially adopt= ed, but whenever the resulting =E2=80=98string=E2=80=99 contains things tha= t aren=E2=80=99t (Unicode) characters, the result may not be called a =E2= =80=98string=E2=80=99, and some of the things in the not-string may not be = called =E2=80=98characters=E2=80=99.

 

> >     - Emacs has its ow= n code for code-conversion, for moving by

> >=       characters through multibyte sequences, for= producing a Unicode

> >    &n= bsp;  codepoint from a byte sequence in the super-UTF-8 representation=

> >       and back,= etc., so it doesn't use libc routines for that, and

> >       thus doesn't depend on the curre= nt locale for these operations.

>

> Guile's encoding conversions don't rely on the libc local= e. They use

> GNU libiconv.

 

>That's okay, but what abo= ut other APIs, like conversion between

characters a= nd their multibyte representations,

 

This is not an _other_ API, this is preci= sely the (ice-9 iconv) API. See string->bytevector and bytevector->st= ring (well, you need to turn the single character into a string consisting = of a single character first, but this is trivial, simply do (string [insert= -character-here])).

 

> returning the length of a string in characters, etc.?=C2= =A0 AFAIK, libiconv doesn't provide

these facilitie= s.

 

This i= s a basic string API, just do string-length like in (all?) Schemes. In Sche= me, strings consists of characters, so string-length returns the length of = a string in characters.

 

Best regards,

Maxime Devos.

= --_C36343D9-5D1F-4C14-9578-55DE8CFBABE5_--