From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.lisp.guile.devel Subject: Re: Improving the handling of system data (env, users, paths, ...) Date: Sun, 07 Jul 2024 14:04:37 +0300 Message-ID: <865xth31kq.fsf@gnu.org> References: <878qyeqn1q.fsf@trouble.defaultvalue.org> <86jzhx3gxe.fsf@gnu.org> <9985c529ffbbabaa259ee62226ced1feec8c7810.camel@abou-samra.fr> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="20915"; mail-complaints-to="usenet@ciao.gmane.io" Cc: rlb@defaultvalue.org, guile-devel@gnu.org To: Jean Abou Samra Original-X-From: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Sun Jul 07 13:05:14 2024 Return-path: Envelope-to: guile-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1sQPhh-00056z-UR for guile-devel@m.gmane-mx.org; Sun, 07 Jul 2024 13:05:13 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sQPhH-0007MR-TS; Sun, 07 Jul 2024 07:04:47 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sQPhD-0007Lt-M6 for guile-devel@gnu.org; Sun, 07 Jul 2024 07:04:44 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sQPhC-0005sJ-46; Sun, 07 Jul 2024 07:04:42 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=5g4TWAVyDIVOcv/zriL3qda93n8pk1v+XeY96KzKCz0=; b=hXFPiM0eCU+oB+8YqHr9 JunjPuZCWZpoOANAGsriBJOLcdQMraTVI5+A6Ibi/QERCSQ73e1aLSJmNa/oT8T7R3jLzhFs421MU c+IxIixUkxOWuQ0bVLWTc+gtB2cFyHwviu2MrGWENZiCJPv8r4wxpwoZyTGYfZENdVvi5b0MCrVf+ I0qQUdrFbfWcFevmxbw3+h3LdNwirKOxWDazWre5tLKw4CJRan30RJMo2HI+5/M3u8/D//2DGlQ4k S7X6+RjxMP7bCBmsFFW9FwKCRrAKvrPPgWpuiLapQjfntb/xKxhNWo8bUEdo2XEMQzog9JSBwS238 o9fD3UYDcEbuag==; In-Reply-To: <9985c529ffbbabaa259ee62226ced1feec8c7810.camel@abou-samra.fr> (message from Jean Abou Samra on Sun, 07 Jul 2024 12:03:06 +0200) X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.lisp.guile.devel:22554 Archived-At: > From: Jean Abou Samra > Cc: guile-devel@gnu.org > Date: Sun, 07 Jul 2024 12:03:06 +0200 > > Le dimanche 07 juillet 2024 à 08:33 +0300, Eli Zaretskii a écrit : > > > >     - The internal representation is a superset of UTF-8, in that it > >       is capable of representing characters for which there are no > >       Unicode codepoints (such as GB 18030, some of whose characters > >       don't have Unicode counterparts; and raw bytes, used to > >       represent byte sequences that cannot be decoded).  It uses > >       5-byte UTF-8-like sequences for these extensions. > > > Guile is a Scheme implementation, bound by Scheme standards and compatibility > with other Scheme implementations (and backwards compatibility too). Yes, I understand that. > I just tried (aref (cadr command-line-args) 0) in a lisp-interaction-mode > Emacs buffer after launching "emacs $'\xb5'". It gave 4194229 = 0x3fffb5, > which quite logically is outside the Unicode code point range 0 - 0x110000. That's not how you get a raw byte from a multibyte string in Emacs. IOW, you code is wrong, if what you wanted was to get the 0xb5 byte. I guess you assumed something about 'aref' in Emacs that is not true with multibyte strings that include raw bytes. So what you got instead is the internal Emacs "codepoint" for raw bytes, which are in the 0x3fff00..0x3fffff range. Note that (cadr command-line-args), for example, yields "\265", as expected. That is, in situation where the caller's intent is clear, Emacs converts back to a single byte automatically. That's part of heuristics that took us some releases to get right. > This doesn't work for Guile, since a character is a Unicode code point > in the Scheme semantics. See above: the problem doesn't exist if one uses the correct APIs. > >     - Emacs has its own code for code-conversion, for moving by > >       characters through multibyte sequences, for producing a Unicode > >       codepoint from a byte sequence in the super-UTF-8 representation > >       and back, etc., so it doesn't use libc routines for that, and > >       thus doesn't depend on the current locale for these operations. > > Guile's encoding conversions don't rely on the libc locale. They use > GNU libiconv. That's okay, but what about other APIs, like conversion between characters and their multibyte representations, returning the length of a string in characters, etc.? AFAIK, libiconv doesn't provide these facilities. > >     - Emacs also has tables of Unicode attributes of characters > >       (produced by parsing the relevant Unicode data files at build > >       time), so it can up/down-case characters, determine their > >       category (letters, digits, punctuation, etc.) and script to > >       which they belong, etc. -- all with its own code, independent of > >       the underlying libc. > > Also exists, and AFAICT uses GNU libunistring. See string-upcase, > char-general-category, etc. Fine, then it should be easier for Guile than I maybe thought to adopt the same scheme.