From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Rob Browning Newsgroups: gmane.lisp.guile.devel Subject: Re: Improving the handling of system data (env, users, paths, ...) Date: Sun, 07 Jul 2024 14:25:06 -0500 Message-ID: <87y16d0zu5.fsf@trouble.defaultvalue.org> References: <878qyeqn1q.fsf@trouble.defaultvalue.org> <1058bc6ec96dfe613eb6f5879cea585213f9d243.camel@abou-samra.fr> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="27647"; mail-complaints-to="usenet@ciao.gmane.io" To: Jean Abou Samra , guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Sun Jul 07 21:25:40 2024 Return-path: Envelope-to: guile-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1sQXVz-0006xM-Uo for guile-devel@m.gmane-mx.org; Sun, 07 Jul 2024 21:25:40 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sQXVa-0008HA-NC; Sun, 07 Jul 2024 15:25:14 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sQXVX-0008Gq-AZ for guile-devel@gnu.org; Sun, 07 Jul 2024 15:25:12 -0400 Original-Received: from defaultvalue.org ([45.33.119.55]) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sQXVV-0008Gm-ED for guile-devel@gnu.org; Sun, 07 Jul 2024 15:25:10 -0400 Original-Received: from trouble.defaultvalue.org (localhost [127.0.0.1]) (Authenticated sender: rlb@defaultvalue.org) by defaultvalue.org (Postfix) with ESMTPSA id DCE4F204C2; Sun, 7 Jul 2024 14:25:06 -0500 (CDT) Original-Received: by trouble.defaultvalue.org (Postfix, from userid 1000) id 7A68814E081; Sun, 7 Jul 2024 14:25:06 -0500 (CDT) In-Reply-To: <1058bc6ec96dfe613eb6f5879cea585213f9d243.camel@abou-samra.fr> Received-SPF: pass client-ip=45.33.119.55; envelope-from=rlb@defaultvalue.org; helo=defaultvalue.org X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.lisp.guile.devel:22565 Archived-At: Jean Abou Samra writes: > latin1 locale is a terrible default. Virtually no Linux system these days > has a locale encoding different than UTF-8. Except perhaps for the "C" lo= cale, > which people still use by habit with "LC_ALL=3DC" as a way to say "speak = English > please", although most Linux distros have a C.UTF-8 locale these days. Given this thread, it might have been good if I'd included a few other bits of context in my original post. - Personally, as someone who spends a lot of time on tool that's more like tar/cp/rsync/etc. (and I suspect this sentiment applies for anyone doing something similar), I'd be happier without "help", i.e. at a minimum, I'd prefer solid bytevector support, and then I'll handle any conversions when needed. But I was trying to propose something incremental that comports with previous (off-list discussions), i.e. something that might be acceptable in the near to medium term. In truth, for system tools, I have no interest in "strings" most of the time, and would rather not pay anything for them (imagine regularly processing a few hundred million filesystem paths), and if I *do* care (say for regular-expression based exclusions), then "OK, first you have to tell us where the paths came from", i.e. we have no way of knowing what the encodings are, other than guessing. That said, I'd be more than happy to have *help*, e.g. bytevector variants of various srfi-13/srfi-14 functions, and/or (as I think suggested elsewhere in the thread) maybe even some hybrid type with additional conveniences (if that were to make sense). Further, you could imagine having more specific types like the "path" type many languages have, depending on what your cross-platform goals are, since paths aren't "just bytes" everywhere, something which even varies in Linux per-filesystem type -- but I didn't consider any of that "in scope" for now. - Using Latin-1 is of course, a hack, a pragmatic hack, but a hack, (it wasn't even my suggestion, originally). Choosing that "for now" would just be trying to take advantage of the facts that it's likely to pass-through without corruption, and still allows easier manipulation via the existing string apis for some common, important cases, i.e. where you can still get the job done while only referring to the ascii bits (split/join on "/", for example), but no, it's not ideal. It also intends to avoid having to decide, and to do, anything further (in the short term) regarding all the existing *many* relevant system calls. You can just call them as-is with a temporarily adjusted locale. - I have no idea where Guile might eventually end up, but given current resources, it seemed likely that what's potentially in scope for now is "incremental". I'll also say that the broader discussion is interesting, and I do like to better understand how other systems work. > Le samedi 06 juillet 2024 =C3=A0 15:32 -0500, Rob Browning a =C3=A9crit= =C2=A0: > >> The most direct (and compact, if we do convert to UTF-8) representation >> would bytevectors, but then you would have a much more limited set of >> operations available (i.e. strings have all of srfi-13, srfi-14, etc.) >> unless we expanded them (likely re-using the existing code paths).=C2=A0= Of >> course you could still convert to Latin-1, perform the operation, and >> convert back, but that's not ideal. > Why is that "not ideal"? The (ice-9 iconv) API is convenient, locale-inde= pendent > and thread-safe. I meant that round-tripping through Latin-1 every time you want to call say string-split on "/" isn't ideal as compared to a bytevector friendly splitter. And if we do switch to UTF-8 internally, it'll also require copying/converting the bytes since non-ascii bytes become multibyte. (Given the UTF-8 work, I've also speculated about the fact that we could probably re-use many, if not all of the optimized "ascii paths" that I've included in the various functions there (srfi-13, srf-14, etc.), to implement bytevector friendly variants without much additional work.) Thanks --=20 Rob Browning rlb @defaultvalue.org and @debian.org GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4