From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Newsgroups: gmane.lisp.guile.devel Subject: Re: Improving the handling of system data (env, users, paths, ...) Date: Sun, 7 Jul 2024 06:59:05 +0200 Message-ID: References: <878qyeqn1q.fsf@trouble.defaultvalue.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="3JhvTUt1wo2F841w" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="20282"; mail-complaints-to="usenet@ciao.gmane.io" Cc: guile-devel@gnu.org To: Rob Browning Original-X-From: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Sun Jul 07 06:59:27 2024 Return-path: Envelope-to: guile-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1sQJzj-00053Z-5D for guile-devel@m.gmane-mx.org; Sun, 07 Jul 2024 06:59:27 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sQJzX-0004p6-7O; Sun, 07 Jul 2024 00:59:15 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sQJzT-0004og-UA for guile-devel@gnu.org; Sun, 07 Jul 2024 00:59:11 -0400 Original-Received: from mail.tuxteam.de ([5.199.139.25]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sQJzR-0000zE-K5 for guile-devel@gnu.org; Sun, 07 Jul 2024 00:59:11 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=tuxteam.de; s=mail; h=From:In-Reply-To:Content-Type:MIME-Version:References:Message-ID: Subject:Cc:To:Date:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=PE94WIXurkX10w2F7FYV59MQy4d1A6Hb1XcOKtt5g8I=; b=TG7i4WZmdcxe/umyds13V4LrYJ TSLFoeuyrE11h4J0slONujYjEEQluRa4QkaMgiIoZEa8Ae29bTPeK89GI5Kiy4pAf3h7iMoxNBh65 nrlPFVYHmrkfFXF79jY3nvGvPiWRfxyLUkxmy5l88n4/sPGVgXA4Wr2ugevLPnHw+dHRM9RNprgOi A3MAoa/2Y3T4VTYWdN+XfGiTVmTUD01R/VDs0PF88Q4VwYYHpExBMCviEPj5xTMNVEx6f+43eCA5Q qY+A+5b/QCZUl03gNrPx0ie4Nkds2448fz6irMbnEF7TaCvPYWiS+zj/ePg3AhMhv10s7Na8669bs nJOvs2BA==; Original-Received: from tomas by mail.tuxteam.de with local (Exim 4.94.2) (envelope-from ) id 1sQJzN-0002sC-5E; Sun, 07 Jul 2024 06:59:05 +0200 Content-Disposition: inline In-Reply-To: <878qyeqn1q.fsf@trouble.defaultvalue.org> Received-SPF: pass client-ip=5.199.139.25; envelope-from=tomas@tuxteam.de; helo=mail.tuxteam.de X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.lisp.guile.devel:22549 Archived-At: --3JhvTUt1wo2F841w Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jul 06, 2024 at 03:32:17PM -0500, Rob Browning wrote: >=20 >=20 > * Problem >=20 > System data like environment variables, user names, group names, file > paths and extended attributes (xattr), etc. are on some systems (like > Linux) binary data, and may not be encodable as a string in the current > locale. Since this might get lost in the ensuing discussion, yes: in Linux (and relatives) file names are byte arrays, not strings. > It's perhaps worth noting, that while typically unlikely, any given > directory could contain paths in an arbitrary collection of encodings: Exactly: it's the creating process's locale what calls the shots. So if you are in a multi-locale environment (e.g. users with different encodings) this will happen. > At a minimum, I suggest Guile should produce an error by default > (instead of generating incorrect data) when the system bytes cannot be > encoded in the current locale. Yes, perhaps. [iso-8859-1] > There are disadvantages to this approach, but it's a fairly easy > improvement. I'm not a fan of this one: watching Emacs's development, people end up using Latin-1 as a poor substitute of "byte array" :-) > The most direct (and compact, if we do convert to UTF-8) representation > would bytevectors, but then you would have a much more limited set of > operations available (i.e. strings have all of srfi-13, srfi-14, etc.) > unless we expanded them (likely re-using the existing code paths). Of > course you could still convert to Latin-1, perform the operation, and > convert back, but that's not ideal. It would be the right one, and let users deal with explicit conversions =66rom/to strings, so they see the issues happening, but alas, you are right: it's very inconvenient. > Finally, while I'm not sure how I feel about it, one notable precedent > is Python's "surrogateescape" approach[5], which shifts any unencodable > bytes into "lone Unicode surrogates", a process which can (and of course > must) be safely reversed before handing the data back to the system. It > has its own trade-offs/(security)-concerns, as mentioned in the PEP. FWIW, that's more or less what Emacs's internal encoding does: it is roughly UTF-8, but reserves some code points to odd bytes (which it then displays as backslash sequences). It's round-trip safe, but has its own set of sharp edges, and naive [1] users get caught in them from time to time. What's my point? Basically, that we shouldn't try to get it 100% right, because there's possibly no way, and we pile up a lot of complexity which is very difficult to get rid of (most languages have their painful transiti= ons to tell stories about). I think it's ok to try some guesswork to make user's lives easier, but perhaps to (by default) fail noisily at the least suspicion than to carry happily away with wrong results. Guessing UTF-8 seems a safe bet: for one, everybody (except Javascript) is moving in that direction, for the other, you notice quickly when it isn't (as opposed to ISO-8859-x, which will trundle along, producing funny conten= t). Cheers --=20 t --3JhvTUt1wo2F841w Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iF0EABECAB0WIQRp53liolZD6iXhAoIFyCz1etHaRgUCZoogkQAKCRAFyCz1etHa RpsLAJwNM8VL2JLtX2/of8GSmsS998YxPwCeI74nEiiFI9JIyYm4TdbRKgb+w9Q= =XXR0 -----END PGP SIGNATURE----- --3JhvTUt1wo2F841w--