From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Maxime Devos <maximedevos@telenet.be>
Newsgroups: gmane.lisp.guile.devel
Subject: RE: Improving the handling of system data (env, users, paths, ...)
Date: Sun, 7 Jul 2024 12:24:25 +0200
Message-ID: <20240707122425.kaQQ2C00E4hwdlW06aQRe0@michel.telenet-ops.be>
References: <878qyeqn1q.fsf@trouble.defaultvalue.org>
Mime-Version: 1.0
Content-Type: multipart/alternative;
 boundary="_B9E8F1CD-F083-4C6D-BF54-03DDBAC444BB_"
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="11468"; mail-complaints-to="usenet@ciao.gmane.io"
To: Rob Browning <rlb@defaultvalue.org>, 
 "guile-devel@gnu.org" <guile-devel@gnu.org>
Original-X-From: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Sun Jul 07 12:30:14 2024
Return-path: <guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org>
Envelope-to: guile-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org>)
	id 1sQP9p-0002m3-EV
	for guile-devel@m.gmane-mx.org; Sun, 07 Jul 2024 12:30:13 +0200
Original-Received: from localhost ([::1] helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <guile-devel-bounces@gnu.org>)
	id 1sQP9J-0006Ol-0c; Sun, 07 Jul 2024 06:29:41 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <maximedevos@telenet.be>)
 id 1sQP9H-0006OL-1Z
 for guile-devel@gnu.org; Sun, 07 Jul 2024 06:29:39 -0400
Original-Received: from michel.telenet-ops.be ([2a02:1800:110:4::f00:18])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <maximedevos@telenet.be>)
 id 1sQP9C-0005Pi-0N
 for guile-devel@gnu.org; Sun, 07 Jul 2024 06:29:38 -0400
Original-Received: from [IPv6:2a02:1811:8c0e:ef00:95f6:12f6:aa85:7dcc]
 ([IPv6:2a02:1811:8c0e:ef00:95f6:12f6:aa85:7dcc])
 by michel.telenet-ops.be with bizsmtp
 id kaQQ2C00E4hwdlW06aQRe0; Sun, 07 Jul 2024 12:24:25 +0200
Importance: normal
X-Priority: 3
In-Reply-To: <878qyeqn1q.fsf@trouble.defaultvalue.org>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=telenet.be; s=r24;
 t=1720347865; bh=/9xi7Kgmbe+45pVLo7sGxDH9+dj5wVqX4aqHNP3rebQ=;
 h=To:From:Subject:Date:In-Reply-To:References;
 b=anH94094W3OQktzxDUR5/lHYkfv1r0r/YVhMYdJ9tEBaCI0Wf170abawSfmSl/eby
 ZLjdjP5fTkZjCfOI2ql81+Oz+TgHZccrcuhsnUvrPLvWn9aTzsaUvNY4IPJpCRsHAJ
 mCnZDDzBbfRFi4DgdyK94VNgFJUaGZOB21+sz8Zh9+pOrcKpGCiKlBhiIBTxYma3ar
 RjqwSm1a6qGg88fuwlSUILZPozmgHF9ErAhhtQv+ZeppykK23aZtDRPtlkwwbKSM6A
 TDUfMmiAtX7e9yd4Z8QTHlSpQoDLz/XaABQYbMUKrlsZfELINBnuY+F9RzVvMCyt+f
 Ymb58DUX+km9g==
Received-SPF: pass client-ip=2a02:1800:110:4::f00:18;
 envelope-from=maximedevos@telenet.be; helo=michel.telenet-ops.be
X-Spam_score_int: -27
X-Spam_score: -2.8
X-Spam_bar: --
X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
 HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Developers list for Guile,
 the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guile-devel>,
 <mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guile-devel>,
 <mailto:guile-devel-request@gnu.org?subject=subscribe>
Errors-To: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org
Original-Sender: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org
Xref: news.gmane.io gmane.lisp.guile.devel:22553
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/22553>

--_B9E8F1CD-F083-4C6D-BF54-03DDBAC444BB_
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

>* Problem
>
>System data like environment variables, user names, group names, file
paths and extended attributes (xattr), etc. are on some systems (like
Linux) binary data, and may not be encodable as a string in the current
locale.  For Linux, as an example, only the null character is an invalid
user/group/filename byte, while for UTF-8, a much smaller set of bytes
are valid[1].
>[...]
>You end up with a question mark instead of the correct value.  This
makes it difficult to write programs that don't risk silent corruption
unless all the relevant system data is known to be compatible with the
user's current locale.

>It's perhaps worth noting, that while typically unlikely, any given
directory could contain paths in an arbitrary collection of encodings:
UTF-8, SHIFT-JIS, Latin-1, etc., and so if you really want to try to
handle them as strings (maybe you want to correctly upcase/downcase
them), you have to know (somehow) the encoding that applies to each one.
Otherwise, in the limiting case, you can only assume "bytes".

>* Improvements

>At a minimum, I suggest Guile should produce an error by default
(instead of generating incorrect data) when the system bytes cannot be
encoded in the current locale.

I totally agree on this.

>There should also be some straightforward, thread-safe way to write code
that accesses and manipulates system data efficiently and without
corruption.

>As an incremental step, and as has been discussed elsewhere a bit, we
might add support for uselocale()[2] and then document that the current
recommendation is to always use ISO-8859-1 (i.e. Latin-1)[3] for system
data unless you're certain your program doesn't need to be general
purpose (perhaps you're sure you only care about UTF-8 systems).

I=E2=80=99d rather not. It=E2=80=99s rather stateful and hence non-trivial =
to compose.
Also, locale is not only about the encoding of text [file name/env encoding=
s/xattr/...],
but also about language. Also setting the language is excessive in this cas=
e.

>A program intended to work everywhere might then do something like
this:

>   ...
>      #:use-module ((guile locale)
>                   #:select (iso-8859-1 with-locale))
>    ...
>
>    (define (environment name)
>      (with-locale iso-8859-1 (getenv name)))

This, OTOH, seems a bit better =E2=80=93 =E2=80=98with-locale=E2=80=99 is l=
ike =E2=80=98parameterize=E2=80=99 and hence pretty composable.
However, it still stuffers from the problem that it sets too much (also, th=
ere is no such thing as the =E2=80=9Ciso-8859-1=E2=80=9D locale?).

Instead, I would propose something like:

;; [todo: add validation]
;; if #false, default to what is implied by the locale
(define system-encoding (make-parameter #false))
;; if #false, default to system-encoding
(define file-name-encoding (make-parameter #false))
[...]

;; let=E2=80=99s say that for some reason, we know the file names have this=
 encoding,
;; but we don=E2=80=99t have information on other things so we leave the de=
cision
;; on other encodings to the caller.
(define (some-proc)
  (parameterize ((file-name-encoding "UTF-8"))
    [open some file and do stuff with it]))

This also has the advantage of separating the different things a bit =E2=80=
=93 I can imagine a multi-user system where the usernames are encoded diffe=
rently from the file names in the user home directory (not an unsurmountabl=
e problem for =E2=80=98with-locale=E2=80=99, but this seems a bit more stra=
ightforward to use when external libraries are involved).

(I=E2=80=99m not too sure about this splitting of parameter objects)

>There are disadvantages to this approach, but it's a fairly easy
improvement.

>Some potential disadvantages:

>  - In cases where the system data was actually UTF-8, non-ASCII
>    characters will be displayed "completely wrong", i.e. mapped to
>    "random" other characters according to the Latin-1 correspondences.

This is why I wouldn=E2=80=99t recommend always using ISO-85519-1 by defaul=
t.
The situation where the encoding of things are different is the exception
(and a historical artifact of pre-UTF-8), not the norm.

I think changing the =E2=80=98?=E2=80=99 into =E2=80=98throw an exception=
=E2=80=99, and providing an _option_ (i.e. temporarily change locale to ISO=
-85519) and also supporting this historical artifact is sufficient.

>  - You have to pay whatever cost is involved in switching locales, and
>    in encoding/decoding the bytes, even if you only care about the
>    bytes.

IIRC, in ISO-88519-1 there is a direct correspondence between bytes and cha=
racters
(and Guile recognises this), so there is no cost beyond mere copying.

>  - If any manipulations of the string representing the system data end
>    up performing Unicode canonicalizations or normalizations, the data
>    could still be corrupted.  I don't *think* Guile itself ever does
>    that implicitly.

Pretty sure it doesn=E2=80=99t.

>  - Less importantly, if we switch the internal string representation to
    UTF-8 (proposed[4]), then non-ASCII bytes in the data will require
    two bytes in memory.

>The most direct (and compact, if we do convert to UTF-8) representation
would bytevectors, but then you would have a much more limited set of
operations available (i.e. strings have all of srfi-13, srfi-14, etc.)
unless we expanded them (likely re-using the existing code paths).  Of
course you could still convert to Latin-1, perform the operation, and
convert back, but that's not ideal.

>Finally, while I'm not sure how I feel about it, one notable precedent
is Python's "surrogateescape" approach[5], which shifts any unencodable
bytes into "lone Unicode surrogates", a process which can (and of course
must) be safely reversed before handing the data back to the system.  It
has its own trade-offs/(security)-concerns, as mentioned in the PEP.

IIRC, surrogates have codepoints, but are not characters. As a consequence,=
 strings would contain non-characters, and (char? (string-ref s index)) mig=
ht be #false. I=E2=80=99d rather not, such an object does not sound like a =
string to me.

Here is an alternative solution:

1. Define a new object type =E2=80=98<unencoded-string>=E2=80=99 (wrapping =
a bytevector). This represent things that are _conceptually_ a string inste=
ad of a mere sequence of bytes, but we don=E2=80=99t know the actual encodi=
ng so we can=E2=80=99t let it be a string.
2. Also define a bunch of procedure for converting between bytes, unencoded=
-strings and strings. Also, a =E2=80=98string-like?=E2=80=99 predicate that=
 includes both =E2=80=98<string>=E2=80=99 and =E2=80=98<unencoded-string>=
=E2=80=99.
3. Procedures like =E2=80=98open-file=E2=80=99 etc. are extended to support=
 <unencoded-string>.
4. Maybe do the same for SRFI-N stuff (maybe as part of (srfi srfi-N gnu) e=
xtensions).
(I don=E2=80=99t know if (string-append unencoded encoded) should be suppor=
ted.)
5. When a procedure would return a filename, it first looks at some paramet=
er objects. These parameter encoding determine what the encoding is, what t=
o do when it is not valid according to the encoding (approximate via ? and =
the like, throw an exception, or return an <unencoded-string>) =E2=80=93 or=
 even return an <unencoded-string> unconditionally.
6. Also do the same for =E2=80=98getenv=E2=80=99 and the like, maybe with a=
 different set of parameter objects.

(Name pending, <unencoded-string> not being a subtype of <string> is bad na=
ming.)

I think this combines most of the positive qualities and avoids most of the=
 negative qualities (with the exception of the surrogate-encoding stuff, wh=
ich I see mostly as a negative):

=E2=80=A2 =E2=80=9Cunless we expanded them (likely re-using the existing co=
de paths)=E2=80=9D

This seems doable.
=E2=80=A2 =E2=80=9C- In cases where the system data was actually UTF-8, non=
-ASCII  characters will be displayed "completely wrong", i.e. mapped to  "r=
andom" other characters according to the Latin-1 correspondences.

By distinguishing <string> from <unencoded-string>, for the most part this =
is non-applicable (depending on the encodings involved, <insert-encoding> m=
ight be incorrectly interpreted as UTF-8, but this seems rare).
=E2=80=A2 =E2=80=9Ceven if you only care about the bytes.=E2=80=9D
If you only care about the bytes, set the relevant parameter objects such t=
hat <unencoded-string> objects rare returned.
=E2=80=A2 =E2=80=9CAt a minimum, I suggest Guile should produce an error by=
 default (instead of generating incorrect data) when the system bytes canno=
t be encoded in the current locale.=E2=80=9D

Included. Also, in the rare situation where approximating things is appropr=
iate (e.g. a basic directory listing), generating incorrect data is also po=
ssible.

A negative quality is that there now are two string-ish object types, but s=
ince the two types represent different situations, one of them requires mor=
e care than the other, and many operations are supported for both, I don=E2=
=80=99t think that=E2=80=99s too bad.

(It might also be possible to replace <unencoded-string> directly by a byte=
vector, but if you do this, then remember that on the C level you need to d=
eal with the lack of trailing \0.)

Best regards,
Maxime Devos.


--_B9E8F1CD-F083-4C6D-BF54-03DDBAC444BB_
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="utf-8"

<html xmlns:o=3D"urn:schemas-microsoft-com:office:office" xmlns:w=3D"urn:sc=
hemas-microsoft-com:office:word" xmlns:m=3D"http://schemas.microsoft.com/of=
fice/2004/12/omml" xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta ht=
tp-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8"><meta name=
=3DGenerator content=3D"Microsoft Word 15 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
	{font-family:Wingdings;
	panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
	{mso-style-priority:34;
	margin-top:0cm;
	margin-right:0cm;
	margin-bottom:0cm;
	margin-left:36.0pt;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;}
.MsoChpDefault
	{mso-style-type:export-only;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
	{page:WordSection1;}
/* List Definitions */
@list l0
	{mso-list-id:1270773722;
	mso-list-type:hybrid;
	mso-list-template-ids:1416290320 -246487508 1442726318 134807557 134807553=
 134807555 134807557 134807553 134807555 134807557;}
@list l0:level1
	{mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:23.0pt;
	text-indent:-18.0pt;
	mso-ascii-font-family:Calibri;
	mso-fareast-font-family:"Times New Roman";
	mso-hansi-font-family:Calibri;
	mso-bidi-font-family:"Times New Roman";}
@list l0:level2
	{mso-level-number-format:bullet;
	mso-level-text:\F0B7;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:59.0pt;
	text-indent:-18.0pt;
	font-family:Symbol;
	mso-fareast-font-family:"Times New Roman";
	mso-bidi-font-family:"Times New Roman";}
@list l0:level3
	{mso-level-number-format:bullet;
	mso-level-text:\F0A7;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:95.0pt;
	text-indent:-18.0pt;
	font-family:Wingdings;}
@list l0:level4
	{mso-level-number-format:bullet;
	mso-level-text:\F0B7;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:131.0pt;
	text-indent:-18.0pt;
	font-family:Symbol;}
@list l0:level5
	{mso-level-number-format:bullet;
	mso-level-text:o;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:167.0pt;
	text-indent:-18.0pt;
	font-family:"Courier New";}
@list l0:level6
	{mso-level-number-format:bullet;
	mso-level-text:\F0A7;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:203.0pt;
	text-indent:-18.0pt;
	font-family:Wingdings;}
@list l0:level7
	{mso-level-number-format:bullet;
	mso-level-text:\F0B7;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:239.0pt;
	text-indent:-18.0pt;
	font-family:Symbol;}
@list l0:level8
	{mso-level-number-format:bullet;
	mso-level-text:o;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:275.0pt;
	text-indent:-18.0pt;
	font-family:"Courier New";}
@list l0:level9
	{mso-level-number-format:bullet;
	mso-level-text:\F0A7;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:311.0pt;
	text-indent:-18.0pt;
	font-family:Wingdings;}
ol
	{margin-bottom:0cm;}
ul
	{margin-bottom:0cm;}
--></style></head><body lang=3Den-BE link=3Dblue vlink=3D"#954F72" style=3D=
'word-wrap:break-word'><div class=3DWordSection1><p class=3DMsoNormal><span=
 lang=3Den-BE>&gt;* Problem<o:p></o:p></span></p><p class=3DMsoNormal><span=
 lang=3Den-BE>&gt;<o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span la=
ng=3Den-BE>&gt;System data like environment variables, user names, group na=
mes, file<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>path=
s and extended attributes (xattr), etc. are on some systems (like<o:p></o:p=
></span></p><p class=3DMsoNormal><span lang=3Den-BE>Linux) binary data, and=
 may not be encodable as a string in the current<o:p></o:p></span></p><p cl=
ass=3DMsoNormal><span lang=3Den-BE>locale.=C2=A0 For Linux, as an example, =
only the null character is an invalid<o:p></o:p></span></p><p class=3DMsoNo=
rmal><span lang=3Den-BE>user/group/filename byte, while for UTF-8, a much s=
maller set of bytes<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3D=
en-BE>are valid[1].<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3D=
en-BE>&gt;[...]<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-B=
E>&gt;You end up with a question mark instead of the correct value.=C2=A0 T=
his<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>makes it d=
ifficult to write programs that don't risk silent corruption<o:p></o:p></sp=
an></p><p class=3DMsoNormal><span lang=3Den-BE>unless all the relevant syst=
em data is known to be compatible with the<o:p></o:p></span></p><p class=3D=
MsoNormal><span lang=3Den-BE>user's current locale.<o:p></o:p></span></p><p=
 class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=
=3DMsoNormal><span lang=3Den-BE>&gt;It's perhaps worth noting, that while t=
ypically unlikely, any given<o:p></o:p></span></p><p class=3DMsoNormal><spa=
n lang=3Den-BE>directory could contain paths in an arbitrary collection of =
encodings:<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>UTF=
-8, SHIFT-JIS, Latin-1, etc., and so if you really want to try to<o:p></o:p=
></span></p><p class=3DMsoNormal><span lang=3Den-BE>handle them as strings =
(maybe you want to correctly upcase/downcase<o:p></o:p></span></p><p class=
=3DMsoNormal><span lang=3Den-BE>them), you have to know (somehow) the encod=
ing that applies to each one.<o:p></o:p></span></p><p class=3DMsoNormal><sp=
an lang=3Den-BE>Otherwise, in the limiting case, you can only assume &quot;=
bytes&quot;.<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE><=
o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>&gt;* Im=
provements<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE><o:=
p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>&gt;At a m=
inimum, I suggest Guile should produce an error by default<o:p></o:p></span=
></p><p class=3DMsoNormal><span lang=3Den-BE>(instead of generating incorre=
ct data) when the system bytes cannot be<o:p></o:p></span></p><p class=3DMs=
oNormal><span lang=3Den-BE>encoded in the current locale.<o:p></o:p></span>=
</p><p class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p =
class=3DMsoNormal><span lang=3Den-BE>I totally agree on this.<o:p></o:p></s=
pan></p><p class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p=
><p class=3DMsoNormal><span lang=3Den-BE>&gt;There should also be some stra=
ightforward, thread-safe way to write code<o:p></o:p></span></p><p class=3D=
MsoNormal><span lang=3Den-BE>that accesses and manipulates system data effi=
ciently and without<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3D=
en-BE>corruption.<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den=
-BE><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>&gt=
;As an incremental step, and as has been discussed elsewhere a bit, we<o:p>=
</o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>might add support =
for uselocale()[2] and then document that the current<o:p></o:p></span></p>=
<p class=3DMsoNormal><span lang=3Den-BE>recommendation is to always use ISO=
-8859-1 (i.e. Latin-1)[3] for system<o:p></o:p></span></p><p class=3DMsoNor=
mal><span lang=3Den-BE>data unless you're certain your program doesn't need=
 to be general<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE=
>purpose (perhaps you're sure you only care about UTF-8 systems).<o:p></o:p=
></span></p><p class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span=
></p><p class=3DMsoNormal><span lang=3Den-BE>I=E2=80=99d rather not. It=E2=
=80=99s rather stateful and hence non-trivial to compose.<o:p></o:p></span>=
</p><p class=3DMsoNormal><span lang=3Den-BE>Also, locale is not only about =
the encoding of text [file name/env encodings/xattr/...],<o:p></o:p></span>=
</p><p class=3DMsoNormal><span lang=3Den-BE>but also about language. Also s=
etting the language is excessive in this case.<o:p></o:p></span></p><p clas=
s=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=3DMso=
Normal><span lang=3Den-BE>&gt;A program intended to work everywhere might t=
hen do something like<o:p></o:p></span></p><p class=3DMsoNormal><span lang=
=3Den-BE>this:<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE=
><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>&gt;=
=C2=A0=C2=A0 ...<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-=
BE>&gt;=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 #:use-module ((guile locale)<o:p></o:=
p></span></p><p class=3DMsoNormal><span lang=3Den-BE>&gt;=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 #:select (iso-8859-1 with-locale))<o:p></o:p></span></p><p =
class=3DMsoNormal><span lang=3Den-BE>&gt;=C2=A0=C2=A0=C2=A0 ...<o:p></o:p><=
/span></p><p class=3DMsoNormal><span lang=3Den-BE>&gt;<o:p>&nbsp;</o:p></sp=
an></p><p class=3DMsoNormal><span lang=3Den-BE>&gt;=C2=A0=C2=A0=C2=A0 (defi=
ne (environment name)<o:p></o:p></span></p><p class=3DMsoNormal><span lang=
=3Den-BE>&gt;=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (with-locale iso-8859-1 (getenv=
 name)))<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE><o:p>=
&nbsp;</o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>This, OTOH, =
seems a bit better =E2=80=93 =E2=80=98with-locale=E2=80=99 is like =E2=80=
=98parameterize=E2=80=99 and hence pretty composable.<o:p></o:p></span></p>=
<p class=3DMsoNormal><span lang=3Den-BE>However, it still stuffers from the=
 problem that it sets too much (also, there is no such thing as the =E2=80=
=9Ciso-8859-1=E2=80=9D locale?).<o:p></o:p></span></p><p class=3DMsoNormal>=
<span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span l=
ang=3Den-BE>Instead, I would propose something like:<o:p></o:p></span></p><=
p class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=
=3DMsoNormal><span lang=3Den-BE>;; [todo: add validation]<o:p></o:p></span>=
</p><p class=3DMsoNormal><span lang=3Den-BE>;; if #false, default to what i=
s implied by the locale<o:p></o:p></span></p><p class=3DMsoNormal><span lan=
g=3Den-BE>(define system-encoding (make-parameter #false))<o:p></o:p></span=
></p><p class=3DMsoNormal><span lang=3Den-BE>;; if #false, default to syste=
m-encoding<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>(de=
fine file-name-encoding (make-parameter #false))<o:p></o:p></span></p><p cl=
ass=3DMsoNormal><span lang=3Den-BE>[...]<o:p></o:p></span></p><p class=3DMs=
oNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal=
><span lang=3Den-BE>;; let=E2=80=99s say that for some reason, we know the =
file names have this encoding,<o:p></o:p></span></p><p class=3DMsoNormal><s=
pan lang=3Den-BE>;; but we don=E2=80=99t have information on other things s=
o we leave the decision<o:p></o:p></span></p><p class=3DMsoNormal><span lan=
g=3Den-BE>;; on other encodings to the caller.<o:p></o:p></span></p><p clas=
s=3DMsoNormal><span lang=3Den-BE>(define (some-proc)<o:p></o:p></span></p><=
p class=3DMsoNormal><span lang=3Den-BE>=C2=A0 (parameterize ((file-name-enc=
oding &quot;UTF-8&quot;))<o:p></o:p></span></p><p class=3DMsoNormal><span l=
ang=3Den-BE>=C2=A0=C2=A0=C2=A0 [open some file and do stuff with it]))<o:p>=
</o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p><=
/span></p><p class=3DMsoNormal><span lang=3Den-BE>This also has the advanta=
ge of separating the different things a bit =E2=80=93 I can imagine a multi=
-user system where the usernames are encoded differently from the file name=
s in the user home directory (not an unsurmountable problem for =E2=80=98wi=
th-locale=E2=80=99, but this seems a bit more straightforward to use when e=
xternal libraries are involved).<o:p></o:p></span></p><p class=3DMsoNormal>=
<span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span l=
ang=3Den-BE>(I=E2=80=99m not too sure about this splitting of parameter obj=
ects)<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE><o:p>&nb=
sp;</o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>&gt;There are d=
isadvantages to this approach, but it's a fairly easy<o:p></o:p></span></p>=
<p class=3DMsoNormal><span lang=3Den-BE>improvement.<o:p></o:p></span></p><=
p class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=
=3DMsoNormal><span lang=3Den-BE>&gt;Some potential disadvantages:<o:p></o:p=
></span></p><p class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span=
></p><p class=3DMsoNormal><span lang=3Den-BE>&gt;=C2=A0 - In cases where th=
e system data was actually UTF-8, non-ASCII<o:p></o:p></span></p><p class=
=3DMsoNormal><span lang=3Den-BE>&gt;=C2=A0=C2=A0=C2=A0 characters will be d=
isplayed &quot;completely wrong&quot;, i.e. mapped to<o:p></o:p></span></p>=
<p class=3DMsoNormal><span lang=3Den-BE>&gt;=C2=A0=C2=A0=C2=A0 &quot;random=
&quot; other characters according to the Latin-1 correspondences.<o:p></o:p=
></span></p><p class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span=
></p><p class=3DMsoNormal><span lang=3Den-BE>This is why I wouldn=E2=80=99t=
 recommend always using ISO-85519-1 by default.<o:p></o:p></span></p><p cla=
ss=3DMsoNormal><span lang=3Den-BE>The situation where the encoding of thing=
s are different is the exception<o:p></o:p></span></p><p class=3DMsoNormal>=
<span lang=3Den-BE>(and a historical artifact of pre-UTF-8), not the norm.<=
o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o=
:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>I think changing the =
=E2=80=98?=E2=80=99 into =E2=80=98throw an exception=E2=80=99, and providin=
g an _<i>option</i>_ (i.e. temporarily change locale to ISO-85519) and also=
 supporting this historical artifact is sufficient.<o:p></o:p></span></p><p=
 class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=
=3DMsoNormal><span lang=3Den-BE>&gt;=C2=A0 - You have to pay whatever cost =
is involved in switching locales, and<o:p></o:p></span></p><p class=3DMsoNo=
rmal><span lang=3Den-BE>&gt;=C2=A0=C2=A0=C2=A0 in encoding/decoding the byt=
es, even if you only care about the<o:p></o:p></span></p><p class=3DMsoNorm=
al><span lang=3Den-BE>&gt;=C2=A0=C2=A0=C2=A0 bytes.<o:p></o:p></span></p><p=
 class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=
=3DMsoNormal><span lang=3Den-BE>IIRC, in ISO-88519-1 there is a direct corr=
espondence between bytes and characters<o:p></o:p></span></p><p class=3DMso=
Normal><span lang=3Den-BE>(and Guile recognises this), so there is no cost =
beyond mere copying.<o:p></o:p></span></p><p class=3DMsoNormal><span lang=
=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span lang=3Den-B=
E>&gt;=C2=A0 - If any manipulations of the string representing the system d=
ata end<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>&gt;=
=C2=A0=C2=A0=C2=A0 up performing Unicode canonicalizations or normalization=
s, the data<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>&g=
t;=C2=A0=C2=A0=C2=A0 could still be corrupted.=C2=A0 I don't *think* Guile =
itself ever does<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-=
BE>&gt;=C2=A0=C2=A0=C2=A0 that implicitly.<o:p></o:p></span></p><p class=3D=
MsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=3DMsoNorm=
al><span lang=3Den-BE>Pretty sure it doesn=E2=80=99t.<o:p></o:p></span></p>=
<p class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p clas=
s=3DMsoNormal><span lang=3Den-BE>&gt;=C2=A0 - Less importantly, if we switc=
h the internal string representation to<o:p></o:p></span></p><p class=3DMso=
Normal><span lang=3Den-BE>=C2=A0=C2=A0=C2=A0 UTF-8 (proposed[4]), then non-=
ASCII bytes in the data will require<o:p></o:p></span></p><p class=3DMsoNor=
mal><span lang=3Den-BE>=C2=A0=C2=A0=C2=A0 two bytes in memory.<o:p></o:p></=
span></p><p class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p></span></=
p><p class=3DMsoNormal><span lang=3Den-BE>&gt;The most direct (and compact,=
 if we do convert to UTF-8) representation<o:p></o:p></span></p><p class=3D=
MsoNormal><span lang=3Den-BE>would bytevectors, but then you would have a m=
uch more limited set of<o:p></o:p></span></p><p class=3DMsoNormal><span lan=
g=3Den-BE>operations available (i.e. strings have all of srfi-13, srfi-14, =
etc.)<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>unless w=
e expanded them (likely re-using the existing code paths).=C2=A0 Of<o:p></o=
:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>course you could stil=
l convert to Latin-1, perform the operation, and<o:p></o:p></span></p><p cl=
ass=3DMsoNormal><span lang=3Den-BE>convert back, but that's not ideal.<o:p>=
</o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:p><=
/span></p><p class=3DMsoNormal><span lang=3Den-BE>&gt;Finally, while I'm no=
t sure how I feel about it, one notable precedent<o:p></o:p></span></p><p c=
lass=3DMsoNormal><span lang=3Den-BE>is Python's &quot;surrogateescape&quot;=
 approach[5], which shifts any unencodable<o:p></o:p></span></p><p class=3D=
MsoNormal><span lang=3Den-BE>bytes into &quot;lone Unicode surrogates&quot;=
, a process which can (and of course<o:p></o:p></span></p><p class=3DMsoNor=
mal><span lang=3Den-BE>must) be safely reversed before handing the data bac=
k to the system.=C2=A0 It<o:p></o:p></span></p><p class=3DMsoNormal><span l=
ang=3Den-BE>has its own trade-offs/(security)-concerns, as mentioned in the=
 PEP.<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE><o:p>&nb=
sp;</o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE>IIRC, surrogate=
s have codepoints, but are not characters. As a consequence, strings would =
contain non-characters, and (char? (string-ref s index)) might be #false. I=
=E2=80=99d rather not, such an object does not sound like a string to me.<o=
:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE><o:p>&nbsp;</o:=
p></span></p><p class=3DMsoNormal><span lang=3Den-BE>Here is an alternative=
 solution:<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3Den-BE><o:=
p>&nbsp;</o:p></span></p><ol style=3D'margin-top:0cm' start=3D1 type=3D1><l=
i class=3DMsoListParagraph style=3D'margin-left:-13.0pt;mso-list:l0 level1 =
lfo1'><span lang=3Den-BE>Define a new object type =E2=80=98&lt;unencoded-st=
ring&gt;=E2=80=99 (wrapping a bytevector). This represent things that are _=
<i>conceptually</i>_ a string instead of a mere sequence of bytes, but we d=
on=E2=80=99t know the actual encoding so we can=E2=80=99t let it be a strin=
g.<o:p></o:p></span></li><li class=3DMsoListParagraph style=3D'margin-left:=
-13.0pt;mso-list:l0 level1 lfo1'><span lang=3Den-BE>Also define a bunch of =
procedure for converting between bytes, unencoded-strings and strings. Also=
, a =E2=80=98string-like?=E2=80=99 predicate that includes both =E2=80=98&l=
t;string&gt;=E2=80=99 and =E2=80=98&lt;unencoded-string&gt;=E2=80=99.<o:p><=
/o:p></span></li><li class=3DMsoListParagraph style=3D'margin-left:-13.0pt;=
mso-list:l0 level1 lfo1'><span lang=3Den-BE>Procedures like =E2=80=98open-f=
ile=E2=80=99 etc. are extended to support &lt;unencoded-string&gt;.<o:p></o=
:p></span></li><li class=3DMsoListParagraph style=3D'margin-left:-13.0pt;ms=
o-list:l0 level1 lfo1'><span lang=3Den-BE>Maybe do the same for SRFI-N stuf=
f (maybe as part of (srfi srfi-N gnu) extensions).<o:p></o:p></span></li></=
ol><p class=3DMsoListParagraph style=3D'margin-left:23.0pt'><span lang=3Den=
-BE>(I don=E2=80=99t know if (string-append unencoded encoded) should be su=
pported.)<o:p></o:p></span></p><ol style=3D'margin-top:0cm' start=3D5 type=
=3D1><li class=3DMsoListParagraph style=3D'margin-left:-13.0pt;mso-list:l0 =
level1 lfo1'><span lang=3Den-BE>When a procedure would return a filename, i=
t first looks at some parameter objects. These parameter encoding determine=
 what the encoding is, what to do when it is not valid according to the enc=
oding (approximate via ? and the like, throw an exception, or return an &lt=
;unencoded-string&gt;) =E2=80=93 or even return an &lt;unencoded-string&gt;=
 unconditionally.<o:p></o:p></span></li><li class=3DMsoListParagraph style=
=3D'margin-left:-13.0pt;mso-list:l0 level1 lfo1'><span lang=3Den-BE>Also do=
 the same for =E2=80=98getenv=E2=80=99 and the like, maybe with a different=
 set of parameter objects.<o:p></o:p></span></li></ol><p class=3DMsoNormal =
style=3D'margin-left:5.0pt'><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p>=
<p class=3DMsoNormal style=3D'margin-left:5.0pt'><span lang=3Den-BE>(Name p=
ending, &lt;unencoded-string&gt; not being a subtype of &lt;string&gt; is b=
ad naming.)<o:p></o:p></span></p><p class=3DMsoNormal style=3D'margin-left:=
5.0pt'><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal =
style=3D'margin-left:5.0pt'><span lang=3Den-BE>I think this combines most o=
f the positive qualities and avoids most of the negative qualities (with th=
e exception of the surrogate-encoding stuff, which I see mostly as a negati=
ve):<o:p></o:p></span></p><p class=3DMsoNormal style=3D'margin-left:5.0pt'>=
<span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><ol style=3D'margin-top:0cm'=
 start=3D6 type=3D1><ul style=3D'margin-top:0cm' type=3Ddisc><li class=3DMs=
oListParagraph style=3D'margin-left:-13.0pt;mso-list:l0 level2 lfo1'><span =
lang=3Den-BE>=E2=80=9Cunless we expanded them (likely re-using the existing=
 code paths)=E2=80=9D<o:p></o:p></span></li></ul></ol><p class=3DMsoListPar=
agraph style=3D'margin-left:59.0pt'><span lang=3Den-BE><o:p>&nbsp;</o:p></s=
pan></p><p class=3DMsoListParagraph style=3D'margin-left:59.0pt'><span lang=
=3Den-BE>This seems doable.<o:p></o:p></span></p><ol style=3D'margin-top:0c=
m' start=3D6 type=3D1><ul style=3D'margin-top:0cm' type=3Ddisc><li class=3D=
MsoListParagraph style=3D'margin-left:-13.0pt;mso-list:l0 level2 lfo1'><spa=
n lang=3Den-BE>=E2=80=9C- In cases where the system data was actually UTF-8=
, non-ASCII=C2=A0 characters will be displayed &quot;completely wrong&quot;=
, i.e. mapped to=C2=A0 &quot;random&quot; other characters according to the=
 Latin-1 correspondences.<o:p></o:p></span></li></ul></ol><p class=3DMsoLis=
tParagraph style=3D'margin-left:59.0pt'><span lang=3Den-BE><o:p>&nbsp;</o:p=
></span></p><p class=3DMsoListParagraph style=3D'margin-left:59.0pt'><span =
lang=3Den-BE>By distinguishing &lt;string&gt; from &lt;unencoded-string&gt;=
, for the most part this is non-applicable (depending on the encodings invo=
lved, &lt;insert-encoding&gt; might be incorrectly interpreted as UTF-8, bu=
t this seems rare).<o:p></o:p></span></p><ol style=3D'margin-top:0cm' start=
=3D6 type=3D1><ul style=3D'margin-top:0cm' type=3Ddisc><li class=3DMsoListP=
aragraph style=3D'margin-left:-13.0pt;mso-list:l0 level2 lfo1'><span lang=
=3Den-BE>=E2=80=9Ceven if you only care about the bytes.=E2=80=9D<o:p></o:p=
></span></li></ul></ol><p class=3DMsoListParagraph style=3D'margin-left:59.=
0pt'><span lang=3Den-BE>If you only care about the bytes, set the relevant =
parameter objects such that &lt;unencoded-string&gt; objects rare returned.=
<o:p></o:p></span></p><ol style=3D'margin-top:0cm' start=3D6 type=3D1><ul s=
tyle=3D'margin-top:0cm' type=3Ddisc><li class=3DMsoListParagraph style=3D'm=
argin-left:-13.0pt;mso-list:l0 level2 lfo1'><span lang=3Den-BE>=E2=80=9CAt =
a minimum, I suggest Guile should produce an error by default (instead of g=
enerating incorrect data) when the system bytes cannot be encoded in the cu=
rrent locale.=E2=80=9D<o:p></o:p></span></li></ul></ol><p class=3DMsoListPa=
ragraph style=3D'margin-left:59.0pt'><span lang=3Den-BE><o:p>&nbsp;</o:p></=
span></p><p class=3DMsoListParagraph style=3D'margin-left:59.0pt'><span lan=
g=3Den-BE>Included. Also, in the rare situation where approximating things =
is appropriate (e.g. a basic directory listing), generating incorrect data =
is also possible.<o:p></o:p></span></p><p class=3DMsoNormal style=3D'margin=
-left:5.0pt'><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=3DMsoN=
ormal style=3D'margin-left:5.0pt'><span lang=3Den-BE>A negative quality is =
that there now are two string-ish object types, but since the two types rep=
resent different situations, one of them requires more care than the other,=
 and many operations are supported for both, I don=E2=80=99t think that=E2=
=80=99s too bad.<o:p></o:p></span></p><p class=3DMsoNormal style=3D'margin-=
left:5.0pt'><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=3DMsoNo=
rmal style=3D'margin-left:5.0pt'><span lang=3Den-BE>(It might also be possi=
ble to replace &lt;unencoded-string&gt; directly by a bytevector, but if yo=
u do this, then remember that on the C level you need to deal with the lack=
 of trailing \0.)<o:p></o:p></span></p><p class=3DMsoNormal style=3D'margin=
-left:5.0pt'><span lang=3Den-BE><o:p>&nbsp;</o:p></span></p><p class=3DMsoN=
ormal style=3D'margin-left:5.0pt'><span lang=3Den-BE>Best regards,<o:p></o:=
p></span></p><p class=3DMsoNormal style=3D'margin-left:5.0pt'><span lang=3D=
en-BE>Maxime Devos.<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3D=
en-BE><o:p>&nbsp;</o:p></span></p></div></body></html>=

--_B9E8F1CD-F083-4C6D-BF54-03DDBAC444BB_--