From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Marko Rauhamaa Newsgroups: gmane.lisp.guile.user Subject: Re: Running script from directory with UTF-8 characters Date: Tue, 22 Dec 2015 23:39:28 +0200 Message-ID: <87d1tycgdr.fsf@elektro.pacujo.net> References: <87twnbfkzb.fsf@elektro.pacujo.net> <20151222003447.198ea945@bother.homenet> <87io3rffo5.fsf@elektro.pacujo.net> <20151222142125.17ba7368@bother.homenet> <87bn9ieaup.fsf@elektro.pacujo.net> <20151222201240.3a66fd94@bother.homenet> <87oadicjbc.fsf@elektro.pacujo.net> <83wps6p5d2.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: ger.gmane.org 1450820381 7621 80.91.229.3 (22 Dec 2015 21:39:41 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 22 Dec 2015 21:39:41 +0000 (UTC) Cc: guile-user@gnu.org To: Eli Zaretskii Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Tue Dec 22 22:39:41 2015 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aBUei-0005zr-SM for guile-user@m.gmane.org; Tue, 22 Dec 2015 22:39:41 +0100 Original-Received: from localhost ([::1]:53115 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aBUei-0003lm-AY for guile-user@m.gmane.org; Tue, 22 Dec 2015 16:39:40 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:41061) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aBUeZ-0003lZ-JU for guile-user@gnu.org; Tue, 22 Dec 2015 16:39:32 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aBUeY-0000Gm-GN for guile-user@gnu.org; Tue, 22 Dec 2015 16:39:31 -0500 Original-Received: from [2001:1bc8:1a0:5384:7a2b:cbff:fe9f:e508] (port=41836 helo=pacujo.net) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aBUeY-0000GY-8q for guile-user@gnu.org; Tue, 22 Dec 2015 16:39:30 -0500 Original-Received: from elektro.pacujo.net (192.168.1.200) by elektro.pacujo.net; Tue, 22 Dec 2015 23:39:28 +0200 Original-Received: by elektro.pacujo.net (sSMTP sendmail emulation); Tue, 22 Dec 2015 23:39:28 +0200 In-Reply-To: <83wps6p5d2.fsf@gnu.org> (Eli Zaretskii's message of "Tue, 22 Dec 2015 22:59:05 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:1bc8:1a0:5384:7a2b:cbff:fe9f:e508 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: guile-user-bounces+guile-user=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.user:12270 Archived-At: Eli Zaretskii : >> From: Marko Rauhamaa >> By setting the character set artificially to Latin-1 in Guile, all >> pathnames are accessible to it. > > No, they aren't, not as file names. E.g., you cannot meaningfully > downcase or upcase such "characters", you cannot count characters (as > opposed to bytes), you cannot calculate how much screen estate will be > needed to display them, with some Far Eastern encodings you cannot > correctly search them for some specific ASCII characters (because they > can be part of a multibyte sequence), etc. etc. IOW, you cannot work > with file names as human-readable text, which is something many > programs need to do. You can, in a roundabout way. You do the low-level file I/O in Latin-1. Then, you reencode into UTF-8, and if you get an exception, you deal with the situation. Otherwise, you may not even be able to remove a file with a non-UTF-8 name. > File names _are_ strings, there's no way around that. Linux pathnames are classic C strings. > They are strings because _people_ name files and give them meaningful > names and extensions. The Linux kernel just doesn't care, and shouldn't. It's acceptable for Guile to create a higher-level illusion, but it shouldn't sacrifice completeness while doing so. You should be able to manipulate every conceivable filename from Guile code. (Python 3.x accepts bytevectors as well as strings everywhere. For example, listing a directory returns strings if the directory name is given as a string. It returns bytevectors if the directory name is given as a bytevector. Python's bytevector literals accept ASCII, which makes this rather convenient.) > If Guile cannot easily work with file names encoded in a codeset other > than the current locale's one, then Guile should be extended to allow > a program to tell it in which encoding to interpret a particular name. A program usually has no clue how a pathname has been encoded. > (I think Guile already supports that, but maybe I misremember.) But > lobbying for treating file names as byte streams, let alone Latin-1 > characters, is a large step backwards, to 1990s when we didn't know > better. We've come a long way since then and learned a lot on the way. At least our backwardness allowed Linux to jump directly to UTF-8 and not be afflicted by UCS-2 like Windows and Java. I'm not saying bytevectors are elegant, but we should not replace them with wishful thinking. Ideally, we should have a bijective bytevector-to-string mapping. (Python 3.x uses Unicode surrogate code points for that purpose but doesn't quite achieve bijection, unfortunately.) I'm a bit sorry that Guile repeated Python 3's mistake and brought (Unicode) strings to the center. Strings are a highly special-purpose data structure; I really never had a real need for them in my decades of programming. Also, I suspect strings are much too simplistic for any serious typesetting or GUI work. It seems the sweet spot of strings are text/plain mail messages and Usenet postings. Guile 1.x's and Python 2.x's bytevector/string confusion was actually a very happy medium. Neither the OS nor the programming language placed any interpretation to the byte sequences. That was left to the application. Marko