* Running script from directory with UTF-8 characters @ 2015-12-21 21:09 Vicente Vera 2015-12-21 23:19 ` Marko Rauhamaa 0 siblings, 1 reply; 21+ messages in thread From: Vicente Vera @ 2015-12-21 21:09 UTC (permalink / raw) To: guile-user Hello. I'm sorry if this is the wrong list (I'm not sure if its a bug). I wrote a small test script: #!/usr/bin/guile -s !# ;; coding: utf-8 (display "hey") (newline) This happens when I try to run it from a directory with UTF-8 characters: $ cd ~/código/ $ ./test.scm ;;; Stat of /home/me/c??digo/./test.scm failed: ;;; ERROR: In procedure stat: No such file or directory: "/home/me/c??digo/./test.scm" Backtrace: In ice-9/boot-9.scm: 157: 8 [catch #t #<catch-closure 9949e00> ...] In unknown file: ?: 7 [apply-smob/1 #<catch-closure 9949e00>] In ice-9/boot-9.scm: 63: 6 [call-with-prompt prompt0 ...] In ice-9/eval.scm: 432: 5 [eval # #] In ice-9/boot-9.scm: 2401: 4 [save-module-excursion #<procedure 9957cc0 at ice-9/boot-9.scm:4045:3 ()>] 4052: 3 [#<procedure 9957cc0 at ice-9/boot-9.scm:4045:3 ()>] 1724: 2 [%start-stack load-stack ...] 1729: 1 [#<procedure 995e738 ()>] In unknown file: ?: 0 [primitive-load "/home/me/c??digo/./test.scm"] ERROR: In procedure primitive-load: ERROR: In procedure open-file: No such file or directory: "/home/me/c??digo/./test.scm" If I remove the UTF-8 character the script works just fine (mv -T ~/código ~/codigo). My locale is en_US.UTF-8 & Guile version: $ guile -v guile (GNU Guile) 2.0.11 Packaged by Debian (2.0.11-deb+1-9) ... What it's happening and how can I solve this? Thank you! ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-21 21:09 Running script from directory with UTF-8 characters Vicente Vera @ 2015-12-21 23:19 ` Marko Rauhamaa 2015-12-22 0:34 ` Chris Vine 2015-12-22 14:32 ` Vicente Vera 0 siblings, 2 replies; 21+ messages in thread From: Marko Rauhamaa @ 2015-12-21 23:19 UTC (permalink / raw) To: Vicente Vera; +Cc: guile-user Vicente Vera <vicentemvp@gmail.com>: > Hello. I'm sorry if this is the wrong list (I'm not sure if its a > bug). Must be a bug. > I wrote a small test script: The error is reproduced with an empty scm file: touch test.scm guile test.scm [...] ERROR: In procedure open-file: No such file or directory: [...] Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-21 23:19 ` Marko Rauhamaa @ 2015-12-22 0:34 ` Chris Vine 2015-12-22 1:14 ` Marko Rauhamaa 2015-12-22 14:32 ` Vicente Vera 1 sibling, 1 reply; 21+ messages in thread From: Chris Vine @ 2015-12-22 0:34 UTC (permalink / raw) To: guile-user On Tue, 22 Dec 2015 01:19:36 +0200 Marko Rauhamaa <marko@pacujo.net> wrote: > Vicente Vera <vicentemvp@gmail.com>: > > > Hello. I'm sorry if this is the wrong list (I'm not sure if its a > > bug). > > Must be a bug. > > > I wrote a small test script: > > The error is reproduced with an empty scm file: > > touch test.scm > guile test.scm > [...] > ERROR: In procedure open-file: No such file or directory: [...] I think the problem is that calling the native 'primitive-load' procedure on a filename with UTF-8 encoding with a character outside the ASCII range (when the locale encoding is also UTF-8) fails to work unless you call '(set-locale LC_ALL "")' in the program first. Of course you can't do that when passing guile a filename as a program argument. It does seem like a weakness, even if not a bug. Chris ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-22 0:34 ` Chris Vine @ 2015-12-22 1:14 ` Marko Rauhamaa 2015-12-22 14:21 ` Chris Vine 0 siblings, 1 reply; 21+ messages in thread From: Marko Rauhamaa @ 2015-12-22 1:14 UTC (permalink / raw) To: Chris Vine; +Cc: guile-user Chris Vine <chris@cvine.freeserve.co.uk>: > I think the problem is that calling the native 'primitive-load' > procedure on a filename with UTF-8 encoding with a character outside > the ASCII range (when the locale encoding is also UTF-8) fails to work > unless you call '(set-locale LC_ALL "")' in the program first. > > Of course you can't do that when passing guile a filename as a program > argument. It does seem like a weakness, even if not a bug. How can it not be a bug? Also, Linux pathnames can contain any bytes other than NUL regardless of the locale (and quite often do) so I hope Guile doesn't paint itself too deep in the Unicode corner. Python is struggling with analogous issues but has been careful to at least make it possible to deal with bytevector pathnames and bytevector standard ports. For example, scheme@(guile-user)> (opendir ".") $1 = #<directory stream f7ffa0> [...] scheme@(guile-user)> (readdir $1) $4 = "?9t\x1b[" scheme@(guile-user)> (open-file $4 "r") ERROR: In procedure open-file: ERROR: In procedure open-file: No such file or directory: "?9t\x1b[" Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-22 1:14 ` Marko Rauhamaa @ 2015-12-22 14:21 ` Chris Vine 2015-12-22 15:55 ` Marko Rauhamaa 0 siblings, 1 reply; 21+ messages in thread From: Chris Vine @ 2015-12-22 14:21 UTC (permalink / raw) To: guile-user On Tue, 22 Dec 2015 03:14:18 +0200 Marko Rauhamaa <marko@pacujo.net> wrote: > Chris Vine <chris@cvine.freeserve.co.uk>: > > > I think the problem is that calling the native 'primitive-load' > > procedure on a filename with UTF-8 encoding with a character outside > > the ASCII range (when the locale encoding is also UTF-8) fails to > > work unless you call '(set-locale LC_ALL "")' in the program first. > > > > Of course you can't do that when passing guile a filename as a > > program argument. It does seem like a weakness, even if not a bug. > > How can it not be a bug? > > Also, Linux pathnames can contain any bytes other than NUL regardless > of the locale (and quite often do) so I hope Guile doesn't paint > itself too deep in the Unicode corner. Python is struggling with > analogous issues but has been careful to at least make it possible to > deal with bytevector pathnames and bytevector standard ports. > > For example, > > scheme@(guile-user)> (opendir ".") > $1 = #<directory stream f7ffa0> > [...] > scheme@(guile-user)> (readdir $1) > $4 = "?9t\x1b[" > scheme@(guile-user)> (open-file $4 "r") > ERROR: In procedure open-file: > ERROR: In procedure open-file: No such file or directory: > "?9t\x1b[" You can set the locale in the REPL, if that is where you are working from (as in your example), and then UTF-8 pathnames will work fine. The problem about this is that although a developer might use the REPL, and therefore maybe assume that that is what everyone else does, the end user probably just wants to run the script by passing guile a file name on the command line. To that extent I agree it is a bug. But the response to the filing of such a bug might be that that is how it is meant to work. Chris ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-22 14:21 ` Chris Vine @ 2015-12-22 15:55 ` Marko Rauhamaa 2015-12-22 20:12 ` Chris Vine 0 siblings, 1 reply; 21+ messages in thread From: Marko Rauhamaa @ 2015-12-22 15:55 UTC (permalink / raw) To: Chris Vine; +Cc: guile-user Chris Vine <chris@cvine.freeserve.co.uk>: > On Tue, 22 Dec 2015 03:14:18 +0200 > Marko Rauhamaa <marko@pacujo.net> wrote: >> For example, >> >> scheme@(guile-user)> (opendir ".") >> $1 = #<directory stream f7ffa0> >> [...] >> scheme@(guile-user)> (readdir $1) >> $4 = "?9t\x1b[" >> scheme@(guile-user)> (open-file $4 "r") >> ERROR: In procedure open-file: >> ERROR: In procedure open-file: No such file or directory: >> "?9t\x1b[" > > You can set the locale in the REPL, if that is where you are working > from (as in your example), and then UTF-8 pathnames will work fine. You misunderstood me. The problem is that Guile cannot deal with non-UTF-8 pathnames in a UTF-8 locale. IOW, Linux pathnames are *not* strings. They are bytevectors. Guile 1.x (as well as Python 2.x) was fine bytevector pathnames, but Guile 2.x (as well as Python 3.x) wants to pretend filenames are strings. That leads to trouble, potentially even to security vulnerabilities. A very typical case is a tarball that contains, say, Latin-1 filenames. If you should expand the tarball in a UTF-8 environment, Guile wouldn't be able to deal with the situation. Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-22 15:55 ` Marko Rauhamaa @ 2015-12-22 20:12 ` Chris Vine 2015-12-22 20:36 ` Marko Rauhamaa 0 siblings, 1 reply; 21+ messages in thread From: Chris Vine @ 2015-12-22 20:12 UTC (permalink / raw) To: guile-user On Tue, 22 Dec 2015 17:55:58 +0200 Marko Rauhamaa <marko@pacujo.net> wrote: > Chris Vine <chris@cvine.freeserve.co.uk> wrote: > > On Tue, 22 Dec 2015 03:14:18 +0200 > > Marko Rauhamaa <marko@pacujo.net> wrote: > >> For example, > >> > >> scheme@(guile-user)> (opendir ".") > >> $1 = #<directory stream f7ffa0> > >> [...] > >> scheme@(guile-user)> (readdir $1) > >> $4 = "?9t\x1b[" > >> scheme@(guile-user)> (open-file $4 "r") > >> ERROR: In procedure open-file: > >> ERROR: In procedure open-file: No such file or directory: > >> "?9t\x1b[" > > > > You can set the locale in the REPL, if that is where you are working > > from (as in your example), and then UTF-8 pathnames will work > > fine. > > You misunderstood me. The problem is that Guile cannot deal with > non-UTF-8 pathnames in a UTF-8 locale. IOW, Linux pathnames are *not* > strings. They are bytevectors. Guile 1.x (as well as Python 2.x) was > fine bytevector pathnames, but Guile 2.x (as well as Python 3.x) wants > to pretend filenames are strings. That leads to trouble, potentially > even to security vulnerabilities. > > A very typical case is a tarball that contains, say, Latin-1 > filenames. If you should expand the tarball in a UTF-8 environment, > Guile wouldn't be able to deal with the situation. Yes, you exceeded my powers of deduction (or clairvoyance, depending on how you look at it). More to he point, unix-like pathnames are at the implementation level just a collection of bytes terminated by null and with '/' as the directory separator. Having said that, the POSIX Portable Filename Character Set (§3.278 of the SUS) doesn't even cover all of ASCII, let alone unicode. It can be useful to handle filenames as strings in the program. My main objection is not that filenames are not treated as collections of bytes, but that guile assumes the filename character set is the same as the locale character set, which on distributed file systems may be completely false. I may be wrong, but I do not think you can set the filename codeset programmatically in guile, which most other libraries permit. So I guess the best rule is that, even if you don't stick to the Portable Filename Character Set, stick to ASCII for filenames/paths. Chris ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-22 20:12 ` Chris Vine @ 2015-12-22 20:36 ` Marko Rauhamaa 2015-12-22 20:59 ` Eli Zaretskii 0 siblings, 1 reply; 21+ messages in thread From: Marko Rauhamaa @ 2015-12-22 20:36 UTC (permalink / raw) To: Chris Vine; +Cc: guile-user Chris Vine <chris@cvine.freeserve.co.uk>: > So I guess the best rule is that, even if you don't stick to the > Portable Filename Character Set, stick to ASCII for filenames/paths. The filenames are not in my control or Guile's. Guile can't simply wish the filenames to be strings, or, like Python, it would at least need some special measures to handle the exceptional cases. Well, of course, when there's a will, there's a way. By setting the character set artificially to Latin-1 in Guile, all pathnames are accessible to it. Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-22 20:36 ` Marko Rauhamaa @ 2015-12-22 20:59 ` Eli Zaretskii 2015-12-22 21:39 ` Marko Rauhamaa 0 siblings, 1 reply; 21+ messages in thread From: Eli Zaretskii @ 2015-12-22 20:59 UTC (permalink / raw) To: Marko Rauhamaa; +Cc: guile-user > From: Marko Rauhamaa <marko@pacujo.net> > Date: Tue, 22 Dec 2015 22:36:07 +0200 > Cc: guile-user@gnu.org > > By setting the character set artificially to Latin-1 in Guile, all > pathnames are accessible to it. No, they aren't, not as file names. E.g., you cannot meaningfully downcase or upcase such "characters", you cannot count characters (as opposed to bytes), you cannot calculate how much screen estate will be needed to display them, with some Far Eastern encodings you cannot correctly search them for some specific ASCII characters (because they can be part of a multibyte sequence), etc. etc. IOW, you cannot work with file names as human-readable text, which is something many programs need to do. File names _are_ strings, there's no way around that. They are strings because _people_ name files and give them meaningful names and extensions. If Guile cannot easily work with file names encoded in a codeset other than the current locale's one, then Guile should be extended to allow a program to tell it in which encoding to interpret a particular name. (I think Guile already supports that, but maybe I misremember.) But lobbying for treating file names as byte streams, let alone Latin-1 characters, is a large step backwards, to 1990s when we didn't know better. We've come a long way since then and learned a lot on the way. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-22 20:59 ` Eli Zaretskii @ 2015-12-22 21:39 ` Marko Rauhamaa 2015-12-23 18:28 ` Eli Zaretskii 2015-12-24 16:13 ` Barry Schwartz 0 siblings, 2 replies; 21+ messages in thread From: Marko Rauhamaa @ 2015-12-22 21:39 UTC (permalink / raw) To: Eli Zaretskii; +Cc: guile-user Eli Zaretskii <eliz@gnu.org>: >> From: Marko Rauhamaa <marko@pacujo.net> >> By setting the character set artificially to Latin-1 in Guile, all >> pathnames are accessible to it. > > No, they aren't, not as file names. E.g., you cannot meaningfully > downcase or upcase such "characters", you cannot count characters (as > opposed to bytes), you cannot calculate how much screen estate will be > needed to display them, with some Far Eastern encodings you cannot > correctly search them for some specific ASCII characters (because they > can be part of a multibyte sequence), etc. etc. IOW, you cannot work > with file names as human-readable text, which is something many > programs need to do. You can, in a roundabout way. You do the low-level file I/O in Latin-1. Then, you reencode into UTF-8, and if you get an exception, you deal with the situation. Otherwise, you may not even be able to remove a file with a non-UTF-8 name. > File names _are_ strings, there's no way around that. Linux pathnames are classic C strings. > They are strings because _people_ name files and give them meaningful > names and extensions. The Linux kernel just doesn't care, and shouldn't. It's acceptable for Guile to create a higher-level illusion, but it shouldn't sacrifice completeness while doing so. You should be able to manipulate every conceivable filename from Guile code. (Python 3.x accepts bytevectors as well as strings everywhere. For example, listing a directory returns strings if the directory name is given as a string. It returns bytevectors if the directory name is given as a bytevector. Python's bytevector literals accept ASCII, which makes this rather convenient.) > If Guile cannot easily work with file names encoded in a codeset other > than the current locale's one, then Guile should be extended to allow > a program to tell it in which encoding to interpret a particular name. A program usually has no clue how a pathname has been encoded. > (I think Guile already supports that, but maybe I misremember.) But > lobbying for treating file names as byte streams, let alone Latin-1 > characters, is a large step backwards, to 1990s when we didn't know > better. We've come a long way since then and learned a lot on the way. At least our backwardness allowed Linux to jump directly to UTF-8 and not be afflicted by UCS-2 like Windows and Java. I'm not saying bytevectors are elegant, but we should not replace them with wishful thinking. Ideally, we should have a bijective bytevector-to-string mapping. (Python 3.x uses Unicode surrogate code points for that purpose but doesn't quite achieve bijection, unfortunately.) I'm a bit sorry that Guile repeated Python 3's mistake and brought (Unicode) strings to the center. Strings are a highly special-purpose data structure; I really never had a real need for them in my decades of programming. Also, I suspect strings are much too simplistic for any serious typesetting or GUI work. It seems the sweet spot of strings are text/plain mail messages and Usenet postings. Guile 1.x's and Python 2.x's bytevector/string confusion was actually a very happy medium. Neither the OS nor the programming language placed any interpretation to the byte sequences. That was left to the application. Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-22 21:39 ` Marko Rauhamaa @ 2015-12-23 18:28 ` Eli Zaretskii 2015-12-23 19:18 ` Marko Rauhamaa 2015-12-24 16:13 ` Barry Schwartz 1 sibling, 1 reply; 21+ messages in thread From: Eli Zaretskii @ 2015-12-23 18:28 UTC (permalink / raw) To: Marko Rauhamaa; +Cc: guile-user > From: Marko Rauhamaa <marko@pacujo.net> > Date: Tue, 22 Dec 2015 23:39:28 +0200 > Cc: guile-user@gnu.org > > > No, they aren't, not as file names. E.g., you cannot meaningfully > > downcase or upcase such "characters", you cannot count characters (as > > opposed to bytes), you cannot calculate how much screen estate will be > > needed to display them, with some Far Eastern encodings you cannot > > correctly search them for some specific ASCII characters (because they > > can be part of a multibyte sequence), etc. etc. IOW, you cannot work > > with file names as human-readable text, which is something many > > programs need to do. > > You can, in a roundabout way. You do the low-level file I/O in Latin-1. > Then, you reencode into UTF-8 IOW, from the application-level perspective, the file names are encoded in UTF-8 (in this example). The low-level reading as byte stream (NOT Latin-1!) is out of scope as long as you consider a Guile Scheme program that needs to manipulate the file names. So we are in violent agreement. > Otherwise, you may not even be able to remove a file with a non-UTF-8 > name. What do you mean by a non UTF-8 file name? A file name that includes byte sequences that are not valid UTF-8? For that, Guile needs to acquire a capability of representing raw bytes, similar to what Emacs does. This capability is an add-on, it should not be instead of being able to interpret file names as character strings encoded in some recognizable encoding, either forced by the application or deduced from some meta-data, user preferences, locale's defaults, etc. > > They are strings because _people_ name files and give them meaningful > > names and extensions. > > The Linux kernel just doesn't care, and shouldn't. Guile is not an OS kernel. Guile is an environment for writing applications. On the application level, you _should_ care, or else you won't be able to manipulate file names in meaningful ways. > It's acceptable for Guile to create a higher-level illusion, but it > shouldn't sacrifice completeness while doing so. You should be able to > manipulate every conceivable filename from Guile code. We are again in violent agreement about the goal. But the means towards that goal is NOT to abandon interpretation of file names as strings of characters, the means is to be able to represent raw bytes on top of a meaningful character representation. > > If Guile cannot easily work with file names encoded in a codeset other > > than the current locale's one, then Guile should be extended to allow > > a program to tell it in which encoding to interpret a particular name. > > A program usually has no clue how a pathname has been encoded. The programmer does, or should be. The user does, sometimes (e.g., the capability presented in many browsers and editors to force text encoding). Some encodings can be deduced by analyzing the bit stream. And there are locale defaults if nothing else works. If none of that is done, the program cannot manipulate these file names in any meaningful way. The kernel can duck that problem because it's the kernel: it doesn't interact with users, and its filesystem layer is not required to understand the meaning of, say, the file-name extensions. We have no such luxury on the application level. So we cannot simply copycat the kernel techniques into Guile, it won't work. It also won't work to expect applications do that, as that is too complex and subtle (and tedious) for application to do it right every time. Once again, I suggest to study how Emacs solves this very problem. The solution used there is satisfactory, and fits all of your requirements above. It's not without some subtleties in rare cases, but the problem is complex and there's no way around that complexity. > > (I think Guile already supports that, but maybe I misremember.) But > > lobbying for treating file names as byte streams, let alone Latin-1 > > characters, is a large step backwards, to 1990s when we didn't know > > better. We've come a long way since then and learned a lot on the way. > > At least our backwardness allowed Linux to jump directly to UTF-8 and > not be afflicted by UCS-2 like Windows and Java. Once again, Guile is not an OS kernel. It cannot simply adopt kernel solutions. > I'm not saying bytevectors are elegant, but we should not replace them > with wishful thinking. No need for wishful thinking. Study what Emacs does and do something similar. > I'm a bit sorry that Guile repeated Python 3's mistake and brought > (Unicode) strings to the center. Everybody does that mistake. Emacs did it as well, but that was years ago, and since then the mistakes were identified and corrected. The basis must be Unicode, the trick is to build additions on top of that which allow raw bytes and Unicode text strings to coexist, more or less transparently to the application level. ("More or less" because handling raw bytes as part of strings requires some care; fortunately, such use cases are rare.) > Strings are a highly special-purpose data structure; I really never > had a real need for them in my decades of programming. Also, I > suspect strings are much too simplistic for any serious typesetting > or GUI work. It seems the sweet spot of strings are text/plain mail > messages and Usenet postings. My experience indicates otherwise (in particular, processing and displaying plain text strings is what the Unicode Standard is all about), but I think that issue is tangential to this discussion. > Guile 1.x's and Python 2.x's bytevector/string confusion was actually > a very happy medium. Neither the OS nor the programming language placed > any interpretation to the byte sequences. That was left to the > application. And that is wrong. Applications cannot handle that, they need some heavy help from the infrastructure. Applications actually love to have normal human-readable text strings, after the infrastructure decoded the byte stream into characters for them. Most file names are encoded in locale's codeset (otherwise file browsers and other interactive programs that accept and display file names won't be able to handle them), so at least this popular and very important use case should "just work" without requiring each application to reinvent the wheel of decoding byte sequences into characters, dealing with EILSEQ, etc. An environment that doesn't provide at least that won't fly. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-23 18:28 ` Eli Zaretskii @ 2015-12-23 19:18 ` Marko Rauhamaa 2015-12-23 19:33 ` Eli Zaretskii 0 siblings, 1 reply; 21+ messages in thread From: Marko Rauhamaa @ 2015-12-23 19:18 UTC (permalink / raw) To: Eli Zaretskii; +Cc: guile-user Eli Zaretskii <eliz@gnu.org>: >> From: Marko Rauhamaa <marko@pacujo.net> >> The Linux kernel just doesn't care, and shouldn't. > > Guile is not an OS kernel. Guile is an environment for writing > applications. On the application level, you _should_ care, or else you > won't be able to manipulate file names in meaningful ways. To me, a programming language is a medium of writing programs for an operating system. I don't think a programming language should "shield" me from the OS. Instead, it should make the whole gamut of the OS facilities available to me. >> I'm not saying bytevectors are elegant, but we should not replace >> them with wishful thinking. > > No need for wishful thinking. Study what Emacs does and do something > similar. Why don't you tell me already what emacs does? >> Guile 1.x's and Python 2.x's bytevector/string confusion was actually >> a very happy medium. Neither the OS nor the programming language >> placed any interpretation to the byte sequences. That was left to the >> application. > > And that is wrong. Applications cannot handle that, they need some > heavy help from the infrastructure. That can be managed through support libraries. Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-23 19:18 ` Marko Rauhamaa @ 2015-12-23 19:33 ` Eli Zaretskii 2015-12-23 21:15 ` Marko Rauhamaa 2015-12-23 21:53 ` David Kastrup 0 siblings, 2 replies; 21+ messages in thread From: Eli Zaretskii @ 2015-12-23 19:33 UTC (permalink / raw) To: Marko Rauhamaa; +Cc: guile-user > From: Marko Rauhamaa <marko@pacujo.net> > Cc: guile-user@gnu.org > Date: Wed, 23 Dec 2015 21:18:28 +0200 > > Eli Zaretskii <eliz@gnu.org>: > > >> From: Marko Rauhamaa <marko@pacujo.net> > >> The Linux kernel just doesn't care, and shouldn't. > > > > Guile is not an OS kernel. Guile is an environment for writing > > applications. On the application level, you _should_ care, or else you > > won't be able to manipulate file names in meaningful ways. > > To me, a programming language is a medium of writing programs for an > operating system. I don't think a programming language should "shield" > me from the OS. Instead, it should make the whole gamut of the OS > facilities available to me. I see no contradiction here, as long as you acknowledge that Guile should be good for more than just OS level stuff. > >> I'm not saying bytevectors are elegant, but we should not replace > >> them with wishful thinking. > > > > No need for wishful thinking. Study what Emacs does and do something > > similar. > > Why don't you tell me already what emacs does? I did, you elided that. It represents text as superset of UTF-8, and uses high codepoints above the Unicode space for raw bytes. > >> Guile 1.x's and Python 2.x's bytevector/string confusion was actually > >> a very happy medium. Neither the OS nor the programming language > >> placed any interpretation to the byte sequences. That was left to the > >> application. > > > > And that is wrong. Applications cannot handle that, they need some > > heavy help from the infrastructure. > > That can be managed through support libraries. Guile is one huge support library, so it should include that built-in. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-23 19:33 ` Eli Zaretskii @ 2015-12-23 21:15 ` Marko Rauhamaa 2015-12-23 21:53 ` David Kastrup 1 sibling, 0 replies; 21+ messages in thread From: Marko Rauhamaa @ 2015-12-23 21:15 UTC (permalink / raw) To: Eli Zaretskii; +Cc: guile-user Eli Zaretskii <eliz@gnu.org>: >> From: Marko Rauhamaa <marko@pacujo.net> >> Why don't you tell me already what emacs does? > > I did, you elided that. It represents text as superset of UTF-8, and > uses high codepoints above the Unicode space for raw bytes. Excellent. If that works, Guile needs the same thing. (I'm afraid, though, the approach is not without its problems as the concatenation of two raw bytes might yield a valid UTF-8 encoding of a single character. I don't know if full bijectivity can be achieved.) >> That can be managed through support libraries. > > Guile is one huge support library, so it should include that built-in. As long as Guile can manage anything Linux throws at it, I'm fine with it. As it stands, a couple of chinks in the armor have been identified. Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-23 19:33 ` Eli Zaretskii 2015-12-23 21:15 ` Marko Rauhamaa @ 2015-12-23 21:53 ` David Kastrup 2015-12-23 22:20 ` Marko Rauhamaa 1 sibling, 1 reply; 21+ messages in thread From: David Kastrup @ 2015-12-23 21:53 UTC (permalink / raw) To: guile-user Eli Zaretskii <eliz@gnu.org> writes: > From: Marko Rauhamaa <marko@pacujo.net> > >> Why don't you tell me already what emacs does? > > I did, you elided that. It represents text as superset of UTF-8, and > uses high codepoints above the Unicode space for raw bytes. Incorrect. It uses overlong encodings of 0x00-0x7f for raw bytes in the 0x80-0xff range (0x00-0x7f are always represented as themselves). Those are not allowed in properly encoded UTF-8 and take only two bytes (byte patterns 0xc0 0x80–0xbf and 0xc1 0x80–0xbf), so random byte patterns get inflated by somewhat less than 50% on average (every pattern allowed in properly encoded UTF-8 is left unchanged, of course). That's more economical than Python's method which uses the encodings of surrogate words not allowed in properly encoded UTF-8, taking 3 bytes rather than the 2 Emacs makes do with. Using high codepoints above the Unicode space would even take 4 bytes. -- David Kastrup ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-23 21:53 ` David Kastrup @ 2015-12-23 22:20 ` Marko Rauhamaa 2015-12-23 22:25 ` David Kastrup 0 siblings, 1 reply; 21+ messages in thread From: Marko Rauhamaa @ 2015-12-23 22:20 UTC (permalink / raw) To: David Kastrup; +Cc: guile-user David Kastrup <dak@gnu.org>: > That's more economical than Python's method which uses the encodings > of surrogate words not allowed in properly encoded UTF-8, taking > 3 bytes rather than the 2 Emacs makes do with. Using high codepoints > above the Unicode space would even take 4 bytes. Actually, CPython represents strings internally even less "economically:" it uses single-byte strings if it can (Latin-1). If it can't, it uses all-two-byte strings (UCS-2). If it can't do even that, it uses all-four-byte strings (UCS-4). Thus, even a single code point above 65535 will cause the whole string to consist of 4-byte integers. Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-23 22:20 ` Marko Rauhamaa @ 2015-12-23 22:25 ` David Kastrup 0 siblings, 0 replies; 21+ messages in thread From: David Kastrup @ 2015-12-23 22:25 UTC (permalink / raw) To: guile-user Marko Rauhamaa <marko@pacujo.net> writes: > David Kastrup <dak@gnu.org>: > >> That's more economical than Python's method which uses the encodings >> of surrogate words not allowed in properly encoded UTF-8, taking >> 3 bytes rather than the 2 Emacs makes do with. Using high codepoints >> above the Unicode space would even take 4 bytes. > > Actually, CPython represents strings internally even less > "economically:" it uses single-byte strings if it can (Latin-1). If it > can't, it uses all-two-byte strings (UCS-2). If it can't do even that, > it uses all-four-byte strings (UCS-4). Thus, even a single code point > above 65535 will cause the whole string to consist of 4-byte integers. Maybe I confused Python and Perl here. No idea. But I'm pretty sure about Emacs. -- David Kastrup ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-22 21:39 ` Marko Rauhamaa 2015-12-23 18:28 ` Eli Zaretskii @ 2015-12-24 16:13 ` Barry Schwartz 1 sibling, 0 replies; 21+ messages in thread From: Barry Schwartz @ 2015-12-24 16:13 UTC (permalink / raw) To: Marko Rauhamaa; +Cc: guile-user Marko Rauhamaa <marko@pacujo.net> skribis: > I'm a bit sorry that Guile repeated Python 3's mistake and brought > (Unicode) strings to the center. Strings are a highly special-purpose > data structure; I really never had a real need for them in my decades of > programming. Also, I suspect strings are much too simplistic for any > serious typesetting or GUI work. This is an understatement. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-21 23:19 ` Marko Rauhamaa 2015-12-22 0:34 ` Chris Vine @ 2015-12-22 14:32 ` Vicente Vera 2015-12-22 15:56 ` Marko Rauhamaa 1 sibling, 1 reply; 21+ messages in thread From: Vicente Vera @ 2015-12-22 14:32 UTC (permalink / raw) To: Marko Rauhamaa; +Cc: guile-user Should this be sent to the bugs list? 2015-12-21 20:19 GMT-03:00 Marko Rauhamaa <marko@pacujo.net>: > Vicente Vera <vicentemvp@gmail.com>: > >> Hello. I'm sorry if this is the wrong list (I'm not sure if its a >> bug). > > Must be a bug. > >> I wrote a small test script: > > The error is reproduced with an empty scm file: > > touch test.scm > guile test.scm > [...] > ERROR: In procedure open-file: No such file or directory: [...] > > > Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-22 14:32 ` Vicente Vera @ 2015-12-22 15:56 ` Marko Rauhamaa 2015-12-26 1:57 ` Vicente Vera 0 siblings, 1 reply; 21+ messages in thread From: Marko Rauhamaa @ 2015-12-22 15:56 UTC (permalink / raw) To: Vicente Vera; +Cc: guile-user Vicente Vera <vicentemvp@gmail.com>: > Should this be sent to the bugs list? Go ahead. Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Running script from directory with UTF-8 characters 2015-12-22 15:56 ` Marko Rauhamaa @ 2015-12-26 1:57 ` Vicente Vera 0 siblings, 0 replies; 21+ messages in thread From: Vicente Vera @ 2015-12-26 1:57 UTC (permalink / raw) To: Marko Rauhamaa; +Cc: guile-user Reported as bug #22229: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=22229 2015-12-22 12:56 GMT-03:00 Marko Rauhamaa <marko@pacujo.net>: > Vicente Vera <vicentemvp@gmail.com>: > >> Should this be sent to the bugs list? > > Go ahead. > > > Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2015-12-26 1:57 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-12-21 21:09 Running script from directory with UTF-8 characters Vicente Vera 2015-12-21 23:19 ` Marko Rauhamaa 2015-12-22 0:34 ` Chris Vine 2015-12-22 1:14 ` Marko Rauhamaa 2015-12-22 14:21 ` Chris Vine 2015-12-22 15:55 ` Marko Rauhamaa 2015-12-22 20:12 ` Chris Vine 2015-12-22 20:36 ` Marko Rauhamaa 2015-12-22 20:59 ` Eli Zaretskii 2015-12-22 21:39 ` Marko Rauhamaa 2015-12-23 18:28 ` Eli Zaretskii 2015-12-23 19:18 ` Marko Rauhamaa 2015-12-23 19:33 ` Eli Zaretskii 2015-12-23 21:15 ` Marko Rauhamaa 2015-12-23 21:53 ` David Kastrup 2015-12-23 22:20 ` Marko Rauhamaa 2015-12-23 22:25 ` David Kastrup 2015-12-24 16:13 ` Barry Schwartz 2015-12-22 14:32 ` Vicente Vera 2015-12-22 15:56 ` Marko Rauhamaa 2015-12-26 1:57 ` Vicente Vera
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).