Running script from directory with UTF-8 characters

unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed

* Running script from directory with UTF-8 characters
@ 2015-12-21 21:09 Vicente Vera
  2015-12-21 23:19 ` Marko Rauhamaa
  0 siblings, 1 reply; 21+ messages in thread
From: Vicente Vera @ 2015-12-21 21:09 UTC (permalink / raw)
  To: guile-user

Hello. I'm sorry if this is the wrong list (I'm not sure if its a bug).

I wrote a small test script:

#!/usr/bin/guile -s
!#
;; coding: utf-8
(display "hey")
(newline)

This happens when I try to run it from a directory with UTF-8 characters:

$ cd ~/código/
$ ./test.scm
;;; Stat of /home/me/c??digo/./test.scm failed:
;;; ERROR: In procedure stat: No such file or directory:
"/home/me/c??digo/./test.scm"
Backtrace:
In ice-9/boot-9.scm:
 157: 8 [catch #t #<catch-closure 9949e00> ...]
In unknown file:
   ?: 7 [apply-smob/1 #<catch-closure 9949e00>]
In ice-9/boot-9.scm:
  63: 6 [call-with-prompt prompt0 ...]
In ice-9/eval.scm:
 432: 5 [eval # #]
In ice-9/boot-9.scm:
2401: 4 [save-module-excursion #<procedure 9957cc0 at
ice-9/boot-9.scm:4045:3 ()>]
4052: 3 [#<procedure 9957cc0 at ice-9/boot-9.scm:4045:3 ()>]
1724: 2 [%start-stack load-stack ...]
1729: 1 [#<procedure 995e738 ()>]
In unknown file:
   ?: 0 [primitive-load "/home/me/c??digo/./test.scm"]

ERROR: In procedure primitive-load:
ERROR: In procedure open-file: No such file or directory:
"/home/me/c??digo/./test.scm"

If I remove the UTF-8 character the script works just fine (mv -T
~/código ~/codigo).

My locale is en_US.UTF-8 & Guile version:
$ guile -v
guile (GNU Guile) 2.0.11
Packaged by Debian (2.0.11-deb+1-9)
...

What it's happening and how can I solve this? Thank you!



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-21 21:09 Running script from directory with UTF-8 characters Vicente Vera
@ 2015-12-21 23:19 ` Marko Rauhamaa
  2015-12-22  0:34   ` Chris Vine
  2015-12-22 14:32   ` Vicente Vera
  0 siblings, 2 replies; 21+ messages in thread
From: Marko Rauhamaa @ 2015-12-21 23:19 UTC (permalink / raw)
  To: Vicente Vera; +Cc: guile-user

Vicente Vera <vicentemvp@gmail.com>:

> Hello. I'm sorry if this is the wrong list (I'm not sure if its a
> bug).

Must be a bug.

> I wrote a small test script:

The error is reproduced with an empty scm file:

   touch test.scm
   guile test.scm
   [...]
   ERROR: In procedure open-file: No such file or directory: [...]


Marko



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-21 23:19 ` Marko Rauhamaa
@ 2015-12-22  0:34   ` Chris Vine
  2015-12-22  1:14     ` Marko Rauhamaa
  2015-12-22 14:32   ` Vicente Vera
  1 sibling, 1 reply; 21+ messages in thread
From: Chris Vine @ 2015-12-22  0:34 UTC (permalink / raw)
  To: guile-user

On Tue, 22 Dec 2015 01:19:36 +0200
Marko Rauhamaa <marko@pacujo.net> wrote:

> Vicente Vera <vicentemvp@gmail.com>:
> 
> > Hello. I'm sorry if this is the wrong list (I'm not sure if its a
> > bug).  
> 
> Must be a bug.
> 
> > I wrote a small test script:  
> 
> The error is reproduced with an empty scm file:
> 
>    touch test.scm
>    guile test.scm
>    [...]
>    ERROR: In procedure open-file: No such file or directory: [...]

I think the problem is that calling the native 'primitive-load'
procedure on a filename with UTF-8 encoding with a character outside
the ASCII range (when the locale encoding is also UTF-8) fails to work
unless you call '(set-locale LC_ALL "")' in the program first.

Of course you can't do that when passing guile a filename as a program
argument.  It does seem like a weakness, even if not a bug.

Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-22  0:34   ` Chris Vine
@ 2015-12-22  1:14     ` Marko Rauhamaa
  2015-12-22 14:21       ` Chris Vine
  0 siblings, 1 reply; 21+ messages in thread
From: Marko Rauhamaa @ 2015-12-22  1:14 UTC (permalink / raw)
  To: Chris Vine; +Cc: guile-user

Chris Vine <chris@cvine.freeserve.co.uk>:

> I think the problem is that calling the native 'primitive-load'
> procedure on a filename with UTF-8 encoding with a character outside
> the ASCII range (when the locale encoding is also UTF-8) fails to work
> unless you call '(set-locale LC_ALL "")' in the program first.
>
> Of course you can't do that when passing guile a filename as a program
> argument. It does seem like a weakness, even if not a bug.

How can it not be a bug?

Also, Linux pathnames can contain any bytes other than NUL regardless of
the locale (and quite often do) so I hope Guile doesn't paint itself too
deep in the Unicode corner. Python is struggling with analogous issues
but has been careful to at least make it possible to deal with
bytevector pathnames and bytevector standard ports.

For example,

    scheme@(guile-user)> (opendir ".")
    $1 = #<directory stream f7ffa0>
    [...]
    scheme@(guile-user)> (readdir $1)
    $4 = "?9t\x1b["
    scheme@(guile-user)> (open-file $4 "r")
    ERROR: In procedure open-file:
    ERROR: In procedure open-file: No such file or directory: "?9t\x1b["


Marko



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-22  1:14     ` Marko Rauhamaa
@ 2015-12-22 14:21       ` Chris Vine
  2015-12-22 15:55         ` Marko Rauhamaa
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Vine @ 2015-12-22 14:21 UTC (permalink / raw)
  To: guile-user

On Tue, 22 Dec 2015 03:14:18 +0200
Marko Rauhamaa <marko@pacujo.net> wrote:
> Chris Vine <chris@cvine.freeserve.co.uk>:
> 
> > I think the problem is that calling the native 'primitive-load'
> > procedure on a filename with UTF-8 encoding with a character outside
> > the ASCII range (when the locale encoding is also UTF-8) fails to
> > work unless you call '(set-locale LC_ALL "")' in the program first.
> >
> > Of course you can't do that when passing guile a filename as a
> > program argument. It does seem like a weakness, even if not a bug.  
> 
> How can it not be a bug?
> 
> Also, Linux pathnames can contain any bytes other than NUL regardless
> of the locale (and quite often do) so I hope Guile doesn't paint
> itself too deep in the Unicode corner. Python is struggling with
> analogous issues but has been careful to at least make it possible to
> deal with bytevector pathnames and bytevector standard ports.
> 
> For example,
> 
>     scheme@(guile-user)> (opendir ".")
>     $1 = #<directory stream f7ffa0>
>     [...]
>     scheme@(guile-user)> (readdir $1)
>     $4 = "?9t\x1b["
>     scheme@(guile-user)> (open-file $4 "r")
>     ERROR: In procedure open-file:
>     ERROR: In procedure open-file: No such file or directory:
> "?9t\x1b["

You can set the locale in the REPL, if that is where you are working
from (as in your example), and then UTF-8 pathnames will work fine.

The problem about this is that although a developer might use the REPL,
and therefore maybe assume that that is what everyone else does, the end
user probably just wants to run the script by passing guile a file name
on the command line.  To that extent I agree it is a bug.  But the
response to the filing of such a bug might be that that is how it is
meant to work.

Chris



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-21 23:19 ` Marko Rauhamaa
  2015-12-22  0:34   ` Chris Vine
@ 2015-12-22 14:32   ` Vicente Vera
  2015-12-22 15:56     ` Marko Rauhamaa
  1 sibling, 1 reply; 21+ messages in thread
From: Vicente Vera @ 2015-12-22 14:32 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

Should this be sent to the bugs list?

2015-12-21 20:19 GMT-03:00 Marko Rauhamaa <marko@pacujo.net>:
> Vicente Vera <vicentemvp@gmail.com>:
>
>> Hello. I'm sorry if this is the wrong list (I'm not sure if its a
>> bug).
>
> Must be a bug.
>
>> I wrote a small test script:
>
> The error is reproduced with an empty scm file:
>
>    touch test.scm
>    guile test.scm
>    [...]
>    ERROR: In procedure open-file: No such file or directory: [...]
>
>
> Marko



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-22 14:21       ` Chris Vine
@ 2015-12-22 15:55         ` Marko Rauhamaa
  2015-12-22 20:12           ` Chris Vine
  0 siblings, 1 reply; 21+ messages in thread
From: Marko Rauhamaa @ 2015-12-22 15:55 UTC (permalink / raw)
  To: Chris Vine; +Cc: guile-user

Chris Vine <chris@cvine.freeserve.co.uk>:

> On Tue, 22 Dec 2015 03:14:18 +0200
> Marko Rauhamaa <marko@pacujo.net> wrote:
>> For example,
>> 
>>     scheme@(guile-user)> (opendir ".")
>>     $1 = #<directory stream f7ffa0>
>>     [...]
>>     scheme@(guile-user)> (readdir $1)
>>     $4 = "?9t\x1b["
>>     scheme@(guile-user)> (open-file $4 "r")
>>     ERROR: In procedure open-file:
>>     ERROR: In procedure open-file: No such file or directory:
>> "?9t\x1b["
>
> You can set the locale in the REPL, if that is where you are working
> from (as in your example), and then UTF-8 pathnames will work fine.

You misunderstood me. The problem is that Guile cannot deal with
non-UTF-8 pathnames in a UTF-8 locale. IOW, Linux pathnames are *not*
strings. They are bytevectors. Guile 1.x (as well as Python 2.x) was
fine bytevector pathnames, but Guile 2.x (as well as Python 3.x) wants
to pretend filenames are strings. That leads to trouble, potentially
even to security vulnerabilities.

A very typical case is a tarball that contains, say, Latin-1 filenames.
If you should expand the tarball in a UTF-8 environment, Guile wouldn't
be able to deal with the situation.

Marko

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-22 14:32   ` Vicente Vera
@ 2015-12-22 15:56     ` Marko Rauhamaa
  2015-12-26  1:57       ` Vicente Vera
  0 siblings, 1 reply; 21+ messages in thread
From: Marko Rauhamaa @ 2015-12-22 15:56 UTC (permalink / raw)
  To: Vicente Vera; +Cc: guile-user

Vicente Vera <vicentemvp@gmail.com>:

> Should this be sent to the bugs list?

Go ahead.


Marko



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-22 15:55         ` Marko Rauhamaa
@ 2015-12-22 20:12           ` Chris Vine
  2015-12-22 20:36             ` Marko Rauhamaa
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Vine @ 2015-12-22 20:12 UTC (permalink / raw)
  To: guile-user

On Tue, 22 Dec 2015 17:55:58 +0200
Marko Rauhamaa <marko@pacujo.net> wrote:
> Chris Vine <chris@cvine.freeserve.co.uk> wrote:
> > On Tue, 22 Dec 2015 03:14:18 +0200
> > Marko Rauhamaa <marko@pacujo.net> wrote:  
> >> For example,
> >> 
> >>     scheme@(guile-user)> (opendir ".")
> >>     $1 = #<directory stream f7ffa0>
> >>     [...]
> >>     scheme@(guile-user)> (readdir $1)
> >>     $4 = "?9t\x1b["
> >>     scheme@(guile-user)> (open-file $4 "r")
> >>     ERROR: In procedure open-file:
> >>     ERROR: In procedure open-file: No such file or directory:
> >> "?9t\x1b["  
> >
> > You can set the locale in the REPL, if that is where you are working
> > from (as in your example), and then UTF-8 pathnames will work
> > fine.  
> 
> You misunderstood me. The problem is that Guile cannot deal with
> non-UTF-8 pathnames in a UTF-8 locale. IOW, Linux pathnames are *not*
> strings. They are bytevectors. Guile 1.x (as well as Python 2.x) was
> fine bytevector pathnames, but Guile 2.x (as well as Python 3.x) wants
> to pretend filenames are strings. That leads to trouble, potentially
> even to security vulnerabilities.
> 
> A very typical case is a tarball that contains, say, Latin-1
> filenames. If you should expand the tarball in a UTF-8 environment,
> Guile wouldn't be able to deal with the situation.

Yes, you exceeded my powers of deduction (or clairvoyance, depending on
how you look at it).

More to he point, unix-like pathnames are at the implementation level
just a collection of bytes terminated by null and with '/' as the
directory separator. Having said that, the POSIX Portable Filename
Character Set (§3.278 of the SUS) doesn't even cover all of ASCII, let
alone unicode.

It can be useful to handle filenames as strings in the program.  My
main objection is not that filenames are not treated as collections of
bytes, but that guile assumes the filename character set is the same as
the locale character set, which on distributed file systems may be
completely false.  I may be wrong, but I do not think you can set the
filename codeset programmatically in guile, which most other libraries
permit.

So I guess the best rule is that, even if you don't stick to the
Portable Filename Character Set, stick to ASCII for filenames/paths.

Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-22 20:12           ` Chris Vine
@ 2015-12-22 20:36             ` Marko Rauhamaa
  2015-12-22 20:59               ` Eli Zaretskii
  0 siblings, 1 reply; 21+ messages in thread
From: Marko Rauhamaa @ 2015-12-22 20:36 UTC (permalink / raw)
  To: Chris Vine; +Cc: guile-user

Chris Vine <chris@cvine.freeserve.co.uk>:

> So I guess the best rule is that, even if you don't stick to the
> Portable Filename Character Set, stick to ASCII for filenames/paths.

The filenames are not in my control or Guile's. Guile can't simply wish
the filenames to be strings, or, like Python, it would at least need
some special measures to handle the exceptional cases.

Well, of course, when there's a will, there's a way. By setting the
character set artificially to Latin-1 in Guile, all pathnames are
accessible to it.

Marko

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-22 20:36             ` Marko Rauhamaa
@ 2015-12-22 20:59               ` Eli Zaretskii
  2015-12-22 21:39                 ` Marko Rauhamaa
  0 siblings, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2015-12-22 20:59 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

> From: Marko Rauhamaa <marko@pacujo.net>
> Date: Tue, 22 Dec 2015 22:36:07 +0200
> Cc: guile-user@gnu.org
> 
> By setting the character set artificially to Latin-1 in Guile, all
> pathnames are accessible to it.

No, they aren't, not as file names.  E.g., you cannot meaningfully
downcase or upcase such "characters", you cannot count characters (as
opposed to bytes), you cannot calculate how much screen estate will be
needed to display them, with some Far Eastern encodings you cannot
correctly search them for some specific ASCII characters (because they
can be part of a multibyte sequence), etc. etc.  IOW, you cannot work
with file names as human-readable text, which is something many
programs need to do.

File names _are_ strings, there's no way around that.  They are
strings because _people_ name files and give them meaningful names and
extensions.  If Guile cannot easily work with file names encoded in a
codeset other than the current locale's one, then Guile should be
extended to allow a program to tell it in which encoding to interpret
a particular name.  (I think Guile already supports that, but maybe I
misremember.)  But lobbying for treating file names as byte streams,
let alone Latin-1 characters, is a large step backwards, to 1990s when
we didn't know better.  We've come a long way since then and learned a
lot on the way.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-22 20:59               ` Eli Zaretskii
@ 2015-12-22 21:39                 ` Marko Rauhamaa
  2015-12-23 18:28                   ` Eli Zaretskii
  2015-12-24 16:13                   ` Barry Schwartz
  0 siblings, 2 replies; 21+ messages in thread
From: Marko Rauhamaa @ 2015-12-22 21:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Eli Zaretskii <eliz@gnu.org>:

>> From: Marko Rauhamaa <marko@pacujo.net>
>> By setting the character set artificially to Latin-1 in Guile, all
>> pathnames are accessible to it.
>
> No, they aren't, not as file names. E.g., you cannot meaningfully
> downcase or upcase such "characters", you cannot count characters (as
> opposed to bytes), you cannot calculate how much screen estate will be
> needed to display them, with some Far Eastern encodings you cannot
> correctly search them for some specific ASCII characters (because they
> can be part of a multibyte sequence), etc. etc. IOW, you cannot work
> with file names as human-readable text, which is something many
> programs need to do.

You can, in a roundabout way. You do the low-level file I/O in Latin-1.
Then, you reencode into UTF-8, and if you get an exception, you deal
with the situation.

Otherwise, you may not even be able to remove a file with a non-UTF-8
name.

> File names _are_ strings, there's no way around that.

Linux pathnames are classic C strings.

> They are strings because _people_ name files and give them meaningful
> names and extensions.

The Linux kernel just doesn't care, and shouldn't.

It's acceptable for Guile to create a higher-level illusion, but it
shouldn't sacrifice completeness while doing so. You should be able to
manipulate every conceivable filename from Guile code.

(Python 3.x accepts bytevectors as well as strings everywhere. For
example, listing a directory returns strings if the directory name is
given as a string. It returns bytevectors if the directory name is given
as a bytevector. Python's bytevector literals accept ASCII, which makes
this rather convenient.)

> If Guile cannot easily work with file names encoded in a codeset other
> than the current locale's one, then Guile should be extended to allow
> a program to tell it in which encoding to interpret a particular name.

A program usually has no clue how a pathname has been encoded.

> (I think Guile already supports that, but maybe I misremember.) But
> lobbying for treating file names as byte streams, let alone Latin-1
> characters, is a large step backwards, to 1990s when we didn't know
> better. We've come a long way since then and learned a lot on the way.

At least our backwardness allowed Linux to jump directly to UTF-8 and
not be afflicted by UCS-2 like Windows and Java.

I'm not saying bytevectors are elegant, but we should not replace them
with wishful thinking. Ideally, we should have a bijective
bytevector-to-string mapping. (Python 3.x uses Unicode surrogate code
points for that purpose but doesn't quite achieve bijection,
unfortunately.)

I'm a bit sorry that Guile repeated Python 3's mistake and brought
(Unicode) strings to the center. Strings are a highly special-purpose
data structure; I really never had a real need for them in my decades of
programming. Also, I suspect strings are much too simplistic for any
serious typesetting or GUI work. It seems the sweet spot of strings are
text/plain mail messages and Usenet postings.

Guile 1.x's and Python 2.x's bytevector/string confusion was actually
a very happy medium. Neither the OS nor the programming language placed
any interpretation to the byte sequences. That was left to the
application.

Marko

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-22 21:39                 ` Marko Rauhamaa
@ 2015-12-23 18:28                   ` Eli Zaretskii
  2015-12-23 19:18                     ` Marko Rauhamaa
  2015-12-24 16:13                   ` Barry Schwartz
  1 sibling, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2015-12-23 18:28 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

> From: Marko Rauhamaa <marko@pacujo.net>
> Date: Tue, 22 Dec 2015 23:39:28 +0200
> Cc: guile-user@gnu.org
> 
> > No, they aren't, not as file names. E.g., you cannot meaningfully
> > downcase or upcase such "characters", you cannot count characters (as
> > opposed to bytes), you cannot calculate how much screen estate will be
> > needed to display them, with some Far Eastern encodings you cannot
> > correctly search them for some specific ASCII characters (because they
> > can be part of a multibyte sequence), etc. etc. IOW, you cannot work
> > with file names as human-readable text, which is something many
> > programs need to do.
> 
> You can, in a roundabout way. You do the low-level file I/O in Latin-1.
> Then, you reencode into UTF-8

IOW, from the application-level perspective, the file names are
encoded in UTF-8 (in this example).  The low-level reading as byte
stream (NOT Latin-1!) is out of scope as long as you consider a Guile
Scheme program that needs to manipulate the file names.

So we are in violent agreement.

> Otherwise, you may not even be able to remove a file with a non-UTF-8
> name.

What do you mean by a non UTF-8 file name?  A file name that includes
byte sequences that are not valid UTF-8?  For that, Guile needs to
acquire a capability of representing raw bytes, similar to what Emacs
does.  This capability is an add-on, it should not be instead of being
able to interpret file names as character strings encoded in some
recognizable encoding, either forced by the application or deduced
from some meta-data, user preferences, locale's defaults, etc.

> > They are strings because _people_ name files and give them meaningful
> > names and extensions.
> 
> The Linux kernel just doesn't care, and shouldn't.

Guile is not an OS kernel.  Guile is an environment for writing
applications.  On the application level, you _should_ care, or else
you won't be able to manipulate file names in meaningful ways.

> It's acceptable for Guile to create a higher-level illusion, but it
> shouldn't sacrifice completeness while doing so. You should be able to
> manipulate every conceivable filename from Guile code.

We are again in violent agreement about the goal.  But the means
towards that goal is NOT to abandon interpretation of file names as
strings of characters, the means is to be able to represent raw bytes
on top of a meaningful character representation.

> > If Guile cannot easily work with file names encoded in a codeset other
> > than the current locale's one, then Guile should be extended to allow
> > a program to tell it in which encoding to interpret a particular name.
> 
> A program usually has no clue how a pathname has been encoded.

The programmer does, or should be.  The user does, sometimes (e.g.,
the capability presented in many browsers and editors to force text
encoding).  Some encodings can be deduced by analyzing the bit stream.
And there are locale defaults if nothing else works.  If none of that
is done, the program cannot manipulate these file names in any
meaningful way.  The kernel can duck that problem because it's the
kernel: it doesn't interact with users, and its filesystem layer is
not required to understand the meaning of, say, the file-name
extensions.  We have no such luxury on the application level.  So we
cannot simply copycat the kernel techniques into Guile, it won't work.
It also won't work to expect applications do that, as that is too
complex and subtle (and tedious) for application to do it right every
time.

Once again, I suggest to study how Emacs solves this very problem.
The solution used there is satisfactory, and fits all of your
requirements above.  It's not without some subtleties in rare cases,
but the problem is complex and there's no way around that complexity.

> > (I think Guile already supports that, but maybe I misremember.) But
> > lobbying for treating file names as byte streams, let alone Latin-1
> > characters, is a large step backwards, to 1990s when we didn't know
> > better. We've come a long way since then and learned a lot on the way.
> 
> At least our backwardness allowed Linux to jump directly to UTF-8 and
> not be afflicted by UCS-2 like Windows and Java.

Once again, Guile is not an OS kernel.  It cannot simply adopt kernel
solutions.

> I'm not saying bytevectors are elegant, but we should not replace them
> with wishful thinking.

No need for wishful thinking.  Study what Emacs does and do something
similar.

> I'm a bit sorry that Guile repeated Python 3's mistake and brought
> (Unicode) strings to the center.

Everybody does that mistake.  Emacs did it as well, but that was years
ago, and since then the mistakes were identified and corrected.  The
basis must be Unicode, the trick is to build additions on top of that
which allow raw bytes and Unicode text strings to coexist, more or
less transparently to the application level.  ("More or less" because
handling raw bytes as part of strings requires some care; fortunately,
such use cases are rare.)

> Strings are a highly special-purpose data structure; I really never
> had a real need for them in my decades of programming. Also, I
> suspect strings are much too simplistic for any serious typesetting
> or GUI work. It seems the sweet spot of strings are text/plain mail
> messages and Usenet postings.

My experience indicates otherwise (in particular, processing and
displaying plain text strings is what the Unicode Standard is all
about), but I think that issue is tangential to this discussion.

> Guile 1.x's and Python 2.x's bytevector/string confusion was actually
> a very happy medium. Neither the OS nor the programming language placed
> any interpretation to the byte sequences. That was left to the
> application.

And that is wrong.  Applications cannot handle that, they need some
heavy help from the infrastructure.  Applications actually love to
have normal human-readable text strings, after the infrastructure
decoded the byte stream into characters for them.  Most file names are
encoded in locale's codeset (otherwise file browsers and other
interactive programs that accept and display file names won't be able
to handle them), so at least this popular and very important use case
should "just work" without requiring each application to reinvent the
wheel of decoding byte sequences into characters, dealing with EILSEQ,
etc.  An environment that doesn't provide at least that won't fly.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-23 18:28                   ` Eli Zaretskii
@ 2015-12-23 19:18                     ` Marko Rauhamaa
  2015-12-23 19:33                       ` Eli Zaretskii
  0 siblings, 1 reply; 21+ messages in thread
From: Marko Rauhamaa @ 2015-12-23 19:18 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Eli Zaretskii <eliz@gnu.org>:

>> From: Marko Rauhamaa <marko@pacujo.net>
>> The Linux kernel just doesn't care, and shouldn't.
>
> Guile is not an OS kernel. Guile is an environment for writing
> applications. On the application level, you _should_ care, or else you
> won't be able to manipulate file names in meaningful ways.

To me, a programming language is a medium of writing programs for an
operating system. I don't think a programming language should "shield"
me from the OS. Instead, it should make the whole gamut of the OS
facilities available to me.

>> I'm not saying bytevectors are elegant, but we should not replace
>> them with wishful thinking.
>
> No need for wishful thinking. Study what Emacs does and do something
> similar.

Why don't you tell me already what emacs does?

>> Guile 1.x's and Python 2.x's bytevector/string confusion was actually
>> a very happy medium. Neither the OS nor the programming language
>> placed any interpretation to the byte sequences. That was left to the
>> application.
>
> And that is wrong. Applications cannot handle that, they need some
> heavy help from the infrastructure.

That can be managed through support libraries.


Marko



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-23 19:18                     ` Marko Rauhamaa
@ 2015-12-23 19:33                       ` Eli Zaretskii
  2015-12-23 21:15                         ` Marko Rauhamaa
  2015-12-23 21:53                         ` David Kastrup
  0 siblings, 2 replies; 21+ messages in thread
From: Eli Zaretskii @ 2015-12-23 19:33 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

> From: Marko Rauhamaa <marko@pacujo.net>
> Cc: guile-user@gnu.org
> Date: Wed, 23 Dec 2015 21:18:28 +0200
> 
> Eli Zaretskii <eliz@gnu.org>:
> 
> >> From: Marko Rauhamaa <marko@pacujo.net>
> >> The Linux kernel just doesn't care, and shouldn't.
> >
> > Guile is not an OS kernel. Guile is an environment for writing
> > applications. On the application level, you _should_ care, or else you
> > won't be able to manipulate file names in meaningful ways.
> 
> To me, a programming language is a medium of writing programs for an
> operating system. I don't think a programming language should "shield"
> me from the OS. Instead, it should make the whole gamut of the OS
> facilities available to me.

I see no contradiction here, as long as you acknowledge that Guile
should be good for more than just OS level stuff.

> >> I'm not saying bytevectors are elegant, but we should not replace
> >> them with wishful thinking.
> >
> > No need for wishful thinking. Study what Emacs does and do something
> > similar.
> 
> Why don't you tell me already what emacs does?

I did, you elided that.  It represents text as superset of UTF-8, and
uses high codepoints above the Unicode space for raw bytes.

> >> Guile 1.x's and Python 2.x's bytevector/string confusion was actually
> >> a very happy medium. Neither the OS nor the programming language
> >> placed any interpretation to the byte sequences. That was left to the
> >> application.
> >
> > And that is wrong. Applications cannot handle that, they need some
> > heavy help from the infrastructure.
> 
> That can be managed through support libraries.

Guile is one huge support library, so it should include that built-in.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-23 19:33                       ` Eli Zaretskii
@ 2015-12-23 21:15                         ` Marko Rauhamaa
  2015-12-23 21:53                         ` David Kastrup
  1 sibling, 0 replies; 21+ messages in thread
From: Marko Rauhamaa @ 2015-12-23 21:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Eli Zaretskii <eliz@gnu.org>:

>> From: Marko Rauhamaa <marko@pacujo.net>
>> Why don't you tell me already what emacs does?
>
> I did, you elided that. It represents text as superset of UTF-8, and
> uses high codepoints above the Unicode space for raw bytes.

Excellent. If that works, Guile needs the same thing.

(I'm afraid, though, the approach is not without its problems as the
concatenation of two raw bytes might yield a valid UTF-8 encoding of a
single character. I don't know if full bijectivity can be achieved.)

>> That can be managed through support libraries.
>
> Guile is one huge support library, so it should include that built-in.

As long as Guile can manage anything Linux throws at it, I'm fine with
it. As it stands, a couple of chinks in the armor have been identified.

Marko

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-23 19:33                       ` Eli Zaretskii
  2015-12-23 21:15                         ` Marko Rauhamaa
@ 2015-12-23 21:53                         ` David Kastrup
  2015-12-23 22:20                           ` Marko Rauhamaa
  1 sibling, 1 reply; 21+ messages in thread
From: David Kastrup @ 2015-12-23 21:53 UTC (permalink / raw)
  To: guile-user

Eli Zaretskii <eliz@gnu.org> writes:

> From: Marko Rauhamaa <marko@pacujo.net>
>
>> Why don't you tell me already what emacs does?
>
> I did, you elided that.  It represents text as superset of UTF-8, and
> uses high codepoints above the Unicode space for raw bytes.

Incorrect.  It uses overlong encodings of 0x00-0x7f for raw bytes in the
0x80-0xff range (0x00-0x7f are always represented as themselves).  Those
are not allowed in properly encoded UTF-8 and take only two bytes (byte
patterns 0xc0 0x80–0xbf and 0xc1 0x80–0xbf), so random byte patterns get
inflated by somewhat less than 50% on average (every pattern allowed in
properly encoded UTF-8 is left unchanged, of course).

That's more economical than Python's method which uses the encodings of
surrogate words not allowed in properly encoded UTF-8, taking 3 bytes
rather than the 2 Emacs makes do with.  Using high codepoints above the
Unicode space would even take 4 bytes.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-23 21:53                         ` David Kastrup
@ 2015-12-23 22:20                           ` Marko Rauhamaa
  2015-12-23 22:25                             ` David Kastrup
  0 siblings, 1 reply; 21+ messages in thread
From: Marko Rauhamaa @ 2015-12-23 22:20 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

David Kastrup <dak@gnu.org>:

> That's more economical than Python's method which uses the encodings
> of surrogate words not allowed in properly encoded UTF-8, taking
> 3 bytes rather than the 2 Emacs makes do with. Using high codepoints
> above the Unicode space would even take 4 bytes.

Actually, CPython represents strings internally even less
"economically:" it uses single-byte strings if it can (Latin-1). If it
can't, it uses all-two-byte strings (UCS-2). If it can't do even that,
it uses all-four-byte strings (UCS-4). Thus, even a single code point
above 65535 will cause the whole string to consist of 4-byte integers.

Marko

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-23 22:20                           ` Marko Rauhamaa
@ 2015-12-23 22:25                             ` David Kastrup
  0 siblings, 0 replies; 21+ messages in thread
From: David Kastrup @ 2015-12-23 22:25 UTC (permalink / raw)
  To: guile-user

Marko Rauhamaa <marko@pacujo.net> writes:

> David Kastrup <dak@gnu.org>:
>
>> That's more economical than Python's method which uses the encodings
>> of surrogate words not allowed in properly encoded UTF-8, taking
>> 3 bytes rather than the 2 Emacs makes do with. Using high codepoints
>> above the Unicode space would even take 4 bytes.
>
> Actually, CPython represents strings internally even less
> "economically:" it uses single-byte strings if it can (Latin-1). If it
> can't, it uses all-two-byte strings (UCS-2). If it can't do even that,
> it uses all-four-byte strings (UCS-4). Thus, even a single code point
> above 65535 will cause the whole string to consist of 4-byte integers.

Maybe I confused Python and Perl here.  No idea.  But I'm pretty sure
about Emacs.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-22 21:39                 ` Marko Rauhamaa
  2015-12-23 18:28                   ` Eli Zaretskii
@ 2015-12-24 16:13                   ` Barry Schwartz
  1 sibling, 0 replies; 21+ messages in thread
From: Barry Schwartz @ 2015-12-24 16:13 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

Marko Rauhamaa <marko@pacujo.net> skribis:
> I'm a bit sorry that Guile repeated Python 3's mistake and brought
> (Unicode) strings to the center. Strings are a highly special-purpose
> data structure; I really never had a real need for them in my decades of
> programming. Also, I suspect strings are much too simplistic for any
> serious typesetting or GUI work.

This is an understatement.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Running script from directory with UTF-8 characters
  2015-12-22 15:56     ` Marko Rauhamaa
@ 2015-12-26  1:57       ` Vicente Vera
  0 siblings, 0 replies; 21+ messages in thread
From: Vicente Vera @ 2015-12-26  1:57 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

Reported as bug #22229:

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=22229

2015-12-22 12:56 GMT-03:00 Marko Rauhamaa <marko@pacujo.net>:
> Vicente Vera <vicentemvp@gmail.com>:
>
>> Should this be sent to the bugs list?
>
> Go ahead.
>
>
> Marko



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-12-26  1:57 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-21 21:09 Running script from directory with UTF-8 characters Vicente Vera
2015-12-21 23:19 ` Marko Rauhamaa
2015-12-22  0:34   ` Chris Vine
2015-12-22  1:14     ` Marko Rauhamaa
2015-12-22 14:21       ` Chris Vine
2015-12-22 15:55         ` Marko Rauhamaa
2015-12-22 20:12           ` Chris Vine
2015-12-22 20:36             ` Marko Rauhamaa
2015-12-22 20:59               ` Eli Zaretskii
2015-12-22 21:39                 ` Marko Rauhamaa
2015-12-23 18:28                   ` Eli Zaretskii
2015-12-23 19:18                     ` Marko Rauhamaa
2015-12-23 19:33                       ` Eli Zaretskii
2015-12-23 21:15                         ` Marko Rauhamaa
2015-12-23 21:53                         ` David Kastrup
2015-12-23 22:20                           ` Marko Rauhamaa
2015-12-23 22:25                             ` David Kastrup
2015-12-24 16:13                   ` Barry Schwartz
2015-12-22 14:32   ` Vicente Vera
2015-12-22 15:56     ` Marko Rauhamaa
2015-12-26  1:57       ` Vicente Vera

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).