unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
* guile can't find a chinese named file
@ 2016-11-27 11:58 Thomas Morley
  2016-11-27 12:16 ` Chaos Eternal
  2017-01-30 14:20 ` Ludovic Courtès
  0 siblings, 2 replies; 110+ messages in thread
From: Thomas Morley @ 2016-11-27 11:58 UTC (permalink / raw)
  To: guile-user

[-- Attachment #1: Type: text/plain, Size: 1501 bytes --]

Hi all,

a chinese user came up with a weird problem.

He wants to process the string retrieved by (command-line) further, in
his file-name he used some chinese characters.
I tracked it down to the attached minimal example.
With guile-2.0.13 I get:

guile filename_名字.scm
;;; Stat of /home/hermann/Desktop/filename_??.scm failed:
;;; ERROR: In procedure stat: No such file or directory:
"/home/hermann/Desktop/filename_\u540d\u5b57.scm"
Backtrace:
In ice-9/boot-9.scm:
 160: 8 [catch #t #<catch-closure 55cae58fe4e0> ...]
In unknown file:
   ?: 7 [apply-smob/1 #<catch-closure 55cae58fe4e0>]
In ice-9/boot-9.scm:
  66: 6 [call-with-prompt prompt0 ...]
In ice-9/eval.scm:
 432: 5 [eval # #]
In ice-9/boot-9.scm:
2404: 4 [save-module-excursion #<procedure 55cae59209c0 at
ice-9/boot-9.scm:4051:3 ()>]
4058: 3 [#<procedure 55cae59209c0 at ice-9/boot-9.scm:4051:3 ()>]
1727: 2 [%start-stack load-stack ...]
1732: 1 [#<procedure 55cae5935db0 ()>]
In unknown file:
   ?: 0 [primitive-load "/home/hermann/Desktop/filename_\u540d\u5b57.scm"]

ERROR: In procedure primitive-load:
ERROR: In procedure open-file: No such file or directory:
"/home/hermann/Desktop/filename_\u540d\u5b57.scm"

What to do to make it work?
I messed around setting locales, to no avail so far.

Btw, the same with guile-1.8:
guile-1.8 filename_名字.scm

("filename_�\x90\x8d字.scm")
(filename_名字.scm)

The string is strange, but at least the file is found.

Cheers,
  Harm

[-- Attachment #2: filename_名字.scm --]
[-- Type: text/x-scheme, Size: 88 bytes --]

(newline)
(write (command-line))

(newline)
(write (map string->symbol (command-line)))

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2016-11-27 11:58 guile can't find a chinese named file Thomas Morley
@ 2016-11-27 12:16 ` Chaos Eternal
  2016-11-28  8:54   ` Thomas Morley
  2017-01-30 14:20 ` Ludovic Courtès
  1 sibling, 1 reply; 110+ messages in thread
From: Chaos Eternal @ 2016-11-27 12:16 UTC (permalink / raw)
  To: Thomas Morley, guile-user

Seems that UTF-8 encoded string has been converted to unicode before
calling `open',
but on filesystem the filename is utf8 string




On Sun, Nov 27, 2016 at 7:58 PM Thomas Morley <thomasmorley65@gmail.com>
wrote:

> Hi all,
>
> a chinese user came up with a weird problem.
>
> He wants to process the string retrieved by (command-line) further, in
> his file-name he used some chinese characters.
> I tracked it down to the attached minimal example.
> With guile-2.0.13 I get:
>
> guile filename_名字.scm
> ;;; Stat of /home/hermann/Desktop/filename_??.scm failed:
> ;;; ERROR: In procedure stat: No such file or directory:
> "/home/hermann/Desktop/filename_\u540d\u5b57.scm"
> Backtrace:
> In ice-9/boot-9.scm:
>  160: 8 [catch #t #<catch-closure 55cae58fe4e0> ...]
> In unknown file:
>    ?: 7 [apply-smob/1 #<catch-closure 55cae58fe4e0>]
> In ice-9/boot-9.scm:
>   66: 6 [call-with-prompt prompt0 ...]
> In ice-9/eval.scm:
>  432: 5 [eval # #]
> In ice-9/boot-9.scm:
> 2404: 4 [save-module-excursion #<procedure 55cae59209c0 at
> ice-9/boot-9.scm:4051:3 ()>]
> 4058: 3 [#<procedure 55cae59209c0 at ice-9/boot-9.scm:4051:3 ()>]
> 1727: 2 [%start-stack load-stack ...]
> 1732: 1 [#<procedure 55cae5935db0 ()>]
> In unknown file:
>    ?: 0 [primitive-load "/home/hermann/Desktop/filename_\u540d\u5b57.scm"]
>
> ERROR: In procedure primitive-load:
> ERROR: In procedure open-file: No such file or directory:
> "/home/hermann/Desktop/filename_\u540d\u5b57.scm"
>
> What to do to make it work?
> I messed around setting locales, to no avail so far.
>
> Btw, the same with guile-1.8:
> guile-1.8 filename_名字.scm
>
> ("filename_�\x90\x8d字.scm")
> (filename_名字.scm)
>
> The string is strange, but at least the file is found.
>
> Cheers,
>   Harm
>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2016-11-27 12:16 ` Chaos Eternal
@ 2016-11-28  8:54   ` Thomas Morley
  2017-01-26 21:59     ` Linas Vepstas
  0 siblings, 1 reply; 110+ messages in thread
From: Thomas Morley @ 2016-11-28  8:54 UTC (permalink / raw)
  To: Chaos Eternal; +Cc: guile-user

2016-11-27 13:16 GMT+01:00 Chaos Eternal <chaoseternal@shlug.org>:
> Seems that UTF-8 encoded string has been converted to unicode before calling
> `open',
> but on filesystem the filename is utf8 string

Your analysis is surely correct, but what to do?
I expected
guile filename_名字.scm
to work out of the box.

Am I missing something?
Or is it a bug, worth reporting?

Cheers,
  Harm



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2016-11-28  8:54   ` Thomas Morley
@ 2017-01-26 21:59     ` Linas Vepstas
  0 siblings, 0 replies; 110+ messages in thread
From: Linas Vepstas @ 2017-01-26 21:59 UTC (permalink / raw)
  To: Thomas Morley; +Cc: Guile User

It's a bug. There have been bugs on and off with guile utf8 handling.
One of the guile-2.0 versions does almost everything right, but utf8
is semi-broken, again in 2.2 -- some things work, but various things
that used to work great are now broken (again).   I'm guessing that
guile has a weak/non-existent unit-test suite for utf8.

Open bug reports. I could even (maybe) volunteer to fix some of them,
cause I know where they are, and why they are, but would need
encouragement/help from one of the core guile devs to create
acceptable patches.

(I'm doing Chinese text processing in guile, right now, so each of
these bugs is painful to hit, and requires yet another work-around in
my code.)

--linas



On Mon, Nov 28, 2016 at 2:54 AM, Thomas Morley <thomasmorley65@gmail.com> wrote:
> 2016-11-27 13:16 GMT+01:00 Chaos Eternal <chaoseternal@shlug.org>:
>> Seems that UTF-8 encoded string has been converted to unicode before calling
>> `open',
>> but on filesystem the filename is utf8 string
>
> Your analysis is surely correct, but what to do?
> I expected
> guile filename_名字.scm
> to work out of the box.
>
> Am I missing something?
> Or is it a bug, worth reporting?
>
> Cheers,
>   Harm
>



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2016-11-27 11:58 guile can't find a chinese named file Thomas Morley
  2016-11-27 12:16 ` Chaos Eternal
@ 2017-01-30 14:20 ` Ludovic Courtès
  2017-01-30 15:48   ` David Kastrup
  2017-01-30 15:54   ` Marko Rauhamaa
  1 sibling, 2 replies; 110+ messages in thread
From: Ludovic Courtès @ 2017-01-30 14:20 UTC (permalink / raw)
  To: guile-user

Hi!

Thomas Morley <thomasmorley65@gmail.com> skribis:

> guile filename_名字.scm
> ;;; Stat of /home/hermann/Desktop/filename_??.scm failed:
> ;;; ERROR: In procedure stat: No such file or directory:
> "/home/hermann/Desktop/filename_\u540d\u5b57.scm"
> Backtrace:
> In ice-9/boot-9.scm:
>  160: 8 [catch #t #<catch-closure 55cae58fe4e0> ...]
> In unknown file:
>    ?: 7 [apply-smob/1 #<catch-closure 55cae58fe4e0>]
> In ice-9/boot-9.scm:
>   66: 6 [call-with-prompt prompt0 ...]
> In ice-9/eval.scm:
>  432: 5 [eval # #]
> In ice-9/boot-9.scm:
> 2404: 4 [save-module-excursion #<procedure 55cae59209c0 at
> ice-9/boot-9.scm:4051:3 ()>]
> 4058: 3 [#<procedure 55cae59209c0 at ice-9/boot-9.scm:4051:3 ()>]
> 1727: 2 [%start-stack load-stack ...]
> 1732: 1 [#<procedure 55cae5935db0 ()>]
> In unknown file:
>    ?: 0 [primitive-load "/home/hermann/Desktop/filename_\u540d\u5b57.scm"]
>
> ERROR: In procedure primitive-load:
> ERROR: In procedure open-file: No such file or directory:
> "/home/hermann/Desktop/filename_\u540d\u5b57.scm"

In C, argv is just an array of byte sequences, but in Guile,
(command-line) returns a list of strings, not a list of bytevectors.

Guile decodes its arguments according to the encoding of the current
locale.  So if you’re in a UTF-8 locale (say, zn_CH.utf8 or en_US.utf8),
Guile assumes its command-line arguments are UTF-8-encoded and decodes
them accordingly.

In the example above, it seems that the file name encoding was different
from the locale encoding, leading to this error.

HTH!

Ludo’.




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 14:20 ` Ludovic Courtès
@ 2017-01-30 15:48   ` David Kastrup
  2017-01-30 16:41     ` Ludovic Courtès
  2017-01-30 15:54   ` Marko Rauhamaa
  1 sibling, 1 reply; 110+ messages in thread
From: David Kastrup @ 2017-01-30 15:48 UTC (permalink / raw)
  To: guile-user

ludo@gnu.org (Ludovic Courtès) writes:

> Hi!
>
> Thomas Morley <thomasmorley65@gmail.com> skribis:
>
>> guile filename_名字.scm
>> ;;; Stat of /home/hermann/Desktop/filename_??.scm failed:
>> ;;; ERROR: In procedure stat: No such file or directory:
>> "/home/hermann/Desktop/filename_\u540d\u5b57.scm"
>> Backtrace:
>> In ice-9/boot-9.scm:
>>  160: 8 [catch #t #<catch-closure 55cae58fe4e0> ...]
>> In unknown file:
>>    ?: 7 [apply-smob/1 #<catch-closure 55cae58fe4e0>]
>> In ice-9/boot-9.scm:
>>   66: 6 [call-with-prompt prompt0 ...]
>> In ice-9/eval.scm:
>>  432: 5 [eval # #]
>> In ice-9/boot-9.scm:
>> 2404: 4 [save-module-excursion #<procedure 55cae59209c0 at
>> ice-9/boot-9.scm:4051:3 ()>]
>> 4058: 3 [#<procedure 55cae59209c0 at ice-9/boot-9.scm:4051:3 ()>]
>> 1727: 2 [%start-stack load-stack ...]
>> 1732: 1 [#<procedure 55cae5935db0 ()>]
>> In unknown file:
>>    ?: 0 [primitive-load "/home/hermann/Desktop/filename_\u540d\u5b57.scm"]
>>
>> ERROR: In procedure primitive-load:
>> ERROR: In procedure open-file: No such file or directory:
>> "/home/hermann/Desktop/filename_\u540d\u5b57.scm"
>
> In C, argv is just an array of byte sequences, but in Guile,
> (command-line) returns a list of strings, not a list of bytevectors.
>
> Guile decodes its arguments according to the encoding of the current
> locale.  So if you’re in a UTF-8 locale (say, zn_CH.utf8 or en_US.utf8),
> Guile assumes its command-line arguments are UTF-8-encoded and decodes
> them accordingly.
>
> In the example above, it seems that the file name encoding was different
> from the locale encoding, leading to this error.
>
> HTH!

Did you actually test this?

dak@lola:/usr/local/tmp/lilypond$ locale
LANG=en_US.UTF-8
LANGUAGE=en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=en_US.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=
dak@lola:/usr/local/tmp/lilypond$ touch /tmp/f♯.scm
dak@lola:/usr/local/tmp/lilypond$ guile-2.0 /tmp/f♯.scm
;;; Stat of /tmp/f?.scm failed:
;;; ERROR: In procedure stat: No such file or directory: "/tmp/f\u266f.scm"
Backtrace:
In ice-9/boot-9.scm:
 160: 8 [catch #t #<catch-closure a0a4710> ...]
In unknown file:
   ?: 7 [apply-smob/1 #<catch-closure a0a4710>]
In ice-9/boot-9.scm:
  66: 6 [call-with-prompt prompt0 ...]
In ice-9/eval.scm:
 432: 5 [eval # #]
In ice-9/boot-9.scm:
2404: 4 [save-module-excursion #<procedure a0b5ce0 at ice-9/boot-9.scm:4051:3 ()>]
4056: 3 [#<procedure a0b5ce0 at ice-9/boot-9.scm:4051:3 ()>]
1727: 2 [%start-stack load-stack ...]
1732: 1 [#<procedure a0bb690 ()>]
In unknown file:
   ?: 0 [primitive-load "/tmp/f\u266f.scm"]

ERROR: In procedure primitive-load:
ERROR: In procedure open-file: No such file or directory: "/tmp/f\u266f.scm"
dak@lola:/usr/local/tmp/lilypond$ ls -l /tmp/f*.scm
-rw-rw-r-- 1 dak dak 0 Jan 30 16:42 /tmp/f♯.scm

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 14:20 ` Ludovic Courtès
  2017-01-30 15:48   ` David Kastrup
@ 2017-01-30 15:54   ` Marko Rauhamaa
  2017-01-30 16:19     ` David Kastrup
  1 sibling, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-01-30 15:54 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-user

ludo@gnu.org (Ludovic Courtès):

> In C, argv is just an array of byte sequences, but in Guile,
> (command-line) returns a list of strings, not a list of bytevectors.
>
> Guile decodes its arguments according to the encoding of the current
> locale. So if you’re in a UTF-8 locale (say, zn_CH.utf8 or
> en_US.utf8), Guile assumes its command-line arguments are
> UTF-8-encoded and decodes them accordingly.
>
> In the example above, it seems that the file name encoding was
> different from the locale encoding, leading to this error.

I'm afraid that choice (which Python made, as well) was a bad one
because Linux doesn't guarantee UTF-8 purity.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 15:54   ` Marko Rauhamaa
@ 2017-01-30 16:19     ` David Kastrup
  2017-01-30 16:33       ` Marko Rauhamaa
  0 siblings, 1 reply; 110+ messages in thread
From: David Kastrup @ 2017-01-30 16:19 UTC (permalink / raw)
  To: guile-user

Marko Rauhamaa <marko@pacujo.net> writes:

> ludo@gnu.org (Ludovic Courtès):
>
>> In C, argv is just an array of byte sequences, but in Guile,
>> (command-line) returns a list of strings, not a list of bytevectors.
>>
>> Guile decodes its arguments according to the encoding of the current
>> locale. So if you’re in a UTF-8 locale (say, zn_CH.utf8 or
>> en_US.utf8), Guile assumes its command-line arguments are
>> UTF-8-encoded and decodes them accordingly.
>>
>> In the example above, it seems that the file name encoding was
>> different from the locale encoding, leading to this error.
>
> I'm afraid that choice (which Python made, as well) was a bad one
> because Linux doesn't guarantee UTF-8 purity.

Have you looked at the error messages?  They are all perfect UTF-8.  As
was the command line locale.

Here, have another data point:

dak@lola:/usr/local/tmp/lilypond$ guile-2.0 /tmp/f♯.scm 
;;; Stat of /tmp/f?.scm failed:
;;; ERROR: In procedure stat: No such file or directory: "/tmp/f\u266f.scm"
Backtrace:
In ice-9/boot-9.scm:
 160: 8 [catch #t #<catch-closure 9ca5710> ...]
In unknown file:
   ?: 7 [apply-smob/1 #<catch-closure 9ca5710>]
In ice-9/boot-9.scm:
  66: 6 [call-with-prompt prompt0 ...]
In ice-9/eval.scm:
 432: 5 [eval # #]
In ice-9/boot-9.scm:
2404: 4 [save-module-excursion #<procedure 9cb6ce0 at ice-9/boot-9.scm:4051:3 ()>]
4056: 3 [#<procedure 9cb6ce0 at ice-9/boot-9.scm:4051:3 ()>]
1727: 2 [%start-stack load-stack ...]
1732: 1 [#<procedure 9cbc690 ()>]
In unknown file:
   ?: 0 [primitive-load "/tmp/f\u266f.scm"]

ERROR: In procedure primitive-load:
ERROR: In procedure open-file: No such file or directory: "/tmp/f\u266f.scm"
dak@lola:/usr/local/tmp/lilypond$ guile-2.0 
GNU Guile 2.0.13
Copyright (C) 1995-2016 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)> (open-input-
open-input-file    open-input-string  
scheme@(guile-user)> (open-input-file "/tmp/f\u266f.scm")
$1 = #<input: /tmp/f♯.scm 9>
scheme@(guile-user)> (open-input-file "/tmp/non-existent")
ERROR: In procedure open-file:
ERROR: In procedure open-file: No such file or directory: "/tmp/non-existent"

Entering a new prompt.  Type `,bt' for a backtrace or `,q' to continue.
scheme@(guile-user) [1]> 

Apparently, Guile can open the file just fine, and it sees the command
line just fine as encoded in utf-8.

But during command line processing rather than afterwards, it fails
opening the file.

So I really, really, really suggest that before people post their
theories that they actually bother cross-checking them with Guile.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 16:19     ` David Kastrup
@ 2017-01-30 16:33       ` Marko Rauhamaa
  2017-01-30 16:42         ` David Kastrup
  0 siblings, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-01-30 16:33 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

David Kastrup <dak@gnu.org>:

> Marko Rauhamaa <marko@pacujo.net> writes:
>> ludo@gnu.org (Ludovic Courtès):
>>> Guile assumes its command-line arguments are UTF-8-encoded and
>>> decodes them accordingly.
>>
>> I'm afraid that choice (which Python made, as well) was a bad one
>> because Linux doesn't guarantee UTF-8 purity.
>
> Have you looked at the error messages? They are all perfect UTF-8. As
> was the command line locale.

I was responding to Ludovic.

> Apparently, Guile can open the file just fine, and it sees the command
> line just fine as encoded in utf-8.

My problem is when it is not valid UTF-8.

> So I really, really, really suggest that before people post their
> theories that they actually bother cross-checking them with Guile.

Well, execute these commands from bash:

   $ touch $'\xee'
   $ touch xyz
   $ ls -a
   .  ..  ''$'\356'  xyz

Then, execute this guile program:

========================================================================
(let ((dir (opendir ".")))
  (let loop ()
    (let ((filename (readdir dir)))
      (if (not (eof-object? filename))
          (begin
            (if (access? filename R_OK)
                (format #t "~s\n" filename))
            (loop))))))
========================================================================

It outputs:

   ".."
   "."
   "xyz"

skipping a file. This is a security risk. Files like these appear easily
when extracting zip files, for example.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 15:48   ` David Kastrup
@ 2017-01-30 16:41     ` Ludovic Courtès
  2017-01-30 17:04       ` David Kastrup
  0 siblings, 1 reply; 110+ messages in thread
From: Ludovic Courtès @ 2017-01-30 16:41 UTC (permalink / raw)
  To: guile-user

Hey Dave!

David Kastrup <dak@gnu.org> skribis:

> ludo@gnu.org (Ludovic Courtès) writes:

[...]

>>> ERROR: In procedure open-file: No such file or directory:
>>> "/home/hermann/Desktop/filename_\u540d\u5b57.scm"
>>
>> In C, argv is just an array of byte sequences, but in Guile,
>> (command-line) returns a list of strings, not a list of bytevectors.
>>
>> Guile decodes its arguments according to the encoding of the current
>> locale.  So if you’re in a UTF-8 locale (say, zn_CH.utf8 or en_US.utf8),
>> Guile assumes its command-line arguments are UTF-8-encoded and decodes
>> them accordingly.
>>
>> In the example above, it seems that the file name encoding was different
>> from the locale encoding, leading to this error.
>>
>> HTH!
>
> Did you actually test this?

Oops, let me clarify.

Command-line arguments are indeed decoded encoding to the locale
encoding (that’s commit ed4c3739668b4b111b38555b8bc101cb74c87c1c.)

When making a syscall like open(2), Guile converts strings to the locale
encoding.

However, in 2.0, the current locale is *not* installed; you have to
either call ‘setlocale’ explicitly (like in C), or set this environment
variable (info "(guile) Environment Variables"):

  GUILE_INSTALL_LOCALE=1

When you do that (and this will be the default in 2.2), things work as
expected:

--8<---------------cut here---------------start------------->8---
$ GUILE_INSTALL_LOCALE=1 guile λ.scm
;;; note: auto-compilation is enabled, set GUILE_AUTO_COMPILE=0
;;;       or pass the --no-auto-compile argument to disable.
;;; compiling /home/ludo/src/guile/λ.scm
;;; compiled /home/ludo/.cache/guile/ccache/2.0-LE-8-2.0/home/ludo/src/guile/λ.scm.go
hello λ!
$ locale
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER=fr_FR.utf8
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=
--8<---------------cut here---------------end--------------->8---

Sorry for the confusion!

Ludo’.




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 16:33       ` Marko Rauhamaa
@ 2017-01-30 16:42         ` David Kastrup
  2017-01-30 17:58           ` Marko Rauhamaa
  0 siblings, 1 reply; 110+ messages in thread
From: David Kastrup @ 2017-01-30 16:42 UTC (permalink / raw)
  To: guile-user

Marko Rauhamaa <marko@pacujo.net> writes:

> David Kastrup <dak@gnu.org>:
>
>> Marko Rauhamaa <marko@pacujo.net> writes:
>>> ludo@gnu.org (Ludovic Courtès):
>>>> Guile assumes its command-line arguments are UTF-8-encoded and
>>>> decodes them accordingly.
>>>
>>> I'm afraid that choice (which Python made, as well) was a bad one
>>> because Linux doesn't guarantee UTF-8 purity.
>>
>> Have you looked at the error messages? They are all perfect UTF-8. As
>> was the command line locale.
>
> I was responding to Ludovic.
>
>> Apparently, Guile can open the file just fine, and it sees the command
>> line just fine as encoded in utf-8.
>
> My problem is when it is not valid UTF-8.
>
>> So I really, really, really suggest that before people post their
>> theories that they actually bother cross-checking them with Guile.
>
> Well, execute these commands from bash:
>
>    $ touch $'\xee'
>    $ touch xyz
>    $ ls -a
>    .  ..  ''$'\356'  xyz

We are not talking about file names not encoded in UTF-8.  It is
well-known that Guile is unable to work with strings in UTF-8-encoding
when their byte-pattern is not valid UTF-8.

This is a red herring.  The problem is not that Guile is unable to deal
with badly encoded UTF-8 file names.  The problem is that Guile is
unable to deal with properly encoded UTF-8 file names when it is
supposed to execute them from the command line.

> Then, execute this guile program:
>
> ========================================================================
> (let ((dir (opendir ".")))
>   (let loop ()
>     (let ((filename (readdir dir)))
>       (if (not (eof-object? filename))
>           (begin
>             (if (access? filename R_OK)
>                 (format #t "~s\n" filename))
>             (loop))))))
> ========================================================================
>
> It outputs:
>
>    ".."
>    "."
>    "xyz"
>
> skipping a file. This is a security risk. Files like these appear easily
> when extracting zip files, for example.

I am surprised this does not just throw a bad encoding exception.

But at any rate, this cannot easily be fixed since Guile uses libraries
for encoding/decoding that cannot deal reproducibly with improper byte
patterns.

The problem here is that Guile cannot even deal with _properly_ encoded
UTF-8 file names on the command line.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 16:41     ` Ludovic Courtès
@ 2017-01-30 17:04       ` David Kastrup
  0 siblings, 0 replies; 110+ messages in thread
From: David Kastrup @ 2017-01-30 17:04 UTC (permalink / raw)
  To: guile-user

ludo@gnu.org (Ludovic Courtès) writes:

[...]

> However, in 2.0, the current locale is *not* installed; you have to
> either call ‘setlocale’ explicitly (like in C), or set this environment
> variable (info "(guile) Environment Variables"):
>
>   GUILE_INSTALL_LOCALE=1
>
> When you do that (and this will be the default in 2.2), things work as
> expected:

But shouldn't that be done temporarily by default when processing the
command line?  Or alternatively, shouldn't Guile just pass the command
line byte-transparently to the file open calls?

It seems strange that Guile is unable to just pass what it received to
the file open call: if it is in 8bit-mode, this should work, and if it
is in UTF-8 mode (and the error messages suggest that it is), this
should work as well.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 16:42         ` David Kastrup
@ 2017-01-30 17:58           ` Marko Rauhamaa
  2017-01-30 18:32             ` David Kastrup
  2017-02-14 19:58             ` Linas Vepstas
  0 siblings, 2 replies; 110+ messages in thread
From: Marko Rauhamaa @ 2017-01-30 17:58 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

David Kastrup <dak@gnu.org>:

> But at any rate, this cannot easily be fixed since Guile uses libraries
> for encoding/decoding that cannot deal reproducibly with improper byte
> patterns.

Guile's mistake was to move to Unicode strings in the operating system
interface.

> The problem here is that Guile cannot even deal with _properly_
> encoded UTF-8 file names on the command line.

Ok.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 17:58           ` Marko Rauhamaa
@ 2017-01-30 18:32             ` David Kastrup
  2017-01-30 18:50               ` Eli Zaretskii
  2017-01-30 19:01               ` Marko Rauhamaa
  2017-02-14 19:58             ` Linas Vepstas
  1 sibling, 2 replies; 110+ messages in thread
From: David Kastrup @ 2017-01-30 18:32 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

Marko Rauhamaa <marko@pacujo.net> writes:

> David Kastrup <dak@gnu.org>:
>
>> But at any rate, this cannot easily be fixed since Guile uses libraries
>> for encoding/decoding that cannot deal reproducibly with improper byte
>> patterns.
>
> Guile's mistake was to move to Unicode strings in the operating system
> interface.

Emacs uses an UTF-8 based encoding internally: basically, valid UTF-8 is
represented as itself, there is a number of coding points beyond the
actual limit of UTF-8 that is used for non-Unicode character sets, and
single bytes not properly belonging to the read encoding are represented
with 0x00...0x7f, 0xc0 0x80 ... 0xc0 0xbf and 0xc1 0x80 ... 0xbf (the
latter two ranges are "overlong" encodings of 0x00...0x7f and
consequently also not valid utf-8).

The result is that random binary files read as "utf-8" grow by less than
50% in the internal representation (0x00-0x7f gets represented as
itself, and 0x80-0xff gets encoded with two bytes only when not being a
part of a valid utf-8 sequence).  The internal representation has
several guarantees for processing.  And when reencoding to utf-8 as
output encoding, the input gets reconstructed perfectly even when it
wasn't actually utf-8 to start with.

Emacs does not use "Unicode strings in the operating system interface"
but rather has a number of explicit encodings:

file-name-coding-system is a variable defined in ‘C source code’.
Its value is nil

Documentation:
Coding system for encoding file names.
If it is nil, ‘default-file-name-coding-system’ (which see) is used.

On MS-Windows, the value of this variable is largely ignored if
‘w32-unicode-filenames’ (which see) is non-nil.  Emacs on Windows
behaves as if file names were encoded in ‘utf-8’.

[back]


Coding system for saving this buffer:
  U -- utf-8-emacs-unix (alias: emacs-internal)

Default coding system (for new files):
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for keyboard input:
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for terminal output:
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for inter-client cut and paste:
  nil
Defaults for subprocess I/O:
  decoding: U -- utf-8-unix (alias: mule-utf-8-unix)

  encoding: U -- utf-8-unix (alias: mule-utf-8-unix)


Priority order for recognizing coding systems when reading files:
  1. utf-8 (alias: mule-utf-8)
  2. iso-2022-7bit 
  3. iso-latin-1 (alias: iso-8859-1 latin-1)
  4. iso-2022-7bit-lock (alias: iso-2022-int-1)
  5. iso-2022-8bit-ss2 
  6. emacs-mule 
  7. raw-text 
  8. iso-2022-jp (alias: junet)
  9. in-is13194-devanagari (alias: devanagari)
  10. chinese-iso-8bit (alias: cn-gb-2312 euc-china euc-cn cn-gb gb2312)
  11. utf-8-auto 
  12. utf-8-with-signature 
  13. utf-16 
  14. utf-16be-with-signature (alias: utf-16-be)
  15. utf-16le-with-signature (alias: utf-16-le)
  16. utf-16be 
  17. utf-16le 
  18. japanese-shift-jis (alias: shift_jis sjis)
  19. chinese-big5 (alias: big5 cn-big5 cp950)
  20. undecided 

  Other coding systems cannot be distinguished automatically
  from these, and therefore cannot be recognized automatically
  with the present coding system priorities.

Particular coding systems specified for certain file names:

  OPERATION	TARGET PATTERN		CODING SYSTEM(s)
  ---------	--------------		----------------
  File I/O	"\\.dz\\'"		(no-conversion . no-conversion)
		"\\.txz\\'"		(no-conversion . no-conversion)
		"\\.xz\\'"		(no-conversion . no-conversion)
		"\\.lzma\\'"		(no-conversion . no-conversion)
		"\\.lz\\'"		(no-conversion . no-conversion)
		"\\.g?z\\'"		(no-conversion . no-conversion)
		"\\.\\(?:tgz\\|svgz\\|sifz\\)\\'"
					(no-conversion . no-conversion)
		"\\.tbz2?\\'"		(no-conversion . no-conversion)
		"\\.bz2\\'"		(no-conversion . no-conversion)
		"\\.Z\\'"		(no-conversion . no-conversion)
		"\\.elc\\'"		utf-8-emacs
		"\\.el\\'"		prefer-utf-8
		"\\.utf\\(-8\\)?\\'"	utf-8
		"\\.xml\\'"		xml-find-file-coding-system
		"\\(\\`\\|/\\)loaddefs.el\\'"
					(raw-text . raw-text-unix)
		"\\.tar\\'"		(no-conversion . no-conversion)
		"\\.po[tx]?\\'\\|\\.po\\."
					po-find-file-coding-system
		"\\.\\(tex\\|ltx\\|dtx\\|drv\\)\\'"
					latexenc-find-file-coding-system
		""			(undecided)
  Process I/O	nothing specified
  Network I/O	nothing specified

[back]


So in short: this is a rather complex domain.  And Elisp, as a
text-manipulating platform, has a whole lot of tools and bells and
whistles to deal with it well enough that you usually won't even notice.

It took a number of years to arrive there and caused the last large
migration to XEmacs.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 18:32             ` David Kastrup
@ 2017-01-30 18:50               ` Eli Zaretskii
  2017-01-30 19:00                 ` David Kastrup
  2017-01-30 19:01               ` Marko Rauhamaa
  1 sibling, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-01-30 18:50 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

> From: David Kastrup <dak@gnu.org>
> Date: Mon, 30 Jan 2017 19:32:14 +0100
> Cc: guile-user@gnu.org
> 
> Emacs uses an UTF-8 based encoding internally: basically, valid UTF-8 is
> represented as itself, there is a number of coding points beyond the
> actual limit of UTF-8 that is used for non-Unicode character sets, and
> single bytes not properly belonging to the read encoding are represented
> with 0x00...0x7f, 0xc0 0x80 ... 0xc0 0xbf and 0xc1 0x80 ... 0xbf (the
> latter two ranges are "overlong" encodings of 0x00...0x7f and
> consequently also not valid utf-8).

One other crucial detail is that Emacs also has unibyte strings
(arrays of bytes), which are necessary during startup, when Emacs
doesn't yet know how to decode non-ASCII strings.  Without that, you
wouldn't be able to start Emacs in a directory whose name includes
non-ASCII characters, because it couldn't access files it needs to
read to set up some of its decoding machinery.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 18:50               ` Eli Zaretskii
@ 2017-01-30 19:00                 ` David Kastrup
  2017-01-30 19:32                   ` Eli Zaretskii
  0 siblings, 1 reply; 110+ messages in thread
From: David Kastrup @ 2017-01-30 19:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Eli Zaretskii <eliz@gnu.org> writes:

>> From: David Kastrup <dak@gnu.org>
>> Date: Mon, 30 Jan 2017 19:32:14 +0100
>> Cc: guile-user@gnu.org
>> 
>> Emacs uses an UTF-8 based encoding internally: basically, valid UTF-8 is
>> represented as itself, there is a number of coding points beyond the
>> actual limit of UTF-8 that is used for non-Unicode character sets, and
>> single bytes not properly belonging to the read encoding are represented
>> with 0x00...0x7f, 0xc0 0x80 ... 0xc0 0xbf and 0xc1 0x80 ... 0xbf (the
>> latter two ranges are "overlong" encodings of 0x00...0x7f and
>> consequently also not valid utf-8).
>
> One other crucial detail is that Emacs also has unibyte strings
> (arrays of bytes), which are necessary during startup, when Emacs
> doesn't yet know how to decode non-ASCII strings.  Without that, you
> wouldn't be able to start Emacs in a directory whose name includes
> non-ASCII characters, because it couldn't access files it needs to
> read to set up some of its decoding machinery.

Hm, I know that XEmacs-Mule emphatically does not have unibyte strings
(and Stephen considers them a complication and abomination that should
never have been left in Emacs), so it must be possible to get away
without them.  And I don't think that the comparatively worse Mule
implementation of XEmacs is due to that decision.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 18:32             ` David Kastrup
  2017-01-30 18:50               ` Eli Zaretskii
@ 2017-01-30 19:01               ` Marko Rauhamaa
  2017-01-30 19:27                 ` David Kastrup
  2017-01-30 19:41                 ` Eli Zaretskii
  1 sibling, 2 replies; 110+ messages in thread
From: Marko Rauhamaa @ 2017-01-30 19:01 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

David Kastrup <dak@gnu.org>:

> Marko Rauhamaa <marko@pacujo.net> writes:
>> Guile's mistake was to move to Unicode strings in the operating system
>> interface.
>
> Emacs uses an UTF-8 based encoding internally [...]

C uses 8-bit characters. That is a model worth emulating.

UTF-8 beautifully bridges the interpretation gap between 8-bit character
strings and text. However, the interpretation step should be done in the
application and not in the programming language. Support libraries for
Unicode are naturally welcome.

Plain Unicode text is actually quite a rare programming need. It is
woefully inadequate for the human interface, which generally requires
numerous other typesetting effects. But is also causing unnecessary
grief in the computer-computer interface, where the classic textual
naming and textual protocols are actually cutely chosen octet-aligned
binary formats.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 19:01               ` Marko Rauhamaa
@ 2017-01-30 19:27                 ` David Kastrup
  2017-02-14 20:10                   ` Linas Vepstas
  2017-01-30 19:41                 ` Eli Zaretskii
  1 sibling, 1 reply; 110+ messages in thread
From: David Kastrup @ 2017-01-30 19:27 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

Marko Rauhamaa <marko@pacujo.net> writes:

> David Kastrup <dak@gnu.org>:
>
>> Marko Rauhamaa <marko@pacujo.net> writes:
>>> Guile's mistake was to move to Unicode strings in the operating system
>>> interface.
>>
>> Emacs uses an UTF-8 based encoding internally [...]
>
> C uses 8-bit characters. That is a model worth emulating.

That's Guile-1.8.  Guile-2 uses either Latin-1 or UCS-32 in its string
internals, either Latin-1 or UTF-8 in its string API, and UTF-8 in its
string port internals.

> UTF-8 beautifully bridges the interpretation gap between 8-bit
> character strings and text. However, the interpretation step should be
> done in the application and not in the programming language.

Elisp is focused enough about text that I think its choice of going
UTF-8 internally with a Unicode character type reasonably sane.  Its
strings (the quirky unibyte strings excluded) are its own variant of
UTF-8 internally, and its string port equivalent (buffers) are that same
variant of UTF-8.  And its API talks UTF-8 for strings, Unicode (or
higher) for characters, and it indexes strings and buffers via Unicode
character counts.  Not O(1), but with enough trickery that it works well
enough in practice.  If strings are to be implemented strictly
Scheme-standard-conforming, they need to be O(1) indexable.  The Scheme
standard is rather silent about Unicode however.  I am not sure that
sticking to the standard where it does not deal with reality is the best
choice.

I think the case for Guile-2 to _also_ support "unibyte strings" would
be quite stronger than for Emacs (byte arrays and binary string ports
don't allow using Guile's string processing functions).  As it stands,
the design of Guile-2 in my book currently involves too many mandatory
conversions for just passing data around with Guile itself and
Guile-based applications.

> Support libraries for Unicode are naturally welcome.
>
> Plain Unicode text is actually quite a rare programming need. It is
> woefully inadequate for the human interface, which generally requires
> numerous other typesetting effects. But is also causing unnecessary
> grief in the computer-computer interface, where the classic textual
> naming and textual protocols are actually cutely chosen octet-aligned
> binary formats.

Sometimes yes, sometimes not.  As long as Guile wants to be a
general-purpose programming and extension language, it should deal
reliably and robustly and reproducibly with whatever is thrown at it.
Its choice of libraries does not currently make it so, but that could be
fixed by either working on the (GNU) libraries or by giving Guile its
own implementation.

But that needs to be considered a priority.  Nobody will do this just
for fun and kicks.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 19:00                 ` David Kastrup
@ 2017-01-30 19:32                   ` Eli Zaretskii
  2017-01-30 19:59                     ` Eli Zaretskii
  0 siblings, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-01-30 19:32 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

> From: David Kastrup <dak@gnu.org>
> Cc: marko@pacujo.net,  guile-user@gnu.org
> Date: Mon, 30 Jan 2017 20:00:03 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > One other crucial detail is that Emacs also has unibyte strings
> > (arrays of bytes), which are necessary during startup, when Emacs
> > doesn't yet know how to decode non-ASCII strings.  Without that, you
> > wouldn't be able to start Emacs in a directory whose name includes
> > non-ASCII characters, because it couldn't access files it needs to
> > read to set up some of its decoding machinery.
> 
> Hm, I know that XEmacs-Mule emphatically does not have unibyte strings
> (and Stephen considers them a complication and abomination that should
> never have been left in Emacs), so it must be possible to get away
> without them.

I doubt that's possible, at least not in general.  (You could get away
if you assumed UTF-8 encoded file names.)  Some translation tables for
some encodings must load files using the likes of load-path, and if
that includes non-ASCII file names, you are screwed unless you can use
unibyte strings.  That is why all Emacs primitives that accept file
names support both unibyte and multibyte strings as file names.

> And I don't think that the comparatively worse Mule implementation
> of XEmacs is due to that decision.

Emacs 20 vintage Mule didn't have all the sophisticated Unicode
support machinery we have today, so maybe for that subset the above
wasn't necessary.  Then again, Emacs couldn't be safely built or
started in a non-ASCII directory until just a few years ago, so
perhaps no one bothered to test that thoroughly with XEmacs, except in
ISO 2022 locales.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 19:01               ` Marko Rauhamaa
  2017-01-30 19:27                 ` David Kastrup
@ 2017-01-30 19:41                 ` Eli Zaretskii
  2017-01-30 20:46                   ` Marko Rauhamaa
  1 sibling, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-01-30 19:41 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user, dak

> From: Marko Rauhamaa <marko@pacujo.net>
> Date: Mon, 30 Jan 2017 21:01:31 +0200
> Cc: guile-user@gnu.org
> 
> UTF-8 beautifully bridges the interpretation gap between 8-bit character
> strings and text. However, the interpretation step should be done in the
> application and not in the programming language.

You can't do that in an environment that specifically targets
sophisticated multi-lingual text processing independent of the outside
locale.  Unless you can interpret byte sequences as characters, you
will be unable to even count characters in a range of text, let alone
render it for display.  And you cannot request applications to do
those low-level chores.

> Support libraries for Unicode are naturally welcome.

Well, in that case Emacs core is one huge "support library".  And I
don't see why Guile couldn't be another one; it should, IMO.

> Plain Unicode text is actually quite a rare programming need. It is
> woefully inadequate for the human interface, which generally requires
> numerous other typesetting effects.

You do need "other typesetting effects", naturally, but that doesn't
mean you can get away without more or less full support of Unicode
nowadays.  You are talking about programming, but we should instead
think about applications -- those of them which need to process text,
or even access files, as this discussion shows, do need decent Unicode
support.  E.g., users generally expect that decomposed and composed
character sequences behave and are treated identically, although they
are different byte-stream wise.

> But is also causing unnecessary grief in the computer-computer
> interface, where the classic textual naming and textual protocols
> are actually cutely chosen octet-aligned binary formats.

The universal acceptance of UTF-8 nowadays makes this much less of an
issue, IME.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 19:32                   ` Eli Zaretskii
@ 2017-01-30 19:59                     ` Eli Zaretskii
  2017-01-30 20:42                       ` Mike Gran
  0 siblings, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-01-30 19:59 UTC (permalink / raw)
  To: dak; +Cc: guile-user

> Date: Mon, 30 Jan 2017 21:32:41 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: guile-user@gnu.org
> 
> > Hm, I know that XEmacs-Mule emphatically does not have unibyte strings
> > (and Stephen considers them a complication and abomination that should
> > never have been left in Emacs), so it must be possible to get away
> > without them.
> 
> I doubt that's possible, at least not in general.  (You could get away
> if you assumed UTF-8 encoded file names.)  Some translation tables for
> some encodings must load files using the likes of load-path, and if
> that includes non-ASCII file names, you are screwed unless you can use
> unibyte strings.

Actually, the need arises even sooner.  Consider how load-path is set
up during startup: it starts with the directory from which Emacs was
invoked, either from argv[0] or by looking up PATH.  Either way, you
get a file name that is encoded in the locale-specific encoding.  Then
you cons load-path by expanding file names relative to the startup
directory.  So you immediately need to be able to create file names
from directories, check whether a file exists and is a directory,
etc. -- all of that before you even know in what locale you started,
so you cannot decode these file names into the internal
representation, before using them.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 19:59                     ` Eli Zaretskii
@ 2017-01-30 20:42                       ` Mike Gran
  2017-01-31  3:31                         ` Eli Zaretskii
  0 siblings, 1 reply; 110+ messages in thread
From: Mike Gran @ 2017-01-30 20:42 UTC (permalink / raw)
  To: Eli Zaretskii, dak@gnu.org; +Cc: guile-user@gnu.org


On Monday, January 30, 2017 12:00 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> Actually, the need arises even sooner.  Consider how load-path is set
> up during startup: it starts with the directory from which Emacs was
> invoked, either from argv[0] or by looking up PATH.  Either way, you
> get a file name that is encoded in the locale-specific encoding.  Then
> you cons load-path by expanding file names relative to the startup
> directory.  So you immediately need to be able to create file names
> from directories, check whether a file exists and is a directory,
> etc. -- all of that before you even know in what locale you started,
> so you cannot decode these file names into the internal

> representation, before using them.

Earlier in the 2.0.x release series, Guile had a hack where it started
up in a Latin-1 encoding, which would be capable of storing any
8-bit string of bytes, even if they weren't Latin-1.  I was the author
of the first version of that hack.  Anyway, while it was technically
incorrect, it did get the job done for some of these locale-free
byte string problems.  It could open non-ASCII paths without really
having an encoding, if I recall correctly.


It was an uneasy middle ground, tho.  Error messages with regards
to file names would be mojibake. And string ports were a mess.And what was supposed to happen after setlocale was called?


As an aside, GTK and GLIB based applications often use a method where
you may need to set the environment variable G_FILENAME_ENCODING
if your filename encoding is different from your locale encoding.
GTK/GLIB also likes to store strings internally as UTF-8, and will
convert to UTF-8 from either the locale or the G_FILENAME_ENCODING-
specified encoding.

As another aside, OpenBSD removed support for non-UTF8 locales.


-Mike Gran



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 19:41                 ` Eli Zaretskii
@ 2017-01-30 20:46                   ` Marko Rauhamaa
  2017-01-31 12:20                     ` tomas
  0 siblings, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-01-30 20:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user, dak

Eli Zaretskii <eliz@gnu.org>:

>> From: Marko Rauhamaa <marko@pacujo.net>
>> 
>> UTF-8 beautifully bridges the interpretation gap between 8-bit character
>> strings and text. However, the interpretation step should be done in the
>> application and not in the programming language.
>
> You can't do that in an environment that specifically targets
> sophisticated multi-lingual text processing independent of the outside
> locale.  Unless you can interpret byte sequences as characters, you
> will be unable to even count characters in a range of text,

If you need to operate on Unicode text, have the application invoke the
UTF-8 (or locale-specific) decoder. However, have the application
request it instead of guessing that the environment is all Unicode.

> You do need "other typesetting effects", naturally, but that doesn't
> mean you can get away without more or less full support of Unicode
> nowadays.

Do support it, fully even, but let the application invoke the
conversion when appropriate.

> You are talking about programming, but we should instead think about
> applications -- those of them which need to process text, or even
> access files, as this discussion shows, do need decent Unicode
> support.

Why should opening a file require Unicode support if the underlying
operating system knows nothing about Unicode? I can open a any given
file in a tiny C program without any Unicode support, under Linux, that
is.

> E.g., users generally expect that decomposed and composed character
> sequences behave and are treated identically, although they are
> different byte-stream wise.

Linux begs to differ. Regardless of the locale, two different octet
sequences that ought to be equivalent UTF-8-wise will be considered
different pathnames under Linux.

I don't need a helicopter to walk across the street.

>> But is also causing unnecessary grief in the computer-computer
>> interface, where the classic textual naming and textual protocols
>> are actually cutely chosen octet-aligned binary formats.
>
> The universal acceptance of UTF-8 nowadays makes this much less of an
> issue, IME.

You are jumping the gun. Linux won't be there for a long time if ever.
Nothing prevents a pathname, or a command-line argument, or an
environment variable, or the standard input from containing illegal
UTF-8.

I also wouldn't like my SMTP server to throw a UTF-8 decoding exception
on parsing a command.

(Also note that even Windows allows pathnames with illegal Unicode in
them if I'm not mistaken.)


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 20:42                       ` Mike Gran
@ 2017-01-31  3:31                         ` Eli Zaretskii
  2017-01-31  6:16                           ` Mike Gran
  2017-01-31  8:51                           ` David Kastrup
  0 siblings, 2 replies; 110+ messages in thread
From: Eli Zaretskii @ 2017-01-31  3:31 UTC (permalink / raw)
  To: Mike Gran; +Cc: guile-user, dak

> Date: Mon, 30 Jan 2017 20:42:38 +0000 (UTC)
> From: Mike Gran <spk121@yahoo.com>
> Cc: "guile-user@gnu.org" <guile-user@gnu.org>
> 
> Earlier in the 2.0.x release series, Guile had a hack where it started
> up in a Latin-1 encoding, which would be capable of storing any
> 8-bit string of bytes, even if they weren't Latin-1.

Latin-1 has holes in the 0..255 range, so it isn't very appropriate in
this situation.

> And what was supposed to happen after setlocale was called?

What Emacs does is explicitly decode any variable produced until that
moment that is known to hold unibyte strings.

> As an aside, GTK and GLIB based applications often use a method where
> you may need to set the environment variable G_FILENAME_ENCODING
> if your filename encoding is different from your locale encoding.
> GTK/GLIB also likes to store strings internally as UTF-8, and will
> convert to UTF-8 from either the locale or the G_FILENAME_ENCODING-
> specified encoding.

Emacs stores all environment variables in their original
locale-specific encoding, as unibyte strings, and only decodes them
when they are actually used or handed to Lisp.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-31  3:31                         ` Eli Zaretskii
@ 2017-01-31  6:16                           ` Mike Gran
  2017-01-31  8:51                           ` David Kastrup
  1 sibling, 0 replies; 110+ messages in thread
From: Mike Gran @ 2017-01-31  6:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user@gnu.org, dak@gnu.org

On Monday, January 30, 2017 7:31 PM, Eli Zaretskii <eliz@gnu.org> wrote:

> Latin-1 has holes in the 0..255 range, so it isn't very appropriate in
> this situation.

I was being imprecise. 
Internally, in Guile, if a string consists of 
Unicode codepoints zero to 255, it is stored as what Guile calls a 
"narrow" string.
 This is still true in 2.1.x, I think.

The first 256 codepoints of Unicode consists of C0 controls, the 
US-ASCII set, the C1 controls from ECMA-48, and the right hand part 
of Latin-1.

In early 2.0.x Guile versions, before setlocale was called, Guile would
map unspecified 8-bit clean file paths to this 8-bit encoding
that consisted of the first 256 codepoints in the Unicode standard.  

> Emacs stores all environment variables in their original
> locale-specific encoding, as unibyte strings, and only decodes them
> when they are actually used or handed to Lisp.
Another method is that of Perl6 where all strings are utf8.
In Perl6 on MoarVM it assumes all
environmental variables and strings are UTF-8, unless otherwise specified,
but it uses their UTF-8-c8 encoding that can encode/decode invalid
UTF-8.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-31  3:31                         ` Eli Zaretskii
  2017-01-31  6:16                           ` Mike Gran
@ 2017-01-31  8:51                           ` David Kastrup
  1 sibling, 0 replies; 110+ messages in thread
From: David Kastrup @ 2017-01-31  8:51 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Mon, 30 Jan 2017 20:42:38 +0000 (UTC)
>> From: Mike Gran <spk121@yahoo.com>
>> Cc: "guile-user@gnu.org" <guile-user@gnu.org>
>> 
>> Earlier in the 2.0.x release series, Guile had a hack where it started
>> up in a Latin-1 encoding, which would be capable of storing any
>> 8-bit string of bytes, even if they weren't Latin-1.
>
> Latin-1 has holes in the 0..255 range, so it isn't very appropriate in
> this situation.

What Guile calls "Latin-1" is really unibyte.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 20:46                   ` Marko Rauhamaa
@ 2017-01-31 12:20                     ` tomas
  0 siblings, 0 replies; 110+ messages in thread
From: tomas @ 2017-01-31 12:20 UTC (permalink / raw)
  To: guile-user

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Jan 30, 2017 at 10:46:26PM +0200, Marko Rauhamaa wrote:

[...]

> You are jumping the gun. Linux won't be there for a long time if ever.
> Nothing prevents a pathname, or a command-line argument, or an
> environment variable, or the standard input from containing illegal
> UTF-8.

To put more emphasis on this point: as things stand, you won't be able
to avoid having paths with illegal UTF-8.

Consider mounting a file system with UTF-8 file names over one
containing Latin-1 file names. Then even if both are consistent
in themselves (a far stretch, because there might be multiple apps
under multiple locales creating and naming files), the combined
path might be bad UTF-8

As long as the OS keeps out of this business I guess the application
will need a "layer" from which to look at paths (and env variables
and argv and all that) as mere byte sequences, before any interpretation
takes place.

> (Also note that even Windows allows pathnames with illegal Unicode in
> them if I'm not mistaken.)

This is so Microsoft: the file name character set in the file system
is Unicode... except when it isn't :-)

regards
- -- t
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAliQgRgACgkQBcgs9XrR2kaOFwCeLo5oUgkTM9NPWo2aK+1SRObY
3yoAniTL3HgxPqh1mdXMx774fmohmnYb
=YuFv
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 17:58           ` Marko Rauhamaa
  2017-01-30 18:32             ` David Kastrup
@ 2017-02-14 19:58             ` Linas Vepstas
  2017-02-26 21:33               ` Andy Wingo
  1 sibling, 1 reply; 110+ messages in thread
From: Linas Vepstas @ 2017-02-14 19:58 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: Guile User, David Kastrup

On Mon, Jan 30, 2017 at 11:58 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> David Kastrup <dak@gnu.org>:
>
>> But at any rate, this cannot easily be fixed since Guile uses libraries
>> for encoding/decoding that cannot deal reproducibly with improper byte
>> patterns.
>
> Guile's mistake was to move to Unicode strings in the operating system
> interface.

Guile's mistake is that it does lots of pointless conversions from utf8 strings
to wide-char arrays, and back, which is a) a cpu suck, and b) a breeding
ground for bugs.   The current 2.1 guile, in git as of a few weeks ago, has
multiple utf8 handling bugs.

I believe most or all of these bugs are "internal", fixable without any
changes to the API or user code.   Its just that the utf8 unit tests for guile
are weak, and don't test some of the common usages, and thus allow bugs
to breed.

>> The problem here is that Guile cannot even deal with _properly_
>> encoded UTF-8 file names on the command line.
>
> Ok.

Well, yes, that too.

--linas



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-01-30 19:27                 ` David Kastrup
@ 2017-02-14 20:10                   ` Linas Vepstas
  2017-02-14 20:54                     ` Mike Gran
  2017-02-14 22:26                     ` Ludovic Courtès
  0 siblings, 2 replies; 110+ messages in thread
From: Linas Vepstas @ 2017-02-14 20:10 UTC (permalink / raw)
  To: David Kastrup; +Cc: Guile User

On Mon, Jan 30, 2017 at 1:27 PM, David Kastrup <dak@gnu.org> wrote:
> Marko Rauhamaa <marko@pacujo.net> writes:
>> David Kastrup <dak@gnu.org>:
>>> Marko Rauhamaa <marko@pacujo.net> writes:
>>>> Guile's mistake was to move to Unicode strings in the operating system
>>>> interface.
>>>
>>> Emacs uses an UTF-8 based encoding internally [...]
>>
>> C uses 8-bit characters. That is a model worth emulating.
>
> That's Guile-1.8.  Guile-2 uses either Latin-1 or UCS-32 in its string
> internals, either Latin-1 or UTF-8 in its string API, and UTF-8 in its
> string port internals.

Which seems to be a bad decision. I've got strings, 10MBytes long, holding
chinese in UTF8, and guile converts these internally, to UCS-32 which is a
complete and total waste of CPU time. WTF.  It then has to  convert them
back to UTF8 before passing them to my C++ code that actually does stuff
with them.

All I get for this design decision is poor performance, and endless
complaints from boehm-gc:
"GC Warning: Repeated allocation of very large block (appr. size 3944448):
        May lead to memory leak and poor performance."

Its just not very friendly.

Although I am complaining, I do have vague plans on submitting patches to
fix this; I've read through the guile source, and can see where the code
does this.  Reading the code, it's written in a
late-night-not-thinking-clearly coding style, and should be relatively easy
to fix.

--linas


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-14 20:10                   ` Linas Vepstas
@ 2017-02-14 20:54                     ` Mike Gran
  2017-02-14 21:07                       ` Marko Rauhamaa
  2017-02-14 23:58                       ` David Kastrup
  2017-02-14 22:26                     ` Ludovic Courtès
  1 sibling, 2 replies; 110+ messages in thread
From: Mike Gran @ 2017-02-14 20:54 UTC (permalink / raw)
  To: linasvepstas@gmail.com, David Kastrup; +Cc: Guile User


On Tuesday, February 14, 2017 12:11 PM, Linas Vepstas <linasvepstas@gmail.com> wrote:

> Which seems to be a bad decision. I've got strings, 10MBytes long, holding
> chinese in UTF8, and guile converts these internally, to UCS-32 which is a
> complete and total waste of CPU time. WTF.  It then has to  convert them
> back to UTF8 before passing them to my C++ code that actually does stuff
> with them.

> All I get for this design decision is poor performance, and endless

> complaints from boehm-gc:

I almost hate to wade in here, because no matter what I say, the
response is likely to be withering.

But, for what it is worth, the Latin-1/UCS-32 design decision came from
a couple of conflicting requirements.  The switch happened in the 1.9.x
series.


There was several examples of legacy C code using Guile for an extension
language that accessed the bytes of a string directly, using 

SCM_STRING_CHARS or scm_i_string_chars.  To keep from breaking legacy code,
we needed to retain the capability to use this (then already deprecated)
capability to have C programs access 8-bit-locale string internals directly.

Also, in R6RS, there was the requirement that functions like "string-ref"
act in "constant time". This suggested either a codepoint-array
representation for strings, or a UTF-8 array representation with some
indexing to allow for constant-time access.

Note that the constant time access requirement was dropped in R7RS, if
I understand it correctly.

Guile wasn't the only language to make this decision.  Python strings
are similar, as you can see in PEP 393, though Guile's usage of such
an encoding scheme came first.

I still maintain that this design decision was a good one based on
the simplicity of implementation.  When I helped out with the coding of the
Unicode support, I had three different prototypes: a UTF-32-only
Guile, and UTF-8 Guile, and the current scheme.

The great difficulty with the UTF-8 Guile prototype was the need to
interrogate every string access or index to decide if it was a codepoint
index or a byte index. I abandoned that effort because it was doing my 

head in.  Had we chosen that route, the result would likely have been
a long, long process of squashing difficult bugs related to byte vs
codepoint index confusion.

But, for what it is worth, we've had a few years of the internal
representation of strings being private, so any modification of
internal representation of strings would be easier in 2017 than they
were in 2007, when the guts of strings were exposed to the C
API.


Thanks,
Mike

(N.B. dak at gnu is on my block list, so I won't see any such response.)



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-14 20:54                     ` Mike Gran
@ 2017-02-14 21:07                       ` Marko Rauhamaa
  2017-02-14 21:52                         ` Mike Gran
  2017-02-14 23:58                       ` David Kastrup
  1 sibling, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-14 21:07 UTC (permalink / raw)
  To: Mike Gran; +Cc: David Kastrup, Guile User

Mike Gran <spk121@yahoo.com>:

> The great difficulty with the UTF-8 Guile prototype was the need to
> interrogate every string access or index to decide if it was a
> codepoint index or a byte index.

Unicode strings are a special data type that have relatively little
practical use. Byte strings are much more fundamental. C's "char *" is
perfect.

In particular, filenames are *not*, nor can they be mapped to, Unicode
strings in Linux.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-14 21:07                       ` Marko Rauhamaa
@ 2017-02-14 21:52                         ` Mike Gran
  2017-02-14 22:12                           ` Marko Rauhamaa
  2017-02-14 22:19                           ` Chris Vine
  0 siblings, 2 replies; 110+ messages in thread
From: Mike Gran @ 2017-02-14 21:52 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: David Kastrup, Guile User






On Tuesday, February 14, 2017 1:07 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
Mike Gran <spk121@yahoo.com>:


> Unicode strings are a special data type that have relatively little> practical use. Byte strings are much more fundamental. C's "char *" is
> perfect.


Human language itself is of limited practical use except for
communicating information to people that read languages that have
a text representation.

> In particular, filenames are *not*, nor can they be mapped to, Unicode

> strings in Linux.

True. Linux should follow OpenBSD and make all locales UTF-8.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-14 21:52                         ` Mike Gran
@ 2017-02-14 22:12                           ` Marko Rauhamaa
  2017-02-14 22:19                           ` Chris Vine
  1 sibling, 0 replies; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-14 22:12 UTC (permalink / raw)
  To: Mike Gran; +Cc: David Kastrup, Guile User

Mike Gran <spk121@yahoo.com>:

> On Tuesday, February 14, 2017 1:07 PM, Marko Rauhamaa
> <marko@pacujo.net> wrote:
>> Unicode strings are a special data type that have relatively little>
>> practical use. Byte strings are much more fundamental. C's "char *"
>> is perfect.
>
> Human language itself is of limited practical use except for
> communicating information to people that read languages that have a
> text representation.

Unicode is useful, don't get me wrong. However, Unicode is not the same
as "human language itself". Unicode is a huge can of worms, and yet not
big enough. It is best reserved for the use of text-processing
applications. It shouldn't be shoved down the throat of each and every
application.

A much more fundamental data type is the byte string, which can
represent many things, including Unicode. With UTF-8, I mostly don't
need an interpretive step to deal with plain text. Sure, I can't know
the visual width of my plain text string, but it's not simply the number
of Unicode points, either, because of diacritics and other similar
complications.

>> In particular, filenames are *not*, nor can they be mapped to,
>> Unicode strings in Linux.
>
> True. Linux should follow OpenBSD and make all locales UTF-8.

Maybe, but Guile should wait until Linux has made the transition. There
are no signs of such a transition at the moment. Linux deals in bytes
and couldn't care less about interpreting those bytes.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-14 21:52                         ` Mike Gran
  2017-02-14 22:12                           ` Marko Rauhamaa
@ 2017-02-14 22:19                           ` Chris Vine
  2017-02-15  7:15                             ` Marko Rauhamaa
  2017-02-15  9:18                             ` tomas
  1 sibling, 2 replies; 110+ messages in thread
From: Chris Vine @ 2017-02-14 22:19 UTC (permalink / raw)
  To: guile-user

On Tue, 14 Feb 2017 21:52:01 +0000 (UTC)
Mike Gran <spk121@yahoo.com> wrote:
[snip]
> > In particular, filenames are *not*, nor can they be mapped to,
> > Unicode  
> 
> > strings in Linux.  
> 
> True. Linux should follow OpenBSD and make all locales UTF-8.

Filenames and locales are not necessarily related.  When you access a
networked file system, you get the filename encoding you are given,
which may or may not be the same as the particular locale encoding on
your particular machine on one particular day, and may or may not be a
unicode encoding.  Glib, for example, enables you to set this with the
G_FILENAME_ENCODING environmental variable, and has separate
g_filename_to_utf8() and g_filename_from_utf8() functions for this
purpose.  You can tie the filename encoding to the locale encoding by
defining the G_BROKEN_FILENAMES environmental variable but that is
deprecated (the name suggests what they thing about that idea).

You may possibly agree with this: I am not clear from your post what
connection you were making between locales and filenames.  But if
OpenBSD requires all _filenames_ to be in valid UTF-8, that is a bad
decision in my view.

Linux is capable of treating filenames as just a null-terminated array
of bytes with '/' as the directory separator.  It is encoding agnostic,
and that works just fine.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-14 20:10                   ` Linas Vepstas
  2017-02-14 20:54                     ` Mike Gran
@ 2017-02-14 22:26                     ` Ludovic Courtès
  2017-02-26 21:23                       ` Andy Wingo
  1 sibling, 1 reply; 110+ messages in thread
From: Ludovic Courtès @ 2017-02-14 22:26 UTC (permalink / raw)
  To: guile-user

Linas Vepstas <linasvepstas@gmail.com> skribis:

> On Mon, Jan 30, 2017 at 1:27 PM, David Kastrup <dak@gnu.org> wrote:
>> Marko Rauhamaa <marko@pacujo.net> writes:
>>> David Kastrup <dak@gnu.org>:
>>>> Marko Rauhamaa <marko@pacujo.net> writes:
>>>>> Guile's mistake was to move to Unicode strings in the operating system
>>>>> interface.
>>>>
>>>> Emacs uses an UTF-8 based encoding internally [...]
>>>
>>> C uses 8-bit characters. That is a model worth emulating.
>>
>> That's Guile-1.8.  Guile-2 uses either Latin-1 or UCS-32 in its string
>> internals, either Latin-1 or UTF-8 in its string API, and UTF-8 in its
>> string port internals.
>
> Which seems to be a bad decision. I've got strings, 10MBytes long, holding
> chinese in UTF8, and guile converts these internally, to UCS-32 which is a
> complete and total waste of CPU time. WTF.  It then has to  convert them
> back to UTF8 before passing them to my C++ code that actually does stuff
> with them.

I see this as an interaction problem: Guile 2.0 uses UCS-32 internally,
and your code uses UTF-8.  It could have been the other way around.

There were discussions to move to UTF-8 internally in 2.2.  As Mike
explained, that was not really an option in 2.0 mostly due to the
requirement to support O(1) random access.

<https://github.com/larcenists/larceny/wiki/StringRepresentations> lists
various options and the tradeoffs involved.

Ludo’.




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-14 20:54                     ` Mike Gran
  2017-02-14 21:07                       ` Marko Rauhamaa
@ 2017-02-14 23:58                       ` David Kastrup
  2017-02-15 10:12                         ` tomas
  2017-02-26 21:20                         ` Andy Wingo
  1 sibling, 2 replies; 110+ messages in thread
From: David Kastrup @ 2017-02-14 23:58 UTC (permalink / raw)
  To: guile-user

Mike Gran <spk121@yahoo.com> writes:

> But, for what it is worth, the Latin-1/UCS-32 design decision came
> from a couple of conflicting requirements.  The switch happened in the
> 1.9.x series.
>
>
> There was several examples of legacy C code using Guile for an
> extension language that accessed the bytes of a string directly, using
>
> SCM_STRING_CHARS or scm_i_string_chars.  To keep from breaking legacy
> code, we needed to retain the capability to use this (then already
> deprecated) capability to have C programs access 8-bit-locale string
> internals directly.

But if you don't know whether the strings are Latin-1 or UCS-32, that's
sort of academical.

> Also, in R6RS, there was the requirement that functions like
> "string-ref" act in "constant time". This suggested either a
> codepoint-array representation for strings, or a UTF-8 array
> representation with some indexing to allow for constant-time access.

The problem is not that Guile has an idiosyncratic internal string
representation.  As you note, other programs have that.

The problem is that Guile does not have an API for passing/processing
strings in that representation.  That means that passing strings in and
out of Guile is expensive.  And when working with string ports, even
keeping data purely inside of Guile requires conversion processes, and
string port positions are calculated in UTF8-encoded byte offsets when
strings are indexed in characters.

The problem is that Guile is _constantly_ required to recode strings it
is processing.  And to add insult to injury, it cannot do this without
data loss when its string encoding assumptions are wrong.

PostScript files are usually encoded in Latin-1 with occasional UCS-16
passages.  Reading and writing and copying such files byte-correctly
while trying to actually parse their contents is not feasible with
Guile.

> I still maintain that this design decision was a good one based on the
> simplicity of implementation.

As I said: the problem is not the chosen internal representation.  The
problem is that there is no API to access it, and it does not even map
to string ports.

> The great difficulty with the UTF-8 Guile prototype was the need to
> interrogate every string access or index to decide if it was a
> codepoint index or a byte index. I abandoned that effort because it
> was doing my head in.

Emacs tried this in version 20.2, and got rid of it in version 20.4 or
so, obliterating byte-based indexing completely.  Anything else would
not have worked in the long run.  That was when, 16 years ago?

> Had we chosen that route, the result would likely have been a long,
> long process of squashing difficult bugs related to byte vs codepoint
> index confusion.
>
> But, for what it is worth, we've had a few years of the internal
> representation of strings being private, so any modification of
> internal representation of strings would be easier in 2017 than they
> were in 2007, when the guts of strings were exposed to the C API.

> (N.B. dak at gnu is on my block list, so I won't see any such
> response.)

Not just on yours.  LilyPond is probably the largest application using
Guile as its extension language, with pretty much the worst impacts of
Guile-2 design decisions.  So obviously nobody wants to hear from its
most active developer.  This is even more important now that LilyPond is
getting removed from Debian and other distributions because it is still
hopeless to get it to run under Guile-2 (the experimental support has
encoding and stability problems and runs about a factor of 5 slower than
Guile-1).  The less one hears of that, the better for morale.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-14 22:19                           ` Chris Vine
@ 2017-02-15  7:15                             ` Marko Rauhamaa
  2017-02-15  9:18                             ` tomas
  1 sibling, 0 replies; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-15  7:15 UTC (permalink / raw)
  To: Chris Vine; +Cc: guile-user

Chris Vine <chris@cvine.freeserve.co.uk>:

> On Tue, 14 Feb 2017 21:52:01 +0000 (UTC)
> Mike Gran <spk121@yahoo.com> wrote:
>> True. Linux should follow OpenBSD and make all locales UTF-8.
>
> Filenames and locales are not necessarily related.

Linux *could* force that reality.

> When you access a networked file system, you get the filename encoding
> you are given, which may or may not be the same as the particular
> locale encoding on your particular machine on one particular day, and
> may or may not be a unicode encoding.

Linux *could* provide a seamless translation, or it *could* refuse to
mount filesystems that don't comply with its requirements.

> Linux is capable of treating filenames as just a null-terminated array
> of bytes with '/' as the directory separator. It is encoding agnostic,
> and that works just fine.

Bottom line: that's just the way things are in Linux. Programming
languages like Guile should not try to create another abstraction layer
on top of it.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-14 22:19                           ` Chris Vine
  2017-02-15  7:15                             ` Marko Rauhamaa
@ 2017-02-15  9:18                             ` tomas
  2017-02-15  9:54                               ` David Kastrup
                                                 ` (2 more replies)
  1 sibling, 3 replies; 110+ messages in thread
From: tomas @ 2017-02-15  9:18 UTC (permalink / raw)
  To: guile-user

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, Feb 14, 2017 at 10:19:14PM +0000, Chris Vine wrote:
> On Tue, 14 Feb 2017 21:52:01 +0000 (UTC)
> Mike Gran <spk121@yahoo.com> wrote:
> [snip]
> > > In particular, filenames are *not*, nor can they be mapped to,
> > > Unicode  
> > 
> > > strings in Linux.  
> > 
> > True. Linux should follow OpenBSD and make all locales UTF-8.
> 
> Filenames and locales are not necessarily related.  When you access a
> networked file system, you get the filename encoding you are given,
> which may or may not be the same as the particular locale encoding on
> your particular machine on one particular day, and may or may not be a
> unicode encoding.  Glib, for example, enables you to set this with the
> G_FILENAME_ENCODING environmental variable [...]

which is, btw., "just a better approximation", but still wrong: the
application creating a directory might have been "in" a different
locale (and thus having a different encoding) that the one creating
the file whithin that directory.

Most notably, the whole path might cross several mount points, thus
the whole path can well have fragments coming from several file systems.

I think the only sane way to see a Linux file system path is the way
Linux sees it: as a byte string.

Sure, some helper infrastructure to try to make characters of that
mess will be welcome, but that should be absolutely robust wrt.
unexpected input e.g. bad UTF-8) and leave control to the application.

Not easy.

> g_filename_to_utf8() and g_filename_from_utf8() functions for this
> purpose.

To me, that seems insufficient, unless this just applies to one
(e.g. the last) path element. Skimming the docs I can't see whether
you are only supposed to do that or whether you can dump whole paths
(or path fragments) into those functions.

>          You can tie the filename encoding to the locale encoding by
> defining the G_BROKEN_FILENAMES environmental variable but that is
> deprecated (the name suggests what they thing about that idea).
> 
> You may possibly agree with this: I am not clear from your post what
> connection you were making between locales and filenames.  But if
> OpenBSD requires all _filenames_ to be in valid UTF-8, that is a bad
> decision in my view.

NT has done that too. I don't know: there are arguments for both
approaches -- that depends whether you think file names are composed
of characters (makes sense, no?) or whether the OS doesn't care
what's in them (just leave null and slash alone!).

It's moving between those two views what's hard. Personally, I'd
tend to have Guile being agnostic (i.e. byte arrays) at the lowest
level (no conversions), and offer the application what it knows
(on BSD or "modern" Windows say: "yes, that's UTF-8" and on Linux
say "No idea, but you can try to convert").

Current locale is just a weak hint one might use in heuristics.
For things like environment variables and command line arguments,
locale is a stronger hint (but not 100%).

> Linux is capable of treating filenames as just a null-terminated array
> of bytes with '/' as the directory separator.  It is encoding agnostic,
> and that works just fine.

Or not. For the OS all is fine, for the applications it's a small
hell -- see those Glib functions you quoted, which -- given their
interfaces -- can't possibly do the right thing (dropping their
names in a search engine to skim their documentation turns up
quite a lot of failure modes, if you know what I mean).

regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlikHOgACgkQBcgs9XrR2kYBLACggihOlLCNLcUjlrsWh0vQMuH8
JxEAnRye7C4d1GNDJi7x6nLgI1PMamex
=+A5K
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15  9:18                             ` tomas
@ 2017-02-15  9:54                               ` David Kastrup
  2017-02-15 10:10                                 ` tomas
  2017-02-15 10:50                                 ` Marko Rauhamaa
  2017-02-15 10:15                               ` Chris Vine
  2017-02-15 16:59                               ` Eli Zaretskii
  2 siblings, 2 replies; 110+ messages in thread
From: David Kastrup @ 2017-02-15  9:54 UTC (permalink / raw)
  To: guile-user

<tomas@tuxteam.de> writes:

> On Tue, Feb 14, 2017 at 10:19:14PM +0000, Chris Vine wrote:
>> On Tue, 14 Feb 2017 21:52:01 +0000 (UTC)
>> Mike Gran <spk121@yahoo.com> wrote:
>> [snip]
>> > > In particular, filenames are *not*, nor can they be mapped to,
>> > > Unicode  
>> > 
>> > > strings in Linux.  
>> > 
>> > True. Linux should follow OpenBSD and make all locales UTF-8.
>> 
>> Filenames and locales are not necessarily related.  When you access a
>> networked file system, you get the filename encoding you are given,
>> which may or may not be the same as the particular locale encoding on
>> your particular machine on one particular day, and may or may not be a
>> unicode encoding.  Glib, for example, enables you to set this with the
>> G_FILENAME_ENCODING environmental variable [...]
>
> which is, btw., "just a better approximation", but still wrong: the
> application creating a directory might have been "in" a different
> locale (and thus having a different encoding) that the one creating
> the file whithin that directory.
>
> Most notably, the whole path might cross several mount points, thus
> the whole path can well have fragments coming from several file systems.
>
> I think the only sane way to see a Linux file system path is the way
> Linux sees it: as a byte string.
>
> Sure, some helper infrastructure to try to make characters of that
> mess will be welcome, but that should be absolutely robust wrt.
> unexpected input e.g. bad UTF-8) and leave control to the application.
>
> Not easy.

If you tell Emacs that some external entity is in UTF-8, it will
represent all valid UTF-8 sequences as properly decoded characters, and
it has special codes for all bytes not part of valid UTF-8.

As a result, it works with valid UTF-8 perfectly as expected but will
reproduce arbitrary byte streams thrown at it perfectly when decoding as
UTF-8 and then reencoding into UTF-8 again.

Guile is lacking this byte stream reproducibility when
decoding/reencoding.  That makes it a whole lot less robust for dealing
with externally provided material.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15  9:54                               ` David Kastrup
@ 2017-02-15 10:10                                 ` tomas
  2017-02-15 17:04                                   ` Eli Zaretskii
  2017-02-15 10:50                                 ` Marko Rauhamaa
  1 sibling, 1 reply; 110+ messages in thread
From: tomas @ 2017-02-15 10:10 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Feb 15, 2017 at 10:54:06AM +0100, David Kastrup wrote:
> <tomas@tuxteam.de> writes:

[...]

> > Not easy.
> 
> If you tell Emacs that some external entity is in UTF-8, it will
> represent all valid UTF-8 sequences as properly decoded characters, and
> it has special codes for all bytes not part of valid UTF-8.
> 
> As a result, it works with valid UTF-8 perfectly as expected but will
> reproduce arbitrary byte streams thrown at it perfectly when decoding as
> UTF-8 and then reencoding into UTF-8 again.

Yes, Emacs is the text specialist.

It has taken years and a bunch of very smart, experienced and *patient*
folks to achieve that. But then Emacs has users who don't run away
screaming when the GUI widget shows funky stuff. They mostly go
"oh, that's interesting" :-)

(My point: not easy).

A Glib function (as a servant to Gtk) is trying to display a string
"correctly", even if that string is mixed (say Latin1/UTF-8) encoding.
It just can't. Not unless you throw heuristics and voodoo in, and you
don't want a library doing that behind your back.

> Guile is lacking this byte stream reproducibility when
> decoding/reencoding.  That makes it a whole lot less robust for dealing
> with externally provided material.

Definitely. Perhaps the Emacs approach might be the right one for Guile?

regards
- -- t
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlikKRgACgkQBcgs9XrR2kaYCwCeIvEPwGvCRsy1Tm+BRLuOMaV2
7kkAnjyoGca+RsKdr4SzMGdzQKf2hEhx
=mb3P
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-14 23:58                       ` David Kastrup
@ 2017-02-15 10:12                         ` tomas
  2017-02-15 12:04                           ` Marko Rauhamaa
  2017-02-26 21:20                         ` Andy Wingo
  1 sibling, 1 reply; 110+ messages in thread
From: tomas @ 2017-02-15 10:12 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Feb 15, 2017 at 12:58:41AM +0100, David Kastrup wrote:

[...]

> Not just on yours.  LilyPond is probably the largest application using
> Guile as its extension language, with pretty much the worst impacts of
> Guile-2 design decisions.  So obviously nobody wants to hear from its
> most active developer.  This is even more important now that LilyPond is
> getting removed from Debian and other distributions because it is still
> hopeless to get it to run under Guile-2 (the experimental support has
> encoding and stability problems and runs about a factor of 5 slower than
> Guile-1).  The less one hears of that, the better for morale.

This is definitely sad. There should be a way out of that.

regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlikKYwACgkQBcgs9XrR2kYu/gCeK0OWrAZgVjqqogr5/bf4aMer
QDsAniO94+sfyWgiauIEEOeAP4clMvXR
=vZbF
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15  9:18                             ` tomas
  2017-02-15  9:54                               ` David Kastrup
@ 2017-02-15 10:15                               ` Chris Vine
  2017-02-15 11:48                                 ` tomas
  2017-02-26 20:52                                 ` Andy Wingo
  2017-02-15 16:59                               ` Eli Zaretskii
  2 siblings, 2 replies; 110+ messages in thread
From: Chris Vine @ 2017-02-15 10:15 UTC (permalink / raw)
  To: guile-user

On Wed, 15 Feb 2017 10:18:32 +0100
<tomas@tuxteam.de> wrote:
> On Tue, Feb 14, 2017 at 10:19:14PM +0000, Chris Vine wrote:
[snip]
> > Filenames and locales are not necessarily related.  When you access
> > a networked file system, you get the filename encoding you are
> > given, which may or may not be the same as the particular locale
> > encoding on your particular machine on one particular day, and may
> > or may not be a unicode encoding.  Glib, for example, enables you
> > to set this with the G_FILENAME_ENCODING environmental variable
> > [...]  
> 
> which is, btw., "just a better approximation", but still wrong: the
> application creating a directory might have been "in" a different
> locale (and thus having a different encoding) that the one creating
> the file whithin that directory.
> 
> Most notably, the whole path might cross several mount points, thus
> the whole path can well have fragments coming from several file
> systems.
> 
> I think the only sane way to see a Linux file system path is the way
> Linux sees it: as a byte string.
> 
> Sure, some helper infrastructure to try to make characters of that
> mess will be welcome, but that should be absolutely robust wrt.
> unexpected input e.g. bad UTF-8) and leave control to the application.
> 
> Not easy.

I don't disagree.  My purpose was to point out that in the modern
world of networking and plug-in devices, locales and filenames are
disjoint.

The glib approach is better than assuming all filenames are in locale
encoding, but it is by no means perfect.  I came across exactly this
problem when writing a small application, mainly for my own use, to
manage music files (actually mainly podcasts) on a USB music stick.
The stick had its filenames in UTF-8 (somewhat confusingly the text in
its index files, which had UTF-8 names, was in UTF-16).  This meant
that if the computer on which the stick was mounted used a different
filename encoding, any file with path could be in a mixed encoding.
Because gio's GFile insists that its filenames with path are in the
encoding set by G_FILENAME_ENCODING, this meant GFile was only
guaranteed to work when the stick was mounted on a computer with
filename encoding set to UTF-8.

In the end I just used the standard POSIX functions to open, close,
read and write files which, because linux is codeset agnostic, worked
fine.  To display filenames in GTK+, I was able to apply
g_filename_to_utf8() to the mount point only and know that the
remainder of the file name was guaranteed to be in UTF-8 already.

> > g_filename_to_utf8() and g_filename_from_utf8() functions for this
> > purpose.  
> 
> To me, that seems insufficient, unless this just applies to one
> (e.g. the last) path element. Skimming the docs I can't see whether
> you are only supposed to do that or whether you can dump whole paths
> (or path fragments) into those functions.

You can do whatever you want with these functions.  They just convert a
text fragment from filename encoding to UTF-8 (if different).  They are
the filename encoding equivalent of g_locale_to_utf8() and
g_locale_from_utf8() for the locale encoding.  If you pass them a
filename with path, and that is in a mixed encoding, it won't work.
There are variants which will gracefully degrade in case of encoding
errors - g_filename_display_name() and g_filename_display_basename().

[snip]
> It's moving between those two views what's hard. Personally, I'd
> tend to have Guile being agnostic (i.e. byte arrays) at the lowest
> level (no conversions), and offer the application what it knows
> (on BSD or "modern" Windows say: "yes, that's UTF-8" and on Linux
> say "No idea, but you can try to convert").
> 
> Current locale is just a weak hint one might use in heuristics.
> For things like environment variables and command line arguments,
> locale is a stronger hint (but not 100%).

I would prefer guile to make the filename encoding a fluid.  It wouldn't
deal with files mounted with mixed encodings, but it would cater for
everything else.

Chris



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15  9:54                               ` David Kastrup
  2017-02-15 10:10                                 ` tomas
@ 2017-02-15 10:50                                 ` Marko Rauhamaa
  2017-02-15 11:18                                   ` David Kastrup
  1 sibling, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-15 10:50 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

David Kastrup <dak@gnu.org>:

> If you tell Emacs that some external entity is in UTF-8, it will
> represent all valid UTF-8 sequences as properly decoded characters,
> and it has special codes for all bytes not part of valid UTF-8.
>
> As a result, it works with valid UTF-8 perfectly as expected but will
> reproduce arbitrary byte streams thrown at it perfectly when decoding
> as UTF-8 and then reencoding into UTF-8 again.
>
> Guile is lacking this byte stream reproducibility when
> decoding/reencoding. That makes it a whole lot less robust for dealing
> with externally provided material.

Python3 supports this by abusing the surrogate code points. I don't
recommend following Python's lead.

Instead, when decoding a byte string into Unicode, the application
should be returned a list:

   ( chars bytes chars bytes ... chars )

or some similar mechanism.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 10:50                                 ` Marko Rauhamaa
@ 2017-02-15 11:18                                   ` David Kastrup
  0 siblings, 0 replies; 110+ messages in thread
From: David Kastrup @ 2017-02-15 11:18 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

Marko Rauhamaa <marko@pacujo.net> writes:

> David Kastrup <dak@gnu.org>:
>
>> If you tell Emacs that some external entity is in UTF-8, it will
>> represent all valid UTF-8 sequences as properly decoded characters,
>> and it has special codes for all bytes not part of valid UTF-8.
>>
>> As a result, it works with valid UTF-8 perfectly as expected but will
>> reproduce arbitrary byte streams thrown at it perfectly when decoding
>> as UTF-8 and then reencoding into UTF-8 again.
>>
>> Guile is lacking this byte stream reproducibility when
>> decoding/reencoding. That makes it a whole lot less robust for dealing
>> with externally provided material.
>
> Python3 supports this by abusing the surrogate code points. I don't
> recommend following Python's lead.

Emacs uses overlong byte sequences for 0x00 to 0x7f to represent bytes
with values 0x80 to 0xff not part of valid UTF-8 sequences.  Those
cannot occur in valid UTF-8, but they handle nice internally with regard
to detecting character boundaries in string/character handling.
Basically, those are patterns 0xc0 0x80 ... 0xc0 0xbf and 0c1 0x80
... 0xc1 0xbf for representing 0x80 ... 0xbf and 0xc0 ... 0xff when the
latter are not part of proper (and consequently uniquely encoded) UTF-8.

Which means that random byte sequences get blown up by less than 50%
internally (less because some bytes 0x80...0xff end up in combinations
constituting valid UTF-8 sequences and thus will pass transparently).

> Instead, when decoding a byte string into Unicode, the application
> should be returned a list:
>
>    ( chars bytes chars bytes ... chars )
>
> or some similar mechanism.

This would seriously inflate random byte sequences and require string
handling to special-case the counters.  The Emacs way is comparatively
modest, and the internal representation meets most of the UTF-8
invariants important for fast string processing.  Perhaps the most
astonishing thing is that this reencoding results in sensible sort
orders: "Isolated bytes 0x80...0xff" sort right after 0x00...0x7f.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 10:15                               ` Chris Vine
@ 2017-02-15 11:48                                 ` tomas
  2017-02-15 12:13                                   ` Chris Vine
  2017-02-26 20:52                                 ` Andy Wingo
  1 sibling, 1 reply; 110+ messages in thread
From: tomas @ 2017-02-15 11:48 UTC (permalink / raw)
  To: guile-user

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Feb 15, 2017 at 10:15:33AM +0000, Chris Vine wrote:

[...]

> I don't disagree.  My purpose was to point out that in the modern
> world of networking and plug-in devices, locales and filenames are
> disjoint.
> 
> The glib approach is better than assuming all filenames are in locale
> encoding, but it is by no means perfect.  I came across exactly this
> problem when writing a small application, mainly for my own use, to
> manage music files (actually mainly podcasts) on a USB music stick.
> The stick had its filenames in UTF-8 (somewhat confusingly the text in
> its index files, which had UTF-8 names, was in UTF-16).  This meant
> that if the computer on which the stick was mounted used a different
> filename encoding, any file with path could be in a mixed encoding.
> Because gio's GFile insists that its filenames with path are in the
> encoding set by G_FILENAME_ENCODING, this meant GFile was only
> guaranteed to work when the stick was mounted on a computer with
> filename encoding set to UTF-8.

A very instructive example, thanks :-)

> In the end I just used the standard POSIX functions to open, close,
> read and write files which, because linux is codeset agnostic, worked
> fine.  To display filenames in GTK+, I was able to apply
> g_filename_to_utf8() to the mount point only and know that the
> remainder of the file name was guaranteed to be in UTF-8 already.

[...]

> I would prefer guile to make the filename encoding a fluid.  It wouldn't
> deal with files mounted with mixed encodings, but it would cater for
> everything else.

But why? I think either (a) have an internal encoding which is
"mostly UTF-8", but has space for raw bytes, as David describes
or (b) keeping completely out and dealing in arrays of bytes,
and providing the filename encoding just as an advisory value
("as far as we can know, those file names are encoded FOO")
seems far superior, since it will deal even with mixed encodings.

Of course (b) has a price too. I've seen XML parsers weep because
someone did a "substring" by hand, cutting in half a poor multibyte
sequence. Now it's *your* problem to do string operations right [1].
Loads of fun :-)

All in all, the Emacs way looks most enticing to me.

[1] And that's because file names *are* character strings, after
   all -- that's the point I don't agree with Marko. But this
   is a "soft", "social" thing, not a technical one.

regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlikQAQACgkQBcgs9XrR2kZ5aQCdGuJYv4NUSMN3xqavXIi5wH06
TDIAni0035zTUynyBjm5VbLwkbDlAXZ6
=oten
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 10:12                         ` tomas
@ 2017-02-15 12:04                           ` Marko Rauhamaa
  0 siblings, 0 replies; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-15 12:04 UTC (permalink / raw)
  To: tomas; +Cc: guile-user, David Kastrup

<tomas@tuxteam.de>:

> On Wed, Feb 15, 2017 at 12:58:41AM +0100, David Kastrup wrote:
>> LilyPond is getting removed from Debian and other distributions
>> because it is still hopeless to get it to run under Guile-2 (the
>> experimental support has encoding and stability problems and runs
>> about a factor of 5 slower than Guile-1). The less one hears of that,
>> the better for morale.
>
> This is definitely sad. There should be a way out of that.

Lilypond is worth half a distro. A true gem in the history of software.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 11:48                                 ` tomas
@ 2017-02-15 12:13                                   ` Chris Vine
  2017-02-15 12:41                                     ` tomas
  2017-02-15 17:07                                     ` Eli Zaretskii
  0 siblings, 2 replies; 110+ messages in thread
From: Chris Vine @ 2017-02-15 12:13 UTC (permalink / raw)
  To: guile-user

On Wed, 15 Feb 2017 12:48:20 +0100
<tomas@tuxteam.de> wrote:
> On Wed, Feb 15, 2017 at 10:15:33AM +0000, Chris Vine wrote:
[snip]
> > I would prefer guile to make the filename encoding a fluid.  It
> > wouldn't deal with files mounted with mixed encodings, but it would
> > cater for everything else.  
> 
> But why? I think either (a) have an internal encoding which is
> "mostly UTF-8", but has space for raw bytes, as David describes
> or (b) keeping completely out and dealing in arrays of bytes,
> and providing the filename encoding just as an advisory value
> ("as far as we can know, those file names are encoded FOO")
> seems far superior, since it will deal even with mixed encodings.

I would be happy with that.  But we have to work in the land of the
achievable.  Making the filename encoding a fluid shouldn't be a major
exercise.  Guile-2.0 already converts filenames from its internal string
representation (Latin-1 or UCS-4) to the locale encoding when opening up
files and the like in C; this would need instead to do the conversion
by reference to the fluid value (which could default to the locale
encoding).

Option (a) you mention would seem to require rewriting the string
implementation.  Option (b) may be more tractable (perhaps there could
be an option to pass file names using bytevectors) but someone has
still got to do it and I am not sure how that would work with windows.

Chris



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 12:13                                   ` Chris Vine
@ 2017-02-15 12:41                                     ` tomas
  2017-02-15 13:11                                       ` Chris Vine
  2017-02-15 17:07                                     ` Eli Zaretskii
  1 sibling, 1 reply; 110+ messages in thread
From: tomas @ 2017-02-15 12:41 UTC (permalink / raw)
  To: guile-user

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Feb 15, 2017 at 12:13:09PM +0000, Chris Vine wrote:
> On Wed, 15 Feb 2017 12:48:20 +0100
> <tomas@tuxteam.de> wrote:
> > On Wed, Feb 15, 2017 at 10:15:33AM +0000, Chris Vine wrote:
> [snip]
> > > I would prefer guile to make the filename encoding a fluid.  It
> > > wouldn't deal with files mounted with mixed encodings, but it would
> > > cater for everything else.  
> > 
> > But why? I think either (a) have an internal encoding which is
> > "mostly UTF-8", but has space for raw bytes, as David describes
> > or (b) keeping completely out and dealing in arrays of bytes,
> > and providing the filename encoding just as an advisory value
> > ("as far as we can know, those file names are encoded FOO")
> > seems far superior, since it will deal even with mixed encodings.
> 
> I would be happy with that.  But we have to work in the land of the
> achievable.  Making the filename encoding a fluid shouldn't be a major
> exercise.  Guile-2.0 already converts filenames from its internal string
> representation (Latin-1 or UCS-4) to the locale encoding when opening up
> files and the like in C; this would need instead to do the conversion
> by reference to the fluid value (which could default to the locale
> encoding).
> 
> Option (a) you mention would seem to require rewriting the string
> implementation.  Option (b) may be more tractable (perhaps there could
> be an option to pass file names using bytevectors) but someone has
> still got to do it and I am not sure how that would work with windows.

Of course it can only be done piecemeal. For (a), the first step would
be to have a separate "omnivore string" representation (possibly as
a smob) and the neccessary I/O operations. For (b), have the I/O
operations to get the things (file names, environments) as raw
byte arrays. This way, people could at least fight, when necessary.

But true, I haven't even a clue how difficult that would be.

What I don't like about the fluid is that it still doesn't give you
an escape hatch in hard cases (your USB stick example).

regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlikTHQACgkQBcgs9XrR2kbfjQCfTvSlVxYvPtZrLK9iUu7vfjSn
/L8An3WuJW+NHZhA5sOJK2GwMJQ5bbSV
=AZ7S
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 12:41                                     ` tomas
@ 2017-02-15 13:11                                       ` Chris Vine
  2017-02-15 13:31                                         ` tomas
  0 siblings, 1 reply; 110+ messages in thread
From: Chris Vine @ 2017-02-15 13:11 UTC (permalink / raw)
  To: guile-user

On Wed, 15 Feb 2017 13:41:24 +0100
<tomas@tuxteam.de> wrote:
[snip]
> What I don't like about the fluid is that it still doesn't give you
> an escape hatch in hard cases (your USB stick example).

The program would just have to document that any mount point must have
a path in a character set (eg ASCII) which is compatible with the
filename encoding used by the plug-able device in question to which the
fluid is to be set.

Not ideal, but in practice probably good enough for anyone using guile,
who is likely to be sufficiently knowledgeable to understand what this
means.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 13:11                                       ` Chris Vine
@ 2017-02-15 13:31                                         ` tomas
  0 siblings, 0 replies; 110+ messages in thread
From: tomas @ 2017-02-15 13:31 UTC (permalink / raw)
  To: guile-user

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Feb 15, 2017 at 01:11:57PM +0000, Chris Vine wrote:
> On Wed, 15 Feb 2017 13:41:24 +0100
> <tomas@tuxteam.de> wrote:
> [snip]
> > What I don't like about the fluid is that it still doesn't give you
> > an escape hatch in hard cases (your USB stick example).
> 
> The program would just have to document that any mount point must have
> a path in a character set (eg ASCII) which is compatible with the
> filename encoding used by the plug-able device in question to which the
> fluid is to be set.

Yes, but exactly that is out of the application's control :-(

> Not ideal, but in practice probably good enough for anyone using guile,
> who is likely to be sufficiently knowledgeable to understand what this
> means.

None of the solutions bandied around are ideal, I fear.

regards
- -- t
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlikWD8ACgkQBcgs9XrR2kb5dgCdHJ4JbjrDAPBxeIOCMS9shjWq
QlwAn3BP5p6He+N6fEqJiTUR9UzNES/P
=zGLl
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15  9:18                             ` tomas
  2017-02-15  9:54                               ` David Kastrup
  2017-02-15 10:15                               ` Chris Vine
@ 2017-02-15 16:59                               ` Eli Zaretskii
  2017-02-15 17:53                                 ` Marko Rauhamaa
  2017-02-15 20:20                                 ` tomas
  2 siblings, 2 replies; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-15 16:59 UTC (permalink / raw)
  To: tomas; +Cc: guile-user

> Date: Wed, 15 Feb 2017 10:18:32 +0100
> From: <tomas@tuxteam.de>
> 
> > Filenames and locales are not necessarily related.  When you access a
> > networked file system, you get the filename encoding you are given,
> > which may or may not be the same as the particular locale encoding on
> > your particular machine on one particular day, and may or may not be a
> > unicode encoding.  Glib, for example, enables you to set this with the
> > G_FILENAME_ENCODING environmental variable [...]
> 
> which is, btw., "just a better approximation", but still wrong: the
> application creating a directory might have been "in" a different
> locale (and thus having a different encoding) that the one creating
> the file whithin that directory.
> 
> Most notably, the whole path might cross several mount points, thus
> the whole path can well have fragments coming from several file systems.

A possible solution would be to decode each mount point's part as it
is being resolved.

> I think the only sane way to see a Linux file system path is the way
> Linux sees it: as a byte string.

This would lose a lot in 99% of use cases.  You are, in effect,
suggesting a "reverse optimization", whereby the majority of use cases
is punished in favor of a small minority, based on theoretical
intractability.

> Sure, some helper infrastructure to try to make characters of that
> mess will be welcome, but that should be absolutely robust wrt.
> unexpected input e.g. bad UTF-8) and leave control to the application.

Most applications won't like this burden, because most application
programmers don't know enough about the issue to solve them correctly,
especially for users of other OSes and locales.

> > But if OpenBSD requires all _filenames_ to be in valid UTF-8, that
> > is a bad decision in my view.
> 
> NT has done that too.

Windows can do that because it also transparently translates file
names to the locale's encoding when files are accessed with ANSI APIs.
Without such translation, this kind of decision is unwise, IMO.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 10:10                                 ` tomas
@ 2017-02-15 17:04                                   ` Eli Zaretskii
  2017-02-15 20:07                                     ` tomas
  0 siblings, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-15 17:04 UTC (permalink / raw)
  To: tomas; +Cc: guile-user, dak

> Date: Wed, 15 Feb 2017 11:10:33 +0100
> From: <tomas@tuxteam.de>
> Cc: guile-user@gnu.org
> 
> Yes, Emacs is the text specialist.
> 
> It has taken years and a bunch of very smart, experienced and *patient*
> folks to achieve that.

It took many years because those smart, experienced, and patient
people made bad decisions, twice, and had to correct them later, which
required rewriting several important internal mechanisms.  Which tells
you that smarts, experience, and patience are not enough to get this
right the first time.

But Guile doesn't need to go through all that, it can just use what
Emacs learned the hard way.

> But then Emacs has users who don't run away screaming when the GUI
> widget shows funky stuff.

Oh yes, they did.  How they did!  Those screams are the reason why the
stuff was rewritten.

Once again, Guile doesn't need to go through all that, it just needs
to learn from others' experience.

> Perhaps the Emacs approach might be the right one for Guile?

For some reason, Guile developers are not necessarily convinced,
AFAIU.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 12:13                                   ` Chris Vine
  2017-02-15 12:41                                     ` tomas
@ 2017-02-15 17:07                                     ` Eli Zaretskii
  2017-02-26 20:58                                       ` Andy Wingo
  1 sibling, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-15 17:07 UTC (permalink / raw)
  To: Chris Vine; +Cc: guile-user

> Date: Wed, 15 Feb 2017 12:13:09 +0000
> From: Chris Vine <chris@cvine.freeserve.co.uk>
> 
> I am not sure how that would work with windows.

Emacs has solved that problem as well: the MS-Windows port pretends
towards Emacs internals that file names are encoded in UTF-8, and
shadows relevant system APIs that accept or return file names, like
fopen, opendir/readdir, stat, etc. with its own versions that convert
UTF-8 to and from UTF-16 before calling the real OS APIs.

Once again, just use that experience, and maybe even some
infrastructure code.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 16:59                               ` Eli Zaretskii
@ 2017-02-15 17:53                                 ` Marko Rauhamaa
  2017-02-15 20:20                                 ` tomas
  1 sibling, 0 replies; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-15 17:53 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Eli Zaretskii <eliz@gnu.org>:

>> Date: Wed, 15 Feb 2017 10:18:32 +0100
>> From: <tomas@tuxteam.de>
>> I think the only sane way to see a Linux file system path is the way
>> Linux sees it: as a byte string.
>
> This would lose a lot in 99% of use cases. You are, in effect,
> suggesting a "reverse optimization", whereby the majority of use cases
> is punished in favor of a small minority, based on theoretical
> intractability.

I think this is a question of software security as well. These
"theoretical" loopholes could be used for sabotage that evades testing.

>> Sure, some helper infrastructure to try to make characters of that
>> mess will be welcome, but that should be absolutely robust wrt.
>> unexpected input e.g. bad UTF-8) and leave control to the
>> application.
>
> Most applications won't like this burden, because most application
> programmers don't know enough about the issue to solve them correctly,
> especially for users of other OSes and locales.

AFAIK, Windows allows pathnames that are illegal Unicode as well, namely
pathnames with isolated surrogate code points (<URL:
https://github.com/rust-lang/rust/issues/12056>).

I don't have access to a Windows machine so maybe somebody else could
confirm my suspicion.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 17:04                                   ` Eli Zaretskii
@ 2017-02-15 20:07                                     ` tomas
  2017-02-15 20:22                                       ` Eli Zaretskii
  0 siblings, 1 reply; 110+ messages in thread
From: tomas @ 2017-02-15 20:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user, dak

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Feb 15, 2017 at 07:04:10PM +0200, Eli Zaretskii wrote:
> > Date: Wed, 15 Feb 2017 11:10:33 +0100
> > From: <tomas@tuxteam.de>
> > Cc: guile-user@gnu.org
> > 
> > Yes, Emacs is the text specialist.
> > 
> > It has taken years and a bunch of very smart, experienced and *patient*
> > folks to achieve that.
> 
> It took many years because those smart, experienced, and patient
> people made bad decisions, twice, and had to correct them later, which
> required rewriting several important internal mechanisms.  Which tells
> you that smarts, experience, and patience are not enough to get this
> right the first time.

That's in my view part of being smart (and yes, you are one of those
smart people I had in mind: thanks *a lot* for that!).

And thanks for your insights.

regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAliktRkACgkQBcgs9XrR2kZeSgCfW3G7w1/J0na7ELAZaae9U5rz
pTEAnRcBbtCbhSQo1g5HL47Mw+rwy44+
=0jCm
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 16:59                               ` Eli Zaretskii
  2017-02-15 17:53                                 ` Marko Rauhamaa
@ 2017-02-15 20:20                                 ` tomas
  2017-02-15 20:32                                   ` Eli Zaretskii
  1 sibling, 1 reply; 110+ messages in thread
From: tomas @ 2017-02-15 20:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Feb 15, 2017 at 06:59:14PM +0200, Eli Zaretskii wrote:
> > Date: Wed, 15 Feb 2017 10:18:32 +0100
> > From: <tomas@tuxteam.de>

[...]

> > Most notably, the whole path might cross several mount points, thus
> > the whole path can well have fragments coming from several file systems.
> 
> A possible solution would be to decode each mount point's part as it
> is being resolved.

...which can only be based on guesswork: there's no reliable info on
the encoding used for that file system (if it's consistent at all).

What can we do? Try different encodings until one "works"? That amounts
to trying UTF-8 and then some Latin-x (for any x), which would fit,
for any x.

> > I think the only sane way to see a Linux file system path is the way
> > Linux sees it: as a byte string.
> 
> This would lose a lot in 99% of use cases.  You are, in effect,
> suggesting a "reverse optimization", whereby the majority of use cases
> is punished in favor of a small minority, based on theoretical
> intractability.

I feel queasy doing some voodoo whithout the application having
a word on it. In the Emacs context it's a bit easier, because in
the "normal" case things are pretty quickly deferred to the user
(usually).

> > Sure, some helper infrastructure to try to make characters of that
> > mess will be welcome, but that should be absolutely robust wrt.
> > unexpected input e.g. bad UTF-8) and leave control to the application.
> 
> Most applications won't like this burden, because most application
> programmers don't know enough about the issue to solve them correctly,
> especially for users of other OSes and locales.
> 
> > > But if OpenBSD requires all _filenames_ to be in valid UTF-8, that
> > > is a bad decision in my view.
> > 
> > NT has done that too.
> 
> Windows can do that because it also transparently translates file
> names to the locale's encoding when files are accessed with ANSI APIs.
> Without such translation, this kind of decision is unwise, IMO.

I guess (I don't *know*) Windows stores information about the encoding
at file system level (and keeps that consistent). Linux hasn't that,
it just keeps out of it. It hasn't even a place to state the encoding
used.

Thanks&regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlikuCgACgkQBcgs9XrR2kauCACfTpfRpHhL2iUJXET5zqokA6US
+pkAnjIc7Q+hBPj9Vi9Pk46AsmI3yA5m
=RXAn
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 20:07                                     ` tomas
@ 2017-02-15 20:22                                       ` Eli Zaretskii
  0 siblings, 0 replies; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-15 20:22 UTC (permalink / raw)
  To: tomas; +Cc: guile-user, dak

> Date: Wed, 15 Feb 2017 21:07:53 +0100
> From: tomas@tuxteam.de
> Cc: tomas@tuxteam.de, dak@gnu.org, guile-user@gnu.org
> 
> > It took many years because those smart, experienced, and patient
> > people made bad decisions, twice, and had to correct them later, which
> > required rewriting several important internal mechanisms.  Which tells
> > you that smarts, experience, and patience are not enough to get this
> > right the first time.
> 
> That's in my view part of being smart (and yes, you are one of those
> smart people I had in mind: thanks *a lot* for that!).

Actually, at the time I was a relatively passive observer.  I did very
little for the related code itself (with the exception of the Windows
handling of file names, which was done years later).  So it isn't my
credit to take.

But thanks anyway.

> And thanks for your insights.

You are welcome.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 20:20                                 ` tomas
@ 2017-02-15 20:32                                   ` Eli Zaretskii
  2017-02-15 21:04                                     ` Marko Rauhamaa
  2017-02-15 21:15                                     ` tomas
  0 siblings, 2 replies; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-15 20:32 UTC (permalink / raw)
  To: tomas; +Cc: guile-user

> Date: Wed, 15 Feb 2017 21:20:56 +0100
> From: tomas@tuxteam.de
> Cc: guile-user@gnu.org
> 
> > > Most notably, the whole path might cross several mount points, thus
> > > the whole path can well have fragments coming from several file systems.
> > 
> > A possible solution would be to decode each mount point's part as it
> > is being resolved.
> 
> ...which can only be based on guesswork: there's no reliable info on
> the encoding used for that file system (if it's consistent at all).

You could maintain a database of encodings per file system, perhaps
user-defined, or derived by some other means.  E.g., for volumes that
physically reside on Windows or macOS the encoding is pretty much
known in advance.

> > > I think the only sane way to see a Linux file system path is the way
> > > Linux sees it: as a byte string.
> > 
> > This would lose a lot in 99% of use cases.  You are, in effect,
> > suggesting a "reverse optimization", whereby the majority of use cases
> > is punished in favor of a small minority, based on theoretical
> > intractability.
> 
> I feel queasy doing some voodoo whithout the application having
> a word on it. In the Emacs context it's a bit easier, because in
> the "normal" case things are pretty quickly deferred to the user
> (usually).

Not really, there are a lot of internal operations that access files
and directories, and would wreak major havoc if they don't succeed,
silently, in the absolute majority of uses.

> > > NT has done that too.
> > 
> > Windows can do that because it also transparently translates file
> > names to the locale's encoding when files are accessed with ANSI APIs.
> > Without such translation, this kind of decision is unwise, IMO.
> 
> I guess (I don't *know*) Windows stores information about the encoding
> at file system level (and keeps that consistent).

No.  At the file system level (for NTFS volumes at least) Windows file
names are always UTF-16 encoded, and Windows just "knows" that.
Windows converts that to the locale's codepage when you access files
via an API that communicates file names encoded in that codepage.  (If
the conversion fails, you get question marks instead of the characters
that couldn't be converted.)

> Linux hasn't that, it just keeps out of it. It hasn't even a place
> to state the encoding used.

Exactly.  Which is why forcing a single file-name encoding on
Linux/Unix filesystems is IMO a bad idea.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 20:32                                   ` Eli Zaretskii
@ 2017-02-15 21:04                                     ` Marko Rauhamaa
  2017-02-16  5:44                                       ` Eli Zaretskii
  2017-02-15 21:15                                     ` tomas
  1 sibling, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-15 21:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Eli Zaretskii <eliz@gnu.org>:

> At the file system level (for NTFS volumes at least) Windows file
> names are always UTF-16 encoded, and Windows just "knows" that.

Hm, I had the impression NTFS filenames were UCS-2 (<URL:
https://en.wikipedia.org/wiki/Talk%3AUTF-16/UCS-2>).


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 20:32                                   ` Eli Zaretskii
  2017-02-15 21:04                                     ` Marko Rauhamaa
@ 2017-02-15 21:15                                     ` tomas
  2017-02-16  5:54                                       ` Eli Zaretskii
  1 sibling, 1 reply; 110+ messages in thread
From: tomas @ 2017-02-15 21:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Feb 15, 2017 at 10:32:57PM +0200, Eli Zaretskii wrote:
> > Date: Wed, 15 Feb 2017 21:20:56 +0100
> > From: tomas@tuxteam.de
> > Cc: guile-user@gnu.org
> > 
> > > > Most notably, the whole path might cross several mount points, thus
> > > > the whole path can well have fragments coming from several file systems.
> > > 
> > > A possible solution would be to decode each mount point's part as it
> > > is being resolved.
> > 
> > ...which can only be based on guesswork: there's no reliable info on
> > the encoding used for that file system (if it's consistent at all).
> 
> You could maintain a database of encodings per file system, perhaps
> user-defined, or derived by some other means.  E.g., for volumes that
> physically reside on Windows or macOS the encoding is pretty much
> known in advance.

This is what I mean by "voodoo". We don't even know the encoding to be
consistent whithin one file system. An example would be the home dirs
of different users running under different locales (an extreme example:
they may have different 8 bit locales!).

[...]

> > I feel queasy doing some voodoo whithout the application having
> > a word on it. In the Emacs context it's a bit easier, because in
> > the "normal" case things are pretty quickly deferred to the user
> > (usually).
> 
> Not really, there are a lot of internal operations that access files
> and directories, and would wreak major havoc if they don't succeed,
> silently, in the absolute majority of uses.

That was the "a bit" part :-)

Anyway, having an encoding à la Emacs eases things a lot, since a
string can at least survive unharmed a plain round trip. The problem
of properly displaying that remains unsolved. Plus operations on that
string (concatenation, e.g.).

[...]

> > I guess (I don't *know*) Windows stores information about the encoding
> > at file system level (and keeps that consistent).
> 
> No.  At the file system level (for NTFS volumes at least) Windows file
> names are always UTF-16 encoded, and Windows just "knows" that.
> Windows converts that to the locale's codepage when you access files
> via an API that communicates file names encoded in that codepage.  (If
> the conversion fails, you get question marks instead of the characters
> that couldn't be converted.)

I see. That means that Windows has to use surrogates for everything
beyond the BMP, right? The heritage from the times Unicode was "just"
16 bit...

> > Linux hasn't that, it just keeps out of it. It hasn't even a place
> > to state the encoding used.
> 
> Exactly.  Which is why forcing a single file-name encoding on
> Linux/Unix filesystems is IMO a bad idea.

Agreed, that can't be done. It'd be nice to have one encoding per file
system, but we don't even have that :-(

regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlikxQgACgkQBcgs9XrR2kbCpQCfcLLffP3e3JdW1gg4DVylHQeo
cjAAnRwVgtZR0qIce7IkU73vUHpLSvMG
=jl5p
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 21:04                                     ` Marko Rauhamaa
@ 2017-02-16  5:44                                       ` Eli Zaretskii
  2017-02-16  6:15                                         ` Marko Rauhamaa
  0 siblings, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-16  5:44 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

> From: Marko Rauhamaa <marko@pacujo.net>
> Cc: tomas@tuxteam.de,  guile-user@gnu.org
> Date: Wed, 15 Feb 2017 23:04:52 +0200
> 
> Eli Zaretskii <eliz@gnu.org>:
> 
> > At the file system level (for NTFS volumes at least) Windows file
> > names are always UTF-16 encoded, and Windows just "knows" that.
> 
> Hm, I had the impression NTFS filenames were UCS-2 (<URL:
> https://en.wikipedia.org/wiki/Talk%3AUTF-16/UCS-2>).

What is the difference, in the context of this discussion?



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 21:15                                     ` tomas
@ 2017-02-16  5:54                                       ` Eli Zaretskii
  0 siblings, 0 replies; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-16  5:54 UTC (permalink / raw)
  To: tomas; +Cc: guile-user

> Date: Wed, 15 Feb 2017 22:15:52 +0100
> From: tomas@tuxteam.de
> Cc: guile-user@gnu.org
> 
> > > > A possible solution would be to decode each mount point's part as it
> > > > is being resolved.
> > > 
> > > ...which can only be based on guesswork: there's no reliable info on
> > > the encoding used for that file system (if it's consistent at all).
> > 
> > You could maintain a database of encodings per file system, perhaps
> > user-defined, or derived by some other means.  E.g., for volumes that
> > physically reside on Windows or macOS the encoding is pretty much
> > known in advance.
> 
> This is what I mean by "voodoo".

Such "voodoo" is what Emacs does, more or less (not in this particular
use case, though).  This is what makes it so useful and successful.
Refusing to use such techniques because they are theoretically
imperfect is an obstacle to making useful software systems that
support multi-lingual environments.

> We don't even know the encoding to be consistent whithin one file
> system.

In almost all cases, it is.  Once again, the 99% vs 1% issue.

> An example would be the home dirs of different users running under
> different locales (an extreme example: they may have different 8 bit
> locales!).

Did you ever see such a use case in practice?

Besides, my suggestion works there as well, given a large enough
database that users can augment.

> Anyway, having an encoding à la Emacs eases things a lot, since a
> string can at least survive unharmed a plain round trip.

That's a basic requirement, yes.

> The problem of properly displaying that remains unsolved.

This must be solved sufficiently in the majority of use cases; doing
that is not hard.  For the rest, there should be optional
settings/commands to get the correct display.  Example: the (now
largely unnecessary) rmail-redecode-body command in Rmail.

> Plus operations on that string (concatenation, e.g.).

No, this can be easily coded to support raw bytes.  Emacs does that.

> > No.  At the file system level (for NTFS volumes at least) Windows file
> > names are always UTF-16 encoded, and Windows just "knows" that.
> > Windows converts that to the locale's codepage when you access files
> > via an API that communicates file names encoded in that codepage.  (If
> > the conversion fails, you get question marks instead of the characters
> > that couldn't be converted.)
> 
> I see. That means that Windows has to use surrogates for everything
> beyond the BMP, right?

Yes.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16  5:44                                       ` Eli Zaretskii
@ 2017-02-16  6:15                                         ` Marko Rauhamaa
  2017-02-16  6:29                                           ` Eli Zaretskii
  0 siblings, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-16  6:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Eli Zaretskii <eliz@gnu.org>:

>> From: Marko Rauhamaa <marko@pacujo.net>
>> Cc: tomas@tuxteam.de,  guile-user@gnu.org
>> Date: Wed, 15 Feb 2017 23:04:52 +0200
>> 
>> Eli Zaretskii <eliz@gnu.org>:
>> 
>> > At the file system level (for NTFS volumes at least) Windows file
>> > names are always UTF-16 encoded, and Windows just "knows" that.
>> 
>> Hm, I had the impression NTFS filenames were UCS-2 (<URL:
>> https://en.wikipedia.org/wiki/Talk%3AUTF-16/UCS-2>).
>
> What is the difference, in the context of this discussion?

It is possible to have illegal Unicode even in Windows filenames, ie,
filenames not expressible using Guile's strings.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16  6:15                                         ` Marko Rauhamaa
@ 2017-02-16  6:29                                           ` Eli Zaretskii
  2017-02-16  6:41                                             ` Eli Zaretskii
  2017-02-16  7:02                                             ` Marko Rauhamaa
  0 siblings, 2 replies; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-16  6:29 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

> From: Marko Rauhamaa <marko@pacujo.net>
> Cc: tomas@tuxteam.de,  guile-user@gnu.org
> Date: Thu, 16 Feb 2017 08:15:57 +0200
> 
> It is possible to have illegal Unicode even in Windows filenames, ie,
> filenames not expressible using Guile's strings.

Is it really possible?  Can you show a code example that would create
such an illegal filename on Windows?



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16  6:29                                           ` Eli Zaretskii
@ 2017-02-16  6:41                                             ` Eli Zaretskii
  2017-02-16  7:16                                               ` Marko Rauhamaa
  2017-02-16  7:02                                             ` Marko Rauhamaa
  1 sibling, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-16  6:41 UTC (permalink / raw)
  To: marko; +Cc: guile-user

> Date: Thu, 16 Feb 2017 08:29:14 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: guile-user@gnu.org
> 
> > From: Marko Rauhamaa <marko@pacujo.net>
> > Cc: tomas@tuxteam.de,  guile-user@gnu.org
> > Date: Thu, 16 Feb 2017 08:15:57 +0200
> > 
> > It is possible to have illegal Unicode even in Windows filenames, ie,
> > filenames not expressible using Guile's strings.
> 
> Is it really possible?  Can you show a code example that would create
> such an illegal filename on Windows?

Btw, if by "UCS-2" you meant to say that only characters within the
BMP are supported in file names on Windows, then this is wrong: since
Windows XP, NTFS volumes support file names with characters outside of
the BMP.  I've just successfully created files with such file names on
Windows XP using Emacs.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16  6:29                                           ` Eli Zaretskii
  2017-02-16  6:41                                             ` Eli Zaretskii
@ 2017-02-16  7:02                                             ` Marko Rauhamaa
  2017-02-16 15:47                                               ` Eli Zaretskii
  1 sibling, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-16  7:02 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Eli Zaretskii <eliz@gnu.org>:

>> From: Marko Rauhamaa <marko@pacujo.net>
>> Cc: tomas@tuxteam.de,  guile-user@gnu.org
>> Date: Thu, 16 Feb 2017 08:15:57 +0200
>> 
>> It is possible to have illegal Unicode even in Windows filenames, ie,
>> filenames not expressible using Guile's strings.
>
> Is it really possible? Can you show a code example that would create
> such an illegal filename on Windows?

I have rely on hearsay since I don't have Windows at my disposal:

   NTFS allows any sequence of 16-bit values for name encoding (file
   names, stream names, index names, etc.) except 0x0000.

   <URL: https://en.wikipedia.org/wiki/NTFS#Internals>

Not all sequences of 16-bit values are legal UTF-16.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16  6:41                                             ` Eli Zaretskii
@ 2017-02-16  7:16                                               ` Marko Rauhamaa
  2017-02-16  8:26                                                 ` David Kastrup
  2017-02-16 16:06                                                 ` Eli Zaretskii
  0 siblings, 2 replies; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-16  7:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Eli Zaretskii <eliz@gnu.org>:

> Btw, if by "UCS-2" you meant to say that only characters within the
> BMP are supported in file names on Windows, then this is wrong

No, I'm claiming Windows allows pathnames to contain isolated surrogate
code points, which cannot be decoded back to Unicode with UTF-16.

The situation is completely analogous to Linux pathnames that can
contain illegal UTF-8.

> : since Windows XP, NTFS volumes support file names with characters
> outside of the BMP. I've just successfully created files with such
> file names on Windows XP using Emacs.

Both Windows and Linux filenames support all of Unicode. Trouble is,
both of them support more than Unicode, making it impossible to use
Guile's strings for an arbitrary filename.

Python solves the problem by using a Unicode superset in its strings. I
think that's misguided, and Guile is correct in sticking to Unicode.

If I understood it correctly, someone just told us emacs maps illegal
UTF-8 to another form of illegal UTF-8 and back. That's better in that
it's bytes to bytes (leaving Unicode out), but it's not immediately
obvious to me why you have to transform the byte sequence at all.

Look at the problem of concatenation. We could have a case where two
illegal UTF-8 (or UTF-16) snippets are concatenated to get valid UTF-8
(or UTF-16). That operation fails if you try to translate the snippets
to strings before concatenation. Such concatenation operations are
commonplace when dealing with filenames (eg, split(1)).


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16  7:16                                               ` Marko Rauhamaa
@ 2017-02-16  8:26                                                 ` David Kastrup
  2017-02-16 10:21                                                   ` Marko Rauhamaa
  2017-02-16 16:06                                                 ` Eli Zaretskii
  1 sibling, 1 reply; 110+ messages in thread
From: David Kastrup @ 2017-02-16  8:26 UTC (permalink / raw)
  To: guile-user

Marko Rauhamaa <marko@pacujo.net> writes:

> Eli Zaretskii <eliz@gnu.org>:
>
>> Btw, if by "UCS-2" you meant to say that only characters within the
>> BMP are supported in file names on Windows, then this is wrong
>
> No, I'm claiming Windows allows pathnames to contain isolated surrogate
> code points, which cannot be decoded back to Unicode with UTF-16.
>
> The situation is completely analogous to Linux pathnames that can
> contain illegal UTF-8.
>
>> : since Windows XP, NTFS volumes support file names with characters
>> outside of the BMP. I've just successfully created files with such
>> file names on Windows XP using Emacs.
>
> Both Windows and Linux filenames support all of Unicode. Trouble is,
> both of them support more than Unicode, making it impossible to use
> Guile's strings for an arbitrary filename.
>
> Python solves the problem by using a Unicode superset in its strings. I
> think that's misguided, and Guile is correct in sticking to Unicode.
>
> If I understood it correctly, someone just told us emacs maps illegal
> UTF-8 to another form of illegal UTF-8 and back. That's better in that
> it's bytes to bytes (leaving Unicode out), but it's not immediately
> obvious to me why you have to transform the byte sequence at all.

After the transformation, the resulting string satisfies the invariants
of the UTF-8 encoding scheme: characters are either a single byte in the
range 0x00 to 0x7f, or they are a sequence of a byte starting with n+1
high bits set and n bytes of value 0x80 to 0xbf following.

All string operators are able to access the individual _characters_ of
such strings (rather than bytes) by working with the general coding
scheme.  In addition, valid UTF-8 remains valid UTF-8 and all string
search/regexp operations work fine on it.  In fact, string search/regexp
operations even work with Emacs strings representing _invalid_ UTF-8
sequences.

The purpose of the transform is to _not_ have a flavorless chunk of
bytes but rather something organized into characters.  Of course this
only makes sense when _typically_ those characters correspond to valid
UTF-8.  If you know your string to be UCS-16, a different reencoding
would make sense, but preferably also one preserving both valid and
invalid content.

> Look at the problem of concatenation. We could have a case where two
> illegal UTF-8 (or UTF-16) snippets are concatenated to get valid UTF-8
> (or UTF-16).

Not possible in the internal Emacs representation.  Its character
boundaries are established on decoding.  Only after reencoding such a
concatenation, valid UTF-8 may fall out.  If you transfer text in blocks
of a given byte size, this may happen.  But concatenating in the decoded
character domain will not magically change the character count: you'll
retain two (or more) pieces of a split UTF-8 character.  If you want to
do processing on concatenated blocks, you better only decode after
concatenation.  Or reencode and redecode, but that seems wasteful.  But
it would work.

> That operation fails if you try to translate the snippets to strings
> before concatenation. Such concatenation operations are commonplace
> when dealing with filenames (eg, split(1)).

split(1) does not "deal with filenames" when splitting, but the
individual files may be split inside of UTF-8 sequences.  See above.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16  8:26                                                 ` David Kastrup
@ 2017-02-16 10:21                                                   ` Marko Rauhamaa
  2017-02-16 10:43                                                     ` David Kastrup
  0 siblings, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-16 10:21 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

David Kastrup <dak@gnu.org>:

> Marko Rauhamaa <marko@pacujo.net> writes:
>> That operation fails if you try to translate the snippets to strings
>> before concatenation. Such concatenation operations are commonplace
>> when dealing with filenames (eg, split(1)).
>
> split(1) does not "deal with filenames" when splitting, but the
> individual files may be split inside of UTF-8 sequences.  See above.

You probably cannot produce valid UTF-8 out of invalid UTF-8 snippets
with split(1). However split(1) does form filenames out of its arguments
by concatenation:

    split --additional-suffix=suffix file prefix

produces these kinds of filenames:

    <prefix><ordinal><suffix>


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 10:21                                                   ` Marko Rauhamaa
@ 2017-02-16 10:43                                                     ` David Kastrup
  2017-02-16 11:04                                                       ` Marko Rauhamaa
  0 siblings, 1 reply; 110+ messages in thread
From: David Kastrup @ 2017-02-16 10:43 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

Marko Rauhamaa <marko@pacujo.net> writes:

> David Kastrup <dak@gnu.org>:
>
>> Marko Rauhamaa <marko@pacujo.net> writes:
>>> That operation fails if you try to translate the snippets to strings
>>> before concatenation. Such concatenation operations are commonplace
>>> when dealing with filenames (eg, split(1)).
>>
>> split(1) does not "deal with filenames" when splitting, but the
>> individual files may be split inside of UTF-8 sequences.  See above.
>
> You probably cannot produce valid UTF-8 out of invalid UTF-8 snippets
> with split(1). However split(1) does form filenames out of its arguments
> by concatenation:
>
>     split --additional-suffix=suffix file prefix
>
> produces these kinds of filenames:
>
>     <prefix><ordinal><suffix>

I don't really get your point here.  Why would you start with invalid
UTF-8 sequences in the filenames?

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 10:43                                                     ` David Kastrup
@ 2017-02-16 11:04                                                       ` Marko Rauhamaa
  2017-02-16 11:11                                                         ` David Kastrup
  0 siblings, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-16 11:04 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

David Kastrup <dak@gnu.org>:

> Marko Rauhamaa <marko@pacujo.net> writes:
>> You probably cannot produce valid UTF-8 out of invalid UTF-8 snippets
>> with split(1). However split(1) does form filenames out of its
>> arguments by concatenation:
>>
>>     split --additional-suffix=suffix file prefix
>>
>> produces these kinds of filenames:
>>
>>     <prefix><ordinal><suffix>
>
> I don't really get your point here.  Why would you start with invalid
> UTF-8 sequences in the filenames?

There's nothing preventing such filenames from appearing on a Linux
system. They might come from a zip file with Latin-1 -encoded names, for
example.

I have files older than UTF-8 on my Linux system. I have files encoded
in Latin-3, for example.

Worst of all, they might be part of an attack on your system. For
example, files whose names contain invalid UTF-8 could evade file
listing altogether, they might make your program crash in unexpected
ways or you might not be able to remove them.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 11:04                                                       ` Marko Rauhamaa
@ 2017-02-16 11:11                                                         ` David Kastrup
  2017-02-16 11:32                                                           ` Marko Rauhamaa
  0 siblings, 1 reply; 110+ messages in thread
From: David Kastrup @ 2017-02-16 11:11 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

Marko Rauhamaa <marko@pacujo.net> writes:

> David Kastrup <dak@gnu.org>:
>
>> Marko Rauhamaa <marko@pacujo.net> writes:
>>> You probably cannot produce valid UTF-8 out of invalid UTF-8 snippets
>>> with split(1). However split(1) does form filenames out of its
>>> arguments by concatenation:
>>>
>>>     split --additional-suffix=suffix file prefix
>>>
>>> produces these kinds of filenames:
>>>
>>>     <prefix><ordinal><suffix>
>>
>> I don't really get your point here.  Why would you start with invalid
>> UTF-8 sequences in the filenames?
>
> There's nothing preventing such filenames from appearing on a Linux
> system. They might come from a zip file with Latin-1 -encoded names, for
> example.

I still don't get your point.  split does not use <file> for generating
file names, only for getting its contents.  The generated file names are
built from the <prefix> and additional characters.

> I have files older than UTF-8 on my Linux system. I have files encoded
> in Latin-3, for example.

It's still irrelevant since split does not _use_ the existing file name
for constructing new file names.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 11:11                                                         ` David Kastrup
@ 2017-02-16 11:32                                                           ` Marko Rauhamaa
  2017-02-16 11:49                                                             ` David Kastrup
  0 siblings, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-16 11:32 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

David Kastrup <dak@gnu.org>:
> It's still irrelevant since split does not _use_ the existing file name
> for constructing new file names.

Split was just an example of a command that concatenates bytes sequences
to get pathnames, nothing more.

Such concatenation is commonplace in Linux programs of all kinds.

And the point of bringing concatenation into the discussion was that
remapping byte sequences to byte sequences breaks concatenation
additivity:

   U(x) + U(y) = U(x + y)


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 11:32                                                           ` Marko Rauhamaa
@ 2017-02-16 11:49                                                             ` David Kastrup
  2017-02-16 12:14                                                               ` Marko Rauhamaa
  0 siblings, 1 reply; 110+ messages in thread
From: David Kastrup @ 2017-02-16 11:49 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

Marko Rauhamaa <marko@pacujo.net> writes:

> David Kastrup <dak@gnu.org>:
>> It's still irrelevant since split does not _use_ the existing file name
>> for constructing new file names.
>
> Split was just an example of a command that concatenates bytes sequences
> to get pathnames, nothing more.
>
> Such concatenation is commonplace in Linux programs of all kinds.
>
> And the point of bringing concatenation into the discussion was that
> remapping byte sequences to byte sequences breaks concatenation
> additivity:
>
>    U(x) + U(y) = U(x + y)

But Emacs' implementation doesn't in any respect "break concatenation
additivity".

If you split an arbitrary byte stream (including material invalid as
UTF-8) at an arbitrary point (including in the middle of an UTF-8
character), decode the resulting pieces as UTF-8 (as one of several
"reversible" encodings Emacs can interpret), concatenate the resulting
Emacs strings and reencode the result as UTF-8 (since you actually need
to provide a byte sequence to open(1) or similar), you will retain the
original byte stream.  No ifs and buts.

The _decoded_ concatenated string might differ from decoding the unsplit
byte string: it might contain "byte 0xc2, byte 0x80" (represented as
0xc1 0x82 0xc0 0x80) at the concatenation point rather than "character
0x80" (represented as 0xc2 0x80).  But the moment you use this
concatenation of half-sequences as a file name, it gets reencoded into
the bytes 0xc2 and 0x80 and works just fine.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 11:49                                                             ` David Kastrup
@ 2017-02-16 12:14                                                               ` Marko Rauhamaa
  2017-02-16 16:21                                                                 ` Eli Zaretskii
  0 siblings, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-16 12:14 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

David Kastrup <dak@gnu.org>:

> Marko Rauhamaa <marko@pacujo.net> writes:
>> And the point of bringing concatenation into the discussion was that
>> remapping byte sequences to byte sequences breaks concatenation
>> additivity:
>>
>>    U(x) + U(y) = U(x + y)
>
> But Emacs' implementation doesn't in any respect "break concatenation
> additivity".
>
> If you split an arbitrary byte stream (including material invalid as
> UTF-8) at an arbitrary point (including in the middle of an UTF-8
> character), decode the resulting pieces as UTF-8 (as one of several
> "reversible" encodings Emacs can interpret), concatenate the resulting
> Emacs strings and reencode the result as UTF-8 (since you actually
> need to provide a byte sequence to open(1) or similar), you will
> retain the original byte stream. No ifs and buts.
>
> The _decoded_ concatenated string might differ from decoding the
> unsplit byte string: it might contain "byte 0xc2, byte 0x80"
> (represented as 0xc1 0x82 0xc0 0x80) at the concatenation point rather
> than "character 0x80" (represented as 0xc2 0x80). But the moment you
> use this concatenation of half-sequences as a file name, it gets
> reencoded into the bytes 0xc2 and 0x80 and works just fine.

That is already a lot, maybe even enough.

(On the other side of the equation, expressing a filename in Unicode may
not produce an unambiguous code point sequence... <URL:
http://unicode.org/faq/normalization.html>)


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16  7:02                                             ` Marko Rauhamaa
@ 2017-02-16 15:47                                               ` Eli Zaretskii
  0 siblings, 0 replies; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-16 15:47 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

> From: Marko Rauhamaa <marko@pacujo.net>
> Cc: tomas@tuxteam.de,  guile-user@gnu.org
> Date: Thu, 16 Feb 2017 09:02:09 +0200
> 
> Eli Zaretskii <eliz@gnu.org>:
> 
> >> From: Marko Rauhamaa <marko@pacujo.net>
> >> Cc: tomas@tuxteam.de,  guile-user@gnu.org
> >> Date: Thu, 16 Feb 2017 08:15:57 +0200
> >> 
> >> It is possible to have illegal Unicode even in Windows filenames, ie,
> >> filenames not expressible using Guile's strings.
> >
> > Is it really possible? Can you show a code example that would create
> > such an illegal filename on Windows?
> 
> I have rely on hearsay since I don't have Windows at my disposal:
> 
>    NTFS allows any sequence of 16-bit values for name encoding (file
>    names, stream names, index names, etc.) except 0x0000.
> 
>    <URL: https://en.wikipedia.org/wiki/NTFS#Internals>
> 
> Not all sequences of 16-bit values are legal UTF-16.

Of course.  But unlike on Unix, it is much harder to create such file
names, because Windows APIs won't allow that.  You probably will have
to access the directory entries on a very low level.

Anyway, this is not the issue at hand.  I only mentioned UTF-16
encoding on Windows because Tomás thought file names on Windows can be
encoded in several different encodings.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16  7:16                                               ` Marko Rauhamaa
  2017-02-16  8:26                                                 ` David Kastrup
@ 2017-02-16 16:06                                                 ` Eli Zaretskii
  2017-02-16 16:35                                                   ` Marko Rauhamaa
  1 sibling, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-16 16:06 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

> From: Marko Rauhamaa <marko@pacujo.net>
> Cc: guile-user@gnu.org
> Date: Thu, 16 Feb 2017 09:16:21 +0200
> 
> If I understood it correctly, someone just told us emacs maps illegal
> UTF-8 to another form of illegal UTF-8 and back. That's better in that
> it's bytes to bytes (leaving Unicode out), but it's not immediately
> obvious to me why you have to transform the byte sequence at all.

Because it allows to solve all the problems you raise in the rest of
this thread.

> Look at the problem of concatenation. We could have a case where two
> illegal UTF-8 (or UTF-16) snippets are concatenated to get valid UTF-8
> (or UTF-16). That operation fails if you try to translate the snippets
> to strings before concatenation. Such concatenation operations are
> commonplace when dealing with filenames (eg, split(1)).

You assume that Emacs concatenates strings by just splicing its bytes.
But that's a far cry from what Emacs does, precisely to countermand
such problems.  I think David described enough of what's happening to
explain why Emacs is not susceptible to such failures.

These tricks, which all happen seamlessly and transparently, are
exactly why it took Emacs so long to get where it is today.  It takes
many moons to see the problems, analyze them, devise solutions that
don't break, and implement both the 90%-successful heuristics and the
opt-in solutions for the other 10%.  The important point for Guile is
that the solution is there, in Free Software, documented well enough,
and people who understand the implementation and can explain its
subtleties are still here, ready to help.  All it takes is for Guile
to decide it wants to implement something similar.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 12:14                                                               ` Marko Rauhamaa
@ 2017-02-16 16:21                                                                 ` Eli Zaretskii
  2017-02-16 16:38                                                                   ` Marko Rauhamaa
  0 siblings, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-16 16:21 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user, dak

> From: Marko Rauhamaa <marko@pacujo.net>
> Date: Thu, 16 Feb 2017 14:14:41 +0200
> Cc: guile-user@gnu.org
> 
> (On the other side of the equation, expressing a filename in Unicode may
> not produce an unambiguous code point sequence... <URL:
> http://unicode.org/faq/normalization.html>)

Why is that a problem?  Unicode generally mandates that equivalent
character (a.k.a. "codepoint") sequences shall be handled the same by
applications, both while processing the text (e.g., searching it etc.)
and when displaying it.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 16:06                                                 ` Eli Zaretskii
@ 2017-02-16 16:35                                                   ` Marko Rauhamaa
  2017-02-16 17:41                                                     ` Eli Zaretskii
  2017-02-16 18:30                                                     ` Mike Gran
  0 siblings, 2 replies; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-16 16:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Eli Zaretskii <eliz@gnu.org>:

> You assume that Emacs concatenates strings by just splicing its bytes.
> But that's a far cry from what Emacs does, precisely to countermand
> such problems.

Good to hear. If Guile is to adopt a similar approach, it should pay
attention to these details as well.

> The important point for Guile is that the solution is there, in Free
> Software, documented well enough, and people who understand the
> implementation and can explain its subtleties are still here, ready to
> help. All it takes is for Guile to decide it wants to implement
> something similar.

It would be important for Guile to be a sufficient basis for emacs. On
the other hand, emacs' needs might be far too high for any simple string
type. For example, Guile might treat strings as simple sequences of code
points while emacs might impose some Unicode normalization requirements
or vice versa.

For example, what should

   (string= "Åström" "Åström")

return?

Emacs 25.1 doesn't see the strings as equal. Neither does Firefox.
However, Chrome thinks they are one and the same thing.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 16:21                                                                 ` Eli Zaretskii
@ 2017-02-16 16:38                                                                   ` Marko Rauhamaa
  2017-02-16 17:46                                                                     ` Eli Zaretskii
  0 siblings, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-16 16:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user, dak

Eli Zaretskii <eliz@gnu.org>:

> Why is that a problem?  Unicode generally mandates that equivalent
> character (a.k.a. "codepoint") sequences shall be handled the same by
> applications, both while processing the text (e.g., searching it etc.)
> and when displaying it.

As I just said in another reply, emacs 25.1 isn't handling them the same
even though it maybe should.

Now, should Guile handle them the same?

(Python thinks they are different.)


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 16:35                                                   ` Marko Rauhamaa
@ 2017-02-16 17:41                                                     ` Eli Zaretskii
  2017-02-16 18:30                                                     ` Mike Gran
  1 sibling, 0 replies; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-16 17:41 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

> From: Marko Rauhamaa <marko@pacujo.net>
> Cc: guile-user@gnu.org
> Date: Thu, 16 Feb 2017 18:35:48 +0200
> 
> Eli Zaretskii <eliz@gnu.org>:
> 
> > You assume that Emacs concatenates strings by just splicing its bytes.
> > But that's a far cry from what Emacs does, precisely to countermand
> > such problems.
> 
> Good to hear. If Guile is to adopt a similar approach, it should pay
> attention to these details as well.

Indeed.

> > The important point for Guile is that the solution is there, in Free
> > Software, documented well enough, and people who understand the
> > implementation and can explain its subtleties are still here, ready to
> > help. All it takes is for Guile to decide it wants to implement
> > something similar.
> 
> It would be important for Guile to be a sufficient basis for emacs.

That's not my point.  My point is that the Emacs model, or some minor
variant thereof, should be a good model for Guile (or any other
environment that seeks to support complex multi-lingual applications),
_regardless_ of whether Guile will ever become the core of the Emacs
Lisp interpreter.  IOW, it's good for Guile itself.

> On the other hand, emacs' needs might be far too high for any simple
> string type. For example, Guile might treat strings as simple
> sequences of code points while emacs might impose some Unicode
> normalization requirements or vice versa.
> 
> For example, what should
> 
>    (string= "Åström" "Åström")
> 
> return?
> 
> Emacs 25.1 doesn't see the strings as equal.

As it should, IMO.  Testing strings for equivalence under canonical or
compatibility decompositions is not the job of string=, it requires a
separate API.  (Emacs provides in ucs-normalize.el the functionality
required for that.)  There are situations where you want the former,
and others where you want the latter.

That's why Unicode normalization is not implemented in Emacs on the
same level as the string data type, and the application needs to
explicitly request normalization in order for it to happen.

In general, string equivalence is in many use cases an
application-level feature (think interactive text searching), and
needs to be language- and locale-sensitive to satisfy users (e.g., it
turns out users of Spanish locales don't consider "ñ" (one character),
to be equivalent to "ñ" (two characters)).



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 16:38                                                                   ` Marko Rauhamaa
@ 2017-02-16 17:46                                                                     ` Eli Zaretskii
  2017-02-16 18:38                                                                       ` Marko Rauhamaa
  0 siblings, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-16 17:46 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user, dak

> From: Marko Rauhamaa <marko@pacujo.net>
> Cc: dak@gnu.org,  guile-user@gnu.org
> Date: Thu, 16 Feb 2017 18:38:48 +0200
> 
> Eli Zaretskii <eliz@gnu.org>:
> 
> > Why is that a problem?  Unicode generally mandates that equivalent
> > character (a.k.a. "codepoint") sequences shall be handled the same by
> > applications, both while processing the text (e.g., searching it etc.)
> > and when displaying it.
> 
> As I just said in another reply, emacs 25.1 isn't handling them the same
> even though it maybe should.

Yes, it does -- where that's TRT.  For example, when displaying them.
And sometimes this is a user option; e.g., see character-folding in
Isearch.

> Now, should Guile handle them the same?

IMO, Guile should provide the facilities to handle them the same, and
leave for the higher-level code to use whatever is suitable.

In any case, this is unrelated to how strings are implemented, because
the basic level of string implementation _must_ support binary,
character by character (and byte by byte) comparison.  Otherwise, you
won't be able to compare file names equal, for example, at least on
Unix and Windows (macOS is another matter).



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 16:35                                                   ` Marko Rauhamaa
  2017-02-16 17:41                                                     ` Eli Zaretskii
@ 2017-02-16 18:30                                                     ` Mike Gran
  2017-02-16 18:48                                                       ` David Kastrup
  1 sibling, 1 reply; 110+ messages in thread
From: Mike Gran @ 2017-02-16 18:30 UTC (permalink / raw)
  To: Marko Rauhamaa, Eli Zaretskii; +Cc: guile-user@gnu.org






On Thursday, February 16, 2017 9:39 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Eli Zaretskii <eliz@gnu.org>:

>> You assume that Emacs concatenates strings by just splicing its bytes.
>> But that's a far cry from what Emacs does, precisely to countermand
>> such problems.

> Good to hear. If Guile is to adopt a similar approach, it should pay

> attention to these details as well.

Guile stores strings as codepoints, and by concatenates and splices
strings in codepoint space.  It never concatenates strings as bytes.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 17:46                                                                     ` Eli Zaretskii
@ 2017-02-16 18:38                                                                       ` Marko Rauhamaa
  2017-02-16 18:46                                                                         ` Eli Zaretskii
  0 siblings, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-16 18:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user, dak

Eli Zaretskii <eliz@gnu.org>:

> In any case, this is unrelated to how strings are implemented, because
> the basic level of string implementation _must_ support binary,
> character by character (and byte by byte) comparison. Otherwise, you
> won't be able to compare file names equal, for example, at least on
> Unix and Windows (macOS is another matter).

Your statement is true only if you want to use character strings when
interfacing the operating system. You could leave character strings to
application libraries for newsreaders, IRC clients etc, and have a
separate byte string data type for the system interface.

Python kinda does it bothways, preferring the string interface, but
duplicating almost all functionality for byte strings as well.

If emacs managed to restore a binary/text unification (and infect Guile
in the process), that would be quite an accomplishment.

Now, if we only could get rid of locales while we are at it...


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 18:38                                                                       ` Marko Rauhamaa
@ 2017-02-16 18:46                                                                         ` Eli Zaretskii
  2017-02-16 19:35                                                                           ` Marko Rauhamaa
  0 siblings, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-16 18:46 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user, dak

> From: Marko Rauhamaa <marko@pacujo.net>
> Cc: dak@gnu.org,  guile-user@gnu.org
> Date: Thu, 16 Feb 2017 20:38:31 +0200
> 
> Eli Zaretskii <eliz@gnu.org>:
> 
> > In any case, this is unrelated to how strings are implemented, because
> > the basic level of string implementation _must_ support binary,
> > character by character (and byte by byte) comparison. Otherwise, you
> > won't be able to compare file names equal, for example, at least on
> > Unix and Windows (macOS is another matter).
> 
> Your statement is true only if you want to use character strings when
> interfacing the operating system.

Why, because I mentioned comparison of file names?  That's just one
example that came to my mind within 5 sec of thought; there are many
others.

My point is that there's place for both types of string comparisons,
and therefore both should be available.  Which means the lowest level
of string implementation should not automatically normalize strings.
(You could also have a separate string variant where normalization
happens automatically, but that has a disadvantage that you need to
decide which variant you want in advance, where you don't necessarily
know enough yet.)

> You could leave character strings to application libraries for
> newsreaders, IRC clients etc, and have a separate byte string data
> type for the system interface.

I don't know what you mean by "application libraries", but if that's
something applications should provide, and Guile shouldn't, then I
disagree: application writers will generally not know enough to
implement this non-trivial functionality.

> If emacs managed to restore a binary/text unification (and infect Guile
> in the process), that would be quite an accomplishment.

I don't understand what "binary/text unification" means, sorry.

> Now, if we only could get rid of locales while we are at it...

Dream on.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 18:30                                                     ` Mike Gran
@ 2017-02-16 18:48                                                       ` David Kastrup
  0 siblings, 0 replies; 110+ messages in thread
From: David Kastrup @ 2017-02-16 18:48 UTC (permalink / raw)
  To: guile-user

Mike Gran <spk121@yahoo.com> writes:

> On Thursday, February 16, 2017 9:39 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> Eli Zaretskii <eliz@gnu.org>:
>
>>> You assume that Emacs concatenates strings by just splicing its bytes.
>>> But that's a far cry from what Emacs does, precisely to countermand
>>> such problems.
>
>> Good to hear. If Guile is to adopt a similar approach, it should pay
>
>> attention to these details as well.
>
> Guile stores strings as codepoints, and by concatenates and splices
> strings in codepoint space.  It never concatenates strings as bytes.

Code points are an abstraction while the discussion is about the actual
implementation.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 18:46                                                                         ` Eli Zaretskii
@ 2017-02-16 19:35                                                                           ` Marko Rauhamaa
  2017-02-16 20:10                                                                             ` Eli Zaretskii
  0 siblings, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-16 19:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user, dak

Eli Zaretskii <eliz@gnu.org>:

>> From: Marko Rauhamaa <marko@pacujo.net>
>> You could leave character strings to application libraries for
>> newsreaders, IRC clients etc, and have a separate byte string data
>> type for the system interface.
>
> I don't know what you mean by "application libraries", but if that's
> something applications should provide, and Guile shouldn't, then I
> disagree: application writers will generally not know enough to
> implement this non-trivial functionality.

Guile can provide those.

I don't know if I ever had a need for them. In my decades of
programming, I don't remember ever needing anything but byte strings.

>> If emacs managed to restore a binary/text unification (and infect Guile
>> in the process), that would be quite an accomplishment.
>
> I don't understand what "binary/text unification" means, sorry.

I say filenames are byte strings. Guile says they are character strings.
You are saying they are both at once.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 19:35                                                                           ` Marko Rauhamaa
@ 2017-02-16 20:10                                                                             ` Eli Zaretskii
  2017-02-16 20:52                                                                               ` David Kastrup
  0 siblings, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-16 20:10 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user, dak

> From: Marko Rauhamaa <marko@pacujo.net>
> Cc: dak@gnu.org,  guile-user@gnu.org
> Date: Thu, 16 Feb 2017 21:35:12 +0200
> 
> >> If emacs managed to restore a binary/text unification (and infect Guile
> >> in the process), that would be quite an accomplishment.
> >
> > I don't understand what "binary/text unification" means, sorry.
> 
> I say filenames are byte strings. Guile says they are character strings.
> You are saying they are both at once.

Yes, to be viable in real-life situation, Guile needs to support
character strings with occasional embedded raw bytes that cannot be
interpreted as characters.  Which means string implementation needs to
have a special representation for these raw bytes that would allow
lossless round-trip, and at the same time avoid the pitfalls some of
which were mentioned here.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 20:10                                                                             ` Eli Zaretskii
@ 2017-02-16 20:52                                                                               ` David Kastrup
  2017-02-16 21:13                                                                                 ` Marko Rauhamaa
  2017-02-17  6:32                                                                                 ` Eli Zaretskii
  0 siblings, 2 replies; 110+ messages in thread
From: David Kastrup @ 2017-02-16 20:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Marko Rauhamaa <marko@pacujo.net>
>> Cc: dak@gnu.org,  guile-user@gnu.org
>> Date: Thu, 16 Feb 2017 21:35:12 +0200
>> 
>> >> If emacs managed to restore a binary/text unification (and infect Guile
>> >> in the process), that would be quite an accomplishment.
>> >
>> > I don't understand what "binary/text unification" means, sorry.
>> 
>> I say filenames are byte strings. Guile says they are character strings.
>> You are saying they are both at once.
>
> Yes, to be viable in real-life situation, Guile needs to support
> character strings with occasional embedded raw bytes that cannot be
> interpreted as characters.

They can be interpreted as "characters", just not inside the _Unicode_
character range.  Raw bytes 0x00 to 0xff could be assigned character
codes -256 to -1 (when decoding UTF-8, only "raw bytes" 0x80 to 0xff
will occur since 0x00 to 0x7f is always represented as its own Unicode
code point).  That would it easy to do a blanket check for invalid
sequences.

> Which means string implementation needs to have a special
> representation for these raw bytes that would allow lossless
> round-trip, and at the same time avoid the pitfalls some of which were
> mentioned here.
>

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 20:52                                                                               ` David Kastrup
@ 2017-02-16 21:13                                                                                 ` Marko Rauhamaa
  2017-02-17  6:44                                                                                   ` Eli Zaretskii
  2017-02-17  6:32                                                                                 ` Eli Zaretskii
  1 sibling, 1 reply; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-16 21:13 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

David Kastrup <dak@gnu.org>:
> Eli Zaretskii <eliz@gnu.org> writes:
>> Yes, to be viable in real-life situation, Guile needs to support
>> character strings with occasional embedded raw bytes that cannot be
>> interpreted as characters.
>
> They can be interpreted as "characters", just not inside the _Unicode_
> character range. Raw bytes 0x00 to 0xff could be assigned character
> codes -256 to -1 (when decoding UTF-8, only "raw bytes" 0x80 to 0xff
> will occur since 0x00 to 0x7f is always represented as its own Unicode
> code point). That would it easy to do a blanket check for invalid
> sequences.

Python uses the surrogate hole in the middle of the Unicode range to
represent such stray bytes, but only when naming files. Unlike Guile,
Python character strings permit surrogate code points for arbitrary
purposes.

Internally, CPython (the principal implementation) has Latin-1, UCS-2
and UCS-4 strings to optimize memory use while maintaining fixed-width
character representation.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 20:52                                                                               ` David Kastrup
  2017-02-16 21:13                                                                                 ` Marko Rauhamaa
@ 2017-02-17  6:32                                                                                 ` Eli Zaretskii
  1 sibling, 0 replies; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-17  6:32 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

> From: David Kastrup <dak@gnu.org>
> Cc: Marko Rauhamaa <marko@pacujo.net>,  guile-user@gnu.org
> Date: Thu, 16 Feb 2017 21:52:48 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > Yes, to be viable in real-life situation, Guile needs to support
> > character strings with occasional embedded raw bytes that cannot be
> > interpreted as characters.
> 
> They can be interpreted as "characters", just not inside the _Unicode_
> character range.

Yes.  Emacs considers them to belong to some imaginary character set,
called "eight-bit".  My point is, those are not human-readable
characters.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-16 21:13                                                                                 ` Marko Rauhamaa
@ 2017-02-17  6:44                                                                                   ` Eli Zaretskii
  2017-02-17  8:46                                                                                     ` Marko Rauhamaa
  0 siblings, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-17  6:44 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user, dak

> From: Marko Rauhamaa <marko@pacujo.net>
> Cc: Eli Zaretskii <eliz@gnu.org>,  guile-user@gnu.org
> Date: Thu, 16 Feb 2017 23:13:35 +0200
> 
> Python uses the surrogate hole in the middle of the Unicode range to
> represent such stray bytes, but only when naming files.

IMO, it makes no sense to limit this to file names, because (a) you
don't always know on all levels of the code which string is a file
name or a part thereof; and (b) because situations where non-ASCII
bytes cannot be properly decoded into Unicode happen with text that is
not file names, and users still expect Emacs to silently produce the
same byte stream on round-trip operations, e.g., when copying text
from one file to another.

> Internally, CPython (the principal implementation) has Latin-1, UCS-2
> and UCS-4 strings to optimize memory use while maintaining fixed-width
> character representation.

Emacs uses a superset of UTF-8 internally.  We have found that the
variable-length encoding doesn't slow down Emacs enough to worry
about, because the need to go back in a string or buffer text is rare.
It wasn't worth the complication of maintaining different
representations, with the corresponding risk of bugs (because it is
very easy in Emacs to gain access to the internal representation of
text).



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-17  6:44                                                                                   ` Eli Zaretskii
@ 2017-02-17  8:46                                                                                     ` Marko Rauhamaa
  2017-02-17  9:04                                                                                       ` David Kastrup
  2017-02-17  9:07                                                                                       ` Eli Zaretskii
  0 siblings, 2 replies; 110+ messages in thread
From: Marko Rauhamaa @ 2017-02-17  8:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user, dak

Eli Zaretskii <eliz@gnu.org>:
>> From: Marko Rauhamaa <marko@pacujo.net>
>> Python uses the surrogate hole in the middle of the Unicode range to
>> represent such stray bytes, but only when naming files.
>
> IMO, it makes no sense to limit this to file names, because (a) you
> don't always know on all levels of the code which string is a file
> name or a part thereof; and (b) because situations where non-ASCII
> bytes cannot be properly decoded into Unicode happen with text that is
> not file names, and users still expect Emacs to silently produce the
> same byte stream on round-trip operations, e.g., when copying text
> from one file to another.

Python just barfs:

   $ python3 -c "import sys; print(sys.stdin.read(30))" <<<$'\xdd'
   Traceback (most recent call last):
     File "<string>", line 1, in <module>
     File "/usr/lib64/python3.5/codecs.py", line 321, in decode
       (result, consumed) = self._buffer_decode(data, self.errors, final)
   UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position \
   0: invalid continuation byte

The situation is a bit difficult to recover from.


Marko



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-17  8:46                                                                                     ` Marko Rauhamaa
@ 2017-02-17  9:04                                                                                       ` David Kastrup
  2017-02-17  9:57                                                                                         ` tomas
  2017-02-17  9:07                                                                                       ` Eli Zaretskii
  1 sibling, 1 reply; 110+ messages in thread
From: David Kastrup @ 2017-02-17  9:04 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user

Marko Rauhamaa <marko@pacujo.net> writes:

> Eli Zaretskii <eliz@gnu.org>:
>>> From: Marko Rauhamaa <marko@pacujo.net>
>>> Python uses the surrogate hole in the middle of the Unicode range to
>>> represent such stray bytes, but only when naming files.
>>
>> IMO, it makes no sense to limit this to file names, because (a) you
>> don't always know on all levels of the code which string is a file
>> name or a part thereof; and (b) because situations where non-ASCII
>> bytes cannot be properly decoded into Unicode happen with text that is
>> not file names, and users still expect Emacs to silently produce the
>> same byte stream on round-trip operations, e.g., when copying text
>> from one file to another.
>
> Python just barfs:
>
>    $ python3 -c "import sys; print(sys.stdin.read(30))" <<<$'\xdd'
>    Traceback (most recent call last):
>      File "<string>", line 1, in <module>
>      File "/usr/lib64/python3.5/codecs.py", line 321, in decode
>        (result, consumed) = self._buffer_decode(data, self.errors, final)
>    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position \
>    0: invalid continuation byte
>
> The situation is a bit difficult to recover from.

You can load an executable into an Emacs buffer and do a
search-and-replace on UTF-8 strings, then save again.  Assuming that the
replacement has been by a string of the same length and that the string
does not appear as part of symbols for the linker, the executable will
likely work fine afterwards.

I don't think that XEmacs (another Emacs implementation that migrated a
lot more leisurely to multibyte encodings) would stand up to the same
sort of abuse.  And probably quite a few text editors would throw in the
towel as well.  But once you view Emacs as a text processing platform,
it's a reasonable conclusion that failure is not a good option.

For a general-purpose programming language like Python or Guile, I
should think it should be at least as important that strings can
represent input accurately without having to degress outside of string
processing and use stuff like byte arrays.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-17  8:46                                                                                     ` Marko Rauhamaa
  2017-02-17  9:04                                                                                       ` David Kastrup
@ 2017-02-17  9:07                                                                                       ` Eli Zaretskii
  1 sibling, 0 replies; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-17  9:07 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: guile-user, dak

> From: Marko Rauhamaa <marko@pacujo.net>
> Cc: dak@gnu.org,  guile-user@gnu.org
> Date: Fri, 17 Feb 2017 10:46:32 +0200
> 
> > IMO, it makes no sense to limit this to file names, because (a) you
> > don't always know on all levels of the code which string is a file
> > name or a part thereof; and (b) because situations where non-ASCII
> > bytes cannot be properly decoded into Unicode happen with text that is
> > not file names, and users still expect Emacs to silently produce the
> > same byte stream on round-trip operations, e.g., when copying text
> > from one file to another.
> 
> Python just barfs:
> 
>    $ python3 -c "import sys; print(sys.stdin.read(30))" <<<$'\xdd'
>    Traceback (most recent call last):
>      File "<string>", line 1, in <module>
>      File "/usr/lib64/python3.5/codecs.py", line 321, in decode
>        (result, consumed) = self._buffer_decode(data, self.errors, final)
>    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position \
>    0: invalid continuation byte

Which is a bad idea, IME: users won't appreciate their program to barf
when all they want is to copy a chunk of text from one place to
another, without changing anything in it.

> The situation is a bit difficult to recover from.

If you assume valid UTF-8 everywhere, certainly.  The world is more
complex than that.  Emacs is known to be used for, e.g., searching
binary executable files for text patterns; if it required the user to
say in advance that the file was binary, so that Emacs could handle it
as a byte array, that would be a major annoyance, and worse: it would
prevent the users from searching valid non-ASCII text in such a binary
file.  So Emacs allows treating binary files as text files with a
certain encoding that have some raw bytes which don't fit that
encoding.  IMO, Guile will do its users a service if it provides
similar features, because applications with similar needs are entirely
reasonable in today's world.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-17  9:04                                                                                       ` David Kastrup
@ 2017-02-17  9:57                                                                                         ` tomas
  0 siblings, 0 replies; 110+ messages in thread
From: tomas @ 2017-02-17  9:57 UTC (permalink / raw)
  To: guile-user

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, Feb 17, 2017 at 10:04:29AM +0100, David Kastrup wrote:

[...]

> You can load an executable into an Emacs buffer and do a
> search-and-replace on UTF-8 strings, then save again.  Assuming that the
> replacement has been by a string of the same length and that the string
> does not appear as part of symbols for the linker, the executable will
> likely work fine afterwards.

:-)

I gathered that much, especially after this long and enlightening thread.
But thanks for driving that home in such a compact yet eloquent way.

I'll keep that around , for delight and preference

thanks
- -- t
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlimyPIACgkQBcgs9XrR2kbLewCdGXs4lmLJdBZGGydjyWtEP+DG
Nj0Ania5c29Fy5Yh9dXJTZNpqSlIszFg
=tFfc
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 10:15                               ` Chris Vine
  2017-02-15 11:48                                 ` tomas
@ 2017-02-26 20:52                                 ` Andy Wingo
  1 sibling, 0 replies; 110+ messages in thread
From: Andy Wingo @ 2017-02-26 20:52 UTC (permalink / raw)
  To: Chris Vine; +Cc: guile-user

Hi,

On Wed 15 Feb 2017 11:15, Chris Vine <chris@cvine.freeserve.co.uk> writes:

> I would prefer guile to make the filename encoding a fluid.  It wouldn't
> deal with files mounted with mixed encodings, but it would cater for
> everything else.

Sounds like a good idea.

Andy, who is trying to pick actionable things from this thread



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-15 17:07                                     ` Eli Zaretskii
@ 2017-02-26 20:58                                       ` Andy Wingo
  2017-02-27 16:02                                         ` Eli Zaretskii
  0 siblings, 1 reply; 110+ messages in thread
From: Andy Wingo @ 2017-02-26 20:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Hi,

On Wed 15 Feb 2017 18:07, Eli Zaretskii <eliz@gnu.org> writes:

> the [Emacs] MS-Windows port pretends towards Emacs internals that file
> names are encoded in UTF-8, and shadows relevant system APIs that
> accept or return file names, like fopen, opendir/readdir, stat,
> etc. with its own versions that convert UTF-8 to and from UTF-16
> before calling the real OS APIs.
>
> Once again, just use that experience, and maybe even some
> infrastructure code.

FWIW we are up for good suggestions.  It's clear that file names (and
command line arguments and environment variables) aren't handled ideally
in Guile as they aren't fundamentally strings of characters in any
particular encoding, and hence this class of bug.

Andy



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-14 23:58                       ` David Kastrup
  2017-02-15 10:12                         ` tomas
@ 2017-02-26 21:20                         ` Andy Wingo
  2017-02-27  9:10                           ` David Kastrup
  2017-02-27 16:07                           ` Eli Zaretskii
  1 sibling, 2 replies; 110+ messages in thread
From: Andy Wingo @ 2017-02-26 21:20 UTC (permalink / raw)
  To: guile-user

Hello,

I feel the need to correct points in this mail for the benefit of
guile-user.  No reply is needed.

On Wed 15 Feb 2017 00:58, David Kastrup <dak@gnu.org> writes:

> Mike Gran <spk121@yahoo.com> writes:
>
>> But, for what it is worth, the Latin-1/UCS-32 design decision came
>> from a couple of conflicting requirements.  The switch happened in the
>> 1.9.x series.
>>
>> There was several examples of legacy C code using Guile for an
>> extension language that accessed the bytes of a string directly, using
>>
>> SCM_STRING_CHARS or scm_i_string_chars.  To keep from breaking legacy
>> code, we needed to retain the capability to use this (then already
>> deprecated) capability to have C programs access 8-bit-locale string
>> internals directly.
>
> But if you don't know whether the strings are Latin-1 or UCS-32, that's
> sort of academical.

Not at all.  Legacy programs don't use codepoints >255.  For UTF-32,
attempting to get the string data would throw an exception.  The
SCM_STRING_CHARS hack was a good trade-off.

> The problem is that Guile is _constantly_ required to recode strings it
> is processing.  And to add insult to injury, it cannot do this without
> data loss when its string encoding assumptions are wrong.

In Scheme, strings are sequences of characters.  Encoding and decoding
is only needed when going to and from bytes.  Guile supports a finite
number of encodings, so in general some encoding/decoding will always be
needed.  The specific encoding may change over time.

> PostScript files are usually encoded in Latin-1 with occasional UCS-16
> passages.  Reading and writing and copying such files byte-correctly
> while trying to actually parse their contents is not feasible with
> Guile.

Works perfectly well.  The web server for example reads the request as
Latin-1 and the body as something else.  Just re-set the port encoding
and there you go.

>> I still maintain that this design decision was a good one based on the
>> simplicity of implementation.
>
> As I said: the problem is not the chosen internal representation.  The
> problem is that there is no API to access it, and it does not even map
> to string ports.

String ports have nothing to do with the discussion AFAIU.  (Ports in
Guile are sequences of bytes also.  They may be accessed using textual
interfaces as well.  Therefore a string port must have an associated
encoding, to read/write the bytes.  But no error is possible for textual
I/O with the default UTF-8 encoding as all characters are representable.
Encoding to UTF-8 is fast and space-efficient.)

Andy



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-14 22:26                     ` Ludovic Courtès
@ 2017-02-26 21:23                       ` Andy Wingo
  0 siblings, 0 replies; 110+ messages in thread
From: Andy Wingo @ 2017-02-26 21:23 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-user

On Tue 14 Feb 2017 23:26, ludo@gnu.org (Ludovic Courtès) writes:

> There were discussions to move to UTF-8 internally in 2.2.  As Mike
> explained, that was not really an option in 2.0 mostly due to the
> requirement to support O(1) random access.

AFAIU this requirement was relaxed in R7RS.  I think we could revisit
it; there are tradeoffs but the trade winds could blow due UTF-8-wards
in the end.  My memory of the state of things was that Mark was going to
have a look at it but he is affected by a global shortage of round tuits
:)

Andy



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-14 19:58             ` Linas Vepstas
@ 2017-02-26 21:33               ` Andy Wingo
  0 siblings, 0 replies; 110+ messages in thread
From: Andy Wingo @ 2017-02-26 21:33 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: Guile User

On Tue 14 Feb 2017 20:58, Linas Vepstas <linasvepstas@gmail.com> writes:

> Guile's mistake is that it does lots of pointless conversions from utf8 strings
> to wide-char arrays, and back, which is a) a cpu suck, and b) a breeding
> ground for bugs.   The current 2.1 guile, in git as of a few weeks ago, has
> multiple utf8 handling bugs.

What bugs are these?  I just found 25397 which I think was initially a
misunderstanding on your part, then you added more expected behaviors
that were not specified in the manual, and vaguely hint at more bugs;
perplexing!  Please be more direct when reporting bugs (i.e. bug-guile
etc).  Thanks in advance.

Andy



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-26 21:20                         ` Andy Wingo
@ 2017-02-27  9:10                           ` David Kastrup
  2017-02-27 11:02                             ` Andy Wingo
  2017-02-27 16:07                           ` Eli Zaretskii
  1 sibling, 1 reply; 110+ messages in thread
From: David Kastrup @ 2017-02-27  9:10 UTC (permalink / raw)
  To: guile-user

Andy Wingo <wingo@pobox.com> writes:

> Hello,
>
> I feel the need to correct points in this mail for the benefit of
> guile-user.  No reply is needed.
>
> On Wed 15 Feb 2017 00:58, David Kastrup <dak@gnu.org> writes:
>
>> Mike Gran <spk121@yahoo.com> writes:
>>
>>> But, for what it is worth, the Latin-1/UCS-32 design decision came
>>> from a couple of conflicting requirements.  The switch happened in the
>>> 1.9.x series.
>>>
>>> There was several examples of legacy C code using Guile for an
>>> extension language that accessed the bytes of a string directly, using
>>>
>>> SCM_STRING_CHARS or scm_i_string_chars.  To keep from breaking legacy
>>> code, we needed to retain the capability to use this (then already
>>> deprecated) capability to have C programs access 8-bit-locale string
>>> internals directly.
>>
>> But if you don't know whether the strings are Latin-1 or UCS-32, that's
>> sort of academical.
>
> Not at all.  Legacy programs don't use codepoints >255.

Sort of a moot point when Guile makes the decision to interpret external
files with codepoints >255.  Not every data processed by a "legacy
program" originates from inside the program.

>> The problem is that Guile is _constantly_ required to recode strings
>> it is processing.  And to add insult to injury, it cannot do this
>> without data loss when its string encoding assumptions are wrong.
>
> In Scheme, strings are sequences of characters.  Encoding and decoding
> is only needed when going to and from bytes.

A string port is strictly passing characters to characters completely
inside of Guile and its data structures and yet it needs to encode and
decode from Latin-1/UCS-32 to UTF-8.  A string port is _explicitly_ not
a binary stream (there are special binary ports for that) but a
character sequence and yet Guile is encoding and decoding for working
with its own internal data.

And the string API contains only scm_from_utf8_string (which always
requires reencoding) for accessing the whole character set.  It isn't
named scm_decode_utf8_bytestream: its target conceptually is a _string_,
yet it is expensive to pass into Guile and back out and there is no
cheaper or more transparent mechanism available.

>> PostScript files are usually encoded in Latin-1 with occasional UCS-16
>> passages.  Reading and writing and copying such files byte-correctly
>> while trying to actually parse their contents is not feasible with
>> Guile.
>
> Works perfectly well.  The web server for example reads the request as
> Latin-1 and the body as something else.  Just re-set the port encoding
> and there you go.

Reading and writing and copying cannot always afford to _parse_ and
switch encodings based on the content.  It needs to work even when you
don't do that.

>> As I said: the problem is not the chosen internal representation.
>> The problem is that there is no API to access it, and it does not
>> even map to string ports.
>
> String ports have nothing to do with the discussion AFAIU.  (Ports in
> Guile are sequences of bytes also.

Which is exactly the problem.

> They may be accessed using textual interfaces as well.

They can _only_ be accessed using textual interfaces.  They are
character-in/character-out.

> Therefore a string port must have an associated encoding, to
> read/write the bytes.

Why does a pure character-in/character-out structure need an associated
encoding?  The semi-equivalent in Emacs are buffers (which have a
manipulation point where you can write/read but are also random-access,
so it's sort of a superset).  Buffers have an _internal_ encoding but it
isn't exposed and it is identical to strings' internal encodings.

In contrast, the internal encoding of Guile string ports _is_ exposed
since its positioning uses byte offsets rather than character offsets
and thus is not compatible with string addressing.

Emacs got rid of this catastrophic user interface mistake (responsible
for the last major wave of migration to its competitor XEmacs) in Emacs
20.3 or 20.4.  Buffers are only ever addressed using character positions
from Emacs Lisp.

It's just painful to see Guile go through all of the expensive mistakes
Emacs made 15 or 20 years ago, just at a tenth of the speed since
getting encodings wrong was seen as more of a deal-breaker with Emacs.

> But no error is possible for textual I/O with the default UTF-8
> encoding as all characters are representable.

But all bytes aren't.

> Encoding to UTF-8 is fast and space-efficient.)

There is a reason that LilyPond on Guile-2.0 runs slower by a factor
of 5 than on Guile-1.8, and the large costs associated with constant
string reencoding are definitely contributing.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-27  9:10                           ` David Kastrup
@ 2017-02-27 11:02                             ` Andy Wingo
  2017-02-27 12:09                               ` David Kastrup
  0 siblings, 1 reply; 110+ messages in thread
From: Andy Wingo @ 2017-02-27 11:02 UTC (permalink / raw)
  To: guile-user

Hello,

On Mon 27 Feb 2017 10:10, David Kastrup <dak@gnu.org> writes:

> Andy Wingo <wingo@pobox.com> writes:
>
>> Legacy programs don't use codepoints >255.
>
> Sort of a moot point when Guile makes the decision to interpret external
> files with codepoints >255.  Not every data processed by a "legacy
> program" originates from inside the program.

Not a moot point at all.  If you want to decode/encode characters
to/from ports, you have to call Guile's setlocale function; that's a
choice you can make.  In Guile 1.8 and earlier regardless you would just
get ISO-8859-1 one-character-per-byte, so no significant change here.
If you would prefer to continue to use this encoding with every port in
your program, you can do that.

>> In Scheme, strings are sequences of characters.  Encoding and decoding
>> is only needed when going to and from bytes.
>
> A string port is strictly passing characters to characters completely
> inside of Guile

This is an implementation concern.  May I remind you and the list that
we have kindly asked you to not post to guile-devel because
implementation discussions with you are not productive.  I'm not
interested in having similar discussions, only on another list.  Thanks.

>>> PostScript files are usually encoded in Latin-1 with occasional UCS-16
>>> passages.  Reading and writing and copying such files byte-correctly
>>> while trying to actually parse their contents is not feasible with
>>> Guile.
>>
>> Works perfectly well.  The web server for example reads the request as
>> Latin-1 and the body as something else.  Just re-set the port encoding
>> and there you go.
>
> Reading and writing and copying cannot always afford to _parse_ and
> switch encodings based on the content.  It needs to work even when you
> don't do that.

If you would like to read just the bytes and parse yourself, you can do
that too.  Re-setting the encoding while parsing from a port can often
be more efficient though, as you don't have to read all of the data and
then parse it all; you can parse incrementally.

>> String ports have nothing to do with the discussion AFAIU.  (Ports in
>> Guile are sequences of bytes also.  They may be accessed using
>> textual interfaces as well.
>
> They can _only_ be accessed using textual interfaces.  They are
> character-in/character-out.

You misunderstand what Guile ports are.  I seriously invite you to read
the fine manual, specifically the first four subsections of this node:

  https://www.gnu.org/software/guile/docs/master/guile.html/Input-and-Output.html

Thanks,

Andy



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-27 11:02                             ` Andy Wingo
@ 2017-02-27 12:09                               ` David Kastrup
  2017-02-27 12:33                                 ` Andy Wingo
  0 siblings, 1 reply; 110+ messages in thread
From: David Kastrup @ 2017-02-27 12:09 UTC (permalink / raw)
  To: guile-user

Andy Wingo <wingo@pobox.com> writes:

> On Mon 27 Feb 2017 10:10, David Kastrup <dak@gnu.org> writes:
>
>>> String ports have nothing to do with the discussion AFAIU.  (Ports in
>>> Guile are sequences of bytes also.  They may be accessed using
>>> textual interfaces as well.
>>
>> They can _only_ be accessed using textual interfaces.  They are
>> character-in/character-out.
>
> You misunderstand what Guile ports are.

The topic was "string ports".  String ports and soft ports operate on
characters and strings in Guile's encoding which makes a separate
reencoding pass both error-prone as well as inefficient.  In particular
since one use of string ports is reading sexps which involves frequent
peek/unget operations for which Guile reencodes characters only to
decode them right again on the next read.

> I seriously invite you to read the fine manual, specifically the first
> four subsections of this node:
>
>   https://www.gnu.org/software/guile/docs/master/guile.html/Input-and-Output.html

The number of errors I reported with regard to Guile's string and port
handling (and which ultimately got fixed) as well as the fact that I did
all of the low-level work for migrating probably the largest existing
Guile-based application from Guile-1.8 to Guile-2.0 makes it somewhat
unlikely that my thoughts are merely the outcome of incompetency.

Where in

(with-output-to-string
  (format #t "~s\n" (make-list 42)))

do you see an encoding inherent?  From the Scheme side of things, it is
characters and strings which are involved here exclusively.  Reencoding
into an external coding system does not make sense here.  The situation
is similar for soft ports.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-27 12:09                               ` David Kastrup
@ 2017-02-27 12:33                                 ` Andy Wingo
  0 siblings, 0 replies; 110+ messages in thread
From: Andy Wingo @ 2017-02-27 12:33 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

On Mon 27 Feb 2017 13:09, David Kastrup <dak@gnu.org> writes:

> Andy Wingo <wingo@pobox.com> writes:
>> I seriously invite you to read the fine manual, specifically the first
>> four subsections of this node:
>>
>>   https://www.gnu.org/software/guile/docs/master/guile.html/Input-and-Output.html
>
> ...somewhat unlikely that my thoughts are merely the outcome of
> incompetency.

Me suggesting that you read the manual is not suggesting you are
incompetent.  I would appreciate you not reading more into my words than
what I wrote.  I merely invite you to read the manual, especially since
has been updated in 2.2.  It explains clearly what Guile ports are and
are not.  This part of the manual was less clear in the past.

You seem to be specifically under the misconception that binary I/O on
string ports is impossible or not useful.  This is a misunderstanding on
your part.

Andy



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-26 20:58                                       ` Andy Wingo
@ 2017-02-27 16:02                                         ` Eli Zaretskii
  0 siblings, 0 replies; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-27 16:02 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-user

> From: Andy Wingo <wingo@pobox.com>
> Cc: Chris Vine <chris@cvine.freeserve.co.uk>,  guile-user@gnu.org
> Date: Sun, 26 Feb 2017 21:58:00 +0100
> 
> On Wed 15 Feb 2017 18:07, Eli Zaretskii <eliz@gnu.org> writes:
> 
> > the [Emacs] MS-Windows port pretends towards Emacs internals that file
> > names are encoded in UTF-8, and shadows relevant system APIs that
> > accept or return file names, like fopen, opendir/readdir, stat,
> > etc. with its own versions that convert UTF-8 to and from UTF-16
> > before calling the real OS APIs.
> >
> > Once again, just use that experience, and maybe even some
> > infrastructure code.
> 
> FWIW we are up for good suggestions.  It's clear that file names (and
> command line arguments and environment variables) aren't handled ideally
> in Guile as they aren't fundamentally strings of characters in any
> particular encoding, and hence this class of bug.

Let me know what kind of suggestions would help.  E.g., if you need a
more detailed descriptions of how Emacs goes about these issues, I can
do that (guile-devel is probably a better place for that).



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-26 21:20                         ` Andy Wingo
  2017-02-27  9:10                           ` David Kastrup
@ 2017-02-27 16:07                           ` Eli Zaretskii
  2017-02-27 19:29                             ` Andy Wingo
  1 sibling, 1 reply; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-27 16:07 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-user

> From: Andy Wingo <wingo@pobox.com>
> Date: Sun, 26 Feb 2017 22:20:31 +0100
> 
> In Scheme, strings are sequences of characters.  Encoding and decoding
> is only needed when going to and from bytes.  Guile supports a finite
> number of encodings, so in general some encoding/decoding will always be
> needed.  The specific encoding may change over time.

The lesson of Emacs development is that there's a need for
"characters" that represent raw bytes which cannot be decoded into the
internal representation, for whatever reasons.  These special
"characters" need to be representable in strings, among "normal"
recognizable characters (and thus distinguishable from the latter
kind), and they need to be converted back to their single-byte form
when the string is output to the external world.  An implementation of
text that doesn't include these features will always fail to support
some important use cases.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-27 16:07                           ` Eli Zaretskii
@ 2017-02-27 19:29                             ` Andy Wingo
  2017-02-27 20:24                               ` Jan Wedekind
  0 siblings, 1 reply; 110+ messages in thread
From: Andy Wingo @ 2017-02-27 19:29 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user

Hi :)

On Mon 27 Feb 2017 17:07, Eli Zaretskii <eliz@gnu.org> writes:

>> From: Andy Wingo <wingo@pobox.com>
>> Date: Sun, 26 Feb 2017 22:20:31 +0100
>> 
>> In Scheme, strings are sequences of characters.  Encoding and decoding
>> is only needed when going to and from bytes.  Guile supports a finite
>> number of encodings, so in general some encoding/decoding will always be
>> needed.  The specific encoding may change over time.
>
> The lesson of Emacs development is that there's a need for
> "characters" that represent raw bytes which cannot be decoded into the
> internal representation, for whatever reasons.  These special
> "characters" need to be representable in strings, among "normal"
> recognizable characters (and thus distinguishable from the latter
> kind), and they need to be converted back to their single-byte form
> when the string is output to the external world.  An implementation of
> text that doesn't include these features will always fail to support
> some important use cases.

Thanks for this note (and upthread).  I didn't know Emacs settled on
this strategy.  It could fit in as a new "conversion strategy" (see
Encoding in the manual).

I think this feature will probably slip for 2.2.0 for lack of time,
though.  When someone does go to look at it, this thread is a useful
resource, or parts of it anyway :) I especially appreciated the
tradeoffs between surrogates and strange UTF-8 hacks.

Andy



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-27 19:29                             ` Andy Wingo
@ 2017-02-27 20:24                               ` Jan Wedekind
  2017-02-27 20:33                                 ` Eli Zaretskii
  0 siblings, 1 reply; 110+ messages in thread
From: Jan Wedekind @ 2017-02-27 20:24 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-user

> Hi :)
>
> On Mon 27 Feb 2017 17:07, Eli Zaretskii <eliz@gnu.org> writes:
>
>>> From: Andy Wingo <wingo@pobox.com>
>>> Date: Sun, 26 Feb 2017 22:20:31 +0100
>>>
>>> In Scheme, strings are sequences of characters.  Encoding and decoding
>>> is only needed when going to and from bytes.  Guile supports a finite
>>> number of encodings, so in general some encoding/decoding will always be
>>> needed.  The specific encoding may change over time.
>>
>> The lesson of Emacs development is that there's a need for
>> "characters" that represent raw bytes which cannot be decoded into the
>> internal representation, for whatever reasons.  These special
>> "characters" need to be representable in strings, among "normal"
>> recognizable characters (and thus distinguishable from the latter
>> kind), and they need to be converted back to their single-byte form
>> when the string is output to the external world.  An implementation of
>> text that doesn't include these features will always fail to support
>> some important use cases.
>
> Thanks for this note (and upthread).  I didn't know Emacs settled on
> this strategy.  It could fit in as a new "conversion strategy" (see
> Encoding in the manual).
>
> I think this feature will probably slip for 2.2.0 for lack of time,
> though.  When someone does go to look at it, this thread is a useful
> resource, or parts of it anyway :) I especially appreciated the
> tradeoffs between surrogates and strange UTF-8 hacks.
>
> Andy
>
>
>

The encoding support of the Ruby programming language [1] is IMHO pretty 
good. It can handle different encodings for source code, input/output, 
string variables, and regular expressions. UTF-8 is the preferred encoding 
but other encodings are required. E.g. Ruby is used a lot in Japan and 
there are many "Kanji" which are currently not covered by UTF-8.

[1] http://nuclearsquid.com/writings/ruby-1-9-encodings/



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: guile can't find a chinese named file
  2017-02-27 20:24                               ` Jan Wedekind
@ 2017-02-27 20:33                                 ` Eli Zaretskii
  0 siblings, 0 replies; 110+ messages in thread
From: Eli Zaretskii @ 2017-02-27 20:33 UTC (permalink / raw)
  To: Jan Wedekind; +Cc: wingo, guile-user

> Date: Mon, 27 Feb 2017 20:24:19 +0000 (GMT)
> From: Jan Wedekind <jan@wedesoft.de>
> cc: Eli Zaretskii <eliz@gnu.org>, guile-user@gnu.org
> 
> The encoding support of the Ruby programming language [1] is IMHO pretty 
> good. It can handle different encodings for source code, input/output, 
> string variables, and regular expressions. UTF-8 is the preferred encoding 
> but other encodings are required. E.g. Ruby is used a lot in Japan and 
> there are many "Kanji" which are currently not covered by UTF-8.

Emacs solves the latter problem as well, by using codepoints beyond
the end of the Unicode range.  (Don't forget that the Emacs m17n
features were designed and implemented by people who came from Japan.)
The advantage of the Emacs solution is that the internal
representation is still (a superset of) UTF-8, even though the byte
sequences for these codepoints could be longer than the maximum of the
standard UTF-8.



^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2017-02-27 20:33 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-27 11:58 guile can't find a chinese named file Thomas Morley
2016-11-27 12:16 ` Chaos Eternal
2016-11-28  8:54   ` Thomas Morley
2017-01-26 21:59     ` Linas Vepstas
2017-01-30 14:20 ` Ludovic Courtès
2017-01-30 15:48   ` David Kastrup
2017-01-30 16:41     ` Ludovic Courtès
2017-01-30 17:04       ` David Kastrup
2017-01-30 15:54   ` Marko Rauhamaa
2017-01-30 16:19     ` David Kastrup
2017-01-30 16:33       ` Marko Rauhamaa
2017-01-30 16:42         ` David Kastrup
2017-01-30 17:58           ` Marko Rauhamaa
2017-01-30 18:32             ` David Kastrup
2017-01-30 18:50               ` Eli Zaretskii
2017-01-30 19:00                 ` David Kastrup
2017-01-30 19:32                   ` Eli Zaretskii
2017-01-30 19:59                     ` Eli Zaretskii
2017-01-30 20:42                       ` Mike Gran
2017-01-31  3:31                         ` Eli Zaretskii
2017-01-31  6:16                           ` Mike Gran
2017-01-31  8:51                           ` David Kastrup
2017-01-30 19:01               ` Marko Rauhamaa
2017-01-30 19:27                 ` David Kastrup
2017-02-14 20:10                   ` Linas Vepstas
2017-02-14 20:54                     ` Mike Gran
2017-02-14 21:07                       ` Marko Rauhamaa
2017-02-14 21:52                         ` Mike Gran
2017-02-14 22:12                           ` Marko Rauhamaa
2017-02-14 22:19                           ` Chris Vine
2017-02-15  7:15                             ` Marko Rauhamaa
2017-02-15  9:18                             ` tomas
2017-02-15  9:54                               ` David Kastrup
2017-02-15 10:10                                 ` tomas
2017-02-15 17:04                                   ` Eli Zaretskii
2017-02-15 20:07                                     ` tomas
2017-02-15 20:22                                       ` Eli Zaretskii
2017-02-15 10:50                                 ` Marko Rauhamaa
2017-02-15 11:18                                   ` David Kastrup
2017-02-15 10:15                               ` Chris Vine
2017-02-15 11:48                                 ` tomas
2017-02-15 12:13                                   ` Chris Vine
2017-02-15 12:41                                     ` tomas
2017-02-15 13:11                                       ` Chris Vine
2017-02-15 13:31                                         ` tomas
2017-02-15 17:07                                     ` Eli Zaretskii
2017-02-26 20:58                                       ` Andy Wingo
2017-02-27 16:02                                         ` Eli Zaretskii
2017-02-26 20:52                                 ` Andy Wingo
2017-02-15 16:59                               ` Eli Zaretskii
2017-02-15 17:53                                 ` Marko Rauhamaa
2017-02-15 20:20                                 ` tomas
2017-02-15 20:32                                   ` Eli Zaretskii
2017-02-15 21:04                                     ` Marko Rauhamaa
2017-02-16  5:44                                       ` Eli Zaretskii
2017-02-16  6:15                                         ` Marko Rauhamaa
2017-02-16  6:29                                           ` Eli Zaretskii
2017-02-16  6:41                                             ` Eli Zaretskii
2017-02-16  7:16                                               ` Marko Rauhamaa
2017-02-16  8:26                                                 ` David Kastrup
2017-02-16 10:21                                                   ` Marko Rauhamaa
2017-02-16 10:43                                                     ` David Kastrup
2017-02-16 11:04                                                       ` Marko Rauhamaa
2017-02-16 11:11                                                         ` David Kastrup
2017-02-16 11:32                                                           ` Marko Rauhamaa
2017-02-16 11:49                                                             ` David Kastrup
2017-02-16 12:14                                                               ` Marko Rauhamaa
2017-02-16 16:21                                                                 ` Eli Zaretskii
2017-02-16 16:38                                                                   ` Marko Rauhamaa
2017-02-16 17:46                                                                     ` Eli Zaretskii
2017-02-16 18:38                                                                       ` Marko Rauhamaa
2017-02-16 18:46                                                                         ` Eli Zaretskii
2017-02-16 19:35                                                                           ` Marko Rauhamaa
2017-02-16 20:10                                                                             ` Eli Zaretskii
2017-02-16 20:52                                                                               ` David Kastrup
2017-02-16 21:13                                                                                 ` Marko Rauhamaa
2017-02-17  6:44                                                                                   ` Eli Zaretskii
2017-02-17  8:46                                                                                     ` Marko Rauhamaa
2017-02-17  9:04                                                                                       ` David Kastrup
2017-02-17  9:57                                                                                         ` tomas
2017-02-17  9:07                                                                                       ` Eli Zaretskii
2017-02-17  6:32                                                                                 ` Eli Zaretskii
2017-02-16 16:06                                                 ` Eli Zaretskii
2017-02-16 16:35                                                   ` Marko Rauhamaa
2017-02-16 17:41                                                     ` Eli Zaretskii
2017-02-16 18:30                                                     ` Mike Gran
2017-02-16 18:48                                                       ` David Kastrup
2017-02-16  7:02                                             ` Marko Rauhamaa
2017-02-16 15:47                                               ` Eli Zaretskii
2017-02-15 21:15                                     ` tomas
2017-02-16  5:54                                       ` Eli Zaretskii
2017-02-14 23:58                       ` David Kastrup
2017-02-15 10:12                         ` tomas
2017-02-15 12:04                           ` Marko Rauhamaa
2017-02-26 21:20                         ` Andy Wingo
2017-02-27  9:10                           ` David Kastrup
2017-02-27 11:02                             ` Andy Wingo
2017-02-27 12:09                               ` David Kastrup
2017-02-27 12:33                                 ` Andy Wingo
2017-02-27 16:07                           ` Eli Zaretskii
2017-02-27 19:29                             ` Andy Wingo
2017-02-27 20:24                               ` Jan Wedekind
2017-02-27 20:33                                 ` Eli Zaretskii
2017-02-14 22:26                     ` Ludovic Courtès
2017-02-26 21:23                       ` Andy Wingo
2017-01-30 19:41                 ` Eli Zaretskii
2017-01-30 20:46                   ` Marko Rauhamaa
2017-01-31 12:20                     ` tomas
2017-02-14 19:58             ` Linas Vepstas
2017-02-26 21:33               ` Andy Wingo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).