Improving the handling of system data (env, users, paths, ...)

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* Improving the handling of system data (env, users, paths, ...)
@ 2024-07-06 20:32 Rob Browning
  2024-07-07  4:59 ` tomas
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Rob Browning @ 2024-07-06 20:32 UTC (permalink / raw)
  To: guile-devel

* Problem

System data like environment variables, user names, group names, file
paths and extended attributes (xattr), etc. are on some systems (like
Linux) binary data, and may not be encodable as a string in the current
locale.  For Linux, as an example, only the null character is an invalid
user/group/filename byte, while for UTF-8, a much smaller set of bytes
are valid[1].

As an example, "µ" (Greek Mu) when encoded as Latin-1 is 0xb5, which is
a completely invalid UTF-8 byte, but a perfectly legitimate Linux file
name.  As a result, (readdir dir) will return a corrupted value when the
locale is set to UTF-8.

You can try it yourself from bash if your current locale uses an
LC_CTYPE that's incompatible with 0xb5:

    $ locale | grep LC_CTYPE
    LC_CTYPE="en_US.utf8"
    $ guile -c '(write (program-arguments)) (newline)' $'\xb5'
    ("guile" "?")

You end up with a question mark instead of the correct value.  This
makes it difficult to write programs that don't risk silent corruption
unless all the relevant system data is known to be compatible with the
user's current locale.

It's perhaps worth noting, that while typically unlikely, any given
directory could contain paths in an arbitrary collection of encodings:
UTF-8, SHIFT-JIS, Latin-1, etc., and so if you really want to try to
handle them as strings (maybe you want to correctly upcase/downcase
them), you have to know (somehow) the encoding that applies to each one.
Otherwise, in the limiting case, you can only assume "bytes".

* Improvements

At a minimum, I suggest Guile should produce an error by default
(instead of generating incorrect data) when the system bytes cannot be
encoded in the current locale.

There should also be some straightforward, thread-safe way to write code
that accesses and manipulates system data efficiently and without
corruption.

As an incremental step, and as has been discussed elsewhere a bit, we
might add support for uselocale()[2] and then document that the current
recommendation is to always use ISO-8859-1 (i.e. Latin-1)[3] for system
data unless you're certain your program doesn't need to be general
purpose (perhaps you're sure you only care about UTF-8 systems).

A program intended to work everywhere might then do something like
this:

    ...
      #:use-module ((guile locale)
                    #:select (iso-8859-1 with-locale))
    ...

    (define (environment name)
      (with-locale iso-8859-1 (getenv name)))

There are disadvantages to this approach, but it's a fairly easy
improvement.

Some potential disadvantages:

  - In cases where the system data was actually UTF-8, non-ASCII
    characters will be displayed "completely wrong", i.e. mapped to
    "random" other characters according to the Latin-1 correspondences.

  - You have to pay whatever cost is involved in switching locales, and
    in encoding/decoding the bytes, even if you only care about the
    bytes.

  - If any manipulations of the string representing the system data end
    up performing Unicode canonicalizations or normalizations, the data
    could still be corrupted.  I don't *think* Guile itself ever does
    that implicitly.

  - Less importantly, if we switch the internal string representation to
    UTF-8 (proposed[4]), then non-ASCII bytes in the data will require
    two bytes in memory.

The most direct (and compact, if we do convert to UTF-8) representation
would bytevectors, but then you would have a much more limited set of
operations available (i.e. strings have all of srfi-13, srfi-14, etc.)
unless we expanded them (likely re-using the existing code paths).  Of
course you could still convert to Latin-1, perform the operation, and
convert back, but that's not ideal.

Finally, while I'm not sure how I feel about it, one notable precedent
is Python's "surrogateescape" approach[5], which shifts any unencodable
bytes into "lone Unicode surrogates", a process which can (and of course
must) be safely reversed before handing the data back to the system.  It
has its own trade-offs/(security)-concerns, as mentioned in the PEP.

[1] https://en.wikipedia.org/wiki/UTF-8#Encoding
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/uselocale.html
[3] https://en.wikipedia.org/wiki/ISO/IEC_8859-1
[4] https://codeberg.org/rlb/guile/src/branch/utf8
[5] https://peps.python.org/pep-0383/

Thanks, and I'm happy to help with the implementation of whatever
improvements we choose, if we come to a consensus.

-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Improving the handling of system data (env, users, paths, ...)
  2024-07-06 20:32 Improving the handling of system data (env, users, paths, ...) Rob Browning
@ 2024-07-07  4:59 ` tomas
  2024-07-07  5:33 ` Eli Zaretskii
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: tomas @ 2024-07-07  4:59 UTC (permalink / raw)
  To: Rob Browning; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 3142 bytes --]

On Sat, Jul 06, 2024 at 03:32:17PM -0500, Rob Browning wrote:
> 
> 
> * Problem
> 
> System data like environment variables, user names, group names, file
> paths and extended attributes (xattr), etc. are on some systems (like
> Linux) binary data, and may not be encodable as a string in the current
> locale.

Since this might get lost in the ensuing discussion, yes: in Linux (and
relatives) file names are byte arrays, not strings.

> It's perhaps worth noting, that while typically unlikely, any given
> directory could contain paths in an arbitrary collection of encodings:

Exactly: it's the creating process's locale what calls the shots. So
if you are in a multi-locale environment (e.g. users with different
encodings) this will happen.

> At a minimum, I suggest Guile should produce an error by default
> (instead of generating incorrect data) when the system bytes cannot be
> encoded in the current locale.

Yes, perhaps.

[iso-8859-1]

> There are disadvantages to this approach, but it's a fairly easy
> improvement.

I'm not a fan of this one: watching Emacs's development, people end
up using Latin-1 as a poor substitute of "byte array" :-)

> The most direct (and compact, if we do convert to UTF-8) representation
> would bytevectors, but then you would have a much more limited set of
> operations available (i.e. strings have all of srfi-13, srfi-14, etc.)
> unless we expanded them (likely re-using the existing code paths).  Of
> course you could still convert to Latin-1, perform the operation, and
> convert back, but that's not ideal.

It would be the right one, and let users deal with explicit conversions
from/to strings, so they see the issues happening, but alas, you are
right: it's very inconvenient.

> Finally, while I'm not sure how I feel about it, one notable precedent
> is Python's "surrogateescape" approach[5], which shifts any unencodable
> bytes into "lone Unicode surrogates", a process which can (and of course
> must) be safely reversed before handing the data back to the system.  It
> has its own trade-offs/(security)-concerns, as mentioned in the PEP.

FWIW, that's more or less what Emacs's internal encoding does: it is roughly
UTF-8, but reserves some code points to odd bytes (which it then displays
as backslash sequences). It's round-trip safe, but has its own set of sharp
edges, and naive [1] users get caught in them from time to time.

What's my point? Basically, that we shouldn't try to get it 100% right,
because there's possibly no way, and we pile up a lot of complexity which
is very difficult to get rid of (most languages have their painful transitions
to tell stories about).

I think it's ok to try some guesswork to make user's lives easier, but
perhaps to (by default) fail noisily at the least suspicion than to carry
happily away with wrong results.

Guessing UTF-8 seems a safe bet: for one, everybody (except Javascript) is
moving in that direction, for the other, you notice quickly when it isn't
(as opposed to ISO-8859-x, which will trundle along, producing funny content).

Cheers
-- 
t

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Improving the handling of system data (env, users, paths, ...)
  2024-07-06 20:32 Improving the handling of system data (env, users, paths, ...) Rob Browning
  2024-07-07  4:59 ` tomas
@ 2024-07-07  5:33 ` Eli Zaretskii
  2024-07-07 10:03   ` Jean Abou Samra
  2024-07-07  9:45 ` Jean Abou Samra
  2024-07-07 10:24 ` Maxime Devos
  3 siblings, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2024-07-07  5:33 UTC (permalink / raw)
  To: Rob Browning; +Cc: guile-devel

> From: Rob Browning <rlb@defaultvalue.org>
> Date: Sat, 06 Jul 2024 15:32:17 -0500
> 
> * Problem
> 
> System data like environment variables, user names, group names, file
> paths and extended attributes (xattr), etc. are on some systems (like
> Linux) binary data, and may not be encodable as a string in the current
> locale.  For Linux, as an example, only the null character is an invalid
> user/group/filename byte, while for UTF-8, a much smaller set of bytes
> are valid[1].
> 
> As an example, "µ" (Greek Mu) when encoded as Latin-1 is 0xb5, which is
> a completely invalid UTF-8 byte, but a perfectly legitimate Linux file
> name.  As a result, (readdir dir) will return a corrupted value when the
> locale is set to UTF-8.
> 
> You can try it yourself from bash if your current locale uses an
> LC_CTYPE that's incompatible with 0xb5:
> 
>     $ locale | grep LC_CTYPE
>     LC_CTYPE="en_US.utf8"
>     $ guile -c '(write (program-arguments)) (newline)' $'\xb5'
>     ("guile" "?")
> 
> You end up with a question mark instead of the correct value.  This
> makes it difficult to write programs that don't risk silent corruption
> unless all the relevant system data is known to be compatible with the
> user's current locale.
> 
> It's perhaps worth noting, that while typically unlikely, any given
> directory could contain paths in an arbitrary collection of encodings:
> UTF-8, SHIFT-JIS, Latin-1, etc., and so if you really want to try to
> handle them as strings (maybe you want to correctly upcase/downcase
> them), you have to know (somehow) the encoding that applies to each one.
> Otherwise, in the limiting case, you can only assume "bytes".

Why not learn from GNU Emacs, which already solved this very hard
problem, and has many years of user and programming experience to
prove it, instead of inventing Guile's own solution?

Here's what we learned in Emacs, since 1997 (when Emacs 20.1 was
released that for the first time tried to provide an environment that
supports multiple languages and encodings at the same time);

 . Locales are not a good mechanism for this.  A locale supports a
   single language/encoding, and switching the locale each time you
   need a different one is costly and makes many simple operations
   cumbersome, and the code hard to read.
 . It follows that relying on libc functions that process non-ASCII
   characters is also not the best idea: those functions depend on the
   locale, and thus force the programmer to use locales and switch
   them as needed.
 . Byte sequences that cannot be decoded for some reason are a fact of
   life, and any real-life programming system must be able to deal
   with them in a reasonable and efficient way.
 . Therefore, Emacs has arrived at the following system, and we use it
   for the last 15 years without any significant changes:

    - When text is read from an external source, it is _decoded_ into
      the internal representation of characters.  When text is written
      to an external destination, it is _encoded_ using an appropriate
      codeset.
    - The internal representation is a superset of UTF-8, in that it
      is capable of representing characters for which there are no
      Unicode codepoints (such as GB 18030, some of whose characters
      don't have Unicode counterparts; and raw bytes, used to
      represent byte sequences that cannot be decoded).  It uses
      5-byte UTF-8-like sequences for these extensions.
    - The codesets used to decode and encode can be selected by simple
      settings, and have defaults which are locale- and
      language-aware.  When the encoding of external text is not
      known, Emacs uses a series of guesses, driven by the locale, the
      nature of the source (e.g., file name), user preferences, etc.
      Encoding generally reuses the same codeset used to decode (which
      is recorded with the text), and the Lisp program can override
      that.
    - Separate global variables and corresponding functions are
      provided for decoding/encoding stuff that comes from several
      important sources and goes to the corresponding destinations.
      Examples include en/decoding of file names, en/decoding of text
      from files, en/decoding values of environment variables and
      system messages (e.g., messages from strerror), and en/decoding
      text from subordinate processes.  Each of these gets the default
      value based on the locale and the language detected at startup,
      but a Lisp program can modify each one of them, either
      temporarily or globally.  There are also facilities for adapting
      these to specific requirements of particular external sources
      and destinations: for example, one can define special codesets
      for encoding and decoding text from/to specific programs run by
      Emacs, based on the program names.  (E.g., Git generally wants
      UTF-8 encoding regardless of the locale.)  Similarly, some
      specific file names are known to use certain encodings.  All of
      these are used to determine the proper codeset when the caller
      didn't specify one.
    - Emacs has its own code for code-conversion, for moving by
      characters through multibyte sequences, for producing a Unicode
      codepoint from a byte sequence in the super-UTF-8 representation
      and back, etc., so it doesn't use libc routines for that, and
      thus doesn't depend on the current locale for these operations.
    - APIs are provided for "manual" encoding and decoding.  A Lisp
      program can read a byte stream, then decode it "manually" using
      a particular codeset, as deemed appropriate.  This allows to
      handle complex situations where a program receives stuff whose
      encoding can only be determined by examining the raw byte stream
      (a typical example is a multipart email message with MIME
      charset header for each part).
    - Emacs also has tables of Unicode attributes of characters
      (produced by parsing the relevant Unicode data files at build
      time), so it can up/down-case characters, determine their
      category (letters, digits, punctuation, etc.) and script to
      which they belong, etc. -- all with its own code, independent of
      the underlying libc.

This is no doubt a complex system that needs a lot of code.  But it
does work, and works well, as proven by years of experience.  Nowadays
at least some of the functionality can be found in free libraries
which Guile could perhaps use, instead of rolling its own
implementations.  And the code used by Emacs is, of course, freely
available for study and reuse.

> At a minimum, I suggest Guile should produce an error by default
> (instead of generating incorrect data) when the system bytes cannot be
> encoded in the current locale.

In our experience, this is a mistake.  Signaling an error for each
decoding problem produces unreliable applications that punt in too
many cases.  Emacs leaves the problematic bytes alone, as raw bytes
(which are representable in the internal representation, see above),
and leaves it to higher-level application code or to the user to deal
with the results.  The "generation of incorrect data" alternative is
thus avoided, because Emacs does not replace undecodable bytes with
something else.

> As an incremental step, and as has been discussed elsewhere a bit, we
> might add support for uselocale()[2] and then document that the current
> recommendation is to always use ISO-8859-1 (i.e. Latin-1)[3] for system
> data unless you're certain your program doesn't need to be general
> purpose (perhaps you're sure you only care about UTF-8 systems).

A Latin-1 locale comes with its baggage of rules, for example up- and
down-casing, character classification (letters vs punctuation etc.),
and other stuff.  Representing raw bytes pretending they are Latin-1
characters is therefore problematic and will lead to programmatic
errors, whereby a program cannot distinguish between a raw byte and a
Latin-1 character that have the same 8-bit value.

Feel free to ask any questions about the details.

HTH



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Improving the handling of system data (env, users, paths, ...)
  2024-07-06 20:32 Improving the handling of system data (env, users, paths, ...) Rob Browning
  2024-07-07  4:59 ` tomas
  2024-07-07  5:33 ` Eli Zaretskii
@ 2024-07-07  9:45 ` Jean Abou Samra
  2024-07-07 19:25   ` Rob Browning
  2024-07-07 10:24 ` Maxime Devos
  3 siblings, 1 reply; 18+ messages in thread
From: Jean Abou Samra @ 2024-07-07  9:45 UTC (permalink / raw)
  To: Rob Browning, guile-devel

[-- Attachment #1: Type: text/plain, Size: 1521 bytes --]

Le samedi 06 juillet 2024 à 15:32 -0500, Rob Browning a écrit :
> At a minimum, I suggest Guile should produce an error by default
> (instead of generating incorrect data) when the system bytes cannot be
> encoded in the current locale.


I agree that an error would be better than replacing with a question mark.


> As an incremental step, and as has been discussed elsewhere a bit, we
> might add support for uselocale()[2] and then document that the current
> recommendation is to always use ISO-8859-1 (i.e. Latin-1)[3] for system
> data unless you're certain your program doesn't need to be general
> purpose (perhaps you're sure you only care about UTF-8 systems).


latin1 locale is a terrible default. Virtually no Linux system these days
has a locale encoding different than UTF-8. Except perhaps for the "C" locale,
which people still use by habit with "LC_ALL=C" as a way to say "speak English
please", although most Linux distros have a C.UTF-8 locale these days.



> The most direct (and compact, if we do convert to UTF-8) representation
> would bytevectors, but then you would have a much more limited set of
> operations available (i.e. strings have all of srfi-13, srfi-14, etc.)
> unless we expanded them (likely re-using the existing code paths).  Of
> course you could still convert to Latin-1, perform the operation, and
> convert back, but that's not ideal.


Why is that "not ideal"? The (ice-9 iconv) API is convenient, locale-independent
and thread-safe.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Improving the handling of system data (env, users, paths, ...)
  2024-07-07  5:33 ` Eli Zaretskii
@ 2024-07-07 10:03   ` Jean Abou Samra
  2024-07-07 11:04     ` Eli Zaretskii
  0 siblings, 1 reply; 18+ messages in thread
From: Jean Abou Samra @ 2024-07-07 10:03 UTC (permalink / raw)
  To: Eli Zaretskii, Rob Browning; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 2821 bytes --]

Le dimanche 07 juillet 2024 à 08:33 +0300, Eli Zaretskii a écrit :
> 
>     - The internal representation is a superset of UTF-8, in that it
>       is capable of representing characters for which there are no
>       Unicode codepoints (such as GB 18030, some of whose characters
>       don't have Unicode counterparts; and raw bytes, used to
>       represent byte sequences that cannot be decoded).  It uses
>       5-byte UTF-8-like sequences for these extensions.


Guile is a Scheme implementation, bound by Scheme standards and compatibility
with other Scheme implementations (and backwards compatibility too).

I just tried (aref (cadr command-line-args) 0) in a lisp-interaction-mode
Emacs buffer after launching "emacs $'\xb5'". It gave 4194229 = 0x3fffb5,
which quite logically is outside the Unicode code point range 0 - 0x110000.

This doesn't work for Guile, since a character is a Unicode code point
in the Scheme semantics.


>     - Emacs has its own code for code-conversion, for moving by
>       characters through multibyte sequences, for producing a Unicode
>       codepoint from a byte sequence in the super-UTF-8 representation
>       and back, etc., so it doesn't use libc routines for that, and
>       thus doesn't depend on the current locale for these operations.


Guile's encoding conversions don't rely on the libc locale. They use
GNU libiconv. The issue at hand is that for argv specifically, the
conversion happens at startup with the locale encoding as a default
(AFAICT Guile uses environ_locale_charset from gnulib to convert the
C locale to an encoding name usable by libiconv) and Guile doesn't store
the original argv bytes.


>     - APIs are provided for "manual" encoding and decoding.  A Lisp
>       program can read a byte stream, then decode it "manually" using
>       a particular codeset, as deemed appropriate.  This allows to
>       handle complex situations where a program receives stuff whose
>       encoding can only be determined by examining the raw byte stream
>       (a typical example is a multipart email message with MIME
>       charset header for each part).


These exist, see (ice-9 iconv).


>     - Emacs also has tables of Unicode attributes of characters
>       (produced by parsing the relevant Unicode data files at build
>       time), so it can up/down-case characters, determine their
>       category (letters, digits, punctuation, etc.) and script to
>       which they belong, etc. -- all with its own code, independent of
>       the underlying libc.


Also exists, and AFAICT uses GNU libunistring. See string-upcase,
char-general-category, etc.

> 
> 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Improving the handling of system data (env, users, paths, ...)
  2024-07-06 20:32 Improving the handling of system data (env, users, paths, ...) Rob Browning
                   ` (2 preceding siblings ...)
  2024-07-07  9:45 ` Jean Abou Samra
@ 2024-07-07 10:24 ` Maxime Devos
  2024-07-07 19:40   ` Rob Browning
  3 siblings, 1 reply; 18+ messages in thread
From: Maxime Devos @ 2024-07-07 10:24 UTC (permalink / raw)
  To: Rob Browning, guile-devel@gnu.org

[-- Attachment #1: Type: text/plain, Size: 9015 bytes --]

>* Problem
>
>System data like environment variables, user names, group names, file
paths and extended attributes (xattr), etc. are on some systems (like
Linux) binary data, and may not be encodable as a string in the current
locale.  For Linux, as an example, only the null character is an invalid
user/group/filename byte, while for UTF-8, a much smaller set of bytes
are valid[1].
>[...]
>You end up with a question mark instead of the correct value.  This
makes it difficult to write programs that don't risk silent corruption
unless all the relevant system data is known to be compatible with the
user's current locale.

>It's perhaps worth noting, that while typically unlikely, any given
directory could contain paths in an arbitrary collection of encodings:
UTF-8, SHIFT-JIS, Latin-1, etc., and so if you really want to try to
handle them as strings (maybe you want to correctly upcase/downcase
them), you have to know (somehow) the encoding that applies to each one.
Otherwise, in the limiting case, you can only assume "bytes".

>* Improvements

>At a minimum, I suggest Guile should produce an error by default
(instead of generating incorrect data) when the system bytes cannot be
encoded in the current locale.

I totally agree on this.

>There should also be some straightforward, thread-safe way to write code
that accesses and manipulates system data efficiently and without
corruption.

>As an incremental step, and as has been discussed elsewhere a bit, we
might add support for uselocale()[2] and then document that the current
recommendation is to always use ISO-8859-1 (i.e. Latin-1)[3] for system
data unless you're certain your program doesn't need to be general
purpose (perhaps you're sure you only care about UTF-8 systems).

I’d rather not. It’s rather stateful and hence non-trivial to compose.
Also, locale is not only about the encoding of text [file name/env encodings/xattr/...],
but also about language. Also setting the language is excessive in this case.

>A program intended to work everywhere might then do something like
this:

>   ...
>      #:use-module ((guile locale)
>                   #:select (iso-8859-1 with-locale))
>    ...
>
>    (define (environment name)
>      (with-locale iso-8859-1 (getenv name)))

This, OTOH, seems a bit better – ‘with-locale’ is like ‘parameterize’ and hence pretty composable.
However, it still stuffers from the problem that it sets too much (also, there is no such thing as the “iso-8859-1” locale?).

Instead, I would propose something like:

;; [todo: add validation]
;; if #false, default to what is implied by the locale
(define system-encoding (make-parameter #false))
;; if #false, default to system-encoding
(define file-name-encoding (make-parameter #false))
[...]

;; let’s say that for some reason, we know the file names have this encoding,
;; but we don’t have information on other things so we leave the decision
;; on other encodings to the caller.
(define (some-proc)
  (parameterize ((file-name-encoding "UTF-8"))
    [open some file and do stuff with it]))

This also has the advantage of separating the different things a bit – I can imagine a multi-user system where the usernames are encoded differently from the file names in the user home directory (not an unsurmountable problem for ‘with-locale’, but this seems a bit more straightforward to use when external libraries are involved).

(I’m not too sure about this splitting of parameter objects)

>There are disadvantages to this approach, but it's a fairly easy
improvement.

>Some potential disadvantages:

>  - In cases where the system data was actually UTF-8, non-ASCII
>    characters will be displayed "completely wrong", i.e. mapped to
>    "random" other characters according to the Latin-1 correspondences.

This is why I wouldn’t recommend always using ISO-85519-1 by default.
The situation where the encoding of things are different is the exception
(and a historical artifact of pre-UTF-8), not the norm.

I think changing the ‘?’ into ‘throw an exception’, and providing an _option_ (i.e. temporarily change locale to ISO-85519) and also supporting this historical artifact is sufficient.

>  - You have to pay whatever cost is involved in switching locales, and
>    in encoding/decoding the bytes, even if you only care about the
>    bytes.

IIRC, in ISO-88519-1 there is a direct correspondence between bytes and characters
(and Guile recognises this), so there is no cost beyond mere copying.

>  - If any manipulations of the string representing the system data end
>    up performing Unicode canonicalizations or normalizations, the data
>    could still be corrupted.  I don't *think* Guile itself ever does
>    that implicitly.

Pretty sure it doesn’t.

>  - Less importantly, if we switch the internal string representation to
    UTF-8 (proposed[4]), then non-ASCII bytes in the data will require
    two bytes in memory.

>The most direct (and compact, if we do convert to UTF-8) representation
would bytevectors, but then you would have a much more limited set of
operations available (i.e. strings have all of srfi-13, srfi-14, etc.)
unless we expanded them (likely re-using the existing code paths).  Of
course you could still convert to Latin-1, perform the operation, and
convert back, but that's not ideal.

>Finally, while I'm not sure how I feel about it, one notable precedent
is Python's "surrogateescape" approach[5], which shifts any unencodable
bytes into "lone Unicode surrogates", a process which can (and of course
must) be safely reversed before handing the data back to the system.  It
has its own trade-offs/(security)-concerns, as mentioned in the PEP.

IIRC, surrogates have codepoints, but are not characters. As a consequence, strings would contain non-characters, and (char? (string-ref s index)) might be #false. I’d rather not, such an object does not sound like a string to me.

Here is an alternative solution:

1. Define a new object type ‘<unencoded-string>’ (wrapping a bytevector). This represent things that are _conceptually_ a string instead of a mere sequence of bytes, but we don’t know the actual encoding so we can’t let it be a string.
2. Also define a bunch of procedure for converting between bytes, unencoded-strings and strings. Also, a ‘string-like?’ predicate that includes both ‘<string>’ and ‘<unencoded-string>’.
3. Procedures like ‘open-file’ etc. are extended to support <unencoded-string>.
4. Maybe do the same for SRFI-N stuff (maybe as part of (srfi srfi-N gnu) extensions).
(I don’t know if (string-append unencoded encoded) should be supported.)
5. When a procedure would return a filename, it first looks at some parameter objects. These parameter encoding determine what the encoding is, what to do when it is not valid according to the encoding (approximate via ? and the like, throw an exception, or return an <unencoded-string>) – or even return an <unencoded-string> unconditionally.
6. Also do the same for ‘getenv’ and the like, maybe with a different set of parameter objects.

(Name pending, <unencoded-string> not being a subtype of <string> is bad naming.)

I think this combines most of the positive qualities and avoids most of the negative qualities (with the exception of the surrogate-encoding stuff, which I see mostly as a negative):

• “unless we expanded them (likely re-using the existing code paths)”

This seems doable.
• “- In cases where the system data was actually UTF-8, non-ASCII  characters will be displayed "completely wrong", i.e. mapped to  "random" other characters according to the Latin-1 correspondences.

By distinguishing <string> from <unencoded-string>, for the most part this is non-applicable (depending on the encodings involved, <insert-encoding> might be incorrectly interpreted as UTF-8, but this seems rare).
• “even if you only care about the bytes.”
If you only care about the bytes, set the relevant parameter objects such that <unencoded-string> objects rare returned.
• “At a minimum, I suggest Guile should produce an error by default (instead of generating incorrect data) when the system bytes cannot be encoded in the current locale.”

Included. Also, in the rare situation where approximating things is appropriate (e.g. a basic directory listing), generating incorrect data is also possible.

A negative quality is that there now are two string-ish object types, but since the two types represent different situations, one of them requires more care than the other, and many operations are supported for both, I don’t think that’s too bad.

(It might also be possible to replace <unencoded-string> directly by a bytevector, but if you do this, then remember that on the C level you need to deal with the lack of trailing \0.)

Best regards,
Maxime Devos.

[-- Attachment #2: Type: text/html, Size: 23636 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Improving the handling of system data (env, users, paths, ...)
  2024-07-07 10:03   ` Jean Abou Samra
@ 2024-07-07 11:04     ` Eli Zaretskii
  2024-07-07 11:35       ` Maxime Devos
  0 siblings, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2024-07-07 11:04 UTC (permalink / raw)
  To: Jean Abou Samra; +Cc: rlb, guile-devel

> From: Jean Abou Samra <jean@abou-samra.fr>
> Cc: guile-devel@gnu.org
> Date: Sun, 07 Jul 2024 12:03:06 +0200
> 
> Le dimanche 07 juillet 2024 à 08:33 +0300, Eli Zaretskii a écrit :
> > 
> >     - The internal representation is a superset of UTF-8, in that it
> >       is capable of representing characters for which there are no
> >       Unicode codepoints (such as GB 18030, some of whose characters
> >       don't have Unicode counterparts; and raw bytes, used to
> >       represent byte sequences that cannot be decoded).  It uses
> >       5-byte UTF-8-like sequences for these extensions.
> 
> 
> Guile is a Scheme implementation, bound by Scheme standards and compatibility
> with other Scheme implementations (and backwards compatibility too).

Yes, I understand that.

> I just tried (aref (cadr command-line-args) 0) in a lisp-interaction-mode
> Emacs buffer after launching "emacs $'\xb5'". It gave 4194229 = 0x3fffb5,
> which quite logically is outside the Unicode code point range 0 - 0x110000.

That's not how you get a raw byte from a multibyte string in Emacs.
IOW, you code is wrong, if what you wanted was to get the 0xb5 byte.
I guess you assumed something about 'aref' in Emacs that is not true
with multibyte strings that include raw bytes.  So what you got
instead is the internal Emacs "codepoint" for raw bytes, which are in
the 0x3fff00..0x3fffff range.

Note that (cadr command-line-args), for example, yields "\265", as
expected.  That is, in situation where the caller's intent is clear,
Emacs converts back to a single byte automatically.  That's part of
heuristics that took us some releases to get right.

> This doesn't work for Guile, since a character is a Unicode code point
> in the Scheme semantics.

See above: the problem doesn't exist if one uses the correct APIs.

> >     - Emacs has its own code for code-conversion, for moving by
> >       characters through multibyte sequences, for producing a Unicode
> >       codepoint from a byte sequence in the super-UTF-8 representation
> >       and back, etc., so it doesn't use libc routines for that, and
> >       thus doesn't depend on the current locale for these operations.
> 
> Guile's encoding conversions don't rely on the libc locale. They use
> GNU libiconv.

That's okay, but what about other APIs, like conversion between
characters and their multibyte representations, returning the length
of a string in characters, etc.?  AFAIK, libiconv doesn't provide
these facilities.

> >     - Emacs also has tables of Unicode attributes of characters
> >       (produced by parsing the relevant Unicode data files at build
> >       time), so it can up/down-case characters, determine their
> >       category (letters, digits, punctuation, etc.) and script to
> >       which they belong, etc. -- all with its own code, independent of
> >       the underlying libc.
> 
> Also exists, and AFAICT uses GNU libunistring. See string-upcase,
> char-general-category, etc.

Fine, then it should be easier for Guile than I maybe thought to adopt
the same scheme.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Improving the handling of system data (env, users, paths, ...)
  2024-07-07 11:04     ` Eli Zaretskii
@ 2024-07-07 11:35       ` Maxime Devos
  2024-07-07 14:25         ` Eli Zaretskii
  0 siblings, 1 reply; 18+ messages in thread
From: Maxime Devos @ 2024-07-07 11:35 UTC (permalink / raw)
  To: Eli Zaretskii, Jean Abou Samra; +Cc: rlb@defaultvalue.org, guile-devel@gnu.org

[-- Attachment #1: Type: text/plain, Size: 4910 bytes --]

Sent from Mail for Windows

From: Eli Zaretskii
Sent: Sunday, 7 July 2024 13:05
To: Jean Abou Samra
Cc: rlb@defaultvalue.org; guile-devel@gnu.org
Subject: Re: Improving the handling of system data (env, users, paths, ...)

> From: Jean Abou Samra <jean@abou-samra.fr>
> Cc: guile-devel@gnu.org
> Date: Sun, 07 Jul 2024 12:03:06 +0200
> 
> Le dimanche 07 juillet 2024 à 08:33 +0300, Eli Zaretskii a écrit :
> > 
> >     - The internal representation is a superset of UTF-8, in that it
> >       is capable of representing characters for which there are no
> >       Unicode codepoints (such as GB 18030, some of whose characters
> >       don't have Unicode counterparts; and raw bytes, used to
> >       represent byte sequences that cannot be decoded).  It uses
> >       5-byte UTF-8-like sequences for these extensions.
> 
> 
>> Guile is a Scheme implementation, bound by Scheme standards and compatibility
>> with other Scheme implementations (and backwards compatibility too).
>
>Yes, I understand that.

Going by what you are saying below, I think you don’t.

>> I just tried (aref (cadr command-line-args) 0) in a lisp-interaction-mode
>> Emacs buffer after launching "emacs $'\xb5'". It gave 4194229 = 0x3fffb5,
>> which quite logically is outside the Unicode code point range 0 - 0x110000.
>That's not how you get a raw byte from a multibyte string in Emacs.
>IOW, you code is wrong, if what you wanted was to get the 0xb5 byte.
>I guess you assumed something about 'aref' in Emacs that is not true
>with multibyte strings that include raw bytes.  So what you got
>instead is the internal Emacs "codepoint" for raw bytes, which are in
>the 0x3fff00..0x3fffff range.

I’m pretty sure that they weren’t intending to get the 0xb5 byte. Rather, they were using the equivalent of ‘string-ref’ (i.e., ‘aref’) and demonstrating that the result is bogus in Scheme.  In Scheme, ‘(string-ref ...)’ needs to return a character, and there exists no (Unicode) character with codepoint 4194229, so what Emacs returns here would be bogus for (Guile) Scheme.

From the Emacs manual:

>For example, you can access individual characters in a string using the function aref (see Functions that Operate on Arrays).

Thus, (aref the-string index) is the equivalent of (string-ref the-string index). I do not see any indication they were trying to extract the byte itself, rather they were extracting the _character_ corresponding to the byte, and demonstrating that this ‘character’ is, in fact, not actually a character in Scheme (or in other words, no such character exists in Scheme).

>> This doesn't work for Guile, since a character is a Unicode code point
>> in the Scheme semantics.
>See above: the problem doesn't exist if one uses the correct APIs.

AFAICT, there are no correct APIs. Fundamentally (whether for compatibility or by choice), characters in (Guile) Scheme are _Unicode_ characters and (Scheme) strings consists of _only_ such _Unicode characters_. Yet, in Elisp strings consists of more stuff – whether that be characters from Emacs’ extended set, or a mixture of Unicode and raw bytes, in both cases the Elisp APIs that would return characters return things that aren’t _Unicode_ characters, and hence aren’t appropriate APIs for Guile.

This doesn’t mean that Emacs’ model can’t be adopted – rather, it could perhaps be partially adopted, but whenever the resulting ‘string’ contains things that aren’t (Unicode) characters, the result may not be called a ‘string’, and some of the things in the not-string may not be called ‘characters’.

> >     - Emacs has its own code for code-conversion, for moving by
> >       characters through multibyte sequences, for producing a Unicode
> >       codepoint from a byte sequence in the super-UTF-8 representation
> >       and back, etc., so it doesn't use libc routines for that, and
> >       thus doesn't depend on the current locale for these operations.
> 
> Guile's encoding conversions don't rely on the libc locale. They use
> GNU libiconv.

>That's okay, but what about other APIs, like conversion between
characters and their multibyte representations,

This is not an _other_ API, this is precisely the (ice-9 iconv) API. See string->bytevector and bytevector->string (well, you need to turn the single character into a string consisting of a single character first, but this is trivial, simply do (string [insert-character-here])).

> returning the length of a string in characters, etc.?  AFAIK, libiconv doesn't provide
these facilities.

This is a basic string API, just do string-length like in (all?) Schemes. In Scheme, strings consists of characters, so string-length returns the length of a string in characters.

Best regards,
Maxime Devos.

[-- Attachment #2: Type: text/html, Size: 9631 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Improving the handling of system data (env, users, paths, ...)
  2024-07-07 11:35       ` Maxime Devos
@ 2024-07-07 14:25         ` Eli Zaretskii
  2024-07-07 14:59           ` Maxime Devos
  2024-07-07 15:16           ` Jean Abou Samra
  0 siblings, 2 replies; 18+ messages in thread
From: Eli Zaretskii @ 2024-07-07 14:25 UTC (permalink / raw)
  To: Maxime Devos; +Cc: jean, rlb, guile-devel

> Cc: "rlb@defaultvalue.org" <rlb@defaultvalue.org>, 
> 	"guile-devel@gnu.org" <guile-devel@gnu.org>
> From: Maxime Devos <maximedevos@telenet.be>
> Date: Sun, 7 Jul 2024 13:35:27 +0200
> 
> >> Guile is a Scheme implementation, bound by Scheme standards and compatibility
> >> with other Scheme implementations (and backwards compatibility too).
> >
> >Yes, I understand that.
> 
> Going by what you are saying below, I think you don’t.

Thank you for your vote of confidence.

> >> I just tried (aref (cadr command-line-args) 0) in a lisp-interaction-mode
> >> Emacs buffer after launching "emacs $'\xb5'". It gave 4194229 = 0x3fffb5,
> >> which quite logically is outside the Unicode code point range 0 - 0x110000.
> >That's not how you get a raw byte from a multibyte string in Emacs.
> >IOW, you code is wrong, if what you wanted was to get the 0xb5 byte.
> >I guess you assumed something about 'aref' in Emacs that is not true
> >with multibyte strings that include raw bytes.  So what you got
> >instead is the internal Emacs "codepoint" for raw bytes, which are in
> >the 0x3fff00..0x3fffff range.
> 
> I’m pretty sure that they weren’t intending to get the 0xb5 byte. Rather, they were using the equivalent of ‘string-ref’ (i.e., ‘aref’) and demonstrating that the result is bogus in Scheme.  In Scheme, ‘(string-ref ...)’ needs to return a character, and there exists no (Unicode) character with codepoint 4194229, so what Emacs returns here would be bogus for (Guile) Scheme.

aref in Emacs and string-ref in Guile are not the same, and if Guile
needs to produce a raw byte in this scenario, it can be easily
arranged.  In Emacs we have other goals.

IOW, I think this argument is pointless, since it is easy to adapt the
mechanism to what Guile needs.

> >From the Emacs manual:
> 
> >For example, you can access individual characters in a string using the function aref (see Functions that Operate on Arrays).
> 
> Thus, (aref the-string index) is the equivalent of (string-ref the-string index).

No, because a raw byte is not a character.

I do not see any indication they were trying to extract the byte itself, rather they were extracting the _character_ corresponding to the byte, and demonstrating that this ‘character’ is, in fact, not actually a character in Scheme (or in other words, no such character exists in Scheme).

> >> This doesn't work for Guile, since a character is a Unicode code point
> >> in the Scheme semantics.
> >See above: the problem doesn't exist if one uses the correct APIs.
> 
> AFAICT, there are no correct APIs. Fundamentally (whether for compatibility or by choice), characters in (Guile) Scheme are _Unicode_ characters and (Scheme) strings consists of _only_ such _Unicode characters_. Yet, in Elisp strings consists of more stuff – whether that be characters from Emacs’ extended set, or a mixture of Unicode and raw bytes, in both cases the Elisp APIs that would return characters return things that aren’t _Unicode_ characters, and hence aren’t appropriate APIs for Guile.

If Guile restricts itself to Unicode characters and only them, it will
lack important features.  So my suggestion is not to have this
restriction.

I think the fact that this discussion is held, and that Rob suggested
to use Latin-1 for the purpose of supporting raw bytes is a clear
indication that Guile, too, needs to deal with "character-like" data
that does not fit the Unicode framework.  So I think saying that
strings in Guile can only hold Unicode characters will not give you
what this discussion attempts to give.  In particular, how will you
handle the situations described by Rob where a file has a name that is
not a valid UTF-8 sequence (thus not "characters" as long as you
interpret text as UTF-8)?

> This doesn’t mean that Emacs’ model can’t be adopted – rather, it could perhaps be partially adopted, but whenever the resulting ‘string’ contains things that aren’t (Unicode) characters, the result may not be called a ‘string’, and some of the things in the not-string may not be called ‘characters’.

I agree.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Improving the handling of system data (env, users, paths, ...)
  2024-07-07 14:25         ` Eli Zaretskii
@ 2024-07-07 14:59           ` Maxime Devos
  2024-07-07 15:43             ` Eli Zaretskii
  2024-07-07 15:16           ` Jean Abou Samra
  1 sibling, 1 reply; 18+ messages in thread
From: Maxime Devos @ 2024-07-07 14:59 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: jean@abou-samra.fr, rlb@defaultvalue.org, guile-devel@gnu.org

[-- Attachment #1: Type: text/plain, Size: 5264 bytes --]

>> >> Guile is a Scheme implementation, bound by Scheme standards and compatibility
>> >> with other Scheme implementations (and backwards compatibility too).
>> >
>> >Yes, I understand that.
>> 
>> Going by what you are saying below, I think you don’t.
>
>Thank you for your vote of confidence.

That was not a vote of confidence, if anything, it’s the contrary.

> I’m pretty sure that they weren’t intending to get the 0xb5 byte. Rather, they were using the equivalent of ‘string-ref’ (i.e., ‘aref’) and demonstrating that the result is bogus in Scheme.  In Scheme, ‘(string-ref ...)’ needs to return a character, and there exists no (Unicode) character with codepoint 4194229, so what Emacs returns here would be bogus for (Guile) Scheme.

>aref in Emacs and string-ref in Guile are not the same, and if Guile
needs to produce a raw byte in this scenario, it can be easily
arranged.  In Emacs we have other goals.

It is the opposite. In Guile, string-ref does not need to produce bytes, but characters – just like aref (modulo difference in how Scheme and Emacs define ‘byte’).

>IOW, I think this argument is pointless, since it is easy to adapt the
mechanism to what Guile needs.

No – the argument is about how it is impossible to adapt the mechanism to Guile, since bytes aren’t characters in Unicode.

> >From the Emacs manual:
> 
> >For example, you can access individual characters in a string using the function aref (see Functions that Operate on Arrays).
> 
> Thus, (aref the-string index) is the equivalent of (string-ref the-string index).

>No, because a raw byte is not a character.

Yes, because characters are characters. Both string-ref and aref return characters. This is documented in both the Emacs and Guile manual:

Again, from the Emacs manual:

> A string is a fixed sequence of characters. [...] Since strings are arrays, and therefore sequences as well, you can operate on them with the general array and sequence functions documented in Sequences, Arrays, and Vectors. For example, you can access individual characters in a string using the function aref (see Functions that Operate on Arrays).

Hence, (aref the-string index) returns (Emacs) characters.

Likewise, from the Guile manual:

> Scheme Procedure: string-ref str k
>C Function: scm_string_ref (str, k)
Return character k of str using zero-origin indexing. k must be a valid index of str.

Clearly, these are equivalent (modulo difference in the meaning of ‘characters’).

>If Guile restricts itself to Unicode characters and only them, it will
lack important features.  So my suggestion is not to have this
restriction.

Guile restricting strings to Unicode _is_ an important feature (simplicity, and compatibility).

Guile extending strings beyond Unicode is a _limitation_ (compatibility and other trickiness for applications).

I could imagine in the far future there might be too little codepoints left in Unicode, in which case the range of what Guile (and more generally, Scheme and Unicode) considers characters needs to be extended (even if that has some compatibility implicaitons), but that time hasn’t arrived yet.

The important feature of this thread, is supporting file names (and getenv stuff, etc.) that doesn’t fit properly in the ‘string’ model. As mentioned earlier (in the initial message, even), there are solutions to that do not impose the ‘let characters go beyond Unicode’ limitation.

>I think the fact that this discussion is held, and that Rob suggested
to use Latin-1 for the purpose of supporting raw bytes is a clear
indication that Guile, too, needs to deal with "character-like" data
that does not fit the Unicode framework. 

True, and I never claimed otherwise.

> So I think saying that strings in Guile can only hold Unicode characters will not give you what this discussion attempts to give.

Sure, and I wasn’t trying to. What I (and IIUC, the other person as well) was doing was mentioning how neither the Emacs’s thing is a solution. (Whether because of backwards compatibility, or whether because of not _wanting_ to conflate bytes with characters (and not wanting to go beyond Unicode) with all the consequences this conflation would imply for applications.)

> In particular, how will you
handle the situations described by Rob where a file has a name that is
not a valid UTF-8 sequence (thus not "characters" as long as you
interpret text as UTF-8)?

Scheme does not interpret text as UTF-8, that’s an internal implementation detail and a matter of things like locales. Instead, to Scheme text is (Unicode) characters.

I have outlined a solution (that does not conflate characters with bytes) in another response. IIRC, it was in a response so Rob. I would propose actually, you know, reading it. I’m not sure, but IIRC Rob also mentioned another solution (i.e., just accept bytevectors in some locations, or do Latin-1).

Also, this structure makes no sense. Even if I did not provide an alternative solution of my own, that wouldn’t mean Emacs’s thing is the answer. (Negative) criticism can be valid without providing alternatives.

Best regards,
Maxime Devos.

[-- Attachment #2: Type: text/html, Size: 14125 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Improving the handling of system data (env, users, paths, ...)
  2024-07-07 14:25         ` Eli Zaretskii
  2024-07-07 14:59           ` Maxime Devos
@ 2024-07-07 15:16           ` Jean Abou Samra
  2024-07-07 15:18             ` Jean Abou Samra
  2024-07-07 15:58             ` Eli Zaretskii
  1 sibling, 2 replies; 18+ messages in thread
From: Jean Abou Samra @ 2024-07-07 15:16 UTC (permalink / raw)
  To: Eli Zaretskii, Maxime Devos; +Cc: rlb, guile-devel

[-- Attachment #1: Type: text/plain, Size: 3028 bytes --]

Le dimanche 07 juillet 2024 à 17:25 +0300, Eli Zaretskii a écrit :
> 
> If Guile restricts itself to Unicode characters and only them, it will
> lack important features.  So my suggestion is not to have this
> restriction.
> 
> I think the fact that this discussion is held, and that Rob suggested
> to use Latin-1 for the purpose of supporting raw bytes is a clear
> indication that Guile, too, needs to deal with "character-like" data
> that does not fit the Unicode framework.  So I think saying that
> strings in Guile can only hold Unicode characters will not give you
> what this discussion attempts to give.  In particular, how will you
> handle the situations described by Rob where a file has a name that is
> not a valid UTF-8 sequence (thus not "characters" as long as you
> interpret text as UTF-8)?

Whatever the details of aref in Emacs are (which I have not studied),
I think we all agree that

a) Strings in Scheme have the semantics of arrays of something called
   "characters".

b) According to Scheme standards and in current Guile, a character
   is a wrapper around a Unicode scalar value.

   (NB I wasn't precise enough in my previous email. R6RS explicitly
   disallows surrogate code points, so characters really correspond to
   scalar values and not to code points).

c) If we want Guile strings to losslessly represent arbitrary byte
   sequences, Guile's definition of a character needs to be expanded
   to include things other than Unicode scalar values.

So what would it entail for Guile to change its string model in this
way?

First, Guile would become technically not R6RS-compliant. I'm not
sure how much of a problem this would actually be.

There are non-trivial backwards compatibility implications. To give
a concrete case: LilyPond definitely has code that would break if
passed a string whose "conversion to UTF-8" gave something not valid
UTF-8. (An example off the top: passing strings to the Pango API and to
GLib's PCRE-based regex API. By the way, running "emacs $'\xb5'"
gives a Pango warning on the terminal, I assume because of trying
to display the file name as the window title.)

From the implementation point of view: conversion from an encoding to
another could no longer use libiconv, because it stops on invalid
multibyte sequences. Likewise, Guile could probably not use libiconv
anymore. This means a large implementation cost to reimplement all
of this in Guile.

I don't think it's worth it. If anybody's going to work on this problem,
I'd recommend simply adding APIs like program-arguments-bytevector,
getenv-bytevector and the like, returning raw bytevectors instead of strings,
and letting programs which need to be reliable against invalid UTF-8
in the environment use these.

That is also the approach taken in, e.g., Rust (except that due to the
static typing, you are forced to handle the "invalid UTF-8" error case
when you use, e.g., std::env::args as opposed to std::env::args_os).

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Improving the handling of system data (env, users, paths, ...)
  2024-07-07 15:16           ` Jean Abou Samra
@ 2024-07-07 15:18             ` Jean Abou Samra
  2024-07-07 15:58             ` Eli Zaretskii
  1 sibling, 0 replies; 18+ messages in thread
From: Jean Abou Samra @ 2024-07-07 15:18 UTC (permalink / raw)
  To: Eli Zaretskii, Maxime Devos; +Cc: rlb, guile-devel

[-- Attachment #1: Type: text/plain, Size: 393 bytes --]

Le dimanche 07 juillet 2024 à 17:16 +0200, Jean Abou Samra a écrit :
> From the implementation point of view: conversion from an encoding to
> another could no longer use libiconv, because it stops on invalid
> multibyte sequences. Likewise, Guile could probably not use libiconv
                                                              ^^^^^^^^

Sorry, I meant libunistring.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Improving the handling of system data (env, users, paths, ...)
  2024-07-07 14:59           ` Maxime Devos
@ 2024-07-07 15:43             ` Eli Zaretskii
  0 siblings, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2024-07-07 15:43 UTC (permalink / raw)
  To: Maxime Devos; +Cc: jean, rlb, guile-devel

> Cc: "jean@abou-samra.fr" <jean@abou-samra.fr>, 
> 	"rlb@defaultvalue.org" <rlb@defaultvalue.org>, 
> 	"guile-devel@gnu.org" <guile-devel@gnu.org>
> From: Maxime Devos <maximedevos@telenet.be>
> Date: Sun, 7 Jul 2024 16:59:10 +0200
> 
> >> >> Guile is a Scheme implementation, bound by Scheme standards and compatibility
> >> >> with other Scheme implementations (and backwards compatibility too).
> >> >
> >> >Yes, I understand that.
> >> 
> >> Going by what you are saying below, I think you don’t.
> >
> >Thank you for your vote of confidence.
> 
> That was not a vote of confidence, if anything, it’s the contrary.

You don't say!

> > I’m pretty sure that they weren’t intending to get the 0xb5 byte. Rather, they were using the equivalent of ‘string-ref’ (i.e., ‘aref’) and demonstrating that the result is bogus in Scheme.  In Scheme, ‘(string-ref ...)’ needs to return a character, and there exists no (Unicode) character with codepoint 4194229, so what Emacs returns here would be bogus for (Guile) Scheme.
> 
> >aref in Emacs and string-ref in Guile are not the same, and if Guile
> needs to produce a raw byte in this scenario, it can be easily
> arranged.  In Emacs we have other goals.
> 
> It is the opposite. In Guile, string-ref does not need to produce bytes, but characters – just like aref (modulo difference in how Scheme and Emacs define ‘byte’).

But raw byte is not a character.

> >IOW, I think this argument is pointless, since it is easy to adapt the
> mechanism to what Guile needs.
> 
> No – the argument is about how it is impossible to adapt the mechanism to Guile, since bytes aren’t characters in Unicode.

I'm saying that Guile needs to support raw bytes as well, because they
happen in Real Life, including as part of otherwise legible text.

> > >From the Emacs manual:
> > 
> > >For example, you can access individual characters in a string using the function aref (see Functions that Operate on Arrays).
> > 
> > Thus, (aref the-string index) is the equivalent of (string-ref the-string index).
> 
> >No, because a raw byte is not a character.
> 
> Yes, because characters are characters. Both string-ref and aref return characters. This is documented in both the Emacs and Guile manual:
> 
> Again, from the Emacs manual:
> 
> > A string is a fixed sequence of characters. [...] Since strings are arrays, and therefore sequences as well, you can operate on them with the general array and sequence functions documented in Sequences, Arrays, and Vectors. For example, you can access individual characters in a string using the function aref (see Functions that Operate on Arrays).
> 
> Hence, (aref the-string index) returns (Emacs) characters.

You missed the description of raw bytes and unibyte strings, I guess.

> >If Guile restricts itself to Unicode characters and only them, it will
> lack important features.  So my suggestion is not to have this
> restriction.
> 
> Guile restricting strings to Unicode _is_ an important feature (simplicity, and compatibility).
> 
> Guile extending strings beyond Unicode is a _limitation_ (compatibility and other trickiness for applications).
> 
> I could imagine in the far future there might be too little codepoints left in Unicode, in which case the range of what Guile (and more generally, Scheme and Unicode) considers characters needs to be extended (even if that has some compatibility implicaitons), but that time hasn’t arrived yet.
> 
> The important feature of this thread, is supporting file names (and getenv stuff, etc.) that doesn’t fit properly in the ‘string’ model. As mentioned earlier (in the initial message, even), there are solutions to that do not impose the ‘let characters go beyond Unicode’ limitation.
> 
> >I think the fact that this discussion is held, and that Rob suggested
> to use Latin-1 for the purpose of supporting raw bytes is a clear
> indication that Guile, too, needs to deal with "character-like" data
> that does not fit the Unicode framework. 
> 
> True, and I never claimed otherwise.
> 
> > So I think saying that strings in Guile can only hold Unicode characters will not give you what this discussion attempts to give.
> 
> Sure, and I wasn’t trying to. What I (and IIUC, the other person as well) was doing was mentioning how neither the Emacs’s thing is a solution. (Whether because of backwards compatibility, or whether because of not _wanting_ to conflate bytes with characters (and not wanting to go beyond Unicode) with all the consequences this conflation would imply for applications.)
> 
> > In particular, how will you
> handle the situations described by Rob where a file has a name that is
> not a valid UTF-8 sequence (thus not "characters" as long as you
> interpret text as UTF-8)?
> 
> Scheme does not interpret text as UTF-8, that’s an internal implementation detail and a matter of things like locales. Instead, to Scheme text is (Unicode) characters.
> 
> I have outlined a solution (that does not conflate characters with bytes) in another response. IIRC, it was in a response so Rob. I would propose actually, you know, reading it. I’m not sure, but IIRC Rob also mentioned another solution (i.e., just accept bytevectors in some locations, or do Latin-1).
> 
> Also, this structure makes no sense. Even if I did not provide an alternative solution of my own, that wouldn’t mean Emacs’s thing is the answer. (Negative) criticism can be valid without providing alternatives.

That's fine by me.  I described what we have done in Emacs because I
think it works and works well.  For many years.  So I thought
describing it will be useful to Guile and will allow you to consider
if something like that could solve your problems, which I think are
very similar if not identical.  It is up to you whether to reject that
solution without trying to adapt it to Guile, and in that case I wish
you all the luck in finding your own solutions.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Improving the handling of system data (env, users, paths, ...)
  2024-07-07 15:16           ` Jean Abou Samra
  2024-07-07 15:18             ` Jean Abou Samra
@ 2024-07-07 15:58             ` Eli Zaretskii
  2024-07-07 16:09               ` Jean Abou Samra
  2024-07-07 16:56               ` Mike Gran
  1 sibling, 2 replies; 18+ messages in thread
From: Eli Zaretskii @ 2024-07-07 15:58 UTC (permalink / raw)
  To: Jean Abou Samra; +Cc: maximedevos, rlb, guile-devel

> From: Jean Abou Samra <jean@abou-samra.fr>
> Cc: rlb@defaultvalue.org, guile-devel@gnu.org
> Date: Sun, 07 Jul 2024 17:16:31 +0200
> 
> There are non-trivial backwards compatibility implications. To give
> a concrete case: LilyPond definitely has code that would break if
> passed a string whose "conversion to UTF-8" gave something not valid
> UTF-8. (An example off the top: passing strings to the Pango API and to
> GLib's PCRE-based regex API. By the way, running "emacs $'\xb5'"
> gives a Pango warning on the terminal, I assume because of trying
> to display the file name as the window title.)

Probably.  Do you consider it a problem in Emacs or in Pango?

> From the implementation point of view: conversion from an encoding to
> another could no longer use libiconv, because it stops on invalid
> multibyte sequences. Likewise, Guile could probably not use libiconv
> anymore. This means a large implementation cost to reimplement all
> of this in Guile.

Or relatively small additions to libiconv, should their developers
agree with such an extension.

> I don't think it's worth it. If anybody's going to work on this problem,
> I'd recommend simply adding APIs like program-arguments-bytevector,
> getenv-bytevector and the like, returning raw bytevectors instead of strings,
> and letting programs which need to be reliable against invalid UTF-8
> in the environment use these.
> 
> That is also the approach taken in, e.g., Rust (except that due to the
> static typing, you are forced to handle the "invalid UTF-8" error case
> when you use, e.g., std::env::args as opposed to std::env::args_os).

The Emacs experience shows that (rare) raw bytes as part of otherwise
completely valid text are a fact of life.  They happen all the time,
for whatever reasons.  Granted, those reasons are most probably
something misconfigured somewhere, but as long as that happens in a
program other than the one you are developing, or even on another
computer, the ability of the user, let alone the programmer, to fix
the whole world is, how shall I put it, somewhat limited.  The
question is what do you when this stuff happens, and how you prepare
your package for dealing with it as well as reasonably possible?

Here's an example just from today: I've received an email from RMS, no
less, with obviously garbled address:

  To: BjÃ¶rn Bidar <bjorn.bidar@thaodan.de>

Now, this is a typical case of misinterpreting UTF-8 as Latin-1 (on
RMS's machine, not on mine); the correct name is Björn Bidar.  But
when you get such mojibake from your MTA, what do you do? signal an
error and refuse to show the message?  Good luck explaining to your
users that you are right behaving like that!  We in Emacs decided
differently, but that's us.

Once again, I described what we do in Emacs in the hope that it will
help you find your own solution.  If it doesn't help, that's fine by
me; there's no need to argue as long as what we do is understood.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Improving the handling of system data (env, users, paths, ...)
  2024-07-07 15:58             ` Eli Zaretskii
@ 2024-07-07 16:09               ` Jean Abou Samra
  2024-07-07 16:56               ` Mike Gran
  1 sibling, 0 replies; 18+ messages in thread
From: Jean Abou Samra @ 2024-07-07 16:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: maximedevos, rlb, guile-devel

[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]

Le dimanche 07 juillet 2024 à 18:58 +0300, Eli Zaretskii a écrit :
> Probably.  Do you consider it a problem in Emacs or in Pango?

The Pango documentation explicitly states that Pango validates the input
and "renders invalid UTF-8 with a placeholder glyph".

  https://docs.gtk.org/Pango/method.Layout.set_text.html

Should it really emit a warning, in which case Emacs should do the
replacement itself to avoid the warning? Or should it just render the
text, with the placeholder glyph, without warning? I don't really
know.

> 
> [...]
> Once again, I described what we do in Emacs in the hope that it will
> help you find your own solution.  If it doesn't help, that's fine by
> me; there's no need to argue as long as what we do is understood.

Thank you for your insight; I think it's valuable, even though I don't
think this solution is the best for the particular tradeoffs of Guile
(but I agree that the problem is significant and the tradeoffs are not
easy).

BTW, let me make it clear that I'm not a Guile maintainer, just a lurker
like you.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Improving the handling of system data (env, users, paths, ...)
  2024-07-07 15:58             ` Eli Zaretskii
  2024-07-07 16:09               ` Jean Abou Samra
@ 2024-07-07 16:56               ` Mike Gran
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Gran @ 2024-07-07 16:56 UTC (permalink / raw)
  To: Jean Abou Samra, Eli Zaretskii
  Cc: maximedevos@telenet.be, rlb@defaultvalue.org, guile-devel@gnu.org

On Sunday, July 7, 2024 at 08:58:34 AM PDT, Eli Zaretskii <eliz@gnu.org> wrote: 
>> I don't think it's worth it. If anybody's going to work on this problem,
>> I'd recommend simply adding APIs like program-arguments-bytevector,
>> getenv-bytevector and the like, returning raw bytevectors instead of strings,
>> and letting programs which need to be reliable against invalid UTF-8
>> in the environment use these.
> 
>> That is also the approach taken in, e.g., Rust (except that due to the
>> static typing, you are forced to handle the "invalid UTF-8" error case
>> when you use, e.g., std::env::args as opposed to std::env::args_os).

> The Emacs experience shows that (rare) raw bytes as part of otherwise
> completely valid text are a fact of life.  They happen all the time,
> for whatever reasons.  Granted, those reasons are most probably
> something misconfigured somewhere, but as long as that happens in a
> program other than the one you are developing, or even on another
> computer, the ability of the user, let alone the programmer, to fix
> the whole world is, how shall I put it, somewhat limited.  The
> question is what do you when this stuff happens, and how you prepare
> your package for dealing with it as well as reasonably possible?

To halfway follow Emacs's lead, Guild could use some of Unicode's
Private Use Area characters to represent raw bytes.

Raw bytes 0x00 to 0xFF could map to U+100000 to U+1000FF, for example.

We could make an encoding option such that
when converting from locale to internal Guile representation fails,
raw bytes could be transcoded thus for storage, and when outputting that string as a locale string
those characters can be output as raw bytes.  When outputting that string as UTF-8,
they can remain as PUA characters or be converted to the U+FFFD Replacement Character.

It would make corner cases: what if you use more than one non-UTF locale,
what if you actually wanted to use SMP PUA characters... 
And it would not be memory efficient; however, it would be
simple enough given Guile internals.

Regards,
Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Improving the handling of system data (env, users, paths, ...)
  2024-07-07  9:45 ` Jean Abou Samra
@ 2024-07-07 19:25   ` Rob Browning
  0 siblings, 0 replies; 18+ messages in thread
From: Rob Browning @ 2024-07-07 19:25 UTC (permalink / raw)
  To: Jean Abou Samra, guile-devel

Jean Abou Samra <jean@abou-samra.fr> writes:

> latin1 locale is a terrible default. Virtually no Linux system these days
> has a locale encoding different than UTF-8. Except perhaps for the "C" locale,
> which people still use by habit with "LC_ALL=C" as a way to say "speak English
> please", although most Linux distros have a C.UTF-8 locale these days.

Given this thread, it might have been good if I'd included a few other
bits of context in my original post.

  - Personally, as someone who spends a lot of time on tool that's more
    like tar/cp/rsync/etc. (and I suspect this sentiment applies for
    anyone doing something similar), I'd be happier without "help",
    i.e. at a minimum, I'd prefer solid bytevector support, and then
    I'll handle any conversions when needed.

    But I was trying to propose something incremental that comports with
    previous (off-list discussions), i.e. something that might be
    acceptable in the near to medium term.

    In truth, for system tools, I have no interest in "strings" most of
    the time, and would rather not pay anything for them (imagine
    regularly processing a few hundred million filesystem paths), and if
    I *do* care (say for regular-expression based exclusions), then "OK,
    first you have to tell us where the paths came from", i.e. we have
    no way of knowing what the encodings are, other than guessing.

    That said, I'd be more than happy to have *help*, e.g. bytevector
    variants of various srfi-13/srfi-14 functions, and/or (as I think
    suggested elsewhere in the thread) maybe even some hybrid type with
    additional conveniences (if that were to make sense).

    Further, you could imagine having more specific types like the
    "path" type many languages have, depending on what your
    cross-platform goals are, since paths aren't "just bytes"
    everywhere, something which even varies in Linux per-filesystem type
    -- but I didn't consider any of that "in scope" for now.

  - Using Latin-1 is of course, a hack, a pragmatic hack, but a hack,
    (it wasn't even my suggestion, originally).  Choosing that "for now"
    would just be trying to take advantage of the facts that it's likely
    to pass-through without corruption, and still allows easier
    manipulation via the existing string apis for some common, important
    cases, i.e. where you can still get the job done while only
    referring to the ascii bits (split/join on "/", for example), but
    no, it's not ideal.

    It also intends to avoid having to decide, and to do, anything
    further (in the short term) regarding all the existing *many*
    relevant system calls.  You can just call them as-is with a
    temporarily adjusted locale.

  - I have no idea where Guile might eventually end up, but given
    current resources, it seemed likely that what's potentially in scope
    for now is "incremental".

I'll also say that the broader discussion is interesting, and I do like
to better understand how other systems work.

> Le samedi 06 juillet 2024 à 15:32 -0500, Rob Browning a écrit :
>
>> The most direct (and compact, if we do convert to UTF-8) representation
>> would bytevectors, but then you would have a much more limited set of
>> operations available (i.e. strings have all of srfi-13, srfi-14, etc.)
>> unless we expanded them (likely re-using the existing code paths).  Of
>> course you could still convert to Latin-1, perform the operation, and
>> convert back, but that's not ideal.

> Why is that "not ideal"? The (ice-9 iconv) API is convenient, locale-independent
> and thread-safe.

I meant that round-tripping through Latin-1 every time you want to call
say string-split on "/" isn't ideal as compared to a bytevector friendly
splitter.  And if we do switch to UTF-8 internally, it'll also require
copying/converting the bytes since non-ascii bytes become multibyte.

(Given the UTF-8 work, I've also speculated about the fact that we could
 probably re-use many, if not all of the optimized "ascii paths" that
 I've included in the various functions there (srfi-13, srf-14, etc.),
 to implement bytevector friendly variants without much additional
 work.)

Thanks
-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4



^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Improving the handling of system data (env, users, paths, ...)
  2024-07-07 10:24 ` Maxime Devos
@ 2024-07-07 19:40   ` Rob Browning
  0 siblings, 0 replies; 18+ messages in thread
From: Rob Browning @ 2024-07-07 19:40 UTC (permalink / raw)
  To: Maxime Devos, guile-devel@gnu.org

Maxime Devos <maximedevos@telenet.be> writes:

> I’d rather not. It’s rather stateful and hence non-trivial to compose.
> Also, locale is not only about the encoding of text [file name/env
> encodings/xattr/...], but also about language. Also setting the
> language is excessive in this case.

The proposal would be that you'd only change the "CTYPE" to Latin-1,
it's strictly for the purpose of getting *bytes* since Latin-1 will do
that with no possibility of crashing on unencodable data.

And of course there's no way of knowing what the *real* encoding is
without out of band information.  That's true for getenv, and also true
for say every call to get a user or group name from the system.  Each
user name *could* (but won't, outiside generative testing, you'd hope)
have a different encoding.

> This, OTOH, seems a bit better – ‘with-locale’ is like ‘parameterize’
> and hence pretty composable.  However, it still stuffers from the
> problem that it sets too much (also, there is no such thing as the
> “iso-8859-1” locale?).

Oh, I was just writing pseudo-code, and right, you'd only want to change
the CTYPE for the current purposes, and that's what I'd expect whatever
we end up with to make it easy/efficient/safe to do.

> IIRC, in ISO-88519-1 there is a direct correspondence between bytes and characters
> (and Guile recognises this), so there is no cost beyond mere copying.

While it may change, I believe the current plan is to switch Guile to
UTF-8 internally, which is why I've been including that in
considerations.

> Here is an alternative solution:

Right, there are a lot of options if we're in the market for a "broader"
solution, but my impression was that we aren't right now (see my other
followup message).

Thanks
-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-07-07 19:40 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-06 20:32 Improving the handling of system data (env, users, paths, ...) Rob Browning
2024-07-07  4:59 ` tomas
2024-07-07  5:33 ` Eli Zaretskii
2024-07-07 10:03   ` Jean Abou Samra
2024-07-07 11:04     ` Eli Zaretskii
2024-07-07 11:35       ` Maxime Devos
2024-07-07 14:25         ` Eli Zaretskii
2024-07-07 14:59           ` Maxime Devos
2024-07-07 15:43             ` Eli Zaretskii
2024-07-07 15:16           ` Jean Abou Samra
2024-07-07 15:18             ` Jean Abou Samra
2024-07-07 15:58             ` Eli Zaretskii
2024-07-07 16:09               ` Jean Abou Samra
2024-07-07 16:56               ` Mike Gran
2024-07-07  9:45 ` Jean Abou Samra
2024-07-07 19:25   ` Rob Browning
2024-07-07 10:24 ` Maxime Devos
2024-07-07 19:40   ` Rob Browning

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).