unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
From: Maxime Devos <maximedevos@telenet.be>
To: Attila Lendvai <attila@lendvai.name>
Cc: 54893@debbugs.gnu.org
Subject: bug#54893: guix-daemon, locale, LANG, and unicode in git tag names
Date: Wed, 13 Apr 2022 10:22:30 +0200	[thread overview]
Message-ID: <d7cf39802973624ab080d53efc4c84a33c397707.camel@telenet.be> (raw)
In-Reply-To: <4sSjKaCcadx8brYQC5HZuP-SyMku3BlXRTZwaCUH13qSv01N33lk9vyUWzzE6R889ZuQRpI_6Pl4Q_51v8jMhUhwh6f9rly5h0EhlUqHG80=@lendvai.name>

[-- Attachment #1: Type: text/plain, Size: 2569 bytes --]

Attila Lendvai schreef op wo 13-04-2022 om 07:51 [+0000]:
> i'm not sure why the wrong locale breaks file-system walking and deleting, though.
> 
> i assume if every function in guile uses/assumes the same locale (character
> encoding), then both directions through the guile FFI should be idempotent, no?
> and i think both ASCII and UTF-8 are idempotent wrt C bytes <-> scheme string
> conversions.

The problem is that the default character encoding is ANSI_X3.4-1968
(US-ASCII) and any bytes above 127 makes things non-ASCII.

Also, the string procedures internally always use UTF-8 (or possibly
ISO-85519-1 as an optimisation?), they are not raw bytes instead they
can be consideres a vector of characters (string-ref returns
characters, not bytes, and doesn't use byte positions).

>  IOW, it's only the displaying of the chars that should be broken,
>  not file operations.

LANG=bogus guile
(guile-user)> (setlocale LC_ALL)
(guile-user)> (use-modules (ice-9 i18n))
(guile-user)> (locale-encoding)
(guile-user)> (locale-encoding)
$2 = "ANSI_X3.4-1968"

Apparently the fallback encoding is ‘ANSI_X3.4-1968’.  Let's take a
look at this encoding.  According to IANA
(https://www.iana.org/assignments/character-sets/character-sets.xhtml),
this character encoding can also be named ‘US-ASCII’ and is specified
in RFC2046.  Some excerpts:

   "US-ASCII" does not indicate an arbitrary 7-bit
   character set[sic], but specifies that all octets in the body must
   be interpreted as characters according to the US-ASCII character
   set.

so it looks like, say, é cannot be encoded as US-ASCII, it does not
belong to the character set of the encoding.  More generally, anything
beyond the 127 (Unicode) codepoint cannot be encoded in ANSI_X3.4-1968.

Let's test this (in a new REPL with an UTF-8 locale):

((@ (ice-9 iconv) string->bytevector) "é" "ANSI_X3.4-1968")
ice-9/boot-9.scm:1669:16: In procedure raise-exception:
Throw to key `encoding-error' with args `("put-char" "conversion to port encoding failed" 84 #<output: string 7fd5bbc23ee0> #\é)'.

((@ (ice-9 iconv) string->bytevector) "é" "ANSI_X3.4-1968" 'substitute)
$2 = #vu8(63)
((@ (rnrs bytevectors) utf8->string) #vu8(63))
$3 = "?"

and the other direction:

((@ (ice-9 iconv) bytevector->string) #vu8(128) "ANSI_X3.4-1968" 'substitute)
$5 = "�" ;; why #\� and not #\?? I don't know, I guess Guile is inconsistent

(FWIW, I would throw an decoding-error here instead of silently corrupting the
file names.)

Greetings,
Maxime.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 260 bytes --]

  parent reply	other threads:[~2022-04-13  8:23 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-12 19:47 bug#54893: guix-daemon, locale, LANG, and unicode in git tag names Attila Lendvai
2022-04-12 20:40 ` Maxime Devos
2022-04-13  7:51   ` Attila Lendvai
2022-04-13  8:03     ` Maxime Devos
2022-04-13  8:45       ` Attila Lendvai
2022-04-19 11:38         ` Attila Lendvai
2022-04-19 15:45           ` Maxime Devos
2022-04-19 16:07           ` Maxime Devos
2022-04-13  8:22     ` Maxime Devos [this message]
2022-04-13 10:40       ` Liliana Marie Prikler
2022-04-13 10:57         ` Maxime Devos
2022-04-13  8:29     ` Maxime Devos
2022-04-19 18:09 ` bug#54893: [PATCH] guix: git-download: Set locale to deal with Unicode in git metadata Attila Lendvai
2022-04-20 20:12   ` bug#54893: guix-daemon, locale, LANG, and unicode in git tag names Ludovic Courtès
2022-04-20 22:15   ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d7cf39802973624ab080d53efc4c84a33c397707.camel@telenet.be \
    --to=maximedevos@telenet.be \
    --cc=54893@debbugs.gnu.org \
    --cc=attila@lendvai.name \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).