* Problem with UTF-8, "write, " and some characters using initial locale
@ 2010-11-20 15:58 Taylor Venable
2010-11-20 17:53 ` Mike Gran
0 siblings, 1 reply; 3+ messages in thread
From: Taylor Venable @ 2010-11-20 15:58 UTC (permalink / raw)
To: Guile Development List
Hi there, I'm having a strange problem using "write" in recent Git
versions. When I include certain characters in a string passed to
write, it prints odd hex representations of the Latin-1 encodings of
those characters: "odd" because the result is not valid UTF-8 even
though I believe my environment indicates it should be outputting in
UTF-8. I've put an example interaction on my website:
[http://metasyntax.net/tmp/guile.txt] (opening it in a hex editor is
helpful to check that the characters which are correct are properly
UTF-8 encoded) After I queried my locale environment using (setlocale
LC_ALL) then everything gets written properly - from the documentation
it seemed to me that this should not have a side effect unless the
"locale" argument was provided. I'm using Guile 1.9.13.91-b7106 on
Linux x86_64. It seemed to me like it had bug potential, but maybe my
understanding of locales and encodings is flawed. Please let me know
if there's any other information I can provide or things I can test.
Best regards,
--
Taylor C. Venable
http://metasyntax.net/
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Problem with UTF-8, "write, " and some characters using initial locale
2010-11-20 15:58 Problem with UTF-8, "write, " and some characters using initial locale Taylor Venable
@ 2010-11-20 17:53 ` Mike Gran
2010-11-20 20:34 ` Taylor Venable
0 siblings, 1 reply; 3+ messages in thread
From: Mike Gran @ 2010-11-20 17:53 UTC (permalink / raw)
To: Taylor Venable, Guile Development List
> From: Taylor Venable <taylor@metasyntax.net>
> Hi there, I'm having a strange problem using "write" in recent Git
> versions. When I include certain characters in a string passed to
> write, it prints odd hex representations of the Latin-1 encodings of
> those characters: "odd" because the result is not valid UTF-8 even
> though I believe my environment indicates it should be outputting in
> UTF-8. I've put an example interaction on my website:
> [http://metasyntax.net/tmp/guile.txt] (opening it in a hex editor is
> helpful to check that the characters which are correct are properly
> UTF-8 encoded) After I queried my locale environment using (setlocale
> LC_ALL) then everything gets written properly - from the documentation
> it seemed to me that this should not have a side effect unless the
> "locale" argument was provided. I'm using Guile 1.9.13.91-b7106 on
> Linux x86_64. It seemed to me like it had bug potential, but maybe my
> understanding of locales and encodings is flawed. Please let me know
> if there's any other information I can provide or things I can test.
>
> Best regards,
Hi Taylor,
You should basically always call (setlocale LC_ALL "") before
working on non-ASCII code.
Guile starts up in Latin-1. It may seem that Guile should
pick up your environment's LANG or LOCALE on startup, but, most
compilers (including gcc) don't do that by default.
When you call setlocale, Guile picks up thelocale of your session.
So, in your first line in your example, you pasted in a string of
utf-8 text. Guile read it in raw bytes and never tried to unpack
those bytes into Unicode characters. You can prove it to youself by
passing your string to the string-length procedure. You'll get
the length of the utf-8 bytes of your string, not the actual number
of characters.
The weird escapes come from trying to write a string of utf-8 bytes
in the latin-1 encoding. The latin-1 characters from 0x80 to 0x9F
are the ISO-8859-1 C1 control characters and not printable.
So, (write) prints them as escapes instead.
For example, the stroked D (U+0110)
- is passed in to Guile as the utf-8 representation 0xC4 0x90
- Guile knows that ISO-8859-1 0x90 is an unprintable control character
- Guile prints the 0xC4 as iso-8859-1 umlaut A and prints 0x90 as an
escape string "x90"
- Your terminal sees an 0xC4 byte, which illegal under UTF-8, and
probably prints a question mark
- and then your terminal prints the "x90" string
So, counterintuitive, but, not a bug.
Thanks,
Mike Gran
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Problem with UTF-8, "write, " and some characters using initial locale
2010-11-20 17:53 ` Mike Gran
@ 2010-11-20 20:34 ` Taylor Venable
0 siblings, 0 replies; 3+ messages in thread
From: Taylor Venable @ 2010-11-20 20:34 UTC (permalink / raw)
To: Mike Gran; +Cc: Guile Development List
On Sat, Nov 20, 2010 at 12:53 PM, Mike Gran <spk121@yahoo.com> wrote:
> You should basically always call (setlocale LC_ALL "") before
> working on non-ASCII code.
>
> Guile starts up in Latin-1. It may seem that Guile should
> pick up your environment's LANG or LOCALE on startup, but, most
> compilers (including gcc) don't do that by default.
>
> When you call setlocale, Guile picks up thelocale of your session.
>
> So, in your first line in your example, you pasted in a string of
> utf-8 text. Guile read it in raw bytes and never tried to unpack
> those bytes into Unicode characters. You can prove it to youself by
> passing your string to the string-length procedure. You'll get
> the length of the utf-8 bytes of your string, not the actual number
> of characters.
>
> The weird escapes come from trying to write a string of utf-8 bytes
> in the latin-1 encoding. The latin-1 characters from 0x80 to 0x9F
> are the ISO-8859-1 C1 control characters and not printable.
> So, (write) prints them as escapes instead.
Oh, I see now! I hadn't thought of it like that, thanks for the explanation.
--
Taylor C. Venable
http://metasyntax.net/
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2010-11-20 20:34 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-20 15:58 Problem with UTF-8, "write, " and some characters using initial locale Taylor Venable
2010-11-20 17:53 ` Mike Gran
2010-11-20 20:34 ` Taylor Venable
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).