* UTF-8 regression in guile 1.9.5
@ 2009-12-06 18:43 Linas Vepstas
2009-12-06 19:16 ` Mike Gran
0 siblings, 1 reply; 16+ messages in thread
From: Linas Vepstas @ 2009-12-06 18:43 UTC (permalink / raw)
To: Guile Development, bug-guile
Hi,
I seem to see either a regression in guile-1.9.5 with regard
to UTF-8 strings, or at least some sort of incompatible change.
In guile-1.8.6, I am able to do the following:
SCM new_node (SCM sname)
{
char * cname = scm_to_locale_string(sname);
printf ("The name is %s\n", cname);
free (cname);
return SCM_EOL;
}
scm_c_define_gsubr("new-node", 1, 0, 0, ss_name);
Then, from the guile prompt, I can evaluate the following:
(new-node "てみました。")
and get the output "The name is てみました。"
However, in guile-1.9.5, the above gives me:
"The name is ã¦ã¿ã¾ããã"
Now, it is very possible that I've forgotten to say
(use-modules some-new-utf8-module)
but I am unclear on what that module is (and why its not
specified by default).
In both cases, my shell has: LANG=en_US.UTF-8
--linas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2009-12-06 18:43 UTF-8 regression in guile 1.9.5 Linas Vepstas
@ 2009-12-06 19:16 ` Mike Gran
2009-12-06 19:33 ` Linas Vepstas
0 siblings, 1 reply; 16+ messages in thread
From: Mike Gran @ 2009-12-06 19:16 UTC (permalink / raw)
To: linasvepstas, Guile Development, bug-guile
> From: Linas Vepstas <linasvepstas@gmail.com>
> Then, from the guile prompt, I can evaluate the following:
>
> (new-node "てみました。")
>
> and get the output "The name is てみました。"
>
>
> However, in guile-1.9.5, the above gives me:
>
> "The name is ã¦ã¿ã¾ããã"
Hmm. The "ã" is a dead giveaway that you are printing a UTF-8 string
that is being interpreted as a ISO-8859-1 string.
You've already said that you're in a UTF-8 locale. It could be that you
need to call (setlocale LC_ALL "") from the command line before entering
(new-node "てみました。") as well as having a setlocale call in your program.
Thanks,
Mike Gran
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2009-12-06 19:16 ` Mike Gran
@ 2009-12-06 19:33 ` Linas Vepstas
2009-12-06 20:40 ` Mike Gran
0 siblings, 1 reply; 16+ messages in thread
From: Linas Vepstas @ 2009-12-06 19:33 UTC (permalink / raw)
To: Mike Gran; +Cc: bug-guile, Guile Development
2009/12/6 Mike Gran <spk121@yahoo.com>:
>> From: Linas Vepstas <linasvepstas@gmail.com>
>
>
>> Then, from the guile prompt, I can evaluate the following:
>>
>> (new-node "てみました。")
>>
>> and get the output "The name is てみました。"
>>
>>
>> However, in guile-1.9.5, the above gives me:
>>
>> "The name is ã¦ã¿ã¾ããã"
>
> Hmm. The "ã" is a dead giveaway that you are printing a UTF-8 string
> that is being interpreted as a ISO-8859-1 string.
>
> You've already said that you're in a UTF-8 locale. It could be that you
> need to call (setlocale LC_ALL "")
That cured it.
> as well as having a setlocale call in your program.
Doesn't seem to be required, after the above.
Thanks!
Why this happened is strange; I'm now investigating. Sorry to
have bothered you with something that is dohh .. basic.
--linas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2009-12-06 19:33 ` Linas Vepstas
@ 2009-12-06 20:40 ` Mike Gran
2009-12-06 20:43 ` Linas Vepstas
0 siblings, 1 reply; 16+ messages in thread
From: Mike Gran @ 2009-12-06 20:40 UTC (permalink / raw)
To: linasvepstas; +Cc: bug-guile, Guile Development
> > Hmm. The "ã" is a dead giveaway that you are printing a UTF-8 string
> > that is being interpreted as a ISO-8859-1 string.
> >
> > You've already said that you're in a UTF-8 locale. It could be that you
> > need to call (setlocale LC_ALL "")
>
> That cured it.
>
> > as well as having a setlocale call in your program.
>
> Doesn't seem to be required, after the above.
>
> Thanks!
>
> Why this happened is strange; I'm now investigating. Sorry to
> have bothered you with something that is dohh .. basic.
1.9.x does work fundamentally differently w.r.t. strings.
The reason for that is because of how strings are now stored.
In 1.8.x, a character was a byte. In 1.9.x a character is a
codepoint.
But for Guile to store characters as codepoints, declaring a locale
pretty much a requirement now.
-Mike
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2009-12-06 20:40 ` Mike Gran
@ 2009-12-06 20:43 ` Linas Vepstas
2009-12-11 10:29 ` Andy Wingo
0 siblings, 1 reply; 16+ messages in thread
From: Linas Vepstas @ 2009-12-06 20:43 UTC (permalink / raw)
To: Mike Gran; +Cc: bug-guile, Guile Development
2009/12/6 Mike Gran <spk121@yahoo.com>:
>
>> > need to call (setlocale LC_ALL "")
>
> But for Guile to store characters as codepoints, declaring a locale
> pretty much a requirement now.
Would it make sense to add (setlocale LC_ALL "") to some default,
e.g. boot-9.scm ?
--linas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2009-12-06 20:43 ` Linas Vepstas
@ 2009-12-11 10:29 ` Andy Wingo
2009-12-11 15:05 ` Mike Gran
0 siblings, 1 reply; 16+ messages in thread
From: Andy Wingo @ 2009-12-11 10:29 UTC (permalink / raw)
To: linasvepstas; +Cc: bug-guile, Guile Development, Mike Gran
Hi,
On Sun 06 Dec 2009 21:43, Linas Vepstas <linasvepstas@gmail.com> writes:
> 2009/12/6 Mike Gran <spk121@yahoo.com>:
>>
>>> > need to call (setlocale LC_ALL "")
>>
>> But for Guile to store characters as codepoints, declaring a locale
>> pretty much a requirement now.
>
> Would it make sense to add (setlocale LC_ALL "") to some default,
> e.g. boot-9.scm ?
Mike I admit I don't follow this completely. Does Linas' suggestion make
sense? I somehow thought that locales would magically just work.
Cheers,
Andy
--
http://wingolog.org/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2009-12-11 10:29 ` Andy Wingo
@ 2009-12-11 15:05 ` Mike Gran
2009-12-11 15:40 ` Linas Vepstas
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Mike Gran @ 2009-12-11 15:05 UTC (permalink / raw)
To: Andy Wingo, linasvepstas; +Cc: bug-guile, Guile Development
> From: Andy Wingo <wingo@pobox.com>
> Hi,
>
> On Sun 06 Dec 2009 21:43, Linas Vepstas writes:
>
> > 2009/12/6 Mike Gran :
> >>
> >>> > need to call (setlocale LC_ALL "")
> >>
> >> But for Guile to store characters as codepoints, declaring a locale
> >> pretty much a requirement now.
> >
> > Would it make sense to add (setlocale LC_ALL "") to some default,
> > e.g. boot-9.scm ?
>
> Mike I admit I don't follow this completely. Does Linas' suggestion
> make sense? I somehow thought that locales would magically just
> work.
If we always call setlocale, legacy code that used UTF-8 and other
non-Latin locales will just work. Legacy code that used strings to
contain binary data would break.
(Of couse, UTF-8 strings only worked on Guile 1.8.x so long
as you either never looked at substrings or chars, or did
UTF-8 parsing yourself.)
As it is now, the opposite is true: legacy code with strings
containing binary data will just work; strings containing non-8-bit
locale encoded strings will break.
| 1.8.x | setlocale |
| Strings | called | Guile 2.0
| contain | 1.8 | 2.0 | will
-----------------------------------------------------------------
| ASCII | Y/N | Y/N | just work
-----------------------------------------------------------------
| locale-encoded | Y/N | Y | just work
| strings | | |
-----------------------------------------------------------------
| locale-encoded | Y/N | N | interpret string bytes as
| strings | | | Latin-1
-----------------------------------------------------------------
| binary data | Y/N | Y | if locale is Latin-1: just work
| | | |
| | | | if locale is not latin-1:
| | | | interpret string bytes using
| | | | locale encoding
-----------------------------------------------------------------
| binary data | Y/N | N | just work
| | | |
I think I prefer that the coder take the responsibility of calling
setlocale, but, I only think that because it is how C works. I'm used
to that convention.
Thanks,
Mike
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2009-12-11 15:05 ` Mike Gran
@ 2009-12-11 15:40 ` Linas Vepstas
2009-12-11 22:50 ` Ludovic Courtès
2010-01-09 18:07 ` Andy Wingo
2 siblings, 0 replies; 16+ messages in thread
From: Linas Vepstas @ 2009-12-11 15:40 UTC (permalink / raw)
To: Mike Gran; +Cc: Andy Wingo, bug-guile, Guile Development
2009/12/11 Mike Gran <spk121@yahoo.com>:
> I think I prefer that the coder take the responsibility of calling
> setlocale, but, I only think that because it is how C works. I'm used
> to that convention.
>
OK works for me.
--linas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2009-12-11 15:05 ` Mike Gran
2009-12-11 15:40 ` Linas Vepstas
@ 2009-12-11 22:50 ` Ludovic Courtès
2010-01-09 18:07 ` Andy Wingo
2 siblings, 0 replies; 16+ messages in thread
From: Ludovic Courtès @ 2009-12-11 22:50 UTC (permalink / raw)
To: bug-guile; +Cc: guile-devel
Hi,
Mike Gran <spk121@yahoo.com> writes:
> I think I prefer that the coder take the responsibility of calling
> setlocale, but, I only think that because it is how C works. I'm used
> to that convention.
+1.
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2009-12-11 15:05 ` Mike Gran
2009-12-11 15:40 ` Linas Vepstas
2009-12-11 22:50 ` Ludovic Courtès
@ 2010-01-09 18:07 ` Andy Wingo
2010-01-10 22:00 ` Mike Gran
2 siblings, 1 reply; 16+ messages in thread
From: Andy Wingo @ 2010-01-09 18:07 UTC (permalink / raw)
To: Mike Gran; +Cc: bug-guile, linasvepstas, Guile Development
Hi,
Reviving an old thread...
On Fri 11 Dec 2009 16:05, Mike Gran <spk121@yahoo.com> writes:
>> On Sun 06 Dec 2009 21:43, Linas Vepstas writes:
>>
>> > 2009/12/6 Mike Gran :
>> >>
>> >>> > need to call (setlocale LC_ALL "")
>> >>
>> >> But for Guile to store characters as codepoints, declaring a locale
>> >> pretty much a requirement now.
>> >
>> > Would it make sense to add (setlocale LC_ALL "") to some default,
>> > e.g. boot-9.scm ?
>
> If we always call setlocale, legacy code that used UTF-8 and other
> non-Latin locales will just work. Legacy code that used strings to
> contain binary data would break.
>
> (Of couse, UTF-8 strings only worked on Guile 1.8.x so long
> as you either never looked at substrings or chars, or did
> UTF-8 parsing yourself.)
>
> As it is now, the opposite is true: legacy code with strings
> containing binary data will just work; strings containing non-8-bit
> locale encoded strings will break.
>
> | 1.8.x | setlocale |
> | Strings | called | Guile 2.0
> | contain | 1.8 | 2.0 | will
> -----------------------------------------------------------------
> | ASCII | Y/N | Y/N | just work
> -----------------------------------------------------------------
> | locale-encoded | Y/N | Y | just work
> | strings | | |
> -----------------------------------------------------------------
> | locale-encoded | Y/N | N | interpret string bytes as
> | strings | | | Latin-1
> -----------------------------------------------------------------
> | binary data | Y/N | Y | if locale is Latin-1: just work
> | | | |
> | | | | if locale is not latin-1:
> | | | | interpret string bytes using
> | | | | locale encoding
> -----------------------------------------------------------------
> | binary data | Y/N | N | just work
> | | | |
>
> I think I prefer that the coder take the responsibility of calling
> setlocale, but, I only think that because it is how C works. I'm used
> to that convention.
I would still prefer ponies and magic, but I realized: if we do a
setlocale(LC_ALL, "") at the beginning, might that not change e.g. the
floating point format, or some other locale-related variable, which
would make Guile modules unreadable, or otherwise semantically different
or invalid?
I'm asking because I ran into this bug now:
scheme@(guile-user)> ,pr (resolve-module '(gnome gtk))
Throw to key `wrong-type-arg' with args `("procedure-name" "Wrong type argument in position ~A: ~S" (1 #<dynamic-object "libgw-guile-gnome-pango">) (#<dynamic-object "libgw-guile-gnome-pango">))'.
Entering the debugger. Type `bt' for a backtrace or `c' to continue.
0 debug> bt
In current input:
<unknown-location>: 13 ERROR: cannot convert to output locale "NONE": ""dynamic-wind""
So I guess we need a special case for NONE there, or something. I really
don't understand i18n/l10n.
FWIW, it seems that both ruby and python require the user to call
setlocale.
Regards,
Andy
--
http://wingolog.org/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2010-01-09 18:07 ` Andy Wingo
@ 2010-01-10 22:00 ` Mike Gran
2010-01-11 13:38 ` Ludovic Courtès
0 siblings, 1 reply; 16+ messages in thread
From: Mike Gran @ 2010-01-10 22:00 UTC (permalink / raw)
To: Andy Wingo; +Cc: bug-guile, linasvepstas, Guile Development
> From: Andy Wingo <wingo@pobox.com>
> Hi,
>
> Reviving an old thread...
>
> > I think I prefer that the coder take the responsibility of calling
> > setlocale, but, I only think that because it is how C works. I'm used
> > to that convention.
>
> I would still prefer ponies and magic, but I realized: if we do a
> setlocale(LC_ALL, "") at the beginning, might that not change e.g. the
> floating point format, or some other locale-related variable, which
> would make Guile modules unreadable, or otherwise semantically different
> or invalid?
>
> I'm asking because I ran into this bug now:
>
> scheme@(guile-user)> ,pr (resolve-module '(gnome gtk))
> Throw to key `wrong-type-arg' with args `("procedure-name" "Wrong type
> argument in position ~A: ~S" (1 #)
> (#))'.
> Entering the debugger. Type `bt' for a backtrace or `c' to continue.
> 0 debug> bt
> In current input:
> : 13 ERROR: cannot convert to output locale "NONE":
> ""dynamic-wind""
>
> So I guess we need a special case for NONE there, or something. I really
> don't understand i18n/l10n.
A LOCALE=NONE is the same as setting locale to any undefined value,e.g.
LOCALE=martian_mars. There isn't a locale called 'none', so the system
can provide no clues on how I/O, date format, etc, should be done.
All programs are supposed to start with locale="C", so I guess the
NONE locale is being set explicitly at some point. On my box, I can't
(setlocale LC_ALL "NONE") since I don't have a NONE locale.
So, the fact that locale=NONE seems like a bug to me.
For Guile string conversion, if locale=NONE has some non-buggy meaning,
I'd probably suggest making NONE the same as UTF-8. We can do whatever
we want, since the result of the operation of conversion into NONE
is undefined.
But as far as the greater question of the side effects of setting locale
early on startup... The parsing of any source code files after locale
is set will be done in that context. I don't think it would do anything
unexpected. The reader and the port routines tend to do their own parsing,
and don't tend to rely on libc locale-specific routines. Even so, it
would take some auditing to prove that there would be no effect.
If you were to set the locale in Guile, you would need to add
a condition to catch if the current LANG envvar isn't set to a valid
locale so you can fall back to the "C" locale.
-Mike Gran
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2010-01-10 22:00 ` Mike Gran
@ 2010-01-11 13:38 ` Ludovic Courtès
2010-01-11 21:18 ` Andy Wingo
0 siblings, 1 reply; 16+ messages in thread
From: Ludovic Courtès @ 2010-01-11 13:38 UTC (permalink / raw)
To: guile-devel; +Cc: bug-guile
Hi,
Mike Gran <spk121@yahoo.com> writes:
> But as far as the greater question of the side effects of setting locale
> early on startup... The parsing of any source code files after locale
> is set will be done in that context. I don't think it would do anything
> unexpected. The reader and the port routines tend to do their own parsing,
> and don't tend to rely on libc locale-specific routines. Even so, it
> would take some auditing to prove that there would be no effect.
Source files should have the right ‘coding:’ meta anyway. I just
changed the compiler to install the current user locale [0], as that’s
typically what a standalone program does. It makes it necessary for
source files to have the right ‘coding:’ since otherwise they could get
read with the current user’s locale encoding, which could be anything [1].
[0] http://git.savannah.gnu.org/cgit/guile.git/commit/?id=e6251e7bd98fbc64e9dbf489c8afaf426af46919
[1] http://git.savannah.gnu.org/cgit/guile.git/commit/?id=bce5cb56413da437c29628c529cec47649d12eb9
> If you were to set the locale in Guile,
[...]
I currently think we shouldn’t do it since (1) Guile can be embedded and
it’s the application’s responsibility to set the locale, and (2) it
would be a departure from previous versions of Guile and from POSIX
behavior.
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2010-01-11 13:38 ` Ludovic Courtès
@ 2010-01-11 21:18 ` Andy Wingo
2010-01-12 11:25 ` Ludovic Courtès
0 siblings, 1 reply; 16+ messages in thread
From: Andy Wingo @ 2010-01-11 21:18 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: bug-guile, guile-devel
On Mon 11 Jan 2010 14:38, ludo@gnu.org (Ludovic Courtès) writes:
> Mike Gran <spk121@yahoo.com> writes:
>
>> But as far as the greater question of the side effects of setting locale
>> early on startup... The parsing of any source code files after locale
>> is set will be done in that context. I don't think it would do anything
>> unexpected. The reader and the port routines tend to do their own parsing,
>> and don't tend to rely on libc locale-specific routines. Even so, it
>> would take some auditing to prove that there would be no effect.
>
> Source files should have the right ‘coding:’ meta anyway. I just
> changed the compiler to install the current user locale [0], as that’s
> typically what a standalone program does.
If we're taking this tack, perhaps we should setlocale in the `guile'
binary (but not by default when used by a library).
Andy
--
http://wingolog.org/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2010-01-11 21:18 ` Andy Wingo
@ 2010-01-12 11:25 ` Ludovic Courtès
2010-01-12 19:36 ` Andy Wingo
0 siblings, 1 reply; 16+ messages in thread
From: Ludovic Courtès @ 2010-01-12 11:25 UTC (permalink / raw)
To: bug-guile; +Cc: guile-devel
Hello,
Andy Wingo <wingo@pobox.com> writes:
> On Mon 11 Jan 2010 14:38, ludo@gnu.org (Ludovic Courtès) writes:
>
>> Mike Gran <spk121@yahoo.com> writes:
>>
>>> But as far as the greater question of the side effects of setting locale
>>> early on startup... The parsing of any source code files after locale
>>> is set will be done in that context. I don't think it would do anything
>>> unexpected. The reader and the port routines tend to do their own parsing,
>>> and don't tend to rely on libc locale-specific routines. Even so, it
>>> would take some auditing to prove that there would be no effect.
>>
>> Source files should have the right ‘coding:’ meta anyway. I just
>> changed the compiler to install the current user locale [0], as that’s
>> typically what a standalone program does.
>
> If we're taking this tack, perhaps we should setlocale in the `guile'
> binary (but not by default when used by a library).
We could, but it would break programs that have been assuming the ‘C’
locale, e.g., when parsing of printing data, etc...
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2010-01-12 11:25 ` Ludovic Courtès
@ 2010-01-12 19:36 ` Andy Wingo
2010-01-12 22:26 ` Ludovic Courtès
0 siblings, 1 reply; 16+ messages in thread
From: Andy Wingo @ 2010-01-12 19:36 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: bug-guile, guile-devel
Hi,
On Tue 12 Jan 2010 12:25, ludo@gnu.org (Ludovic Courtès) writes:
> Andy Wingo <wingo@pobox.com> writes:
>
>> perhaps we should setlocale in the `guile' binary (but not by default
>> when used by a library).
>
> We could, but it would break programs that have been assuming the ‘C’
> locale, e.g., when parsing of printing data, etc...
But surely it is the Right Thing; is there no way to make a transition
to having it there by default?
A
--
http://wingolog.org/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: UTF-8 regression in guile 1.9.5
2010-01-12 19:36 ` Andy Wingo
@ 2010-01-12 22:26 ` Ludovic Courtès
0 siblings, 0 replies; 16+ messages in thread
From: Ludovic Courtès @ 2010-01-12 22:26 UTC (permalink / raw)
To: Andy Wingo; +Cc: bug-guile, guile-devel
Hello,
Andy Wingo <wingo@pobox.com> writes:
> On Tue 12 Jan 2010 12:25, ludo@gnu.org (Ludovic Courtès) writes:
>
>> Andy Wingo <wingo@pobox.com> writes:
>>
>>> perhaps we should setlocale in the `guile' binary (but not by default
>>> when used by a library).
>>
>> We could, but it would break programs that have been assuming the ‘C’
>> locale, e.g., when parsing of printing data, etc...
>
> But surely it is the Right Thing;
Perhaps, I don’t know. Apparently Perl does that, for instance. What
do others do? Is there no real reason for POSIX to be this way, other
than backward compatibility?
> is there no way to make a transition to having it there by default?
Not that I can think of.
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2010-01-12 22:26 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-06 18:43 UTF-8 regression in guile 1.9.5 Linas Vepstas
2009-12-06 19:16 ` Mike Gran
2009-12-06 19:33 ` Linas Vepstas
2009-12-06 20:40 ` Mike Gran
2009-12-06 20:43 ` Linas Vepstas
2009-12-11 10:29 ` Andy Wingo
2009-12-11 15:05 ` Mike Gran
2009-12-11 15:40 ` Linas Vepstas
2009-12-11 22:50 ` Ludovic Courtès
2010-01-09 18:07 ` Andy Wingo
2010-01-10 22:00 ` Mike Gran
2010-01-11 13:38 ` Ludovic Courtès
2010-01-11 21:18 ` Andy Wingo
2010-01-12 11:25 ` Ludovic Courtès
2010-01-12 19:36 ` Andy Wingo
2010-01-12 22:26 ` Ludovic Courtès
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).