unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* [PATCH] fix locale string reading
@ 2011-11-07 19:12 Nala Ginrut
  2011-11-08 13:31 ` Peter Brett
  2011-11-11 20:16 ` Ludovic Courtès
  0 siblings, 2 replies; 14+ messages in thread
From: Nala Ginrut @ 2011-11-07 19:12 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 2160 bytes --]

I've read a mail from a Chinese user which sent to guile-user. And I
realized that this locale string reading error is a real problem to a
Chinese user.
And I talked with Wingo & ijp about this topic. Finally I found what's the
problem.
All the string seems to be handled all by libunistring now, and
"u32_conv_from_encoding" should return a valid string under "zh_CN.UTF-8",
but it didn't.
We need to call setlocale(LC_ALL, ""), or the user must do it in their code
every time.
The better solution maybe set locale each time scm_*_locale_string is
called,  because the locale could be changed during the runtime.

Here is the patch. It solved the problem like these:
------------------cut begin--------------
...
(write (command-line))
...
-------------------cut end---------------

# ./test.scm 我靠
==>("./ad.scm" "我靠")

Now it works.



---------------------------------------------patch
cut---------------------------------------------------------

From a424c2f7022cc81d163971cf961581d560f0d4ed Mon Sep 17 00:00:00 2001
From: NalaGinrut <NalaGinrut@gmail.com>
Date: Tue, 8 Nov 2011 02:56:03 +0800
Subject: [PATCH] fix locale string reading

---
 libguile/strings.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/libguile/strings.c b/libguile/strings.c
index 666a951..391c983 100644
--- a/libguile/strings.c
+++ b/libguile/strings.c
@@ -29,6 +29,7 @@
 #include <uninorm.h>
 #include <unistr.h>
 #include <uniconv.h>
+#include <locale.h>

 #include "striconveh.h"

@@ -1532,6 +1533,7 @@ scm_from_locale_string (const char *str)
 SCM
 scm_from_locale_stringn (const char *str, size_t len)
 {
+  setlocale (LC_ALL, "");
   return scm_from_stringn (str, len, locale_charset (),
                            scm_i_get_conversion_strategy (SCM_BOOL_F));
 }
@@ -1758,6 +1760,7 @@ scm_to_locale_string (SCM str)
 char *
 scm_to_locale_stringn (SCM str, size_t *lenp)
 {
+  setlocale (LC_ALL, "");
   return scm_to_stringn (str, lenp,
                          locale_charset (),
                          scm_i_get_conversion_strategy (SCM_BOOL_F));
-- 
1.7.0.4

[-- Attachment #2: Type: text/html, Size: 3020 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] fix locale string reading
  2011-11-07 19:12 [PATCH] fix locale string reading Nala Ginrut
@ 2011-11-08 13:31 ` Peter Brett
       [not found]   ` <CAPjoZoewXz+MMMD2N=8gp57AwsNG99UHc9KkhsizdK8U7djLvg@mail.gmail.com>
  2011-11-11 20:16 ` Ludovic Courtès
  1 sibling, 1 reply; 14+ messages in thread
From: Peter Brett @ 2011-11-08 13:31 UTC (permalink / raw)
  To: guile-devel

Nala Ginrut <nalaginrut@gmail.com> writes:

> Here is the patch. It solved the problem like these:
> @@ -1532,6 +1533,7 @@ scm_from_locale_string (const char *str)
>  SCM
>  scm_from_locale_stringn (const char *str, size_t len)
>  {
> +  setlocale (LC_ALL, "");
>    return scm_from_stringn (str, len, locale_charset (),
>                             scm_i_get_conversion_strategy (SCM_BOOL_F));
>  }
> @@ -1758,6 +1760,7 @@ scm_to_locale_string (SCM str)
>  char *
>  scm_to_locale_stringn (SCM str, size_t *lenp)
>  {
> +  setlocale (LC_ALL, "");
>    return scm_to_stringn (str, lenp, 
>                           locale_charset (),
>                           scm_i_get_conversion_strategy (SCM_BOOL_F));

This patch *breaks* scm_to_locale_string() and
scm_from_locale_string().  The documentation for these functions quite
clearly states that they use "the current locale", not "the locale
specified by the environment variables". Not to mention the fact that
they change the current locale without restoring it afterwards.

So no, please don't do this.  It's obviously the Wrong Thing.

                            Peter

-- 
Peter Brett <peter@peter-b.co.uk>
Remote Sensing Research Group
Surrey Space Centre




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fix locale string reading
       [not found]   ` <CAPjoZoewXz+MMMD2N=8gp57AwsNG99UHc9KkhsizdK8U7djLvg@mail.gmail.com>
@ 2011-11-09  2:59     ` Nala Ginrut
  2011-11-09  3:11       ` Nala Ginrut
  0 siblings, 1 reply; 14+ messages in thread
From: Nala Ginrut @ 2011-11-09  2:59 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 1586 bytes --]

On Tue, Nov 8, 2011 at 9:31 PM, Peter Brett <peter@peter-b.co.uk> wrote:

> Nala Ginrut <nalaginrut@gmail.com> writes:
>
> > Here is the patch. It solved the problem like these:
> > @@ -1532,6 +1533,7 @@ scm_from_locale_string (const char *str)
> >  SCM
> >  scm_from_locale_stringn (const char *str, size_t len)
> >  {
> > +  setlocale (LC_ALL, "");
> >    return scm_from_stringn (str, len, locale_charset (),
> >                             scm_i_get_conversion_strategy (SCM_BOOL_F));
> >  }
> > @@ -1758,6 +1760,7 @@ scm_to_locale_string (SCM str)
> >  char *
> >  scm_to_locale_stringn (SCM str, size_t *lenp)
> >  {
> > +  setlocale (LC_ALL, "");
> >    return scm_to_stringn (str, lenp,
> >                           locale_charset (),
> >                           scm_i_get_conversion_strategy (SCM_BOOL_F));
>
> This patch *breaks* scm_to_locale_string() and
> scm_from_locale_string().  The documentation for these functions quite
> clearly states that they use "the current locale", not "the locale
> specified by the environment variables". Not to mention the fact that
> they change the current locale without restoring it afterwards.
>

NO, I don't agree.
setlocale(LC_ALL ,"") means the current locale is only queried, not
modified.
So it just make sure that locale_charset() can get the correct locale ,not
a modified locale.

I tested this in C code. Current locale is "zh_CN.UTF-8",  locale_charset()
will return "ANSI_X3.4-1968".
It's definitely wrong. We must use setlocale(LC_ALL,"") to query current
locale first, and locale_charset() will return "zh_CN.UTF-8".

[-- Attachment #2: Type: text/html, Size: 2342 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fix locale string reading
  2011-11-09  2:59     ` Nala Ginrut
@ 2011-11-09  3:11       ` Nala Ginrut
  2011-11-09  4:10         ` Nala Ginrut
  0 siblings, 1 reply; 14+ messages in thread
From: Nala Ginrut @ 2011-11-09  3:11 UTC (permalink / raw)
  To: guile-devel, Peter Brett

[-- Attachment #1: Type: text/plain, Size: 124 bytes --]

Sorry, it's my fault.
setlocale(LC_ALL,"") is not query.
setlocale(LC_ALL,NULL) does.
So it's a wrong patch. Thanks Petter.

[-- Attachment #2: Type: text/html, Size: 192 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fix locale string reading
  2011-11-09  3:11       ` Nala Ginrut
@ 2011-11-09  4:10         ` Nala Ginrut
  2011-11-09 10:20           ` Peter Brett
  0 siblings, 1 reply; 14+ messages in thread
From: Nala Ginrut @ 2011-11-09  4:10 UTC (permalink / raw)
  To: guile-devel, Peter Brett

[-- Attachment #1: Type: text/plain, Size: 1273 bytes --]

Well, I've read the manual about scm_*_locale_string.
It shows:
-----------------------------------
Note that these functions should not be used to convert C string constants,
because
there is no guarantee that the current locale will match that of the source
code. To
convert C string constants, use scm_from_latin1_string, scm_from_utf8_string
or scm_from_utf32_string.
------------------------------------

I think there's in a dilemma. According to the manual, we should convert
string or change locale in the users' code.
But Guile will break in (command-line) proc, because Chinese string as
command arguments can not get valid result from  "u32_conv_from_encoding"
called by "scm_from_stringn", and raised an error.
So we don't have any chance to convert it or change locale from environment
in the users' code because Guile has already crashed by "decoding-error".

Now I have two questons:
1. Maybe we raised this "decoding-error" too early. Can we let
"scm_from_locale_stringn" just return the string which wasn't recognized as
a bytevector? Then we may convert it as wish. But it's confused for
"scm_from_locale_stringn" returning a bytevector;
2. Why not let "scm_from_locale_stringn" return string according to the
locale from current environment as my patch?

[-- Attachment #2: Type: text/html, Size: 1911 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fix locale string reading
  2011-11-09  4:10         ` Nala Ginrut
@ 2011-11-09 10:20           ` Peter Brett
  2011-11-09 10:46             ` Nala Ginrut
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Brett @ 2011-11-09 10:20 UTC (permalink / raw)
  To: guile-devel

Nala Ginrut <nalaginrut@gmail.com> writes:

> But Guile will break in (command-line) proc, because Chinese string as
> command arguments can not get valid result from
>  "u32_conv_from_encoding" called by "scm_from_stringn", and raised an
> error. 

Probably what should happen is that Guile's command-line parsing code
should use the environment-provided locale by taking the following
steps:

1) Save current locale.
2) Set locale from environment.
3) Call scm_*_locale_string() functions.
4) Restore original locale.

However, note that this still may cause decoding errors, because there's
no guarantee that argv is in the same encoding as the environment
specifies, or indeed in any valid encoding at all.  So consider *also*
adding e.g. a (command-line-bv) function to return the command line
without attempting to decode it.

> So we don't have any chance to convert it or change locale from
> environment in the users' code because Guile has already crashed by
> "decoding-error".

Hang on -- are you saying that if you run Guile with badly-encoded argv
then it will die before running any user code?  That would obviously be a bug.

> Now I have two questons:
> 1. Maybe we raised this "decoding-error" too early. Can we let
> "scm_from_locale_stringn" just return the string which
> wasn't recognized as a bytevector? Then we may convert it as wish. But
> it's confused for "scm_from_locale_stringn" returning a bytevector;

No, you can't do that, because scm_*_locale_string is supposed to either
return a string or raise "decoding-error".

> 2. Why not let "scm_from_locale_stringn" return string according to
> the locale from current environment as my patch?

Because it breaks existing code that relies on current behaviour *and*
doesn't actually fix the problem.

                      Peter

-- 
Peter Brett <peter@peter-b.co.uk>
Remote Sensing Research Group
Surrey Space Centre




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fix locale string reading
  2011-11-09 10:20           ` Peter Brett
@ 2011-11-09 10:46             ` Nala Ginrut
  2011-11-09 13:45               ` Peter Brett
  0 siblings, 1 reply; 14+ messages in thread
From: Nala Ginrut @ 2011-11-09 10:46 UTC (permalink / raw)
  To: Peter Brett, guile-devel

[-- Attachment #1: Type: text/plain, Size: 2243 bytes --]

On Wed, Nov 9, 2011 at 6:20 PM, Peter Brett <peter@peter-b.co.uk> wrote:

> Nala Ginrut <nalaginrut@gmail.com> writes:
>
> > But Guile will break in (command-line) proc, because Chinese string as
> > command arguments can not get valid result from
> >  "u32_conv_from_encoding" called by "scm_from_stringn", and raised an
> > error.
>
> Probably what should happen is that Guile's command-line parsing code
> should use the environment-provided locale by taking the following
> steps:
>
> 1) Save current locale.
> 2) Set locale from environment.
> 3) Call scm_*_locale_string() functions.
> 4) Restore original locale.
>
> However, note that this still may cause decoding errors, because there's
> no guarantee that argv is in the same encoding as the environment
> specifies, or indeed in any valid encoding at all.  So consider *also*
> adding e.g. a (command-line-bv) function to return the command line
> without attempting to decode it.
>
> This couldn't be the final solution.
Even we add a (command-line-bv), it may cause encoding-error. Because
(command-line) would read argv too , and raise the error.
Unless we use (command-line-bv) and delete (command-line).


> > So we don't have any chance to convert it or change locale from
> > environment in the users' code because Guile has already crashed by
> > "decoding-error".
>
> Hang on -- are you saying that if you run Guile with badly-encoded argv
> then it will die before running any user code?  That would obviously be a
> bug.


I think so. I mentioned it in the first mail of this thread.
The badly-encoded argv can not get valid result but NULL from
"u32_conv_from_encoding".
So scm_from_locale_stringn will raise encording-error directly and show the
argv as bytevector.
But even none-badly-encoded argv can not get valid result either. I checked
out the code, current_charset() can not return the correct current locale.
I must run setlocale(LC_ALL,"") to query locale from environment first.
But I think what you mean is *not query locale from envrionment*.

If this is not a bug. And locale string can not get result from environment
locale. The solution maybe get rid of (command-line), use
(command-line-bv), it's the easiest way. But I don't think it's the best
way.

[-- Attachment #2: Type: text/html, Size: 3936 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fix locale string reading
  2011-11-09 10:46             ` Nala Ginrut
@ 2011-11-09 13:45               ` Peter Brett
  2011-11-10  0:36                 ` Nala Ginrut
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Brett @ 2011-11-09 13:45 UTC (permalink / raw)
  To: guile-devel

Nala Ginrut <nalaginrut@gmail.com> writes:

> This couldn't be the final solution.
> Even we add a (command-line-bv), it may cause encoding-error. Because
> (command-line) would read argv too , and raise the error.
> Unless we use (command-line-bv) and delete (command-line).

(command-line-bv) should never cause an "encoding-error", because it
should never try to decode the command line.

Users of (command-line) should catch a potential "encoding-error" and
handle it appropriately.

> I think so. I mentioned it in the first mail of this thread.
> The badly-encoded argv can not get valid result but NULL
> from  "u32_conv_from_encoding". So scm_from_locale_stringn will raise
> encording-error directly and show the argv as bytevector.
> But even none-badly-encoded argv can not get valid result either. I checked out
> the code, current_charset() can not return the correct current locale. I must
> run setlocale(LC_ALL,"") to query locale from environment first. 
> But I think what you mean is *not query locale from envrionment*. 
>
> If this is not a bug. And locale string can not get result from environment
> locale. The solution maybe get rid of (command-line), use (command-line-bv),
> it's the easiest way. But I don't think it's the best way.

Well, you still need to provide (command-line) and (program-arguments)
functions compatible with existing code.  I think the root of the
problem is that Guile calls scm_set_program_arguments() [feature.c:72]
early in initialisation, and this can fail messily.

I still think your proposed "solution" causes WAY more problems than it
claims to solve, though.  What do Andy & Ludovic think?

                         Peter

-- 
Peter Brett <peter@peter-b.co.uk>
Remote Sensing Research Group
Surrey Space Centre




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fix locale string reading
  2011-11-09 13:45               ` Peter Brett
@ 2011-11-10  0:36                 ` Nala Ginrut
  0 siblings, 0 replies; 14+ messages in thread
From: Nala Ginrut @ 2011-11-10  0:36 UTC (permalink / raw)
  To: Peter Brett, guile-devel

[-- Attachment #1: Type: text/plain, Size: 1238 bytes --]

On Wed, Nov 9, 2011 at 9:45 PM, Peter Brett <peter@peter-b.co.uk> wrote:

> Nala Ginrut <nalaginrut@gmail.com> writes:
>
> > This couldn't be the final solution.
> > Even we add a (command-line-bv), it may cause encoding-error. Because
> > (command-line) would read argv too , and raise the error.
> > Unless we use (command-line-bv) and delete (command-line).
>
> (command-line-bv) should never cause an "encoding-error", because it
> should never try to decode the command line.
>
> Users of (command-line) should catch a potential "encoding-error" and
> handle it appropriately.
>
> Well, one of the approaches could be "call command-line-bv after
encoding-error rased".
We can put command-line-bv in the error handler of encoding-error.
Argv will be bytevector if encoding error occurs.

Well, you still need to provide (command-line) and (program-arguments)
> functions compatible with existing code.  I think the root of the
> problem is that Guile calls scm_set_program_arguments() [feature.c:72]
> early in initialisation, and this can fail messily.
>

Hmm...I mentioned *deal with argv too early* in IRC with Andy.
But I chosen an easier way to fix this problem with query locale
from environment.
Now it seems to be improper...

[-- Attachment #2: Type: text/html, Size: 2015 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fix locale string reading
  2011-11-07 19:12 [PATCH] fix locale string reading Nala Ginrut
  2011-11-08 13:31 ` Peter Brett
@ 2011-11-11 20:16 ` Ludovic Courtès
  2011-11-11 23:39   ` Mark H Weaver
  1 sibling, 1 reply; 14+ messages in thread
From: Ludovic Courtès @ 2011-11-11 20:16 UTC (permalink / raw)
  To: guile-devel

Hi,

Nala Ginrut <nalaginrut@gmail.com> skribis:

> +  setlocale (LC_ALL, "");

Currently, like in C, it’s the programmer’s responsibility to install
the locale with such a call, and none of Guile’s business.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fix locale string reading
  2011-11-11 20:16 ` Ludovic Courtès
@ 2011-11-11 23:39   ` Mark H Weaver
  2011-11-14 22:42     ` Ludovic Courtès
  0 siblings, 1 reply; 14+ messages in thread
From: Mark H Weaver @ 2011-11-11 23:39 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

ludo@gnu.org (Ludovic Courtès) writes:
> Nala Ginrut <nalaginrut@gmail.com> skribis:
>> +  setlocale (LC_ALL, "");
>
> Currently, like in C, it’s the programmer’s responsibility to install
> the locale with such a call, and none of Guile’s business.

Unfortunately, I don't see a way for the user to call setlocale before a
Guile script converts the command-line arguments to Scheme strings, at
least not without providing their own `main' function in C.

It wouldn't be too hard to postpone the conversion until the first time
(command-line) or (program-arguments) is called.  That would change the
semantics somewhat, e.g. if the program modifies the argv[] array before
the first call to (command-line), but it might still be a worthwhile
change.

However, what should we do about scripts that use `-e'?
For example:

     #!/usr/local/bin/guile \
     -e main -s
     !#

     (define (main args)
       (display args)
       (newline))

I think we should consider decoding the command-line arguments using the
locale specified by the environment variables, at least in cases like
this where there's no way for the user to call setlocale before the
conversion happens.

     Mark



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fix locale string reading
  2011-11-11 23:39   ` Mark H Weaver
@ 2011-11-14 22:42     ` Ludovic Courtès
  2011-11-15  0:02       ` Mark H Weaver
  0 siblings, 1 reply; 14+ messages in thread
From: Ludovic Courtès @ 2011-11-14 22:42 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 1494 bytes --]

Hi Mark,

Mark H Weaver <mhw@netris.org> skribis:

> ludo@gnu.org (Ludovic Courtès) writes:
>> Nala Ginrut <nalaginrut@gmail.com> skribis:
>>> +  setlocale (LC_ALL, "");
>>
>> Currently, like in C, it’s the programmer’s responsibility to install
>> the locale with such a call, and none of Guile’s business.
>
> Unfortunately, I don't see a way for the user to call setlocale before a
> Guile script converts the command-line arguments to Scheme strings, at
> least not without providing their own `main' function in C.

Hmm, very good point.

[...]

> I think we should consider decoding the command-line arguments using the
> locale specified by the environment variables, at least in cases like
> this where there's no way for the user to call setlocale before the
> conversion happens.

Below is a patch that does roughly that (we should get ‘locale_encoding’
reviewed and perhaps added to Gnulib.)

It solves the problem:

--8<---------------cut here---------------start------------->8---
# With the patch.
$ ./meta/guile -c '(setlocale LC_ALL "en_US.UTF8")(display (command-line))' -- λ
(/home/ludo/src/guile/libguile/.libs/guile -- λ)

# Previously.
$ guile -c '(setlocale LC_ALL "en_US.UTF8")(display (command-line))' -- λ
(guile -- ??)
--8<---------------cut here---------------end--------------->8---

(Note that the ‘setlocale’ call here is just so that the output port is
correctly set up.)

WDYT?

Thanks,
Ludo’.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 3059 bytes --]

diff --git a/libguile/script.c b/libguile/script.c
index 5e0685a..20d7b9e 100644
--- a/libguile/script.c
+++ b/libguile/script.c
@@ -26,6 +26,7 @@
 #include <stdio.h>
 #include <errno.h>
 #include <ctype.h>
+#include <uniconv.h>
 
 #include "libguile/_scm.h"
 #include "libguile/eval.h"
@@ -368,6 +369,74 @@ scm_shell_usage (int fatal, char *message)
                : SCM_BOOL_F));
 }
 
+/* Return the name of the locale encoding suggested by environment
+   variables, even if it's not current, or NULL if no encoding is
+   defined.  Based on Gnulib's `localcharset.c'.  */
+static const char *
+locale_encoding (void)
+{
+  const char *locale, *codeset = NULL;
+
+  /* Allow user to override the codeset, as set in the operating system,
+     with standard language environment variables.  */
+  locale = getenv ("LC_ALL");
+  if (locale == NULL || locale[0] == '\0')
+    {
+      locale = getenv ("LC_CTYPE");
+      if (locale == NULL || locale[0] == '\0')
+        locale = getenv ("LANG");
+    }
+  if (locale != NULL && locale[0] != '\0')
+    {
+      /* If the locale name contains an encoding after the dot, return it.  */
+      const char *dot = strchr (locale, '.');
+
+      if (dot != NULL)
+        {
+	  static char buf[2 + 10 + 1];
+          const char *modifier;
+
+          dot++;
+          /* Look for the possible @... trailer and remove it, if any.  */
+          modifier = strchr (dot, '@');
+          if (modifier == NULL)
+            return dot;
+          if (modifier - dot < sizeof (buf))
+            {
+              memcpy (buf, dot, modifier - dot);
+              buf [modifier - dot] = '\0';
+              return buf;
+            }
+        }
+
+      /* Resolve through the charset.alias file.  */
+      codeset = locale;
+    }
+
+  return codeset;
+}
+
+/* Return a list of strings from ARGV, which contains ARGC strings
+   assumed to be encoded in the current locale.  Use `locale_charset'
+   instead of relying on `scm_from_locale_string' because the user
+   hasn't had a change to call (setlocale LC_ALL "") yet.  */
+static SCM
+locale_arguments_to_string_list (int argc, char **const argv)
+{
+  int i;
+  SCM lst;
+  const char *encoding;
+
+  encoding = locale_encoding ();
+  for (i = argc - 1, lst = SCM_EOL;
+       i >= 0;
+       i--)
+    lst = scm_cons (scm_from_stringn (argv[i], (size_t) -1, encoding,
+				      SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE),
+		    lst);
+
+  return lst;
+}
 
 /* Given an array of command-line switches, return a Scheme expression
    to carry out the actions specified by the switches.
@@ -378,7 +447,7 @@ scm_compile_shell_switches (int argc, char **argv)
 {
   return scm_call_2 (scm_c_public_ref ("ice-9 command-line",
                                        "compile-shell-switches"),
-                     scm_makfromstrs (argc, argv),
+		     locale_arguments_to_string_list (argc, argv),
                      (scm_usage_name
                       ? scm_from_locale_string (scm_usage_name)
                       : scm_from_latin1_string ("guile")));

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] fix locale string reading
  2011-11-14 22:42     ` Ludovic Courtès
@ 2011-11-15  0:02       ` Mark H Weaver
  2011-11-15 23:51         ` Ludovic Courtès
  0 siblings, 1 reply; 14+ messages in thread
From: Mark H Weaver @ 2011-11-15  0:02 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

Hi Ludovic!

ludo@gnu.org (Ludovic Courtès) writes:
>> I think we should consider decoding the command-line arguments using the
>> locale specified by the environment variables, at least in cases like
>> this where there's no way for the user to call setlocale before the
>> conversion happens.
>
> Below is a patch that does roughly that (we should get ‘locale_encoding’
> reviewed and perhaps added to Gnulib.)
>
> It solves the problem:
>
> # With the patch.
> $ ./meta/guile -c '(setlocale LC_ALL "en_US.UTF8")(display (command-line))' -- λ
> (/home/ludo/src/guile/libguile/.libs/guile -- λ)
>
> # Previously.
> $ guile -c '(setlocale LC_ALL "en_US.UTF8")(display (command-line))' -- λ
> (guile -- ??)

Looks great, thanks! :)

I have one question though.  You fixed scm_compile_shell_switches, but I
see another place where command-line arguments are converted to Scheme
strings before the user is able to call setlocale: guile.c and init.c.

main (guile.c) calls scm_boot_guile (init.c), which uses
invoke_main_func (init.c), which calls scm_set_program_arguments
(feature.c).  Does this code need to be fixed also?

    Thanks,
      Mark


> diff --git a/libguile/script.c b/libguile/script.c
> index 5e0685a..20d7b9e 100644
> --- a/libguile/script.c
> +++ b/libguile/script.c
> @@ -26,6 +26,7 @@
>  #include <stdio.h>
>  #include <errno.h>
>  #include <ctype.h>
> +#include <uniconv.h>
>  
>  #include "libguile/_scm.h"
>  #include "libguile/eval.h"
> @@ -368,6 +369,74 @@ scm_shell_usage (int fatal, char *message)
>                 : SCM_BOOL_F));
>  }
>  
> +/* Return the name of the locale encoding suggested by environment
> +   variables, even if it's not current, or NULL if no encoding is
> +   defined.  Based on Gnulib's `localcharset.c'.  */
> +static const char *
> +locale_encoding (void)
> +{
> +  const char *locale, *codeset = NULL;
> +
> +  /* Allow user to override the codeset, as set in the operating system,
> +     with standard language environment variables.  */
> +  locale = getenv ("LC_ALL");
> +  if (locale == NULL || locale[0] == '\0')
> +    {
> +      locale = getenv ("LC_CTYPE");
> +      if (locale == NULL || locale[0] == '\0')
> +        locale = getenv ("LANG");
> +    }
> +  if (locale != NULL && locale[0] != '\0')
> +    {
> +      /* If the locale name contains an encoding after the dot, return it.  */
> +      const char *dot = strchr (locale, '.');
> +
> +      if (dot != NULL)
> +        {
> +	  static char buf[2 + 10 + 1];
> +          const char *modifier;
> +
> +          dot++;
> +          /* Look for the possible @... trailer and remove it, if any.  */
> +          modifier = strchr (dot, '@');
> +          if (modifier == NULL)
> +            return dot;
> +          if (modifier - dot < sizeof (buf))
> +            {
> +              memcpy (buf, dot, modifier - dot);
> +              buf [modifier - dot] = '\0';
> +              return buf;
> +            }
> +        }
> +
> +      /* Resolve through the charset.alias file.  */
> +      codeset = locale;
> +    }
> +
> +  return codeset;
> +}
> +
> +/* Return a list of strings from ARGV, which contains ARGC strings
> +   assumed to be encoded in the current locale.  Use `locale_charset'
> +   instead of relying on `scm_from_locale_string' because the user
> +   hasn't had a change to call (setlocale LC_ALL "") yet.  */
> +static SCM
> +locale_arguments_to_string_list (int argc, char **const argv)
> +{
> +  int i;
> +  SCM lst;
> +  const char *encoding;
> +
> +  encoding = locale_encoding ();
> +  for (i = argc - 1, lst = SCM_EOL;
> +       i >= 0;
> +       i--)
> +    lst = scm_cons (scm_from_stringn (argv[i], (size_t) -1, encoding,
> +				      SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE),
> +		    lst);
> +
> +  return lst;
> +}
>  
>  /* Given an array of command-line switches, return a Scheme expression
>     to carry out the actions specified by the switches.
> @@ -378,7 +447,7 @@ scm_compile_shell_switches (int argc, char **argv)
>  {
>    return scm_call_2 (scm_c_public_ref ("ice-9 command-line",
>                                         "compile-shell-switches"),
> -                     scm_makfromstrs (argc, argv),
> +		     locale_arguments_to_string_list (argc, argv),
>                       (scm_usage_name
>                        ? scm_from_locale_string (scm_usage_name)
>                        : scm_from_latin1_string ("guile")));



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fix locale string reading
  2011-11-15  0:02       ` Mark H Weaver
@ 2011-11-15 23:51         ` Ludovic Courtès
  0 siblings, 0 replies; 14+ messages in thread
From: Ludovic Courtès @ 2011-11-15 23:51 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 2151 bytes --]

Hi Mark!

Mark H Weaver <mhw@netris.org> skribis:

> ludo@gnu.org (Ludovic Courtès) writes:
>>> I think we should consider decoding the command-line arguments using the
>>> locale specified by the environment variables, at least in cases like
>>> this where there's no way for the user to call setlocale before the
>>> conversion happens.
>>
>> Below is a patch that does roughly that (we should get ‘locale_encoding’
>> reviewed and perhaps added to Gnulib.)
>>
>> It solves the problem:
>>
>> # With the patch.
>> $ ./meta/guile -c '(setlocale LC_ALL "en_US.UTF8")(display (command-line))' -- λ
>> (/home/ludo/src/guile/libguile/.libs/guile -- λ)
>>
>> # Previously.
>> $ guile -c '(setlocale LC_ALL "en_US.UTF8")(display (command-line))' -- λ
>> (guile -- ??)
>
> Looks great, thanks! :)
>
> I have one question though.  You fixed scm_compile_shell_switches, but I
> see another place where command-line arguments are converted to Scheme
> strings before the user is able to call setlocale: guile.c and init.c.
>
> main (guile.c) calls scm_boot_guile (init.c), which uses
> invoke_main_func (init.c), which calls scm_set_program_arguments
> (feature.c).  Does this code need to be fixed also?

Yes, good catch!

An updated patch is attached.  It seems to fulfill its mission:

--8<---------------cut here---------------start------------->8---
# Now:
$ ./meta/guile -c '(setlocale LC_ALL "en_US.UTF8")(display (list (command-line) (program-arguments)))' -- λ 
((/home/ludo/src/guile/libguile/.libs/guile -- λ) (/home/ludo/src/guile/libguile/.libs/guile -- λ))

# Before:
$ guile -c '(setlocale LC_ALL "en_US.UTF8")(display (list (command-line) (program-arguments)))' -- λ
((guile -- ??) (guile -- ??))
--8<---------------cut here---------------end--------------->8---

Note that the code uses SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE, but I
wonder if we couldn’t do better.  For instance, upon conversion failure,
we could pass the argument as a bytevector instead of a string and let
the application cope with it.  OTOH, that would be an API change.

Thoughts?

Thanks,
Ludo’.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 7723 bytes --]

diff --git a/libguile/feature.c b/libguile/feature.c
index 7007403..f3bddc7 100644
--- a/libguile/feature.c
+++ b/libguile/feature.c
@@ -1,5 +1,6 @@
-/* Copyright (C) 1995,1996,1998,1999,2000,2001,2002, 2003, 2004, 2006, 2007, 2009 Free Software Foundation, Inc.
- * 
+/* Copyright (C) 1995, 1996, 1998, 1999, 2000, 2001, 2002, 2003, 2004,
+ *   2006, 2007, 2009, 2011 Free Software Foundation, Inc.
+ *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public License
  * as published by the Free Software Foundation; either version 3 of
@@ -36,7 +37,8 @@
 
 \f
 
-static SCM progargs_fluid;
+SCM scm_program_arguments_fluid;
+
 static SCM features_var;
 
 void
@@ -58,7 +60,7 @@ SCM_DEFINE (scm_program_arguments, "program-arguments", 0, 0, 0,
 	    "options like @code{-e} and @code{-l}.")
 #define FUNC_NAME s_scm_program_arguments
 {
-  return scm_fluid_ref (progargs_fluid);
+  return scm_fluid_ref (scm_program_arguments_fluid);
 }
 #undef FUNC_NAME
 
@@ -74,7 +76,7 @@ scm_set_program_arguments (int argc, char **argv, char *first)
   SCM args = scm_makfromstrs (argc, argv);
   if (first)
     args = scm_cons (scm_from_locale_string (first), args);
-  scm_fluid_set_x (progargs_fluid, args);
+  scm_fluid_set_x (scm_program_arguments_fluid, args);
 }
 
 SCM_DEFINE (scm_set_program_arguments_scm, "set-program-arguments", 1, 0, 0, 
@@ -89,7 +91,7 @@ SCM_DEFINE (scm_set_program_arguments_scm, "set-program-arguments", 1, 0, 0,
 	    "strings within it are copied, so should not be modified later.")
 #define FUNC_NAME s_scm_set_program_arguments_scm
 {
-  return scm_fluid_set_x (progargs_fluid, lst);
+  return scm_fluid_set_x (scm_program_arguments_fluid, lst);
 }
 #undef FUNC_NAME
 
@@ -99,7 +101,7 @@ SCM_DEFINE (scm_set_program_arguments_scm, "set-program-arguments", 1, 0, 0,
 void
 scm_init_feature()
 {
-  progargs_fluid = scm_make_fluid ();
+  scm_program_arguments_fluid = scm_make_fluid ();
 
   features_var = scm_c_define ("*features*", SCM_EOL);
 #ifndef _Windows
diff --git a/libguile/feature.h b/libguile/feature.h
index d373bc7..467f9ed 100644
--- a/libguile/feature.h
+++ b/libguile/feature.h
@@ -3,7 +3,8 @@
 #ifndef SCM_FEATURE_H
 #define SCM_FEATURE_H
 
-/* Copyright (C) 1995,1996,1999,2000,2001, 2006, 2007, 2008 Free Software Foundation, Inc.
+/* Copyright (C) 1995, 1996, 1999, 2000, 2001, 2006, 2007, 2008,
+ *   2011 Free Software Foundation, Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public License
@@ -29,6 +30,8 @@ SCM_API void scm_add_feature (const char* str);
 SCM_API SCM scm_program_arguments (void);
 SCM_API void scm_set_program_arguments (int argc, char **argv, char *first);
 SCM_API SCM scm_set_program_arguments_scm (SCM lst);
+
+SCM_INTERNAL SCM scm_program_arguments_fluid;
 SCM_INTERNAL void scm_init_feature (void);
 
 #endif  /* SCM_FEATURE_H */
diff --git a/libguile/init.c b/libguile/init.c
index 8e3888d..633f8c6 100644
--- a/libguile/init.c
+++ b/libguile/init.c
@@ -332,7 +332,7 @@ invoke_main_func (void *body_data)
 {
   struct main_func_closure *closure = (struct main_func_closure *) body_data;
 
-  scm_set_program_arguments (closure->argc, closure->argv, 0);
+  scm_i_set_boot_program_arguments (closure->argc, closure->argv);
   (*closure->main_func) (closure->closure, closure->argc, closure->argv);
 
   scm_restore_signals ();
diff --git a/libguile/script.c b/libguile/script.c
index 5e0685a..b1d3327 100644
--- a/libguile/script.c
+++ b/libguile/script.c
@@ -26,6 +26,7 @@
 #include <stdio.h>
 #include <errno.h>
 #include <ctype.h>
+#include <uniconv.h>
 
 #include "libguile/_scm.h"
 #include "libguile/eval.h"
@@ -368,6 +369,87 @@ scm_shell_usage (int fatal, char *message)
                : SCM_BOOL_F));
 }
 
+/* Return the name of the locale encoding suggested by environment
+   variables, even if it's not current, or NULL if no encoding is
+   defined.  Based on Gnulib's `localcharset.c'.  */
+static const char *
+locale_encoding (void)
+{
+  static char buf[2 + 10 + 1];
+  const char *locale, *codeset = NULL;
+
+  /* Allow user to override the codeset, as set in the operating system,
+     with standard language environment variables.  */
+  locale = getenv ("LC_ALL");
+  if (locale == NULL || locale[0] == '\0')
+    {
+      locale = getenv ("LC_CTYPE");
+      if (locale == NULL || locale[0] == '\0')
+        locale = getenv ("LANG");
+    }
+  if (locale != NULL && locale[0] != '\0')
+    {
+      /* If the locale name contains an encoding after the dot, return it.  */
+      const char *dot = strchr (locale, '.');
+
+      if (dot != NULL)
+        {
+          const char *modifier;
+
+          dot++;
+          /* Look for the possible @... trailer and remove it, if any.  */
+          modifier = strchr (dot, '@');
+          if (modifier == NULL)
+            return dot;
+          if (modifier - dot < sizeof (buf))
+            {
+              memcpy (buf, dot, modifier - dot);
+              buf [modifier - dot] = '\0';
+              return buf;
+            }
+        }
+      else if (strcmp (locale, "C") == 0)
+	{
+	  strcpy (buf, "ASCII");
+	  return buf;
+	}
+
+      /* Resolve through the charset.alias file.  */
+      codeset = locale;
+    }
+
+  return codeset;
+}
+
+/* Return a list of strings from ARGV, which contains ARGC strings
+   assumed to be encoded in the current locale.  Use `locale_charset'
+   instead of relying on `scm_from_locale_string' because the user
+   hasn't had a change to call (setlocale LC_ALL "") yet.  */
+static SCM
+locale_arguments_to_string_list (int argc, char **const argv)
+{
+  int i;
+  SCM lst;
+  const char *encoding;
+
+  encoding = locale_encoding ();
+  for (i = argc - 1, lst = SCM_EOL;
+       i >= 0;
+       i--)
+    lst = scm_cons (scm_from_stringn (argv[i], (size_t) -1, encoding,
+				      SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE),
+		    lst);
+
+  return lst;
+}
+
+/* Set the value returned by `program-arguments', given ARGC and ARGV.  */
+void
+scm_i_set_boot_program_arguments (int argc, char *argv[])
+{
+  scm_fluid_set_x (scm_program_arguments_fluid,
+		   locale_arguments_to_string_list (argc, argv));
+}
 
 /* Given an array of command-line switches, return a Scheme expression
    to carry out the actions specified by the switches.
@@ -378,7 +460,7 @@ scm_compile_shell_switches (int argc, char **argv)
 {
   return scm_call_2 (scm_c_public_ref ("ice-9 command-line",
                                        "compile-shell-switches"),
-                     scm_makfromstrs (argc, argv),
+		     locale_arguments_to_string_list (argc, argv),
                      (scm_usage_name
                       ? scm_from_locale_string (scm_usage_name)
                       : scm_from_latin1_string ("guile")));
diff --git a/libguile/script.h b/libguile/script.h
index 7e3828a..cf0162a 100644
--- a/libguile/script.h
+++ b/libguile/script.h
@@ -3,7 +3,7 @@
 #ifndef SCM_SCRIPT_H
 #define SCM_SCRIPT_H
 
-/* Copyright (C) 1997,1998,2000, 2006, 2008 Free Software Foundation, Inc.
+/* Copyright (C) 1997,1998,2000, 2006, 2008, 2011 Free Software Foundation, Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public License
@@ -37,6 +37,7 @@ SCM_API void scm_shell_usage (int fatal, char *message);
 SCM_API SCM scm_compile_shell_switches (int argc, char **argv);
 SCM_API void scm_shell (int argc, char **argv);
 SCM_API char *scm_usage_name;
+SCM_INTERNAL void scm_i_set_boot_program_arguments (int argc, char *argv[]);
 SCM_INTERNAL void scm_init_script (void);
 
 #endif  /* SCM_SCRIPT_H */

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-11-15 23:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-07 19:12 [PATCH] fix locale string reading Nala Ginrut
2011-11-08 13:31 ` Peter Brett
     [not found]   ` <CAPjoZoewXz+MMMD2N=8gp57AwsNG99UHc9KkhsizdK8U7djLvg@mail.gmail.com>
2011-11-09  2:59     ` Nala Ginrut
2011-11-09  3:11       ` Nala Ginrut
2011-11-09  4:10         ` Nala Ginrut
2011-11-09 10:20           ` Peter Brett
2011-11-09 10:46             ` Nala Ginrut
2011-11-09 13:45               ` Peter Brett
2011-11-10  0:36                 ` Nala Ginrut
2011-11-11 20:16 ` Ludovic Courtès
2011-11-11 23:39   ` Mark H Weaver
2011-11-14 22:42     ` Ludovic Courtès
2011-11-15  0:02       ` Mark H Weaver
2011-11-15 23:51         ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).