unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* make check fails if no en_US.iso88591 locale
@ 2009-09-09  0:45 Neil Jerram
  2009-09-09  1:28 ` Mike Gran
  2009-09-09  8:08 ` Ludovic Courtès
  0 siblings, 2 replies; 11+ messages in thread
From: Neil Jerram @ 2009-09-09  0:45 UTC (permalink / raw)
  To: Guile Development

make check fails for me in regexp.test:

  ...
  Running regexp.test
  guile: uncaught throw to unresolved: ()

because I don't have an en_US.iso88591 locale installed, and so

  (with-locale "en_US.iso88591" ...)

throws an 'unresolved exception.

I can allow make check to complete by changing that line to

  (false-if-exception (with-locale "en_US.iso88591"

but I doubt that's the best fix.  Is the "en_US.iso88591" locale
actually important for the enclosed tests?

Thanks,
        Neil





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: make check fails if no en_US.iso88591 locale
  2009-09-09  0:45 make check fails if no en_US.iso88591 locale Neil Jerram
@ 2009-09-09  1:28 ` Mike Gran
  2009-09-09 21:53   ` Neil Jerram
  2009-09-09  8:08 ` Ludovic Courtès
  1 sibling, 1 reply; 11+ messages in thread
From: Mike Gran @ 2009-09-09  1:28 UTC (permalink / raw)
  To: Neil Jerram, Guile Development

> From: Neil Jerram <neil@ossau.uklinux.net>
> 
> make check fails for me in regexp.test:
> 
>   ...
>   Running regexp.test
>   guile: uncaught throw to unresolved: ()
> 
> because I don't have an en_US.iso88591 locale installed, and so
> 
>   (with-locale "en_US.iso88591" ...)
> 
> throws an 'unresolved exception.
> 

My bad.  Actually, I should have enclosed the 'with-locale' in the
context of a 'pass-if', which would have caught the exception.

> I can allow make check to complete by changing that line to
> 
>   (false-if-exception (with-locale "en_US.iso88591"
> 
> but I doubt that's the best fix.  Is the "en_US.iso88591" locale
> actually important for the enclosed tests?

It is important.  This is one of the problems with the whole Unicode
effort.  There is no Unicode-capable regex library.  The regexp.test
tries matching all bytes from 0 to 255, and it uses scm_to_locale_string
to prep the string for dispatch to the libc regex calls and
scm_from_locale_string to send them back.  

If the current locale is C or ASCII, bytes above 127 will cause errors.
If the current locale is UTF-8, bytes above 127 will be converted into
multibyte sequences that won't be matched by the regular expression
being tested.  To pass the test in regexp.test, we need to use the 
encoding that matches all of the codepoints 0 to 255 to single byte
characters, which is ISO-8859-1.

So until a better regex comes along, wrapping regex in an
8-bit-clean-friendly locale like Latin-1 is necessary to avoid encoding
errors when encoding arbitrary 8-bit data like the test does.

The reason why this problem is cropping up now and didn't occur before
is because the old scm_to_locale_string was just a stub that passed
8-bit data through unmodified.

This regex library actually can be used with arbitrary Unicode data
but it takes extra care.  UTF-8 can be used as the locale, and, then
regular expression must be written keeping in mind that each non-ASCII
character is really a multibyte string.

> 
> Thanks,
>         Neil

Thanks,

Mike




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: make check fails if no en_US.iso88591 locale
  2009-09-09  0:45 make check fails if no en_US.iso88591 locale Neil Jerram
  2009-09-09  1:28 ` Mike Gran
@ 2009-09-09  8:08 ` Ludovic Courtès
  1 sibling, 0 replies; 11+ messages in thread
From: Ludovic Courtès @ 2009-09-09  8:08 UTC (permalink / raw)
  To: guile-devel

Hi,

Neil Jerram <neil@ossau.uklinux.net> writes:

> because I don't have an en_US.iso88591 locale installed, and so
>
>   (with-locale "en_US.iso88591" ...)
>
> throws an 'unresolved exception.

I’d suggest using ‘with-latin1-locale’ as in ‘bytevectors.test’ to
mitigate this problem.

(Something akin to Gnulib’s ‘locale-*.m4’ could be a good starting
point, too.)

Thanks,
Ludo’.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: make check fails if no en_US.iso88591 locale
  2009-09-09  1:28 ` Mike Gran
@ 2009-09-09 21:53   ` Neil Jerram
  2009-09-10  2:36     ` Mike Gran
  0 siblings, 1 reply; 11+ messages in thread
From: Neil Jerram @ 2009-09-09 21:53 UTC (permalink / raw)
  To: Mike Gran; +Cc: Guile Development

Mike Gran <spk121@yahoo.com> writes:

> My bad.  Actually, I should have enclosed the 'with-locale' in the
> context of a 'pass-if', which would have caught the exception.

Yes, but at the cost of not running the tests...

>> I can allow make check to complete by changing that line to
>> 
>>   (false-if-exception (with-locale "en_US.iso88591"
>> 
>> but I doubt that's the best fix.  Is the "en_US.iso88591" locale
>> actually important for the enclosed tests?
>
> It is important.  This is one of the problems with the whole Unicode
> effort.  There is no Unicode-capable regex library.  The regexp.test
> tries matching all bytes from 0 to 255, and it uses scm_to_locale_string
> to prep the string for dispatch to the libc regex calls and
> scm_from_locale_string to send them back.  
>
> If the current locale is C or ASCII, bytes above 127 will cause errors.
> If the current locale is UTF-8, bytes above 127 will be converted into
> multibyte sequences that won't be matched by the regular expression
> being tested.  To pass the test in regexp.test, we need to use the 
> encoding that matches all of the codepoints 0 to 255 to single byte
> characters, which is ISO-8859-1.
>
> So until a better regex comes along, wrapping regex in an
> 8-bit-clean-friendly locale like Latin-1 is necessary to avoid encoding
> errors when encoding arbitrary 8-bit data like the test does.
>
> The reason why this problem is cropping up now and didn't occur before
> is because the old scm_to_locale_string was just a stub that passed
> 8-bit data through unmodified.

Thanks for explaining; I think I understand now.  So then Ludovic's
suggestion of with-latin1-locale should work, shouldn't it?

> This regex library actually can be used with arbitrary Unicode data
> but it takes extra care.  UTF-8 can be used as the locale, and, then
> regular expression must be written keeping in mind that each non-ASCII
> character is really a multibyte string.

Can you give an example of what that ("keeping in mind...") means?  Is
it being careful with repetition counts (as in "[a-z]{3}"), for
example?

Thanks,
        Neil




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: make check fails if no en_US.iso88591 locale
  2009-09-09 21:53   ` Neil Jerram
@ 2009-09-10  2:36     ` Mike Gran
  2009-09-10 10:27       ` Ludovic Courtès
  2009-09-10 19:34       ` Neil Jerram
  0 siblings, 2 replies; 11+ messages in thread
From: Mike Gran @ 2009-09-10  2:36 UTC (permalink / raw)
  To: Neil Jerram; +Cc: Guile Devel

On Wed, 2009-09-09 at 22:53 +0100, Neil Jerram wrote:
> > It is important.  This is one of the problems with the whole Unicode
> > effort.  There is no Unicode-capable regex library.  The regexp.test
> > tries matching all bytes from 0 to 255, and it uses scm_to_locale_string
> > to prep the string for dispatch to the libc regex calls and
> > scm_from_locale_string to send them back.  

[...]

> Thanks for explaining; I think I understand now.  So then Ludovic's
> suggestion of with-latin1-locale should work, shouldn't it?

Yeah.  I went with that idea.

> 
> > This regex library actually can be used with arbitrary Unicode data
> > but it takes extra care.  UTF-8 can be used as the locale, and, then
> > regular expression must be written keeping in mind that each non-ASCII
> > character is really a multibyte string.
> 
> Can you give an example of what that ("keeping in mind...") means?  Is
> it being careful with repetition counts (as in "[a-z]{3}"), for
> example?

I'm not much of a regex guy, but, here's a couple of examples.  First
one that sort of works as expected.

guile> (string-match "sé" "José") 
==> #("José" (2 . 5))

Regex properly matches the word, but, the match struct (2 . 5) is
referring to the bytes of the string, not the characters of the string.

Here's one that doesn't work as expected.

guile> (string-match "[:lower:]" "Hi, mom")
==> #("Hi, mom" (5 . 6))
guile> (string-match "[:lower:]" "Hí, móm")
==> #f

Once you add accents on the vowels, nothing matches.

Thanks,

Mike





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: make check fails if no en_US.iso88591 locale
  2009-09-10  2:36     ` Mike Gran
@ 2009-09-10 10:27       ` Ludovic Courtès
  2009-09-10 12:44         ` Mike Gran
  2009-09-10 19:34       ` Neil Jerram
  1 sibling, 1 reply; 11+ messages in thread
From: Ludovic Courtès @ 2009-09-10 10:27 UTC (permalink / raw)
  To: guile-devel

Hello!

I built today’s ‘master’ on a ppc64 box and there are many
regexp-related errors and a surprisingly high number of unresolved
regexp-related tests:

  http://autobuild.josefsson.org/guile/log-200909100539539848000.txt

This machine only has the following locales:

  C
  en_US.utf8
  POSIX

Thanks,
Ludo’.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: make check fails if no en_US.iso88591 locale
  2009-09-10 10:27       ` Ludovic Courtès
@ 2009-09-10 12:44         ` Mike Gran
  2009-09-10 15:33           ` Ludovic Courtès
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Gran @ 2009-09-10 12:44 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

On Thu, 2009-09-10 at 12:27 +0200, Ludovic Courtès wrote:
> Hello!
> 
> I built today’s ‘master’ on a ppc64 box and there are many
> regexp-related errors and a surprisingly high number of unresolved
> regexp-related tests:
> 
>   http://autobuild.josefsson.org/guile/log-200909100539539848000.txt
> 
> This machine only has the following locales:
> 
>   C
>   en_US.utf8
>   POSIX
> 

I'm not surprised to see the unresolved, since I'd wrapped a lot of
those tests to throw unresolved if a Latin-1 locale wasn't found.  The
errors are a surprise: they indicate that my strategy for wrapping in a
Latin-1 locale isn't correct.

The reason for declaring a Latin-1 locale was to allow
scm_to/from_locale_string to convert a scheme string with values from 0
to 255 to an 8-bit binary C string.  The regexp.test runs on arbitrary
binary data which wasn't a problem in guile-1.8 since
scm_to/from_locale_string did no real locale conversion.

I could fix the test by testing only characters 0 to 127 in a C locale
if a Latin-1 locale can't be found.  I can also fix the test by using
the 'setbinary' function to force the encodings on stdin and stdout to a
default value that will pass through binary data, instead of calling
'setlocale'.  The procedure 'setbinary' was always a hack, and I kind of
want to get rid of it, but, this is why it was created.

I looked in the POSIX spec on Regex for specific advice using 128-255 in
regex in the C locale.  I didn't see anything offhand.  The spec does
spend a lot of time talking about the interaction between the locale and
regular expressions.  I get the impression from the spec that using
regex on 128-255 in the C locale is an unexpected use of regular
expressions.

Thanks,
Mike





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: make check fails if no en_US.iso88591 locale
  2009-09-10 12:44         ` Mike Gran
@ 2009-09-10 15:33           ` Ludovic Courtès
  2009-09-11  4:28             ` Mike Gran
  0 siblings, 1 reply; 11+ messages in thread
From: Ludovic Courtès @ 2009-09-10 15:33 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 2623 bytes --]

Mike Gran <spk121@yahoo.com> writes:

> I could fix the test by testing only characters 0 to 127 in a C locale
> if a Latin-1 locale can't be found.

Yes, that'd be nice.

> I can also fix the test by using the 'setbinary' function

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> (help setbinary)
`setbinary' is a primitive procedure in the (guile) module.

 -- Scheme Procedure: setbinary
     Sets the encoding for the current input, output, and error ports
     to ISO-8859-1.  That character encoding allows ports to operate on
     binary data.

     It also sets the default encoding for newly created ports to
     ISO-8859-1.

     The previous default encoding for new ports is returned
--8<---------------cut here---------------end--------------->8---

It seems to do a lot of things, which aren't clear from the name.  ;-)

What can be done about it?

At least it should be renamed, to `set-port-binary-mode!' or similar.

Then it'd be nice if that functionality could be split in several
functions, some operating on a per-port basis.  After all, one can
already do:

  (for-each (lambda (p)
              (set-port-encoding! p "ISO-8859-1"))
            (list (current-input-port) (current-output-port)
                  (current-error-port)))

So we just lack:

  ;; encoding for newly created ports
  (set-default-port-encoding! "ISO-8859-1")

With that `setbinary' can be implemented in Scheme.

> to force the encodings on stdin and stdout to a default value that
> will pass through binary data, instead of calling 'setlocale'.

Hmm, I think I'd still prefer `setlocale'.

regexec(3) doesn't say anything about the string encoding.  Do libc
implementations actually expect plain ASCII or Latin-1?  Or do they
adapt to the current locale's encoding?

> I looked in the POSIX spec on Regex for specific advice using 128-255 in
> regex in the C locale.  I didn't see anything offhand.  The spec does
> spend a lot of time talking about the interaction between the locale and
> regular expressions.  I get the impression from the spec that using
> regex on 128-255 in the C locale is an unexpected use of regular
> expressions.

http://www.opengroup.org/onlinepubs/9699919799/functions/regexec.html
reads:

  If, when regexec() is called, the locale is different from when the
  regular expression was compiled, the result is undefined.

It makes me think that, if a process runs with a UTF-8 locale and passes
raw UTF-8 bytes to regcomp(3) and regexec(3), it may work.

Hmm, the program below, with UTF-8-encoded source, works both with a
Latin-1 and a UTF-8 locale:


[-- Attachment #2: Type: text/x-csrc, Size: 295 bytes --]

#include <stdlib.h>
#include <regex.h>
#include <locale.h>

int
main (int argc, char *argv[])
{
  regex_t rx;
  regmatch_t match;

  setlocale (LC_ALL, "fr_FR.utf8");

  regcomp (&rx, "ça", REG_EXTENDED);
  return regexec (&rx, "ça va ?", 1, &match, 0) == 0
    ? EXIT_SUCCESS : EXIT_FAILURE;
}

[-- Attachment #3: Type: text/plain, Size: 89 bytes --]


Do you think it would work to just leave `regexp.test' as it is in 1.8?

Thanks,
Ludo'.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: make check fails if no en_US.iso88591 locale
  2009-09-10  2:36     ` Mike Gran
  2009-09-10 10:27       ` Ludovic Courtès
@ 2009-09-10 19:34       ` Neil Jerram
  2009-09-10 21:17         ` Mike Gran
  1 sibling, 1 reply; 11+ messages in thread
From: Neil Jerram @ 2009-09-10 19:34 UTC (permalink / raw)
  To: Mike Gran; +Cc: Guile Devel

Mike Gran <spk121@yahoo.com> writes:

> I'm not much of a regex guy, but, here's a couple of examples.  First
> one that sort of works as expected.
>
> guile> (string-match "sé" "José") 
> ==> #("José" (2 . 5))
>
> Regex properly matches the word, but, the match struct (2 . 5) is
> referring to the bytes of the string, not the characters of the string.

That's with a UTF-8 locale, isn't it?  With latin-1 I suppose the
numbers would be (2 . 4), right?

> Here's one that doesn't work as expected.
>
> guile> (string-match "[:lower:]" "Hi, mom")
> ==> #("Hi, mom" (5 . 6))
> guile> (string-match "[:lower:]" "Hí, móm")
> ==> #f
>
> Once you add accents on the vowels, nothing matches.
>
> Thanks,

Thank you!  Do you think it would be good to add these examples to the
manual?  (I'm happy to do that if so.)

       Neil







^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: make check fails if no en_US.iso88591 locale
  2009-09-10 19:34       ` Neil Jerram
@ 2009-09-10 21:17         ` Mike Gran
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Gran @ 2009-09-10 21:17 UTC (permalink / raw)
  To: Neil Jerram; +Cc: Guile Devel

> From: Neil Jerram <neil@ossau.uklinux.net>
> Mike Gran writes:

> > Here's one that doesn't work as expected.
> >
> > guile> (string-match "[:lower:]" "Hi, mom")
> > ==> #("Hi, mom" (5 . 6))
> > guile> (string-match "[:lower:]" "Hí, móm")
> > ==> #f
> >
> > Once you add accents on the vowels, nothing matches.

Doh!  This one doesn't work because it is nonsense.

It should have been [[:lower:]], not [:lower:]

Thanks,

Mike




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: make check fails if no en_US.iso88591 locale
  2009-09-10 15:33           ` Ludovic Courtès
@ 2009-09-11  4:28             ` Mike Gran
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Gran @ 2009-09-11  4:28 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guile Devel

On Thu, 2009-09-10 at 17:33 +0200, Ludovic Courtès wrote:

> Do you think it would work to just leave `regexp.test' as it is in 1.8?

It would probably work, but, it offends my sense of aesthetics that the
names of the tests would be displayed in the wrong locale for the
terminal.  I'm uploading yet another attempt at doing the right thing in
regexp.test.  Third time's a charm.

> 
> Thanks,
> Ludo'.





^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-09-11  4:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-09  0:45 make check fails if no en_US.iso88591 locale Neil Jerram
2009-09-09  1:28 ` Mike Gran
2009-09-09 21:53   ` Neil Jerram
2009-09-10  2:36     ` Mike Gran
2009-09-10 10:27       ` Ludovic Courtès
2009-09-10 12:44         ` Mike Gran
2009-09-10 15:33           ` Ludovic Courtès
2009-09-11  4:28             ` Mike Gran
2009-09-10 19:34       ` Neil Jerram
2009-09-10 21:17         ` Mike Gran
2009-09-09  8:08 ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).