* make check fails if no en_US.iso88591 locale @ 2009-09-09 0:45 Neil Jerram 2009-09-09 1:28 ` Mike Gran 2009-09-09 8:08 ` Ludovic Courtès 0 siblings, 2 replies; 11+ messages in thread From: Neil Jerram @ 2009-09-09 0:45 UTC (permalink / raw) To: Guile Development make check fails for me in regexp.test: ... Running regexp.test guile: uncaught throw to unresolved: () because I don't have an en_US.iso88591 locale installed, and so (with-locale "en_US.iso88591" ...) throws an 'unresolved exception. I can allow make check to complete by changing that line to (false-if-exception (with-locale "en_US.iso88591" but I doubt that's the best fix. Is the "en_US.iso88591" locale actually important for the enclosed tests? Thanks, Neil ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: make check fails if no en_US.iso88591 locale 2009-09-09 0:45 make check fails if no en_US.iso88591 locale Neil Jerram @ 2009-09-09 1:28 ` Mike Gran 2009-09-09 21:53 ` Neil Jerram 2009-09-09 8:08 ` Ludovic Courtès 1 sibling, 1 reply; 11+ messages in thread From: Mike Gran @ 2009-09-09 1:28 UTC (permalink / raw) To: Neil Jerram, Guile Development > From: Neil Jerram <neil@ossau.uklinux.net> > > make check fails for me in regexp.test: > > ... > Running regexp.test > guile: uncaught throw to unresolved: () > > because I don't have an en_US.iso88591 locale installed, and so > > (with-locale "en_US.iso88591" ...) > > throws an 'unresolved exception. > My bad. Actually, I should have enclosed the 'with-locale' in the context of a 'pass-if', which would have caught the exception. > I can allow make check to complete by changing that line to > > (false-if-exception (with-locale "en_US.iso88591" > > but I doubt that's the best fix. Is the "en_US.iso88591" locale > actually important for the enclosed tests? It is important. This is one of the problems with the whole Unicode effort. There is no Unicode-capable regex library. The regexp.test tries matching all bytes from 0 to 255, and it uses scm_to_locale_string to prep the string for dispatch to the libc regex calls and scm_from_locale_string to send them back. If the current locale is C or ASCII, bytes above 127 will cause errors. If the current locale is UTF-8, bytes above 127 will be converted into multibyte sequences that won't be matched by the regular expression being tested. To pass the test in regexp.test, we need to use the encoding that matches all of the codepoints 0 to 255 to single byte characters, which is ISO-8859-1. So until a better regex comes along, wrapping regex in an 8-bit-clean-friendly locale like Latin-1 is necessary to avoid encoding errors when encoding arbitrary 8-bit data like the test does. The reason why this problem is cropping up now and didn't occur before is because the old scm_to_locale_string was just a stub that passed 8-bit data through unmodified. This regex library actually can be used with arbitrary Unicode data but it takes extra care. UTF-8 can be used as the locale, and, then regular expression must be written keeping in mind that each non-ASCII character is really a multibyte string. > > Thanks, > Neil Thanks, Mike ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: make check fails if no en_US.iso88591 locale 2009-09-09 1:28 ` Mike Gran @ 2009-09-09 21:53 ` Neil Jerram 2009-09-10 2:36 ` Mike Gran 0 siblings, 1 reply; 11+ messages in thread From: Neil Jerram @ 2009-09-09 21:53 UTC (permalink / raw) To: Mike Gran; +Cc: Guile Development Mike Gran <spk121@yahoo.com> writes: > My bad. Actually, I should have enclosed the 'with-locale' in the > context of a 'pass-if', which would have caught the exception. Yes, but at the cost of not running the tests... >> I can allow make check to complete by changing that line to >> >> (false-if-exception (with-locale "en_US.iso88591" >> >> but I doubt that's the best fix. Is the "en_US.iso88591" locale >> actually important for the enclosed tests? > > It is important. This is one of the problems with the whole Unicode > effort. There is no Unicode-capable regex library. The regexp.test > tries matching all bytes from 0 to 255, and it uses scm_to_locale_string > to prep the string for dispatch to the libc regex calls and > scm_from_locale_string to send them back. > > If the current locale is C or ASCII, bytes above 127 will cause errors. > If the current locale is UTF-8, bytes above 127 will be converted into > multibyte sequences that won't be matched by the regular expression > being tested. To pass the test in regexp.test, we need to use the > encoding that matches all of the codepoints 0 to 255 to single byte > characters, which is ISO-8859-1. > > So until a better regex comes along, wrapping regex in an > 8-bit-clean-friendly locale like Latin-1 is necessary to avoid encoding > errors when encoding arbitrary 8-bit data like the test does. > > The reason why this problem is cropping up now and didn't occur before > is because the old scm_to_locale_string was just a stub that passed > 8-bit data through unmodified. Thanks for explaining; I think I understand now. So then Ludovic's suggestion of with-latin1-locale should work, shouldn't it? > This regex library actually can be used with arbitrary Unicode data > but it takes extra care. UTF-8 can be used as the locale, and, then > regular expression must be written keeping in mind that each non-ASCII > character is really a multibyte string. Can you give an example of what that ("keeping in mind...") means? Is it being careful with repetition counts (as in "[a-z]{3}"), for example? Thanks, Neil ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: make check fails if no en_US.iso88591 locale 2009-09-09 21:53 ` Neil Jerram @ 2009-09-10 2:36 ` Mike Gran 2009-09-10 10:27 ` Ludovic Courtès 2009-09-10 19:34 ` Neil Jerram 0 siblings, 2 replies; 11+ messages in thread From: Mike Gran @ 2009-09-10 2:36 UTC (permalink / raw) To: Neil Jerram; +Cc: Guile Devel On Wed, 2009-09-09 at 22:53 +0100, Neil Jerram wrote: > > It is important. This is one of the problems with the whole Unicode > > effort. There is no Unicode-capable regex library. The regexp.test > > tries matching all bytes from 0 to 255, and it uses scm_to_locale_string > > to prep the string for dispatch to the libc regex calls and > > scm_from_locale_string to send them back. [...] > Thanks for explaining; I think I understand now. So then Ludovic's > suggestion of with-latin1-locale should work, shouldn't it? Yeah. I went with that idea. > > > This regex library actually can be used with arbitrary Unicode data > > but it takes extra care. UTF-8 can be used as the locale, and, then > > regular expression must be written keeping in mind that each non-ASCII > > character is really a multibyte string. > > Can you give an example of what that ("keeping in mind...") means? Is > it being careful with repetition counts (as in "[a-z]{3}"), for > example? I'm not much of a regex guy, but, here's a couple of examples. First one that sort of works as expected. guile> (string-match "sé" "José") ==> #("José" (2 . 5)) Regex properly matches the word, but, the match struct (2 . 5) is referring to the bytes of the string, not the characters of the string. Here's one that doesn't work as expected. guile> (string-match "[:lower:]" "Hi, mom") ==> #("Hi, mom" (5 . 6)) guile> (string-match "[:lower:]" "Hí, móm") ==> #f Once you add accents on the vowels, nothing matches. Thanks, Mike ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: make check fails if no en_US.iso88591 locale 2009-09-10 2:36 ` Mike Gran @ 2009-09-10 10:27 ` Ludovic Courtès 2009-09-10 12:44 ` Mike Gran 2009-09-10 19:34 ` Neil Jerram 1 sibling, 1 reply; 11+ messages in thread From: Ludovic Courtès @ 2009-09-10 10:27 UTC (permalink / raw) To: guile-devel Hello! I built today’s ‘master’ on a ppc64 box and there are many regexp-related errors and a surprisingly high number of unresolved regexp-related tests: http://autobuild.josefsson.org/guile/log-200909100539539848000.txt This machine only has the following locales: C en_US.utf8 POSIX Thanks, Ludo’. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: make check fails if no en_US.iso88591 locale 2009-09-10 10:27 ` Ludovic Courtès @ 2009-09-10 12:44 ` Mike Gran 2009-09-10 15:33 ` Ludovic Courtès 0 siblings, 1 reply; 11+ messages in thread From: Mike Gran @ 2009-09-10 12:44 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guile-devel On Thu, 2009-09-10 at 12:27 +0200, Ludovic Courtès wrote: > Hello! > > I built today’s ‘master’ on a ppc64 box and there are many > regexp-related errors and a surprisingly high number of unresolved > regexp-related tests: > > http://autobuild.josefsson.org/guile/log-200909100539539848000.txt > > This machine only has the following locales: > > C > en_US.utf8 > POSIX > I'm not surprised to see the unresolved, since I'd wrapped a lot of those tests to throw unresolved if a Latin-1 locale wasn't found. The errors are a surprise: they indicate that my strategy for wrapping in a Latin-1 locale isn't correct. The reason for declaring a Latin-1 locale was to allow scm_to/from_locale_string to convert a scheme string with values from 0 to 255 to an 8-bit binary C string. The regexp.test runs on arbitrary binary data which wasn't a problem in guile-1.8 since scm_to/from_locale_string did no real locale conversion. I could fix the test by testing only characters 0 to 127 in a C locale if a Latin-1 locale can't be found. I can also fix the test by using the 'setbinary' function to force the encodings on stdin and stdout to a default value that will pass through binary data, instead of calling 'setlocale'. The procedure 'setbinary' was always a hack, and I kind of want to get rid of it, but, this is why it was created. I looked in the POSIX spec on Regex for specific advice using 128-255 in regex in the C locale. I didn't see anything offhand. The spec does spend a lot of time talking about the interaction between the locale and regular expressions. I get the impression from the spec that using regex on 128-255 in the C locale is an unexpected use of regular expressions. Thanks, Mike ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: make check fails if no en_US.iso88591 locale 2009-09-10 12:44 ` Mike Gran @ 2009-09-10 15:33 ` Ludovic Courtès 2009-09-11 4:28 ` Mike Gran 0 siblings, 1 reply; 11+ messages in thread From: Ludovic Courtès @ 2009-09-10 15:33 UTC (permalink / raw) To: guile-devel [-- Attachment #1: Type: text/plain, Size: 2623 bytes --] Mike Gran <spk121@yahoo.com> writes: > I could fix the test by testing only characters 0 to 127 in a C locale > if a Latin-1 locale can't be found. Yes, that'd be nice. > I can also fix the test by using the 'setbinary' function --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> (help setbinary) `setbinary' is a primitive procedure in the (guile) module. -- Scheme Procedure: setbinary Sets the encoding for the current input, output, and error ports to ISO-8859-1. That character encoding allows ports to operate on binary data. It also sets the default encoding for newly created ports to ISO-8859-1. The previous default encoding for new ports is returned --8<---------------cut here---------------end--------------->8--- It seems to do a lot of things, which aren't clear from the name. ;-) What can be done about it? At least it should be renamed, to `set-port-binary-mode!' or similar. Then it'd be nice if that functionality could be split in several functions, some operating on a per-port basis. After all, one can already do: (for-each (lambda (p) (set-port-encoding! p "ISO-8859-1")) (list (current-input-port) (current-output-port) (current-error-port))) So we just lack: ;; encoding for newly created ports (set-default-port-encoding! "ISO-8859-1") With that `setbinary' can be implemented in Scheme. > to force the encodings on stdin and stdout to a default value that > will pass through binary data, instead of calling 'setlocale'. Hmm, I think I'd still prefer `setlocale'. regexec(3) doesn't say anything about the string encoding. Do libc implementations actually expect plain ASCII or Latin-1? Or do they adapt to the current locale's encoding? > I looked in the POSIX spec on Regex for specific advice using 128-255 in > regex in the C locale. I didn't see anything offhand. The spec does > spend a lot of time talking about the interaction between the locale and > regular expressions. I get the impression from the spec that using > regex on 128-255 in the C locale is an unexpected use of regular > expressions. http://www.opengroup.org/onlinepubs/9699919799/functions/regexec.html reads: If, when regexec() is called, the locale is different from when the regular expression was compiled, the result is undefined. It makes me think that, if a process runs with a UTF-8 locale and passes raw UTF-8 bytes to regcomp(3) and regexec(3), it may work. Hmm, the program below, with UTF-8-encoded source, works both with a Latin-1 and a UTF-8 locale: [-- Attachment #2: Type: text/x-csrc, Size: 295 bytes --] #include <stdlib.h> #include <regex.h> #include <locale.h> int main (int argc, char *argv[]) { regex_t rx; regmatch_t match; setlocale (LC_ALL, "fr_FR.utf8"); regcomp (&rx, "ça", REG_EXTENDED); return regexec (&rx, "ça va ?", 1, &match, 0) == 0 ? EXIT_SUCCESS : EXIT_FAILURE; } [-- Attachment #3: Type: text/plain, Size: 89 bytes --] Do you think it would work to just leave `regexp.test' as it is in 1.8? Thanks, Ludo'. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: make check fails if no en_US.iso88591 locale 2009-09-10 15:33 ` Ludovic Courtès @ 2009-09-11 4:28 ` Mike Gran 0 siblings, 0 replies; 11+ messages in thread From: Mike Gran @ 2009-09-11 4:28 UTC (permalink / raw) To: Ludovic Courtès; +Cc: Guile Devel On Thu, 2009-09-10 at 17:33 +0200, Ludovic Courtès wrote: > Do you think it would work to just leave `regexp.test' as it is in 1.8? It would probably work, but, it offends my sense of aesthetics that the names of the tests would be displayed in the wrong locale for the terminal. I'm uploading yet another attempt at doing the right thing in regexp.test. Third time's a charm. > > Thanks, > Ludo'. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: make check fails if no en_US.iso88591 locale 2009-09-10 2:36 ` Mike Gran 2009-09-10 10:27 ` Ludovic Courtès @ 2009-09-10 19:34 ` Neil Jerram 2009-09-10 21:17 ` Mike Gran 1 sibling, 1 reply; 11+ messages in thread From: Neil Jerram @ 2009-09-10 19:34 UTC (permalink / raw) To: Mike Gran; +Cc: Guile Devel Mike Gran <spk121@yahoo.com> writes: > I'm not much of a regex guy, but, here's a couple of examples. First > one that sort of works as expected. > > guile> (string-match "sé" "José") > ==> #("José" (2 . 5)) > > Regex properly matches the word, but, the match struct (2 . 5) is > referring to the bytes of the string, not the characters of the string. That's with a UTF-8 locale, isn't it? With latin-1 I suppose the numbers would be (2 . 4), right? > Here's one that doesn't work as expected. > > guile> (string-match "[:lower:]" "Hi, mom") > ==> #("Hi, mom" (5 . 6)) > guile> (string-match "[:lower:]" "Hí, móm") > ==> #f > > Once you add accents on the vowels, nothing matches. > > Thanks, Thank you! Do you think it would be good to add these examples to the manual? (I'm happy to do that if so.) Neil ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: make check fails if no en_US.iso88591 locale 2009-09-10 19:34 ` Neil Jerram @ 2009-09-10 21:17 ` Mike Gran 0 siblings, 0 replies; 11+ messages in thread From: Mike Gran @ 2009-09-10 21:17 UTC (permalink / raw) To: Neil Jerram; +Cc: Guile Devel > From: Neil Jerram <neil@ossau.uklinux.net> > Mike Gran writes: > > Here's one that doesn't work as expected. > > > > guile> (string-match "[:lower:]" "Hi, mom") > > ==> #("Hi, mom" (5 . 6)) > > guile> (string-match "[:lower:]" "Hí, móm") > > ==> #f > > > > Once you add accents on the vowels, nothing matches. Doh! This one doesn't work because it is nonsense. It should have been [[:lower:]], not [:lower:] Thanks, Mike ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: make check fails if no en_US.iso88591 locale 2009-09-09 0:45 make check fails if no en_US.iso88591 locale Neil Jerram 2009-09-09 1:28 ` Mike Gran @ 2009-09-09 8:08 ` Ludovic Courtès 1 sibling, 0 replies; 11+ messages in thread From: Ludovic Courtès @ 2009-09-09 8:08 UTC (permalink / raw) To: guile-devel Hi, Neil Jerram <neil@ossau.uklinux.net> writes: > because I don't have an en_US.iso88591 locale installed, and so > > (with-locale "en_US.iso88591" ...) > > throws an 'unresolved exception. I’d suggest using ‘with-latin1-locale’ as in ‘bytevectors.test’ to mitigate this problem. (Something akin to Gnulib’s ‘locale-*.m4’ could be a good starting point, too.) Thanks, Ludo’. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2009-09-11 4:28 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-09-09 0:45 make check fails if no en_US.iso88591 locale Neil Jerram 2009-09-09 1:28 ` Mike Gran 2009-09-09 21:53 ` Neil Jerram 2009-09-10 2:36 ` Mike Gran 2009-09-10 10:27 ` Ludovic Courtès 2009-09-10 12:44 ` Mike Gran 2009-09-10 15:33 ` Ludovic Courtès 2009-09-11 4:28 ` Mike Gran 2009-09-10 19:34 ` Neil Jerram 2009-09-10 21:17 ` Mike Gran 2009-09-09 8:08 ` Ludovic Courtès
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).