Hi,

Neil Jerram <neil@ossau.uklinux.net> writes:

> Yes; based on Kevin's and Ludovic's latest emails, I'm happy now with
> the isalpha() solution if we can make it leverage this "i18n"
> classification.

Below is a patch that does what we agreed on: keep using the <ctype.h>
functions for character classification, and recompute the standard
SRFI-14 char sets upon successful `setlocale'.  It also makes char set
computation more efficient and fixes `char-set:punctuation' and
`char-set:symbol' in ASCII.

There are still issues.  The bug in `char-set:punctuation' and
`char-set:symbol' I mention above is due to the fact that there is not
<ctype.h> equivalent to those char sets (in particular, `ispunct ()'
does not match `char-set:punctuation').  Fixing it for ASCII was easy,
but it's not so easy for Latin-1.

The reason we can hardly get `char-set:punctuation' and
`char-set:symbol' for Latin-1 is that we don't want to hard-code too
much Latin-1-specific knowledge: one goal is to have SRFI-14 provide
also sensible results for non-Latin-1 8-bit charsets.

With this patch, all standard char sets are those expected by SRFI-14 in
ASCII.  In Latin-1, `char-set:letter', as well as `lower-case',
`upper-case', and `iso-control' are correct (at least, using current
glibc locales), but `punctuation', for instance, is a superset of what
SRFI-14 expects while `symbol' is (correspondingly) a subset of what it
should be, and `blank' lacks the "no-break space" character (#\0240).

I'm not sure we can do much better than that until Guile fully supports
Unicode.  The right solution, in the end, would be to process the whole
`UnicodeData.txt' and generate a character classification strictly
following the SRFI-14 rules.  In the meantime, I think this patch can be
an acceptable solution.

I'd be glad if some of you could test it, and especially run the test
cases.  I added Latin-1-specific test cases, but they require that a
Latin-1 locale is available, and it will try to guess what that can be
(yes, it looks quite hackish but I couldn't think of anything
better...).

Comments welcome.

Thanks,
Ludovic.