* `regexp-exec' and non-ascii strings
@ 2011-03-06 19:52 Clinton Ebadi
2011-03-17 17:54 ` Andy Wingo
0 siblings, 1 reply; 2+ messages in thread
From: Clinton Ebadi @ 2011-03-06 19:52 UTC (permalink / raw)
To: guile-devel
[-- Attachment #1.1: Type: text/plain, Size: 1441 bytes --]
Greetings,
While debugging[0] an issue with Bobot++ (poor sneek!) aborting after
calling scm_regexp_exec on any utf-8 strings I eventually realized
that... the string was actually single-byte encoded internally. After
taking that down the wrong path I eventually tested `regexp-exec' with a
*valid* latin-1 string and that too aborted in `fixup_multibyte_match'.
I have attached a patch that I think is correct. Instead of
unconditionally calling `fixup_multibyte_match' when wchar_t is
available it instead checks if the scheme string being matched is
actually a multibyte string. This permits applications that provide no
string encoding and non-ascii strings to be matched.
If you call `setlocale' with any locale things sort of work. In the case
of "C" non-ascii characters are escaped upon read, and in the case of
"latin1" `mbrlen' will not reject the char code (AFAICT, I'm not an
expert in this area).
Unfortunately this means I don't see an easy way to write a test for the
suite--it only happens in the case where the locale is "C" and no port
encoder is set. <http://paste.lisp.org/display/120245#5> is what I was
going for and will show the bug if run by hand.
I'm not entirely certain this is the *correct* solution, but I think it
should be--it seems bad to abort() applications that uses regexeps but
haven't set their locale yet!
(My papers for Guile are on file AFAIK FWIW)
[0] http://paste.lisp.org/display/120245
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0001-2011-03-05-Clinton-Ebadi-clinton-unknownlamer.org.patch --]
[-- Type: text/x-diff, Size: 914 bytes --]
From 61900d7e93780dd9d7d6db02fe3ad07a72a8a45b Mon Sep 17 00:00:00 2001
From: Clinton Ebadi <clinton@unknownlamer.org>
Date: Sat, 5 Mar 2011 23:44:23 -0500
Subject: [PATCH] 2011-03-05 Clinton Ebadi <clinton@unknownlamer.org>
* libguile/regex-posix.c (scm_regexp_exec): Only fixup byte to
character offset when the string is actually multibyte encoded.
---
libguile/regex-posix.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/libguile/regex-posix.c b/libguile/regex-posix.c
index 3423099..db76e36 100644
--- a/libguile/regex-posix.c
+++ b/libguile/regex-posix.c
@@ -305,7 +305,7 @@ SCM_DEFINE (scm_regexp_exec, "regexp-exec", 2, 2, 0,
scm_to_int (flags));
#ifdef HAVE_WCHAR_H
- if (!status)
+ if ((!status) && (scm_to_int (scm_string_bytes_per_char (substr)) > 1))
fixup_multibyte_match (matches, nmatches, c_str);
#endif
--
1.6.6.1
[-- Attachment #1.3: Type: text/plain, Size: 70 bytes --]
--
Jessie: but today i was a nerd
Jessie: i even read slashdot.
[-- Attachment #2: Type: application/pgp-signature, Size: 229 bytes --]
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: `regexp-exec' and non-ascii strings
2011-03-06 19:52 `regexp-exec' and non-ascii strings Clinton Ebadi
@ 2011-03-17 17:54 ` Andy Wingo
0 siblings, 0 replies; 2+ messages in thread
From: Andy Wingo @ 2011-03-17 17:54 UTC (permalink / raw)
To: Clinton Ebadi; +Cc: guile-devel
On Sun 06 Mar 2011 20:52, Clinton Ebadi <clinton@unknownlamer.org> writes:
> While debugging[0] an issue with Bobot++ (poor sneek!) aborting after
> calling scm_regexp_exec on any utf-8 strings I eventually realized
> that... the string was actually single-byte encoded internally. After
> taking that down the wrong path I eventually tested `regexp-exec' with a
> *valid* latin-1 string and that too aborted in `fixup_multibyte_match'.
This was actually due to scm_to_locale_string() not producing a valid
locale string. Having fixed that, I verified that
meta/guile -c '(display (regexp-exec (make-regexp "(.)(.)(.)") (string (integer->char 200) (integer->char 201) (integer->char 202))))'
no longer triggers the abort.
Thanks for the report,
Andy
--
http://wingolog.org/
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2011-03-17 17:54 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-06 19:52 `regexp-exec' and non-ascii strings Clinton Ebadi
2011-03-17 17:54 ` Andy Wingo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).