unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* `regexp-exec' and non-ascii strings
@ 2011-03-06 19:52 Clinton Ebadi
  2011-03-17 17:54 ` Andy Wingo
  0 siblings, 1 reply; 2+ messages in thread
From: Clinton Ebadi @ 2011-03-06 19:52 UTC (permalink / raw)
  To: guile-devel


[-- Attachment #1.1: Type: text/plain, Size: 1441 bytes --]


Greetings,

While debugging[0] an issue with Bobot++ (poor sneek!) aborting after
calling scm_regexp_exec on any utf-8 strings I eventually realized
that... the string was actually single-byte encoded internally. After
taking that down the wrong path I eventually tested `regexp-exec' with a
*valid* latin-1 string and that too aborted in `fixup_multibyte_match'.

I have attached a patch that I think is correct. Instead of
unconditionally calling `fixup_multibyte_match' when wchar_t is
available it instead checks if the scheme string being matched is
actually a multibyte string. This permits applications that provide no
string encoding and non-ascii strings to be matched.

If you call `setlocale' with any locale things sort of work. In the case
of "C" non-ascii characters are escaped upon read, and in the case of
"latin1" `mbrlen' will not reject the char code (AFAICT, I'm not an
expert in this area).

Unfortunately this means I don't see an easy way to write a test for the
suite--it only happens in the case where the locale is "C" and no port
encoder is set. <http://paste.lisp.org/display/120245#5> is what I was
going for and will show the bug if run by hand.

I'm not entirely certain this is the *correct* solution, but I think it
should be--it seems bad to abort() applications that uses regexeps but
haven't set their locale yet!

(My papers for Guile are on file AFAIK FWIW)

[0] http://paste.lisp.org/display/120245


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0001-2011-03-05-Clinton-Ebadi-clinton-unknownlamer.org.patch --]
[-- Type: text/x-diff, Size: 914 bytes --]

From 61900d7e93780dd9d7d6db02fe3ad07a72a8a45b Mon Sep 17 00:00:00 2001
From: Clinton Ebadi <clinton@unknownlamer.org>
Date: Sat, 5 Mar 2011 23:44:23 -0500
Subject: [PATCH] 2011-03-05  Clinton Ebadi  <clinton@unknownlamer.org>

	* libguile/regex-posix.c (scm_regexp_exec): Only fixup byte to
	character offset when the string is actually multibyte encoded.
---
 libguile/regex-posix.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/libguile/regex-posix.c b/libguile/regex-posix.c
index 3423099..db76e36 100644
--- a/libguile/regex-posix.c
+++ b/libguile/regex-posix.c
@@ -305,7 +305,7 @@ SCM_DEFINE (scm_regexp_exec, "regexp-exec", 2, 2, 0,
 		    scm_to_int (flags));
 
 #ifdef HAVE_WCHAR_H
-  if (!status)
+  if ((!status) && (scm_to_int (scm_string_bytes_per_char (substr)) > 1))
     fixup_multibyte_match (matches, nmatches, c_str);
 #endif
 
-- 
1.6.6.1


[-- Attachment #1.3: Type: text/plain, Size: 70 bytes --]


-- 
Jessie: but today i was a nerd
Jessie: i even read slashdot.

[-- Attachment #2: Type: application/pgp-signature, Size: 229 bytes --]

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: `regexp-exec' and non-ascii strings
  2011-03-06 19:52 `regexp-exec' and non-ascii strings Clinton Ebadi
@ 2011-03-17 17:54 ` Andy Wingo
  0 siblings, 0 replies; 2+ messages in thread
From: Andy Wingo @ 2011-03-17 17:54 UTC (permalink / raw)
  To: Clinton Ebadi; +Cc: guile-devel

On Sun 06 Mar 2011 20:52, Clinton Ebadi <clinton@unknownlamer.org> writes:

> While debugging[0] an issue with Bobot++ (poor sneek!) aborting after
> calling scm_regexp_exec on any utf-8 strings I eventually realized
> that... the string was actually single-byte encoded internally. After
> taking that down the wrong path I eventually tested `regexp-exec' with a
> *valid* latin-1 string and that too aborted in `fixup_multibyte_match'.

This was actually due to scm_to_locale_string() not producing a valid
locale string.  Having fixed that, I verified that

  meta/guile -c '(display (regexp-exec (make-regexp "(.)(.)(.)") (string (integer->char 200) (integer->char 201) (integer->char 202))))'

no longer triggers the abort.

Thanks for the report,

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2011-03-17 17:54 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-06 19:52 `regexp-exec' and non-ascii strings Clinton Ebadi
2011-03-17 17:54 ` Andy Wingo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).