* [Patch] SRFI-13 string-tokenize is wrong
@ 2002-03-12 17:35 Matthias Koeppe
2002-04-24 19:58 ` Marius Vollmer
0 siblings, 1 reply; 6+ messages in thread
From: Matthias Koeppe @ 2002-03-12 17:35 UTC (permalink / raw)
Cc: guile-devel, haus
[-- Attachment #1: Type: text/plain, Size: 1539 bytes --]
Hi,
the Guile implementation of SRFI-13 `string-tokenize' gets the meaning
of the `token-set' argument wrong.
Quoting the SRFI:
| string-tokenize s [token-set start end] -> list
|
| Split the string s into a list of substrings, where each substring
| is a maximal non-empty contiguous sequence of characters from the
| character set token-set.
|
| * token-set defaults to char-set:graphic (see SRFI 14 for more
| on character sets and char-set:graphic).
|
| [...]
|
| (string-tokenize "Help make programs run, run, RUN!")
| => ("Help" "make" "programs" "run," "run," "RUN!")
In Guile (1.5 branch):
(string-tokenize "Help make programs run, run, RUN!")
=> ("Help" "make" "programs" "run," "run," "RUN!") ; OK
but:
(string-tokenize "Help make programs run, run, RUN!" char-set:graphic)
=> (" " " " " " " " " ") ; WRONG
The corresponding tests in srfi-13.test are also wrong.
I suggest fixing this bug in both the stable and the unstable branch,
so that incorrect uses of `string-tokenize' in user code are avoided.
The attached patch fixes the bug and also removes the Guile-specific
extension of `string-tokenize' to accept a character as the
`token-set' argument because it is inconsistent with both the
Guile-specific procedure documentation and with the correct behavior
of `string-tokenize' when a character set is passed as `token-set'.
--
Matthias Köppe -- http://www.math.uni-magdeburg.de/~mkoeppe
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 2529 bytes --]
--- srfi-13.c.~1.11.2.4.~ Tue Mar 12 17:03:03 2002
+++ srfi-13.c Tue Mar 12 18:03:23 2002
@@ -2798,13 +2798,14 @@
SCM_DEFINE (scm_string_tokenize, "string-tokenize", 1, 3, 0,
- (SCM s, SCM token_char, SCM start, SCM end),
+ (SCM s, SCM token_set, SCM start, SCM end),
"Split the string @var{s} into a list of substrings, where each\n"
"substring is a maximal non-empty contiguous sequence of\n"
- "characters equal to the character @var{token_char}, or\n"
- "whitespace, if @var{token_char} is not given. If\n"
- "@var{token_char} is a character set, it is used for finding the\n"
- "token borders.")
+ "characters from the character set @var{token_set}, which\n"
+ "defaults to an equivalent of @code{char-set:graphic}.\n"
+ "If @var{start} or @var{end} indices are provided, they restrict\n"
+ "@code{string-tokenize} to operating on the indicated substring\n"
+ "of @var{s}.")
#define FUNC_NAME s_scm_string_tokenize
{
char * cstr;
@@ -2814,7 +2815,7 @@
SCM_VALIDATE_SUBSTRING_SPEC_COPY (1, s, cstr,
3, start, cstart,
4, end, cend);
- if (SCM_UNBNDP (token_char))
+ if (SCM_UNBNDP (token_set))
{
int idx;
@@ -2838,7 +2839,7 @@
result = scm_cons (scm_mem2string (cstr + cend, idx - cend), result);
}
}
- else if (SCM_CHARSETP (token_char))
+ else if (SCM_CHARSETP (token_set))
{
int idx;
@@ -2846,7 +2847,7 @@
{
while (cstart < cend)
{
- if (!SCM_CHARSET_GET (token_char, cstr[cend - 1]))
+ if (SCM_CHARSET_GET (token_set, cstr[cend - 1]))
break;
cend--;
}
@@ -2855,41 +2856,14 @@
idx = cend;
while (cstart < cend)
{
- if (SCM_CHARSET_GET (token_char, cstr[cend - 1]))
- break;
- cend--;
- }
- result = scm_cons (scm_mem2string (cstr + cend, idx - cend), result);
- }
- }
- else
- {
- int idx;
- char chr;
-
- SCM_VALIDATE_CHAR (2, token_char);
- chr = SCM_CHAR (token_char);
-
- while (cstart < cend)
- {
- while (cstart < cend)
- {
- if (cstr[cend - 1] != chr)
- break;
- cend--;
- }
- if (cstart >= cend)
- break;
- idx = cend;
- while (cstart < cend)
- {
- if (cstr[cend - 1] == chr)
+ if (!SCM_CHARSET_GET (token_set, cstr[cend - 1]))
break;
cend--;
}
result = scm_cons (scm_mem2string (cstr + cend, idx - cend), result);
}
}
+ else SCM_WRONG_TYPE_ARG (2, token_set);
return result;
}
#undef FUNC_NAME
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Patch] SRFI-13 string-tokenize is wrong
2002-03-12 17:35 [Patch] SRFI-13 string-tokenize is wrong Matthias Koeppe
@ 2002-04-24 19:58 ` Marius Vollmer
2002-04-26 8:27 ` Matthias Koeppe
0 siblings, 1 reply; 6+ messages in thread
From: Marius Vollmer @ 2002-04-24 19:58 UTC (permalink / raw)
Cc: Martin Grabmueller, guile-devel, haus
Matthias Koeppe <mkoeppe@mail.Math.Uni-Magdeburg.De> writes:
> Hi,
>
> the Guile implementation of SRFI-13 `string-tokenize' gets the meaning
> of the `token-set' argument wrong.
Yep, good catch. However, your patch did not make an unspecified
token-set equivalent to char-set:graphic, but to (char-set-complement
char-set:whitespace).
Could you update your patch accordingly?
Thanks!
> I suggest fixing this bug in both the stable and the unstable branch,
> so that incorrect uses of `string-tokenize' in user code are avoided.
Yes, I have recorded this as bug 'string-tokenize-is-wrong' and tagged
it as release critical.
(I know this might be controversial. My argument is that in the rare
case where we do have a standard to conform to, and we violate it so
spectacularly, and the fix is so easy, then it is release critical.)
_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Patch] SRFI-13 string-tokenize is wrong
2002-04-24 19:58 ` Marius Vollmer
@ 2002-04-26 8:27 ` Matthias Koeppe
2002-04-26 18:18 ` Marius Vollmer
0 siblings, 1 reply; 6+ messages in thread
From: Matthias Koeppe @ 2002-04-26 8:27 UTC (permalink / raw)
Cc: Martin Grabmueller, guile-devel, haus
Marius Vollmer <mvo@zagadka.ping.de> writes:
> Matthias Koeppe <mkoeppe@mail.Math.Uni-Magdeburg.De> writes:
>> the Guile implementation of SRFI-13 `string-tokenize' gets the meaning
>> of the `token-set' argument wrong.
>
> Yep, good catch. However, your patch did not make an unspecified
> token-set equivalent to char-set:graphic, but to (char-set-complement
> char-set:whitespace).
Thanks for pointing this out. I've updated the patch; see below.
Now this works correctly:
(string-tokenize "Help make programs\arun, run,\nRUN!")
==> ("Help" "make" "programs" "run," "run," "RUN!")
-- Matthias
Index: srfi-13.c
===================================================================
RCS file: /cvs/guile/guile-core/srfi/srfi-13.c,v
retrieving revision 1.11.2.5
diff -u -u -r1.11.2.5 srfi-13.c
--- srfi-13.c 14 Mar 2002 05:32:48 -0000 1.11.2.5
+++ srfi-13.c 26 Apr 2002 08:20:04 -0000
@@ -2798,13 +2798,14 @@
SCM_DEFINE (scm_string_tokenize, "string-tokenize", 1, 3, 0,
- (SCM s, SCM token_char, SCM start, SCM end),
+ (SCM s, SCM token_set, SCM start, SCM end),
"Split the string @var{s} into a list of substrings, where each\n"
"substring is a maximal non-empty contiguous sequence of\n"
- "characters equal to the character @var{token_char}, or\n"
- "whitespace, if @var{token_char} is not given. If\n"
- "@var{token_char} is a character set, it is used for finding the\n"
- "token borders.")
+ "characters from the character set @var{token_set}, which\n"
+ "defaults to an equivalent of @code{char-set:graphic}.\n"
+ "If @var{start} or @var{end} indices are provided, they restrict\n"
+ "@code{string-tokenize} to operating on the indicated substring\n"
+ "of @var{s}.")
#define FUNC_NAME s_scm_string_tokenize
{
char * cstr;
@@ -2814,7 +2815,7 @@
SCM_VALIDATE_SUBSTRING_SPEC_COPY (1, s, cstr,
3, start, cstart,
4, end, cend);
- if (SCM_UNBNDP (token_char))
+ if (SCM_UNBNDP (token_set))
{
int idx;
@@ -2822,7 +2823,7 @@
{
while (cstart < cend)
{
- if (!isspace (cstr[cend - 1]))
+ if (isgraph (cstr[cend - 1]))
break;
cend--;
}
@@ -2831,14 +2832,14 @@
idx = cend;
while (cstart < cend)
{
- if (isspace (cstr[cend - 1]))
+ if (!isgraph (cstr[cend - 1]))
break;
cend--;
}
result = scm_cons (scm_mem2string (cstr + cend, idx - cend), result);
}
}
- else if (SCM_CHARSETP (token_char))
+ else if (SCM_CHARSETP (token_set))
{
int idx;
@@ -2846,7 +2847,7 @@
{
while (cstart < cend)
{
- if (!SCM_CHARSET_GET (token_char, cstr[cend - 1]))
+ if (SCM_CHARSET_GET (token_set, cstr[cend - 1]))
break;
cend--;
}
@@ -2855,41 +2856,14 @@
idx = cend;
while (cstart < cend)
{
- if (SCM_CHARSET_GET (token_char, cstr[cend - 1]))
- break;
- cend--;
- }
- result = scm_cons (scm_mem2string (cstr + cend, idx - cend), result);
- }
- }
- else
- {
- int idx;
- char chr;
-
- SCM_VALIDATE_CHAR (2, token_char);
- chr = SCM_CHAR (token_char);
-
- while (cstart < cend)
- {
- while (cstart < cend)
- {
- if (cstr[cend - 1] != chr)
- break;
- cend--;
- }
- if (cstart >= cend)
- break;
- idx = cend;
- while (cstart < cend)
- {
- if (cstr[cend - 1] == chr)
+ if (!SCM_CHARSET_GET (token_set, cstr[cend - 1]))
break;
cend--;
}
result = scm_cons (scm_mem2string (cstr + cend, idx - cend), result);
}
}
+ else SCM_WRONG_TYPE_ARG (2, token_set);
return result;
}
#undef FUNC_NAME
--
Matthias Koeppe -- http://www.math.uni-magdeburg.de/~mkoeppe
_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Patch] SRFI-13 string-tokenize is wrong
2002-04-26 8:27 ` Matthias Koeppe
@ 2002-04-26 18:18 ` Marius Vollmer
2002-04-29 9:21 ` Matthias Koeppe
0 siblings, 1 reply; 6+ messages in thread
From: Marius Vollmer @ 2002-04-26 18:18 UTC (permalink / raw)
Cc: Martin Grabmueller, guile-devel, haus
Matthias Koeppe <mkoeppe@mail.Math.Uni-Magdeburg.De> writes:
> Thanks for pointing this out. I've updated the patch; see below.
Thanks; and sorry for being nitpicky: can we be sure that isgraphic is
the same as charset:graphic?
I'll apply the patch regardless.
_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Patch] SRFI-13 string-tokenize is wrong
2002-04-26 18:18 ` Marius Vollmer
@ 2002-04-29 9:21 ` Matthias Koeppe
2002-05-06 18:54 ` Marius Vollmer
0 siblings, 1 reply; 6+ messages in thread
From: Matthias Koeppe @ 2002-04-29 9:21 UTC (permalink / raw)
Cc: Martin Grabmueller, guile-devel, haus
Marius Vollmer <mvo@zagadka.ping.de> writes:
> Thanks; and sorry for being nitpicky: can we be sure that isgraphic is
> the same as charset:graphic?
We can't. That's why I wrote that TOKEN_SET defaults to "an
equivalent" of CHAR-SET:GRAPHIC.
The whole internationalization stuff is, of course, broken. Some
Guile functions depend on the current locale setting; others depend on
the locale setting at load time; others silently do ASCII only. This
clearly needs to be worked on, but I don't think STRING-TOKENIZE would
be the place to start.
BTW, when I tried to make an example of the described behavior, I got
a segmentation fault caused by an array being indexed by a signed
char (on Solaris 2.7 with the Forte compiler):
(use-modules (srfi srfi-13) (srfi srfi-14))
(string-tokenize "charsetsäareäfun" char-set:graphic)
==> segfault
Here is a fix:
--- srfi-14.h.~1.3.2.6.~ Tue Sep 25 13:00:41 2001
+++ srfi-14.h Mon Apr 29 11:13:03 2002
@@ -48,15 +48,15 @@
#define SCM_CHARSET_SIZE 256
-/* We expect 8-bit bytes here. Shoule be no problem in the year
+/* We expect 8-bit bytes here. Should be no problem in the year
2001. */
#ifndef SCM_BITS_PER_LONG
# define SCM_BITS_PER_LONG (sizeof (long) * 8)
#endif
#define SCM_CHARSET_GET(cs, idx) (((long *) SCM_SMOB_DATA (cs))\
- [(idx) / SCM_BITS_PER_LONG] &\
- (1L << ((idx) % SCM_BITS_PER_LONG)))
+ [((unsigned char) (idx)) / SCM_BITS_PER_LONG] &\
+ (1L << (((unsigned char) (idx)) % SCM_BITS_PER_LONG)))
#define SCM_CHARSETP(x) (!SCM_IMP (x) && (SCM_TYP16 (x) == scm_tc16_charset))
--
Matthias Köppe -- http://www.math.uni-magdeburg.de/~mkoeppe
_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Patch] SRFI-13 string-tokenize is wrong
2002-04-29 9:21 ` Matthias Koeppe
@ 2002-05-06 18:54 ` Marius Vollmer
0 siblings, 0 replies; 6+ messages in thread
From: Marius Vollmer @ 2002-05-06 18:54 UTC (permalink / raw)
Cc: Martin Grabmueller, guile-devel, haus
Matthias Koeppe <mkoeppe@mail.Math.Uni-Magdeburg.De> writes:
> Marius Vollmer <mvo@zagadka.ping.de> writes:
>
> > Thanks; and sorry for being nitpicky: can we be sure that isgraphic is
> > the same as charset:graphic?
>
> We can't. That's why I wrote that TOKEN_SET defaults to "an
> equivalent" of CHAR-SET:GRAPHIC.
Err, I meant "can we be sure that isgraphic is an equivalent of
charset:graphic?". I suppose it would be better to use the real
charset:graphic but we would have to import it into C land. (time
passes.) I have done this now in unstable, but not in stable.
> The whole internationalization stuff is, of course, broken.
Yeah, true.
> BTW, when I tried to make an example of the described behavior, I
> got a segmentation fault caused by an array being indexed by a
> signed char (on Solaris 2.7 with the Forte compiler):
Yes, we hit on a couple of these already.
> Here is a fix:
Thanks, applied!
_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2002-05-06 18:54 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-03-12 17:35 [Patch] SRFI-13 string-tokenize is wrong Matthias Koeppe
2002-04-24 19:58 ` Marius Vollmer
2002-04-26 8:27 ` Matthias Koeppe
2002-04-26 18:18 ` Marius Vollmer
2002-04-29 9:21 ` Matthias Koeppe
2002-05-06 18:54 ` Marius Vollmer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).