[Patch] SRFI-13 string-tokenize is wrong

* [Patch] SRFI-13 string-tokenize is wrong
@ 2002-03-12 17:35 Matthias Koeppe
  2002-04-24 19:58 ` Marius Vollmer
  0 siblings, 1 reply; 6+ messages in thread
From: Matthias Koeppe @ 2002-03-12 17:35 UTC (permalink / raw)
  Cc: guile-devel, haus

[-- Attachment #1: Type: text/plain, Size: 1539 bytes --]

Hi,

the Guile implementation of SRFI-13 `string-tokenize' gets the meaning
of the `token-set' argument wrong.

Quoting the SRFI:

| string-tokenize s [token-set start end] -> list
| 
|     Split the string s into a list of substrings, where each substring
|     is a maximal non-empty contiguous sequence of characters from the
|     character set token-set.
| 
|         * token-set defaults to char-set:graphic (see SRFI 14 for more
|           on character sets and char-set:graphic).
| 
|     [...]    
| 
|     (string-tokenize "Help make programs run, run, RUN!") 
|     => ("Help" "make" "programs" "run," "run," "RUN!")
 
In Guile (1.5 branch):

      (string-tokenize "Help make programs run, run, RUN!") 
      => ("Help" "make" "programs" "run," "run," "RUN!")  ; OK

but:

      (string-tokenize "Help make programs run, run, RUN!" char-set:graphic)
      => (" " " " " " " " " ")  ; WRONG

The corresponding tests in srfi-13.test are also wrong.

I suggest fixing this bug in both the stable and the unstable branch,
so that incorrect uses of `string-tokenize' in user code are avoided.

The attached patch fixes the bug and also removes the Guile-specific
extension of `string-tokenize' to accept a character as the
`token-set' argument because it is inconsistent with both the
Guile-specific procedure documentation and with the correct behavior
of `string-tokenize' when a character set is passed as `token-set'.

-- 
Matthias Köppe -- http://www.math.uni-magdeburg.de/~mkoeppe

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 2529 bytes --]

--- srfi-13.c.~1.11.2.4.~	Tue Mar 12 17:03:03 2002
+++ srfi-13.c	Tue Mar 12 18:03:23 2002
@@ -2798,13 +2798,14 @@
 
 
 SCM_DEFINE (scm_string_tokenize, "string-tokenize", 1, 3, 0,
-	    (SCM s, SCM token_char, SCM start, SCM end),
+	    (SCM s, SCM token_set, SCM start, SCM end),
 	    "Split the string @var{s} into a list of substrings, where each\n"
 	    "substring is a maximal non-empty contiguous sequence of\n"
-	    "characters equal to the character @var{token_char}, or\n"
-	    "whitespace, if @var{token_char} is not given.  If\n"
-	    "@var{token_char} is a character set, it is used for finding the\n"
-	    "token borders.")
+	    "characters from the character set @var{token_set}, which\n"
+	    "defaults to an equivalent of @code{char-set:graphic}.\n"
+	    "If @var{start} or @var{end} indices are provided, they restrict\n"
+	    "@code{string-tokenize} to operating on the indicated substring\n"
+	    "of @var{s}.")
 #define FUNC_NAME s_scm_string_tokenize
 {
   char * cstr;
@@ -2814,7 +2815,7 @@
   SCM_VALIDATE_SUBSTRING_SPEC_COPY (1, s, cstr,
 				    3, start, cstart,
 				    4, end, cend);
-  if (SCM_UNBNDP (token_char))
+  if (SCM_UNBNDP (token_set))
     {
       int idx;
 
@@ -2838,7 +2839,7 @@
 	  result = scm_cons (scm_mem2string (cstr + cend, idx - cend), result);
 	}
     }
-  else if (SCM_CHARSETP (token_char))
+  else if (SCM_CHARSETP (token_set))
     {
       int idx;
 
@@ -2846,7 +2847,7 @@
 	{
 	  while (cstart < cend)
 	    {
-	      if (!SCM_CHARSET_GET (token_char, cstr[cend - 1]))
+	      if (SCM_CHARSET_GET (token_set, cstr[cend - 1]))
 		break;
 	      cend--;
 	    }
@@ -2855,41 +2856,14 @@
 	  idx = cend;
 	  while (cstart < cend)
 	    {
-	      if (SCM_CHARSET_GET (token_char, cstr[cend - 1]))
-		break;
-	      cend--;
-	    }
-	  result = scm_cons (scm_mem2string (cstr + cend, idx - cend), result);
-	}
-    }
-  else
-    {
-      int idx;
-      char chr;
-
-      SCM_VALIDATE_CHAR (2, token_char);
-      chr = SCM_CHAR (token_char);
-
-      while (cstart < cend)
-	{
-	  while (cstart < cend)
-	    {
-	      if (cstr[cend - 1] != chr)
-		break;
-	      cend--;
-	    }
-	  if (cstart >= cend)
-	    break;
-	  idx = cend;
-	  while (cstart < cend)
-	    {
-	      if (cstr[cend - 1] == chr)
+	      if (!SCM_CHARSET_GET (token_set, cstr[cend - 1]))
 		break;
 	      cend--;
 	    }
 	  result = scm_cons (scm_mem2string (cstr + cend, idx - cend), result);
 	}
     }
+  else SCM_WRONG_TYPE_ARG (2, token_set);
   return result;
 }
 #undef FUNC_NAME

^ permalink raw reply	[flat|nested] 6+ messages in thread