unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* [PATCH] In string-split, add support for character sets and predicates.
@ 2012-10-08 11:23 Daniel Hartwig
  2012-10-08 15:40 ` Mark H Weaver
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Hartwig @ 2012-10-08 11:23 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 454 bytes --]

Following up on the thread from last time regexp-split was discussed.

On 8 January 2012 07:05, Andy Wingo <wingo@pobox.com> wrote:
> On Sat 31 Dec 2011 06:54, Daniel Hartwig <mandyke@gmail.com> writes:
>> * [Vanilla `string-split' expanded to support the CHAR_PRED
>>   semantics of `string-index' et al.]
>
> Makes sense to me.
>

Attached, with perhaps too many test cases as well.  Commit message
contains the details on why it's done the way it is.

[-- Attachment #2: 0001-In-string-split-add-support-for-character-sets-and-p.patch --]
[-- Type: application/octet-stream, Size: 7459 bytes --]

From 0aeed16baa70eca143fec05e864f98d95d7267e8 Mon Sep 17 00:00:00 2001
From: Daniel Hartwig <mandyke@gmail.com>
Date: Mon, 8 Oct 2012 18:35:00 +0800
Subject: [PATCH] In string-split, add support for character sets and
 predicates.

* libguile/srfi-13.c (string-split): Add support for splitting on
  character sets and predicates, like string-index and others.  Keep the
  original (fast) path when splitting by character and refactor using
  string-index-right for other types; the later involves handling SCM
  values so there is less chance to optimize anyway.
* test-suite/tests/strings.test (string-split): Add tests covering
  the new argument types.
---
 libguile/srfi-13.c            |   53 ++++++++++++++++++++++++++++++----
 libguile/srfi-13.h            |    2 +-
 test-suite/tests/strings.test |   62 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 108 insertions(+), 9 deletions(-)

diff --git a/libguile/srfi-13.c b/libguile/srfi-13.c
index 2834553..1874754 100644
--- a/libguile/srfi-13.c
+++ b/libguile/srfi-13.c
@@ -2993,11 +2993,22 @@ SCM_DEFINE (scm_string_tokenize, "string-tokenize", 1, 3, 0,
 #undef FUNC_NAME
 
 SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
-	    (SCM str, SCM chr),
+	    (SCM str, SCM char_pred),
 	    "Split the string @var{str} into a list of the substrings delimited\n"
-	    "by appearances of the character @var{chr}.  Note that an empty substring\n"
-	    "between separator characters will result in an empty string in the\n"
-	    "result list.\n"
+	    "by appearances characters which\n"
+            "\n"
+            "@itemize @bullet\n"
+            "@item\n"
+            "equals @var{char_pred}, if it is a character,\n"
+            "\n"
+            "@item\n"
+            "satisfies the predicate @var{char_pred}, if it is a procedure,\n"
+            "\n"
+            "@item\n"
+            "is in the set @var{char_pred}, if it is a character set.\n"
+            "@end itemize\n\n"
+            "Note that an empty substring between separator characters\n"
+            "will result in an empty string in the result list.\n"
 	    "\n"
 	    "@lisp\n"
 	    "(string-split \"root:x:0:0:root:/root:/bin/bash\" #\\:)\n"
@@ -3014,13 +3025,39 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
 	    "@end lisp")
 #define FUNC_NAME s_scm_string_split
 {
+  SCM sidx, slast_idx;
   long idx, last_idx;
   int narrow;
   SCM res = SCM_EOL;
 
   SCM_VALIDATE_STRING (1, str);
-  SCM_VALIDATE_CHAR (2, chr);
   
+  if (SCM_CHARP (char_pred))
+    {
+      goto split_char;
+    }
+  else if (!SCM_CHARSETP (char_pred))
+    {
+      SCM_ASSERT (scm_is_true (scm_procedure_p (char_pred)),
+                  char_pred, SCM_ARG2, FUNC_NAME);
+    }
+
+  sidx = scm_string_length (str);
+  slast_idx = SCM_BOOL_F;
+  while (scm_is_true (sidx))
+    {
+      slast_idx = sidx;
+      sidx = scm_string_index_right (str, char_pred, SCM_INUM0, slast_idx);
+      if (scm_is_true (sidx))
+        {
+          SCM substr = scm_substring (str, scm_oneplus (sidx), slast_idx);
+          res = scm_cons (substr, res);
+        }
+    }
+  res = scm_cons (scm_substring (str, SCM_INUM0, slast_idx), res);
+  goto done;
+
+ split_char:
   /* This is explicit wide/narrow logic (instead of using
      scm_i_string_ref) is a speed optimization.  */
   idx = scm_i_string_length (str);
@@ -3031,7 +3068,7 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
       while (idx >= 0)
         {
           last_idx = idx;
-          while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(chr))
+          while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(char_pred))
             idx--;
           if (idx >= 0)
             {
@@ -3046,7 +3083,7 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
       while (idx >= 0)
         {
           last_idx = idx;
-          while (idx > 0 && buf[idx-1] != SCM_CHAR(chr))
+          while (idx > 0 && buf[idx-1] != SCM_CHAR(char_pred))
             idx--;
           if (idx >= 0)
             {
@@ -3055,6 +3092,8 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
             }
         }
     }
+
+ done:
   scm_remember_upto_here_1 (str);
   return res;
 }
diff --git a/libguile/srfi-13.h b/libguile/srfi-13.h
index f63239a..325e222 100644
--- a/libguile/srfi-13.h
+++ b/libguile/srfi-13.h
@@ -110,7 +110,7 @@ SCM_API SCM scm_xsubstring (SCM s, SCM from, SCM to, SCM start, SCM end);
 SCM_API SCM scm_string_xcopy_x (SCM target, SCM tstart, SCM s, SCM sfrom, SCM sto, SCM start, SCM end);
 SCM_API SCM scm_string_replace (SCM s1, SCM s2, SCM start1, SCM end1, SCM start2, SCM end2);
 SCM_API SCM scm_string_tokenize (SCM s, SCM token_char, SCM start, SCM end);
-SCM_API SCM scm_string_split (SCM s, SCM chr);
+SCM_API SCM scm_string_split (SCM s, SCM char_pred);
 SCM_API SCM scm_string_filter (SCM char_pred, SCM s, SCM start, SCM end);
 SCM_API SCM scm_string_delete (SCM char_pred, SCM s, SCM start, SCM end);
 
diff --git a/test-suite/tests/strings.test b/test-suite/tests/strings.test
index d892b70..679e173 100644
--- a/test-suite/tests/strings.test
+++ b/test-suite/tests/strings.test
@@ -557,7 +557,67 @@
   (pass-if "char 255"
     (equal? '("a" "b")
 	    (string-split (string #\a (integer->char 255) #\b)
-			  (integer->char 255)))))
+			  (integer->char 255))))
+
+  (pass-if "empty string - char"
+    (equal? '("")
+            (string-split "" #\:)))
+
+  (pass-if "non-empty - char - no delimiters"
+    (equal? '("foobarfrob")
+            (string-split "foobarfrob" #\:)))
+
+  (pass-if "non-empty - char - delimiters"
+    (equal? '("foo" "bar" "frob")
+            (string-split "foo:bar:frob" #\:)))
+
+  (pass-if "non-empty - char - leading delimiters"
+    (equal? '("" "" "foo" "bar" "frob")
+            (string-split "::foo:bar:frob" #\:)))
+
+  (pass-if "non-empty - char - trailing delimiters"
+    (equal? '("foo" "bar" "frob" "" "")
+            (string-split "foo:bar:frob::" #\:)))
+
+  (pass-if "empty string - charset"
+    (equal? '("")
+            (string-split "" (char-set #\:))))
+
+  (pass-if "non-empty - charset - no delimiters"
+    (equal? '("foobarfrob")
+            (string-split "foobarfrob" (char-set #\:))))
+
+  (pass-if "non-empty - charset - delimiters"
+    (equal? '("foo" "bar" "frob")
+            (string-split "foo:bar:frob" (char-set #\:))))
+
+  (pass-if "non-empty - charset - leading delimiters"
+    (equal? '("" "" "foo" "bar" "frob")
+            (string-split "::foo:bar:frob" (char-set #\:))))
+
+  (pass-if "non-empty - charset - trailing delimiters"
+    (equal? '("foo" "bar" "frob" "" "")
+            (string-split "foo:bar:frob::" (char-set #\:))))
+
+  (pass-if "empty string - pred"
+    (equal? '("")
+            (string-split "" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - no delimiters"
+    (equal? '("foobarfrob")
+            (string-split "foobarfrob" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - delimiters"
+    (equal? '("foo" "bar" "frob")
+            (string-split "foo:bar:frob" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - leading delimiters"
+    (equal? '("" "" "foo" "bar" "frob")
+            (string-split "::foo:bar:frob" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - trailing delimiters"
+    (equal? '("foo" "bar" "frob" "" "")
+            (string-split "foo:bar:frob::" (negate char-alphabetic?)))))
 
 (with-test-prefix "substring-move!"
 
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] In string-split, add support for character sets and predicates.
  2012-10-08 11:23 [PATCH] In string-split, add support for character sets and predicates Daniel Hartwig
@ 2012-10-08 15:40 ` Mark H Weaver
  2012-10-09  3:34   ` Daniel Hartwig
  0 siblings, 1 reply; 14+ messages in thread
From: Mark H Weaver @ 2012-10-08 15:40 UTC (permalink / raw)
  To: Daniel Hartwig; +Cc: guile-devel

Hi Daniel,

Thanks for the patch!  I have a few comments.

Daniel Hartwig <mandyke@gmail.com> writes:
> From 0aeed16baa70eca143fec05e864f98d95d7267e8 Mon Sep 17 00:00:00 2001
> From: Daniel Hartwig <mandyke@gmail.com>
> Date: Mon, 8 Oct 2012 18:35:00 +0800
> Subject: [PATCH] In string-split, add support for character sets and
>  predicates.
>
> * libguile/srfi-13.c (string-split): Add support for splitting on
>   character sets and predicates, like string-index and others.  Keep the
>   original (fast) path when splitting by character and refactor using
>   string-index-right for other types; the later involves handling SCM
>   values so there is less chance to optimize anyway.

As Ludovic frequently reminds us (and I agree), rationales should be in
the source code comments, not in the commit message.

> * test-suite/tests/strings.test (string-split): Add tests covering
>   the new argument types.
> ---
>  libguile/srfi-13.c            |   53 ++++++++++++++++++++++++++++++----
>  libguile/srfi-13.h            |    2 +-
>  test-suite/tests/strings.test |   62 ++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 108 insertions(+), 9 deletions(-)
>
> diff --git a/libguile/srfi-13.c b/libguile/srfi-13.c
> index 2834553..1874754 100644
> --- a/libguile/srfi-13.c
> +++ b/libguile/srfi-13.c
> @@ -2993,11 +2993,22 @@ SCM_DEFINE (scm_string_tokenize, "string-tokenize", 1, 3, 0,
>  #undef FUNC_NAME
>  
>  SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
> -	    (SCM str, SCM chr),
> +	    (SCM str, SCM char_pred),
>  	    "Split the string @var{str} into a list of the substrings delimited\n"
> -	    "by appearances of the character @var{chr}.  Note that an empty substring\n"
> -	    "between separator characters will result in an empty string in the\n"
> -	    "result list.\n"
> +	    "by appearances characters which\n"

"by appearances of characters that"
                ^^            ^^^^

(the difference between 'which' and 'that' is described in
<http://www.kentlaw.edu/academics/lrw/grinker/LwtaThat_Versus_Which.htm>)

> +            "\n"
> +            "@itemize @bullet\n"
> +            "@item\n"
> +            "equals @var{char_pred}, if it is a character,\n"

Should be "equal", not "equals", because the subject "characters" is
plural.

> +            "\n"
> +            "@item\n"
> +            "satisfies the predicate @var{char_pred}, if it is a procedure,\n"

Should be "satisfy" for the same reason.

> +            "\n"
> +            "@item\n"
> +            "is in the set @var{char_pred}, if it is a character set.\n"

"are in the set".

> +            "@end itemize\n\n"
> +            "Note that an empty substring between separator characters\n"
> +            "will result in an empty string in the result list.\n"
>  	    "\n"
>  	    "@lisp\n"
>  	    "(string-split \"root:x:0:0:root:/root:/bin/bash\" #\\:)\n"
> @@ -3014,13 +3025,39 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
>  	    "@end lisp")
>  #define FUNC_NAME s_scm_string_split
>  {
> +  SCM sidx, slast_idx;
>    long idx, last_idx;
>    int narrow;
>    SCM res = SCM_EOL;
>  
>    SCM_VALIDATE_STRING (1, str);
> -  SCM_VALIDATE_CHAR (2, chr);
>    
> +  if (SCM_CHARP (char_pred))
> +    {
> +      goto split_char;

I'd prefer to avoid the use of 'goto' here, and instead use nested 'if's
here.  (I admit that 'goto's occasionally make code simpler and more
readable, but not in this case IMO).

Can you please put the code between 'split_char' and 'done' within this
'if', and all the code from here to 'split_char' within the 'else'?

> +    }
> +  else if (!SCM_CHARSETP (char_pred))
> +    {
> +      SCM_ASSERT (scm_is_true (scm_procedure_p (char_pred)),
> +                  char_pred, SCM_ARG2, FUNC_NAME);
> +    }
> +
> +  sidx = scm_string_length (str);
> +  slast_idx = SCM_BOOL_F;
> +  while (scm_is_true (sidx))
> +    {
> +      slast_idx = sidx;
> +      sidx = scm_string_index_right (str, char_pred, SCM_INUM0, slast_idx);
> +      if (scm_is_true (sidx))
> +        {
> +          SCM substr = scm_substring (str, scm_oneplus (sidx), slast_idx);
> +          res = scm_cons (substr, res);
> +        }
> +    }

It is needlessly inefficient to test 'scm_is_true (sidx)' twice per
iteration of this loop.  The first test is also a waste.  Here's one way
to avoid the redundant tests:

  slast_idx = scm_string_length (str);
  for (;;)
    {
      sidx = scm_string_index_right (str, char_pred, SCM_INUM0, slast_idx);
      if (scm_is_false (sidx))
        break;
      res = scm_cons (scm_substring (str, scm_oneplus (sidx), slast_idx), res);
      slast_idx = sidx;
    }

> +  res = scm_cons (scm_substring (str, SCM_INUM0, slast_idx), res);
> +  goto done;
> +
> + split_char:
>    /* This is explicit wide/narrow logic (instead of using
>       scm_i_string_ref) is a speed optimization.  */
>    idx = scm_i_string_length (str);
> @@ -3031,7 +3068,7 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
>        while (idx >= 0)
>          {
>            last_idx = idx;
> -          while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(chr))
> +          while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(char_pred))
>              idx--;
>            if (idx >= 0)
>              {
> @@ -3046,7 +3083,7 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
>        while (idx >= 0)
>          {
>            last_idx = idx;
> -          while (idx > 0 && buf[idx-1] != SCM_CHAR(chr))
> +          while (idx > 0 && buf[idx-1] != SCM_CHAR(char_pred))
>              idx--;
>            if (idx >= 0)
>              {
> @@ -3055,6 +3092,8 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
>              }
>          }
>      }
> +
> + done:
>    scm_remember_upto_here_1 (str);
>    return res;
>  }

[...]

Everything else looks good.  Thanks for including a full set of tests!

     Mark



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] In string-split, add support for character sets and predicates.
  2012-10-08 15:40 ` Mark H Weaver
@ 2012-10-09  3:34   ` Daniel Hartwig
  2012-10-09 17:48     ` Mark H Weaver
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Hartwig @ 2012-10-09  3:34 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 924 bytes --]

Hi Mark

Thanks for the speedy response.

General ACK on all your comments.  Some additional notes:

On 8 October 2012 23:40, Mark H Weaver <mhw@netris.org> wrote:
>> [which vs. that]

This was straight out of the doc string for string-index.  Changed.

>> +  if (SCM_CHARP (char_pred))
>> +    {
>> +      goto split_char;
>
> I'd prefer to avoid the use of 'goto' here, and instead use nested 'if's
> here.  (I admit that 'goto's occasionally make code simpler and more
> readable, but not in this case IMO).

I had done this only to avoid diff noise on the original code path
since I wasn't changing it.  I agree that it adds nothing to the
structure of the code.  Changed.

> Can you please put the code between 'split_char' and 'done' within this
> 'if', and all the code from here to 'split_char' within the 'else'?

Done.  Also adjusted the scope of the branch-specific variables.

Updated version attached.

Regards

[-- Attachment #2: 0001-In-string-split-add-support-for-character-sets-and-p.patch --]
[-- Type: application/octet-stream, Size: 8472 bytes --]

From 46178db9eecc9ca402d9571c3ee520074f15ef5a Mon Sep 17 00:00:00 2001
From: Daniel Hartwig <mandyke@gmail.com>
Date: Mon, 8 Oct 2012 18:35:00 +0800
Subject: [PATCH] In string-split, add support for character sets and
 predicates.

* libguile/srfi-13.c (string-split): Add support for splitting on
  character sets and predicates, like string-index and others.
* test-suite/tests/strings.test (string-split): Add tests covering
  the new argument types.
---
 libguile/srfi-13.c            |  102 +++++++++++++++++++++++++++++------------
 libguile/srfi-13.h            |    2 +-
 test-suite/tests/strings.test |   62 ++++++++++++++++++++++++-
 3 files changed, 134 insertions(+), 32 deletions(-)

diff --git a/libguile/srfi-13.c b/libguile/srfi-13.c
index 2834553..605ba21 100644
--- a/libguile/srfi-13.c
+++ b/libguile/srfi-13.c
@@ -2993,11 +2993,22 @@ SCM_DEFINE (scm_string_tokenize, "string-tokenize", 1, 3, 0,
 #undef FUNC_NAME
 
 SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
-	    (SCM str, SCM chr),
+	    (SCM str, SCM char_pred),
 	    "Split the string @var{str} into a list of the substrings delimited\n"
-	    "by appearances of the character @var{chr}.  Note that an empty substring\n"
-	    "between separator characters will result in an empty string in the\n"
-	    "result list.\n"
+	    "by appearances of characters that\n"
+            "\n"
+            "@itemize @bullet\n"
+            "@item\n"
+            "equal @var{char_pred}, if it is a character,\n"
+            "\n"
+            "@item\n"
+            "satisfy the predicate @var{char_pred}, if it is a procedure,\n"
+            "\n"
+            "@item\n"
+            "are in the set @var{char_pred}, if it is a character set.\n"
+            "@end itemize\n\n"
+            "Note that an empty substring between separator characters\n"
+            "will result in an empty string in the result list.\n"
 	    "\n"
 	    "@lisp\n"
 	    "(string-split \"root:x:0:0:root:/root:/bin/bash\" #\\:)\n"
@@ -3014,47 +3025,78 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
 	    "@end lisp")
 #define FUNC_NAME s_scm_string_split
 {
-  long idx, last_idx;
-  int narrow;
   SCM res = SCM_EOL;
 
   SCM_VALIDATE_STRING (1, str);
-  SCM_VALIDATE_CHAR (2, chr);
   
-  /* This is explicit wide/narrow logic (instead of using
-     scm_i_string_ref) is a speed optimization.  */
-  idx = scm_i_string_length (str);
-  narrow = scm_i_is_narrow_string (str);
-  if (narrow)
+  if (SCM_CHARP (char_pred))
     {
-      const char *buf = scm_i_string_chars (str);
-      while (idx >= 0)
+      long idx, last_idx;
+      int narrow;
+
+      /* This is explicit wide/narrow logic (instead of using
+         scm_i_string_ref) is a speed optimization.  */
+      idx = scm_i_string_length (str);
+      narrow = scm_i_is_narrow_string (str);
+      if (narrow)
         {
-          last_idx = idx;
-          while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(chr))
-            idx--;
-          if (idx >= 0)
+          const char *buf = scm_i_string_chars (str);
+          while (idx >= 0)
             {
-              res = scm_cons (scm_i_substring (str, idx, last_idx), res);
-              idx--;
+              last_idx = idx;
+              while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(char_pred))
+                idx--;
+              if (idx >= 0)
+                {
+                  res = scm_cons (scm_i_substring (str, idx, last_idx), res);
+                  idx--;
+                }
             }
         }
+      else
+        {
+          const scm_t_wchar *buf = scm_i_string_wide_chars (str);
+          while (idx >= 0)
+            {
+              last_idx = idx;
+              while (idx > 0 && buf[idx-1] != SCM_CHAR(char_pred))
+                idx--;
+              if (idx >= 0)
+                {
+                  res = scm_cons (scm_i_substring (str, idx, last_idx), res);
+                  idx--;
+                }
+            }
+        }
+      goto done;
     }
   else
     {
-      const scm_t_wchar *buf = scm_i_string_wide_chars (str);
-      while (idx >= 0)
+      SCM sidx, slast_idx;
+
+      if (!SCM_CHARSETP (char_pred))
         {
-          last_idx = idx;
-          while (idx > 0 && buf[idx-1] != SCM_CHAR(chr))
-            idx--;
-          if (idx >= 0)
-            {
-              res = scm_cons (scm_i_substring (str, idx, last_idx), res);
-              idx--;
-            }
+          SCM_ASSERT (scm_is_true (scm_procedure_p (char_pred)),
+                      char_pred, SCM_ARG2, FUNC_NAME);
+        }
+
+      /* Supporting predicates and character sets involves handling SCM
+         values so there is less chance to optimize. */
+      slast_idx = scm_string_length (str);
+      for (;;)
+        {
+          sidx = scm_string_index_right (str, char_pred, SCM_INUM0, slast_idx);
+          if (scm_is_false (sidx))
+            break;
+          res = scm_cons (scm_substring (str, scm_oneplus (sidx), slast_idx), res);
+          slast_idx = sidx;
         }
+
+      res = scm_cons (scm_substring (str, SCM_INUM0, slast_idx), res);
+      goto done;
     }
+
+ done:
   scm_remember_upto_here_1 (str);
   return res;
 }
diff --git a/libguile/srfi-13.h b/libguile/srfi-13.h
index f63239a..325e222 100644
--- a/libguile/srfi-13.h
+++ b/libguile/srfi-13.h
@@ -110,7 +110,7 @@ SCM_API SCM scm_xsubstring (SCM s, SCM from, SCM to, SCM start, SCM end);
 SCM_API SCM scm_string_xcopy_x (SCM target, SCM tstart, SCM s, SCM sfrom, SCM sto, SCM start, SCM end);
 SCM_API SCM scm_string_replace (SCM s1, SCM s2, SCM start1, SCM end1, SCM start2, SCM end2);
 SCM_API SCM scm_string_tokenize (SCM s, SCM token_char, SCM start, SCM end);
-SCM_API SCM scm_string_split (SCM s, SCM chr);
+SCM_API SCM scm_string_split (SCM s, SCM char_pred);
 SCM_API SCM scm_string_filter (SCM char_pred, SCM s, SCM start, SCM end);
 SCM_API SCM scm_string_delete (SCM char_pred, SCM s, SCM start, SCM end);
 
diff --git a/test-suite/tests/strings.test b/test-suite/tests/strings.test
index d892b70..679e173 100644
--- a/test-suite/tests/strings.test
+++ b/test-suite/tests/strings.test
@@ -557,7 +557,67 @@
   (pass-if "char 255"
     (equal? '("a" "b")
 	    (string-split (string #\a (integer->char 255) #\b)
-			  (integer->char 255)))))
+			  (integer->char 255))))
+
+  (pass-if "empty string - char"
+    (equal? '("")
+            (string-split "" #\:)))
+
+  (pass-if "non-empty - char - no delimiters"
+    (equal? '("foobarfrob")
+            (string-split "foobarfrob" #\:)))
+
+  (pass-if "non-empty - char - delimiters"
+    (equal? '("foo" "bar" "frob")
+            (string-split "foo:bar:frob" #\:)))
+
+  (pass-if "non-empty - char - leading delimiters"
+    (equal? '("" "" "foo" "bar" "frob")
+            (string-split "::foo:bar:frob" #\:)))
+
+  (pass-if "non-empty - char - trailing delimiters"
+    (equal? '("foo" "bar" "frob" "" "")
+            (string-split "foo:bar:frob::" #\:)))
+
+  (pass-if "empty string - charset"
+    (equal? '("")
+            (string-split "" (char-set #\:))))
+
+  (pass-if "non-empty - charset - no delimiters"
+    (equal? '("foobarfrob")
+            (string-split "foobarfrob" (char-set #\:))))
+
+  (pass-if "non-empty - charset - delimiters"
+    (equal? '("foo" "bar" "frob")
+            (string-split "foo:bar:frob" (char-set #\:))))
+
+  (pass-if "non-empty - charset - leading delimiters"
+    (equal? '("" "" "foo" "bar" "frob")
+            (string-split "::foo:bar:frob" (char-set #\:))))
+
+  (pass-if "non-empty - charset - trailing delimiters"
+    (equal? '("foo" "bar" "frob" "" "")
+            (string-split "foo:bar:frob::" (char-set #\:))))
+
+  (pass-if "empty string - pred"
+    (equal? '("")
+            (string-split "" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - no delimiters"
+    (equal? '("foobarfrob")
+            (string-split "foobarfrob" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - delimiters"
+    (equal? '("foo" "bar" "frob")
+            (string-split "foo:bar:frob" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - leading delimiters"
+    (equal? '("" "" "foo" "bar" "frob")
+            (string-split "::foo:bar:frob" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - trailing delimiters"
+    (equal? '("foo" "bar" "frob" "" "")
+            (string-split "foo:bar:frob::" (negate char-alphabetic?)))))
 
 (with-test-prefix "substring-move!"
 
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] In string-split, add support for character sets and predicates.
  2012-10-09  3:34   ` Daniel Hartwig
@ 2012-10-09 17:48     ` Mark H Weaver
  2012-10-10  1:37       ` Daniel Hartwig
  0 siblings, 1 reply; 14+ messages in thread
From: Mark H Weaver @ 2012-10-09 17:48 UTC (permalink / raw)
  To: Daniel Hartwig; +Cc: guile-devel

Hi Daniel,

The updated patch looks good except for a few minor nitpicks:

Daniel Hartwig <mandyke@gmail.com> writes:
> From 46178db9eecc9ca402d9571c3ee520074f15ef5a Mon Sep 17 00:00:00 2001
> From: Daniel Hartwig <mandyke@gmail.com>
> Date: Mon, 8 Oct 2012 18:35:00 +0800
> Subject: [PATCH] In string-split, add support for character sets and
>  predicates.
>
> * libguile/srfi-13.c (string-split): Add support for splitting on
>   character sets and predicates, like string-index and others.
> * test-suite/tests/strings.test (string-split): Add tests covering
>   the new argument types.
> ---
>  libguile/srfi-13.c            |  102 +++++++++++++++++++++++++++++------------
>  libguile/srfi-13.h            |    2 +-
>  test-suite/tests/strings.test |   62 ++++++++++++++++++++++++-
>  3 files changed, 134 insertions(+), 32 deletions(-)
>
> diff --git a/libguile/srfi-13.c b/libguile/srfi-13.c
> index 2834553..605ba21 100644
> --- a/libguile/srfi-13.c
> +++ b/libguile/srfi-13.c
> @@ -2993,11 +2993,22 @@ SCM_DEFINE (scm_string_tokenize, "string-tokenize", 1, 3, 0,
>  #undef FUNC_NAME
>  
>  SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
> -	    (SCM str, SCM chr),
> +	    (SCM str, SCM char_pred),
>  	    "Split the string @var{str} into a list of the substrings delimited\n"
> -	    "by appearances of the character @var{chr}.  Note that an empty substring\n"
> -	    "between separator characters will result in an empty string in the\n"
> -	    "result list.\n"
> +	    "by appearances of characters that\n"
> +            "\n"
> +            "@itemize @bullet\n"
> +            "@item\n"
> +            "equal @var{char_pred}, if it is a character,\n"
> +            "\n"
> +            "@item\n"
> +            "satisfy the predicate @var{char_pred}, if it is a procedure,\n"
> +            "\n"
> +            "@item\n"
> +            "are in the set @var{char_pred}, if it is a character set.\n"
> +            "@end itemize\n\n"
> +            "Note that an empty substring between separator characters\n"
> +            "will result in an empty string in the result list.\n"
>  	    "\n"
>  	    "@lisp\n"
>  	    "(string-split \"root:x:0:0:root:/root:/bin/bash\" #\\:)\n"
> @@ -3014,47 +3025,78 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
>  	    "@end lisp")
>  #define FUNC_NAME s_scm_string_split
>  {
> -  long idx, last_idx;
> -  int narrow;
>    SCM res = SCM_EOL;
>  
>    SCM_VALIDATE_STRING (1, str);
> -  SCM_VALIDATE_CHAR (2, chr);
>    
> -  /* This is explicit wide/narrow logic (instead of using
> -     scm_i_string_ref) is a speed optimization.  */
> -  idx = scm_i_string_length (str);
> -  narrow = scm_i_is_narrow_string (str);
> -  if (narrow)
> +  if (SCM_CHARP (char_pred))
>      {
> -      const char *buf = scm_i_string_chars (str);
> -      while (idx >= 0)
> +      long idx, last_idx;
> +      int narrow;
> +
> +      /* This is explicit wide/narrow logic (instead of using
> +         scm_i_string_ref) is a speed optimization.  */
> +      idx = scm_i_string_length (str);
> +      narrow = scm_i_is_narrow_string (str);
> +      if (narrow)
>          {
> -          last_idx = idx;
> -          while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(chr))
> -            idx--;
> -          if (idx >= 0)
> +          const char *buf = scm_i_string_chars (str);
> +          while (idx >= 0)
>              {
> -              res = scm_cons (scm_i_substring (str, idx, last_idx), res);
> -              idx--;
> +              last_idx = idx;
> +              while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(char_pred))
> +                idx--;
> +              if (idx >= 0)
> +                {
> +                  res = scm_cons (scm_i_substring (str, idx, last_idx), res);
> +                  idx--;
> +                }
>              }
>          }
> +      else
> +        {
> +          const scm_t_wchar *buf = scm_i_string_wide_chars (str);
> +          while (idx >= 0)
> +            {
> +              last_idx = idx;
> +              while (idx > 0 && buf[idx-1] != SCM_CHAR(char_pred))
> +                idx--;
> +              if (idx >= 0)
> +                {
> +                  res = scm_cons (scm_i_substring (str, idx, last_idx), res);
> +                  idx--;
> +                }
> +            }
> +        }
> +      goto done;

This 'goto done' is a no-op.  Please remove it.

>      }
>    else
>      {
> -      const scm_t_wchar *buf = scm_i_string_wide_chars (str);
> -      while (idx >= 0)
> +      SCM sidx, slast_idx;
> +
> +      if (!SCM_CHARSETP (char_pred))
>          {
> -          last_idx = idx;
> -          while (idx > 0 && buf[idx-1] != SCM_CHAR(chr))
> -            idx--;
> -          if (idx >= 0)
> -            {
> -              res = scm_cons (scm_i_substring (str, idx, last_idx), res);
> -              idx--;
> -            }
> +          SCM_ASSERT (scm_is_true (scm_procedure_p (char_pred)),
> +                      char_pred, SCM_ARG2, FUNC_NAME);
> +        }

Please drop the unneeded curly braces above.

> +
> +      /* Supporting predicates and character sets involves handling SCM
> +         values so there is less chance to optimize. */
> +      slast_idx = scm_string_length (str);
> +      for (;;)
> +        {
> +          sidx = scm_string_index_right (str, char_pred, SCM_INUM0, slast_idx);
> +          if (scm_is_false (sidx))
> +            break;
> +          res = scm_cons (scm_substring (str, scm_oneplus (sidx), slast_idx), res);
> +          slast_idx = sidx;
>          }
> +
> +      res = scm_cons (scm_substring (str, SCM_INUM0, slast_idx), res);
> +      goto done;

This 'goto done' is a no-op.  Please remove it.

>      }
> +
> + done:

and remove this label.

>    scm_remember_upto_here_1 (str);
>    return res;
>  }

After these changes, I think the patch will be ready to push, unless
anyone else has suggestions.

    Thanks!
      Mark



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] In string-split, add support for character sets and predicates.
  2012-10-09 17:48     ` Mark H Weaver
@ 2012-10-10  1:37       ` Daniel Hartwig
  2012-10-10  2:14         ` Mark H Weaver
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Hartwig @ 2012-10-10  1:37 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 1051 bytes --]

On 10 October 2012 01:48, Mark H Weaver <mhw@netris.org> wrote:
>> +      if (!SCM_CHARSETP (char_pred))
>>          {
>> -          last_idx = idx;
>> -          while (idx > 0 && buf[idx-1] != SCM_CHAR(chr))
>> -            idx--;
>> -          if (idx >= 0)
>> -            {
>> -              res = scm_cons (scm_i_substring (str, idx, last_idx), res);
>> -              idx--;
>> -            }
>> +          SCM_ASSERT (scm_is_true (scm_procedure_p (char_pred)),
>> +                      char_pred, SCM_ARG2, FUNC_NAME);
>> +        }
>
> Please drop the unneeded curly braces above.

Are guile's macros safe to always use that way?  In libguile/__scm.h
the SCM_ASSERT definition is conditional on SCM_RECKLESS and appears
unsafe without the block.

In NEWS it is mentioned that SCM_RECKLESS has been “removed
completely” between 0.6 and 0.8, so I wonder if this instance was
missed.  The conditional in __scm.h dates to 1996.

In any case, applied your suggestions and reattached; user of
-DSCM_RECKLESS beware.

[-- Attachment #2: 0001-In-string-split-add-support-for-character-sets-and-p.patch --]
[-- Type: application/octet-stream, Size: 8404 bytes --]

From 47b8437fa32edae52790851d05e5f40cc709c6a7 Mon Sep 17 00:00:00 2001
From: Daniel Hartwig <mandyke@gmail.com>
Date: Mon, 8 Oct 2012 18:35:00 +0800
Subject: [PATCH] In string-split, add support for character sets and
 predicates.

* libguile/srfi-13.c (string-split): Add support for splitting on
  character sets and predicates, like string-index and others.
* test-suite/tests/strings.test (string-split): Add tests covering
  the new argument types.
---
 libguile/srfi-13.c            |   97 ++++++++++++++++++++++++++++-------------
 libguile/srfi-13.h            |    2 +-
 test-suite/tests/strings.test |   62 ++++++++++++++++++++++++++-
 3 files changed, 129 insertions(+), 32 deletions(-)

diff --git a/libguile/srfi-13.c b/libguile/srfi-13.c
index 2834553..609af7d 100644
--- a/libguile/srfi-13.c
+++ b/libguile/srfi-13.c
@@ -2993,11 +2993,22 @@ SCM_DEFINE (scm_string_tokenize, "string-tokenize", 1, 3, 0,
 #undef FUNC_NAME
 
 SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
-	    (SCM str, SCM chr),
+	    (SCM str, SCM char_pred),
 	    "Split the string @var{str} into a list of the substrings delimited\n"
-	    "by appearances of the character @var{chr}.  Note that an empty substring\n"
-	    "between separator characters will result in an empty string in the\n"
-	    "result list.\n"
+	    "by appearances of characters that\n"
+            "\n"
+            "@itemize @bullet\n"
+            "@item\n"
+            "equal @var{char_pred}, if it is a character,\n"
+            "\n"
+            "@item\n"
+            "satisfy the predicate @var{char_pred}, if it is a procedure,\n"
+            "\n"
+            "@item\n"
+            "are in the set @var{char_pred}, if it is a character set.\n"
+            "@end itemize\n\n"
+            "Note that an empty substring between separator characters\n"
+            "will result in an empty string in the result list.\n"
 	    "\n"
 	    "@lisp\n"
 	    "(string-split \"root:x:0:0:root:/root:/bin/bash\" #\\:)\n"
@@ -3014,47 +3025,73 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
 	    "@end lisp")
 #define FUNC_NAME s_scm_string_split
 {
-  long idx, last_idx;
-  int narrow;
   SCM res = SCM_EOL;
 
   SCM_VALIDATE_STRING (1, str);
-  SCM_VALIDATE_CHAR (2, chr);
   
-  /* This is explicit wide/narrow logic (instead of using
-     scm_i_string_ref) is a speed optimization.  */
-  idx = scm_i_string_length (str);
-  narrow = scm_i_is_narrow_string (str);
-  if (narrow)
+  if (SCM_CHARP (char_pred))
     {
-      const char *buf = scm_i_string_chars (str);
-      while (idx >= 0)
+      long idx, last_idx;
+      int narrow;
+
+      /* This is explicit wide/narrow logic (instead of using
+         scm_i_string_ref) is a speed optimization.  */
+      idx = scm_i_string_length (str);
+      narrow = scm_i_is_narrow_string (str);
+      if (narrow)
+        {
+          const char *buf = scm_i_string_chars (str);
+          while (idx >= 0)
+            {
+              last_idx = idx;
+              while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(char_pred))
+                idx--;
+              if (idx >= 0)
+                {
+                  res = scm_cons (scm_i_substring (str, idx, last_idx), res);
+                  idx--;
+                }
+            }
+        }
+      else
         {
-          last_idx = idx;
-          while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(chr))
-            idx--;
-          if (idx >= 0)
+          const scm_t_wchar *buf = scm_i_string_wide_chars (str);
+          while (idx >= 0)
             {
-              res = scm_cons (scm_i_substring (str, idx, last_idx), res);
-              idx--;
+              last_idx = idx;
+              while (idx > 0 && buf[idx-1] != SCM_CHAR(char_pred))
+                idx--;
+              if (idx >= 0)
+                {
+                  res = scm_cons (scm_i_substring (str, idx, last_idx), res);
+                  idx--;
+                }
             }
         }
     }
   else
     {
-      const scm_t_wchar *buf = scm_i_string_wide_chars (str);
-      while (idx >= 0)
+      SCM sidx, slast_idx;
+
+      if (!SCM_CHARSETP (char_pred))
+        SCM_ASSERT (scm_is_true (scm_procedure_p (char_pred)),
+                    char_pred, SCM_ARG2, FUNC_NAME);
+
+      /* Supporting predicates and character sets involves handling SCM
+         values so there is less chance to optimize. */
+      slast_idx = scm_string_length (str);
+      for (;;)
         {
-          last_idx = idx;
-          while (idx > 0 && buf[idx-1] != SCM_CHAR(chr))
-            idx--;
-          if (idx >= 0)
-            {
-              res = scm_cons (scm_i_substring (str, idx, last_idx), res);
-              idx--;
-            }
+          sidx = scm_string_index_right (str, char_pred, SCM_INUM0, slast_idx);
+          if (scm_is_false (sidx))
+            break;
+          res = scm_cons (scm_substring (str, scm_oneplus (sidx), slast_idx), res);
+          slast_idx = sidx;
         }
+
+      res = scm_cons (scm_substring (str, SCM_INUM0, slast_idx), res);
     }
+
   scm_remember_upto_here_1 (str);
   return res;
 }
diff --git a/libguile/srfi-13.h b/libguile/srfi-13.h
index f63239a..325e222 100644
--- a/libguile/srfi-13.h
+++ b/libguile/srfi-13.h
@@ -110,7 +110,7 @@ SCM_API SCM scm_xsubstring (SCM s, SCM from, SCM to, SCM start, SCM end);
 SCM_API SCM scm_string_xcopy_x (SCM target, SCM tstart, SCM s, SCM sfrom, SCM sto, SCM start, SCM end);
 SCM_API SCM scm_string_replace (SCM s1, SCM s2, SCM start1, SCM end1, SCM start2, SCM end2);
 SCM_API SCM scm_string_tokenize (SCM s, SCM token_char, SCM start, SCM end);
-SCM_API SCM scm_string_split (SCM s, SCM chr);
+SCM_API SCM scm_string_split (SCM s, SCM char_pred);
 SCM_API SCM scm_string_filter (SCM char_pred, SCM s, SCM start, SCM end);
 SCM_API SCM scm_string_delete (SCM char_pred, SCM s, SCM start, SCM end);
 
diff --git a/test-suite/tests/strings.test b/test-suite/tests/strings.test
index d892b70..679e173 100644
--- a/test-suite/tests/strings.test
+++ b/test-suite/tests/strings.test
@@ -557,7 +557,67 @@
   (pass-if "char 255"
     (equal? '("a" "b")
 	    (string-split (string #\a (integer->char 255) #\b)
-			  (integer->char 255)))))
+			  (integer->char 255))))
+
+  (pass-if "empty string - char"
+    (equal? '("")
+            (string-split "" #\:)))
+
+  (pass-if "non-empty - char - no delimiters"
+    (equal? '("foobarfrob")
+            (string-split "foobarfrob" #\:)))
+
+  (pass-if "non-empty - char - delimiters"
+    (equal? '("foo" "bar" "frob")
+            (string-split "foo:bar:frob" #\:)))
+
+  (pass-if "non-empty - char - leading delimiters"
+    (equal? '("" "" "foo" "bar" "frob")
+            (string-split "::foo:bar:frob" #\:)))
+
+  (pass-if "non-empty - char - trailing delimiters"
+    (equal? '("foo" "bar" "frob" "" "")
+            (string-split "foo:bar:frob::" #\:)))
+
+  (pass-if "empty string - charset"
+    (equal? '("")
+            (string-split "" (char-set #\:))))
+
+  (pass-if "non-empty - charset - no delimiters"
+    (equal? '("foobarfrob")
+            (string-split "foobarfrob" (char-set #\:))))
+
+  (pass-if "non-empty - charset - delimiters"
+    (equal? '("foo" "bar" "frob")
+            (string-split "foo:bar:frob" (char-set #\:))))
+
+  (pass-if "non-empty - charset - leading delimiters"
+    (equal? '("" "" "foo" "bar" "frob")
+            (string-split "::foo:bar:frob" (char-set #\:))))
+
+  (pass-if "non-empty - charset - trailing delimiters"
+    (equal? '("foo" "bar" "frob" "" "")
+            (string-split "foo:bar:frob::" (char-set #\:))))
+
+  (pass-if "empty string - pred"
+    (equal? '("")
+            (string-split "" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - no delimiters"
+    (equal? '("foobarfrob")
+            (string-split "foobarfrob" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - delimiters"
+    (equal? '("foo" "bar" "frob")
+            (string-split "foo:bar:frob" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - leading delimiters"
+    (equal? '("" "" "foo" "bar" "frob")
+            (string-split "::foo:bar:frob" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - trailing delimiters"
+    (equal? '("foo" "bar" "frob" "" "")
+            (string-split "foo:bar:frob::" (negate char-alphabetic?)))))
 
 (with-test-prefix "substring-move!"
 
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] In string-split, add support for character sets and predicates.
  2012-10-10  1:37       ` Daniel Hartwig
@ 2012-10-10  2:14         ` Mark H Weaver
  2012-10-10  3:15           ` Daniel Hartwig
  0 siblings, 1 reply; 14+ messages in thread
From: Mark H Weaver @ 2012-10-10  2:14 UTC (permalink / raw)
  To: Daniel Hartwig; +Cc: guile-devel

Daniel Hartwig <mandyke@gmail.com> writes:
> On 10 October 2012 01:48, Mark H Weaver <mhw@netris.org> wrote:
>> Please drop the unneeded curly braces above.
>
> Are guile's macros safe to always use that way?

They should be.  If they aren't, that's a bug IMO.

> In libguile/__scm.h
> the SCM_ASSERT definition is conditional on SCM_RECKLESS and appears
> unsafe without the block.

Looks safe to me.  If SCM_RECKLESS is set, then 'SCM_ASSERT(...)'
becomes the empty string, but the immediately following ';' remains
and becomes a C null statement.

Your latest patch looks good, but I just remembered one more thing:
doc/ref/api-data.texi needs to be updated.

Are you willing? :)

   Thanks!
     Mark



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] In string-split, add support for character sets and predicates.
  2012-10-10  2:14         ` Mark H Weaver
@ 2012-10-10  3:15           ` Daniel Hartwig
  2012-10-10  3:25             ` Mark H Weaver
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Hartwig @ 2012-10-10  3:15 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel

On 10 October 2012 10:14, Mark H Weaver <mhw@netris.org> wrote:
> Your latest patch looks good, but I just remembered one more thing:
> doc/ref/api-data.texi needs to be updated.

I had assumed that was updated automatically from the doc strings,
which are in the right format.

?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] In string-split, add support for character sets and predicates.
  2012-10-10  3:15           ` Daniel Hartwig
@ 2012-10-10  3:25             ` Mark H Weaver
  2012-10-10  3:28               ` Daniel Hartwig
  2012-10-10 20:42               ` Ludovic Courtès
  0 siblings, 2 replies; 14+ messages in thread
From: Mark H Weaver @ 2012-10-10  3:25 UTC (permalink / raw)
  To: Daniel Hartwig; +Cc: guile-devel

Daniel Hartwig <mandyke@gmail.com> writes:

> On 10 October 2012 10:14, Mark H Weaver <mhw@netris.org> wrote:
>> Your latest patch looks good, but I just remembered one more thing:
>> doc/ref/api-data.texi needs to be updated.
>
> I had assumed that was updated automatically from the doc strings,
> which are in the right format.

No, only doc/ref/standard-library.texi is automatically generated.

I vaguely recall hearing somewhere that this autogeneration of docs was
considered a failed experiment, but we'd have to ask Andy or Ludovic
about that.

     Mark



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] In string-split, add support for character sets and predicates.
  2012-10-10  3:25             ` Mark H Weaver
@ 2012-10-10  3:28               ` Daniel Hartwig
  2012-10-10  7:59                 ` Mark H Weaver
  2012-10-10 20:42               ` Ludovic Courtès
  1 sibling, 1 reply; 14+ messages in thread
From: Daniel Hartwig @ 2012-10-10  3:28 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel

On 10 October 2012 11:25, Mark H Weaver <mhw@netris.org> wrote:
> Daniel Hartwig <mandyke@gmail.com> writes:
>
>> On 10 October 2012 10:14, Mark H Weaver <mhw@netris.org> wrote:
>>> Your latest patch looks good, but I just remembered one more thing:
>>> doc/ref/api-data.texi needs to be updated.
>>
>> I had assumed that was updated automatically from the doc strings,
>> which are in the right format.
>
> No, only doc/ref/standard-library.texi is automatically generated.
>
> I vaguely recall hearing somewhere that this autogeneration of docs was
> considered a failed experiment, but we'd have to ask Andy or Ludovic
> about that.

Ah, ok.

One final thing before I resubmit.  I just notice that most of this
file has uses hard tabs, though there are places with soft.  Is it
preferable to change the patch to also use hard tabs?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] In string-split, add support for character sets and predicates.
  2012-10-10  3:28               ` Daniel Hartwig
@ 2012-10-10  7:59                 ` Mark H Weaver
  2012-10-10 20:44                   ` Ludovic Courtès
  2012-10-12  6:38                   ` Daniel Hartwig
  0 siblings, 2 replies; 14+ messages in thread
From: Mark H Weaver @ 2012-10-10  7:59 UTC (permalink / raw)
  To: Daniel Hartwig; +Cc: guile-devel

Daniel Hartwig <mandyke@gmail.com> writes:
> One final thing before I resubmit.  I just notice that most of this
> file has uses hard tabs, though there are places with soft.  Is it
> preferable to change the patch to also use hard tabs?

IMO, even when editing code that currently uses hard tabs, all new lines
should use soft tabs, and ideally all *changed* lines should also be
converted to soft tabs.  Makefiles are a special case of course.

Hard tabs cause several problems, e.g. messed up indentation in patches,
even if the file uses hard tabs consistently.  Personally, I'd like to
work toward a future with no soft tabs, even if it excerbates the messed
up indentation problem in the short term.

What do you think?

    Mark

PS: I recommend setting 'indent-tabs-mode' to nil in Emacs.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] In string-split, add support for character sets and predicates.
  2012-10-10  3:25             ` Mark H Weaver
  2012-10-10  3:28               ` Daniel Hartwig
@ 2012-10-10 20:42               ` Ludovic Courtès
  1 sibling, 0 replies; 14+ messages in thread
From: Ludovic Courtès @ 2012-10-10 20:42 UTC (permalink / raw)
  To: guile-devel

Hi!

Mark H Weaver <mhw@netris.org> skribis:

> I vaguely recall hearing somewhere that this autogeneration of docs was
> considered a failed experiment, but we'd have to ask Andy or Ludovic
> about that.

Well, look at this section of the manual (especially on hard copy), and
compare it to hand-written sections...

Ludo’.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] In string-split, add support for character sets and predicates.
  2012-10-10  7:59                 ` Mark H Weaver
@ 2012-10-10 20:44                   ` Ludovic Courtès
  2012-10-12  6:38                   ` Daniel Hartwig
  1 sibling, 0 replies; 14+ messages in thread
From: Ludovic Courtès @ 2012-10-10 20:44 UTC (permalink / raw)
  To: guile-devel

Hi,

FWIW, I think tabs are OK in C files, but not in Scheme files, because
in the latter case, we want to be able to copy/paste without triggering
tab completion.

Our .dir-locals.el follows that.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] In string-split, add support for character sets and predicates.
  2012-10-10  7:59                 ` Mark H Weaver
  2012-10-10 20:44                   ` Ludovic Courtès
@ 2012-10-12  6:38                   ` Daniel Hartwig
  2012-10-12 12:23                     ` Mark H Weaver
  1 sibling, 1 reply; 14+ messages in thread
From: Daniel Hartwig @ 2012-10-12  6:38 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 31 bytes --]

Patch with .texi updated also.

[-- Attachment #2: 0001-In-string-split-add-support-for-character-sets-and-p.patch --]
[-- Type: application/octet-stream, Size: 9659 bytes --]

From 8923e727d3d77933aa3446f2f72f27fc40b46b1d Mon Sep 17 00:00:00 2001
From: Daniel Hartwig <mandyke@gmail.com>
Date: Mon, 8 Oct 2012 18:35:00 +0800
Subject: [PATCH] In string-split, add support for character sets and
 predicates.

* libguile/srfi-13.c (string-split): Add support for splitting on
  character sets and predicates, like string-index and others.
* test-suite/tests/strings.test (string-split): Add tests covering
  the new argument types.
* doc/ref/api-data.texi (string-split): Update.
---
 doc/ref/api-data.texi         |   22 +++++++--
 libguile/srfi-13.c            |   97 ++++++++++++++++++++++++++++-------------
 libguile/srfi-13.h            |    2 +-
 test-suite/tests/strings.test |   62 ++++++++++++++++++++++++++-
 4 files changed, 146 insertions(+), 37 deletions(-)

diff --git a/doc/ref/api-data.texi b/doc/ref/api-data.texi
index 39c9790..6d8de2b 100644
--- a/doc/ref/api-data.texi
+++ b/doc/ref/api-data.texi
@@ -3152,12 +3152,24 @@ These procedures are useful for similar tasks.
 Convert the string @var{str} into a list of characters.
 @end deffn
 
-@deffn {Scheme Procedure} string-split str chr
-@deffnx {C Function} scm_string_split (str, chr)
+@deffn {Scheme Procedure} string-split str char_pred
+@deffnx {C Function} scm_string_split (str, char_pred)
 Split the string @var{str} into a list of substrings delimited
-by appearances of the character @var{chr}.  Note that an empty substring
-between separator characters will result in an empty string in the
-result list.
+by appearances of characters that
+
+@itemize @bullet
+@item
+equal @var{char_pred}, if it is a character,
+
+@item
+satisfy the predicate @var{char_pred}, if it is a procedure,
+
+@item
+are in the set @var{char_pred}, if it is a character set.
+@end itemize
+
+Note that an empty substring between separator characters will result in
+an empty string in the result list.
 
 @lisp
 (string-split "root:x:0:0:root:/root:/bin/bash" #\:)
diff --git a/libguile/srfi-13.c b/libguile/srfi-13.c
index 2834553..97c5a1d 100644
--- a/libguile/srfi-13.c
+++ b/libguile/srfi-13.c
@@ -2993,11 +2993,22 @@ SCM_DEFINE (scm_string_tokenize, "string-tokenize", 1, 3, 0,
 #undef FUNC_NAME
 
 SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
-	    (SCM str, SCM chr),
+	    (SCM str, SCM char_pred),
 	    "Split the string @var{str} into a list of the substrings delimited\n"
-	    "by appearances of the character @var{chr}.  Note that an empty substring\n"
-	    "between separator characters will result in an empty string in the\n"
-	    "result list.\n"
+            "by appearances of characters that\n"
+            "\n"
+            "@itemize @bullet\n"
+            "@item\n"
+            "equal @var{char_pred}, if it is a character,\n"
+            "\n"
+            "@item\n"
+            "satisfy the predicate @var{char_pred}, if it is a procedure,\n"
+            "\n"
+            "@item\n"
+            "are in the set @var{char_pred}, if it is a character set.\n"
+            "@end itemize\n\n"
+            "Note that an empty substring between separator characters\n"
+            "will result in an empty string in the result list.\n"
 	    "\n"
 	    "@lisp\n"
 	    "(string-split \"root:x:0:0:root:/root:/bin/bash\" #\\:)\n"
@@ -3014,47 +3025,73 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
 	    "@end lisp")
 #define FUNC_NAME s_scm_string_split
 {
-  long idx, last_idx;
-  int narrow;
   SCM res = SCM_EOL;
 
   SCM_VALIDATE_STRING (1, str);
-  SCM_VALIDATE_CHAR (2, chr);
   
-  /* This is explicit wide/narrow logic (instead of using
-     scm_i_string_ref) is a speed optimization.  */
-  idx = scm_i_string_length (str);
-  narrow = scm_i_is_narrow_string (str);
-  if (narrow)
+  if (SCM_CHARP (char_pred))
     {
-      const char *buf = scm_i_string_chars (str);
-      while (idx >= 0)
+      long idx, last_idx;
+      int narrow;
+
+      /* This is explicit wide/narrow logic (instead of using
+         scm_i_string_ref) is a speed optimization.  */
+      idx = scm_i_string_length (str);
+      narrow = scm_i_is_narrow_string (str);
+      if (narrow)
+        {
+          const char *buf = scm_i_string_chars (str);
+          while (idx >= 0)
+            {
+              last_idx = idx;
+              while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(char_pred))
+                idx--;
+              if (idx >= 0)
+                {
+                  res = scm_cons (scm_i_substring (str, idx, last_idx), res);
+                  idx--;
+                }
+            }
+        }
+      else
         {
-          last_idx = idx;
-          while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(chr))
-            idx--;
-          if (idx >= 0)
+          const scm_t_wchar *buf = scm_i_string_wide_chars (str);
+          while (idx >= 0)
             {
-              res = scm_cons (scm_i_substring (str, idx, last_idx), res);
-              idx--;
+              last_idx = idx;
+              while (idx > 0 && buf[idx-1] != SCM_CHAR(char_pred))
+                idx--;
+              if (idx >= 0)
+                {
+                  res = scm_cons (scm_i_substring (str, idx, last_idx), res);
+                  idx--;
+                }
             }
         }
     }
   else
     {
-      const scm_t_wchar *buf = scm_i_string_wide_chars (str);
-      while (idx >= 0)
+      SCM sidx, slast_idx;
+
+      if (!SCM_CHARSETP (char_pred))
+        SCM_ASSERT (scm_is_true (scm_procedure_p (char_pred)),
+                    char_pred, SCM_ARG2, FUNC_NAME);
+
+      /* Supporting predicates and character sets involves handling SCM
+         values so there is less chance to optimize. */
+      slast_idx = scm_string_length (str);
+      for (;;)
         {
-          last_idx = idx;
-          while (idx > 0 && buf[idx-1] != SCM_CHAR(chr))
-            idx--;
-          if (idx >= 0)
-            {
-              res = scm_cons (scm_i_substring (str, idx, last_idx), res);
-              idx--;
-            }
+          sidx = scm_string_index_right (str, char_pred, SCM_INUM0, slast_idx);
+          if (scm_is_false (sidx))
+            break;
+          res = scm_cons (scm_substring (str, scm_oneplus (sidx), slast_idx), res);
+          slast_idx = sidx;
         }
+
+      res = scm_cons (scm_substring (str, SCM_INUM0, slast_idx), res);
     }
+
   scm_remember_upto_here_1 (str);
   return res;
 }
diff --git a/libguile/srfi-13.h b/libguile/srfi-13.h
index f63239a..325e222 100644
--- a/libguile/srfi-13.h
+++ b/libguile/srfi-13.h
@@ -110,7 +110,7 @@ SCM_API SCM scm_xsubstring (SCM s, SCM from, SCM to, SCM start, SCM end);
 SCM_API SCM scm_string_xcopy_x (SCM target, SCM tstart, SCM s, SCM sfrom, SCM sto, SCM start, SCM end);
 SCM_API SCM scm_string_replace (SCM s1, SCM s2, SCM start1, SCM end1, SCM start2, SCM end2);
 SCM_API SCM scm_string_tokenize (SCM s, SCM token_char, SCM start, SCM end);
-SCM_API SCM scm_string_split (SCM s, SCM chr);
+SCM_API SCM scm_string_split (SCM s, SCM char_pred);
 SCM_API SCM scm_string_filter (SCM char_pred, SCM s, SCM start, SCM end);
 SCM_API SCM scm_string_delete (SCM char_pred, SCM s, SCM start, SCM end);
 
diff --git a/test-suite/tests/strings.test b/test-suite/tests/strings.test
index d892b70..679e173 100644
--- a/test-suite/tests/strings.test
+++ b/test-suite/tests/strings.test
@@ -557,7 +557,67 @@
   (pass-if "char 255"
     (equal? '("a" "b")
 	    (string-split (string #\a (integer->char 255) #\b)
-			  (integer->char 255)))))
+			  (integer->char 255))))
+
+  (pass-if "empty string - char"
+    (equal? '("")
+            (string-split "" #\:)))
+
+  (pass-if "non-empty - char - no delimiters"
+    (equal? '("foobarfrob")
+            (string-split "foobarfrob" #\:)))
+
+  (pass-if "non-empty - char - delimiters"
+    (equal? '("foo" "bar" "frob")
+            (string-split "foo:bar:frob" #\:)))
+
+  (pass-if "non-empty - char - leading delimiters"
+    (equal? '("" "" "foo" "bar" "frob")
+            (string-split "::foo:bar:frob" #\:)))
+
+  (pass-if "non-empty - char - trailing delimiters"
+    (equal? '("foo" "bar" "frob" "" "")
+            (string-split "foo:bar:frob::" #\:)))
+
+  (pass-if "empty string - charset"
+    (equal? '("")
+            (string-split "" (char-set #\:))))
+
+  (pass-if "non-empty - charset - no delimiters"
+    (equal? '("foobarfrob")
+            (string-split "foobarfrob" (char-set #\:))))
+
+  (pass-if "non-empty - charset - delimiters"
+    (equal? '("foo" "bar" "frob")
+            (string-split "foo:bar:frob" (char-set #\:))))
+
+  (pass-if "non-empty - charset - leading delimiters"
+    (equal? '("" "" "foo" "bar" "frob")
+            (string-split "::foo:bar:frob" (char-set #\:))))
+
+  (pass-if "non-empty - charset - trailing delimiters"
+    (equal? '("foo" "bar" "frob" "" "")
+            (string-split "foo:bar:frob::" (char-set #\:))))
+
+  (pass-if "empty string - pred"
+    (equal? '("")
+            (string-split "" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - no delimiters"
+    (equal? '("foobarfrob")
+            (string-split "foobarfrob" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - delimiters"
+    (equal? '("foo" "bar" "frob")
+            (string-split "foo:bar:frob" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - leading delimiters"
+    (equal? '("" "" "foo" "bar" "frob")
+            (string-split "::foo:bar:frob" (negate char-alphabetic?))))
+
+  (pass-if "non-empty - pred - trailing delimiters"
+    (equal? '("foo" "bar" "frob" "" "")
+            (string-split "foo:bar:frob::" (negate char-alphabetic?)))))
 
 (with-test-prefix "substring-move!"
 
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] In string-split, add support for character sets and predicates.
  2012-10-12  6:38                   ` Daniel Hartwig
@ 2012-10-12 12:23                     ` Mark H Weaver
  0 siblings, 0 replies; 14+ messages in thread
From: Mark H Weaver @ 2012-10-12 12:23 UTC (permalink / raw)
  To: Daniel Hartwig; +Cc: guile-devel

Daniel Hartwig <mandyke@gmail.com> writes:
> Patch with .texi updated also.

Applied, thanks! :)

     Mark



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-10-12 12:23 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-08 11:23 [PATCH] In string-split, add support for character sets and predicates Daniel Hartwig
2012-10-08 15:40 ` Mark H Weaver
2012-10-09  3:34   ` Daniel Hartwig
2012-10-09 17:48     ` Mark H Weaver
2012-10-10  1:37       ` Daniel Hartwig
2012-10-10  2:14         ` Mark H Weaver
2012-10-10  3:15           ` Daniel Hartwig
2012-10-10  3:25             ` Mark H Weaver
2012-10-10  3:28               ` Daniel Hartwig
2012-10-10  7:59                 ` Mark H Weaver
2012-10-10 20:44                   ` Ludovic Courtès
2012-10-12  6:38                   ` Daniel Hartwig
2012-10-12 12:23                     ` Mark H Weaver
2012-10-10 20:42               ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).