unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* regexp-split for Guile
@ 2012-09-17 14:01 Chris K. Jester-Young
  2012-09-17 19:32 ` Thien-Thi Nguyen
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Chris K. Jester-Young @ 2012-09-17 14:01 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 1991 bytes --]

Hi there,

I'm currently implementing regexp-split for Guile, which provides a
Perl-style split function (including correctly implementing the "limit"
parameter), minus the special awk-style whitespace handling (that is
used with a pattern of " ", as opposed to / /, with Perl's split).

Attached is a couple of patches, to support the regexp-split function
which I'm proposing at the bottom of this message:

1. The first fixes the behaviour of fold-matches and list-matches when
   the pattern contains a ^ (identical to the patch in my last email).
2. The second adds the ability to limit the number of matches done.
   This applies on top of the first patch.

Some comments about the regexp-split implementation: the value that's
being passed to regexp-split-fold is a cons, where the car is the last
match's end position, and the cdr is the substrings so far collected.

The special check in regexp-split-fold for match-end being zero is to
emulate a specific behaviour as documented for Perl's split: "Empty
leading fields are produced when there are positive-width matches at
the beginning of the string; a zero-width match at the beginning of the
string does not produce an empty field."

Below is the implementation; comments are welcome! If it all looks good,
I'll write tests and documentation, with a view to eventually putting it
into (ice-9 regex).

Thanks,
Chris.

			*	*	*

(define (regexp-split-fold match prev)
  (if (zero? (match:end match)) prev
      (cons* (match:end match)
             (substring (match:string match) (car prev) (match:start match))
             (cdr prev))))

(define (string-empty? str)
  (zero? (string-length str)))

(define* (regexp-split pat str #:optional (limit 0))
  (let* ((result (fold-matches pat str '(0) regexp-split-fold 0
                               (if (positive? limit) (1- limit) #f)))
         (final (cons (substring str (car result)) (cdr result))))
    (reverse! (if (zero? limit) (drop-while string-empty? final) final))))

[-- Attachment #2: 0001-In-fold-matches-set-regexp-notbol-unless-matching-st.patch --]
[-- Type: text/x-diff, Size: 2018 bytes --]

From da8b0cd523f6e9bf9e1d46829cccf01e3115c614 Mon Sep 17 00:00:00 2001
From: "Chris K. Jester-Young" <cky944@gmail.com>
Date: Sun, 16 Sep 2012 02:20:56 -0400
Subject: [PATCH 1/2] In fold-matches, set regexp/notbol unless matching
 string start.

* module/ice-9/regex.scm (fold-matches): Set regexp/notbol if the
  starting position is nonzero.
* test-suite/tests/regexp.test (fold-matches): Check that when
  matching /^foo/ against "foofoofoofoo", only one match results.
---
 module/ice-9/regex.scm       |    3 ++-
 test-suite/tests/regexp.test |    9 ++++++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/module/ice-9/regex.scm b/module/ice-9/regex.scm
index f7b94b7..08ae2c2 100644
--- a/module/ice-9/regex.scm
+++ b/module/ice-9/regex.scm
@@ -172,8 +172,9 @@
     (let loop ((start 0)
                (value init)
                (abuts #f))              ; True if start abuts a previous match.
+      (define bol (if (zero? start) 0 regexp/notbol))
       (let ((m (if (> start (string-length string)) #f
-                   (regexp-exec regexp string start flags))))
+                   (regexp-exec regexp string start (logior flags bol)))))
         (cond
          ((not m) value)
          ((and (= (match:start m) (match:end m)) abuts)
diff --git a/test-suite/tests/regexp.test b/test-suite/tests/regexp.test
index ef59465..d549df2 100644
--- a/test-suite/tests/regexp.test
+++ b/test-suite/tests/regexp.test
@@ -132,7 +132,14 @@
                    (lambda (match result)
                      (cons (match:substring match)
                            result))
-                   (logior regexp/notbol regexp/noteol)))))
+                   (logior regexp/notbol regexp/noteol))))
+
+  (pass-if "regexp/notbol is set correctly"
+    (equal? '("foo")
+            (fold-matches "^foo" "foofoofoofoo" '()
+                          (lambda (match result)
+                            (cons (match:substring match)
+                                  result))))))
 
 
 ;;;
-- 
1.7.9.5


[-- Attachment #3: 0002-Add-limit-parameter-to-fold-matches-and-list-matches.patch --]
[-- Type: text/x-diff, Size: 5194 bytes --]

From 147dc0d7fd9ab04d10b4f13cecf47a32c5b6c4b6 Mon Sep 17 00:00:00 2001
From: "Chris K. Jester-Young" <cky944@gmail.com>
Date: Mon, 17 Sep 2012 01:06:07 -0400
Subject: [PATCH 2/2] Add "limit" parameter to fold-matches and list-matches.

* doc/ref/api-regex.texi: Document new "limit" parameter.

* module/ice-9/regex.scm (fold-matches, list-matches): Optionally take
  a "limit" argument that, if specified, limits how many times the
  pattern is matched.

* test-suite/tests/regexp.test (fold-matches): Add tests for the correct
  functioning of the limit parameter.
---
 doc/ref/api-regex.texi       |   10 ++++++----
 module/ice-9/regex.scm       |   18 ++++++++++--------
 test-suite/tests/regexp.test |   16 +++++++++++++++-
 3 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/doc/ref/api-regex.texi b/doc/ref/api-regex.texi
index 082fb87..2d2243f 100644
--- a/doc/ref/api-regex.texi
+++ b/doc/ref/api-regex.texi
@@ -189,11 +189,12 @@ or @code{#f} otherwise.
 @end deffn
 
 @sp 1
-@deffn {Scheme Procedure} list-matches regexp str [flags]
+@deffn {Scheme Procedure} list-matches regexp str [flags [limit]]
 Return a list of match structures which are the non-overlapping
 matches of @var{regexp} in @var{str}.  @var{regexp} can be either a
 pattern string or a compiled regexp.  The @var{flags} argument is as
-per @code{regexp-exec} above.
+per @code{regexp-exec} above.  The @var{limit} argument, if specified,
+limits how many times @var{regexp} is matched.
 
 @example
 (map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
@@ -201,11 +202,12 @@ per @code{regexp-exec} above.
 @end  example
 @end deffn
 
-@deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
+@deffn {Scheme Procedure} fold-matches regexp str init proc [flags [limit]]
 Apply @var{proc} to the non-overlapping matches of @var{regexp} in
 @var{str}, to build a result.  @var{regexp} can be either a pattern
 string or a compiled regexp.  The @var{flags} argument is as per
-@code{regexp-exec} above.
+@code{regexp-exec} above.  The @var{limit} argument, if specified,
+limits how many times @var{regexp} is matched.
 
 @var{proc} is called as @code{(@var{proc} match prev)} where
 @var{match} is a match structure and @var{prev} is the previous return
diff --git a/module/ice-9/regex.scm b/module/ice-9/regex.scm
index 08ae2c2..0ffe74c 100644
--- a/module/ice-9/regex.scm
+++ b/module/ice-9/regex.scm
@@ -167,26 +167,28 @@
 ;;; `b'.  Around or within `xxx', only the match covering all three
 ;;; x's counts, because the rest are not maximal.
 
-(define* (fold-matches regexp string init proc #:optional (flags 0))
+(define* (fold-matches regexp string init proc #:optional (flags 0) limit)
   (let ((regexp (if (regexp? regexp) regexp (make-regexp regexp))))
     (let loop ((start 0)
+               (count 0)
                (value init)
                (abuts #f))              ; True if start abuts a previous match.
-      (define bol (if (zero? start) 0 regexp/notbol))
-      (let ((m (if (> start (string-length string)) #f
-                   (regexp-exec regexp string start (logior flags bol)))))
+      (let* ((bol (if (zero? start) 0 regexp/notbol))
+             (m (and (or (not limit) (< count limit))
+                     (<= start (string-length string))
+                     (regexp-exec regexp string start (logior flags bol)))))
         (cond
          ((not m) value)
          ((and (= (match:start m) (match:end m)) abuts)
           ;; We matched an empty string, but that would overlap the
           ;; match immediately before.  Try again at a position
           ;; further to the right.
-          (loop (+ start 1) value #f))
+          (loop (1+ start) count value #f))
          (else
-          (loop (match:end m) (proc m value) #t)))))))
+          (loop (match:end m) (1+ count) (proc m value) #t)))))))
 
-(define* (list-matches regexp string #:optional (flags 0))
-  (reverse! (fold-matches regexp string '() cons flags)))
+(define* (list-matches regexp string #:optional (flags 0) limit)
+  (reverse! (fold-matches regexp string '() cons flags limit)))
 
 (define (regexp-substitute/global port regexp string . items)
 
diff --git a/test-suite/tests/regexp.test b/test-suite/tests/regexp.test
index d549df2..c3ba698 100644
--- a/test-suite/tests/regexp.test
+++ b/test-suite/tests/regexp.test
@@ -139,7 +139,21 @@
             (fold-matches "^foo" "foofoofoofoo" '()
                           (lambda (match result)
                             (cons (match:substring match)
-                                  result))))))
+                                  result)))))
+
+  (pass-if "without limit"
+    (equal? '("foo" "foo" "foo" "foo")
+            (fold-matches "foo" "foofoofoofoo" '()
+                          (lambda (match result)
+                            (cons (match:substring match)
+                                  result)))))
+
+  (pass-if "with limit"
+    (equal? '("foo" "foo")
+            (fold-matches "foo" "foofoofoofoo" '()
+                          (lambda (match result)
+                            (cons (match:substring match)
+                                  result)) 0 2))))
 
 
 ;;;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-09-17 14:01 regexp-split for Guile Chris K. Jester-Young
@ 2012-09-17 19:32 ` Thien-Thi Nguyen
  2012-09-17 20:06   ` Chris K. Jester-Young
  2012-09-18 12:59 ` nalaginrut
  2012-10-04 21:47 ` Ludovic Courtès
  2 siblings, 1 reply; 18+ messages in thread
From: Thien-Thi Nguyen @ 2012-09-17 19:32 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 2072 bytes --]

() "Chris K. Jester-Young" <cky944@gmail.com>
() Mon, 17 Sep 2012 10:01:33 -0400

   (define (string-empty? str)
     (zero? (string-length str)))

You can use ‘string-null?’ instead.

   (define* (regexp-split pat str #:optional (limit 0))
     (let* ((result (fold-matches pat str '(0) regexp-split-fold 0
                                  (if (positive? limit) (1- limit) #f)))
            (final (cons (substring str (car result)) (cdr result))))
       (reverse! (if (zero? limit) (drop-while string-empty? final) final))))

Style nit: i find it easier to read ‘if’ expressions w/ the condition,
then and else expressions on separate lines.  Similarly ‘cons’.  E.g.:

(define* (regexp-split pat str #:optional (limit 0))
  (let* ((result (fold-matches pat str '(0) regexp-split-fold 0
                               (if (positive? limit)
                                   (1- limit)
                                   #f)))
         (final (cons (substring str (car result))
                      (cdr result))))
    (reverse! (if (zero? limit)
                  (drop-while string-empty? final)
                  final))))

It is easier because the eye can flowingly bump along the indentation
w/o the doubtful mind jerking it to the right to fully identify and then
verify forks and merges.  Does that make sense?  (If not, just ignore.)

A more substantial line of questioning: What happens if ‘regexp-split’
is called w/ negative ‘limit’?  Should that be handled in ‘regexp-split’
or will the procs it calls DTRT?  What is TRT, anyway?  In the absence
of explicit validation, maybe a comment here will help the non-expert.

-- 
Thien-Thi Nguyen ..................................... GPG key: 4C807502
.                  NB: ttn at glug dot org is not me                   .
.                 (and has not been since 2007 or so)                  .
.                        ACCEPT NO SUBSTITUTES                         .
........... please send technical questions to mailing lists ...........


[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-09-17 19:32 ` Thien-Thi Nguyen
@ 2012-09-17 20:06   ` Chris K. Jester-Young
  2012-09-18  7:06     ` Sjoerd van Leent Privé
  2012-09-18 19:59     ` Chris K. Jester-Young
  0 siblings, 2 replies; 18+ messages in thread
From: Chris K. Jester-Young @ 2012-09-17 20:06 UTC (permalink / raw)
  To: guile-devel

On Mon, Sep 17, 2012 at 09:32:14PM +0200, Thien-Thi Nguyen wrote:
>    (define (string-empty? str)
>      (zero? (string-length str)))
> 
> You can use ‘string-null?’ instead.

Ah, nice! Thanks for the pointer.

> Style nit: i find it easier to read ‘if’ expressions w/ the condition,
> then and else expressions on separate lines.  Similarly ‘cons’.  E.g.:

Right, that sounds like a good idea. It does make the code longer, and
so for simple cases of "if" and "cons", I'd probably still keep it in
one line, but in this case you do make a very clear case with the "cons"
(which involves somewhat lengthier subexpressions).

> A more substantial line of questioning: What happens if ‘regexp-split’
> is called w/ negative ‘limit’?  Should that be handled in ‘regexp-split’
> or will the procs it calls DTRT?  What is TRT, anyway?  In the absence
> of explicit validation, maybe a comment here will help the non-expert.

So, basically, the Perl split's limit is used this way:

1. Positive limit: Return this many fields at most:

    (regexp-split ":" "foo:bar:baz:qux:" 3)
    => ("foo" "bar" "baz:qux:")

2. Negative limit: Return all fields:

    (regexp-split ":" "foo:bar:baz:qux:" -1)
    => ("foo" "bar" "baz" "qux" "")

3. Zero limit: Return all fields, after removing trailing blank fields:

    (regexp-split ":" "foo:bar:baz:qux:" 0)
    => ("foo" "bar" "baz" "qux")

Because of this, the specific negative value doesn't matter; they are
all treated the same. This is why the code checks for a positive limit
and passes #f to fold-matches if it's not positive. I hope this makes
sense. :-)

Thanks so much for your feedback. I'll incorporate your comments.

Cheers,
Chris.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-09-17 20:06   ` Chris K. Jester-Young
@ 2012-09-18  7:06     ` Sjoerd van Leent Privé
  2012-09-18 19:31       ` Chris K. Jester-Young
  2012-09-18 19:59     ` Chris K. Jester-Young
  1 sibling, 1 reply; 18+ messages in thread
From: Sjoerd van Leent Privé @ 2012-09-18  7:06 UTC (permalink / raw)
  To: Chris K. Jester-Young, guile-devel

Hi Chris,

I have been following your thread about regexp-split. I do have some 
thoughts about this to make the interface more versalite.

> So, basically, the Perl split's limit is used this way:
>
> 1. Positive limit: Return this many fields at most:
>
>      (regexp-split ":" "foo:bar:baz:qux:" 3)
>      => ("foo" "bar" "baz:qux:")
>
> 2. Negative limit: Return all fields:
>
>      (regexp-split ":" "foo:bar:baz:qux:" -1)
>      => ("foo" "bar" "baz" "qux" "")

It might just be me, but would it not be more sensible for scheme to 
just perform the opposite. Return the same amount of fields at most, but 
starting from the end, thus:

(regexp-split ":" "foo:bar:baz:qux:" -3)
=> ("foo:bar" "baz" "qux" "")

This is practical for paths etc.

The problem described in your second case could be solved by using a 
symbol, such as #:all, or something similar.

>
> 3. Zero limit: Return all fields, after removing trailing blank fields:
>
>      (regexp-split ":" "foo:bar:baz:qux:" 0)
>      => ("foo" "bar" "baz" "qux")
>
Regards,
Sjoerd




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-09-17 14:01 regexp-split for Guile Chris K. Jester-Young
  2012-09-17 19:32 ` Thien-Thi Nguyen
@ 2012-09-18 12:59 ` nalaginrut
  2012-09-18 19:55   ` Chris K. Jester-Young
  2012-10-04 21:47 ` Ludovic Courtès
  2 siblings, 1 reply; 18+ messages in thread
From: nalaginrut @ 2012-09-18 12:59 UTC (permalink / raw)
  To: Chris K. Jester-Young; +Cc: guile-devel

I had the same topic before:
http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00247.html
Actually, there's an older thread than mine before:
http://old.nabble.com/regex-split-for-Guile-td31093245.html

Anyway, if there're so many people like this nice thing, why not we add
it (at any option of these three implementations) into ice-9?


On Mon, 2012-09-17 at 10:01 -0400, Chris K. Jester-Young wrote: 
> Hi there,
> 
> I'm currently implementing regexp-split for Guile, which provides a
> Perl-style split function (including correctly implementing the "limit"
> parameter), minus the special awk-style whitespace handling (that is
> used with a pattern of " ", as opposed to / /, with Perl's split).
> 
> Attached is a couple of patches, to support the regexp-split function
> which I'm proposing at the bottom of this message:
> 
> 1. The first fixes the behaviour of fold-matches and list-matches when
>    the pattern contains a ^ (identical to the patch in my last email).
> 2. The second adds the ability to limit the number of matches done.
>    This applies on top of the first patch.
> 
> Some comments about the regexp-split implementation: the value that's
> being passed to regexp-split-fold is a cons, where the car is the last
> match's end position, and the cdr is the substrings so far collected.
> 
> The special check in regexp-split-fold for match-end being zero is to
> emulate a specific behaviour as documented for Perl's split: "Empty
> leading fields are produced when there are positive-width matches at
> the beginning of the string; a zero-width match at the beginning of the
> string does not produce an empty field."
> 
> Below is the implementation; comments are welcome! If it all looks good,
> I'll write tests and documentation, with a view to eventually putting it
> into (ice-9 regex).
> 
> Thanks,
> Chris.
> 
> 			*	*	*
> 
> (define (regexp-split-fold match prev)
>   (if (zero? (match:end match)) prev
>       (cons* (match:end match)
>              (substring (match:string match) (car prev) (match:start match))
>              (cdr prev))))
> 
> (define (string-empty? str)
>   (zero? (string-length str)))
> 
> (define* (regexp-split pat str #:optional (limit 0))
>   (let* ((result (fold-matches pat str '(0) regexp-split-fold 0
>                                (if (positive? limit) (1- limit) #f)))
>          (final (cons (substring str (car result)) (cdr result))))
>     (reverse! (if (zero? limit) (drop-while string-empty? final) final))))
> differences between files attachment
> (0001-In-fold-matches-set-regexp-notbol-unless-matching-st.patch)
> From da8b0cd523f6e9bf9e1d46829cccf01e3115c614 Mon Sep 17 00:00:00 2001
> From: "Chris K. Jester-Young" <cky944@gmail.com>
> Date: Sun, 16 Sep 2012 02:20:56 -0400
> Subject: [PATCH 1/2] In fold-matches, set regexp/notbol unless matching
>  string start.
> 
> * module/ice-9/regex.scm (fold-matches): Set regexp/notbol if the
>   starting position is nonzero.
> * test-suite/tests/regexp.test (fold-matches): Check that when
>   matching /^foo/ against "foofoofoofoo", only one match results.
> ---
>  module/ice-9/regex.scm       |    3 ++-
>  test-suite/tests/regexp.test |    9 ++++++++-
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/module/ice-9/regex.scm b/module/ice-9/regex.scm
> index f7b94b7..08ae2c2 100644
> --- a/module/ice-9/regex.scm
> +++ b/module/ice-9/regex.scm
> @@ -172,8 +172,9 @@
>      (let loop ((start 0)
>                 (value init)
>                 (abuts #f))              ; True if start abuts a previous match.
> +      (define bol (if (zero? start) 0 regexp/notbol))
>        (let ((m (if (> start (string-length string)) #f
> -                   (regexp-exec regexp string start flags))))
> +                   (regexp-exec regexp string start (logior flags bol)))))
>          (cond
>           ((not m) value)
>           ((and (= (match:start m) (match:end m)) abuts)
> diff --git a/test-suite/tests/regexp.test b/test-suite/tests/regexp.test
> index ef59465..d549df2 100644
> --- a/test-suite/tests/regexp.test
> +++ b/test-suite/tests/regexp.test
> @@ -132,7 +132,14 @@
>                     (lambda (match result)
>                       (cons (match:substring match)
>                             result))
> -                   (logior regexp/notbol regexp/noteol)))))
> +                   (logior regexp/notbol regexp/noteol))))
> +
> +  (pass-if "regexp/notbol is set correctly"
> +    (equal? '("foo")
> +            (fold-matches "^foo" "foofoofoofoo" '()
> +                          (lambda (match result)
> +                            (cons (match:substring match)
> +                                  result))))))
>  
> 
>  ;;;
> differences between files attachment
> (0002-Add-limit-parameter-to-fold-matches-and-list-matches.patch)
> From 147dc0d7fd9ab04d10b4f13cecf47a32c5b6c4b6 Mon Sep 17 00:00:00 2001
> From: "Chris K. Jester-Young" <cky944@gmail.com>
> Date: Mon, 17 Sep 2012 01:06:07 -0400
> Subject: [PATCH 2/2] Add "limit" parameter to fold-matches and list-matches.
> 
> * doc/ref/api-regex.texi: Document new "limit" parameter.
> 
> * module/ice-9/regex.scm (fold-matches, list-matches): Optionally take
>   a "limit" argument that, if specified, limits how many times the
>   pattern is matched.
> 
> * test-suite/tests/regexp.test (fold-matches): Add tests for the correct
>   functioning of the limit parameter.
> ---
>  doc/ref/api-regex.texi       |   10 ++++++----
>  module/ice-9/regex.scm       |   18 ++++++++++--------
>  test-suite/tests/regexp.test |   16 +++++++++++++++-
>  3 files changed, 31 insertions(+), 13 deletions(-)
> 
> diff --git a/doc/ref/api-regex.texi b/doc/ref/api-regex.texi
> index 082fb87..2d2243f 100644
> --- a/doc/ref/api-regex.texi
> +++ b/doc/ref/api-regex.texi
> @@ -189,11 +189,12 @@ or @code{#f} otherwise.
>  @end deffn
>  
>  @sp 1
> -@deffn {Scheme Procedure} list-matches regexp str [flags]
> +@deffn {Scheme Procedure} list-matches regexp str [flags [limit]]
>  Return a list of match structures which are the non-overlapping
>  matches of @var{regexp} in @var{str}.  @var{regexp} can be either a
>  pattern string or a compiled regexp.  The @var{flags} argument is as
> -per @code{regexp-exec} above.
> +per @code{regexp-exec} above.  The @var{limit} argument, if specified,
> +limits how many times @var{regexp} is matched.
>  
>  @example
>  (map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
> @@ -201,11 +202,12 @@ per @code{regexp-exec} above.
>  @end  example
>  @end deffn
>  
> -@deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
> +@deffn {Scheme Procedure} fold-matches regexp str init proc [flags [limit]]
>  Apply @var{proc} to the non-overlapping matches of @var{regexp} in
>  @var{str}, to build a result.  @var{regexp} can be either a pattern
>  string or a compiled regexp.  The @var{flags} argument is as per
> -@code{regexp-exec} above.
> +@code{regexp-exec} above.  The @var{limit} argument, if specified,
> +limits how many times @var{regexp} is matched.
>  
>  @var{proc} is called as @code{(@var{proc} match prev)} where
>  @var{match} is a match structure and @var{prev} is the previous return
> diff --git a/module/ice-9/regex.scm b/module/ice-9/regex.scm
> index 08ae2c2..0ffe74c 100644
> --- a/module/ice-9/regex.scm
> +++ b/module/ice-9/regex.scm
> @@ -167,26 +167,28 @@
>  ;;; `b'.  Around or within `xxx', only the match covering all three
>  ;;; x's counts, because the rest are not maximal.
>  
> -(define* (fold-matches regexp string init proc #:optional (flags 0))
> +(define* (fold-matches regexp string init proc #:optional (flags 0) limit)
>    (let ((regexp (if (regexp? regexp) regexp (make-regexp regexp))))
>      (let loop ((start 0)
> +               (count 0)
>                 (value init)
>                 (abuts #f))              ; True if start abuts a previous match.
> -      (define bol (if (zero? start) 0 regexp/notbol))
> -      (let ((m (if (> start (string-length string)) #f
> -                   (regexp-exec regexp string start (logior flags bol)))))
> +      (let* ((bol (if (zero? start) 0 regexp/notbol))
> +             (m (and (or (not limit) (< count limit))
> +                     (<= start (string-length string))
> +                     (regexp-exec regexp string start (logior flags bol)))))
>          (cond
>           ((not m) value)
>           ((and (= (match:start m) (match:end m)) abuts)
>            ;; We matched an empty string, but that would overlap the
>            ;; match immediately before.  Try again at a position
>            ;; further to the right.
> -          (loop (+ start 1) value #f))
> +          (loop (1+ start) count value #f))
>           (else
> -          (loop (match:end m) (proc m value) #t)))))))
> +          (loop (match:end m) (1+ count) (proc m value) #t)))))))
>  
> -(define* (list-matches regexp string #:optional (flags 0))
> -  (reverse! (fold-matches regexp string '() cons flags)))
> +(define* (list-matches regexp string #:optional (flags 0) limit)
> +  (reverse! (fold-matches regexp string '() cons flags limit)))
>  
>  (define (regexp-substitute/global port regexp string . items)
>  
> diff --git a/test-suite/tests/regexp.test b/test-suite/tests/regexp.test
> index d549df2..c3ba698 100644
> --- a/test-suite/tests/regexp.test
> +++ b/test-suite/tests/regexp.test
> @@ -139,7 +139,21 @@
>              (fold-matches "^foo" "foofoofoofoo" '()
>                            (lambda (match result)
>                              (cons (match:substring match)
> -                                  result))))))
> +                                  result)))))
> +
> +  (pass-if "without limit"
> +    (equal? '("foo" "foo" "foo" "foo")
> +            (fold-matches "foo" "foofoofoofoo" '()
> +                          (lambda (match result)
> +                            (cons (match:substring match)
> +                                  result)))))
> +
> +  (pass-if "with limit"
> +    (equal? '("foo" "foo")
> +            (fold-matches "foo" "foofoofoofoo" '()
> +                          (lambda (match result)
> +                            (cons (match:substring match)
> +                                  result)) 0 2))))
>  
> 
>  ;;;





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-09-18  7:06     ` Sjoerd van Leent Privé
@ 2012-09-18 19:31       ` Chris K. Jester-Young
  0 siblings, 0 replies; 18+ messages in thread
From: Chris K. Jester-Young @ 2012-09-18 19:31 UTC (permalink / raw)
  To: guile-devel

On Tue, Sep 18, 2012 at 09:06:55AM +0200, Sjoerd van Leent Privé wrote:
> It might just be me, but would it not be more sensible for scheme to
> just perform the opposite. Return the same amount of fields at most,
> but starting from the end, thus:
> 
> (regexp-split ":" "foo:bar:baz:qux:" -3)
> => ("foo:bar" "baz" "qux" "")

Unfortunately, this is not ideal to implement:

1. Regexes are inherently forward-searching only. This means that
   matches are always built up from left-to-right.

2. Thus, there is no sensible way to implement right-fold-matches,
   which is what would be required for what you're proposing. What
   would instead have to happen is that you have to do list-matches
   with no limit, then ignore the front matches. This complicates
   the code significantly.

3. It's not compatible with Perl's split limits. The appeal of the
   Perl semantics is that it's already implemented exactly the same
   way in Ruby and Java, so the learning curve is lower.

> The problem described in your second case could be solved by using a
> symbol, such as #:all, or something similar.

I'm not a fan of this. :-( Because then you'd have to add another
keyword, like #:trim, to trim off the blanks (which is what happens with
a limit of 0), and then the interface suddenly got a lot more verbose.

Wanting all the fields, or trimming blank ones, are the commonest use
cases, and people who want to use it should not be punished with having
to use a verbose syntax for it.

Thanks for your feedback! Unfortunately, it sounds like implementing it
would complicate things too much, in my opinion. :-( YMMV.

Cheers,
Chris.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-09-18 12:59 ` nalaginrut
@ 2012-09-18 19:55   ` Chris K. Jester-Young
  2012-09-19  0:30     ` nalaginrut
  0 siblings, 1 reply; 18+ messages in thread
From: Chris K. Jester-Young @ 2012-09-18 19:55 UTC (permalink / raw)
  To: guile-devel

On Tue, Sep 18, 2012 at 08:59:33PM +0800, nalaginrut wrote:
> Anyway, if there're so many people like this nice thing, why not we add
> it (at any option of these three implementations) into ice-9?

Oh noes! This is where the bikeshedding begins. ;-)

Seriously, I do think having a regexp-split in (ice-9 regex) would be
good. The question is what interface is best. The one I presented is
a Perl-style one (as also used in Ruby and Java), which is easy to use,
easy to learn (if you come from one of those languages), and easy to
implement. So do we go with it, or do people have a better idea?

Cheers,
Chris.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-09-17 20:06   ` Chris K. Jester-Young
  2012-09-18  7:06     ` Sjoerd van Leent Privé
@ 2012-09-18 19:59     ` Chris K. Jester-Young
  2012-10-07  2:38       ` Daniel Hartwig
  1 sibling, 1 reply; 18+ messages in thread
From: Chris K. Jester-Young @ 2012-09-18 19:59 UTC (permalink / raw)
  To: guile-devel

Here's a revised version, implementing Thien-Thi Nguyen's comments. I
added line breaks for the "cons" and the bottom "if" (I feel that the
top "if" is still simple enough to keep on the same line).

Cheers,
Chris.

			*	*	*

(define (regexp-split-fold match prev)
  (if (zero? (match:end match)) prev
      (cons* (match:end match)
             (substring (match:string match) (car prev) (match:start match))
             (cdr prev))))

(define* (regexp-split pat str #:optional (limit 0))
  (let* ((result (fold-matches pat str '(0) regexp-split-fold 0
                               (if (positive? limit) (1- limit) #f)))
         (final (cons (substring str (car result))
                      (cdr result))))
    (reverse (if (zero? limit)
                 (drop-while string-null? final)
                 final))))



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-09-18 19:55   ` Chris K. Jester-Young
@ 2012-09-19  0:30     ` nalaginrut
  0 siblings, 0 replies; 18+ messages in thread
From: nalaginrut @ 2012-09-19  0:30 UTC (permalink / raw)
  To: Chris K. Jester-Young; +Cc: guile-devel

On Tue, 2012-09-18 at 15:55 -0400, Chris K. Jester-Young wrote: 
> On Tue, Sep 18, 2012 at 08:59:33PM +0800, nalaginrut wrote:
> > Anyway, if there're so many people like this nice thing, why not we add
> > it (at any option of these three implementations) into ice-9?
> 
> Oh noes! This is where the bikeshedding begins. ;-)
> 
> Seriously, I do think having a regexp-split in (ice-9 regex) would be
> good. The question is what interface is best. The one I presented is
> a Perl-style one (as also used in Ruby and Java), which is easy to use,
> easy to learn (if you come from one of those languages), and easy to
> implement. So do we go with it, or do people have a better idea?
> 

Yeah~my version is Python-style, but I don't stick to any specific
style. Anyway, my opinion is "we do need a such thing". ;-D

> Cheers,
> Chris.
> 





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-09-17 14:01 regexp-split for Guile Chris K. Jester-Young
  2012-09-17 19:32 ` Thien-Thi Nguyen
  2012-09-18 12:59 ` nalaginrut
@ 2012-10-04 21:47 ` Ludovic Courtès
  2 siblings, 0 replies; 18+ messages in thread
From: Ludovic Courtès @ 2012-10-04 21:47 UTC (permalink / raw)
  To: guile-devel

Hi Chris,

"Chris K. Jester-Young" <cky944@gmail.com> skribis:

> I'm currently implementing regexp-split for Guile, which provides a
> Perl-style split function (including correctly implementing the "limit"
> parameter), minus the special awk-style whitespace handling (that is
> used with a pattern of " ", as opposed to / /, with Perl's split).

Woow, I don’t understand what you’re saying.  :-)

> Attached is a couple of patches, to support the regexp-split function
> which I'm proposing at the bottom of this message:
>
> 1. The first fixes the behaviour of fold-matches and list-matches when
>    the pattern contains a ^ (identical to the patch in my last email).

This one was already applied.

> 2. The second adds the ability to limit the number of matches done.
>    This applies on top of the first patch.
>
> Some comments about the regexp-split implementation: the value that's
> being passed to regexp-split-fold is a cons, where the car is the last
> match's end position, and the cdr is the substrings so far collected.
>
> The special check in regexp-split-fold for match-end being zero is to
> emulate a specific behaviour as documented for Perl's split: "Empty
> leading fields are produced when there are positive-width matches at
> the beginning of the string; a zero-width match at the beginning of the
> string does not produce an empty field."

OK.  The semantics of ‘limits’ in ‘regexp-split’ look somewhat awkward
to me, but I’ve no better idea, and I understand the rationale and
appeal to Perl hackers (yuk!).

Stylistic comments:

> From 147dc0d7fd9ab04d10b4f13cecf47a32c5b6c4b6 Mon Sep 17 00:00:00 2001
> From: "Chris K. Jester-Young" <cky944@gmail.com>
> Date: Mon, 17 Sep 2012 01:06:07 -0400
> Subject: [PATCH 2/2] Add "limit" parameter to fold-matches and list-matches.
>
> * doc/ref/api-regex.texi: Document new "limit" parameter.
>
> * module/ice-9/regex.scm (fold-matches, list-matches): Optionally take
>   a "limit" argument that, if specified, limits how many times the
>   pattern is matched.
>
> * test-suite/tests/regexp.test (fold-matches): Add tests for the correct
>   functioning of the limit parameter.
> ---
>  doc/ref/api-regex.texi       |   10 ++++++----
>  module/ice-9/regex.scm       |   18 ++++++++++--------
>  test-suite/tests/regexp.test |   16 +++++++++++++++-
>  3 files changed, 31 insertions(+), 13 deletions(-)
>
>
> diff --git a/doc/ref/api-regex.texi b/doc/ref/api-regex.texi
> index 082fb87..2d2243f 100644
> --- a/doc/ref/api-regex.texi
> +++ b/doc/ref/api-regex.texi
> @@ -189,11 +189,12 @@ or @code{#f} otherwise.
>  @end deffn
>  
>  @sp 1
> -@deffn {Scheme Procedure} list-matches regexp str [flags]
> +@deffn {Scheme Procedure} list-matches regexp str [flags [limit]]
>  Return a list of match structures which are the non-overlapping
>  matches of @var{regexp} in @var{str}.  @var{regexp} can be either a
>  pattern string or a compiled regexp.  The @var{flags} argument is as
> -per @code{regexp-exec} above.
> +per @code{regexp-exec} above.  The @var{limit} argument, if specified,
> +limits how many times @var{regexp} is matched.

Rather something non-ambiguous like:

  Match @var{regexp} at most @var{limit} times, or an indefinite number
  of times when @var{limit} is omitted or equal to @code{0}.

> -@deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
> +@deffn {Scheme Procedure} fold-matches regexp str init proc [flags [limit]]
>  Apply @var{proc} to the non-overlapping matches of @var{regexp} in
>  @var{str}, to build a result.  @var{regexp} can be either a pattern
>  string or a compiled regexp.  The @var{flags} argument is as per
> -@code{regexp-exec} above.
> +@code{regexp-exec} above.  The @var{limit} argument, if specified,
> +limits how many times @var{regexp} is matched.

Likewise.

>  @var{proc} is called as @code{(@var{proc} match prev)} where
>  @var{match} is a match structure and @var{prev} is the previous return
> diff --git a/module/ice-9/regex.scm b/module/ice-9/regex.scm
> index 08ae2c2..0ffe74c 100644
> --- a/module/ice-9/regex.scm
> +++ b/module/ice-9/regex.scm
> @@ -167,26 +167,28 @@
>  ;;; `b'.  Around or within `xxx', only the match covering all three
>  ;;; x's counts, because the rest are not maximal.
>  
> -(define* (fold-matches regexp string init proc #:optional (flags 0))
> +(define* (fold-matches regexp string init proc #:optional (flags 0) limit)
>    (let ((regexp (if (regexp? regexp) regexp (make-regexp regexp))))
>      (let loop ((start 0)
> +               (count 0)
>                 (value init)
>                 (abuts #f))              ; True if start abuts a previous match.
> -      (define bol (if (zero? start) 0 regexp/notbol))
> -      (let ((m (if (> start (string-length string)) #f
> -                   (regexp-exec regexp string start (logior flags bol)))))
> +      (let* ((bol (if (zero? start) 0 regexp/notbol))
> +             (m (and (or (not limit) (< count limit))
> +                     (<= start (string-length string))
> +                     (regexp-exec regexp string start (logior flags bol)))))
>          (cond
>           ((not m) value)
>           ((and (= (match:start m) (match:end m)) abuts)
>            ;; We matched an empty string, but that would overlap the
>            ;; match immediately before.  Try again at a position
>            ;; further to the right.
> -          (loop (+ start 1) value #f))
> +          (loop (1+ start) count value #f))
>           (else
> -          (loop (match:end m) (proc m value) #t)))))))
> +          (loop (match:end m) (1+ count) (proc m value) #t)))))))
>  
> -(define* (list-matches regexp string #:optional (flags 0))
> -  (reverse! (fold-matches regexp string '() cons flags)))
> +(define* (list-matches regexp string #:optional (flags 0) limit)
> +  (reverse! (fold-matches regexp string '() cons flags limit)))
>  
>  (define (regexp-substitute/global port regexp string . items)

[...]

> +  (pass-if "with limit"
> +    (equal? '("foo" "foo")
> +            (fold-matches "foo" "foofoofoofoo" '()
> +                          (lambda (match result)
> +                            (cons (match:substring match)
> +                                  result)) 0 2))))

Indent like Thien-Thi suggested.

Could you send an updated patch?

Thanks!

Ludo’.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-09-18 19:59     ` Chris K. Jester-Young
@ 2012-10-07  2:38       ` Daniel Hartwig
  2012-10-12 21:57         ` Mark H Weaver
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Hartwig @ 2012-10-07  2:38 UTC (permalink / raw)
  To: guile-devel

On 19 September 2012 03:59, Chris K. Jester-Young <cky944@gmail.com> wrote:
> (define* (regexp-split pat str #:optional (limit 0))
> […]
>     (reverse (if (zero? limit)
>                  (drop-while string-null? final)
>                  final))))
>

Please simplify this limit arg, removing the maybe-drop-empty-strings
behaviour.  Either positive limit or #f for all matches.  It is
trivial for the caller to remove the empty strings if desired, and
simplifies the docs for regexp-split.  Matching perl semantics is not
necessarily desirable.

The discussion following the previous thread (started by Nala) is
quite useful.  Semantics of many other implementations were examined
and a good summary later included.  (Which does remind me to get back
to that TODO list…)

Regards



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-10-07  2:38       ` Daniel Hartwig
@ 2012-10-12 21:57         ` Mark H Weaver
  2012-10-20  4:01           ` Chris K. Jester-Young
  0 siblings, 1 reply; 18+ messages in thread
From: Mark H Weaver @ 2012-10-12 21:57 UTC (permalink / raw)
  To: Daniel Hartwig; +Cc: guile-devel

Daniel Hartwig <mandyke@gmail.com> writes:
> On 19 September 2012 03:59, Chris K. Jester-Young <cky944@gmail.com> wrote:
>> (define* (regexp-split pat str #:optional (limit 0))
>> […]
>>     (reverse (if (zero? limit)
>>                  (drop-while string-null? final)
>>                  final))))
>>
>
> Please simplify this limit arg, removing the maybe-drop-empty-strings
> behaviour.  Either positive limit or #f for all matches.  It is
> trivial for the caller to remove the empty strings if desired, and
> simplifies the docs for regexp-split.  Matching perl semantics is not
> necessarily desirable.

FWIW, I agree with Daniel.  I dislike the complicated semantics of this
'limit' argument, which combines into a single number two different
concepts:

* What limiting mode to use:

   [A] return 'limit' many fields at most
   [B] return all fields
   [C] return all fields except trailing blank fields

* How many fields, if using limiting mode [A].

Beyond matters of taste, I don't like this because it makes bugs less
likely to be caught.  Suppose 'limit' is a computed value, normally
expected to be positive.  Code that follows may implicitly assume that
the returned list has no more than 'limit' elements.  Now suppose that
due to a bug or exceptional circumstance, the computed 'limit' ends up
being less than 1.  Now 'regexp-split' switches to a qualitatively
different mode of behavior.

I'd prefer for a numeric limit to be interpreted in a uniform way.  That
suggests that a non-positive 'limit' should raise an exception.

Limiting modes [B] and [C] could be indicated in a few different ways.
One possibility would be to pass special symbol values for the 'limit'
argument to indicate these two other modes.

Another possibility is to add a 'drop-right-while' procedure (analogous
to SRFI-1's 'drop-while'), and then users who want this could do:

  (drop-right-while string-null?
                    (regexp-split ...))

    Regards,
      Mark



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-10-12 21:57         ` Mark H Weaver
@ 2012-10-20  4:01           ` Chris K. Jester-Young
  2012-10-20 13:27             ` Mark H Weaver
  0 siblings, 1 reply; 18+ messages in thread
From: Chris K. Jester-Young @ 2012-10-20  4:01 UTC (permalink / raw)
  To: guile-devel

On Fri, Oct 12, 2012 at 05:57:11PM -0400, Mark H Weaver wrote:
> FWIW, I agree with Daniel.  I dislike the complicated semantics of this
> 'limit' argument, which combines into a single number two different
> concepts:

First, I want to thank both Daniel and Mark for their feedback. I'm
sorry I haven't had a chance to reply until now; last weekend I went
to (and presented at) RacketCon, so I didn't have a lot of time for
replying to emails.

(And if you want to see my RacketCon presentation, feel free to visit
https://speakerdeck.com/u/cky/p/rackona :-))

> Beyond matters of taste, I don't like this because it makes bugs less
> likely to be caught.  Suppose 'limit' is a computed value, normally
> expected to be positive.  Code that follows may implicitly assume that
> the returned list has no more than 'limit' elements.  Now suppose that
> due to a bug or exceptional circumstance, the computed 'limit' ends up
> being less than 1.  Now 'regexp-split' switches to a qualitatively
> different mode of behavior.

I am sympathetic to this. It would definitely be good for the limit to
mean only that, and not have two other meanings attached to it.

So, in this spirit, below is my proposal for something that I hope would
fit within the character of your feedback, while not making the common
use cases needlessly verbose: we should favour the common use cases by
making them easy to use.

Before I begin, remember that in Perl's split, the default limit is 0,
which is to strip off all the blank trailing fields. This is the common
use case when using whitespace as a delimiter, where you simply want to
ignore all the end-of-line whitespace. Making the calling code manually
call drop-right-while is counter-productive for this common use case.

Here is my proposal:

    (regexp-split pat str #:key limit (trim? (not limit)))

With no optional arguments specified (so, #:limit is #f and #:trim? is
#t), it behaves like limit == 0 in Perl. i.e., return all fields, minus
blank trailing ones.

With a #:limit specified (which must be a positive integer), return
that number of fields at most (subsequent ones are not split out, and
are returned as part of the last field, with all delimiters intact).

With #:trim? given a false value, return all fields, including blank
trailing ones. This is false by default iff #:limit is specified.

Rationale: The common use case is the most succinct version. The next
most common use case has a relatively short formulation (#:trim?).
Also, the default for #:trim? is based on common use cases depending on
whether #:limit is specified. (Trim-with-limit is not supported in Perl,
but it seemed to take more work to ban it here than just let it be.)

Examples:

    (regexp-split " +" "foo  bar  baz  ")
      => ("foo" "bar" "baz")
    (regexp-split " +" "foo  bar  baz  " #:trim? #f)
      => ("foo" "bar" "baz" "")
    (regexp-split " +" "foo  bar  baz  " #:limit 4)
      => ("foo" "bar" "baz" "")
    (regexp-split " +" "foo  bar  baz  " #:limit 4 #:trim? #t)
      => ("foo" "bar" "baz")
    (regexp-split " +" "foo  bar  baz  " #:limit 3)
      => ("foo" "bar" "baz  ")
    (regexp-split " +" "foo  bar  baz  " #:limit 2)
      => ("foo" "bar  baz  ")

Does that sound reasonable?

Comments welcome,
Chris.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-10-20  4:01           ` Chris K. Jester-Young
@ 2012-10-20 13:27             ` Mark H Weaver
  2012-10-20 14:16               ` Mark H Weaver
  0 siblings, 1 reply; 18+ messages in thread
From: Mark H Weaver @ 2012-10-20 13:27 UTC (permalink / raw)
  To: guile-devel

Hi Chris,

"Chris K. Jester-Young" <cky944@gmail.com> writes:
> On Fri, Oct 12, 2012 at 05:57:11PM -0400, Mark H Weaver wrote:
>> Beyond matters of taste, I don't like this because it makes bugs less
>> likely to be caught.  Suppose 'limit' is a computed value, normally
>> expected to be positive.  Code that follows may implicitly assume that
>> the returned list has no more than 'limit' elements.  Now suppose that
>> due to a bug or exceptional circumstance, the computed 'limit' ends up
>> being less than 1.  Now 'regexp-split' switches to a qualitatively
>> different mode of behavior.
>
> I am sympathetic to this. It would definitely be good for the limit to
> mean only that, and not have two other meanings attached to it.
>
> So, in this spirit, below is my proposal for something that I hope would
> fit within the character of your feedback, while not making the common
> use cases needlessly verbose: we should favour the common use cases by
> making them easy to use.
>
> Before I begin, remember that in Perl's split, the default limit is 0,
> which is to strip off all the blank trailing fields. This is the common
> use case when using whitespace as a delimiter, where you simply want to
> ignore all the end-of-line whitespace. Making the calling code manually
> call drop-right-while is counter-productive for this common use case.
>
> Here is my proposal:
>
>     (regexp-split pat str #:key limit (trim? (not limit)))
>
> With no optional arguments specified (so, #:limit is #f and #:trim? is
> #t), it behaves like limit == 0 in Perl. i.e., return all fields, minus
> blank trailing ones.
>
> With a #:limit specified (which must be a positive integer), return
> that number of fields at most (subsequent ones are not split out, and
> are returned as part of the last field, with all delimiters intact).
>
> With #:trim? given a false value, return all fields, including blank
> trailing ones. This is false by default iff #:limit is specified.
>
> Rationale: The common use case is the most succinct version. The next
> most common use case has a relatively short formulation (#:trim?).
> Also, the default for #:trim? is based on common use cases depending on
> whether #:limit is specified. (Trim-with-limit is not supported in Perl,
> but it seemed to take more work to ban it here than just let it be.)

I generally like your new proposal, but after mulling it over some more,
I think that trimming should be off by default, regardless of how limit
is set.  The thing is, it seems to me that the only time #:trim? #t
makes sense is when you're splitting based on whitespace.  In most other
cases, trimming is not a sensible default.

As a programmer, I don't want basic tools like 'regexp-split' adding a
post-processing pass on the results without me explicitly asking for it.
Furthermore, if I add (or remove) the #:limit argument, I'd be
unpleasantly surprised to see any other changes in behavior.

While it's sometimes reasonable for _user_ interfaces to try to guess
what the user wanted to enable shorter commands, programming interfaces
should not do so, IMO.  This kind of cleverness is expected in Perl
circles, but not in the Scheme world.

Also, if we're going to add a built-in trimmer to 'regexp-split', I'd
like to see a "trim both ends" mode as well.  When splitting by
whitespace, I suspect #:trim 'both is wanted as often as #:trim 'right.

So how about something like this?

    (regexp-split pat str #:key limit trim)

where (member trim (#f 'both 'right 'left))

For example:

    (regexp-split "/\\" "foo/bar\baz/")
      => ("foo" "bar" "baz" "")
    (regexp-split " +" "  foo  bar  baz  ")
      => ("" "foo" "bar" "baz" "")
    (regexp-split " +" "  foo  bar  baz  " #:trim 'right)
      => ("" "foo" "bar" "baz")
    (regexp-split " +" "  foo  bar  baz  " #:trim 'both)
      => ("foo" "bar" "baz")
    (regexp-split " +" "  foo  bar  baz  " #:limit 5)
      => ("" "foo" "bar" "baz" "")
    (regexp-split " +" "  foo  bar  baz  " #:limit 5 #:trim 'right)
      => ("" "foo" "bar" "baz")
    (regexp-split " +" "  foo  bar  baz  " #:limit 5 #:trim 'both)
      => ("foo" "bar" "baz")
    (regexp-split " +" "  foo  bar  baz  " #:limit 3 #:trim 'both)
      => ("foo" "bar" "baz")
    (regexp-split " +" "  foo  bar  baz  " #:limit 2 #:trim 'both)
      => ("foo" "bar")

What do you think?

Thanks for working on this!

     Mark



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-10-20 13:27             ` Mark H Weaver
@ 2012-10-20 14:16               ` Mark H Weaver
  2012-10-21  8:20                 ` Daniel Hartwig
  2012-10-21 16:08                 ` Chris K. Jester-Young
  0 siblings, 2 replies; 18+ messages in thread
From: Mark H Weaver @ 2012-10-20 14:16 UTC (permalink / raw)
  To: guile-devel

I wrote:
>     (regexp-split " +" "  foo  bar  baz  " #:limit 3 #:trim 'both)
>       => ("foo" "bar" "baz")
>     (regexp-split " +" "  foo  bar  baz  " #:limit 2 #:trim 'both)
>       => ("foo" "bar")

Sorry, that last example is wrong of course, but both of these examples
raise an interesting question about how #:limit and #:trim should
interact.  To my mind, the top example above is correct.  I think the
last result should be "baz", not "baz  ".

I guess I'd prefer to think of #:trim as trimming *before* splitting,
instead of trimming empty elements *after* splitting, so:

     (regexp-split " +" "  foo  bar  baz  " #:limit 3 #:trim 'both)
       => ("foo" "bar" "baz")
     (regexp-split " +" "  foo  bar  baz  " #:limit 2 #:trim 'both)
       => ("foo" "bar  baz")

Note also that if you trim empty elements *after* splitting, then
there's a bad interaction with #:limit if you trim the left side.
Consider:

     (regexp-split " +" "  foo  bar  baz  " #:limit 3 #:trim 'both)

If we first split, taking into account the limit, we get:

     ("" "foo" "bar  baz  ")

and then we trim empty elements from both ends to get the final result:

       => ("foo" "bar  baz")

which seems wrong, given that I asked for #:limit 3.

Honestly, this question makes me wonder if the proposed 'regexp-split'
is too complicated.  If you want to trim whitespace, how about using
'string-trim-right' or 'string-trim-both' before splitting?  It seems
more likely to do what I would expect.

What do you think?

    Regards,
      Mark



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-10-20 14:16               ` Mark H Weaver
@ 2012-10-21  8:20                 ` Daniel Hartwig
  2012-10-21 19:23                   ` Chris K. Jester-Young
  2012-10-21 16:08                 ` Chris K. Jester-Young
  1 sibling, 1 reply; 18+ messages in thread
From: Daniel Hartwig @ 2012-10-21  8:20 UTC (permalink / raw)
  To: guile-devel

On 20 October 2012 22:16, Mark H Weaver <mhw@netris.org> wrote:
> Honestly, this question makes me wonder if the proposed 'regexp-split'
> is too complicated.  If you want to trim whitespace, how about using
> 'string-trim-right' or 'string-trim-both' before splitting?  It seems
> more likely to do what I would expect.

Yes.  Keep it simple.  Operations like trim-whitespace and
drop-empty-strings-from-the-result (mentioned in the previous
discussion) are so easy to do outside of regexp-split, why complicate
the semantics?

Limit is arguably more fundamental to the procedure.  Anything else is
pre- or post-processing.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-10-20 14:16               ` Mark H Weaver
  2012-10-21  8:20                 ` Daniel Hartwig
@ 2012-10-21 16:08                 ` Chris K. Jester-Young
  1 sibling, 0 replies; 18+ messages in thread
From: Chris K. Jester-Young @ 2012-10-21 16:08 UTC (permalink / raw)
  To: guile-devel

On Sat, Oct 20, 2012 at 10:16:49AM -0400, Mark H Weaver wrote:
> Sorry, that last example is wrong of course, but both of these examples
> raise an interesting question about how #:limit and #:trim should
> interact.  To my mind, the top example above is correct.  I think the
> last result should be "baz", not "baz  ".
[...]
> Honestly, this question makes me wonder if the proposed 'regexp-split'
> is too complicated.  If you want to trim whitespace, how about using
> 'string-trim-right' or 'string-trim-both' before splitting?  It seems
> more likely to do what I would expect.

Thanks so much for your feedback, Mark! I appreciate it.

Yeah, I think given the left-to-right nature of regex matching, the
only kind of trimming that makes sense is a right trim. And then once
you do that, people start asking for left trim, and mayhem begins. ;-)

I do want to consider the string pre-trimming approach, as it's more
clear what's going on, and is less "magical" (where "magic" is a plus
in the Perl world, and not so much of a plus in other languages).

Thankfully, the string-trim{,-right,-both} functions you mentioned use
substring behind the scenes, which uses copy-on-write. So that solves
one of my potential concerns, which is that a pre-trim would require
copying most of the string.

			*	*	*

Granted, if you want trimming-with-complicated-regex-delimiter, and
not just whitespace, then your best bet is to trim the output list.
This is slightly more complicated, because my original code simply
uses drop-while before reversing the output list for return, but since
the caller doesn't receive the reversed list, they either have to
reverse+trim+reverse (yuck), or we have to implement drop-right-while
(like you mentioned previously).

In that regard, here's one implementation of drop-right-while (that I
just wrote on the spot):

    (define (drop-right-while pred lst)
      (let recur ((lst lst))
        (if (null? lst) '()
            (let ((elem (car lst))
                  (next (recur (cdr lst))))
              (if (and (null? next) (pred elem)) '()
                  (cons elem next))))))

One could theoretically write drop-right-while! also (I can think of
two different implementation strategies) but it sounds like it's more
work than it's worth.

So, that's our last hurdle: we "just" have to get drop-right-while
integrated into Guile, then we can separate out the splitting and
trimming processes. And everybody will be happy. :-)

Comments welcome,
Chris.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: regexp-split for Guile
  2012-10-21  8:20                 ` Daniel Hartwig
@ 2012-10-21 19:23                   ` Chris K. Jester-Young
  0 siblings, 0 replies; 18+ messages in thread
From: Chris K. Jester-Young @ 2012-10-21 19:23 UTC (permalink / raw)
  To: guile-devel

On Sun, Oct 21, 2012 at 04:20:09PM +0800, Daniel Hartwig wrote:
> Yes.  Keep it simple.  Operations like trim-whitespace and
> drop-empty-strings-from-the-result (mentioned in the previous
> discussion) are so easy to do outside of regexp-split, why complicate
> the semantics?

"So easy", but so verbose. We should prefer to make common use cases
easy (and succinct) to use, and not optimise for uncommon ones.

Anyway, in my response to Mark, I mentioned that if we can get
drop-right-while in-tree, we have a middle ground that should make
"everyone" happy. I am against requiring users of regexp-split to
reinvent that wheel each time. Leave the reinvention to Phil Bewig.[1]

Cheers,
Chris.

[1] http://lists.nongnu.org/archive/html/chicken-users/2009-05/msg00024.html
    "I never use SRFI-1. My personal standard library has take, drop,
     take-while, drop-while, range, iterate, filter, zip, [...]"



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-10-21 19:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-17 14:01 regexp-split for Guile Chris K. Jester-Young
2012-09-17 19:32 ` Thien-Thi Nguyen
2012-09-17 20:06   ` Chris K. Jester-Young
2012-09-18  7:06     ` Sjoerd van Leent Privé
2012-09-18 19:31       ` Chris K. Jester-Young
2012-09-18 19:59     ` Chris K. Jester-Young
2012-10-07  2:38       ` Daniel Hartwig
2012-10-12 21:57         ` Mark H Weaver
2012-10-20  4:01           ` Chris K. Jester-Young
2012-10-20 13:27             ` Mark H Weaver
2012-10-20 14:16               ` Mark H Weaver
2012-10-21  8:20                 ` Daniel Hartwig
2012-10-21 19:23                   ` Chris K. Jester-Young
2012-10-21 16:08                 ` Chris K. Jester-Young
2012-09-18 12:59 ` nalaginrut
2012-09-18 19:55   ` Chris K. Jester-Young
2012-09-19  0:30     ` nalaginrut
2012-10-04 21:47 ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).