From: ludo@gnu.org (Ludovic Courtès)
To: guile-devel@gnu.org
Subject: Re: regexp-split for Guile
Date: Thu, 04 Oct 2012 23:47:32 +0200 [thread overview]
Message-ID: <87y5jlhpu3.fsf@gnu.org> (raw)
In-Reply-To: 20120917140133.GA6315@yarrow
Hi Chris,
"Chris K. Jester-Young" <cky944@gmail.com> skribis:
> I'm currently implementing regexp-split for Guile, which provides a
> Perl-style split function (including correctly implementing the "limit"
> parameter), minus the special awk-style whitespace handling (that is
> used with a pattern of " ", as opposed to / /, with Perl's split).
Woow, I don’t understand what you’re saying. :-)
> Attached is a couple of patches, to support the regexp-split function
> which I'm proposing at the bottom of this message:
>
> 1. The first fixes the behaviour of fold-matches and list-matches when
> the pattern contains a ^ (identical to the patch in my last email).
This one was already applied.
> 2. The second adds the ability to limit the number of matches done.
> This applies on top of the first patch.
>
> Some comments about the regexp-split implementation: the value that's
> being passed to regexp-split-fold is a cons, where the car is the last
> match's end position, and the cdr is the substrings so far collected.
>
> The special check in regexp-split-fold for match-end being zero is to
> emulate a specific behaviour as documented for Perl's split: "Empty
> leading fields are produced when there are positive-width matches at
> the beginning of the string; a zero-width match at the beginning of the
> string does not produce an empty field."
OK. The semantics of ‘limits’ in ‘regexp-split’ look somewhat awkward
to me, but I’ve no better idea, and I understand the rationale and
appeal to Perl hackers (yuk!).
Stylistic comments:
> From 147dc0d7fd9ab04d10b4f13cecf47a32c5b6c4b6 Mon Sep 17 00:00:00 2001
> From: "Chris K. Jester-Young" <cky944@gmail.com>
> Date: Mon, 17 Sep 2012 01:06:07 -0400
> Subject: [PATCH 2/2] Add "limit" parameter to fold-matches and list-matches.
>
> * doc/ref/api-regex.texi: Document new "limit" parameter.
>
> * module/ice-9/regex.scm (fold-matches, list-matches): Optionally take
> a "limit" argument that, if specified, limits how many times the
> pattern is matched.
>
> * test-suite/tests/regexp.test (fold-matches): Add tests for the correct
> functioning of the limit parameter.
> ---
> doc/ref/api-regex.texi | 10 ++++++----
> module/ice-9/regex.scm | 18 ++++++++++--------
> test-suite/tests/regexp.test | 16 +++++++++++++++-
> 3 files changed, 31 insertions(+), 13 deletions(-)
>
>
> diff --git a/doc/ref/api-regex.texi b/doc/ref/api-regex.texi
> index 082fb87..2d2243f 100644
> --- a/doc/ref/api-regex.texi
> +++ b/doc/ref/api-regex.texi
> @@ -189,11 +189,12 @@ or @code{#f} otherwise.
> @end deffn
>
> @sp 1
> -@deffn {Scheme Procedure} list-matches regexp str [flags]
> +@deffn {Scheme Procedure} list-matches regexp str [flags [limit]]
> Return a list of match structures which are the non-overlapping
> matches of @var{regexp} in @var{str}. @var{regexp} can be either a
> pattern string or a compiled regexp. The @var{flags} argument is as
> -per @code{regexp-exec} above.
> +per @code{regexp-exec} above. The @var{limit} argument, if specified,
> +limits how many times @var{regexp} is matched.
Rather something non-ambiguous like:
Match @var{regexp} at most @var{limit} times, or an indefinite number
of times when @var{limit} is omitted or equal to @code{0}.
> -@deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
> +@deffn {Scheme Procedure} fold-matches regexp str init proc [flags [limit]]
> Apply @var{proc} to the non-overlapping matches of @var{regexp} in
> @var{str}, to build a result. @var{regexp} can be either a pattern
> string or a compiled regexp. The @var{flags} argument is as per
> -@code{regexp-exec} above.
> +@code{regexp-exec} above. The @var{limit} argument, if specified,
> +limits how many times @var{regexp} is matched.
Likewise.
> @var{proc} is called as @code{(@var{proc} match prev)} where
> @var{match} is a match structure and @var{prev} is the previous return
> diff --git a/module/ice-9/regex.scm b/module/ice-9/regex.scm
> index 08ae2c2..0ffe74c 100644
> --- a/module/ice-9/regex.scm
> +++ b/module/ice-9/regex.scm
> @@ -167,26 +167,28 @@
> ;;; `b'. Around or within `xxx', only the match covering all three
> ;;; x's counts, because the rest are not maximal.
>
> -(define* (fold-matches regexp string init proc #:optional (flags 0))
> +(define* (fold-matches regexp string init proc #:optional (flags 0) limit)
> (let ((regexp (if (regexp? regexp) regexp (make-regexp regexp))))
> (let loop ((start 0)
> + (count 0)
> (value init)
> (abuts #f)) ; True if start abuts a previous match.
> - (define bol (if (zero? start) 0 regexp/notbol))
> - (let ((m (if (> start (string-length string)) #f
> - (regexp-exec regexp string start (logior flags bol)))))
> + (let* ((bol (if (zero? start) 0 regexp/notbol))
> + (m (and (or (not limit) (< count limit))
> + (<= start (string-length string))
> + (regexp-exec regexp string start (logior flags bol)))))
> (cond
> ((not m) value)
> ((and (= (match:start m) (match:end m)) abuts)
> ;; We matched an empty string, but that would overlap the
> ;; match immediately before. Try again at a position
> ;; further to the right.
> - (loop (+ start 1) value #f))
> + (loop (1+ start) count value #f))
> (else
> - (loop (match:end m) (proc m value) #t)))))))
> + (loop (match:end m) (1+ count) (proc m value) #t)))))))
>
> -(define* (list-matches regexp string #:optional (flags 0))
> - (reverse! (fold-matches regexp string '() cons flags)))
> +(define* (list-matches regexp string #:optional (flags 0) limit)
> + (reverse! (fold-matches regexp string '() cons flags limit)))
>
> (define (regexp-substitute/global port regexp string . items)
[...]
> + (pass-if "with limit"
> + (equal? '("foo" "foo")
> + (fold-matches "foo" "foofoofoofoo" '()
> + (lambda (match result)
> + (cons (match:substring match)
> + result)) 0 2))))
Indent like Thien-Thi suggested.
Could you send an updated patch?
Thanks!
Ludo’.
prev parent reply other threads:[~2012-10-04 21:47 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-09-17 14:01 regexp-split for Guile Chris K. Jester-Young
2012-09-17 19:32 ` Thien-Thi Nguyen
2012-09-17 20:06 ` Chris K. Jester-Young
2012-09-18 7:06 ` Sjoerd van Leent Privé
2012-09-18 19:31 ` Chris K. Jester-Young
2012-09-18 19:59 ` Chris K. Jester-Young
2012-10-07 2:38 ` Daniel Hartwig
2012-10-12 21:57 ` Mark H Weaver
2012-10-20 4:01 ` Chris K. Jester-Young
2012-10-20 13:27 ` Mark H Weaver
2012-10-20 14:16 ` Mark H Weaver
2012-10-21 8:20 ` Daniel Hartwig
2012-10-21 19:23 ` Chris K. Jester-Young
2012-10-21 16:08 ` Chris K. Jester-Young
2012-09-18 12:59 ` nalaginrut
2012-09-18 19:55 ` Chris K. Jester-Young
2012-09-19 0:30 ` nalaginrut
2012-10-04 21:47 ` Ludovic Courtès [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87y5jlhpu3.fsf@gnu.org \
--to=ludo@gnu.org \
--cc=guile-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).