From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Chris K. Jester-Young" Newsgroups: gmane.lisp.guile.devel Subject: regexp-split for Guile Date: Mon, 17 Sep 2012 10:01:33 -0400 Message-ID: <20120917140133.GA6315@yarrow> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="Qxx1br4bt0+wmkIi" X-Trace: ger.gmane.org 1347892432 29458 80.91.229.3 (17 Sep 2012 14:33:52 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 17 Sep 2012 14:33:52 +0000 (UTC) To: guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Mon Sep 17 16:33:56 2012 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1TDcOV-0004JY-AI for guile-devel@m.gmane.org; Mon, 17 Sep 2012 16:33:51 +0200 Original-Received: from localhost ([::1]:43611 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TDbty-0003DH-Kq for guile-devel@m.gmane.org; Mon, 17 Sep 2012 10:02:18 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:60161) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TDbtc-0002c2-Fk for guile-devel@gnu.org; Mon, 17 Sep 2012 10:02:06 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TDbtR-0004Ij-Eg for guile-devel@gnu.org; Mon, 17 Sep 2012 10:01:56 -0400 Original-Received: from mail-vb0-f41.google.com ([209.85.212.41]:54231) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TDbtR-0004IY-AG for guile-devel@gnu.org; Mon, 17 Sep 2012 10:01:45 -0400 Original-Received: by vbkv13 with SMTP id v13so8587204vbk.0 for ; Mon, 17 Sep 2012 07:01:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:subject:message-id:mail-followup-to:mime-version :content-type:content-disposition:user-agent; bh=5NfqHqx+5qqqnnUgiKtdS0i/EYzPBgZPv2HRDo0I1LE=; b=jZte8AJB5XlQ4pfJJVeuJB7y51bNUbkcshmKLe0MNzg0RO2673lJEiAyTgY6FkKkg6 Rq5lVnjjInVQr1MGabuFSs+k/5x0G4f6R7X33BJ3duuMqRd71tUKOIa+HKUJtKCfFzV5 i6dmFqUYnrtD/EGJDSwoAuGwXnFnnZTXD/mzXeJMtdIi0bOKWXMnMrjKzeyWulSYurCC 4vbMVRY2KWqQ2NSJpPoQhHhft6mhnD0dqtZGXFIXY30RuOYRxp32WLvUiKGKDnBHmro4 UZkJhFFWaEGvAB8aOaR17mobAkMNbGom8zyFBbcsvwtkZ/qLUcY9d5cYjsMWK4IEFHqg 2mbw== Original-Received: by 10.221.11.71 with SMTP id pd7mr7515745vcb.45.1347890504467; Mon, 17 Sep 2012 07:01:44 -0700 (PDT) Original-Received: from yarrow (cpe-069-134-140-185.nc.res.rr.com. [69.134.140.185]) by mx.google.com with ESMTPS id q19sm1288345vdf.21.2012.09.17.07.01.42 (version=SSLv3 cipher=OTHER); Mon, 17 Sep 2012 07:01:43 -0700 (PDT) Mail-Followup-To: guile-devel@gnu.org Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 209.85.212.41 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:14889 Archived-At: --Qxx1br4bt0+wmkIi Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi there, I'm currently implementing regexp-split for Guile, which provides a Perl-style split function (including correctly implementing the "limit" parameter), minus the special awk-style whitespace handling (that is used with a pattern of " ", as opposed to / /, with Perl's split). Attached is a couple of patches, to support the regexp-split function which I'm proposing at the bottom of this message: 1. The first fixes the behaviour of fold-matches and list-matches when the pattern contains a ^ (identical to the patch in my last email). 2. The second adds the ability to limit the number of matches done. This applies on top of the first patch. Some comments about the regexp-split implementation: the value that's being passed to regexp-split-fold is a cons, where the car is the last match's end position, and the cdr is the substrings so far collected. The special check in regexp-split-fold for match-end being zero is to emulate a specific behaviour as documented for Perl's split: "Empty leading fields are produced when there are positive-width matches at the beginning of the string; a zero-width match at the beginning of the string does not produce an empty field." Below is the implementation; comments are welcome! If it all looks good, I'll write tests and documentation, with a view to eventually putting it into (ice-9 regex). Thanks, Chris. * * * (define (regexp-split-fold match prev) (if (zero? (match:end match)) prev (cons* (match:end match) (substring (match:string match) (car prev) (match:start match)) (cdr prev)))) (define (string-empty? str) (zero? (string-length str))) (define* (regexp-split pat str #:optional (limit 0)) (let* ((result (fold-matches pat str '(0) regexp-split-fold 0 (if (positive? limit) (1- limit) #f))) (final (cons (substring str (car result)) (cdr result)))) (reverse! (if (zero? limit) (drop-while string-empty? final) final)))) --Qxx1br4bt0+wmkIi Content-Type: text/x-diff; charset=us-ascii Content-Disposition: inline; filename="0001-In-fold-matches-set-regexp-notbol-unless-matching-st.patch" >From da8b0cd523f6e9bf9e1d46829cccf01e3115c614 Mon Sep 17 00:00:00 2001 From: "Chris K. Jester-Young" Date: Sun, 16 Sep 2012 02:20:56 -0400 Subject: [PATCH 1/2] In fold-matches, set regexp/notbol unless matching string start. * module/ice-9/regex.scm (fold-matches): Set regexp/notbol if the starting position is nonzero. * test-suite/tests/regexp.test (fold-matches): Check that when matching /^foo/ against "foofoofoofoo", only one match results. --- module/ice-9/regex.scm | 3 ++- test-suite/tests/regexp.test | 9 ++++++++- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/module/ice-9/regex.scm b/module/ice-9/regex.scm index f7b94b7..08ae2c2 100644 --- a/module/ice-9/regex.scm +++ b/module/ice-9/regex.scm @@ -172,8 +172,9 @@ (let loop ((start 0) (value init) (abuts #f)) ; True if start abuts a previous match. + (define bol (if (zero? start) 0 regexp/notbol)) (let ((m (if (> start (string-length string)) #f - (regexp-exec regexp string start flags)))) + (regexp-exec regexp string start (logior flags bol))))) (cond ((not m) value) ((and (= (match:start m) (match:end m)) abuts) diff --git a/test-suite/tests/regexp.test b/test-suite/tests/regexp.test index ef59465..d549df2 100644 --- a/test-suite/tests/regexp.test +++ b/test-suite/tests/regexp.test @@ -132,7 +132,14 @@ (lambda (match result) (cons (match:substring match) result)) - (logior regexp/notbol regexp/noteol))))) + (logior regexp/notbol regexp/noteol)))) + + (pass-if "regexp/notbol is set correctly" + (equal? '("foo") + (fold-matches "^foo" "foofoofoofoo" '() + (lambda (match result) + (cons (match:substring match) + result)))))) ;;; -- 1.7.9.5 --Qxx1br4bt0+wmkIi Content-Type: text/x-diff; charset=us-ascii Content-Disposition: inline; filename="0002-Add-limit-parameter-to-fold-matches-and-list-matches.patch" >From 147dc0d7fd9ab04d10b4f13cecf47a32c5b6c4b6 Mon Sep 17 00:00:00 2001 From: "Chris K. Jester-Young" Date: Mon, 17 Sep 2012 01:06:07 -0400 Subject: [PATCH 2/2] Add "limit" parameter to fold-matches and list-matches. * doc/ref/api-regex.texi: Document new "limit" parameter. * module/ice-9/regex.scm (fold-matches, list-matches): Optionally take a "limit" argument that, if specified, limits how many times the pattern is matched. * test-suite/tests/regexp.test (fold-matches): Add tests for the correct functioning of the limit parameter. --- doc/ref/api-regex.texi | 10 ++++++---- module/ice-9/regex.scm | 18 ++++++++++-------- test-suite/tests/regexp.test | 16 +++++++++++++++- 3 files changed, 31 insertions(+), 13 deletions(-) diff --git a/doc/ref/api-regex.texi b/doc/ref/api-regex.texi index 082fb87..2d2243f 100644 --- a/doc/ref/api-regex.texi +++ b/doc/ref/api-regex.texi @@ -189,11 +189,12 @@ or @code{#f} otherwise. @end deffn @sp 1 -@deffn {Scheme Procedure} list-matches regexp str [flags] +@deffn {Scheme Procedure} list-matches regexp str [flags [limit]] Return a list of match structures which are the non-overlapping matches of @var{regexp} in @var{str}. @var{regexp} can be either a pattern string or a compiled regexp. The @var{flags} argument is as -per @code{regexp-exec} above. +per @code{regexp-exec} above. The @var{limit} argument, if specified, +limits how many times @var{regexp} is matched. @example (map match:substring (list-matches "[a-z]+" "abc 42 def 78")) @@ -201,11 +202,12 @@ per @code{regexp-exec} above. @end example @end deffn -@deffn {Scheme Procedure} fold-matches regexp str init proc [flags] +@deffn {Scheme Procedure} fold-matches regexp str init proc [flags [limit]] Apply @var{proc} to the non-overlapping matches of @var{regexp} in @var{str}, to build a result. @var{regexp} can be either a pattern string or a compiled regexp. The @var{flags} argument is as per -@code{regexp-exec} above. +@code{regexp-exec} above. The @var{limit} argument, if specified, +limits how many times @var{regexp} is matched. @var{proc} is called as @code{(@var{proc} match prev)} where @var{match} is a match structure and @var{prev} is the previous return diff --git a/module/ice-9/regex.scm b/module/ice-9/regex.scm index 08ae2c2..0ffe74c 100644 --- a/module/ice-9/regex.scm +++ b/module/ice-9/regex.scm @@ -167,26 +167,28 @@ ;;; `b'. Around or within `xxx', only the match covering all three ;;; x's counts, because the rest are not maximal. -(define* (fold-matches regexp string init proc #:optional (flags 0)) +(define* (fold-matches regexp string init proc #:optional (flags 0) limit) (let ((regexp (if (regexp? regexp) regexp (make-regexp regexp)))) (let loop ((start 0) + (count 0) (value init) (abuts #f)) ; True if start abuts a previous match. - (define bol (if (zero? start) 0 regexp/notbol)) - (let ((m (if (> start (string-length string)) #f - (regexp-exec regexp string start (logior flags bol))))) + (let* ((bol (if (zero? start) 0 regexp/notbol)) + (m (and (or (not limit) (< count limit)) + (<= start (string-length string)) + (regexp-exec regexp string start (logior flags bol))))) (cond ((not m) value) ((and (= (match:start m) (match:end m)) abuts) ;; We matched an empty string, but that would overlap the ;; match immediately before. Try again at a position ;; further to the right. - (loop (+ start 1) value #f)) + (loop (1+ start) count value #f)) (else - (loop (match:end m) (proc m value) #t))))))) + (loop (match:end m) (1+ count) (proc m value) #t))))))) -(define* (list-matches regexp string #:optional (flags 0)) - (reverse! (fold-matches regexp string '() cons flags))) +(define* (list-matches regexp string #:optional (flags 0) limit) + (reverse! (fold-matches regexp string '() cons flags limit))) (define (regexp-substitute/global port regexp string . items) diff --git a/test-suite/tests/regexp.test b/test-suite/tests/regexp.test index d549df2..c3ba698 100644 --- a/test-suite/tests/regexp.test +++ b/test-suite/tests/regexp.test @@ -139,7 +139,21 @@ (fold-matches "^foo" "foofoofoofoo" '() (lambda (match result) (cons (match:substring match) - result)))))) + result))))) + + (pass-if "without limit" + (equal? '("foo" "foo" "foo" "foo") + (fold-matches "foo" "foofoofoofoo" '() + (lambda (match result) + (cons (match:substring match) + result))))) + + (pass-if "with limit" + (equal? '("foo" "foo") + (fold-matches "foo" "foofoofoofoo" '() + (lambda (match result) + (cons (match:substring match) + result)) 0 2)))) ;;; -- 1.7.9.5 --Qxx1br4bt0+wmkIi--