From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: ludo@gnu.org (Ludovic =?iso-8859-1?Q?Court=E8s?=) Newsgroups: gmane.lisp.guile.devel Subject: Re: regexp-split for Guile Date: Thu, 04 Oct 2012 23:47:32 +0200 Message-ID: <87y5jlhpu3.fsf@gnu.org> References: <20120917140133.GA6315@yarrow> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1349401790 23344 80.91.229.3 (5 Oct 2012 01:49:50 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 5 Oct 2012 01:49:50 +0000 (UTC) To: guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Fri Oct 05 03:49:56 2012 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1TJx32-0005tv-1v for guile-devel@m.gmane.org; Fri, 05 Oct 2012 03:49:52 +0200 Original-Received: from localhost ([::1]:52186 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TJwgA-0003Zs-8d for guile-devel@m.gmane.org; Thu, 04 Oct 2012 21:26:14 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:35535) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TJwg7-0003Zn-UL for guile-devel@gnu.org; Thu, 04 Oct 2012 21:26:13 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TJwg6-00023U-Cy for guile-devel@gnu.org; Thu, 04 Oct 2012 21:26:11 -0400 Original-Received: from plane.gmane.org ([80.91.229.3]:36602) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TJwg6-00023P-2k for guile-devel@gnu.org; Thu, 04 Oct 2012 21:26:10 -0400 Original-Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1TJwep-0008Ci-CJ for guile-devel@gnu.org; Fri, 05 Oct 2012 03:24:51 +0200 Original-Received: from reverse-83.fdn.fr ([80.67.176.83]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 05 Oct 2012 03:24:51 +0200 Original-Received: from ludo by reverse-83.fdn.fr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 05 Oct 2012 03:24:51 +0200 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 150 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: reverse-83.fdn.fr X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 13 =?iso-8859-1?Q?Vend=E9miaire?= an 221 de la =?iso-8859-1?Q?R=E9volution?= X-PGP-Key-ID: 0xEA52ECF4 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 83C4 F8E5 10A3 3B4C 5BEA D15D 77DD 95E2 EA52 ECF4 X-OS: x86_64-unknown-linux-gnu User-Agent: Gnus/5.130005 (Ma Gnus v0.5) Emacs/24.2 (gnu/linux) Cancel-Lock: sha1:1F+VQ2l2F7eCGgKmX5ykL8QIJtc= X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 80.91.229.3 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:14926 Archived-At: Hi Chris, "Chris K. Jester-Young" skribis: > I'm currently implementing regexp-split for Guile, which provides a > Perl-style split function (including correctly implementing the "limit" > parameter), minus the special awk-style whitespace handling (that is > used with a pattern of " ", as opposed to / /, with Perl's split). Woow, I don’t understand what you’re saying. :-) > Attached is a couple of patches, to support the regexp-split function > which I'm proposing at the bottom of this message: > > 1. The first fixes the behaviour of fold-matches and list-matches when > the pattern contains a ^ (identical to the patch in my last email). This one was already applied. > 2. The second adds the ability to limit the number of matches done. > This applies on top of the first patch. > > Some comments about the regexp-split implementation: the value that's > being passed to regexp-split-fold is a cons, where the car is the last > match's end position, and the cdr is the substrings so far collected. > > The special check in regexp-split-fold for match-end being zero is to > emulate a specific behaviour as documented for Perl's split: "Empty > leading fields are produced when there are positive-width matches at > the beginning of the string; a zero-width match at the beginning of the > string does not produce an empty field." OK. The semantics of ‘limits’ in ‘regexp-split’ look somewhat awkward to me, but I’ve no better idea, and I understand the rationale and appeal to Perl hackers (yuk!). Stylistic comments: > From 147dc0d7fd9ab04d10b4f13cecf47a32c5b6c4b6 Mon Sep 17 00:00:00 2001 > From: "Chris K. Jester-Young" > Date: Mon, 17 Sep 2012 01:06:07 -0400 > Subject: [PATCH 2/2] Add "limit" parameter to fold-matches and list-matches. > > * doc/ref/api-regex.texi: Document new "limit" parameter. > > * module/ice-9/regex.scm (fold-matches, list-matches): Optionally take > a "limit" argument that, if specified, limits how many times the > pattern is matched. > > * test-suite/tests/regexp.test (fold-matches): Add tests for the correct > functioning of the limit parameter. > --- > doc/ref/api-regex.texi | 10 ++++++---- > module/ice-9/regex.scm | 18 ++++++++++-------- > test-suite/tests/regexp.test | 16 +++++++++++++++- > 3 files changed, 31 insertions(+), 13 deletions(-) > > > diff --git a/doc/ref/api-regex.texi b/doc/ref/api-regex.texi > index 082fb87..2d2243f 100644 > --- a/doc/ref/api-regex.texi > +++ b/doc/ref/api-regex.texi > @@ -189,11 +189,12 @@ or @code{#f} otherwise. > @end deffn > > @sp 1 > -@deffn {Scheme Procedure} list-matches regexp str [flags] > +@deffn {Scheme Procedure} list-matches regexp str [flags [limit]] > Return a list of match structures which are the non-overlapping > matches of @var{regexp} in @var{str}. @var{regexp} can be either a > pattern string or a compiled regexp. The @var{flags} argument is as > -per @code{regexp-exec} above. > +per @code{regexp-exec} above. The @var{limit} argument, if specified, > +limits how many times @var{regexp} is matched. Rather something non-ambiguous like: Match @var{regexp} at most @var{limit} times, or an indefinite number of times when @var{limit} is omitted or equal to @code{0}. > -@deffn {Scheme Procedure} fold-matches regexp str init proc [flags] > +@deffn {Scheme Procedure} fold-matches regexp str init proc [flags [limit]] > Apply @var{proc} to the non-overlapping matches of @var{regexp} in > @var{str}, to build a result. @var{regexp} can be either a pattern > string or a compiled regexp. The @var{flags} argument is as per > -@code{regexp-exec} above. > +@code{regexp-exec} above. The @var{limit} argument, if specified, > +limits how many times @var{regexp} is matched. Likewise. > @var{proc} is called as @code{(@var{proc} match prev)} where > @var{match} is a match structure and @var{prev} is the previous return > diff --git a/module/ice-9/regex.scm b/module/ice-9/regex.scm > index 08ae2c2..0ffe74c 100644 > --- a/module/ice-9/regex.scm > +++ b/module/ice-9/regex.scm > @@ -167,26 +167,28 @@ > ;;; `b'. Around or within `xxx', only the match covering all three > ;;; x's counts, because the rest are not maximal. > > -(define* (fold-matches regexp string init proc #:optional (flags 0)) > +(define* (fold-matches regexp string init proc #:optional (flags 0) limit) > (let ((regexp (if (regexp? regexp) regexp (make-regexp regexp)))) > (let loop ((start 0) > + (count 0) > (value init) > (abuts #f)) ; True if start abuts a previous match. > - (define bol (if (zero? start) 0 regexp/notbol)) > - (let ((m (if (> start (string-length string)) #f > - (regexp-exec regexp string start (logior flags bol))))) > + (let* ((bol (if (zero? start) 0 regexp/notbol)) > + (m (and (or (not limit) (< count limit)) > + (<= start (string-length string)) > + (regexp-exec regexp string start (logior flags bol))))) > (cond > ((not m) value) > ((and (= (match:start m) (match:end m)) abuts) > ;; We matched an empty string, but that would overlap the > ;; match immediately before. Try again at a position > ;; further to the right. > - (loop (+ start 1) value #f)) > + (loop (1+ start) count value #f)) > (else > - (loop (match:end m) (proc m value) #t))))))) > + (loop (match:end m) (1+ count) (proc m value) #t))))))) > > -(define* (list-matches regexp string #:optional (flags 0)) > - (reverse! (fold-matches regexp string '() cons flags))) > +(define* (list-matches regexp string #:optional (flags 0) limit) > + (reverse! (fold-matches regexp string '() cons flags limit))) > > (define (regexp-substitute/global port regexp string . items) [...] > + (pass-if "with limit" > + (equal? '("foo" "foo") > + (fold-matches "foo" "foofoofoofoo" '() > + (lambda (match result) > + (cons (match:substring match) > + result)) 0 2)))) Indent like Thien-Thi suggested. Could you send an updated patch? Thanks! Ludo’.