From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Barzilay Newsgroups: gmane.lisp.guile.devel Subject: Re: add regexp-split: a summary and new proposal Date: Sat, 31 Dec 2011 16:13:41 -0500 Message-ID: <20223.31493.852823.636906@winooski.ccs.neu.edu> References: <20222.47629.522520.63683@winooski.ccs.neu.edu> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1325366045 28139 80.91.229.12 (31 Dec 2011 21:14:05 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Sat, 31 Dec 2011 21:14:05 +0000 (UTC) Cc: guile-devel@gnu.org To: Daniel Hartwig Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sat Dec 31 22:14:00 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Rh6Fa-0003Ss-Qx for guile-devel@m.gmane.org; Sat, 31 Dec 2011 22:13:59 +0100 Original-Received: from localhost ([::1]:54248 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Rh6Fa-0003pz-0I for guile-devel@m.gmane.org; Sat, 31 Dec 2011 16:13:58 -0500 Original-Received: from eggs.gnu.org ([140.186.70.92]:41243) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Rh6FT-0003pq-UK for guile-devel@gnu.org; Sat, 31 Dec 2011 16:13:55 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Rh6FP-0007rX-ON for guile-devel@gnu.org; Sat, 31 Dec 2011 16:13:51 -0500 Original-Received: from winooski.ccs.neu.edu ([129.10.115.117]:60159) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Rh6FP-0007rR-LS for guile-devel@gnu.org; Sat, 31 Dec 2011 16:13:47 -0500 Original-Received: from winooski.ccs.neu.edu (localhost.localdomain [127.0.0.1]) by winooski.ccs.neu.edu (8.14.4/8.14.4) with ESMTP id pBVLDhvs018750; Sat, 31 Dec 2011 16:13:43 -0500 Original-Received: (from eli@localhost) by winooski.ccs.neu.edu (8.14.4/8.14.4/Submit) id pBVLDhMT018747; Sat, 31 Dec 2011 16:13:43 -0500 In-Reply-To: X-Mailer: VM 8.2.0a under 23.2.1 (x86_64-redhat-linux-gnu) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 129.10.115.117 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:13226 Archived-At: 11 hours ago, Daniel Hartwig wrote: > On 31 December 2011 15:30, Eli Barzilay wrote: > > But there's one more point that bugs me about the python thing: the= > > resulting list has both the matches and the non-matching gaps, and > > knowing which is which is tricky. =C2=A0For example, if you do this= (I'll > > use our syntax here, so note the minor differences): > > > > =C2=A0(define (foo rx) > > =C2=A0 =C2=A0(regexp-split rx "some string")) > > > > then you can't tell which is which in its output without knowing ho= w > > many grouping parens are in the input regexp. =C2=A0It therefore ma= kes > > sense to me to have this instead: > > > > =C2=A0> (regexp-explode #rx"([^0-9])" "123+456*/") > > =C2=A0'("123" ("+") "456" ("*") "" ("/") "") > > > > and now it's easy to know which is which. =C2=A0This is of course a= simple > > example with a single group so it doesn't look like much help, but > > when with more than one group things can get confusing otherwise: f= or > > example, in python you can get `None's in the result: > > > > =C2=A0>>> re.split('([^0-9](4)=3F)', '123+456*/') > > =C2=A0['123', '+4', '4', '56', '*', None, '', '/', None, ''] > > > > but with the above, this becomes: > > > > =C2=A0> (regexp-explode #rx"([^0-9](4)=3F)" "123+456*/") > > =C2=A0'("123" ("+4" "4") "456" ("*" #f) "" ("/" #f) "") > > > > so you can rely on the odd-numbered elements to be strings. =C2=A0T= his is > > probably going to be different for you, since you allow string > > predicates instead of regexps. > > > > Finally, the Racket implementation will probably be a little differ= ent > > still -- our `regexp-match' returns a list with the matched substri= ng > > first, and then the matches for the capturing groups. =C2=A0Followi= ng this, >=20 > The format is the same in Guile, substring followed by capturing > groups: >=20 > scheme@(guile-user)> (string-match "([^0-9])" "123+456*/") > $7 =3D #("123+456*/" (3 . 4) (3 . 4)) >=20 > Though that is more of an analogue to `regexp-match-positions'. (I guess, if I understand the output to have yet another first value with is the string that the positions apply to. We'd get only the two pairs.) > > a more uniform behavior for a `regexp-explode' would be to return > > these lists, so we'd actually get: > > > > =C2=A0> (regexp-explode #rx"[^0-9]" "123+456*/") > > =C2=A0'("123" ("+") "456" ("*") "" ("/") "") > > =C2=A0> (regexp-explode #rx"([^0-9])" "123+456*/") > > =C2=A0'("123" ("+" "+") "456" ("*" "*") "" ("/" "/") "") >=20 > This is a very interesting way to return the results. >=20 > Now that the `explode' has been separated from `split' I am actually > quite partial to always including the matched substring in the result= =2E > This makes even more sense considering the output would be the same > using a char-predicate or regexp with no capturing groups: >=20 > scheme@(guile-user)> (string-explode "123+456*/" (negate char-numeric= =3F)) > $8 =3D ("123" "+" "456" "*" "" "/" "") > scheme@(guile-user)> (string-explode "123+456*/" (make-regexp "[^0-9]= ")) > $9 =3D ("123" "+" "456" "*" "" "/" "") >=20 > And the result is compatible with using `string-concatenate' as an > inverse operation: >=20 > scheme@(guile-user)> (string-concatenate $9) > $10 =3D "123+456*/" >=20 > Bonus! You mean keep the python thing, or have only the full matches rather than the groups=3F (If you keep the groups, then you get that bonus only when there are no groups, of course, otherwise you get a semi-random character salad.) > WRT to all the capturing groups as a list: >=20 > + as you mention earlier the user can be somewhat ignorant of the > number of capturing groups (why not just use `split'=3F); (Because of the usual reasons... It's hiding as some random utility that takes in a string from an api-level function, and now it needs to parse it if you need to know the number of groups.) > + easier to handle collectively; >=20 > - result is no longer a flat list (I *do* like sexps, really); Well, given a `flatten' function it's trivial to get the flat form back... > - moving away from *all* existing implementations; >=20 > * trivial to transform between styles assuming one knows how many > capturing groups; ...but the flattened form loses information, which means that getting from it to the nested one is impossible without information about the (number of groups in the) regexp. > So now I am thinking about both `string-explode' (flat output) and > `regexp-explode' with the nested output. (I'm not familiar enough with your conventional differences between `string-x' and `regexp-x', but that seems potentially confusing...) > > And again, this looks silly in this simple example, but would be > > more useful in more complex ones. =C2=A0We would also have a simila= r > > `regexp-explode-positions' function that returns position pairs > > for cases where you don't want to allocate all substrings. >=20 > ... or need to know the positioning information. Obviously. > [BTW, substrings in Guile share copy-on-write memory with their super= > so I don't see string allocation as an issue on the Guile front. Not= > sure about substrings in Racket.] We have the ability to share substrings, but I don't think that we're using it for these things. It seems dangerous to me -- what if I do something like: (define x (substring (make-string 1000000000 #\space) 0 1)) =3F With a naive implementation you'd get the whole gb in memory just for that tiny string... In any case, we also allow regexp operations on ports, and in that case allocation is an issue no matter what you do. --=20 ((lambda (x) (x x)) (lambda (x) (x x))) Eli Barzilay= : http://barzilay.org/ Maze is Life= !