From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Barzilay Newsgroups: gmane.lisp.guile.devel Subject: Re: add regexp-split: a summary and new proposal Date: Sat, 31 Dec 2011 02:30:21 -0500 Message-ID: <20222.47629.522520.63683@winooski.ccs.neu.edu> References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: dough.gmane.org 1325316635 21497 80.91.229.12 (31 Dec 2011 07:30:35 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Sat, 31 Dec 2011 07:30:35 +0000 (UTC) Cc: guile-devel@gnu.org To: Daniel Hartwig Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sat Dec 31 08:30:31 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1RgtOh-0004oi-1A for guile-devel@m.gmane.org; Sat, 31 Dec 2011 08:30:31 +0100 Original-Received: from localhost ([::1]:33683 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RgtOg-00034r-6d for guile-devel@m.gmane.org; Sat, 31 Dec 2011 02:30:30 -0500 Original-Received: from eggs.gnu.org ([140.186.70.92]:35077) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RgtOc-00034b-J8 for guile-devel@gnu.org; Sat, 31 Dec 2011 02:30:27 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RgtOb-0002V3-G7 for guile-devel@gnu.org; Sat, 31 Dec 2011 02:30:26 -0500 Original-Received: from winooski.ccs.neu.edu ([129.10.115.117]:45858) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RgtOb-0002Uv-AQ for guile-devel@gnu.org; Sat, 31 Dec 2011 02:30:25 -0500 Original-Received: from winooski.ccs.neu.edu (localhost.localdomain [127.0.0.1]) by winooski.ccs.neu.edu (8.14.4/8.14.4) with ESMTP id pBV7UMSA019878; Sat, 31 Dec 2011 02:30:22 -0500 Original-Received: (from eli@localhost) by winooski.ccs.neu.edu (8.14.4/8.14.4/Submit) id pBV7ULXG019875; Sat, 31 Dec 2011 02:30:21 -0500 In-Reply-To: X-Mailer: VM 8.2.0a under 23.2.1 (x86_64-redhat-linux-gnu) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 129.10.115.117 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:13223 Archived-At: An hour ago, Daniel Hartwig wrote: > > Anyway, what do people think of this proposal which tries to address > that whole discussion: > > * [Vanilla `string-split' expanded to support the CHAR_PRED > semantics of `string-index' et al.] > > * New function `string-explode' similar to `string-split' but returns > the deliminators in it's result. > > * Regex module replaces both of these with regexp-enhanced versions. Aha -- I was looking for a new name, and `-explode' sounds good and not misleading like `-split' (misleading in that I wouldn't have expected a "split" function to return stuff from the gaps). But there's one more point that bugs me about the python thing: the resulting list has both the matches and the non-matching gaps, and knowing which is which is tricky. For example, if you do this (I'll use our syntax here, so note the minor differences): (define (foo rx) (regexp-split rx "some string")) then you can't tell which is which in its output without knowing how many grouping parens are in the input regexp. It therefore makes sense to me to have this instead: > (regexp-explode #rx"([^0-9])" "123+456*/") '("123" ("+") "456" ("*") "" ("/") "") and now it's easy to know which is which. This is of course a simple example with a single group so it doesn't look like much help, but when with more than one group things can get confusing otherwise: for example, in python you can get `None's in the result: >>> re.split('([^0-9](4)?)', '123+456*/') ['123', '+4', '4', '56', '*', None, '', '/', None, ''] but with the above, this becomes: > (regexp-explode #rx"([^0-9](4)?)" "123+456*/") '("123" ("+4" "4") "456" ("*" #f) "" ("/" #f) "") so you can rely on the odd-numbered elements to be strings. This is probably going to be different for you, since you allow string predicates instead of regexps. Finally, the Racket implementation will probably be a little different still -- our `regexp-match' returns a list with the matched substring first, and then the matches for the capturing groups. Following this, a more uniform behavior for a `regexp-explode' would be to return these lists, so we'd actually get: > (regexp-explode #rx"[^0-9]" "123+456*/") '("123" ("+") "456" ("*") "" ("/") "") > (regexp-explode #rx"([^0-9])" "123+456*/") '("123" ("+" "+") "456" ("*" "*") "" ("/" "/") "") And again, this looks silly in this simple example, but would be more useful in more complex ones. We would also have a similar `regexp-explode-positions' function that returns position pairs for cases where you don't want to allocate all substrings. One last not-too-related note: this is IMO all a by-product of a bad choice of common regexp practices where capturing groups always refer to the last match only. In a world that would have made a better choice, I'd expect: > (regexp-match #rx"(foo+)+ bar" "blah foofoooo bar") '("foofoooo bar" ("foo" "foooo")) and, of course: > (regexp-match #rx"(fo(o)+)+ bar" "blah foofoooo bar") '("foofoooo bar" (("foo" ("o")) ("foooo" ("o" "o" "o")))) But my guess is that many people wouldn't like that much... (Probably similar to disliking sexprs which are needed for the results of these things.) With such a thing, many of these additional constructs wouldn't be necessary -- for exampe, we have `regexp-match*' that returns all matches, and that wouldn't have been necessary. `regexp-split' would probably not have been necessary too. -- ((lambda (x) (x x)) (lambda (x) (x x))) Eli Barzilay: http://barzilay.org/ Maze is Life!