Re: add regexp-split: a summary and new proposal

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

From: Eli Barzilay <eli@barzilay.org>
To: Daniel Hartwig <mandyke@gmail.com>
Cc: guile-devel@gnu.org
Subject: Re: add regexp-split: a summary and new proposal
Date: Sat, 31 Dec 2011 02:30:21 -0500	[thread overview]
Message-ID: <20222.47629.522520.63683@winooski.ccs.neu.edu> (raw)
In-Reply-To: <CAN3veRcgnoRN3XdfSrF_=T1Aq6DZ_kQMpV6T8AG46X=EPNanGw@mail.gmail.com>

An hour ago, Daniel Hartwig wrote:
> 
> Anyway, what do people think of this proposal which tries to address
> that whole discussion:
> 
> * [Vanilla `string-split' expanded to support the CHAR_PRED
>   semantics of `string-index' et al.]
> 
> * New function `string-explode' similar to `string-split' but returns
>   the deliminators in it's result.
> 
> * Regex module replaces both of these with regexp-enhanced versions.

Aha -- I was looking for a new name, and `-explode' sounds good and
not misleading like `-split' (misleading in that I wouldn't have
expected a "split" function to return stuff from the gaps).

But there's one more point that bugs me about the python thing: the
resulting list has both the matches and the non-matching gaps, and
knowing which is which is tricky.  For example, if you do this (I'll
use our syntax here, so note the minor differences):

  (define (foo rx)
    (regexp-split rx "some string"))

then you can't tell which is which in its output without knowing how
many grouping parens are in the input regexp.  It therefore makes
sense to me to have this instead:

  > (regexp-explode #rx"([^0-9])" "123+456*/")
  '("123" ("+") "456" ("*") "" ("/") "")

and now it's easy to know which is which.  This is of course a simple
example with a single group so it doesn't look like much help, but
when with more than one group things can get confusing otherwise: for
example, in python you can get `None's in the result:

  >>> re.split('([^0-9](4)?)', '123+456*/')
  ['123', '+4', '4', '56', '*', None, '', '/', None, '']

but with the above, this becomes:

  > (regexp-explode #rx"([^0-9](4)?)" "123+456*/")
  '("123" ("+4" "4") "456" ("*" #f) "" ("/" #f) "")

so you can rely on the odd-numbered elements to be strings.  This is
probably going to be different for you, since you allow string
predicates instead of regexps.

Finally, the Racket implementation will probably be a little different
still -- our `regexp-match' returns a list with the matched substring
first, and then the matches for the capturing groups.  Following this,
a more uniform behavior for a `regexp-explode' would be to return
these lists, so we'd actually get:

  > (regexp-explode #rx"[^0-9]" "123+456*/")
  '("123" ("+") "456" ("*") "" ("/") "")
  > (regexp-explode #rx"([^0-9])" "123+456*/")
  '("123" ("+" "+") "456" ("*" "*") "" ("/" "/") "")

And again, this looks silly in this simple example, but would be more
useful in more complex ones.  We would also have a similar
`regexp-explode-positions' function that returns position pairs for
cases where you don't want to allocate all substrings.

One last not-too-related note: this is IMO all a by-product of a bad
choice of common regexp practices where capturing groups always refer
to the last match only.  In a world that would have made a better
choice, I'd expect:

  > (regexp-match #rx"(foo+)+ bar" "blah foofoooo bar")
  '("foofoooo bar" ("foo" "foooo"))

and, of course:

  > (regexp-match #rx"(fo(o)+)+ bar" "blah foofoooo bar")
  '("foofoooo bar" (("foo" ("o")) ("foooo" ("o" "o" "o"))))

But my guess is that many people wouldn't like that much...  (Probably
similar to disliking sexprs which are needed for the results of these
things.)  With such a thing, many of these additional constructs
wouldn't be necessary -- for exampe, we have `regexp-match*' that
returns all matches, and that wouldn't have been necessary.
`regexp-split' would probably not have been necessary too.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!

next prev parent reply	other threads:[~2011-12-31  7:30 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-31  5:54 add regexp-split: a summary and new proposal Daniel Hartwig
2011-12-31  7:30 ` Eli Barzilay [this message]
2011-12-31  9:30   ` Daniel Hartwig
2011-12-31 21:13     ` Eli Barzilay
2012-01-07 23:05 ` Andy Wingo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20222.47629.522520.63683@winooski.ccs.neu.edu \
    --to=eli@barzilay.org \
    --cc=guile-devel@gnu.org \
    --cc=mandyke@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).