From: Daniel Hartwig <mandyke@gmail.com>
To: Eli Barzilay <eli@barzilay.org>
Cc: guile-devel@gnu.org
Subject: Re: add regexp-split: a summary and new proposal
Date: Sat, 31 Dec 2011 17:30:58 +0800 [thread overview]
Message-ID: <CAN3veRer09F=oag5iTsf7uFXpv9idn5G5wyGCmWEgKLN0HsZDw@mail.gmail.com> (raw)
In-Reply-To: <20222.47629.522520.63683@winooski.ccs.neu.edu>
On 31 December 2011 15:30, Eli Barzilay <eli@barzilay.org> wrote:
> But there's one more point that bugs me about the python thing: the
> resulting list has both the matches and the non-matching gaps, and
> knowing which is which is tricky. For example, if you do this (I'll
> use our syntax here, so note the minor differences):
>
> (define (foo rx)
> (regexp-split rx "some string"))
>
> then you can't tell which is which in its output without knowing how
> many grouping parens are in the input regexp. It therefore makes
> sense to me to have this instead:
>
> > (regexp-explode #rx"([^0-9])" "123+456*/")
> '("123" ("+") "456" ("*") "" ("/") "")
>
> and now it's easy to know which is which. This is of course a simple
> example with a single group so it doesn't look like much help, but
> when with more than one group things can get confusing otherwise: for
> example, in python you can get `None's in the result:
>
> >>> re.split('([^0-9](4)?)', '123+456*/')
> ['123', '+4', '4', '56', '*', None, '', '/', None, '']
>
> but with the above, this becomes:
>
> > (regexp-explode #rx"([^0-9](4)?)" "123+456*/")
> '("123" ("+4" "4") "456" ("*" #f) "" ("/" #f) "")
>
> so you can rely on the odd-numbered elements to be strings. This is
> probably going to be different for you, since you allow string
> predicates instead of regexps.
>
> Finally, the Racket implementation will probably be a little different
> still -- our `regexp-match' returns a list with the matched substring
> first, and then the matches for the capturing groups. Following this,
The format is the same in Guile, substring followed by capturing
groups:
scheme@(guile-user)> (string-match "([^0-9])" "123+456*/")
$7 = #("123+456*/" (3 . 4) (3 . 4))
Though that is more of an analogue to `regexp-match-positions'.
> a more uniform behavior for a `regexp-explode' would be to return
> these lists, so we'd actually get:
>
> > (regexp-explode #rx"[^0-9]" "123+456*/")
> '("123" ("+") "456" ("*") "" ("/") "")
> > (regexp-explode #rx"([^0-9])" "123+456*/")
> '("123" ("+" "+") "456" ("*" "*") "" ("/" "/") "")
This is a very interesting way to return the results.
Now that the `explode' has been separated from `split' I am actually
quite partial to always including the matched substring in the result.
This makes even more sense considering the output would be the same
using a char-predicate or regexp with no capturing groups:
scheme@(guile-user)> (string-explode "123+456*/" (negate char-numeric?))
$8 = ("123" "+" "456" "*" "" "/" "")
scheme@(guile-user)> (string-explode "123+456*/" (make-regexp "[^0-9]"))
$9 = ("123" "+" "456" "*" "" "/" "")
And the result is compatible with using `string-concatenate' as an
inverse operation:
scheme@(guile-user)> (string-concatenate $9)
$10 = "123+456*/"
Bonus!
WRT to all the capturing groups as a list:
+ as you mention earlier the user can be somewhat ignorant of the
number of capturing groups (why not just use `split'?);
+ easier to handle collectively;
- result is no longer a flat list (I *do* like sexps, really);
- moving away from *all* existing implementations;
* trivial to transform between styles assuming one knows how many
capturing groups;
So now I am thinking about both `string-explode' (flat output) and
`regexp-explode' with the nested output.
> And again, this looks silly in this simple example, but would be more
> useful in more complex ones. We would also have a similar
> `regexp-explode-positions' function that returns position pairs for
> cases where you don't want to allocate all substrings.
... or need to know the positioning information.
[BTW, substrings in Guile share copy-on-write memory with their super
so I don't see string allocation as an issue on the Guile front. Not
sure about substrings in Racket.]
next prev parent reply other threads:[~2011-12-31 9:30 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-12-31 5:54 add regexp-split: a summary and new proposal Daniel Hartwig
2011-12-31 7:30 ` Eli Barzilay
2011-12-31 9:30 ` Daniel Hartwig [this message]
2011-12-31 21:13 ` Eli Barzilay
2012-01-07 23:05 ` Andy Wingo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAN3veRer09F=oag5iTsf7uFXpv9idn5G5wyGCmWEgKLN0HsZDw@mail.gmail.com' \
--to=mandyke@gmail.com \
--cc=eli@barzilay.org \
--cc=guile-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).