Re: add regexp-split: a summary and new proposal

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

From: Daniel Hartwig <mandyke@gmail.com>
To: Eli Barzilay <eli@barzilay.org>
Cc: guile-devel@gnu.org
Subject: Re: add regexp-split: a summary and new proposal
Date: Sat, 31 Dec 2011 17:30:58 +0800	[thread overview]
Message-ID: <CAN3veRer09F=oag5iTsf7uFXpv9idn5G5wyGCmWEgKLN0HsZDw@mail.gmail.com> (raw)
In-Reply-To: <20222.47629.522520.63683@winooski.ccs.neu.edu>

On 31 December 2011 15:30, Eli Barzilay <eli@barzilay.org> wrote:
> But there's one more point that bugs me about the python thing: the
> resulting list has both the matches and the non-matching gaps, and
> knowing which is which is tricky.  For example, if you do this (I'll
> use our syntax here, so note the minor differences):
>
>  (define (foo rx)
>    (regexp-split rx "some string"))
>
> then you can't tell which is which in its output without knowing how
> many grouping parens are in the input regexp.  It therefore makes
> sense to me to have this instead:
>
>  > (regexp-explode #rx"([^0-9])" "123+456*/")
>  '("123" ("+") "456" ("*") "" ("/") "")
>
> and now it's easy to know which is which.  This is of course a simple
> example with a single group so it doesn't look like much help, but
> when with more than one group things can get confusing otherwise: for
> example, in python you can get `None's in the result:
>
>  >>> re.split('([^0-9](4)?)', '123+456*/')
>  ['123', '+4', '4', '56', '*', None, '', '/', None, '']
>
> but with the above, this becomes:
>
>  > (regexp-explode #rx"([^0-9](4)?)" "123+456*/")
>  '("123" ("+4" "4") "456" ("*" #f) "" ("/" #f) "")
>
> so you can rely on the odd-numbered elements to be strings.  This is
> probably going to be different for you, since you allow string
> predicates instead of regexps.
>
> Finally, the Racket implementation will probably be a little different
> still -- our `regexp-match' returns a list with the matched substring
> first, and then the matches for the capturing groups.  Following this,

The format is the same in Guile, substring followed by capturing
groups:

scheme@(guile-user)> (string-match "([^0-9])" "123+456*/")
$7 = #("123+456*/" (3 . 4) (3 . 4))

Though that is more of an analogue to `regexp-match-positions'.

> a more uniform behavior for a `regexp-explode' would be to return
> these lists, so we'd actually get:
>
>  > (regexp-explode #rx"[^0-9]" "123+456*/")
>  '("123" ("+") "456" ("*") "" ("/") "")
>  > (regexp-explode #rx"([^0-9])" "123+456*/")
>  '("123" ("+" "+") "456" ("*" "*") "" ("/" "/") "")

This is a very interesting way to return the results.

Now that the `explode' has been separated from `split' I am actually
quite partial to always including the matched substring in the result.
This makes even more sense considering the output would be the same
using a char-predicate or regexp with no capturing groups:

scheme@(guile-user)> (string-explode "123+456*/" (negate char-numeric?))
$8 = ("123" "+" "456" "*" "" "/" "")
scheme@(guile-user)> (string-explode "123+456*/" (make-regexp "[^0-9]"))
$9 = ("123" "+" "456" "*" "" "/" "")

And the result is compatible with using `string-concatenate' as an
inverse operation:

scheme@(guile-user)> (string-concatenate $9)
$10 = "123+456*/"

Bonus!

WRT to all the capturing groups as a list:

 + as you mention earlier the user can be somewhat ignorant of the
   number of capturing groups (why not just use `split'?);
 + easier to handle collectively;

 - result is no longer a flat list (I *do* like sexps, really);
 - moving away from *all* existing implementations;

 * trivial to transform between styles assuming one knows how many
   capturing groups;

So now I am thinking about both `string-explode' (flat output) and
`regexp-explode' with the nested output.

> And again, this looks silly in this simple example, but would be more
> useful in more complex ones.  We would also have a similar
> `regexp-explode-positions' function that returns position pairs for
> cases where you don't want to allocate all substrings.

... or need to know the positioning information.

[BTW, substrings in Guile share copy-on-write memory with their super
so I don't see string allocation as an issue on the Guile front.  Not
sure about substrings in Racket.]

next prev parent reply	other threads:[~2011-12-31  9:30 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-31  5:54 add regexp-split: a summary and new proposal Daniel Hartwig
2011-12-31  7:30 ` Eli Barzilay
2011-12-31  9:30   ` Daniel Hartwig [this message]
2011-12-31 21:13     ` Eli Barzilay
2012-01-07 23:05 ` Andy Wingo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAN3veRer09F=oag5iTsf7uFXpv9idn5G5wyGCmWEgKLN0HsZDw@mail.gmail.com' \
    --to=mandyke@gmail.com \
    --cc=eli@barzilay.org \
    --cc=guile-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).