add regexp-split: a summary and new proposal

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* add regexp-split: a summary and new proposal
@ 2011-12-31  5:54 Daniel Hartwig
  2011-12-31  7:30 ` Eli Barzilay
  2012-01-07 23:05 ` Andy Wingo
  0 siblings, 2 replies; 5+ messages in thread
From: Daniel Hartwig @ 2011-12-31  5:54 UTC (permalink / raw)
  To: guile-devel; +Cc: Eli Barzilay

An attempt to summarize the pertinent points of the thread [1].

[1] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00241.html

* Semantics, generally

  `regexp-split' is similar to `string-split'.  However, between
  various implementations the semantics vary over the following two
  points.  It is important to consider appropriate compatability with
  these other implementations whilst still offering the user a good
  set of functionality.

* Captured groups

  The Python [2] implementation contains unique semantics whereby the
  text of any captured groups in the pattern are included in the
  result:

  >>> re.split('\W+', 'Words, words, words.')
  ['Words', 'words', 'words', '']
  >>> re.split('(\W+)', 'Words, words, words.')
  ['Words', ', ', 'words', ', ', 'words', '.', '']

  This is considered useful functionality to have [3], though not
  necesarily by default.  Consider a simple parser [4] which will need
  access to the tokens for processing.

  Other implementations such as Racket [3], Chicken [5], and Perl do not
  return the captured groups in their result.

  If there were two separate functions (or one function with an
  optional argument controlling the output) then the user could have a
  single regexp perform both the task of just splitting and the task
  of extracting the tokens. [6]

  [2] http://docs.python.org/library/re.html#re.split
  [3] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00257.html
  [4] http://80.68.89.23/2003/Oct/26/reSplit/
  [5] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00249.html
  [6] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00266.html

* Empty strings

  Some implementations (e.g. Chicken and Perl) drop (some) empty
  strings from their result.  In the case of Perl this is likely due
  to making things "nice" for the user in the majority case, but it is
  hard to revert this. [7]

  As per the example of `string-split', having empty strings in the
  result is useful to keep track of which "field" is which.

  In Scheme, if the empty strings are not desired, it is trivial to
  remove them:
   (filter (negate string-null?) lst)

  [7] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00269.html

* Naming

  > Also, to me the name seems unintuitive -- it is STR being split, not
  > RE -- perhaps this can be folded in to the existing string-split
  > function.

  [8] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00245.html

Hopefully I have not missed out anything important :-)

Anyway, what do people think of this proposal which tries to address
that whole discussion:

* [Vanilla `string-split' expanded to support the CHAR_PRED
  semantics of `string-index' et al.]

* New function `string-explode' similar to `string-split' but returns
  the deliminators in it's result.

* Regex module replaces both of these with regexp-enhanced versions.

Thus:

scheme@(guile-user)> ;; with a char predicate
scheme@(guile-user)> (string-split "123+456*/" (negate char-numeric?))
$8 = ("123" "456" "" "")
scheme@(guile-user)> (string-explode "123+456*/" (negate char-numeric?))
$9 = ("123" "+" "456" "*" "" "/" "")
scheme@(guile-user)> ;; with a regular expression
scheme@(guile-user)> (use-modules (ice-9 regex))
scheme@(guile-user)> (define rx (make-regexp "([^0-9])"))
scheme@(guile-user)> (string-split "123+456*/" rx)
$10 = ("123" "456" "" "")
scheme@(guile-user)> ;; didn't want empty strings
scheme@(guile-user)> (filter (negate string-null?) $10)
$11 = ("123" "456")
scheme@(guile-user)> (string-explode "123+456*/" rx)
$12 = ("123" "+" "456" "*" "" "/" "")

and so on.

I'm happy to throw together a patch for the above, however, would like
some feedback first :-)

Regards

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: add regexp-split: a summary and new proposal
  2011-12-31  5:54 add regexp-split: a summary and new proposal Daniel Hartwig
@ 2011-12-31  7:30 ` Eli Barzilay
  2011-12-31  9:30   ` Daniel Hartwig
  2012-01-07 23:05 ` Andy Wingo
  1 sibling, 1 reply; 5+ messages in thread
From: Eli Barzilay @ 2011-12-31  7:30 UTC (permalink / raw)
  To: Daniel Hartwig; +Cc: guile-devel

An hour ago, Daniel Hartwig wrote:
> 
> Anyway, what do people think of this proposal which tries to address
> that whole discussion:
> 
> * [Vanilla `string-split' expanded to support the CHAR_PRED
>   semantics of `string-index' et al.]
> 
> * New function `string-explode' similar to `string-split' but returns
>   the deliminators in it's result.
> 
> * Regex module replaces both of these with regexp-enhanced versions.

Aha -- I was looking for a new name, and `-explode' sounds good and
not misleading like `-split' (misleading in that I wouldn't have
expected a "split" function to return stuff from the gaps).

But there's one more point that bugs me about the python thing: the
resulting list has both the matches and the non-matching gaps, and
knowing which is which is tricky.  For example, if you do this (I'll
use our syntax here, so note the minor differences):

  (define (foo rx)
    (regexp-split rx "some string"))

then you can't tell which is which in its output without knowing how
many grouping parens are in the input regexp.  It therefore makes
sense to me to have this instead:

  > (regexp-explode #rx"([^0-9])" "123+456*/")
  '("123" ("+") "456" ("*") "" ("/") "")

and now it's easy to know which is which.  This is of course a simple
example with a single group so it doesn't look like much help, but
when with more than one group things can get confusing otherwise: for
example, in python you can get `None's in the result:

  >>> re.split('([^0-9](4)?)', '123+456*/')
  ['123', '+4', '4', '56', '*', None, '', '/', None, '']

but with the above, this becomes:

  > (regexp-explode #rx"([^0-9](4)?)" "123+456*/")
  '("123" ("+4" "4") "456" ("*" #f) "" ("/" #f) "")

so you can rely on the odd-numbered elements to be strings.  This is
probably going to be different for you, since you allow string
predicates instead of regexps.

Finally, the Racket implementation will probably be a little different
still -- our `regexp-match' returns a list with the matched substring
first, and then the matches for the capturing groups.  Following this,
a more uniform behavior for a `regexp-explode' would be to return
these lists, so we'd actually get:

  > (regexp-explode #rx"[^0-9]" "123+456*/")
  '("123" ("+") "456" ("*") "" ("/") "")
  > (regexp-explode #rx"([^0-9])" "123+456*/")
  '("123" ("+" "+") "456" ("*" "*") "" ("/" "/") "")

And again, this looks silly in this simple example, but would be more
useful in more complex ones.  We would also have a similar
`regexp-explode-positions' function that returns position pairs for
cases where you don't want to allocate all substrings.

One last not-too-related note: this is IMO all a by-product of a bad
choice of common regexp practices where capturing groups always refer
to the last match only.  In a world that would have made a better
choice, I'd expect:

  > (regexp-match #rx"(foo+)+ bar" "blah foofoooo bar")
  '("foofoooo bar" ("foo" "foooo"))

and, of course:

  > (regexp-match #rx"(fo(o)+)+ bar" "blah foofoooo bar")
  '("foofoooo bar" (("foo" ("o")) ("foooo" ("o" "o" "o"))))

But my guess is that many people wouldn't like that much...  (Probably
similar to disliking sexprs which are needed for the results of these
things.)  With such a thing, many of these additional constructs
wouldn't be necessary -- for exampe, we have `regexp-match*' that
returns all matches, and that wouldn't have been necessary.
`regexp-split' would probably not have been necessary too.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: add regexp-split: a summary and new proposal
  2011-12-31  7:30 ` Eli Barzilay
@ 2011-12-31  9:30   ` Daniel Hartwig
  2011-12-31 21:13     ` Eli Barzilay
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Hartwig @ 2011-12-31  9:30 UTC (permalink / raw)
  To: Eli Barzilay; +Cc: guile-devel

On 31 December 2011 15:30, Eli Barzilay <eli@barzilay.org> wrote:
> But there's one more point that bugs me about the python thing: the
> resulting list has both the matches and the non-matching gaps, and
> knowing which is which is tricky.  For example, if you do this (I'll
> use our syntax here, so note the minor differences):
>
>  (define (foo rx)
>    (regexp-split rx "some string"))
>
> then you can't tell which is which in its output without knowing how
> many grouping parens are in the input regexp.  It therefore makes
> sense to me to have this instead:
>
>  > (regexp-explode #rx"([^0-9])" "123+456*/")
>  '("123" ("+") "456" ("*") "" ("/") "")
>
> and now it's easy to know which is which.  This is of course a simple
> example with a single group so it doesn't look like much help, but
> when with more than one group things can get confusing otherwise: for
> example, in python you can get `None's in the result:
>
>  >>> re.split('([^0-9](4)?)', '123+456*/')
>  ['123', '+4', '4', '56', '*', None, '', '/', None, '']
>
> but with the above, this becomes:
>
>  > (regexp-explode #rx"([^0-9](4)?)" "123+456*/")
>  '("123" ("+4" "4") "456" ("*" #f) "" ("/" #f) "")
>
> so you can rely on the odd-numbered elements to be strings.  This is
> probably going to be different for you, since you allow string
> predicates instead of regexps.
>
> Finally, the Racket implementation will probably be a little different
> still -- our `regexp-match' returns a list with the matched substring
> first, and then the matches for the capturing groups.  Following this,

The format is the same in Guile, substring followed by capturing
groups:

scheme@(guile-user)> (string-match "([^0-9])" "123+456*/")
$7 = #("123+456*/" (3 . 4) (3 . 4))

Though that is more of an analogue to `regexp-match-positions'.

> a more uniform behavior for a `regexp-explode' would be to return
> these lists, so we'd actually get:
>
>  > (regexp-explode #rx"[^0-9]" "123+456*/")
>  '("123" ("+") "456" ("*") "" ("/") "")
>  > (regexp-explode #rx"([^0-9])" "123+456*/")
>  '("123" ("+" "+") "456" ("*" "*") "" ("/" "/") "")

This is a very interesting way to return the results.

Now that the `explode' has been separated from `split' I am actually
quite partial to always including the matched substring in the result.
This makes even more sense considering the output would be the same
using a char-predicate or regexp with no capturing groups:

scheme@(guile-user)> (string-explode "123+456*/" (negate char-numeric?))
$8 = ("123" "+" "456" "*" "" "/" "")
scheme@(guile-user)> (string-explode "123+456*/" (make-regexp "[^0-9]"))
$9 = ("123" "+" "456" "*" "" "/" "")

And the result is compatible with using `string-concatenate' as an
inverse operation:

scheme@(guile-user)> (string-concatenate $9)
$10 = "123+456*/"

Bonus!

WRT to all the capturing groups as a list:

 + as you mention earlier the user can be somewhat ignorant of the
   number of capturing groups (why not just use `split'?);
 + easier to handle collectively;

 - result is no longer a flat list (I *do* like sexps, really);
 - moving away from *all* existing implementations;

 * trivial to transform between styles assuming one knows how many
   capturing groups;

So now I am thinking about both `string-explode' (flat output) and
`regexp-explode' with the nested output.

> And again, this looks silly in this simple example, but would be more
> useful in more complex ones.  We would also have a similar
> `regexp-explode-positions' function that returns position pairs for
> cases where you don't want to allocate all substrings.

... or need to know the positioning information.

[BTW, substrings in Guile share copy-on-write memory with their super
so I don't see string allocation as an issue on the Guile front.  Not
sure about substrings in Racket.]



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: add regexp-split: a summary and new proposal
  2011-12-31  9:30   ` Daniel Hartwig
@ 2011-12-31 21:13     ` Eli Barzilay
  0 siblings, 0 replies; 5+ messages in thread
From: Eli Barzilay @ 2011-12-31 21:13 UTC (permalink / raw)
  To: Daniel Hartwig; +Cc: guile-devel

11 hours ago, Daniel Hartwig wrote:
> On 31 December 2011 15:30, Eli Barzilay <eli@barzilay.org> wrote:
> > But there's one more point that bugs me about the python thing: the
> > resulting list has both the matches and the non-matching gaps, and
> > knowing which is which is tricky.  For example, if you do this (I'll
> > use our syntax here, so note the minor differences):
> >
> >  (define (foo rx)
> >    (regexp-split rx "some string"))
> >
> > then you can't tell which is which in its output without knowing how
> > many grouping parens are in the input regexp.  It therefore makes
> > sense to me to have this instead:
> >
> >  > (regexp-explode #rx"([^0-9])" "123+456*/")
> >  '("123" ("+") "456" ("*") "" ("/") "")
> >
> > and now it's easy to know which is which.  This is of course a simple
> > example with a single group so it doesn't look like much help, but
> > when with more than one group things can get confusing otherwise: for
> > example, in python you can get `None's in the result:
> >
> >  >>> re.split('([^0-9](4)?)', '123+456*/')
> >  ['123', '+4', '4', '56', '*', None, '', '/', None, '']
> >
> > but with the above, this becomes:
> >
> >  > (regexp-explode #rx"([^0-9](4)?)" "123+456*/")
> >  '("123" ("+4" "4") "456" ("*" #f) "" ("/" #f) "")
> >
> > so you can rely on the odd-numbered elements to be strings.  This is
> > probably going to be different for you, since you allow string
> > predicates instead of regexps.
> >
> > Finally, the Racket implementation will probably be a little different
> > still -- our `regexp-match' returns a list with the matched substring
> > first, and then the matches for the capturing groups.  Following this,
> 
> The format is the same in Guile, substring followed by capturing
> groups:
> 
> scheme@(guile-user)> (string-match "([^0-9])" "123+456*/")
> $7 = #("123+456*/" (3 . 4) (3 . 4))
> 
> Though that is more of an analogue to `regexp-match-positions'.

(I guess, if I understand the output to have yet another first
value with is the string that the positions apply to.  We'd get only
the two pairs.)


> > a more uniform behavior for a `regexp-explode' would be to return
> > these lists, so we'd actually get:
> >
> >  > (regexp-explode #rx"[^0-9]" "123+456*/")
> >  '("123" ("+") "456" ("*") "" ("/") "")
> >  > (regexp-explode #rx"([^0-9])" "123+456*/")
> >  '("123" ("+" "+") "456" ("*" "*") "" ("/" "/") "")
> 
> This is a very interesting way to return the results.
> 
> Now that the `explode' has been separated from `split' I am actually
> quite partial to always including the matched substring in the result.
> This makes even more sense considering the output would be the same
> using a char-predicate or regexp with no capturing groups:
> 
> scheme@(guile-user)> (string-explode "123+456*/" (negate char-numeric?))
> $8 = ("123" "+" "456" "*" "" "/" "")
> scheme@(guile-user)> (string-explode "123+456*/" (make-regexp "[^0-9]"))
> $9 = ("123" "+" "456" "*" "" "/" "")
> 
> And the result is compatible with using `string-concatenate' as an
> inverse operation:
> 
> scheme@(guile-user)> (string-concatenate $9)
> $10 = "123+456*/"
> 
> Bonus!

You mean keep the python thing, or have only the full matches rather
than the groups?

(If you keep the groups, then you get that bonus only when there are
no groups, of course, otherwise you get a semi-random character
salad.)


> WRT to all the capturing groups as a list:
> 
>  + as you mention earlier the user can be somewhat ignorant of the
>    number of capturing groups (why not just use `split'?);

(Because of the usual reasons...  It's hiding as some random utility
that takes in a string from an api-level function, and now it needs to
parse it if you need to know the number of groups.)


>  + easier to handle collectively;
> 
>  - result is no longer a flat list (I *do* like sexps, really);

Well, given a `flatten' function it's trivial to get the flat form
back...

>  - moving away from *all* existing implementations;
> 
>  * trivial to transform between styles assuming one knows how many
>    capturing groups;

...but the flattened form loses information, which means that getting
from it to the nested one is impossible without information about the
(number of groups in the) regexp.


> So now I am thinking about both `string-explode' (flat output) and
> `regexp-explode' with the nested output.

(I'm not familiar enough with your conventional differences between
`string-x' and `regexp-x', but that seems potentially confusing...)


> > And again, this looks silly in this simple example, but would be
> > more useful in more complex ones.  We would also have a similar
> > `regexp-explode-positions' function that returns position pairs
> > for cases where you don't want to allocate all substrings.
> 
> ... or need to know the positioning information.

Obviously.


> [BTW, substrings in Guile share copy-on-write memory with their super
> so I don't see string allocation as an issue on the Guile front.  Not
> sure about substrings in Racket.]

We have the ability to share substrings, but I don't think that we're
using it for these things.  It seems dangerous to me -- what if I do
something like:

  (define x (substring (make-string 1000000000 #\space) 0 1))

?  With a naive implementation you'd get the whole gb in memory just
for that tiny string...

In any case, we also allow regexp operations on ports, and in that
case allocation is an issue no matter what you do.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: add regexp-split: a summary and new proposal
  2011-12-31  5:54 add regexp-split: a summary and new proposal Daniel Hartwig
  2011-12-31  7:30 ` Eli Barzilay
@ 2012-01-07 23:05 ` Andy Wingo
  1 sibling, 0 replies; 5+ messages in thread
From: Andy Wingo @ 2012-01-07 23:05 UTC (permalink / raw)
  To: Daniel Hartwig; +Cc: guile-devel

On Sat 31 Dec 2011 06:54, Daniel Hartwig <mandyke@gmail.com> writes:

> An attempt to summarize the pertinent points of the thread [1].
>
> [1]
> http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00241.html

Thanks for doing this!  I replied earlier without having read this
thread.  Sorry for being a serial processor :)

> * [Vanilla `string-split' expanded to support the CHAR_PRED
>   semantics of `string-index' et al.]

Makes sense to me.

> * New function `string-explode' similar to `string-split' but returns
>   the deliminators in it's result.

"delimiters" :)  Also, "Explode" has a meaning in the PHP world like our
"split", but oh well.  I would be OK with this.

> * Regex module replaces both of these with regexp-enhanced versions.

Sure.  I like Eli's suggestion of having the delimiters be the full set
of matching groups, as a list, but perhaps this could be controlled by a
keyword argument.

MHO at least :)

Andy
-- 
http://wingolog.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-01-07 23:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-31  5:54 add regexp-split: a summary and new proposal Daniel Hartwig
2011-12-31  7:30 ` Eli Barzilay
2011-12-31  9:30   ` Daniel Hartwig
2011-12-31 21:13     ` Eli Barzilay
2012-01-07 23:05 ` Andy Wingo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).