will we ever have zero width assertions in regexps?

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* will we ever have zero width assertions in regexps?
@ 2011-01-26 14:55 Le Wang
  0 siblings, 0 replies; 15+ messages in thread
From: Le Wang @ 2011-01-26 14:55 UTC (permalink / raw)
  To: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 135 bytes --]

I mean positive/negative look-ahead/look-behind similar to what is available
in Perl.  http://perldoc.perl.org/perlreref.html


-- 
Le

[-- Attachment #2: Type: text/html, Size: 225 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
       [not found] <mailman.1.1296054361.23496.help-gnu-emacs@gnu.org>
@ 2011-01-26 15:58 ` Stefan Monnier
  2011-01-27  1:45   ` Le Wang
       [not found]   ` <mailman.6.1296092730.6982.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 15+ messages in thread
From: Stefan Monnier @ 2011-01-26 15:58 UTC (permalink / raw)
  To: help-gnu-emacs

> I mean positive/negative look-ahead/look-behind similar to what is available
> in Perl.  http://perldoc.perl.org/perlreref.html

The future is full of mysteries.


        Stefan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
  2011-01-26 15:58 ` will we ever have zero width assertions in regexps? Stefan Monnier
@ 2011-01-27  1:45   ` Le Wang
       [not found]   ` <mailman.6.1296092730.6982.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 15+ messages in thread
From: Le Wang @ 2011-01-27  1:45 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 499 bytes --]

Indeed.  I'm using Emacs 23.2.1, so I don't know whether this is already in
the latest CVS version. But assuming it's not, is this feature on the radar
of the developers?  Google turned up no discussion of it.

On Wed, Jan 26, 2011 at 11:58 PM, Stefan Monnier
<monnier@iro.umontreal.ca>wrote:

> > I mean positive/negative look-ahead/look-behind similar to what is
> available
> > in Perl.  http://perldoc.perl.org/perlreref.html
>
> The future is full of mysteries.
>
>
>        Stefan
>



-- 
Le

[-- Attachment #2: Type: text/html, Size: 922 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
       [not found]   ` <mailman.6.1296092730.6982.help-gnu-emacs@gnu.org>
@ 2011-01-27  2:21     ` Stefan Monnier
  2011-01-27  6:34       ` Ilya Zakharevich
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2011-01-27  2:21 UTC (permalink / raw)
  To: help-gnu-emacs

> Indeed.  I'm using Emacs 23.2.1, so I don't know whether this is already in
> the latest CVS version.  But assuming it's not, is this feature on the radar
> of the developers?

No.  There hasn't been much demand/interest for it, tho IIRC someone
posted a patch to implement it at some point in the past.

I think there'd be more interest in extending regexps to PEG in a way
similar to Lua's Lpeg.  I myself would also be more interested in
replacing the backtracking matcher with a non-backtracking one (for the
cases where backtracking is not required by backrefs).

        Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
  2011-01-27  2:21     ` Stefan Monnier
@ 2011-01-27  6:34       ` Ilya Zakharevich
  2011-01-27 16:10         ` Stefan Monnier
  0 siblings, 1 reply; 15+ messages in thread
From: Ilya Zakharevich @ 2011-01-27  6:34 UTC (permalink / raw)
  To: help-gnu-emacs

On 2011-01-27, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
> similar to Lua's Lpeg.  I myself would also be more interested in
> replacing the backtracking matcher with a non-backtracking one (for the
> cases where backtracking is not required by backrefs).

What for?  [After my final round of backtracking-optimizations for
Perl REx engine, I do not recollect people (loudly) wanting anything
like this for Perl anymore...]

Ilya


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
  2011-01-27  6:34       ` Ilya Zakharevich
@ 2011-01-27 16:10         ` Stefan Monnier
  2011-01-28 23:49           ` Ilya Zakharevich
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2011-01-27 16:10 UTC (permalink / raw)
  To: help-gnu-emacs

>> similar to Lua's Lpeg.  I myself would also be more interested in
>> replacing the backtracking matcher with a non-backtracking one (for the
>> cases where backtracking is not required by backrefs).
> What for?  [After my final round of backtracking-optimizations for
> Perl REx engine, I do not recollect people (loudly) wanting anything
> like this for Perl anymore...]

To get rid of the occasional pathological case where matching takes
forever and Emacs appears to be frozen.  Programmers who are used to
backtracking matchers will usually intuitively stay away from regexps
that can show such behaviors, but not all programmers do, and even if
you're careful there are cases that are hard to avoid.

Another minor reason is that it can be handy to have an incremental
matching primitive, so you can match over a long string one chunk at
a time.  I'm not sure how often this would be useful, but I've come
across a few cases where it seemed like it could be put to good use
(tho, for lack of experience with it, I can't sweat that it would turn
out to be a good idea).

        Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
  2011-01-27 16:10         ` Stefan Monnier
@ 2011-01-28 23:49           ` Ilya Zakharevich
  2011-01-29  2:51             ` Stefan Monnier
  0 siblings, 1 reply; 15+ messages in thread
From: Ilya Zakharevich @ 2011-01-28 23:49 UTC (permalink / raw)
  To: help-gnu-emacs

On 2011-01-27, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>>> similar to Lua's Lpeg.  I myself would also be more interested in
>>> replacing the backtracking matcher with a non-backtracking one (for the
>>> cases where backtracking is not required by backrefs).
>> What for?  [After my final round of backtracking-optimizations for
>> Perl REx engine, I do not recollect people (loudly) wanting anything
>> like this for Perl anymore...]
>
> To get rid of the occasional pathological case where matching takes
> forever and Emacs appears to be frozen.  Programmers who are used to
> backtracking matchers will usually intuitively stay away from regexps
> that can show such behaviors, but not all programmers do, and even if
> you're careful there are cases that are hard to avoid.

Did you try it with Perl recently (last 10 years or so)?  As I said, I
put some optimizations which in most (AFAIK) practical senses remove
such pathologies.  (The underlying problems remain; the optimizations
are only "heuristic"; but one needs to be extra inventive to
circumvent the optimizations.)

> Another minor reason is that it can be handy to have an incremental
> matching primitive, so you can match over a long string one chunk at
> a time.  I'm not sure how often this would be useful, but I've come
> across a few cases where it seemed like it could be put to good use
> (tho, for lack of experience with it, I can't sweat that it would turn
> out to be a good idea).

Do not know what you mean by this...

Ilya


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
  2011-01-28 23:49           ` Ilya Zakharevich
@ 2011-01-29  2:51             ` Stefan Monnier
  2011-01-29 22:28               ` Ilya Zakharevich
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2011-01-29  2:51 UTC (permalink / raw)
  To: help-gnu-emacs

>> To get rid of the occasional pathological case where matching takes
>> forever and Emacs appears to be frozen.  Programmers who are used to
>> backtracking matchers will usually intuitively stay away from regexps
>> that can show such behaviors, but not all programmers do, and even if
>> you're careful there are cases that are hard to avoid.
> Did you try it with Perl recently (last 10 years or so)?

No, but neither have I bumped into pathological cases in Perl before
that (when I did use it).

> As I said, I put some optimizations which in most (AFAIK) practical
> senses remove such pathologies.  (The underlying problems remain; the
> optimizations are only "heuristic"; but one needs to be extra
> inventive to circumvent the optimizations.)

A typical case could look something like "foo *(.*?) *bar". when
matching "foo ..<many space>.. baZ".
Emacs's regexp engine is not very clever and doesn't do much in terms of
avoiding backtracking (it mostly takes care of <foo>*<bar> when <foo>
can only match a single char and <bar> can only start with a char that's
not matched by <foo>), but I can't think of too many ways to handle the
above one efficiently within a "backtracking regexp matcher" framework.

>> Another minor reason is that it can be handy to have an incremental
>> matching primitive, so you can match over a long string one chunk at
>> a time.  I'm not sure how often this would be useful, but I've come
>> across a few cases where it seemed like it could be put to good use
>> (tho, for lack of experience with it, I can't sweat that it would turn
>> out to be a good idea).
> Do not know what you mean by this...

Basically, provide a primitive like (match-string RE STRING LIMIT) that
can not only say "matched between START and END", but also "reached
LIMIT within yet finding a match, here's the suspended SEARCH-STATE at
LIMIT", so you can later resume the search starting at LIMIT by passing
that state.

        Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
  2011-01-29  2:51             ` Stefan Monnier
@ 2011-01-29 22:28               ` Ilya Zakharevich
  2011-01-31 16:08                 ` Stefan Monnier
  0 siblings, 1 reply; 15+ messages in thread
From: Ilya Zakharevich @ 2011-01-29 22:28 UTC (permalink / raw)
  To: help-gnu-emacs

On 2011-01-29, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>> As I said, I put some optimizations which in most (AFAIK) practical
>> senses remove such pathologies.  (The underlying problems remain; the
>> optimizations are only "heuristic"; but one needs to be extra
>> inventive to circumvent the optimizations.)
>
> A typical case could look something like "foo *(.*?) *bar". when
> matching "foo ..<many space>.. baZ".

No, this is a polynomial-time problem.  My optimization does nothing
for such cases.  And I do not think such a REx would provide any
problem in real life - unless you have many hundreds of consecutive
spaces.  (And unless Emacs' REx engine is particularly slow per OPCODE.)

But I start to see the difference - it is in usage scenarios.  Many
Perl REx matches are done "per-line", not "per-file".  And in Emacs,
one would rarely narrow the buffer before a match.  This creates a
major skew in REx matches - in Emacs the string to match against would
be quite often a couple of orders of magnitude larger.

Essentially, in Perl, a sloppy REx which leads to a non-linear
polynomial time matching would be mostly unnoticed speed-wise.  Only
those which lead to exponential-time match bite hard.

So my patches for Perl improve only the exponential cases of
"sloppyness".  In Emacs, polynomial-time may give quite noticable
problems too.  Hmm, AFAICS, one can slighly modify my patches to
handle polynomial-time matches too.  I would think about it more (but
do not promise any action! ;-).

> Basically, provide a primitive like (match-string RE STRING LIMIT) that
> can not only say "matched between START and END", but also "reached
> LIMIT within yet finding a match, here's the suspended SEARCH-STATE at
> LIMIT", so you can later resume the search starting at LIMIT by passing
> that state.

match-with-continuation.  An interesting idea.  I already implemented
it for Perl (to support (??{}), but it is not exposed to the user.
Would one want this in non-interactive situations?

Thanks,
Ilya

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
  2011-01-29 22:28               ` Ilya Zakharevich
@ 2011-01-31 16:08                 ` Stefan Monnier
  2011-01-31 17:10                   ` Ilya Zakharevich
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2011-01-31 16:08 UTC (permalink / raw)
  To: help-gnu-emacs

>> A typical case could look something like "foo *(.*?) *bar". when
>> matching "foo ..<many space>.. baZ".

> No, this is a polynomial-time problem.  My optimization does nothing
> for such cases.  And I do not think such a REx would provide any
> problem in real life - unless you have many hundreds of consecutive
> spaces.

Such problems tend to show up (in hard to fix ways that is) with regexps
that are built in pieces (e.g. by combining existing regexps like
comment-start-skip and paragraph-start or things like that).

And yes, these tend to work just fine in practice, which is why they end
up in real code, and then a couple years later someone complains that
Emacs freezes when he opens his funny file with some odd long line.

> (And unless Emacs' REx engine is particularly slow per OPCODE.)

Emacs's REx engine isn't particularly fast, I think, but I don't think
it's the problem.

> But I start to see the difference - it is in usage scenarios.

Probably.

> Many Perl REx matches are done "per-line", not "per-file".

That's one difference.  Another is that many regexps are used all the
time without the user explicitly asking for it, and on text which we
assume takes a particular shape, even though it may take a completely
different form (e.g. regexps used for the *compile* buffer).

> match-with-continuation.  An interesting idea.  I already implemented
> it for Perl (to support (??{}), but it is not exposed to the user.
> Would one want this in non-interactive situations?

I can't think of interactive uses, but I'd like to try and use it for
to let font-lock find elements that span several lines, even when it
works one-line-at-a-time.

        Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
  2011-01-31 16:08                 ` Stefan Monnier
@ 2011-01-31 17:10                   ` Ilya Zakharevich
  2011-01-31 21:29                     ` Stefan Monnier
  0 siblings, 1 reply; 15+ messages in thread
From: Ilya Zakharevich @ 2011-01-31 17:10 UTC (permalink / raw)
  To: help-gnu-emacs

On 2011-01-31, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>> match-with-continuation.  An interesting idea.  I already implemented
>> it for Perl (to support (??{}), but it is not exposed to the user.
>> Would one want this in non-interactive situations?
>
> I can't think of interactive uses, but I'd like to try and use it for
> to let font-lock find elements that span several lines, even when it
> works one-line-at-a-time.

So you have a REx which is matched against a line, but you want (in
addition to the usual effects of matching) to know whether it "wanted"
the match to overflow into the following line?

If so, it looks like "reusing the continuation state" would not be a
serious optimization - it would add just a small multiplicative
constant to the "use only the hypothetical bit" scenario...

Do I miss anything?
Ilya

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
  2011-01-31 17:10                   ` Ilya Zakharevich
@ 2011-01-31 21:29                     ` Stefan Monnier
  2011-02-02 15:09                       ` Ilya Zakharevich
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2011-01-31 21:29 UTC (permalink / raw)
  To: help-gnu-emacs

> So you have a REx which is matched against a line, but you want (in
> addition to the usual effects of matching) to know whether it "wanted"
> the match to overflow into the following line?

> If so, it looks like "reusing the continuation state" would not be a
> serious optimization - it would add just a small multiplicative
> constant to the "use only the hypothetical bit" scenario...

We could probably make it work with just that extra bit, indeed.
But with the full intermediate state, we get to just "start the search
with last line's state" instead of having to "start the search from the
previous N lines since they all ended with the <wantmore> bit set", so
it will happily work with many-lines cases without having to reparse
those many lines N times.

Admittedly, in most cases, font-lock element should not be long (in most
cases, such long elements should be handled differently, e.g. via things
like font-lock-syntactic-keywords, marking the beginning and the end of
the long element), but again, it's good to handle pathological cases,
because sooner or later users will bump into them.

        Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
  2011-01-31 21:29                     ` Stefan Monnier
@ 2011-02-02 15:09                       ` Ilya Zakharevich
  2011-02-07 20:30                         ` Stefan Monnier
  0 siblings, 1 reply; 15+ messages in thread
From: Ilya Zakharevich @ 2011-02-02 15:09 UTC (permalink / raw)
  To: help-gnu-emacs

On 2011-01-31, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>> So you have a REx which is matched against a line, but you want (in
>> addition to the usual effects of matching) to know whether it "wanted"
>> the match to overflow into the following line?
>
>> If so, it looks like "reusing the continuation state" would not be a
>> serious optimization - it would add just a small multiplicative
>> constant to the "use only the hypothetical bit" scenario...
>
> We could probably make it work with just that extra bit, indeed.
> But with the full intermediate state, we get to just "start the search
> with last line's state" instead of having to "start the search from the
> previous N lines since they all ended with the <wantmore> bit set", so
> it will happily work with many-lines cases without having to reparse
> those many lines N times.

Hmm, I thought about a different scenario: if the bit is set, then one
switches to a DIFFERENT REx designed for a multi-line case.  Otherwise
why not just run it against the rest of the buffer, instead of
one-line?

Ilya


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
  2011-02-02 15:09                       ` Ilya Zakharevich
@ 2011-02-07 20:30                         ` Stefan Monnier
  2011-02-08 22:41                           ` Ilya Zakharevich
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2011-02-07 20:30 UTC (permalink / raw)
  To: help-gnu-emacs

>>> So you have a REx which is matched against a line, but you want (in
>>> addition to the usual effects of matching) to know whether it "wanted"
>>> the match to overflow into the following line?
>> 
>>> If so, it looks like "reusing the continuation state" would not be a
>>> serious optimization - it would add just a small multiplicative
>>> constant to the "use only the hypothetical bit" scenario...
>> 
>> We could probably make it work with just that extra bit, indeed.
>> But with the full intermediate state, we get to just "start the search
>> with last line's state" instead of having to "start the search from the
>> previous N lines since they all ended with the <wantmore> bit set", so
>> it will happily work with many-lines cases without having to reparse
>> those many lines N times.

> Hmm, I thought about a different scenario: if the bit is set, then one
> switches to a DIFFERENT REx designed for a multi-line case.  Otherwise
> why not just run it against the rest of the buffer, instead of
> one-line?

Because we don't want to match those regexps against the whole buffer
every time the buffer is modified (the buffer may be large).

Also it can be tricky to match only some of the font-lock regexps
against the whole buffer, since font-lock-keywords is normally defined
with the assumption that the regexps are applied in turn, that earlier
ones prevent subsequent ones from being applied and that we start with
a fresh un-highlighted buffer.  So in general, if we want to apply one
of the font-lock-keywords to the whole buffer, we have to do it for all
of them.

BTW, another reason to want a non-backtracking matcher can be seen in
the recent thread "Stack overflow in regexp matcher".

        Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: will we ever have zero width assertions in regexps?
  2011-02-07 20:30                         ` Stefan Monnier
@ 2011-02-08 22:41                           ` Ilya Zakharevich
  0 siblings, 0 replies; 15+ messages in thread
From: Ilya Zakharevich @ 2011-02-08 22:41 UTC (permalink / raw)
  To: help-gnu-emacs

On 2011-02-07, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>> Hmm, I thought about a different scenario: if the bit is set, then one
>> switches to a DIFFERENT REx designed for a multi-line case.  Otherwise
>> why not just run it against the rest of the buffer, instead of
>> one-line?
>
> Because we don't want to match those regexps against the whole buffer
> every time the buffer is modified (the buffer may be large).

I think there is some confusion here: I did not want to mean that you
run "another REx" via re-search(), but via looking-at() at the
corresponding position...  Essentially: do you want the REx to start
match on the current line, or end match on the current line?

Or maybe you mean a different thing, and
WANT e.g. [^a-z] to actually "mean" [^a-z\n] ?  (So only an explicit
non-negated \n in a REx may match \n in a buffer?)

> BTW, another reason to want a non-backtracking matcher can be seen in
> the recent thread "Stack overflow in regexp matcher".

I think this is a red herring.  Try to stack-overflow the Perl
matcher...  (Possible, but one must be very malicious to hit these
situations.)

Ilya

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-02-08 22:41 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <mailman.1.1296054361.23496.help-gnu-emacs@gnu.org>
2011-01-26 15:58 ` will we ever have zero width assertions in regexps? Stefan Monnier
2011-01-27  1:45   ` Le Wang
     [not found]   ` <mailman.6.1296092730.6982.help-gnu-emacs@gnu.org>
2011-01-27  2:21     ` Stefan Monnier
2011-01-27  6:34       ` Ilya Zakharevich
2011-01-27 16:10         ` Stefan Monnier
2011-01-28 23:49           ` Ilya Zakharevich
2011-01-29  2:51             ` Stefan Monnier
2011-01-29 22:28               ` Ilya Zakharevich
2011-01-31 16:08                 ` Stefan Monnier
2011-01-31 17:10                   ` Ilya Zakharevich
2011-01-31 21:29                     ` Stefan Monnier
2011-02-02 15:09                       ` Ilya Zakharevich
2011-02-07 20:30                         ` Stefan Monnier
2011-02-08 22:41                           ` Ilya Zakharevich
2011-01-26 14:55 Le Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).