unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Daniel Colascione <dancol@dancol.org>
To: Stefan Monnier <monnier@iro.umontreal.ca>
Cc: emacs-devel@gnu.org, Colin Fraizer <emacs-devel@cfraizer.com>,
	t.matsuyama.pub@gmail.com
Subject: Re: Patch for lookaround assertion in regexp
Date: Tue, 14 Feb 2012 09:53:07 -0800	[thread overview]
Message-ID: <4F3A9F83.8060307@dancol.org> (raw)
In-Reply-To: <jwv7h0iike3.fsf-monnier+emacs@gnu.org>

[-- Attachment #1: Type: text/plain, Size: 2311 bytes --]

On 1/23/12 6:11 AM, Stefan Monnier wrote:
>> In 2009, Tomohiro Matsuyama sent a message to this list with a patch
>> to add lookahead/lookbehind assertions to Emacs regular expressions
>> (regexps).  Is there any plan to incorporate this useful feature into
>> an official release?
>> [IMHO, it is incredibly useful. The only responses to Matsuyama-san's
>> message were positive.]
> 
> I'd like to replace the current regexp engine with one that does not
> suffer from exponential blowup (i.e. using "Thompson's NFA").
> Every feature added to the current regexp engine will make it more
> difficult to switch, so I'm not too thrilled at the idea of adding
> lookaround operators.
> OTOH, noone has submitted code to replace the current regexp engine, and
> I don't forsee I'll have the time to write it myself, so maybe I should
> just give up on this plan.

Implementing a fully general NFA-based regular expression matching
engine that support submatches is hard. The only two useful
implementations of which I'm aware are RE2 and Ville Laurikari's TRE,
both of which are two-clause BSD licensed. Laurikari wrote his thesis
[2] on the latter. TRE is the better of the two libraries, IMHO,
because it's single-pass and can work over arbitrary kinds of input
stream (like characters in a gap buffer). TRE's approximate matching
support is occasionally useful as well.

That said, I'd actually prefer to head in the other direction and
allow users to express arbitrarily rich grammars using an extended rx
syntax. The idea is basically to support parser combinator grammars,
which can be efficiently matched using a scannerless GRL parser, which
is O(N^3) time, or a memozing and backtracking "packrat" parser, which
is O(N) time and O(n) space. The end result would look a bit like Perl
6's rules.

I have some ultra-preliminary work at
https://github.com/dcolascione/jezebel.

Basically, instead of matching regular expressions more efficiently,
we should add support for efficient parsing using
declaratively-constructed grammars, of which regular expressions are
simply a special case. Such support would be incredibly useful for
processing languages like Perl and C++.

[1] http://laurikari.net/tre/about/
[2] http://laurikari.net/ville/regex-submatch.pdf



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 235 bytes --]

  parent reply	other threads:[~2012-02-14 17:53 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-23 11:17 Patch for lookaround assertion in regexp Colin Fraizer
2012-01-23 14:11 ` Stefan Monnier
2012-01-23 14:44   ` Tom
2012-01-23 14:50     ` Andreas Schwab
2012-01-23 15:19       ` Tom
2012-01-23 16:14         ` Andreas Schwab
2012-01-23 17:11           ` Stefan Monnier
2012-01-23 18:45             ` Štěpán Němec
2012-01-30  0:31               ` Juri Linkov
2012-01-23 15:31     ` Stefan Monnier
2012-01-24  8:41   ` Nikolai Weibull
2012-01-24 14:40     ` Stefan Monnier
2012-01-24 15:09       ` Nikolai Weibull
2012-01-24 17:34         ` Stefan Monnier
2012-01-24 23:27     ` David De La Harpe Golden
2012-01-25  6:07       ` Nikolai Weibull
2012-02-14 17:53   ` Daniel Colascione [this message]
2012-02-14 18:36     ` Stefan Monnier
2012-02-20 16:19     ` Dimitri Fontaine
2012-09-26  6:55 ` Tomohiro Matsuyama
  -- strict thread matches above, loose matches on Subject: below --
2009-06-03 23:04 Tomohiro MATSUYAMA
2009-06-04  4:47 ` Miles Bader
2009-06-04  8:27   ` Deniz Dogan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F3A9F83.8060307@dancol.org \
    --to=dancol@dancol.org \
    --cc=emacs-devel@cfraizer.com \
    --cc=emacs-devel@gnu.org \
    --cc=monnier@iro.umontreal.ca \
    --cc=t.matsuyama.pub@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).