From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Stefan Monnier <monnier@IRO.UMontreal.CA>
Newsgroups: gmane.emacs.devel
Subject: Re: Patch for lookaround assertion in regexp
Date: Tue, 14 Feb 2012 13:36:32 -0500
Message-ID: <jwvaa4ls24n.fsf-monnier+emacs@gnu.org>
References: <009001ccd9c0$9bde09f0$d39a1dd0$@cfraizer.com>
	<jwv7h0iike3.fsf-monnier+emacs@gnu.org> <4F3A9F83.8060307@dancol.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: dough.gmane.org 1329245341 3361 80.91.229.3 (14 Feb 2012 18:49:01 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Tue, 14 Feb 2012 18:49:01 +0000 (UTC)
Cc: emacs-devel@gnu.org, Colin Fraizer <emacs-devel@cfraizer.com>,
	t.matsuyama.pub@gmail.com
To: Daniel Colascione <dancol@dancol.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Feb 14 19:49:00 2012
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([140.186.70.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1RxNQx-0007iv-I3
	for ged-emacs-devel@m.gmane.org; Tue, 14 Feb 2012 19:48:59 +0100
Original-Received: from localhost ([::1]:53523 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1RxNFB-0000k5-Pm
	for ged-emacs-devel@m.gmane.org; Tue, 14 Feb 2012 13:36:49 -0500
Original-Received: from eggs.gnu.org ([140.186.70.92]:43923)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <monnier@iro.umontreal.ca>) id 1RxNF9-0000jM-8g
	for emacs-devel@gnu.org; Tue, 14 Feb 2012 13:36:48 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <monnier@iro.umontreal.ca>) id 1RxNF5-000773-6q
	for emacs-devel@gnu.org; Tue, 14 Feb 2012 13:36:47 -0500
Original-Received: from pruche.dit.umontreal.ca ([132.204.246.22]:41364)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <monnier@iro.umontreal.ca>) id 1RxNF4-00076e-Hg
	for emacs-devel@gnu.org; Tue, 14 Feb 2012 13:36:43 -0500
Original-Received: from faina.iro.umontreal.ca (lechon.iro.umontreal.ca
	[132.204.27.242])
	by pruche.dit.umontreal.ca (8.14.1/8.14.1) with ESMTP id q1EIaWWP017827;
	Tue, 14 Feb 2012 13:36:32 -0500
Original-Received: by faina.iro.umontreal.ca (Postfix, from userid 20848)
	id 4895B1300C9; Tue, 14 Feb 2012 13:36:32 -0500 (EST)
In-Reply-To: <4F3A9F83.8060307@dancol.org> (Daniel Colascione's message of
	"Tue, 14 Feb 2012 09:53:07 -0800")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.92 (gnu/linux)
X-NAI-Spam-Flag: NO
X-NAI-Spam-Threshold: 5
X-NAI-Spam-Score: 0
X-NAI-Spam-Rules: 1 Rules triggered
	RV4132=0
X-NAI-Spam-Version: 2.2.0.9309 : core <4132> : streams <728629> : uri <1065365>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 132.204.246.22
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:148615
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/148615>

> Implementing a fully general NFA-based regular expression matching
> engine that support submatches is hard. The only two useful
> implementations of which I'm aware are RE2 and Ville Laurikari's TRE,
> both of which are two-clause BSD licensed. Laurikari wrote his thesis
> [2] on the latter. TRE is the better of the two libraries, IMHO,
> because it's single-pass and can work over arbitrary kinds of input
> stream (like characters in a gap buffer). TRE's approximate matching
> support is occasionally useful as well.

I'm familiar with the work, yes.  TRE seemed like the best option last
time I looked around.

> That said, I'd actually prefer to head in the other direction and
> allow users to express arbitrarily rich grammars using an extended rx
> syntax.

I think that would be orthogonal: we want regexp support because it's
efficient (yes, our current implementation is super slow in some cases,
but it's also efficient in many important cases).

I also would like a new regexp engine to fix the "backward matching"
problem so that looking-back can work the way most people would expect,
and doesn't need a `greedy' hack.  The fact that regexps are symmetric
is a very neat property (operator precedence grammars enjoy the same
property, which is one of the reasons why I chose them as the basis for
SMIE).

> The idea is basically to support parser combinator grammars,
> which can be efficiently matched using a scannerless GRL parser, which
> is O(N^3) time, or a memozing and backtracking "packrat" parser, which
> is O(N) time and O(n) space. The end result would look a bit like Perl
> 6's rules.

While these are algorithmically reasonably efficient, it can be
difficult to make them as efficient as a regexp-only matcher for many
typical regexps.  Also it can be difficult to make them work backwards.
IOW I don't think that can replace regexps given the amount of regexps
out there we have to support.


        Stefan