From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Daniel Colascione Newsgroups: gmane.emacs.devel Subject: Re: Patch for lookaround assertion in regexp Date: Tue, 14 Feb 2012 09:53:07 -0800 Message-ID: <4F3A9F83.8060307@dancol.org> References: <009001ccd9c0$9bde09f0$d39a1dd0$@cfraizer.com> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enigD627FD100A7808E27788718F" X-Trace: dough.gmane.org 1329242058 9582 80.91.229.3 (14 Feb 2012 17:54:18 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Tue, 14 Feb 2012 17:54:18 +0000 (UTC) Cc: emacs-devel@gnu.org, Colin Fraizer , t.matsuyama.pub@gmail.com To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Feb 14 18:54:16 2012 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1RxMZx-0006tB-B6 for ged-emacs-devel@m.gmane.org; Tue, 14 Feb 2012 18:54:13 +0100 Original-Received: from localhost ([::1]:34961 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RxMZw-0004WR-3v for ged-emacs-devel@m.gmane.org; Tue, 14 Feb 2012 12:54:12 -0500 Original-Received: from eggs.gnu.org ([140.186.70.92]:35576) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RxMZo-0004W2-Q3 for emacs-devel@gnu.org; Tue, 14 Feb 2012 12:54:10 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RxMZi-0005j7-KG for emacs-devel@gnu.org; Tue, 14 Feb 2012 12:54:04 -0500 Original-Received: from dancol.org ([96.126.100.184]:37032) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RxMZi-0005j1-F7 for emacs-devel@gnu.org; Tue, 14 Feb 2012 12:53:58 -0500 Original-Received: from c-24-18-179-193.hsd1.wa.comcast.net ([24.18.179.193] helo=edith.local) by dancol.org with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.72) (envelope-from ) id 1RxMZf-0002qy-MR; Tue, 14 Feb 2012 09:53:55 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:10.0.1) Gecko/20120208 Thunderbird/10.0.1 In-Reply-To: X-Enigmail-Version: 1.3.5 X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 96.126.100.184 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:148612 Archived-At: This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigD627FD100A7808E27788718F Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 1/23/12 6:11 AM, Stefan Monnier wrote: >> In 2009, Tomohiro Matsuyama sent a message to this list with a patch >> to add lookahead/lookbehind assertions to Emacs regular expressions >> (regexps). Is there any plan to incorporate this useful feature into >> an official release? >> [IMHO, it is incredibly useful. The only responses to Matsuyama-san's >> message were positive.] >=20 > I'd like to replace the current regexp engine with one that does not > suffer from exponential blowup (i.e. using "Thompson's NFA"). > Every feature added to the current regexp engine will make it more > difficult to switch, so I'm not too thrilled at the idea of adding > lookaround operators. > OTOH, noone has submitted code to replace the current regexp engine, an= d > I don't forsee I'll have the time to write it myself, so maybe I should= > just give up on this plan. Implementing a fully general NFA-based regular expression matching engine that support submatches is hard. The only two useful implementations of which I'm aware are RE2 and Ville Laurikari's TRE, both of which are two-clause BSD licensed. Laurikari wrote his thesis [2] on the latter. TRE is the better of the two libraries, IMHO, because it's single-pass and can work over arbitrary kinds of input stream (like characters in a gap buffer). TRE's approximate matching support is occasionally useful as well. That said, I'd actually prefer to head in the other direction and allow users to express arbitrarily rich grammars using an extended rx syntax. The idea is basically to support parser combinator grammars, which can be efficiently matched using a scannerless GRL parser, which is O(N^3) time, or a memozing and backtracking "packrat" parser, which is O(N) time and O(n) space. The end result would look a bit like Perl 6's rules. I have some ultra-preliminary work at https://github.com/dcolascione/jezebel. Basically, instead of matching regular expressions more efficiently, we should add support for efficient parsing using declaratively-constructed grammars, of which regular expressions are simply a special case. Such support would be incredibly useful for processing languages like Perl and C++. [1] http://laurikari.net/tre/about/ [2] http://laurikari.net/ville/regex-submatch.pdf --------------enigD627FD100A7808E27788718F Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (Darwin) Comment: GPGTools - http://gpgtools.org iEYEARECAAYFAk86n4MACgkQ17c2LVA10VtdwQCfTdruoOicYpHveQ1ncJIBlZnq t70AoLCSM+L+EcWfkbu0wZRGBq9UN3bh =sOob -----END PGP SIGNATURE----- --------------enigD627FD100A7808E27788718F--