From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: Re: Patch for lookaround assertion in regexp Date: Tue, 14 Feb 2012 13:36:32 -0500 Message-ID: References: <009001ccd9c0$9bde09f0$d39a1dd0$@cfraizer.com> <4F3A9F83.8060307@dancol.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: dough.gmane.org 1329245341 3361 80.91.229.3 (14 Feb 2012 18:49:01 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Tue, 14 Feb 2012 18:49:01 +0000 (UTC) Cc: emacs-devel@gnu.org, Colin Fraizer , t.matsuyama.pub@gmail.com To: Daniel Colascione Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Feb 14 19:49:00 2012 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1RxNQx-0007iv-I3 for ged-emacs-devel@m.gmane.org; Tue, 14 Feb 2012 19:48:59 +0100 Original-Received: from localhost ([::1]:53523 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RxNFB-0000k5-Pm for ged-emacs-devel@m.gmane.org; Tue, 14 Feb 2012 13:36:49 -0500 Original-Received: from eggs.gnu.org ([140.186.70.92]:43923) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RxNF9-0000jM-8g for emacs-devel@gnu.org; Tue, 14 Feb 2012 13:36:48 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RxNF5-000773-6q for emacs-devel@gnu.org; Tue, 14 Feb 2012 13:36:47 -0500 Original-Received: from pruche.dit.umontreal.ca ([132.204.246.22]:41364) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RxNF4-00076e-Hg for emacs-devel@gnu.org; Tue, 14 Feb 2012 13:36:43 -0500 Original-Received: from faina.iro.umontreal.ca (lechon.iro.umontreal.ca [132.204.27.242]) by pruche.dit.umontreal.ca (8.14.1/8.14.1) with ESMTP id q1EIaWWP017827; Tue, 14 Feb 2012 13:36:32 -0500 Original-Received: by faina.iro.umontreal.ca (Postfix, from userid 20848) id 4895B1300C9; Tue, 14 Feb 2012 13:36:32 -0500 (EST) In-Reply-To: <4F3A9F83.8060307@dancol.org> (Daniel Colascione's message of "Tue, 14 Feb 2012 09:53:07 -0800") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.92 (gnu/linux) X-NAI-Spam-Flag: NO X-NAI-Spam-Threshold: 5 X-NAI-Spam-Score: 0 X-NAI-Spam-Rules: 1 Rules triggered RV4132=0 X-NAI-Spam-Version: 2.2.0.9309 : core <4132> : streams <728629> : uri <1065365> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 132.204.246.22 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:148615 Archived-At: > Implementing a fully general NFA-based regular expression matching > engine that support submatches is hard. The only two useful > implementations of which I'm aware are RE2 and Ville Laurikari's TRE, > both of which are two-clause BSD licensed. Laurikari wrote his thesis > [2] on the latter. TRE is the better of the two libraries, IMHO, > because it's single-pass and can work over arbitrary kinds of input > stream (like characters in a gap buffer). TRE's approximate matching > support is occasionally useful as well. I'm familiar with the work, yes. TRE seemed like the best option last time I looked around. > That said, I'd actually prefer to head in the other direction and > allow users to express arbitrarily rich grammars using an extended rx > syntax. I think that would be orthogonal: we want regexp support because it's efficient (yes, our current implementation is super slow in some cases, but it's also efficient in many important cases). I also would like a new regexp engine to fix the "backward matching" problem so that looking-back can work the way most people would expect, and doesn't need a `greedy' hack. The fact that regexps are symmetric is a very neat property (operator precedence grammars enjoy the same property, which is one of the reasons why I chose them as the basis for SMIE). > The idea is basically to support parser combinator grammars, > which can be efficiently matched using a scannerless GRL parser, which > is O(N^3) time, or a memozing and backtracking "packrat" parser, which > is O(N) time and O(n) space. The end result would look a bit like Perl > 6's rules. While these are algorithmically reasonably efficient, it can be difficult to make them as efficient as a regexp-only matcher for many typical regexps. Also it can be difficult to make them work backwards. IOW I don't think that can replace regexps given the amount of regexps out there we have to support. Stefan