From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Thomas Lord <lord@emf.net>
Newsgroups: gmane.emacs.devel
Subject: Re: Compiling Elisp to a native code with a GCC plugin
Date: Wed, 15 Sep 2010 09:28:57 -0700
Message-ID: <1284568137.3622.22.camel@dell-desktop.example.com>
References: <87bp805ecr.fsf@gmail.com> <m3mxrk6rzl.fsf@fleche.redhat.com>
	<m18w33cmgz.fsf@cam.ac.uk> <jwvd3sfnlyo.fsf-monnier+emacs@gnu.org>
	<m2hbhrgjz9.fsf@gmail.com> <jwv4odrm4xj.fsf-monnier+emacs@gnu.org>
	<m2zkvjf1rs.fsf@gmail.com>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Trace: dough.gmane.org 1284568443 26850 80.91.229.12 (15 Sep 2010 16:34:03 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Wed, 15 Sep 2010 16:34:03 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: Helmut Eller <eller.helmut@gmail.com>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Sep 15 18:33:58 2010
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1Ovuvl-0002SJ-UE
	for ged-emacs-devel@m.gmane.org; Wed, 15 Sep 2010 18:33:58 +0200
Original-Received: from localhost ([127.0.0.1]:58290 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1Ovuvl-0000AW-0L
	for ged-emacs-devel@m.gmane.org; Wed, 15 Sep 2010 12:33:57 -0400
Original-Received: from [140.186.70.92] (port=47223 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Ovur2-0004PU-Ns
	for emacs-devel@gnu.org; Wed, 15 Sep 2010 12:29:09 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <lord@emf.net>) id 1Ovuqx-0000SU-HK
	for emacs-devel@gnu.org; Wed, 15 Sep 2010 12:29:04 -0400
Original-Received: from smtp141.iad.emailsrvr.com ([207.97.245.141]:53840)
	by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from <lord@emf.net>)
	id 1Ovuqx-0000SN-Ai
	for emacs-devel@gnu.org; Wed, 15 Sep 2010 12:28:59 -0400
Original-Received: from localhost (localhost.localdomain [127.0.0.1])
	by smtp34.relay.iad1a.emailsrvr.com (SMTP Server) with ESMTP id
	16C1F380327; Wed, 15 Sep 2010 12:28:59 -0400 (EDT)
X-Virus-Scanned: OK
Original-Received: by smtp34.relay.iad1a.emailsrvr.com (Authenticated sender:
	lord-AT-emf.net) with ESMTPSA id 800D8380347; 
	Wed, 15 Sep 2010 12:28:58 -0400 (EDT)
In-Reply-To: <m2zkvjf1rs.fsf@gmail.com>
X-Mailer: Evolution 2.28.3 
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:130225
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/130225>

On Wed, 2010-09-15 at 17:46 +0200, Helmut Eller wrote:
> * Stefan Monnier [2010-09-15 14:59] writes:

> > IIRC regexps without back-refs can be matched (and searched) in O(N)
> > where N is the length of the input.  

Not quite.   Let R be the length of the pattern
and L be the length of the string we seek to match.

Assume that the pattern is a true regular expression
(no back-references, no sub-expression position
reporting, etc.)

The problem itself is O(R * L):  no algorithm can
guarantee doing better than the product of the two
lengths.

There is an algorithm (compiling the pattern to a DFA
first) which is O(2^R + L). Formally it is suboptimal.
There is a catch though.   For many common patterns,
either R is very small or we don't actually need
anything like 2^R steps to compile the pattern,
and so the *expected* case (for these patterns)
is O(L).

The Thompson "DFA caching algorithm" described briefly
in the Dragon compiler book is quite attractive
because it offers (for true regular expressions) a 
worst case complexity of O(R * L) ... so it is optimal ...
yet also delivers much closer to O(L) for many
common patterns.   The algorithm is a bit 
unattractive because it is tricky to implement 
well, very hard to generalize beyond true regular
expressions, and can easily lose to more naive NFA
algorithms by small but annoying constant factors.


> With back-refs, I think (not sure)
> > the theoretical bound is O(N^2), which requires
> > a non-backtracking algorithm.

> > So yes, we'd need to handle back-refs specially.  Several regexp engines
> > do that already (they have a few different inner engines and choose
> > which one to use based on the particular regexp at hand).
> 
> After googleing a bit I found this page 
> http://swtch.com/~rsc/regexp/regexp1.html
> which again links to this
> http://perl.plover.com/NPC/NPC-3SAT.html
> which says that regexp matching with backreferences is NP-complete.

The best known algorithms are not O(N^2) but,
rather, O(2^N) in the worst case.   It's dismal
how such an innocent seeming feature can cause
such havoc.


> Cox (the first page) seems to say that backtracking-with-memoization is
> linear time at the expense of O(N) space.

If he said that regarding worst case performance
and without mentioning some specific subset of 
true regular expressions that he's talking about,
he must have made a mistake.

For true regular expressions, the simpler case,
you can not beat O(R * L) time as the worst case.

-t