From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Alan Mackenzie Newsgroups: gmane.emacs.devel Subject: Re: Documentation on debugging regexp performance Date: Thu, 21 Jan 2016 17:16:07 +0000 Message-ID: <20160121171607.GC1795@acm.fritz.box> References: <56A06CD6.2090707@gmail.com> <20160121152742.GA1795@acm.fritz.box> <56A1095C.2070107@gmail.com> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1453396444 3202 80.91.229.3 (21 Jan 2016 17:14:04 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 21 Jan 2016 17:14:04 +0000 (UTC) Cc: Emacs developers To: =?iso-8859-1?Q?Cl=E9ment?= Pit--Claudel Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Jan 21 18:13:56 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aMInz-0000wp-90 for ged-emacs-devel@m.gmane.org; Thu, 21 Jan 2016 18:13:55 +0100 Original-Received: from localhost ([::1]:49008 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aMIny-00027H-Gx for ged-emacs-devel@m.gmane.org; Thu, 21 Jan 2016 12:13:54 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:50687) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aMInr-00023l-N8 for emacs-devel@gnu.org; Thu, 21 Jan 2016 12:13:51 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aMIno-0004sc-H8 for emacs-devel@gnu.org; Thu, 21 Jan 2016 12:13:47 -0500 Original-Received: from mail.muc.de ([193.149.48.3]:53410) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aMIno-0004sX-7b for emacs-devel@gnu.org; Thu, 21 Jan 2016 12:13:44 -0500 Original-Received: (qmail 13104 invoked by uid 3782); 21 Jan 2016 17:13:42 -0000 Original-Received: from acm.muc.de (p5B146A47.dip0.t-ipconnect.de [91.20.106.71]) by colin.muc.de (tmda-ofmipd) with ESMTP; Thu, 21 Jan 2016 18:13:39 +0100 Original-Received: (qmail 3852 invoked by uid 1000); 21 Jan 2016 17:16:07 -0000 Content-Disposition: inline In-Reply-To: <56A1095C.2070107@gmail.com> User-Agent: Mutt/1.5.24 (2015-08-30) X-Delivery-Agent: TMDA/1.1.12 (Macallan) X-Primary-Address: acm@muc.de X-detected-operating-system: by eggs.gnu.org: FreeBSD 9.x X-Received-From: 193.149.48.3 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:198506 Archived-At: Hello again Clément. On Thu, Jan 21, 2016 at 11:37:48AM -0500, Clément Pit--Claudel wrote: > On 01/21/2016 10:27 AM, Alan Mackenzie wrote: > Hi Alan! > > " +[^:=]+ +:=?" is an ill-formed regexp - if you get lots of spaces in > > a non-match, the Emacs regexp engine will try all possible ways of > > matching these spaces before giving up. You have three concatenated > > sub-expressions, all of which match any number of spaces, namely: > > " +[^:=]+ +" > > 1122222233 > > I would suggest reformulating it thus: > > " +[^:= ][^:=]+ " > > 112222223333334 > I think this has different semantics: my original regexp requires at > least three spaces. But I think prepending spaces to yours fixes that. Sorry, yes, I'd extracted the interesting bit of your regexp, and forgot that I'd done so. > > Subexpression 1 matches ALL the leading spaces. > > Subexp 2 is exactly one > > character which can't be a space. Subexp 3 matches almost anything, > > including spaces, and subexp 4 matches a single space at the end (to make > > sure there is at least one space there). > This is helpful, thanks! I realize however that maybe I > oversimplified. The issue is that what I really want is something like > this: > " +\\([^:=]+\\) +:=?" > IOW, I want to capture that first group. That is ambiguous. But if we can assume that the first group always begins with a non-space, and always ends with a non-space, then we can reformulate the above as: " +\\([^:= ]\\([^:=]+[^:= ]\\)?\\) +:=?" ^ (or something similar - I've not actually tested it). The ? inside the first expression is to cope with there just being 1 single character matched by the group. > > All the best with your regexp! > Thanks. Your points about backtracking were helpful as well. Do you > know if there are technical reasons why Emacs chooses a backtracking > implementation for this regexp (instead of compiling it to a > linear-time matcher)? I'm afraid I don't know. It might be that compiling a regexp for a linear-time matcher would be slower. Or, possibly, nobody has sat down and hacked out a better regexp engine. > Clément. -- Alan Mackenzie (Nuremberg, Germany).