From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Alan Mackenzie Newsgroups: gmane.emacs.devel Subject: Re: Fixing ill-conditioned regular expressions. Proof of concept. Date: Thu, 26 Feb 2015 13:09:18 +0000 Message-ID: <20150226130917.GC19320@acm.fritz.box> References: <20150223181205.GA2861@acm.fritz.box> <54EB85AC.1030800@cs.ucla.edu> <20150223202114.GB2861@acm.fritz.box> <54EBA757.5030901@cs.ucla.edu> <20150223224245.GC2861@acm.fritz.box> <54EBB9C4.1020505@cs.ucla.edu> <20150225100834.GA3502@acm.fritz.box> <54EEDD82.4010502@cs.ucla.edu> <20150226101137.GA19320@acm.fritz.box> <87fv9tc4qm.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1424956225 15139 80.91.229.3 (26 Feb 2015 13:10:25 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 26 Feb 2015 13:10:25 +0000 (UTC) Cc: Paul Eggert To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Feb 26 14:10:14 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YQyCe-0008NE-LJ for ged-emacs-devel@m.gmane.org; Thu, 26 Feb 2015 14:10:08 +0100 Original-Received: from localhost ([::1]:58938 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YQyCd-0005Yn-Pc for ged-emacs-devel@m.gmane.org; Thu, 26 Feb 2015 08:10:07 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:53928) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YQyCM-0005Vf-GQ for emacs-devel@gnu.org; Thu, 26 Feb 2015 08:09:51 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YQyCH-00009q-Mw for emacs-devel@gnu.org; Thu, 26 Feb 2015 08:09:50 -0500 Original-Received: from colin.muc.de ([193.149.48.1]:28385 helo=mail.muc.de) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YQyCH-00009O-DV for emacs-devel@gnu.org; Thu, 26 Feb 2015 08:09:45 -0500 Original-Received: (qmail 58117 invoked by uid 3782); 26 Feb 2015 13:09:43 -0000 Original-Received: from acm.muc.de (pD951AF12.dip0.t-ipconnect.de [217.81.175.18]) by colin.muc.de (tmda-ofmipd) with ESMTP; Thu, 26 Feb 2015 14:09:42 +0100 Original-Received: (qmail 20068 invoked by uid 1000); 26 Feb 2015 13:09:18 -0000 Content-Disposition: inline In-Reply-To: <87fv9tc4qm.fsf@gnu.org> User-Agent: Mutt/1.5.22 (2013-10-16) X-Delivery-Agent: TMDA/1.1.12 (Macallan) X-Primary-Address: acm@muc.de X-detected-operating-system: by eggs.gnu.org: FreeBSD 8.x X-Received-From: 193.149.48.1 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:183511 Archived-At: Hello, Tassilo. On Thu, Feb 26, 2015 at 12:05:37PM +0100, Tassilo Horn wrote: > Alan Mackenzie writes: > Hi Alan, > >> Sure, but you could remember how the \(...\) constructs were > >> renumbered, and fix the match data after the underlying regexp call > >> returned. It shouldn't be a big deal. > > Unfortunately, it's not that simple. Consider the RE > > \(R\)+E*\(R\)+ > > 1 1 2 2 > > . This gets transformed to > > \(R\)+\(?:E+\(R\)+\|\(R\)\) > > 1 1 2 2 2 2 > > . What was subexpression 2 in the original has become two > > subexpressions straddling an \| sign in the transformation. I don't > > think there's a way of transforming R+E*R+ that preserves the > > numbering of the subexpressions. > Couldn't you use explicitly numbered groups, i.e., the regex would > translate to > \(?1:R\)+\(?:E+\(?2:R\)+\|\(?2:R\)\) > ? Thanks. That's a brilliant idea! I think it would work. > As long as the groups with the same number are exclusive there shouldn't > be a problem. I think that's true. There might be a slight problem with groups which match only the empty string. Something like: R*\(\)R* , but anybody who writes such regexps deserves what she gets. > Bye, > Tassilo -- Alan Mackenzie (Nuremberg, Germany).