From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Alan Mackenzie Newsgroups: gmane.emacs.devel Subject: Re: Fixing ill-conditioned regular expressions. Proof of concept. Date: Thu, 26 Feb 2015 20:01:08 +0000 Message-ID: <20150226200108.GE19320@acm.fritz.box> References: <20150223224245.GC2861@acm.fritz.box> <54EBB9C4.1020505@cs.ucla.edu> <20150225100834.GA3502@acm.fritz.box> <54EEDD82.4010502@cs.ucla.edu> <20150226101137.GA19320@acm.fritz.box> <87fv9tc4qm.fsf@gnu.org> <20150226130917.GC19320@acm.fritz.box> <20150226162119.GD19320@acm.fritz.box> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1424980924 10562 80.91.229.3 (26 Feb 2015 20:02:04 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 26 Feb 2015 20:02:04 +0000 (UTC) Cc: Paul Eggert , emacs-devel@gnu.org To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Feb 26 21:01:57 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YR4dA-0002xs-4x for ged-emacs-devel@m.gmane.org; Thu, 26 Feb 2015 21:01:56 +0100 Original-Received: from localhost ([::1]:60644 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YR4d9-0006Fa-A2 for ged-emacs-devel@m.gmane.org; Thu, 26 Feb 2015 15:01:55 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:53232) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YR4ct-00068T-7O for emacs-devel@gnu.org; Thu, 26 Feb 2015 15:01:43 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YR4cp-0002Xi-L1 for emacs-devel@gnu.org; Thu, 26 Feb 2015 15:01:39 -0500 Original-Received: from colin.muc.de ([193.149.48.1]:34765 helo=mail.muc.de) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YR4cp-0002X6-BJ for emacs-devel@gnu.org; Thu, 26 Feb 2015 15:01:35 -0500 Original-Received: (qmail 27890 invoked by uid 3782); 26 Feb 2015 20:01:33 -0000 Original-Received: from acm.muc.de (pD951AF12.dip0.t-ipconnect.de [217.81.175.18]) by colin.muc.de (tmda-ofmipd) with ESMTP; Thu, 26 Feb 2015 21:01:32 +0100 Original-Received: (qmail 21422 invoked by uid 1000); 26 Feb 2015 20:01:08 -0000 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.22 (2013-10-16) X-Delivery-Agent: TMDA/1.1.12 (Macallan) X-Primary-Address: acm@muc.de X-detected-operating-system: by eggs.gnu.org: FreeBSD 8.x X-Received-From: 193.149.48.1 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:183521 Archived-At: Hello, Stefan. On Thu, Feb 26, 2015 at 02:12:32PM -0500, Stefan Monnier wrote: > >> > R*\(\)R* > >> > , but anybody who writes such regexps deserves what she gets. > >> What is it that I deserve to get? > > You deserve, perhaps, to lose (match-beginning 1) and (match-end 1), > > which were ill-defined anyway. > Why do you think so? They seem perfectly well-defined to me. > They're just always equal to one another, of course, but to the extent > that the regexp syntax only forces me to put "named positions" in pairs, > if I need a single position, it's fairly natural to just use \(\). I really did mean R*\(\)R*, with R being the same on both sides of the \(\), but the *s possibly being +s. _That_ is nasty and undefined. > > Have you really written a regexp like this (apart from for testing > > purposes)?. If so, what's it for? > grep '\\\\(\\\\)' **/*.el > finds 27 matches. Taking one example from the list: > lisp/emacs-lisp/smie.el: ((looking-at "\\s(\\|\\s)\\(\\)") > what this does is to let me use (match-beginning 1) to figure out which > of the two alternatives was matched. I could have written this as > ((looking-at "\\s(\\|\\(\\s)\\)") > but this would be (marginally) slower, because we'd always push > a "group-start" marker before try to match "\\s)", whereas with the > other rule, we only do that when we know "\\s)" has matched. OK. > > By the way, how do you see the prospects of this file becoming > > incorporated into Emacs at some stage? > To be honest, I haven't looked at it at all, yet. > The vague understanding I have of what it might be sounds interesting. > It's just a patch trying to cover up the worst aspects of the > current regexp engine, but since there doesn't seem to be much interest > in improving/overhauling the regexp engine, maybe it's a good stop-gap. Thanks. I'll continue working on it, adding a decent set of test cases too. > Stefan -- Alan Mackenzie (Nuremberg, Germany).