From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Alan Mackenzie Newsgroups: gmane.emacs.devel Subject: Re: Ugly regexps Date: Wed, 3 Mar 2021 20:46:12 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="10791"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Stefan Monnier , emacs-devel@gnu.org To: Stefan Kangas Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Mar 03 21:47:39 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lHYPC-0002im-PP for ged-emacs-devel@m.gmane-mx.org; Wed, 03 Mar 2021 21:47:38 +0100 Original-Received: from localhost ([::1]:33972 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lHYPB-0004il-RW for ged-emacs-devel@m.gmane-mx.org; Wed, 03 Mar 2021 15:47:37 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:33030) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lHYNu-000448-TK for emacs-devel@gnu.org; Wed, 03 Mar 2021 15:46:19 -0500 Original-Received: from colin.muc.de ([193.149.48.1]:29889 helo=mail.muc.de) by eggs.gnu.org with smtp (Exim 4.90_1) (envelope-from ) id 1lHYNs-0001Wx-4f for emacs-devel@gnu.org; Wed, 03 Mar 2021 15:46:18 -0500 Original-Received: (qmail 14329 invoked by uid 3782); 3 Mar 2021 20:46:13 -0000 Original-Received: from acm.muc.de (p4fe15d01.dip0.t-ipconnect.de [79.225.93.1]) (using STARTTLS) by colin.muc.de (tmda-ofmipd) with ESMTP; Wed, 03 Mar 2021 21:46:12 +0100 Original-Received: (qmail 20636 invoked by uid 1000); 3 Mar 2021 20:46:12 -0000 Content-Disposition: inline In-Reply-To: X-Submission-Agent: TMDA/1.3.x (Ph3nix) X-Primary-Address: acm@muc.de Received-SPF: pass client-ip=193.149.48.1; envelope-from=acm@muc.de; helo=mail.muc.de X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:265913 Archived-At: Hello, Stefan. On Tue, Mar 02, 2021 at 19:32:23 -0600, Stefan Kangas wrote: > Stefan Monnier writes: > > BTW, while this theme of ugly regexps keeps coming up, how 'bout we add > > a new function `ere` which converts between the ERE style of regexps > > where grouping parens are not escaped (and plain chars meant to match > > an actual paren need to be escaped instead) to ELisp-style regexps? > > So you can do > > (string-match (ere "\\(def(macro|un|subst) .{1,}")) > > instead of > > (string-match "(def\\(macro\\|un\\|subst\\) .\\{1,\\}") > > ? > Sounds good to me. > I was going to ask why not just do PCRE, but then I realized I'm not > exactly sure what the syntactical differences are. (We obviously lack > some features.) AFAIR, Emacs regexps don't exactly match GNU grep, > egrep, Perl, or anything else really. These things don't exactly match eachother, do they? > So I cranked out my dusty old copy of Mastering Regular Expressions and > found this overview: > grep egrep Emacs Perl > \? \+ \| ? + | ? + \| ? + | > \( \) ( ) \( \) ( ) > \< \> \< \> \b \B \b \B > (Excerpt from Mastering Regular Expressions: Table 3-3: A (Very) > Superficial Look at the Flavor of a Few Common Tools) > This shows the differences that most commonly bites you, in my > experience. The "biting" effect is surely small. I have little difficulty using grep, egrep and awk, all of whose regexp notations differ somewhat. > While we're at it, has it ever been discussed to add support for the > pcre library side-by-side with our homegrown regexp.c? It would give us > sane (standard) syntax and some useful features "for free" > (e.g. lookaround). I didn't test but a priori I would also assume the > code to be much more performant than anything we could ever cook up > ourselves. It is used by several high-profile projects. > I would imagine we'd introduce entirely new function names for it. > Perhaps even a completely new and improved API like Lars suggested a > while back. No, No, No, No! All these tools have one overarching thing in common, and that is they each have a single variety of regexp. That is, with the exception of Emacs, which also has a radically different source form, namely rx. Somebody pointed out the relatively small use of rx, and the same might happen for a new regexp notation. Or it might not, and we'd have two different notations side by side. This is surely something to avoid. There's not a lot wrong with Emacs's regexp notation. It works, works well, and we're all familiar with it. And there are many thousands of lines of lisp containing regexps, all of which are in the same variety. With the exception of those written with rx. To introduce a second (string) variety alongside Emacs regexps would cause confusion, and suck up effort better used for productive work. Just how is one meant to search for a regexp using grep, when one doesn't even know whether it follows Emacs conventions or some foreign set of conventions? -- Alan Mackenzie (Nuremberg, Germany).