From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Paul Eggert Newsgroups: gmane.emacs.bugs Subject: bug#37659: rx additions: anychar, unmatchable, unordered-or Date: Tue, 22 Oct 2019 10:33:40 -0700 Organization: UCLA Computer Science Department Message-ID: References: <88571301-3F15-428F-82F9-60A23D817EF8@acm.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="262345"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.1.1 Cc: 37659@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Oct 22 19:34:29 2019 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1iMy3B-00166s-6v for geb-bug-gnu-emacs@m.gmane.org; Tue, 22 Oct 2019 19:34:29 +0200 Original-Received: from localhost ([::1]:38136 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iMy39-0006TZ-VN for geb-bug-gnu-emacs@m.gmane.org; Tue, 22 Oct 2019 13:34:27 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:50528) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iMy2l-0005wx-Pp for bug-gnu-emacs@gnu.org; Tue, 22 Oct 2019 13:34:05 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1iMy2k-0000Rf-LK for bug-gnu-emacs@gnu.org; Tue, 22 Oct 2019 13:34:03 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:51608) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1iMy2k-0000Rb-I7 for bug-gnu-emacs@gnu.org; Tue, 22 Oct 2019 13:34:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1iMy2k-000115-DN for bug-gnu-emacs@gnu.org; Tue, 22 Oct 2019 13:34:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 22 Oct 2019 17:34:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 37659 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 37659-submit@debbugs.gnu.org id=B37659.15717656353888 (code B ref 37659); Tue, 22 Oct 2019 17:34:02 +0000 Original-Received: (at 37659) by debbugs.gnu.org; 22 Oct 2019 17:33:55 +0000 Original-Received: from localhost ([127.0.0.1]:60428 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iMy2c-00010e-NC for submit@debbugs.gnu.org; Tue, 22 Oct 2019 13:33:54 -0400 Original-Received: from zimbra.cs.ucla.edu ([131.179.128.68]:53006) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iMy2Z-00010N-TC for 37659@debbugs.gnu.org; Tue, 22 Oct 2019 13:33:53 -0400 Original-Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 78161160646; Tue, 22 Oct 2019 10:33:45 -0700 (PDT) Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id CoRMK9L4p5xz; Tue, 22 Oct 2019 10:33:44 -0700 (PDT) Original-Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 86888160650; Tue, 22 Oct 2019 10:33:44 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id iIQ8lehdGmDe; Tue, 22 Oct 2019 10:33:44 -0700 (PDT) Original-Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 6C545160646; Tue, 22 Oct 2019 10:33:44 -0700 (PDT) In-Reply-To: <88571301-3F15-428F-82F9-60A23D817EF8@acm.org> Content-Language: en-US X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.51.188.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:169995 Archived-At: On 10/22/19 8:14 AM, Mattias Engdeg=C3=A5rd wrote: > 'regexp-opt' always generates a regexp preferring long matches. This is= undocumented, but useful enough that I would be surprised if this proper= ty wasn't exploited (perhaps unknowingly) by callers. It's quite natural:= given a set of strings, surely the caller want them all to be candidates= for a match, even if there is no following anchoring pattern. Yes, the longstanding tradition is that regular expressions are greedy. > Thus, instead of 'unordered-or', define the operator in terms of long m= atches: 'or-max' (working name) would work like 'or' but guarantee a long= est match, and only permit strings and 'or-max' forms as arguments. That's an odd restriction. I'm not sure it's a good idea to add an=20 operator with such a restriction. That is, I know why the restriction is=20 there (it's because of limitations in the Emacs regexp matcher), but=20 it's not clear that users should have to know and understand these detail= s. Moreover, if greed is the longstanding tradition for regexp-opt,=20 shouldn't plain "or" be greedy, to be consistent with other operators?=20 That is true for POSIX regular expressions involving "|". For example,=20 the shell command: echo abbc | awk '{n=3Dsplit($0, a, /b|bb/); for (i=3D1;i<=3Dn;i++) print a[i]}' outputs the two lines "a" and "c" (not the three lines "a", "", and "c")=20 because the "b|bb" matches greedily. If it's too much trouble to make plain "or" greedy, I suggest just=20 documenting it as possibly being greedy and possibly not (that is,=20 document it as being unordered, even if it happens to be ordered now).=20 This will give us more opportunity for optimization later. More generally, surely it would be better to improve the underlying=20 Emacs regular expression matcher to have a greedy "or", or a stingy=20 "or", or whatever.