From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= Newsgroups: gmane.emacs.devel Subject: Re: Scan of regexps in Emacs (March 17) Date: Thu, 21 Mar 2019 12:15:57 +0100 Message-ID: <21CCFA3D-B391-44E1-9ED5-1D37009F1988@acm.org> References: <5363970c-3207-1bb4-8b30-74a7d12277cc@cs.ucla.edu> <05269D79-B016-4FCB-94B8-068BF7D1C2D2@acm.org> <3974269b-6cad-0744-bd1f-66c067f94192@cs.ucla.edu> <4b1164c4-e302-ce41-07c3-145d31a97b4c@cs.ucla.edu> Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\)) Content-Type: multipart/mixed; boundary="Apple-Mail=_90DA0F4C-9C43-403D-8F54-BE46ECBC8F46" Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="7684"; mail-complaints-to="usenet@blaine.gmane.org" Cc: Stefan Monnier , emacs-devel@gnu.org To: Paul Eggert Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Mar 21 12:16:32 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1h6vgV-0001tS-9F for ged-emacs-devel@m.gmane.org; Thu, 21 Mar 2019 12:16:31 +0100 Original-Received: from localhost ([127.0.0.1]:35083 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h6vgT-0001AD-Sz for ged-emacs-devel@m.gmane.org; Thu, 21 Mar 2019 07:16:29 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:52546) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h6vgN-00019y-1e for emacs-devel@gnu.org; Thu, 21 Mar 2019 07:16:24 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1h6vgH-0008S1-TH for emacs-devel@gnu.org; Thu, 21 Mar 2019 07:16:23 -0400 Original-Received: from mail175c50.megamailservers.eu ([91.136.10.185]:32856 helo=mail50c50.megamailservers.eu) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1h6vgG-0008GI-Q9 for emacs-devel@gnu.org; Thu, 21 Mar 2019 07:16:17 -0400 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1553166962; bh=8tzX4+SyW949kj4TBXjjsCtgDA2ZFyu3eODekEWvYgA=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=F6p3+jqASy67swWLUDSOn+fnwbb11LRsbyK9Ntmg+7n6nD5aM3PIZBFu4HCQRSke8 7bPKQ6PxCJXANWLZ8zBIfEDPXyh/gtEULKCwM0G/Ii8Q041+AHJOjL9TD5w2qTc/LF 2CsgGIe5diAQDIIOpbumtlrhNoYp8XOXqueN1TwI= Feedback-ID: mattiase@acm.or Original-Received: from [192.168.1.64] (c-e636e253.032-75-73746f71.bbcust.telenor.se [83.226.54.230]) (authenticated bits=0) by mail50c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id x2LBFvve031833; Thu, 21 Mar 2019 11:15:59 +0000 In-Reply-To: <4b1164c4-e302-ce41-07c3-145d31a97b4c@cs.ucla.edu> X-Mailer: Apple Mail (2.3445.102.3) X-CTCH-RefID: str=0001.0A0B0201.5C937272.0072, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-CTCH-VOD: Unknown X-CTCH-Spam: Unknown X-CTCH-Score: 0.000 X-CTCH-Flags: 0 X-CTCH-ScoreCust: 0.000 X-CSC: 0 X-CHA: v=2.3 cv=a4UeC3aF c=1 sm=1 tr=0 a=M+GU/qJco4WXjv8D6jB2IA==:117 a=M+GU/qJco4WXjv8D6jB2IA==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=3ORN8VxVkAU1rkrWeyYA:9 a=QEXdDO2ut3YA:10 a=OYk9R4avAdcWoGfWFRcA:9 a=ITdVHhY7-e0A:10 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x (no timestamps) [generic] X-Received-From: 91.136.10.185 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:234458 Archived-At: --Apple-Mail=_90DA0F4C-9C43-403D-8F54-BE46ECBC8F46 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 20 mars 2019 kl. 23.01 skrev Paul Eggert : >=20 > On 3/19/19 7:20 PM, Stefan Monnier wrote: >> I wonder why the doc doesn't just say that `-` should be the last >> character and not mention the other possibilities which just make the >> rule unnecessarily complex. Agreed, that is what the 'how to write regexps' part of the docs should = say. But don't we also need a precise description of exactly how they = are interpreted by the engine? Otherwise, a user cannot read and = understand existing code. (Unless he or she uses xr!) Perhaps there = needs to be a separate 'gritty details' section. > * The doc already says that regular expressions like "*foo" and "+foo" > are problematic (they're confusing, and POSIX says the behavior is > undefined) and should be avoided. REs like "[a-m-z]" and = "[!-[:alpha:]]" > and "[[:alpha:]-~]" are problematic in the same way and also should be > avoided. I'm with Stefan here; `-' should go last. Anything else is a gritty = detail. > * The doc doesn't clearly say when the Emacs range behavior is an > extension to POSIX; saying this will help people know better when they > can export Emacs regular expressions to other programs. Documenting differences from POSIX regexps is useful. Do you prefer = having those differences being spread out, or all concentrated into one = section? These days, a user may be more familiar with the various PCRE dialects = than traditional or extended POSIX. Should that be taken into account? > * The doc is confused (and there's a comment about this) about what > happens when one end of a range is unibyte and the other is multibyte. = I > added something saying that if one bound is a raw 8-bit byte then the > other should be a unibyte character (either ASCII, or a raw 8-bit = byte). > I don't see any good way to specify the behavior when one bound is a = raw > 8-bit byte and the other bound is a multibyte character, in such a way > that it's a natural extension of the documented behavior, so the > documentation now recommends against that. The terminology is a bit confusing. Is 'raw 8-bit byte' included in = 'unibyte'? Is \x7f ever a raw 8-bit byte? I agree that [=C3=A5-\xff], say, should be invalid but I've never seen = such constructs. > * We might as well go ahead and say that [b-a] matches nothing, as > enough code (ab)uses regexps in that way, and there is value in having = a > simple regular expression that always fails to match. However, I = expect > that we should say that users should avoid wilder examples like [~-!] = so > that the trawler can catch them as typos. It already does, and some bugs were found that way. As a special case, = it no longer complains about z-a because that is unlikely to be an = accident and occurs in some code on purpose. I'm not sure it's a good idea to document reversed ranges as a = recommended way to match any or no character (although the description = of the semantics would belong in a 'gritty details' section), and only = to use [Y-X] where Y=3DX+1. More about that in a separate post. > These new recommendations ("should"s in the attached patch) will give > the trawler license to diagnose questionable REs like "[a-m-z]", > "[!-[:alpha:]]", "[~-!]", and (my favorite) "[\u00FF-\xFF]". There is = no > change to actual Emacs behavior. As an experiment, I added detection of 'chained' ranges like [a-m-z] to = xr and found a handful in both Emacs and GNU ELPA, but none of them = carried a freeload of bugs. Keeping that check didn't seem worthwhile; = the regexps may be a bit odd-looking, but aren't wrong. [!-[:alpha:]] is already detected since xr parses it correctly and will = complain about the duplication of ':'. The reverse, [[:digit:]-z], is = seen occasionally but again does not seem to be a serious bug proxy. Much as I would like to outlaw ranges where a typical programmer has to = consult an ASCII table to understand what's included, they just seem too = common, with too many false positives, to merit inclusion in xr. Nevertheless I had a quick look and extracted a few that might merit = attention; see attachment. Similarly, a rule finding [X-Y] where Y=3DX+1 found one or two = questionable cases in a sea of false positives (also in the attachment). --Apple-Mail=_90DA0F4C-9C43-403D-8F54-BE46ECBC8F46 Content-Disposition: attachment; filename=possibly-broken-regexps.log Content-Type: application/octet-stream; x-unix-mode=0644; name="possibly-broken-regexps.log" Content-Transfer-Encoding: 7bit /Users/mattias/emacs/lisp/vc/diff-mode.el:2215:19: In call to re-search-forward: Unintuitive range `+-<' (pos 3) "\n[!+-<>]\\(-- [0-9]+\\(,[0-9]+\\)? ----\n\\( .*\n\\)*[+]\\)?" ....^ /Users/mattias/emacs/lisp/speedbar.el:2852:42: In call to re-search-forward: Unintuitive range `+-?' (pos 21) "^\\([0-9]+\\):\\s-*[[<][+-?][]>] " ........................^ /Users/mattias/emacs/lisp/speedbar.el:2903:42: In call to re-search-forward: Unintuitive range `+-?' (pos 19) "^\\([0-9]+\\):\\s-*\\[[+-?]\\] " .......................^ /Users/mattias/emacs/lisp/woman.el:3514:26: In call to looking-at: Unintuitive range `+-/' (pos 1) "[+-/*%]" .^ /Users/mattias/emacs/lisp/net/webjump.el:345:39: In call to string-match: Two-character range `.-/' (pos 8) "[a-zA-Z_.-/]" ........^ /Users/mattias/emacs/lisp/align.el:386:3: In align-rules-list (perl-assignment): Two-character range `*-+' (pos 6) "[^=!^&*-+<>/| \t\n]\\(\\s-*\\)=[~>]?\\(\\s-*\\)\\([^>= \t\n]\\|$\\)" ......^ --Apple-Mail=_90DA0F4C-9C43-403D-8F54-BE46ECBC8F46--