From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= Newsgroups: gmane.emacs.devel Subject: Re: Scan of regexps in Emacs (March 17) Date: Tue, 2 Apr 2019 16:15:13 +0200 Message-ID: <09AE372B-3A30-4596-8C4E-B9F4CBF6E348@acm.org> References: <5363970c-3207-1bb4-8b30-74a7d12277cc@cs.ucla.edu> <05269D79-B016-4FCB-94B8-068BF7D1C2D2@acm.org> <3974269b-6cad-0744-bd1f-66c067f94192@cs.ucla.edu> <4b1164c4-e302-ce41-07c3-145d31a97b4c@cs.ucla.edu> <21CCFA3D-B391-44E1-9ED5-1D37009F1988@acm.org> Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.8\)) Content-Type: multipart/mixed; boundary="Apple-Mail=_F21F3E41-88E3-449D-BB86-2B5A2E376A75" Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="85058"; mail-complaints-to="usenet@blaine.gmane.org" Cc: Stefan Monnier , emacs-devel@gnu.org To: Paul Eggert Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Apr 02 16:16:59 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1hBKDj-000M0s-Ei for ged-emacs-devel@m.gmane.org; Tue, 02 Apr 2019 16:16:59 +0200 Original-Received: from localhost ([127.0.0.1]:43080 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hBKDi-0003bu-EC for ged-emacs-devel@m.gmane.org; Tue, 02 Apr 2019 10:16:58 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:57417) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hBKCO-0003Vu-GR for emacs-devel@gnu.org; Tue, 02 Apr 2019 10:15:37 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hBKCN-0005rZ-9P for emacs-devel@gnu.org; Tue, 02 Apr 2019 10:15:36 -0400 Original-Received: from mail175c50.megamailservers.eu ([91.136.10.185]:60748 helo=mail50c50.megamailservers.eu) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1hBKCM-0005hd-Bs for emacs-devel@gnu.org; Tue, 02 Apr 2019 10:15:35 -0400 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1554214516; bh=pkKJywCbDaCWaD/BAEP3GzBzIj5DcWRf8sFIwbOcjfs=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=myC/n7lFgM9f/j3t49IDtS65u43dJ1hWmlDoXdcxG5fyAw5rA22y4Yn+XaRsw6JPZ fXEJqIccqLy+Tx41BYk7NJ4Metnes1/aHH6XuOiEo2tSwaJxEi0XZdRfmqkheGaf7h XshCEXcgcI6j0RMjivVgmnImBEApZhgTvnP3AzOg= Feedback-ID: mattiase@acm.or Original-Received: from [192.168.0.4] ([188.150.171.71]) (authenticated bits=0) by mail50c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id x32EFDll001068; Tue, 2 Apr 2019 14:15:16 +0000 In-Reply-To: X-Mailer: Apple Mail (2.3445.104.8) X-CTCH-RefID: str=0001.0A0B0210.5CA36E74.009F, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-CTCH-VOD: Unknown X-CTCH-Spam: Unknown X-CTCH-Score: 0.000 X-CTCH-Flags: 0 X-CTCH-ScoreCust: 0.000 X-CSC: 0 X-CHA: v=2.3 cv=a4UeC3aF c=1 sm=1 tr=0 a=SF+I6pRkHZhrawxbOkkvaA==:117 a=SF+I6pRkHZhrawxbOkkvaA==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=iSgzB4grNUHKWWYyOZsA:9 a=QEXdDO2ut3YA:10 a=Tbc8gJKrHcXCFUcVbK0A:9 a=ITdVHhY7-e0A:10 a=tclcd6dtLQvEqt9_mmAA:9 a=CjuIK1q_8ugA:10 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x (no timestamps) [generic] X-Received-From: 91.136.10.185 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:234867 Archived-At: --Apple-Mail=_F21F3E41-88E3-449D-BB86-2B5A2E376A75 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 2 apr. 2019 kl. 09.33 skrev Paul Eggert : >=20 >> don't we also need a precise description of exactly how they are = interpreted by the engine? >=20 > In other parts of Emacs, we are typically OK with specs that don't = completely specify behavior. This gives us more freedom to make changes = in the undocumented behavior later. I think it makes sense to do that = here too, for regular expressions like "[z-a-m]" that most readers would = find confusing. Then where does a user go to understand extant regexps? (Do we have any = latitude at all for changing even obscure corners of regexp syntax and = semantics today?) That's why I favour expounding on the details in a = separate section. >> The terminology is a bit confusing. Is 'raw 8-bit byte' included in = 'unibyte'? Is \x7f ever a raw 8-bit byte? >> I agree that [=C3=A5-\xff], say, should be invalid but I've never = seen such constructs. >=20 > After looking into it I realized that I don't really know the = semantics here (the text I recently added there seems to be wrong, in = some cases), and I have my doubts that anyone else knows the semantics = either. The attached patch simply gets rid of that section, leaving the = area undocumented. User beware! Apparently I don't really know it either -- I just discovered that: (string-match "\xff" "\xff") =3D> 0 (string-match "[\xff]" "\xff") =3D> 0 (string-match "\xff=C3=A9?" "\xff") =3D> nil (string-match "[\xff]=C3=A9?" "\xff") =3D> 0 (string-match "\xff" "\xff=C3=A9") =3D> 0 (string-match "[\xff]" "\xff=C3=A9") =3D> nil (string-match "\xff=C3=A9?" "\xff=C3=A9") =3D> 0 (string-match "[\xff]=C3=A9?" "\xff=C3=A9") =3D> nil > OK, then we should document z-a as the preferred syntax (best go with = the flow...). Done in the attached patch. Actually, the only place where I saw z-a was in auctex (in negated form, = [^z-a]). >> As an experiment, I added detection of 'chained' ranges like [a-m-z] = to xr and found a handful in both Emacs and GNU ELPA, but none of them = carried a freeload of bugs. Keeping that check didn't seem worthwhile; = the regexps may be a bit odd-looking, but aren't wrong. >=20 > It depends on what one means by "wrong". If one wants to use the = ranges in both Emacs and grep they are "wrong", so it's reasonable for = the manual to recommend against them. Definitely agree that it should be discouraged. I've attached the ones = found by a modified relint/xr, in case you are interested. > It might also help for the trawler to warn about [X-Z] where Z =3D = X+2. [XYZ] is clearer and less error-prone than [X-Z]. I shoehorned that = into the attached patch too. These seem to be rare; I found exactly one occurrence = (lisp/gnus/message.el:1291): "[ \t]\\|[][!\"#$%&'()*+,-./0-9;<=3D>?@A-Z\\^_`a-z{|}~]+:" which uses the punny range ,-. (possibly by benign accident). Similarly, singleton ranges, X-X, are non-existent save for --- which I = presume is an XEmacs workaround. The latest xr version warns about 2-character ranges, except within = digits because [0-1] etc was found to be common and harmless. diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index 748ab586af..72ee9233a3 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi ... +A character alternative can include duplicates. For example, +@samp{[XYa-yYb-zX]} is less clear than @samp{[XYa-z]}. Certainly, but does this need to be mentioned? Overlapping ranges are = rarely written on purpose. Besides, duplication isn't confined to = ranges. More useful, I think, would be to recommend ranges to stay within = natural sequences (letters, digits, etc) so that a reader needn't = consult a table to see what is included. Thus [0-9.:/] good, [.-:] bad, = even though they denote the same set. +@item +A @samp{-} also appear at the beginning of a character alternative, or 'appears' --Apple-Mail=_F21F3E41-88E3-449D-BB86-2B5A2E376A75 Content-Disposition: attachment; filename=chained-ranges.log Content-Type: application/octet-stream; x-unix-mode=0644; name="chained-ranges.log" Content-Transfer-Encoding: 7bit ;; -*- compilation -*- lisp/gnus/nndoc.el:704:30: In call to re-search-forward: Attempt at chained ranges `A-Z-\' (pos 47) "^\\\\\\\\\n\\(Paper\\( (\\*cross-listing\\*)\\)?: [a-zA-Z-\\.]+/[0-9]+\\|arXiv:\\)" .........................................................^ lisp/gnus/nndoc.el:735:27: In call to looking-at: Attempt at chained ranges `A-Z-\' (pos 34) "^\\(Paper.*: \\|arXiv:\\)\\([0-9a-zA-Z-\\./]+\\)" ......................................^ lisp/org/org-eshell.el:40:29: In call to string-match: Attempt at chained ranges `0-9-+' (pos 12) "\\([A-Za-z0-9-+*]+\\):\\(.*\\)" .............^ lisp/org/org.el:432:3: In org-deadline-time-hour-regexp: Attempt at chained ranges `0-9-+' (pos 48) "\\]+[0-9]\\{1,2\\}:[0-9]\\{2\\}[0-9-+:hdwmy \t.]*\\)>" ......................................................^ lisp/org/org.el:448:3: In org-scheduled-time-hour-regexp: Attempt at chained ranges `0-9-+' (pos 49) "\\]+[0-9]\\{1,2\\}:[0-9]\\{2\\}[0-9-+:hdwmy \t.]*\\)>" .......................................................^ lisp/progmodes/bat-mode.el:68:3: In bat-font-lock-keywords: Attempt at chained ranges `0-9-_' (pos 39) "\\_<\\(call\\|goto\\)\\_>[ \t]+%?\\([A-Za-z0-9-_\\:.]+\\)%?" ..............................................^ lisp/progmodes/bug-reference.el:72:3: In bug-reference-bug-regexp: Attempt at chained ranges `a-z-+' (pos 42) "\\([Bb]ug ?#?\\|[Pp]atch ?#\\|RFE ?#\\|PR [a-z-+]+/\\)\\([0-9]+\\(?:#[0-9]+\\)?\\)" ..............................................^ lisp/textmodes/less-css-mode.el:196:3: In less-css-font-lock-keywords: Attempt at chained ranges `a-z-_' (pos 12) "@[a-z_-][a-z-_0-9]*" ............^ lisp/textmodes/less-css-mode.el:196:3: In less-css-font-lock-keywords: Attempt at chained ranges `a-z-_' (pos 30) "\\(?:[ \t{;]\\|^\\)\\(\\.[a-z_-][a-z-_0-9]*\\)[ \t]*;" ....................................^ lisp/vc/vc-cvs.el:1090:27: In call to string-match: Attempt at chained ranges `A-Z-_' (pos 11) "[^a-z0-9A-Z-_]" ...........^ lisp/vc/vc-svn.el:762:27: In call to string-match: Attempt at chained ranges `A-Z-_' (pos 11) "[^a-z0-9A-Z-_]" ...........^ lisp/files.el:6319:28: In call to string-match: Attempt at chained ranges `0-9-_' (pos 11) "[^A-Za-z0-9-_.~#+]" ...........^ --Apple-Mail=_F21F3E41-88E3-449D-BB86-2B5A2E376A75 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii --Apple-Mail=_F21F3E41-88E3-449D-BB86-2B5A2E376A75--