From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Paul Eggert Newsgroups: gmane.emacs.devel Subject: Re: Scan of regexps in Emacs (March 17) Date: Tue, 2 Apr 2019 00:33:28 -0700 Organization: UCLA Computer Science Department Message-ID: References: <5363970c-3207-1bb4-8b30-74a7d12277cc@cs.ucla.edu> <05269D79-B016-4FCB-94B8-068BF7D1C2D2@acm.org> <3974269b-6cad-0744-bd1f-66c067f94192@cs.ucla.edu> <4b1164c4-e302-ce41-07c3-145d31a97b4c@cs.ucla.edu> <21CCFA3D-B391-44E1-9ED5-1D37009F1988@acm.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------C879513D13D2CCFC7FA51290" Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="139516"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 Cc: Stefan Monnier , emacs-devel@gnu.org To: =?UTF-8?Q?Mattias_Engdeg=c3=a5rd?= Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Apr 02 09:42:05 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1hBE3Y-000a67-NI for ged-emacs-devel@m.gmane.org; Tue, 02 Apr 2019 09:42:04 +0200 Original-Received: from localhost ([127.0.0.1]:49938 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hBE3X-0004aw-Pq for ged-emacs-devel@m.gmane.org; Tue, 02 Apr 2019 03:42:03 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:60199) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hBDvN-0006E7-IG for emacs-devel@gnu.org; Tue, 02 Apr 2019 03:33:40 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hBDvL-0001SF-G1 for emacs-devel@gnu.org; Tue, 02 Apr 2019 03:33:37 -0400 Original-Received: from zimbra.cs.ucla.edu ([131.179.128.68]:43630) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1hBDvK-0001Hp-SE for emacs-devel@gnu.org; Tue, 02 Apr 2019 03:33:35 -0400 Original-Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 603431613D4; Tue, 2 Apr 2019 00:33:30 -0700 (PDT) Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id SGIBjphuH8Cj; Tue, 2 Apr 2019 00:33:29 -0700 (PDT) Original-Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 15C1C1613E2; Tue, 2 Apr 2019 00:33:29 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id g3_-x70mGB57; Tue, 2 Apr 2019 00:33:28 -0700 (PDT) Original-Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id C3F5E1612C8; Tue, 2 Apr 2019 00:33:28 -0700 (PDT) In-Reply-To: <21CCFA3D-B391-44E1-9ED5-1D37009F1988@acm.org> Content-Language: en-US X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 131.179.128.68 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:234856 Archived-At: This is a multi-part message in MIME format. --------------C879513D13D2CCFC7FA51290 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Mattias Engdeg=C3=A5rd wrote: > > don't we also need a precise description of exactly how they are interp= reted by the engine? In other parts of Emacs, we are typically OK with specs that don't comple= tely=20 specify behavior. This gives us more freedom to make changes in the undoc= umented=20 behavior later. I think it makes sense to do that here too, for regular=20 expressions like "[z-a-m]" that most readers would find confusing. > I'm with Stefan here; `-' should go last. Anything else is a gritty det= ail. Stefan already changed the doc in master to say that. The attached patch=20 tightens up the wording (and still says that "-" should go last). > Documenting differences from POSIX regexps is useful. Do you prefer hav= ing those differences being spread out, or all concentrated into one sect= ion? I don't have a strong preference. I wrote it concentrated originally, and= that=20 form seems to work well. > These days, a user may be more familiar with the various PCRE dialects = than traditional or extended POSIX. Should that be taken into account? It might be helpful. However, PCRE is further away from Emacs regexps tha= n POSIX=20 is, and a comparison of PCRE and POSIX regexps is probably best put into = a=20 different section. It's not a section I'd like to write, to be honest; PC= RE is=20 pretty hairy. > The terminology is a bit confusing. Is 'raw 8-bit byte' included in 'un= ibyte'? Is \x7f ever a raw 8-bit byte? > I agree that [=C3=A5-\xff], say, should be invalid but I've never seen = such constructs. After looking into it I realized that I don't really know the semantics h= ere=20 (the text I recently added there seems to be wrong, in some cases), and I= have=20 my doubts that anyone else knows the semantics either. The attached patch= simply=20 gets rid of that section, leaving the area undocumented. User beware! > It already does, and some bugs were found that way. As a special case, = it no longer complains about z-a because that is unlikely to be an accide= nt and occurs in some code on purpose. OK, then we should document z-a as the preferred syntax (best go with the= =20 flow...). Done in the attached patch. > As an experiment, I added detection of 'chained' ranges like [a-m-z] to= xr and found a handful in both Emacs and GNU ELPA, but none of them carr= ied a freeload of bugs. Keeping that check didn't seem worthwhile; the re= gexps may be a bit odd-looking, but aren't wrong. It depends on what one means by "wrong". If one wants to use the ranges i= n both=20 Emacs and grep they are "wrong", so it's reasonable for the manual to rec= ommend=20 against them. > a rule finding [X-Y] where Y=3DX+1 found one or two questionable cases = in a sea of false positives (also in the attachment). It might also help for the trawler to warn about [X-Z] where Z =3D X+2. [= XYZ] is=20 clearer and less error-prone than [X-Z]. I shoehorned that into the attac= hed=20 patch too. --------------C879513D13D2CCFC7FA51290 Content-Type: text/x-patch; name="0001-More-regexp-advice-and-clarifications.patch" Content-Disposition: attachment; filename="0001-More-regexp-advice-and-clarifications.patch" Content-Transfer-Encoding: quoted-printable >From 076ed98ff6d7debff3929beab048c8a90e48dbb8 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Tue, 2 Apr 2019 00:17:37 -0700 Subject: [PATCH] More regexp advice and clarifications MIME-Version: 1.0 Content-Type: text/plain; charset=3DUTF-8 Content-Transfer-Encoding: 8bit * doc/lispref/searching.texi (Regexp Special): Simplify style advice for order of ], ^, and - in character alternatives. Stick with saying that it=E2=80=99s not a good idea to put =E2=80=98-=E2=80= =99 after a range. Remove the special case about raw 8-bit bytes and unibyte characters, as this documentation is confusing and seems to be incorrect in some cases. Say that z-a is the preferred style for reversed ranges, since it=E2=80=99s clearer and is typically what=E2=80=99s used in practice. Mention some bad styles: duplicates in character alternatives, ranges that denote <=3D3 characters, and =E2=80=98-=E2=80=99 as the first character. --- doc/lispref/searching.texi | 52 +++++++++++++++++++++++--------------- 1 file changed, 31 insertions(+), 21 deletions(-) diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index 748ab586af..72ee9233a3 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi @@ -398,17 +398,11 @@ Regexp Special The usual regexp special characters are not special inside a character alternative. A completely different set of characters is special inside character alternatives: @samp{]}, @samp{-} and @samp{^}. - -To include a @samp{]} in a character alternative, you must make it the f= irst -character. For example, @samp{[]a]} matches @samp{]} or @samp{a}. To i= nclude -a @samp{-}, write @samp{-} as the last character of the character altern= ative, -tho you can also put it first or after a range. Thus, @samp{[]-]} match= es both -@samp{]} and @samp{-}. (As explained below, you cannot use @samp{\]} to -include a @samp{]} inside a character alternative, since @samp{\} is not -special there.) - -To include @samp{^} in a character alternative, put it anywhere but at -the beginning. +To include @samp{]} in a character alternative, put it at the +beginning. To include @samp{^}, put it anywhere but at the beginning. +To include @samp{-}, put it at the end. Thus, @samp{[]^-]} matches +all three of these special characters. You cannot use @samp{\} to +escape these three characters, since @samp{\} is not special here. =20 The following aspects of ranges are specific to Emacs, in that POSIX allows but does not require this behavior and programs other than @@ -426,17 +420,33 @@ Regexp Special outside the C or POSIX locale. =20 @item -As a special case, if either bound of a range is a raw 8-bit byte, the -other bound should be a unibyte character, and the range matches only -unibyte characters. +If the lower bound of a range is greater than its upper bound, the +range is empty and represents no characters. Thus, @samp{[z-a]} +always fails to match, and @samp{[^z-a]} matches any character, +including newline. However, a reversed range should always be from +the letter @samp{z} to the letter @samp{a} to make it clear that it is +not a typo; for example, @samp{[+-*/]} should be avoided, because it +matches only @samp{/} rather than the likely-intended four characters. +@end enumerate + +Some kinds of character alternatives are not the best style even +though they are standardized by POSIX and are portable. They include: =20 +@enumerate @item -If the lower bound of a range is greater than its upper bound, the -range is empty and represents no characters. Thus, @samp{[b-a]} -always fails to match, and @samp{[^b-a]} matches any character, -including newline. However, the lower bound should be at most one -greater than the upper bound; for example, @samp{[c-a]} should be -avoided. +A character alternative can include duplicates. For example, +@samp{[XYa-yYb-zX]} is less clear than @samp{[XYa-z]}. + +@item +A range can denote just one, two, or three characters. For example, +@samp{[(-(]} is less clear than @samp{[(]}, @samp{[*-+]} is less clear +than @samp{[*+]}, and @samp{[*-,]} is less clear than @samp{[*+,]}. + +@item +A @samp{-} also appear at the beginning of a character alternative, or +as the upper bound of a range. For example, although @samp{[-a-z]} is +valid, @samp{[a-z-]} is better style; and although @samp{[!--/]} is +valid, @samp{[!-,/-]} is clearer. @end enumerate =20 A character alternative can also specify named character classes @@ -452,7 +462,7 @@ Regexp Special @cindex @samp{^} in regexp @samp{[^} begins a @dfn{complemented character alternative}. This matches any character except the ones specified. Thus, -@samp{[^a-z0-9A-Z]} matches all characters @emph{except} letters and +@samp{[^a-z0-9A-Z]} matches all characters @emph{except} ASCII letters a= nd digits. =20 @samp{^} is not special in a character alternative unless it is the firs= t --=20 2.17.1 --------------C879513D13D2CCFC7FA51290--