From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: Paul Eggert <eggert@cs.ucla.edu>
Newsgroups: gmane.emacs.devel
Subject: Re: Scan of regexps in Emacs (March 17)
Date: Tue, 2 Apr 2019 00:33:28 -0700
Organization: UCLA Computer Science Department
Message-ID: <f0edb8ac-9a9a-6cd6-3594-ea12cdbcd03b@cs.ucla.edu>
References: <C25133A0-1564-4B27-AA4D-DDAD4A2FB03F@acm.org>
	<5363970c-3207-1bb4-8b30-74a7d12277cc@cs.ucla.edu>
	<05269D79-B016-4FCB-94B8-068BF7D1C2D2@acm.org>
	<3974269b-6cad-0744-bd1f-66c067f94192@cs.ucla.edu>
	<jwvbm26s39r.fsf-monnier+emacs@gnu.org>
	<4b1164c4-e302-ce41-07c3-145d31a97b4c@cs.ucla.edu>
	<21CCFA3D-B391-44E1-9ED5-1D37009F1988@acm.org>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="------------C879513D13D2CCFC7FA51290"
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="139516"; mail-complaints-to="usenet@blaine.gmane.org"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
	Thunderbird/60.6.1
Cc: Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
To: =?UTF-8?Q?Mattias_Engdeg=c3=a5rd?= <mattiase@acm.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Apr 02 09:42:05 2019
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256)
	(Exim 4.89)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1hBE3Y-000a67-NI
	for ged-emacs-devel@m.gmane.org; Tue, 02 Apr 2019 09:42:04 +0200
Original-Received: from localhost ([127.0.0.1]:49938 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1hBE3X-0004aw-Pq
	for ged-emacs-devel@m.gmane.org; Tue, 02 Apr 2019 03:42:03 -0400
Original-Received: from eggs.gnu.org ([209.51.188.92]:60199)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eggert@cs.ucla.edu>) id 1hBDvN-0006E7-IG
	for emacs-devel@gnu.org; Tue, 02 Apr 2019 03:33:40 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eggert@cs.ucla.edu>) id 1hBDvL-0001SF-G1
	for emacs-devel@gnu.org; Tue, 02 Apr 2019 03:33:37 -0400
Original-Received: from zimbra.cs.ucla.edu ([131.179.128.68]:43630)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <eggert@cs.ucla.edu>) id 1hBDvK-0001Hp-SE
	for emacs-devel@gnu.org; Tue, 02 Apr 2019 03:33:35 -0400
Original-Received: from localhost (localhost [127.0.0.1])
	by zimbra.cs.ucla.edu (Postfix) with ESMTP id 603431613D4;
	Tue,  2 Apr 2019 00:33:30 -0700 (PDT)
Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1])
	by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
	with ESMTP id SGIBjphuH8Cj; Tue,  2 Apr 2019 00:33:29 -0700 (PDT)
Original-Received: from localhost (localhost [127.0.0.1])
	by zimbra.cs.ucla.edu (Postfix) with ESMTP id 15C1C1613E2;
	Tue,  2 Apr 2019 00:33:29 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1])
	by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
	with ESMTP id g3_-x70mGB57; Tue,  2 Apr 2019 00:33:28 -0700 (PDT)
Original-Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com
	[23.242.74.103])
	by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id C3F5E1612C8;
	Tue,  2 Apr 2019 00:33:28 -0700 (PDT)
In-Reply-To: <21CCFA3D-B391-44E1-9ED5-1D37009F1988@acm.org>
Content-Language: en-US
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x
X-Received-From: 131.179.128.68
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel/>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:234856
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/234856>

This is a multi-part message in MIME format.
--------------C879513D13D2CCFC7FA51290
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable

Mattias Engdeg=C3=A5rd wrote:
 >
> don't we also need a precise description of exactly how they are interp=
reted by the engine?

In other parts of Emacs, we are typically OK with specs that don't comple=
tely=20
specify behavior. This gives us more freedom to make changes in the undoc=
umented=20
behavior later. I think it makes sense to do that here too, for regular=20
expressions like "[z-a-m]" that most readers would find confusing.

> I'm with Stefan here; `-' should go last. Anything else is a gritty det=
ail.

Stefan already changed the doc in master to say that. The attached patch=20
tightens up the wording (and still says that "-" should go last).

> Documenting differences from POSIX regexps is useful. Do you prefer hav=
ing those differences being spread out, or all concentrated into one sect=
ion?

I don't have a strong preference. I wrote it concentrated originally, and=
 that=20
form seems to work well.

> These days, a user may be more familiar with the various PCRE dialects =
than traditional or extended POSIX. Should that be taken into account?

It might be helpful. However, PCRE is further away from Emacs regexps tha=
n POSIX=20
is, and a comparison of PCRE and POSIX regexps is probably best put into =
a=20
different section. It's not a section I'd like to write, to be honest; PC=
RE is=20
pretty hairy.

> The terminology is a bit confusing. Is 'raw 8-bit byte' included in 'un=
ibyte'? Is \x7f ever a raw 8-bit byte?
> I agree that [=C3=A5-\xff], say, should be invalid but I've never seen =
such constructs.

After looking into it I realized that I don't really know the semantics h=
ere=20
(the text I recently added there seems to be wrong, in some cases), and I=
 have=20
my doubts that anyone else knows the semantics either. The attached patch=
 simply=20
gets rid of that section, leaving the area undocumented. User beware!

> It already does, and some bugs were found that way. As a special case, =
it no longer complains about z-a because that is unlikely to be an accide=
nt and occurs in some code on purpose.

OK, then we should document z-a as the preferred syntax (best go with the=
=20
flow...). Done in the attached patch.

> As an experiment, I added detection of 'chained' ranges like [a-m-z] to=
 xr and found a handful in both Emacs and GNU ELPA, but none of them carr=
ied a freeload of bugs. Keeping that check didn't seem worthwhile; the re=
gexps may be a bit odd-looking, but aren't wrong.

It depends on what one means by "wrong". If one wants to use the ranges i=
n both=20
Emacs and grep they are "wrong", so it's reasonable for the manual to rec=
ommend=20
against them.
> a rule finding [X-Y] where Y=3DX+1 found one or two questionable cases =
in a sea of false positives (also in the attachment).

It might also help for the trawler to warn about [X-Z] where Z =3D X+2. [=
XYZ] is=20
clearer and less error-prone than [X-Z]. I shoehorned that into the attac=
hed=20
patch too.

--------------C879513D13D2CCFC7FA51290
Content-Type: text/x-patch;
 name="0001-More-regexp-advice-and-clarifications.patch"
Content-Disposition: attachment;
 filename="0001-More-regexp-advice-and-clarifications.patch"
Content-Transfer-Encoding: quoted-printable

>From 076ed98ff6d7debff3929beab048c8a90e48dbb8 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Tue, 2 Apr 2019 00:17:37 -0700
Subject: [PATCH] More regexp advice and clarifications
MIME-Version: 1.0
Content-Type: text/plain; charset=3DUTF-8
Content-Transfer-Encoding: 8bit

* doc/lispref/searching.texi (Regexp Special): Simplify style
advice for order of ], ^, and - in character alternatives.
Stick with saying that it=E2=80=99s not a good idea to put =E2=80=98-=E2=80=
=99 after a
range.  Remove the special case about raw 8-bit bytes and
unibyte characters, as this documentation is confusing and
seems to be incorrect in some cases.  Say that z-a is the
preferred style for reversed ranges, since it=E2=80=99s clearer and is
typically what=E2=80=99s used in practice.  Mention some bad styles:
duplicates in character alternatives, ranges that denote <=3D3
characters, and =E2=80=98-=E2=80=99 as the first character.
---
 doc/lispref/searching.texi | 52 +++++++++++++++++++++++---------------
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 748ab586af..72ee9233a3 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -398,17 +398,11 @@ Regexp Special
 The usual regexp special characters are not special inside a
 character alternative.  A completely different set of characters is
 special inside character alternatives: @samp{]}, @samp{-} and @samp{^}.
-
-To include a @samp{]} in a character alternative, you must make it the f=
irst
-character.  For example, @samp{[]a]} matches @samp{]} or @samp{a}.  To i=
nclude
-a @samp{-}, write @samp{-} as the last character of the character altern=
ative,
-tho you can also put it first or after a range.  Thus, @samp{[]-]} match=
es both
-@samp{]} and @samp{-}.  (As explained below, you cannot use @samp{\]} to
-include a @samp{]} inside a character alternative, since @samp{\} is not
-special there.)
-
-To include @samp{^} in a character alternative, put it anywhere but at
-the beginning.
+To include @samp{]} in a character alternative, put it at the
+beginning.  To include @samp{^}, put it anywhere but at the beginning.
+To include @samp{-}, put it at the end.  Thus, @samp{[]^-]} matches
+all three of these special characters.  You cannot use @samp{\} to
+escape these three characters, since @samp{\} is not special here.
=20
 The following aspects of ranges are specific to Emacs, in that POSIX
 allows but does not require this behavior and programs other than
@@ -426,17 +420,33 @@ Regexp Special
 outside the C or POSIX locale.
=20
 @item
-As a special case, if either bound of a range is a raw 8-bit byte, the
-other bound should be a unibyte character, and the range matches only
-unibyte characters.
+If the lower bound of a range is greater than its upper bound, the
+range is empty and represents no characters.  Thus, @samp{[z-a]}
+always fails to match, and @samp{[^z-a]} matches any character,
+including newline.  However, a reversed range should always be from
+the letter @samp{z} to the letter @samp{a} to make it clear that it is
+not a typo; for example, @samp{[+-*/]} should be avoided, because it
+matches only @samp{/} rather than the likely-intended four characters.
+@end enumerate
+
+Some kinds of character alternatives are not the best style even
+though they are standardized by POSIX and are portable.  They include:
=20
+@enumerate
 @item
-If the lower bound of a range is greater than its upper bound, the
-range is empty and represents no characters.  Thus, @samp{[b-a]}
-always fails to match, and @samp{[^b-a]} matches any character,
-including newline.  However, the lower bound should be at most one
-greater than the upper bound; for example, @samp{[c-a]} should be
-avoided.
+A character alternative can include duplicates.  For example,
+@samp{[XYa-yYb-zX]} is less clear than @samp{[XYa-z]}.
+
+@item
+A range can denote just one, two, or three characters.  For example,
+@samp{[(-(]} is less clear than @samp{[(]}, @samp{[*-+]} is less clear
+than @samp{[*+]}, and @samp{[*-,]} is less clear than @samp{[*+,]}.
+
+@item
+A @samp{-} also appear at the beginning of a character alternative, or
+as the upper bound of a range.  For example, although @samp{[-a-z]} is
+valid, @samp{[a-z-]} is better style; and although @samp{[!--/]} is
+valid, @samp{[!-,/-]} is clearer.
 @end enumerate
=20
 A character alternative can also specify named character classes
@@ -452,7 +462,7 @@ Regexp Special
 @cindex @samp{^} in regexp
 @samp{[^} begins a @dfn{complemented character alternative}.  This
 matches any character except the ones specified.  Thus,
-@samp{[^a-z0-9A-Z]} matches all characters @emph{except} letters and
+@samp{[^a-z0-9A-Z]} matches all characters @emph{except} ASCII letters a=
nd
 digits.
=20
 @samp{^} is not special in a character alternative unless it is the firs=
t
--=20
2.17.1


--------------C879513D13D2CCFC7FA51290--