From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: Paul Eggert <eggert@cs.ucla.edu>
Newsgroups: gmane.emacs.devel
Subject: Re: Scan of regexps in Emacs (March 17)
Date: Wed, 20 Mar 2019 15:01:51 -0700
Organization: UCLA Computer Science Department
Message-ID: <4b1164c4-e302-ce41-07c3-145d31a97b4c@cs.ucla.edu>
References: <C25133A0-1564-4B27-AA4D-DDAD4A2FB03F@acm.org>
	<5363970c-3207-1bb4-8b30-74a7d12277cc@cs.ucla.edu>
	<05269D79-B016-4FCB-94B8-068BF7D1C2D2@acm.org>
	<3974269b-6cad-0744-bd1f-66c067f94192@cs.ucla.edu>
	<jwvbm26s39r.fsf-monnier+emacs@gnu.org>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="------------093F95A9F138DA31103C0149"
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="152980"; mail-complaints-to="usenet@blaine.gmane.org"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
	Thunderbird/60.5.3
Cc: =?UTF-8?Q?Mattias_Engdeg=c3=a5rd?= <mattiase@acm.org>, emacs-devel@gnu.org
To: Stefan Monnier <monnier@iro.umontreal.ca>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Mar 20 23:02:11 2019
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256)
	(Exim 4.89)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1h6jHk-000dh4-Lh
	for ged-emacs-devel@m.gmane.org; Wed, 20 Mar 2019 23:02:08 +0100
Original-Received: from localhost ([127.0.0.1]:53954 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1h6jHj-0007SM-Iv
	for ged-emacs-devel@m.gmane.org; Wed, 20 Mar 2019 18:02:07 -0400
Original-Received: from eggs.gnu.org ([209.51.188.92]:59638)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eggert@cs.ucla.edu>) id 1h6jHc-0007SH-OO
	for emacs-devel@gnu.org; Wed, 20 Mar 2019 18:02:02 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eggert@cs.ucla.edu>) id 1h6jHa-0004Y9-Mv
	for emacs-devel@gnu.org; Wed, 20 Mar 2019 18:02:00 -0400
Original-Received: from zimbra.cs.ucla.edu ([131.179.128.68]:53762)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <eggert@cs.ucla.edu>) id 1h6jHW-0004Rr-Uq
	for emacs-devel@gnu.org; Wed, 20 Mar 2019 18:01:57 -0400
Original-Received: from localhost (localhost [127.0.0.1])
	by zimbra.cs.ucla.edu (Postfix) with ESMTP id B0DA916092A;
	Wed, 20 Mar 2019 15:01:52 -0700 (PDT)
Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1])
	by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
	with ESMTP id 5w2Vvp1jwcPq; Wed, 20 Mar 2019 15:01:51 -0700 (PDT)
Original-Received: from localhost (localhost [127.0.0.1])
	by zimbra.cs.ucla.edu (Postfix) with ESMTP id 6BB6616092F;
	Wed, 20 Mar 2019 15:01:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1])
	by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
	with ESMTP id FkH7ZGgTHIts; Wed, 20 Mar 2019 15:01:51 -0700 (PDT)
Original-Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200])
	by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 4C2E1160927;
	Wed, 20 Mar 2019 15:01:51 -0700 (PDT)
Openpgp: preference=signencrypt
Autocrypt: addr=eggert@cs.ucla.edu; prefer-encrypt=mutual; keydata=
	xsFNBEyAcmQBEADAAyH2xoTu7ppG5D3a8FMZEon74dCvc4+q1XA2J2tBy2pwaTqfhpxxdGA9
	Jj50UJ3PD4bSUEgN8tLZ0san47l5XTAFLi2456ciSl5m8sKaHlGdt9XmAAtmXqeZVIYX/UFS
	96fDzf4xhEmm/y7LbYEPQdUdxu47xA5KhTYp5bltF3WYDz1Ygd7gx07Auwp7iw7eNvnoDTAl
	KAl8KYDZzbDNCQGEbpY3efZIvPdeI+FWQN4W+kghy+P6au6PrIIhYraeua7XDdb2LS1en3Ss
	mE3QjqfRqI/A2ue8JMwsvXe/WK38Ezs6x74iTaqI3AFH6ilAhDqpMnd/msSESNFt76DiO1ZK
	QMr9amVPknjfPmJISqdhgB1DlEdw34sROf6V8mZw0xfqT6PKE46LcFefzs0kbg4GORf8vjG2
	Sf1tk5eU8MBiyN/bZ03bKNjNYMpODDQQwuP84kYLkX2wBxxMAhBxwbDVZudzxDZJ1C2VXujC
	OJVxq2kljBM9ETYuUGqd75AW2LXrLw6+MuIsHFAYAgRr7+KcwDgBAfwhPBYX34nSSiHlmLC+
	KaHLeCLF5ZI2vKm3HEeCTtlOg7xZEONgwzL+fdKo+D6SoC8RRxJKs8a3sVfI4t6CnrQzvJbB
	n6gxdgCu5i29J1QCYrCYvql2UyFPAK+do99/1jOXT4m2836j1wARAQABzSBQYXVsIEVnZ2Vy
	dCA8ZWdnZXJ0QGNzLnVjbGEuZWR1PsLBfgQTAQIAKAUCTIByZAIbAwUJEswDAAYLCQgHAwIG
	FQgCCQoLBBYCAwECH 
In-Reply-To: <jwvbm26s39r.fsf-monnier+emacs@gnu.org>
Content-Language: en-US
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x
X-Received-From: 131.179.128.68
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel/>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:234430
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/234430>

This is a multi-part message in MIME format.
--------------093F95A9F138DA31103C0149
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

On 3/19/19 7:20 PM, Stefan Monnier wrote:
> I wonder why the doc doesn't just say that `-` should be the last
> character and not mention the other possibilities which just make the
> rule unnecessarily complex.


'-' can also be the first character in a regular expression; this is
pretty common and is standard. POSIX also says '-' can be the upper
bound of a range, which is a bit weird (but hey! it's standard).

I went through the documentation and attempted to fix the doc to
describe this mess better by installing the attached patch into the
emacs-26 branch. The basic ideas are:

* The doc already says that regular expressions like "*foo" and "+foo"
are problematic (they're confusing, and POSIX says the behavior is
undefined) and should be avoided. REs like "[a-m-z]" and "[!-[:alpha:]]"
and "[[:alpha:]-~]" are problematic in the same way and also should be
avoided.

* The doc doesn't clearly say when the Emacs range behavior is an
extension to POSIX; saying this will help people know better when they
can export Emacs regular expressions to other programs.

* The doc is confused (and there's a comment about this) about what
happens when one end of a range is unibyte and the other is multibyte. I
added something saying that if one bound is a raw 8-bit byte then the
other should be a unibyte character (either ASCII, or a raw 8-bit byte).
I don't see any good way to specify the behavior when one bound is a raw
8-bit byte and the other bound is a multibyte character, in such a way
that it's a natural extension of the documented behavior, so the
documentation now recommends against that.

* We might as well go ahead and say that [b-a] matches nothing, as
enough code (ab)uses regexps in that way, and there is value in having a
simple regular expression that always fails to match. However, I expect
that we should say that users should avoid wilder examples like [~-!] so
that the trawler can catch them as typos.

These new recommendations ("should"s in the attached patch) will give
the trawler license to diagnose questionable REs like "[a-m-z]",
"[!-[:alpha:]]", "[~-!]", and (my favorite) "[\u00FF-\xFF]". There is no
change to actual Emacs behavior.


--------------093F95A9F138DA31103C0149
Content-Type: text/x-patch;
 name="0001-Say-which-regexp-ranges-should-be-avoided.patch"
Content-Disposition: attachment;
 filename="0001-Say-which-regexp-ranges-should-be-avoided.patch"
Content-Transfer-Encoding: quoted-printable

>From 981bd72cb5fee582067a691cc0de94c6b6fd1f1d Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Wed, 20 Mar 2019 14:43:30 -0700
Subject: [PATCH] Say which regexp ranges should be avoided
MIME-Version: 1.0
Content-Type: text/plain; charset=3DUTF-8
Content-Transfer-Encoding: 8bit

* doc/lispref/searching.texi (Regexp Special): Say that
regular expressions like "[a-m-z]" and "[[:alpha:]-~]" should
be avoided, for the same reason that regular expressions like
"+" and "*" should be avoided: POSIX says their behavior is
undefined, and they are confusing anyway.  Also, explain
better what happens when the bound of a range is a raw 8-bit
byte; the old explanation appears to have been obsolete
anyway.  Finally, say that ranges like "[\u00FF-\xFF]" that
mix non-ASCII characters and raw 8-bit bytes should be
avoided, since it=E2=80=99s not clear what they should mean.
---
 doc/lispref/searching.texi | 54 ++++++++++++++++++++++++--------------
 1 file changed, 35 insertions(+), 19 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 7546863dde..0cf527b6ac 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -391,25 +391,18 @@ Regexp Special
 Thus, @samp{[a-z]} matches any lower-case @acronym{ASCII} letter.
 Ranges may be intermixed freely with individual characters, as in
 @samp{[a-z$%.]}, which matches any lower case @acronym{ASCII} letter
-or @samp{$}, @samp{%} or period.
+or @samp{$}, @samp{%} or period.  However, the ending character of one
+range should not be the starting point of another one; for example,
+@samp{[a-m-z]} should be avoided.
=20
-If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also
-matches upper-case letters.  Note that a range like @samp{[a-z]} is
-not affected by the locale's collation sequence, it always represents
-a sequence in @acronym{ASCII} order.
-@c This wasn't obvious to me, since, e.g., the grep manual "Character
-@c Classes and Bracket Expressions" specifically notes the opposite
-@c behavior.  But by experiment Emacs seems unaffected by LC_COLLATE
-@c in this regard.
-
-Note also that the usual regexp special characters are not special insid=
e a
+The usual regexp special characters are not special inside a
 character alternative.  A completely different set of characters is
 special inside character alternatives: @samp{]}, @samp{-} and @samp{^}.
=20
 To include a @samp{]} in a character alternative, you must make it the
 first character.  For example, @samp{[]a]} matches @samp{]} or @samp{a}.
 To include a @samp{-}, write @samp{-} as the first or last character of
-the character alternative, or put it after a range.  Thus, @samp{[]-]}
+the character alternative, or as the upper bound of a range.  Thus, @sam=
p{[]-]}
 matches both @samp{]} and @samp{-}.  (As explained below, you cannot
 use @samp{\]} to include a @samp{]} inside a character alternative,
 since @samp{\} is not special there.)
@@ -417,13 +410,34 @@ Regexp Special
 To include @samp{^} in a character alternative, put it anywhere but at
 the beginning.
=20
-@c What if it starts with a multibyte and ends with a unibyte?
-@c That doesn't seem to match anything...?
-If a range starts with a unibyte character @var{c} and ends with a
-multibyte character @var{c2}, the range is divided into two parts: one
-spans the unibyte characters @samp{@var{c}..?\377}, the other the
-multibyte characters @samp{@var{c1}..@var{c2}}, where @var{c1} is the
-first character of the charset to which @var{c2} belongs.
+The following aspects of ranges are specific to Emacs, in that POSIX
+allows but does not require this behavior and programs other than
+Emacs may behave differently:
+
+@enumerate
+@item
+If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also
+matches upper-case letters.
+
+@item
+A range is not affected by the locale's collation sequence: it always
+represents the set of characters with codepoints ranging between those
+of its bounds, so that @samp{[a-z]} matches only ASCII letters, even
+outside the C or POSIX locale.
+
+@item
+As a special case, if either bound of a range is a raw 8-bit byte, the
+other bound should be a unibyte character, and the range matches only
+unibyte characters.
+
+@item
+If the lower bound of a range is greater than its upper bound, the
+range is empty and represents no characters.  Thus, @samp{[b-a]}
+always fails to match, and @samp{[^b-a]} matches any character,
+including newline.  However, the lower bound should be at most one
+greater than the upper bound; for example, @samp{[c-a]} should be
+avoided.
+@end enumerate
=20
 A character alternative can also specify named character classes
 (@pxref{Char Classes}).  This is a POSIX feature.  For example,
@@ -431,6 +445,8 @@ Regexp Special
 Using a character class is equivalent to mentioning each of the
 characters in that class; but the latter is not feasible in practice,
 since some classes include thousands of different characters.
+A character class should not appear as the lower or upper bound
+of a range.
=20
 @item @samp{[^ @dots{} ]}
 @cindex @samp{^} in regexp
--=20
2.20.1


--------------093F95A9F138DA31103C0149--