From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Paul Eggert Newsgroups: gmane.emacs.devel Subject: Re: Scan of regexps in Emacs (March 17) Date: Wed, 20 Mar 2019 15:01:51 -0700 Organization: UCLA Computer Science Department Message-ID: <4b1164c4-e302-ce41-07c3-145d31a97b4c@cs.ucla.edu> References: <5363970c-3207-1bb4-8b30-74a7d12277cc@cs.ucla.edu> <05269D79-B016-4FCB-94B8-068BF7D1C2D2@acm.org> <3974269b-6cad-0744-bd1f-66c067f94192@cs.ucla.edu> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------093F95A9F138DA31103C0149" Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="152980"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.3 Cc: =?UTF-8?Q?Mattias_Engdeg=c3=a5rd?= , emacs-devel@gnu.org To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Mar 20 23:02:11 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1h6jHk-000dh4-Lh for ged-emacs-devel@m.gmane.org; Wed, 20 Mar 2019 23:02:08 +0100 Original-Received: from localhost ([127.0.0.1]:53954 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h6jHj-0007SM-Iv for ged-emacs-devel@m.gmane.org; Wed, 20 Mar 2019 18:02:07 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:59638) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h6jHc-0007SH-OO for emacs-devel@gnu.org; Wed, 20 Mar 2019 18:02:02 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1h6jHa-0004Y9-Mv for emacs-devel@gnu.org; Wed, 20 Mar 2019 18:02:00 -0400 Original-Received: from zimbra.cs.ucla.edu ([131.179.128.68]:53762) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1h6jHW-0004Rr-Uq for emacs-devel@gnu.org; Wed, 20 Mar 2019 18:01:57 -0400 Original-Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id B0DA916092A; Wed, 20 Mar 2019 15:01:52 -0700 (PDT) Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 5w2Vvp1jwcPq; Wed, 20 Mar 2019 15:01:51 -0700 (PDT) Original-Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 6BB6616092F; Wed, 20 Mar 2019 15:01:51 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id FkH7ZGgTHIts; Wed, 20 Mar 2019 15:01:51 -0700 (PDT) Original-Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 4C2E1160927; Wed, 20 Mar 2019 15:01:51 -0700 (PDT) Openpgp: preference=signencrypt Autocrypt: addr=eggert@cs.ucla.edu; prefer-encrypt=mutual; keydata= xsFNBEyAcmQBEADAAyH2xoTu7ppG5D3a8FMZEon74dCvc4+q1XA2J2tBy2pwaTqfhpxxdGA9 Jj50UJ3PD4bSUEgN8tLZ0san47l5XTAFLi2456ciSl5m8sKaHlGdt9XmAAtmXqeZVIYX/UFS 96fDzf4xhEmm/y7LbYEPQdUdxu47xA5KhTYp5bltF3WYDz1Ygd7gx07Auwp7iw7eNvnoDTAl KAl8KYDZzbDNCQGEbpY3efZIvPdeI+FWQN4W+kghy+P6au6PrIIhYraeua7XDdb2LS1en3Ss mE3QjqfRqI/A2ue8JMwsvXe/WK38Ezs6x74iTaqI3AFH6ilAhDqpMnd/msSESNFt76DiO1ZK QMr9amVPknjfPmJISqdhgB1DlEdw34sROf6V8mZw0xfqT6PKE46LcFefzs0kbg4GORf8vjG2 Sf1tk5eU8MBiyN/bZ03bKNjNYMpODDQQwuP84kYLkX2wBxxMAhBxwbDVZudzxDZJ1C2VXujC OJVxq2kljBM9ETYuUGqd75AW2LXrLw6+MuIsHFAYAgRr7+KcwDgBAfwhPBYX34nSSiHlmLC+ KaHLeCLF5ZI2vKm3HEeCTtlOg7xZEONgwzL+fdKo+D6SoC8RRxJKs8a3sVfI4t6CnrQzvJbB n6gxdgCu5i29J1QCYrCYvql2UyFPAK+do99/1jOXT4m2836j1wARAQABzSBQYXVsIEVnZ2Vy dCA8ZWdnZXJ0QGNzLnVjbGEuZWR1PsLBfgQTAQIAKAUCTIByZAIbAwUJEswDAAYLCQgHAwIG FQgCCQoLBBYCAwECH In-Reply-To: Content-Language: en-US X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 131.179.128.68 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:234430 Archived-At: This is a multi-part message in MIME format. --------------093F95A9F138DA31103C0149 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit On 3/19/19 7:20 PM, Stefan Monnier wrote: > I wonder why the doc doesn't just say that `-` should be the last > character and not mention the other possibilities which just make the > rule unnecessarily complex. '-' can also be the first character in a regular expression; this is pretty common and is standard. POSIX also says '-' can be the upper bound of a range, which is a bit weird (but hey! it's standard). I went through the documentation and attempted to fix the doc to describe this mess better by installing the attached patch into the emacs-26 branch. The basic ideas are: * The doc already says that regular expressions like "*foo" and "+foo" are problematic (they're confusing, and POSIX says the behavior is undefined) and should be avoided. REs like "[a-m-z]" and "[!-[:alpha:]]" and "[[:alpha:]-~]" are problematic in the same way and also should be avoided. * The doc doesn't clearly say when the Emacs range behavior is an extension to POSIX; saying this will help people know better when they can export Emacs regular expressions to other programs. * The doc is confused (and there's a comment about this) about what happens when one end of a range is unibyte and the other is multibyte. I added something saying that if one bound is a raw 8-bit byte then the other should be a unibyte character (either ASCII, or a raw 8-bit byte). I don't see any good way to specify the behavior when one bound is a raw 8-bit byte and the other bound is a multibyte character, in such a way that it's a natural extension of the documented behavior, so the documentation now recommends against that. * We might as well go ahead and say that [b-a] matches nothing, as enough code (ab)uses regexps in that way, and there is value in having a simple regular expression that always fails to match. However, I expect that we should say that users should avoid wilder examples like [~-!] so that the trawler can catch them as typos. These new recommendations ("should"s in the attached patch) will give the trawler license to diagnose questionable REs like "[a-m-z]", "[!-[:alpha:]]", "[~-!]", and (my favorite) "[\u00FF-\xFF]". There is no change to actual Emacs behavior. --------------093F95A9F138DA31103C0149 Content-Type: text/x-patch; name="0001-Say-which-regexp-ranges-should-be-avoided.patch" Content-Disposition: attachment; filename="0001-Say-which-regexp-ranges-should-be-avoided.patch" Content-Transfer-Encoding: quoted-printable >From 981bd72cb5fee582067a691cc0de94c6b6fd1f1d Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Wed, 20 Mar 2019 14:43:30 -0700 Subject: [PATCH] Say which regexp ranges should be avoided MIME-Version: 1.0 Content-Type: text/plain; charset=3DUTF-8 Content-Transfer-Encoding: 8bit * doc/lispref/searching.texi (Regexp Special): Say that regular expressions like "[a-m-z]" and "[[:alpha:]-~]" should be avoided, for the same reason that regular expressions like "+" and "*" should be avoided: POSIX says their behavior is undefined, and they are confusing anyway. Also, explain better what happens when the bound of a range is a raw 8-bit byte; the old explanation appears to have been obsolete anyway. Finally, say that ranges like "[\u00FF-\xFF]" that mix non-ASCII characters and raw 8-bit bytes should be avoided, since it=E2=80=99s not clear what they should mean. --- doc/lispref/searching.texi | 54 ++++++++++++++++++++++++-------------- 1 file changed, 35 insertions(+), 19 deletions(-) diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index 7546863dde..0cf527b6ac 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi @@ -391,25 +391,18 @@ Regexp Special Thus, @samp{[a-z]} matches any lower-case @acronym{ASCII} letter. Ranges may be intermixed freely with individual characters, as in @samp{[a-z$%.]}, which matches any lower case @acronym{ASCII} letter -or @samp{$}, @samp{%} or period. +or @samp{$}, @samp{%} or period. However, the ending character of one +range should not be the starting point of another one; for example, +@samp{[a-m-z]} should be avoided. =20 -If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also -matches upper-case letters. Note that a range like @samp{[a-z]} is -not affected by the locale's collation sequence, it always represents -a sequence in @acronym{ASCII} order. -@c This wasn't obvious to me, since, e.g., the grep manual "Character -@c Classes and Bracket Expressions" specifically notes the opposite -@c behavior. But by experiment Emacs seems unaffected by LC_COLLATE -@c in this regard. - -Note also that the usual regexp special characters are not special insid= e a +The usual regexp special characters are not special inside a character alternative. A completely different set of characters is special inside character alternatives: @samp{]}, @samp{-} and @samp{^}. =20 To include a @samp{]} in a character alternative, you must make it the first character. For example, @samp{[]a]} matches @samp{]} or @samp{a}. To include a @samp{-}, write @samp{-} as the first or last character of -the character alternative, or put it after a range. Thus, @samp{[]-]} +the character alternative, or as the upper bound of a range. Thus, @sam= p{[]-]} matches both @samp{]} and @samp{-}. (As explained below, you cannot use @samp{\]} to include a @samp{]} inside a character alternative, since @samp{\} is not special there.) @@ -417,13 +410,34 @@ Regexp Special To include @samp{^} in a character alternative, put it anywhere but at the beginning. =20 -@c What if it starts with a multibyte and ends with a unibyte? -@c That doesn't seem to match anything...? -If a range starts with a unibyte character @var{c} and ends with a -multibyte character @var{c2}, the range is divided into two parts: one -spans the unibyte characters @samp{@var{c}..?\377}, the other the -multibyte characters @samp{@var{c1}..@var{c2}}, where @var{c1} is the -first character of the charset to which @var{c2} belongs. +The following aspects of ranges are specific to Emacs, in that POSIX +allows but does not require this behavior and programs other than +Emacs may behave differently: + +@enumerate +@item +If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also +matches upper-case letters. + +@item +A range is not affected by the locale's collation sequence: it always +represents the set of characters with codepoints ranging between those +of its bounds, so that @samp{[a-z]} matches only ASCII letters, even +outside the C or POSIX locale. + +@item +As a special case, if either bound of a range is a raw 8-bit byte, the +other bound should be a unibyte character, and the range matches only +unibyte characters. + +@item +If the lower bound of a range is greater than its upper bound, the +range is empty and represents no characters. Thus, @samp{[b-a]} +always fails to match, and @samp{[^b-a]} matches any character, +including newline. However, the lower bound should be at most one +greater than the upper bound; for example, @samp{[c-a]} should be +avoided. +@end enumerate =20 A character alternative can also specify named character classes (@pxref{Char Classes}). This is a POSIX feature. For example, @@ -431,6 +445,8 @@ Regexp Special Using a character class is equivalent to mentioning each of the characters in that class; but the latter is not feasible in practice, since some classes include thousands of different characters. +A character class should not appear as the lower or upper bound +of a range. =20 @item @samp{[^ @dots{} ]} @cindex @samp{^} in regexp --=20 2.20.1 --------------093F95A9F138DA31103C0149--