From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Luc Teirlinck Newsgroups: gmane.emacs.devel Subject: Re: Unquoted special characters in regexps Date: Mon, 6 Mar 2006 23:52:44 -0600 (CST) Message-ID: <200603070552.k275qiG12547@raven.dms.auburn.edu> References: <4400AD8E.5050001@gmx.at> <4400BBB1.2050800@gmx.at> <200602252213.k1PMDBP24413@raven.dms.auburn.edu> <4401A98D.3070809@gmx.at> <4401E0F2.7030800@gmx.at> <4401FCBA.1070206@gmx.at> <200602280059.k1S0xYD07415@raven.dms.auburn.edu> NNTP-Posting-Host: main.gmane.org X-Trace: sea.gmane.org 1141744020 14087 80.91.229.2 (7 Mar 2006 15:07:00 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 7 Mar 2006 15:07:00 +0000 (UTC) Cc: rudalics@gmx.at, schwab@suse.de, emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Mar 07 16:06:56 2006 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1FGdlZ-0006Bm-JJ for ged-emacs-devel@m.gmane.org; Tue, 07 Mar 2006 16:06:26 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1FGdlY-0004iG-V8 for ged-emacs-devel@m.gmane.org; Tue, 07 Mar 2006 10:06:25 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1FGVlC-0007Aa-Ea for emacs-devel@gnu.org; Tue, 07 Mar 2006 01:33:30 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1FGVHI-0001kG-IN for emacs-devel@gnu.org; Tue, 07 Mar 2006 01:03:00 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1FGVHI-0001gZ-AZ for emacs-devel@gnu.org; Tue, 07 Mar 2006 01:02:36 -0500 Original-Received: from [131.204.53.104] (helo=manatee.dms.auburn.edu) by monty-python.gnu.org with esmtp (Exim 4.52) id 1FGVFH-0003gQ-7l; Tue, 07 Mar 2006 01:00:31 -0500 Original-Received: from raven.dms.auburn.edu (raven.dms.auburn.edu [131.204.53.29]) by manatee.dms.auburn.edu (8.13.3+Sun/8.13.3) with ESMTP id k275vgS8008077; Mon, 6 Mar 2006 23:57:42 -0600 (CST) Original-Received: (from teirllm@localhost) by raven.dms.auburn.edu (8.11.7p1+Sun/8.11.7) id k275qiG12547; Mon, 6 Mar 2006 23:52:44 -0600 (CST) X-Authentication-Warning: raven.dms.auburn.edu: teirllm set sender to teirllm@dms.auburn.edu using -f Original-To: rms@gnu.org In-reply-to: (message from Richard Stallman on Mon, 06 Mar 2006 07:52:07 -0500) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.0.1 (manatee.dms.auburn.edu [131.204.53.104]); Mon, 06 Mar 2006 23:57:43 -0600 (CST) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:51315 Archived-At: Richard Stallman wrote: I think the manual needs to explain both levels--the first level so beginners can begin to understand, and the second level for precise thinking about counterintuitive regexps. I could certainly do that, but I am terribly overloaded. Would someone else like to try it? What about the following patch, which I can install if desired? It includes one unrelated change dealing with a problem I noticed in the process. It moves a paragraph occurring currently in the description of `*' to the description of `+'. (Although, from diff's perspective, it instead moves the definition of `+' up till before that paragraph. Everything is relative, I guess.) The reason is that the paragraph discusses the regexp "(x+y*\)*a" before the meaning of `+' is explained. This makes `x+y' look like is the sum of x and y. Also the remarks in the paragraph apply to both `*' and `+'. ===File ~/searching.texi-diff=============================== *** searching.texi 06 Feb 2006 16:02:08 -0600 1.68 --- searching.texi 06 Mar 2006 23:47:42 -0600 *************** *** 235,246 **** Regular expressions have a syntax in which a few characters are special constructs and the rest are @dfn{ordinary}. An ordinary ! character is a simple regular expression that matches that character and ! nothing else. The special characters are @samp{.}, @samp{*}, @samp{+}, ! @samp{?}, @samp{[}, @samp{]}, @samp{^}, @samp{$}, and @samp{\}; no new ! special characters will be defined in the future. Any other character ! appearing in a regular expression is ordinary, unless a @samp{\} ! precedes it. For example, @samp{f} is not a special character, so it is ordinary, and therefore @samp{f} is a regular expression that matches the string --- 235,249 ---- Regular expressions have a syntax in which a few characters are special constructs and the rest are @dfn{ordinary}. An ordinary ! character is a simple regular expression that matches that character ! and nothing else. The special characters are @samp{.}, @samp{*}, ! @samp{+}, @samp{?}, @samp{[}, @samp{^}, @samp{$}, and @samp{\}; no new ! special characters will be defined in the future. The character ! @samp{]} is special if it ends a character alternative (see later). ! The character @samp{-} is special inside a character alternative. A ! @samp{[:} and balancing @samp{:]} enclose a character class inside a ! character alternative. Any other character appearing in a regular ! expression is ordinary, unless a @samp{\} precedes it. For example, @samp{f} is not a special character, so it is ordinary, and therefore @samp{f} is a regular expression that matches the string *************** *** 301,306 **** --- 304,316 ---- The next alternative is for @samp{a*} to match only two @samp{a}s. With this choice, the rest of the regexp matches successfully.@refill + @item @samp{+} + @cindex @samp{+} in regexp + is a postfix operator, similar to @samp{*} except that it must match + the preceding expression at least once. So, for example, @samp{ca+r} + matches the strings @samp{car} and @samp{caaaar} but not the string + @samp{cr}, whereas @samp{ca*r} matches all three strings. + Nested repetition operators take a long time, or even forever, if they lead to ambiguous matching. For example, trying to match the regular expression @samp{\(x+y*\)*a} against the string *************** *** 311,323 **** it causes an infinite loop. To avoid these problems, check nested repetitions carefully. - @item @samp{+} - @cindex @samp{+} in regexp - is a postfix operator, similar to @samp{*} except that it must match - the preceding expression at least once. So, for example, @samp{ca+r} - matches the strings @samp{car} and @samp{caaaar} but not the string - @samp{cr}, whereas @samp{ca*r} matches all three strings. - @item @samp{?} @cindex @samp{?} in regexp is a postfix operator, similar to @samp{*} except that it must match the --- 321,326 ---- *************** *** 468,473 **** --- 471,504 ---- can act. It is poor practice to depend on this behavior; quote the special character anyway, regardless of where it appears.@refill + As a @samp{\} is not special inside a character alternative, it can + never remove the special meaning of @samp{-} or @samp{]}. So you + should not quote these characters when they have no special meaning + either. This would not clarify anything, since backslashes can + legitimately precede these characters where they @emph{have} special + meaning, as in @code{[^\]} (@code{"[^\\]"} for Lisp string syntax), + which matches any single character except a backslash. + + In practice, most @samp{]} that occur in regular expressions close a + character alternative and hence are special. However, occasionally a + regular expression may try to match a complex pattern of literal + @samp{[} and @samp{]}. In such situations, it sometimes may be + necessary to carefully parse the regexp from the start to determine + which square brackets enclose a character alternative. For example, + @code{[^][]]}, consists of the complemented character alternative + @code{[^][]}, which matches any single character that is not a square + bracket, followed by a literal @samp{]}. + + The exact rules are that at the beginning of a regexp, @samp{[} is + special and @samp{]} not. This lasts until the first unquoted + @samp{[}, after which we are in a character alternative; @samp{[} is + no longer special (except if it starts a character class) but @samp{]} + is special, unless it immediately follows the special @samp{[} or that + @samp{[} followed by a @samp{^}. This lasts until the next special + @samp{]} that does not end a character class. This ends the character + alternative and restores the ordinary syntax of regular expressions; + an unquoted @samp{[} is special again and a @samp{]} not. + @node Char Classes @subsubsection Character Classes @cindex character classes in regexp *************** *** 740,747 **** @kindex invalid-regexp Not every string is a valid regular expression. For example, a string ! with unbalanced square brackets is invalid (with a few exceptions, such ! as @samp{[]]}), and so is a string that ends with a single @samp{\}. If an invalid regular expression is passed to any of the search functions, an @code{invalid-regexp} error is signaled. --- 771,778 ---- @kindex invalid-regexp Not every string is a valid regular expression. For example, a string ! that ends inside a character alternative without terminating @samp{]} ! is invalid, and so is a string that ends with a single @samp{\}. If an invalid regular expression is passed to any of the search functions, an @code{invalid-regexp} error is signaled. ============================================================