From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Luc Teirlinck <teirllm@dms.auburn.edu>
Newsgroups: gmane.emacs.devel
Subject: Re: Unquoted special characters in regexps
Date: Mon, 6 Mar 2006 23:52:44 -0600 (CST)
Message-ID: <200603070552.k275qiG12547@raven.dms.auburn.edu>
References: <wkek1rz72f.fsf@gmx.at>
	<jeek1rz3e8.fsf@sykes.suse.de>	<4400AD8E.5050001@gmx.at>
	<jeaccfz142.fsf@sykes.suse.de>	<4400BBB1.2050800@gmx.at>	<200602252213.k1PMDBP24413@raven.dms.auburn.edu>	<4401A98D.3070809@gmx.at>
	<jefym6ut3l.fsf@sykes.suse.de>	<4401E0F2.7030800@gmx.at>
	<je7j7ic8cf.fsf@sykes.suse.de> <4401FCBA.1070206@gmx.at>
	<E1FDneo-00050N-Nn@fencepost.gnu.org>
	<200602280059.k1S0xYD07415@raven.dms.auburn.edu>
	<E1FGFC3-00079F-G2@fencepost.gnu.org>
NNTP-Posting-Host: main.gmane.org
X-Trace: sea.gmane.org 1141744020 14087 80.91.229.2 (7 Mar 2006 15:07:00 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Tue, 7 Mar 2006 15:07:00 +0000 (UTC)
Cc: rudalics@gmx.at, schwab@suse.de, emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Mar 07 16:06:56 2006
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1FGdlZ-0006Bm-JJ
	for ged-emacs-devel@m.gmane.org; Tue, 07 Mar 2006 16:06:26 +0100
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1FGdlY-0004iG-V8
	for ged-emacs-devel@m.gmane.org; Tue, 07 Mar 2006 10:06:25 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1FGVlC-0007Aa-Ea
	for emacs-devel@gnu.org; Tue, 07 Mar 2006 01:33:30 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1FGVHI-0001kG-IN
	for emacs-devel@gnu.org; Tue, 07 Mar 2006 01:03:00 -0500
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1FGVHI-0001gZ-AZ
	for emacs-devel@gnu.org; Tue, 07 Mar 2006 01:02:36 -0500
Original-Received: from [131.204.53.104] (helo=manatee.dms.auburn.edu)
	by monty-python.gnu.org with esmtp (Exim 4.52)
	id 1FGVFH-0003gQ-7l; Tue, 07 Mar 2006 01:00:31 -0500
Original-Received: from raven.dms.auburn.edu (raven.dms.auburn.edu [131.204.53.29])
	by manatee.dms.auburn.edu (8.13.3+Sun/8.13.3) with ESMTP id
	k275vgS8008077; Mon, 6 Mar 2006 23:57:42 -0600 (CST)
Original-Received: (from teirllm@localhost)
	by raven.dms.auburn.edu (8.11.7p1+Sun/8.11.7) id k275qiG12547;
	Mon, 6 Mar 2006 23:52:44 -0600 (CST)
X-Authentication-Warning: raven.dms.auburn.edu: teirllm set sender to
	teirllm@dms.auburn.edu using -f
Original-To: rms@gnu.org
In-reply-to: <E1FGFC3-00079F-G2@fencepost.gnu.org> (message from Richard
	Stallman on Mon, 06 Mar 2006 07:52:07 -0500)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.0.1
	(manatee.dms.auburn.edu [131.204.53.104]);
	Mon, 06 Mar 2006 23:57:43 -0600 (CST)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:51315
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/51315>

Richard Stallman wrote:

   I think the manual needs to explain both levels--the first level so
   beginners can begin to understand, and the second level for precise
   thinking about counterintuitive regexps.

   I could certainly do that, but I am terribly overloaded.  Would
   someone else like to try it?

What about the following patch, which I can install if desired?

It includes one unrelated change dealing with a problem I noticed in
the process.  It moves a paragraph occurring currently in the
description of `*' to the description of `+'.  (Although, from diff's
perspective, it instead moves the definition of `+' up till before
that paragraph.  Everything is relative, I guess.)  The reason is that
the paragraph discusses the regexp "(x+y*\)*a" before the meaning of
`+' is explained.  This makes `x+y' look like is the sum of x and y.
Also the remarks in the paragraph apply to both `*' and `+'.

===File ~/searching.texi-diff===============================
*** searching.texi	06 Feb 2006 16:02:08 -0600	1.68
--- searching.texi	06 Mar 2006 23:47:42 -0600	
***************
*** 235,246 ****
  
    Regular expressions have a syntax in which a few characters are
  special constructs and the rest are @dfn{ordinary}.  An ordinary
! character is a simple regular expression that matches that character and
! nothing else.  The special characters are @samp{.}, @samp{*}, @samp{+},
! @samp{?}, @samp{[}, @samp{]}, @samp{^}, @samp{$}, and @samp{\}; no new
! special characters will be defined in the future.  Any other character
! appearing in a regular expression is ordinary, unless a @samp{\}
! precedes it.
  
    For example, @samp{f} is not a special character, so it is ordinary, and
  therefore @samp{f} is a regular expression that matches the string
--- 235,249 ----
  
    Regular expressions have a syntax in which a few characters are
  special constructs and the rest are @dfn{ordinary}.  An ordinary
! character is a simple regular expression that matches that character
! and nothing else.  The special characters are @samp{.}, @samp{*},
! @samp{+}, @samp{?}, @samp{[}, @samp{^}, @samp{$}, and @samp{\}; no new
! special characters will be defined in the future.  The character
! @samp{]} is special if it ends a character alternative (see later).
! The character @samp{-} is special inside a character alternative.  A
! @samp{[:} and balancing @samp{:]} enclose a character class inside a
! character alternative.  Any other character appearing in a regular
! expression is ordinary, unless a @samp{\} precedes it.
  
    For example, @samp{f} is not a special character, so it is ordinary, and
  therefore @samp{f} is a regular expression that matches the string
***************
*** 301,306 ****
--- 304,316 ----
  The next alternative is for @samp{a*} to match only two @samp{a}s.  With
  this choice, the rest of the regexp matches successfully.@refill
  
+ @item @samp{+}
+ @cindex @samp{+} in regexp
+ is a postfix operator, similar to @samp{*} except that it must match
+ the preceding expression at least once.  So, for example, @samp{ca+r}
+ matches the strings @samp{car} and @samp{caaaar} but not the string
+ @samp{cr}, whereas @samp{ca*r} matches all three strings.
+ 
  Nested repetition operators take a long time, or even forever, if they
  lead to ambiguous matching.  For example, trying to match the regular
  expression @samp{\(x+y*\)*a} against the string
***************
*** 311,323 ****
  it causes an infinite loop.  To avoid these problems, check nested
  repetitions carefully.
  
- @item @samp{+}
- @cindex @samp{+} in regexp
- is a postfix operator, similar to @samp{*} except that it must match
- the preceding expression at least once.  So, for example, @samp{ca+r}
- matches the strings @samp{car} and @samp{caaaar} but not the string
- @samp{cr}, whereas @samp{ca*r} matches all three strings.
- 
  @item @samp{?}
  @cindex @samp{?} in regexp
  is a postfix operator, similar to @samp{*} except that it must match the
--- 321,326 ----
***************
*** 468,473 ****
--- 471,504 ----
  can act.  It is poor practice to depend on this behavior; quote the
  special character anyway, regardless of where it appears.@refill
  
+ As a @samp{\} is not special inside a character alternative, it can
+ never remove the special meaning of @samp{-} or @samp{]}.  So you
+ should not quote these characters when they have no special meaning
+ either.  This would not clarify anything, since backslashes can
+ legitimately precede these characters where they @emph{have} special
+ meaning, as in @code{[^\]} (@code{"[^\\]"} for Lisp string syntax),
+ which matches any single character except a backslash.
+ 
+ In practice, most @samp{]} that occur in regular expressions close a
+ character alternative and hence are special.  However, occasionally a
+ regular expression may try to match a complex pattern of literal
+ @samp{[} and @samp{]}.  In such situations, it sometimes may be
+ necessary to carefully parse the regexp from the start to determine
+ which square brackets enclose a character alternative.  For example,
+ @code{[^][]]}, consists of the complemented character alternative
+ @code{[^][]}, which matches any single character that is not a square
+ bracket, followed by a literal @samp{]}.
+ 
+ The exact rules are that at the beginning of a regexp, @samp{[} is
+ special and @samp{]} not.  This lasts until the first unquoted
+ @samp{[}, after which we are in a character alternative; @samp{[} is
+ no longer special (except if it starts a character class) but @samp{]}
+ is special, unless it immediately follows the special @samp{[} or that
+ @samp{[} followed by a @samp{^}.  This lasts until the next special
+ @samp{]} that does not end a character class.  This ends the character
+ alternative and restores the ordinary syntax of regular expressions;
+ an unquoted @samp{[} is special again and a @samp{]} not.
+ 
  @node Char Classes
  @subsubsection Character Classes
  @cindex character classes in regexp
***************
*** 740,747 ****
  
  @kindex invalid-regexp
    Not every string is a valid regular expression.  For example, a string
! with unbalanced square brackets is invalid (with a few exceptions, such
! as @samp{[]]}), and so is a string that ends with a single @samp{\}.  If
  an invalid regular expression is passed to any of the search functions,
  an @code{invalid-regexp} error is signaled.
  
--- 771,778 ----
  
  @kindex invalid-regexp
    Not every string is a valid regular expression.  For example, a string
! that ends inside a character alternative without terminating @samp{]}
! is invalid, and so is a string that ends with a single @samp{\}.  If
  an invalid regular expression is passed to any of the search functions,
  an @code{invalid-regexp} error is signaled.
  
============================================================