From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Jim Blandy <jimb@redhat.com>
Newsgroups: gmane.emacs.devel
Subject: Re: Implement new symbol-start and symbol-end regexp operators
Date: 12 May 2004 12:36:49 -0500
Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org
Message-ID: <vt2oeotpmxa.fsf@zenia.home>
References: <vt27jvyo0is.fsf@zenia.home> <E1BJr2v-0001Wz-L7@fencepost.gnu.org>
	<vt2isfc3stj.fsf@zenia.home> <u4qqve6pb.fsf@gnu.org>
	<vt2ad0n1ju6.fsf@zenia.home>
NNTP-Posting-Host: deer.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: sea.gmane.org 1084384048 19332 80.91.224.253 (12 May 2004 17:47:28 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Wed, 12 May 2004 17:47:28 +0000 (UTC)
Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Wed May 12 19:47:17 2004
Return-path: <emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org>
Original-Received: from quimby.gnus.org ([80.91.224.244])
	by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 1BNxp7-0006iv-00
	for <emacs-devel@deer.gmane.org>; Wed, 12 May 2004 19:47:17 +0200
Original-Received: from monty-python.gnu.org ([199.232.76.173])
	by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian))
	id 1BNxp6-0003cP-00
	for <emacs-devel@quimby.gnus.org>; Wed, 12 May 2004 19:47:16 +0200
Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org)
	by monty-python.gnu.org with esmtp (Exim 4.34)
	id 1BNxjF-0003hq-IN
	for emacs-devel@quimby.gnus.org; Wed, 12 May 2004 13:41:13 -0400
Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.34)
	id 1BNxgE-0002wf-Fi
	for emacs-devel@gnu.org; Wed, 12 May 2004 13:38:06 -0400
Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.34)
	id 1BNxfc-0002mF-1D
	for emacs-devel@gnu.org; Wed, 12 May 2004 13:37:59 -0400
Original-Received: from [66.187.233.31] (helo=mx1.redhat.com)
	by monty-python.gnu.org with esmtp (TLSv1:DES-CBC3-SHA:168)
	(Exim 4.34) id 1BNxfb-0002kV-0b
	for emacs-devel@gnu.org; Wed, 12 May 2004 13:37:27 -0400
Original-Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com
	[172.16.52.254])
	by mx1.redhat.com (8.12.10/8.12.10) with ESMTP id i4CHbL0m010306
	for <emacs-devel@gnu.org>; Wed, 12 May 2004 13:37:21 -0400
Original-Received: from zenia.home.redhat.com (porkchop.devel.redhat.com [172.16.58.2])
	by int-mx1.corp.redhat.com (8.11.6/8.11.6) with ESMTP id
	i4CHbG321865; Wed, 12 May 2004 13:37:17 -0400
Original-To: emacs-devel@gnu.org
In-Reply-To: <vt2ad0n1ju6.fsf@zenia.home>
Original-Lines: 365
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.4
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://mail.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org
Xref: main.gmane.org gmane.emacs.devel:23265
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:23265


Has anyone had a chance to try this patch out?

Jim Blandy <jimb@redhat.com> writes:

> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > > From: Jim Blandy <jimb@redhat.com>
> > > Date: 04 May 2004 14:17:44 -0500
> > >   
> > > + @item \_<
> > > + @cindex @samp{\_<} in regexp
> > 
> > IMHO, an additional index entry here, something like
> > 
> >       @cindex matching symbols in regexp
> > 
> > would be useful.
> 
> Okay, I added:
> 
> + @cindex symbols, matching in regexp
> 
> 
> src/ChangeLog:
> 2004-04-29  Jim Blandy  <jimb@redhat.com>
> 
> 	Add support for new '\_<' and '\_>' regexp operators, matching the
> 	beginning and ends of symbols.
> 	* regex.c (enum syntaxcode): Add Ssymbol.
> 	(init_syntax_once): Set the syntax for '_' to Ssymbol, not Sword.
> 	(symbeg, symend): New opcodes.
> 	(print_partial_compiled_pattern): Print the new opcodes properly.
> 	(regex_compile): Parse the new operators.
> 	(analyze_first): symbeg and symend match only the empty string.
> 	(mutually_exclusive_p): symend is mutually exclusive with \s_ and
> 	\sw; symbeg is mutually exclusive with \S_ and \Sw.
> 	(re_match_2_internal): Add code for symbeg and symend.
> 	* search.c (trivial_regexp_p): \_ is no longer a trivial regexp.
> 
> man/ChangeLog:
> 2004-04-29  Jim Blandy  <jimb@redhat.com>
> 
> 	* search.texi (Regexps): Document the \_< and \_> regexp operators.
> 
> lispref/ChangeLog:
> 2004-05-04  Jim Blandy  <jimb@redhat.com>
> 
> 	* searching.texi (Regexp Backslash): Document new \_< and \_>
> 	operators.
> 
> *** src/regex.c.~2~	2004-04-29 15:56:53.000000000 -0500
> --- src/regex.c	2004-04-29 17:44:24.000000000 -0500
> ***************
> *** 219,225 ****
>   /* Define the syntax stuff for \<, \>, etc.  */
>   
>   /* Sword must be nonzero for the wordchar pattern commands in re_match_2.  */
> ! enum syntaxcode { Swhitespace = 0, Sword = 1 };
>   
>   # ifdef SWITCH_ENUM_BUG
>   #  define SWITCH_ENUM_CAST(x) ((int)(x))
> --- 219,225 ----
>   /* Define the syntax stuff for \<, \>, etc.  */
>   
>   /* Sword must be nonzero for the wordchar pattern commands in re_match_2.  */
> ! enum syntaxcode { Swhitespace = 0, Sword = 1, Ssymbol = 2 };
>   
>   # ifdef SWITCH_ENUM_BUG
>   #  define SWITCH_ENUM_CAST(x) ((int)(x))
> ***************
> *** 399,405 ****
>        if (ISALNUM (c))
>   	re_syntax_table[c] = Sword;
>   
> !    re_syntax_table['_'] = Sword;
>   
>      done = 1;
>   }
> --- 399,405 ----
>        if (ISALNUM (c))
>   	re_syntax_table[c] = Sword;
>   
> !    re_syntax_table['_'] = Ssymbol;
>   
>      done = 1;
>   }
> ***************
> *** 656,661 ****
> --- 656,664 ----
>     wordbound,	/* Succeeds if at a word boundary.  */
>     notwordbound,	/* Succeeds if not at a word boundary.	*/
>   
> +   symbeg,       /* Succeeds if at symbol beginning.  */
> +   symend,       /* Succeeds if at symbol end.  */
> + 
>   	/* Matches any character whose syntax is specified.  Followed by
>   	   a byte which contains a syntax code, e.g., Sword.  */
>     syntaxspec,
> ***************
> *** 1095,1100 ****
> --- 1098,1110 ----
>   	case wordend:
>   	  printf ("/wordend");
>   
> + 	case symbeg:
> + 	  printf ("/symbeg");
> + 	  break;
> + 
> + 	case symend:
> + 	  printf ("/symend");
> + 
>   	case syntaxspec:
>   	  printf ("/syntaxspec");
>   	  mcnt = *p++;
> ***************
> *** 3135,3140 ****
> --- 3145,3163 ----
>   	      BUF_PUSH (wordend);
>   	      break;
>   
> + 	    case '_':
> + 	      if (syntax & RE_NO_GNU_OPS)
> + 		goto normal_char;
> +               laststart = b;
> +               PATFETCH (c);
> +               if (c == '<')
> +                 BUF_PUSH (symbeg);
> +               else if (c == '>')
> +                 BUF_PUSH (symend);
> +               else
> +                 FREE_STACK_RETURN (REG_BADPAT);
> +               break;
> + 
>   	    case 'b':
>   	      if (syntax & RE_NO_GNU_OPS)
>   		goto normal_char;
> ***************
> *** 3629,3634 ****
> --- 3652,3659 ----
>   	case notwordbound:
>   	case wordbeg:
>   	case wordend:
> + 	case symbeg:
> + 	case symend:
>   	  continue;
>   
>   
> ***************
> *** 4396,4409 ****
>         break;
>   
>       case wordend:
> !     case notsyntaxspec:
>         return ((re_opcode_t) *p1 == syntaxspec
> ! 	      && p1[1] == (op2 == wordend ? Sword : p2[1]));
>   
>       case wordbeg:
> !     case syntaxspec:
>         return ((re_opcode_t) *p1 == notsyntaxspec
> ! 	      && p1[1] == (op2 == wordbeg ? Sword : p2[1]));
>   
>       case wordbound:
>         return (((re_opcode_t) *p1 == notsyntaxspec
> --- 4421,4440 ----
>         break;
>   
>       case wordend:
> !       return ((re_opcode_t) *p1 == syntaxspec && p1[1] == Sword);
> !     case symend:
>         return ((re_opcode_t) *p1 == syntaxspec
> !               && (p1[1] == Ssymbol || p1[1] == Sword));
> !     case notsyntaxspec:
> !       return ((re_opcode_t) *p1 == syntaxspec && p1[1] == p2[1]);
>   
>       case wordbeg:
> !       return ((re_opcode_t) *p1 == notsyntaxspec && p1[1] == Sword);
> !     case symbeg:
>         return ((re_opcode_t) *p1 == notsyntaxspec
> !               && (p1[1] == Ssymbol || p1[1] == Sword));
> !     case syntaxspec:
> !       return ((re_opcode_t) *p1 == notsyntaxspec && p1[1] == p2[1]);
>   
>       case wordbound:
>         return (((re_opcode_t) *p1 == notsyntaxspec
> ***************
> *** 5528,5533 ****
> --- 5559,5650 ----
>   	    }
>   	  break;
>   
> + 	case symbeg:
> + 	  DEBUG_PRINT1 ("EXECUTING symbeg.\n");
> + 
> + 	  /* We FAIL in one of the following cases: */
> + 
> + 	  /* Case 1: D is at the end of string.	 */
> + 	  if (AT_STRINGS_END (d))
> + 	    goto fail;
> + 	  else
> + 	    {
> + 	      /* C1 is the character before D, S1 is the syntax of C1, C2
> + 		 is the character at D, and S2 is the syntax of C2.  */
> + 	      re_wchar_t c1, c2;
> + 	      int s1, s2;
> + #ifdef emacs
> + 	      int offset = PTR_TO_OFFSET (d);
> + 	      int charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> + 	      UPDATE_SYNTAX_TABLE (charpos);
> + #endif
> + 	      PREFETCH ();
> + 	      c2 = RE_STRING_CHAR (d, dend - d);
> + 	      s2 = SYNTAX (c2);
> + 	
> + 	      /* Case 2: S2 is neither Sword nor Ssymbol. */
> + 	      if (s2 != Sword && s2 != Ssymbol)
> + 		goto fail;
> + 
> + 	      /* Case 3: D is not at the beginning of string ... */
> + 	      if (!AT_STRINGS_BEG (d))
> + 		{
> + 		  GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
> + #ifdef emacs
> + 		  UPDATE_SYNTAX_TABLE_BACKWARD (charpos - 1);
> + #endif
> + 		  s1 = SYNTAX (c1);
> + 
> + 		  /* ... and S1 is Sword or Ssymbol.  */
> + 		  if (s1 == Sword || s1 == Ssymbol)
> + 		    goto fail;
> + 		}
> + 	    }
> + 	  break;
> + 
> + 	case symend:
> + 	  DEBUG_PRINT1 ("EXECUTING symend.\n");
> + 
> + 	  /* We FAIL in one of the following cases: */
> + 
> + 	  /* Case 1: D is at the beginning of string.  */
> + 	  if (AT_STRINGS_BEG (d))
> + 	    goto fail;
> + 	  else
> + 	    {
> + 	      /* C1 is the character before D, S1 is the syntax of C1, C2
> + 		 is the character at D, and S2 is the syntax of C2.  */
> + 	      re_wchar_t c1, c2;
> + 	      int s1, s2;
> + #ifdef emacs
> + 	      int offset = PTR_TO_OFFSET (d) - 1;
> + 	      int charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> + 	      UPDATE_SYNTAX_TABLE (charpos);
> + #endif
> + 	      GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
> + 	      s1 = SYNTAX (c1);
> + 
> + 	      /* Case 2: S1 is neither Ssymbol nor Sword.  */
> + 	      if (s1 != Sword && s1 != Ssymbol)
> + 		goto fail;
> + 
> + 	      /* Case 3: D is not at the end of string ... */
> + 	      if (!AT_STRINGS_END (d))
> + 		{
> + 		  PREFETCH_NOLIMIT ();
> + 		  c2 = RE_STRING_CHAR (d, dend - d);
> + #ifdef emacs
> + 		  UPDATE_SYNTAX_TABLE_FORWARD (charpos);
> + #endif
> + 		  s2 = SYNTAX (c2);
> + 
> + 		  /* ... and S2 is Sword or Ssymbol.  */
> + 		  if (s2 == Sword || s2 == Ssymbol)
> +                     goto fail;
> + 		}
> + 	    }
> + 	  break;
> + 
>   	case syntaxspec:
>   	case notsyntaxspec:
>   	  not = (re_opcode_t) *(p - 1) == notsyntaxspec;
> *** src/search.c.~1~	2002-05-12 19:04:16.000000000 -0500
> --- src/search.c	2004-04-29 17:30:17.000000000 -0500
> ***************
> *** 962,968 ****
>   	    {
>   	    case '|': case '(': case ')': case '`': case '\'': case 'b':
>   	    case 'B': case '<': case '>': case 'w': case 'W': case 's':
> ! 	    case 'S': case '=': case '{': case '}':
>   	    case 'c': case 'C':	/* for categoryspec and notcategoryspec */
>   	    case '1': case '2': case '3': case '4': case '5':
>   	    case '6': case '7': case '8': case '9':
> --- 962,968 ----
>   	    {
>   	    case '|': case '(': case ')': case '`': case '\'': case 'b':
>   	    case 'B': case '<': case '>': case 'w': case 'W': case 's':
> ! 	    case 'S': case '=': case '{': case '}': case '_':
>   	    case 'c': case 'C':	/* for categoryspec and notcategoryspec */
>   	    case '1': case '2': case '3': case '4': case '5':
>   	    case '6': case '7': case '8': case '9':
> *** man/search.texi.~1~	2002-07-06 08:44:06.000000000 -0500
> --- man/search.texi	2004-04-29 17:38:41.000000000 -0500
> ***************
> *** 672,677 ****
> --- 672,689 ----
>   @item \W
>   matches any character that is not a word-constituent.
>   
> + @item \_<
> + matches the empty string, but only at the beginning of a symbol.  A
> + symbol is a sequence of one or more word or symbol constituent
> + characters.  @samp{\_<} matches at the beginning of the buffer only if
> + a symbol-constituent character follows.
> + 
> + @item \_>
> + matches the empty string, but only at the end of a symbol.  A symbol
> + is a sequence of one or more word or symbol constituent characters.
> + @samp{\_>} matches at the end of the buffer only if the contents end
> + with a symbol-constituent character.
> + 
>   @item \s@var{c}
>   matches any character whose syntax is @var{c}.  Here @var{c} is a
>   character that designates a particular syntax class: thus, @samp{w}
> *** searching.texi.~1.48.~	2004-02-16 20:09:15.000000000 -0500
> --- searching.texi	2004-05-05 01:12:38.000000000 -0500
> ***************
> *** 666,671 ****
> --- 666,686 ----
>   with a word-constituent character.
>   @end table
>   
> + @item \_<
> + @cindex @samp{\_<} in regexp
> + @cindex symbols, matching in regexp
> + matches the empty string, but only at the beginning of a symbol.  A
> + symbol is a sequence of one or more word or symbol constituent
> + characters.  @samp{\_<} matches at the beginning of the buffer (or
> + string) only if a symbol-constituent character follows.
> + 
> + @item \_>
> + @cindex @samp{\_>} in regexp
> + matches the empty string, but only at the end of a symbol.  A symbol
> + is a sequence of one or more word or symbol constituent characters.
> + @samp{\_>} matches at the end of the buffer (or string) only if the
> + contents end with a symbol-constituent character.
> + 
>   @kindex invalid-regexp
>     Not every string is a valid regular expression.  For example, a string
>   with unbalanced square brackets is invalid (with a few exceptions, such
> *** etc/NEWS.~1.950.~	2004-04-27 17:02:27.000000000 -0500
> --- etc/NEWS	2004-05-04 14:15:33.000000000 -0500
> ***************
> *** 90,95 ****
> --- 90,101 ----
>   
>   * Changes in Emacs 21.4
>   
> + +++
> + ** There are now two new regular expression operators, \_< and \_>,
> + for matching the beginning and end of a symbol.  A symbol is a
> + non-empty sequence of either word or symbol constituent characters, as
> + specified by the syntax table.
> + 
>   ---
>   ** The IELM prompt is now, by default, read-only.  This can be
>   controlled with the new user option `ielm-prompt-read-only'.