unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Explicitly numbered subgroups in regular expressions
@ 2007-06-09 20:37 Stefan Monnier
  2007-06-10 13:19 ` Richard Stallman
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Monnier @ 2007-06-09 20:37 UTC (permalink / raw)
  To: emacs-devel


Any objection to the patch below for the trunk (it'll come with a NEWS
and ChangeLog entry, of course)?

What it does is add a new regexp syntax \(?<num>:<regexp>\) which is like
\(<regexp>\) except that it specifies explicitly the number of the subgroup.
E.g. (and (string-match "\\(?3:a\\)" "a") (match-data)) returns (0 1 nil
nil nil nil 0 1).  There is no backward compatibility issue with this patch:
such regexps are currently rejected as invalid.

Cases where this is useful:

1 - when we need to match either '<regexp>' or "<regexp>" and we need to
    extract the text within the quotes.  Currently we either use
    "[\"']\\(<regexp>\\)[\"']" which is not quite correct, or we use
    "\"\\(<regexp>\\)\"\\|'\\(<regexp>\\)'" and then have to use
    (or (match-string 1) (match-string 2)).
    With this patch we can use "\"\\(?1:<regexp>\\)\"\\|'\\(?1:<regexp>\\)'".
    In log-view-mode, we have such a situation where we regularly need to
    add one case and hence change some of the code from (or (match-string 1)
    (match-string 2)) to (or (match-string 1) (match-string 2) (match-string 3))
    and update other match-data indices.

2 - there are several places where some customizable data specifies
    a regular expression along with the match-data indices where the
    relevant subdata can be found.  E.g. in compilation-error-regexp-alist.
    With such a patch, we could use instead a scheme where the customizable
    data only includes a regexp because the match-data indices in which the
    relevant subdata can be found are always the same.

3 - in some rare cases such as comment-start-skip we have declared that
    subgroup N has a special meaning.  Problem is that this can be
    occasionally be problematic.

Number 3 is a rare circumstance (see comment around fortran-mode's setting
of comment-start-skip) and I can't remember it being anything else than
a minor problem.
Number 2 seems useful, but I have to admit that I have not actually tried it.
Number 1 was the motivating case.


        Stefan


--- regex.c	29 jan 2007 13:35:35 -0500	1.222
+++ regex.c	09 jun 2007 16:13:59 -0400	
@@ -2482,11 +2482,6 @@
      last -- ends with a forward jump of this sort.  */
   unsigned char *fixup_alt_jump = 0;
 
-  /* Counts open-groups as they are encountered.  Remembered for the
-     matching close-group on the compile stack, so the same register
-     number is put in the stop_memory as the start_memory.  */
-  regnum_t regnum = 0;
-
   /* Work area for range table of charset.  */
   struct range_table_work_area range_table_work;
 
@@ -3123,28 +3118,54 @@
 	    handle_open:
 	      {
 		int shy = 0;
+		regnum_t regnum = 0;
 		if (p+1 < pend)
 		  {
 		    /* Look for a special (?...) construct */
 		    if ((syntax & RE_SHY_GROUPS) && *p == '?')
 		      {
 			PATFETCH (c); /* Gobble up the '?'.  */
+			while (!shy)
+			  {
 			PATFETCH (c);
 			switch (c)
 			  {
 			  case ':': shy = 1; break;
+			      case '0':
+				/* An explicitly specified regnum must start
+				   with non-0. */
+				if (regnum == 0)
+				  FREE_STACK_RETURN (REG_BADPAT);
+			      case '1': case '2': case '3': case '4':
+			      case '5': case '6': case '7': case '8': case '9':
+				regnum = 10*regnum + (c - '0'); break;
 			  default:
 			    /* Only (?:...) is supported right now. */
 			    FREE_STACK_RETURN (REG_BADPAT);
 			  }
 		      }
 		  }
+		  }
 
 		if (!shy)
-		  {
-		    bufp->re_nsub++;
-		    regnum++;
+		  regnum = ++bufp->re_nsub;
+		else if (regnum)
+		  { /* It's actually not shy, but explicitly numbered.  */
+		    shy = 0;
+		    if (regnum > bufp->re_nsub)
+		      bufp->re_nsub = regnum;
+		    else if (regnum > bufp->re_nsub
+			     /* Ideally, we'd want to check that the specified
+				group can't have matched (i.e. all subgroups
+				using the same regnum are in other branches of
+				OR patterns), but we don't currently keep track
+				of enough info to do that easily.  */
+			     || group_in_compile_stack (compile_stack, regnum))
+		      FREE_STACK_RETURN (REG_BADPAT);
 		  }
+		else
+		  /* It's really shy.  */
+		  regnum = - bufp->re_nsub;
 
 		if (COMPILE_STACK_FULL)
 		  {
@@ -3163,12 +3184,11 @@
 		COMPILE_STACK_TOP.fixup_alt_jump
 		  = fixup_alt_jump ? fixup_alt_jump - bufp->buffer + 1 : 0;
 		COMPILE_STACK_TOP.laststart_offset = b - bufp->buffer;
-		COMPILE_STACK_TOP.regnum = shy ? -regnum : regnum;
+		COMPILE_STACK_TOP.regnum = regnum;
 
-		/* Do not push a
-		   start_memory for groups beyond the last one we can
-		   represent in the compiled pattern.  */
-		if (regnum <= MAX_REGNUM && !shy)
+		/* Do not push a start_memory for groups beyond the last one
+		   we can represent in the compiled pattern.  */
+		if (regnum <= MAX_REGNUM && regnum > 0)
 		  BUF_PUSH_2 (start_memory, regnum);
 
 		compile_stack.avail++;
@@ -3213,7 +3233,7 @@
 		/* We don't just want to restore into `regnum', because
 		   later groups should continue to be numbered higher,
 		   as in `(ab)c(de)' -- the second group is #2.  */
-		regnum_t this_group_regnum;
+		regnum_t regnum;
 
 		compile_stack.avail--;
 		begalt = bufp->buffer + COMPILE_STACK_TOP.begalt_offset;
@@ -3222,7 +3242,7 @@
 		    ? bufp->buffer + COMPILE_STACK_TOP.fixup_alt_jump - 1
 		    : 0;
 		laststart = bufp->buffer + COMPILE_STACK_TOP.laststart_offset;
-		this_group_regnum = COMPILE_STACK_TOP.regnum;
+		regnum = COMPILE_STACK_TOP.regnum;
 		/* If we've reached MAX_REGNUM groups, then this open
 		   won't actually generate any code, so we'll have to
 		   clear pending_exact explicitly.  */
@@ -3230,8 +3250,8 @@
 
 		/* We're at the end of the group, so now we know how many
 		   groups were inside this one.  */
-		if (this_group_regnum <= MAX_REGNUM && this_group_regnum > 0)
-		  BUF_PUSH_2 (stop_memory, this_group_regnum);
+		if (regnum <= MAX_REGNUM && regnum > 0)
+		  BUF_PUSH_2 (stop_memory, regnum);
 	      }
 	      break;
 
@@ -3557,8 +3577,9 @@
 
 		reg = c - '0';
 
-		/* Can't back reference to a subexpression before its end.  */
-		if (reg > regnum || group_in_compile_stack (compile_stack, reg))
+		if (reg > bufp->re_nsub || reg < 1
+		    /* Can't back reference to a subexp before its end.  */
+		    || group_in_compile_stack (compile_stack, reg))
 		  FREE_STACK_RETURN (REG_ESUBREG);
 
 		laststart = b;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Explicitly numbered subgroups in regular expressions
  2007-06-09 20:37 Explicitly numbered subgroups in regular expressions Stefan Monnier
@ 2007-06-10 13:19 ` Richard Stallman
  2007-06-10 13:44   ` Stefan Monnier
  2007-06-12 18:42   ` Stefan Monnier
  0 siblings, 2 replies; 7+ messages in thread
From: Richard Stallman @ 2007-06-10 13:19 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

    What it does is add a new regexp syntax \(?<num>:<regexp>\) which is like
    \(<regexp>\) except that it specifies explicitly the number of the subgroup.
    E.g. (and (string-match "\\(?3:a\\)" "a") (match-data)) returns (0 1 nil
    nil nil nil 0 1).  There is no backward compatibility issue with this patch:
    such regexps are currently rejected as invalid.

The feature seems useful, but how does this affect the numbering of
groups that don't specify a number?  That has to be done right, then
documented in the two manuals.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Explicitly numbered subgroups in regular expressions
  2007-06-10 13:19 ` Richard Stallman
@ 2007-06-10 13:44   ` Stefan Monnier
  2007-06-11  9:44     ` Richard Stallman
  2007-06-12 18:42   ` Stefan Monnier
  1 sibling, 1 reply; 7+ messages in thread
From: Stefan Monnier @ 2007-06-10 13:44 UTC (permalink / raw)
  To: rms; +Cc: emacs-devel

>     What it does is add a new regexp syntax \(?<num>:<regexp>\) which is
>     like \(<regexp>\) except that it specifies explicitly the number of
>     the subgroup.  E.g. (and (string-match "\\(?3:a\\)" "a") (match-data))
>     returns (0 1 nil nil nil nil 0 1).  There is no backward compatibility
>     issue with this patch: such regexps are currently rejected as invalid.

> The feature seems useful, but how does this affect the numbering of
> groups that don't specify a number?

A subgroup that doesn't explicitly specify a number gets the "smallest
natural number larger than all previous ones".  I.e.

  (and (string-match "\\(?3:a\\)\\(b\\)" "ab") (match-data))

returns

  (0 2 nil nil nil nil 0 1 1 2)

since the second group gets number 4.

> That has to be done right, then documented in the two manuals.

Sure,


        Stefan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Explicitly numbered subgroups in regular expressions
  2007-06-10 13:44   ` Stefan Monnier
@ 2007-06-11  9:44     ` Richard Stallman
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Stallman @ 2007-06-11  9:44 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

    A subgroup that doesn't explicitly specify a number gets the "smallest
    natural number larger than all previous ones".

That sounds good.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Explicitly numbered subgroups in regular expressions
  2007-06-10 13:19 ` Richard Stallman
  2007-06-10 13:44   ` Stefan Monnier
@ 2007-06-12 18:42   ` Stefan Monnier
  2007-06-12 20:56     ` Juri Linkov
  2007-06-13 16:23     ` Richard Stallman
  1 sibling, 2 replies; 7+ messages in thread
From: Stefan Monnier @ 2007-06-12 18:42 UTC (permalink / raw)
  To: rms; +Cc: emacs-devel

> The feature seems useful, but how does this affect the numbering of
> groups that don't specify a number?  That has to be done right, then
> documented in the two manuals.

Done.  Except I have only documented it in the Elisp manual: I believe it's
only useful for Lisp programming, not for interactive regexp use.


        Stefan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Explicitly numbered subgroups in regular expressions
  2007-06-12 18:42   ` Stefan Monnier
@ 2007-06-12 20:56     ` Juri Linkov
  2007-06-13 16:23     ` Richard Stallman
  1 sibling, 0 replies; 7+ messages in thread
From: Juri Linkov @ 2007-06-12 20:56 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: rms, emacs-devel

>> The feature seems useful, but how does this affect the numbering of
>> groups that don't specify a number?  That has to be done right, then
>> documented in the two manuals.
>
> Done.  Except I have only documented it in the Elisp manual: I believe it's
> only useful for Lisp programming, not for interactive regexp use.

This feature makes regexps more readable and less error-prone, so I think
it would be useful for interactive usage like in query-replace-regexp.

Now since it was installed, I noticed that in Emacs Lisp files numbers
in regexps don't get highlighted like grouping constructs in
font-lock-regexp-grouping-construct face.  The patch below adds highlighting
for them.  It also uses this new feature in the same regexp ;)

Index: lisp/font-lock.el
===================================================================
RCS file: /sources/emacs/emacs/lisp/font-lock.el,v
retrieving revision 1.317
diff -c -r1.317 font-lock.el
*** lisp/font-lock.el	21 Apr 2007 14:30:25 -0000	1.317
--- lisp/font-lock.el	12 Jun 2007 20:55:41 -0000
***************
*** 2279,2285 ****
              ;; that do not occur in strings.  The associated regexp matches one
              ;; of `\\\\' `\\(' `\\(?:' `\\|' `\\)'.  `\\\\' has been included to
              ;; avoid highlighting, for example, `\\(' in `\\\\('.
!             (while (re-search-forward "\\(\\\\\\\\\\)\\(?:\\(\\\\\\\\\\)\\|\\((\\(?:\\?:\\)?\\|[|)]\\)\\)" bound t)
                (unless (match-beginning 2)
                  (let ((face (get-text-property (1- (point)) 'face)))
                    (when (or (and (listp face)
--- 2279,2285 ----
              ;; that do not occur in strings.  The associated regexp matches one
              ;; of `\\\\' `\\(' `\\(?:' `\\|' `\\)'.  `\\\\' has been included to
              ;; avoid highlighting, for example, `\\(' in `\\\\('.
!             (while (re-search-forward "\\(?1:\\\\\\\\\\)\\(?:\\(?2:\\\\\\\\\\)\\|\\(?3:(\\(?:\\?[0-9]*:\\)?\\|[|)]\\)\\)" bound t)
                (unless (match-beginning 2)
                  (let ((face (get-text-property (1- (point)) 'face)))
                    (when (or (and (listp face)

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Explicitly numbered subgroups in regular expressions
  2007-06-12 18:42   ` Stefan Monnier
  2007-06-12 20:56     ` Juri Linkov
@ 2007-06-13 16:23     ` Richard Stallman
  1 sibling, 0 replies; 7+ messages in thread
From: Richard Stallman @ 2007-06-13 16:23 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

    Done.  Except I have only documented it in the Elisp manual: I believe it's
    only useful for Lisp programming, not for interactive regexp use.

I think you are right.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-06-13 16:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-09 20:37 Explicitly numbered subgroups in regular expressions Stefan Monnier
2007-06-10 13:19 ` Richard Stallman
2007-06-10 13:44   ` Stefan Monnier
2007-06-11  9:44     ` Richard Stallman
2007-06-12 18:42   ` Stefan Monnier
2007-06-12 20:56     ` Juri Linkov
2007-06-13 16:23     ` Richard Stallman

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).