* Explicitly numbered subgroups in regular expressions
@ 2007-06-09 20:37 Stefan Monnier
2007-06-10 13:19 ` Richard Stallman
0 siblings, 1 reply; 7+ messages in thread
From: Stefan Monnier @ 2007-06-09 20:37 UTC (permalink / raw)
To: emacs-devel
Any objection to the patch below for the trunk (it'll come with a NEWS
and ChangeLog entry, of course)?
What it does is add a new regexp syntax \(?<num>:<regexp>\) which is like
\(<regexp>\) except that it specifies explicitly the number of the subgroup.
E.g. (and (string-match "\\(?3:a\\)" "a") (match-data)) returns (0 1 nil
nil nil nil 0 1). There is no backward compatibility issue with this patch:
such regexps are currently rejected as invalid.
Cases where this is useful:
1 - when we need to match either '<regexp>' or "<regexp>" and we need to
extract the text within the quotes. Currently we either use
"[\"']\\(<regexp>\\)[\"']" which is not quite correct, or we use
"\"\\(<regexp>\\)\"\\|'\\(<regexp>\\)'" and then have to use
(or (match-string 1) (match-string 2)).
With this patch we can use "\"\\(?1:<regexp>\\)\"\\|'\\(?1:<regexp>\\)'".
In log-view-mode, we have such a situation where we regularly need to
add one case and hence change some of the code from (or (match-string 1)
(match-string 2)) to (or (match-string 1) (match-string 2) (match-string 3))
and update other match-data indices.
2 - there are several places where some customizable data specifies
a regular expression along with the match-data indices where the
relevant subdata can be found. E.g. in compilation-error-regexp-alist.
With such a patch, we could use instead a scheme where the customizable
data only includes a regexp because the match-data indices in which the
relevant subdata can be found are always the same.
3 - in some rare cases such as comment-start-skip we have declared that
subgroup N has a special meaning. Problem is that this can be
occasionally be problematic.
Number 3 is a rare circumstance (see comment around fortran-mode's setting
of comment-start-skip) and I can't remember it being anything else than
a minor problem.
Number 2 seems useful, but I have to admit that I have not actually tried it.
Number 1 was the motivating case.
Stefan
--- regex.c 29 jan 2007 13:35:35 -0500 1.222
+++ regex.c 09 jun 2007 16:13:59 -0400
@@ -2482,11 +2482,6 @@
last -- ends with a forward jump of this sort. */
unsigned char *fixup_alt_jump = 0;
- /* Counts open-groups as they are encountered. Remembered for the
- matching close-group on the compile stack, so the same register
- number is put in the stop_memory as the start_memory. */
- regnum_t regnum = 0;
-
/* Work area for range table of charset. */
struct range_table_work_area range_table_work;
@@ -3123,28 +3118,54 @@
handle_open:
{
int shy = 0;
+ regnum_t regnum = 0;
if (p+1 < pend)
{
/* Look for a special (?...) construct */
if ((syntax & RE_SHY_GROUPS) && *p == '?')
{
PATFETCH (c); /* Gobble up the '?'. */
+ while (!shy)
+ {
PATFETCH (c);
switch (c)
{
case ':': shy = 1; break;
+ case '0':
+ /* An explicitly specified regnum must start
+ with non-0. */
+ if (regnum == 0)
+ FREE_STACK_RETURN (REG_BADPAT);
+ case '1': case '2': case '3': case '4':
+ case '5': case '6': case '7': case '8': case '9':
+ regnum = 10*regnum + (c - '0'); break;
default:
/* Only (?:...) is supported right now. */
FREE_STACK_RETURN (REG_BADPAT);
}
}
}
+ }
if (!shy)
- {
- bufp->re_nsub++;
- regnum++;
+ regnum = ++bufp->re_nsub;
+ else if (regnum)
+ { /* It's actually not shy, but explicitly numbered. */
+ shy = 0;
+ if (regnum > bufp->re_nsub)
+ bufp->re_nsub = regnum;
+ else if (regnum > bufp->re_nsub
+ /* Ideally, we'd want to check that the specified
+ group can't have matched (i.e. all subgroups
+ using the same regnum are in other branches of
+ OR patterns), but we don't currently keep track
+ of enough info to do that easily. */
+ || group_in_compile_stack (compile_stack, regnum))
+ FREE_STACK_RETURN (REG_BADPAT);
}
+ else
+ /* It's really shy. */
+ regnum = - bufp->re_nsub;
if (COMPILE_STACK_FULL)
{
@@ -3163,12 +3184,11 @@
COMPILE_STACK_TOP.fixup_alt_jump
= fixup_alt_jump ? fixup_alt_jump - bufp->buffer + 1 : 0;
COMPILE_STACK_TOP.laststart_offset = b - bufp->buffer;
- COMPILE_STACK_TOP.regnum = shy ? -regnum : regnum;
+ COMPILE_STACK_TOP.regnum = regnum;
- /* Do not push a
- start_memory for groups beyond the last one we can
- represent in the compiled pattern. */
- if (regnum <= MAX_REGNUM && !shy)
+ /* Do not push a start_memory for groups beyond the last one
+ we can represent in the compiled pattern. */
+ if (regnum <= MAX_REGNUM && regnum > 0)
BUF_PUSH_2 (start_memory, regnum);
compile_stack.avail++;
@@ -3213,7 +3233,7 @@
/* We don't just want to restore into `regnum', because
later groups should continue to be numbered higher,
as in `(ab)c(de)' -- the second group is #2. */
- regnum_t this_group_regnum;
+ regnum_t regnum;
compile_stack.avail--;
begalt = bufp->buffer + COMPILE_STACK_TOP.begalt_offset;
@@ -3222,7 +3242,7 @@
? bufp->buffer + COMPILE_STACK_TOP.fixup_alt_jump - 1
: 0;
laststart = bufp->buffer + COMPILE_STACK_TOP.laststart_offset;
- this_group_regnum = COMPILE_STACK_TOP.regnum;
+ regnum = COMPILE_STACK_TOP.regnum;
/* If we've reached MAX_REGNUM groups, then this open
won't actually generate any code, so we'll have to
clear pending_exact explicitly. */
@@ -3230,8 +3250,8 @@
/* We're at the end of the group, so now we know how many
groups were inside this one. */
- if (this_group_regnum <= MAX_REGNUM && this_group_regnum > 0)
- BUF_PUSH_2 (stop_memory, this_group_regnum);
+ if (regnum <= MAX_REGNUM && regnum > 0)
+ BUF_PUSH_2 (stop_memory, regnum);
}
break;
@@ -3557,8 +3577,9 @@
reg = c - '0';
- /* Can't back reference to a subexpression before its end. */
- if (reg > regnum || group_in_compile_stack (compile_stack, reg))
+ if (reg > bufp->re_nsub || reg < 1
+ /* Can't back reference to a subexp before its end. */
+ || group_in_compile_stack (compile_stack, reg))
FREE_STACK_RETURN (REG_ESUBREG);
laststart = b;
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Explicitly numbered subgroups in regular expressions
2007-06-09 20:37 Explicitly numbered subgroups in regular expressions Stefan Monnier
@ 2007-06-10 13:19 ` Richard Stallman
2007-06-10 13:44 ` Stefan Monnier
2007-06-12 18:42 ` Stefan Monnier
0 siblings, 2 replies; 7+ messages in thread
From: Richard Stallman @ 2007-06-10 13:19 UTC (permalink / raw)
To: Stefan Monnier; +Cc: emacs-devel
What it does is add a new regexp syntax \(?<num>:<regexp>\) which is like
\(<regexp>\) except that it specifies explicitly the number of the subgroup.
E.g. (and (string-match "\\(?3:a\\)" "a") (match-data)) returns (0 1 nil
nil nil nil 0 1). There is no backward compatibility issue with this patch:
such regexps are currently rejected as invalid.
The feature seems useful, but how does this affect the numbering of
groups that don't specify a number? That has to be done right, then
documented in the two manuals.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Explicitly numbered subgroups in regular expressions
2007-06-10 13:19 ` Richard Stallman
@ 2007-06-10 13:44 ` Stefan Monnier
2007-06-11 9:44 ` Richard Stallman
2007-06-12 18:42 ` Stefan Monnier
1 sibling, 1 reply; 7+ messages in thread
From: Stefan Monnier @ 2007-06-10 13:44 UTC (permalink / raw)
To: rms; +Cc: emacs-devel
> What it does is add a new regexp syntax \(?<num>:<regexp>\) which is
> like \(<regexp>\) except that it specifies explicitly the number of
> the subgroup. E.g. (and (string-match "\\(?3:a\\)" "a") (match-data))
> returns (0 1 nil nil nil nil 0 1). There is no backward compatibility
> issue with this patch: such regexps are currently rejected as invalid.
> The feature seems useful, but how does this affect the numbering of
> groups that don't specify a number?
A subgroup that doesn't explicitly specify a number gets the "smallest
natural number larger than all previous ones". I.e.
(and (string-match "\\(?3:a\\)\\(b\\)" "ab") (match-data))
returns
(0 2 nil nil nil nil 0 1 1 2)
since the second group gets number 4.
> That has to be done right, then documented in the two manuals.
Sure,
Stefan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Explicitly numbered subgroups in regular expressions
2007-06-10 13:19 ` Richard Stallman
2007-06-10 13:44 ` Stefan Monnier
@ 2007-06-12 18:42 ` Stefan Monnier
2007-06-12 20:56 ` Juri Linkov
2007-06-13 16:23 ` Richard Stallman
1 sibling, 2 replies; 7+ messages in thread
From: Stefan Monnier @ 2007-06-12 18:42 UTC (permalink / raw)
To: rms; +Cc: emacs-devel
> The feature seems useful, but how does this affect the numbering of
> groups that don't specify a number? That has to be done right, then
> documented in the two manuals.
Done. Except I have only documented it in the Elisp manual: I believe it's
only useful for Lisp programming, not for interactive regexp use.
Stefan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Explicitly numbered subgroups in regular expressions
2007-06-12 18:42 ` Stefan Monnier
@ 2007-06-12 20:56 ` Juri Linkov
2007-06-13 16:23 ` Richard Stallman
1 sibling, 0 replies; 7+ messages in thread
From: Juri Linkov @ 2007-06-12 20:56 UTC (permalink / raw)
To: Stefan Monnier; +Cc: rms, emacs-devel
>> The feature seems useful, but how does this affect the numbering of
>> groups that don't specify a number? That has to be done right, then
>> documented in the two manuals.
>
> Done. Except I have only documented it in the Elisp manual: I believe it's
> only useful for Lisp programming, not for interactive regexp use.
This feature makes regexps more readable and less error-prone, so I think
it would be useful for interactive usage like in query-replace-regexp.
Now since it was installed, I noticed that in Emacs Lisp files numbers
in regexps don't get highlighted like grouping constructs in
font-lock-regexp-grouping-construct face. The patch below adds highlighting
for them. It also uses this new feature in the same regexp ;)
Index: lisp/font-lock.el
===================================================================
RCS file: /sources/emacs/emacs/lisp/font-lock.el,v
retrieving revision 1.317
diff -c -r1.317 font-lock.el
*** lisp/font-lock.el 21 Apr 2007 14:30:25 -0000 1.317
--- lisp/font-lock.el 12 Jun 2007 20:55:41 -0000
***************
*** 2279,2285 ****
;; that do not occur in strings. The associated regexp matches one
;; of `\\\\' `\\(' `\\(?:' `\\|' `\\)'. `\\\\' has been included to
;; avoid highlighting, for example, `\\(' in `\\\\('.
! (while (re-search-forward "\\(\\\\\\\\\\)\\(?:\\(\\\\\\\\\\)\\|\\((\\(?:\\?:\\)?\\|[|)]\\)\\)" bound t)
(unless (match-beginning 2)
(let ((face (get-text-property (1- (point)) 'face)))
(when (or (and (listp face)
--- 2279,2285 ----
;; that do not occur in strings. The associated regexp matches one
;; of `\\\\' `\\(' `\\(?:' `\\|' `\\)'. `\\\\' has been included to
;; avoid highlighting, for example, `\\(' in `\\\\('.
! (while (re-search-forward "\\(?1:\\\\\\\\\\)\\(?:\\(?2:\\\\\\\\\\)\\|\\(?3:(\\(?:\\?[0-9]*:\\)?\\|[|)]\\)\\)" bound t)
(unless (match-beginning 2)
(let ((face (get-text-property (1- (point)) 'face)))
(when (or (and (listp face)
--
Juri Linkov
http://www.jurta.org/emacs/
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Explicitly numbered subgroups in regular expressions
2007-06-12 18:42 ` Stefan Monnier
2007-06-12 20:56 ` Juri Linkov
@ 2007-06-13 16:23 ` Richard Stallman
1 sibling, 0 replies; 7+ messages in thread
From: Richard Stallman @ 2007-06-13 16:23 UTC (permalink / raw)
To: Stefan Monnier; +Cc: emacs-devel
Done. Except I have only documented it in the Elisp manual: I believe it's
only useful for Lisp programming, not for interactive regexp use.
I think you are right.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2007-06-13 16:23 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-09 20:37 Explicitly numbered subgroups in regular expressions Stefan Monnier
2007-06-10 13:19 ` Richard Stallman
2007-06-10 13:44 ` Stefan Monnier
2007-06-11 9:44 ` Richard Stallman
2007-06-12 18:42 ` Stefan Monnier
2007-06-12 20:56 ` Juri Linkov
2007-06-13 16:23 ` Richard Stallman
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.