unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* using non-Emacs regexp syntax
@ 2006-11-28 20:56 Paul Pogonyshev
  2006-11-29 16:26 ` Richard Stallman
  0 siblings, 1 reply; 15+ messages in thread
From: Paul Pogonyshev @ 2006-11-28 20:56 UTC (permalink / raw)


Hi,

Is there a function to convert non-Emacs regexps (e.g. "ab(c+|d)" to
Emacs regexps (example to "ab\(c+\|d\)")?

If there is none, are you interested in adding such functions?  (Of
course, not now, but after the release.)  I assume it is not worth it
to implement in C, so a Lisp implementation is in order?

(Abstract task is like this: be able to read regexps from an (XML)
file, which should be readable not only by Emacs; since Emacs syntax
is not widespread, regexps would use a different syntax.)

Paul

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: using non-Emacs regexp syntax
  2006-11-28 20:56 using non-Emacs regexp syntax Paul Pogonyshev
@ 2006-11-29 16:26 ` Richard Stallman
  2006-11-29 16:38   ` Drew Adams
  2006-11-29 19:06   ` Paul Pogonyshev
  0 siblings, 2 replies; 15+ messages in thread
From: Richard Stallman @ 2006-11-29 16:26 UTC (permalink / raw)
  Cc: emacs-devel

    Is there a function to convert non-Emacs regexps (e.g. "ab(c+|d)" to
    Emacs regexps (example to "ab\(c+\|d\)")?

The first form appears to be an "extended regexp" or egrep-style regexp.
The second appears to be a "basic regexp" or grep-style regexp.

This conversion feature in Lisp would be useful to add after the release.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: using non-Emacs regexp syntax
  2006-11-29 16:26 ` Richard Stallman
@ 2006-11-29 16:38   ` Drew Adams
  2006-11-29 17:23     ` David Kastrup
  2006-11-29 19:06   ` Paul Pogonyshev
  1 sibling, 1 reply; 15+ messages in thread
From: Drew Adams @ 2006-11-29 16:38 UTC (permalink / raw)


>     Is there a function to convert non-Emacs regexps (e.g. "ab(c+|d)" to
>     Emacs regexps (example to "ab\(c+\|d\)")?
>
> The first form appears to be an "extended regexp" or egrep-style regexp.
> The second appears to be a "basic regexp" or grep-style regexp.
>
> This conversion feature in Lisp would be useful to add after the release.

Very glad to hear that.

I'm hoping there will also be support for toggling the newline sensitivity
of dot. This means a "doc-matches-newline" mode (aka "single-line" mode)
where `.' will also match newline. Please see the thread "short regexp to
match any character?" from 2006/03/04 and 03/11.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: using non-Emacs regexp syntax
  2006-11-29 16:38   ` Drew Adams
@ 2006-11-29 17:23     ` David Kastrup
  2006-11-29 19:13       ` Paul Pogonyshev
  2006-12-01 20:30       ` Stuart D. Herring
  0 siblings, 2 replies; 15+ messages in thread
From: David Kastrup @ 2006-11-29 17:23 UTC (permalink / raw)
  Cc: emacs-devel

"Drew Adams" <drew.adams@oracle.com> writes:

>>     Is there a function to convert non-Emacs regexps (e.g. "ab(c+|d)" to
>>     Emacs regexps (example to "ab\(c+\|d\)")?
>>
>> The first form appears to be an "extended regexp" or egrep-style regexp.
>> The second appears to be a "basic regexp" or grep-style regexp.
>>
>> This conversion feature in Lisp would be useful to add after the release.
>
> Very glad to hear that.
>
> I'm hoping there will also be support for toggling the newline sensitivity
> of dot. This means a "doc-matches-newline" mode (aka "single-line" mode)
> where `.' will also match newline. Please see the thread "short regexp to
> match any character?" from 2006/03/04 and 03/11.

I don't know any other matcher where dot matches a newline.  Quite
more relevant would be inverse character ranges like [^A-Z] that do
_not_ match newline by default.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: using non-Emacs regexp syntax
  2006-11-29 16:26 ` Richard Stallman
  2006-11-29 16:38   ` Drew Adams
@ 2006-11-29 19:06   ` Paul Pogonyshev
  1 sibling, 0 replies; 15+ messages in thread
From: Paul Pogonyshev @ 2006-11-29 19:06 UTC (permalink / raw)


Richard Stallman wrote:
>     Is there a function to convert non-Emacs regexps (e.g. "ab(c+|d)" to
>     Emacs regexps (example to "ab\(c+\|d\)")?
> 
> The first form appears to be an "extended regexp" or egrep-style regexp.
> The second appears to be a "basic regexp" or grep-style regexp.
> 
> This conversion feature in Lisp would be useful to add after the release.

If you don't mind, I'll work on it now.  Changes can be added to whatever
.el file in the distribution later.

Also, is there sense in supporting conversion to and from several formats?
E.g. some require that plus operator is escaped, while everything else is
not.  E.g. something like this:

	(convert-regexp :sed :emacs some-regexp)
			FROM   TO   PATTERN-STRING

Of course, it will add more complexity, but it shouldn't be much of a
problem for users of this function and implementing it in Lisp should still
be not hard.

Paul

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: using non-Emacs regexp syntax
  2006-11-29 17:23     ` David Kastrup
@ 2006-11-29 19:13       ` Paul Pogonyshev
  2006-11-29 20:53         ` Jari Aalto
  2006-11-30  2:11         ` Drew Adams
  2006-12-01 20:30       ` Stuart D. Herring
  1 sibling, 2 replies; 15+ messages in thread
From: Paul Pogonyshev @ 2006-11-29 19:13 UTC (permalink / raw)
  Cc: Drew Adams

David Kastrup wrote:
> "Drew Adams" <drew.adams@oracle.com> writes:
> 
> >>     Is there a function to convert non-Emacs regexps (e.g. "ab(c+|d)" to
> >>     Emacs regexps (example to "ab\(c+\|d\)")?
> >>
> >> The first form appears to be an "extended regexp" or egrep-style regexp.
> >> The second appears to be a "basic regexp" or grep-style regexp.
> >>
> >> This conversion feature in Lisp would be useful to add after the release.
> >
> > Very glad to hear that.
> >
> > I'm hoping there will also be support for toggling the newline sensitivity
> > of dot. This means a "doc-matches-newline" mode (aka "single-line" mode)
> > where `.' will also match newline. Please see the thread "short regexp to
> > match any character?" from 2006/03/04 and 03/11.
> 
> I don't know any other matcher where dot matches a newline.  Quite
> more relevant would be inverse character ranges like [^A-Z] that do
> _not_ match newline by default.

As far as I remember, Perl regexp syntax has a flag to match or not match
newline by default.  Emacs could adopt a similar flag facility, or use native
flag variables (like `case-fold-search', just with a different meaning.)

Paul

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: using non-Emacs regexp syntax
  2006-11-29 19:13       ` Paul Pogonyshev
@ 2006-11-29 20:53         ` Jari Aalto
  2006-11-30  2:11         ` Drew Adams
  1 sibling, 0 replies; 15+ messages in thread
From: Jari Aalto @ 2006-11-29 20:53 UTC (permalink / raw)


Paul Pogonyshev <pogonyshev@gmx.net> writes:

> David Kastrup wrote:
> > "Drew Adams" <drew.adams@oracle.com> writes:
> > 
> > >>     Is there a function to convert non-Emacs regexps (e.g. "ab(c+|d)" to
> > >>     Emacs regexps (example to "ab\(c+\|d\)")?
> > >>
> > >> The first form appears to be an "extended regexp" or egrep-style regexp.
> > >> The second appears to be a "basic regexp" or grep-style regexp.
> > >>
> > >> This conversion feature in Lisp would be useful to add after the release.
> > >
> > > Very glad to hear that.
> > >
> > > I'm hoping there will also be support for toggling the newline sensitivity
> > > of dot. This means a "doc-matches-newline" mode (aka "single-line" mode)
> > > where `.' will also match newline. Please see the thread "short regexp to
> > > match any character?" from 2006/03/04 and 03/11.
> > 
> > I don't know any other matcher where dot matches a newline.  Quite
> > more relevant would be inverse character ranges like [^A-Z] that do
> > _not_ match newline by default.
> 
> As far as I remember, Perl regexp syntax has a flag to match or not match
> newline by default.  

Yes, it. It goes like this:

    "This\nLine"

    /.*/        "This"
    /.*/s       Make dot to match (m)ultiline: "This\nLine"

Common modifiers used are:

    i           Case (i)nsensitive match.
    g           (g)lobal match; all occurrances until last one
    m           (m)ultiline achors. The ^ and $ match in between the
                lines, not just at the beginning or end of string. 
    s           Change semantics of "." to also match \r or \n.
    x           e(x)tended. Treat all white space in regexp
                non-significant. Makes it possible to wrire readable
                regular expressions /Like  \s  This/
    
Jari

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: using non-Emacs regexp syntax
  2006-11-29 19:13       ` Paul Pogonyshev
  2006-11-29 20:53         ` Jari Aalto
@ 2006-11-30  2:11         ` Drew Adams
  2006-11-30 14:26           ` Stefan Monnier
  1 sibling, 1 reply; 15+ messages in thread
From: Drew Adams @ 2006-11-30  2:11 UTC (permalink / raw)


> > I don't know any other matcher where dot matches a newline.  Quite
> > more relevant would be inverse character ranges like [^A-Z] that do
> > _not_ match newline by default.
>
> As far as I remember, Perl regexp syntax has a flag to
> match or not match newline by default.

Exactly. This (perl's "single-line" mode) was mentioned in the thread I
cited, as well as the reason (use) for such a feature for Emacs.

> Emacs could adopt a similar flag facility,
> or use native flag variables (like `case-fold-search',
> just with a different meaning.)

Juri proposed a simple implementation: 'setting a new variable
`search-dot-regexp' to "\\(.\\|[\\n]\\)".'

Please see the (3-message) thread. This would be very handy for interactive
use of regexps, IMO, especially with a toggle command (bound to, say,
`C-.').

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: using non-Emacs regexp syntax
  2006-11-30  2:11         ` Drew Adams
@ 2006-11-30 14:26           ` Stefan Monnier
  2006-12-05  5:16             ` Drew Adams
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2006-11-30 14:26 UTC (permalink / raw)
  Cc: emacs-devel, Paul Pogonyshev

> Juri proposed a simple implementation: 'setting a new variable
> `search-dot-regexp' to "\\(.\\|[\\n]\\)".'

There's already a much more efficient implementation.  See regex.h:

   /* If this bit is set, then . matches newline.
      If not set, then it doesn't.  */
   #define RE_DOT_NEWLINE (RE_CONTEXT_INVALID_OPS << 1)


-- Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: using non-Emacs regexp syntax
  2006-11-29 17:23     ` David Kastrup
  2006-11-29 19:13       ` Paul Pogonyshev
@ 2006-12-01 20:30       ` Stuart D. Herring
  1 sibling, 0 replies; 15+ messages in thread
From: Stuart D. Herring @ 2006-12-01 20:30 UTC (permalink / raw)
  Cc: Drew Adams, emacs-devel

> I don't know any other matcher where dot matches a newline.  Quite
> more relevant would be inverse character ranges like [^A-Z] that do
> _not_ match newline by default.

Not to compare the editors, or even really to inform the decision, but
just for complete information: dot always matches a newline in sed(1),
although often sed is matching against strings that contain no newlines,
since it takes a line at a time without the newline.  However, if newlines
are added to the pattern space, . matches them.

Davis

-- 
This product is sold by volume, not by mass.  If it appears too dense or
too sparse, it is because mass-energy conversion has occurred during
shipping.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: using non-Emacs regexp syntax
@ 2006-12-01 22:35 Stuart D. Herring
  2006-12-01 22:54 ` Paul Pogonyshev
  2006-12-02  2:38 ` Stefan Monnier
  0 siblings, 2 replies; 15+ messages in thread
From: Stuart D. Herring @ 2006-12-01 22:35 UTC (permalink / raw)
  Cc: rms, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 2007 bytes --]

> If you don't mind, I'll work on it now.  Changes can be added to whatever
> .el file in the distribution later.
>
> Also, is there sense in supporting conversion to and from several formats?
> E.g. some require that plus operator is escaped, while everything else is
> not.  E.g. something like this:
>
> 	(convert-regexp :sed :emacs some-regexp)
> 			FROM   TO   PATTERN-STRING
>
> Of course, it will add more complexity, but it shouldn't be much of a
problem for users of this function and implementing it in Lisp should
still
> be not hard.

I've already started on this sort of thing, writing a converter just
between the two formats supported by GNU grep.  (These are
"GNU-extended-basic-RE" and "extended-RE with backreferences".)  As it
happens, that conversion can be done with one function because the formats
are so similar.  I had planned to go on to the more general case, but for
now I'll just provide what I have for comment and/or use.  (I have papers,
so any use is fine.)  If, Paul, you'd like, we can collaborate on this, or
one of us of your choice can go on with it.

For reference/goal purposes, I've been looking at the (somewhat outdated)
Mastering Regular Expressions and it describes these syntaxes:
1.  vi
2. (modern) grep
3. egrep
4. sed
5. lex
6. old awk
7. new awk(s) (don't know how different they really are from each other or
from old awk)
8. Emacs
9. Perl (obviously we can only convert a subset of Perl's syntax...)
10. Tcl
11. a Tcl library called Expect (although I don't know if/why it has a
different syntax from Tcl itself)
12. Python (complicated by the old regex and the new re packages, and how
the former had a variable syntax)

Hope it's helpful,
Davis

PS - I originally wrote this using some convenience macros of mine.  It
seems to work after I standardized it, but that's probably why if it
doesn't.

-- 
This product is sold by volume, not by mass.  If it appears too dense or
too sparse, it is because mass-energy conversion has occurred during
shipping.

[-- Attachment #2: convert-re.el --]
[-- Type: application/octet-stream, Size: 1396 bytes --]

;; Remember the exceedingly-basic regexes as used by sed(1)... might need to
;; support them too, although converting into them can be a pain.  Obviously,
;; in general you can't have just one function.

(defun convert-regexp (re)
	"Convert the regexp RE from basic to extended format or back."
	(let ((chars (string-to-list re)) ret backslash)
		(while chars
			(let ((curchar (car chars)))
				(cond
				 ((eq curchar ?\\)
					(unless (setq backslash (not backslash))
						(push ?\\ ret) (push ?\\ ret)))
				 ((eq curchar ?\[)
					(if backslash (progn (push ?\\ ret) (push ?\[ ret))
						;; Otherwise, it's a character class:
						(push ?\[ ret)
						(setq chars (cdr chars))
						(let ((level 1) (first 0))
							(while (and chars (> level 0))
								(let ((clch (car chars)))
									(push clch ret)
									(cond
									 ((eq clch ?\[) (incf level))
									 ((eq clch ?\]) (unless first (decf level)))
									 ((eq clch ?^) (if first (setq first t)))))
								(setq first (and first (unless (numberp first) 0)))
								(unless (zerop level) (setq chars (cdr chars)))))))
				 ((memq curchar (string-to-list "?+()|{}"))
					(unless backslash (push ?\\ ret))
					(push (car chars) ret))
				 (t (if backslash (push ?\\ ret)) (push (car chars) ret))))
			(setq backslash (and backslash (unless (numberp backslash) 0))
						chars (cdr chars)))
		(concat (nreverse ret))))

[-- Attachment #3: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: using non-Emacs regexp syntax
  2006-12-01 22:35 Stuart D. Herring
@ 2006-12-01 22:54 ` Paul Pogonyshev
  2006-12-03 20:22   ` Juri Linkov
  2006-12-02  2:38 ` Stefan Monnier
  1 sibling, 1 reply; 15+ messages in thread
From: Paul Pogonyshev @ 2006-12-01 22:54 UTC (permalink / raw)
  Cc: rms

Stuart D. Herring wrote:
> > If you don't mind, I'll work on it now.  Changes can be added to whatever
> > .el file in the distribution later.
> >
> > Also, is there sense in supporting conversion to and from several formats?
> > E.g. some require that plus operator is escaped, while everything else is
> > not.  E.g. something like this:
> >
> > 	(convert-regexp :sed :emacs some-regexp)
> > 			FROM   TO   PATTERN-STRING
> >
> > Of course, it will add more complexity, but it shouldn't be much of a
> > problem for users of this function and implementing it in Lisp should
> > still
> > be not hard.
> 
> I've already started on this sort of thing, writing a converter just
> between the two formats supported by GNU grep.  (These are
> "GNU-extended-basic-RE" and "extended-RE with backreferences".)  As it
> happens, that conversion can be done with one function because the formats
> are so similar.  I had planned to go on to the more general case, but for
> now I'll just provide what I have for comment and/or use.  (I have papers,
> so any use is fine.)  If, Paul, you'd like, we can collaborate on this, or
> one of us of your choice can go on with it.
> 
> [...]

I will happily pass this to you if you wish.  I planned a more generic
implementation which can be briefly described as this:

* Each implemented format provides a table of associations
  construct-name -> construct-generator (some constructs,  like []
  character class, will require a parameter.)  In the simplest form,
  construct-generator can be just a fixed string, which will suffice in
  most cases.

* Each format also provides a parser that splits a regexp into a list
  of construct-name.

* Entry function (or a helper for it) combines together a table for
  output format and a parser for input format.  The result is a regexp
  in output format.

Maybe it is too slow, though.  However, given that Emacs lived happily
without this sort of function, it can hardly be too slow.  But maybe
you can come up with a simpler solution.

(One more thing: it probably makes sense to add conversion function
for replacement strings too.  E.g. some formats require $N, some
(like Emacs) use \N for referencing the matched group.)

Paul

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: using non-Emacs regexp syntax
  2006-12-01 22:35 Stuart D. Herring
  2006-12-01 22:54 ` Paul Pogonyshev
@ 2006-12-02  2:38 ` Stefan Monnier
  1 sibling, 0 replies; 15+ messages in thread
From: Stefan Monnier @ 2006-12-02  2:38 UTC (permalink / raw)
  Cc: emacs-devel, rms, Paul Pogonyshev

> I've already started on this sort of thing, writing a converter just
> between the two formats supported by GNU grep.  (These are

BTW, if the output of your function is only ever passed to Emacs, then it
may be worth it to instead provide ways to access from elisp the full
functionality of the underlying features of the regexp.c code (which was
originally not specific to Emacs and has flags to support various syntax
options, including whether { ( | and friends should be backslashed or not.
See regexp.h).
Of course, maybe both would be useful, depending on the application.


        Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: using non-Emacs regexp syntax
  2006-12-01 22:54 ` Paul Pogonyshev
@ 2006-12-03 20:22   ` Juri Linkov
  0 siblings, 0 replies; 15+ messages in thread
From: Juri Linkov @ 2006-12-03 20:22 UTC (permalink / raw)
  Cc: rms, emacs-devel

> * Each implemented format provides a table of associations
>   construct-name -> construct-generator (some constructs,  like []
>   character class, will require a parameter.)  In the simplest form,
>   construct-generator can be just a fixed string, which will suffice in
>   most cases.
>
> * Each format also provides a parser that splits a regexp into a list
>   of construct-name.
>
> * Entry function (or a helper for it) combines together a table for
>   output format and a parser for input format.  The result is a regexp
>   in output format.

A good implementation would parse a regexp string to the sregex
Lisp-like syntax (qv emacs-lisp/sregex.el) which would allow doing
such things as (sregexq (a-new-function-to-parse-Perl-regexp PERL-REGEXP))
to convert a non-Emacs regexp to the Emacs one.

PS: I would much prefer to use this feature in query-replace-regexp
with an option to use extended regexps instead of basic regexps in its
interactive arguments.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: using non-Emacs regexp syntax
  2006-11-30 14:26           ` Stefan Monnier
@ 2006-12-05  5:16             ` Drew Adams
  0 siblings, 0 replies; 15+ messages in thread
From: Drew Adams @ 2006-12-05  5:16 UTC (permalink / raw)


> > Juri proposed a simple implementation: 'setting a new variable
> > `search-dot-regexp' to "\\(.\\|[\\n]\\)".'
>
> There's already a much more efficient implementation.  See regex.h:
>
>    /* If this bit is set, then . matches newline.
>       If not set, then it doesn't.  */
>    #define RE_DOT_NEWLINE (RE_CONTEXT_INVALID_OPS << 1)

Fine, as far as implementation goes. What's still needed then is Lisp access
(Lisp variable) and making a toggle command for the variable.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2006-12-05  5:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-28 20:56 using non-Emacs regexp syntax Paul Pogonyshev
2006-11-29 16:26 ` Richard Stallman
2006-11-29 16:38   ` Drew Adams
2006-11-29 17:23     ` David Kastrup
2006-11-29 19:13       ` Paul Pogonyshev
2006-11-29 20:53         ` Jari Aalto
2006-11-30  2:11         ` Drew Adams
2006-11-30 14:26           ` Stefan Monnier
2006-12-05  5:16             ` Drew Adams
2006-12-01 20:30       ` Stuart D. Herring
2006-11-29 19:06   ` Paul Pogonyshev
  -- strict thread matches above, loose matches on Subject: below --
2006-12-01 22:35 Stuart D. Herring
2006-12-01 22:54 ` Paul Pogonyshev
2006-12-03 20:22   ` Juri Linkov
2006-12-02  2:38 ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).