unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* regexp that matches newline characters?
@ 2008-05-09 16:53 Dmitri Minaev
  0 siblings, 0 replies; 9+ messages in thread
From: Dmitri Minaev @ 2008-05-09 16:53 UTC (permalink / raw)
  To: EMACS list

I tried to extract a tag from an xml file to parse it later, but I
can't find a regexp that would match an xml tag with its content,
including newlines. Dot doesn't match newlines. The elisp manual
mentions that "complemented character alternative" matches a newline,
so I used this funny template:\\(<author>[^±]*?</author>\\). Of
course, this is not the right thing to do. What would be the correct
regular expression?

-- 
With best regards,
Dmitri Minaev

Russian history blog: http://minaev.blogspot.com




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: regexp that matches newline characters?
       [not found] <mailman.11388.1210352002.18990.help-gnu-emacs@gnu.org>
@ 2008-05-09 20:07 ` Xah
  2008-05-09 22:18   ` Dmitri Minaev
       [not found]   ` <mailman.11398.1210371539.18990.help-gnu-emacs@gnu.org>
  2008-05-09 22:56 ` harven
  1 sibling, 2 replies; 9+ messages in thread
From: Xah @ 2008-05-09 20:07 UTC (permalink / raw)
  To: help-gnu-emacs

On May 9, 9:53 am, "Dmitri Minaev" <min...@gmail.com> wrote:
> I tried to extract a tag from an xml file to parse it later, but I
> can't find a regexp that would match an xml tag with its content,
> including newlines. Dot doesn't match newlines. The elisp manual
> mentions that "complemented character alternative" matches a newline,
> so I used this funny template:\\(<author>[^±]*?</author>\\). Of
> course, this is not the right thing to do. What would be the correct
> regular expression?

Line ending char can be matched by \n, but you'll need to double the
backslash.

However, this is prob what you want:

(some-regex-func "<author>\([^<]+\)</author>" ...)

which captures the content.

See here for some explanation and frequently used patterns:
 http://xahlee.org/emacs/emacs_regex.html

  Xah
  xah@xahlee.org
∑ http://xahlee.org/^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: regexp that matches newline characters?
  2008-05-09 20:07 ` regexp that matches newline characters? Xah
@ 2008-05-09 22:18   ` Dmitri Minaev
  2008-05-09 22:28     ` Lennart Borgman (gmail)
       [not found]   ` <mailman.11398.1210371539.18990.help-gnu-emacs@gnu.org>
  1 sibling, 1 reply; 9+ messages in thread
From: Dmitri Minaev @ 2008-05-09 22:18 UTC (permalink / raw)
  To: Xah; +Cc: help-gnu-emacs

On Sat, May 10, 2008 at 1:07 AM, Xah <xahlee@gmail.com> wrote:
> Line ending char can be matched by \n, but you'll need to double the
> backslash.
>
> However, this is prob what you want:
>
> (some-regex-func "<author>\([^<]+\)</author>" ...)

Thanks, but it won't do the job -- there are embedded tags inside
<author>. That's why I preferred ± to < :)

The regexp should eat anything, like dot, but including all kinds of
whitespaces. Is it possible to do it with character classes? Something
like [[:alnum:][:space:]]* (this one didn't work for me) ?

>
> See here for some explanation and frequently used patterns:
>  http://xahlee.org/emacs/emacs_regex.html

Very good page, but too short :) Thanks!

-- 
With best regards,
Dmitri Minaev

Russian history blog: http://minaev.blogspot.com




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: regexp that matches newline characters?
  2008-05-09 22:18   ` Dmitri Minaev
@ 2008-05-09 22:28     ` Lennart Borgman (gmail)
  0 siblings, 0 replies; 9+ messages in thread
From: Lennart Borgman (gmail) @ 2008-05-09 22:28 UTC (permalink / raw)
  To: Dmitri Minaev; +Cc: help-gnu-emacs, Xah

Dmitri Minaev wrote:
> The regexp should eat anything, like dot, but including all kinds of
> whitespaces. Is it possible to do it with character classes? Something
> like [[:alnum:][:space:]]* (this one didn't work for me) ?

   \(.\|\)

but double the \.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: regexp that matches newline characters?
       [not found] <mailman.11388.1210352002.18990.help-gnu-emacs@gnu.org>
  2008-05-09 20:07 ` regexp that matches newline characters? Xah
@ 2008-05-09 22:56 ` harven
  2008-05-11 17:45   ` Dmitri Minaev
  1 sibling, 1 reply; 9+ messages in thread
From: harven @ 2008-05-09 22:56 UTC (permalink / raw)
  To: help-gnu-emacs

On May 9, 6:53 pm, "Dmitri Minaev" <min...@gmail.com> wrote:
> I tried to extract a tag from an xml file to parse it later, but I
> can't find a regexp that would match an xml tag with its content,
> including newlines. Dot doesn't match newlines. The elisp manual
> mentions that "complemented character alternative" matches a newline,
> so I used this funny template:\\(<author>[^±]*?</author>\\). Of
> course, this is not the right thing to do. What would be the correct
> regular expression?
>
> --
> With best regards,
> Dmitri Minaev
>
> Russian history blog:http://minaev.blogspot.com

"\\(.\\|\n\\)"   matches everything.
It stands for: any character but a new-line, or a new-line.
Do not double-backslash the \n.
The  regexp must be entered as a string in an elisp expression.
In a string, \n stands as newline, \t as tab, \\ as backslash.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: regexp that matches newline characters?
       [not found]   ` <mailman.11398.1210371539.18990.help-gnu-emacs@gnu.org>
@ 2008-05-10  2:51     ` Xah
  0 siblings, 0 replies; 9+ messages in thread
From: Xah @ 2008-05-10  2:51 UTC (permalink / raw)
  To: help-gnu-emacs

Sorry i seem to have misunderstood your question.

The following also works, for what's worth.

(search-forward-regexp "<pre class=\"mma\">\\([^•]*\\)</pre>")

<pre class="mma">
something  here
<p>some</p>
and there
</pre>

  Xah
  xah@xahlee.org
∑ http://xahlee.org/

☄

On May 9, 3:18 pm, "Dmitri Minaev" <min...@gmail.com> wrote:
> On Sat, May 10, 2008 at 1:07 AM, Xah <xah...@gmail.com> wrote:
> > Line ending char can be matched by \n, but you'll need to double the
> > backslash.
>
> > However, this is prob what you want:
>
> > (some-regex-func "<author>\([^<]+\)</author>" ...)
>
> Thanks, but it won't do the job -- there are embedded tags inside
> <author>. That's why I preferred ± to < :)
>
> The regexp should eat anything, like dot, but including all kinds of
> whitespaces. Is it possible to do it with character classes? Something
> like [[:alnum:][:space:]]* (this one didn't work for me) ?
>
>
>
> > See here for some explanation and frequently used patterns:
> >  http://xahlee.org/emacs/emacs_regex.html
>
> Very good page, but too short :) Thanks!
>
> --
> With best regards,
> Dmitri Minaev
>
> Russian history blog:http://minaev.blogspot.com



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: regexp that matches newline characters?
  2008-05-09 22:56 ` harven
@ 2008-05-11 17:45   ` Dmitri Minaev
  2008-05-11 18:11     ` Peter Dyballa
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitri Minaev @ 2008-05-11 17:45 UTC (permalink / raw)
  To: harven; +Cc: help-gnu-emacs

On Sat, May 10, 2008 at 3:56 AM, harven <harven@free.fr> wrote:
>  "\\(.\\|\n\\)"   matches everything.

Thanks to everyone. Parenthesized alternative works, but I found a
solution based on character classes:

\\(<author>[[:print:][:space]]*?</author>\\)

So long, it works. It will help me to get rid of nested groups.

> (search-forward-regexp "<pre class=\"mma\">\\([^•]*\\)</pre>")

Yes, but this is the same hack I wanted to avoid: taking a character
which is not supposed to be found inside the tag and matching anything
except for this character. What if this character appears in some
author's name? What if Prince changes his name again? :)

Is there a comparison of various regexp tools' efficiency: are
character classes fast enough? would parenthesized groups be faster?
or character alternatives (like that [^±])?

Thank you.

-- 
With best regards,
Dmitri Minaev

Russian history blog: http://minaev.blogspot.com




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: regexp that matches newline characters?
  2008-05-11 17:45   ` Dmitri Minaev
@ 2008-05-11 18:11     ` Peter Dyballa
  2008-05-11 18:32       ` Dmitri Minaev
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Dyballa @ 2008-05-11 18:11 UTC (permalink / raw)
  To: Dmitri Minaev; +Cc: emacs list


Am 11.05.2008 um 19:45 schrieb Dmitri Minaev:

> Is there a comparison of various regexp tools' efficiency: are
> character classes fast enough? would parenthesized groups be faster?
> or character alternatives (like that [^±])?

Could be this helps: http://swtch.com/~rsc/regexp/regexp1.html

--
Greetings

   Pete

Bake pizza not war!







^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: regexp that matches newline characters?
  2008-05-11 18:11     ` Peter Dyballa
@ 2008-05-11 18:32       ` Dmitri Minaev
  0 siblings, 0 replies; 9+ messages in thread
From: Dmitri Minaev @ 2008-05-11 18:32 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: emacs list

On Sun, May 11, 2008 at 11:11 PM, Peter Dyballa <Peter_Dyballa@web.de> wrote:
>  Could be this helps: http://swtch.com/~rsc/regexp/regexp1.html
>

Not really, I'm afraid :). What inspired me to a certain degree was a
quotation from an old Jamie Zawinski's e-mail:

"The heavy use of regexps in Perl is due to them being far and away
the most obvious hammer in the box.

The heavy use of regexps in Emacs is due almost entirely to
performance issues: because of implementation details, Emacs code that
uses regexps will almost always run faster than code that uses more
traditional control structures." (from
http://regex.info/blog/2006-09-15/247)

Let's hope it still holds true...

-- 
With best regards,
Dmitri Minaev

Russian history blog: http://minaev.blogspot.com




^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-05-11 18:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <mailman.11388.1210352002.18990.help-gnu-emacs@gnu.org>
2008-05-09 20:07 ` regexp that matches newline characters? Xah
2008-05-09 22:18   ` Dmitri Minaev
2008-05-09 22:28     ` Lennart Borgman (gmail)
     [not found]   ` <mailman.11398.1210371539.18990.help-gnu-emacs@gnu.org>
2008-05-10  2:51     ` Xah
2008-05-09 22:56 ` harven
2008-05-11 17:45   ` Dmitri Minaev
2008-05-11 18:11     ` Peter Dyballa
2008-05-11 18:32       ` Dmitri Minaev
2008-05-09 16:53 Dmitri Minaev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).