RE for any text, including white space

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* RE for any text, including white space
@ 2011-03-16 13:23 ken
  2011-03-16 19:40 ` PJ Weisberg
  0 siblings, 1 reply; 7+ messages in thread
From: ken @ 2011-03-16 13:23 UTC (permalink / raw)
  To: GNU Emacs List

What's the RE for any text, white space included?  I also want to grab
(for match-string...) this text.  The text is bounded by known
characters.  E.g.,

<h3>Any Text-- <a name="thisname">
Hot Stuff</h3
>

In the above, how to grab the text of the title, i.e., everything
between <h3> and </h3>?  Conceivably this title text might contain
*anything* except "</[Hh]{1-9]".

tnx.

-- 
Anything is easy if you know how to do it.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RE for any text, including white space
  2011-03-16 13:23 RE for any text, including white space ken
@ 2011-03-16 19:40 ` PJ Weisberg
  2011-03-16 21:53   ` ken
  0 siblings, 1 reply; 7+ messages in thread
From: PJ Weisberg @ 2011-03-16 19:40 UTC (permalink / raw)
  To: gebser; +Cc: GNU Emacs List

On 3/16/11, ken <gebser@mousecar.com> wrote:
> What's the RE for any text, white space included?  I also want to grab
> (for match-string...) this text.  The text is bounded by known
> characters.  E.g.,
>
> <h3>Any Text-- <a name="thisname">
> Hot Stuff</h3
>>
>
> In the above, how to grab the text of the title, i.e., everything
> between <h3> and </h3>?  Conceivably this title text might contain
> *anything* except "</[Hh]{1-9]".
>

If A and B are your start and end points, then you want:

"A\\(.\\|\n\\)*?B"

You probably got thrown off by the fact that '.' matches anything
EXCEPT a newline.  Regexps are usually assumed to be line-based.

The '?' is there to make the '*' non-greedy, to prevent it from
matching everything between the first A and the last B in the whole
buffer.

The double '\'s are necessary in lisp code because it's interpreted as
a string before it's passed to the regexp engine.

-PJ

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RE for any text, including white space
  2011-03-16 19:40 ` PJ Weisberg
@ 2011-03-16 21:53   ` ken
  2011-03-16 22:05     ` PJ Weisberg
  2011-03-17  4:50     ` Kevin Rodgers
  0 siblings, 2 replies; 7+ messages in thread
From: ken @ 2011-03-16 21:53 UTC (permalink / raw)
  To: PJ Weisberg; +Cc: GNU Emacs List

On 03/16/2011 03:40 PM PJ Weisberg wrote:
> On 3/16/11, ken <gebser@mousecar.com> wrote:
>> What's the RE for any text, white space included?  I also want to grab
>> (for match-string...) this text.  The text is bounded by known
>> characters.  E.g.,
>>
>> <h3>Any Text-- <a name="thisname">
>> Hot Stuff</h3
>> In the above, how to grab the text of the title, i.e., everything
>> between <h3> and </h3>?  Conceivably this title text might contain
>> *anything* except "</[Hh]{1-9]".
>>
> 
> If A and B are your start and end points, then you want:
> 
> "A\\(.\\|\n\\)*?B"

That's almost it, but not quite.  It grabs only the on last character
before the "B"; in my example above it grabs just "f".  I'm needing to grab:

"Any Text-- <a name="thisname">
Hot Stuff"

-- without the quotes, of course.

> 
> You probably got thrown off by the fact that '.' matches anything
> EXCEPT a newline.  

Well, no, I discovered that a long time ago.  I'm thrown off by a lot of
things though... like why....  Well, I don't want to throw the thread
off in four other directions, so I won't say.

If what you gave me works to find just the "f" before "</h3", then
something like "<h3>\\(\\[.\n\t ]*\\)</h3" should work, right?  Nope.

> Regexps are usually assumed to be line-based.

Yeah.  That must be a throw-back to the mainframe days.  And that's
unfortunate.

> 
> The '?' is there to make the '*' non-greedy, to prevent it from
> matching everything between the first A and the last B in the whole
> buffer.

I've formulated a lot of other similar REs without using the '?' and
they work fine, so I didn't even try that.  Once I find something that
works, it would be interesting then to see the differential effect with
and without it.

> 
> The double '\'s are necessary in lisp code because it's interpreted as
> a string before it's passed to the regexp engine.

Yeah, I've seen and used a lot of that.  Most of the time my first guess
gets it right.

> 
> -PJ

Thanks much for the good attempt.

Ken

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RE for any text, including white space
  2011-03-16 21:53   ` ken
@ 2011-03-16 22:05     ` PJ Weisberg
  2011-03-16 23:43       ` ken
  2011-03-17  4:50     ` Kevin Rodgers
  1 sibling, 1 reply; 7+ messages in thread
From: PJ Weisberg @ 2011-03-16 22:05 UTC (permalink / raw)
  To: gebser; +Cc: GNU Emacs List

On Wed, Mar 16, 2011 at 2:53 PM, ken <gebser@mousecar.com> wrote:
> On 03/16/2011 03:40 PM PJ Weisberg wrote:
>> On 3/16/11, ken <gebser@mousecar.com> wrote:
>>> What's the RE for any text, white space included?  I also want to grab
>>> (for match-string...) this text.  The text is bounded by known
>>> characters.  E.g.,
>>>
>>> <h3>Any Text-- <a name="thisname">
>>> Hot Stuff</h3
>>> In the above, how to grab the text of the title, i.e., everything
>>> between <h3> and </h3>?  Conceivably this title text might contain
>>> *anything* except "</[Hh]{1-9]".
>>>
>>
>> If A and B are your start and end points, then you want:
>>
>> "A\\(.\\|\n\\)*?B"
>
> That's almost it, but not quite.  It grabs only the on last character
> before the "B"; in my example above it grabs just "f".  I'm needing to grab:
>
> "Any Text-- <a name="thisname">
> Hot Stuff"
>
> -- without the quotes, of course.

Well, it *matches* the whole thing; it's just that the parentheses
only grab the last character.  Put in another set of parentheses
around the part you want to capture, and you're golden.

"<h3>\\(\\(.\\|\n\\)*?\\)</h3"

-PJ



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RE for any text, including white space
  2011-03-16 22:05     ` PJ Weisberg
@ 2011-03-16 23:43       ` ken
  0 siblings, 0 replies; 7+ messages in thread
From: ken @ 2011-03-16 23:43 UTC (permalink / raw)
  To: PJ Weisberg; +Cc: GNU Emacs List


On 03/16/2011 06:05 PM PJ Weisberg wrote:
> On Wed, Mar 16, 2011 at 2:53 PM, ken <gebser@mousecar.com> wrote:
>> On 03/16/2011 03:40 PM PJ Weisberg wrote:
>>> On 3/16/11, ken <gebser@mousecar.com> wrote:
>>>> What's the RE for any text, white space included?  I also want to grab
>>>> (for match-string...) this text.  The text is bounded by known
>>>> characters.  E.g.,
>>>>
>>>> <h3>Any Text-- <a name="thisname">
>>>> Hot Stuff</h3
>>>> In the above, how to grab the text of the title, i.e., everything
>>>> between <h3> and </h3>?  Conceivably this title text might contain
>>>> *anything* except "</[Hh]{1-9]".
>>>>
>>> If A and B are your start and end points, then you want:
>>>
>>> "A\\(.\\|\n\\)*?B"
>> That's almost it, but not quite.  It grabs only the on last character
>> before the "B"; in my example above it grabs just "f".  I'm needing to grab:
>>
>> "Any Text-- <a name="thisname">
>> Hot Stuff"
>>
>> -- without the quotes, of course.
> 
> Well, it *matches* the whole thing; it's just that the parentheses
> only grab the last character.  Put in another set of parentheses
> around the part you want to capture, and you're golden.
> 
> "<h3>\\(\\(.\\|\n\\)*?\\)</h3"
> 
> -PJ

Cool.  That worked!!  PJ, you're /The Man/.

Somewhere in the many docs on REs I read it said that you couldn't nest
match syntax-- \\(...\\) so I never tried what you did.  Doing a lot of
different \\([...]*\\) kind of stuff didn't work (even with more '\'s)
at all.  So this was kind of a big learn.

Thanks much,
Ken



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RE for any text, including white space
  2011-03-16 21:53   ` ken
  2011-03-16 22:05     ` PJ Weisberg
@ 2011-03-17  4:50     ` Kevin Rodgers
  2011-03-17 10:20       ` ken
  1 sibling, 1 reply; 7+ messages in thread
From: Kevin Rodgers @ 2011-03-17  4:50 UTC (permalink / raw)
  To: help-gnu-emacs

On 3/16/11 3:53 PM, ken wrote:
...
> If what you gave me works to find just the "f" before "</h3", then
> something like "<h3>\\(\\[.\n\t ]*\\)</h3" should work, right?  Nope.

. is not special within [], which is why PJ expressed the tag content as
\\(.\\|\n\\)*?

And \t and SPC do not have to be handled specially with respect to ., only \n.

-- 
Kevin Rodgers
Denver, Colorado, USA




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RE for any text, including white space
  2011-03-17  4:50     ` Kevin Rodgers
@ 2011-03-17 10:20       ` ken
  0 siblings, 0 replies; 7+ messages in thread
From: ken @ 2011-03-17 10:20 UTC (permalink / raw)
  To: Kevin Rodgers; +Cc: help-gnu-emacs

On 03/17/2011 12:50 AM Kevin Rodgers wrote:
> On 3/16/11 3:53 PM, ken wrote:
> ...
>> If what you gave me works to find just the "f" before "</h3", then
>> something like "<h3>\\(\\[.\n\t ]*\\)</h3" should work, right?  Nope.
> 
> . is not special within [], which is why PJ expressed the tag content as
> \\(.\\|\n\\)*?
> 
> And \t and SPC do not have to be handled specially with respect to .,
> only \n.
> 

I thought I read that on the web somewhere but I wasn't sure that I did,
and I don't always have 100% faith in what the web says.  So for the
sake of expediency, I thought it better to be redundant, get something
that works, then test for possible redundancies.  So thanks for the
confirmation... it saves me from having to test those.

On the first point: How elisp is to parse the period seems to have been
dreamt in Black Forest lore rather than in the hard, white light of
rationality.  That's to say: If it's a special character, why not let it
be so both within \\(.\\) and \\[.\\]??  This would seem the more
consistent, yes?  Consistency would also seem to dictate that in either
context, prepending a backslash would serve to substitute the literal
for its special meaning, as it does for so many other special
characters.  Yes, of course it's much too late to write the rules.  ACK
that.  I suppose I'm just venting a particular frustration that seems to
have me reverse-engineering frequent parts of this strange language in
order to write in it.

Thanks, Kevin, for unveiling that.  It's one less unpuzzling for me to do.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-03-17 10:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-16 13:23 RE for any text, including white space ken
2011-03-16 19:40 ` PJ Weisberg
2011-03-16 21:53   ` ken
2011-03-16 22:05     ` PJ Weisberg
2011-03-16 23:43       ` ken
2011-03-17  4:50     ` Kevin Rodgers
2011-03-17 10:20       ` ken

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).