all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* bug in elisp... or in  elisper???
@ 2011-03-22 23:37 ken
  2011-03-23  0:15 ` PJ Weisberg
  0 siblings, 1 reply; 10+ messages in thread
From: ken @ 2011-03-22 23:37 UTC (permalink / raw)
  To: GNU Emacs List

Fellow elispers,

Something seems to be amiss in the search syntax here:

 (setq aname-re-str
"<a\\([\s-\\|\n]+?\\)name=\"\\(.*?\\)\"\\([\s-\\|\n]*?\\)>\\(\\(.\\|\n\\)*?\\)</a\\(\\(
\\|\t\\|\n\\)*?\\)>" )

;;Here's a function to use the above RE and return diagnostics:

(defun test-aname-search ()
  (interactive)
  (re-search-forward aname-re-str)
  (message "1: \"%s\" 2: \"%s\" 3: \"%s\" 4: \"%s\" 5: \"%s\" 6: \"%s\"
7: \"%s\" 8: \"%s\""
	   (match-string 1)
	   (match-string 2)
	   (match-string 3)
	   (match-string 4)
	   (match-string 5)
	   (match-string 6)
	   (match-string 7)
	   (match-string 8)))


Here are some strings to search on:

<h3><a name="thisname">Any Text--
Hot Stuff</a></h3>

<h1
class="title"
><a
name="heres-a-name"
>
the</a
></h1
>

<h3><a name="duplicate">Any Text--
Hot Crud</a></h3>


The problem is that the 5th match-string should be either empty or
whitespace.  But it consistently contains the last character of of the
4th match-string.  And these two matches are separated by the literal
character string, "</a"!!  What's up with this?


Wishing I hadn't quit beer,
ken

-- 
Anything is easy if you know how to do it.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in elisp... or in  elisper???
       [not found] <mailman.11.1300837050.13753.help-gnu-emacs@gnu.org>
@ 2011-03-22 23:50 ` David Kastrup
  2011-03-23 15:21   ` ken
  2011-03-23  7:01 ` Tim X
  1 sibling, 1 reply; 10+ messages in thread
From: David Kastrup @ 2011-03-22 23:50 UTC (permalink / raw)
  To: help-gnu-emacs

ken <gebser@mousecar.com> writes:

> Fellow elispers,
>
> Something seems to be amiss in the search syntax here:
>
>  (setq aname-re-str
> "<a\\([\s-\\|\n]+?\\)name=\"\\(.*?\\)\"\\([\s-\\|\n]*?\\)>\\(\\(.\\|\n\\)*?\\)</a\\(\\(
> \\|\t\\|\n\\)*?\\)>" )
>
> ;;Here's a function to use the above RE and return diagnostics:
>
> (defun test-aname-search ()
>   (interactive)
>   (re-search-forward aname-re-str)
>   (message "1: \"%s\" 2: \"%s\" 3: \"%s\" 4: \"%s\" 5: \"%s\" 6: \"%s\"
> 7: \"%s\" 8: \"%s\""
> 	   (match-string 1)
> 	   (match-string 2)
> 	   (match-string 3)
> 	   (match-string 4)
> 	   (match-string 5)
> 	   (match-string 6)
> 	   (match-string 7)
> 	   (match-string 8)))
>
>
> The problem is that the 5th match-string should be either empty or
> whitespace.

Uh what?

\\(.\\|\n\\)*?

Matches _any_ character.

> But it consistently contains the last character of of the 4th
> match-string.

That is because it _is_ the last matched character of the 4th
match-string.

> And these two matches are separated by the literal
> character string, "</a"!!  What's up with this?

Your ability to count \\( strings?  They are assigned match numbers from
left to right, regardless of whether they are nested or not.

-- 
David Kastrup


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in elisp... or in elisper???
  2011-03-22 23:37 bug in elisp... or in elisper??? ken
@ 2011-03-23  0:15 ` PJ Weisberg
  2011-03-23 14:18   ` ken
       [not found]   ` <mailman.3.1300889938.15160.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 10+ messages in thread
From: PJ Weisberg @ 2011-03-23  0:15 UTC (permalink / raw)
  To: gebser; +Cc: GNU Emacs List

On 3/22/11, ken <gebser@mousecar.com> wrote:
> Fellow elispers,
>
> Something seems to be amiss in the search syntax here:
>
>  (setq aname-re-str
> "<a\\([\s-\\|\n]+?\\)name=\"\\(.*?\\)\"\\([\s-\\|\n]*?\\)>\\(\\(.\\|\n\\)*?\\)</a\\(\\(
> \\|\t\\|\n\\)*?\\)>" )
>
...
> The problem is that the 5th match-string should be either empty or
> whitespace.  But it consistently contains the last character of of the
> 4th match-string.  And these two matches are separated by the literal
> character string, "</a"!!  What's up with this?

You miscounted your '('s.  The fifth group IS inside the fourth group,
matching . or \n.

-PJ



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in elisp... or in  elisper???
       [not found] <mailman.11.1300837050.13753.help-gnu-emacs@gnu.org>
  2011-03-22 23:50 ` David Kastrup
@ 2011-03-23  7:01 ` Tim X
  2011-03-23 15:56   ` ken
  1 sibling, 1 reply; 10+ messages in thread
From: Tim X @ 2011-03-23  7:01 UTC (permalink / raw)
  To: help-gnu-emacs

ken <gebser@mousecar.com> writes:

> Fellow elispers,
>
> Something seems to be amiss in the search syntax here:
>
>  (setq aname-re-str
> "<a\\([\s-\\|\n]+?\\)name=\"\\(.*?\\)\"\\([\s-\\|\n]*?\\)>\\(\\(.\\|\n\\)*?\\)</a\\(\\(
> \\|\t\\|\n\\)*?\\)>" )
>
> ;;Here's a function to use the above RE and return diagnostics:
>
> (defun test-aname-search ()
>   (interactive)
>   (re-search-forward aname-re-str)
>   (message "1: \"%s\" 2: \"%s\" 3: \"%s\" 4: \"%s\" 5: \"%s\" 6: \"%s\"
> 7: \"%s\" 8: \"%s\""
> 	   (match-string 1)
> 	   (match-string 2)
> 	   (match-string 3)
> 	   (match-string 4)
> 	   (match-string 5)
> 	   (match-string 6)
> 	   (match-string 7)
> 	   (match-string 8)))
>
>
> Here are some strings to search on:
>
> <h3><a name="thisname">Any Text--
> Hot Stuff</a></h3>
>
> <h1
> class="title"
>><a
> name="heres-a-name"
>>
> the</a
>></h1
>>
>
> <h3><a name="duplicate">Any Text--
> Hot Crud</a></h3>
>
>
> The problem is that the 5th match-string should be either empty or
> whitespace.  But it consistently contains the last character of of the
> 4th match-string.  And these two matches are separated by the literal
> character string, "</a"!!  What's up with this?
>
>
> Wishing I hadn't quit beer,
> ken

I don't think your re is matching what you think it is. Strong recommend
you try using re-builder as this will give you a visual representation
of what your re is matching (with different colours representing the
various match groups).

Tim

-- 
tcross (at) rapttech dot com dot au


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in elisp... or in elisper???
  2011-03-23  0:15 ` PJ Weisberg
@ 2011-03-23 14:18   ` ken
  2011-03-25  3:44     ` Kevin Rodgers
       [not found]   ` <mailman.3.1300889938.15160.help-gnu-emacs@gnu.org>
  1 sibling, 1 reply; 10+ messages in thread
From: ken @ 2011-03-23 14:18 UTC (permalink / raw)
  To: PJ Weisberg; +Cc: GNU Emacs List

On 03/22/2011 08:15 PM PJ Weisberg wrote:
> On 3/22/11, ken <gebser@mousecar.com> wrote:
>> Fellow elispers,
>>
>> Something seems to be amiss in the search syntax here:
>>
>>  (setq aname-re-str
>> "<a\\([\s-\\|\n]+?\\)name=\"\\(.*?\\)\"\\([\s-\\|\n]*?\\)>\\(\\(.\\|\n\\)*?\\)</a\\(\\(
>> \\|\t\\|\n\\)*?\\)>" )
>>
> ...
>> The problem is that the 5th match-string should be either empty or
>> whitespace.  But it consistently contains the last character of of the
>> 4th match-string.  And these two matches are separated by the literal
>> character string, "</a"!!  What's up with this?
> 
> You miscounted your '('s.  The fifth group IS inside the fourth group,
> matching . or \n.
> 
> -PJ

It wasn't that I miscounted.  I read a doc which said that I couldn't
embed one potential match expression inside another.  (I mentioned this,
I believe, in a previous email.)  So I figured that, if this wasn't
allowed, I certainly couldn't count each expression inside a pair of
parens as another match.  But it seems that doc was wrong.

So this is actually good news: my RE works just as I want it to *and*
there's no bug in elisp to contend with.  I am, however, starting to
have trust issues with documentation I find on the web.  But I have you
guys here on this list as a reality check.

If one match expression *can* be embedded within another, this is good
news: it means I can write more comprehensive REs.  I.e., instead of
writing RE #1 to locate a section of text and then RE #2 to parse just
that section, REs #1 and #2 can be combined into one RE.  Radically cool.

So some further questions:

You might have noticed I use "\\([\s-\\|\n]+?\\)" to non-greedily match
one or more whitespace characters.  Can one "\\[...\\] be nested inside
another...?  e.g., "[[\s-\\|\n]+?]" or some syntax like that?

The "specialness" of "." seems to be lost when inside brackets; that is,
in "[.\n]*?" it seems to represent a regular period (.) rather than "any
character except newline".  Is there some way to bring back that
specialness?  Or is there some other RE to represent "multiple instances
of any character, including a newline"?

Is it actually true (what the docs say) that there's a limit of nine
sub-expression match-strings per RE?  Or can I do, e.g., "(match-string
12)" and "(match-string 15)"?  What is the actual limit?  Whatever it
is, is this hard-coded into elisp... or can it be changed/configured to
something else?


Thanks for the illumination.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in elisp... or in  elisper???
  2011-03-22 23:50 ` David Kastrup
@ 2011-03-23 15:21   ` ken
  2011-03-23 15:38     ` David Kastrup
  0 siblings, 1 reply; 10+ messages in thread
From: ken @ 2011-03-23 15:21 UTC (permalink / raw)
  To: David Kastrup; +Cc: help-gnu-emacs

On 03/22/2011 07:50 PM David Kastrup wrote:
> ken <gebser@mousecar.com> writes:
> 
>> Fellow elispers,
>>
>> Something seems to be amiss in the search syntax here:
>>
>>  (setq aname-re-str
>> "<a\\([\s-\\|\n]+?\\)name=\"\\(.*?\\)\"\\([\s-\\|\n]*?\\)>\\(\\(.\\|\n\\)*?\\)</a\\(\\(
>> \\|\t\\|\n\\)*?\\)>" )
>>
>> ....
> 
> Uh what?
> 
> \\(.\\|\n\\)*?
> 
> Matches _any_ character.

Yes.  Why not?  Users' texts can and do contain any sort of character,
multiple instances of them in fact... and, moreover, in any languages'
character sets they might want.  They're allowed to do this.

Perhaps you're perplexed because you're not noting the RE immediately
following: "</a".  IOW, elisp should keep reading chars until the first
instance of "</a".  Seems to me to be a perfectly rational request.  In
the small bit of testing I've done, it seems also to work just fine.


> 
>> But it consistently contains the last character of of the 4th
>> match-string.
> 
> That is because it _is_ the last matched character of the 4th
> match-string.
> 
>> And these two matches are separated by the literal
>> character string, "</a"!!  What's up with this?
> 
> Your ability to count \\( strings?  They are assigned match numbers from
> left to right, regardless of whether they are nested or not.

An inability to count would be the most derogatory interpretation.  But
the function I wrote (here elided) actually did the counting for me, so
that would not be a cogent interpretation.  A mere mortal, I wasn't born
knowing that REs could be nested (documentation I read in fact stated
they couldn't), of course then also not that in such cases both inner
and outer REs are counted separately by match-string.  So once again,
the more charitable interpretation is the more perspicacious... and vice
versa.


-- 
One is not superior merely because one
sees the world as odious.
                -- Chateaubriand (1768-1848)




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in elisp... or in elisper???
       [not found]   ` <mailman.3.1300889938.15160.help-gnu-emacs@gnu.org>
@ 2011-03-23 15:27     ` Stefan Monnier
  0 siblings, 0 replies; 10+ messages in thread
From: Stefan Monnier @ 2011-03-23 15:27 UTC (permalink / raw)
  To: help-gnu-emacs

> I am, however, starting to have trust issues with documentation I find
> on the web.

Don't believe everything you read.

> Is it actually true (what the docs say) that there's a limit of nine
> sub-expression match-strings per RE?

No.

> Or can I do, e.g., "(match-string 12)" and "(match-string 15)"?

Yes.

> What is the actual limit?

The limit currently is around 255 sub-groups (or maybe 127), IIRC.
OTOH back-references can only refer to subgroups 1-9 (because we
haven't bothered to introduce a syntax for other cases).

> Whatever it is, is this hard-coded into elisp... or can it be
> changed/configured to something else?

It's hardcoded in the C code of the regexp engine.

BTW, I recommend you use the "online" documentation distributed with
Emacs.  There are function and variable docstrings (C-h f, C-h v), plus
Info documents (Emacs manual, Elisp manual).  We work pretty hard to keep
those up-to-date and of good quality.  And if you find something to be
untrue in there, please report it via M-x report-emacs-bug.


        Stefan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in elisp... or in  elisper???
  2011-03-23 15:21   ` ken
@ 2011-03-23 15:38     ` David Kastrup
  0 siblings, 0 replies; 10+ messages in thread
From: David Kastrup @ 2011-03-23 15:38 UTC (permalink / raw)
  To: gebser; +Cc: help-gnu-emacs

ken <gebser@mousecar.com> writes:

> An inability to count would be the most derogatory interpretation.
> But the function I wrote (here elided) actually did the counting for
> me, so that would not be a cogent interpretation.  A mere mortal, I
> wasn't born knowing that REs could be nested (documentation I read in
> fact stated they couldn't),

Emacs comes with its own hyperlinked, up to date, maintained, indexed
fast documentation accessible via Help menu and keybindings.

There is no reason to promote random garbage found somewhere on the
internet to "documentation".  In particular not concerning software that
has a history of 30 years, where consequently most documentation in
existence that might at one point even have been accurate is no longer
so due to being prehistoric.

Still I have my doubts that the documentation you are alluding to even
was ever part of Emacs.

> of course then also not that in such cases both inner and outer REs
> are counted separately by match-string.  So once again, the more
> charitable interpretation is the more perspicacious... and vice versa.

Care to provide a pointer to the "documentation" you are referring to?
While I have my doubts it will lead to a much more charitable
interpretation, I certainly am willing to let myself be surprised.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in elisp... or in  elisper???
  2011-03-23  7:01 ` Tim X
@ 2011-03-23 15:56   ` ken
  0 siblings, 0 replies; 10+ messages in thread
From: ken @ 2011-03-23 15:56 UTC (permalink / raw)
  To: tcross; +Cc: help-gnu-emacs


Anything is easy if you know how to do it.


On 03/23/2011 03:01 AM Tim X wrote:
> ken <gebser@mousecar.com> writes:
> 
>> Fellow elispers,
>>
>> Something seems to be amiss in the search syntax here:
>>
>>  (setq aname-re-str
>> "<a\\([\s-\\|\n]+?\\)name=\"\\(.*?\\)\"\\([\s-\\|\n]*?\\)>\\(\\(.\\|\n\\)*?\\)</a\\(\\(
>> \\|\t\\|\n\\)*?\\)>" )
>>
>> ;;Here's a function to use the above RE and return diagnostics:
>>
>> (defun test-aname-search ()
>>   (interactive)
>>   (re-search-forward aname-re-str)
>>   (message "1: \"%s\" 2: \"%s\" 3: \"%s\" 4: \"%s\" 5: \"%s\" 6: \"%s\"
>> 7: \"%s\" 8: \"%s\""
>> 	   (match-string 1)
>> 	   (match-string 2)
>> 	   (match-string 3)
>> 	   (match-string 4)
>> 	   (match-string 5)
>> 	   (match-string 6)
>> 	   (match-string 7)
>> 	   (match-string 8)))
>>
>>
>> Here are some strings to search on:
>>
>> <h3><a name="thisname">Any Text--
>> Hot Stuff</a></h3>
>>
>> <h1
>> class="title"
>>> <a
>> name="heres-a-name"
>> the</a
>>> </h1
>>>
>> <h3><a name="duplicate">Any Text--
>> Hot Crud</a></h3>
>>
>>
>> The problem is that the 5th match-string should be either empty or
>> whitespace.  But it consistently contains the last character of of the
>> 4th match-string.  And these two matches are separated by the literal
>> character string, "</a"!!  What's up with this?
>>
>>
>> Wishing I hadn't quit beer,
>> ken
> 
> I don't think your re is matching what you think it is. Strong recommend
> you try using re-builder as this will give you a visual representation
> of what your re is matching (with different colours representing the
> various match groups).
> 
> Tim

Well, I was missing a crucial bit of knowledge about REs (explained in
two previous posts here) and that was causing me to misinterpret
results.  PJ's reply pointed me in the direction I needed to go to
figure out what the problem was.  And I think it was a mistake for me to
post such a complex example, but I couldn't think of how else to do it.

I read mention of re-builder, but must admit I haven't tried it yet.
With your recommendation, I'm sure I'll be giving it a try on some
future RE puzzle.  The mere fact that this tool exists is comforting...
tells me that I'm not the only one who's occasionally perplexed by REs.

Thanks for the suggestion.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in elisp... or in elisper???
  2011-03-23 14:18   ` ken
@ 2011-03-25  3:44     ` Kevin Rodgers
  0 siblings, 0 replies; 10+ messages in thread
From: Kevin Rodgers @ 2011-03-25  3:44 UTC (permalink / raw)
  To: help-gnu-emacs

On 3/23/11 8:18 AM, ken wrote:
...
> You might have noticed I use "\\([\s-\\|\n]+?\\)" to non-greedily match
> one or more whitespace characters.  Can one "\\[...\\] be nested inside
> another...?  e.g., "[[\s-\\|\n]+?]" or some syntax like that?

No.

It is not clear whether you mean "\s" (space) followed by "-" (which is
special within "[]"), or you actually meant "\\s-" (i.e. any character
with whitespace syntax).  The problem with "\\s-" is that it depends on
the buffer's syntax table, as does "[[:space:]]" -- see section 34.3.1.2
(Character Classes) in the Emacs Lisp manual for an explanation of
"[[:space:]]" and other POSIX-inspired character classes:

http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes

If you are going to add "\n" to "\\s-" or "[:space:]", within "[]" or
"\\(\\|\\)", because you can't be sure whether the buffer's syntax table
assigns whitespace syntax to newline, then how can you be sure that it
assigns whitespace syntax to space, tab, formfeed, return, and vertical
tab?

So you may as well be explicit about what you mean by whitespace e.g.
"[ \f\t\n\r\v]"

> The "specialness" of "." seems to be lost when inside brackets; that is,
> in "[.\n]*?" it seems to represent a regular period (.) rather than "any
> character except newline".  Is there some way to bring back that
> specialness?  Or is there some other RE to represent "multiple instances
> of any character, including a newline"?

No, inside "[]", "." is not special.

The right way is: "\\(.\\|\n\\)*"

There may be other ways, but they will be longer and unnecessarily complex.

> Is it actually true (what the docs say) that there's a limit of nine
> sub-expression match-strings per RE?  Or can I do, e.g., "(match-string
> 12)" and "(match-string 15)"?  What is the actual limit?  Whatever it
> is, is this hard-coded into elisp... or can it be changed/configured to
> something else?

No, but you can only refer to the first 9 sub-expressions (actually, the
text matched by each of the first 9 preceding sub-expressions).

-- 
Kevin Rodgers
Denver, Colorado, USA




^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-03-25  3:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-22 23:37 bug in elisp... or in elisper??? ken
2011-03-23  0:15 ` PJ Weisberg
2011-03-23 14:18   ` ken
2011-03-25  3:44     ` Kevin Rodgers
     [not found]   ` <mailman.3.1300889938.15160.help-gnu-emacs@gnu.org>
2011-03-23 15:27     ` Stefan Monnier
     [not found] <mailman.11.1300837050.13753.help-gnu-emacs@gnu.org>
2011-03-22 23:50 ` David Kastrup
2011-03-23 15:21   ` ken
2011-03-23 15:38     ` David Kastrup
2011-03-23  7:01 ` Tim X
2011-03-23 15:56   ` ken

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.