unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* Help needed with regexps
@ 2004-02-13 19:17 D. D. Brierton
  2004-02-13 19:21 ` D. D. Brierton
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: D. D. Brierton @ 2004-02-13 19:17 UTC (permalink / raw)


Hi,

Could a regexp guru look over these regexps and tell me if they're correct
and if they could be improved/simplified?

I'm tweaking my multiple-major-mode setup of psgml / php-mode / css-mode /
javascript-generic-mode for (X)HTML editing. My previous regexps worked
only 75% of the time, and I was trying to improve them and have ended up
breaking things altogether. The current attempt seems to send emacs into
some kind of loop -- CPU hits 100% and I have to kill emacs:

; Set up an mmm group for fancy html editing
(mmm-add-group
 'fancy-html
 '(
         (html-php-embedded
                :submode php-mode
                :face mmm-code-submode-face
                :front "<[?]php"
                :back "[?]>")
	 (html-css-embedded
	        :submode css-mode
		:face mmm-code-submode-face
		:front "<style\\s-+\\(\\s-*.*\\s-+\\)*.*css\"?\\(\\s-*.*\\s-*\\)*\\s-*>"
		:back "</style>")
         (html-css-attribute
                :submode css-mode
                :face mmm-code-submode-face
                :front "\\bstyle=\"?"
                :back "\"")
	 (html-javascript-embedded
	        :submode javascript-generic-mode
		:face mmm-code-submode-face
		:front "<script\\s-+\\(\\s-*.*\\s-+\\)*.*javascript.*\\(\\s-*.*\\s-+\\)*\\s-*>"
		:back "</script>")
         (html-javascript-attribute
                :submode javascript-generic-mode
                :face mmm-code-submode-face
                :front "\\bon\\w+=\"?"
                :back "\"")
   )
)

I have to edit a lot of other people's HTML, and it is very often invalid.
Element and attribute names may be in a mix of upper and lower case,
atrribute values may or may not be quoted, required attributes may be
omitted and nonexistent attributes included!

In particular, the regexps for html-css-embedded and
html-javascript-embedded are the ones I need someone to look over for me.

So, for CSS

"<style\\s-+\\(\\s-*.*\\s-+\\)*.*css\"?\\(\\s-*.*\\s-*\\)*\\s-*>"

should match a "style" element regardless of how its spaced out which at
least contains the string "css" somewhere (and "style" and "css" may be
upper or lower case). For example,

<style
   attr1="val1"
   attr2="val2"
   type="text/css"
   attr3="val3"
   attr4="val4"
>

and

<style type="text/css">

For javascript

"<script\\s-+\\(\\s-*.*\\s-+\\)*.*javascript.*\\(\\s-*.*\\s-+\\)*\\s-*>"

should match a "script" element that contains the string "javascript" and
which may again be variably spaced and either upper case or lower case.

It's mainly the variable whitespacing, and the fact that it's so hard to
know what might come between "<style/<script", "css/javascipt" and ">"
that is throwing me, and my attempts at just experimenting and seeing what
got highlighted correctly have been dampened somewhat by emacs being sent
into a tailspin by my last "experiment". I'd really appreciate some help.
Thanks in advance.

Best, Darren

-- 
======================================================================
D. D. Brierton            darren@dzr-web.com           www.dzr-web.com
       Trying is the first step towards failure (Homer Simpson)
======================================================================

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Help needed with regexps
  2004-02-13 19:17 Help needed with regexps D. D. Brierton
@ 2004-02-13 19:21 ` D. D. Brierton
  2004-02-13 19:37 ` Stefan Monnier
  2004-02-13 20:16 ` D. D. Brierton
  2 siblings, 0 replies; 5+ messages in thread
From: D. D. Brierton @ 2004-02-13 19:21 UTC (permalink / raw)


On Fri, 13 Feb 2004 19:17:44 +0000, D. D. Brierton wrote:

> Could a regexp guru look over these regexps and tell me if they're correct
> and if they could be improved/simplified?

Ooops! I started writing that post before I completely broke the regexps
altogether! Obviously I *know* they're not correct!

-- 
======================================================================
D. D. Brierton            darren@dzr-web.com           www.dzr-web.com
       Trying is the first step towards failure (Homer Simpson)
======================================================================

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Help needed with regexps
  2004-02-13 19:17 Help needed with regexps D. D. Brierton
  2004-02-13 19:21 ` D. D. Brierton
@ 2004-02-13 19:37 ` Stefan Monnier
  2004-02-13 19:57   ` D. D. Brierton
  2004-02-13 20:16 ` D. D. Brierton
  2 siblings, 1 reply; 5+ messages in thread
From: Stefan Monnier @ 2004-02-13 19:37 UTC (permalink / raw)


> 		:front "<style\\s-+\\(\\s-*.*\\s-+\\)*.*css\"?\\(\\s-*.*\\s-*\\)*\\s-*>"

Regexps like A*.*B* are asking for trouble because there is an exponential
numberof ways to match them and the regexp-engine uses backtracking.
So it will get stuck trying them all, potentially for hours or even years.
To you it just looks like "it's stuck".


        Stefan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Help needed with regexps
  2004-02-13 19:37 ` Stefan Monnier
@ 2004-02-13 19:57   ` D. D. Brierton
  0 siblings, 0 replies; 5+ messages in thread
From: D. D. Brierton @ 2004-02-13 19:57 UTC (permalink / raw)


On Fri, 13 Feb 2004 19:37:32 +0000, Stefan Monnier wrote:

>> 		:front "<style\\s-+\\(\\s-*.*\\s-+\\)*.*css\"?\\(\\s-*.*\\s-*\\)*\\s-*>"
> 
> Regexps like A*.*B* are asking for trouble because there is an exponential
> numberof ways to match them and the regexp-engine uses backtracking.
> So it will get stuck trying them all, potentially for hours or even years.
> To you it just looks like "it's stuck".

Ah of course. I'm such a nincompoop sometimes. Removing the many .*s with
more specific expressions has solved the emacs tailspin problem. Thanks!

Best, Darren

-- 
======================================================================
D. D. Brierton            darren@dzr-web.com           www.dzr-web.com
       Trying is the first step towards failure (Homer Simpson)
======================================================================

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Help needed with regexps
  2004-02-13 19:17 Help needed with regexps D. D. Brierton
  2004-02-13 19:21 ` D. D. Brierton
  2004-02-13 19:37 ` Stefan Monnier
@ 2004-02-13 20:16 ` D. D. Brierton
  2 siblings, 0 replies; 5+ messages in thread
From: D. D. Brierton @ 2004-02-13 20:16 UTC (permalink / raw)


On Fri, 13 Feb 2004 19:17:44 +0000, D. D. Brierton wrote:

> In particular, the regexps for html-css-embedded and
> html-javascript-embedded are the ones I need someone to look over for me.
> 
> So, for CSS
> 
> "<style\\s-+\\(\\s-*.*\\s-+\\)*.*css\"?\\(\\s-*.*\\s-*\\)*\\s-*>"

My current version of this is:

"<style\\s-+\\(\\s-*\\w+=\"\\w+\"\\s-+\\)*type=\"\\(text/\\)?css\"\\(\\s-*\\w+=\"\\w+\"\\s-*\\)*\\s-*>"

This now looks for a "style" attribute that contains a "type" attribute
with either value "text/css" or the incorrect "css". Not ideal, and
doesn't work for the situation I just thought of where someone has just
used a "<style> ... </style>" element with no attributes. Hmmm. Perhaps
just this would be better?

"<style\\(\\s-+\\w+=\"?\\w+\"?\\)*\\s-*>"

(I'd originally wanted to keep the "css" string a match requirement in
case I ever came across some weird instance of someone attempting to use
something other than CSS to style an HTML page (I don't know what ... may
be JSSL). But in all honesty, I guess that is never going to happen.)

> For javascript
> 
> "<script\\s-+\\(\\s-*.*\\s-+\\)*.*javascript.*\\(\\s-*.*\\s-+\\)*\\s-*>"
> 
> should match a "script" element that contains the string "javascript" and
> which may again be variably spaced and either upper case or lower case.

My current regexp for embedded javascript is:

"<script\\s-+\\(\\s-*\\w+=\"\\w+\"\\s-+\\)*\\(language\\|type\\)=\"\\(text/\\)?javascript[.0-9]*\"\\(\\s-*\\w+=\"\\w+\"\\s-+\\)*\\s-*>"

Unlike the CSS case, matching "javascript" is more of an issue, as people
do include VBscript on web pages. However, I probably want the case where
all there is is a "<script> ... </script>" element with no attributes to
default to javascript-mode as well. Besides, the above regexp looks way to
complicated to me. Any suggestions?

-- 
======================================================================
D. D. Brierton            darren@dzr-web.com           www.dzr-web.com
       Trying is the first step towards failure (Homer Simpson)
======================================================================

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-02-13 20:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-13 19:17 Help needed with regexps D. D. Brierton
2004-02-13 19:21 ` D. D. Brierton
2004-02-13 19:37 ` Stefan Monnier
2004-02-13 19:57   ` D. D. Brierton
2004-02-13 20:16 ` D. D. Brierton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).