regexp and strings you don't want

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* regexp and strings you don't want
@ 2003-08-25 19:45 Chaz
  2003-08-25 20:17 ` Barry Margolin
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Chaz @ 2003-08-25 19:45 UTC (permalink / raw)


Hi,

I know that ^ at the start of [ ] excludes individual characters (or
ranges of characters) from a regular expression search, but is there
an equivalent to eliminate strings?  That is, how can I search for a
regular expression that does not include a specified string?

For example, how can I search for a paragraph beginning with "The"
that does NOT include the word "top"?

Thanks

Chaz

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: regexp and strings you don't want
  2003-08-25 19:45 regexp and strings you don't want Chaz
@ 2003-08-25 20:17 ` Barry Margolin
  2003-08-26 18:13   ` Chaz
  2003-08-26 22:19 ` Eric Pement
  2003-08-27 20:26 ` Kai Großjohann
  2 siblings, 1 reply; 12+ messages in thread
From: Barry Margolin @ 2003-08-25 20:17 UTC (permalink / raw)


In article <6c185cf3.0308251145.6af55ffc@posting.google.com>,
Chaz <chaz2@thedoghousemail.com> wrote:
>I know that ^ at the start of [ ] excludes individual characters (or
>ranges of characters) from a regular expression search, but is there
>an equivalent to eliminate strings?  That is, how can I search for a
>regular expression that does not include a specified string?
>
>For example, how can I search for a paragraph beginning with "The"
>that does NOT include the word "top"?

This is something that regexps by themselves are pretty bad at.  What you
should do is collect all the paragraphs that begin with "The", and then
search each of them for "top", and discard those from the list.

-- 
Barry Margolin, barry.margolin@level3.com
Level(3), Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: regexp and strings you don't want
  2003-08-25 20:17 ` Barry Margolin
@ 2003-08-26 18:13   ` Chaz
  2003-08-27 15:13     ` Kevin Rodgers
  2003-08-29 15:50     ` Stefan Monnier
  0 siblings, 2 replies; 12+ messages in thread
From: Chaz @ 2003-08-26 18:13 UTC (permalink / raw)


Barry Margolin <barry.margolin@level3.com> wrote in message news:<A9u2b.399$mD.8@news.level3.com>...

> >For example, how can I search for a paragraph beginning with "The"
> >that does NOT include the word "top"?
> 
> This is something that regexps by themselves are pretty bad at.  

That's too bad.  I wouldn't have thought it so complicated to have
some function like [^"string"], or to list strings as options, maybe
["dog""cat""turtle"].  These would be quite handy.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: regexp and strings you don't want
  2003-08-25 19:45 regexp and strings you don't want Chaz
  2003-08-25 20:17 ` Barry Margolin
@ 2003-08-26 22:19 ` Eric Pement
  2003-08-27 20:26 ` Kai Großjohann
  2 siblings, 0 replies; 12+ messages in thread
From: Eric Pement @ 2003-08-26 22:19 UTC (permalink / raw)


chaz2@thedoghousemail.com (Chaz) wrote in message news:<6c185cf3.0308251145.6af55ffc@posting.google.com>...
> I know that ^ at the start of [ ] excludes individual characters (or
> ranges of characters) from a regular expression search, but is there
> an equivalent to eliminate strings?  That is, how can I search for a
> regular expression that does not include a specified string?
> 
> For example, how can I search for a paragraph beginning with "The"
> that does NOT include the word "top"?

A big problem is that paragraphs are multi-line objects and they
aren't amenable to simple grep operations or "M-x occur", which I'm
using much more frequently now.

But to answer your question more directly, here's how to locate the
matching paragraphs (separated by blank lines) in sed:

   sed '/./{H;$!d;};x;/^\nThe/!d;/\<top\>/d' file

And in Emacs, mark the region and then do

   M-| sed '/./{H;$!d;};x;/^\nThe/!d;/\<top\>/d' <RET>

   M-| runs shell-command-on-region, sending the results to a new
window. This will show you how many paragraphs match your
specification. It's not like "M-x occur", which will take you directly
to the matching lines there. However, this script will let you see how
many paragraphs match your specification, without leaving Emacs.

   I presume you have GNU sed; if not, omit the \< and \>  characters,
which are GNU options to match whole words. If you use WinNT Emacs,
wrap the sed script in "double quotes" instead of 'single quotes'. 
Hope this helps.

--
Eric Pement

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: regexp and strings you don't want
  2003-08-26 18:13   ` Chaz
@ 2003-08-27 15:13     ` Kevin Rodgers
  2003-08-29 15:50     ` Stefan Monnier
  1 sibling, 0 replies; 12+ messages in thread
From: Kevin Rodgers @ 2003-08-27 15:13 UTC (permalink / raw)


Chaz wrote:

> Barry Margolin <barry.margolin@level3.com> wrote in message news:<A9u2b.399$mD.8@news.level3.com>...
>>>For example, how can I search for a paragraph beginning with "The"
>>>that does NOT include the word "top"?
>>>
>>This is something that regexps by themselves are pretty bad at.  
>>
> 
> That's too bad.  I wouldn't have thought it so complicated to have
> some function like [^"string"], or to list strings as options, maybe
> ["dog""cat""turtle"].  These would be quite handy.


I don't think you understand the theory behind regular expressions, which are
based on regular languages and the finite state automata that recognize them.

But as Barry said, it's easy to write a function that does what you want.


-- 
Kevin Rodgers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: regexp and strings you don't want
  2003-08-25 19:45 regexp and strings you don't want Chaz
  2003-08-25 20:17 ` Barry Margolin
  2003-08-26 22:19 ` Eric Pement
@ 2003-08-27 20:26 ` Kai Großjohann
  2003-08-29 16:14   ` Oliver Scholz
  2 siblings, 1 reply; 12+ messages in thread
From: Kai Großjohann @ 2003-08-27 20:26 UTC (permalink / raw)

chaz2@thedoghousemail.com (Chaz) writes:

> For example, how can I search for a paragraph beginning with "The"
> that does NOT include the word "top"?

It is possible to build a regexp that does this (disregarding the
paragraph problem at the moment), but it is not pretty.

Some regexp implementations have the feature you're looking for to
make it convenient, but the Emacs implementation doesn't.

Let me rephrase this in terms of lines instead of paragraphs.

The idea is this: search for a line that begins with The and then
does not have top after it, as follows: after The, we allow any
characters that aren't t.  We also allow a t followed by something
that's not o, and also a to that's followed by something that's not
p.  And so on:

"^The\\([^t]*\\($\\|t$\\|t[^o]\\|to$\\|to[^p]\\)\\)*$"

The above regexp is in Lisp syntax, with doubled backslashes.  Note
that I treat the newline that might follow a t, or to, specially.

Do you see the idea?  I hope I haven't made a mistake, but if you
understand the idea, you'll see what to do.
-- 
Two cafe au lait please, but without milk.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: regexp and strings you don't want
  2003-08-26 18:13   ` Chaz
  2003-08-27 15:13     ` Kevin Rodgers
@ 2003-08-29 15:50     ` Stefan Monnier
  1 sibling, 0 replies; 12+ messages in thread
From: Stefan Monnier @ 2003-08-29 15:50 UTC (permalink / raw)


> That's too bad.  I wouldn't have thought it so complicated to have
> some function like [^"string"],

What would it match, exactly ?  There are many different possibilities with
very different behavior: match a string different from "string"; match a
string which does not include "string" as a submatch; match the empty string
if it is not followed by a string that matches "string"; ...

> or to list strings as options, maybe
> ["dog""cat""turtle"].  These would be quite handy.

I don't understand what you want here.  How is that different from
"dog\\|cat\\|turtle" ?


        Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: regexp and strings you don't want
  2003-08-27 20:26 ` Kai Großjohann
@ 2003-08-29 16:14   ` Oliver Scholz
  2003-08-29 18:50     ` Oliver Scholz
  0 siblings, 1 reply; 12+ messages in thread
From: Oliver Scholz @ 2003-08-29 16:14 UTC (permalink / raw)


kai.grossjohann@gmx.net (Kai Großjohann) writes:

> chaz2@thedoghousemail.com (Chaz) writes:
>
>> For example, how can I search for a paragraph beginning with "The"
>> that does NOT include the word "top"?
>
> It is possible to build a regexp that does this (disregarding the
> paragraph problem at the moment), but it is not pretty.
>
> Some regexp implementations have the feature you're looking for to
> make it convenient, but the Emacs implementation doesn't.
>
> Let me rephrase this in terms of lines instead of paragraphs.
>
> The idea is this: search for a line that begins with The and then
> does not have top after it, as follows: after The, we allow any
> characters that aren't t.  We also allow a t followed by something
> that's not o, and also a to that's followed by something that's not
> p.  And so on:
>
> "^The\\([^t]*\\($\\|t$\\|t[^o]\\|to$\\|to[^p]\\)\\)*$"

Hmm. This is not really human readable. Would it be hard and/or bad
to extend `rx' so that it allows for (not STRING)? A là:

(looking-at (rx (and line-start
		     "The "
		     (not "top"))))

Whereas `(not "top")' would compile to a normal regexp in the way you
described it. WDYT?

    Oliver
-- 
12 Fructidor an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: regexp and strings you don't want
  2003-08-29 16:14   ` Oliver Scholz
@ 2003-08-29 18:50     ` Oliver Scholz
  2003-08-29 19:58       ` Kai Großjohann
  0 siblings, 1 reply; 12+ messages in thread
From: Oliver Scholz @ 2003-08-29 18:50 UTC (permalink / raw)


[Yet another follow-up to myself ...]
[Superseded because of a flaky patch]

Oliver Scholz <alkibiades@gmx.de> writes:

> kai.grossjohann@gmx.net (Kai Großjohann) writes:
>
>> chaz2@thedoghousemail.com (Chaz) writes:
>>
>>> For example, how can I search for a paragraph beginning with "The"
>>> that does NOT include the word "top"?
>>
>> It is possible to build a regexp that does this (disregarding the
>> paragraph problem at the moment), but it is not pretty.
>>
>> Some regexp implementations have the feature you're looking for to
>> make it convenient, but the Emacs implementation doesn't.
>>
>> Let me rephrase this in terms of lines instead of paragraphs.
>>
>> The idea is this: search for a line that begins with The and then
>> does not have top after it, as follows: after The, we allow any
>> characters that aren't t.  We also allow a t followed by something
>> that's not o, and also a to that's followed by something that's not
>> p.  And so on:
>>
>> "^The\\([^t]*\\($\\|t$\\|t[^o]\\|to$\\|to[^p]\\)\\)*$"
>
> Hmm. This is not really human readable. Would it be hard and/or bad
> to extend `rx' so that it allows for (not STRING)? A là:
>
> (looking-at (rx (and line-start
> 		     "The "
> 		     (not "top"))))
>
> Whereas `(not "top")' would compile to a normal regexp in the way you
> described it. WDYT?
[...]

I've played a bit with this (patch below). But I thing I am a bit
puzzled. With my patch, `(rx (not top))' translates to:

"\\(?:[^t]*\\|t[^o]*\\|to[^p]*\\)"

Is this actually correct?

What does the concept of a regexp that matches a sequence of
characters that does _not_ contain a certain sequence of characters
actually mean?

Should it match any sequence of characters not identical to the
unwanted one (including the empty string) or should it match only
sequences of the same length? Or any non-empty sequence of characters
not identical with the unwanted one?

With my patch:

(string-match (rx (and line-start
		       "The "
		       (not "top")
 		       " lirum larum"))
	      "The top lirum larum")
 ==> nil

(string-match (rx (and line-start
		       "The "
		       (not "top")
 		       " lirum larum"))
	      "The to lirum larum")
 ==> 0

(string-match (rx (and line-start
		       "The "
		       (not "top")
 		       " lirum larum"))
	      "The lirum larum")

 ==> nil

Is this good or bad?

    Oliver (puzzled)


Index: lisp/emacs-lisp/rx.el
===================================================================
RCS file: /cvsroot/emacs/emacs/lisp/emacs-lisp/rx.el,v
retrieving revision 1.3
diff -u -r1.3 rx.el
--- lisp/emacs-lisp/rx.el	23 Dec 2002 17:43:24 -0000	1.3
+++ lisp/emacs-lisp/rx.el	29 Aug 2003 18:46:18 -0000
@@ -334,6 +334,7 @@
 		    '(digit control hex-digit blank graphic printing
 			    alphanumeric letter ascii nonascii lower
 			    punctuation space upper word))
+	      (stringp form)
 	      (and (consp form)
 		   (memq (car form) '(not any in syntax category:))))
     (error "Rx `not' syntax error: %s" form))
@@ -343,27 +344,41 @@
 (defun rx-not (form)
   "Parse and produce code from FORM.  FORM is `(not ...)'."
   (rx-check form)
-  (let ((result (rx-to-string (cadr form) 'no-group)))
-    (cond ((string-match "\\`\\[^" result)
-	   (if (= (length result) 4)
-	       (substring result 2 3)
-	     (concat "[" (substring result 2))))
-	  ((string-match "\\`\\[" result)
-	   (concat "[^" (substring result 1)))
-	  ((string-match "\\`\\\\s." result)
-	   (concat "\\S" (substring result 2)))
-	  ((string-match "\\`\\\\S." result)
-	   (concat "\\s" (substring result 2)))
-	  ((string-match "\\`\\\\c." result)
-	   (concat "\\C" (substring result 2)))
-	  ((string-match "\\`\\\\C." result)
-	   (concat "\\c" (substring result 2)))
-	  ((string-match "\\`\\\\B" result)
-	   (concat "\\b" (substring result 2)))
-	  ((string-match "\\`\\\\b" result)
-	   (concat "\\B" (substring result 2)))
-	  (t
-	   (concat "[^" result "]")))))
+  (if (stringp (cadr form))
+      (rx-reverse-string (cadr form))
+    (let ((result (rx-to-string (cadr form) 'no-group)))
+      (cond ((string-match "\\`\\[^" result)
+	     (if (= (length result) 4)
+		 (substring result 2 3)
+	       (concat "[" (substring result 2))))
+	    ((string-match "\\`\\[" result)
+	     (concat "[^" (substring result 1)))
+	    ((string-match "\\`\\\\s." result)
+	     (concat "\\S" (substring result 2)))
+	    ((string-match "\\`\\\\S." result)
+	     (concat "\\s" (substring result 2)))
+	    ((string-match "\\`\\\\c." result)
+	     (concat "\\C" (substring result 2)))
+	    ((string-match "\\`\\\\C." result)
+	     (concat "\\c" (substring result 2)))
+	    ((string-match "\\`\\\\B" result)
+	     (concat "\\b" (substring result 2)))
+	    ((string-match "\\`\\\\b" result)
+	     (concat "\\B" (substring result 2)))
+	    (t
+	     (concat "[^" result "]"))))))
+
+(defun rx-reverse-string (string)
+  (let ((list nil))
+    (dotimes (i (length string))
+      (push (rx-reverse-string-1 i string) list))
+    (concat "\\(?:"
+	    (mapconcat 'identity (nreverse list) "\\|")
+	    "\\)")))
+
+(defun rx-reverse-string-1 (n string)
+  (concat (substring string 0 n)
+	  "[^" (string (aref string n)) "]*"))
 
 
 (defun rx-repeat (form)

-- 
12 Fructidor an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: regexp and strings you don't want
  2003-08-29 18:50     ` Oliver Scholz
@ 2003-08-29 19:58       ` Kai Großjohann
  2003-08-29 20:38         ` Oliver Scholz
  2003-08-30 14:50         ` Ilya Zakharevich
  0 siblings, 2 replies; 12+ messages in thread
From: Kai Großjohann @ 2003-08-29 19:58 UTC (permalink / raw)


Oliver Scholz <alkibiades@gmx.de> writes:

> I've played a bit with this (patch below). But I thing I am a bit
> puzzled. With my patch, `(rx (not top))' translates to:
>
> "\\(?:[^t]*\\|t[^o]*\\|to[^p]*\\)"
>
> Is this actually correct?

Well, err, it depends.

I guess one meaningful meaning of the hypothetical x\\(!:top\\)y would
be to look whether the characters following x are t-o-p.  If so, then
fail.  If not, then do like xy would have done.

That's what Perl does, I believe.
-- 
Two cafe au lait please, but without milk.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: regexp and strings you don't want
  2003-08-29 19:58       ` Kai Großjohann
@ 2003-08-29 20:38         ` Oliver Scholz
  2003-08-30 14:50         ` Ilya Zakharevich
  1 sibling, 0 replies; 12+ messages in thread
From: Oliver Scholz @ 2003-08-29 20:38 UTC (permalink / raw)


kai.grossjohann@gmx.net (Kai Großjohann) writes:

> Oliver Scholz <alkibiades@gmx.de> writes:
>
>> I've played a bit with this (patch below). But I thing I am a bit
>> puzzled. With my patch, `(rx (not top))' translates to:
>>
>> "\\(?:[^t]*\\|t[^o]*\\|to[^p]*\\)"
>>
>> Is this actually correct?
>
> Well, err, it depends.
>
> I guess one meaningful meaning of the hypothetical x\\(!:top\\)y would
> be to look whether the characters following x are t-o-p.  If so, then
> fail.  If not, then do like xy would have done.
[...]

Hmpf. I don't see a way to compile this to an ordinary Emacs regexp
programatically. Is it actually possible, even with you trick? But
maybe I am just a little bit slow this evening. Any thoughts?

    Oliver (time to go to bed now)
-- 
12 Fructidor an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: regexp and strings you don't want
  2003-08-29 19:58       ` Kai Großjohann
  2003-08-29 20:38         ` Oliver Scholz
@ 2003-08-30 14:50         ` Ilya Zakharevich
  1 sibling, 0 replies; 12+ messages in thread
From: Ilya Zakharevich @ 2003-08-30 14:50 UTC (permalink / raw)


[A complimentary Cc of this posting was sent to
=?iso-8859-1?q?Kai_Gro=DFjohann?=
<kai.grossjohann@gmx.net>], who wrote in article <84isogutgt.fsf@slowfox.is.informatik.uni-duisburg.de>:
> I guess one meaningful meaning of the hypothetical x\\(!:top\\)y would
> be to look whether the characters following x are t-o-p.  If so, then
> fail.  If not, then do like xy would have done.
> 
> That's what Perl does, I believe.

Keep in mind that Perl has very primitive REx engine.  E.g., "onion
rings" (google for me and this) are not supported.

The semantic of "onion rings" is that

  A & B & C &! D & E

will match if A can match, and the *substring* which A matched matches
B, C, E, and does not match D.  The syntax is not defined (until it is
supported).

[But even with primitive state one can do negation without a problem:
(?!.*B) will match anything which does not contain B (if you put this
expression up front).]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2003-08-30 14:50 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-08-25 19:45 regexp and strings you don't want Chaz
2003-08-25 20:17 ` Barry Margolin
2003-08-26 18:13   ` Chaz
2003-08-27 15:13     ` Kevin Rodgers
2003-08-29 15:50     ` Stefan Monnier
2003-08-26 22:19 ` Eric Pement
2003-08-27 20:26 ` Kai Großjohann
2003-08-29 16:14   ` Oliver Scholz
2003-08-29 18:50     ` Oliver Scholz
2003-08-29 19:58       ` Kai Großjohann
2003-08-29 20:38         ` Oliver Scholz
2003-08-30 14:50         ` Ilya Zakharevich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).