all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* Regex to match lines with a specific number of words
@ 2022-04-23 19:44 Joost Kremers
  2022-04-23 20:58 ` Thibaut Verron
  2022-04-26 23:55 ` Nick Dokos
  0 siblings, 2 replies; 11+ messages in thread
From: Joost Kremers @ 2022-04-23 19:44 UTC (permalink / raw)
  To: help-gnu-emacs

Hi all,

I've been trying to come up with a regex that will match any line containing at
least 30 words in order to kill them from the buffer (preferably with
`kill-matching-lines`, because I need to move the lines to another buffer.)

Frustratingly enough, I have not been successful. Since "word" here can be
interpreted very broadly, I thought this would be easy. Any sequence of
non-whitespace characters surrounded by whitespace can be considered a "word"
(even if it's a number of some special character such as & or #.) So I did this:

\([^[:space:]]+[[:space:]]+\)

This seems to capture a word (in the above sense) plus any following white space
well enough.

But when I try to modify the regex to only match those lines that repeat this
pattern at least 30 times, it fails:

\([^[:space:]]+[[:space:]]+\)\{30,\}

Passing this to `flush-lines` simply deletes everything in the buffer starting
at point, telling me it "[d]eleted 1 matching line", even though (many) more
lines were deleted. Adding ^ and $ around the regex didn't have any effect.

So what am I doing wrong here? 


-- 
Joost Kremers
Life has its moments



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regex to match lines with a specific number of words
  2022-04-23 19:44 Regex to match lines with a specific number of words Joost Kremers
@ 2022-04-23 20:58 ` Thibaut Verron
  2022-04-23 21:20   ` Joost Kremers
  2022-04-23 22:46   ` Stefan Monnier via Users list for the GNU Emacs text editor
  2022-04-26 23:55 ` Nick Dokos
  1 sibling, 2 replies; 11+ messages in thread
From: Thibaut Verron @ 2022-04-23 20:58 UTC (permalink / raw)
  To: Joost Kremers; +Cc: help-gnu-emacs

Hi,

The group [:space:] also matches newline characters. So your search has
exactly one match, spanning many lines.
You can use [:blank:] instead to match spaces and tabs only, for the
separator.

It's probably better to keep [^[:space:]] for the first group, you wouldn't
want to start matching newlines there.

Best wishes,
Thibaut

Le sam. 23 avr. 2022 à 22:39, Joost Kremers <joostkremers@fastmail.fm> a
écrit :

> Hi all,
>
> I've been trying to come up with a regex that will match any line
> containing at
> least 30 words in order to kill them from the buffer (preferably with
> `kill-matching-lines`, because I need to move the lines to another buffer.)
>
> Frustratingly enough, I have not been successful. Since "word" here can be
> interpreted very broadly, I thought this would be easy. Any sequence of
> non-whitespace characters surrounded by whitespace can be considered a
> "word"
> (even if it's a number of some special character such as & or #.) So I did
> this:
>
> \([^[:space:]]+[[:space:]]+\)
>
> This seems to capture a word (in the above sense) plus any following white
> space
> well enough.
>
> But when I try to modify the regex to only match those lines that repeat
> this
> pattern at least 30 times, it fails:
>
> \([^[:space:]]+[[:space:]]+\)\{30,\}
>
> Passing this to `flush-lines` simply deletes everything in the buffer
> starting
> at point, telling me it "[d]eleted 1 matching line", even though (many)
> more
> lines were deleted. Adding ^ and $ around the regex didn't have any effect.
>
> So what am I doing wrong here?
>
>
> --
> Joost Kremers
> Life has its moments
>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regex to match lines with a specific number of words
  2022-04-23 20:58 ` Thibaut Verron
@ 2022-04-23 21:20   ` Joost Kremers
  2022-04-23 21:46     ` Thibaut Verron
  2022-04-23 22:46   ` Stefan Monnier via Users list for the GNU Emacs text editor
  1 sibling, 1 reply; 11+ messages in thread
From: Joost Kremers @ 2022-04-23 21:20 UTC (permalink / raw)
  To: thibaut.verron; +Cc: help-gnu-emacs


On Sat, Apr 23 2022, Thibaut Verron wrote:
> The group [:space:] also matches newline characters. So your search has
> exactly one match, spanning many lines.
> You can use [:blank:] instead to match spaces and tabs only, for the
> separator.

Thanks! I never would have thought of that. (Why isn't this mentioned explicitly
in the manual?)

Unfortunately, passing this regexp to `flush-lines` or `kill-matching-lines` in
a file of close to 65000 lines completely cripples Emacs... One CPU core runs up
to 100% and Emacs becomes unresponsive.

Lemme see if a function that goes through the buffer, splits every line on white
space and deletes those that are too long works better.

Thanks, though, for the quick reply!



-- 
Joost Kremers
Life has its moments



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regex to match lines with a specific number of words
  2022-04-23 21:20   ` Joost Kremers
@ 2022-04-23 21:46     ` Thibaut Verron
  2022-04-23 22:11       ` [External] : " Drew Adams
  2022-04-23 22:21       ` Joost Kremers
  0 siblings, 2 replies; 11+ messages in thread
From: Thibaut Verron @ 2022-04-23 21:46 UTC (permalink / raw)
  To: Joost Kremers; +Cc: help-gnu-emacs

No problem! The information is in the manual, but hidden behind several
layers of redirection.
I find the emacswiki page on regular expressions both more synthetic and
more informative.

Regarding performances, that's a bit strange.
Is it better if you add ^ and $ around the expression? Or if you add only ^
and search for exactly 30 repetitions (not 30 or more)?

Best wishes,
Thibaut

Le sam. 23 avr. 2022 à 23:34, Joost Kremers <joostkremers@fastmail.fm> a
écrit :

>
> On Sat, Apr 23 2022, Thibaut Verron wrote:
> > The group [:space:] also matches newline characters. So your search has
> > exactly one match, spanning many lines.
> > You can use [:blank:] instead to match spaces and tabs only, for the
> > separator.
>
> Thanks! I never would have thought of that. (Why isn't this mentioned
> explicitly
> in the manual?)
>
> Unfortunately, passing this regexp to `flush-lines` or
> `kill-matching-lines` in
> a file of close to 65000 lines completely cripples Emacs... One CPU core
> runs up
> to 100% and Emacs becomes unresponsive.
>
> Lemme see if a function that goes through the buffer, splits every line on
> white
> space and deletes those that are too long works better.
>
> Thanks, though, for the quick reply!
>
>
>
> --
> Joost Kremers
> Life has its moments
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [External] : Re: Regex to match lines with a specific number of words
  2022-04-23 21:46     ` Thibaut Verron
@ 2022-04-23 22:11       ` Drew Adams
  2022-04-23 22:32         ` Thibaut Verron
  2022-04-23 22:21       ` Joost Kremers
  1 sibling, 1 reply; 11+ messages in thread
From: Drew Adams @ 2022-04-23 22:11 UTC (permalink / raw)
  To: thibaut.verron@gmail.com, Joost Kremers; +Cc: help-gnu-emacs

> The information is in the manual, but hidden
> behind several layers of redirection.
> I find the emacswiki page on regular expressions 
> both more synthetic and more informative.

https://www.emacswiki.org/emacs/CategoryRegexp

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regex to match lines with a specific number of words
  2022-04-23 21:46     ` Thibaut Verron
  2022-04-23 22:11       ` [External] : " Drew Adams
@ 2022-04-23 22:21       ` Joost Kremers
  1 sibling, 0 replies; 11+ messages in thread
From: Joost Kremers @ 2022-04-23 22:21 UTC (permalink / raw)
  To: thibaut.verron; +Cc: help-gnu-emacs


On Sat, Apr 23 2022, Thibaut Verron wrote:
> No problem! The information is in the manual, but hidden behind several
> layers of redirection.
> I find the emacswiki page on regular expressions both more synthetic and
> more informative.

Thanks, I'll check it out.

> Regarding performances, that's a bit strange.
> Is it better if you add ^ and $ around the expression? Or if you add only ^
> and search for exactly 30 repetitions (not 30 or more)?

Well, either version still runs up one CPU core to 100%. The only difference
seems to be that they are more easily interruptable with C-g: Emacs responds
immediately, whereas before it would take seconds to respond to C-g and in one
case it did not respond at all. (I ended up killing Emacs when GNOME popped up a
a suggestion to do so...)

> Le sam. 23 avr. 2022 à 23:34, Joost Kremers <joostkremers@fastmail.fm> a
> écrit :

>> Lemme see if a function that goes through the buffer, splits every line on
>> white space and deletes those that are too long works better.

That actually worked well. It still takes a few seconds to run, but really just
a few seconds. And that's including dumping all the extracted lines into a
separate buffer.



-- 
Joost Kremers
Life has its moments



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [External] : Re: Regex to match lines with a specific number of words
  2022-04-23 22:11       ` [External] : " Drew Adams
@ 2022-04-23 22:32         ` Thibaut Verron
  0 siblings, 0 replies; 11+ messages in thread
From: Thibaut Verron @ 2022-04-23 22:32 UTC (permalink / raw)
  To: Drew Adams; +Cc: Joost Kremers, help-gnu-emacs

Right, I meant this one: https://www.emacswiki.org/emacs/RegularExpression

Le dim. 24 avr. 2022 à 00:12, Drew Adams <drew.adams@oracle.com> a écrit :

> > The information is in the manual, but hidden
> > behind several layers of redirection.
> > I find the emacswiki page on regular expressions
> > both more synthetic and more informative.
>
> https://www.emacswiki.org/emacs/CategoryRegexp
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regex to match lines with a specific number of words
  2022-04-23 20:58 ` Thibaut Verron
  2022-04-23 21:20   ` Joost Kremers
@ 2022-04-23 22:46   ` Stefan Monnier via Users list for the GNU Emacs text editor
  2022-04-24 14:31     ` Joost Kremers
  1 sibling, 1 reply; 11+ messages in thread
From: Stefan Monnier via Users list for the GNU Emacs text editor @ 2022-04-23 22:46 UTC (permalink / raw)
  To: help-gnu-emacs

>> \([^[:space:]]+[[:space:]]+\)\{30,\}

I don't think yo want to match more than 30 times: as soon as you have
30 repetitions you know what to do.
And indeed you should likely anchor your regexp at beg of line (but not
end of line, since you want to ignore the rest of the line after matching
30 repetitions).

Note also that regardless if [:space:] matches line-feed or not, one of
[[:space:]] or [^[:space:]] will match line-feed.

Personally, I'd go with something like:

    ^\([^\n\t ]+[\t ]+\)\{30\}

[ Tho, IIUC `flush-lines` doesn't know \t and \n, so you'll probably
  need C-q C-j and C-q TAB.  ]


        Stefan




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regex to match lines with a specific number of words
  2022-04-23 22:46   ` Stefan Monnier via Users list for the GNU Emacs text editor
@ 2022-04-24 14:31     ` Joost Kremers
  0 siblings, 0 replies; 11+ messages in thread
From: Joost Kremers @ 2022-04-24 14:31 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: help-gnu-emacs


On Sat, Apr 23 2022, Stefan Monnier via Users list for the GNU Emacs text editor wrote:
>>> \([^[:space:]]+[[:space:]]+\)\{30,\}
>
> I don't think yo want to match more than 30 times: as soon as you have
> 30 repetitions you know what to do.

True. I modelled it on a regex for SublimeText, which, however, was used in a
search and replace operation (where the replacement string was empty), so
matching more than 30 times was necessary. But for flush-lines it's not.

[...]
> Personally, I'd go with something like:
>
>     ^\([^\n\t ]+[\t ]+\)\{30\}

Thanks, I'll try that and see if it gives any speed advantages.


-- 
Joost Kremers
Life has its moments



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regex to match lines with a specific number of words
  2022-04-23 19:44 Regex to match lines with a specific number of words Joost Kremers
  2022-04-23 20:58 ` Thibaut Verron
@ 2022-04-26 23:55 ` Nick Dokos
  2022-04-27  7:23   ` Jean Louis
  1 sibling, 1 reply; 11+ messages in thread
From: Nick Dokos @ 2022-04-26 23:55 UTC (permalink / raw)
  To: help-gnu-emacs

Joost Kremers <joostkremers@fastmail.fm> writes:


> I've been trying to come up with a regex that will match any line containing at
> least 30 words in order to kill them from the buffer (preferably with
> `kill-matching-lines`, because I need to move the lines to another buffer.)
>
> ...
>
> Passing this to `flush-lines` simply deletes everything in the buffer starting
> at point, telling me it "[d]eleted 1 matching line", even though (many) more
> lines were deleted. Adding ^ and $ around the regex didn't have any effect.
>

Do you have to do it in emacs? Why not use `awk'? It could be as simple as

awk 'NF < 30  {print;}'

and saving the other lines to a different file is not much more difficult:

awk 'NF < 5 {print;}
NF >= 5 { print > "long-stuff.txt"}'

-- 
Nick

"There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors." -Martin Fowler




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regex to match lines with a specific number of words
  2022-04-26 23:55 ` Nick Dokos
@ 2022-04-27  7:23   ` Jean Louis
  0 siblings, 0 replies; 11+ messages in thread
From: Jean Louis @ 2022-04-27  7:23 UTC (permalink / raw)
  To: Nick Dokos; +Cc: help-gnu-emacs

I would solve that problem first by making a list of lines with more
than 30 words, each line being designated, and then searching within
those lines.

-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

In support of Richard M. Stallman
https://stallmansupport.org/



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-04-27  7:23 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-04-23 19:44 Regex to match lines with a specific number of words Joost Kremers
2022-04-23 20:58 ` Thibaut Verron
2022-04-23 21:20   ` Joost Kremers
2022-04-23 21:46     ` Thibaut Verron
2022-04-23 22:11       ` [External] : " Drew Adams
2022-04-23 22:32         ` Thibaut Verron
2022-04-23 22:21       ` Joost Kremers
2022-04-23 22:46   ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-04-24 14:31     ` Joost Kremers
2022-04-26 23:55 ` Nick Dokos
2022-04-27  7:23   ` Jean Louis

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.