From: Heime via Users list for the GNU Emacs text editor <help-gnu-emacs@gnu.org>
To: Heime via Users list for the GNU Emacs text editor
<help-gnu-emacs@gnu.org>
Subject: Locating repetitions of text sequences
Date: Sat, 22 Oct 2022 22:31:29 +0000 [thread overview]
Message-ID: <zcivt9XZBb83tF-j06AVfn6gnL25_CcANzOC76q3tHKKMCFQxoHaVzDLXwajLwlNwflCMnN8S6n7kuYrRvUttde7-0ymwFoa8dqy033cp5Y=@protonmail.com> (raw)
https://emacs.stackexchange.com/posts/74219/timeline
Currently implementing a function that finds repeating sequences of text, length N.
Here is some text
Joseph Rudyard Kipling (30 December 1865 - 18 January 1936)
was an English novelist, short-story writer, poet, and
journalist. He was born in British India, which inspired
much of his work. English novelist, short-story writer,
poet, and journalist.
Kipling's works of fiction include the Jungle Book duology
(The Jungle Book, 1894; The Second Jungle Book, 1895). His
poems include "Mandalay" (1890), "Gunga Din" (1890), "The Gods
of the Copybook Headings" (1919), and "The White Man's Burden"
(1899).
With N=5, the first "Search Sequence" with five components is
--------
Joseph Rudyard Kipling (30 December
--------
Which I match with consecutive "Text Extracts" (each time shifted by one component)
--------
Joseph Rudyard Kipling (30 December
Rudyard Kipling (30 December 1865
Kipling (30 December 1865 -
--------
and so on.
Then repeat with again with "Search Sequence"
Joseph Rudyard Kipling (30 December
--------------------
Suppose I now reach the "Search Sequence"
---------
novelist, short-story writer, poet, and
---------
then use the following "Text Extracts"
--------
Kipling (30 December 1865 -
Joseph Rudyard Kipling (30 December
Rudyard Kipling (30 December 1865
(30 December 1865 - 18
December 1865 - 18 January
1865 - 18 January 1936)
--------
continued with
--------
English novelist, short-story writer, poet,
novelist, short-story writer, poet, and
short-story writer, poet, and journalist.
writer, poet, and journalist. Kipling's
--------
where a match is found in the second piece
One then outputs the line number where the match was found, together with the
repeating part.
--------
4- novelist, short-story writer, poet, and
--------
Continuing so till the end of the buffer
Have started with the following function
---------
(defun wseqn ()
"Search buffer for repeating phrases with N number of words."
(interactive)
(let (N x regex-search)
(setq N (read-number "How many words to search?: " 5))
(setq x 1)
(save-excursion
(while
(< x (length (buffer-string))
(save-excursion
(let (p1 p2 (case-fold-search t))
(setq p1 x)
;; After search N words forward, set end point as index
;; of the last char of those words
(dotimes (y N (setq p2 (point)))
(skip-chars-forward "_a-z0-9"))
(setq regex-search
(buffer-substring-no-properties p1 p2)))
(message "regex-search %S" regex-search)
;; Only forward search is necessary. If it was repeated
;; behind, it would have been caught in previous
;; iterations. This implementation also captures the
;; same repeated phrase by multiple earlier searches.
(save-excursion
(while (search-forward regex-search nil t)
(let (p2)
(setq p2 (point))
(goto-char (- p2 (length regex-search)))
(push-mark p2))))
(setq x (+ x (skip-chars-forward "_a-zA-Z0-9") 1))))))))
next reply other threads:[~2022-10-22 22:31 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-22 22:31 Heime via Users list for the GNU Emacs text editor [this message]
2022-10-24 3:23 ` Locating repetitions of text sequences Emanuel Berg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='zcivt9XZBb83tF-j06AVfn6gnL25_CcANzOC76q3tHKKMCFQxoHaVzDLXwajLwlNwflCMnN8S6n7kuYrRvUttde7-0ymwFoa8dqy033cp5Y=@protonmail.com' \
--to=help-gnu-emacs@gnu.org \
--cc=heimeborgia@protonmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.