From: Eric Abrahamsen <eric@ericabrahamsen.net>
To: help-gnu-emacs@gnu.org
Subject: Re: Most used words in current buffer
Date: Sat, 21 Jul 2018 09:15:28 -0700 [thread overview]
Message-ID: <87effwtvj3.fsf@ericabrahamsen.net> (raw)
In-Reply-To: piva82$prk$1@dont-email.me
Udyant Wig <udyantw@gmail.com> writes:
> On 07/20/2018 08:38 AM, Bob Newell wrote:
>> By the way on a 2 MB file the elisp version runs in a few seconds.
>> Hats off to the coder.
>
> I am still looking to improve it. For example, on a 4.5 MB text file,
> the original version takes over 5 seconds to run, as measured using the
> functions #'benchmark-run and #'benchmark-run-compiled.
>
> Is it feasible to read words from the buffer and hash them directly from
> there? Or, going further, is there a better way to do this -- counting
> words and producing the N most used -- using some other design, maybe
> with some other data structure?
Interesting... In general I think Emacs is highly optimized to use the
buffer as its textual data structure, more so than a string.
Particularly when the code is compiled (many of the text-movement
commands have opcodes). I made the following two commands to collect
words from a novel in an Org file, and the one that uses `forward-word'
and `buffer-substring' runs around twice as fast as the `split-string'.
Of course, they don't collect the same list of words! But even if you
add more code for trimming, etc., it will still likely be faster than
operating on a string.
(defun test-string (&optional f)
(let ((file (or f "/home/eric/org/hollowmountain.org"))
str lst)
(with-temp-buffer
(insert-file-contents file)
(setq str (split-string (buffer-string)))
(dolist (word str)
(push word lst)))
(length lst)))
(defun test-buffer (&optional f)
(let ((file (or f "/home/eric/org/hollowmountain.org"))
pnt lst)
(with-temp-buffer
(insert-file-contents file)
(goto-char (point-min))
(setq pnt (point))
(while (forward-word)
(push (buffer-substring pnt (point)) lst)
(setq pnt (point))))
(length lst)))
next prev parent reply other threads:[~2018-07-21 16:15 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-17 9:28 Most used words in current buffer Udyant Wig
2018-07-17 18:41 ` Emanuel Berg
2018-07-18 9:36 ` Udyant Wig
2018-07-18 11:48 ` Emanuel Berg
2018-07-18 14:50 ` Udyant Wig
2018-07-18 16:32 ` Emanuel Berg
2018-07-18 22:39 ` Ben Bacarisse
2018-07-19 0:45 ` Bob Proulx
[not found] ` <mailman.3785.1531961144.1292.help-gnu-emacs@gnu.org>
2018-07-19 5:33 ` Udyant Wig
2018-07-19 7:04 ` Bob Proulx
2018-07-19 7:25 ` tomas
2018-07-19 17:19 ` Nick Dokos
2018-07-19 17:30 ` Eli Zaretskii
2018-07-19 20:08 ` Bob Proulx
2018-07-20 16:39 ` Nick Dokos
[not found] ` <mailman.3909.1532104802.1292.help-gnu-emacs@gnu.org>
2018-07-20 18:13 ` Udyant Wig
2018-07-20 22:24 ` Bob Newell
2018-07-21 0:00 ` Nick Dokos
2018-07-21 0:18 ` Nick Dokos
[not found] ` <mailman.3843.1532030947.1292.help-gnu-emacs@gnu.org>
2018-07-20 6:19 ` Udyant Wig
2018-07-20 23:25 ` Bob Proulx
2018-07-21 0:26 ` Nick Dokos
2018-07-21 4:03 ` Bob Proulx
[not found] ` <mailman.3934.1532129163.1292.help-gnu-emacs@gnu.org>
2018-07-21 13:39 ` Udyant Wig
[not found] ` <mailman.3826.1532020800.1292.help-gnu-emacs@gnu.org>
2018-07-20 5:52 ` Udyant Wig
[not found] ` <mailman.3796.1531983885.1292.help-gnu-emacs@gnu.org>
2018-07-19 13:26 ` Udyant Wig
2018-07-19 20:42 ` Bob Proulx
2018-07-20 3:08 ` Bob Newell
[not found] ` <mailman.3861.1532056120.1292.help-gnu-emacs@gnu.org>
2018-07-21 12:51 ` Udyant Wig
2018-07-21 16:15 ` Eric Abrahamsen [this message]
[not found] ` <mailman.3982.1532189751.1292.help-gnu-emacs@gnu.org>
2018-07-21 19:46 ` Udyant Wig
2018-07-22 3:57 ` Eric Abrahamsen
2018-07-22 4:00 ` Eric Abrahamsen
2018-07-22 4:05 ` Eric Abrahamsen
[not found] ` <mailman.4008.1532232144.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:28 ` Udyant Wig
2018-07-22 20:05 ` Eric Abrahamsen
[not found] ` <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:19 ` Udyant Wig
[not found] ` <mailman.3845.1532032966.1292.help-gnu-emacs@gnu.org>
2018-07-20 13:18 ` Udyant Wig
2018-07-21 18:22 ` Stefan Monnier
2018-07-22 9:02 ` tomas
2018-07-23 6:09 ` Bob Proulx
2018-07-23 7:34 ` tomas
[not found] ` <mailman.4074.1532326162.1292.help-gnu-emacs@gnu.org>
2018-07-23 7:26 ` Udyant Wig
[not found] ` <mailman.4013.1532250176.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:58 ` Udyant Wig
[not found] ` <mailman.3991.1532197378.1292.help-gnu-emacs@gnu.org>
2018-07-21 19:39 ` Udyant Wig
2018-07-21 20:54 ` Stefan Monnier
[not found] ` <mailman.3995.1532206511.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:43 ` Udyant Wig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87effwtvj3.fsf@ericabrahamsen.net \
--to=eric@ericabrahamsen.net \
--cc=help-gnu-emacs@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).