all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Udyant Wig <udyantw@gmail.com>
To: help-gnu-emacs@gnu.org
Subject: Re: Most used words in current buffer
Date: Sun, 22 Jul 2018 23:49:01 +0530	[thread overview]
Message-ID: <pj2hqo$ag3$1@dont-email.me> (raw)
In-Reply-To: <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org>

On 07/22/2018 09:27 AM, Eric Abrahamsen wrote:
> As Stefan said, going character by character is going to be
> slow... But my example with `forward-word' collects a lot of cruft. So
> I would suggest doing what `forward-word' does internally and move by
> syntax.  This also opens up the possibility of tweaking the behavior
> of your function (ie, what constitutes a word) by setting temporary
> syntax tables. Here's a word scanner that only picks up actual words
> (according to the default syntax table):
>
> (defun test-buffer (&optional f)
>   (let ((file (or f "/home/eric/org/hollowmountain.org"))
> 	pnt lst)
>     (with-temp-buffer
>       (insert-file-contents file)
>       (goto-char (point-min))
>       (skip-syntax-forward "^w")
>       (setq pnt (point))
>       (while (and (null (eobp)) (skip-syntax-forward "w"))
> 	(push (buffer-substring pnt (point)) lst)
> 	(skip-syntax-forward "^w")
> 	(setq pnt (point))))
>     (nreverse lst)))

Thank you for the idea!  It did wonders for the running time, a sample
of which I have put after the following adaption of your idea to the
code.

---
(defun buffer-most-used-words-4 (n)
  "Make a list of the N most used words in buffer."
  (let ((counts (make-hash-table :test #'equal))
	sorted-counts
	start
	end)
    (save-excursion
      (goto-char (point-min))
      (skip-syntax-forward "^w")
      (setf start (point))
      (cl-loop until (eobp)
	       do
	       (skip-syntax-forward "w")
	       (setf end (point))
	       (incf (gethash (buffer-substring start end) counts 0))
	       (skip-syntax-forward "^w")
	       (setf start (point))))
    (cl-loop for word being the hash-keys of counts
	     using (hash-values count)
	     do
	     (push (list word count) sorted-counts)
	     finally (setf sorted-counts (cl-sort sorted-counts #'>
						  :key #'second)))
    (mapcar #'first (cl-subseq sorted-counts 0 n))))
---

Compiled, this version takes about half the time the previous version --
going character by character -- took to process a 4.5 MB text file.

Average timing after ten runs on the above mentioned file: 2.75 seconds.


On syntax tables, the ability to determine what is a word or other
construct in a buffer could be very handy indeed.  One application
beyond prose text that comes to mind could be to count the most used
variable or function in a file of source code.  There might be others of
course.

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



  parent reply	other threads:[~2018-07-22 18:19 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-17  9:28 Most used words in current buffer Udyant Wig
2018-07-17 18:41 ` Emanuel Berg
2018-07-18  9:36   ` Udyant Wig
2018-07-18 11:48     ` Emanuel Berg
2018-07-18 14:50       ` Udyant Wig
2018-07-18 16:32         ` Emanuel Berg
2018-07-18 22:39     ` Ben Bacarisse
2018-07-19  0:45       ` Bob Proulx
     [not found]       ` <mailman.3785.1531961144.1292.help-gnu-emacs@gnu.org>
2018-07-19  5:33         ` Udyant Wig
2018-07-19  7:04           ` Bob Proulx
2018-07-19  7:25             ` tomas
2018-07-19 17:19             ` Nick Dokos
2018-07-19 17:30               ` Eli Zaretskii
2018-07-19 20:08               ` Bob Proulx
2018-07-20 16:39                 ` Nick Dokos
     [not found]                 ` <mailman.3909.1532104802.1292.help-gnu-emacs@gnu.org>
2018-07-20 18:13                   ` Udyant Wig
2018-07-20 22:24                     ` Bob Newell
2018-07-21  0:00                       ` Nick Dokos
2018-07-21  0:18                     ` Nick Dokos
     [not found]               ` <mailman.3843.1532030947.1292.help-gnu-emacs@gnu.org>
2018-07-20  6:19                 ` Udyant Wig
2018-07-20 23:25                   ` Bob Proulx
2018-07-21  0:26                     ` Nick Dokos
2018-07-21  4:03                       ` Bob Proulx
     [not found]                   ` <mailman.3934.1532129163.1292.help-gnu-emacs@gnu.org>
2018-07-21 13:39                     ` Udyant Wig
     [not found]             ` <mailman.3826.1532020800.1292.help-gnu-emacs@gnu.org>
2018-07-20  5:52               ` Udyant Wig
     [not found]           ` <mailman.3796.1531983885.1292.help-gnu-emacs@gnu.org>
2018-07-19 13:26             ` Udyant Wig
2018-07-19 20:42               ` Bob Proulx
2018-07-20  3:08                 ` Bob Newell
     [not found]                 ` <mailman.3861.1532056120.1292.help-gnu-emacs@gnu.org>
2018-07-21 12:51                   ` Udyant Wig
2018-07-21 16:15                     ` Eric Abrahamsen
     [not found]                     ` <mailman.3982.1532189751.1292.help-gnu-emacs@gnu.org>
2018-07-21 19:46                       ` Udyant Wig
2018-07-22  3:57                         ` Eric Abrahamsen
2018-07-22  4:00                           ` Eric Abrahamsen
2018-07-22  4:05                             ` Eric Abrahamsen
     [not found]                           ` <mailman.4008.1532232144.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:28                             ` Udyant Wig
2018-07-22 20:05                               ` Eric Abrahamsen
     [not found]                         ` <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:19                           ` Udyant Wig [this message]
     [not found]               ` <mailman.3845.1532032966.1292.help-gnu-emacs@gnu.org>
2018-07-20 13:18                 ` Udyant Wig
2018-07-21 18:22               ` Stefan Monnier
2018-07-22  9:02                 ` tomas
2018-07-23  6:09                   ` Bob Proulx
2018-07-23  7:34                     ` tomas
     [not found]                   ` <mailman.4074.1532326162.1292.help-gnu-emacs@gnu.org>
2018-07-23  7:26                     ` Udyant Wig
     [not found]                 ` <mailman.4013.1532250176.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:58                   ` Udyant Wig
     [not found]               ` <mailman.3991.1532197378.1292.help-gnu-emacs@gnu.org>
2018-07-21 19:39                 ` Udyant Wig
2018-07-21 20:54                   ` Stefan Monnier
     [not found]                   ` <mailman.3995.1532206511.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:43                     ` Udyant Wig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pj2hqo$ag3$1@dont-email.me' \
    --to=udyantw@gmail.com \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.