From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Udyant Wig <udyantw@gmail.com>
Newsgroups: gmane.emacs.help
Subject: Re: Most used words in current buffer
Date: Sun, 22 Jul 2018 23:49:01 +0530
Organization: A noiseless patient Spider
Message-ID: <pj2hqo$ag3$1@dont-email.me>
References: <pikcs5$6sm$1@dont-email.me> <861sc1iu1m.fsf@zoho.com>
	<pin1no$l8e$1@dont-email.me> <87pnzkcgna.fsf@bsb.me.uk>
	<mailman.3785.1531961144.1292.help-gnu-emacs@gnu.org>
	<pip7rt$v3m$1@dont-email.me>
	<mailman.3796.1531983885.1292.help-gnu-emacs@gnu.org>
	<piq3hm$sff$1@dont-email.me> <20180719140935156302029@bob.proulx.com>
	<mailman.3861.1532056120.1292.help-gnu-emacs@gnu.org>
	<piva82$prk$1@dont-email.me>
	<mailman.3982.1532189751.1292.help-gnu-emacs@gnu.org>
	<pj02jg$ksf$1@dont-email.me>
	<mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org>
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Trace: blaine.gmane.org 1532283505 21341 195.159.176.226 (22 Jul 2018 18:18:25 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Sun, 22 Jul 2018 18:18:25 +0000 (UTC)
Injection-Date: Sun, 22 Jul 2018 18:19:05 -0000 (UTC)
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
	Thunderbird/52.9.1
To: help-gnu-emacs@gnu.org
Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sun Jul 22 20:18:21 2018
Return-path: <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geh-help-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>)
	id 1fhIw0-0005SI-CO
	for geh-help-gnu-emacs@m.gmane.org; Sun, 22 Jul 2018 20:18:20 +0200
Original-Received: from localhost ([::1]:57018 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>)
	id 1fhIy6-0001ux-QJ
	for geh-help-gnu-emacs@m.gmane.org; Sun, 22 Jul 2018 14:20:30 -0400
Original-Path: usenet.stanford.edu!goblin1!goblin.stu.neva.ru!eternal-september.org!feeder.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
Original-Newsgroups: gnu.emacs.help
Original-Lines: 73
Original-Injection-Info: reader02.eternal-september.org;
	posting-host="04ad57f50777d2dca843b4deb8789345"; 
	logging-data="10755"; mail-complaints-to="abuse@eternal-september.org";
	posting-account="U2FsdGVkX18OdbcejOlSS1vKldI7LfUI"
Cancel-Lock: sha1:GXNbZ3bxXxIbuDMQpkjEWPIr5Oo=
In-Reply-To: <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org>
Content-Language: en-US
Openpgp: preference=signencrypt
Original-Xref: usenet.stanford.edu gnu.emacs.help:223433
X-BeenThere: help-gnu-emacs@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: Users list for the GNU Emacs text editor <help-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/help-gnu-emacs/>
List-Post: <mailto:help-gnu-emacs@gnu.org>
List-Help: <mailto:help-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Original-Sender: "help-gnu-emacs"
	<help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.help:117558
Archived-At: <http://permalink.gmane.org/gmane.emacs.help/117558>

On 07/22/2018 09:27 AM, Eric Abrahamsen wrote:
> As Stefan said, going character by character is going to be
> slow... But my example with `forward-word' collects a lot of cruft. So
> I would suggest doing what `forward-word' does internally and move by
> syntax.  This also opens up the possibility of tweaking the behavior
> of your function (ie, what constitutes a word) by setting temporary
> syntax tables. Here's a word scanner that only picks up actual words
> (according to the default syntax table):
>
> (defun test-buffer (&optional f)
>   (let ((file (or f "/home/eric/org/hollowmountain.org"))
> 	pnt lst)
>     (with-temp-buffer
>       (insert-file-contents file)
>       (goto-char (point-min))
>       (skip-syntax-forward "^w")
>       (setq pnt (point))
>       (while (and (null (eobp)) (skip-syntax-forward "w"))
> 	(push (buffer-substring pnt (point)) lst)
> 	(skip-syntax-forward "^w")
> 	(setq pnt (point))))
>     (nreverse lst)))

Thank you for the idea!  It did wonders for the running time, a sample
of which I have put after the following adaption of your idea to the
code.

---
(defun buffer-most-used-words-4 (n)
  "Make a list of the N most used words in buffer."
  (let ((counts (make-hash-table :test #'equal))
	sorted-counts
	start
	end)
    (save-excursion
      (goto-char (point-min))
      (skip-syntax-forward "^w")
      (setf start (point))
      (cl-loop until (eobp)
	       do
	       (skip-syntax-forward "w")
	       (setf end (point))
	       (incf (gethash (buffer-substring start end) counts 0))
	       (skip-syntax-forward "^w")
	       (setf start (point))))
    (cl-loop for word being the hash-keys of counts
	     using (hash-values count)
	     do
	     (push (list word count) sorted-counts)
	     finally (setf sorted-counts (cl-sort sorted-counts #'>
						  :key #'second)))
    (mapcar #'first (cl-subseq sorted-counts 0 n))))
---

Compiled, this version takes about half the time the previous version --
going character by character -- took to process a 4.5 MB text file.

Average timing after ten runs on the above mentioned file: 2.75 seconds.


On syntax tables, the ability to determine what is a word or other
construct in a buffer could be very handy indeed.  One application
beyond prose text that comes to mind could be to count the most used
variable or function in a file of source code.  There might be others of
course.

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch