From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Udyant Wig Newsgroups: gmane.emacs.help Subject: Re: Most used words in current buffer Date: Sun, 22 Jul 2018 23:49:01 +0530 Organization: A noiseless patient Spider Message-ID: References: <861sc1iu1m.fsf@zoho.com> <87pnzkcgna.fsf@bsb.me.uk> <20180719140935156302029@bob.proulx.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Trace: blaine.gmane.org 1532283505 21341 195.159.176.226 (22 Jul 2018 18:18:25 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 22 Jul 2018 18:18:25 +0000 (UTC) Injection-Date: Sun, 22 Jul 2018 18:19:05 -0000 (UTC) User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sun Jul 22 20:18:21 2018 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fhIw0-0005SI-CO for geh-help-gnu-emacs@m.gmane.org; Sun, 22 Jul 2018 20:18:20 +0200 Original-Received: from localhost ([::1]:57018 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fhIy6-0001ux-QJ for geh-help-gnu-emacs@m.gmane.org; Sun, 22 Jul 2018 14:20:30 -0400 Original-Path: usenet.stanford.edu!goblin1!goblin.stu.neva.ru!eternal-september.org!feeder.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail Original-Newsgroups: gnu.emacs.help Original-Lines: 73 Original-Injection-Info: reader02.eternal-september.org; posting-host="04ad57f50777d2dca843b4deb8789345"; logging-data="10755"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18OdbcejOlSS1vKldI7LfUI" Cancel-Lock: sha1:GXNbZ3bxXxIbuDMQpkjEWPIr5Oo= In-Reply-To: Content-Language: en-US Openpgp: preference=signencrypt Original-Xref: usenet.stanford.edu gnu.emacs.help:223433 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.org gmane.emacs.help:117558 Archived-At: On 07/22/2018 09:27 AM, Eric Abrahamsen wrote: > As Stefan said, going character by character is going to be > slow... But my example with `forward-word' collects a lot of cruft. So > I would suggest doing what `forward-word' does internally and move by > syntax. This also opens up the possibility of tweaking the behavior > of your function (ie, what constitutes a word) by setting temporary > syntax tables. Here's a word scanner that only picks up actual words > (according to the default syntax table): > > (defun test-buffer (&optional f) > (let ((file (or f "/home/eric/org/hollowmountain.org")) > pnt lst) > (with-temp-buffer > (insert-file-contents file) > (goto-char (point-min)) > (skip-syntax-forward "^w") > (setq pnt (point)) > (while (and (null (eobp)) (skip-syntax-forward "w")) > (push (buffer-substring pnt (point)) lst) > (skip-syntax-forward "^w") > (setq pnt (point)))) > (nreverse lst))) Thank you for the idea! It did wonders for the running time, a sample of which I have put after the following adaption of your idea to the code. --- (defun buffer-most-used-words-4 (n) "Make a list of the N most used words in buffer." (let ((counts (make-hash-table :test #'equal)) sorted-counts start end) (save-excursion (goto-char (point-min)) (skip-syntax-forward "^w") (setf start (point)) (cl-loop until (eobp) do (skip-syntax-forward "w") (setf end (point)) (incf (gethash (buffer-substring start end) counts 0)) (skip-syntax-forward "^w") (setf start (point)))) (cl-loop for word being the hash-keys of counts using (hash-values count) do (push (list word count) sorted-counts) finally (setf sorted-counts (cl-sort sorted-counts #'> :key #'second))) (mapcar #'first (cl-subseq sorted-counts 0 n)))) --- Compiled, this version takes about half the time the previous version -- going character by character -- took to process a 4.5 MB text file. Average timing after ten runs on the above mentioned file: 2.75 seconds. On syntax tables, the ability to determine what is a word or other construct in a buffer could be very handy indeed. One application beyond prose text that comes to mind could be to count the most used variable or function in a file of source code. There might be others of course. Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch