From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Udyant Wig Newsgroups: gmane.emacs.help Subject: Re: Most used words in current buffer Date: Sun, 22 Jul 2018 01:09:25 +0530 Organization: A noiseless patient Spider Message-ID: References: <861sc1iu1m.fsf@zoho.com> <87pnzkcgna.fsf@bsb.me.uk> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Trace: blaine.gmane.org 1532201900 10569 195.159.176.226 (21 Jul 2018 19:38:20 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 21 Jul 2018 19:38:20 +0000 (UTC) Injection-Date: Sat, 21 Jul 2018 19:39:29 -0000 (UTC) User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sat Jul 21 21:38:16 2018 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fgxhn-0002fc-T4 for geh-help-gnu-emacs@m.gmane.org; Sat, 21 Jul 2018 21:38:16 +0200 Original-Received: from localhost ([::1]:53763 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fgxju-0004s2-LW for geh-help-gnu-emacs@m.gmane.org; Sat, 21 Jul 2018 15:40:26 -0400 X-Received: by 2002:adf:e784:: with SMTP id n4-v6mr639671wrm.14.1532201969386; Sat, 21 Jul 2018 12:39:29 -0700 (PDT) Original-Path: usenet.stanford.edu!o2-v6no5294456wmc.0!news-out.google.com!s23-v6ni32724wmc.0!nntp.google.com!feeder1.cambriumusenet.nl!feed.tweak.nl!144.76.165.73.MISMATCH!news.unit0.net!eternal-september.org!feeder.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail Original-Newsgroups: gnu.emacs.help Original-Lines: 78 Original-Injection-Info: reader02.eternal-september.org; posting-host="babc73e8d2b6d397ab9897937f465675"; logging-data="18552"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/dTSk+NR9BuRyStDbGe9zT" Cancel-Lock: sha1:MqDnuf2f18JU7oY2hlybXkKcfEY= In-Reply-To: Openpgp: preference=signencrypt Content-Language: en-US Original-Xref: usenet.stanford.edu gnu.emacs.help:223414 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.org gmane.emacs.help:117539 Archived-At: On 07/21/2018 11:52 PM, Stefan Monnier wrote: >> (defun buffer-most-used-words-2 (n) >> "Make a list of the N most used words in buffer." >> (let ((counts (avl-tree-create (lambda (wc1 wc2) >> (string< (first wc1) (first wc2))))) >> (words (split-string (buffer-string))) > > If you want to go fast, don't use split-string+buffer-string. Scan > through the buffer and extract each word with buffer-substring > directly. > >> (let ((element (avl-tree-member counts (list (downcase word) >>0)))) > > I'd use a hash-table (implemented in C) rather than an avl-tree > (implemented in Elisp). After spending (too) many hours on this, I believe that I have a better solution. --- (require 'cl-lib) ;; Can this hack be made better? (defun whitespace-p (char) (or (eq char 9) (eq char 10) (eq char 13)) (eq char 32)) (defun buffer-most-used-words-3 (n) "Make a list of the N most used words in buffer." (let ((counts (make-hash-table :test #'equal)) sorted-counts) (save-excursion (goto-char (point-min)) (cl-loop with word = nil with start = 0 with end = 0 with state = 'space with char = nil until (eobp) do (setf char (char-after)) (cond ((eq state 'space) (when (not (whitespace-p char)) (setf start (point) state 'word))) ((eq state 'word) (when (whitespace-p char) (setf end (point) state 'space word (buffer-substring start end)) (incf (gethash word counts 0))))) (forward-char))) (cl-loop for word being the hash-keys of counts using (hash-values count) do (push (list word count) sorted-counts) finally (setf sorted-counts (cl-sort sorted-counts #'> :key #'second))) (mapcar #'first (cl-subseq sorted-counts 0 n)))) --- In regard to performance, it is slightly better than BUFFER-MOST-USED-WORDS-1, which used a combination of SPLIT-STRING on BUFFER-STRING along with a hash-table. Here are timings over ten runs each for a 4.5 MB text file: buffer-most-used-words-1: 4.7362510517 seconds buffer-most-used-words-3: 4.4849896529 seconds > Stefan Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch