From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Udyant Wig Newsgroups: gmane.emacs.help Subject: Re: Most used words in current buffer Date: Wed, 18 Jul 2018 15:06:56 +0530 Organization: A noiseless patient Spider Message-ID: References: <861sc1iu1m.fsf@zoho.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Trace: blaine.gmane.org 1531906850 3441 195.159.176.226 (18 Jul 2018 09:40:50 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Wed, 18 Jul 2018 09:40:50 +0000 (UTC) Injection-Date: Wed, 18 Jul 2018 09:36:56 -0000 (UTC) User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Wed Jul 18 11:40:46 2018 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ffiwt-0000lc-5p for geh-help-gnu-emacs@m.gmane.org; Wed, 18 Jul 2018 11:40:43 +0200 Original-Received: from localhost ([::1]:35614 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ffiyz-0006ww-Pf for geh-help-gnu-emacs@m.gmane.org; Wed, 18 Jul 2018 05:42:53 -0400 X-Received: by 2002:a5d:4847:: with SMTP id n7-v6mr496276wrs.21.1531906616651; Wed, 18 Jul 2018 02:36:56 -0700 (PDT) Original-Path: usenet.stanford.edu!o2-v6no2644011wmc.0!news-out.google.com!o12-v6ni8214wmc.0!nntp.google.com!proxad.net!feeder1-2.proxad.net!feeder.erje.net!2.eu.feeder.erje.net!eternal-september.org!feeder.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail Original-Newsgroups: gnu.emacs.help Original-Lines: 69 Original-Injection-Info: h2725194.stratoserver.net; posting-host="b9f8d0ec1dfc655117edae0c20eb30ad"; logging-data="21774"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19NzWCPx95enwNFuyl7GAnN" Cancel-Lock: sha1:nGcuzLeAabXJuC78VesZw3foVpg= In-Reply-To: <861sc1iu1m.fsf@zoho.com> Openpgp: preference=signencrypt Content-Language: en-US Original-Xref: usenet.stanford.edu gnu.emacs.help:223354 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.org gmane.emacs.help:117479 Archived-At: On 07/18/2018 12:11 AM, Emanuel Berg wrote: > Do it! > > But if you can let go of the Elisp requirement here are some examples > how to do it with everyday GNU/Unix tools: > > https://unix.stackexchange.com/questions/41479/find-n-most-frequent-words-in-a-file I went ahead and did it. I obtained many solutions, in fact. Only today did I check the link above. First, of the solutions in Emacs Lisp, this one came out as the quickest: --- (defun buffer-most-used-words-1 (n) "Make a list of the N most used words in buffer." (let ((counts (make-hash-table :test #'equal)) (words (split-string (buffer-string))) sorted-counts) (dolist (word words) (let ((count (gethash (downcase word) counts 0))) (puthash (downcase word) (1+ count) counts))) (loop for word being the hash-keys of counts using (hash-values count) do (push (list word count) sorted-counts) finally (setf sorted-counts (cl-sort sorted-counts #'> :key #'second))) (mapcar #'first (cl-subseq sorted-counts 0 n)))) --- Briefly, it obtains a list of the strings in the buffer, hashes them, puts the words and their counts in a list, sorts it, and lists the first N words. (I had also written solutions (1) using alists; (2) using the handy AVL tree library I found among the Emacs Lisp files in the Emacs distribution; and (3) reading the words directly and hashing them. None beat the above.) The function is suffixed with '-1' because it is the the core of another, interactive function, which takes the above generated list and displays it nicely in another buffer. I was curious about possible solutions in other languages. I wrote programs in both Common Lisp and Python, based on the essential hash table approach. While a lot faster than the Emacs Lisp solution above, they were left behind by this old Awk solution (also using hashing) I found in the classic /The Unix Programming Environment/ by Kernighan and Pike: --- #!/bin/sh awk ' { for (i = 1; i <= NF; i++) num[$i]++ } END { for (word in num) print word, num[word] } ' $* | sort +1 -nr | head -10 | awk '{ print $1 }' --- I appended the last awk pipeline to only give the words without the counts. I wrapped it up in an Emacs command to display the words in another buffer, just like my original Emacs Lisp solution above. Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch