From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Udyant Wig <udyantw@gmail.com>
Newsgroups: gmane.emacs.help
Subject: Re: Most used words in current buffer
Date: Wed, 18 Jul 2018 15:06:56 +0530
Organization: A noiseless patient Spider
Message-ID: <pin1no$l8e$1@dont-email.me>
References: <pikcs5$6sm$1@dont-email.me> <861sc1iu1m.fsf@zoho.com>
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Trace: blaine.gmane.org 1531906850 3441 195.159.176.226 (18 Jul 2018 09:40:50 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Wed, 18 Jul 2018 09:40:50 +0000 (UTC)
Injection-Date: Wed, 18 Jul 2018 09:36:56 -0000 (UTC)
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
	Thunderbird/52.9.1
To: help-gnu-emacs@gnu.org
Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Wed Jul 18 11:40:46 2018
Return-path: <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geh-help-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>)
	id 1ffiwt-0000lc-5p
	for geh-help-gnu-emacs@m.gmane.org; Wed, 18 Jul 2018 11:40:43 +0200
Original-Received: from localhost ([::1]:35614 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>)
	id 1ffiyz-0006ww-Pf
	for geh-help-gnu-emacs@m.gmane.org; Wed, 18 Jul 2018 05:42:53 -0400
X-Received: by 2002:a5d:4847:: with SMTP id n7-v6mr496276wrs.21.1531906616651; 
	Wed, 18 Jul 2018 02:36:56 -0700 (PDT)
Original-Path: usenet.stanford.edu!o2-v6no2644011wmc.0!news-out.google.com!o12-v6ni8214wmc.0!nntp.google.com!proxad.net!feeder1-2.proxad.net!feeder.erje.net!2.eu.feeder.erje.net!eternal-september.org!feeder.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
Original-Newsgroups: gnu.emacs.help
Original-Lines: 69
Original-Injection-Info: h2725194.stratoserver.net;
	posting-host="b9f8d0ec1dfc655117edae0c20eb30ad"; 
	logging-data="21774"; mail-complaints-to="abuse@eternal-september.org";
	posting-account="U2FsdGVkX19NzWCPx95enwNFuyl7GAnN"
Cancel-Lock: sha1:nGcuzLeAabXJuC78VesZw3foVpg=
In-Reply-To: <861sc1iu1m.fsf@zoho.com>
Openpgp: preference=signencrypt
Content-Language: en-US
Original-Xref: usenet.stanford.edu gnu.emacs.help:223354
X-BeenThere: help-gnu-emacs@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: Users list for the GNU Emacs text editor <help-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/help-gnu-emacs/>
List-Post: <mailto:help-gnu-emacs@gnu.org>
List-Help: <mailto:help-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Original-Sender: "help-gnu-emacs"
	<help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.help:117479
Archived-At: <http://permalink.gmane.org/gmane.emacs.help/117479>

On 07/18/2018 12:11 AM, Emanuel Berg wrote:
> Do it!
>
> But if you can let go of the Elisp requirement here are some examples
> how to do it with everyday GNU/Unix tools:
>
>
https://unix.stackexchange.com/questions/41479/find-n-most-frequent-words-in-a-file

I went ahead and did it.  I obtained many solutions, in fact.  Only
today did I check the link above.

First, of the solutions in Emacs Lisp, this one came out as the
quickest:

---
(defun buffer-most-used-words-1 (n)
  "Make a list of the N most used words in buffer."
  (let ((counts (make-hash-table :test #'equal))
	(words (split-string (buffer-string)))
	sorted-counts)
    (dolist (word words)
      (let ((count (gethash (downcase word) counts 0)))
	(puthash (downcase word) (1+ count) counts)))
    (loop for word being the hash-keys of counts
       using (hash-values count)
       do
	 (push (list word count) sorted-counts)
       finally (setf sorted-counts (cl-sort sorted-counts #'>
					    :key #'second)))
    (mapcar #'first (cl-subseq sorted-counts 0 n))))
---

Briefly, it obtains a list of the strings in the buffer, hashes them,
puts the words and their counts in a list, sorts it, and lists the first
N words.  (I had also written solutions (1) using alists; (2) using the
handy AVL tree library I found among the Emacs Lisp files in the Emacs
distribution; and (3) reading the words directly and hashing them.  None
beat the above.)

The function is suffixed with '-1' because it is the the core of
another, interactive function, which takes the above generated list and
displays it nicely in another buffer.

I was curious about possible solutions in other languages.  I wrote
programs in both Common Lisp and Python, based on the essential hash
table approach.  While a lot faster than the Emacs Lisp solution above,
they were left behind by this old Awk solution (also using hashing) I
found in the classic /The Unix Programming Environment/ by Kernighan and
Pike:

---
#!/bin/sh

awk '    { for (i = 1; i <= NF; i++) num[$i]++ }
END      { for (word in num) print word, num[word] }
' $* | sort +1 -nr | head -10 | awk '{ print $1 }'
---

I appended the last awk pipeline to only give the words without the
counts.  I wrapped it up in an Emacs command to display the words in
another buffer, just like my original Emacs Lisp solution above.

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch