From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Bob Proulx Newsgroups: gmane.emacs.help Subject: Re: Most used words in current buffer Date: Wed, 18 Jul 2018 18:45:36 -0600 Message-ID: <20180718183827948731893@bob.proulx.com> References: <861sc1iu1m.fsf@zoho.com> <87pnzkcgna.fsf@bsb.me.uk> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1531961042 7420 195.159.176.226 (19 Jul 2018 00:44:02 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 19 Jul 2018 00:44:02 +0000 (UTC) User-Agent: Mutt/1.10.0 (2018-05-17) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Thu Jul 19 02:43:58 2018 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ffx2z-0001ni-HF for geh-help-gnu-emacs@m.gmane.org; Thu, 19 Jul 2018 02:43:57 +0200 Original-Received: from localhost ([::1]:38826 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ffx56-0002u3-5O for geh-help-gnu-emacs@m.gmane.org; Wed, 18 Jul 2018 20:46:08 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44741) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ffx4g-0002tn-G8 for help-gnu-emacs@gnu.org; Wed, 18 Jul 2018 20:45:43 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ffx4d-000882-Ca for help-gnu-emacs@gnu.org; Wed, 18 Jul 2018 20:45:42 -0400 Original-Received: from havoc.proulx.com ([96.88.95.61]:42091) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1ffx4c-00087g-Oz for help-gnu-emacs@gnu.org; Wed, 18 Jul 2018 20:45:38 -0400 Original-Received: from joseki.proulx.com (localhost [127.0.0.1]) by havoc.proulx.com (Postfix) with ESMTP id 5C15548F for ; Wed, 18 Jul 2018 18:45:37 -0600 (MDT) Original-Received: from hysteria.proulx.com (hysteria.proulx.com [192.168.230.119]) by joseki.proulx.com (Postfix) with ESMTP id 0D6A621257 for ; Wed, 18 Jul 2018 18:45:37 -0600 (MDT) Original-Received: by hysteria.proulx.com (Postfix, from userid 1000) id EA25A2DC71; Wed, 18 Jul 2018 18:45:36 -0600 (MDT) Mail-Followup-To: help-gnu-emacs@gnu.org Content-Disposition: inline In-Reply-To: <87pnzkcgna.fsf@bsb.me.uk> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 96.88.95.61 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.org gmane.emacs.help:117491 Archived-At: Ben Bacarisse wrote: > Udyant Wig writes: > > they were left behind by this old Awk solution (also using hashing) I Not wanting to be too annoying but I see no hashing in the awk solution. It is using an awk associative array to store the words. Perl and Pything call those "hashes" but they are just associative arrays. > > found in the classic /The Unix Programming Environment/ by Kernighan and > > Pike: > >... > > awk ' { for (i = 1; i <= NF; i++) num[$i]++ } > > END { for (word in num) print word, num[word] } > > ' $* | sort +1 -nr | head -10 | awk '{ print $1 }' > > > > I appended the last awk pipeline to only give the words without the > > counts. > > The Unix command cut does this task. Nothing wrong with using another > awk, but I often feel sorry for poor old cut. It's been around for > decades, and yet is so very often overlooked! Mind you, it uses TABs to > delimit fields by default, so maybe it only has itself to blame. I will continue to be contrary here and say that awk does a much better job of cutting by whitespace separated fields than does cut. Both are standard and should be available everywhere. And here because awk is already in use I expect it to be somewhat more efficient to use awk again in the pipeline than to use a different program. I also wish to improve the command line somewhat. Using $* by itself does not sufficiently quote program arguments with whitespace. One should use "$@" for that purpose. Also the old forms of sort and head would be better left behind and use the new portable option set for them instead. Let me suggest: ' "$@" | sort -k2,2nr | head -n10 | awk '{ print $1 }' Bob