From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Bob Proulx Newsgroups: gmane.emacs.help Subject: Re: Most used words in current buffer Date: Thu, 19 Jul 2018 01:04:38 -0600 Message-ID: <20180719000330488732477@bob.proulx.com> References: <861sc1iu1m.fsf@zoho.com> <87pnzkcgna.fsf@bsb.me.uk> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1531983798 1527 195.159.176.226 (19 Jul 2018 07:03:18 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 19 Jul 2018 07:03:18 +0000 (UTC) User-Agent: Mutt/1.10.0 (2018-05-17) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Thu Jul 19 09:03:14 2018 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fg2xy-0000Cm-UH for geh-help-gnu-emacs@m.gmane.org; Thu, 19 Jul 2018 09:03:11 +0200 Original-Received: from localhost ([::1]:40129 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fg304-0002tg-Br for geh-help-gnu-emacs@m.gmane.org; Thu, 19 Jul 2018 03:05:20 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44786) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fg2zT-0002tU-5w for help-gnu-emacs@gnu.org; Thu, 19 Jul 2018 03:04:44 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fg2zP-0007Ik-VH for help-gnu-emacs@gnu.org; Thu, 19 Jul 2018 03:04:43 -0400 Original-Received: from havoc.proulx.com ([96.88.95.61]:45904) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1fg2zP-0007II-Mx for help-gnu-emacs@gnu.org; Thu, 19 Jul 2018 03:04:39 -0400 Original-Received: from joseki.proulx.com (localhost [127.0.0.1]) by havoc.proulx.com (Postfix) with ESMTP id B1FB62390 for ; Thu, 19 Jul 2018 01:04:38 -0600 (MDT) Original-Received: from hysteria.proulx.com (hysteria.proulx.com [192.168.230.119]) by joseki.proulx.com (Postfix) with ESMTP id 7CA6D217DF for ; Thu, 19 Jul 2018 01:04:38 -0600 (MDT) Original-Received: by hysteria.proulx.com (Postfix, from userid 1000) id 648D62DC71; Thu, 19 Jul 2018 01:04:38 -0600 (MDT) Mail-Followup-To: help-gnu-emacs@gnu.org Content-Disposition: inline In-Reply-To: X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 96.88.95.61 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.org gmane.emacs.help:117493 Archived-At: Udyant Wig wrote: > Bob Proulx wrote: > > Not wanting to be too annoying but I see no hashing in the awk > > solution. It is using an awk associative array to store the words. > > Perl and Pything call those "hashes" but they are just associative > > arrays. > > I understand that associative arrays in awk are built upon hashing. > Kernighan and Pike say Hmm... I think looking behind the abstract data type at how it might or might not be implemented is a stretch to say the least. The entire reason it is an abstract data type is to hide those types of implementation details. :-) > However, on the previous page, in introducing the language construct, > they do take the name _associative array_. The naming of things usually says more about the person that named the thing than the thing itself. Associative arrays is a naming that reflects the concepts involved in what it does. This is the same as when someone else names it a map table, or a dictionary. Those are all the same thing. Just using different names because people took different paths to get there. > The implementation of associative memory uses a hashing scheme to > ensure that access to any element takes about the same time as to any > other, and that (at least for moderate array sizes) the time doesn't > depend on how many elements are in the array. Hash tables are attractive because in their perfect form they are O(1) order of the function being a fixed amount for any lookup. Being a fixed amount sounds very good and therefore authors often implement using hash tables. However achieving that perfect O(1) performance requires knowing the size of the needed hash table at creation time so that it can be sized correctly from the start and that is not the case here. Therefore when implemented as a hash table the table will need to be grown at various intervals as the number of entries contained is increased. That growth is lumped and consumes time and resources for the growth at the time the table is grown. For such things I generally prefer balanced tree structures because work is amortized instead of lumped. But the important point here is that for every algorithm + data structure there is a trade-off of some sort between one thing and another thing. It was in the Perl community that these became known as "hashes" due to the implementation. But I think it is better to keep the ADT (abstract data type) abstract. > > ' "$@" | sort -k2,2nr | head -n10 | awk '{ print $1 }' > > Thank you for the portable pipeline. It is interesting to compare it > with the pipeline given in the book: > > $ wordfreq ch4.* | sort +1 -nr | sed 20q | 4 > > where > > wordfreq is the awk script proper, > 4 is a shell script that prints its input in 4 columns, > and sed 20q does the equivalent of head -20. > > On the last point, they say that given the ease of typing a sed command, > they felt no need to write the program head. I am in total agreement over using sed instead of head if you want to do that. Seeing 'sed 20q' should roll off the keyboard as print lines until line 20 and then quit. Very simple and to the point. There is definitely no need for a separate head command. Other than for symmetry with tail which is not as simple in sed. I didn't go look up the reference in the book before posting my first reply. I probably should have refreshed my knowledge of it. Then I would have realized that we weren't talking about literal hashing of tokens (like what is done in SpamAssassin's Bayes engine for one example) but rather use of "hash" as a table lookup method. That caused me to say that first. Sorry. As to the old style command line options for head and sort I was just wanting to keep in people's minds that since a decade ago or so the new format of the command line options is now the preferred one. That old style is certainly how we wrote those things back in the day though. Bob