* Most used words in current buffer @ 2018-07-17 9:28 Udyant Wig 2018-07-17 18:41 ` Emanuel Berg 0 siblings, 1 reply; 47+ messages in thread From: Udyant Wig @ 2018-07-17 9:28 UTC (permalink / raw) To: help-gnu-emacs How might I obtain a list of the most frequently used words in the current buffer? I have an inkling of an Emacs Lisp solution, but I thought to ask here before writing it. Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-17 9:28 Most used words in current buffer Udyant Wig @ 2018-07-17 18:41 ` Emanuel Berg 2018-07-18 9:36 ` Udyant Wig 0 siblings, 1 reply; 47+ messages in thread From: Emanuel Berg @ 2018-07-17 18:41 UTC (permalink / raw) To: help-gnu-emacs Udyant Wig wrote: > How might I obtain a list of the most > frequently used words in the current buffer? > > I have an inkling of an Emacs Lisp solution, > but I thought to ask here before writing it. Do it! But if you can let go of the Elisp requirement here are some examples how to do it with everyday GNU/Unix tools: https://unix.stackexchange.com/questions/41479/find-n-most-frequent-words-in-a-file -- underground experts united http://user.it.uu.se/~embe8573 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-17 18:41 ` Emanuel Berg @ 2018-07-18 9:36 ` Udyant Wig 2018-07-18 11:48 ` Emanuel Berg 2018-07-18 22:39 ` Ben Bacarisse 0 siblings, 2 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-18 9:36 UTC (permalink / raw) To: help-gnu-emacs On 07/18/2018 12:11 AM, Emanuel Berg wrote: > Do it! > > But if you can let go of the Elisp requirement here are some examples > how to do it with everyday GNU/Unix tools: > > https://unix.stackexchange.com/questions/41479/find-n-most-frequent-words-in-a-file I went ahead and did it. I obtained many solutions, in fact. Only today did I check the link above. First, of the solutions in Emacs Lisp, this one came out as the quickest: --- (defun buffer-most-used-words-1 (n) "Make a list of the N most used words in buffer." (let ((counts (make-hash-table :test #'equal)) (words (split-string (buffer-string))) sorted-counts) (dolist (word words) (let ((count (gethash (downcase word) counts 0))) (puthash (downcase word) (1+ count) counts))) (loop for word being the hash-keys of counts using (hash-values count) do (push (list word count) sorted-counts) finally (setf sorted-counts (cl-sort sorted-counts #'> :key #'second))) (mapcar #'first (cl-subseq sorted-counts 0 n)))) --- Briefly, it obtains a list of the strings in the buffer, hashes them, puts the words and their counts in a list, sorts it, and lists the first N words. (I had also written solutions (1) using alists; (2) using the handy AVL tree library I found among the Emacs Lisp files in the Emacs distribution; and (3) reading the words directly and hashing them. None beat the above.) The function is suffixed with '-1' because it is the the core of another, interactive function, which takes the above generated list and displays it nicely in another buffer. I was curious about possible solutions in other languages. I wrote programs in both Common Lisp and Python, based on the essential hash table approach. While a lot faster than the Emacs Lisp solution above, they were left behind by this old Awk solution (also using hashing) I found in the classic /The Unix Programming Environment/ by Kernighan and Pike: --- #!/bin/sh awk ' { for (i = 1; i <= NF; i++) num[$i]++ } END { for (word in num) print word, num[word] } ' $* | sort +1 -nr | head -10 | awk '{ print $1 }' --- I appended the last awk pipeline to only give the words without the counts. I wrapped it up in an Emacs command to display the words in another buffer, just like my original Emacs Lisp solution above. Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-18 9:36 ` Udyant Wig @ 2018-07-18 11:48 ` Emanuel Berg 2018-07-18 14:50 ` Udyant Wig 2018-07-18 22:39 ` Ben Bacarisse 1 sibling, 1 reply; 47+ messages in thread From: Emanuel Berg @ 2018-07-18 11:48 UTC (permalink / raw) To: help-gnu-emacs Udyant Wig wrote: > (defun buffer-most-used-words-1 (n) > "Make a list of the N most used words in buffer." > (let ((counts (make-hash-table :test #'equal)) > (words (split-string (buffer-string))) > sorted-counts) > (dolist (word words) > (let ((count (gethash (downcase word) counts 0))) > (puthash (downcase word) (1+ count) counts))) > (loop for word being the hash-keys of counts > using (hash-values count) > do > (push (list word count) sorted-counts) > finally (setf sorted-counts (cl-sort sorted-counts #'> > :key #'second))) > (mapcar #'first (cl-subseq sorted-counts 0 n)))) Great! Only (require 'cl-lib) loop -> cl-loop first -> cl-first second -> cl-second Perhaps this should be published in M/ELPA or at the very least on gnu.emacs.sources :) -- underground experts united http://user.it.uu.se/~embe8573 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-18 11:48 ` Emanuel Berg @ 2018-07-18 14:50 ` Udyant Wig 2018-07-18 16:32 ` Emanuel Berg 0 siblings, 1 reply; 47+ messages in thread From: Udyant Wig @ 2018-07-18 14:50 UTC (permalink / raw) To: help-gnu-emacs On 07/18/2018 05:18 PM, Emanuel Berg wrote: > Great! Only > > (require 'cl-lib) > loop -> cl-loop > first -> cl-first > second -> cl-second Ah, yes. Thank you. I have included the fixes in the following. --- (require 'cl-lib) (defun buffer-most-used-words-1 (n) "Make a list of the N most used words in buffer." (let ((counts (make-hash-table :test #'equal)) (words (split-string (buffer-string))) sorted-counts) (dolist (word words) (let ((count (gethash (downcase word) counts 0))) (puthash (downcase word) (1+ count) counts))) (cl-loop for word being the hash-keys of counts using (hash-values count) do (push (list word count) sorted-counts) finally (setf sorted-counts (cl-sort sorted-counts #'> :key #'cl-second))) (mapcar #'cl-first (cl-subseq sorted-counts 0 n)))) --- > Perhaps this should be published in M/ELPA or at the very least on > gnu.emacs.sources :) Maybe. I hope this is useful to others also. Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-18 14:50 ` Udyant Wig @ 2018-07-18 16:32 ` Emanuel Berg 0 siblings, 0 replies; 47+ messages in thread From: Emanuel Berg @ 2018-07-18 16:32 UTC (permalink / raw) To: help-gnu-emacs Udyant Wig wrote: > I hope this is useful to others also. Definitely! Be sure to login to the SX site and post the "Elisp solution" to the problem as well! Or if someone else who is frequenting that site feels like doing it for you/us. I don't see why Emacs is any less a GNU/Unix tool than say sed or (g)awk! Because actually it is a much better tool :) -- underground experts united http://user.it.uu.se/~embe8573 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-18 9:36 ` Udyant Wig 2018-07-18 11:48 ` Emanuel Berg @ 2018-07-18 22:39 ` Ben Bacarisse 2018-07-19 0:45 ` Bob Proulx [not found] ` <mailman.3785.1531961144.1292.help-gnu-emacs@gnu.org> 1 sibling, 2 replies; 47+ messages in thread From: Ben Bacarisse @ 2018-07-18 22:39 UTC (permalink / raw) To: help-gnu-emacs Udyant Wig <udyantw@gmail.com> writes: <snip> > they were left behind by this old Awk solution (also using hashing) I > found in the classic /The Unix Programming Environment/ by Kernighan and > Pike: > > --- > #!/bin/sh > > awk ' { for (i = 1; i <= NF; i++) num[$i]++ } > END { for (word in num) print word, num[word] } > ' $* | sort +1 -nr | head -10 | awk '{ print $1 }' > --- > > I appended the last awk pipeline to only give the words without the > counts. The Unix command cut does this task. Nothing wrong with using another awk, but I often feel sorry for poor old cut. It's been around for decades, and yet is so very often overlooked! Mind you, it uses TABs to delimit fields by default, so maybe it only has itself to blame. -- Ben. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-18 22:39 ` Ben Bacarisse @ 2018-07-19 0:45 ` Bob Proulx [not found] ` <mailman.3785.1531961144.1292.help-gnu-emacs@gnu.org> 1 sibling, 0 replies; 47+ messages in thread From: Bob Proulx @ 2018-07-19 0:45 UTC (permalink / raw) To: help-gnu-emacs Ben Bacarisse wrote: > Udyant Wig writes: > > they were left behind by this old Awk solution (also using hashing) I Not wanting to be too annoying but I see no hashing in the awk solution. It is using an awk associative array to store the words. Perl and Pything call those "hashes" but they are just associative arrays. > > found in the classic /The Unix Programming Environment/ by Kernighan and > > Pike: > >... > > awk ' { for (i = 1; i <= NF; i++) num[$i]++ } > > END { for (word in num) print word, num[word] } > > ' $* | sort +1 -nr | head -10 | awk '{ print $1 }' > > > > I appended the last awk pipeline to only give the words without the > > counts. > > The Unix command cut does this task. Nothing wrong with using another > awk, but I often feel sorry for poor old cut. It's been around for > decades, and yet is so very often overlooked! Mind you, it uses TABs to > delimit fields by default, so maybe it only has itself to blame. I will continue to be contrary here and say that awk does a much better job of cutting by whitespace separated fields than does cut. Both are standard and should be available everywhere. And here because awk is already in use I expect it to be somewhat more efficient to use awk again in the pipeline than to use a different program. I also wish to improve the command line somewhat. Using $* by itself does not sufficiently quote program arguments with whitespace. One should use "$@" for that purpose. Also the old forms of sort and head would be better left behind and use the new portable option set for them instead. Let me suggest: ' "$@" | sort -k2,2nr | head -n10 | awk '{ print $1 }' Bob ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.3785.1531961144.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.3785.1531961144.1292.help-gnu-emacs@gnu.org> @ 2018-07-19 5:33 ` Udyant Wig 2018-07-19 7:04 ` Bob Proulx [not found] ` <mailman.3796.1531983885.1292.help-gnu-emacs@gnu.org> 0 siblings, 2 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-19 5:33 UTC (permalink / raw) To: help-gnu-emacs On 07/19/2018 06:15 AM, Bob Proulx wrote: > Not wanting to be too annoying but I see no hashing in the awk > solution. It is using an awk associative array to store the words. > Perl and Pything call those "hashes" but they are just associative > arrays. I understand that associative arrays in awk are built upon hashing. Kernighan and Pike say The implementation of associative memory uses a hashing scheme to ensure that access to any element takes about the same time as to any other, and that (at least for moderate array sizes) the time doesn't depend on how many elements are in the array. However, on the previous page, in introducing the language construct, they do take the name _associative array_. > I will continue to be contrary here and say that awk does a much > better job of cutting by whitespace separated fields than does cut. > Both are standard and should be available everywhere. And here > because awk is already in use I expect it to be somewhat more > efficient to use awk again in the pipeline than to use a different > program. > > I also wish to improve the command line somewhat. Using $* by itself > does not sufficiently quote program arguments with whitespace. One > should use "$@" for that purpose. Also the old forms of sort and head > would be better left behind and use the new portable option set for > them instead. Let me suggest: > > ' "$@" | sort -k2,2nr | head -n10 | awk '{ print $1 }' > > Bob Thank you for the portable pipeline. It is interesting to compare it with the pipeline given in the book: $ wordfreq ch4.* | sort +1 -nr | sed 20q | 4 where wordfreq is the awk script proper, 4 is a shell script that prints its input in 4 columns, and sed 20q does the equivalent of head -20. On the last point, they say that given the ease of typing a sed command, they felt no need to write the program head. Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-19 5:33 ` Udyant Wig @ 2018-07-19 7:04 ` Bob Proulx 2018-07-19 7:25 ` tomas ` (2 more replies) [not found] ` <mailman.3796.1531983885.1292.help-gnu-emacs@gnu.org> 1 sibling, 3 replies; 47+ messages in thread From: Bob Proulx @ 2018-07-19 7:04 UTC (permalink / raw) To: help-gnu-emacs Udyant Wig wrote: > Bob Proulx wrote: > > Not wanting to be too annoying but I see no hashing in the awk > > solution. It is using an awk associative array to store the words. > > Perl and Pything call those "hashes" but they are just associative > > arrays. > > I understand that associative arrays in awk are built upon hashing. > Kernighan and Pike say Hmm... I think looking behind the abstract data type at how it might or might not be implemented is a stretch to say the least. The entire reason it is an abstract data type is to hide those types of implementation details. :-) > However, on the previous page, in introducing the language construct, > they do take the name _associative array_. The naming of things usually says more about the person that named the thing than the thing itself. Associative arrays is a naming that reflects the concepts involved in what it does. This is the same as when someone else names it a map table, or a dictionary. Those are all the same thing. Just using different names because people took different paths to get there. > The implementation of associative memory uses a hashing scheme to > ensure that access to any element takes about the same time as to any > other, and that (at least for moderate array sizes) the time doesn't > depend on how many elements are in the array. Hash tables are attractive because in their perfect form they are O(1) order of the function being a fixed amount for any lookup. Being a fixed amount sounds very good and therefore authors often implement using hash tables. However achieving that perfect O(1) performance requires knowing the size of the needed hash table at creation time so that it can be sized correctly from the start and that is not the case here. Therefore when implemented as a hash table the table will need to be grown at various intervals as the number of entries contained is increased. That growth is lumped and consumes time and resources for the growth at the time the table is grown. For such things I generally prefer balanced tree structures because work is amortized instead of lumped. But the important point here is that for every algorithm + data structure there is a trade-off of some sort between one thing and another thing. It was in the Perl community that these became known as "hashes" due to the implementation. But I think it is better to keep the ADT (abstract data type) abstract. > > ' "$@" | sort -k2,2nr | head -n10 | awk '{ print $1 }' > > Thank you for the portable pipeline. It is interesting to compare it > with the pipeline given in the book: > > $ wordfreq ch4.* | sort +1 -nr | sed 20q | 4 > > where > > wordfreq is the awk script proper, > 4 is a shell script that prints its input in 4 columns, > and sed 20q does the equivalent of head -20. > > On the last point, they say that given the ease of typing a sed command, > they felt no need to write the program head. I am in total agreement over using sed instead of head if you want to do that. Seeing 'sed 20q' should roll off the keyboard as print lines until line 20 and then quit. Very simple and to the point. There is definitely no need for a separate head command. Other than for symmetry with tail which is not as simple in sed. I didn't go look up the reference in the book before posting my first reply. I probably should have refreshed my knowledge of it. Then I would have realized that we weren't talking about literal hashing of tokens (like what is done in SpamAssassin's Bayes engine for one example) but rather use of "hash" as a table lookup method. That caused me to say that first. Sorry. As to the old style command line options for head and sort I was just wanting to keep in people's minds that since a decade ago or so the new format of the command line options is now the preferred one. That old style is certainly how we wrote those things back in the day though. Bob ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-19 7:04 ` Bob Proulx @ 2018-07-19 7:25 ` tomas 2018-07-19 17:19 ` Nick Dokos [not found] ` <mailman.3826.1532020800.1292.help-gnu-emacs@gnu.org> 2 siblings, 0 replies; 47+ messages in thread From: tomas @ 2018-07-19 7:25 UTC (permalink / raw) To: help-gnu-emacs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thu, Jul 19, 2018 at 01:04:38AM -0600, Bob Proulx wrote: [...] > For such things I generally prefer balanced tree structures because > work is amortized instead of lumped. But the important point here is > that for every algorithm + data structure there is a trade-off of some > sort between one thing and another thing. It all depends, of course... > It was in the Perl community that these became known as "hashes" due > to the implementation. But I think it is better to keep the ADT > (abstract data type) abstract. Ain't (human) language beautiful? Like "crane" (the lifting device), which is a metaphor by analogy to the bird (and which is preserved across other West European languages with quite different names for both!). Or like those brand-things (Scotch/Sello/Tesa for adhesive tape, for example), which would be the same pattern you decry above (implementation detail as a substitute for the principle, or pars pro toto). No wonder that Perl is a linguist's child :-) You are right: it *is* important to distinguish those layers, as a mathematician/technician. But then you are wrong [1] too: we are humans and our languages are exquisitely messy. Cheers [1] Sorry for the extremely shortened form: it is meant as an expression of subjective feeling. I don't think I am in the real position to decide whether you are right or wrong. That would be presumptuous indeed. - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAltQPNwACgkQBcgs9XrR2kYfhQCePvfIuPJTJYaIkM+5964SZBGI jjwAnjtxsz1GLy1MjcrD8EetvqpTY4L/ =y+xj -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-19 7:04 ` Bob Proulx 2018-07-19 7:25 ` tomas @ 2018-07-19 17:19 ` Nick Dokos 2018-07-19 17:30 ` Eli Zaretskii ` (2 more replies) [not found] ` <mailman.3826.1532020800.1292.help-gnu-emacs@gnu.org> 2 siblings, 3 replies; 47+ messages in thread From: Nick Dokos @ 2018-07-19 17:19 UTC (permalink / raw) To: help-gnu-emacs Bob Proulx <bob@proulx.com> writes: > > I am in total agreement over using sed instead of head if you want to > do that. Seeing 'sed 20q' should roll off the keyboard as print lines > until line 20 and then quit. Very simple and to the point. There is > definitely no need for a separate head command. Other than for > symmetry with tail which is not as simple in sed. > IIRC, Kernighan & Pike say in the "Unix Programming Environment" that there *was* a `head' program, in addition to the `tail' program. It fell into disuse and disappeared almost immediately after sed became available. tail is still around, I guess both because the sed invocation for `tail -n' does not quite roll off the fingers the same way that 20q does -- see e.g https://unix.stackexchange.com/questions/107387/emulate-tail-with-sed#107388 -- but also because `tail -f' cannot be emulated by sed, at least not without a lot of effort and/or another program). -- Nick "There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors." -Martin Fowler ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-19 17:19 ` Nick Dokos @ 2018-07-19 17:30 ` Eli Zaretskii 2018-07-19 20:08 ` Bob Proulx [not found] ` <mailman.3843.1532030947.1292.help-gnu-emacs@gnu.org> 2 siblings, 0 replies; 47+ messages in thread From: Eli Zaretskii @ 2018-07-19 17:30 UTC (permalink / raw) To: help-gnu-emacs > From: Nick Dokos <ndokos@gmail.com> > Date: Thu, 19 Jul 2018 13:19:39 -0400 > > IIRC, Kernighan & Pike say in the "Unix Programming Environment" that > there *was* a `head' program, in addition to the `tail' program. It > fell into disuse and disappeared almost immediately after sed became > available. 'head' is alive and well in GNU Coreutils. E.g., on a garden-variety GNU/Linux system: ~$ head --version head (GNU coreutils) 8.21 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by David MacKenzie and Jim Meyering. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-19 17:19 ` Nick Dokos 2018-07-19 17:30 ` Eli Zaretskii @ 2018-07-19 20:08 ` Bob Proulx 2018-07-20 16:39 ` Nick Dokos [not found] ` <mailman.3909.1532104802.1292.help-gnu-emacs@gnu.org> [not found] ` <mailman.3843.1532030947.1292.help-gnu-emacs@gnu.org> 2 siblings, 2 replies; 47+ messages in thread From: Bob Proulx @ 2018-07-19 20:08 UTC (permalink / raw) To: help-gnu-emacs Nick Dokos wrote: > IIRC, Kernighan & Pike say in the "Unix Programming Environment" that > there *was* a `head' program, in addition to the `tail' program. It > fell into disuse and disappeared almost immediately after sed became > available. I think you may be thinking of the 'gres' program. Again here I am not going to look up the reference but instead just reply upon my feeble human memory. But I think you are thinking of the gres program, global regular expression substitute, which if that route were followed would require a lot of greX programs where X is replaced by many specific things and was completely subsumed by 'sed'. Also remember that at the time knowledge and daily use of ed (and qed, ex, and the others) made using sed very easy. Lots of shared knowledge. However today that sed may seem arcane to people is just that they are no longer familiar with ed. It no longer has that shared learning that made sed so familiar back in the day. Bob ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-19 20:08 ` Bob Proulx @ 2018-07-20 16:39 ` Nick Dokos [not found] ` <mailman.3909.1532104802.1292.help-gnu-emacs@gnu.org> 1 sibling, 0 replies; 47+ messages in thread From: Nick Dokos @ 2018-07-20 16:39 UTC (permalink / raw) To: help-gnu-emacs > Nick Dokos wrote: >> IIRC, Kernighan & Pike say in the "Unix Programming Environment" that >> there *was* a `head' program, in addition to the `tail' program. It >> fell into disuse and disappeared almost immediately after sed became >> available. > Eli Zaretskii writes: > 'head' is alive and well in GNU Coreutils. E.g., on a garden-variety > GNU/Linux system: Indeed - I knew that but I misspoke - I got buried shortly after I sent and never got the chance to correct it. What I meant was not that head is not available today, but that it was available on early Unix and that (I thought) it went away after sed was added in that same early Unix (maybe version 6 or possibly earlier). Which obviously would not explain why it is still around today, but never mind: I was misremembering things - it turns out that Bob Proulx's "feeble" memory was better than my apparently non-existent one - see below. Bob Proulx <bob@proulx.com> writes: > I think you may be thinking of the 'gres' program. Again here I am > not going to look up the reference but instead just reply upon my > feeble human memory. But I think you are thinking of the gres > program, global regular expression substitute, which if that route > were followed would require a lot of greX programs where X is replaced > by many specific things and was completely subsumed by 'sed'. > Yes, indeed: I had to search a bit in Kernighan and Pike to find it again, and you are right: they were talking about `gres' and `grep', not head and tail. > Also remember that at the time knowledge and daily use of ed (and qed, > ex, and the others) made using sed very easy. Lots of shared > knowledge. However today that sed may seem arcane to people is just > that they are no longer familiar with ed. It no longer has that > shared learning that made sed so familiar back in the day. > Agreed: I knew ed (I actually ported the Kernighan and Plauger ed clone to a Prime OS machine in the early 1980's, before Prime started selling its version of Emacs - the line editor they were distributing was driving me crazy), and could transfer the knowledge to simple sed invocations. At some point I even learnt the more "advanced" portions of sed programming, but I never used them enough to retain them. -- Nick "There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors." -Martin Fowler ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.3909.1532104802.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.3909.1532104802.1292.help-gnu-emacs@gnu.org> @ 2018-07-20 18:13 ` Udyant Wig 2018-07-20 22:24 ` Bob Newell 2018-07-21 0:18 ` Nick Dokos 0 siblings, 2 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-20 18:13 UTC (permalink / raw) To: help-gnu-emacs On 07/20/2018 10:09 PM, Nick Dokos wrote: > Agreed: I knew ed (I actually ported the Kernighan and Plauger ed > clone to a Prime OS machine in the early 1980's, before Prime started > selling its version of Emacs - the line editor they were distributing > was driving me crazy), and could transfer the knowledge to simple sed > invocations. At some point I even learnt the more "advanced" portions > of sed programming, but I never used them enough to retain them. How was it to use the Emacs on Prime OS? Was it very different to other Emacsen extant at the time? Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-20 18:13 ` Udyant Wig @ 2018-07-20 22:24 ` Bob Newell 2018-07-21 0:00 ` Nick Dokos 2018-07-21 0:18 ` Nick Dokos 1 sibling, 1 reply; 47+ messages in thread From: Bob Newell @ 2018-07-20 22:24 UTC (permalink / raw) To: help-gnu-emacs > How was it to use the Emacs on Prime OS? Was it very different to > other > Emacsen extant at the time? I also date back to these days, and in fact Emacs on Prime was my first experience with Emacs, all those years ago. I was fairly active in the Prime user community and sent the odd patch back to Prime once in a while. I was thoroughly disliked by the local Prime software support staff because the only time I ever contacted them was when something was seriously wrong and I couldn't fix it on my own. The initial release, as I recall, was extremely slow starting, taking at least five minutes before being ready for use (not exaggerating). This did get much, much better with subsequent releases. And, once started, it always ran very quickly. It was real Emacs in all its glory. Nick--- as you're the one that did the port--- take credit for brilliant work. There were a couple of peculiarities. Ctrl-p was Prime's interrupt, so previous-line became Ctrl-z (I don't believe there was the concept of 'suspend' at that time). We used the mnemoic 'zrevious'! The other odd thing was that underscore was used instead of dash, so you wouldn't have buffer-name, you'd have buffer_name. But it served me well for a number of years, and that's how I learned Emacs and Elisp. There wasn't a package library, but Emacs was still super-powerful. When the PC came along I looked for something as good and found little. Early Emacs ports to the PC weren't the best. There was something called Epsilon (which I think still exists) that came closest. But eventually there was the Real Thing and none of us ever looked back. By then Prime was in its last days. -- Bob Newell Honolulu, Hawai`i * Via Gnus/BBDB/Org/Emacs/Linux * ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-20 22:24 ` Bob Newell @ 2018-07-21 0:00 ` Nick Dokos 0 siblings, 0 replies; 47+ messages in thread From: Nick Dokos @ 2018-07-21 0:00 UTC (permalink / raw) To: help-gnu-emacs Bob Newell <bobnewell@bobnewell.net> writes: > It was real Emacs in all its glory. Nick--- as you're the one > that did the port--- take credit for brilliant work. > Uhh, I wish I could say that I ported emacs - alas I cannot take credit for that: what I ported was the ed clone from Kernighan and Plauger's "Software Tools". -- Nick "There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors." -Martin Fowler ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-20 18:13 ` Udyant Wig 2018-07-20 22:24 ` Bob Newell @ 2018-07-21 0:18 ` Nick Dokos 1 sibling, 0 replies; 47+ messages in thread From: Nick Dokos @ 2018-07-21 0:18 UTC (permalink / raw) To: help-gnu-emacs Udyant Wig <udyantw@gmail.com> writes: > How was it to use the Emacs on Prime OS? Was it very different to other > Emacsen extant at the time? > It was pretty close I think. I don't remember the startup slowness that Bob Newell mentioned but it may be that we waited to get it until later and they may have ironed out the problems a bit. But once emacs became available to me, I threw away my ed port and never looked back. -- Nick "There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors." -Martin Fowler ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.3843.1532030947.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.3843.1532030947.1292.help-gnu-emacs@gnu.org> @ 2018-07-20 6:19 ` Udyant Wig 2018-07-20 23:25 ` Bob Proulx [not found] ` <mailman.3934.1532129163.1292.help-gnu-emacs@gnu.org> 0 siblings, 2 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-20 6:19 UTC (permalink / raw) To: help-gnu-emacs On 07/20/2018 01:38 AM, Bob Proulx wrote: > Also remember that at the time knowledge and daily use of ed (and qed, > ex, and the others) made using sed very easy. Lots of shared > knowledge. However today that sed may seem arcane to people is just > that they are no longer familiar with ed. It no longer has that > shared learning that made sed so familiar back in the day. That is true. I expect that learning the three of ed, sed, and grep will be mutually reinforcing; the functionality in ed forms the basis of those in sed and grep. One cannot however expect ed to be available by default on modern Unix systems. E.g., it is not there by default on Debian. While on the topic of editors, here's a fun quote: By the way, 'em' stands for 'editor for mortals' -- I christened it that after Ken Thompson visited our lab at QMC while I was developing it and said something like: "Yeah, I've seen editors like that, but I don't feel a need for them, I don't want to see the state of the file when I'm editing". -- George Coulouris > Bob Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-20 6:19 ` Udyant Wig @ 2018-07-20 23:25 ` Bob Proulx 2018-07-21 0:26 ` Nick Dokos [not found] ` <mailman.3934.1532129163.1292.help-gnu-emacs@gnu.org> 1 sibling, 1 reply; 47+ messages in thread From: Bob Proulx @ 2018-07-20 23:25 UTC (permalink / raw) To: help-gnu-emacs Udyant Wig wrote: > Bob Proulx wrote: > > Also remember that at the time knowledge and daily use of ed (and qed, > > That is true. I expect that learning the three of ed, sed, and grep > will be mutually reinforcing; the functionality in ed forms the basis of > those in sed and grep. Perhaps the other facet of this is regular expressions. When I was in my entry level university class the first two weeks solid were spent on learning to use the computer, use the editor, work with files, and using regular expressions. That last, regular expressions, seems to be less emphasized today. It isn't taught at all at the local university. AFAICT students here pick it up only in passing if at all. They are missing out on a very powerful feature. > One cannot however expect ed to be available by default on modern Unix > systems. E.g., it is not there by default on Debian. I have two stories that I find funny about ed. This one happened in the last week. A friend at a meetup had a problem with his Ubuntu laptop. His GUI was broken and needed a small file fix. This person uses the GUI for everything and had no terminal editors installed, as far as I could tell. Not vi, vim, nor emacs nor other. Of course this person's normal GUI editors were unavailable without X running. But strangely 'ed' *was* installed. I don't use Ubuntu but I guess it got installed by default there. Or something pulled it in. I have no idea. I think you can already tell where this is going. I used ed to edit and fix things. Being able to use it appeared like magic to this person who couldn't imagine you could edit something without a mouse. This happened in just this last week! Having ed there made the task easy. And I'll save the other funny story about ed for another time. :-) Bob ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-20 23:25 ` Bob Proulx @ 2018-07-21 0:26 ` Nick Dokos 2018-07-21 4:03 ` Bob Proulx 0 siblings, 1 reply; 47+ messages in thread From: Nick Dokos @ 2018-07-21 0:26 UTC (permalink / raw) To: help-gnu-emacs Bob Proulx <bob@proulx.com> writes: > I have two stories that I find funny about ed. This one happened in > the last week. A friend at a meetup had a problem with his Ubuntu > laptop. His GUI was broken and needed a small file fix. This person > uses the GUI for everything and had no terminal editors installed, as > far as I could tell. Not vi, vim, nor emacs nor other. Of course > this person's normal GUI editors were unavailable without X running. > But strangely 'ed' *was* installed. I don't use Ubuntu but I guess it > got installed by default there. Or something pulled it in. I have no > idea. > > I think you can already tell where this is going. I used ed to edit > and fix things. Being able to use it appeared like magic to this > person who couldn't imagine you could edit something without a mouse. > A friend (an avid vim user) was watching over my shoulder while I was trying to do something (in emacs of course) and had to edit /etc/fstab or something. I was in an emacs shell at the time, so I did 'sudo ed /etc/fstab' and did my edit (and yes, I know and use tramp most of the time). The idea of running ed from the shell from within emacs was too much for him: he was laughing for hours! And yes, I install ed whenever I set up a new machine: for just those occasions :-) -- Nick "There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors." -Martin Fowler ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-21 0:26 ` Nick Dokos @ 2018-07-21 4:03 ` Bob Proulx 0 siblings, 0 replies; 47+ messages in thread From: Bob Proulx @ 2018-07-21 4:03 UTC (permalink / raw) To: help-gnu-emacs Nick Dokos wrote: > A friend (an avid vim user) was watching over my shoulder while I was > trying to do something (in emacs of course) and had to edit /etc/fstab > or something. I was in an emacs shell at the time, so I did 'sudo ed > /etc/fstab' and did my edit (and yes, I know and use tramp most of the > time). The idea of running ed from the shell from within emacs was too > much for him: he was laughing for hours! That seems just perfect! However I must ask if you have used M-x term in order to get a tty? You could have really tormented your friend: M-x term sudo emacs /etc/fstab Just remember to prefix every keystroke appropriately. As a hint it is easy to switch term mode into line mode with C-c C-j and back to character mode with C-c C-k. And then you can always do that again without a hard limit on the number of nestings. Until you run out of vertical terminal lines! > And yes, I install ed whenever I set up a new machine: for just those > occasions :-) Me too! :-) Bob ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.3934.1532129163.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.3934.1532129163.1292.help-gnu-emacs@gnu.org> @ 2018-07-21 13:39 ` Udyant Wig 0 siblings, 0 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-21 13:39 UTC (permalink / raw) To: help-gnu-emacs On 07/21/2018 04:55 AM, Bob Proulx wrote: > I have two stories that I find funny about ed. This one happened in > the last week. A friend at a meetup had a problem with his Ubuntu > laptop. His GUI was broken and needed a small file fix. This person > uses the GUI for everything and had no terminal editors installed, as > far as I could tell. Not vi, vim, nor emacs nor other. Of course > this person's normal GUI editors were unavailable without X running. > But strangely 'ed' *was* installed. I don't use Ubuntu but I guess it > got installed by default there. Or something pulled it in. I have no > idea. > > I think you can already tell where this is going. I used ed to edit > and fix things. Being able to use it appeared like magic to this > person who couldn't imagine you could edit something without a mouse. > > This happened in just this last week! Having ed there made the task > easy. The face of the person must have been a picture. I take the moral of this to be that one ought to know the simpler tools, even if one never uses them. I don't know how general this moral is. > And I'll save the other funny story about ed for another time. :-) Here's hoping that occasion comes soon. > Bob Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.3826.1532020800.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.3826.1532020800.1292.help-gnu-emacs@gnu.org> @ 2018-07-20 5:52 ` Udyant Wig 0 siblings, 0 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-20 5:52 UTC (permalink / raw) To: help-gnu-emacs On 07/19/2018 10:49 PM, Nick Dokos wrote: > IIRC, Kernighan & Pike say in the "Unix Programming Environment" that > there *was* a `head' program, in addition to the `tail' program. It > fell into disuse and disappeared almost immediately after sed became > available. tail is still around, I guess both because the sed > invocation for `tail -n' does not quite roll off the fingers the same > way that 20q does -- see e.g > > https://unix.stackexchange.com/questions/107387/emulate-tail-with-sed#107388 > > -- but also because `tail -f' cannot be emulated by sed, at least not > without a lot of effort and/or another program). Well, in the interests of dispelling doubts around what Kernighan and Pike say, I will quote the paragraph they wrote in regard to sed and utilities equivalent to sed commands: With these ideas, it might seem sensible to write a program, called _head_, to print the first few lines of each filename argument. But _sed 3q_ (or _10q_) is so easy to type that we've never felt the need. We do, however, have an _ind_, since its equivalent _sed_ command is harder to type. (In the process of writing this book we replaced the existing 20-line C program by version 2 of the one-line implementation shown earlier). There is no clear criterion for when it's worth making a separate command from a complicated command line; the best rule we've found is to put it in your _bin_ and see if you actually use it. On the point of 'tail -f', yes, it is a very useful command that would indeed be non-trivial to build using sed. Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.3796.1531983885.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.3796.1531983885.1292.help-gnu-emacs@gnu.org> @ 2018-07-19 13:26 ` Udyant Wig 2018-07-19 20:42 ` Bob Proulx ` (3 more replies) 0 siblings, 4 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-19 13:26 UTC (permalink / raw) To: help-gnu-emacs On 07/19/2018 12:34 PM, Bob Proulx wrote: > Hmm... I think looking behind the abstract data type at how it might > or might not be implemented is a stretch to say the least. The entire > reason it is an abstract data type is to hide those types of > implementation details. :-) Indeed. The principle generalizes to abstract functionality as well, doesn't it? E.g. qsort in the C library may or may not be an implementation of quick sort; or closer to the topic of the newsgroup, one ought not to care about the actual algorithm of the SORT function in Emacs. Or, to illustrate the point in another way, I like that I can operate on sequences in Emacs Lisp, rather than have to commit to either lists or vectors right at the start. > The naming of things usually says more about the person that named the > thing than the thing itself. Associative arrays is a naming that > reflects the concepts involved in what it does. This is the same as > when someone else names it a map table, or a dictionary. Those are > all the same thing. Just using different names because people took > different paths to get there. Yes. Just as, arguably, vectors are a special case of the general concept of arrays, though the terms are commonly used to name the same thing. > Hash tables are attractive because in their perfect form they are O(1) > order of the function being a fixed amount for any lookup. Being a > fixed amount sounds very good and therefore authors often implement > using hash tables. However achieving that perfect O(1) performance > requires knowing the size of the needed hash table at creation time so > that it can be sized correctly from the start and that is not the case > here. Therefore when implemented as a hash table the table will need > to be grown at various intervals as the number of entries contained is > increased. That growth is lumped and consumes time and resources for > the growth at the time the table is grown. > > For such things I generally prefer balanced tree structures because > work is amortized instead of lumped. But the important point here is > that for every algorithm + data structure there is a trade-off of some > sort between one thing and another thing. Hmm. I had written a tree version of the word counter I had mentioned before. I had stumbled upon the AVL tree package in Emacs and thought I might try using it. This tree-based attempt turned out to be slower than my straightforward hashing solution. I have no doubts this code could be written better by someone more experienced than I. --- (require 'cl-lib) (require 'avl-tree) (defun buffer-most-used-words-2 (n) "Make a list of the N most used words in buffer." (let ((counts (avl-tree-create (lambda (wc1 wc2) (string< (first wc1) (first wc2))))) (words (split-string (buffer-string))) sorted-counts) (dolist (word words) (let ((element (avl-tree-member counts (list (downcase word) 0)))) (if element (progn (avl-tree-delete counts element) (avl-tree-enter counts (list (downcase word) (1+ (second element))))) (avl-tree-enter counts (list (downcase word) 1))))) (setf sorted-counts (cl-sort (avl-tree-flatten counts) #'> :key #'second)) (mapcar #'first (cl-subseq sorted-counts 0 n)))) --- > I am in total agreement over using sed instead of head if you want to > do that. Seeing 'sed 20q' should roll off the keyboard as print lines > until line 20 and then quit. Very simple and to the point. There is > definitely no need for a separate head command. Other than for > symmetry with tail which is not as simple in sed. I see that. You could implement head on top of sed if you wanted to. I myself have been using head for long enough for its stated purpose that grasping a sed equivalent was not immediately obvious. > I didn't go look up the reference in the book before posting my first > reply. I probably should have refreshed my knowledge of it. Then I > would have realized that we weren't talking about literal hashing of > tokens (like what is done in SpamAssassin's Bayes engine for one > example) but rather use of "hash" as a table lookup method. That > caused me to say that first. Sorry. No, I made the original mistake. I should have made it clear that I was speaking of hash tables. The general word 'hashing' was misleading. Mea culpa. > As to the old style command line options for head and sort I was just > wanting to keep in people's minds that since a decade ago or so the > new format of the command line options is now the preferred one. That > old style is certainly how we wrote those things back in the day > though. These things do take time to gain currency, don't they? Under Linux, for example, the ip set of commands has been named the successor to ifconfig, and it too is taking time to diffuse into general knowledge. (And, although there have been a number of revisions of Standard C since 1989/1990, a lot of projects still write to that now legacy standard. But there may be other issues to consider here.) > Bob Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-19 13:26 ` Udyant Wig @ 2018-07-19 20:42 ` Bob Proulx 2018-07-20 3:08 ` Bob Newell [not found] ` <mailman.3861.1532056120.1292.help-gnu-emacs@gnu.org> [not found] ` <mailman.3845.1532032966.1292.help-gnu-emacs@gnu.org> ` (2 subsequent siblings) 3 siblings, 2 replies; 47+ messages in thread From: Bob Proulx @ 2018-07-19 20:42 UTC (permalink / raw) To: help-gnu-emacs Udyant Wig wrote: > Bob Proulx wrote: > > Hmm... I think looking behind the abstract data type at how it might > > or might not be implemented is a stretch to say the least. The entire > > reason it is an abstract data type is to hide those types of > > implementation details. :-) > > Indeed. The principle generalizes to abstract functionality as well, > doesn't it? E.g. qsort in the C library may or may not be an > implementation of quick sort; or closer to the topic of the newsgroup, > one ought not to care about the actual algorithm of the SORT function in > Emacs. Yes! That is it exactly! > > The naming of things usually says more about the person that named the > > thing than the thing itself. Associative arrays is a naming that > > reflects the concepts involved in what it does. This is the same as > > when someone else names it a map table, or a dictionary. Those are > > all the same thing. Just using different names because people took > > different paths to get there. > > Yes. Just as, arguably, vectors are a special case of the general > concept of arrays, though the terms are commonly used to name the same > thing. Agree completely. > > For such things I generally prefer balanced tree structures because > > work is amortized instead of lumped. But the important point here is > > that for every algorithm + data structure there is a trade-off of some > > sort between one thing and another thing. > > Hmm. I had written a tree version of the word counter I had mentioned > before. I had stumbled upon the AVL tree package in Emacs and thought I > might try using it. This tree-based attempt turned out to be slower > than my straightforward hashing solution. > > I have no doubts this code could be written better by someone more > experienced than I. I don't know if the AVL package you used was implemented in elisp or in C or otherwise. And even though I am a long time user of emacs I have never acquired the elisp skill to the same level as other languages and therefore can't comment on that part. But I know that when people have implemented such data structures in Perl that the result has never been as fast and efficient as in a native C version. If so then that may easily account for performance differences. And also the native implementation of "hashes" in awk, perl, python, ruby is quite optimized and very fast. They have had years of eyes and tweaking upon them. > > I am in total agreement over using sed instead of head if you want to > > do that. Seeing 'sed 20q' should roll off the keyboard as print lines > > until line 20 and then quit. Very simple and to the point. There is > > definitely no need for a separate head command. Other than for > > symmetry with tail which is not as simple in sed. > > I see that. You could implement head on top of sed if you wanted to. I > myself have been using head for long enough for its stated purpose that > grasping a sed equivalent was not immediately obvious. Writing clear code that can be understood immediately by the entire range of programmer skill is important in my not so humble opinion. One shouldn't need to be a master experienced programmer to understand what has been written. Therefore I usually use 'head' specifically for the clarity of it to everyone. Seeing "head -n40" is not going to confuse anyone. Therefore I usually use it instead of "sed 40q" even though I could remove 'head' entirely from my system if I were to uniformly implement one in terms of the other. Clarity is more important. And before someone mentions performance let me remind that we are talking shell scripts. In a shell script clarity is more important than performance. Always. If the resulting shell script results in a performance problem than choosing a better algorithm will almost certainly be the better solution. And if not than then choosing a different language more efficient at the task is next. I do expect some skill to be learned with 'awk' however. It is so very useful that seeing "awk '{print$1}' should not be that confusing that it is printing the first field column. Or that '{print$NF}' is a common idiom for printing the last field. (NF is the Number of Fields in the line that was split by whitespace. $NF is therefore the last field. If NF is 5 then $NF is saying $5 and therefore always the last field of the line.) A little bit of awk learning pays back a large return on the investment. > These things do take time to gain currency, don't they? Under Linux, > for example, the ip set of commands has been named the successor to > ifconfig, and it too is taking time to diffuse into general knowledge. Yes. And 'ip' is an excellent example! Even I have converted to using ip and the iproute2 family instead of ifconfig. One thing to note about the iproute2 family is that it is reasonably well written. We are not forced to use it. Instead we are attracted to using it in order to get access to the entire set of new networking features available only through them. It is a carrot not a stick. > (And, although there have been a number of revisions of Standard C since > 1989/1990, a lot of projects still write to that now legacy standard. > But there may be other issues to consider here.) Another good example. :-) Bob ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-19 20:42 ` Bob Proulx @ 2018-07-20 3:08 ` Bob Newell [not found] ` <mailman.3861.1532056120.1292.help-gnu-emacs@gnu.org> 1 sibling, 0 replies; 47+ messages in thread From: Bob Newell @ 2018-07-20 3:08 UTC (permalink / raw) To: help-gnu-emacs As a writer I rather like this and I've started to adapt it for my Emacs writing environment. I plan on a simple change to allow optional restriction to a region rather than the whole buffer. But as a writing tool, (and I realize I'm straying from pure Emacs discussion) the issue is that words like "the" and all the common stuff are always going to come out on top in an English-language piece of any length. The next trick will be to eliminate such common words (again, optionally). By the way on a 2 MB file the elisp version runs in a few seconds. Hats off to the coder. Bob Newell Honolulu, Hawai`i ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.3861.1532056120.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.3861.1532056120.1292.help-gnu-emacs@gnu.org> @ 2018-07-21 12:51 ` Udyant Wig 2018-07-21 16:15 ` Eric Abrahamsen [not found] ` <mailman.3982.1532189751.1292.help-gnu-emacs@gnu.org> 0 siblings, 2 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-21 12:51 UTC (permalink / raw) To: help-gnu-emacs On 07/20/2018 08:38 AM, Bob Newell wrote: > By the way on a 2 MB file the elisp version runs in a few seconds. > Hats off to the coder. I am still looking to improve it. For example, on a 4.5 MB text file, the original version takes over 5 seconds to run, as measured using the functions #'benchmark-run and #'benchmark-run-compiled. Is it feasible to read words from the buffer and hash them directly from there? Or, going further, is there a better way to do this -- counting words and producing the N most used -- using some other design, maybe with some other data structure? > Bob Newell > Honolulu, Hawai`i Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-21 12:51 ` Udyant Wig @ 2018-07-21 16:15 ` Eric Abrahamsen [not found] ` <mailman.3982.1532189751.1292.help-gnu-emacs@gnu.org> 1 sibling, 0 replies; 47+ messages in thread From: Eric Abrahamsen @ 2018-07-21 16:15 UTC (permalink / raw) To: help-gnu-emacs Udyant Wig <udyantw@gmail.com> writes: > On 07/20/2018 08:38 AM, Bob Newell wrote: >> By the way on a 2 MB file the elisp version runs in a few seconds. >> Hats off to the coder. > > I am still looking to improve it. For example, on a 4.5 MB text file, > the original version takes over 5 seconds to run, as measured using the > functions #'benchmark-run and #'benchmark-run-compiled. > > Is it feasible to read words from the buffer and hash them directly from > there? Or, going further, is there a better way to do this -- counting > words and producing the N most used -- using some other design, maybe > with some other data structure? Interesting... In general I think Emacs is highly optimized to use the buffer as its textual data structure, more so than a string. Particularly when the code is compiled (many of the text-movement commands have opcodes). I made the following two commands to collect words from a novel in an Org file, and the one that uses `forward-word' and `buffer-substring' runs around twice as fast as the `split-string'. Of course, they don't collect the same list of words! But even if you add more code for trimming, etc., it will still likely be faster than operating on a string. (defun test-string (&optional f) (let ((file (or f "/home/eric/org/hollowmountain.org")) str lst) (with-temp-buffer (insert-file-contents file) (setq str (split-string (buffer-string))) (dolist (word str) (push word lst))) (length lst))) (defun test-buffer (&optional f) (let ((file (or f "/home/eric/org/hollowmountain.org")) pnt lst) (with-temp-buffer (insert-file-contents file) (goto-char (point-min)) (setq pnt (point)) (while (forward-word) (push (buffer-substring pnt (point)) lst) (setq pnt (point)))) (length lst))) ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.3982.1532189751.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.3982.1532189751.1292.help-gnu-emacs@gnu.org> @ 2018-07-21 19:46 ` Udyant Wig 2018-07-22 3:57 ` Eric Abrahamsen [not found] ` <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org> 0 siblings, 2 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-21 19:46 UTC (permalink / raw) To: help-gnu-emacs On 07/21/2018 09:45 PM, Eric Abrahamsen wrote: > Interesting... In general I think Emacs is highly optimized to use the > buffer as its textual data structure, more so than a string. > Particularly when the code is compiled (many of the text-movement > commands have opcodes). I made the following two commands to collect > words from a novel in an Org file, and the one that uses > `forward-word' and `buffer-substring' runs around twice as fast as the > `split-string'. > > Of course, they don't collect the same list of words! But even if you > add more code for trimming, etc., it will still likely be faster than > operating on a string. > [snip code] I have acted upon the advice (yours and Stefan Monnier's) to operate on the buffer directly using BUFFER-SUBSTRING. Please see my follow up to Stefan's message. BUFFER-SUBSTRING did gain me (somewhat) better performance. Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-21 19:46 ` Udyant Wig @ 2018-07-22 3:57 ` Eric Abrahamsen 2018-07-22 4:00 ` Eric Abrahamsen [not found] ` <mailman.4008.1532232144.1292.help-gnu-emacs@gnu.org> [not found] ` <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org> 1 sibling, 2 replies; 47+ messages in thread From: Eric Abrahamsen @ 2018-07-22 3:57 UTC (permalink / raw) To: help-gnu-emacs Udyant Wig <udyantw@gmail.com> writes: > On 07/21/2018 09:45 PM, Eric Abrahamsen wrote: >> Interesting... In general I think Emacs is highly optimized to use the >> buffer as its textual data structure, more so than a string. >> Particularly when the code is compiled (many of the text-movement >> commands have opcodes). I made the following two commands to collect >> words from a novel in an Org file, and the one that uses >> `forward-word' and `buffer-substring' runs around twice as fast as the >> `split-string'. >> >> Of course, they don't collect the same list of words! But even if you >> add more code for trimming, etc., it will still likely be faster than >> operating on a string. >> [snip code] > > I have acted upon the advice (yours and Stefan Monnier's) to operate on > the buffer directly using BUFFER-SUBSTRING. Please see my follow up to > Stefan's message. > > BUFFER-SUBSTRING did gain me (somewhat) better performance. As Stefan said, going character by character is going to be slow... But my example with `forward-word' collects a lot of cruft. So I would suggest doing what `forward-word' does internally and move by syntax. This also opens up the possibility of tweaking the behavior of your function (ie, what constitutes a word) by setting temporary syntax tables. Here's a word scanner that only picks up actual words (according to the default syntax table): (defun test-buffer (&optional f) (let ((file (or f "/home/eric/org/hollowmountain.org")) pnt lst) (with-temp-buffer (insert-file-contents file) (goto-char (point-min)) (skip-syntax-forward "^w") (setq pnt (point)) (while (and (null (eobp)) (skip-syntax-forward "w")) (push (buffer-substring pnt (point)) lst) (skip-syntax-forward "^w") (setq pnt (point)))) (nreverse lst))) ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-22 3:57 ` Eric Abrahamsen @ 2018-07-22 4:00 ` Eric Abrahamsen 2018-07-22 4:05 ` Eric Abrahamsen [not found] ` <mailman.4008.1532232144.1292.help-gnu-emacs@gnu.org> 1 sibling, 1 reply; 47+ messages in thread From: Eric Abrahamsen @ 2018-07-22 4:00 UTC (permalink / raw) To: help-gnu-emacs Eric Abrahamsen <eric@ericabrahamsen.net> writes: > Udyant Wig <udyantw@gmail.com> writes: > >> On 07/21/2018 09:45 PM, Eric Abrahamsen wrote: >>> Interesting... In general I think Emacs is highly optimized to use the >>> buffer as its textual data structure, more so than a string. >>> Particularly when the code is compiled (many of the text-movement >>> commands have opcodes). I made the following two commands to collect >>> words from a novel in an Org file, and the one that uses >>> `forward-word' and `buffer-substring' runs around twice as fast as the >>> `split-string'. >>> >>> Of course, they don't collect the same list of words! But even if you >>> add more code for trimming, etc., it will still likely be faster than >>> operating on a string. >>> [snip code] >> >> I have acted upon the advice (yours and Stefan Monnier's) to operate on >> the buffer directly using BUFFER-SUBSTRING. Please see my follow up to >> Stefan's message. >> >> BUFFER-SUBSTRING did gain me (somewhat) better performance. > > As Stefan said, going character by character is going to be slow... But > my example with `forward-word' collects a lot of cruft. So I would > suggest doing what `forward-word' does internally and move by syntax. Actually I think alternating `forward-word' with `forward-to-word' might do the exact same thing as alternating (skip-syntax-forward "w") with (skip-syntax-forward "^w"), and might get you some extra... stuff. Maybe worth benchmarking! ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-22 4:00 ` Eric Abrahamsen @ 2018-07-22 4:05 ` Eric Abrahamsen 0 siblings, 0 replies; 47+ messages in thread From: Eric Abrahamsen @ 2018-07-22 4:05 UTC (permalink / raw) To: help-gnu-emacs Eric Abrahamsen <eric@ericabrahamsen.net> writes: > Eric Abrahamsen <eric@ericabrahamsen.net> writes: > >> Udyant Wig <udyantw@gmail.com> writes: >> >>> On 07/21/2018 09:45 PM, Eric Abrahamsen wrote: >>>> Interesting... In general I think Emacs is highly optimized to use the >>>> buffer as its textual data structure, more so than a string. >>>> Particularly when the code is compiled (many of the text-movement >>>> commands have opcodes). I made the following two commands to collect >>>> words from a novel in an Org file, and the one that uses >>>> `forward-word' and `buffer-substring' runs around twice as fast as the >>>> `split-string'. >>>> >>>> Of course, they don't collect the same list of words! But even if you >>>> add more code for trimming, etc., it will still likely be faster than >>>> operating on a string. >>>> [snip code] >>> >>> I have acted upon the advice (yours and Stefan Monnier's) to operate on >>> the buffer directly using BUFFER-SUBSTRING. Please see my follow up to >>> Stefan's message. >>> >>> BUFFER-SUBSTRING did gain me (somewhat) better performance. >> >> As Stefan said, going character by character is going to be slow... But >> my example with `forward-word' collects a lot of cruft. So I would >> suggest doing what `forward-word' does internally and move by syntax. > > Actually I think alternating `forward-word' with `forward-to-word' might > do the exact same thing as alternating (skip-syntax-forward "w") with > (skip-syntax-forward "^w"), and might get you some extra... stuff. Maybe > worth benchmarking! And, because apparently my Saturday nights are slow: (defun test-buffer (f) (let ((counts (make-hash-table :test #'equal)) pnt) (with-temp-buffer (insert-file-contents f) (goto-char (point-min)) (forward-to-word 1) (setq pnt (point)) (while (and (null (eobp)) (forward-word)) (cl-incf (gethash (downcase (buffer-substring pnt (point))) counts 0)) (forward-to-word 1) (setq pnt (point)))) counts)) Seems to go pretty quick on my test file, though it's only 220K. ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.4008.1532232144.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.4008.1532232144.1292.help-gnu-emacs@gnu.org> @ 2018-07-22 18:28 ` Udyant Wig 2018-07-22 20:05 ` Eric Abrahamsen 0 siblings, 1 reply; 47+ messages in thread From: Udyant Wig @ 2018-07-22 18:28 UTC (permalink / raw) To: help-gnu-emacs On 07/22/2018 09:30 AM, Eric Abrahamsen wrote: > Actually I think alternating `forward-word' with `forward-to-word' > might do the exact same thing as alternating (skip-syntax-forward "w") > with (skip-syntax-forward "^w"), and might get you some > extra... stuff. Maybe worth benchmarking! I did. I had to load the `misc' package to use `forward-to-word'. A ten run timing session gave an average of 4.89 seconds on the 4.5 MB text file: more than twice as much time as the version of the code using `skip-syntax-forward'. Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-22 18:28 ` Udyant Wig @ 2018-07-22 20:05 ` Eric Abrahamsen 0 siblings, 0 replies; 47+ messages in thread From: Eric Abrahamsen @ 2018-07-22 20:05 UTC (permalink / raw) To: help-gnu-emacs Udyant Wig <udyantw@gmail.com> writes: > On 07/22/2018 09:30 AM, Eric Abrahamsen wrote: >> Actually I think alternating `forward-word' with `forward-to-word' >> might do the exact same thing as alternating (skip-syntax-forward "w") >> with (skip-syntax-forward "^w"), and might get you some >> extra... stuff. Maybe worth benchmarking! > > I did. I had to load the `misc' package to use `forward-to-word'. A > ten run timing session gave an average of 4.89 seconds on the 4.5 MB > text file: more than twice as much time as the version of the code using > `skip-syntax-forward'. Wow, that's way worse than I would have guessed. `forward-word' does much more than `skip-syntax-forward', (it lets you use `find-word-boundary-function-table', for one thing) but I wouldn't have thought it does *that* much more. I don't speak C though, so... ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org> @ 2018-07-22 18:19 ` Udyant Wig 0 siblings, 0 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-22 18:19 UTC (permalink / raw) To: help-gnu-emacs On 07/22/2018 09:27 AM, Eric Abrahamsen wrote: > As Stefan said, going character by character is going to be > slow... But my example with `forward-word' collects a lot of cruft. So > I would suggest doing what `forward-word' does internally and move by > syntax. This also opens up the possibility of tweaking the behavior > of your function (ie, what constitutes a word) by setting temporary > syntax tables. Here's a word scanner that only picks up actual words > (according to the default syntax table): > > (defun test-buffer (&optional f) > (let ((file (or f "/home/eric/org/hollowmountain.org")) > pnt lst) > (with-temp-buffer > (insert-file-contents file) > (goto-char (point-min)) > (skip-syntax-forward "^w") > (setq pnt (point)) > (while (and (null (eobp)) (skip-syntax-forward "w")) > (push (buffer-substring pnt (point)) lst) > (skip-syntax-forward "^w") > (setq pnt (point)))) > (nreverse lst))) Thank you for the idea! It did wonders for the running time, a sample of which I have put after the following adaption of your idea to the code. --- (defun buffer-most-used-words-4 (n) "Make a list of the N most used words in buffer." (let ((counts (make-hash-table :test #'equal)) sorted-counts start end) (save-excursion (goto-char (point-min)) (skip-syntax-forward "^w") (setf start (point)) (cl-loop until (eobp) do (skip-syntax-forward "w") (setf end (point)) (incf (gethash (buffer-substring start end) counts 0)) (skip-syntax-forward "^w") (setf start (point)))) (cl-loop for word being the hash-keys of counts using (hash-values count) do (push (list word count) sorted-counts) finally (setf sorted-counts (cl-sort sorted-counts #'> :key #'second))) (mapcar #'first (cl-subseq sorted-counts 0 n)))) --- Compiled, this version takes about half the time the previous version -- going character by character -- took to process a 4.5 MB text file. Average timing after ten runs on the above mentioned file: 2.75 seconds. On syntax tables, the ability to determine what is a word or other construct in a buffer could be very handy indeed. One application beyond prose text that comes to mind could be to count the most used variable or function in a file of source code. There might be others of course. Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.3845.1532032966.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.3845.1532032966.1292.help-gnu-emacs@gnu.org> @ 2018-07-20 13:18 ` Udyant Wig 0 siblings, 0 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-20 13:18 UTC (permalink / raw) To: help-gnu-emacs On 07/20/2018 02:12 AM, Bob Proulx wrote: > I don't know if the AVL package you used was implemented in elisp or > in C or otherwise. And even though I am a long time user of emacs I > have never acquired the elisp skill to the same level as other > languages and therefore can't comment on that part. But I know that > when people have implemented such data structures in Perl that the > result has never been as fast and efficient as in a native C version. > If so then that may easily account for performance differences. And > also the native implementation of "hashes" in awk, perl, python, ruby > is quite optimized and very fast. They have had years of eyes and > tweaking upon them. The AVL tree package comes with Emacs. It bears a creation date of 10 May 1991 and copyright dates of 1995, and 2007-2015 on my system. Now, it could be just that I have used trees in an ignorant way, or it could be that it is as fast as it can go. I hope it is the former as that is something I can feasibly affect. My issue in this general problem (of counting words) was this: I have to store and operate on one kind of data during insertion (strings), and another kind of data during retrieval (numbers); the keys are different in the two cases. I could not think of a way to do this nicely with a tree. > Writing clear code that can be understood immediately by the entire > range of programmer skill is important in my not so humble opinion. > One shouldn't need to be a master experienced programmer to understand > what has been written. Therefore I usually use 'head' specifically > for the clarity of it to everyone. Seeing "head -n40" is not going to > confuse anyone. Therefore I usually use it instead of "sed 40q" even > though I could remove 'head' entirely from my system if I were to > uniformly implement one in terms of the other. Clarity is more > important. > > And before someone mentions performance let me remind that we are > talking shell scripts. In a shell script clarity is more important > than performance. Always. If the resulting shell script results in a > performance problem than choosing a better algorithm will almost > certainly be the better solution. And if not than then choosing a > different language more efficient at the task is next. These are two paragraphs worth recollecting every so often. Thank you. > I do expect some skill to be learned with 'awk' however. It is so > very useful that seeing "awk '{print$1}' should not be that confusing > that it is printing the first field column. Or that '{print$NF}' is a > common idiom for printing the last field. (NF is the Number of Fields > in the line that was split by whitespace. $NF is therefore the last > field. If NF is 5 then $NF is saying $5 and therefore always the last > field of the line.) A little bit of awk learning pays back a large > return on the investment. Indeed. > Bob Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-19 13:26 ` Udyant Wig 2018-07-19 20:42 ` Bob Proulx [not found] ` <mailman.3845.1532032966.1292.help-gnu-emacs@gnu.org> @ 2018-07-21 18:22 ` Stefan Monnier 2018-07-22 9:02 ` tomas [not found] ` <mailman.4013.1532250176.1292.help-gnu-emacs@gnu.org> [not found] ` <mailman.3991.1532197378.1292.help-gnu-emacs@gnu.org> 3 siblings, 2 replies; 47+ messages in thread From: Stefan Monnier @ 2018-07-21 18:22 UTC (permalink / raw) To: help-gnu-emacs > (defun buffer-most-used-words-2 (n) > "Make a list of the N most used words in buffer." > (let ((counts (avl-tree-create (lambda (wc1 wc2) > (string< (first wc1) (first wc2))))) > (words (split-string (buffer-string))) If you want to go fast, don't use split-string+buffer-string. Scan through the buffer and extract each word with buffer-substring directly. > (let ((element (avl-tree-member counts (list (downcase word) 0)))) I'd use a hash-table (implemented in C) rather than an avl-tree (implemented in Elisp). Stefan ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-21 18:22 ` Stefan Monnier @ 2018-07-22 9:02 ` tomas 2018-07-23 6:09 ` Bob Proulx [not found] ` <mailman.4074.1532326162.1292.help-gnu-emacs@gnu.org> [not found] ` <mailman.4013.1532250176.1292.help-gnu-emacs@gnu.org> 1 sibling, 2 replies; 47+ messages in thread From: tomas @ 2018-07-22 9:02 UTC (permalink / raw) To: help-gnu-emacs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sat, Jul 21, 2018 at 02:22:44PM -0400, Stefan Monnier wrote: > > (defun buffer-most-used-words-2 (n) > > "Make a list of the N most used words in buffer." > > (let ((counts (avl-tree-create (lambda (wc1 wc2) > > (string< (first wc1) (first wc2))))) > > (words (split-string (buffer-string))) > > If you want to go fast, don't use split-string+buffer-string. > Scan through the buffer and extract each word with buffer-substring directly. > > > (let ((element (avl-tree-member counts (list (downcase word) 0)))) > > I'd use a hash-table (implemented in C) rather than an avl-tree > (implemented in Elisp). Plus, a (well-implemented) hash table will always be faster (for inserts and random lookups) than a balanced (AVL, red-black) binary tree. The latter affords you sorted lookup (find first greater than, output in order). You pay for that :-) Cheers - -- t -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAltUSDIACgkQBcgs9XrR2kYypwCcC7wGis1R7N6HnC5Moq0yj1Hb a7AAniF3qrn9Tu60jo8qhkQRM73KMFDS =eGJK -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-22 9:02 ` tomas @ 2018-07-23 6:09 ` Bob Proulx 2018-07-23 7:34 ` tomas [not found] ` <mailman.4074.1532326162.1292.help-gnu-emacs@gnu.org> 1 sibling, 1 reply; 47+ messages in thread From: Bob Proulx @ 2018-07-23 6:09 UTC (permalink / raw) To: help-gnu-emacs tomas@tuxteam.de wrote: > Stefan Monnier wrote: > > I'd use a hash-table (implemented in C) rather than an avl-tree > > (implemented in Elisp). With exactly those choices above I would agree completely. > Plus, a (well-implemented) hash table will always be faster (for inserts > and random lookups) than a balanced (AVL, red-black) binary tree. The > latter affords you sorted lookup (find first greater than, output in order). > > You pay for that :-) Again the hash table is good until you need to grow the table. Then there is a pause while memory is shuffled around. But that never happens with a balanced tree because that cost is always amortized along as you go. Also if you need to extract the entries in sorted order then the balanced tree is already sorted. Whereas with the hash table an extra sort step is needed. One of looking at this is that it is global optimization versus local optimization. In some ways it is swings and roundabouts. But in other ways it depends upon what you need. Bob ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-23 6:09 ` Bob Proulx @ 2018-07-23 7:34 ` tomas 0 siblings, 0 replies; 47+ messages in thread From: tomas @ 2018-07-23 7:34 UTC (permalink / raw) To: help-gnu-emacs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, Jul 23, 2018 at 12:09:15AM -0600, Bob Proulx wrote: [...] > Again the hash table is good until you need to grow the table. Then > there is a pause while memory is shuffled around. Precisely. To get this "amortized", you usually grow by some constant factor, and there's where the log n comes in (which a tree has naturally built in). For really big hash tables (think all *dbm variations which are not some cousins of B-trees), you even have several "levels", at which point the hash tables start resembling B-trees somewhat (see below). > But that never happens with a balanced tree because that cost is > always amortized along as you go. But once you get big, you start "feeling" the structure of your storage (i.e. that the "storage as a big constant-time access array" is just a lie). Having two pointers per node is horribly wasteful, and if those pointers point to different continents in your RAM, your CPU ends up waiting on cache fills all the time. > Also if you need to extract the entries in sorted order then the > balanced tree is already sorted. Whereas with the hash table an extra > sort step is needed. That's right: once you need sorted walks or even lookup by sorted index (i.e. "find the first entry whose value is greater or equal to X" -- or "find all entries whose values are between X and Y"), ordered trees are (probably) the way to go. But I'd guess, once the tree becomes significant in size (yeah, handwaving here :) some kind of B-tree, where you manage several entries in one node, should be at an advantage. > One of looking at this is that it is global optimization versus local > optimization. And the ratio of lookups per insert. And whether you want a sorted index. > In some ways it is swings and roundabouts. Nicely put :-) > But in other ways it depends upon what you need. Absolutely Cheers - -- t -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAltVhQQACgkQBcgs9XrR2kYaIACfdi/XvTyfdbJ9yk4my+eJziMt yyYAnjw7NadASHDsgDrCJpWdzvDpmxDm =5dmS -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.4074.1532326162.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.4074.1532326162.1292.help-gnu-emacs@gnu.org> @ 2018-07-23 7:26 ` Udyant Wig 0 siblings, 0 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-23 7:26 UTC (permalink / raw) To: help-gnu-emacs On 07/23/2018 11:39 AM, Bob Proulx wrote: > tomas@tuxteam.de wrote: >> Stefan Monnier wrote: >>> I'd use a hash-table (implemented in C) rather than an avl-tree >>> (implemented in Elisp). > > With exactly those choices above I would agree completely. Would you choose differently if Emacs had a native code compiler besides a byte code one? >> Plus, a (well-implemented) hash table will always be faster (for >> inserts and random lookups) than a balanced (AVL, red-black) binary >> tree. The latter affords you sorted lookup (find first greater than, >> output in order). >> >> You pay for that :-) > > Again the hash table is good until you need to grow the table. Then > there is a pause while memory is shuffled around. But that never > happens with a balanced tree because that cost is always amortized > along as you go. Yes. > Also if you need to extract the entries in sorted order then the > balanced tree is already sorted. Whereas with the hash table an extra > sort step is needed. It just occured to me that when I obtain a list of the elements of the AVL tree solution, I get a sorted list, though in ascending order. After this, I sort to get the list in descending order. I could have avoided this silly sorting and simply reversed the list. > One of looking at this is that it is global optimization versus local > optimization. > > In some ways it is swings and roundabouts. But in other ways it > depends upon what you need. Which might entail more swings and roundabouts. > Bob Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.4013.1532250176.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.4013.1532250176.1292.help-gnu-emacs@gnu.org> @ 2018-07-22 18:58 ` Udyant Wig 0 siblings, 0 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-22 18:58 UTC (permalink / raw) To: help-gnu-emacs On 07/22/2018 02:32 PM, tomas@tuxteam.de wrote: > Plus, a (well-implemented) hash table will always be faster (for > inserts and random lookups) than a balanced (AVL, red-black) binary > tree. The latter affords you sorted lookup (find first greater than, > output in order). > > You pay for that :-) As I have found to be the case. I have been unable to put the AVL tree package to any heavy or serious use yet, but I did use it in a simple way to keep track of some books. I found `avl-tree-flatten' to be valuable. Also of note, I think, is `avl-tree-stack', which allows treatment as a sorted stack of the tree elements. > Cheers > -- t Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.3991.1532197378.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.3991.1532197378.1292.help-gnu-emacs@gnu.org> @ 2018-07-21 19:39 ` Udyant Wig 2018-07-21 20:54 ` Stefan Monnier [not found] ` <mailman.3995.1532206511.1292.help-gnu-emacs@gnu.org> 0 siblings, 2 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-21 19:39 UTC (permalink / raw) To: help-gnu-emacs On 07/21/2018 11:52 PM, Stefan Monnier wrote: >> (defun buffer-most-used-words-2 (n) >> "Make a list of the N most used words in buffer." >> (let ((counts (avl-tree-create (lambda (wc1 wc2) >> (string< (first wc1) (first wc2))))) >> (words (split-string (buffer-string))) > > If you want to go fast, don't use split-string+buffer-string. Scan > through the buffer and extract each word with buffer-substring > directly. > >> (let ((element (avl-tree-member counts (list (downcase word) >>0)))) > > I'd use a hash-table (implemented in C) rather than an avl-tree > (implemented in Elisp). After spending (too) many hours on this, I believe that I have a better solution. --- (require 'cl-lib) ;; Can this hack be made better? (defun whitespace-p (char) (or (eq char 9) (eq char 10) (eq char 13)) (eq char 32)) (defun buffer-most-used-words-3 (n) "Make a list of the N most used words in buffer." (let ((counts (make-hash-table :test #'equal)) sorted-counts) (save-excursion (goto-char (point-min)) (cl-loop with word = nil with start = 0 with end = 0 with state = 'space with char = nil until (eobp) do (setf char (char-after)) (cond ((eq state 'space) (when (not (whitespace-p char)) (setf start (point) state 'word))) ((eq state 'word) (when (whitespace-p char) (setf end (point) state 'space word (buffer-substring start end)) (incf (gethash word counts 0))))) (forward-char))) (cl-loop for word being the hash-keys of counts using (hash-values count) do (push (list word count) sorted-counts) finally (setf sorted-counts (cl-sort sorted-counts #'> :key #'second))) (mapcar #'first (cl-subseq sorted-counts 0 n)))) --- In regard to performance, it is slightly better than BUFFER-MOST-USED-WORDS-1, which used a combination of SPLIT-STRING on BUFFER-STRING along with a hash-table. Here are timings over ten runs each for a 4.5 MB text file: buffer-most-used-words-1: 4.7362510517 seconds buffer-most-used-words-3: 4.4849896529 seconds > Stefan Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Most used words in current buffer 2018-07-21 19:39 ` Udyant Wig @ 2018-07-21 20:54 ` Stefan Monnier [not found] ` <mailman.3995.1532206511.1292.help-gnu-emacs@gnu.org> 1 sibling, 0 replies; 47+ messages in thread From: Stefan Monnier @ 2018-07-21 20:54 UTC (permalink / raw) To: help-gnu-emacs > (or (eq char 9) (eq char 10) (eq char 13)) (eq char 32)) Aka (memq char '(?\t ?\n ?\r ?\s)) > (cl-loop with word = nil [...] > (forward-char))) This time you do the word-parsing char-by-char in Elisp. I'd expect you'll get better performance with things like `forward-word` or (re-search-forward "\\<.*?\\>" nil t) so that the Elisp loop operates word-by-word rather than having to look at each and every character. Stefan ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <mailman.3995.1532206511.1292.help-gnu-emacs@gnu.org>]
* Re: Most used words in current buffer [not found] ` <mailman.3995.1532206511.1292.help-gnu-emacs@gnu.org> @ 2018-07-22 18:43 ` Udyant Wig 0 siblings, 0 replies; 47+ messages in thread From: Udyant Wig @ 2018-07-22 18:43 UTC (permalink / raw) To: help-gnu-emacs On 07/22/2018 02:24 AM, Stefan Monnier wrote: >> (or (eq char 9) (eq char 10) (eq char 13)) (eq char 32)) > > Aka (memq char '(?\t ?\n ?\r ?\s)) Indeed. Not only does this form avoid the magic numbers, increase readability, and abstract the representation, it fixes the error I made in my code. It turns out that my code worked by fluke: checking spaces. I should try not to post late at night. >> (cl-loop with word = nil > [...] >> (forward-char))) > > This time you do the word-parsing char-by-char in Elisp. I'd expect > you'll get better performance with things like `forward-word` or > (re-search-forward "\\<.*?\\>" nil t) so that the Elisp loop operates > word-by-word rather than having to look at each and every character. I read your message after Eric Abrahamsen's, which echoed your point. Taking the advice paid off handsomely! Thank you. I posted `buffer-most-used-words-4' and an indication of its performance in a follow up to Eric Abrahamsen's message. > Stefan Udyant Wig -- We make our discoveries through our mistakes: we watch one another's success: and where there is freedom to experiment there is hope to improve. -- Arthur Quiller-Couch ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2018-07-23 7:34 UTC | newest] Thread overview: 47+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-07-17 9:28 Most used words in current buffer Udyant Wig 2018-07-17 18:41 ` Emanuel Berg 2018-07-18 9:36 ` Udyant Wig 2018-07-18 11:48 ` Emanuel Berg 2018-07-18 14:50 ` Udyant Wig 2018-07-18 16:32 ` Emanuel Berg 2018-07-18 22:39 ` Ben Bacarisse 2018-07-19 0:45 ` Bob Proulx [not found] ` <mailman.3785.1531961144.1292.help-gnu-emacs@gnu.org> 2018-07-19 5:33 ` Udyant Wig 2018-07-19 7:04 ` Bob Proulx 2018-07-19 7:25 ` tomas 2018-07-19 17:19 ` Nick Dokos 2018-07-19 17:30 ` Eli Zaretskii 2018-07-19 20:08 ` Bob Proulx 2018-07-20 16:39 ` Nick Dokos [not found] ` <mailman.3909.1532104802.1292.help-gnu-emacs@gnu.org> 2018-07-20 18:13 ` Udyant Wig 2018-07-20 22:24 ` Bob Newell 2018-07-21 0:00 ` Nick Dokos 2018-07-21 0:18 ` Nick Dokos [not found] ` <mailman.3843.1532030947.1292.help-gnu-emacs@gnu.org> 2018-07-20 6:19 ` Udyant Wig 2018-07-20 23:25 ` Bob Proulx 2018-07-21 0:26 ` Nick Dokos 2018-07-21 4:03 ` Bob Proulx [not found] ` <mailman.3934.1532129163.1292.help-gnu-emacs@gnu.org> 2018-07-21 13:39 ` Udyant Wig [not found] ` <mailman.3826.1532020800.1292.help-gnu-emacs@gnu.org> 2018-07-20 5:52 ` Udyant Wig [not found] ` <mailman.3796.1531983885.1292.help-gnu-emacs@gnu.org> 2018-07-19 13:26 ` Udyant Wig 2018-07-19 20:42 ` Bob Proulx 2018-07-20 3:08 ` Bob Newell [not found] ` <mailman.3861.1532056120.1292.help-gnu-emacs@gnu.org> 2018-07-21 12:51 ` Udyant Wig 2018-07-21 16:15 ` Eric Abrahamsen [not found] ` <mailman.3982.1532189751.1292.help-gnu-emacs@gnu.org> 2018-07-21 19:46 ` Udyant Wig 2018-07-22 3:57 ` Eric Abrahamsen 2018-07-22 4:00 ` Eric Abrahamsen 2018-07-22 4:05 ` Eric Abrahamsen [not found] ` <mailman.4008.1532232144.1292.help-gnu-emacs@gnu.org> 2018-07-22 18:28 ` Udyant Wig 2018-07-22 20:05 ` Eric Abrahamsen [not found] ` <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org> 2018-07-22 18:19 ` Udyant Wig [not found] ` <mailman.3845.1532032966.1292.help-gnu-emacs@gnu.org> 2018-07-20 13:18 ` Udyant Wig 2018-07-21 18:22 ` Stefan Monnier 2018-07-22 9:02 ` tomas 2018-07-23 6:09 ` Bob Proulx 2018-07-23 7:34 ` tomas [not found] ` <mailman.4074.1532326162.1292.help-gnu-emacs@gnu.org> 2018-07-23 7:26 ` Udyant Wig [not found] ` <mailman.4013.1532250176.1292.help-gnu-emacs@gnu.org> 2018-07-22 18:58 ` Udyant Wig [not found] ` <mailman.3991.1532197378.1292.help-gnu-emacs@gnu.org> 2018-07-21 19:39 ` Udyant Wig 2018-07-21 20:54 ` Stefan Monnier [not found] ` <mailman.3995.1532206511.1292.help-gnu-emacs@gnu.org> 2018-07-22 18:43 ` Udyant Wig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).