Most used words in current buffer

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Most used words in current buffer
@ 2018-07-17  9:28 Udyant Wig
  2018-07-17 18:41 ` Emanuel Berg
  0 siblings, 1 reply; 47+ messages in thread
From: Udyant Wig @ 2018-07-17  9:28 UTC (permalink / raw)
  To: help-gnu-emacs

How might I obtain a list of the most frequently used words in the
current buffer?

I have an inkling of an Emacs Lisp solution, but I thought to ask here
before writing it.

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-17  9:28 Most used words in current buffer Udyant Wig
@ 2018-07-17 18:41 ` Emanuel Berg
  2018-07-18  9:36   ` Udyant Wig
  0 siblings, 1 reply; 47+ messages in thread
From: Emanuel Berg @ 2018-07-17 18:41 UTC (permalink / raw)
  To: help-gnu-emacs

Udyant Wig wrote:

> How might I obtain a list of the most
> frequently used words in the current buffer?
>
> I have an inkling of an Emacs Lisp solution,
> but I thought to ask here before writing it.

Do it!

But if you can let go of the Elisp requirement
here are some examples how to do it with
everyday GNU/Unix tools:

    https://unix.stackexchange.com/questions/41479/find-n-most-frequent-words-in-a-file

-- 
underground experts united
http://user.it.uu.se/~embe8573


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-17 18:41 ` Emanuel Berg
@ 2018-07-18  9:36   ` Udyant Wig
  2018-07-18 11:48     ` Emanuel Berg
  2018-07-18 22:39     ` Ben Bacarisse
  0 siblings, 2 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-18  9:36 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/18/2018 12:11 AM, Emanuel Berg wrote:
> Do it!
>
> But if you can let go of the Elisp requirement here are some examples
> how to do it with everyday GNU/Unix tools:
>
>
https://unix.stackexchange.com/questions/41479/find-n-most-frequent-words-in-a-file

I went ahead and did it.  I obtained many solutions, in fact.  Only
today did I check the link above.

First, of the solutions in Emacs Lisp, this one came out as the
quickest:

---
(defun buffer-most-used-words-1 (n)
  "Make a list of the N most used words in buffer."
  (let ((counts (make-hash-table :test #'equal))
	(words (split-string (buffer-string)))
	sorted-counts)
    (dolist (word words)
      (let ((count (gethash (downcase word) counts 0)))
	(puthash (downcase word) (1+ count) counts)))
    (loop for word being the hash-keys of counts
       using (hash-values count)
       do
	 (push (list word count) sorted-counts)
       finally (setf sorted-counts (cl-sort sorted-counts #'>
					    :key #'second)))
    (mapcar #'first (cl-subseq sorted-counts 0 n))))
---

Briefly, it obtains a list of the strings in the buffer, hashes them,
puts the words and their counts in a list, sorts it, and lists the first
N words.  (I had also written solutions (1) using alists; (2) using the
handy AVL tree library I found among the Emacs Lisp files in the Emacs
distribution; and (3) reading the words directly and hashing them.  None
beat the above.)

The function is suffixed with '-1' because it is the the core of
another, interactive function, which takes the above generated list and
displays it nicely in another buffer.

I was curious about possible solutions in other languages.  I wrote
programs in both Common Lisp and Python, based on the essential hash
table approach.  While a lot faster than the Emacs Lisp solution above,
they were left behind by this old Awk solution (also using hashing) I
found in the classic /The Unix Programming Environment/ by Kernighan and
Pike:

---
#!/bin/sh

awk '    { for (i = 1; i <= NF; i++) num[$i]++ }
END      { for (word in num) print word, num[word] }
' $* | sort +1 -nr | head -10 | awk '{ print $1 }'
---

I appended the last awk pipeline to only give the words without the
counts.  I wrapped it up in an Emacs command to display the words in
another buffer, just like my original Emacs Lisp solution above.

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-18  9:36   ` Udyant Wig
@ 2018-07-18 11:48     ` Emanuel Berg
  2018-07-18 14:50       ` Udyant Wig
  2018-07-18 22:39     ` Ben Bacarisse
  1 sibling, 1 reply; 47+ messages in thread
From: Emanuel Berg @ 2018-07-18 11:48 UTC (permalink / raw)
  To: help-gnu-emacs

Udyant Wig wrote:

> (defun buffer-most-used-words-1 (n)
>   "Make a list of the N most used words in buffer."
>   (let ((counts (make-hash-table :test #'equal))
> 	(words (split-string (buffer-string)))
> 	sorted-counts)
>     (dolist (word words)
>       (let ((count (gethash (downcase word) counts 0)))
> 	(puthash (downcase word) (1+ count) counts)))
>     (loop for word being the hash-keys of counts
>        using (hash-values count)
>        do
> 	 (push (list word count) sorted-counts)
>        finally (setf sorted-counts (cl-sort sorted-counts #'>
> 					    :key #'second)))
>     (mapcar #'first (cl-subseq sorted-counts 0 n))))

Great! Only

    (require 'cl-lib)
    loop -> cl-loop
    first -> cl-first
    second -> cl-second

Perhaps this should be published in M/ELPA or
at the very least on gnu.emacs.sources :)

-- 
underground experts united
http://user.it.uu.se/~embe8573


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-18 11:48     ` Emanuel Berg
@ 2018-07-18 14:50       ` Udyant Wig
  2018-07-18 16:32         ` Emanuel Berg
  0 siblings, 1 reply; 47+ messages in thread
From: Udyant Wig @ 2018-07-18 14:50 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/18/2018 05:18 PM, Emanuel Berg wrote:
> Great! Only
>
>     (require 'cl-lib)
>     loop -> cl-loop
>     first -> cl-first
>     second -> cl-second

Ah, yes.  Thank you.  I have included the fixes in the following.

---
(require 'cl-lib)

(defun buffer-most-used-words-1 (n)
  "Make a list of the N most used words in buffer."
  (let ((counts (make-hash-table :test #'equal))
	(words (split-string (buffer-string)))
	sorted-counts)
    (dolist (word words)
      (let ((count (gethash (downcase word) counts 0)))
	(puthash (downcase word) (1+ count) counts)))
    (cl-loop for word being the hash-keys of counts
       using (hash-values count)
       do
	 (push (list word count) sorted-counts)
       finally (setf sorted-counts (cl-sort sorted-counts #'>
					    :key #'cl-second)))
    (mapcar #'cl-first (cl-subseq sorted-counts 0 n))))
---

> Perhaps this should be published in M/ELPA or at the very least on
> gnu.emacs.sources :)

Maybe.  I hope this is useful to others also.

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-18 14:50       ` Udyant Wig
@ 2018-07-18 16:32         ` Emanuel Berg
  0 siblings, 0 replies; 47+ messages in thread
From: Emanuel Berg @ 2018-07-18 16:32 UTC (permalink / raw)
  To: help-gnu-emacs

Udyant Wig wrote:

> I hope this is useful to others also.

Definitely!

Be sure to login to the SX site and post the
"Elisp solution" to the problem as well! Or if
someone else who is frequenting that site feels
like doing it for you/us.

I don't see why Emacs is any less a GNU/Unix
tool than say sed or (g)awk! Because actually
it is a much better tool :)

-- 
underground experts united
http://user.it.uu.se/~embe8573

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-18  9:36   ` Udyant Wig
  2018-07-18 11:48     ` Emanuel Berg
@ 2018-07-18 22:39     ` Ben Bacarisse
  2018-07-19  0:45       ` Bob Proulx
       [not found]       ` <mailman.3785.1531961144.1292.help-gnu-emacs@gnu.org>
  1 sibling, 2 replies; 47+ messages in thread
From: Ben Bacarisse @ 2018-07-18 22:39 UTC (permalink / raw)
  To: help-gnu-emacs

Udyant Wig <udyantw@gmail.com> writes:
<snip>
> they were left behind by this old Awk solution (also using hashing) I
> found in the classic /The Unix Programming Environment/ by Kernighan and
> Pike:
>
> ---
> #!/bin/sh
>
> awk '    { for (i = 1; i <= NF; i++) num[$i]++ }
> END      { for (word in num) print word, num[word] }
> ' $* | sort +1 -nr | head -10 | awk '{ print $1 }'
> ---
>
> I appended the last awk pipeline to only give the words without the
> counts.

The Unix command cut does this task.  Nothing wrong with using another
awk, but I often feel sorry for poor old cut.  It's been around for
decades, and yet is so very often overlooked!  Mind you, it uses TABs to
delimit fields by default, so maybe it only has itself to blame.

-- 
Ben.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-18 22:39     ` Ben Bacarisse
@ 2018-07-19  0:45       ` Bob Proulx
       [not found]       ` <mailman.3785.1531961144.1292.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 47+ messages in thread
From: Bob Proulx @ 2018-07-19  0:45 UTC (permalink / raw)
  To: help-gnu-emacs

Ben Bacarisse wrote:
> Udyant Wig writes:
> > they were left behind by this old Awk solution (also using hashing) I

Not wanting to be too annoying but I see no hashing in the awk
solution.  It is using an awk associative array to store the words.
Perl and Pything call those "hashes" but they are just associative
arrays.

> > found in the classic /The Unix Programming Environment/ by Kernighan and
> > Pike:
> >...
> > awk '    { for (i = 1; i <= NF; i++) num[$i]++ }
> > END      { for (word in num) print word, num[word] }
> > ' $* | sort +1 -nr | head -10 | awk '{ print $1 }'
> >
> > I appended the last awk pipeline to only give the words without the
> > counts.
>
> The Unix command cut does this task.  Nothing wrong with using another
> awk, but I often feel sorry for poor old cut.  It's been around for
> decades, and yet is so very often overlooked!  Mind you, it uses TABs to
> delimit fields by default, so maybe it only has itself to blame.

I will continue to be contrary here and say that awk does a much
better job of cutting by whitespace separated fields than does cut.
Both are standard and should be available everywhere.  And here
because awk is already in use I expect it to be somewhat more
efficient to use awk again in the pipeline than to use a different
program.

I also wish to improve the command line somewhat.  Using $* by itself
does not sufficiently quote program arguments with whitespace.  One
should use "$@" for that purpose.  Also the old forms of sort and head
would be better left behind and use the new portable option set
for them instead.  Let me suggest:

  ' "$@" | sort -k2,2nr | head -n10 | awk '{ print $1 }'

Bob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]       ` <mailman.3785.1531961144.1292.help-gnu-emacs@gnu.org>
@ 2018-07-19  5:33         ` Udyant Wig
  2018-07-19  7:04           ` Bob Proulx
       [not found]           ` <mailman.3796.1531983885.1292.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-19  5:33 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/19/2018 06:15 AM, Bob Proulx wrote:
> Not wanting to be too annoying but I see no hashing in the awk
> solution.  It is using an awk associative array to store the words.
> Perl and Pything call those "hashes" but they are just associative
> arrays.

I understand that associative arrays in awk are built upon hashing.
Kernighan and Pike say

  The implementation of associative memory uses a hashing scheme to
  ensure that access to any element takes about the same time as to any
  other, and that (at least for moderate array sizes) the time doesn't
  depend on how many elements are in the array.

However, on the previous page, in introducing the language construct,
they do take the name _associative array_.

> I will continue to be contrary here and say that awk does a much
> better job of cutting by whitespace separated fields than does cut.
> Both are standard and should be available everywhere.  And here
> because awk is already in use I expect it to be somewhat more
> efficient to use awk again in the pipeline than to use a different
> program.
>
> I also wish to improve the command line somewhat.  Using $* by itself
> does not sufficiently quote program arguments with whitespace.  One
> should use "$@" for that purpose.  Also the old forms of sort and head
> would be better left behind and use the new portable option set for
> them instead.  Let me suggest:
>
>   ' "$@" | sort -k2,2nr | head -n10 | awk '{ print $1 }'
>
> Bob

Thank you for the portable pipeline.  It is interesting to compare it
with the pipeline given in the book:

  $ wordfreq ch4.* | sort +1 -nr | sed 20q | 4

where

  wordfreq is the awk script proper,
  4 is a shell script that prints its input in 4 columns,
  and sed 20q does the equivalent of head -20.

On the last point, they say that given the ease of typing a sed command,
they felt no need to write the program head.

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-19  5:33         ` Udyant Wig
@ 2018-07-19  7:04           ` Bob Proulx
  2018-07-19  7:25             ` tomas
                               ` (2 more replies)
       [not found]           ` <mailman.3796.1531983885.1292.help-gnu-emacs@gnu.org>
  1 sibling, 3 replies; 47+ messages in thread
From: Bob Proulx @ 2018-07-19  7:04 UTC (permalink / raw)
  To: help-gnu-emacs

Udyant Wig wrote:
> Bob Proulx wrote:
> > Not wanting to be too annoying but I see no hashing in the awk
> > solution.  It is using an awk associative array to store the words.
> > Perl and Pything call those "hashes" but they are just associative
> > arrays.
> 
> I understand that associative arrays in awk are built upon hashing.
> Kernighan and Pike say

Hmm...  I think looking behind the abstract data type at how it might
or might not be implemented is a stretch to say the least.  The entire
reason it is an abstract data type is to hide those types of
implementation details.  :-)

> However, on the previous page, in introducing the language construct,
> they do take the name _associative array_.

The naming of things usually says more about the person that named
the thing than the thing itself.  Associative arrays is a naming that
reflects the concepts involved in what it does.  This is the same as
when someone else names it a map table, or a dictionary.  Those are
all the same thing.  Just using different names because people took
different paths to get there.

>   The implementation of associative memory uses a hashing scheme to
>   ensure that access to any element takes about the same time as to any
>   other, and that (at least for moderate array sizes) the time doesn't
>   depend on how many elements are in the array.

Hash tables are attractive because in their perfect form they are O(1)
order of the function being a fixed amount for any lookup.  Being a
fixed amount sounds very good and therefore authors often implement
using hash tables.  However achieving that perfect O(1) performance
requires knowing the size of the needed hash table at creation time so
that it can be sized correctly from the start and that is not the case
here.  Therefore when implemented as a hash table the table will need
to be grown at various intervals as the number of entries contained is
increased.  That growth is lumped and consumes time and resources for
the growth at the time the table is grown.

For such things I generally prefer balanced tree structures because
work is amortized instead of lumped.  But the important point here is
that for every algorithm + data structure there is a trade-off of some
sort between one thing and another thing.

It was in the Perl community that these became known as "hashes" due
to the implementation.  But I think it is better to keep the ADT
(abstract data type) abstract.

> >   ' "$@" | sort -k2,2nr | head -n10 | awk '{ print $1 }'
> 
> Thank you for the portable pipeline.  It is interesting to compare it
> with the pipeline given in the book:
> 
>   $ wordfreq ch4.* | sort +1 -nr | sed 20q | 4
> 
> where
> 
>   wordfreq is the awk script proper,
>   4 is a shell script that prints its input in 4 columns,
>   and sed 20q does the equivalent of head -20.
> 
> On the last point, they say that given the ease of typing a sed command,
> they felt no need to write the program head.

I am in total agreement over using sed instead of head if you want to
do that.  Seeing 'sed 20q' should roll off the keyboard as print lines
until line 20 and then quit.  Very simple and to the point.  There is
definitely no need for a separate head command.  Other than for
symmetry with tail which is not as simple in sed.

I didn't go look up the reference in the book before posting my
first reply.  I probably should have refreshed my knowledge of it.
Then I would have realized that we weren't talking about literal
hashing of tokens (like what is done in SpamAssassin's Bayes engine
for one example) but rather use of "hash" as a table lookup method.
That caused me to say that first.  Sorry.

As to the old style command line options for head and sort I was just
wanting to keep in people's minds that since a decade ago or so the
new format of the command line options is now the preferred one.  That
old style is certainly how we wrote those things back in the day
though.

Bob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-19  7:04           ` Bob Proulx
@ 2018-07-19  7:25             ` tomas
  2018-07-19 17:19             ` Nick Dokos
       [not found]             ` <mailman.3826.1532020800.1292.help-gnu-emacs@gnu.org>
  2 siblings, 0 replies; 47+ messages in thread
From: tomas @ 2018-07-19  7:25 UTC (permalink / raw)
  To: help-gnu-emacs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, Jul 19, 2018 at 01:04:38AM -0600, Bob Proulx wrote:

[...]

> For such things I generally prefer balanced tree structures because
> work is amortized instead of lumped.  But the important point here is
> that for every algorithm + data structure there is a trade-off of some
> sort between one thing and another thing.

It all depends, of course...

> It was in the Perl community that these became known as "hashes" due
> to the implementation.  But I think it is better to keep the ADT
> (abstract data type) abstract.

Ain't (human) language beautiful? Like "crane" (the lifting device),
which is a metaphor by analogy to the bird (and which is preserved
across other West European languages with quite different names for
both!). Or like those brand-things (Scotch/Sello/Tesa for adhesive
tape, for example), which would be the same pattern you decry above
(implementation detail as a substitute for the principle, or pars
pro toto).

No wonder that Perl is a linguist's child :-)

You are right: it *is* important to distinguish those layers, as a
mathematician/technician. But then you are wrong [1] too: we are
humans and our languages are exquisitely messy.

Cheers

[1] Sorry for the extremely shortened form: it is meant as an
   expression of subjective feeling. I don't think I am in the
   real position to decide whether you are right or wrong. That
   would be presumptuous indeed.

- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAltQPNwACgkQBcgs9XrR2kYfhQCePvfIuPJTJYaIkM+5964SZBGI
jjwAnjtxsz1GLy1MjcrD8EetvqpTY4L/
=y+xj
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]           ` <mailman.3796.1531983885.1292.help-gnu-emacs@gnu.org>
@ 2018-07-19 13:26             ` Udyant Wig
  2018-07-19 20:42               ` Bob Proulx
                                 ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-19 13:26 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/19/2018 12:34 PM, Bob Proulx wrote:
> Hmm...  I think looking behind the abstract data type at how it might
> or might not be implemented is a stretch to say the least.  The entire
> reason it is an abstract data type is to hide those types of
> implementation details.  :-)

Indeed.  The principle generalizes to abstract functionality as well,
doesn't it?  E.g. qsort in the C library may or may not be an
implementation of quick sort; or closer to the topic of the newsgroup,
one ought not to care about the actual algorithm of the SORT function in
Emacs.

Or, to illustrate the point in another way, I like that I can operate on
sequences in Emacs Lisp, rather than have to commit to either lists or
vectors right at the start.

> The naming of things usually says more about the person that named the
> thing than the thing itself.  Associative arrays is a naming that
> reflects the concepts involved in what it does.  This is the same as
> when someone else names it a map table, or a dictionary.  Those are
> all the same thing.  Just using different names because people took
> different paths to get there.

Yes.  Just as, arguably, vectors are a special case of the general
concept of arrays, though the terms are commonly used to name the same
thing.

> Hash tables are attractive because in their perfect form they are O(1)
> order of the function being a fixed amount for any lookup.  Being a
> fixed amount sounds very good and therefore authors often implement
> using hash tables.  However achieving that perfect O(1) performance
> requires knowing the size of the needed hash table at creation time so
> that it can be sized correctly from the start and that is not the case
> here.  Therefore when implemented as a hash table the table will need
> to be grown at various intervals as the number of entries contained is
> increased.  That growth is lumped and consumes time and resources for
> the growth at the time the table is grown.
>
> For such things I generally prefer balanced tree structures because
> work is amortized instead of lumped.  But the important point here is
> that for every algorithm + data structure there is a trade-off of some
> sort between one thing and another thing.

Hmm.  I had written a tree version of the word counter I had mentioned
before.  I had stumbled upon the AVL tree package in Emacs and thought I
might try using it.  This tree-based attempt turned out to be slower
than my straightforward hashing solution.

I have no doubts this code could be written better by someone more
experienced than I.

---
(require 'cl-lib)
(require 'avl-tree)

(defun buffer-most-used-words-2 (n)
  "Make a list of the N most used words in buffer."
  (let ((counts (avl-tree-create (lambda (wc1 wc2)
				   (string< (first wc1) (first wc2)))))
	(words (split-string (buffer-string)))
	sorted-counts)
    (dolist (word words)
      (let ((element (avl-tree-member counts (list (downcase word) 0))))
	(if element
	    (progn
	      (avl-tree-delete counts element)
	      (avl-tree-enter counts (list (downcase word)
					   (1+ (second element)))))
	    (avl-tree-enter counts (list (downcase word) 1)))))
    (setf sorted-counts (cl-sort (avl-tree-flatten counts) #'>
				 :key #'second))
    (mapcar #'first (cl-subseq sorted-counts 0 n))))
---

> I am in total agreement over using sed instead of head if you want to
> do that.  Seeing 'sed 20q' should roll off the keyboard as print lines
> until line 20 and then quit.  Very simple and to the point.  There is
> definitely no need for a separate head command.  Other than for
> symmetry with tail which is not as simple in sed.

I see that.  You could implement head on top of sed if you wanted to.  I
myself have been using head for long enough for its stated purpose that
grasping a sed equivalent was not immediately obvious.

> I didn't go look up the reference in the book before posting my first
> reply.  I probably should have refreshed my knowledge of it.  Then I
> would have realized that we weren't talking about literal hashing of
> tokens (like what is done in SpamAssassin's Bayes engine for one
> example) but rather use of "hash" as a table lookup method.  That
> caused me to say that first.  Sorry.

No, I made the original mistake.  I should have made it clear that I was
speaking of hash tables.  The general word 'hashing' was misleading.
Mea culpa.

> As to the old style command line options for head and sort I was just
> wanting to keep in people's minds that since a decade ago or so the
> new format of the command line options is now the preferred one.  That
> old style is certainly how we wrote those things back in the day
> though.

These things do take time to gain currency, don't they?  Under Linux,
for example, the ip set of commands has been named the successor to
ifconfig, and it too is taking time to diffuse into general knowledge.

(And, although there have been a number of revisions of Standard C since
1989/1990, a lot of projects still write to that now legacy standard.
But there may be other issues to consider here.)

> Bob

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-19  7:04           ` Bob Proulx
  2018-07-19  7:25             ` tomas
@ 2018-07-19 17:19             ` Nick Dokos
  2018-07-19 17:30               ` Eli Zaretskii
                                 ` (2 more replies)
       [not found]             ` <mailman.3826.1532020800.1292.help-gnu-emacs@gnu.org>
  2 siblings, 3 replies; 47+ messages in thread
From: Nick Dokos @ 2018-07-19 17:19 UTC (permalink / raw)
  To: help-gnu-emacs

Bob Proulx <bob@proulx.com> writes:

>
> I am in total agreement over using sed instead of head if you want to
> do that.  Seeing 'sed 20q' should roll off the keyboard as print lines
> until line 20 and then quit.  Very simple and to the point.  There is
> definitely no need for a separate head command.  Other than for
> symmetry with tail which is not as simple in sed.
>

IIRC, Kernighan & Pike say in the "Unix Programming Environment" that
there *was* a `head' program, in addition to the `tail' program. It
fell into disuse and disappeared almost immediately after sed became
available. tail is still around, I guess both because the sed
invocation for `tail -n' does not quite roll off the fingers the same
way that 20q does -- see e.g

   https://unix.stackexchange.com/questions/107387/emulate-tail-with-sed#107388
  
-- but also because `tail -f' cannot be emulated by sed, at least not
without a lot of effort and/or another program).

-- 
Nick

"There are only two hard problems in computer science: cache
invalidation, naming things, and off-by-one errors." -Martin Fowler




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-19 17:19             ` Nick Dokos
@ 2018-07-19 17:30               ` Eli Zaretskii
  2018-07-19 20:08               ` Bob Proulx
       [not found]               ` <mailman.3843.1532030947.1292.help-gnu-emacs@gnu.org>
  2 siblings, 0 replies; 47+ messages in thread
From: Eli Zaretskii @ 2018-07-19 17:30 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Nick Dokos <ndokos@gmail.com>
> Date: Thu, 19 Jul 2018 13:19:39 -0400
> 
> IIRC, Kernighan & Pike say in the "Unix Programming Environment" that
> there *was* a `head' program, in addition to the `tail' program. It
> fell into disuse and disappeared almost immediately after sed became
> available.

'head' is alive and well in GNU Coreutils.  E.g., on a garden-variety
GNU/Linux system:

  ~$ head --version
  head (GNU coreutils) 8.21
  Copyright (C) 2013 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.

  Written by David MacKenzie and Jim Meyering.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-19 17:19             ` Nick Dokos
  2018-07-19 17:30               ` Eli Zaretskii
@ 2018-07-19 20:08               ` Bob Proulx
  2018-07-20 16:39                 ` Nick Dokos
       [not found]                 ` <mailman.3909.1532104802.1292.help-gnu-emacs@gnu.org>
       [not found]               ` <mailman.3843.1532030947.1292.help-gnu-emacs@gnu.org>
  2 siblings, 2 replies; 47+ messages in thread
From: Bob Proulx @ 2018-07-19 20:08 UTC (permalink / raw)
  To: help-gnu-emacs

Nick Dokos wrote:
> IIRC, Kernighan & Pike say in the "Unix Programming Environment" that
> there *was* a `head' program, in addition to the `tail' program. It
> fell into disuse and disappeared almost immediately after sed became
> available.

I think you may be thinking of the 'gres' program.  Again here I am
not going to look up the reference but instead just reply upon my
feeble human memory.  But I think you are thinking of the gres
program, global regular expression substitute, which if that route
were followed would require a lot of greX programs where X is replaced
by many specific things and was completely subsumed by 'sed'.

Also remember that at the time knowledge and daily use of ed (and qed,
ex, and the others) made using sed very easy.  Lots of shared
knowledge.  However today that sed may seem arcane to people is just
that they are no longer familiar with ed.  It no longer has that
shared learning that made sed so familiar back in the day.

Bob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-19 13:26             ` Udyant Wig
@ 2018-07-19 20:42               ` Bob Proulx
  2018-07-20  3:08                 ` Bob Newell
       [not found]                 ` <mailman.3861.1532056120.1292.help-gnu-emacs@gnu.org>
       [not found]               ` <mailman.3845.1532032966.1292.help-gnu-emacs@gnu.org>
                                 ` (2 subsequent siblings)
  3 siblings, 2 replies; 47+ messages in thread
From: Bob Proulx @ 2018-07-19 20:42 UTC (permalink / raw)
  To: help-gnu-emacs

Udyant Wig wrote:
> Bob Proulx wrote:
> > Hmm...  I think looking behind the abstract data type at how it might
> > or might not be implemented is a stretch to say the least.  The entire
> > reason it is an abstract data type is to hide those types of
> > implementation details.  :-)
> 
> Indeed.  The principle generalizes to abstract functionality as well,
> doesn't it?  E.g. qsort in the C library may or may not be an
> implementation of quick sort; or closer to the topic of the newsgroup,
> one ought not to care about the actual algorithm of the SORT function in
> Emacs.

Yes!  That is it exactly!

> > The naming of things usually says more about the person that named the
> > thing than the thing itself.  Associative arrays is a naming that
> > reflects the concepts involved in what it does.  This is the same as
> > when someone else names it a map table, or a dictionary.  Those are
> > all the same thing.  Just using different names because people took
> > different paths to get there.
> 
> Yes.  Just as, arguably, vectors are a special case of the general
> concept of arrays, though the terms are commonly used to name the same
> thing.

Agree completely.

> > For such things I generally prefer balanced tree structures because
> > work is amortized instead of lumped.  But the important point here is
> > that for every algorithm + data structure there is a trade-off of some
> > sort between one thing and another thing.
> 
> Hmm.  I had written a tree version of the word counter I had mentioned
> before.  I had stumbled upon the AVL tree package in Emacs and thought I
> might try using it.  This tree-based attempt turned out to be slower
> than my straightforward hashing solution.
> 
> I have no doubts this code could be written better by someone more
> experienced than I.

I don't know if the AVL package you used was implemented in elisp or
in C or otherwise.  And even though I am a long time user of emacs I
have never acquired the elisp skill to the same level as other
languages and therefore can't comment on that part.  But I know that
when people have implemented such data structures in Perl that the
result has never been as fast and efficient as in a native C version.
If so then that may easily account for performance differences.  And
also the native implementation of "hashes" in awk, perl, python, ruby
is quite optimized and very fast.  They have had years of eyes and
tweaking upon them.

> > I am in total agreement over using sed instead of head if you want to
> > do that.  Seeing 'sed 20q' should roll off the keyboard as print lines
> > until line 20 and then quit.  Very simple and to the point.  There is
> > definitely no need for a separate head command.  Other than for
> > symmetry with tail which is not as simple in sed.
> 
> I see that.  You could implement head on top of sed if you wanted to.  I
> myself have been using head for long enough for its stated purpose that
> grasping a sed equivalent was not immediately obvious.

Writing clear code that can be understood immediately by the entire
range of programmer skill is important in my not so humble opinion.
One shouldn't need to be a master experienced programmer to understand
what has been written.  Therefore I usually use 'head' specifically
for the clarity of it to everyone.  Seeing "head -n40" is not going to
confuse anyone.  Therefore I usually use it instead of "sed 40q" even
though I could remove 'head' entirely from my system if I were to
uniformly implement one in terms of the other.  Clarity is more
important.

And before someone mentions performance let me remind that we are
talking shell scripts.  In a shell script clarity is more important
than performance.  Always.  If the resulting shell script results in a
performance problem than choosing a better algorithm will almost
certainly be the better solution.  And if not than then choosing a
different language more efficient at the task is next.

I do expect some skill to be learned with 'awk' however.  It is so
very useful that seeing "awk '{print$1}' should not be that confusing
that it is printing the first field column.  Or that '{print$NF}' is
a common idiom for printing the last field.  (NF is the Number of
Fields in the line that was split by whitespace.  $NF is therefore the
last field.  If NF is 5 then $NF is saying $5 and therefore always the
last field of the line.)  A little bit of awk learning pays back a
large return on the investment.

> These things do take time to gain currency, don't they?  Under Linux,
> for example, the ip set of commands has been named the successor to
> ifconfig, and it too is taking time to diffuse into general knowledge.

Yes.  And 'ip' is an excellent example!  Even I have converted to
using ip and the iproute2 family instead of ifconfig.

One thing to note about the iproute2 family is that it is reasonably
well written.  We are not forced to use it.  Instead we are attracted
to using it in order to get access to the entire set of new networking
features available only through them.  It is a carrot not a stick.

> (And, although there have been a number of revisions of Standard C since
> 1989/1990, a lot of projects still write to that now legacy standard.
> But there may be other issues to consider here.)

Another good example. :-)

Bob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-19 20:42               ` Bob Proulx
@ 2018-07-20  3:08                 ` Bob Newell
       [not found]                 ` <mailman.3861.1532056120.1292.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 47+ messages in thread
From: Bob Newell @ 2018-07-20  3:08 UTC (permalink / raw)
  To: help-gnu-emacs

As a writer I rather like this and I've started to adapt it for my
Emacs writing environment. I plan on a simple change to allow optional
restriction to a region rather than the whole buffer.

But as a writing tool, (and I realize I'm straying from pure Emacs
discussion) the issue is that words like "the" and all the common
stuff are always going to come out on top in an English-language piece
of any length. The next trick will be to eliminate such common words
(again, optionally).

By the way on a 2 MB file the elisp version runs in a few seconds.
Hats off to the coder.

Bob Newell
Honolulu, Hawai`i

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]             ` <mailman.3826.1532020800.1292.help-gnu-emacs@gnu.org>
@ 2018-07-20  5:52               ` Udyant Wig
  0 siblings, 0 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-20  5:52 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/19/2018 10:49 PM, Nick Dokos wrote:
> IIRC, Kernighan & Pike say in the "Unix Programming Environment" that
> there *was* a `head' program, in addition to the `tail' program. It
> fell into disuse and disappeared almost immediately after sed became
> available. tail is still around, I guess both because the sed
> invocation for `tail -n' does not quite roll off the fingers the same
> way that 20q does -- see e.g
>
>
https://unix.stackexchange.com/questions/107387/emulate-tail-with-sed#107388
>
> -- but also because `tail -f' cannot be emulated by sed, at least not
> without a lot of effort and/or another program).

Well, in the interests of dispelling doubts around what Kernighan and
Pike say, I will quote the paragraph they wrote in regard to sed and
utilities equivalent to sed commands:

  With these ideas, it might seem sensible to write a program, called
  _head_, to print the first few lines of each filename argument.  But
  _sed 3q_ (or _10q_) is so easy to type that we've never felt the
  need.  We do, however, have an _ind_, since its equivalent _sed_
  command is harder to type.  (In the process of writing this book we
  replaced the existing 20-line C program by version 2 of the one-line
  implementation shown earlier).  There is no clear criterion for when
  it's worth making a separate command from a complicated command line;
  the best rule we've found is to put it in your _bin_ and see if you
  actually use it.

On the point of 'tail -f', yes, it is a very useful command that would
indeed be non-trivial to build using sed.

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]               ` <mailman.3843.1532030947.1292.help-gnu-emacs@gnu.org>
@ 2018-07-20  6:19                 ` Udyant Wig
  2018-07-20 23:25                   ` Bob Proulx
       [not found]                   ` <mailman.3934.1532129163.1292.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-20  6:19 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/20/2018 01:38 AM, Bob Proulx wrote:
> Also remember that at the time knowledge and daily use of ed (and qed,
> ex, and the others) made using sed very easy.  Lots of shared
> knowledge.  However today that sed may seem arcane to people is just
> that they are no longer familiar with ed.  It no longer has that
> shared learning that made sed so familiar back in the day.

That is true.  I expect that learning the three of ed, sed, and grep
will be mutually reinforcing; the functionality in ed forms the basis of
those in sed and grep.

One cannot however expect ed to be available by default on modern Unix
systems.  E.g., it is not there by default on Debian.

While on the topic of editors, here's a fun quote:

  By the way, 'em' stands for 'editor for mortals' -- I christened it
  that after Ken Thompson visited our lab at QMC while I was developing
  it and said something like: "Yeah, I've seen editors like that, but I
  don't feel a need for them, I don't want to see the state of the file
  when I'm editing".
                                -- George Coulouris

> Bob

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]               ` <mailman.3845.1532032966.1292.help-gnu-emacs@gnu.org>
@ 2018-07-20 13:18                 ` Udyant Wig
  0 siblings, 0 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-20 13:18 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/20/2018 02:12 AM, Bob Proulx wrote:
> I don't know if the AVL package you used was implemented in elisp or
> in C or otherwise.  And even though I am a long time user of emacs I
> have never acquired the elisp skill to the same level as other
> languages and therefore can't comment on that part.  But I know that
> when people have implemented such data structures in Perl that the
> result has never been as fast and efficient as in a native C version.
> If so then that may easily account for performance differences.  And
> also the native implementation of "hashes" in awk, perl, python, ruby
> is quite optimized and very fast.  They have had years of eyes and
> tweaking upon them.

The AVL tree package comes with Emacs.  It bears a creation date of 10
May 1991 and copyright dates of 1995, and 2007-2015 on my system.

Now, it could be just that I have used trees in an ignorant way, or it
could be that it is as fast as it can go.  I hope it is the former as
that is something I can feasibly affect.

My issue in this general problem (of counting words) was this: I have to
store and operate on one kind of data during insertion (strings), and
another kind of data during retrieval (numbers); the keys are different
in the two cases.  I could not think of a way to do this nicely with a
tree.

> Writing clear code that can be understood immediately by the entire
> range of programmer skill is important in my not so humble opinion.
> One shouldn't need to be a master experienced programmer to understand
> what has been written.  Therefore I usually use 'head' specifically
> for the clarity of it to everyone.  Seeing "head -n40" is not going to
> confuse anyone.  Therefore I usually use it instead of "sed 40q" even
> though I could remove 'head' entirely from my system if I were to
> uniformly implement one in terms of the other.  Clarity is more
> important.
>
> And before someone mentions performance let me remind that we are
> talking shell scripts.  In a shell script clarity is more important
> than performance.  Always.  If the resulting shell script results in a
> performance problem than choosing a better algorithm will almost
> certainly be the better solution.  And if not than then choosing a
> different language more efficient at the task is next.

These are two paragraphs worth recollecting every so often.  Thank you.

> I do expect some skill to be learned with 'awk' however.  It is so
> very useful that seeing "awk '{print$1}' should not be that confusing
> that it is printing the first field column.  Or that '{print$NF}' is a
> common idiom for printing the last field.  (NF is the Number of Fields
> in the line that was split by whitespace.  $NF is therefore the last
> field.  If NF is 5 then $NF is saying $5 and therefore always the last
> field of the line.)  A little bit of awk learning pays back a large
> return on the investment.

Indeed.

> Bob

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-19 20:08               ` Bob Proulx
@ 2018-07-20 16:39                 ` Nick Dokos
       [not found]                 ` <mailman.3909.1532104802.1292.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 47+ messages in thread
From: Nick Dokos @ 2018-07-20 16:39 UTC (permalink / raw)
  To: help-gnu-emacs

> Nick Dokos wrote:
>> IIRC, Kernighan & Pike say in the "Unix Programming Environment" that
>> there *was* a `head' program, in addition to the `tail' program. It
>> fell into disuse and disappeared almost immediately after sed became
>> available.
>

Eli Zaretskii writes:

> 'head' is alive and well in GNU Coreutils.  E.g., on a garden-variety
> GNU/Linux system:

Indeed - I knew that but I misspoke - I got buried shortly after I
sent and never got the chance to correct it. What I meant was not that
head is not available today, but that it was available on early Unix
and that (I thought) it went away after sed was added in that same
early Unix (maybe version 6 or possibly earlier). Which obviously
would not explain why it is still around today, but never mind: I was
misremembering things - it turns out that Bob Proulx's "feeble" memory
was better than my apparently non-existent one - see below.

Bob Proulx <bob@proulx.com> writes:
> I think you may be thinking of the 'gres' program.  Again here I am
> not going to look up the reference but instead just reply upon my
> feeble human memory.  But I think you are thinking of the gres
> program, global regular expression substitute, which if that route
> were followed would require a lot of greX programs where X is replaced
> by many specific things and was completely subsumed by 'sed'.
>

Yes, indeed: I had to search a bit in Kernighan and Pike to find it
again, and you are right: they were talking about `gres' and `grep',
not head and tail.

> Also remember that at the time knowledge and daily use of ed (and qed,
> ex, and the others) made using sed very easy.  Lots of shared
> knowledge.  However today that sed may seem arcane to people is just
> that they are no longer familiar with ed.  It no longer has that
> shared learning that made sed so familiar back in the day.
>

Agreed: I knew ed (I actually ported the Kernighan and Plauger ed
clone to a Prime OS machine in the early 1980's, before Prime started
selling its version of Emacs - the line editor they were distributing
was driving me crazy), and could transfer the knowledge to simple sed
invocations. At some point I even learnt the more "advanced" portions
of sed programming, but I never used them enough to retain them.

-- 
Nick

"There are only two hard problems in computer science: cache
invalidation, naming things, and off-by-one errors." -Martin Fowler

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]                 ` <mailman.3909.1532104802.1292.help-gnu-emacs@gnu.org>
@ 2018-07-20 18:13                   ` Udyant Wig
  2018-07-20 22:24                     ` Bob Newell
  2018-07-21  0:18                     ` Nick Dokos
  0 siblings, 2 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-20 18:13 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/20/2018 10:09 PM, Nick Dokos wrote:
> Agreed: I knew ed (I actually ported the Kernighan and Plauger ed
> clone to a Prime OS machine in the early 1980's, before Prime started
> selling its version of Emacs - the line editor they were distributing
> was driving me crazy), and could transfer the knowledge to simple sed
> invocations. At some point I even learnt the more "advanced" portions
> of sed programming, but I never used them enough to retain them.

How was it to use the Emacs on Prime OS?  Was it very different to other
Emacsen extant at the time?

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-20 18:13                   ` Udyant Wig
@ 2018-07-20 22:24                     ` Bob Newell
  2018-07-21  0:00                       ` Nick Dokos
  2018-07-21  0:18                     ` Nick Dokos
  1 sibling, 1 reply; 47+ messages in thread
From: Bob Newell @ 2018-07-20 22:24 UTC (permalink / raw)
  To: help-gnu-emacs

> How was it to use the Emacs on Prime OS?  Was it very different to
> other
> Emacsen extant at the time?

I also date back to these days, and in fact Emacs on Prime was
my first experience with Emacs, all those years ago. I was
fairly active in the Prime user community and sent the odd
patch back to Prime once in a while. I was thoroughly disliked
by the local Prime software support staff because the only
time I ever contacted them was when something was seriously
wrong and I couldn't fix it on my own.

The initial release, as I recall, was extremely slow starting,
taking at least five minutes before being ready for use (not
exaggerating). This did get much, much better with subsequent
releases. And, once started, it always ran very quickly.

It was real Emacs in all its glory. Nick--- as you're the one
that did the port--- take credit for brilliant work.

There were a couple of peculiarities. Ctrl-p was Prime's
interrupt, so previous-line became Ctrl-z (I don't believe
there was the concept of 'suspend' at that time). We used the
mnemoic 'zrevious'! The other odd thing was that underscore
was used instead of dash, so you wouldn't have buffer-name,
you'd have buffer_name.

But it served me well for a number of years, and that's how I
learned Emacs and Elisp. There wasn't a package library, but
Emacs was still super-powerful. When the PC came along I
looked for something as good and found little. Early Emacs
ports to the PC weren't the best. There was something called
Epsilon (which I think still exists) that came closest. But
eventually there was the Real Thing and none of us ever looked
back. By then Prime was in its last days.

-- 
Bob Newell
Honolulu, Hawai`i
* Via Gnus/BBDB/Org/Emacs/Linux *

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-20  6:19                 ` Udyant Wig
@ 2018-07-20 23:25                   ` Bob Proulx
  2018-07-21  0:26                     ` Nick Dokos
       [not found]                   ` <mailman.3934.1532129163.1292.help-gnu-emacs@gnu.org>
  1 sibling, 1 reply; 47+ messages in thread
From: Bob Proulx @ 2018-07-20 23:25 UTC (permalink / raw)
  To: help-gnu-emacs

Udyant Wig wrote:
> Bob Proulx wrote:
> > Also remember that at the time knowledge and daily use of ed (and qed,
> 
> That is true.  I expect that learning the three of ed, sed, and grep
> will be mutually reinforcing; the functionality in ed forms the basis of
> those in sed and grep.

Perhaps the other facet of this is regular expressions.  When I was in
my entry level university class the first two weeks solid were spent
on learning to use the computer, use the editor, work with files, and
using regular expressions.  That last, regular expressions, seems to
be less emphasized today.  It isn't taught at all at the local
university.  AFAICT students here pick it up only in passing if at
all.  They are missing out on a very powerful feature.

> One cannot however expect ed to be available by default on modern Unix
> systems.  E.g., it is not there by default on Debian.

I have two stories that I find funny about ed.  This one happened in
the last week.  A friend at a meetup had a problem with his Ubuntu
laptop.  His GUI was broken and needed a small file fix.  This person
uses the GUI for everything and had no terminal editors installed, as
far as I could tell.  Not vi, vim, nor emacs nor other.  Of course
this person's normal GUI editors were unavailable without X running.
But strangely 'ed' *was* installed.  I don't use Ubuntu but I guess it
got installed by default there.  Or something pulled it in.  I have no
idea.

I think you can already tell where this is going.  I used ed to edit
and fix things.  Being able to use it appeared like magic to this
person who couldn't imagine you could edit something without a mouse.

This happened in just this last week!  Having ed there made the task
easy.

And I'll save the other funny story about ed for another time. :-)

Bob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-20 22:24                     ` Bob Newell
@ 2018-07-21  0:00                       ` Nick Dokos
  0 siblings, 0 replies; 47+ messages in thread
From: Nick Dokos @ 2018-07-21  0:00 UTC (permalink / raw)
  To: help-gnu-emacs

Bob Newell <bobnewell@bobnewell.net> writes:

> It was real Emacs in all its glory. Nick--- as you're the one
> that did the port--- take credit for brilliant work.
>
Uhh, I wish I could say that I ported emacs - alas I cannot take
credit for that: what I ported was the ed clone from Kernighan and
Plauger's "Software Tools".

-- 
Nick

"There are only two hard problems in computer science: cache
invalidation, naming things, and off-by-one errors." -Martin Fowler




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-20 18:13                   ` Udyant Wig
  2018-07-20 22:24                     ` Bob Newell
@ 2018-07-21  0:18                     ` Nick Dokos
  1 sibling, 0 replies; 47+ messages in thread
From: Nick Dokos @ 2018-07-21  0:18 UTC (permalink / raw)
  To: help-gnu-emacs

Udyant Wig <udyantw@gmail.com> writes:

> How was it to use the Emacs on Prime OS?  Was it very different to other
> Emacsen extant at the time?
>

It was pretty close I think. I don't remember the startup slowness
that Bob Newell mentioned but it may be that we waited to get it until
later and they may have ironed out the problems a bit. But once emacs
became available to me, I threw away my ed port and never looked back.

-- 
Nick

"There are only two hard problems in computer science: cache
invalidation, naming things, and off-by-one errors." -Martin Fowler

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-20 23:25                   ` Bob Proulx
@ 2018-07-21  0:26                     ` Nick Dokos
  2018-07-21  4:03                       ` Bob Proulx
  0 siblings, 1 reply; 47+ messages in thread
From: Nick Dokos @ 2018-07-21  0:26 UTC (permalink / raw)
  To: help-gnu-emacs

Bob Proulx <bob@proulx.com> writes:

> I have two stories that I find funny about ed.  This one happened in
> the last week.  A friend at a meetup had a problem with his Ubuntu
> laptop.  His GUI was broken and needed a small file fix.  This person
> uses the GUI for everything and had no terminal editors installed, as
> far as I could tell.  Not vi, vim, nor emacs nor other.  Of course
> this person's normal GUI editors were unavailable without X running.
> But strangely 'ed' *was* installed.  I don't use Ubuntu but I guess it
> got installed by default there.  Or something pulled it in.  I have no
> idea.
>
> I think you can already tell where this is going.  I used ed to edit
> and fix things.  Being able to use it appeared like magic to this
> person who couldn't imagine you could edit something without a mouse.
>

A friend (an avid vim user) was watching over my shoulder while I was
trying to do something (in emacs of course) and had to edit /etc/fstab
or something.  I was in an emacs shell at the time, so I did 'sudo ed
/etc/fstab' and did my edit (and yes, I know and use tramp most of the
time). The idea of running ed from the shell from within emacs was too
much for him: he was laughing for hours!

And yes, I install ed whenever I set up a new machine: for just those
occasions :-)

-- 
Nick

"There are only two hard problems in computer science: cache
invalidation, naming things, and off-by-one errors." -Martin Fowler




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-21  0:26                     ` Nick Dokos
@ 2018-07-21  4:03                       ` Bob Proulx
  0 siblings, 0 replies; 47+ messages in thread
From: Bob Proulx @ 2018-07-21  4:03 UTC (permalink / raw)
  To: help-gnu-emacs

Nick Dokos wrote:
> A friend (an avid vim user) was watching over my shoulder while I was
> trying to do something (in emacs of course) and had to edit /etc/fstab
> or something.  I was in an emacs shell at the time, so I did 'sudo ed
> /etc/fstab' and did my edit (and yes, I know and use tramp most of the
> time). The idea of running ed from the shell from within emacs was too
> much for him: he was laughing for hours!

That seems just perfect!  However I must ask if you have used M-x term
in order to get a tty?  You could have really tormented your friend:

  M-x term
  sudo emacs /etc/fstab

Just remember to prefix every keystroke appropriately.  As a hint it
is easy to switch term mode into line mode with C-c C-j and back to
character mode with C-c C-k.

And then you can always do that again without a hard limit on the
number of nestings.  Until you run out of vertical terminal lines!

> And yes, I install ed whenever I set up a new machine: for just those
> occasions :-)

Me too! :-)

Bob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]                 ` <mailman.3861.1532056120.1292.help-gnu-emacs@gnu.org>
@ 2018-07-21 12:51                   ` Udyant Wig
  2018-07-21 16:15                     ` Eric Abrahamsen
       [not found]                     ` <mailman.3982.1532189751.1292.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-21 12:51 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/20/2018 08:38 AM, Bob Newell wrote:
> By the way on a 2 MB file the elisp version runs in a few seconds.
> Hats off to the coder.

I am still looking to improve it.  For example, on a 4.5 MB text file,
the original version takes over 5 seconds to run, as measured using the
functions #'benchmark-run and #'benchmark-run-compiled.

Is it feasible to read words from the buffer and hash them directly from
there?  Or, going further, is there a better way to do this -- counting
words and producing the N most used -- using some other design, maybe
with some other data structure?

> Bob Newell
> Honolulu, Hawai`i

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]                   ` <mailman.3934.1532129163.1292.help-gnu-emacs@gnu.org>
@ 2018-07-21 13:39                     ` Udyant Wig
  0 siblings, 0 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-21 13:39 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/21/2018 04:55 AM, Bob Proulx wrote:
> I have two stories that I find funny about ed.  This one happened in
> the last week.  A friend at a meetup had a problem with his Ubuntu
> laptop.  His GUI was broken and needed a small file fix.  This person
> uses the GUI for everything and had no terminal editors installed, as
> far as I could tell.  Not vi, vim, nor emacs nor other.  Of course
> this person's normal GUI editors were unavailable without X running.
> But strangely 'ed' *was* installed.  I don't use Ubuntu but I guess it
> got installed by default there.  Or something pulled it in.  I have no
> idea.
>
> I think you can already tell where this is going.  I used ed to edit
> and fix things.  Being able to use it appeared like magic to this
> person who couldn't imagine you could edit something without a mouse.
>
> This happened in just this last week!  Having ed there made the task
> easy.

The face of the person must have been a picture.

I take the moral of this to be that one ought to know the simpler tools,
even if one never uses them.  I don't know how general this moral is.

> And I'll save the other funny story about ed for another time. :-)

Here's hoping that occasion comes soon.

> Bob

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-21 12:51                   ` Udyant Wig
@ 2018-07-21 16:15                     ` Eric Abrahamsen
       [not found]                     ` <mailman.3982.1532189751.1292.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 47+ messages in thread
From: Eric Abrahamsen @ 2018-07-21 16:15 UTC (permalink / raw)
  To: help-gnu-emacs

Udyant Wig <udyantw@gmail.com> writes:

> On 07/20/2018 08:38 AM, Bob Newell wrote:
>> By the way on a 2 MB file the elisp version runs in a few seconds.
>> Hats off to the coder.
>
> I am still looking to improve it.  For example, on a 4.5 MB text file,
> the original version takes over 5 seconds to run, as measured using the
> functions #'benchmark-run and #'benchmark-run-compiled.
>
> Is it feasible to read words from the buffer and hash them directly from
> there?  Or, going further, is there a better way to do this -- counting
> words and producing the N most used -- using some other design, maybe
> with some other data structure?

Interesting... In general I think Emacs is highly optimized to use the
buffer as its textual data structure, more so than a string.
Particularly when the code is compiled (many of the text-movement
commands have opcodes). I made the following two commands to collect
words from a novel in an Org file, and the one that uses `forward-word'
and `buffer-substring' runs around twice as fast as the `split-string'.

Of course, they don't collect the same list of words! But even if you
add more code for trimming, etc., it will still likely be faster than
operating on a string.


(defun test-string (&optional f)
  (let ((file (or f "/home/eric/org/hollowmountain.org"))
	str lst)
    (with-temp-buffer
      (insert-file-contents file)
      (setq str (split-string (buffer-string)))
      (dolist (word str)
	(push word lst)))
    (length lst)))


(defun test-buffer (&optional f)
  (let ((file (or f "/home/eric/org/hollowmountain.org"))
	pnt lst)
    (with-temp-buffer
      (insert-file-contents file)
      (goto-char (point-min))
      (setq pnt (point))
      (while (forward-word)
	(push (buffer-substring pnt (point)) lst)
	(setq pnt (point))))
    (length lst)))




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-19 13:26             ` Udyant Wig
  2018-07-19 20:42               ` Bob Proulx
       [not found]               ` <mailman.3845.1532032966.1292.help-gnu-emacs@gnu.org>
@ 2018-07-21 18:22               ` Stefan Monnier
  2018-07-22  9:02                 ` tomas
       [not found]                 ` <mailman.4013.1532250176.1292.help-gnu-emacs@gnu.org>
       [not found]               ` <mailman.3991.1532197378.1292.help-gnu-emacs@gnu.org>
  3 siblings, 2 replies; 47+ messages in thread
From: Stefan Monnier @ 2018-07-21 18:22 UTC (permalink / raw)
  To: help-gnu-emacs

> (defun buffer-most-used-words-2 (n)
>   "Make a list of the N most used words in buffer."
>   (let ((counts (avl-tree-create (lambda (wc1 wc2)
> 				   (string< (first wc1) (first wc2)))))
> 	(words (split-string (buffer-string)))

If you want to go fast, don't use split-string+buffer-string.
Scan through the buffer and extract each word with buffer-substring directly.

>       (let ((element (avl-tree-member counts (list (downcase word) 0))))

I'd use a hash-table (implemented in C) rather than an avl-tree
(implemented in Elisp).


        Stefan




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]               ` <mailman.3991.1532197378.1292.help-gnu-emacs@gnu.org>
@ 2018-07-21 19:39                 ` Udyant Wig
  2018-07-21 20:54                   ` Stefan Monnier
       [not found]                   ` <mailman.3995.1532206511.1292.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-21 19:39 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/21/2018 11:52 PM, Stefan Monnier wrote:
>> (defun buffer-most-used-words-2 (n)
>>   "Make a list of the N most used words in buffer."
>>   (let ((counts (avl-tree-create (lambda (wc1 wc2)
>> 				   (string< (first wc1) (first wc2)))))
>> 	(words (split-string (buffer-string)))
>
> If you want to go fast, don't use split-string+buffer-string.  Scan
> through the buffer and extract each word with buffer-substring
> directly.
>
>>       (let ((element (avl-tree-member counts (list (downcase word)
>>0))))
>
> I'd use a hash-table (implemented in C) rather than an avl-tree
> (implemented in Elisp).

After spending (too) many hours on this, I believe that I have a better
solution.

---
(require 'cl-lib)

;; Can this hack be made better?
(defun whitespace-p (char)
  (or (eq char 9) (eq char 10) (eq char 13)) (eq char 32))

(defun buffer-most-used-words-3 (n)
  "Make a list of the N most used words in buffer."
  (let ((counts (make-hash-table :test #'equal))
	sorted-counts)
    (save-excursion
      (goto-char (point-min))
      (cl-loop with word = nil
	       with start = 0
	       with end = 0
	       with state = 'space
	       with char = nil
	       until (eobp)
	       do
	       (setf char (char-after))
	       (cond ((eq state 'space)
		      (when (not (whitespace-p char))
			(setf start (point)
			      state 'word)))
		     ((eq state 'word)
		      (when (whitespace-p char)
			(setf end (point)
			      state 'space
			      word (buffer-substring start end))
			(incf (gethash word counts 0)))))
	       (forward-char)))
    (cl-loop for word being the hash-keys of counts
	     using (hash-values count)
	     do
	     (push (list word count) sorted-counts)
	     finally (setf sorted-counts (cl-sort sorted-counts #'>
						  :key #'second)))
    (mapcar #'first (cl-subseq sorted-counts 0 n))))
---

In regard to performance, it is slightly better than
BUFFER-MOST-USED-WORDS-1, which used a combination of SPLIT-STRING on
BUFFER-STRING along with a hash-table.  Here are timings over ten runs
each for a 4.5 MB text file:

buffer-most-used-words-1:    4.7362510517 seconds
buffer-most-used-words-3:    4.4849896529 seconds

>         Stefan

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]                     ` <mailman.3982.1532189751.1292.help-gnu-emacs@gnu.org>
@ 2018-07-21 19:46                       ` Udyant Wig
  2018-07-22  3:57                         ` Eric Abrahamsen
       [not found]                         ` <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-21 19:46 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/21/2018 09:45 PM, Eric Abrahamsen wrote:
> Interesting... In general I think Emacs is highly optimized to use the
> buffer as its textual data structure, more so than a string.
> Particularly when the code is compiled (many of the text-movement
> commands have opcodes). I made the following two commands to collect
> words from a novel in an Org file, and the one that uses
> `forward-word' and `buffer-substring' runs around twice as fast as the
> `split-string'.
>
> Of course, they don't collect the same list of words! But even if you
> add more code for trimming, etc., it will still likely be faster than
> operating on a string.
> [snip code]

I have acted upon the advice (yours and Stefan Monnier's) to operate on
the buffer directly using BUFFER-SUBSTRING.  Please see my follow up to
Stefan's message.

BUFFER-SUBSTRING did gain me (somewhat) better performance.

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-21 19:39                 ` Udyant Wig
@ 2018-07-21 20:54                   ` Stefan Monnier
       [not found]                   ` <mailman.3995.1532206511.1292.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 47+ messages in thread
From: Stefan Monnier @ 2018-07-21 20:54 UTC (permalink / raw)
  To: help-gnu-emacs

>  (or (eq char 9) (eq char 10) (eq char 13)) (eq char 32))

Aka (memq char '(?\t ?\n ?\r ?\s))

>       (cl-loop with word = nil
[...]
> 	       (forward-char)))

This time you do the word-parsing char-by-char in Elisp.  I'd expect
you'll get better performance with things like `forward-word` or
(re-search-forward "\\<.*?\\>" nil t) so that the Elisp loop operates
word-by-word rather than having to look at each and every character.

        Stefan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-21 19:46                       ` Udyant Wig
@ 2018-07-22  3:57                         ` Eric Abrahamsen
  2018-07-22  4:00                           ` Eric Abrahamsen
       [not found]                           ` <mailman.4008.1532232144.1292.help-gnu-emacs@gnu.org>
       [not found]                         ` <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org>
  1 sibling, 2 replies; 47+ messages in thread
From: Eric Abrahamsen @ 2018-07-22  3:57 UTC (permalink / raw)
  To: help-gnu-emacs

Udyant Wig <udyantw@gmail.com> writes:

> On 07/21/2018 09:45 PM, Eric Abrahamsen wrote:
>> Interesting... In general I think Emacs is highly optimized to use the
>> buffer as its textual data structure, more so than a string.
>> Particularly when the code is compiled (many of the text-movement
>> commands have opcodes). I made the following two commands to collect
>> words from a novel in an Org file, and the one that uses
>> `forward-word' and `buffer-substring' runs around twice as fast as the
>> `split-string'.
>>
>> Of course, they don't collect the same list of words! But even if you
>> add more code for trimming, etc., it will still likely be faster than
>> operating on a string.
>> [snip code]
>
> I have acted upon the advice (yours and Stefan Monnier's) to operate on
> the buffer directly using BUFFER-SUBSTRING.  Please see my follow up to
> Stefan's message.
>
> BUFFER-SUBSTRING did gain me (somewhat) better performance.

As Stefan said, going character by character is going to be slow... But
my example with `forward-word' collects a lot of cruft. So I would
suggest doing what `forward-word' does internally and move by syntax.
This also opens up the possibility of tweaking the behavior of your
function (ie, what constitutes a word) by setting temporary syntax
tables. Here's a word scanner that only picks up actual words (according
to the default syntax table):

(defun test-buffer (&optional f)
  (let ((file (or f "/home/eric/org/hollowmountain.org"))
	pnt lst)
    (with-temp-buffer
      (insert-file-contents file)
      (goto-char (point-min))
      (skip-syntax-forward "^w")
      (setq pnt (point))
      (while (and (null (eobp)) (skip-syntax-forward "w"))
	(push (buffer-substring pnt (point)) lst)
	(skip-syntax-forward "^w")
	(setq pnt (point))))
    (nreverse lst)))




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-22  3:57                         ` Eric Abrahamsen
@ 2018-07-22  4:00                           ` Eric Abrahamsen
  2018-07-22  4:05                             ` Eric Abrahamsen
       [not found]                           ` <mailman.4008.1532232144.1292.help-gnu-emacs@gnu.org>
  1 sibling, 1 reply; 47+ messages in thread
From: Eric Abrahamsen @ 2018-07-22  4:00 UTC (permalink / raw)
  To: help-gnu-emacs

Eric Abrahamsen <eric@ericabrahamsen.net> writes:

> Udyant Wig <udyantw@gmail.com> writes:
>
>> On 07/21/2018 09:45 PM, Eric Abrahamsen wrote:
>>> Interesting... In general I think Emacs is highly optimized to use the
>>> buffer as its textual data structure, more so than a string.
>>> Particularly when the code is compiled (many of the text-movement
>>> commands have opcodes). I made the following two commands to collect
>>> words from a novel in an Org file, and the one that uses
>>> `forward-word' and `buffer-substring' runs around twice as fast as the
>>> `split-string'.
>>>
>>> Of course, they don't collect the same list of words! But even if you
>>> add more code for trimming, etc., it will still likely be faster than
>>> operating on a string.
>>> [snip code]
>>
>> I have acted upon the advice (yours and Stefan Monnier's) to operate on
>> the buffer directly using BUFFER-SUBSTRING.  Please see my follow up to
>> Stefan's message.
>>
>> BUFFER-SUBSTRING did gain me (somewhat) better performance.
>
> As Stefan said, going character by character is going to be slow... But
> my example with `forward-word' collects a lot of cruft. So I would
> suggest doing what `forward-word' does internally and move by syntax.

Actually I think alternating `forward-word' with `forward-to-word' might
do the exact same thing as alternating (skip-syntax-forward "w") with
(skip-syntax-forward "^w"), and might get you some extra... stuff. Maybe
worth benchmarking!




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-22  4:00                           ` Eric Abrahamsen
@ 2018-07-22  4:05                             ` Eric Abrahamsen
  0 siblings, 0 replies; 47+ messages in thread
From: Eric Abrahamsen @ 2018-07-22  4:05 UTC (permalink / raw)
  To: help-gnu-emacs

Eric Abrahamsen <eric@ericabrahamsen.net> writes:

> Eric Abrahamsen <eric@ericabrahamsen.net> writes:
>
>> Udyant Wig <udyantw@gmail.com> writes:
>>
>>> On 07/21/2018 09:45 PM, Eric Abrahamsen wrote:
>>>> Interesting... In general I think Emacs is highly optimized to use the
>>>> buffer as its textual data structure, more so than a string.
>>>> Particularly when the code is compiled (many of the text-movement
>>>> commands have opcodes). I made the following two commands to collect
>>>> words from a novel in an Org file, and the one that uses
>>>> `forward-word' and `buffer-substring' runs around twice as fast as the
>>>> `split-string'.
>>>>
>>>> Of course, they don't collect the same list of words! But even if you
>>>> add more code for trimming, etc., it will still likely be faster than
>>>> operating on a string.
>>>> [snip code]
>>>
>>> I have acted upon the advice (yours and Stefan Monnier's) to operate on
>>> the buffer directly using BUFFER-SUBSTRING.  Please see my follow up to
>>> Stefan's message.
>>>
>>> BUFFER-SUBSTRING did gain me (somewhat) better performance.
>>
>> As Stefan said, going character by character is going to be slow... But
>> my example with `forward-word' collects a lot of cruft. So I would
>> suggest doing what `forward-word' does internally and move by syntax.
>
> Actually I think alternating `forward-word' with `forward-to-word' might
> do the exact same thing as alternating (skip-syntax-forward "w") with
> (skip-syntax-forward "^w"), and might get you some extra... stuff. Maybe
> worth benchmarking!

And, because apparently my Saturday nights are slow:

(defun test-buffer (f)
  (let ((counts (make-hash-table :test #'equal))
	pnt)
    (with-temp-buffer
      (insert-file-contents f)
      (goto-char (point-min))
      (forward-to-word 1)
      (setq pnt (point))
      (while (and (null (eobp)) (forward-word))
	(cl-incf (gethash (downcase (buffer-substring pnt (point))) counts 0))
	(forward-to-word 1)
	(setq pnt (point))))
        counts))

Seems to go pretty quick on my test file, though it's only 220K.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-21 18:22               ` Stefan Monnier
@ 2018-07-22  9:02                 ` tomas
  2018-07-23  6:09                   ` Bob Proulx
       [not found]                   ` <mailman.4074.1532326162.1292.help-gnu-emacs@gnu.org>
       [not found]                 ` <mailman.4013.1532250176.1292.help-gnu-emacs@gnu.org>
  1 sibling, 2 replies; 47+ messages in thread
From: tomas @ 2018-07-22  9:02 UTC (permalink / raw)
  To: help-gnu-emacs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sat, Jul 21, 2018 at 02:22:44PM -0400, Stefan Monnier wrote:
> > (defun buffer-most-used-words-2 (n)
> >   "Make a list of the N most used words in buffer."
> >   (let ((counts (avl-tree-create (lambda (wc1 wc2)
> > 				   (string< (first wc1) (first wc2)))))
> > 	(words (split-string (buffer-string)))
> 
> If you want to go fast, don't use split-string+buffer-string.
> Scan through the buffer and extract each word with buffer-substring directly.
> 
> >       (let ((element (avl-tree-member counts (list (downcase word) 0))))
> 
> I'd use a hash-table (implemented in C) rather than an avl-tree
> (implemented in Elisp).

Plus, a (well-implemented) hash table will always be faster (for inserts
and random lookups) than a balanced (AVL, red-black) binary tree. The
latter affords you sorted lookup (find first greater than, output in order).

You pay for that :-)

Cheers
- -- t
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAltUSDIACgkQBcgs9XrR2kYypwCcC7wGis1R7N6HnC5Moq0yj1Hb
a7AAniF3qrn9Tu60jo8qhkQRM73KMFDS
=eGJK
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]                         ` <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org>
@ 2018-07-22 18:19                           ` Udyant Wig
  0 siblings, 0 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-22 18:19 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/22/2018 09:27 AM, Eric Abrahamsen wrote:
> As Stefan said, going character by character is going to be
> slow... But my example with `forward-word' collects a lot of cruft. So
> I would suggest doing what `forward-word' does internally and move by
> syntax.  This also opens up the possibility of tweaking the behavior
> of your function (ie, what constitutes a word) by setting temporary
> syntax tables. Here's a word scanner that only picks up actual words
> (according to the default syntax table):
>
> (defun test-buffer (&optional f)
>   (let ((file (or f "/home/eric/org/hollowmountain.org"))
> 	pnt lst)
>     (with-temp-buffer
>       (insert-file-contents file)
>       (goto-char (point-min))
>       (skip-syntax-forward "^w")
>       (setq pnt (point))
>       (while (and (null (eobp)) (skip-syntax-forward "w"))
> 	(push (buffer-substring pnt (point)) lst)
> 	(skip-syntax-forward "^w")
> 	(setq pnt (point))))
>     (nreverse lst)))

Thank you for the idea!  It did wonders for the running time, a sample
of which I have put after the following adaption of your idea to the
code.

---
(defun buffer-most-used-words-4 (n)
  "Make a list of the N most used words in buffer."
  (let ((counts (make-hash-table :test #'equal))
	sorted-counts
	start
	end)
    (save-excursion
      (goto-char (point-min))
      (skip-syntax-forward "^w")
      (setf start (point))
      (cl-loop until (eobp)
	       do
	       (skip-syntax-forward "w")
	       (setf end (point))
	       (incf (gethash (buffer-substring start end) counts 0))
	       (skip-syntax-forward "^w")
	       (setf start (point))))
    (cl-loop for word being the hash-keys of counts
	     using (hash-values count)
	     do
	     (push (list word count) sorted-counts)
	     finally (setf sorted-counts (cl-sort sorted-counts #'>
						  :key #'second)))
    (mapcar #'first (cl-subseq sorted-counts 0 n))))
---

Compiled, this version takes about half the time the previous version --
going character by character -- took to process a 4.5 MB text file.

Average timing after ten runs on the above mentioned file: 2.75 seconds.


On syntax tables, the ability to determine what is a word or other
construct in a buffer could be very handy indeed.  One application
beyond prose text that comes to mind could be to count the most used
variable or function in a file of source code.  There might be others of
course.

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]                           ` <mailman.4008.1532232144.1292.help-gnu-emacs@gnu.org>
@ 2018-07-22 18:28                             ` Udyant Wig
  2018-07-22 20:05                               ` Eric Abrahamsen
  0 siblings, 1 reply; 47+ messages in thread
From: Udyant Wig @ 2018-07-22 18:28 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/22/2018 09:30 AM, Eric Abrahamsen wrote:
> Actually I think alternating `forward-word' with `forward-to-word'
> might do the exact same thing as alternating (skip-syntax-forward "w")
> with (skip-syntax-forward "^w"), and might get you some
> extra... stuff. Maybe worth benchmarking!

I did.  I had to load the `misc' package to use `forward-to-word'.  A
ten run timing session gave an average of 4.89 seconds on the 4.5 MB
text file: more than twice as much time as the version of the code using
`skip-syntax-forward'.

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]                   ` <mailman.3995.1532206511.1292.help-gnu-emacs@gnu.org>
@ 2018-07-22 18:43                     ` Udyant Wig
  0 siblings, 0 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-22 18:43 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/22/2018 02:24 AM, Stefan Monnier wrote:
>>  (or (eq char 9) (eq char 10) (eq char 13)) (eq char 32))
>
> Aka (memq char '(?\t ?\n ?\r ?\s))

Indeed.  Not only does this form avoid the magic numbers, increase
readability, and abstract the representation, it fixes the error I made
in my code.

It turns out that my code worked by fluke: checking spaces.  I should
try not to post late at night.

>>       (cl-loop with word = nil
> [...]
>> 	       (forward-char)))
>
> This time you do the word-parsing char-by-char in Elisp.  I'd expect
> you'll get better performance with things like `forward-word` or
> (re-search-forward "\\<.*?\\>" nil t) so that the Elisp loop operates
> word-by-word rather than having to look at each and every character.

I read your message after Eric Abrahamsen's, which echoed your point.
Taking the advice paid off handsomely!  Thank you.

I posted `buffer-most-used-words-4' and an indication of its performance
in a follow up to Eric Abrahamsen's message.

>         Stefan

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]                 ` <mailman.4013.1532250176.1292.help-gnu-emacs@gnu.org>
@ 2018-07-22 18:58                   ` Udyant Wig
  0 siblings, 0 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-22 18:58 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/22/2018 02:32 PM, tomas@tuxteam.de wrote:
> Plus, a (well-implemented) hash table will always be faster (for
> inserts and random lookups) than a balanced (AVL, red-black) binary
> tree. The latter affords you sorted lookup (find first greater than,
> output in order).
>
> You pay for that :-)

As I have found to be the case.

I have been unable to put the AVL tree package to any heavy or serious
use yet, but I did use it in a simple way to keep track of some books.
I found `avl-tree-flatten' to be valuable.  Also of note, I think, is
`avl-tree-stack', which allows treatment as a sorted stack of the tree
elements.

> Cheers
> -- t

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-22 18:28                             ` Udyant Wig
@ 2018-07-22 20:05                               ` Eric Abrahamsen
  0 siblings, 0 replies; 47+ messages in thread
From: Eric Abrahamsen @ 2018-07-22 20:05 UTC (permalink / raw)
  To: help-gnu-emacs

Udyant Wig <udyantw@gmail.com> writes:

> On 07/22/2018 09:30 AM, Eric Abrahamsen wrote:
>> Actually I think alternating `forward-word' with `forward-to-word'
>> might do the exact same thing as alternating (skip-syntax-forward "w")
>> with (skip-syntax-forward "^w"), and might get you some
>> extra... stuff. Maybe worth benchmarking!
>
> I did.  I had to load the `misc' package to use `forward-to-word'.  A
> ten run timing session gave an average of 4.89 seconds on the 4.5 MB
> text file: more than twice as much time as the version of the code using
> `skip-syntax-forward'.

Wow, that's way worse than I would have guessed. `forward-word' does
much more than `skip-syntax-forward', (it lets you use
`find-word-boundary-function-table', for one thing) but I wouldn't have
thought it does *that* much more. I don't speak C though, so...




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-22  9:02                 ` tomas
@ 2018-07-23  6:09                   ` Bob Proulx
  2018-07-23  7:34                     ` tomas
       [not found]                   ` <mailman.4074.1532326162.1292.help-gnu-emacs@gnu.org>
  1 sibling, 1 reply; 47+ messages in thread
From: Bob Proulx @ 2018-07-23  6:09 UTC (permalink / raw)
  To: help-gnu-emacs

tomas@tuxteam.de wrote:
> Stefan Monnier wrote:
> > I'd use a hash-table (implemented in C) rather than an avl-tree
> > (implemented in Elisp).

With exactly those choices above I would agree completely.

> Plus, a (well-implemented) hash table will always be faster (for inserts
> and random lookups) than a balanced (AVL, red-black) binary tree. The
> latter affords you sorted lookup (find first greater than, output in order).
>
> You pay for that :-)

Again the hash table is good until you need to grow the table.  Then
there is a pause while memory is shuffled around.  But that never
happens with a balanced tree because that cost is always amortized
along as you go.

Also if you need to extract the entries in sorted order then the
balanced tree is already sorted.  Whereas with the hash table an extra
sort step is needed.

One of looking at this is that it is global optimization versus local
optimization.

In some ways it is swings and roundabouts.  But in other ways it
depends upon what you need.

Bob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
       [not found]                   ` <mailman.4074.1532326162.1292.help-gnu-emacs@gnu.org>
@ 2018-07-23  7:26                     ` Udyant Wig
  0 siblings, 0 replies; 47+ messages in thread
From: Udyant Wig @ 2018-07-23  7:26 UTC (permalink / raw)
  To: help-gnu-emacs

On 07/23/2018 11:39 AM, Bob Proulx wrote:
> tomas@tuxteam.de wrote:
>> Stefan Monnier wrote:
>>> I'd use a hash-table (implemented in C) rather than an avl-tree
>>> (implemented in Elisp).
>
> With exactly those choices above I would agree completely.

Would you choose differently if Emacs had a native code compiler besides
a byte code one?

>> Plus, a (well-implemented) hash table will always be faster (for
>> inserts and random lookups) than a balanced (AVL, red-black) binary
>> tree. The latter affords you sorted lookup (find first greater than,
>> output in order).
>>
>> You pay for that :-)
>
> Again the hash table is good until you need to grow the table.  Then
> there is a pause while memory is shuffled around.  But that never
> happens with a balanced tree because that cost is always amortized
> along as you go.

Yes.

> Also if you need to extract the entries in sorted order then the
> balanced tree is already sorted.  Whereas with the hash table an extra
> sort step is needed.

It just occured to me that when I obtain a list of the elements of the
AVL tree solution, I get a sorted list, though in ascending order.
After this, I sort to get the list in descending order.  I could have
avoided this silly sorting and simply reversed the list.

> One of looking at this is that it is global optimization versus local
> optimization.
>
> In some ways it is swings and roundabouts.  But in other ways it
> depends upon what you need.

Which might entail more swings and roundabouts.

> Bob

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Most used words in current buffer
  2018-07-23  6:09                   ` Bob Proulx
@ 2018-07-23  7:34                     ` tomas
  0 siblings, 0 replies; 47+ messages in thread
From: tomas @ 2018-07-23  7:34 UTC (permalink / raw)
  To: help-gnu-emacs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Jul 23, 2018 at 12:09:15AM -0600, Bob Proulx wrote:

[...]

> Again the hash table is good until you need to grow the table.  Then
> there is a pause while memory is shuffled around.

Precisely. To get this "amortized", you usually grow by some constant
factor, and there's where the log n comes in (which a tree has naturally
built in).

For really big hash tables (think all *dbm variations which are not
some cousins of B-trees), you even have several "levels", at which
point the hash tables start resembling B-trees somewhat (see below).

> But that never happens with a balanced tree because that cost is
> always amortized along as you go.

But once you get big, you start "feeling" the structure of your
storage (i.e. that the "storage as a big constant-time access
array" is just a lie). Having two pointers per node is horribly
wasteful, and if those pointers point to different continents
in your RAM, your CPU ends up waiting on cache fills all the
time.

> Also if you need to extract the entries in sorted order then the
> balanced tree is already sorted.  Whereas with the hash table an extra
> sort step is needed.

That's right: once you need sorted walks or even lookup by sorted
index (i.e. "find the first entry whose value is greater or equal
to X" -- or "find all entries whose values are between X and Y"),
ordered trees are (probably) the way to go. But I'd guess, once
the tree becomes significant in size (yeah, handwaving here :)
some kind of B-tree, where you manage several entries in one node,
should be at an advantage.

> One of looking at this is that it is global optimization versus local
> optimization.

And the ratio of lookups per insert. And whether you want a sorted
index.

> In some ways it is swings and roundabouts.

Nicely put :-)

> But in other ways it depends upon what you need.

Absolutely

Cheers
- -- t
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAltVhQQACgkQBcgs9XrR2kYaIACfdi/XvTyfdbJ9yk4my+eJziMt
yyYAnjw7NadASHDsgDrCJpWdzvDpmxDm
=5dmS
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2018-07-23  7:34 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-17  9:28 Most used words in current buffer Udyant Wig
2018-07-17 18:41 ` Emanuel Berg
2018-07-18  9:36   ` Udyant Wig
2018-07-18 11:48     ` Emanuel Berg
2018-07-18 14:50       ` Udyant Wig
2018-07-18 16:32         ` Emanuel Berg
2018-07-18 22:39     ` Ben Bacarisse
2018-07-19  0:45       ` Bob Proulx
     [not found]       ` <mailman.3785.1531961144.1292.help-gnu-emacs@gnu.org>
2018-07-19  5:33         ` Udyant Wig
2018-07-19  7:04           ` Bob Proulx
2018-07-19  7:25             ` tomas
2018-07-19 17:19             ` Nick Dokos
2018-07-19 17:30               ` Eli Zaretskii
2018-07-19 20:08               ` Bob Proulx
2018-07-20 16:39                 ` Nick Dokos
     [not found]                 ` <mailman.3909.1532104802.1292.help-gnu-emacs@gnu.org>
2018-07-20 18:13                   ` Udyant Wig
2018-07-20 22:24                     ` Bob Newell
2018-07-21  0:00                       ` Nick Dokos
2018-07-21  0:18                     ` Nick Dokos
     [not found]               ` <mailman.3843.1532030947.1292.help-gnu-emacs@gnu.org>
2018-07-20  6:19                 ` Udyant Wig
2018-07-20 23:25                   ` Bob Proulx
2018-07-21  0:26                     ` Nick Dokos
2018-07-21  4:03                       ` Bob Proulx
     [not found]                   ` <mailman.3934.1532129163.1292.help-gnu-emacs@gnu.org>
2018-07-21 13:39                     ` Udyant Wig
     [not found]             ` <mailman.3826.1532020800.1292.help-gnu-emacs@gnu.org>
2018-07-20  5:52               ` Udyant Wig
     [not found]           ` <mailman.3796.1531983885.1292.help-gnu-emacs@gnu.org>
2018-07-19 13:26             ` Udyant Wig
2018-07-19 20:42               ` Bob Proulx
2018-07-20  3:08                 ` Bob Newell
     [not found]                 ` <mailman.3861.1532056120.1292.help-gnu-emacs@gnu.org>
2018-07-21 12:51                   ` Udyant Wig
2018-07-21 16:15                     ` Eric Abrahamsen
     [not found]                     ` <mailman.3982.1532189751.1292.help-gnu-emacs@gnu.org>
2018-07-21 19:46                       ` Udyant Wig
2018-07-22  3:57                         ` Eric Abrahamsen
2018-07-22  4:00                           ` Eric Abrahamsen
2018-07-22  4:05                             ` Eric Abrahamsen
     [not found]                           ` <mailman.4008.1532232144.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:28                             ` Udyant Wig
2018-07-22 20:05                               ` Eric Abrahamsen
     [not found]                         ` <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:19                           ` Udyant Wig
     [not found]               ` <mailman.3845.1532032966.1292.help-gnu-emacs@gnu.org>
2018-07-20 13:18                 ` Udyant Wig
2018-07-21 18:22               ` Stefan Monnier
2018-07-22  9:02                 ` tomas
2018-07-23  6:09                   ` Bob Proulx
2018-07-23  7:34                     ` tomas
     [not found]                   ` <mailman.4074.1532326162.1292.help-gnu-emacs@gnu.org>
2018-07-23  7:26                     ` Udyant Wig
     [not found]                 ` <mailman.4013.1532250176.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:58                   ` Udyant Wig
     [not found]               ` <mailman.3991.1532197378.1292.help-gnu-emacs@gnu.org>
2018-07-21 19:39                 ` Udyant Wig
2018-07-21 20:54                   ` Stefan Monnier
     [not found]                   ` <mailman.3995.1532206511.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:43                     ` Udyant Wig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).