Any faster way to find frequency of words?

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Any faster way to find frequency of words?
@ 2021-05-09 14:38 Jean Louis
  2021-05-09 14:56 ` Eric Abrahamsen
  2021-05-09 15:02 ` Emanuel Berg via Users list for the GNU Emacs text editor
  0 siblings, 2 replies; 15+ messages in thread
From: Jean Louis @ 2021-05-09 14:38 UTC (permalink / raw)
  To: Help GNU Emacs

I am interested if there is some better way for Emacs Lisp to find
frequency of words.

Purpose is to create HTML clickable tag clouds similar to image tag
clouds. But I will invoke Perl from Emacs to generate it. For that, I
have to analyze the text first.

(setq text "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a diam
lectus. Sed sit amet ipsum mauris. Maecenas congue ligula ac quam
viverra nec consectetur ante hendrerit. Maecenas congue ligula ac quam
viverra nec consectetur ante hendrerit..")

(defun text-alphabetic-only (text)
  "Return alphabetic characters from TEXT."
  (replace-regexp-in-string "[^[:alpha:]]" " " text))

(defun word-frequency (text &optional length)
  "Returns word frequency as hash from TEXT."
  (let* ((hash (make-hash-table :test 'equal))
	 (text (text-alphabetic-only text))
	 (words (split-string text " " t " ")))
    (mapc (lambda (word)
	    (when (> (length word) 2)
	      (let ((word (downcase word)))
		(if (numberp (gethash word hash))
		    (puthash word (1+ (gethash word hash)) hash)
		  (puthash word 1 hash)))))
	  words)
    hash))

(word-frequency text) ⇒ #s(hash-table size 65 test equal rehash-size 1.5 rehash-threshold 0.8125 data ("lorem" 1 "ipsum" 2 "dolor" 1 "sit" 2 "amet" 2 "consectetur" 3 "adipiscing" 1 "elit" 1 "donec" 1 "diam" 1 "lectus" 1 "sed" 1 "mauris" 1 "maecenas" 2 "congue" 2 "ligula" 2 "quam" 2 "viverra" 2 "nec" 2 "ante" 2 "hendrerit" 2))



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Any faster way to find frequency of words?
  2021-05-09 14:38 Any faster way to find frequency of words? Jean Louis
@ 2021-05-09 14:56 ` Eric Abrahamsen
  2021-05-09 15:05   ` Emanuel Berg via Users list for the GNU Emacs text editor
  2021-05-09 17:16   ` Jean Louis
  2021-05-09 15:02 ` Emanuel Berg via Users list for the GNU Emacs text editor
  1 sibling, 2 replies; 15+ messages in thread
From: Eric Abrahamsen @ 2021-05-09 14:56 UTC (permalink / raw)
  To: Jean Louis; +Cc: Help GNU Emacs

Jean Louis <bugs@gnu.support> writes:

> I am interested if there is some better way for Emacs Lisp to find
> frequency of words.
>
> Purpose is to create HTML clickable tag clouds similar to image tag
> clouds. But I will invoke Perl from Emacs to generate it. For that, I
> have to analyze the text first.

Is there any particular improvement you're trying to make?

> (setq text "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a diam
> lectus. Sed sit amet ipsum mauris. Maecenas congue ligula ac quam
> viverra nec consectetur ante hendrerit. Maecenas congue ligula ac quam
> viverra nec consectetur ante hendrerit..")
>
> (defun text-alphabetic-only (text)
>   "Return alphabetic characters from TEXT."
>   (replace-regexp-in-string "[^[:alpha:]]" " " text))
>
> (defun word-frequency (text &optional length)
>   "Returns word frequency as hash from TEXT."
>   (let* ((hash (make-hash-table :test 'equal))
> 	 (text (text-alphabetic-only text))
> 	 (words (split-string text " " t " ")))

I guess I'd suggest using Emacs syntax parsing functions, ie
`forward-word' and `buffer-substring'. Then you can fine tune the
definition of words using the local syntax table.

>     (mapc (lambda (word)
> 	    (when (> (length word) 2)
> 	      (let ((word (downcase word)))
> 		(if (numberp (gethash word hash))
> 		    (puthash word (1+ (gethash word hash)) hash)
> 		  (puthash word 1 hash)))))

While hash tables are probably best for very large texts, alists are
nice because you can use place-setting with a default, simplifying the
above to:

(cl-incf (alist-get word frequency-alist 0 nil #'equal))

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Any faster way to find frequency of words?
  2021-05-09 14:56 ` Eric Abrahamsen
@ 2021-05-09 15:05   ` Emanuel Berg via Users list for the GNU Emacs text editor
  2021-05-09 17:16   ` Jean Louis
  1 sibling, 0 replies; 15+ messages in thread
From: Emanuel Berg via Users list for the GNU Emacs text editor @ 2021-05-09 15:05 UTC (permalink / raw)
  To: help-gnu-emacs

> While hash tables are probably best for very large texts,
> alists are nice because you can use place-setting with
> a default, simplifying the above to:
>
> (cl-incf (alist-get word frequency-alist 0 nil #'equal))

Here is one solution already:

  https://emacs.stackexchange.com/a/13518

-- 
underground experts united
https://dataswamp.org/~incal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Any faster way to find frequency of words?
  2021-05-09 14:56 ` Eric Abrahamsen
  2021-05-09 15:05   ` Emanuel Berg via Users list for the GNU Emacs text editor
@ 2021-05-09 17:16   ` Jean Louis
  2021-05-10  3:37     ` Eric Abrahamsen
  1 sibling, 1 reply; 15+ messages in thread
From: Jean Louis @ 2021-05-09 17:16 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: Help GNU Emacs

* Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-09 17:57]:
> Jean Louis <bugs@gnu.support> writes:
> 
> > I am interested if there is some better way for Emacs Lisp to find
> > frequency of words.
> >
> > Purpose is to create HTML clickable tag clouds similar to image tag
> > clouds. But I will invoke Perl from Emacs to generate it. For that, I
> > have to analyze the text first.
> 
> Is there any particular improvement you're trying to make?

I am invoking Perl on the fly and producing clickable HTML tag
cloud. It would be boring and tiresome to re-write Perl's module into
Emacs Lisp, though useful. For now, I rather just do it on the fly.

As HTML tags are created from text, I need nothing but alphabetical
characters. Function is invoked rarely.

It is also useful to generate tags for particular text, that helps me
to curate WWW pages.

> I guess I'd suggest using Emacs syntax parsing functions, ie
> `forward-word' and `buffer-substring'. Then you can fine tune the
> definition of words using the local syntax table.

That is also interesting approach, it could just go over the words and
enter them into list.

> >     (mapc (lambda (word)
> > 	    (when (> (length word) 2)
> > 	      (let ((word (downcase word)))
> > 		(if (numberp (gethash word hash))
> > 		    (puthash word (1+ (gethash word hash)) hash)
> > 		  (puthash word 1 hash)))))
> 
> While hash tables are probably best for very large texts, alists are
> nice because you can use place-setting with a default, simplifying the
> above to:
> 
> (cl-incf (alist-get word frequency-alist 0 nil #'equal))

The idea gave me idea to use the defaults from hashes, so I have made
it now as below (puthash word (1+ (gethash word hash 0)) hash), that
is result of brain storming here...

(defun rcd-word-frequency (text &optional length)
  "Returns word frequency as hash from TEXT.

Words smaller than LENGTH are discarded from counting."
  (let* ((hash (make-hash-table :test 'equal))
	 (text (text-alphabetic-only text))
	 (length (or length 3))
	 (words (split-string text " " t " "))
	 (words (mapcar 'downcase words))
	 (words (mapcar (lambda (word) (when (> (length word) length) word)) words))
	 (words (delq nil words)))
    (mapc (lambda (word)
	    (puthash word (1+ (gethash word hash 0)) hash))
	  words)
    hash))

I am not sure if I should rather collect it into alist. Maybe I could
collect it straight into by frequency ordered list like:

(("word" 9) ("another" 7) ("more" 3))

That is what I am doing here, to construct string of most frequent tags:

(defun rcd-word-frequency-string (text &optional length how-many-words)
  (let* ((words (rcd-word-frequency text length))
	 (words (hash-to-list words))
	 (number (or how-many-words 20))
	 (frequent (seq-sort (lambda (a b)
			       (> (cadr a) (cadr b)))
			     words)))
    (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) number)) " ")))


(rcd-word-frequency-string text nil 5) ⇒ "consectetur ipsum amet maecenas congue"


-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

Sign an open letter in support of Richard M. Stallman
https://stallmansupport.org/
https://rms-support-letter.github.io/




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Any faster way to find frequency of words?
  2021-05-09 17:16   ` Jean Louis
@ 2021-05-10  3:37     ` Eric Abrahamsen
  2021-05-10  7:14       ` Jean Louis
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Abrahamsen @ 2021-05-10  3:37 UTC (permalink / raw)
  To: Help GNU Emacs

Jean Louis <bugs@gnu.support> writes:

> * Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-09 17:57]:
>> Jean Louis <bugs@gnu.support> writes:
>> 
>> > I am interested if there is some better way for Emacs Lisp to find
>> > frequency of words.
>> >
>> > Purpose is to create HTML clickable tag clouds similar to image tag
>> > clouds. But I will invoke Perl from Emacs to generate it. For that, I
>> > have to analyze the text first.
>> 
>> Is there any particular improvement you're trying to make?
>
> I am invoking Perl on the fly and producing clickable HTML tag
> cloud. It would be boring and tiresome to re-write Perl's module into
> Emacs Lisp, though useful. For now, I rather just do it on the fly.
>
> As HTML tags are created from text, I need nothing but alphabetical
> characters. Function is invoked rarely.
>
> It is also useful to generate tags for particular text, that helps me
> to curate WWW pages.

Right, but what I meant was, is there anything wrong with the
implementation you posted?

>> I guess I'd suggest using Emacs syntax parsing functions, ie
>> `forward-word' and `buffer-substring'. Then you can fine tune the
>> definition of words using the local syntax table.
>
> That is also interesting approach, it could just go over the words and
> enter them into list.

Yes, and it can help you skip garbage characters that shouldn't count as
words. Things like `(skip-syntax-forward "^w")` (meaning "skip a run of
characters that aren't word constituents") can be very useful.

>> >     (mapc (lambda (word)
>> > 	    (when (> (length word) 2)
>> > 	      (let ((word (downcase word)))
>> > 		(if (numberp (gethash word hash))
>> > 		    (puthash word (1+ (gethash word hash)) hash)
>> > 		  (puthash word 1 hash)))))
>> 
>> While hash tables are probably best for very large texts, alists are
>> nice because you can use place-setting with a default, simplifying the
>> above to:
>> 
>> (cl-incf (alist-get word frequency-alist 0 nil #'equal))
>
> The idea gave me idea to use the defaults from hashes, so I have made
> it now as below (puthash word (1+ (gethash word hash 0)) hash), that
> is result of brain storming here...

> (defun rcd-word-frequency (text &optional length)
>   "Returns word frequency as hash from TEXT.
>
> Words smaller than LENGTH are discarded from counting."
>   (let* ((hash (make-hash-table :test 'equal))
> 	 (text (text-alphabetic-only text))
> 	 (length (or length 3))
> 	 (words (split-string text " " t " "))
> 	 (words (mapcar 'downcase words))
> 	 (words (mapcar (lambda (word) (when (> (length word) length) word)) words))
> 	 (words (delq nil words)))
>     (mapc (lambda (word)
> 	    (puthash word (1+ (gethash word hash 0)) hash))

I totally forgot that `gethash' has a default argument! So the line
above can just be:

(cl-incf (gethash word hash 0))

I don't know why, but I really enjoy that.

> 	  words)
>     hash))
>
> I am not sure if I should rather collect it into alist. Maybe I could
> collect it straight into by frequency ordered list like:
>
> (("word" 9) ("another" 7) ("more" 3))
>
> That is what I am doing here, to construct string of most frequent tags:
>
> (defun rcd-word-frequency-string (text &optional length how-many-words)
>   (let* ((words (rcd-word-frequency text length))
> 	 (words (hash-to-list words))
> 	 (number (or how-many-words 20))
> 	 (frequent (seq-sort (lambda (a b)
> 			       (> (cadr a) (cadr b)))
> 			     words)))
>     (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) number)) " ")))

I don't have a `hash-to-list' function, but once you've built your table
it seems like the rest of it is fairly straightforward.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Any faster way to find frequency of words?
  2021-05-10  3:37     ` Eric Abrahamsen
@ 2021-05-10  7:14       ` Jean Louis
  2021-05-10 14:02         ` [External] : " Drew Adams
  0 siblings, 1 reply; 15+ messages in thread
From: Jean Louis @ 2021-05-10  7:14 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: Help GNU Emacs

* Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-10 06:38]:
> > It is also useful to generate tags for particular text, that helps me
> > to curate WWW pages.
> 
> Right, but what I meant was, is there anything wrong with the
> implementation you posted?

Thank you. It gives me practically the wanted result, theoretically I
have not tested it well to say if maybe something technically is
wrong. And I use it on smaller chunks of text, it appears pretty fast
and it would be very slow if I would be using it on huge number of
documents. 

On a document of 246000 bytes it takes few seconds. But is not a
problem, I have not get too many such documents and I am not
iterating. It is for generation of tags.

I think this is full set of functions:

(defun hash-to-list (hash)
  "Convert hash HASH to list"
  (let (list)
    (maphash (lambda (key value) (setq list (append list (list (list key value))))) hash)
    list))

(defun text-alphabetic-only (text)
  "Return alphabetic characters from TEXT."
  (replace-regexp-in-string "[^[:alpha:]]" " " text))

(defun rcd-word-frequency (text &optional length)
  "Returns word frequency as hash from TEXT.

Words smaller than LENGTH are discarded from counting."
  (let* ((hash (make-hash-table :test 'equal))
	 (text (text-alphabetic-only text))
	 (length (or length 3))
	 (words (split-string text " " t " "))
	 (words (mapcar 'downcase words))
	 (words (mapcar (lambda (word) (when (> (length word) length) word)) words))
	 (words (delq nil words)))
    (mapc (lambda (word)
	    (puthash word (1+ (gethash word hash 0)) hash))
	  words)
    hash))

(defun rcd-word-frequency-list (text &optional length)
  "Return the unsorted word frequency list of pairs.

First item of the pair is the word, second the word count.

It will analyze TEXT, with minimum word LENGTH."
  (let* ((words (rcd-word-frequency text length))
	 (words (hash-to-list words))
	 (frequent (seq-sort (lambda (a b)
			       (> (cadr a) (cadr b)))
			     words)))
    frequent))

(defun rcd-word-frequency-string (text &optional length how-many)
  "Return string with most frequent words in TEXT.

Use LENGTH to designate minimum length of words to analyze.

Return HOW-MANY words"
  (let ((frequent (rcd-word-frequency-list text length)))
    (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) how-many)) " ")))

(defun rcd-word-frequency-buffer (&optional how-many)
  (interactive)
  (let* ((how-many (or how-many (read-number "How many most frequent words you wish to see? ")))
	 (text (buffer-string))
	 (frequent (rcd-word-frequency-list text))
	 (report (mapconcat (lambda (a) (format "%s:%s " (car a) (cadr a))) (butlast frequent (- (length frequent) how-many)) " ")))
    (prog1
	report
      (message report))))

(rcd-word-frequency-buffer 10) ⇒ "word:44  words:35  text:28  hash:28  length:25  list:17  frequency:16  frequent:14  many:11  lambda:word:44  words:35  text:28  hash:28  length:25  list:17  frequency:16  frequent:14  many:11  lambda:11 

> >> I guess I'd suggest using Emacs syntax parsing functions, ie
> >> `forward-word' and `buffer-substring'. Then you can fine tune the
> >> definition of words using the local syntax table.
> >
> > That is also interesting approach, it could just go over the words and
> > enter them into list.
> 
> Yes, and it can help you skip garbage characters that shouldn't count as
> words. Things like `(skip-syntax-forward "^w")` (meaning "skip a run of
> characters that aren't word constituents") can be very useful.

For now I just skip words by its length and count those alphabetic
characters. Purpose is just to generate tags for HTML pages. 

Once tags have been generated, I can use PostgreSQL database to find
documents with most frequent tags.

Generation of tags is human curated, not automatic. Thus such function
is invoked rather on specific documents. It suggests me the tags for
editing. Not that is creates tags without my attendance.

For example "https" does not seem quite useful tag if articles does
not speak of it, so I have to delete such tags.

> > Words smaller than LENGTH are discarded from counting."
> >   (let* ((hash (make-hash-table :test 'equal))
> > 	 (text (text-alphabetic-only text))
> > 	 (length (or length 3))
> > 	 (words (split-string text " " t " "))
> > 	 (words (mapcar 'downcase words))
> > 	 (words (mapcar (lambda (word) (when (> (length word) length) word)) words))
> > 	 (words (delq nil words)))
> >     (mapc (lambda (word)
> > 	    (puthash word (1+ (gethash word hash 0)) hash))
> 
> I totally forgot that `gethash' has a default argument! So the line
> above can just be:
> 
> (cl-incf (gethash word hash 0))

You like cl-incf and I use 1+, I am not sure if this macro would maybe
slow it down. That is why I tend to skip macros. And let us say I wish
to make package for word frequencies, it would not need to require
cl-lib library.

(defmacro cl-incf (place &optional x)
  "Increment PLACE by X (1 by default).
PLACE may be a symbol, or any generalized variable allowed by `setf'.
The return value is the incremented value of PLACE."
  (declare (debug (place &optional form)))
  (if (symbolp place)
      (list 'setq place (if x (list '+ place x) (list '1+ place)))
    (list 'cl-callf '+ place (or x 1))))

> > (defun rcd-word-frequency-string (text &optional length how-many-words)
> >   (let* ((words (rcd-word-frequency text length))
> > 	 (words (hash-to-list words))
> > 	 (number (or how-many-words 20))
> > 	 (frequent (seq-sort (lambda (a b)
> > 			       (> (cadr a) (cadr b)))
> > 			     words)))
> >     (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) number)) " ")))
> 
> I don't have a `hash-to-list' function, but once you've built your table
> it seems like the rest of it is fairly straightforward.

I use those functions below.

;;;; ━━━━━━━━━━━━━━━━━━
;;;;   HASH FUNCTIONS
;;;; ━━━━━━━━━━━━━━━━━━

(defun hash-to-plist (hash)
  "Convert hash HASH to plist."
  (let (plist)
    (maphash (lambda (key value) (push key plist) (push value plist)) hash)
    (reverse plist)))

(defun hash-to-alist (hash)
  "Convert hash HASH to alist"
  (let (alist)
    (maphash (lambda (key value) (push (cons key value) alist)) hash)
    alist))

(defun hash-to-list (hash)
  "Convert hash HASH to list"
  (let (list)
    (maphash (lambda (key value) (setq list (append list (list (list key value))))) hash)
    list))

(defun hash-append (h1 &rest hashes)
  "Return H1 hash appended with HASHES."
  (mapc 
   (lambda (hash)
     (maphash 
      (lambda (key value) (puthash key value h1)) hash))
   hashes)
  h1)



-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

Sign an open letter in support of Richard M. Stallman
https://stallmansupport.org/
https://rms-support-letter.github.io/




^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [External] : Re: Any faster way to find frequency of words?
  2021-05-10  7:14       ` Jean Louis
@ 2021-05-10 14:02         ` Drew Adams
  2021-05-10 16:26           ` Jean Louis
  0 siblings, 1 reply; 15+ messages in thread
From: Drew Adams @ 2021-05-10 14:02 UTC (permalink / raw)
  To: Jean Louis, Eric Abrahamsen; +Cc: Help GNU Emacs

> (defun hash-to-list (hash)
>   "Convert hash HASH to list"
>   (let (list)
>     (maphash (lambda (key value)
>                (setq list (append list
>                                   (list (list key value)))))
>                                   hash)
>     list))

I use this, FWIW:

(defun hash-table-to-alist (hash-table)
  "Create and return an alist created from HASH-TABLE.
The order of alist entries is undefined, but it seems to be the same
as the order of hash-table entries (which seems to be the order in
which the entries were added to the table)."
  (let ((al  ()))
    (maphash (lambda (key val) (push (cons key val) al))
             hash-table)
    (nreverse al)))


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] : Re: Any faster way to find frequency of words?
  2021-05-10 14:02         ` [External] : " Drew Adams
@ 2021-05-10 16:26           ` Jean Louis
  2021-05-10 16:34             ` Drew Adams
  0 siblings, 1 reply; 15+ messages in thread
From: Jean Louis @ 2021-05-10 16:26 UTC (permalink / raw)
  To: Drew Adams; +Cc: Eric Abrahamsen, Help GNU Emacs

* Drew Adams <drew.adams@oracle.com> [2021-05-10 17:03]:
> I use this, FWIW:
> 
> (defun hash-table-to-alist (hash-table)
>   "Create and return an alist created from HASH-TABLE.
> The order of alist entries is undefined, but it seems to be the same
> as the order of hash-table entries (which seems to be the order in
> which the entries were added to the table)."
>   (let ((al  ()))
>     (maphash (lambda (key val) (push (cons key val) al))
>              hash-table)
>     (nreverse al)))

That may be better, nicer.

I wonder if nreverse is really needed as function just returns some
data, is that data anyway destroyed thereafter?

Then I was also using reverse, I will take it out, as I don't think
there is any order in the hash, if I reverse it or not, it does not
matter.

(setq hash (make-hash-table))
(puthash 'Name "Jimmy" hash)
(puthash "City" "New York" hash)
(puthash "Brigade" "II" hash)

hash ⇒ #s(hash-table size 65 test eql rehash-size 1.5 rehash-threshold 0.8125 data (Name "Jimmy" "City" "New York" "Brigade" "II"))

(puthash 'Name "Jimmy2" hash)

hash ⇒ #s(hash-table size 65 test eql rehash-size 1.5 rehash-threshold 0.8125 data (Name "Jimmy2" "City" "New York" "Brigade" "II"))

I can see that hash keeps order of entries, but I don't believe
that is guaranteed. Visually it gives us the same order:

(hash-table-to-alist hash) ⇒ ((Name . "Jimmy2") ("City" . "New York") ("Brigade" . "II"))

(setq alist (hash-table-to-alist hash)) ⇒ ((Name . "Jimmy2") ("City" . "New York") ("Brigade" . "II"))

(assoc 'Name alist) ⇒ (Name . "Jimmy2")

I just wonder if the order matters. It should not matter in hash,
alist, plist I guess.




-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

Sign an open letter in support of Richard M. Stallman
https://stallmansupport.org/
https://rms-support-letter.github.io/




^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [External] : Re: Any faster way to find frequency of words?
  2021-05-10 16:26           ` Jean Louis
@ 2021-05-10 16:34             ` Drew Adams
  2021-05-10 17:05               ` Jean Louis
  0 siblings, 1 reply; 15+ messages in thread
From: Drew Adams @ 2021-05-10 16:34 UTC (permalink / raw)
  To: Jean Louis; +Cc: Eric Abrahamsen, Help GNU Emacs

> >   (let ((al  ()))
> >     (maphash (lambda (key val) (push (cons key val) al))
> >              hash-table)
> >     (nreverse al)))
> 
> That may be better, nicer.
> 
> I wonder if nreverse is really needed as function just returns some
> data, is that data anyway destroyed thereafter?

It creates new list structure, for local var `al'.
So there's no problem with destructively modifying
that list structure - nothing else can be using it
in this context.

> I just wonder if the order matters. It should not matter in hash,
> alist, plist I guess.

Order can matter in an alist or plist.

"Can", because certainly some code can use such a
list without caring about the order.

Alists, in particular, are expressly designed to
allow for multiple elements with the same key.
For most purposes, only the first element with
the same key is accessed; it "shadows" subsequent
elements in the list.  There are various advantages
to being able to have multiple entries.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] : Re: Any faster way to find frequency of words?
  2021-05-10 16:34             ` Drew Adams
@ 2021-05-10 17:05               ` Jean Louis
  0 siblings, 0 replies; 15+ messages in thread
From: Jean Louis @ 2021-05-10 17:05 UTC (permalink / raw)
  To: Drew Adams; +Cc: Eric Abrahamsen, Help GNU Emacs

* Drew Adams <drew.adams@oracle.com> [2021-05-10 19:35]:
> > I just wonder if the order matters. It should not matter in hash,
> > alist, plist I guess.
> 
> Order can matter in an alist or plist.

It could matter if one access it beyond those specific Emacs Lisp
functions. For example I could and have created alist of most frequent
words, but then realized, it need not be alist, it can be simple
list with lists. But maybe internally is something faster.

-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

Sign an open letter in support of Richard M. Stallman
https://stallmansupport.org/
https://rms-support-letter.github.io/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Any faster way to find frequency of words?
  2021-05-09 14:38 Any faster way to find frequency of words? Jean Louis
  2021-05-09 14:56 ` Eric Abrahamsen
@ 2021-05-09 15:02 ` Emanuel Berg via Users list for the GNU Emacs text editor
  2021-05-09 17:19   ` Jean Louis
  1 sibling, 1 reply; 15+ messages in thread
From: Emanuel Berg via Users list for the GNU Emacs text editor @ 2021-05-09 15:02 UTC (permalink / raw)
  To: help-gnu-emacs

Jean Louis wrote:

> 	 (text (text-alphabetic-only text))
> 	 (words (split-string text " " t " ")))

Here is what I would try first

  1. `buffer-substring'
  2. `split-string'
  3. `delete-dups'
  4. loop and do `how-many'
  5. get a new list with '(occurrences word)
  6. sort WRT occurrences
  7. loop and (insert "%d %s\n" occ wrd)

Easy. Fast enough?

-- 
underground experts united
https://dataswamp.org/~incal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Any faster way to find frequency of words?
  2021-05-09 15:02 ` Emanuel Berg via Users list for the GNU Emacs text editor
@ 2021-05-09 17:19   ` Jean Louis
  2021-05-09 18:00     ` Emanuel Berg via Users list for the GNU Emacs text editor
  0 siblings, 1 reply; 15+ messages in thread
From: Jean Louis @ 2021-05-09 17:19 UTC (permalink / raw)
  To: help-gnu-emacs

* Emanuel Berg via Users list for the GNU Emacs text editor <help-gnu-emacs@gnu.org> [2021-05-09 18:05]:
> Jean Louis wrote:
> 
> > 	 (text (text-alphabetic-only text))
> > 	 (words (split-string text " " t " ")))
> 
> Here is what I would try first
> 
>   1. `buffer-substring'
>   2. `split-string'
>   3. `delete-dups'
>   4. loop and do `how-many'
>   5. get a new list with '(occurrences word)

How do you get `occurences'? You would count for words each time? Is
it function? I cannot find it.

I think that your (4) is not necessary, as counting is not
necessary. 

-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

Sign an open letter in support of Richard M. Stallman
https://stallmansupport.org/
https://rms-support-letter.github.io/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Any faster way to find frequency of words?
  2021-05-09 17:19   ` Jean Louis
@ 2021-05-09 18:00     ` Emanuel Berg via Users list for the GNU Emacs text editor
  2021-05-09 19:03       ` Jean Louis
  0 siblings, 1 reply; 15+ messages in thread
From: Emanuel Berg via Users list for the GNU Emacs text editor @ 2021-05-09 18:00 UTC (permalink / raw)
  To: help-gnu-emacs

Jean Louis wrote:

> I think that your (4) is not necessary, as counting is
> not necessary.

Some counting is if you are to learn the frequency.

How about `forward-word' the whole buffer and for every word
feed it to a data structure, which keeps a record and a digit
and increase that by 1?

Then the challenge would be to pick a data structure where
searching is fast and in particular where search time doesn't
_grow_ fast with respect to it's overall size growing (size =
the number of unique words)

BTW the theoretical worst-case would be a buffer where all
words are unique. Buffer cost is almost 1, ultimately n.
With the theoretical worst-case, data structure would be, if
linear, like this

if we denote buffer cost : data structure cost

1: 0      <-- first word
1: 1
1: 2
1: 3
..
1: n + 1  <-- last word

linear!

But probably data structure cost is less than linear, say
logarithmic, then we would have

linear(n) + n * logarithmic(n)

linear(n) will grow the faster, so linear!

Whatever you do with the data structure, it'll be fast enough!

-- 
underground experts united
https://dataswamp.org/~incal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Any faster way to find frequency of words?
  2021-05-09 18:00     ` Emanuel Berg via Users list for the GNU Emacs text editor
@ 2021-05-09 19:03       ` Jean Louis
  2021-05-09 23:33         ` Emanuel Berg via Users list for the GNU Emacs text editor
  0 siblings, 1 reply; 15+ messages in thread
From: Jean Louis @ 2021-05-09 19:03 UTC (permalink / raw)
  To: help-gnu-emacs

* Emanuel Berg via Users list for the GNU Emacs text editor <help-gnu-emacs@gnu.org> [2021-05-09 21:01]:
> Jean Louis wrote:
> 
> > I think that your (4) is not necessary, as counting is
> > not necessary.
> 
> Some counting is if you are to learn the frequency.

Iterating and increasing the value is not same as counting. That first
creates the frequency of words. 

Counting could be useful when finding the most frequent words. But
even in that case programmatical comparison of what is greater seem to
be enough. Maybe the underlying C program is counting.

> BTW the theoretical worst-case would be a buffer where all
> words are unique. Buffer cost is almost 1, ultimately n.
> With the theoretical worst-case, data structure would be, if
> linear, like this

Heaven thanks it is not theoretical case, in practice it just finds
frequencies of words in some kilobytes. For speedy searching by word
frequencies I am using PostgreSQL with Emacs interface.

-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

Sign an open letter in support of Richard M. Stallman
https://stallmansupport.org/
https://rms-support-letter.github.io/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Any faster way to find frequency of words?
  2021-05-09 19:03       ` Jean Louis
@ 2021-05-09 23:33         ` Emanuel Berg via Users list for the GNU Emacs text editor
  0 siblings, 0 replies; 15+ messages in thread
From: Emanuel Berg via Users list for the GNU Emacs text editor @ 2021-05-09 23:33 UTC (permalink / raw)
  To: help-gnu-emacs

Jean Louis wrote:

> Iterating and increasing the value is not same as counting.
> That first creates the frequency of words.

Iterating and increasing the value is a method to do counting,
and perhaps here, it is the best one...

-- 
underground experts united
https://dataswamp.org/~incal

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-05-10 17:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-05-09 14:38 Any faster way to find frequency of words? Jean Louis
2021-05-09 14:56 ` Eric Abrahamsen
2021-05-09 15:05   ` Emanuel Berg via Users list for the GNU Emacs text editor
2021-05-09 17:16   ` Jean Louis
2021-05-10  3:37     ` Eric Abrahamsen
2021-05-10  7:14       ` Jean Louis
2021-05-10 14:02         ` [External] : " Drew Adams
2021-05-10 16:26           ` Jean Louis
2021-05-10 16:34             ` Drew Adams
2021-05-10 17:05               ` Jean Louis
2021-05-09 15:02 ` Emanuel Berg via Users list for the GNU Emacs text editor
2021-05-09 17:19   ` Jean Louis
2021-05-09 18:00     ` Emanuel Berg via Users list for the GNU Emacs text editor
2021-05-09 19:03       ` Jean Louis
2021-05-09 23:33         ` Emanuel Berg via Users list for the GNU Emacs text editor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).