* Any faster way to find frequency of words? @ 2021-05-09 14:38 Jean Louis 2021-05-09 14:56 ` Eric Abrahamsen 2021-05-09 15:02 ` Emanuel Berg via Users list for the GNU Emacs text editor 0 siblings, 2 replies; 15+ messages in thread From: Jean Louis @ 2021-05-09 14:38 UTC (permalink / raw) To: Help GNU Emacs I am interested if there is some better way for Emacs Lisp to find frequency of words. Purpose is to create HTML clickable tag clouds similar to image tag clouds. But I will invoke Perl from Emacs to generate it. For that, I have to analyze the text first. (setq text "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a diam lectus. Sed sit amet ipsum mauris. Maecenas congue ligula ac quam viverra nec consectetur ante hendrerit. Maecenas congue ligula ac quam viverra nec consectetur ante hendrerit..") (defun text-alphabetic-only (text) "Return alphabetic characters from TEXT." (replace-regexp-in-string "[^[:alpha:]]" " " text)) (defun word-frequency (text &optional length) "Returns word frequency as hash from TEXT." (let* ((hash (make-hash-table :test 'equal)) (text (text-alphabetic-only text)) (words (split-string text " " t " "))) (mapc (lambda (word) (when (> (length word) 2) (let ((word (downcase word))) (if (numberp (gethash word hash)) (puthash word (1+ (gethash word hash)) hash) (puthash word 1 hash))))) words) hash)) (word-frequency text) ⇒ #s(hash-table size 65 test equal rehash-size 1.5 rehash-threshold 0.8125 data ("lorem" 1 "ipsum" 2 "dolor" 1 "sit" 2 "amet" 2 "consectetur" 3 "adipiscing" 1 "elit" 1 "donec" 1 "diam" 1 "lectus" 1 "sed" 1 "mauris" 1 "maecenas" 2 "congue" 2 "ligula" 2 "quam" 2 "viverra" 2 "nec" 2 "ante" 2 "hendrerit" 2)) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Any faster way to find frequency of words? 2021-05-09 14:38 Any faster way to find frequency of words? Jean Louis @ 2021-05-09 14:56 ` Eric Abrahamsen 2021-05-09 15:05 ` Emanuel Berg via Users list for the GNU Emacs text editor 2021-05-09 17:16 ` Jean Louis 2021-05-09 15:02 ` Emanuel Berg via Users list for the GNU Emacs text editor 1 sibling, 2 replies; 15+ messages in thread From: Eric Abrahamsen @ 2021-05-09 14:56 UTC (permalink / raw) To: Jean Louis; +Cc: Help GNU Emacs Jean Louis <bugs@gnu.support> writes: > I am interested if there is some better way for Emacs Lisp to find > frequency of words. > > Purpose is to create HTML clickable tag clouds similar to image tag > clouds. But I will invoke Perl from Emacs to generate it. For that, I > have to analyze the text first. Is there any particular improvement you're trying to make? > (setq text "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a diam > lectus. Sed sit amet ipsum mauris. Maecenas congue ligula ac quam > viverra nec consectetur ante hendrerit. Maecenas congue ligula ac quam > viverra nec consectetur ante hendrerit..") > > (defun text-alphabetic-only (text) > "Return alphabetic characters from TEXT." > (replace-regexp-in-string "[^[:alpha:]]" " " text)) > > (defun word-frequency (text &optional length) > "Returns word frequency as hash from TEXT." > (let* ((hash (make-hash-table :test 'equal)) > (text (text-alphabetic-only text)) > (words (split-string text " " t " "))) I guess I'd suggest using Emacs syntax parsing functions, ie `forward-word' and `buffer-substring'. Then you can fine tune the definition of words using the local syntax table. > (mapc (lambda (word) > (when (> (length word) 2) > (let ((word (downcase word))) > (if (numberp (gethash word hash)) > (puthash word (1+ (gethash word hash)) hash) > (puthash word 1 hash))))) While hash tables are probably best for very large texts, alists are nice because you can use place-setting with a default, simplifying the above to: (cl-incf (alist-get word frequency-alist 0 nil #'equal)) Eric ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Any faster way to find frequency of words? 2021-05-09 14:56 ` Eric Abrahamsen @ 2021-05-09 15:05 ` Emanuel Berg via Users list for the GNU Emacs text editor 2021-05-09 17:16 ` Jean Louis 1 sibling, 0 replies; 15+ messages in thread From: Emanuel Berg via Users list for the GNU Emacs text editor @ 2021-05-09 15:05 UTC (permalink / raw) To: help-gnu-emacs > While hash tables are probably best for very large texts, > alists are nice because you can use place-setting with > a default, simplifying the above to: > > (cl-incf (alist-get word frequency-alist 0 nil #'equal)) Here is one solution already: https://emacs.stackexchange.com/a/13518 -- underground experts united https://dataswamp.org/~incal ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Any faster way to find frequency of words? 2021-05-09 14:56 ` Eric Abrahamsen 2021-05-09 15:05 ` Emanuel Berg via Users list for the GNU Emacs text editor @ 2021-05-09 17:16 ` Jean Louis 2021-05-10 3:37 ` Eric Abrahamsen 1 sibling, 1 reply; 15+ messages in thread From: Jean Louis @ 2021-05-09 17:16 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: Help GNU Emacs * Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-09 17:57]: > Jean Louis <bugs@gnu.support> writes: > > > I am interested if there is some better way for Emacs Lisp to find > > frequency of words. > > > > Purpose is to create HTML clickable tag clouds similar to image tag > > clouds. But I will invoke Perl from Emacs to generate it. For that, I > > have to analyze the text first. > > Is there any particular improvement you're trying to make? I am invoking Perl on the fly and producing clickable HTML tag cloud. It would be boring and tiresome to re-write Perl's module into Emacs Lisp, though useful. For now, I rather just do it on the fly. As HTML tags are created from text, I need nothing but alphabetical characters. Function is invoked rarely. It is also useful to generate tags for particular text, that helps me to curate WWW pages. > I guess I'd suggest using Emacs syntax parsing functions, ie > `forward-word' and `buffer-substring'. Then you can fine tune the > definition of words using the local syntax table. That is also interesting approach, it could just go over the words and enter them into list. > > (mapc (lambda (word) > > (when (> (length word) 2) > > (let ((word (downcase word))) > > (if (numberp (gethash word hash)) > > (puthash word (1+ (gethash word hash)) hash) > > (puthash word 1 hash))))) > > While hash tables are probably best for very large texts, alists are > nice because you can use place-setting with a default, simplifying the > above to: > > (cl-incf (alist-get word frequency-alist 0 nil #'equal)) The idea gave me idea to use the defaults from hashes, so I have made it now as below (puthash word (1+ (gethash word hash 0)) hash), that is result of brain storming here... (defun rcd-word-frequency (text &optional length) "Returns word frequency as hash from TEXT. Words smaller than LENGTH are discarded from counting." (let* ((hash (make-hash-table :test 'equal)) (text (text-alphabetic-only text)) (length (or length 3)) (words (split-string text " " t " ")) (words (mapcar 'downcase words)) (words (mapcar (lambda (word) (when (> (length word) length) word)) words)) (words (delq nil words))) (mapc (lambda (word) (puthash word (1+ (gethash word hash 0)) hash)) words) hash)) I am not sure if I should rather collect it into alist. Maybe I could collect it straight into by frequency ordered list like: (("word" 9) ("another" 7) ("more" 3)) That is what I am doing here, to construct string of most frequent tags: (defun rcd-word-frequency-string (text &optional length how-many-words) (let* ((words (rcd-word-frequency text length)) (words (hash-to-list words)) (number (or how-many-words 20)) (frequent (seq-sort (lambda (a b) (> (cadr a) (cadr b))) words))) (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) number)) " "))) (rcd-word-frequency-string text nil 5) ⇒ "consectetur ipsum amet maecenas congue" -- Jean Take action in Free Software Foundation campaigns: https://www.fsf.org/campaigns Sign an open letter in support of Richard M. Stallman https://stallmansupport.org/ https://rms-support-letter.github.io/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Any faster way to find frequency of words? 2021-05-09 17:16 ` Jean Louis @ 2021-05-10 3:37 ` Eric Abrahamsen 2021-05-10 7:14 ` Jean Louis 0 siblings, 1 reply; 15+ messages in thread From: Eric Abrahamsen @ 2021-05-10 3:37 UTC (permalink / raw) To: Help GNU Emacs Jean Louis <bugs@gnu.support> writes: > * Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-09 17:57]: >> Jean Louis <bugs@gnu.support> writes: >> >> > I am interested if there is some better way for Emacs Lisp to find >> > frequency of words. >> > >> > Purpose is to create HTML clickable tag clouds similar to image tag >> > clouds. But I will invoke Perl from Emacs to generate it. For that, I >> > have to analyze the text first. >> >> Is there any particular improvement you're trying to make? > > I am invoking Perl on the fly and producing clickable HTML tag > cloud. It would be boring and tiresome to re-write Perl's module into > Emacs Lisp, though useful. For now, I rather just do it on the fly. > > As HTML tags are created from text, I need nothing but alphabetical > characters. Function is invoked rarely. > > It is also useful to generate tags for particular text, that helps me > to curate WWW pages. Right, but what I meant was, is there anything wrong with the implementation you posted? >> I guess I'd suggest using Emacs syntax parsing functions, ie >> `forward-word' and `buffer-substring'. Then you can fine tune the >> definition of words using the local syntax table. > > That is also interesting approach, it could just go over the words and > enter them into list. Yes, and it can help you skip garbage characters that shouldn't count as words. Things like `(skip-syntax-forward "^w")` (meaning "skip a run of characters that aren't word constituents") can be very useful. >> > (mapc (lambda (word) >> > (when (> (length word) 2) >> > (let ((word (downcase word))) >> > (if (numberp (gethash word hash)) >> > (puthash word (1+ (gethash word hash)) hash) >> > (puthash word 1 hash))))) >> >> While hash tables are probably best for very large texts, alists are >> nice because you can use place-setting with a default, simplifying the >> above to: >> >> (cl-incf (alist-get word frequency-alist 0 nil #'equal)) > > The idea gave me idea to use the defaults from hashes, so I have made > it now as below (puthash word (1+ (gethash word hash 0)) hash), that > is result of brain storming here... > (defun rcd-word-frequency (text &optional length) > "Returns word frequency as hash from TEXT. > > Words smaller than LENGTH are discarded from counting." > (let* ((hash (make-hash-table :test 'equal)) > (text (text-alphabetic-only text)) > (length (or length 3)) > (words (split-string text " " t " ")) > (words (mapcar 'downcase words)) > (words (mapcar (lambda (word) (when (> (length word) length) word)) words)) > (words (delq nil words))) > (mapc (lambda (word) > (puthash word (1+ (gethash word hash 0)) hash)) I totally forgot that `gethash' has a default argument! So the line above can just be: (cl-incf (gethash word hash 0)) I don't know why, but I really enjoy that. > words) > hash)) > > I am not sure if I should rather collect it into alist. Maybe I could > collect it straight into by frequency ordered list like: > > (("word" 9) ("another" 7) ("more" 3)) > > That is what I am doing here, to construct string of most frequent tags: > > (defun rcd-word-frequency-string (text &optional length how-many-words) > (let* ((words (rcd-word-frequency text length)) > (words (hash-to-list words)) > (number (or how-many-words 20)) > (frequent (seq-sort (lambda (a b) > (> (cadr a) (cadr b))) > words))) > (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) number)) " "))) I don't have a `hash-to-list' function, but once you've built your table it seems like the rest of it is fairly straightforward. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Any faster way to find frequency of words? 2021-05-10 3:37 ` Eric Abrahamsen @ 2021-05-10 7:14 ` Jean Louis 2021-05-10 14:02 ` [External] : " Drew Adams 0 siblings, 1 reply; 15+ messages in thread From: Jean Louis @ 2021-05-10 7:14 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: Help GNU Emacs * Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-10 06:38]: > > It is also useful to generate tags for particular text, that helps me > > to curate WWW pages. > > Right, but what I meant was, is there anything wrong with the > implementation you posted? Thank you. It gives me practically the wanted result, theoretically I have not tested it well to say if maybe something technically is wrong. And I use it on smaller chunks of text, it appears pretty fast and it would be very slow if I would be using it on huge number of documents. On a document of 246000 bytes it takes few seconds. But is not a problem, I have not get too many such documents and I am not iterating. It is for generation of tags. I think this is full set of functions: (defun hash-to-list (hash) "Convert hash HASH to list" (let (list) (maphash (lambda (key value) (setq list (append list (list (list key value))))) hash) list)) (defun text-alphabetic-only (text) "Return alphabetic characters from TEXT." (replace-regexp-in-string "[^[:alpha:]]" " " text)) (defun rcd-word-frequency (text &optional length) "Returns word frequency as hash from TEXT. Words smaller than LENGTH are discarded from counting." (let* ((hash (make-hash-table :test 'equal)) (text (text-alphabetic-only text)) (length (or length 3)) (words (split-string text " " t " ")) (words (mapcar 'downcase words)) (words (mapcar (lambda (word) (when (> (length word) length) word)) words)) (words (delq nil words))) (mapc (lambda (word) (puthash word (1+ (gethash word hash 0)) hash)) words) hash)) (defun rcd-word-frequency-list (text &optional length) "Return the unsorted word frequency list of pairs. First item of the pair is the word, second the word count. It will analyze TEXT, with minimum word LENGTH." (let* ((words (rcd-word-frequency text length)) (words (hash-to-list words)) (frequent (seq-sort (lambda (a b) (> (cadr a) (cadr b))) words))) frequent)) (defun rcd-word-frequency-string (text &optional length how-many) "Return string with most frequent words in TEXT. Use LENGTH to designate minimum length of words to analyze. Return HOW-MANY words" (let ((frequent (rcd-word-frequency-list text length))) (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) how-many)) " "))) (defun rcd-word-frequency-buffer (&optional how-many) (interactive) (let* ((how-many (or how-many (read-number "How many most frequent words you wish to see? "))) (text (buffer-string)) (frequent (rcd-word-frequency-list text)) (report (mapconcat (lambda (a) (format "%s:%s " (car a) (cadr a))) (butlast frequent (- (length frequent) how-many)) " "))) (prog1 report (message report)))) (rcd-word-frequency-buffer 10) ⇒ "word:44 words:35 text:28 hash:28 length:25 list:17 frequency:16 frequent:14 many:11 lambda:word:44 words:35 text:28 hash:28 length:25 list:17 frequency:16 frequent:14 many:11 lambda:11 > >> I guess I'd suggest using Emacs syntax parsing functions, ie > >> `forward-word' and `buffer-substring'. Then you can fine tune the > >> definition of words using the local syntax table. > > > > That is also interesting approach, it could just go over the words and > > enter them into list. > > Yes, and it can help you skip garbage characters that shouldn't count as > words. Things like `(skip-syntax-forward "^w")` (meaning "skip a run of > characters that aren't word constituents") can be very useful. For now I just skip words by its length and count those alphabetic characters. Purpose is just to generate tags for HTML pages. Once tags have been generated, I can use PostgreSQL database to find documents with most frequent tags. Generation of tags is human curated, not automatic. Thus such function is invoked rather on specific documents. It suggests me the tags for editing. Not that is creates tags without my attendance. For example "https" does not seem quite useful tag if articles does not speak of it, so I have to delete such tags. > > Words smaller than LENGTH are discarded from counting." > > (let* ((hash (make-hash-table :test 'equal)) > > (text (text-alphabetic-only text)) > > (length (or length 3)) > > (words (split-string text " " t " ")) > > (words (mapcar 'downcase words)) > > (words (mapcar (lambda (word) (when (> (length word) length) word)) words)) > > (words (delq nil words))) > > (mapc (lambda (word) > > (puthash word (1+ (gethash word hash 0)) hash)) > > I totally forgot that `gethash' has a default argument! So the line > above can just be: > > (cl-incf (gethash word hash 0)) You like cl-incf and I use 1+, I am not sure if this macro would maybe slow it down. That is why I tend to skip macros. And let us say I wish to make package for word frequencies, it would not need to require cl-lib library. (defmacro cl-incf (place &optional x) "Increment PLACE by X (1 by default). PLACE may be a symbol, or any generalized variable allowed by `setf'. The return value is the incremented value of PLACE." (declare (debug (place &optional form))) (if (symbolp place) (list 'setq place (if x (list '+ place x) (list '1+ place))) (list 'cl-callf '+ place (or x 1)))) > > (defun rcd-word-frequency-string (text &optional length how-many-words) > > (let* ((words (rcd-word-frequency text length)) > > (words (hash-to-list words)) > > (number (or how-many-words 20)) > > (frequent (seq-sort (lambda (a b) > > (> (cadr a) (cadr b))) > > words))) > > (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) number)) " "))) > > I don't have a `hash-to-list' function, but once you've built your table > it seems like the rest of it is fairly straightforward. I use those functions below. ;;;; ━━━━━━━━━━━━━━━━━━ ;;;; HASH FUNCTIONS ;;;; ━━━━━━━━━━━━━━━━━━ (defun hash-to-plist (hash) "Convert hash HASH to plist." (let (plist) (maphash (lambda (key value) (push key plist) (push value plist)) hash) (reverse plist))) (defun hash-to-alist (hash) "Convert hash HASH to alist" (let (alist) (maphash (lambda (key value) (push (cons key value) alist)) hash) alist)) (defun hash-to-list (hash) "Convert hash HASH to list" (let (list) (maphash (lambda (key value) (setq list (append list (list (list key value))))) hash) list)) (defun hash-append (h1 &rest hashes) "Return H1 hash appended with HASHES." (mapc (lambda (hash) (maphash (lambda (key value) (puthash key value h1)) hash)) hashes) h1) -- Jean Take action in Free Software Foundation campaigns: https://www.fsf.org/campaigns Sign an open letter in support of Richard M. Stallman https://stallmansupport.org/ https://rms-support-letter.github.io/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [External] : Re: Any faster way to find frequency of words? 2021-05-10 7:14 ` Jean Louis @ 2021-05-10 14:02 ` Drew Adams 2021-05-10 16:26 ` Jean Louis 0 siblings, 1 reply; 15+ messages in thread From: Drew Adams @ 2021-05-10 14:02 UTC (permalink / raw) To: Jean Louis, Eric Abrahamsen; +Cc: Help GNU Emacs > (defun hash-to-list (hash) > "Convert hash HASH to list" > (let (list) > (maphash (lambda (key value) > (setq list (append list > (list (list key value))))) > hash) > list)) I use this, FWIW: (defun hash-table-to-alist (hash-table) "Create and return an alist created from HASH-TABLE. The order of alist entries is undefined, but it seems to be the same as the order of hash-table entries (which seems to be the order in which the entries were added to the table)." (let ((al ())) (maphash (lambda (key val) (push (cons key val) al)) hash-table) (nreverse al))) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [External] : Re: Any faster way to find frequency of words? 2021-05-10 14:02 ` [External] : " Drew Adams @ 2021-05-10 16:26 ` Jean Louis 2021-05-10 16:34 ` Drew Adams 0 siblings, 1 reply; 15+ messages in thread From: Jean Louis @ 2021-05-10 16:26 UTC (permalink / raw) To: Drew Adams; +Cc: Eric Abrahamsen, Help GNU Emacs * Drew Adams <drew.adams@oracle.com> [2021-05-10 17:03]: > I use this, FWIW: > > (defun hash-table-to-alist (hash-table) > "Create and return an alist created from HASH-TABLE. > The order of alist entries is undefined, but it seems to be the same > as the order of hash-table entries (which seems to be the order in > which the entries were added to the table)." > (let ((al ())) > (maphash (lambda (key val) (push (cons key val) al)) > hash-table) > (nreverse al))) That may be better, nicer. I wonder if nreverse is really needed as function just returns some data, is that data anyway destroyed thereafter? Then I was also using reverse, I will take it out, as I don't think there is any order in the hash, if I reverse it or not, it does not matter. (setq hash (make-hash-table)) (puthash 'Name "Jimmy" hash) (puthash "City" "New York" hash) (puthash "Brigade" "II" hash) hash ⇒ #s(hash-table size 65 test eql rehash-size 1.5 rehash-threshold 0.8125 data (Name "Jimmy" "City" "New York" "Brigade" "II")) (puthash 'Name "Jimmy2" hash) hash ⇒ #s(hash-table size 65 test eql rehash-size 1.5 rehash-threshold 0.8125 data (Name "Jimmy2" "City" "New York" "Brigade" "II")) I can see that hash keeps order of entries, but I don't believe that is guaranteed. Visually it gives us the same order: (hash-table-to-alist hash) ⇒ ((Name . "Jimmy2") ("City" . "New York") ("Brigade" . "II")) (setq alist (hash-table-to-alist hash)) ⇒ ((Name . "Jimmy2") ("City" . "New York") ("Brigade" . "II")) (assoc 'Name alist) ⇒ (Name . "Jimmy2") I just wonder if the order matters. It should not matter in hash, alist, plist I guess. -- Jean Take action in Free Software Foundation campaigns: https://www.fsf.org/campaigns Sign an open letter in support of Richard M. Stallman https://stallmansupport.org/ https://rms-support-letter.github.io/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [External] : Re: Any faster way to find frequency of words? 2021-05-10 16:26 ` Jean Louis @ 2021-05-10 16:34 ` Drew Adams 2021-05-10 17:05 ` Jean Louis 0 siblings, 1 reply; 15+ messages in thread From: Drew Adams @ 2021-05-10 16:34 UTC (permalink / raw) To: Jean Louis; +Cc: Eric Abrahamsen, Help GNU Emacs > > (let ((al ())) > > (maphash (lambda (key val) (push (cons key val) al)) > > hash-table) > > (nreverse al))) > > That may be better, nicer. > > I wonder if nreverse is really needed as function just returns some > data, is that data anyway destroyed thereafter? It creates new list structure, for local var `al'. So there's no problem with destructively modifying that list structure - nothing else can be using it in this context. > I just wonder if the order matters. It should not matter in hash, > alist, plist I guess. Order can matter in an alist or plist. "Can", because certainly some code can use such a list without caring about the order. Alists, in particular, are expressly designed to allow for multiple elements with the same key. For most purposes, only the first element with the same key is accessed; it "shadows" subsequent elements in the list. There are various advantages to being able to have multiple entries. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [External] : Re: Any faster way to find frequency of words? 2021-05-10 16:34 ` Drew Adams @ 2021-05-10 17:05 ` Jean Louis 0 siblings, 0 replies; 15+ messages in thread From: Jean Louis @ 2021-05-10 17:05 UTC (permalink / raw) To: Drew Adams; +Cc: Eric Abrahamsen, Help GNU Emacs * Drew Adams <drew.adams@oracle.com> [2021-05-10 19:35]: > > I just wonder if the order matters. It should not matter in hash, > > alist, plist I guess. > > Order can matter in an alist or plist. It could matter if one access it beyond those specific Emacs Lisp functions. For example I could and have created alist of most frequent words, but then realized, it need not be alist, it can be simple list with lists. But maybe internally is something faster. -- Jean Take action in Free Software Foundation campaigns: https://www.fsf.org/campaigns Sign an open letter in support of Richard M. Stallman https://stallmansupport.org/ https://rms-support-letter.github.io/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Any faster way to find frequency of words? 2021-05-09 14:38 Any faster way to find frequency of words? Jean Louis 2021-05-09 14:56 ` Eric Abrahamsen @ 2021-05-09 15:02 ` Emanuel Berg via Users list for the GNU Emacs text editor 2021-05-09 17:19 ` Jean Louis 1 sibling, 1 reply; 15+ messages in thread From: Emanuel Berg via Users list for the GNU Emacs text editor @ 2021-05-09 15:02 UTC (permalink / raw) To: help-gnu-emacs Jean Louis wrote: > (text (text-alphabetic-only text)) > (words (split-string text " " t " "))) Here is what I would try first 1. `buffer-substring' 2. `split-string' 3. `delete-dups' 4. loop and do `how-many' 5. get a new list with '(occurrences word) 6. sort WRT occurrences 7. loop and (insert "%d %s\n" occ wrd) Easy. Fast enough? -- underground experts united https://dataswamp.org/~incal ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Any faster way to find frequency of words? 2021-05-09 15:02 ` Emanuel Berg via Users list for the GNU Emacs text editor @ 2021-05-09 17:19 ` Jean Louis 2021-05-09 18:00 ` Emanuel Berg via Users list for the GNU Emacs text editor 0 siblings, 1 reply; 15+ messages in thread From: Jean Louis @ 2021-05-09 17:19 UTC (permalink / raw) To: help-gnu-emacs * Emanuel Berg via Users list for the GNU Emacs text editor <help-gnu-emacs@gnu.org> [2021-05-09 18:05]: > Jean Louis wrote: > > > (text (text-alphabetic-only text)) > > (words (split-string text " " t " "))) > > Here is what I would try first > > 1. `buffer-substring' > 2. `split-string' > 3. `delete-dups' > 4. loop and do `how-many' > 5. get a new list with '(occurrences word) How do you get `occurences'? You would count for words each time? Is it function? I cannot find it. I think that your (4) is not necessary, as counting is not necessary. -- Jean Take action in Free Software Foundation campaigns: https://www.fsf.org/campaigns Sign an open letter in support of Richard M. Stallman https://stallmansupport.org/ https://rms-support-letter.github.io/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Any faster way to find frequency of words? 2021-05-09 17:19 ` Jean Louis @ 2021-05-09 18:00 ` Emanuel Berg via Users list for the GNU Emacs text editor 2021-05-09 19:03 ` Jean Louis 0 siblings, 1 reply; 15+ messages in thread From: Emanuel Berg via Users list for the GNU Emacs text editor @ 2021-05-09 18:00 UTC (permalink / raw) To: help-gnu-emacs Jean Louis wrote: > I think that your (4) is not necessary, as counting is > not necessary. Some counting is if you are to learn the frequency. How about `forward-word' the whole buffer and for every word feed it to a data structure, which keeps a record and a digit and increase that by 1? Then the challenge would be to pick a data structure where searching is fast and in particular where search time doesn't _grow_ fast with respect to it's overall size growing (size = the number of unique words) BTW the theoretical worst-case would be a buffer where all words are unique. Buffer cost is almost 1, ultimately n. With the theoretical worst-case, data structure would be, if linear, like this if we denote buffer cost : data structure cost 1: 0 <-- first word 1: 1 1: 2 1: 3 .. 1: n + 1 <-- last word linear! But probably data structure cost is less than linear, say logarithmic, then we would have linear(n) + n * logarithmic(n) linear(n) will grow the faster, so linear! Whatever you do with the data structure, it'll be fast enough! -- underground experts united https://dataswamp.org/~incal ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Any faster way to find frequency of words? 2021-05-09 18:00 ` Emanuel Berg via Users list for the GNU Emacs text editor @ 2021-05-09 19:03 ` Jean Louis 2021-05-09 23:33 ` Emanuel Berg via Users list for the GNU Emacs text editor 0 siblings, 1 reply; 15+ messages in thread From: Jean Louis @ 2021-05-09 19:03 UTC (permalink / raw) To: help-gnu-emacs * Emanuel Berg via Users list for the GNU Emacs text editor <help-gnu-emacs@gnu.org> [2021-05-09 21:01]: > Jean Louis wrote: > > > I think that your (4) is not necessary, as counting is > > not necessary. > > Some counting is if you are to learn the frequency. Iterating and increasing the value is not same as counting. That first creates the frequency of words. Counting could be useful when finding the most frequent words. But even in that case programmatical comparison of what is greater seem to be enough. Maybe the underlying C program is counting. > BTW the theoretical worst-case would be a buffer where all > words are unique. Buffer cost is almost 1, ultimately n. > With the theoretical worst-case, data structure would be, if > linear, like this Heaven thanks it is not theoretical case, in practice it just finds frequencies of words in some kilobytes. For speedy searching by word frequencies I am using PostgreSQL with Emacs interface. -- Jean Take action in Free Software Foundation campaigns: https://www.fsf.org/campaigns Sign an open letter in support of Richard M. Stallman https://stallmansupport.org/ https://rms-support-letter.github.io/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Any faster way to find frequency of words? 2021-05-09 19:03 ` Jean Louis @ 2021-05-09 23:33 ` Emanuel Berg via Users list for the GNU Emacs text editor 0 siblings, 0 replies; 15+ messages in thread From: Emanuel Berg via Users list for the GNU Emacs text editor @ 2021-05-09 23:33 UTC (permalink / raw) To: help-gnu-emacs Jean Louis wrote: > Iterating and increasing the value is not same as counting. > That first creates the frequency of words. Iterating and increasing the value is a method to do counting, and perhaps here, it is the best one... -- underground experts united https://dataswamp.org/~incal ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2021-05-10 17:05 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-05-09 14:38 Any faster way to find frequency of words? Jean Louis 2021-05-09 14:56 ` Eric Abrahamsen 2021-05-09 15:05 ` Emanuel Berg via Users list for the GNU Emacs text editor 2021-05-09 17:16 ` Jean Louis 2021-05-10 3:37 ` Eric Abrahamsen 2021-05-10 7:14 ` Jean Louis 2021-05-10 14:02 ` [External] : " Drew Adams 2021-05-10 16:26 ` Jean Louis 2021-05-10 16:34 ` Drew Adams 2021-05-10 17:05 ` Jean Louis 2021-05-09 15:02 ` Emanuel Berg via Users list for the GNU Emacs text editor 2021-05-09 17:19 ` Jean Louis 2021-05-09 18:00 ` Emanuel Berg via Users list for the GNU Emacs text editor 2021-05-09 19:03 ` Jean Louis 2021-05-09 23:33 ` Emanuel Berg via Users list for the GNU Emacs text editor
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).