Re: Any faster way to find frequency of words?

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: Jean Louis <bugs@gnu.support>
To: Eric Abrahamsen <eric@ericabrahamsen.net>
Cc: Help GNU Emacs <help-gnu-emacs@gnu.org>
Subject: Re: Any faster way to find frequency of words?
Date: Mon, 10 May 2021 10:14:04 +0300	[thread overview]
Message-ID: <YJjdPDq57Hup3DRS@protected.localdomain> (raw)
In-Reply-To: <87cztzqmxl.fsf@ericabrahamsen.net>

* Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-10 06:38]:
> > It is also useful to generate tags for particular text, that helps me
> > to curate WWW pages.
> 
> Right, but what I meant was, is there anything wrong with the
> implementation you posted?

Thank you. It gives me practically the wanted result, theoretically I
have not tested it well to say if maybe something technically is
wrong. And I use it on smaller chunks of text, it appears pretty fast
and it would be very slow if I would be using it on huge number of
documents. 

On a document of 246000 bytes it takes few seconds. But is not a
problem, I have not get too many such documents and I am not
iterating. It is for generation of tags.

I think this is full set of functions:

(defun hash-to-list (hash)
  "Convert hash HASH to list"
  (let (list)
    (maphash (lambda (key value) (setq list (append list (list (list key value))))) hash)
    list))

(defun text-alphabetic-only (text)
  "Return alphabetic characters from TEXT."
  (replace-regexp-in-string "[^[:alpha:]]" " " text))

(defun rcd-word-frequency (text &optional length)
  "Returns word frequency as hash from TEXT.

Words smaller than LENGTH are discarded from counting."
  (let* ((hash (make-hash-table :test 'equal))
	 (text (text-alphabetic-only text))
	 (length (or length 3))
	 (words (split-string text " " t " "))
	 (words (mapcar 'downcase words))
	 (words (mapcar (lambda (word) (when (> (length word) length) word)) words))
	 (words (delq nil words)))
    (mapc (lambda (word)
	    (puthash word (1+ (gethash word hash 0)) hash))
	  words)
    hash))

(defun rcd-word-frequency-list (text &optional length)
  "Return the unsorted word frequency list of pairs.

First item of the pair is the word, second the word count.

It will analyze TEXT, with minimum word LENGTH."
  (let* ((words (rcd-word-frequency text length))
	 (words (hash-to-list words))
	 (frequent (seq-sort (lambda (a b)
			       (> (cadr a) (cadr b)))
			     words)))
    frequent))

(defun rcd-word-frequency-string (text &optional length how-many)
  "Return string with most frequent words in TEXT.

Use LENGTH to designate minimum length of words to analyze.

Return HOW-MANY words"
  (let ((frequent (rcd-word-frequency-list text length)))
    (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) how-many)) " ")))

(defun rcd-word-frequency-buffer (&optional how-many)
  (interactive)
  (let* ((how-many (or how-many (read-number "How many most frequent words you wish to see? ")))
	 (text (buffer-string))
	 (frequent (rcd-word-frequency-list text))
	 (report (mapconcat (lambda (a) (format "%s:%s " (car a) (cadr a))) (butlast frequent (- (length frequent) how-many)) " ")))
    (prog1
	report
      (message report))))

(rcd-word-frequency-buffer 10) ⇒ "word:44  words:35  text:28  hash:28  length:25  list:17  frequency:16  frequent:14  many:11  lambda:word:44  words:35  text:28  hash:28  length:25  list:17  frequency:16  frequent:14  many:11  lambda:11 

> >> I guess I'd suggest using Emacs syntax parsing functions, ie
> >> `forward-word' and `buffer-substring'. Then you can fine tune the
> >> definition of words using the local syntax table.
> >
> > That is also interesting approach, it could just go over the words and
> > enter them into list.
> 
> Yes, and it can help you skip garbage characters that shouldn't count as
> words. Things like `(skip-syntax-forward "^w")` (meaning "skip a run of
> characters that aren't word constituents") can be very useful.

For now I just skip words by its length and count those alphabetic
characters. Purpose is just to generate tags for HTML pages. 

Once tags have been generated, I can use PostgreSQL database to find
documents with most frequent tags.

Generation of tags is human curated, not automatic. Thus such function
is invoked rather on specific documents. It suggests me the tags for
editing. Not that is creates tags without my attendance.

For example "https" does not seem quite useful tag if articles does
not speak of it, so I have to delete such tags.

> > Words smaller than LENGTH are discarded from counting."
> >   (let* ((hash (make-hash-table :test 'equal))
> > 	 (text (text-alphabetic-only text))
> > 	 (length (or length 3))
> > 	 (words (split-string text " " t " "))
> > 	 (words (mapcar 'downcase words))
> > 	 (words (mapcar (lambda (word) (when (> (length word) length) word)) words))
> > 	 (words (delq nil words)))
> >     (mapc (lambda (word)
> > 	    (puthash word (1+ (gethash word hash 0)) hash))
> 
> I totally forgot that `gethash' has a default argument! So the line
> above can just be:
> 
> (cl-incf (gethash word hash 0))

You like cl-incf and I use 1+, I am not sure if this macro would maybe
slow it down. That is why I tend to skip macros. And let us say I wish
to make package for word frequencies, it would not need to require
cl-lib library.

(defmacro cl-incf (place &optional x)
  "Increment PLACE by X (1 by default).
PLACE may be a symbol, or any generalized variable allowed by `setf'.
The return value is the incremented value of PLACE."
  (declare (debug (place &optional form)))
  (if (symbolp place)
      (list 'setq place (if x (list '+ place x) (list '1+ place)))
    (list 'cl-callf '+ place (or x 1))))

> > (defun rcd-word-frequency-string (text &optional length how-many-words)
> >   (let* ((words (rcd-word-frequency text length))
> > 	 (words (hash-to-list words))
> > 	 (number (or how-many-words 20))
> > 	 (frequent (seq-sort (lambda (a b)
> > 			       (> (cadr a) (cadr b)))
> > 			     words)))
> >     (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) number)) " ")))
> 
> I don't have a `hash-to-list' function, but once you've built your table
> it seems like the rest of it is fairly straightforward.

I use those functions below.

;;;; ━━━━━━━━━━━━━━━━━━
;;;;   HASH FUNCTIONS
;;;; ━━━━━━━━━━━━━━━━━━

(defun hash-to-plist (hash)
  "Convert hash HASH to plist."
  (let (plist)
    (maphash (lambda (key value) (push key plist) (push value plist)) hash)
    (reverse plist)))

(defun hash-to-alist (hash)
  "Convert hash HASH to alist"
  (let (alist)
    (maphash (lambda (key value) (push (cons key value) alist)) hash)
    alist))

(defun hash-to-list (hash)
  "Convert hash HASH to list"
  (let (list)
    (maphash (lambda (key value) (setq list (append list (list (list key value))))) hash)
    list))

(defun hash-append (h1 &rest hashes)
  "Return H1 hash appended with HASHES."
  (mapc 
   (lambda (hash)
     (maphash 
      (lambda (key value) (puthash key value h1)) hash))
   hashes)
  h1)

-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

Sign an open letter in support of Richard M. Stallman
https://stallmansupport.org/
https://rms-support-letter.github.io/

next prev parent reply	other threads:[~2021-05-10  7:14 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-09 14:38 Any faster way to find frequency of words? Jean Louis
2021-05-09 14:56 ` Eric Abrahamsen
2021-05-09 15:05   ` Emanuel Berg via Users list for the GNU Emacs text editor
2021-05-09 17:16   ` Jean Louis
2021-05-10  3:37     ` Eric Abrahamsen
2021-05-10  7:14       ` Jean Louis [this message]
2021-05-10 14:02         ` [External] : " Drew Adams
2021-05-10 16:26           ` Jean Louis
2021-05-10 16:34             ` Drew Adams
2021-05-10 17:05               ` Jean Louis
2021-05-09 15:02 ` Emanuel Berg via Users list for the GNU Emacs text editor
2021-05-09 17:19   ` Jean Louis
2021-05-09 18:00     ` Emanuel Berg via Users list for the GNU Emacs text editor
2021-05-09 19:03       ` Jean Louis
2021-05-09 23:33         ` Emanuel Berg via Users list for the GNU Emacs text editor

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YJjdPDq57Hup3DRS@protected.localdomain \
    --to=bugs@gnu.support \
    --cc=eric@ericabrahamsen.net \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.