unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
From: Jean Louis <bugs@gnu.support>
To: Eric Abrahamsen <eric@ericabrahamsen.net>
Cc: Help GNU Emacs <help-gnu-emacs@gnu.org>
Subject: Re: Any faster way to find frequency of words?
Date: Sun, 9 May 2021 20:16:42 +0300	[thread overview]
Message-ID: <YJgY+u9mDiwNzV4r@protected.localdomain> (raw)
In-Reply-To: <87h7jcq7li.fsf@ericabrahamsen.net>

* Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-09 17:57]:
> Jean Louis <bugs@gnu.support> writes:
> 
> > I am interested if there is some better way for Emacs Lisp to find
> > frequency of words.
> >
> > Purpose is to create HTML clickable tag clouds similar to image tag
> > clouds. But I will invoke Perl from Emacs to generate it. For that, I
> > have to analyze the text first.
> 
> Is there any particular improvement you're trying to make?

I am invoking Perl on the fly and producing clickable HTML tag
cloud. It would be boring and tiresome to re-write Perl's module into
Emacs Lisp, though useful. For now, I rather just do it on the fly.

As HTML tags are created from text, I need nothing but alphabetical
characters. Function is invoked rarely.

It is also useful to generate tags for particular text, that helps me
to curate WWW pages.

> I guess I'd suggest using Emacs syntax parsing functions, ie
> `forward-word' and `buffer-substring'. Then you can fine tune the
> definition of words using the local syntax table.

That is also interesting approach, it could just go over the words and
enter them into list.

> >     (mapc (lambda (word)
> > 	    (when (> (length word) 2)
> > 	      (let ((word (downcase word)))
> > 		(if (numberp (gethash word hash))
> > 		    (puthash word (1+ (gethash word hash)) hash)
> > 		  (puthash word 1 hash)))))
> 
> While hash tables are probably best for very large texts, alists are
> nice because you can use place-setting with a default, simplifying the
> above to:
> 
> (cl-incf (alist-get word frequency-alist 0 nil #'equal))

The idea gave me idea to use the defaults from hashes, so I have made
it now as below (puthash word (1+ (gethash word hash 0)) hash), that
is result of brain storming here...

(defun rcd-word-frequency (text &optional length)
  "Returns word frequency as hash from TEXT.

Words smaller than LENGTH are discarded from counting."
  (let* ((hash (make-hash-table :test 'equal))
	 (text (text-alphabetic-only text))
	 (length (or length 3))
	 (words (split-string text " " t " "))
	 (words (mapcar 'downcase words))
	 (words (mapcar (lambda (word) (when (> (length word) length) word)) words))
	 (words (delq nil words)))
    (mapc (lambda (word)
	    (puthash word (1+ (gethash word hash 0)) hash))
	  words)
    hash))

I am not sure if I should rather collect it into alist. Maybe I could
collect it straight into by frequency ordered list like:

(("word" 9) ("another" 7) ("more" 3))

That is what I am doing here, to construct string of most frequent tags:

(defun rcd-word-frequency-string (text &optional length how-many-words)
  (let* ((words (rcd-word-frequency text length))
	 (words (hash-to-list words))
	 (number (or how-many-words 20))
	 (frequent (seq-sort (lambda (a b)
			       (> (cadr a) (cadr b)))
			     words)))
    (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) number)) " ")))


(rcd-word-frequency-string text nil 5) ⇒ "consectetur ipsum amet maecenas congue"


-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

Sign an open letter in support of Richard M. Stallman
https://stallmansupport.org/
https://rms-support-letter.github.io/




  parent reply	other threads:[~2021-05-09 17:16 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-09 14:38 Any faster way to find frequency of words? Jean Louis
2021-05-09 14:56 ` Eric Abrahamsen
2021-05-09 15:05   ` Emanuel Berg via Users list for the GNU Emacs text editor
2021-05-09 17:16   ` Jean Louis [this message]
2021-05-10  3:37     ` Eric Abrahamsen
2021-05-10  7:14       ` Jean Louis
2021-05-10 14:02         ` [External] : " Drew Adams
2021-05-10 16:26           ` Jean Louis
2021-05-10 16:34             ` Drew Adams
2021-05-10 17:05               ` Jean Louis
2021-05-09 15:02 ` Emanuel Berg via Users list for the GNU Emacs text editor
2021-05-09 17:19   ` Jean Louis
2021-05-09 18:00     ` Emanuel Berg via Users list for the GNU Emacs text editor
2021-05-09 19:03       ` Jean Louis
2021-05-09 23:33         ` Emanuel Berg via Users list for the GNU Emacs text editor

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YJgY+u9mDiwNzV4r@protected.localdomain \
    --to=bugs@gnu.support \
    --cc=eric@ericabrahamsen.net \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).