all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Eric Abrahamsen <eric@ericabrahamsen.net>
To: Jean Louis <bugs@gnu.support>
Cc: Help GNU Emacs <help-gnu-emacs@gnu.org>
Subject: Re: Any faster way to find frequency of words?
Date: Sun, 09 May 2021 07:56:09 -0700	[thread overview]
Message-ID: <87h7jcq7li.fsf@ericabrahamsen.net> (raw)
In-Reply-To: <courier.000000006097F3E3.00004125@stw1.rcdrun.com> (Jean Louis's message of "Sun, 09 May 2021 17:38:05 +0300")

Jean Louis <bugs@gnu.support> writes:

> I am interested if there is some better way for Emacs Lisp to find
> frequency of words.
>
> Purpose is to create HTML clickable tag clouds similar to image tag
> clouds. But I will invoke Perl from Emacs to generate it. For that, I
> have to analyze the text first.

Is there any particular improvement you're trying to make?

> (setq text "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a diam
> lectus. Sed sit amet ipsum mauris. Maecenas congue ligula ac quam
> viverra nec consectetur ante hendrerit. Maecenas congue ligula ac quam
> viverra nec consectetur ante hendrerit..")
>
> (defun text-alphabetic-only (text)
>   "Return alphabetic characters from TEXT."
>   (replace-regexp-in-string "[^[:alpha:]]" " " text))
>
> (defun word-frequency (text &optional length)
>   "Returns word frequency as hash from TEXT."
>   (let* ((hash (make-hash-table :test 'equal))
> 	 (text (text-alphabetic-only text))
> 	 (words (split-string text " " t " ")))

I guess I'd suggest using Emacs syntax parsing functions, ie
`forward-word' and `buffer-substring'. Then you can fine tune the
definition of words using the local syntax table.

>     (mapc (lambda (word)
> 	    (when (> (length word) 2)
> 	      (let ((word (downcase word)))
> 		(if (numberp (gethash word hash))
> 		    (puthash word (1+ (gethash word hash)) hash)
> 		  (puthash word 1 hash)))))

While hash tables are probably best for very large texts, alists are
nice because you can use place-setting with a default, simplifying the
above to:

(cl-incf (alist-get word frequency-alist 0 nil #'equal))

Eric



  reply	other threads:[~2021-05-09 14:56 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-09 14:38 Any faster way to find frequency of words? Jean Louis
2021-05-09 14:56 ` Eric Abrahamsen [this message]
2021-05-09 15:05   ` Emanuel Berg via Users list for the GNU Emacs text editor
2021-05-09 17:16   ` Jean Louis
2021-05-10  3:37     ` Eric Abrahamsen
2021-05-10  7:14       ` Jean Louis
2021-05-10 14:02         ` [External] : " Drew Adams
2021-05-10 16:26           ` Jean Louis
2021-05-10 16:34             ` Drew Adams
2021-05-10 17:05               ` Jean Louis
2021-05-09 15:02 ` Emanuel Berg via Users list for the GNU Emacs text editor
2021-05-09 17:19   ` Jean Louis
2021-05-09 18:00     ` Emanuel Berg via Users list for the GNU Emacs text editor
2021-05-09 19:03       ` Jean Louis
2021-05-09 23:33         ` Emanuel Berg via Users list for the GNU Emacs text editor

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h7jcq7li.fsf@ericabrahamsen.net \
    --to=eric@ericabrahamsen.net \
    --cc=bugs@gnu.support \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.