From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Jean Louis Newsgroups: gmane.emacs.help Subject: Any faster way to find frequency of words? Date: Sun, 09 May 2021 17:38:05 +0300 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="19684"; mail-complaints-to="usenet@ciao.gmane.io" To: Help GNU Emacs Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Sun May 09 16:40:11 2021 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lfkbK-00050k-LQ for geh-help-gnu-emacs@m.gmane-mx.org; Sun, 09 May 2021 16:40:10 +0200 Original-Received: from localhost ([::1]:52754 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lfkbJ-0008Uy-Iu for geh-help-gnu-emacs@m.gmane-mx.org; Sun, 09 May 2021 10:40:09 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:37816) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lfkaG-0008Ts-NU for help-gnu-emacs@gnu.org; Sun, 09 May 2021 10:39:05 -0400 Original-Received: from stw1.rcdrun.com ([217.170.207.13]:42845) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lfkaE-0000qm-U0 for help-gnu-emacs@gnu.org; Sun, 09 May 2021 10:39:04 -0400 Original-Received: from localhost ([::ffff:197.239.7.47]) (AUTH: PLAIN securesender, TLS: TLS1.3,256bits,ECDHE_RSA_AES_256_GCM_SHA384) by stw1.rcdrun.com with ESMTPSA id 00000000000ABF29.000000006097F3E4.00004125; Sun, 09 May 2021 07:38:27 -0700 Received-SPF: pass client-ip=217.170.207.13; envelope-from=support1@rcdrun.com; helo=stw1.rcdrun.com X-Spam_score_int: -16 X-Spam_score: -1.7 X-Spam_bar: - X-Spam_report: (-1.7 / 5.0 requ) BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.249, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.io gmane.emacs.help:129622 Archived-At: I am interested if there is some better way for Emacs Lisp to find frequency of words. Purpose is to create HTML clickable tag clouds similar to image tag clouds. But I will invoke Perl from Emacs to generate it. For that, I have to analyze the text first. (setq text "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a diam lectus. Sed sit amet ipsum mauris. Maecenas congue ligula ac quam viverra nec consectetur ante hendrerit. Maecenas congue ligula ac quam viverra nec consectetur ante hendrerit..") (defun text-alphabetic-only (text) "Return alphabetic characters from TEXT." (replace-regexp-in-string "[^[:alpha:]]" " " text)) (defun word-frequency (text &optional length) "Returns word frequency as hash from TEXT." (let* ((hash (make-hash-table :test 'equal)) (text (text-alphabetic-only text)) (words (split-string text " " t " "))) (mapc (lambda (word) (when (> (length word) 2) (let ((word (downcase word))) (if (numberp (gethash word hash)) (puthash word (1+ (gethash word hash)) hash) (puthash word 1 hash))))) words) hash)) (word-frequency text) ⇒ #s(hash-table size 65 test equal rehash-size 1.5 rehash-threshold 0.8125 data ("lorem" 1 "ipsum" 2 "dolor" 1 "sit" 2 "amet" 2 "consectetur" 3 "adipiscing" 1 "elit" 1 "donec" 1 "diam" 1 "lectus" 1 "sed" 1 "mauris" 1 "maecenas" 2 "congue" 2 "ligula" 2 "quam" 2 "viverra" 2 "nec" 2 "ante" 2 "hendrerit" 2))