From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Jean Louis Newsgroups: gmane.emacs.help Subject: Re: Any faster way to find frequency of words? Date: Sun, 9 May 2021 20:16:42 +0300 Message-ID: References: <87h7jcq7li.fsf@ericabrahamsen.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="30722"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Mutt/2.0.6 (2021-03-06) Cc: Help GNU Emacs To: Eric Abrahamsen Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Sun May 09 19:18:56 2021 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lfn4x-0007ql-VY for geh-help-gnu-emacs@m.gmane-mx.org; Sun, 09 May 2021 19:18:55 +0200 Original-Received: from localhost ([::1]:51668 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lfn4w-0007Ua-W2 for geh-help-gnu-emacs@m.gmane-mx.org; Sun, 09 May 2021 13:18:55 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:34514) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lfn4d-0007UE-8F for help-gnu-emacs@gnu.org; Sun, 09 May 2021 13:18:35 -0400 Original-Received: from stw1.rcdrun.com ([217.170.207.13]:38029) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lfn4Z-0005Na-Vi for help-gnu-emacs@gnu.org; Sun, 09 May 2021 13:18:34 -0400 Original-Received: from localhost ([::ffff:197.239.7.47]) (AUTH: PLAIN securesender, TLS: TLS1.3,256bits,ECDHE_RSA_AES_256_GCM_SHA384) by stw1.rcdrun.com with ESMTPSA id 00000000000ABF29.0000000060981965.0000500A; Sun, 09 May 2021 10:18:28 -0700 Mail-Followup-To: Eric Abrahamsen , Help GNU Emacs Content-Disposition: inline In-Reply-To: <87h7jcq7li.fsf@ericabrahamsen.net> Received-SPF: pass client-ip=217.170.207.13; envelope-from=bugs@gnu.support; helo=stw1.rcdrun.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.io gmane.emacs.help:129630 Archived-At: * Eric Abrahamsen [2021-05-09 17:57]: > Jean Louis writes: > > > I am interested if there is some better way for Emacs Lisp to find > > frequency of words. > > > > Purpose is to create HTML clickable tag clouds similar to image tag > > clouds. But I will invoke Perl from Emacs to generate it. For that, I > > have to analyze the text first. > > Is there any particular improvement you're trying to make? I am invoking Perl on the fly and producing clickable HTML tag cloud. It would be boring and tiresome to re-write Perl's module into Emacs Lisp, though useful. For now, I rather just do it on the fly. As HTML tags are created from text, I need nothing but alphabetical characters. Function is invoked rarely. It is also useful to generate tags for particular text, that helps me to curate WWW pages. > I guess I'd suggest using Emacs syntax parsing functions, ie > `forward-word' and `buffer-substring'. Then you can fine tune the > definition of words using the local syntax table. That is also interesting approach, it could just go over the words and enter them into list. > > (mapc (lambda (word) > > (when (> (length word) 2) > > (let ((word (downcase word))) > > (if (numberp (gethash word hash)) > > (puthash word (1+ (gethash word hash)) hash) > > (puthash word 1 hash))))) > > While hash tables are probably best for very large texts, alists are > nice because you can use place-setting with a default, simplifying the > above to: > > (cl-incf (alist-get word frequency-alist 0 nil #'equal)) The idea gave me idea to use the defaults from hashes, so I have made it now as below (puthash word (1+ (gethash word hash 0)) hash), that is result of brain storming here... (defun rcd-word-frequency (text &optional length) "Returns word frequency as hash from TEXT. Words smaller than LENGTH are discarded from counting." (let* ((hash (make-hash-table :test 'equal)) (text (text-alphabetic-only text)) (length (or length 3)) (words (split-string text " " t " ")) (words (mapcar 'downcase words)) (words (mapcar (lambda (word) (when (> (length word) length) word)) words)) (words (delq nil words))) (mapc (lambda (word) (puthash word (1+ (gethash word hash 0)) hash)) words) hash)) I am not sure if I should rather collect it into alist. Maybe I could collect it straight into by frequency ordered list like: (("word" 9) ("another" 7) ("more" 3)) That is what I am doing here, to construct string of most frequent tags: (defun rcd-word-frequency-string (text &optional length how-many-words) (let* ((words (rcd-word-frequency text length)) (words (hash-to-list words)) (number (or how-many-words 20)) (frequent (seq-sort (lambda (a b) (> (cadr a) (cadr b))) words))) (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) number)) " "))) (rcd-word-frequency-string text nil 5) ⇒ "consectetur ipsum amet maecenas congue" -- Jean Take action in Free Software Foundation campaigns: https://www.fsf.org/campaigns Sign an open letter in support of Richard M. Stallman https://stallmansupport.org/ https://rms-support-letter.github.io/