From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eric Abrahamsen Newsgroups: gmane.emacs.help Subject: Re: Any faster way to find frequency of words? Date: Sun, 09 May 2021 20:37:10 -0700 Message-ID: <87cztzqmxl.fsf@ericabrahamsen.net> References: <87h7jcq7li.fsf@ericabrahamsen.net> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="13308"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux) To: Help GNU Emacs Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Mon May 10 05:38:29 2021 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lfwkW-0003MX-J1 for geh-help-gnu-emacs@m.gmane-mx.org; Mon, 10 May 2021 05:38:28 +0200 Original-Received: from localhost ([::1]:46208 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lfwkV-0008Hw-Hy for geh-help-gnu-emacs@m.gmane-mx.org; Sun, 09 May 2021 23:38:27 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:34494) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lfwjT-0008Fb-4I for help-gnu-emacs@gnu.org; Sun, 09 May 2021 23:37:23 -0400 Original-Received: from ericabrahamsen.net ([52.70.2.18]:42370 helo=mail.ericabrahamsen.net) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lfwjO-0005i8-Ta for help-gnu-emacs@gnu.org; Sun, 09 May 2021 23:37:22 -0400 Original-Received: from localhost (c-71-197-184-122.hsd1.wa.comcast.net [71.197.184.122]) (Authenticated sender: eric@ericabrahamsen.net) by mail.ericabrahamsen.net (Postfix) with ESMTPSA id 7CEFEFC80D for ; Mon, 10 May 2021 03:37:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ericabrahamsen.net; s=mail; t=1620617836; bh=9zLlreYZ6Ze508rnpZ05s3K3OL2KHZVj3dv28xqCXqM=; h=From:To:Subject:References:Date:In-Reply-To:From; b=lhijTz2gwpsB+2aHmSldI1l/xlvxTohVPqEY6LW4rktKGVqhsnWYWiQ2fXZnL/3Bu sc9vf4yExjTtMP6SKJFoeaSBcrEKIHjjiby2Mr3CdOgGtB5d91agPz6xL5+16TBXqv 6sO1gehxQjeJ/wmPmeqh8KIH+Ws/7Seb+3DeVLmQ= In-Reply-To: (Jean Louis's message of "Sun, 9 May 2021 20:16:42 +0300") Received-SPF: pass client-ip=52.70.2.18; envelope-from=eric@ericabrahamsen.net; helo=mail.ericabrahamsen.net X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.io gmane.emacs.help:129645 Archived-At: Jean Louis writes: > * Eric Abrahamsen [2021-05-09 17:57]: >> Jean Louis writes: >> >> > I am interested if there is some better way for Emacs Lisp to find >> > frequency of words. >> > >> > Purpose is to create HTML clickable tag clouds similar to image tag >> > clouds. But I will invoke Perl from Emacs to generate it. For that, I >> > have to analyze the text first. >> >> Is there any particular improvement you're trying to make? > > I am invoking Perl on the fly and producing clickable HTML tag > cloud. It would be boring and tiresome to re-write Perl's module into > Emacs Lisp, though useful. For now, I rather just do it on the fly. > > As HTML tags are created from text, I need nothing but alphabetical > characters. Function is invoked rarely. > > It is also useful to generate tags for particular text, that helps me > to curate WWW pages. Right, but what I meant was, is there anything wrong with the implementation you posted? >> I guess I'd suggest using Emacs syntax parsing functions, ie >> `forward-word' and `buffer-substring'. Then you can fine tune the >> definition of words using the local syntax table. > > That is also interesting approach, it could just go over the words and > enter them into list. Yes, and it can help you skip garbage characters that shouldn't count as words. Things like `(skip-syntax-forward "^w")` (meaning "skip a run of characters that aren't word constituents") can be very useful. >> > (mapc (lambda (word) >> > (when (> (length word) 2) >> > (let ((word (downcase word))) >> > (if (numberp (gethash word hash)) >> > (puthash word (1+ (gethash word hash)) hash) >> > (puthash word 1 hash))))) >> >> While hash tables are probably best for very large texts, alists are >> nice because you can use place-setting with a default, simplifying the >> above to: >> >> (cl-incf (alist-get word frequency-alist 0 nil #'equal)) > > The idea gave me idea to use the defaults from hashes, so I have made > it now as below (puthash word (1+ (gethash word hash 0)) hash), that > is result of brain storming here... > (defun rcd-word-frequency (text &optional length) > "Returns word frequency as hash from TEXT. > > Words smaller than LENGTH are discarded from counting." > (let* ((hash (make-hash-table :test 'equal)) > (text (text-alphabetic-only text)) > (length (or length 3)) > (words (split-string text " " t " ")) > (words (mapcar 'downcase words)) > (words (mapcar (lambda (word) (when (> (length word) length) word)) words)) > (words (delq nil words))) > (mapc (lambda (word) > (puthash word (1+ (gethash word hash 0)) hash)) I totally forgot that `gethash' has a default argument! So the line above can just be: (cl-incf (gethash word hash 0)) I don't know why, but I really enjoy that. > words) > hash)) > > I am not sure if I should rather collect it into alist. Maybe I could > collect it straight into by frequency ordered list like: > > (("word" 9) ("another" 7) ("more" 3)) > > That is what I am doing here, to construct string of most frequent tags: > > (defun rcd-word-frequency-string (text &optional length how-many-words) > (let* ((words (rcd-word-frequency text length)) > (words (hash-to-list words)) > (number (or how-many-words 20)) > (frequent (seq-sort (lambda (a b) > (> (cadr a) (cadr b))) > words))) > (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) number)) " "))) I don't have a `hash-to-list' function, but once you've built your table it seems like the rest of it is fairly straightforward.