From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Jean Louis Newsgroups: gmane.emacs.help Subject: Re: Any faster way to find frequency of words? Date: Mon, 10 May 2021 10:14:04 +0300 Message-ID: References: <87h7jcq7li.fsf@ericabrahamsen.net> <87cztzqmxl.fsf@ericabrahamsen.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="13099"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Mutt/2.0.6 (2021-03-06) Cc: Help GNU Emacs To: Eric Abrahamsen Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Mon May 10 09:17:17 2021 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lg0AH-0003Db-8r for geh-help-gnu-emacs@m.gmane-mx.org; Mon, 10 May 2021 09:17:17 +0200 Original-Received: from localhost ([::1]:49270 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lg0AF-0000R8-RD for geh-help-gnu-emacs@m.gmane-mx.org; Mon, 10 May 2021 03:17:15 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:36396) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lg09r-0000Qx-V2 for help-gnu-emacs@gnu.org; Mon, 10 May 2021 03:16:51 -0400 Original-Received: from stw1.rcdrun.com ([217.170.207.13]:60607) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lg09l-0000vf-Th for help-gnu-emacs@gnu.org; Mon, 10 May 2021 03:16:51 -0400 Original-Received: from localhost ([::ffff:197.239.7.47]) (AUTH: PLAIN securesender, TLS: TLS1.3,256bits,ECDHE_RSA_AES_256_GCM_SHA384) by stw1.rcdrun.com with ESMTPSA id 00000000000ABF27.000000006098DDDB.00001915; Mon, 10 May 2021 00:16:42 -0700 Mail-Followup-To: Eric Abrahamsen , Help GNU Emacs Content-Disposition: inline In-Reply-To: <87cztzqmxl.fsf@ericabrahamsen.net> Received-SPF: pass client-ip=217.170.207.13; envelope-from=bugs@gnu.support; helo=stw1.rcdrun.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.io gmane.emacs.help:129651 Archived-At: * Eric Abrahamsen [2021-05-10 06:38]: > > It is also useful to generate tags for particular text, that helps me > > to curate WWW pages. > > Right, but what I meant was, is there anything wrong with the > implementation you posted? Thank you. It gives me practically the wanted result, theoretically I have not tested it well to say if maybe something technically is wrong. And I use it on smaller chunks of text, it appears pretty fast and it would be very slow if I would be using it on huge number of documents. On a document of 246000 bytes it takes few seconds. But is not a problem, I have not get too many such documents and I am not iterating. It is for generation of tags. I think this is full set of functions: (defun hash-to-list (hash) "Convert hash HASH to list" (let (list) (maphash (lambda (key value) (setq list (append list (list (list key value))))) hash) list)) (defun text-alphabetic-only (text) "Return alphabetic characters from TEXT." (replace-regexp-in-string "[^[:alpha:]]" " " text)) (defun rcd-word-frequency (text &optional length) "Returns word frequency as hash from TEXT. Words smaller than LENGTH are discarded from counting." (let* ((hash (make-hash-table :test 'equal)) (text (text-alphabetic-only text)) (length (or length 3)) (words (split-string text " " t " ")) (words (mapcar 'downcase words)) (words (mapcar (lambda (word) (when (> (length word) length) word)) words)) (words (delq nil words))) (mapc (lambda (word) (puthash word (1+ (gethash word hash 0)) hash)) words) hash)) (defun rcd-word-frequency-list (text &optional length) "Return the unsorted word frequency list of pairs. First item of the pair is the word, second the word count. It will analyze TEXT, with minimum word LENGTH." (let* ((words (rcd-word-frequency text length)) (words (hash-to-list words)) (frequent (seq-sort (lambda (a b) (> (cadr a) (cadr b))) words))) frequent)) (defun rcd-word-frequency-string (text &optional length how-many) "Return string with most frequent words in TEXT. Use LENGTH to designate minimum length of words to analyze. Return HOW-MANY words" (let ((frequent (rcd-word-frequency-list text length))) (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) how-many)) " "))) (defun rcd-word-frequency-buffer (&optional how-many) (interactive) (let* ((how-many (or how-many (read-number "How many most frequent words you wish to see? "))) (text (buffer-string)) (frequent (rcd-word-frequency-list text)) (report (mapconcat (lambda (a) (format "%s:%s " (car a) (cadr a))) (butlast frequent (- (length frequent) how-many)) " "))) (prog1 report (message report)))) (rcd-word-frequency-buffer 10) ⇒ "word:44 words:35 text:28 hash:28 length:25 list:17 frequency:16 frequent:14 many:11 lambda:word:44 words:35 text:28 hash:28 length:25 list:17 frequency:16 frequent:14 many:11 lambda:11 > >> I guess I'd suggest using Emacs syntax parsing functions, ie > >> `forward-word' and `buffer-substring'. Then you can fine tune the > >> definition of words using the local syntax table. > > > > That is also interesting approach, it could just go over the words and > > enter them into list. > > Yes, and it can help you skip garbage characters that shouldn't count as > words. Things like `(skip-syntax-forward "^w")` (meaning "skip a run of > characters that aren't word constituents") can be very useful. For now I just skip words by its length and count those alphabetic characters. Purpose is just to generate tags for HTML pages. Once tags have been generated, I can use PostgreSQL database to find documents with most frequent tags. Generation of tags is human curated, not automatic. Thus such function is invoked rather on specific documents. It suggests me the tags for editing. Not that is creates tags without my attendance. For example "https" does not seem quite useful tag if articles does not speak of it, so I have to delete such tags. > > Words smaller than LENGTH are discarded from counting." > > (let* ((hash (make-hash-table :test 'equal)) > > (text (text-alphabetic-only text)) > > (length (or length 3)) > > (words (split-string text " " t " ")) > > (words (mapcar 'downcase words)) > > (words (mapcar (lambda (word) (when (> (length word) length) word)) words)) > > (words (delq nil words))) > > (mapc (lambda (word) > > (puthash word (1+ (gethash word hash 0)) hash)) > > I totally forgot that `gethash' has a default argument! So the line > above can just be: > > (cl-incf (gethash word hash 0)) You like cl-incf and I use 1+, I am not sure if this macro would maybe slow it down. That is why I tend to skip macros. And let us say I wish to make package for word frequencies, it would not need to require cl-lib library. (defmacro cl-incf (place &optional x) "Increment PLACE by X (1 by default). PLACE may be a symbol, or any generalized variable allowed by `setf'. The return value is the incremented value of PLACE." (declare (debug (place &optional form))) (if (symbolp place) (list 'setq place (if x (list '+ place x) (list '1+ place))) (list 'cl-callf '+ place (or x 1)))) > > (defun rcd-word-frequency-string (text &optional length how-many-words) > > (let* ((words (rcd-word-frequency text length)) > > (words (hash-to-list words)) > > (number (or how-many-words 20)) > > (frequent (seq-sort (lambda (a b) > > (> (cadr a) (cadr b))) > > words))) > > (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) number)) " "))) > > I don't have a `hash-to-list' function, but once you've built your table > it seems like the rest of it is fairly straightforward. I use those functions below. ;;;; ━━━━━━━━━━━━━━━━━━ ;;;; HASH FUNCTIONS ;;;; ━━━━━━━━━━━━━━━━━━ (defun hash-to-plist (hash) "Convert hash HASH to plist." (let (plist) (maphash (lambda (key value) (push key plist) (push value plist)) hash) (reverse plist))) (defun hash-to-alist (hash) "Convert hash HASH to alist" (let (alist) (maphash (lambda (key value) (push (cons key value) alist)) hash) alist)) (defun hash-to-list (hash) "Convert hash HASH to list" (let (list) (maphash (lambda (key value) (setq list (append list (list (list key value))))) hash) list)) (defun hash-append (h1 &rest hashes) "Return H1 hash appended with HASHES." (mapc (lambda (hash) (maphash (lambda (key value) (puthash key value h1)) hash)) hashes) h1) -- Jean Take action in Free Software Foundation campaigns: https://www.fsf.org/campaigns Sign an open letter in support of Richard M. Stallman https://stallmansupport.org/ https://rms-support-letter.github.io/