From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Jean Louis <bugs@gnu.support>
Newsgroups: gmane.emacs.help
Subject: Re: Any faster way to find frequency of words?
Date: Mon, 10 May 2021 10:14:04 +0300
Message-ID: <YJjdPDq57Hup3DRS@protected.localdomain>
References: <courier.000000006097F3E3.00004125@stw1.rcdrun.com>
 <87h7jcq7li.fsf@ericabrahamsen.net>
 <YJgY+u9mDiwNzV4r@protected.localdomain>
 <87cztzqmxl.fsf@ericabrahamsen.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="13099"; mail-complaints-to="usenet@ciao.gmane.io"
User-Agent: Mutt/2.0.6 (2021-03-06)
Cc: Help GNU Emacs <help-gnu-emacs@gnu.org>
To: Eric Abrahamsen <eric@ericabrahamsen.net>
Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Mon May 10 09:17:17 2021
Return-path: <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org>
Envelope-to: geh-help-gnu-emacs@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org>)
	id 1lg0AH-0003Db-8r
	for geh-help-gnu-emacs@m.gmane-mx.org; Mon, 10 May 2021 09:17:17 +0200
Original-Received: from localhost ([::1]:49270 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org>)
	id 1lg0AF-0000R8-RD
	for geh-help-gnu-emacs@m.gmane-mx.org; Mon, 10 May 2021 03:17:15 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:36396)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <bugs@gnu.support>) id 1lg09r-0000Qx-V2
 for help-gnu-emacs@gnu.org; Mon, 10 May 2021 03:16:51 -0400
Original-Received: from stw1.rcdrun.com ([217.170.207.13]:60607)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <bugs@gnu.support>) id 1lg09l-0000vf-Th
 for help-gnu-emacs@gnu.org; Mon, 10 May 2021 03:16:51 -0400
Original-Received: from localhost ([::ffff:197.239.7.47])
 (AUTH: PLAIN securesender, TLS: TLS1.3,256bits,ECDHE_RSA_AES_256_GCM_SHA384)
 by stw1.rcdrun.com with ESMTPSA
 id 00000000000ABF27.000000006098DDDB.00001915; Mon, 10 May 2021 00:16:42 -0700
Mail-Followup-To: Eric Abrahamsen <eric@ericabrahamsen.net>,
 Help GNU Emacs <help-gnu-emacs@gnu.org>
Content-Disposition: inline
In-Reply-To: <87cztzqmxl.fsf@ericabrahamsen.net>
Received-SPF: pass client-ip=217.170.207.13; envelope-from=bugs@gnu.support;
 helo=stw1.rcdrun.com
X-Spam_score_int: -18
X-Spam_score: -1.9
X-Spam_bar: -
X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_PASS=-0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: help-gnu-emacs@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Users list for the GNU Emacs text editor <help-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/help-gnu-emacs>,
 <mailto:help-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/help-gnu-emacs>
List-Post: <mailto:help-gnu-emacs@gnu.org>
List-Help: <mailto:help-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
 <mailto:help-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org
Original-Sender: "help-gnu-emacs"
 <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org>
Xref: news.gmane.io gmane.emacs.help:129651
Archived-At: <http://permalink.gmane.org/gmane.emacs.help/129651>

* Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-10 06:38]:
> > It is also useful to generate tags for particular text, that helps me
> > to curate WWW pages.
> 
> Right, but what I meant was, is there anything wrong with the
> implementation you posted?

Thank you. It gives me practically the wanted result, theoretically I
have not tested it well to say if maybe something technically is
wrong. And I use it on smaller chunks of text, it appears pretty fast
and it would be very slow if I would be using it on huge number of
documents. 

On a document of 246000 bytes it takes few seconds. But is not a
problem, I have not get too many such documents and I am not
iterating. It is for generation of tags.

I think this is full set of functions:

(defun hash-to-list (hash)
  "Convert hash HASH to list"
  (let (list)
    (maphash (lambda (key value) (setq list (append list (list (list key value))))) hash)
    list))

(defun text-alphabetic-only (text)
  "Return alphabetic characters from TEXT."
  (replace-regexp-in-string "[^[:alpha:]]" " " text))

(defun rcd-word-frequency (text &optional length)
  "Returns word frequency as hash from TEXT.

Words smaller than LENGTH are discarded from counting."
  (let* ((hash (make-hash-table :test 'equal))
	 (text (text-alphabetic-only text))
	 (length (or length 3))
	 (words (split-string text " " t " "))
	 (words (mapcar 'downcase words))
	 (words (mapcar (lambda (word) (when (> (length word) length) word)) words))
	 (words (delq nil words)))
    (mapc (lambda (word)
	    (puthash word (1+ (gethash word hash 0)) hash))
	  words)
    hash))

(defun rcd-word-frequency-list (text &optional length)
  "Return the unsorted word frequency list of pairs.

First item of the pair is the word, second the word count.

It will analyze TEXT, with minimum word LENGTH."
  (let* ((words (rcd-word-frequency text length))
	 (words (hash-to-list words))
	 (frequent (seq-sort (lambda (a b)
			       (> (cadr a) (cadr b)))
			     words)))
    frequent))

(defun rcd-word-frequency-string (text &optional length how-many)
  "Return string with most frequent words in TEXT.

Use LENGTH to designate minimum length of words to analyze.

Return HOW-MANY words"
  (let ((frequent (rcd-word-frequency-list text length)))
    (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) how-many)) " ")))

(defun rcd-word-frequency-buffer (&optional how-many)
  (interactive)
  (let* ((how-many (or how-many (read-number "How many most frequent words you wish to see? ")))
	 (text (buffer-string))
	 (frequent (rcd-word-frequency-list text))
	 (report (mapconcat (lambda (a) (format "%s:%s " (car a) (cadr a))) (butlast frequent (- (length frequent) how-many)) " ")))
    (prog1
	report
      (message report))))

(rcd-word-frequency-buffer 10) ⇒ "word:44  words:35  text:28  hash:28  length:25  list:17  frequency:16  frequent:14  many:11  lambda:word:44  words:35  text:28  hash:28  length:25  list:17  frequency:16  frequent:14  many:11  lambda:11 

> >> I guess I'd suggest using Emacs syntax parsing functions, ie
> >> `forward-word' and `buffer-substring'. Then you can fine tune the
> >> definition of words using the local syntax table.
> >
> > That is also interesting approach, it could just go over the words and
> > enter them into list.
> 
> Yes, and it can help you skip garbage characters that shouldn't count as
> words. Things like `(skip-syntax-forward "^w")` (meaning "skip a run of
> characters that aren't word constituents") can be very useful.

For now I just skip words by its length and count those alphabetic
characters. Purpose is just to generate tags for HTML pages. 

Once tags have been generated, I can use PostgreSQL database to find
documents with most frequent tags.

Generation of tags is human curated, not automatic. Thus such function
is invoked rather on specific documents. It suggests me the tags for
editing. Not that is creates tags without my attendance.

For example "https" does not seem quite useful tag if articles does
not speak of it, so I have to delete such tags.

> > Words smaller than LENGTH are discarded from counting."
> >   (let* ((hash (make-hash-table :test 'equal))
> > 	 (text (text-alphabetic-only text))
> > 	 (length (or length 3))
> > 	 (words (split-string text " " t " "))
> > 	 (words (mapcar 'downcase words))
> > 	 (words (mapcar (lambda (word) (when (> (length word) length) word)) words))
> > 	 (words (delq nil words)))
> >     (mapc (lambda (word)
> > 	    (puthash word (1+ (gethash word hash 0)) hash))
> 
> I totally forgot that `gethash' has a default argument! So the line
> above can just be:
> 
> (cl-incf (gethash word hash 0))

You like cl-incf and I use 1+, I am not sure if this macro would maybe
slow it down. That is why I tend to skip macros. And let us say I wish
to make package for word frequencies, it would not need to require
cl-lib library.

(defmacro cl-incf (place &optional x)
  "Increment PLACE by X (1 by default).
PLACE may be a symbol, or any generalized variable allowed by `setf'.
The return value is the incremented value of PLACE."
  (declare (debug (place &optional form)))
  (if (symbolp place)
      (list 'setq place (if x (list '+ place x) (list '1+ place)))
    (list 'cl-callf '+ place (or x 1))))

> > (defun rcd-word-frequency-string (text &optional length how-many-words)
> >   (let* ((words (rcd-word-frequency text length))
> > 	 (words (hash-to-list words))
> > 	 (number (or how-many-words 20))
> > 	 (frequent (seq-sort (lambda (a b)
> > 			       (> (cadr a) (cadr b)))
> > 			     words)))
> >     (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) number)) " ")))
> 
> I don't have a `hash-to-list' function, but once you've built your table
> it seems like the rest of it is fairly straightforward.

I use those functions below.

;;;; ━━━━━━━━━━━━━━━━━━
;;;;   HASH FUNCTIONS
;;;; ━━━━━━━━━━━━━━━━━━

(defun hash-to-plist (hash)
  "Convert hash HASH to plist."
  (let (plist)
    (maphash (lambda (key value) (push key plist) (push value plist)) hash)
    (reverse plist)))

(defun hash-to-alist (hash)
  "Convert hash HASH to alist"
  (let (alist)
    (maphash (lambda (key value) (push (cons key value) alist)) hash)
    alist))

(defun hash-to-list (hash)
  "Convert hash HASH to list"
  (let (list)
    (maphash (lambda (key value) (setq list (append list (list (list key value))))) hash)
    list))

(defun hash-append (h1 &rest hashes)
  "Return H1 hash appended with HASHES."
  (mapc 
   (lambda (hash)
     (maphash 
      (lambda (key value) (puthash key value h1)) hash))
   hashes)
  h1)


-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

Sign an open letter in support of Richard M. Stallman
https://stallmansupport.org/
https://rms-support-letter.github.io/