From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eric Abrahamsen Newsgroups: gmane.emacs.help Subject: Re: Any faster way to find frequency of words? Date: Sun, 09 May 2021 07:56:09 -0700 Message-ID: <87h7jcq7li.fsf@ericabrahamsen.net> References: Mime-Version: 1.0 Content-Type: text/plain Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="15411"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux) Cc: Help GNU Emacs To: Jean Louis Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Sun May 09 16:56:44 2021 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lfkrM-0003tf-FX for geh-help-gnu-emacs@m.gmane-mx.org; Sun, 09 May 2021 16:56:44 +0200 Original-Received: from localhost ([::1]:44970 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lfkrK-0006K6-IT for geh-help-gnu-emacs@m.gmane-mx.org; Sun, 09 May 2021 10:56:42 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:40506) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lfkqz-0006Jv-5i for help-gnu-emacs@gnu.org; Sun, 09 May 2021 10:56:21 -0400 Original-Received: from ericabrahamsen.net ([52.70.2.18]:53610 helo=mail.ericabrahamsen.net) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lfkqx-0003Ys-AL for help-gnu-emacs@gnu.org; Sun, 09 May 2021 10:56:20 -0400 Original-Received: from localhost (c-71-197-184-122.hsd1.wa.comcast.net [71.197.184.122]) (Authenticated sender: eric@ericabrahamsen.net) by mail.ericabrahamsen.net (Postfix) with ESMTPSA id 016D3FA022; Sun, 9 May 2021 14:56:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ericabrahamsen.net; s=mail; t=1620572171; bh=O72xQjCXTFV32CdVPXYOrqcJRsVPkghuROKHR2/l57s=; h=From:To:Cc:Subject:References:Date:In-Reply-To:From; b=POwVvOiheZzuxHb0+ASibwDXANM0ZQMBewJcBst0UczYjDBc9pwHI0/fMUNuljww/ eMddXyaPQ/p645lbLmFst3jGZasXDbgKt245SaOxXfVsEyKwlIhzlW+Mp685OfJrcI mxBVOcHsLCTjfH/SnbbL9rlYfCudeI8DbryBP8cc= In-Reply-To: (Jean Louis's message of "Sun, 09 May 2021 17:38:05 +0300") Received-SPF: pass client-ip=52.70.2.18; envelope-from=eric@ericabrahamsen.net; helo=mail.ericabrahamsen.net X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.io gmane.emacs.help:129623 Archived-At: Jean Louis writes: > I am interested if there is some better way for Emacs Lisp to find > frequency of words. > > Purpose is to create HTML clickable tag clouds similar to image tag > clouds. But I will invoke Perl from Emacs to generate it. For that, I > have to analyze the text first. Is there any particular improvement you're trying to make? > (setq text "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a diam > lectus. Sed sit amet ipsum mauris. Maecenas congue ligula ac quam > viverra nec consectetur ante hendrerit. Maecenas congue ligula ac quam > viverra nec consectetur ante hendrerit..") > > (defun text-alphabetic-only (text) > "Return alphabetic characters from TEXT." > (replace-regexp-in-string "[^[:alpha:]]" " " text)) > > (defun word-frequency (text &optional length) > "Returns word frequency as hash from TEXT." > (let* ((hash (make-hash-table :test 'equal)) > (text (text-alphabetic-only text)) > (words (split-string text " " t " "))) I guess I'd suggest using Emacs syntax parsing functions, ie `forward-word' and `buffer-substring'. Then you can fine tune the definition of words using the local syntax table. > (mapc (lambda (word) > (when (> (length word) 2) > (let ((word (downcase word))) > (if (numberp (gethash word hash)) > (puthash word (1+ (gethash word hash)) hash) > (puthash word 1 hash))))) While hash tables are probably best for very large texts, alists are nice because you can use place-setting with a default, simplifying the above to: (cl-incf (alist-get word frequency-alist 0 nil #'equal)) Eric