From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Andreas Politz Newsgroups: gmane.emacs.help Subject: Re: processing a large buffer contents to a hash table Date: Fri, 09 Jan 2009 18:06:06 +0100 Organization: FH-Trier Message-ID: <1231520831.329403@arno.fh-trier.de> References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1231522890 11627 80.91.229.12 (9 Jan 2009 17:41:30 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 9 Jan 2009 17:41:30 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Fri Jan 09 18:42:41 2009 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1LLLNB-0001pb-Fd for geh-help-gnu-emacs@m.gmane.org; Fri, 09 Jan 2009 18:42:17 +0100 Original-Received: from localhost ([127.0.0.1]:46989 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LLLLv-0005TF-Ib for geh-help-gnu-emacs@m.gmane.org; Fri, 09 Jan 2009 12:40:59 -0500 Original-Path: news.stanford.edu!headwall.stanford.edu!news.glorb.com!news2.glorb.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!news2.euro.net!newsfeed.freenet.de!newsfeed.r-kom.de!news-nue1.dfn.de!news-stu1.dfn.de!news.uni-kl.de!not-for-mail Original-Newsgroups: gnu.emacs.help Original-Lines: 166 Original-NNTP-Posting-Host: 143-93-54-11.arno.fh-trier.de Original-X-Trace: news.uni-kl.de 1231520889 27925 143.93.54.11 (9 Jan 2009 17:08:09 GMT) Original-X-Complaints-To: usenet@news.uni-kl.de Original-NNTP-Posting-Date: Fri, 9 Jan 2009 17:08:09 +0000 (UTC) User-Agent: Mozilla-Thunderbird 2.0.0.17 (X11/20081018) In-Reply-To: Cache-Post-Path: arno.fh-trier.de!unknown@dslb-088-069-053-024.pools.arcor-ip.net X-Cache: nntpcache 3.0.1 (see http://www.nntpcache.org/) Original-Xref: news.stanford.edu gnu.emacs.help:165880 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:61215 Archived-At: Seweryn Kokot wrote: > Hello, > > I'm trying to write a function in elisp which returns word frequency of > a buffer content . It works but for a large file > (around 250 000 words) it takes 15 seconds, while a similar function in python > takes 4s. > > Here is the function which process a buffer word by word and write word > frequency to a hash table. > > (defun word-frequency-process-buffer () > (interactive) > (let ((buffer (current-buffer)) bounds beg end word) > (save-excursion > (goto-char (point-min)) > (while (re-search-forward "\\<[[:word:]]+\\>" nil t) (word-frequency-incr (downcase (match-string 0)))))) ...should do the same. (?) More specific, I think bounds-of-thing-at-point is your bottleneck. And try to compile everything. -ap > ;; (while (forward-word 1) > (setq bounds (bounds-of-thing-at-point 'word)) > (setq beg (car bounds)) > (setq end (cdr bounds)) > (setq word (downcase (buffer-substring-no-properties beg end))) > (word-frequency-incr word) > )))) > > The main function is word-frequency which operates on the current buffer > and gives *frequencies* with word statistics. > > Any idea how to optimize `word-frequency-process-buffer' function? > > Thanks in advance. > regards, > Seweryn > > P.S. I took some code from command-frequency.el library > > > ------ code start > (defvar word-frequency-table (make-hash-table :test 'equal :size 128)) > > (defvar word-frequency-buffer "*frequencies*" > "Buffer where frequencies are displayed.") > > (defun word-frequency-incr (word) > (puthash word (1+ (gethash word word-frequency-table 0)) word-frequency-table)) > > (defun word-frequency-list (&optional reverse limit) > "Returns a cons which car is sum of times any word was used > and cdr is a list of (word . count) pairs. If REVERSE is nil > sorts it starting from the most used word; if it is 'no-sort > the list is not sorted; if it is non-nil and not 'no-sort sorts > it from the least used words. If LIMIT is positive number > only words which were used more then LIMIT times will be > added. If it is negative number only words which were used > less then -LIMIT times will be added." > (let (l (sum 0)) > (maphash > (cond > ((or (not (numberp limit)) (= limit 0)) > (lambda (k v) (setq l (cons (cons k v) l) sum (+ sum v)))) > ((= limit -1) (lambda (k v) (setq sum (+ sum v)))) > ((< limit 0) > (setq limit (- limit)) > (lambda (k v) (setq sum (+ sum v)) > (if (< v limit) (setq l (cons (cons k v) l))))) > (t > (lambda (k v) (setq sum (+ sum v)) > (if (> v limit) (setq l (cons (cons k v) l)))))) > word-frequency-table) > (cons sum > (cond > ((equal reverse 'no-sort) l) > (reverse (sort l (lambda (a b) (< (cdr a) (cdr b))))) > (t (sort l (lambda (a b) (> (cdr a) (cdr b))))))))) > > (defun word-frequency-string (&optional reverse limit func) > "Returns formatted string with word usage statistics. > > If FUNC is nil each line contains number of times word was > called and the word; if it is t percentage usage is added in > the middle; if it is 'raw each line will contain number an > word separated by single line (with no formatting) otherwise > FUNC must be a function returning a string which will be called > for each entry with three arguments: number of times word was > called, percentage usage and the word. > > See `word-frequency-list' for description of REVERSE and LIMIT > arguments." > (let* ((list (word-frequency-list reverse)) (sum (car list))) > (mapconcat > (cond > ((not func) (lambda (e) (format "%7d %s\n" (cdr e) (car e)))) > ((equal func t) > (lambda (e) (format "%7d %6.2f%% %s\n" > (cdr e) (/ (* 1e2 (cdr e)) sum) (car e)))) > ((equal func 'raw) (lambda (e) (format "%d %s\n" (cdr e) (car e)))) > (t (lambda (e) (funcall func (cdr e) (/ (* 1e2 (cdr e)) sum) (car e))))) > (cdr list) ""))) > > (defun word-frequency (&optional where reverse limit func) > "Formats word usage statistics using > `word-frequency-string' function (see for description of > REVERSE, LIMIT and FUNC arguments) and: > - if WHERE is nil inserts it in th e > or displays it in echo area if possible; else > - if WHERE is t inserts it in the current buffer; else > - if WHERE is an empty string inserts it into > `word-frequency-buffer' buffer; else > - inserts it into buffer WHERE. > > When called interactively behaves as if WHERE and LIMIT were nil, > FUNC was t and: > - with no prefix argument - REVERSE was nil; > - with universal or positive prefix arument - REVERSE was t; > - with negative prefix argument - REVERSE was 'no-sort." > > (interactive (list nil > (cond > ((not current-prefix-arg) nil) > ((> (prefix-numeric-value current-prefix-arg) 0)) > (t 'no-sort)) > nil t)) > (clrhash word-frequency-table) > (word-frequency-process-buffer) > (cond > ((not where) > (display-message-or-buffer (word-frequency-string reverse limit func) > word-frequency-buffer)) > ((equal where t) > (insert (word-frequency-string reverse limit func))) > (t > (display-buffer > (if (and (stringp where) (string= where "")) > word-frequency-buffer where) > (word-frequency-string reverse limit func))))) > > (defun word-frequency-process-buffer () > (interactive) > (let ((buffer (current-buffer)) > bounds > beg > end > word) > (save-excursion > (goto-char (point-min)) > (while (re-search-forward "\\<[[:word:]]+\\>" nil t) > ;; (while (forward-word 1) > (setq bounds (bounds-of-thing-at-point 'word)) > (setq beg (car bounds)) > (setq end (cdr bounds)) > (setq word (downcase (buffer-substring-no-properties beg end))) > (word-frequency-incr word) > )))) > ----- code end > > > >