From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Seweryn Kokot Newsgroups: gmane.emacs.help Subject: processing a large buffer contents to a hash table Date: Fri, 09 Jan 2009 17:56:30 +0100 Message-ID: <8763kobept.fsf@poczta.po.opole.pl> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1231519854 32437 80.91.229.12 (9 Jan 2009 16:50:54 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 9 Jan 2009 16:50:54 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Fri Jan 09 17:52:05 2009 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1LLKaZ-0007Z8-W6 for geh-help-gnu-emacs@m.gmane.org; Fri, 09 Jan 2009 17:52:04 +0100 Original-Received: from localhost ([127.0.0.1]:56080 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LLKZJ-0005q0-QS for geh-help-gnu-emacs@m.gmane.org; Fri, 09 Jan 2009 11:50:45 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LLKZ0-0005pv-0v for help-gnu-emacs@gnu.org; Fri, 09 Jan 2009 11:50:26 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LLKYy-0005pj-I2 for help-gnu-emacs@gnu.org; Fri, 09 Jan 2009 11:50:24 -0500 Original-Received: from [199.232.76.173] (port=47203 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LLKYy-0005pg-Bs for help-gnu-emacs@gnu.org; Fri, 09 Jan 2009 11:50:24 -0500 Original-Received: from main.gmane.org ([80.91.229.2]:38499 helo=ciao.gmane.org) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1LLKYx-00006D-NO for help-gnu-emacs@gnu.org; Fri, 09 Jan 2009 11:50:24 -0500 Original-Received: from list by ciao.gmane.org with local (Exim 4.43) id 1LLKYq-0007wu-1P for help-gnu-emacs@gnu.org; Fri, 09 Jan 2009 16:50:16 +0000 Original-Received: from nat-warynskiego.po.opole.pl ([217.173.199.132]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 09 Jan 2009 16:50:16 +0000 Original-Received: from sewkokot by nat-warynskiego.po.opole.pl with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 09 Jan 2009 16:50:16 +0000 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 154 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: nat-warynskiego.po.opole.pl X-Operating-System: Debian GNU/Linux User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux) Cancel-Lock: sha1:CovGzkjXgT2CMwTTQaWiUG/b8Xo= X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:61212 Archived-At: Hello, I'm trying to write a function in elisp which returns word frequency of a buffer content . It works but for a large file (around 250 000 words) it takes 15 seconds, while a similar function in python takes 4s. Here is the function which process a buffer word by word and write word frequency to a hash table. (defun word-frequency-process-buffer () (interactive) (let ((buffer (current-buffer)) bounds beg end word) (save-excursion (goto-char (point-min)) (while (re-search-forward "\\<[[:word:]]+\\>" nil t) ;; (while (forward-word 1) (setq bounds (bounds-of-thing-at-point 'word)) (setq beg (car bounds)) (setq end (cdr bounds)) (setq word (downcase (buffer-substring-no-properties beg end))) (word-frequency-incr word) )))) The main function is word-frequency which operates on the current buffer and gives *frequencies* with word statistics. Any idea how to optimize `word-frequency-process-buffer' function? Thanks in advance. regards, Seweryn P.S. I took some code from command-frequency.el library ------ code start (defvar word-frequency-table (make-hash-table :test 'equal :size 128)) (defvar word-frequency-buffer "*frequencies*" "Buffer where frequencies are displayed.") (defun word-frequency-incr (word) (puthash word (1+ (gethash word word-frequency-table 0)) word-frequency-table)) (defun word-frequency-list (&optional reverse limit) "Returns a cons which car is sum of times any word was used and cdr is a list of (word . count) pairs. If REVERSE is nil sorts it starting from the most used word; if it is 'no-sort the list is not sorted; if it is non-nil and not 'no-sort sorts it from the least used words. If LIMIT is positive number only words which were used more then LIMIT times will be added. If it is negative number only words which were used less then -LIMIT times will be added." (let (l (sum 0)) (maphash (cond ((or (not (numberp limit)) (= limit 0)) (lambda (k v) (setq l (cons (cons k v) l) sum (+ sum v)))) ((= limit -1) (lambda (k v) (setq sum (+ sum v)))) ((< limit 0) (setq limit (- limit)) (lambda (k v) (setq sum (+ sum v)) (if (< v limit) (setq l (cons (cons k v) l))))) (t (lambda (k v) (setq sum (+ sum v)) (if (> v limit) (setq l (cons (cons k v) l)))))) word-frequency-table) (cons sum (cond ((equal reverse 'no-sort) l) (reverse (sort l (lambda (a b) (< (cdr a) (cdr b))))) (t (sort l (lambda (a b) (> (cdr a) (cdr b))))))))) (defun word-frequency-string (&optional reverse limit func) "Returns formatted string with word usage statistics. If FUNC is nil each line contains number of times word was called and the word; if it is t percentage usage is added in the middle; if it is 'raw each line will contain number an word separated by single line (with no formatting) otherwise FUNC must be a function returning a string which will be called for each entry with three arguments: number of times word was called, percentage usage and the word. See `word-frequency-list' for description of REVERSE and LIMIT arguments." (let* ((list (word-frequency-list reverse)) (sum (car list))) (mapconcat (cond ((not func) (lambda (e) (format "%7d %s\n" (cdr e) (car e)))) ((equal func t) (lambda (e) (format "%7d %6.2f%% %s\n" (cdr e) (/ (* 1e2 (cdr e)) sum) (car e)))) ((equal func 'raw) (lambda (e) (format "%d %s\n" (cdr e) (car e)))) (t (lambda (e) (funcall func (cdr e) (/ (* 1e2 (cdr e)) sum) (car e))))) (cdr list) ""))) (defun word-frequency (&optional where reverse limit func) "Formats word usage statistics using `word-frequency-string' function (see for description of REVERSE, LIMIT and FUNC arguments) and: - if WHERE is nil inserts it in th e or displays it in echo area if possible; else - if WHERE is t inserts it in the current buffer; else - if WHERE is an empty string inserts it into `word-frequency-buffer' buffer; else - inserts it into buffer WHERE. When called interactively behaves as if WHERE and LIMIT were nil, FUNC was t and: - with no prefix argument - REVERSE was nil; - with universal or positive prefix arument - REVERSE was t; - with negative prefix argument - REVERSE was 'no-sort." (interactive (list nil (cond ((not current-prefix-arg) nil) ((> (prefix-numeric-value current-prefix-arg) 0)) (t 'no-sort)) nil t)) (clrhash word-frequency-table) (word-frequency-process-buffer) (cond ((not where) (display-message-or-buffer (word-frequency-string reverse limit func) word-frequency-buffer)) ((equal where t) (insert (word-frequency-string reverse limit func))) (t (display-buffer (if (and (stringp where) (string= where "")) word-frequency-buffer where) (word-frequency-string reverse limit func))))) (defun word-frequency-process-buffer () (interactive) (let ((buffer (current-buffer)) bounds beg end word) (save-excursion (goto-char (point-min)) (while (re-search-forward "\\<[[:word:]]+\\>" nil t) ;; (while (forward-word 1) (setq bounds (bounds-of-thing-at-point 'word)) (setq beg (car bounds)) (setq end (cdr bounds)) (setq word (downcase (buffer-substring-no-properties beg end))) (word-frequency-incr word) )))) ----- code end