unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* processing a large buffer contents to a hash table
@ 2009-01-09 16:56 Seweryn Kokot
  0 siblings, 0 replies; 4+ messages in thread
From: Seweryn Kokot @ 2009-01-09 16:56 UTC (permalink / raw)
  To: help-gnu-emacs

Hello,

I'm trying to write a function in elisp which returns word frequency of
a buffer content . It works but for a large file
(around 250 000 words) it takes 15 seconds, while a similar function in python
takes 4s. 

Here is the function which process a buffer word by word and write word
frequency to a hash table. 

(defun word-frequency-process-buffer ()
  (interactive)
  (let ((buffer (current-buffer)) bounds beg end word)
	(save-excursion
	  (goto-char (point-min))
	  (while (re-search-forward "\\<[[:word:]]+\\>" nil t)
;;	  (while (forward-word 1)
		(setq bounds (bounds-of-thing-at-point 'word))
		(setq beg (car bounds))
		(setq end (cdr bounds))
		(setq word (downcase (buffer-substring-no-properties beg end)))
		(word-frequency-incr word)
		))))

The main function is word-frequency which operates on the current buffer
and gives *frequencies* with word statistics.

Any idea how to optimize `word-frequency-process-buffer' function?

Thanks in advance.
regards,
Seweryn

P.S. I took some code from command-frequency.el library


------ code start
(defvar word-frequency-table (make-hash-table :test 'equal :size 128))

(defvar word-frequency-buffer "*frequencies*"
  "Buffer where frequencies are displayed.")

(defun word-frequency-incr (word)
  (puthash word (1+ (gethash word word-frequency-table 0)) word-frequency-table))

(defun word-frequency-list (&optional reverse limit)
  "Returns a cons which car is sum of times any word was used
and cdr is a list of (word . count) pairs.  If REVERSE is nil
sorts it starting from the most used word; if it is 'no-sort
the list is not sorted; if it is non-nil and not 'no-sort sorts
it from the least used words.  If LIMIT is positive number
only words which were used more then LIMIT times will be
added.  If it is negative number only words which were used
less then -LIMIT times will be added."
  (let (l (sum 0))
    (maphash
     (cond
      ((or (not (numberp limit)) (= limit 0))
       (lambda (k v) (setq l (cons (cons k v) l) sum (+ sum v))))
      ((= limit -1) (lambda (k v) (setq sum (+ sum v))))
      ((< limit 0)
       (setq limit (- limit))
       (lambda (k v) (setq sum (+ sum v))
         (if (< v limit) (setq l (cons (cons k v) l)))))
      (t
       (lambda (k v) (setq sum (+ sum v))
         (if (> v limit) (setq l (cons (cons k v) l))))))
     word-frequency-table)
    (cons sum
          (cond
           ((equal reverse 'no-sort) l)
           (reverse (sort l (lambda (a b) (< (cdr a) (cdr b)))))
           (t       (sort l (lambda (a b) (> (cdr a) (cdr b)))))))))

(defun word-frequency-string (&optional reverse limit func)
  "Returns formatted string with word usage statistics.

If FUNC is nil each line contains number of times word was
called and the word; if it is t percentage usage is added in
the middle; if it is 'raw each line will contain number an
word separated by single line (with no formatting) otherwise
FUNC must be a function returning a string which will be called
for each entry with three arguments: number of times word was
called, percentage usage and the word.

See `word-frequency-list' for description of REVERSE and LIMIT
arguments."
  (let* ((list (word-frequency-list reverse)) (sum (car list)))
    (mapconcat
     (cond
      ((not func) (lambda (e) (format "%7d  %s\n" (cdr e) (car e))))
      ((equal func t)
       (lambda (e) (format "%7d  %6.2f%%  %s\n"
                           (cdr e) (/ (* 1e2 (cdr e)) sum) (car e))))
      ((equal func 'raw) (lambda (e) (format "%d %s\n" (cdr e) (car e))))
      (t (lambda (e) (funcall func (cdr e) (/ (* 1e2 (cdr e)) sum) (car e)))))
     (cdr list) "")))

(defun word-frequency (&optional where reverse limit func)
  "Formats word usage statistics using
`word-frequency-string' function (see for description of
REVERSE, LIMIT and FUNC arguments) and:
- if WHERE is nil inserts it in th e
  or displays it in echo area if possible; else
- if WHERE is t inserts it in the current buffer; else
- if WHERE is an empty string inserts it into
  `word-frequency-buffer' buffer; else
- inserts it into buffer WHERE.

When called interactively behaves as if WHERE and LIMIT were nil,
FUNC was t and:
- with no prefix argument - REVERSE was nil;
- with universal or positive prefix arument - REVERSE was t;
- with negative prefix argument - REVERSE was 'no-sort."

  (interactive (list nil
                     (cond
                      ((not current-prefix-arg) nil)
                      ((> (prefix-numeric-value current-prefix-arg) 0))
                      (t 'no-sort))
                     nil t))
  (clrhash word-frequency-table)
  (word-frequency-process-buffer)
  (cond
   ((not where)
    (display-message-or-buffer (word-frequency-string reverse limit func)
                               word-frequency-buffer))
   ((equal where t)
    (insert (word-frequency-string reverse limit func)))
   (t
    (display-buffer
     (if (and (stringp where) (string= where ""))
         word-frequency-buffer where)
     (word-frequency-string reverse limit func)))))

(defun word-frequency-process-buffer ()
  (interactive)
  (let ((buffer (current-buffer))
		bounds
		beg
		end
		word)
	(save-excursion
	  (goto-char (point-min))
	  (while (re-search-forward "\\<[[:word:]]+\\>" nil t)
;;	  (while (forward-word 1)
		(setq bounds (bounds-of-thing-at-point 'word))
		(setq beg (car bounds))
		(setq end (cdr bounds))
		(setq word (downcase (buffer-substring-no-properties beg end)))
		(word-frequency-incr word)
		))))
----- code end






^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: processing a large buffer contents to a hash table
       [not found] <mailman.4415.1231519828.26697.help-gnu-emacs@gnu.org>
@ 2009-01-09 17:06 ` Andreas Politz
  2009-01-09 17:59   ` Seweryn Kokot
  2009-01-09 17:47 ` Xah Lee
  1 sibling, 1 reply; 4+ messages in thread
From: Andreas Politz @ 2009-01-09 17:06 UTC (permalink / raw)
  To: help-gnu-emacs

Seweryn Kokot wrote:
> Hello,
> 
> I'm trying to write a function in elisp which returns word frequency of
> a buffer content . It works but for a large file
> (around 250 000 words) it takes 15 seconds, while a similar function in python
> takes 4s. 
> 
> Here is the function which process a buffer word by word and write word
> frequency to a hash table. 
> 
> (defun word-frequency-process-buffer ()
>   (interactive)
>   (let ((buffer (current-buffer)) bounds beg end word)
> 	(save-excursion
> 	  (goto-char (point-min))
> 	  (while (re-search-forward "\\<[[:word:]]+\\>" nil t)
           (word-frequency-incr (downcase (match-string 0))))))

...should do the same. (?)
More specific, I think bounds-of-thing-at-point is
your bottleneck. And try to compile everything.

-ap

> ;;	  (while (forward-word 1)
> 		(setq bounds (bounds-of-thing-at-point 'word))
> 		(setq beg (car bounds))
> 		(setq end (cdr bounds))
> 		(setq word (downcase (buffer-substring-no-properties beg end)))
> 		(word-frequency-incr word)
> 		))))
> 
> The main function is word-frequency which operates on the current buffer
> and gives *frequencies* with word statistics.
> 
> Any idea how to optimize `word-frequency-process-buffer' function?
> 
> Thanks in advance.
> regards,
> Seweryn
> 
> P.S. I took some code from command-frequency.el library
> 
> 
> ------ code start
> (defvar word-frequency-table (make-hash-table :test 'equal :size 128))
> 
> (defvar word-frequency-buffer "*frequencies*"
>   "Buffer where frequencies are displayed.")
> 
> (defun word-frequency-incr (word)
>   (puthash word (1+ (gethash word word-frequency-table 0)) word-frequency-table))
> 
> (defun word-frequency-list (&optional reverse limit)
>   "Returns a cons which car is sum of times any word was used
> and cdr is a list of (word . count) pairs.  If REVERSE is nil
> sorts it starting from the most used word; if it is 'no-sort
> the list is not sorted; if it is non-nil and not 'no-sort sorts
> it from the least used words.  If LIMIT is positive number
> only words which were used more then LIMIT times will be
> added.  If it is negative number only words which were used
> less then -LIMIT times will be added."
>   (let (l (sum 0))
>     (maphash
>      (cond
>       ((or (not (numberp limit)) (= limit 0))
>        (lambda (k v) (setq l (cons (cons k v) l) sum (+ sum v))))
>       ((= limit -1) (lambda (k v) (setq sum (+ sum v))))
>       ((< limit 0)
>        (setq limit (- limit))
>        (lambda (k v) (setq sum (+ sum v))
>          (if (< v limit) (setq l (cons (cons k v) l)))))
>       (t
>        (lambda (k v) (setq sum (+ sum v))
>          (if (> v limit) (setq l (cons (cons k v) l))))))
>      word-frequency-table)
>     (cons sum
>           (cond
>            ((equal reverse 'no-sort) l)
>            (reverse (sort l (lambda (a b) (< (cdr a) (cdr b)))))
>            (t       (sort l (lambda (a b) (> (cdr a) (cdr b)))))))))
> 
> (defun word-frequency-string (&optional reverse limit func)
>   "Returns formatted string with word usage statistics.
> 
> If FUNC is nil each line contains number of times word was
> called and the word; if it is t percentage usage is added in
> the middle; if it is 'raw each line will contain number an
> word separated by single line (with no formatting) otherwise
> FUNC must be a function returning a string which will be called
> for each entry with three arguments: number of times word was
> called, percentage usage and the word.
> 
> See `word-frequency-list' for description of REVERSE and LIMIT
> arguments."
>   (let* ((list (word-frequency-list reverse)) (sum (car list)))
>     (mapconcat
>      (cond
>       ((not func) (lambda (e) (format "%7d  %s\n" (cdr e) (car e))))
>       ((equal func t)
>        (lambda (e) (format "%7d  %6.2f%%  %s\n"
>                            (cdr e) (/ (* 1e2 (cdr e)) sum) (car e))))
>       ((equal func 'raw) (lambda (e) (format "%d %s\n" (cdr e) (car e))))
>       (t (lambda (e) (funcall func (cdr e) (/ (* 1e2 (cdr e)) sum) (car e)))))
>      (cdr list) "")))
> 
> (defun word-frequency (&optional where reverse limit func)
>   "Formats word usage statistics using
> `word-frequency-string' function (see for description of
> REVERSE, LIMIT and FUNC arguments) and:
> - if WHERE is nil inserts it in th e
>   or displays it in echo area if possible; else
> - if WHERE is t inserts it in the current buffer; else
> - if WHERE is an empty string inserts it into
>   `word-frequency-buffer' buffer; else
> - inserts it into buffer WHERE.
> 
> When called interactively behaves as if WHERE and LIMIT were nil,
> FUNC was t and:
> - with no prefix argument - REVERSE was nil;
> - with universal or positive prefix arument - REVERSE was t;
> - with negative prefix argument - REVERSE was 'no-sort."
> 
>   (interactive (list nil
>                      (cond
>                       ((not current-prefix-arg) nil)
>                       ((> (prefix-numeric-value current-prefix-arg) 0))
>                       (t 'no-sort))
>                      nil t))
>   (clrhash word-frequency-table)
>   (word-frequency-process-buffer)
>   (cond
>    ((not where)
>     (display-message-or-buffer (word-frequency-string reverse limit func)
>                                word-frequency-buffer))
>    ((equal where t)
>     (insert (word-frequency-string reverse limit func)))
>    (t
>     (display-buffer
>      (if (and (stringp where) (string= where ""))
>          word-frequency-buffer where)
>      (word-frequency-string reverse limit func)))))
> 
> (defun word-frequency-process-buffer ()
>   (interactive)
>   (let ((buffer (current-buffer))
> 		bounds
> 		beg
> 		end
> 		word)
> 	(save-excursion
> 	  (goto-char (point-min))
> 	  (while (re-search-forward "\\<[[:word:]]+\\>" nil t)
> ;;	  (while (forward-word 1)
> 		(setq bounds (bounds-of-thing-at-point 'word))
> 		(setq beg (car bounds))
> 		(setq end (cdr bounds))
> 		(setq word (downcase (buffer-substring-no-properties beg end)))
> 		(word-frequency-incr word)
> 		))))
> ----- code end
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: processing a large buffer contents to a hash table
       [not found] <mailman.4415.1231519828.26697.help-gnu-emacs@gnu.org>
  2009-01-09 17:06 ` processing a large buffer contents to a hash table Andreas Politz
@ 2009-01-09 17:47 ` Xah Lee
  1 sibling, 0 replies; 4+ messages in thread
From: Xah Lee @ 2009-01-09 17:47 UTC (permalink / raw)
  To: help-gnu-emacs

On Jan 9, 8:56 am, Seweryn Kokot <sewko...@gmail.com> wrote:
> Hello,
>
> I'm trying to write a function in elisp which returns word frequency of
> a buffer content . It works but for a large file
> (around 250 000 words) it takes 15 seconds, while a similar function in python
> takes 4s.
>
> Here is the function which process a buffer word by word and write word
> frequency to a hash table.
>
> (defun word-frequency-process-buffer ()
>   (interactive)
>   (let ((buffer (current-buffer)) bounds beg end word)
>         (save-excursion
>           (goto-char (point-min))
>           (while (re-search-forward "\\<[[:word:]]+\\>" nil t)
> ;;        (while (forward-word 1)
>                 (setq bounds (bounds-of-thing-at-point 'word))
>                 (setq beg (car bounds))
>                 (setq end (cdr bounds))
>                 (setq word (downcase (buffer-substring-no-properties beg end)))
>                 (word-frequency-incr word)
>                 ))))
>
> The main function is word-frequency which operates on the current buffer
> and gives *frequencies* with word statistics.
>
> Any idea how to optimize `word-frequency-process-buffer' function?
>

you doing many unnecessary things.

you don't need save-excursion.
don't need setting the bunch of boundary.
you can use just match-string to capture the word and feed it to word-
frequency-incr.

...

  Xah
∑ http://xahlee.org/^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: processing a large buffer contents to a hash table
  2009-01-09 17:06 ` processing a large buffer contents to a hash table Andreas Politz
@ 2009-01-09 17:59   ` Seweryn Kokot
  0 siblings, 0 replies; 4+ messages in thread
From: Seweryn Kokot @ 2009-01-09 17:59 UTC (permalink / raw)
  To: help-gnu-emacs

Andreas Politz <politza@fh-trier.de> writes:

> Seweryn Kokot wrote:
>> Hello,
>>
>> I'm trying to write a function in elisp which returns word frequency of
>> a buffer content . It works but for a large file
>> (around 250 000 words) it takes 15 seconds, while a similar function in python
>> takes 4s. 
>>
>> Here is the function which process a buffer word by word and write word
>> frequency to a hash table. 
>>
>> (defun word-frequency-process-buffer ()
>>   (interactive)
>>   (let ((buffer (current-buffer)) bounds beg end word)
>> 	(save-excursion
>> 	  (goto-char (point-min))
>> 	  (while (re-search-forward "\\<[[:word:]]+\\>" nil t)
>           (word-frequency-incr (downcase (match-string 0))))))
>
> ...should do the same. (?)
> More specific, I think bounds-of-thing-at-point is
> your bottleneck. And try to compile everything.
>

Indeed it helps a lot. 
Now it also takes 4s to process this big file!

Thanks.
-- 
regards,
Seweryn





^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-01-09 17:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <mailman.4415.1231519828.26697.help-gnu-emacs@gnu.org>
2009-01-09 17:06 ` processing a large buffer contents to a hash table Andreas Politz
2009-01-09 17:59   ` Seweryn Kokot
2009-01-09 17:47 ` Xah Lee
2009-01-09 16:56 Seweryn Kokot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).