all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Eric Abrahamsen <eric@ericabrahamsen.net>
To: help-gnu-emacs@gnu.org
Subject: Re: Most used words in current buffer
Date: Sat, 21 Jul 2018 21:05:48 -0700	[thread overview]
Message-ID: <877elnrk2r.fsf@ericabrahamsen.net> (raw)
In-Reply-To: 87bmazrkbf.fsf@ericabrahamsen.net

Eric Abrahamsen <eric@ericabrahamsen.net> writes:

> Eric Abrahamsen <eric@ericabrahamsen.net> writes:
>
>> Udyant Wig <udyantw@gmail.com> writes:
>>
>>> On 07/21/2018 09:45 PM, Eric Abrahamsen wrote:
>>>> Interesting... In general I think Emacs is highly optimized to use the
>>>> buffer as its textual data structure, more so than a string.
>>>> Particularly when the code is compiled (many of the text-movement
>>>> commands have opcodes). I made the following two commands to collect
>>>> words from a novel in an Org file, and the one that uses
>>>> `forward-word' and `buffer-substring' runs around twice as fast as the
>>>> `split-string'.
>>>>
>>>> Of course, they don't collect the same list of words! But even if you
>>>> add more code for trimming, etc., it will still likely be faster than
>>>> operating on a string.
>>>> [snip code]
>>>
>>> I have acted upon the advice (yours and Stefan Monnier's) to operate on
>>> the buffer directly using BUFFER-SUBSTRING.  Please see my follow up to
>>> Stefan's message.
>>>
>>> BUFFER-SUBSTRING did gain me (somewhat) better performance.
>>
>> As Stefan said, going character by character is going to be slow... But
>> my example with `forward-word' collects a lot of cruft. So I would
>> suggest doing what `forward-word' does internally and move by syntax.
>
> Actually I think alternating `forward-word' with `forward-to-word' might
> do the exact same thing as alternating (skip-syntax-forward "w") with
> (skip-syntax-forward "^w"), and might get you some extra... stuff. Maybe
> worth benchmarking!

And, because apparently my Saturday nights are slow:

(defun test-buffer (f)
  (let ((counts (make-hash-table :test #'equal))
	pnt)
    (with-temp-buffer
      (insert-file-contents f)
      (goto-char (point-min))
      (forward-to-word 1)
      (setq pnt (point))
      (while (and (null (eobp)) (forward-word))
	(cl-incf (gethash (downcase (buffer-substring pnt (point))) counts 0))
	(forward-to-word 1)
	(setq pnt (point))))
        counts))

Seems to go pretty quick on my test file, though it's only 220K.




  reply	other threads:[~2018-07-22  4:05 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-17  9:28 Most used words in current buffer Udyant Wig
2018-07-17 18:41 ` Emanuel Berg
2018-07-18  9:36   ` Udyant Wig
2018-07-18 11:48     ` Emanuel Berg
2018-07-18 14:50       ` Udyant Wig
2018-07-18 16:32         ` Emanuel Berg
2018-07-18 22:39     ` Ben Bacarisse
2018-07-19  0:45       ` Bob Proulx
     [not found]       ` <mailman.3785.1531961144.1292.help-gnu-emacs@gnu.org>
2018-07-19  5:33         ` Udyant Wig
2018-07-19  7:04           ` Bob Proulx
2018-07-19  7:25             ` tomas
2018-07-19 17:19             ` Nick Dokos
2018-07-19 17:30               ` Eli Zaretskii
2018-07-19 20:08               ` Bob Proulx
2018-07-20 16:39                 ` Nick Dokos
     [not found]                 ` <mailman.3909.1532104802.1292.help-gnu-emacs@gnu.org>
2018-07-20 18:13                   ` Udyant Wig
2018-07-20 22:24                     ` Bob Newell
2018-07-21  0:00                       ` Nick Dokos
2018-07-21  0:18                     ` Nick Dokos
     [not found]               ` <mailman.3843.1532030947.1292.help-gnu-emacs@gnu.org>
2018-07-20  6:19                 ` Udyant Wig
2018-07-20 23:25                   ` Bob Proulx
2018-07-21  0:26                     ` Nick Dokos
2018-07-21  4:03                       ` Bob Proulx
     [not found]                   ` <mailman.3934.1532129163.1292.help-gnu-emacs@gnu.org>
2018-07-21 13:39                     ` Udyant Wig
     [not found]             ` <mailman.3826.1532020800.1292.help-gnu-emacs@gnu.org>
2018-07-20  5:52               ` Udyant Wig
     [not found]           ` <mailman.3796.1531983885.1292.help-gnu-emacs@gnu.org>
2018-07-19 13:26             ` Udyant Wig
2018-07-19 20:42               ` Bob Proulx
2018-07-20  3:08                 ` Bob Newell
     [not found]                 ` <mailman.3861.1532056120.1292.help-gnu-emacs@gnu.org>
2018-07-21 12:51                   ` Udyant Wig
2018-07-21 16:15                     ` Eric Abrahamsen
     [not found]                     ` <mailman.3982.1532189751.1292.help-gnu-emacs@gnu.org>
2018-07-21 19:46                       ` Udyant Wig
2018-07-22  3:57                         ` Eric Abrahamsen
2018-07-22  4:00                           ` Eric Abrahamsen
2018-07-22  4:05                             ` Eric Abrahamsen [this message]
     [not found]                           ` <mailman.4008.1532232144.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:28                             ` Udyant Wig
2018-07-22 20:05                               ` Eric Abrahamsen
     [not found]                         ` <mailman.4007.1532231884.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:19                           ` Udyant Wig
     [not found]               ` <mailman.3845.1532032966.1292.help-gnu-emacs@gnu.org>
2018-07-20 13:18                 ` Udyant Wig
2018-07-21 18:22               ` Stefan Monnier
2018-07-22  9:02                 ` tomas
2018-07-23  6:09                   ` Bob Proulx
2018-07-23  7:34                     ` tomas
     [not found]                   ` <mailman.4074.1532326162.1292.help-gnu-emacs@gnu.org>
2018-07-23  7:26                     ` Udyant Wig
     [not found]                 ` <mailman.4013.1532250176.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:58                   ` Udyant Wig
     [not found]               ` <mailman.3991.1532197378.1292.help-gnu-emacs@gnu.org>
2018-07-21 19:39                 ` Udyant Wig
2018-07-21 20:54                   ` Stefan Monnier
     [not found]                   ` <mailman.3995.1532206511.1292.help-gnu-emacs@gnu.org>
2018-07-22 18:43                     ` Udyant Wig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877elnrk2r.fsf@ericabrahamsen.net \
    --to=eric@ericabrahamsen.net \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.