all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* How to generate a wordlist for a document
@ 2011-08-15 21:20 Thorsten
  2011-08-16 13:33 ` Richard Fieldsend
  2011-08-16 16:45 ` Andreas Röhler
  0 siblings, 2 replies; 5+ messages in thread
From: Thorsten @ 2011-08-15 21:20 UTC (permalink / raw)
  To: help-gnu-emacs

Hi list,
how do I generate a word list for a document in Emacs (in my case a
multi-file LaTex document)?
(With wordlist I mean a list with all unique words in the document)
Thanks for any hints
Thorsten




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to generate a wordlist for a document
  2011-08-15 21:20 How to generate a wordlist for a document Thorsten
@ 2011-08-16 13:33 ` Richard Fieldsend
  2011-08-17 13:44   ` Arnaldo Mandel
  2011-08-16 16:45 ` Andreas Röhler
  1 sibling, 1 reply; 5+ messages in thread
From: Richard Fieldsend @ 2011-08-16 13:33 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

Hi Thorsten,
you haven't mentioned which OS you are running, or whether you want to include LaTeX commands.  Assuming that you are only interested in the text of the document I would recommend the following steps:

1) For each of the files in your multi-file document run 'detex' to remove all of the TeX and LaTeX formatting.
2) Compile a single file containing the detex'd versions of the files using cat:

cat file1 >> completefile

3) You can then make the file one word per line, then sort it and make each term appear just once by doing the following:

grep -o -E '\w+' *sourcefile* | sort | uniq > output

If you need word frequency information then you can make uniq prepend the number of occurences.

For the record, this doesn't lowercase anything so multiple occurences of the same word are likely.

HTH

Richard

----- Original Message -----
From: Thorsten <quintfall@googlemail.com>
To: help-gnu-emacs@gnu.org
Cc: 
Sent: Monday, 15 August 2011, 22:20
Subject: How to generate a wordlist for a document

Hi list,
how do I generate a word list for a document in Emacs (in my case a
multi-file LaTex document)?
(With wordlist I mean a list with all unique words in the document)
Thanks for any hints
Thorsten



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to generate a wordlist for a document
  2011-08-15 21:20 How to generate a wordlist for a document Thorsten
  2011-08-16 13:33 ` Richard Fieldsend
@ 2011-08-16 16:45 ` Andreas Röhler
  2011-08-16 20:27   ` Arnaldo Mandel
  1 sibling, 1 reply; 5+ messages in thread
From: Andreas Röhler @ 2011-08-16 16:45 UTC (permalink / raw)
  To: help-gnu-emacs

Am 15.08.2011 23:20, schrieb Thorsten:
> Hi list,
> how do I generate a word list for a document in Emacs (in my case a
> multi-file LaTex document)?
> (With wordlist I mean a list with all unique words in the document)
> Thanks for any hints
> Thorsten
>
>
>

Hi,

would export first all into plain texts.

Put all into one file.

than inside Emacs

you could use something like that:

(defun wordlist (&optional beg end)
   (interactive)
   (let ((beg (cond (beg)
                    ((region-active-p)
                     (region-beginning))
                    (t (point-min))))
         (end (cond (end (copy-marker end))
                    ((region-active-p)
                     (copy-marker (region-end)))
                    (t (point-max))))
         (erg '()))
     (goto-char (point-min))
     (while (re-search-forward "\\sw+" nil (quote move) 1)
       (add-to-list 'erg (match-string-no-properties 0))
       (when (interactive-p) (message "%s" erg))
       erg)))

HTH,


Andreas

--
https://launchpad.net/python-mode
https://launchpad.net/s-x-emacs-werkstatt/




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to generate a wordlist for a document
  2011-08-16 16:45 ` Andreas Röhler
@ 2011-08-16 20:27   ` Arnaldo Mandel
  0 siblings, 0 replies; 5+ messages in thread
From: Arnaldo Mandel @ 2011-08-16 20:27 UTC (permalink / raw)
  To: Andreas Röhler; +Cc: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 1527 bytes --]

On Tue, Aug 16, 2011 at 1:45 PM, Andreas Röhler <
andreas.roehler@easy-emacs.de> wrote:

> Am 15.08.2011 23:20, schrieb Thorsten:
>
>  Hi list,
>> how do I generate a word list for a document in Emacs (in my case a
>> multi-file LaTex document)?
>> (With wordlist I mean a list with all unique words in the document)
>> Thanks for any hints
>> Thorsten
>>
>
> Hi,
>
> would export first all into plain texts.
>
> Put all into one file.
>
> than inside Emacs
>
> you could use something like that:
>
> (defun wordlist (&optional beg end)
>
[...]

This is a bit too simplistic.  For instance, it would list words inside
comments, macro parameters, environment names.  Things can get really
complicated.

There is a perl script called texcount, which is part of many TeX
distributions.  It embodies a lot of LaTeX knowledge into deciding what is
and what is not a word, and its sheer size shows the difficulty of the
problem.  As the name says, the program count words.  However, with option
-v1 it outputs a "cleaned-up" version of the text, tagged with ansi color
codes.  That seems to be amenable to processing by a code similar to what
you propose - with a more complex underlying automaton.

Still, it also depends on what Thorsten's concept of word is, in his
question.  For instance, texcount reports

\documentclass{article}
\begin{document}
\textsc{w}o\emph{r}\texttt{d}.
\end{document}

as containing 4 words; it can be reasonably construed as a one word text.

Arnaldo

[-- Attachment #2: Type: text/html, Size: 2006 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to generate a wordlist for a document
  2011-08-16 13:33 ` Richard Fieldsend
@ 2011-08-17 13:44   ` Arnaldo Mandel
  0 siblings, 0 replies; 5+ messages in thread
From: Arnaldo Mandel @ 2011-08-17 13:44 UTC (permalink / raw)
  To: Richard Fieldsend; +Cc: help-gnu-emacs@gnu.org

[-- Attachment #1: Type: text/plain, Size: 1804 bytes --]

On Tue, Aug 16, 2011 at 10:33 AM, Richard Fieldsend <
r.fieldsend@btopenworld.com> wrote:

> Hi Thorsten,
> you haven't mentioned which OS you are running, or whether you want to
> include LaTeX commands.  Assuming that you are only interested in the text
> of the document I would recommend the following steps:
>
> I had forgotten detex!


> 1) For each of the files in your multi-file document run 'detex' to remove
> all of the TeX and LaTeX formatting.
> 2) Compile a single file containing the detex'd versions of the files using
> cat:
>
> Actually, not needed.  detex follows \input and \include commands.


> cat file1 >> completefile
>
> 3) You can then make the file one word per line, then sort it and make each
> term appear just once by doing the following:
>
> grep -o -E '\w+' *sourcefile* | sort | uniq > output
>
> Actually, detex can give a wordlist, so the pipeline reduces to

detex -w mainfile.tex | sort -u

Within emacs, I would use it in dired, keying ! at the mainfile and typing

detex -w * | sort -u

Of course, if one uses this a lot, one can always wrap it into an emacs
function or a shell command.

Arnaldo















> If you need word frequency information then you can make uniq prepend the
> number of occurences.
>
> For the record, this doesn't lowercase anything so multiple occurences of
> the same word are likely.
>
> HTH
>
> Richard
>
> ----- Original Message -----
> From: Thorsten <quintfall@googlemail.com>
> To: help-gnu-emacs@gnu.org
> Cc:
> Sent: Monday, 15 August 2011, 22:20
> Subject: How to generate a wordlist for a document
>
> Hi list,
> how do I generate a word list for a document in Emacs (in my case a
> multi-file LaTex document)?
> (With wordlist I mean a list with all unique words in the document)
> Thanks for any hints
> Thorsten
>
>

[-- Attachment #2: Type: text/html, Size: 3047 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-08-17 13:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-15 21:20 How to generate a wordlist for a document Thorsten
2011-08-16 13:33 ` Richard Fieldsend
2011-08-17 13:44   ` Arnaldo Mandel
2011-08-16 16:45 ` Andreas Röhler
2011-08-16 20:27   ` Arnaldo Mandel

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.