On Tue, Aug 16, 2011 at 10:33 AM, Richard Fieldsend
<r.fieldsend@btopenworld.com> wrote:
Hi Thorsten,
you haven't mentioned which OS you are running, or whether you want to include LaTeX commands. Assuming that you are only interested in the text of the document I would recommend the following steps:
I had forgotten detex!
1) For each of the files in your multi-file document run 'detex' to remove all of the TeX and LaTeX formatting.
2) Compile a single file containing the detex'd versions of the files using cat:
Actually, not needed. detex follows \input and \include commands.
cat file1 >> completefile
3) You can then make the file one word per line, then sort it and make each term appear just once by doing the following:
grep -o -E '\w+' *sourcefile* | sort | uniq > output
Actually, detex can give a wordlist, so the pipeline reduces to
detex -w mainfile.tex | sort -u
Within emacs, I would use it in dired, keying ! at the mainfile and typing
detex -w * | sort -u
Of course, if one uses this a lot, one can always wrap it into an emacs function or a shell command.
Arnaldo
If you need word frequency information then you can make uniq prepend the number of occurences.
For the record, this doesn't lowercase anything so multiple occurences of the same word are likely.
HTH
Richard
----- Original Message -----
From: Thorsten <
quintfall@googlemail.com>
To:
help-gnu-emacs@gnu.org
Cc:
Sent: Monday, 15 August 2011, 22:20
Subject: How to generate a wordlist for a document
Hi list,
how do I generate a word list for a document in Emacs (in my case a
multi-file LaTex document)?
(With wordlist I mean a list with all unique words in the document)
Thanks for any hints
Thorsten