* How to generate a wordlist for a document
@ 2011-08-15 21:20 Thorsten
2011-08-16 13:33 ` Richard Fieldsend
2011-08-16 16:45 ` Andreas Röhler
0 siblings, 2 replies; 5+ messages in thread
From: Thorsten @ 2011-08-15 21:20 UTC (permalink / raw)
To: help-gnu-emacs
Hi list,
how do I generate a word list for a document in Emacs (in my case a
multi-file LaTex document)?
(With wordlist I mean a list with all unique words in the document)
Thanks for any hints
Thorsten
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: How to generate a wordlist for a document
2011-08-15 21:20 How to generate a wordlist for a document Thorsten
@ 2011-08-16 13:33 ` Richard Fieldsend
2011-08-17 13:44 ` Arnaldo Mandel
2011-08-16 16:45 ` Andreas Röhler
1 sibling, 1 reply; 5+ messages in thread
From: Richard Fieldsend @ 2011-08-16 13:33 UTC (permalink / raw)
To: help-gnu-emacs@gnu.org
Hi Thorsten,
you haven't mentioned which OS you are running, or whether you want to include LaTeX commands. Assuming that you are only interested in the text of the document I would recommend the following steps:
1) For each of the files in your multi-file document run 'detex' to remove all of the TeX and LaTeX formatting.
2) Compile a single file containing the detex'd versions of the files using cat:
cat file1 >> completefile
3) You can then make the file one word per line, then sort it and make each term appear just once by doing the following:
grep -o -E '\w+' *sourcefile* | sort | uniq > output
If you need word frequency information then you can make uniq prepend the number of occurences.
For the record, this doesn't lowercase anything so multiple occurences of the same word are likely.
HTH
Richard
----- Original Message -----
From: Thorsten <quintfall@googlemail.com>
To: help-gnu-emacs@gnu.org
Cc:
Sent: Monday, 15 August 2011, 22:20
Subject: How to generate a wordlist for a document
Hi list,
how do I generate a word list for a document in Emacs (in my case a
multi-file LaTex document)?
(With wordlist I mean a list with all unique words in the document)
Thanks for any hints
Thorsten
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: How to generate a wordlist for a document
2011-08-15 21:20 How to generate a wordlist for a document Thorsten
2011-08-16 13:33 ` Richard Fieldsend
@ 2011-08-16 16:45 ` Andreas Röhler
2011-08-16 20:27 ` Arnaldo Mandel
1 sibling, 1 reply; 5+ messages in thread
From: Andreas Röhler @ 2011-08-16 16:45 UTC (permalink / raw)
To: help-gnu-emacs
Am 15.08.2011 23:20, schrieb Thorsten:
> Hi list,
> how do I generate a word list for a document in Emacs (in my case a
> multi-file LaTex document)?
> (With wordlist I mean a list with all unique words in the document)
> Thanks for any hints
> Thorsten
>
>
>
Hi,
would export first all into plain texts.
Put all into one file.
than inside Emacs
you could use something like that:
(defun wordlist (&optional beg end)
(interactive)
(let ((beg (cond (beg)
((region-active-p)
(region-beginning))
(t (point-min))))
(end (cond (end (copy-marker end))
((region-active-p)
(copy-marker (region-end)))
(t (point-max))))
(erg '()))
(goto-char (point-min))
(while (re-search-forward "\\sw+" nil (quote move) 1)
(add-to-list 'erg (match-string-no-properties 0))
(when (interactive-p) (message "%s" erg))
erg)))
HTH,
Andreas
--
https://launchpad.net/python-mode
https://launchpad.net/s-x-emacs-werkstatt/
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: How to generate a wordlist for a document
2011-08-16 16:45 ` Andreas Röhler
@ 2011-08-16 20:27 ` Arnaldo Mandel
0 siblings, 0 replies; 5+ messages in thread
From: Arnaldo Mandel @ 2011-08-16 20:27 UTC (permalink / raw)
To: Andreas Röhler; +Cc: help-gnu-emacs
[-- Attachment #1: Type: text/plain, Size: 1527 bytes --]
On Tue, Aug 16, 2011 at 1:45 PM, Andreas Röhler <
andreas.roehler@easy-emacs.de> wrote:
> Am 15.08.2011 23:20, schrieb Thorsten:
>
> Hi list,
>> how do I generate a word list for a document in Emacs (in my case a
>> multi-file LaTex document)?
>> (With wordlist I mean a list with all unique words in the document)
>> Thanks for any hints
>> Thorsten
>>
>
> Hi,
>
> would export first all into plain texts.
>
> Put all into one file.
>
> than inside Emacs
>
> you could use something like that:
>
> (defun wordlist (&optional beg end)
>
[...]
This is a bit too simplistic. For instance, it would list words inside
comments, macro parameters, environment names. Things can get really
complicated.
There is a perl script called texcount, which is part of many TeX
distributions. It embodies a lot of LaTeX knowledge into deciding what is
and what is not a word, and its sheer size shows the difficulty of the
problem. As the name says, the program count words. However, with option
-v1 it outputs a "cleaned-up" version of the text, tagged with ansi color
codes. That seems to be amenable to processing by a code similar to what
you propose - with a more complex underlying automaton.
Still, it also depends on what Thorsten's concept of word is, in his
question. For instance, texcount reports
\documentclass{article}
\begin{document}
\textsc{w}o\emph{r}\texttt{d}.
\end{document}
as containing 4 words; it can be reasonably construed as a one word text.
Arnaldo
[-- Attachment #2: Type: text/html, Size: 2006 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: How to generate a wordlist for a document
2011-08-16 13:33 ` Richard Fieldsend
@ 2011-08-17 13:44 ` Arnaldo Mandel
0 siblings, 0 replies; 5+ messages in thread
From: Arnaldo Mandel @ 2011-08-17 13:44 UTC (permalink / raw)
To: Richard Fieldsend; +Cc: help-gnu-emacs@gnu.org
[-- Attachment #1: Type: text/plain, Size: 1804 bytes --]
On Tue, Aug 16, 2011 at 10:33 AM, Richard Fieldsend <
r.fieldsend@btopenworld.com> wrote:
> Hi Thorsten,
> you haven't mentioned which OS you are running, or whether you want to
> include LaTeX commands. Assuming that you are only interested in the text
> of the document I would recommend the following steps:
>
> I had forgotten detex!
> 1) For each of the files in your multi-file document run 'detex' to remove
> all of the TeX and LaTeX formatting.
> 2) Compile a single file containing the detex'd versions of the files using
> cat:
>
> Actually, not needed. detex follows \input and \include commands.
> cat file1 >> completefile
>
> 3) You can then make the file one word per line, then sort it and make each
> term appear just once by doing the following:
>
> grep -o -E '\w+' *sourcefile* | sort | uniq > output
>
> Actually, detex can give a wordlist, so the pipeline reduces to
detex -w mainfile.tex | sort -u
Within emacs, I would use it in dired, keying ! at the mainfile and typing
detex -w * | sort -u
Of course, if one uses this a lot, one can always wrap it into an emacs
function or a shell command.
Arnaldo
> If you need word frequency information then you can make uniq prepend the
> number of occurences.
>
> For the record, this doesn't lowercase anything so multiple occurences of
> the same word are likely.
>
> HTH
>
> Richard
>
> ----- Original Message -----
> From: Thorsten <quintfall@googlemail.com>
> To: help-gnu-emacs@gnu.org
> Cc:
> Sent: Monday, 15 August 2011, 22:20
> Subject: How to generate a wordlist for a document
>
> Hi list,
> how do I generate a word list for a document in Emacs (in my case a
> multi-file LaTex document)?
> (With wordlist I mean a list with all unique words in the document)
> Thanks for any hints
> Thorsten
>
>
[-- Attachment #2: Type: text/html, Size: 3047 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-08-17 13:44 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-15 21:20 How to generate a wordlist for a document Thorsten
2011-08-16 13:33 ` Richard Fieldsend
2011-08-17 13:44 ` Arnaldo Mandel
2011-08-16 16:45 ` Andreas Röhler
2011-08-16 20:27 ` Arnaldo Mandel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).