* How to generate a wordlist for a document @ 2011-08-15 21:20 Thorsten 2011-08-16 13:33 ` Richard Fieldsend 2011-08-16 16:45 ` Andreas Röhler 0 siblings, 2 replies; 5+ messages in thread From: Thorsten @ 2011-08-15 21:20 UTC (permalink / raw) To: help-gnu-emacs Hi list, how do I generate a word list for a document in Emacs (in my case a multi-file LaTex document)? (With wordlist I mean a list with all unique words in the document) Thanks for any hints Thorsten ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: How to generate a wordlist for a document 2011-08-15 21:20 How to generate a wordlist for a document Thorsten @ 2011-08-16 13:33 ` Richard Fieldsend 2011-08-17 13:44 ` Arnaldo Mandel 2011-08-16 16:45 ` Andreas Röhler 1 sibling, 1 reply; 5+ messages in thread From: Richard Fieldsend @ 2011-08-16 13:33 UTC (permalink / raw) To: help-gnu-emacs@gnu.org Hi Thorsten, you haven't mentioned which OS you are running, or whether you want to include LaTeX commands. Assuming that you are only interested in the text of the document I would recommend the following steps: 1) For each of the files in your multi-file document run 'detex' to remove all of the TeX and LaTeX formatting. 2) Compile a single file containing the detex'd versions of the files using cat: cat file1 >> completefile 3) You can then make the file one word per line, then sort it and make each term appear just once by doing the following: grep -o -E '\w+' *sourcefile* | sort | uniq > output If you need word frequency information then you can make uniq prepend the number of occurences. For the record, this doesn't lowercase anything so multiple occurences of the same word are likely. HTH Richard ----- Original Message ----- From: Thorsten <quintfall@googlemail.com> To: help-gnu-emacs@gnu.org Cc: Sent: Monday, 15 August 2011, 22:20 Subject: How to generate a wordlist for a document Hi list, how do I generate a word list for a document in Emacs (in my case a multi-file LaTex document)? (With wordlist I mean a list with all unique words in the document) Thanks for any hints Thorsten ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: How to generate a wordlist for a document 2011-08-16 13:33 ` Richard Fieldsend @ 2011-08-17 13:44 ` Arnaldo Mandel 0 siblings, 0 replies; 5+ messages in thread From: Arnaldo Mandel @ 2011-08-17 13:44 UTC (permalink / raw) To: Richard Fieldsend; +Cc: help-gnu-emacs@gnu.org [-- Attachment #1: Type: text/plain, Size: 1804 bytes --] On Tue, Aug 16, 2011 at 10:33 AM, Richard Fieldsend < r.fieldsend@btopenworld.com> wrote: > Hi Thorsten, > you haven't mentioned which OS you are running, or whether you want to > include LaTeX commands. Assuming that you are only interested in the text > of the document I would recommend the following steps: > > I had forgotten detex! > 1) For each of the files in your multi-file document run 'detex' to remove > all of the TeX and LaTeX formatting. > 2) Compile a single file containing the detex'd versions of the files using > cat: > > Actually, not needed. detex follows \input and \include commands. > cat file1 >> completefile > > 3) You can then make the file one word per line, then sort it and make each > term appear just once by doing the following: > > grep -o -E '\w+' *sourcefile* | sort | uniq > output > > Actually, detex can give a wordlist, so the pipeline reduces to detex -w mainfile.tex | sort -u Within emacs, I would use it in dired, keying ! at the mainfile and typing detex -w * | sort -u Of course, if one uses this a lot, one can always wrap it into an emacs function or a shell command. Arnaldo > If you need word frequency information then you can make uniq prepend the > number of occurences. > > For the record, this doesn't lowercase anything so multiple occurences of > the same word are likely. > > HTH > > Richard > > ----- Original Message ----- > From: Thorsten <quintfall@googlemail.com> > To: help-gnu-emacs@gnu.org > Cc: > Sent: Monday, 15 August 2011, 22:20 > Subject: How to generate a wordlist for a document > > Hi list, > how do I generate a word list for a document in Emacs (in my case a > multi-file LaTex document)? > (With wordlist I mean a list with all unique words in the document) > Thanks for any hints > Thorsten > > [-- Attachment #2: Type: text/html, Size: 3047 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: How to generate a wordlist for a document 2011-08-15 21:20 How to generate a wordlist for a document Thorsten 2011-08-16 13:33 ` Richard Fieldsend @ 2011-08-16 16:45 ` Andreas Röhler 2011-08-16 20:27 ` Arnaldo Mandel 1 sibling, 1 reply; 5+ messages in thread From: Andreas Röhler @ 2011-08-16 16:45 UTC (permalink / raw) To: help-gnu-emacs Am 15.08.2011 23:20, schrieb Thorsten: > Hi list, > how do I generate a word list for a document in Emacs (in my case a > multi-file LaTex document)? > (With wordlist I mean a list with all unique words in the document) > Thanks for any hints > Thorsten > > > Hi, would export first all into plain texts. Put all into one file. than inside Emacs you could use something like that: (defun wordlist (&optional beg end) (interactive) (let ((beg (cond (beg) ((region-active-p) (region-beginning)) (t (point-min)))) (end (cond (end (copy-marker end)) ((region-active-p) (copy-marker (region-end))) (t (point-max)))) (erg '())) (goto-char (point-min)) (while (re-search-forward "\\sw+" nil (quote move) 1) (add-to-list 'erg (match-string-no-properties 0)) (when (interactive-p) (message "%s" erg)) erg))) HTH, Andreas -- https://launchpad.net/python-mode https://launchpad.net/s-x-emacs-werkstatt/ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: How to generate a wordlist for a document 2011-08-16 16:45 ` Andreas Röhler @ 2011-08-16 20:27 ` Arnaldo Mandel 0 siblings, 0 replies; 5+ messages in thread From: Arnaldo Mandel @ 2011-08-16 20:27 UTC (permalink / raw) To: Andreas Röhler; +Cc: help-gnu-emacs [-- Attachment #1: Type: text/plain, Size: 1527 bytes --] On Tue, Aug 16, 2011 at 1:45 PM, Andreas Röhler < andreas.roehler@easy-emacs.de> wrote: > Am 15.08.2011 23:20, schrieb Thorsten: > > Hi list, >> how do I generate a word list for a document in Emacs (in my case a >> multi-file LaTex document)? >> (With wordlist I mean a list with all unique words in the document) >> Thanks for any hints >> Thorsten >> > > Hi, > > would export first all into plain texts. > > Put all into one file. > > than inside Emacs > > you could use something like that: > > (defun wordlist (&optional beg end) > [...] This is a bit too simplistic. For instance, it would list words inside comments, macro parameters, environment names. Things can get really complicated. There is a perl script called texcount, which is part of many TeX distributions. It embodies a lot of LaTeX knowledge into deciding what is and what is not a word, and its sheer size shows the difficulty of the problem. As the name says, the program count words. However, with option -v1 it outputs a "cleaned-up" version of the text, tagged with ansi color codes. That seems to be amenable to processing by a code similar to what you propose - with a more complex underlying automaton. Still, it also depends on what Thorsten's concept of word is, in his question. For instance, texcount reports \documentclass{article} \begin{document} \textsc{w}o\emph{r}\texttt{d}. \end{document} as containing 4 words; it can be reasonably construed as a one word text. Arnaldo [-- Attachment #2: Type: text/html, Size: 2006 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-08-17 13:44 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-08-15 21:20 How to generate a wordlist for a document Thorsten 2011-08-16 13:33 ` Richard Fieldsend 2011-08-17 13:44 ` Arnaldo Mandel 2011-08-16 16:45 ` Andreas Röhler 2011-08-16 20:27 ` Arnaldo Mandel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).