all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Arnaldo Mandel <amandel@gmail.com>
To: Richard Fieldsend <r.fieldsend@btopenworld.com>
Cc: "help-gnu-emacs@gnu.org" <help-gnu-emacs@gnu.org>
Subject: Re: How to generate a wordlist for a document
Date: Wed, 17 Aug 2011 10:44:13 -0300	[thread overview]
Message-ID: <CABHLsmcYEAtVB5wGVHZodnimTN=FASdzaX=fxUwJv74zP_iCMQ@mail.gmail.com> (raw)
In-Reply-To: <1313501636.15362.YahooMailNeo@web86004.mail.ird.yahoo.com>

[-- Attachment #1: Type: text/plain, Size: 1804 bytes --]

On Tue, Aug 16, 2011 at 10:33 AM, Richard Fieldsend <
r.fieldsend@btopenworld.com> wrote:

> Hi Thorsten,
> you haven't mentioned which OS you are running, or whether you want to
> include LaTeX commands.  Assuming that you are only interested in the text
> of the document I would recommend the following steps:
>
> I had forgotten detex!


> 1) For each of the files in your multi-file document run 'detex' to remove
> all of the TeX and LaTeX formatting.
> 2) Compile a single file containing the detex'd versions of the files using
> cat:
>
> Actually, not needed.  detex follows \input and \include commands.


> cat file1 >> completefile
>
> 3) You can then make the file one word per line, then sort it and make each
> term appear just once by doing the following:
>
> grep -o -E '\w+' *sourcefile* | sort | uniq > output
>
> Actually, detex can give a wordlist, so the pipeline reduces to

detex -w mainfile.tex | sort -u

Within emacs, I would use it in dired, keying ! at the mainfile and typing

detex -w * | sort -u

Of course, if one uses this a lot, one can always wrap it into an emacs
function or a shell command.

Arnaldo















> If you need word frequency information then you can make uniq prepend the
> number of occurences.
>
> For the record, this doesn't lowercase anything so multiple occurences of
> the same word are likely.
>
> HTH
>
> Richard
>
> ----- Original Message -----
> From: Thorsten <quintfall@googlemail.com>
> To: help-gnu-emacs@gnu.org
> Cc:
> Sent: Monday, 15 August 2011, 22:20
> Subject: How to generate a wordlist for a document
>
> Hi list,
> how do I generate a word list for a document in Emacs (in my case a
> multi-file LaTex document)?
> (With wordlist I mean a list with all unique words in the document)
> Thanks for any hints
> Thorsten
>
>

[-- Attachment #2: Type: text/html, Size: 3047 bytes --]

  reply	other threads:[~2011-08-17 13:44 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-15 21:20 How to generate a wordlist for a document Thorsten
2011-08-16 13:33 ` Richard Fieldsend
2011-08-17 13:44   ` Arnaldo Mandel [this message]
2011-08-16 16:45 ` Andreas Röhler
2011-08-16 20:27   ` Arnaldo Mandel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABHLsmcYEAtVB5wGVHZodnimTN=FASdzaX=fxUwJv74zP_iCMQ@mail.gmail.com' \
    --to=amandel@gmail.com \
    --cc=help-gnu-emacs@gnu.org \
    --cc=r.fieldsend@btopenworld.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.