From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Richard Fieldsend Newsgroups: gmane.emacs.help Subject: Re: How to generate a wordlist for a document Date: Tue, 16 Aug 2011 14:33:56 +0100 (BST) Message-ID: <1313501636.15362.YahooMailNeo@web86004.mail.ird.yahoo.com> References: <86sjp2e5i3.fsf@googlemail.com> Reply-To: Richard Fieldsend NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1313536983 14494 80.91.229.12 (16 Aug 2011 23:23:03 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Tue, 16 Aug 2011 23:23:03 +0000 (UTC) To: "help-gnu-emacs@gnu.org" Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Wed Aug 17 01:22:59 2011 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1QtSyJ-0006rd-0y for geh-help-gnu-emacs@m.gmane.org; Wed, 17 Aug 2011 01:22:59 +0200 Original-Received: from localhost ([::1]:49671 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QtSyD-0005Y6-5j for geh-help-gnu-emacs@m.gmane.org; Tue, 16 Aug 2011 19:22:53 -0400 Original-Received: from eggs.gnu.org ([140.186.70.92]:34909) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QtJmK-0002rA-2o for help-gnu-emacs@gnu.org; Tue, 16 Aug 2011 09:34:01 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QtJmJ-0001Gi-3c for help-gnu-emacs@gnu.org; Tue, 16 Aug 2011 09:34:00 -0400 Original-Received: from nm1-vm0.bt.bullet.mail.ukl.yahoo.com ([217.146.182.223]:23978) by eggs.gnu.org with smtp (Exim 4.71) (envelope-from ) id 1QtJmI-0001FW-MX for help-gnu-emacs@gnu.org; Tue, 16 Aug 2011 09:33:59 -0400 Original-Received: from [217.146.183.196] by nm1.bt.bullet.mail.ukl.yahoo.com with NNFMP; 16 Aug 2011 13:33:56 -0000 Original-Received: from [217.146.183.206] by tm2.bt.bullet.mail.ukl.yahoo.com with NNFMP; 16 Aug 2011 13:33:56 -0000 Original-Received: from [127.0.0.1] by omp1004.bt.mail.ukl.yahoo.com with NNFMP; 16 Aug 2011 13:33:56 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 930505.20056.bm@omp1004.bt.mail.ukl.yahoo.com Original-Received: (qmail 66125 invoked by uid 60001); 16 Aug 2011 13:33:56 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=btopenworld.com; s=s1024; t=1313501636; bh=0FxwlRvr0tsR9HsWbmRXXIKz/xV8WqwIiy7bw6K+0EA=; h=X-YMail-OSG:Received:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=lbV69Gs+KpzehZzBA+HOOOmjRqLMwtH2VWy9ZSwvkLfIZHKUUKukWrvS2npNW50KjRD1WtkaosbEK2ItMrSFOD2f5uO8moj6SmnKceTt2aihpxVnvpY7++3yX6mE40hWugi7Bz0o2npHYHqqn5iZyjWeZ+gqCs1fL9fg+bqAXKA= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=btopenworld.com; h=X-YMail-OSG:Received:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=iMW1akJki3LLP8rBHdtXRdw9di85lLOmGcv6qhwR5ihTLBGc04mJQtA7FXA7Z6R+W6iVmJ4YzjsTUa0ta52RVwDnXZxlyYlttqq89QqMo0fIyqld1qNl9JE3DFbUY7XyJnwunufnYisCNTO9M0ZuQX3ny1ffJ9Q2MxVwIWelt6M=; X-YMail-OSG: bnO9CUwVM1mk_cQ0IO95ZuLBchHh70BUGllT6zsMT0mVbaM cSwP4C_.tTWS6hCQ3rR1pydvzsJSQ_lFXmJvH9ztLRv_BnmRyWvhFi16NiAh 5ssZzjuNoo1rhSGjqx2Gz8Xx70Ddy_lyrJY9MMoXVJEAySneEHS4cd4xxHKB .n5bus8uOdhDxQIXB_.T68JUqfp1OoIfcEwHCP3Nr0y4MTnFYwYYaP0JpWfD GbqGzvn0JRRQnyRvlxgMjh.eUzsPW3PUbkl5U4qdMqDiEoCCYP7y_U9qpMpe aTBiprFv0T9Ax53u3q93FiIHyZLSPpEPXxsi6TBh91Jw1w0KESA0hEDRsB6t R6T79SziXQWstavjgpWEHxZ_cytkEIeoQdZ8jfffydcQ10SghGw2tz5c4Ik6 ebMmoPm1LAypG.kyrgy1EBM13EcgWaqui8l5XkboacIg9iFff Original-Received: from [131.111.185.82] by web86004.mail.ird.yahoo.com via HTTP; Tue, 16 Aug 2011 14:33:56 BST X-Mailer: YahooMailWebService/0.8.113.313619 X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 217.146.182.223 X-Mailman-Approved-At: Tue, 16 Aug 2011 19:22:49 -0400 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:81962 Archived-At: Hi Thorsten,=0Ayou haven't mentioned which OS you are running, or whether y= ou want to include LaTeX commands. =A0Assuming that you are only interested= in the text of the document I would recommend the following steps:=0A=0A1)= For each of the files in your multi-file document run 'detex' to remove al= l of the TeX and LaTeX formatting.=0A2) Compile a single file containing th= e detex'd versions of the files using cat:=0A=0Acat file1 >> completefile= =0A=0A3) You can then make the file one word per line, then sort it and mak= e each term appear just once by doing the following:=0A=0Agrep -o -E '\w+' = *sourcefile* | sort | uniq > output=0A=0AIf you need word frequency informa= tion then you can make uniq prepend the number of occurences.=0A=0AFor the = record, this doesn't lowercase anything so multiple occurences of the same = word are likely.=0A=0AHTH=0A=0ARichard=0A=0A----- Original Message -----=0A= From: Thorsten =0ATo: help-gnu-emacs@gnu.org=0ACc= : =0ASent: Monday, 15 August 2011, 22:20=0ASubject: How to generate a wordl= ist for a document=0A=0AHi list,=0Ahow do I generate a word list for a docu= ment in Emacs (in my case a=0Amulti-file LaTex document)?=0A(With wordlist = I mean a list with all unique words in the document)=0AThanks for any hints= =0AThorsten