unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* How to convert .doc to plain text ascii in emacs.
@ 2004-04-28 17:32 Don Saklad
  2004-04-28 18:13 ` Yoni Rabkin Katzenell
  2004-05-02  8:57 ` Tim X
  0 siblings, 2 replies; 11+ messages in thread
From: Don Saklad @ 2004-04-28 17:32 UTC (permalink / raw)


What related emacs commands are there that might convert an rmail
attachment from .doc to plain text ascii ?...

It is an rmail message distributed from local government about an
upcoming public hearing.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to convert .doc to plain text ascii in emacs.
@ 2004-04-28 18:02 Don Saklad
  2004-04-28 18:10 ` Kin Cho
  0 siblings, 1 reply; 11+ messages in thread
From: Don Saklad @ 2004-04-28 18:02 UTC (permalink / raw)


What related emacs commands are there that might convert an rmail
attachment from .doc to plain text ascii ?...

It is an rmail message distributed from local government about an
upcoming public hearing. For example, here are some parts of a
specimen message...




This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_000_01C42C97.793C3810
Content-Type: text/plain;
	charset="iso-8859-1"

      
 Education Hearings
Please note some of the Education hearing will be held within the Ways and
Means Budget Hearings 
 <<HN-CNS Edu  Dockets #0378 May 10, 2004 discuss school dept district wide
student success plan.doc>>  <<HN-CNS Edu  Dockets #0251 May 10, 2004 hearing
regarding MCAS.doc>>  <<HN-CNS Edu  Dockets #0375, 0374, 0376 May 6, 2004
physcial edu prog, nutrition curriculum and staff training, after school
athletic programs.doc>>  <<HN-CNS Edu  Dockets #0266 and  0369 May 4, 2004
mildred cntr unified facilities plan.doc>>  <<HN-CNS Edu  Dockets #0489 May
6, 2004 discuss student civic life.doc>>  <<HN-CNS Arts, Film, Humanities
and Tourism  Docket #0486 May 4, 2004 on the Strand Theater.doc>> 

------_=_NextPart_000_01C42C97.793C3810
Content-Type: application/msword;
	name="HN-CNS Edu  Dockets #0378 May 10, 2004 discuss school dept district wide student success plan.doc"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
	filename="HN-CNS Edu  Dockets #0378 May 10, 2004 discuss school dept district wide student success plan.doc"

0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAACAAAAiQAAAAAAAAAA
EAAAiwAAAAEAAAD+////AAAAAIcAAACIAAAA////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////




------_=_NextPart_000_01C42C97.793C3810
Content-Type: application/msword;
	name="HN-CNS Edu  Dockets #0251 May 10, 2004 hearing regarding MCAS.doc"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
	filename="HN-CNS Edu  Dockets #0251 May 10, 2004 hearing regarding MCAS.doc"

0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAACAAAAiQAAAAAAAAAA
EAAAiwAAAAEAAAD+////AAAAAIcAAACIAAAA////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////


------_=_NextPart_000_01C42C97.793C3810--

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to convert .doc to plain text ascii in emacs.
  2004-04-28 18:02 How to convert .doc to plain text ascii in emacs Don Saklad
@ 2004-04-28 18:10 ` Kin Cho
  2004-04-28 18:17   ` Jay Belanger
  0 siblings, 1 reply; 11+ messages in thread
From: Kin Cho @ 2004-04-28 18:10 UTC (permalink / raw)


Save the attachment, then google.  The answer depends on your
platform.

Also ask the sender to provide a pdf version of the document.

-kin

Don Saklad <dsaklad@nestle.csail.mit.edu> writes:

> What related emacs commands are there that might convert an rmail
> attachment from .doc to plain text ascii ?...
> 
> It is an rmail message distributed from local government about an
> upcoming public hearing. For example, here are some parts of a
> specimen message...
> 
> 
> 
> 
> This message is in MIME format. Since your mail reader does not understand
> this format, some or all of this message may not be legible.
> 
> ------_=_NextPart_000_01C42C97.793C3810
> Content-Type: text/plain;
> 	charset="iso-8859-1"
> 
>       
>  Education Hearings
> Please note some of the Education hearing will be held within the Ways and
> Means Budget Hearings 
>  <<HN-CNS Edu  Dockets #0378 May 10, 2004 discuss school dept district wide
> student success plan.doc>>  <<HN-CNS Edu  Dockets #0251 May 10, 2004 hearing
> regarding MCAS.doc>>  <<HN-CNS Edu  Dockets #0375, 0374, 0376 May 6, 2004
> physcial edu prog, nutrition curriculum and staff training, after school
> athletic programs.doc>>  <<HN-CNS Edu  Dockets #0266 and  0369 May 4, 2004
> mildred cntr unified facilities plan.doc>>  <<HN-CNS Edu  Dockets #0489 May
> 6, 2004 discuss student civic life.doc>>  <<HN-CNS Arts, Film, Humanities
> and Tourism  Docket #0486 May 4, 2004 on the Strand Theater.doc>> 
> 
> ------_=_NextPart_000_01C42C97.793C3810
> Content-Type: application/msword;
> 	name="HN-CNS Edu  Dockets #0378 May 10, 2004 discuss school dept district wide student success plan.doc"
> Content-Transfer-Encoding: base64
> Content-Disposition: attachment;
> 	filename="HN-CNS Edu  Dockets #0378 May 10, 2004 discuss school dept district wide student success plan.doc"
> 
> 0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAACAAAAiQAAAAAAAAAA
> EAAAiwAAAAEAAAD+////AAAAAIcAAACIAAAA////////////////////////////////////////
> ////////////////////////////////////////////////////////////////////////////
> ////////////////////////////////////////////////////////////////////////////
> 
> 
> 
> 
> ------_=_NextPart_000_01C42C97.793C3810
> Content-Type: application/msword;
> 	name="HN-CNS Edu  Dockets #0251 May 10, 2004 hearing regarding MCAS.doc"
> Content-Transfer-Encoding: base64
> Content-Disposition: attachment;
> 	filename="HN-CNS Edu  Dockets #0251 May 10, 2004 hearing regarding MCAS.doc"
> 
> 0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAACAAAAiQAAAAAAAAAA
> EAAAiwAAAAEAAAD+////AAAAAIcAAACIAAAA////////////////////////////////////////
> ////////////////////////////////////////////////////////////////////////////
> ////////////////////////////////////////////////////////////////////////////
> 
> 
> ------_=_NextPart_000_01C42C97.793C3810--

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to convert .doc to plain text ascii in emacs.
  2004-04-28 17:32 Don Saklad
@ 2004-04-28 18:13 ` Yoni Rabkin Katzenell
  2004-05-01 19:02   ` Thomas Persson
  2004-05-02  8:57 ` Tim X
  1 sibling, 1 reply; 11+ messages in thread
From: Yoni Rabkin Katzenell @ 2004-04-28 18:13 UTC (permalink / raw)


Don Saklad <dsaklad@nestle.csail.mit.edu> writes:

> What related emacs commands are there that might convert an rmail
> attachment from .doc to plain text ascii ?...
>
> It is an rmail message distributed from local government about an
> upcoming public hearing.

I heard that Antiword [http://www.winfield.demon.nl/] can convert doc
files into plain text and postscript. Note though, that I have never
used the software myself.

I would also be good to inform the body in question that they are making
a grave mistake and effectively endorsing a commercial product by
forcing people to purchase Microsoft Windows in order to take part in
governmental activities.

More on word attachments at:
[http://www.gnu.org/philosophy/no-word-attachments.html]

-- 
"Cut your own wood and it will warm you twice"
	Regards, Yoni Rabkin Katzenell

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to convert .doc to plain text ascii in emacs.
  2004-04-28 18:10 ` Kin Cho
@ 2004-04-28 18:17   ` Jay Belanger
  2004-04-29 13:47     ` John Russell
  0 siblings, 1 reply; 11+ messages in thread
From: Jay Belanger @ 2004-04-28 18:17 UTC (permalink / raw)



Kin Cho <ignore-this-prefixkin@techie.com> writes:

> Save the attachment, then google.  The answer depends on your
> platform.

There's undoc.el (http://www.ccs.neu.edu/home/guttman/undoc.el),
which is written in elisp.

> Also ask the sender to provide a pdf version of the document.

The better solution, of course.

Jay

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to convert .doc to plain text ascii in emacs.
  2004-04-28 18:17   ` Jay Belanger
@ 2004-04-29 13:47     ` John Russell
  0 siblings, 0 replies; 11+ messages in thread
From: John Russell @ 2004-04-29 13:47 UTC (permalink / raw)


Jay Belanger <belanger@truman.edu> writes:

I find the the strings command helps if you just need to know what is
in a doc file, but of course...
> 
> > Also ask the sender to provide a pdf version of the document.
> 
> The better solution, of course.
> 
> Jay

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to convert .doc to plain text ascii in emacs.
  2004-04-28 18:13 ` Yoni Rabkin Katzenell
@ 2004-05-01 19:02   ` Thomas Persson
  2004-05-02 14:44     ` gebser
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Persson @ 2004-05-01 19:02 UTC (permalink / raw)


Yoni Rabkin Katzenell <yoni-r@actcom.com> writes:

> Don Saklad <dsaklad@nestle.csail.mit.edu> writes:
>
>> What related emacs commands are there that might convert an rmail
>> attachment from .doc to plain text ascii ?...
>>
>> It is an rmail message distributed from local government about an
>> upcoming public hearing.
>
> I heard that Antiword [http://www.winfield.demon.nl/] can convert doc
> files into plain text and postscript. Note though, that I have never
> used the software myself.

I use antiword and the following code to integrate it with emacs:

(defun antiword-buffer ()
  "Takes the current buffer as input to the external program antiword.

If the current buffer is a ms-word document it's contents are replaced
with the output from antiword and the extension `.doc' is replaced
with `.txt' in the buffer-file-name."
  (let ((txt-buffer-file-name (concat (substring (buffer-file-name) 0 -4)
				      ".txt")))
    (shell-command-on-region (point-min) (point-max)
			     "cat | antiword -" nil t nil)
    (undo-start)
    (if (equal (buffer-string) "- is not a Word Document.\n")
	(or (undo-more 1)
	    (message "%s - is not a Word Document."(current-buffer)))
      (set-visited-file-name txt-buffer-file-name)
      (not-modified))))

;; The following expression makes sure that antiword-buffer is run when a
;; file with the .doc extension is opened.
(setq auto-mode-alist
      (append '(("\\.doc\\'" . antiword-buffer))
	      auto-mode-alist))

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to convert .doc to plain text ascii in emacs.
  2004-04-28 17:32 Don Saklad
  2004-04-28 18:13 ` Yoni Rabkin Katzenell
@ 2004-05-02  8:57 ` Tim X
  1 sibling, 0 replies; 11+ messages in thread
From: Tim X @ 2004-05-02  8:57 UTC (permalink / raw)


>>>>> "Don" == Don Saklad <dsaklad@nestle.csail.mit.edu> writes:

 Don> What related emacs commands are there that might convert an
 Don> rmail attachment from .doc to plain text ascii ?...

 Don> It is an rmail message distributed from local government about
 Don> an upcoming public hearing.

There are two solutions I've used for this. The first is a set of
utilities called wvWare - I think they are related to abiword. At any
rate, if your using Debian, just install wv.

The second product I've used is one called catdoc. Its not quite as
powerful, but works reasonably well. 

If your using VM as your mail reader, its trivial to configure it to
run either the wv utility or catdoc on the attachment and have it
display in the buffer as text. With wv, I think you also have the
option to have it rendered as HTML as well.

As a last resort, you could use "strings" on the document to get the
content, but you will probably have a fair amount of crap mixed in
with it.

Note that the only time I've found the wv utilities have failed is
when I've recieved attachments witht e msword mime type, but which are
actually M$ bloody RTF format. I have'nt worked out a reliable way to
translate M$ RTF (which is not the rich text format we all knew a
decade ago!).

I would also contact the authority who send word documents and request
they use a less proprietry format - even PDF is better!

tim


-- 
Tim Cross
The e-mail address on this message is FALSE (obviously!). My real e-mail is
to a company in Australia called rapttech and my login is tcross - if you 
really need to send mail, you should be able to work it out!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to convert .doc to plain text ascii in emacs.
  2004-05-01 19:02   ` Thomas Persson
@ 2004-05-02 14:44     ` gebser
  2004-05-02 19:04       ` Roodwriter
  2004-05-02 19:26       ` Thomas Persson
  0 siblings, 2 replies; 11+ messages in thread
From: gebser @ 2004-05-02 14:44 UTC (permalink / raw)



Thanks very much.  Your elisp works great.  There's one glitch (which I
realize is from antiword):

The three characters "\342\200\231" should be replaced by the single 
apostrophe character (').  To do this by hand, I did

M-x replace-regexp Return C-q 342 Return C-q 200 Return C-q 231 Return
Return ' Return

but this does not find the intended string.  The problem seems to be 
that C-q 342 is immediately (in the minibuffer) converted into an 'a' 
with a grave symbol over it.  Putting the point on the backslash (\) 
preceding the 342 in the antiword-converted buffer and doing "C-u C-x =" 
indeed shows this a-with-grave character to be (0342, 226, 0xe2).

To create a simple test case, do the following:

Open an empty *scratch* buffer.  Enter into it: C-q 342 Return C-q 200
Return C-q 231 Return.  The first character that appears is the 
a-with-grave; the second and third characters appear properly as 
\200\231.  

It is, I think, the failure of C-q 342 to be represented as \342 which 
is the problem.  What is the solution?


tia,
ken



[....]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to convert .doc to plain text ascii in emacs.
  2004-05-02 14:44     ` gebser
@ 2004-05-02 19:04       ` Roodwriter
  2004-05-02 19:26       ` Thomas Persson
  1 sibling, 0 replies; 11+ messages in thread
From: Roodwriter @ 2004-05-02 19:04 UTC (permalink / raw)


gebser@speakeasy.net wrote:

> 
> Thanks very much.  Your elisp works great.  There's one glitch (which I
> realize is from antiword):
> 
> The three characters "\342\200\231" should be replaced by the single
> apostrophe character (').  To do this by hand, I did
> 
> M-x replace-regexp Return C-q 342 Return C-q 200 Return C-q 231 Return
> Return ' Return
> 
> but this does not find the intended string.  The problem seems to be
> that C-q 342 is immediately (in the minibuffer) converted into an 'a'
> with a grave symbol over it.  Putting the point on the backslash (\)
> preceding the 342 in the antiword-converted buffer and doing "C-u C-x ="
> indeed shows this a-with-grave character to be (0342, 226, 0xe2).
> 
> To create a simple test case, do the following:
> 
> Open an empty *scratch* buffer.  Enter into it: C-q 342 Return C-q 200
> Return C-q 231 Return.  The first character that appears is the
> a-with-grave; the second and third characters appear properly as
> \200\231.
> 
> It is, I think, the failure of C-q 342 to be represented as \342 which
> is the problem.  What is the solution?
> 
> 
> tia,
> ken
> 


Have you tried just copying and pasting the character into the minibuffer 
when doing the replace-regexp?

--Rod

__________

Author of "Linux for Non-Geeks--Clear-eyed Answers for Practical Consumers" 
and "Boring Stories from Uncle Rod." Both are available at 
http://www.rodwriterpublishing.com/index.html

To reply by e-mail, take the extra "o" out of the name.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to convert .doc to plain text ascii in emacs.
  2004-05-02 14:44     ` gebser
  2004-05-02 19:04       ` Roodwriter
@ 2004-05-02 19:26       ` Thomas Persson
  1 sibling, 0 replies; 11+ messages in thread
From: Thomas Persson @ 2004-05-02 19:26 UTC (permalink / raw)


gebser@speakeasy.net writes:

> Thanks very much.  Your elisp works great.  There's one glitch (which I
> realize is from antiword):
>
> The three characters "\342\200\231" should be replaced by the single 
> apostrophe character (').

The fact that antiword and my code leaves you with a buffer containing
numerical codes instead of the characters themselves is your first
problem. This doesn't happen for me at all. It's either a problem with
antiword or a problem with how emacs displays characters. Try running
antiword from the command line to figure out which.

> To do this by hand, I did M-x replace-regexp Return C-q 342 Return
> C-q 200 Return C-q 231 Return Return ' Return
>
> but this does not find the intended string.  The problem seems to be 
> that C-q 342 is immediately (in the minibuffer) converted into an 'a' 
> with a grave symbol over it.  Putting the point on the backslash (\) 
> preceding the 342 in the antiword-converted buffer and doing "C-u C-x =" 
> indeed shows this a-with-grave character to be (0342, 226, 0xe2).
>
> To create a simple test case, do the following:
>
> Open an empty *scratch* buffer.  Enter into it: C-q 342 Return C-q 200
> Return C-q 231 Return.  The first character that appears is the 
> a-with-grave; the second and third characters appear properly as 
> \200\231.  
>
> It is, I think, the failure of C-q 342 to be represented as \342 which 
> is the problem.  What is the solution?

The fact that you have a problem with replacing the numerical
character codes with the characters themselves is however definitely a
emacs related problem. As far as I can tell it would work to add the
replace-regexp business to the end of the antiword-buffer function
like this:


(defun antiword-buffer ()
  "Takes the current buffer as input to the external program antiword.

If the current buffer is a ms-word document it's contents are replaced
with the output from antiword and the extension `.doc' is replaced
with `.txt' in the buffer-file-name."
  (let ((txt-buffer-file-name (concat (substring (buffer-file-name) 0 -4)
				      ".txt")))
    (shell-command-on-region (point-min) (point-max)
			     "cat | antiword -" nil t nil)
    (undo-start)
    (if (equal (buffer-string) "- is not a Word Document.\n")
	(or (undo-more 1)
	    (message "%s - is not a Word Document."(current-buffer)))
      (set-visited-file-name txt-buffer-file-name)
      (not-modified)
      (replace-regexp "\342\200\231" "'"))))

;; The following expression makes sure that antiword-buffer is run when a
;; file with the .doc extension is opened.
(setq auto-mode-alist
      (append '(("\\.doc\\'" . antiword-buffer))
	      auto-mode-alist))


If that doesn't work then perhaps "wvWare" or "undoc.el" ,as previous
posters have suggested, might be better solutions for you.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2004-05-02 19:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-28 18:02 How to convert .doc to plain text ascii in emacs Don Saklad
2004-04-28 18:10 ` Kin Cho
2004-04-28 18:17   ` Jay Belanger
2004-04-29 13:47     ` John Russell
  -- strict thread matches above, loose matches on Subject: below --
2004-04-28 17:32 Don Saklad
2004-04-28 18:13 ` Yoni Rabkin Katzenell
2004-05-01 19:02   ` Thomas Persson
2004-05-02 14:44     ` gebser
2004-05-02 19:04       ` Roodwriter
2004-05-02 19:26       ` Thomas Persson
2004-05-02  8:57 ` Tim X

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).