all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* When is a text file not a text file?
@ 2004-01-02 17:24 sebyte
  2004-01-09  9:43 ` Oliver Scholz
  0 siblings, 1 reply; 4+ messages in thread
From: sebyte @ 2004-01-02 17:24 UTC (permalink / raw)


Hi all,

I often use the command 'html2text <htmlfile>' in a *shell* buffer to view html 
files, and nine times out of ten the file is displayed beautifully, (with all 
the html tags removed etc).  However, the command 'html2text -o <newfilename> 
<htmlfile>', writes a file to disk which when opened in an Emacs buffer, (or any 
text editor for that matter), displays more formatting tags than actual text! 
In fact, it appears as if only some of the text is to be found buried amongst 
all the tags!  Yet returning to the *shell* buffer and issuing the command 'cat 
<newfilename>' diplays the file as it is meant to be seen once more.

No doubt there is a simple explanation for this, but damned if I know where to 
even start!  I don't believe it as an html2text issue as I have observed similar 
behaviour before with html files I've downloaded.

TIA for any explanations of what's actually going on here.

sebyte

P.S. html2text is available through Fink, (for potentially interested OS X users 
out there).

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: When is a text file not a text file?
  2004-01-02 17:24 When is a text file not a text file? sebyte
@ 2004-01-09  9:43 ` Oliver Scholz
  2004-01-09 18:17   ` sebyte
  0 siblings, 1 reply; 4+ messages in thread
From: Oliver Scholz @ 2004-01-09  9:43 UTC (permalink / raw)


sebyte <sdt133@netscape.net> writes:
[...]
> However, the command 'html2text -o <newfilename> <htmlfile>', writes
> a file to disk which when opened in an Emacs buffer, (or any text
> editor for that matter), displays more formatting tags than actual
> text!
[...]
> Yet returning to the *shell* buffer and issuing the command cat
> <newfilename>' diplays the file as it is meant to be seen once more.
[...]

What "tags" are these? I don't know the actual program you are
using. But I seem to recall that I once had a program "html2text" or
"htmltotxt" or whatever that procuced a text file as output
*containing ANSI escape sequences for colours*.

Could that be the case? It would explain why dumping such a file on
the tty---unlike visiting it with a text editor---would display it
correctly.

If this *is* the case, then you probably could do something with
ansi-color.el, though I don't know offhand how exactly.

    Oliver
-- 
20 Nivôse an 212 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: When is a text file not a text file?
  2004-01-09  9:43 ` Oliver Scholz
@ 2004-01-09 18:17   ` sebyte
  2004-01-09 18:48     ` Oliver Scholz
  0 siblings, 1 reply; 4+ messages in thread
From: sebyte @ 2004-01-09 18:17 UTC (permalink / raw)



> What "tags" are these? I don't know the actual program you are
> using. But I seem to recall that I once had a program "html2text" or
> "htmltotxt" or whatever that procuced a text file as output
> *containing ANSI escape sequences for colours*.
> 
> Could that be the case? It would explain why dumping such a file on
> the tty---unlike visiting it with a text editor---would display it
> correctly.
> 
> If this *is* the case, then you probably could do something with
> ansi-color.el, though I don't know offhand how exactly.
> 
>     Oliver

Hi Oliver,

Thanks for your time.  Here's an example of html2text's output, displayed in an 
Emacs buffer:


  C^HCo^Hop^Hpy^Hyr^Hri^Hig^Hgh^Hht^Ht n^Hno^Hot^Hti^Hic^Hce^He:^H: All 
reader-contributed material on freshmeat.net is the
  property and responsibility of its author; for reprint rights, please contact
  the author directly.
  -----------------------------------------------------------------------------
  Let me repeat that: OS X is not Unix.
  Consider the following: all of Apple.com's 
_^Hm_^Ha_^Hr_^Hk_^He_^Ht_^Hi_^Hn_^Hg_^H _^Hp_^Ha_^Hg_^He_^Hs on the subject of
  their darling new operating system are extremely careful to note that OS X is
  "_^HU_^HN_^HI_^HX_^H-_^Hb_^Ha_^Hs_^He_^Hd".


Here is how it looks on a tty or in an Emacs *shell* buffer:


  Copyright notice: All reader-contributed material on freshmeat.net is the
  property and responsibility of its author; for reprint rights, please contact
  the author directly.
  -----------------------------------------------------------------------------
  Let me repeat that: OS X is not Unix.
  Consider the following: all of Apple.com's marketing pages on the subject of
  their darling new operating system are extremely careful to note that OS X is
  "UNIX-based".


I had thought that they might be remnants of HTML tags, (I must admit I didn't 
look very closely), but I have found out since they are actually ANSI 'backspace 
control sequences', used to preserve things like underlining and boldface.  The 
html2text option '-nobs' gets rid of them.

(After days spent looking for information, I discovered that html2text comes 
with a manpage and all was revealed.  DOH!)


sebyte

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: When is a text file not a text file?
  2004-01-09 18:17   ` sebyte
@ 2004-01-09 18:48     ` Oliver Scholz
  0 siblings, 0 replies; 4+ messages in thread
From: Oliver Scholz @ 2004-01-09 18:48 UTC (permalink / raw)


sebyte <sdt133@netscape.net> writes:

>> What "tags" are these? I don't know the actual program you are
>> using. But I seem to recall that I once had a program "html2text" or
>> "htmltotxt" or whatever that procuced a text file as output
>> *containing ANSI escape sequences for colours*.

[...]
> Thanks for your time.  Here's an example of html2text's output,
> displayed in an Emacs buffer:
>
>
>   C^HCo^Hop^Hpy^Hyr^Hri^Hig^Hgh^Hht^Ht n^Hno^Hot^Hti^Hic^Hce^He:^H:
>   All reader-contributed material on freshmeat.net is the
[...]
> I had thought that they might be remnants of HTML tags, (I must admit
> I didn't look very closely), but I have found out since they are
> actually ANSI 'backspace control sequences', used to preserve things
> like underlining and boldface.  The html2text option '-nobs' gets rid
> of them.
[...]

Untested: You could also try to call

(ansi-color-apply-on-region (point-min) (point-max))

on a buffer containing such ANSI control sequences (after a `(require
'ansi-color)' that is).

    Oliver
-- 
20 Nivôse an 212 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-01-09 18:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-02 17:24 When is a text file not a text file? sebyte
2004-01-09  9:43 ` Oliver Scholz
2004-01-09 18:17   ` sebyte
2004-01-09 18:48     ` Oliver Scholz

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.