unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* html2text
@ 2004-10-31 18:25 Alfred M. Szmidt
  2004-11-01 11:55 ` html2text Reiner Steib
       [not found] ` <mzxv9dyj.fsf@blue.sea.net>
  0 siblings, 2 replies; 11+ messages in thread
From: Alfred M. Szmidt @ 2004-10-31 18:25 UTC (permalink / raw)


html2text is quite nice, but it doesn't strip all HTML files into
something that is readable.  The following patch makes it strip some
"newer" tags that have croped up.  Though, it still doesn't make
things as nice as they could be, tables and comments are still left
intact.

I guess that a better way to do this is to convert all known tags to
something nice, and then just strip all remaining tags that are left.

diff -ur html2text.el html2text.el.new
--- html2text.el	2004-10-31 19:23:06.000000000 +0100
+++ html2text.el.new	2004-10-31 19:23:46.000000000 +0100
@@ -75,8 +75,10 @@
 
 (defvar html2text-format-tag-list
   '(("b" 	  . html2text-clean-bold)
+    ("strong"     . html2text-clean-bold)
     ("u" 	  . html2text-clean-underline)
     ("i" 	  . html2text-clean-italic)
+    ("em"         . html2text-clean-italic)
     ("blockquote" . html2text-clean-blockquote)
     ("a"          . html2text-clean-anchor)
     ("ul"         . html2text-clean-ul)

Diff finished.  Sun Oct 31 19:23:56 2004

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: html2text
  2004-10-31 18:25 html2text Alfred M. Szmidt
@ 2004-11-01 11:55 ` Reiner Steib
  2004-11-01 19:21   ` html2text Alfred M. Szmidt
  2004-11-02  4:46   ` html2text Katsumi Yamaoka
       [not found] ` <mzxv9dyj.fsf@blue.sea.net>
  1 sibling, 2 replies; 11+ messages in thread
From: Reiner Steib @ 2004-11-01 11:55 UTC (permalink / raw)
  Cc: emacs-devel

On Sun, Oct 31 2004, Alfred M. Szmidt wrote:

> html2text is quite nice, but it doesn't strip all HTML files into
> something that is readable.  The following patch makes it strip some
> "newer" tags that have croped up.  Though, it still doesn't make
> things as nice as they could be, tables and comments are still left
> intact.
>
> I guess that a better way to do this is to convert all known tags to
> something nice, and then just strip all remaining tags that are left.

Would you like to work on this?

> diff -ur html2text.el html2text.el.new
> --- html2text.el	2004-10-31 19:23:06.000000000 +0100
> +++ html2text.el.new	2004-10-31 19:23:46.000000000 +0100
> @@ -75,8 +75,10 @@
>
>  (defvar html2text-format-tag-list
>    '(("b" 	  . html2text-clean-bold)
> +    ("strong"     . html2text-clean-bold)
>      ("u" 	  . html2text-clean-underline)
>      ("i" 	  . html2text-clean-italic)
> +    ("em"         . html2text-clean-italic)
>      ("blockquote" . html2text-clean-blockquote)
>      ("a"          . html2text-clean-anchor)
>      ("ul"         . html2text-clean-ul)

Committed in Gnus repository (will be synced to Emacs within a couple
of days).

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: html2text
  2004-11-01 11:55 ` html2text Reiner Steib
@ 2004-11-01 19:21   ` Alfred M. Szmidt
  2004-11-02  4:46   ` html2text Katsumi Yamaoka
  1 sibling, 0 replies; 11+ messages in thread
From: Alfred M. Szmidt @ 2004-11-01 19:21 UTC (permalink / raw)
  Cc: emacs-devel

   > I guess that a better way to do this is to convert all known tags
   > to something nice, and then just strip all remaining tags that
   > are left.

   Would you like to work on this?

Yes I would, but I have other priorities that are far more important
to me so it won't happen any time soon.

   Committed in Gnus repository (will be synced to Emacs within a
   couple of days).

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: html2text
  2004-11-01 11:55 ` html2text Reiner Steib
  2004-11-01 19:21   ` html2text Alfred M. Szmidt
@ 2004-11-02  4:46   ` Katsumi Yamaoka
  2004-11-02  9:22     ` html2text Reiner Steib
  1 sibling, 1 reply; 11+ messages in thread
From: Katsumi Yamaoka @ 2004-11-02  4:46 UTC (permalink / raw)


Hi,

What's the `tag' argument used for?

(defun html2text-get-attr (p1 p2 tag)

[...]

(defun html2text-clean-anchor (p1 p2 p3 p4)
  (let* ((attr-list (html2text-get-attr p1 p2 "a"))

In relation to this; is it necessary to bind the following
variable?

(defun html2text-format-tags ()
[...]
        (let ((p1)
              (p2 (point))
              (p3) (p4)
              (attr (match-string 0)))
               ^^^^

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: html2text
  2004-11-02  4:46   ` html2text Katsumi Yamaoka
@ 2004-11-02  9:22     ` Reiner Steib
  2004-11-02 11:59       ` html2text Katsumi Yamaoka
  0 siblings, 1 reply; 11+ messages in thread
From: Reiner Steib @ 2004-11-02  9:22 UTC (permalink / raw)


On Tue, Nov 02 2004, Katsumi Yamaoka wrote:

> What's the `tag' argument used for?
> (defun html2text-get-attr (p1 p2 tag)

For nothing, AFAICS.

[ other odd code / unused variables ]

I only fixed _some_ odd looking places in the code.  I didn't go thru
the code systematically.  Please fix it if you find more.

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: html2text
  2004-11-02  9:22     ` html2text Reiner Steib
@ 2004-11-02 11:59       ` Katsumi Yamaoka
  2004-11-02 14:12         ` html2text Reiner Steib
  0 siblings, 1 reply; 11+ messages in thread
From: Katsumi Yamaoka @ 2004-11-02 11:59 UTC (permalink / raw)


>>>>> In <v9lldk4ncs.fsf@marauder.physik.uni-ulm.de> Reiner Steib wrote:

>> What's the `tag' argument used for?
>> (defun html2text-get-attr (p1 p2 tag)

> For nothing, AFAICS.

> [ other odd code / unused variables ]

> I only fixed _some_ odd looking places in the code.  I didn't go thru
> the code systematically.

Sorry, I misunderstood as you've renewed that function because I
couldn't find any logs except for the following:

	(html2text-get-attr, html2text-fix-paragraph): Simplify code.

> Please fix it if you find more.

I've done.  Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: html2text
  2004-11-02 11:59       ` html2text Katsumi Yamaoka
@ 2004-11-02 14:12         ` Reiner Steib
  0 siblings, 0 replies; 11+ messages in thread
From: Reiner Steib @ 2004-11-02 14:12 UTC (permalink / raw)


On Tue, Nov 02 2004, Katsumi Yamaoka wrote:

>>>>>> In <v9lldk4ncs.fsf@marauder.physik.uni-ulm.de> Reiner Steib wrote:
[...]
>> I only fixed _some_ odd looking places in the code.  I didn't go thru
>> the code systematically.
>
> Sorry, I misunderstood as you've renewed that function because I
> couldn't find any logs except for the following:
>
> 	(html2text-get-attr, html2text-fix-paragraph): Simplify code.

Ah, you probably looked at the version in Gnus trunk.  It seems
something went wrong when Miles merged the changed from v5-10 (where I
installed the changes), see ...

cvs diff -r7.7 -r7.8 lisp/html2text.el

or ...

http://quimby.gnus.org/cgi-bin/cvsweb.cgi/gnus/lisp/html2text.el.diff?r1=7.7&r2=7.8

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: html2text
       [not found] ` <mzxv9dyj.fsf@blue.sea.net>
@ 2004-11-08 15:51   ` Reiner Steib
  2004-11-08 18:02     ` html2text David Kastrup
                       ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Reiner Steib @ 2004-11-08 15:51 UTC (permalink / raw)
  Cc: Emacs development

On Sat, Nov 06 2004, Jari Aalto+mail.emacs wrote:

> This is your copy. Article has been posted to the newsgroup(s).

I didn't see your message on emacs-devel, see
<URL:http://thread.gmane.org/1099247139.071920.12084.nullmailer@Update.UU.SE>.

> * Sun 2004-10-31 Alfred Szmidt <ams AT kemisten.nu> gmane.emacs.devel
> * Message-Id: 1099247139.071920.12084.nullmailer AT Update.UU.SE
> | html2text is quite nice, but it doesn't strip all HTML files into
> | something that is readable.  The following patch makes it strip some
> | "newer" tags that have croped up.
>
> There is more entities. This patch is against the Gnus CVS, but I
> assume it will work for Emacs as well. The entities are in
> alphabetical order.
>
> 2004-11-06 Sat  Jari Aalto  <jari dot aalto A T cante dot net>
>
>         * text2html (html2text-replace-list). Added more HTML 4.0
>         entities.

It seems you have signed papers for Emacs as you are listed in the
AUTHORS file.  But I can't check it myself.  Could you please confirm?

[ The suggested patch from Jari's original message was: ]

--8<---------------cut here---------------start------------->8---
--- html2text.el.7.10	2004-11-06 17:20:46.000000000 +0200
+++ html2text.el	2004-11-06 17:41:12.000000000 +0200
@@ -42,8 +42,42 @@
 (defvar html2text-format-single-element-list '(("hr" . html2text-clean-hr)))

 (defvar html2text-replace-list
-  '(("&nbsp;" . " ") ("&gt;" . ">") ("&lt;" . "<") ("&quot;" . "\"")
-    ("&amp;" . "&") ("&apos;" . "'"))
+  '(("&acute;" . "`")
+    ("&amp;" . "&")
+    ("&apos;" . "'")
+    ("&brvbar;" . "|")
+    ("&cent;" . "c")
+    ("&circ;" . "^")
+    ("&copy;" . "(C)")
+    ("&curren;" . "¤")
+    ("&deg;" . "degree")
+    ("&divide;" . "/")
+    ("&euro;" . "e")
+    ("&frac12;" . "½")
+    ("&gt;" . ">")
+    ("&iquest;" . "?")
+    ("&laquo;" . "<<")
+    ("&ldquo" . "\"")
+    ("&lsaquo;" . "(")
+    ("&lsquo;" . "`")
+    ("&lt;" . "<")
+    ("&mdash;" . "--")
+    ("&nbsp;" . " ")
+    ("&ndash;" . "-")
+    ("&permil;" . "%%")
+    ("&plusmn;" . "+-")
+    ("&pound;" . "£")
+    ("&quot;" . "\"")
+    ("&raquo;" . ">>")
+    ("&rdquo" . "\"")
+    ("&reg;" . "(R)")
+    ("&rsaquo;" . ")")
+    ("&rsquo;" . "'")
+    ("&sect;" . "§")
+    ("&sup1;" . "^1")
+    ("&sup2;" . "^2")
+    ("&sup3;" . "^3")
+    ("&tilde;" . "~"))
   "The map of entity to text.
--8<---------------cut here---------------end--------------->8---

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: html2text
  2004-11-08 15:51   ` html2text Reiner Steib
@ 2004-11-08 18:02     ` David Kastrup
  2004-11-09 22:44     ` html2text Reiner Steib
  2004-11-15  8:31     ` html2text Jari Aalto
  2 siblings, 0 replies; 11+ messages in thread
From: David Kastrup @ 2004-11-08 18:02 UTC (permalink / raw)


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Reiner Steib <reinersteib+gmane@imap.cc> writes:

> On Sat, Nov 06 2004, Jari Aalto+mail.emacs wrote:
>
> It seems you have signed papers for Emacs as you are listed in the
> AUTHORS file.  But I can't check it myself.  Could you please
> confirm?

copyright.list (privately accessible to GNU maintainers) lists Jari
with a general assignment of past and future changes to Emacs.

- -- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iD8DBQFBj7S2Bo350SLJfmgRAq3jAJ9b01k5renyp+1lwFuxasAGSMuW+ACeLJq+
YD4G1OUD17RECupVKH3SkGI=
=SyiB
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: html2text
  2004-11-08 15:51   ` html2text Reiner Steib
  2004-11-08 18:02     ` html2text David Kastrup
@ 2004-11-09 22:44     ` Reiner Steib
  2004-11-15  8:31     ` html2text Jari Aalto
  2 siblings, 0 replies; 11+ messages in thread
From: Reiner Steib @ 2004-11-09 22:44 UTC (permalink / raw)
  Cc: Jari Aalto+mail.emacs

On Mon, Nov 08 2004, Reiner Steib wrote:

> [ The suggested patch from Jari's original message was: ]
>
> --8<---------------cut here---------------start------------->8---
> --- html2text.el.7.10	2004-11-06 17:20:46.000000000 +0200
> +++ html2text.el	2004-11-06 17:41:12.000000000 +0200
> @@ -42,8 +42,42 @@
>  (defvar html2text-format-single-element-list '(("hr" . html2text-clean-hr)))
>
>  (defvar html2text-replace-list
> -  '(("&nbsp;" . " ") ("&gt;" . ">") ("&lt;" . "<") ("&quot;" . "\"")
> -    ("&amp;" . "&") ("&apos;" . "'"))
> +  '(("&acute;" . "`")

This should be "´".

> +    ("&amp;" . "&")
> +    ("&apos;" . "'")
> +    ("&brvbar;" . "|")
> +    ("&cent;" . "c")
> +    ("&circ;" . "^")
> +    ("&copy;" . "(C)")
> +    ("&curren;" . "¤")
> +    ("&deg;" . "degree")
> +    ("&divide;" . "/")
> +    ("&euro;" . "e")
> +    ("&frac12;" . "½")
[...]

It seems strange to use Latin-1 characters for some entities, but not
for all encodable by Latin-1.

On a second thought, it looks like there are already more or less
complete lists[1] e.g. in `mm-url-html-entities' (from Gnus),
`sgml-char-names', `sgml-char-names-table', `iso-iso2sgml-trans-tab'
(Emacs) or `w3m-entity-alist' (emacs-w3m).

Probably one of these could be used.  Hm, maybe the function
`iso-sgml2iso' could be used in `html2text.el'?

Bye, Reiner.

[1] Might be checked with
    http://www.w3.org/TR/REC-html40/sgml/entities.html or other
    tables.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: html2text
  2004-11-08 15:51   ` html2text Reiner Steib
  2004-11-08 18:02     ` html2text David Kastrup
  2004-11-09 22:44     ` html2text Reiner Steib
@ 2004-11-15  8:31     ` Jari Aalto
  2 siblings, 0 replies; 11+ messages in thread
From: Jari Aalto @ 2004-11-15  8:31 UTC (permalink / raw)
  Cc: reinersteib+gmane

Reiner Steib <reinersteib+gmane@imap.cc> writes:
| On Sat, Nov 06 2004, Jari Aalto+mail.emacs wrote:
| 
| > This is your copy. Article has been posted to the newsgroup(s).
| 
| I didn't see your message on emacs-devel, see
| <URL:http://thread.gmane.org/1099247139.071920.12084.nullmailer@Update.UU=
| .SE>.
| >
| > 2004-11-06 Sat  Jari Aalto  <jari dot aalto A T cante dot net>
| >
| >         * text2html (html2text-replace-list). Added more HTML 4.0
| >         entities.
| 
| It seems you have signed papers for Emacs as you are listed in the
| AUTHORS file.  But I can't check it myself.  Could you please confirm?

Confirmed. I have signed the papers.

Jari

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2004-11-15  8:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-10-31 18:25 html2text Alfred M. Szmidt
2004-11-01 11:55 ` html2text Reiner Steib
2004-11-01 19:21   ` html2text Alfred M. Szmidt
2004-11-02  4:46   ` html2text Katsumi Yamaoka
2004-11-02  9:22     ` html2text Reiner Steib
2004-11-02 11:59       ` html2text Katsumi Yamaoka
2004-11-02 14:12         ` html2text Reiner Steib
     [not found] ` <mzxv9dyj.fsf@blue.sea.net>
2004-11-08 15:51   ` html2text Reiner Steib
2004-11-08 18:02     ` html2text David Kastrup
2004-11-09 22:44     ` html2text Reiner Steib
2004-11-15  8:31     ` html2text Jari Aalto

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).