unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
From: Neil Van Dyke <neil@neilvandyke.org>
To: Kenan Toker <kenan@kdtsh.net>, guile-user@gnu.org
Subject: Re: Using shtml with htmlprag - output of shtml->html is different to some given HTML
Date: Thu, 5 Sep 2019 18:33:57 -0400	[thread overview]
Message-ID: <e0637ce2-fa36-7ac9-a429-52568088b8c8@neilvandyke.org> (raw)
In-Reply-To: <5c940139-4544-07ed-b01f-1e154e3eb30e@kdtsh.net>

Kenan, could you please try the below "one-line" change, and let me know 
what you think?

(It's an attempt at a minimal fix for the problem you were seeing, and 
for some related problems with modern HTML.  However, it breaks 
backward-compatibility relative to the htmlprag currently in guile-lib.  
For example, consider someone doing Web scraping of modern HTML, and 
their scraping code only works with the previous, invalid parse.  I'm 
not yet familiar with guile-lib and how the htmlprag in it is being 
used, so I don't want to be too quick to suggest breaking changes to it.)

(Historical note: htmlprag was mostly written 18 years ago, when HTML 
was different in both standards and practice.  Today, I'd write the 
parser very differently, though I think there's a good chance that 
htmlprag will still work for one's purpose, with this change.)

Neil

--- htmlprag.scm.ORIG    2019-09-05 18:21:40.850220789 -0400
+++ htmlprag.scm    2019-09-05 18:21:40.850220789 -0400
@@ -1099,7 +1099,7 @@
                (meta     . (head))
                (noframes . (frameset))
                (option   . (select))
-              (p        . (body td th))
+              (p        . (div blockquote body footer header li td th))
                (param    . (applet))
                (tbody    . (table))
                (td       . (tr))
@@ -1989,6 +1989,13 @@
      (t1 "<script>xxx"  '((script "xxx")))
      (t1 "<script/>xxx" '((script) "xxx"))

+    (t1 "<div><p>x</p></div>" '((div        (p "x"))))
+    (t1 "<header><p>x</p></>" '((header     (p "x"))))
+    (t1 "<footer><p>x</p></>" '((footer     (p "x"))))
+    (t1 "<blockquote><p>x</p></blockquote>" '((blockquote (p "x"))))
+    (t1 "<ul><li><p>x</p></li></ul>" '((ul (li     (p "x")))))
+    (t1 "<ol><li><p>x</p></li></ol>" '((ol (li     (p "x")))))
+
      ;; TODO: Add verbatim-pair cases with attributes in the end tag.

      (t2 '(p)            "<p></p>")




  reply	other threads:[~2019-09-05 22:33 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-04 12:52 Using shtml with htmlprag - output of shtml->html is different to some given HTML Kenan Toker
2019-09-04 14:33 ` Neil Van Dyke
2019-09-04 23:38   ` Kenan Toker
2019-09-05 22:33     ` Neil Van Dyke [this message]
2019-09-06  4:09       ` Kenan Toker
2019-09-06 20:55         ` Neil Van Dyke
2019-09-07  4:40           ` Kenan Toker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e0637ce2-fa36-7ac9-a429-52568088b8c8@neilvandyke.org \
    --to=neil@neilvandyke.org \
    --cc=guile-user@gnu.org \
    --cc=kenan@kdtsh.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).