From: Neil Van Dyke <neil@neilvandyke.org>
To: Kenan Toker <kenan@kdtsh.net>, guile-user@gnu.org
Subject: Re: Using shtml with htmlprag - output of shtml->html is different to some given HTML
Date: Thu, 5 Sep 2019 18:33:57 -0400 [thread overview]
Message-ID: <e0637ce2-fa36-7ac9-a429-52568088b8c8@neilvandyke.org> (raw)
In-Reply-To: <5c940139-4544-07ed-b01f-1e154e3eb30e@kdtsh.net>
Kenan, could you please try the below "one-line" change, and let me know
what you think?
(It's an attempt at a minimal fix for the problem you were seeing, and
for some related problems with modern HTML. However, it breaks
backward-compatibility relative to the htmlprag currently in guile-lib.
For example, consider someone doing Web scraping of modern HTML, and
their scraping code only works with the previous, invalid parse. I'm
not yet familiar with guile-lib and how the htmlprag in it is being
used, so I don't want to be too quick to suggest breaking changes to it.)
(Historical note: htmlprag was mostly written 18 years ago, when HTML
was different in both standards and practice. Today, I'd write the
parser very differently, though I think there's a good chance that
htmlprag will still work for one's purpose, with this change.)
Neil
--- htmlprag.scm.ORIG 2019-09-05 18:21:40.850220789 -0400
+++ htmlprag.scm 2019-09-05 18:21:40.850220789 -0400
@@ -1099,7 +1099,7 @@
(meta . (head))
(noframes . (frameset))
(option . (select))
- (p . (body td th))
+ (p . (div blockquote body footer header li td th))
(param . (applet))
(tbody . (table))
(td . (tr))
@@ -1989,6 +1989,13 @@
(t1 "<script>xxx" '((script "xxx")))
(t1 "<script/>xxx" '((script) "xxx"))
+ (t1 "<div><p>x</p></div>" '((div (p "x"))))
+ (t1 "<header><p>x</p></>" '((header (p "x"))))
+ (t1 "<footer><p>x</p></>" '((footer (p "x"))))
+ (t1 "<blockquote><p>x</p></blockquote>" '((blockquote (p "x"))))
+ (t1 "<ul><li><p>x</p></li></ul>" '((ul (li (p "x")))))
+ (t1 "<ol><li><p>x</p></li></ol>" '((ol (li (p "x")))))
+
;; TODO: Add verbatim-pair cases with attributes in the end tag.
(t2 '(p) "<p></p>")
next prev parent reply other threads:[~2019-09-05 22:33 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-04 12:52 Using shtml with htmlprag - output of shtml->html is different to some given HTML Kenan Toker
2019-09-04 14:33 ` Neil Van Dyke
2019-09-04 23:38 ` Kenan Toker
2019-09-05 22:33 ` Neil Van Dyke [this message]
2019-09-06 4:09 ` Kenan Toker
2019-09-06 20:55 ` Neil Van Dyke
2019-09-07 4:40 ` Kenan Toker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e0637ce2-fa36-7ac9-a429-52568088b8c8@neilvandyke.org \
--to=neil@neilvandyke.org \
--cc=guile-user@gnu.org \
--cc=kenan@kdtsh.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).