unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
* Using shtml with htmlprag - output of shtml->html is different to some given HTML
@ 2019-09-04 12:52 Kenan Toker
  2019-09-04 14:33 ` Neil Van Dyke
  0 siblings, 1 reply; 7+ messages in thread
From: Kenan Toker @ 2019-09-04 12:52 UTC (permalink / raw)
  To: guile-user


[-- Attachment #1.1: Type: text/plain, Size: 4363 bytes --]

Hi guile-users,

Hope you're all very well! I have a question about using shtml with
htmlprag - as far as I know this module isn't actually part of Guile,
and it looks like it's quite old now and maybe no longer under active
development, but if anyone has any insights I'm keen to see if I can get
another set of eyes on this issue I'm having.

I'm new to Guile, and to learn the language I'm building a web crawler.
As part of this, I'm using htmlprag and sxpath to convert some HTML to
shtml and pull some interesting data out of the shtml.

I have the following HTML (I wrote this up for example's sake):

    <!DOCTYPE html>
    <html>
      <head>
        <title>Example</title>
      </head>
      <body>
        <header class="exampleHeader">
          <img id="bannerImage"
    src="https://www.gnu.org/software/guile/static/base/img/branding.png">
          <div>
            <p id="labelName">A label for the header.</p>
          </div>
          <p id="labelDescription">Some description of the header.</p>
        </header>
        <div id="exampleDiv">
          <hr>
          <div id="divMessage">An example message.</div>
        </div>
        <footer id="footer"></footer>
      </body>
    </html>

When I used html->shtml I got the following shtml:

    (*TOP* (*DECL* DOCTYPE html)
     (html
        (head
          (title Example)
       )
        (body
          (header (@ (class exampleHeader))
            (img (@ (id bannerImage) (src
    https://www.gnu.org/software/guile/static/base/img/branding.png)))
            (div
             )) (p (@ (id labelName)) A label for the header.)
           
            (p (@ (id labelDescription)) Some description of the header.)
         
          (div (@ (id exampleDiv))
            (hr)
            (div (@ (id divMessage)) An example message.)
         )
          (footer (@ (id footer)))
       )
    )
    )

I would have however expected something like (div (p (@ (id labelName))
A label for the header.)) under the header[@class="exampleHeader"] tag
(I haven't tested this exact s-expression though). Instead, the p tag
sits outside the div tag.

When I do shtml->html over this shtml, I get the following html:

    <!DOCTYPE html>
    <html>
      <head>
        <title>Example</title>
      </head>
      <body>
        <header class="exampleHeader">
          <img id="bannerImage"
    src="https://www.gnu.org/software/guile/static/base/img/branding.png" />
          <div>
            </div></header><p id="labelName">A label for the header.</p>
         
          <p id="labelDescription">Some description of the header.</p>
       
        <div id="exampleDiv">
          <hr />
          <div id="divMessage">An example message.</div>
        </div>
        <footer id="footer"></footer>
      </body>
    </html>

The p[@id="labelName"] tag no longer sits under the div tag. This means
when I use an sxpath expression like '(// html body (header (@ (eq?
"exampleHeader")))), I get the img tag and an empty div tag, but no p
tag - like so:

    ((header (@ (class exampleHeader))
            (img (@ (id bannerImage) (src
    https://www.gnu.org/software/guile/static/base/img/branding.png)))
            (div
             )))

I'm wondering if I've missed something, or if others get this kind of
behaviour. The upshot of this is that, for the HTML above, it looks like
(equal? example-html (shtml->html (html->shtml example-html))) is false,
which isn't what I'd expect. Is there something funny that happens with
`p`?

Thanks a lot,
Kenan


NB. In the sxml example above all the strings aren't surrounded by
double quotes, but I think this is an artefact of how I'm writing them
to files for testing purposes - see an extract of the sxml below when I
use ,pretty-print in Geiser:

    (div (@ (id "exampleDiv"))
                "\n"
                "      "
                (hr)
                "\n"
                "      "
                (div (@ (id "divMessage")) "An example message.")
                "\n"
                "    ")


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Using shtml with htmlprag - output of shtml->html is different to some given HTML
  2019-09-04 12:52 Using shtml with htmlprag - output of shtml->html is different to some given HTML Kenan Toker
@ 2019-09-04 14:33 ` Neil Van Dyke
  2019-09-04 23:38   ` Kenan Toker
  0 siblings, 1 reply; 7+ messages in thread
From: Neil Van Dyke @ 2019-09-04 14:33 UTC (permalink / raw)
  To: Kenan Toker, guile-user

Hi, Kenan.

If you can tell me a URL for which `htmlprag` you're using, I'll try to 
fix the problem.

If I can't quickly figure out a simple fix for that `htmlprag`, I might 
end up trying to quickly do an unofficial port of the current (sadly 
Racket-specific) incarnation of the `htmlprag` code, since the newer 
code works correctly with your example, and I don't see an obvious fix 
in the version history: 
"https://www.neilvandyke.org/racket/html-parsing/#%28part._.History%29".

(Coincidentally, I recently decided to switch back from Racket, to 
building atop RnRS again, and hopefully advancing it.  `htmlprag` was 
the very first Scheme code I ever wrote, and was tested on many Scheme 
implementations, so porting `html-parsing` back from Racket first seems 
suiting, and probably very easy.)

Neil V.




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Using shtml with htmlprag - output of shtml->html is different to some given HTML
  2019-09-04 14:33 ` Neil Van Dyke
@ 2019-09-04 23:38   ` Kenan Toker
  2019-09-05 22:33     ` Neil Van Dyke
  0 siblings, 1 reply; 7+ messages in thread
From: Kenan Toker @ 2019-09-04 23:38 UTC (permalink / raw)
  To: Neil Van Dyke, guile-user

Hi Neil,

Brilliant! I'm using the version which is bundled with the guile-lib
0.2.6.1 - https://www.nongnu.org/guile-lib/ for website,
https://download.savannah.nongnu.org/releases/guile-lib/ for download.

I haven't tried using a different version than this. I can also see from
the header that this version was `forked` (I'm not sure if that's the
best way to put it) and built into guile-lib in 2004. If trying a
different version of htmlprag is a good idea I can give that a go too.

Re: porting html-parsing and generally working with RnRS again, that
sounds like a good project. As it happens, when I was searching around
for more info about htmlprag I came across html-parsing, and I thought
that maybe it could drop easily into Guile but a glance at the code
confirmed this wasn't to be. It looks like a great library.

Cheers,
Kenan

On 5/9/19 12:33 am, Neil Van Dyke wrote:
> Hi, Kenan.
>
> If you can tell me a URL for which `htmlprag` you're using, I'll try
> to fix the problem.
>
> If I can't quickly figure out a simple fix for that `htmlprag`, I
> might end up trying to quickly do an unofficial port of the current
> (sadly Racket-specific) incarnation of the `htmlprag` code, since the
> newer code works correctly with your example, and I don't see an
> obvious fix in the version history:
> "https://www.neilvandyke.org/racket/html-parsing/#%28part._.History%29".
>
> (Coincidentally, I recently decided to switch back from Racket, to
> building atop RnRS again, and hopefully advancing it.  `htmlprag` was
> the very first Scheme code I ever wrote, and was tested on many Scheme
> implementations, so porting `html-parsing` back from Racket first
> seems suiting, and probably very easy.)
>
> Neil V.
>




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Using shtml with htmlprag - output of shtml->html is different to some given HTML
  2019-09-04 23:38   ` Kenan Toker
@ 2019-09-05 22:33     ` Neil Van Dyke
  2019-09-06  4:09       ` Kenan Toker
  0 siblings, 1 reply; 7+ messages in thread
From: Neil Van Dyke @ 2019-09-05 22:33 UTC (permalink / raw)
  To: Kenan Toker, guile-user

Kenan, could you please try the below "one-line" change, and let me know 
what you think?

(It's an attempt at a minimal fix for the problem you were seeing, and 
for some related problems with modern HTML.  However, it breaks 
backward-compatibility relative to the htmlprag currently in guile-lib.  
For example, consider someone doing Web scraping of modern HTML, and 
their scraping code only works with the previous, invalid parse.  I'm 
not yet familiar with guile-lib and how the htmlprag in it is being 
used, so I don't want to be too quick to suggest breaking changes to it.)

(Historical note: htmlprag was mostly written 18 years ago, when HTML 
was different in both standards and practice.  Today, I'd write the 
parser very differently, though I think there's a good chance that 
htmlprag will still work for one's purpose, with this change.)

Neil

--- htmlprag.scm.ORIG    2019-09-05 18:21:40.850220789 -0400
+++ htmlprag.scm    2019-09-05 18:21:40.850220789 -0400
@@ -1099,7 +1099,7 @@
                (meta     . (head))
                (noframes . (frameset))
                (option   . (select))
-              (p        . (body td th))
+              (p        . (div blockquote body footer header li td th))
                (param    . (applet))
                (tbody    . (table))
                (td       . (tr))
@@ -1989,6 +1989,13 @@
      (t1 "<script>xxx"  '((script "xxx")))
      (t1 "<script/>xxx" '((script) "xxx"))

+    (t1 "<div><p>x</p></div>" '((div        (p "x"))))
+    (t1 "<header><p>x</p></>" '((header     (p "x"))))
+    (t1 "<footer><p>x</p></>" '((footer     (p "x"))))
+    (t1 "<blockquote><p>x</p></blockquote>" '((blockquote (p "x"))))
+    (t1 "<ul><li><p>x</p></li></ul>" '((ul (li     (p "x")))))
+    (t1 "<ol><li><p>x</p></li></ol>" '((ol (li     (p "x")))))
+
      ;; TODO: Add verbatim-pair cases with attributes in the end tag.

      (t2 '(p)            "<p></p>")




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Using shtml with htmlprag - output of shtml->html is different to some given HTML
  2019-09-05 22:33     ` Neil Van Dyke
@ 2019-09-06  4:09       ` Kenan Toker
  2019-09-06 20:55         ` Neil Van Dyke
  0 siblings, 1 reply; 7+ messages in thread
From: Kenan Toker @ 2019-09-06  4:09 UTC (permalink / raw)
  To: Neil Van Dyke, guile-user


[-- Attachment #1.1: Type: text/plain, Size: 2755 bytes --]

Hi Neil,

Thanks heaps, I'll give this fix a go and let you know how it works ASAP.

That all makes sense re: avoiding breaking changes in guile-lib. If this
fix works and is all that's needed, I'll use it instead of the version
currently available in guile-lib.

With that in mind, if I were to choose one of the 'distributions' of
htmlprag, is there one you yourself would pick? - or are the version
available in e.g. guile-lib and standalone for all intents and purposes
the same?

Cheers,
Kenan

On 6/9/19 8:33 am, Neil Van Dyke wrote:
> Kenan, could you please try the below "one-line" change, and let me
> know what you think?
>
> (It's an attempt at a minimal fix for the problem you were seeing, and
> for some related problems with modern HTML.  However, it breaks
> backward-compatibility relative to the htmlprag currently in
> guile-lib.  For example, consider someone doing Web scraping of modern
> HTML, and their scraping code only works with the previous, invalid
> parse.  I'm not yet familiar with guile-lib and how the htmlprag in it
> is being used, so I don't want to be too quick to suggest breaking
> changes to it.)
>
> (Historical note: htmlprag was mostly written 18 years ago, when HTML
> was different in both standards and practice.  Today, I'd write the
> parser very differently, though I think there's a good chance that
> htmlprag will still work for one's purpose, with this change.)
>
> Neil
>
> --- htmlprag.scm.ORIG    2019-09-05 18:21:40.850220789 -0400
> +++ htmlprag.scm    2019-09-05 18:21:40.850220789 -0400
> @@ -1099,7 +1099,7 @@
>                (meta     . (head))
>                (noframes . (frameset))
>                (option   . (select))
> -              (p        . (body td th))
> +              (p        . (div blockquote body footer header li td th))
>                (param    . (applet))
>                (tbody    . (table))
>                (td       . (tr))
> @@ -1989,6 +1989,13 @@
>      (t1 "<script>xxx"  '((script "xxx")))
>      (t1 "<script/>xxx" '((script) "xxx"))
>
> +    (t1 "<div><p>x</p></div>" '((div        (p "x"))))
> +    (t1 "<header><p>x</p></>" '((header     (p "x"))))
> +    (t1 "<footer><p>x</p></>" '((footer     (p "x"))))
> +    (t1 "<blockquote><p>x</p></blockquote>" '((blockquote (p "x"))))
> +    (t1 "<ul><li><p>x</p></li></ul>" '((ul (li     (p "x")))))
> +    (t1 "<ol><li><p>x</p></li></ol>" '((ol (li     (p "x")))))
> +
>      ;; TODO: Add verbatim-pair cases with attributes in the end tag.
>
>      (t2 '(p)            "<p></p>")
>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Using shtml with htmlprag - output of shtml->html is different to some given HTML
  2019-09-06  4:09       ` Kenan Toker
@ 2019-09-06 20:55         ` Neil Van Dyke
  2019-09-07  4:40           ` Kenan Toker
  0 siblings, 1 reply; 7+ messages in thread
From: Neil Van Dyke @ 2019-09-06 20:55 UTC (permalink / raw)
  To: Kenan Toker, guile-user

Kenan Toker wrote on 9/6/19 12:09 AM:
 > With that in mind, if I were to choose one of the 'distributions' of 
htmlprag, is there one you yourself would pick?

I suspect that the version in guile-lib (plus the patch I sent 
yesterday) is best.

(Realistically, I probably can't work on anything better anytime soon, 
unless a deep-pocketed dotcom wants to get into the Scheming business.)




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Using shtml with htmlprag - output of shtml->html is different to some given HTML
  2019-09-06 20:55         ` Neil Van Dyke
@ 2019-09-07  4:40           ` Kenan Toker
  0 siblings, 0 replies; 7+ messages in thread
From: Kenan Toker @ 2019-09-07  4:40 UTC (permalink / raw)
  To: Neil Van Dyke, guile-user


[-- Attachment #1.1: Type: text/plain, Size: 1958 bytes --]

Great, in that case I'll use this patched version of htmlprag with
guile-lib now.

After a little bit of testing it looking like this patch did the trick -
here's the shtml of the HTML file in my original email:

    (*TOP* (*DECL* DOCTYPE html)
     (html
        (head
          (title Example)
       )
        (body
          (header (@ (class exampleHeader))
            (img (@ (id bannerImage) (src
    https://www.gnu.org/software/guile/static/base/img/branding.png)))
            (div
              (p (@ (id labelName)) A label for the header.)
           )
            (p (@ (id labelDescription)) Some description of the header.)
         )
          (div (@ (id exampleDiv))
            (hr)
            (div (@ (id divMessage)) An example message.)
         )
          (footer (@ (id footer)))
       )
    )
    )

Which looks a lot better - the p tag is nested inside the div, and my
sxpath expression '(// html body (header (@ (equal? (class
"exampleHeader")))) div) gives me the following (I've taken the escaped
characters and whitespace strings out):

    ((div
          (p (@ (id "labelName"))
             "A label for the header.")))

Thanks a lot for your help with this, it's very much appreciated!

(Likewise if I ever find myself with some time, I might review the code
for html-parsing and see whether porting it to RnRS is something I could
realistically work on.)

Cheers,
Kenan


On 7/9/19 6:55 am, Neil Van Dyke wrote:
> Kenan Toker wrote on 9/6/19 12:09 AM:
> > With that in mind, if I were to choose one of the 'distributions' of
> htmlprag, is there one you yourself would pick?
>
> I suspect that the version in guile-lib (plus the patch I sent
> yesterday) is best.
>
> (Realistically, I probably can't work on anything better anytime soon,
> unless a deep-pocketed dotcom wants to get into the Scheming business.)
>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-09-07  4:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-09-04 12:52 Using shtml with htmlprag - output of shtml->html is different to some given HTML Kenan Toker
2019-09-04 14:33 ` Neil Van Dyke
2019-09-04 23:38   ` Kenan Toker
2019-09-05 22:33     ` Neil Van Dyke
2019-09-06  4:09       ` Kenan Toker
2019-09-06 20:55         ` Neil Van Dyke
2019-09-07  4:40           ` Kenan Toker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).