unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
* Using shtml with htmlprag - output of shtml->html is different to some given HTML
@ 2019-09-04 12:52 Kenan Toker
  2019-09-04 14:33 ` Neil Van Dyke
  0 siblings, 1 reply; 7+ messages in thread
From: Kenan Toker @ 2019-09-04 12:52 UTC (permalink / raw)
  To: guile-user


[-- Attachment #1.1: Type: text/plain, Size: 4363 bytes --]

Hi guile-users,

Hope you're all very well! I have a question about using shtml with
htmlprag - as far as I know this module isn't actually part of Guile,
and it looks like it's quite old now and maybe no longer under active
development, but if anyone has any insights I'm keen to see if I can get
another set of eyes on this issue I'm having.

I'm new to Guile, and to learn the language I'm building a web crawler.
As part of this, I'm using htmlprag and sxpath to convert some HTML to
shtml and pull some interesting data out of the shtml.

I have the following HTML (I wrote this up for example's sake):

    <!DOCTYPE html>
    <html>
      <head>
        <title>Example</title>
      </head>
      <body>
        <header class="exampleHeader">
          <img id="bannerImage"
    src="https://www.gnu.org/software/guile/static/base/img/branding.png">
          <div>
            <p id="labelName">A label for the header.</p>
          </div>
          <p id="labelDescription">Some description of the header.</p>
        </header>
        <div id="exampleDiv">
          <hr>
          <div id="divMessage">An example message.</div>
        </div>
        <footer id="footer"></footer>
      </body>
    </html>

When I used html->shtml I got the following shtml:

    (*TOP* (*DECL* DOCTYPE html)
     (html
        (head
          (title Example)
       )
        (body
          (header (@ (class exampleHeader))
            (img (@ (id bannerImage) (src
    https://www.gnu.org/software/guile/static/base/img/branding.png)))
            (div
             )) (p (@ (id labelName)) A label for the header.)
           
            (p (@ (id labelDescription)) Some description of the header.)
         
          (div (@ (id exampleDiv))
            (hr)
            (div (@ (id divMessage)) An example message.)
         )
          (footer (@ (id footer)))
       )
    )
    )

I would have however expected something like (div (p (@ (id labelName))
A label for the header.)) under the header[@class="exampleHeader"] tag
(I haven't tested this exact s-expression though). Instead, the p tag
sits outside the div tag.

When I do shtml->html over this shtml, I get the following html:

    <!DOCTYPE html>
    <html>
      <head>
        <title>Example</title>
      </head>
      <body>
        <header class="exampleHeader">
          <img id="bannerImage"
    src="https://www.gnu.org/software/guile/static/base/img/branding.png" />
          <div>
            </div></header><p id="labelName">A label for the header.</p>
         
          <p id="labelDescription">Some description of the header.</p>
       
        <div id="exampleDiv">
          <hr />
          <div id="divMessage">An example message.</div>
        </div>
        <footer id="footer"></footer>
      </body>
    </html>

The p[@id="labelName"] tag no longer sits under the div tag. This means
when I use an sxpath expression like '(// html body (header (@ (eq?
"exampleHeader")))), I get the img tag and an empty div tag, but no p
tag - like so:

    ((header (@ (class exampleHeader))
            (img (@ (id bannerImage) (src
    https://www.gnu.org/software/guile/static/base/img/branding.png)))
            (div
             )))

I'm wondering if I've missed something, or if others get this kind of
behaviour. The upshot of this is that, for the HTML above, it looks like
(equal? example-html (shtml->html (html->shtml example-html))) is false,
which isn't what I'd expect. Is there something funny that happens with
`p`?

Thanks a lot,
Kenan


NB. In the sxml example above all the strings aren't surrounded by
double quotes, but I think this is an artefact of how I'm writing them
to files for testing purposes - see an extract of the sxml below when I
use ,pretty-print in Geiser:

    (div (@ (id "exampleDiv"))
                "\n"
                "      "
                (hr)
                "\n"
                "      "
                (div (@ (id "divMessage")) "An example message.")
                "\n"
                "    ")


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-09-07  4:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-09-04 12:52 Using shtml with htmlprag - output of shtml->html is different to some given HTML Kenan Toker
2019-09-04 14:33 ` Neil Van Dyke
2019-09-04 23:38   ` Kenan Toker
2019-09-05 22:33     ` Neil Van Dyke
2019-09-06  4:09       ` Kenan Toker
2019-09-06 20:55         ` Neil Van Dyke
2019-09-07  4:40           ` Kenan Toker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).