unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
From: Kenan Toker <kenan@kdtsh.net>
To: guile-user@gnu.org
Subject: Using shtml with htmlprag - output of shtml->html is different to some given HTML
Date: Wed, 4 Sep 2019 22:52:54 +1000	[thread overview]
Message-ID: <6db79fe7-36e9-edbf-4ac0-35a8fb8bbb03@kdtsh.net> (raw)


[-- Attachment #1.1: Type: text/plain, Size: 4363 bytes --]

Hi guile-users,

Hope you're all very well! I have a question about using shtml with
htmlprag - as far as I know this module isn't actually part of Guile,
and it looks like it's quite old now and maybe no longer under active
development, but if anyone has any insights I'm keen to see if I can get
another set of eyes on this issue I'm having.

I'm new to Guile, and to learn the language I'm building a web crawler.
As part of this, I'm using htmlprag and sxpath to convert some HTML to
shtml and pull some interesting data out of the shtml.

I have the following HTML (I wrote this up for example's sake):

    <!DOCTYPE html>
    <html>
      <head>
        <title>Example</title>
      </head>
      <body>
        <header class="exampleHeader">
          <img id="bannerImage"
    src="https://www.gnu.org/software/guile/static/base/img/branding.png">
          <div>
            <p id="labelName">A label for the header.</p>
          </div>
          <p id="labelDescription">Some description of the header.</p>
        </header>
        <div id="exampleDiv">
          <hr>
          <div id="divMessage">An example message.</div>
        </div>
        <footer id="footer"></footer>
      </body>
    </html>

When I used html->shtml I got the following shtml:

    (*TOP* (*DECL* DOCTYPE html)
     (html
        (head
          (title Example)
       )
        (body
          (header (@ (class exampleHeader))
            (img (@ (id bannerImage) (src
    https://www.gnu.org/software/guile/static/base/img/branding.png)))
            (div
             )) (p (@ (id labelName)) A label for the header.)
           
            (p (@ (id labelDescription)) Some description of the header.)
         
          (div (@ (id exampleDiv))
            (hr)
            (div (@ (id divMessage)) An example message.)
         )
          (footer (@ (id footer)))
       )
    )
    )

I would have however expected something like (div (p (@ (id labelName))
A label for the header.)) under the header[@class="exampleHeader"] tag
(I haven't tested this exact s-expression though). Instead, the p tag
sits outside the div tag.

When I do shtml->html over this shtml, I get the following html:

    <!DOCTYPE html>
    <html>
      <head>
        <title>Example</title>
      </head>
      <body>
        <header class="exampleHeader">
          <img id="bannerImage"
    src="https://www.gnu.org/software/guile/static/base/img/branding.png" />
          <div>
            </div></header><p id="labelName">A label for the header.</p>
         
          <p id="labelDescription">Some description of the header.</p>
       
        <div id="exampleDiv">
          <hr />
          <div id="divMessage">An example message.</div>
        </div>
        <footer id="footer"></footer>
      </body>
    </html>

The p[@id="labelName"] tag no longer sits under the div tag. This means
when I use an sxpath expression like '(// html body (header (@ (eq?
"exampleHeader")))), I get the img tag and an empty div tag, but no p
tag - like so:

    ((header (@ (class exampleHeader))
            (img (@ (id bannerImage) (src
    https://www.gnu.org/software/guile/static/base/img/branding.png)))
            (div
             )))

I'm wondering if I've missed something, or if others get this kind of
behaviour. The upshot of this is that, for the HTML above, it looks like
(equal? example-html (shtml->html (html->shtml example-html))) is false,
which isn't what I'd expect. Is there something funny that happens with
`p`?

Thanks a lot,
Kenan


NB. In the sxml example above all the strings aren't surrounded by
double quotes, but I think this is an artefact of how I'm writing them
to files for testing purposes - see an extract of the sxml below when I
use ,pretty-print in Geiser:

    (div (@ (id "exampleDiv"))
                "\n"
                "      "
                (hr)
                "\n"
                "      "
                (div (@ (id "divMessage")) "An example message.")
                "\n"
                "    ")


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

             reply	other threads:[~2019-09-04 12:52 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-04 12:52 Kenan Toker [this message]
2019-09-04 14:33 ` Using shtml with htmlprag - output of shtml->html is different to some given HTML Neil Van Dyke
2019-09-04 23:38   ` Kenan Toker
2019-09-05 22:33     ` Neil Van Dyke
2019-09-06  4:09       ` Kenan Toker
2019-09-06 20:55         ` Neil Van Dyke
2019-09-07  4:40           ` Kenan Toker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6db79fe7-36e9-edbf-4ac0-35a8fb8bbb03@kdtsh.net \
    --to=kenan@kdtsh.net \
    --cc=guile-user@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).