From: Catonano <catonano@gmail.com>
To: guile-user@gnu.org
Subject: Re: salutations and web scraping
Date: Mon, 16 Jan 2012 21:06:48 +0100 [thread overview]
Message-ID: <CAJ98PDzWPYyLxTUsbjaGV0_tqgjy1XqUBB3bPFms2_S54-cMhQ@mail.gmail.com> (raw)
In-Reply-To: <871ur7fcyj.fsf@pobox.com>
[-- Attachment #1: Type: text/plain, Size: 4080 bytes --]
Andy,
Il giorno 10 gennaio 2012 22:46, Andy Wingo <wingo@pobox.com> ha scritto:
> Hi Catonano,
>
> On Fri 30 Dec 2011 23:58, Catonano <catonano@gmail.com> writes:
>
> > I´m a beginner, I never wrote a single line of LISP or Scheme in my life
> > and I´m here for asking for directions and suggestions.
>
> Welcome! :-)
>
thank you so much for your reply. I had been eagerly waiting for a signal
from the list and I had missed it ! I´m sorry.
The gmail learning mechanism hasn´t still learned enough about my interest
in this issue, so it didn´t promptly reported about your reply. I had to
dig inside the folders structure I had layed out in order to discover it.
As for me I haven´t learned enough about the gmail learning mechaninsm
woes. I guess we´re both learning, now.
Well, I was attempting a joke ;-)
> > my boldness is such that I´d ask you to write for me an example
> > skeleton code.
>
>
> Hey, it's fair, I think; that is a new part of Guile, and there is not a
> lot of example code.
>
>
Thanks, Andy, I´m grateful for this. Actually I managed to set up geiser,
load a file and get me delivered to a prompt in which that file is loaded.
Cool ;-) But there are still some thing I didn´t know that your post made
clear.
> Generally, we figure out how to solve problems at the REPL, so fire up
> your Guile:
>
> $ guile
> ...
> scheme@(guile-user)>
>
> (Here I'm assuming you have guile 2.0.3.)
>
> Use the web modules. Let's assume we're grabbing http://www.gnu.org/,
> for simplicity:
>
> > (use-modules (web client) (web uri))
> > (http-get (string->uri "http://www.gnu.org/software/guile/"))
> [here the text of the web page gets printed out]
>
Ok, I had managed to arrive so far (thanks to the help received in the
guile cannel in irc)
>
> Actually there are two return values: the response object, corresponding
> to the headers, and the body. If you scroll your terminal up, you'll
> see that they get labels like $1 and $2.
>
I didn´t know they were 2 values, thanks
>
> Now you need to parse the HTML. The best way to do this is with the
> pragmatic HTML parser, htmlprag. It's part of guile-lib. So download
> and install guile-lib (it's at http://www.non-gnu.org/guile-lib/), and
> then, assuming the html is in $2:
>
I had seen those $i things but I hadn´t understood that stuff was "inside"
them and that I could use them, so I was using a lot of (define this that).
And this is probably why I missed the two values returned by http-get.
Thanks !
> > (use-modules (htmlprag))
> > (define the-web-page (html->sxml $2))
>
And I didn´t know about htmlprag, thanks
>
> That parses the web page to s-expressions. You can print the result
> nicely:
>
> > ,pretty-print the-web-page
>
thanks, I didn´t know this, either
>
> Now you need to get something out of the web page. The hackiest way to
> do it is just to match against the entire page. Maybe someone else can
> come up with an example, but I'm short on time, so I'll proceed to The
> Right Thing -- the problem is that whitespace is significant, and maybe
> all you want is the contents of "the <title> in the <head> in the
> <html>."
>
> So in XML you'd use XPATH. In SXML you'd use SXPATH. It's hard to use
> right now; we really need to steal
> http://www.neilvandyke.org/webscraperhelper/ from Neil van Dyke. But
> you can see from his docs that the thing would be
>
> > (use-modules (sxml xpath))
> > (define matcher (sxpath '(// html head title)))
> > (matcher the-web-page)
> $3 = ((title "GNU Guile (About Guile)"))
>
>
I was going to attempt something along this line
(sxml-match (xml->sxml page) [(div (@ (id "real_player") (rel ,url))) (str
but I´m going to explore your lines too. I still wasn´t there, I had
stumbled in something I thought it was a bug, but I also had something else
to do (this is a pet project) so this had to wait.
But I´ll surely let you know
Thanks again for your help
Bye
Cato
[-- Attachment #2: Type: text/html, Size: 6183 bytes --]
next prev parent reply other threads:[~2012-01-16 20:06 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-12-30 22:58 salutations and web scraping Catonano
2012-01-10 21:46 ` Andy Wingo
2012-01-16 20:06 ` Catonano [this message]
2012-01-24 12:47 ` Catonano
2012-01-24 13:07 ` Andy Wingo
2012-01-24 14:17 ` Catonano
2012-01-25 1:41 ` Catonano
2012-01-25 3:56 ` Daniel Hartwig
2012-01-25 4:57 ` Catonano
2012-01-25 9:07 ` Andy Wingo
2012-01-25 17:23 ` Catonano
2012-01-27 12:18 ` Catonano
2013-01-07 22:23 ` Andy Wingo
2013-01-30 13:48 ` Catonano
2012-01-25 8:57 ` Andy Wingo
2012-01-29 14:23 ` Catonano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJ98PDzWPYyLxTUsbjaGV0_tqgjy1XqUBB3bPFms2_S54-cMhQ@mail.gmail.com \
--to=catonano@gmail.com \
--cc=guile-user@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).