Re: salutations and web scraping

unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed

From: Catonano <catonano@gmail.com>
To: guile-user@gnu.org
Subject: Re: salutations and web scraping
Date: Mon, 16 Jan 2012 21:06:48 +0100	[thread overview]
Message-ID: <CAJ98PDzWPYyLxTUsbjaGV0_tqgjy1XqUBB3bPFms2_S54-cMhQ@mail.gmail.com> (raw)
In-Reply-To: <871ur7fcyj.fsf@pobox.com>

[-- Attachment #1: Type: text/plain, Size: 4080 bytes --]

Andy,

Il giorno 10 gennaio 2012 22:46, Andy Wingo <wingo@pobox.com> ha scritto:

> Hi Catonano,
>
> On Fri 30 Dec 2011 23:58, Catonano <catonano@gmail.com> writes:
>
> > I´m a beginner, I never wrote a single line of LISP or Scheme in my life
> > and I´m here for asking for directions and suggestions.
>
> Welcome! :-)
>

thank you so much for your reply. I had been eagerly waiting for a signal
from the list and I had missed it ! I´m sorry.

The gmail learning mechanism hasn´t still learned enough about my interest
in this issue, so it didn´t promptly reported about your reply. I had to
dig inside the folders structure I had layed out in order to discover it.
As for me I haven´t learned enough about the gmail learning mechaninsm
woes. I guess we´re both learning, now.

Well, I was attempting a joke ;-)



> > my boldness is such that I´d ask you to write for me an example
> > skeleton code.
>
>
> Hey, it's fair, I think; that is a new part of Guile, and there is not a
> lot of example code.
>
>
Thanks, Andy, I´m grateful for this. Actually I managed to set up geiser,
load a file and get me delivered to a prompt in which that file is loaded.
Cool ;-) But there are still some thing I didn´t know that your post made
clear.


> Generally, we figure out how to solve problems at the REPL, so fire up
> your Guile:
>
>  $ guile
>  ...
>  scheme@(guile-user)>
>
> (Here I'm assuming you have guile 2.0.3.)
>


> Use the web modules.  Let's assume we're grabbing http://www.gnu.org/,
> for simplicity:
>
>  > (use-modules (web client) (web uri))
>  > (http-get (string->uri "http://www.gnu.org/software/guile/"))
>  [here the text of the web page gets printed out]
>

Ok, I had managed to arrive so far (thanks to the help received in the
guile cannel in irc)

>
> Actually there are two return values: the response object, corresponding
> to the headers, and the body.  If you scroll your terminal up, you'll
> see that they get labels like $1 and $2.
>

I didn´t know they were 2 values, thanks

>
> Now you need to parse the HTML.  The best way to do this is with the
> pragmatic HTML parser, htmlprag.  It's part of guile-lib.  So download
> and install guile-lib (it's at http://www.non-gnu.org/guile-lib/), and
> then, assuming the html is in $2:
>

I had seen those $i things but I hadn´t understood that stuff was "inside"
them and that I could use them, so I was using a lot of (define this that).
And this is probably why I missed the two values returned by http-get.
Thanks !



>   > (use-modules (htmlprag))
>  > (define the-web-page (html->sxml $2))
>


And I didn´t know about htmlprag, thanks


>
> That parses the web page to s-expressions.  You can print the result
> nicely:
>
>  > ,pretty-print the-web-page
>

thanks, I didn´t know this, either


>
> Now you need to get something out of the web page.  The hackiest way to
> do it is just to match against the entire page.  Maybe someone else can
> come up with an example, but I'm short on time, so I'll proceed to The
> Right Thing -- the problem is that whitespace is significant, and maybe
> all you want is the contents of "the <title> in the <head> in the
> <html>."
>
> So in XML you'd use XPATH.  In SXML you'd use SXPATH.  It's hard to use
> right now; we really need to steal
> http://www.neilvandyke.org/webscraperhelper/ from Neil van Dyke.  But
> you can see from his docs that the thing would be
>
>  > (use-modules (sxml xpath))
>  > (define matcher (sxpath '(// html head title)))
>  > (matcher the-web-page)
>  $3 = ((title "GNU Guile (About Guile)"))
>
>
I was going to attempt something along this line

(sxml-match (xml->sxml page) [(div (@ (id "real_player") (rel ,url))) (str

but I´m going to explore your lines too. I still wasn´t there, I had
stumbled in something I thought it was a bug, but I also had something else
to do (this is a pet project) so this had to wait.

But I´ll surely let you know

Thanks again for your help
Bye
Cato

[-- Attachment #2: Type: text/html, Size: 6183 bytes --]

next prev parent reply	other threads:[~2012-01-16 20:06 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-30 22:58 salutations and web scraping Catonano
2012-01-10 21:46 ` Andy Wingo
2012-01-16 20:06   ` Catonano [this message]
2012-01-24 12:47   ` Catonano
2012-01-24 13:07     ` Andy Wingo
2012-01-24 14:17       ` Catonano
2012-01-25  1:41         ` Catonano
2012-01-25  3:56           ` Daniel Hartwig
2012-01-25  4:57             ` Catonano
2012-01-25  9:07             ` Andy Wingo
2012-01-25 17:23               ` Catonano
2012-01-27 12:18                 ` Catonano
2013-01-07 22:23                   ` Andy Wingo
2013-01-30 13:48                     ` Catonano
2012-01-25  8:57           ` Andy Wingo
2012-01-29 14:23             ` Catonano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJ98PDzWPYyLxTUsbjaGV0_tqgjy1XqUBB3bPFms2_S54-cMhQ@mail.gmail.com \
    --to=catonano@gmail.com \
    --cc=guile-user@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).