From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Andy Wingo Newsgroups: gmane.lisp.guile.user Subject: Re: salutations and web scraping Date: Tue, 10 Jan 2012 22:46:12 +0100 Message-ID: <871ur7fcyj.fsf@pobox.com> References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1326231993 15585 80.91.229.12 (10 Jan 2012 21:46:33 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Tue, 10 Jan 2012 21:46:33 +0000 (UTC) Cc: guile-user@gnu.org To: Catonano Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Tue Jan 10 22:46:26 2012 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1RkjWT-0003Se-UL for guile-user@m.gmane.org; Tue, 10 Jan 2012 22:46:26 +0100 Original-Received: from localhost ([::1]:36582 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RkjWT-0001qs-FL for guile-user@m.gmane.org; Tue, 10 Jan 2012 16:46:25 -0500 Original-Received: from eggs.gnu.org ([140.186.70.92]:34166) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RkjWO-0001qg-Fb for guile-user@gnu.org; Tue, 10 Jan 2012 16:46:21 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RkjWN-00086s-0l for guile-user@gnu.org; Tue, 10 Jan 2012 16:46:20 -0500 Original-Received: from a-pb-sasl-sd.pobox.com ([74.115.168.62]:52401 helo=sasl.smtp.pobox.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RkjWM-00086n-S2 for guile-user@gnu.org; Tue, 10 Jan 2012 16:46:18 -0500 Original-Received: from sasl.smtp.pobox.com (unknown [127.0.0.1]) by a-pb-sasl-sd.pobox.com (Postfix) with ESMTP id F23E28D30; Tue, 10 Jan 2012 16:46:17 -0500 (EST) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=pobox.com; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type:content-transfer-encoding; s=sasl; bh=TMBAZ+PTutBn wsbjIGZ6k+7/K+U=; b=mQvNYdy68usvssSJdzyhTMfTK6wCukmA61bV1coypfSM sly4+1nKk5IDKK/Kx5goET0jxHwbPcEJJTfxPFe6gUKRFng0SrnEQDc4H+L6SlXb QRHhpd2vQqORGFyW42jzPyADj9H/C8eiqfrh8MaIDE3cRYzUftV29VT6fNzsuoI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=pobox.com; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type:content-transfer-encoding; q=dns; s=sasl; b=K4zQG4 c6l8RVcPdBlmvNF+eNo1i1+J7lwo5++ALLQLMOCxC0KgD81KTQyOlG1/rrlcL3YN 1O+NKbZhz1JD8kOR/gRpBrTPU0EuA+yQXTP6t06gfZx9uJsbsH9OjmGQU+F+tjzT IcrY8ZOQcsNC/ieIeDFa/0C2DhZ4CP3EZg1no= Original-Received: from a-pb-sasl-sd.pobox.com (unknown [127.0.0.1]) by a-pb-sasl-sd.pobox.com (Postfix) with ESMTP id EC2B28D2F; Tue, 10 Jan 2012 16:46:17 -0500 (EST) Original-Received: from badger (unknown [90.164.198.39]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by a-pb-sasl-sd.pobox.com (Postfix) with ESMTPSA id 30A098D24; Tue, 10 Jan 2012 16:46:17 -0500 (EST) In-Reply-To: (catonano@gmail.com's message of "Fri, 30 Dec 2011 23:58:47 +0100") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) X-Pobox-Relay-ID: 867FCD84-3BD4-11E1-AD39-65B1DE995924-02397024!a-pb-sasl-sd.pobox.com X-detected-operating-system: by eggs.gnu.org: Solaris 10 (beta) X-Received-From: 74.115.168.62 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: guile-user-bounces+guile-user=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.user:9127 Archived-At: Hi Catonano, On Fri 30 Dec 2011 23:58, Catonano writes: > I=C2=B4m a beginner, I never wrote a single line of LISP or Scheme in my = life > and I=C2=B4m here for asking for directions and suggestions. Welcome! :-) > I=C2=B4m mumbling about a pet project. I would like to scrape the web sit= e of > a comunitarian radio station and grab the flash streamed content they > publish. The license the material is published under is Creative Common= =C2=A0 > so what I=C2=B4m planning is not illegal. Sounds like fun. > my boldness is such that I=C2=B4d ask you to write for me an example > skeleton code. Hey, it's fair, I think; that is a new part of Guile, and there is not a lot of example code. Generally, we figure out how to solve problems at the REPL, so fire up your Guile: $ guile ... scheme@(guile-user)>=20 (Here I'm assuming you have guile 2.0.3.) Use the web modules. Let's assume we're grabbing http://www.gnu.org/, for simplicity: > (use-modules (web client) (web uri)) > (http-get (string->uri "http://www.gnu.org/software/guile/")) [here the text of the web page gets printed out] Actually there are two return values: the response object, corresponding to the headers, and the body. If you scroll your terminal up, you'll see that they get labels like $1 and $2. Now you need to parse the HTML. The best way to do this is with the pragmatic HTML parser, htmlprag. It's part of guile-lib. So download and install guile-lib (it's at http://www.non-gnu.org/guile-lib/), and then, assuming the html is in $2: > (use-modules (htmlprag)) > (define the-web-page (html->sxml $2)) That parses the web page to s-expressions. You can print the result nicely: > ,pretty-print the-web-page Now you need to get something out of the web page. The hackiest way to do it is just to match against the entire page. Maybe someone else can come up with an example, but I'm short on time, so I'll proceed to The Right Thing -- the problem is that whitespace is significant, and maybe all you want is the contents of "the in the <head> in the <html>." So in XML you'd use XPATH. In SXML you'd use SXPATH. It's hard to use right now; we really need to steal http://www.neilvandyke.org/webscraperhelper/ from Neil van Dyke. But you can see from his docs that the thing would be > (use-modules (sxml xpath)) > (define matcher (sxpath '(// html head title))) > (matcher the-web-page) $3 =3D ((title "GNU Guile (About Guile)")) Et voila. I don't do much web scraping these days, but I know some others do. So if those others would chime in with better ways to do things, that is very welcome. Happy hacking, Andy --=20 http://wingolog.org/