unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
From: Panicz Maciej Godek <godek.maciek@gmail.com>
To: swedebugia <swedebugia@riseup.net>
Cc: Ricardo Wurmus <rekado@elephly.net>, Guile User <guile-user@gnu.org>
Subject: Re: Permissive html parser for guile
Date: Wed, 23 Jan 2019 22:04:23 +0100	[thread overview]
Message-ID: <CAMFYt2aWwVmDbyNXj-DHdvvEPexqh81Nxs88F_yXiTRPi8g0sA@mail.gmail.com> (raw)
In-Reply-To: <656912ae-c706-5a12-dee7-f0c0e581bdb1@riseup.net>

I believe that the canonical way of working with XML documents in Guile is
through the (sxml simple) module (and others):
https://www.gnu.org/software/guile/manual/html_node/SXML.html

It contains xml->sxml function which allows to convert XML strings to a
more familiar s-expression based format.

śr., 23 sty 2019 o 17:41 swedebugia <swedebugia@riseup.net> napisał(a):

> I just found this LGPL3 parser by Neil Van Dyke (see attachment)
>
> Do we have something similar in guile?
>
> If not is anybody interested in porting it? (I have no idea how much
> work it would be, but Racket seems quite close to guile)
>
> Here is the introduction:
> "The html-parsing library provides a permissive HTML parser. The parser
> is useful for software agent extraction of information from Web pages,
> for programmatically transforming HTML files, and for implementing
> interactive Web browsers. html-parsing emits SXML/xexp, so that
> conventional HTML may be processed with XML tools such as SXPath. Like
> Oleg Kiselyov’s SSAX-based HTML parser, html-parsing provides a
> permissive tokenizer, but html-parsing extends this by attempting to
> recover syntactic structure.
> The html-parsing parsing behavior is permissive in that it accepts
> erroneous HTML, handling several classes of HTML syntax errors
> gracefully, without yielding a parse error. This is crucial for parsing
> arbitrary real-world Web pages, since many pages actually contain syntax
> errors that would defeat a strict or validating parser. html-parsing’s
> handling of errors is intended to generally emulate popular Web
> browsers’ interpretation of the structure of erroneous HTML."
> https://docs.racket-lang.org/html-parsing/index.html
>
> --
> Cheers Swedebugia
>


  reply	other threads:[~2019-01-23 21:04 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1b161633-c285-1401-d771-c965dae58149@riseup.net>
     [not found] ` <874l9z78sc.fsf@elephly.net>
     [not found]   ` <a15124a8-1b06-77e7-8d35-4de4cee59afe@riseup.net>
     [not found]     ` <87womv5psn.fsf@elephly.net>
2019-01-23 16:47       ` Permissive html parser for guile swedebugia
2019-01-23 21:04         ` Panicz Maciej Godek [this message]
2019-01-23 21:18           ` Ricardo Wurmus
2019-01-23 21:58           ` tomas
2019-01-23 21:08         ` Thompson, David
2019-01-24  6:12           ` swedebugia
2019-01-24  8:18             ` Catonano
2019-03-11  8:14               ` Pierre Neidhardt
2019-03-11  8:15                 ` Pierre Neidhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMFYt2aWwVmDbyNXj-DHdvvEPexqh81Nxs88F_yXiTRPi8g0sA@mail.gmail.com \
    --to=godek.maciek@gmail.com \
    --cc=guile-user@gnu.org \
    --cc=rekado@elephly.net \
    --cc=swedebugia@riseup.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).