From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Panicz Maciej Godek Newsgroups: gmane.lisp.guile.user Subject: Re: Permissive html parser for guile Date: Wed, 23 Jan 2019 22:04:23 +0100 Message-ID: References: <1b161633-c285-1401-d771-c965dae58149@riseup.net> <874l9z78sc.fsf@elephly.net> <87womv5psn.fsf@elephly.net> <656912ae-c706-5a12-dee7-f0c0e581bdb1@riseup.net> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="184224"; mail-complaints-to="usenet@blaine.gmane.org" Cc: Ricardo Wurmus , Guile User To: swedebugia Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Wed Jan 23 22:05:24 2019 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gmPi8-000loM-G5 for guile-user@m.gmane.org; Wed, 23 Jan 2019 22:05:24 +0100 Original-Received: from localhost ([127.0.0.1]:41821 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gmPi6-0002JM-Q9 for guile-user@m.gmane.org; Wed, 23 Jan 2019 16:05:22 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:34952) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gmPhg-0002JE-Ht for guile-user@gnu.org; Wed, 23 Jan 2019 16:04:57 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gmPhY-0002H3-A2 for guile-user@gnu.org; Wed, 23 Jan 2019 16:04:54 -0500 Original-Received: from mail-ed1-x536.google.com ([2a00:1450:4864:20::536]:40450) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gmPhM-0002D7-SC for guile-user@gnu.org; Wed, 23 Jan 2019 16:04:38 -0500 Original-Received: by mail-ed1-x536.google.com with SMTP id g22so2845325edr.7 for ; Wed, 23 Jan 2019 13:04:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ZrEvQuwbkZ3uE8/Xa38wBdrT7EKeJ1AIbJEDBwCeyI4=; b=hT4/5rclTuddJGVAPRTs1S7phMx9T1yT2AMLjlRN7/sTlzprdSQgR73PNDBqkpbZ/K 285hxz2cDAZsL6tTCZ0Vy6WM2dMWgqtyUIQcpEVl3WLXhyyWgTcN99K2ob1QXU06GmR/ G/NPAT4EA6F8SAaqy49/RghAHpthB0hPUxvWmPEndNixe7h+xeNh3/B236QEB/lgbif3 SiVchXF28dceca0Slel3QLQRonu4i/jm1plcG9C+KfczmK+fR+AlpLOB7SGLLPLoNO39 xJUSvb/Tld5RMN+7LY1IHv3OjbB52YSSlbJJR0MDJU8wMw1+BpPupd9tjNKyfGVoR/j+ nD5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ZrEvQuwbkZ3uE8/Xa38wBdrT7EKeJ1AIbJEDBwCeyI4=; b=AHSR+fxOFkstYMuubJ5ImE1zoaaOP+rOw5HL3NvShsZdG6Wzu4fNR0AEn/ftNRovc5 2jkNhn5vH7C25TOiuaVwI5RslsoBcNgPYLjBni9GRg7pB/XbXz8/BvGv4ODZim94hH7v C8G7j4jdRWI4kz+F/9A+ZTagv7cpm1x7xIf53XegJD0ux2ZgOKGz1Jx9xarshqtNe8Gj tgKP2AKCTMPp/tzBZvgpFr0nBF4tnUndQ2oFB3wA2Ry9c7HWc6G+Qo6hCnBZeQVnlFH/ tOGvk6jmnA4ifakUhkx1Fcjo2zw4bSPBLC3QGgNCJeJkFou2oWNYkXmf3fC+xYrZGPEz JuqA== X-Gm-Message-State: AJcUukdop4ORzmxl2renSf+6iLGr1Lb5AHJKLQZ6GNQ1O1pslwkGnfGf yDfftTFSHCDLePq6tmF70kCq/Ig8Q5qYT8S4HtA= X-Google-Smtp-Source: ALg8bN5qgd7shlQEWB8ax96DZ7vp6z9sSwEElf/TZ681ZteOmnWC/NsB1t3wmxEzvgRxgB67zYMzF5ZVntnBNCosYDs= X-Received: by 2002:a17:906:948d:: with SMTP id t13-v6mr3631204ejx.119.1548277473769; Wed, 23 Jan 2019 13:04:33 -0800 (PST) In-Reply-To: <656912ae-c706-5a12-dee7-f0c0e581bdb1@riseup.net> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2a00:1450:4864:20::536 X-Content-Filtered-By: Mailman/MimeDel 2.1.21 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: "guile-user" Xref: news.gmane.org gmane.lisp.guile.user:15249 Archived-At: I believe that the canonical way of working with XML documents in Guile is through the (sxml simple) module (and others): https://www.gnu.org/software/guile/manual/html_node/SXML.html It contains xml->sxml function which allows to convert XML strings to a more familiar s-expression based format. =C5=9Br., 23 sty 2019 o 17:41 swedebugia napisa=C5= =82(a): > I just found this LGPL3 parser by Neil Van Dyke (see attachment) > > Do we have something similar in guile? > > If not is anybody interested in porting it? (I have no idea how much > work it would be, but Racket seems quite close to guile) > > Here is the introduction: > "The html-parsing library provides a permissive HTML parser. The parser > is useful for software agent extraction of information from Web pages, > for programmatically transforming HTML files, and for implementing > interactive Web browsers. html-parsing emits SXML/xexp, so that > conventional HTML may be processed with XML tools such as SXPath. Like > Oleg Kiselyov=E2=80=99s SSAX-based HTML parser, html-parsing provides a > permissive tokenizer, but html-parsing extends this by attempting to > recover syntactic structure. > The html-parsing parsing behavior is permissive in that it accepts > erroneous HTML, handling several classes of HTML syntax errors > gracefully, without yielding a parse error. This is crucial for parsing > arbitrary real-world Web pages, since many pages actually contain syntax > errors that would defeat a strict or validating parser. html-parsing=E2= =80=99s > handling of errors is intended to generally emulate popular Web > browsers=E2=80=99 interpretation of the structure of erroneous HTML." > https://docs.racket-lang.org/html-parsing/index.html > > -- > Cheers Swedebugia >