From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ricardo Wurmus Subject: Re: Help with sxml simple parser for the quicklisp importer Date: Wed, 23 Jan 2019 17:41:32 +0100 Message-ID: <87tvhz5nsz.fsf@elephly.net> References: <1b161633-c285-1401-d771-c965dae58149@riseup.net> <874l9z78sc.fsf@elephly.net> <87womv5psn.fsf@elephly.net> <42ab2c44-3e2f-d2ba-17de-3f73f78b148b@riseup.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([209.51.188.92]:56175) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gmLbC-0000Lk-BM for guix-devel@gnu.org; Wed, 23 Jan 2019 11:41:59 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gmLb7-0000sw-RR for guix-devel@gnu.org; Wed, 23 Jan 2019 11:41:55 -0500 Received: from sender-of-o53.zoho.com ([135.84.80.218]:21737) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1gmLb5-0000hh-4F for guix-devel@gnu.org; Wed, 23 Jan 2019 11:41:52 -0500 In-reply-to: <42ab2c44-3e2f-d2ba-17de-3f73f78b148b@riseup.net> List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: swedebugia Cc: guix-devel swedebugia writes: > On 2019-01-23 16:58, Ricardo Wurmus wrote: >> >> swedebugia writes: >> >>>> The second =E2=80=9Clink=E2=80=9D tag opens but is never closed. This= may be valid >>>> HTML, but it is not valid XML, which is what xml->sxml expects. >>> >>> Thanks for the quick answer! >>> I will try to remove this line before handling over to the parser. >> >> I would recommend looking for a better source of package information. >> Parsing HTML is not fun and is often brittle. > > I understand. Hm. Will try asking the author. > > Got a little further. Added this: > > (define (sanitize-html html) > "Correct an offending invalid line from the html source" > (let* ((html1 (regexp-substitute #f (string-match "main.css\">" html) > 'pre "main.css\" />" 'post)) > (result (regexp-substitute #f (string-match "utf-8\">" html1) > 'pre "utf-8\" />" 'post))) > result)) It=E2=80=99s generally a bad idea to use regular expressions on HTML or XML= . Be careful. > sxml/simple.scm:143:4: In procedure loop: > Throw to key `parser-error' with args `(# > "[wf-entdeclared] broken for " copy)'. I guess this is about the © entity. You may have to tell xml->sxml about these HTML entities. -- Ricardo