From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ricardo Wurmus <rekado@elephly.net>
Subject: Re: Help with sxml simple parser for the quicklisp importer
Date: Wed, 23 Jan 2019 17:41:32 +0100
Message-ID: <87tvhz5nsz.fsf@elephly.net>
References: <1b161633-c285-1401-d771-c965dae58149@riseup.net>
	<874l9z78sc.fsf@elephly.net>
	<a15124a8-1b06-77e7-8d35-4de4cee59afe@riseup.net>
	<87womv5psn.fsf@elephly.net>
	<42ab2c44-3e2f-d2ba-17de-3f73f78b148b@riseup.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Return-path: <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([209.51.188.92]:56175)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <rekado@elephly.net>) id 1gmLbC-0000Lk-BM
	for guix-devel@gnu.org; Wed, 23 Jan 2019 11:41:59 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <rekado@elephly.net>) id 1gmLb7-0000sw-RR
	for guix-devel@gnu.org; Wed, 23 Jan 2019 11:41:55 -0500
Received: from sender-of-o53.zoho.com ([135.84.80.218]:21737)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <rekado@elephly.net>) id 1gmLb5-0000hh-4F
	for guix-devel@gnu.org; Wed, 23 Jan 2019 11:41:52 -0500
In-reply-to: <42ab2c44-3e2f-d2ba-17de-3f73f78b148b@riseup.net>
List-Id: "Development of GNU Guix and the GNU System distribution."
	<guix-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guix-devel/>
List-Post: <mailto:guix-devel@gnu.org>
List-Help: <mailto:guix-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=subscribe>
Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org
Sender: "Guix-devel" <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
To: swedebugia <swedebugia@riseup.net>
Cc: guix-devel <guix-devel@gnu.org>


swedebugia <swedebugia@riseup.net> writes:

> On 2019-01-23 16:58, Ricardo Wurmus wrote:
>>
>> swedebugia <swedebugia@riseup.net> writes:
>>
>>>> The second =E2=80=9Clink=E2=80=9D tag opens but is never closed.  This=
 may be valid
>>>> HTML, but it is not valid XML, which is what xml->sxml expects.
>>>
>>> Thanks for the quick answer!
>>> I will try to remove this line before handling over to the parser.
>>
>> I would recommend looking for a better source of package information.
>> Parsing HTML is not fun and is often brittle.
>
> I understand. Hm. Will try asking the author.
>
> Got a little further. Added this:
>
> (define (sanitize-html html)
>   "Correct an offending invalid line from the html source"
>   (let* ((html1 (regexp-substitute #f (string-match "main.css\">" html)
>                                    'pre "main.css\" />" 'post))
>          (result (regexp-substitute #f (string-match "utf-8\">" html1)
>                                     'pre "utf-8\" />" 'post)))
>     result))

It=E2=80=99s generally a bad idea to use regular expressions on HTML or XML=
.  Be
careful.

> sxml/simple.scm:143:4: In procedure loop:
> Throw to key `parser-error' with args `(#<input: string 24fdaf0>
> "[wf-entdeclared] broken for " copy)'.

I guess this is about the &copy; entity.  You may have to tell xml->sxml
about these HTML entities.

--
Ricardo