From mboxrd@z Thu Jan 1 00:00:00 1970 From: swedebugia Subject: Re: Help with sxml simple parser for the quicklisp importer Date: Wed, 23 Jan 2019 17:32:20 +0100 Message-ID: <42ab2c44-3e2f-d2ba-17de-3f73f78b148b@riseup.net> References: <1b161633-c285-1401-d771-c965dae58149@riseup.net> <874l9z78sc.fsf@elephly.net> <87womv5psn.fsf@elephly.net> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------0F02628B6C34B44CB23C0B81" Return-path: Received: from eggs.gnu.org ([209.51.188.92]:50941) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gmLLF-0001m9-5y for guix-devel@gnu.org; Wed, 23 Jan 2019 11:25:32 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gmLL7-0005Un-FI for guix-devel@gnu.org; Wed, 23 Jan 2019 11:25:29 -0500 Received: from mx1.riseup.net ([198.252.153.129]:51668) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1gmLL3-0005NA-Lv for guix-devel@gnu.org; Wed, 23 Jan 2019 11:25:19 -0500 In-Reply-To: <87womv5psn.fsf@elephly.net> Content-Language: en-US List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: Ricardo Wurmus Cc: guix-devel This is a multi-part message in MIME format. --------------0F02628B6C34B44CB23C0B81 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable On 2019-01-23 16:58, Ricardo Wurmus wrote: >=20 > swedebugia writes: >=20 >>> The second =E2=80=9Clink=E2=80=9D tag opens but is never closed. Thi= s may be valid >>> HTML, but it is not valid XML, which is what xml->sxml expects. >> >> Thanks for the quick answer! >> I will try to remove this line before handling over to the parser. >=20 > I would recommend looking for a better source of package information. > Parsing HTML is not fun and is often brittle. I understand. Hm. Will try asking the author. Got a little further. Added this: (define (sanitize-html html) "Correct an offending invalid line from the html source" (let* ((html1 (regexp-substitute #f (string-match "main.css\">" html) 'pre "main.css\" />" 'post)) (result (regexp-substitute #f (string-match "utf-8\">" html1) 'pre "utf-8\" />" 'post))) result)) Which results in a new error: Starting download of /tmp/guix-file.uAoKMD From http://quickdocs.org/1am/... 1am/ 7KiB 2.0MiB/s 00:00=20 [##################] 100.0% Backtrace: 13 (apply-smob/1 #) In ice-9/boot-9.scm: 705:2 12 (call-with-prompt _ _ #) In ice-9/eval.scm: 619:8 11 (_ #(#(#))) In ice-9/boot-9.scm: 2312:4 10 (save-module-excursion _) 3831:12 9 (_) In guix/import/quicklisp.scm: 239:9 8 (_) In guix/utils.scm: 618:8 7 (call-with-temporary-output-file #) In sxml/simple.scm: 143:4 6 (xml->sxml _ #:namespaces _ #:declare-namespaces? _=20 #:trim-whitespace? _ #:entities _ #:default-entity-handler _ # _) 143:4 5 (loop # () #f _) 143:4 4 (loop # () #f _) 143:4 3 (loop # () #f _) 143:4 2 (loop # () #f _) 143:4 1 (loop # () #f _) 143:4 0 (loop # () #f _) sxml/simple.scm:143:4: In procedure loop: Throw to key `parser-error' with args `(#=20 "[wf-entdeclared] broken for " copy)'. Any ideas? --=20 Cheers Swedebugia --------------0F02628B6C34B44CB23C0B81 Content-Type: text/x-scheme; name="quicklisp.scm" Content-Disposition: attachment; filename="quicklisp.scm" Content-Transfer-Encoding: quoted-printable ;;; GNU Guix --- Functional package management for GNU ;;; Copyright =C2=A9 2018 Julien Lepiller ;;; ;;; This file is part of GNU Guix. ;;; ;;; GNU Guix is free software; you can redistribute it and/or modify it ;;; under the terms of the GNU General Public License as published by ;;; the Free Software Foundation; either version 3 of the License, or (at ;;; your option) any later version. ;;; ;;; GNU Guix is distributed in the hope that it will be useful, but ;;; WITHOUT ANY WARRANTY; without even the implied warranty of ;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;;; GNU General Public License for more details. ;;; ;;; You should have received a copy of the GNU General Public License ;;; along with GNU Guix. If not, see . (define-module (guix import quicklisp) ; #:use-module (ice-9 ftw) #:use-module (ice-9 match) #:use-module (ice-9 peg) #:use-module (ice-9 regex) #:use-module ((ice-9 rdelim) #:select (read-line)) #:use-module (ice-9 textual-ports) #:use-module (sxml simple) #:use-module (sxml xpath) #:use-module (srfi srfi-1) #:use-module (srfi srfi-2) #:use-module (web uri) ; #:use-module (guix http-client) #:use-module ((guix build download) #:prefix build:) #:use-module (guix base32) ; #:use-module (guix ui) #:use-module ((guix utils) #:select (call-with-temporary-output-file)) ; #:use-module (guix import utils) ; #:use-module ((guix licenses) #:prefix license:) #:export (test)) ;; Define a PEG parser for the quicklisp format (define-peg-pattern SP none (or " " "\n")) (define-peg-pattern NL none "\n") (define-peg-pattern COLON none ":") (define-peg-pattern RELEASE none "release/") (define-peg-pattern HASH none "#") (define-peg-pattern IGNORE none (peg "(! SP .)*")) (define-peg-pattern IGNORE-UNTIL-NL none (peg "(! NL .)*")) (define-peg-pattern text all (+ (or (range #\a #\z) "-"))) (define-peg-pattern text-until-sp all (peg "(! SP .)*")) (define-peg-pattern text-until-colon all (peg "(! COLON .)*")) (define-peg-pattern text-until-nl all (peg "(! NL .)*")) ;; Meta and distinfo files use COLON as separator: (define-peg-pattern test all text-until-colon) (define-peg-pattern record all (and text COLON (* SP) text-until-nl)) (define-peg-pattern records body (* (and record (* SP)))) ;; Release index has no colons between the values: (define-peg-pattern record-index all (and text-until-sp (* SP) text-until= -nl)) ;; Field no. 2 is tarball url (define-peg-pattern field-2 all (and text-until-sp SP text-until-sp IGNORE-UNTIL-NL)) (define-peg-pattern 2s body (* (and (or header-hash field-2) (* SP)))) ;; Field no. 6 is version (define-peg-pattern version all (and text-until-sp SP IGNORE SP IGNORE SP IGNORE SP IGNORE SP text-until-sp IGNORE-UNTIL-NL)) (define-peg-pattern versions body (* (and (or header-hash version) (* SP)))) ;; Field no. 7-N is system files which we need. (define-peg-pattern system-files all (and text-until-sp SP IGNORE SP IGNORE SP IGNORE SP IGNORE SP IGNORE SP text-until-nl)) (define-peg-pattern system-files-list body (* (and (or header-hash system-files) (* SP)))) ;; Release hashfile has no colons between the values: ;; Names contain numbers, get everything exept space. (define-peg-pattern record-hash all (and ;; This matches the prefix "release/" RELEASE text-until-sp (* SP) text-until-nl)) (define-peg-pattern header-hash all (and (and HASH (* SP)) text-until-sp (* SP) text-until-nl)) (define-peg-pattern records-hash body (* (and (or header-hash record-hash= ) (* SP)))) ;; Systems.txt with dependencies ;; This parsing results in a lot of duplicates, but we ignore that becaus= e our ;; match just picks the first and returns happy and the overhead is negli= ble. ;; Field no. 3 is system-name (define-peg-pattern field-3 all (and text-until-sp SP IGNORE SP text-until-sp IGNORE-UNTIL-NL)) (define-peg-pattern 3s body (* (and (or header-hash field-3) (* SP)))) (define-peg-pattern record-sys all (and ;; We ignore the second and third field for now. text-until-sp (* SP) IGNORE SP IGNORE SP text-until-nl)) (define-peg-pattern records-sys body (* (and (or header-hash record-sys) = (* SP)))) ;;; QL=3D QuickLisp ;;; cl=3D Common Lisp (define (url-fetch url file-name) "Save the contents of URL to FILE-NAME. Return #f on failure." (parameterize ((current-output-port (current-error-port))) (build:url-fetch url file-name))) (define (ql-meta-file) "Get the latest meta release file. From the links in this we extract al= l other information we need." (call-with-temporary-output-file (lambda (temp port) (and (url-fetch "https://beta.quicklisp.org/dist/quicklisp.txt" temp= ) (peg:tree (match-pattern test (get-string-all port))))))) ;;(display (metadata-ref (ql-meta-file) "release-index-url")) (define (ql-latest-index-file) "Get the latest release index file content. This contains: name tarball= -url file-md5 content-sha1-hash version etc. Space separated." (let ((latest (metadata-ref (ql-meta-file) "release-index-url"))) (call-with-temporary-output-file (lambda (temp port) (and (url-fetch latest temp) (get-string-all port)))))) (define (ql-distinfo-file) "Get the latest distinfo file. Colon separated. From this we only get t= he link to system-index-url." (let ((latest (metadata-ref (ql-meta-file) "canonical-distinfo-url"))) (call-with-temporary-output-file (lambda (temp port) (and (url-fetch latest temp) (peg:tree (match-pattern records (get-string-all port)))))))) (define (ql-hash-file) "Get the latest hashfile." (let* ((uri (string->uri (metadata-ref (ql-meta-file) "release-index-ur= l"))) (host (uri-host uri)) (path (string-drop-right (uri-path uri) 12)) (url (string-append "https://" host path "digests.txt"))) (pk 'url url) (call-with-temporary-output-file (lambda (temp port) (and (url-fetch url temp) (peg:tree (match-pattern records-hash (get-string-all port)))= ))))) (define (ql-systems-file) "Get the latest file with dependency information for each package. Spac= eseparated list of dependencies." (let ((latest (metadata-ref (ql-distinfo-file) "system-index-url"))) (call-with-temporary-output-file (lambda (temp port) (and (url-fetch latest temp) (peg:tree (match-pattern records-sys (get-string-all port))))= )))) ;;(display (ql-systems-file)) works. (define (metadata-ref file lookup) "Lookup metadata for FILE and LOOKUP." ;; (pk 'file file 'lookup lookup) (fold (lambda (record acc) (match record ;; Output from parser looks like this: ((record (_ key) (_ val)) ;; Find our key (if (equal? key lookup) ;; return val val ;; else return acc acc)))) #f file)) ;;(metadata-ref (ql-latest-index-file) "1am") ; returns url of the tarball - ; works! ;;(metadata-ref (ql-hash-file) "1am") ; returns sha256 hash of the tarb= all - ; works! ;;(metadata-ref (ql-systems-file) "able") ; returns string with dependencie= s ; space separated - works! (define (ql-extract field name) "Helper to read the right field from (ql-latest-index-file). Field is o= ne of url, system-file, system-name, version. Name is the package name." (let* ((release (ql-latest-index-file)) ;; Not sure what to do with these yet: ;; (systems (ql-systems-file)) ;; (system-files (peg:tree (match-pattern system-files-list rele= ase))) ;; (system-name (peg:tree (match-pattern 3s systems))) (url (peg:tree (match-pattern 2s release))) (version (peg:tree (match-pattern versions release)))) (cond ;; FIXME these ONLY extract the first match ;; What do we need these for? ;; ((equal? field 'system-files) ;; (metadata-ref system-files name)) ;; ((equal? field 'system-name) ;; (metadata-ref system-name name)) ((equal? field 'url) (metadata-ref url name)) ((equal? field 'version) (let ((str (metadata-ref version name))) (if (string-prefix? name str) ;; Drop "name-" from version-string. (string-drop str (+ 1 (string-length name))) str)))))) #; (begin (display (ql-extract 'version "1am")) (display (ql-extract 'url "1am"))) ;; Guess the homepage of the package (define (homepage name) (string-append "http://quickdocs.org/" name "/")) (define (sanitize-html html) "Correct an offending invalid line from the html source" (let* ((html1 (regexp-substitute #f (string-match "main.css\">" html) 'pre "main.css\" />" 'post)) (result (regexp-substitute #f (string-match "utf-8\">" html1) 'pre "utf-8\" />" 'post))) result)) (define (get-homepage name) "Get the latest meta release file. From the links in this we extract al= l other information we need." (call-with-temporary-output-file (lambda (temp port) (and (url-fetch (homepage name) temp) (xml->sxml (sanitize-html (get-string-all port))))))) (display (get-homepage "1am")) ;; fetcher from texlive importer: #; (define (fetch-sxml name) "Return an sxml representation of the package information contained in = the XML description of the CTAN package or #f in case of failure." ;; This API always returns the latest release of the module. (let ((url (string-append "http://www.ctan.org/xml/1.2/pkg/" name))) (guard (c ((http-get-error? c) (format (current-error-port) "error: failed to retrieve package information \ from ~s: ~a (~s)~%" (uri->string (http-get-error-uri c)) (http-get-error-code c) (http-get-error-reason c)) #f)) (xml->sxml (http-fetch url) #:trim-whitespace? #t)))) (define (cl-name->guix-name name) (cond ;; TODO: Any special cases? ((string-prefix? "cl-" name) name) (else (string-append "cl-" name)))) ;; Native dependency information is not available in QL. ;; (define (dependency->input dependency) ;; (match dependency ;; )) (define (dependency-list->inputs lst) (map (lambda (dependency) (list dependency (list 'unquote (string->symbol dependency)))) (cl-name->guix-name lst))) (define (sxml->guix-package name) ;; (define (sxml-value path) ;; (match ((sxpath path) sxml) ;; (() #f) ;; ((val) val))) (and-let* ( ;;(sxml (get-homepage name)) (cl-version (ql-extract 'version name)) (hash (base32 (metadata-ref (ql-hash-file) name))) (source-url (ql-extract 'url name)) (inputs (dependency-list->inputs (string-split (metadata-ref (ql-systems-file) name) " ")))= ) (call-with-temporary-output-file (lambda (temp port) (and (url-fetch source-url temp) `(package (name ,(cl-name->guix-name name)) (version ,cl-version) (source (origin (method url-fetch) (uri ,source-url) ;; TODO chech hash (sha256 (base32 ,(guix-hash-url temp))))) (build-system asdf-build-system) ,@(if (null? inputs) '() `((inputs ,(list 'quasiquote inputs)))) ;; (home-page ,(metadata-ref opam-content "homepage")= ) ;; (synopsis ,(metadata-ref opam-content "synopsis")) ;; (description ,(metadata-ref opam-content "descript= ion")) (license #f))))))) --------------0F02628B6C34B44CB23C0B81--