* Help with sxml simple parser for the quicklisp importer
@ 2019-01-23 12:35 swedebugia
2019-01-23 14:22 ` Ricardo Wurmus
0 siblings, 1 reply; 9+ messages in thread
From: swedebugia @ 2019-01-23 12:35 UTC (permalink / raw)
To: Pierre Neidhardt, guix-devel
[-- Attachment #1: Type: text/plain, Size: 937 bytes --]
Hi
I am trying to extract information from the quickdocs website to fill
into the description.
I got this code:
;; Guess the homepage of the package
(define (homepage name)
(string-append "http://quickdocs.org/" name "/"))
(define (get-homepage name)
"Get the latest meta release file. From the links in this we extract all
other information we need."
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch (homepage name) temp)
(xml->sxml (get-string-all port))))))
(display (get-homepage "1am"))
But it errors out with:
sxml/simple.scm:143:4: In procedure loop:
Throw to key `parser-error' with args `(#<input: string 23c45b0>
"[GIMatch] broken for " (END . head) " while expecting " END link)'.
The document has no link problems in the html according to
https://validator.w3.org/nu/?doc=http%3A%2F%2Fquickdocs.org%2F1am%2F
Any ideas how to make it parse?
--
Cheers Swedebugia
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: quicklisp.scm --]
[-- Type: text/x-scheme; name="quicklisp.scm", Size: 15494 bytes --]
;;; GNU Guix --- Functional package management for GNU
;;; Copyright © 2018 Julien Lepiller <julien@lepiller.eu>
;;;
;;; This file is part of GNU Guix.
;;;
;;; GNU Guix is free software; you can redistribute it and/or modify it
;;; under the terms of the GNU General Public License as published by
;;; the Free Software Foundation; either version 3 of the License, or (at
;;; your option) any later version.
;;;
;;; GNU Guix is distributed in the hope that it will be useful, but
;;; WITHOUT ANY WARRANTY; without even the implied warranty of
;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
;;; GNU General Public License for more details.
;;;
;;; You should have received a copy of the GNU General Public License
;;; along with GNU Guix. If not, see <http://www.gnu.org/licenses/>.
(define-module (guix import quicklisp)
; #:use-module (ice-9 ftw)
#:use-module (ice-9 match)
#:use-module (ice-9 peg)
; #:use-module (ice-9 receive)
#:use-module ((ice-9 rdelim) #:select (read-line))
#:use-module (ice-9 textual-ports)
#:use-module (sxml simple)
#:use-module (sxml xpath)
#:use-module (srfi srfi-1)
#:use-module (srfi srfi-2)
#:use-module (web uri)
; #:use-module (guix http-client)
#:use-module ((guix build download) #:prefix build:)
; #:use-module (guix git)
; #:use-module (guix ui)
#:use-module ((guix utils) #:select (call-with-temporary-output-file))
; #:use-module (guix import utils)
; #:use-module ((guix licenses) #:prefix license:)
#:export (test))
;; Define a PEG parser for the quicklisp format
(define-peg-pattern SP none (or " " "\n"))
(define-peg-pattern NL none "\n")
(define-peg-pattern COLON none ":")
(define-peg-pattern RELEASE none "release/")
(define-peg-pattern HASH none "#")
(define-peg-pattern IGNORE none (peg "(! SP .)*"))
(define-peg-pattern IGNORE-UNTIL-NL none (peg "(! NL .)*"))
(define-peg-pattern text all (+ (or (range #\a #\z) "-")))
(define-peg-pattern text-until-sp all (peg "(! SP .)*"))
(define-peg-pattern text-until-colon all (peg "(! COLON .)*"))
(define-peg-pattern text-until-nl all (peg "(! NL .)*"))
;; Meta and distinfo files use COLON as separator:
(define-peg-pattern test all
text-until-colon)
(define-peg-pattern record all
(and text COLON (* SP) text-until-nl))
(define-peg-pattern records body (* (and record (* SP))))
;; Release index has no colons between the values:
(define-peg-pattern record-index all (and text-until-sp (* SP) text-until-nl))
;; Field no. 2 is tarball url
(define-peg-pattern field-2 all
(and text-until-sp SP text-until-sp IGNORE-UNTIL-NL))
(define-peg-pattern 2s body
(* (and (or header-hash field-2) (* SP))))
;; Field no. 6 is version
(define-peg-pattern version all
(and text-until-sp SP IGNORE SP IGNORE SP IGNORE SP
IGNORE SP text-until-sp IGNORE-UNTIL-NL))
(define-peg-pattern versions body
(* (and (or header-hash version) (* SP))))
;; Field no. 7-N is system files which we need.
(define-peg-pattern system-files all
(and text-until-sp SP IGNORE SP IGNORE SP IGNORE SP
IGNORE SP IGNORE SP text-until-nl))
(define-peg-pattern system-files-list body
(* (and (or header-hash system-files) (* SP))))
;; Release hashfile has no colons between the values:
;; Names contain numbers, get everything exept space.
(define-peg-pattern record-hash all
(and
;; This matches the prefix "release/"
RELEASE
text-until-sp (* SP) text-until-nl))
(define-peg-pattern header-hash all
(and
(and HASH (* SP))
text-until-sp (* SP) text-until-nl))
(define-peg-pattern records-hash body (* (and (or header-hash record-hash) (* SP))))
;; Systems.txt with dependencies
;; This parsing results in a lot of duplicates, but we ignore that because our
;; match just picks the first and returns happy and the overhead is neglible.
;; Field no. 3 is system-name
(define-peg-pattern field-3 all
(and text-until-sp SP IGNORE SP text-until-sp IGNORE-UNTIL-NL))
(define-peg-pattern 3s body
(* (and (or header-hash field-3) (* SP))))
(define-peg-pattern record-sys all
(and
;; We ignore the second and third field for now.
text-until-sp (* SP) IGNORE SP IGNORE SP text-until-nl))
(define-peg-pattern records-sys body (* (and (or header-hash record-sys) (* SP))))
;;; QL= QuickLisp
;;; cl= Common Lisp
(define (url-fetch url file-name)
"Save the contents of URL to FILE-NAME. Return #f on failure."
(parameterize ((current-output-port (current-error-port)))
(build:url-fetch url file-name)))
(define (ql-meta-file)
"Get the latest meta release file. From the links in this we extract all
other information we need."
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch "https://beta.quicklisp.org/dist/quicklisp.txt" temp)
(peg:tree (match-pattern test (get-string-all port)))))))
;;(display (metadata-ref (ql-meta-file) "release-index-url"))
(define (ql-latest-index-file)
"Get the latest release index file content. This contains: name tarball-url
file-md5 content-sha1-hash version etc. Space separated."
(let ((latest (metadata-ref (ql-meta-file) "release-index-url")))
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch latest temp)
(get-string-all port))))))
(define (ql-distinfo-file)
"Get the latest distinfo file. Colon separated. From this we only get the link to system-index-url."
(let ((latest (metadata-ref (ql-meta-file) "canonical-distinfo-url")))
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch latest temp)
(peg:tree (match-pattern records (get-string-all port))))))))
(define (ql-hash-file)
"Get the latest hashfile."
(let* ((uri (string->uri (metadata-ref (ql-meta-file) "release-index-url")))
(host (uri-host uri))
(path (string-drop-right (uri-path uri) 12))
(url (string-append "https://" host path "digests.txt")))
(pk 'url url)
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch url temp)
(peg:tree (match-pattern records-hash (get-string-all port))))))))
(define (ql-systems-file)
"Get the latest file with dependency information for each package. Spaceseparated list of dependencies."
(let ((latest (metadata-ref (ql-distinfo-file) "system-index-url")))
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch latest temp)
(peg:tree (match-pattern records-sys (get-string-all port))))))))
;;(display (ql-systems-file)) works.
(define (metadata-ref file lookup)
"Lookup metadata for FILE and LOOKUP."
;; (pk 'file file 'lookup lookup)
(fold (lambda (record acc)
(match record
;; Output from parser looks like this:
((record (_ key) (_ val))
;; Find our key
(if (equal? key lookup)
;; return val
val
;; else return acc
acc))))
#f file))
;;(metadata-ref (ql-latest-index-file) "1am")
; returns url of the tarball -
; works!
;;(metadata-ref (ql-hash-file) "1am")
; returns sha256 hash of the tarball -
; works!
;;(metadata-ref (ql-systems-file) "able")
; returns string with dependencies
; space separated - works!
(define (ql-extract field name)
"Helper to read the right field from (ql-latest-index-file). Field is one of
url, system-file, system-name, version. Name is the package name."
(let* ((release (ql-latest-index-file))
(systems (ql-systems-file))
(system-files (peg:tree (match-pattern system-files-list release)))
(system-name (peg:tree (match-pattern 3s systems)))
(url (peg:tree (match-pattern 2s release)))
(version (peg:tree (match-pattern versions release))))
(cond
((equal? field 'version)
(let ((str (metadata-ref version name)))
(if (string-prefix? name str)
;; Drop "name-" from version-string.
(string-drop str (+ 1 (string-length name)))
str)))
;; FIXME these ONLY extract the first match
((equal? field 'system-files)
(metadata-ref system-files name))
((equal? field 'system-name)
(metadata-ref system-name name)))))
;;(display (peg:tree (match-pattern versions (ql-latest-index-file))))
#;
(display
(peg:tree (match-pattern versions (ql-latest-index-file))))
;;(display (ql-extract 'version "1am"))
;;(display (ql-extract 'system-files "1am"))
;;(display (ql-latest-index-file))
;; Guess the homepage of the package
(define (homepage name)
(string-append "http://quickdocs.org/" name "/"))
(define (get-homepage name)
"Get the latest meta release file. From the links in this we extract all
other information we need."
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch (homepage name) temp)
(xml->sxml (get-string-all port))))))
(display (get-homepage "1am"))
#;
(define (synopsis name)
;;extract from homepage: section class="readme"
)
;; fetcher from texlive importer:
#;
(define (fetch-sxml name)
"Return an sxml representation of the package information contained in the
XML description of the CTAN package or #f in case of failure."
;; This API always returns the latest release of the module.
(let ((url (string-append "http://www.ctan.org/xml/1.2/pkg/" name)))
(guard (c ((http-get-error? c)
(format (current-error-port)
"error: failed to retrieve package information \
from ~s: ~a (~s)~%"
(uri->string (http-get-error-uri c))
(http-get-error-code c)
(http-get-error-reason c))
#f))
(xml->sxml (http-fetch url)
#:trim-whitespace? #t))))
(define (cl-name->guix-name name)
(cond
;; TODO: Any special cases?
((string-prefix? "cl-" name) name)
(else (string-append "cl-" name))))
;; Native dependency information is not available in QL.
;; (define (dependency->input dependency)
;; (match dependency
;; ))
(define (dependency-list->inputs lst)
(map
(lambda (dependency)
(list dependency (list 'unquote (string->symbol dependency))))
(cl-names->guix-names lst)))
(define (sxml->guix-package name)
(define (sxml-value path)
(match ((sxpath path) sxml)
(() #f)
((val) val)))
(and-let* (
(sxml (get-homepage name))
(version (find-latest-version name))
(file '())
;;(home-page '())
(hash (metadata-ref (ql-hash-file) name))
(source-url (metadata-ref url-dict "src"))
(inputs
(dependency-list->inputs
(string-split (metadata-ref (ql-systems-file) name) " "))))
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch source-url temp)
`(package
(name ,(cl-name->guix-name name))
(version ,(metadata-ref opam-content "version"))
(source
(origin
(method url-fetch)
(uri ,source-url)
;; TODO chech hash
(sha256 (base32 ,(guix-hash-url temp)))))
(build-system ocaml-build-system)
,@(if (null? inputs)
'()
`((inputs ,(list 'quasiquote inputs))))
(home-page ,(metadata-ref opam-content "homepage"))
(synopsis ,(metadata-ref opam-content "synopsis"))
(description ,(metadata-ref opam-content "description"))
(license #f)))))))
;; opam stuff:
;; (define (find-latest-version package repository)
;; "Get the latest version of a package as described in the given repository."
;; (let* ((dir (string-append repository "/packages/" package))
;; (versions (scandir dir (lambda (name) (not (string-prefix? "." name))))))
;; (if versions
;; (let ((versions (map
;; (lambda (dir)
;; (string-join (cdr (string-split dir #\.)) "."))
;; versions)))
;; (latest-version versions))
;; (begin
;; (format #t (G_ "Package not found in opam repository: ~a~%") package)
;; #f))))
;; (define (get-metadata opam-file)
;; (with-input-from-file opam-file
;; (lambda _
;; (peg:tree (match-pattern records (get-string-all (current-input-port)))))))
;; (define (ocaml-name->guix-name name)
;; (cond
;; ((equal? name "ocamlfind") "ocaml-findlib")
;; ((string-prefix? "ocaml" name) name)
;; ((string-prefix? "conf-" name) (substring name 5))
;; (else (string-append "ocaml-" name))))
;; (define (metadata-ref file lookup)
;; (pk 'file file 'lookup lookup)
;; (fold (lambda (record acc)
;; (match record
;; ((record key val)
;; (if (equal? key lookup)
;; (match val
;; (('list-pat . stuff) stuff)
;; (('string-pat stuff) stuff)
;; (('multiline-string stuff) stuff)
;; (('dict records ...) records))
;; acc))))
;; #f file))
;; (define (native? condition)
;; (match condition
;; (('condition-var var)
;; (match var
;; ("with-test" #t)
;; ("test" #t)
;; ("build" #t)
;; (_ #f)))
;; ((or ('condition-or cond-left cond-right) ('condition-and cond-left cond-right))
;; (or (native? cond-left)
;; (native? cond-right)))
;; (_ #f)))
;; (define (dependency->input dependency)
;; (match dependency
;; (('string-pat str) str)
;; (('conditional-value val condition)
;; (if (native? condition) "" (dependency->input val)))))
;; (define (dependency->native-input dependency)
;; (match dependency
;; (('string-pat str) "")
;; (('conditional-value val condition)
;; (if (native? condition) (dependency->input val) ""))))
;; (define (ocaml-names->guix-names names)
;; (map ocaml-name->guix-name
;; (remove (lambda (name)
;; (or (equal? "" name))
;; (equal? "ocaml" name))
;; names)))
;; (define (depends->inputs depends)
;; (filter (lambda (name)
;; (and (not (equal? "" name))
;; (not (equal? "ocaml" name))
;; (not (equal? "ocamlfind" name))))
;; (map dependency->input depends)))
;; (define (depends->native-inputs depends)
;; (filter (lambda (name) (not (equal? "" name)))
;; (map dependency->native-input depends)))
;; (define (dependency-list->inputs lst)
;; (map
;; (lambda (dependency)
;; (list dependency (list 'unquote (string->symbol dependency))))
;; (ocaml-names->guix-names lst)))
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help with sxml simple parser for the quicklisp importer
2019-01-23 12:35 Help with sxml simple parser for the quicklisp importer swedebugia
@ 2019-01-23 14:22 ` Ricardo Wurmus
2019-01-23 16:03 ` swedebugia
0 siblings, 1 reply; 9+ messages in thread
From: Ricardo Wurmus @ 2019-01-23 14:22 UTC (permalink / raw)
To: swedebugia; +Cc: guix-devel
Hi,
> (define (get-homepage name)
> "Get the latest meta release file. From the links in this we extract all
> other information we need."
> (call-with-temporary-output-file
> (lambda (temp port)
> (and (url-fetch (homepage name) temp)
> (xml->sxml (get-string-all port))))))
Aside: you don’t need to use “get-string-all”; “xml->sxml” can read
directly from a port.
> But it errors out with:
>
> sxml/simple.scm:143:4: In procedure loop:
> Throw to key `parser-error' with args `(#<input: string 23c45b0>
> "[GIMatch] broken for " (END . head) " while expecting " END link)'.
I fetched the document. Here’s the part that it barfs on:
--8<---------------cut here---------------start------------->8---
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>1am | Quickdocs</title>
<link rel="stylesheet" type="text/css" href="/css/LigatureSymbols/style.css" />
<link rel="stylesheet" type="text/css" media="screen" href="/css/main.css">
<script type="text/javascript" src="/js/jquery-1.9.1.min.js"></script>
<script type="text/javascript" src="/js/underscore-min.js"></script>
<script type="text/javascript" src="/js/quickdocs.js"></script>
</head>
…
--8<---------------cut here---------------end--------------->8---
The second “link” tag opens but is never closed. This may be valid
HTML, but it is not valid XML, which is what xml->sxml expects.
--
Ricardo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help with sxml simple parser for the quicklisp importer
2019-01-23 14:22 ` Ricardo Wurmus
@ 2019-01-23 16:03 ` swedebugia
2019-01-23 15:58 ` Ricardo Wurmus
0 siblings, 1 reply; 9+ messages in thread
From: swedebugia @ 2019-01-23 16:03 UTC (permalink / raw)
To: Ricardo Wurmus; +Cc: guix-devel
On 2019-01-23 15:22, Ricardo Wurmus wrote:
> Hi,
>
>> (define (get-homepage name)
>> "Get the latest meta release file. From the links in this we extract all
>> other information we need."
>> (call-with-temporary-output-file
>> (lambda (temp port)
>> (and (url-fetch (homepage name) temp)
>> (xml->sxml (get-string-all port))))))
>
> Aside: you don’t need to use “get-string-all”; “xml->sxml” can read
> directly from a port.
>
>> But it errors out with:
>>
>> sxml/simple.scm:143:4: In procedure loop:
>> Throw to key `parser-error' with args `(#<input: string 23c45b0>
>> "[GIMatch] broken for " (END . head) " while expecting " END link)'.
>
> I fetched the document. Here’s the part that it barfs on:
>
> --8<---------------cut here---------------start------------->8---
> <!DOCTYPE html>
> <html>
> <head>
> <meta charset="utf-8">
> <title>1am | Quickdocs</title>
> <link rel="stylesheet" type="text/css" href="/css/LigatureSymbols/style.css" />
>
> <link rel="stylesheet" type="text/css" media="screen" href="/css/main.css">
>
> <script type="text/javascript" src="/js/jquery-1.9.1.min.js"></script>
> <script type="text/javascript" src="/js/underscore-min.js"></script>
> <script type="text/javascript" src="/js/quickdocs.js"></script>
> </head>
> …
> --8<---------------cut here---------------end--------------->8---
>
> The second “link” tag opens but is never closed. This may be valid
> HTML, but it is not valid XML, which is what xml->sxml expects.
Thanks for the quick answer!
I will try to remove this line before handling over to the parser.
--
Cheers Swedebugia
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help with sxml simple parser for the quicklisp importer
2019-01-23 16:03 ` swedebugia
@ 2019-01-23 15:58 ` Ricardo Wurmus
2019-01-23 16:32 ` swedebugia
2019-01-23 16:55 ` Katherine Cox-Buday
0 siblings, 2 replies; 9+ messages in thread
From: Ricardo Wurmus @ 2019-01-23 15:58 UTC (permalink / raw)
To: swedebugia; +Cc: guix-devel
swedebugia <swedebugia@riseup.net> writes:
>> The second “link” tag opens but is never closed. This may be valid
>> HTML, but it is not valid XML, which is what xml->sxml expects.
>
> Thanks for the quick answer!
> I will try to remove this line before handling over to the parser.
I would recommend looking for a better source of package information.
Parsing HTML is not fun and is often brittle.
--
Ricardo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help with sxml simple parser for the quicklisp importer
2019-01-23 15:58 ` Ricardo Wurmus
@ 2019-01-23 16:32 ` swedebugia
2019-01-23 16:41 ` Ricardo Wurmus
2019-01-23 16:55 ` Katherine Cox-Buday
1 sibling, 1 reply; 9+ messages in thread
From: swedebugia @ 2019-01-23 16:32 UTC (permalink / raw)
To: Ricardo Wurmus; +Cc: guix-devel
[-- Attachment #1: Type: text/plain, Size: 2352 bytes --]
On 2019-01-23 16:58, Ricardo Wurmus wrote:
>
> swedebugia <swedebugia@riseup.net> writes:
>
>>> The second “link” tag opens but is never closed. This may be valid
>>> HTML, but it is not valid XML, which is what xml->sxml expects.
>>
>> Thanks for the quick answer!
>> I will try to remove this line before handling over to the parser.
>
> I would recommend looking for a better source of package information.
> Parsing HTML is not fun and is often brittle.
I understand. Hm. Will try asking the author.
Got a little further. Added this:
(define (sanitize-html html)
"Correct an offending invalid line from the html source"
(let* ((html1 (regexp-substitute #f (string-match "main.css\">" html)
'pre "main.css\" />" 'post))
(result (regexp-substitute #f (string-match "utf-8\">" html1)
'pre "utf-8\" />" 'post)))
result))
Which results in a new error:
Starting download of /tmp/guix-file.uAoKMD
From http://quickdocs.org/1am/...
1am/ 7KiB 2.0MiB/s 00:00
[##################] 100.0%
Backtrace:
13 (apply-smob/1 #<catch-closure 17a84e0>)
In ice-9/boot-9.scm:
705:2 12 (call-with-prompt _ _ #<procedure default-prompt-handler
(k proc)>)
In ice-9/eval.scm:
619:8 11 (_ #(#(#<directory (guile-user) 18cc140>)))
In ice-9/boot-9.scm:
2312:4 10 (save-module-excursion _)
3831:12 9 (_)
In guix/import/quicklisp.scm:
239:9 8 (_)
In guix/utils.scm:
618:8 7 (call-with-temporary-output-file #<procedure 305f440 at
guix/import/quicklisp.scm:236:3 (temp port)>)
In sxml/simple.scm:
143:4 6 (xml->sxml _ #:namespaces _ #:declare-namespaces? _
#:trim-whitespace? _ #:entities _ #:default-entity-handler _ # _)
143:4 5 (loop #<input: string 24fdaf0> () #f _)
143:4 4 (loop #<input: string 24fdaf0> () #f _)
143:4 3 (loop #<input: string 24fdaf0> () #f _)
143:4 2 (loop #<input: string 24fdaf0> () #f _)
143:4 1 (loop #<input: string 24fdaf0> () #f _)
143:4 0 (loop #<input: string 24fdaf0> () #f _)
sxml/simple.scm:143:4: In procedure loop:
Throw to key `parser-error' with args `(#<input: string 24fdaf0>
"[wf-entdeclared] broken for " copy)'.
Any ideas?
--
Cheers Swedebugia
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: quicklisp.scm --]
[-- Type: text/x-scheme; name="quicklisp.scm", Size: 12474 bytes --]
;;; GNU Guix --- Functional package management for GNU
;;; Copyright © 2018 Julien Lepiller <julien@lepiller.eu>
;;;
;;; This file is part of GNU Guix.
;;;
;;; GNU Guix is free software; you can redistribute it and/or modify it
;;; under the terms of the GNU General Public License as published by
;;; the Free Software Foundation; either version 3 of the License, or (at
;;; your option) any later version.
;;;
;;; GNU Guix is distributed in the hope that it will be useful, but
;;; WITHOUT ANY WARRANTY; without even the implied warranty of
;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
;;; GNU General Public License for more details.
;;;
;;; You should have received a copy of the GNU General Public License
;;; along with GNU Guix. If not, see <http://www.gnu.org/licenses/>.
(define-module (guix import quicklisp)
; #:use-module (ice-9 ftw)
#:use-module (ice-9 match)
#:use-module (ice-9 peg)
#:use-module (ice-9 regex)
#:use-module ((ice-9 rdelim) #:select (read-line))
#:use-module (ice-9 textual-ports)
#:use-module (sxml simple)
#:use-module (sxml xpath)
#:use-module (srfi srfi-1)
#:use-module (srfi srfi-2)
#:use-module (web uri)
; #:use-module (guix http-client)
#:use-module ((guix build download) #:prefix build:)
#:use-module (guix base32)
; #:use-module (guix ui)
#:use-module ((guix utils) #:select (call-with-temporary-output-file))
; #:use-module (guix import utils)
; #:use-module ((guix licenses) #:prefix license:)
#:export (test))
;; Define a PEG parser for the quicklisp format
(define-peg-pattern SP none (or " " "\n"))
(define-peg-pattern NL none "\n")
(define-peg-pattern COLON none ":")
(define-peg-pattern RELEASE none "release/")
(define-peg-pattern HASH none "#")
(define-peg-pattern IGNORE none (peg "(! SP .)*"))
(define-peg-pattern IGNORE-UNTIL-NL none (peg "(! NL .)*"))
(define-peg-pattern text all (+ (or (range #\a #\z) "-")))
(define-peg-pattern text-until-sp all (peg "(! SP .)*"))
(define-peg-pattern text-until-colon all (peg "(! COLON .)*"))
(define-peg-pattern text-until-nl all (peg "(! NL .)*"))
;; Meta and distinfo files use COLON as separator:
(define-peg-pattern test all
text-until-colon)
(define-peg-pattern record all
(and text COLON (* SP) text-until-nl))
(define-peg-pattern records body (* (and record (* SP))))
;; Release index has no colons between the values:
(define-peg-pattern record-index all (and text-until-sp (* SP) text-until-nl))
;; Field no. 2 is tarball url
(define-peg-pattern field-2 all
(and text-until-sp SP text-until-sp IGNORE-UNTIL-NL))
(define-peg-pattern 2s body
(* (and (or header-hash field-2) (* SP))))
;; Field no. 6 is version
(define-peg-pattern version all
(and text-until-sp SP IGNORE SP IGNORE SP IGNORE SP
IGNORE SP text-until-sp IGNORE-UNTIL-NL))
(define-peg-pattern versions body
(* (and (or header-hash version) (* SP))))
;; Field no. 7-N is system files which we need.
(define-peg-pattern system-files all
(and text-until-sp SP IGNORE SP IGNORE SP IGNORE SP
IGNORE SP IGNORE SP text-until-nl))
(define-peg-pattern system-files-list body
(* (and (or header-hash system-files) (* SP))))
;; Release hashfile has no colons between the values:
;; Names contain numbers, get everything exept space.
(define-peg-pattern record-hash all
(and
;; This matches the prefix "release/"
RELEASE
text-until-sp (* SP) text-until-nl))
(define-peg-pattern header-hash all
(and
(and HASH (* SP))
text-until-sp (* SP) text-until-nl))
(define-peg-pattern records-hash body (* (and (or header-hash record-hash) (* SP))))
;; Systems.txt with dependencies
;; This parsing results in a lot of duplicates, but we ignore that because our
;; match just picks the first and returns happy and the overhead is neglible.
;; Field no. 3 is system-name
(define-peg-pattern field-3 all
(and text-until-sp SP IGNORE SP text-until-sp IGNORE-UNTIL-NL))
(define-peg-pattern 3s body
(* (and (or header-hash field-3) (* SP))))
(define-peg-pattern record-sys all
(and
;; We ignore the second and third field for now.
text-until-sp (* SP) IGNORE SP IGNORE SP text-until-nl))
(define-peg-pattern records-sys body (* (and (or header-hash record-sys) (* SP))))
;;; QL= QuickLisp
;;; cl= Common Lisp
(define (url-fetch url file-name)
"Save the contents of URL to FILE-NAME. Return #f on failure."
(parameterize ((current-output-port (current-error-port)))
(build:url-fetch url file-name)))
(define (ql-meta-file)
"Get the latest meta release file. From the links in this we extract all
other information we need."
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch "https://beta.quicklisp.org/dist/quicklisp.txt" temp)
(peg:tree (match-pattern test (get-string-all port)))))))
;;(display (metadata-ref (ql-meta-file) "release-index-url"))
(define (ql-latest-index-file)
"Get the latest release index file content. This contains: name tarball-url
file-md5 content-sha1-hash version etc. Space separated."
(let ((latest (metadata-ref (ql-meta-file) "release-index-url")))
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch latest temp)
(get-string-all port))))))
(define (ql-distinfo-file)
"Get the latest distinfo file. Colon separated. From this we only get the link to system-index-url."
(let ((latest (metadata-ref (ql-meta-file) "canonical-distinfo-url")))
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch latest temp)
(peg:tree (match-pattern records (get-string-all port))))))))
(define (ql-hash-file)
"Get the latest hashfile."
(let* ((uri (string->uri (metadata-ref (ql-meta-file) "release-index-url")))
(host (uri-host uri))
(path (string-drop-right (uri-path uri) 12))
(url (string-append "https://" host path "digests.txt")))
(pk 'url url)
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch url temp)
(peg:tree (match-pattern records-hash (get-string-all port))))))))
(define (ql-systems-file)
"Get the latest file with dependency information for each package. Spaceseparated list of dependencies."
(let ((latest (metadata-ref (ql-distinfo-file) "system-index-url")))
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch latest temp)
(peg:tree (match-pattern records-sys (get-string-all port))))))))
;;(display (ql-systems-file)) works.
(define (metadata-ref file lookup)
"Lookup metadata for FILE and LOOKUP."
;; (pk 'file file 'lookup lookup)
(fold (lambda (record acc)
(match record
;; Output from parser looks like this:
((record (_ key) (_ val))
;; Find our key
(if (equal? key lookup)
;; return val
val
;; else return acc
acc))))
#f file))
;;(metadata-ref (ql-latest-index-file) "1am")
; returns url of the tarball -
; works!
;;(metadata-ref (ql-hash-file) "1am")
; returns sha256 hash of the tarball -
; works!
;;(metadata-ref (ql-systems-file) "able")
; returns string with dependencies
; space separated - works!
(define (ql-extract field name)
"Helper to read the right field from (ql-latest-index-file). Field is one of
url, system-file, system-name, version. Name is the package name."
(let* ((release (ql-latest-index-file))
;; Not sure what to do with these yet:
;; (systems (ql-systems-file))
;; (system-files (peg:tree (match-pattern system-files-list release)))
;; (system-name (peg:tree (match-pattern 3s systems)))
(url (peg:tree (match-pattern 2s release)))
(version (peg:tree (match-pattern versions release))))
(cond
;; FIXME these ONLY extract the first match
;; What do we need these for?
;; ((equal? field 'system-files)
;; (metadata-ref system-files name))
;; ((equal? field 'system-name)
;; (metadata-ref system-name name))
((equal? field 'url)
(metadata-ref url name))
((equal? field 'version)
(let ((str (metadata-ref version name)))
(if (string-prefix? name str)
;; Drop "name-" from version-string.
(string-drop str (+ 1 (string-length name)))
str))))))
#;
(begin
(display (ql-extract 'version "1am"))
(display (ql-extract 'url "1am")))
;; Guess the homepage of the package
(define (homepage name)
(string-append "http://quickdocs.org/" name "/"))
(define (sanitize-html html)
"Correct an offending invalid line from the html source"
(let* ((html1 (regexp-substitute #f (string-match "main.css\">" html)
'pre "main.css\" />" 'post))
(result (regexp-substitute #f (string-match "utf-8\">" html1)
'pre "utf-8\" />" 'post)))
result))
(define (get-homepage name)
"Get the latest meta release file. From the links in this we extract all
other information we need."
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch (homepage name) temp)
(xml->sxml (sanitize-html (get-string-all port)))))))
(display (get-homepage "1am"))
;; fetcher from texlive importer:
#;
(define (fetch-sxml name)
"Return an sxml representation of the package information contained in the
XML description of the CTAN package or #f in case of failure."
;; This API always returns the latest release of the module.
(let ((url (string-append "http://www.ctan.org/xml/1.2/pkg/" name)))
(guard (c ((http-get-error? c)
(format (current-error-port)
"error: failed to retrieve package information \
from ~s: ~a (~s)~%"
(uri->string (http-get-error-uri c))
(http-get-error-code c)
(http-get-error-reason c))
#f))
(xml->sxml (http-fetch url)
#:trim-whitespace? #t))))
(define (cl-name->guix-name name)
(cond
;; TODO: Any special cases?
((string-prefix? "cl-" name) name)
(else (string-append "cl-" name))))
;; Native dependency information is not available in QL.
;; (define (dependency->input dependency)
;; (match dependency
;; ))
(define (dependency-list->inputs lst)
(map
(lambda (dependency)
(list dependency (list 'unquote (string->symbol dependency))))
(cl-name->guix-name lst)))
(define (sxml->guix-package name)
;; (define (sxml-value path)
;; (match ((sxpath path) sxml)
;; (() #f)
;; ((val) val)))
(and-let* (
;;(sxml (get-homepage name))
(cl-version (ql-extract 'version name))
(hash (base32 (metadata-ref (ql-hash-file) name)))
(source-url (ql-extract 'url name))
(inputs
(dependency-list->inputs
(string-split (metadata-ref (ql-systems-file) name) " "))))
(call-with-temporary-output-file
(lambda (temp port)
(and (url-fetch source-url temp)
`(package
(name ,(cl-name->guix-name name))
(version ,cl-version)
(source
(origin
(method url-fetch)
(uri ,source-url)
;; TODO chech hash
(sha256 (base32 ,(guix-hash-url temp)))))
(build-system asdf-build-system)
,@(if (null? inputs)
'()
`((inputs ,(list 'quasiquote inputs))))
;; (home-page ,(metadata-ref opam-content "homepage"))
;; (synopsis ,(metadata-ref opam-content "synopsis"))
;; (description ,(metadata-ref opam-content "description"))
(license #f)))))))
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help with sxml simple parser for the quicklisp importer
2019-01-23 16:32 ` swedebugia
@ 2019-01-23 16:41 ` Ricardo Wurmus
2019-01-23 18:58 ` Pierre Neidhardt
0 siblings, 1 reply; 9+ messages in thread
From: Ricardo Wurmus @ 2019-01-23 16:41 UTC (permalink / raw)
To: swedebugia; +Cc: guix-devel
swedebugia <swedebugia@riseup.net> writes:
> On 2019-01-23 16:58, Ricardo Wurmus wrote:
>>
>> swedebugia <swedebugia@riseup.net> writes:
>>
>>>> The second “link” tag opens but is never closed. This may be valid
>>>> HTML, but it is not valid XML, which is what xml->sxml expects.
>>>
>>> Thanks for the quick answer!
>>> I will try to remove this line before handling over to the parser.
>>
>> I would recommend looking for a better source of package information.
>> Parsing HTML is not fun and is often brittle.
>
> I understand. Hm. Will try asking the author.
>
> Got a little further. Added this:
>
> (define (sanitize-html html)
> "Correct an offending invalid line from the html source"
> (let* ((html1 (regexp-substitute #f (string-match "main.css\">" html)
> 'pre "main.css\" />" 'post))
> (result (regexp-substitute #f (string-match "utf-8\">" html1)
> 'pre "utf-8\" />" 'post)))
> result))
It’s generally a bad idea to use regular expressions on HTML or XML. Be
careful.
> sxml/simple.scm:143:4: In procedure loop:
> Throw to key `parser-error' with args `(#<input: string 24fdaf0>
> "[wf-entdeclared] broken for " copy)'.
I guess this is about the © entity. You may have to tell xml->sxml
about these HTML entities.
--
Ricardo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help with sxml simple parser for the quicklisp importer
2019-01-23 16:41 ` Ricardo Wurmus
@ 2019-01-23 18:58 ` Pierre Neidhardt
2019-01-23 21:21 ` swedebugia
0 siblings, 1 reply; 9+ messages in thread
From: Pierre Neidhardt @ 2019-01-23 18:58 UTC (permalink / raw)
To: Ricardo Wurmus; +Cc: guix-devel
I'm not sure you need to extract Quickdocs data at all. What Quickdocs does is
all automated I believe, so you could probably do the same it's already doing:
extract the metadata from the project themselves.
Most of the data is contained in the .asd files. I think looking at Quickdocs
code could be most informative, especially considering it's written in Common
Lisp ;)
--
Pierre Neidhardt
https://ambrevar.xyz/
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help with sxml simple parser for the quicklisp importer
2019-01-23 18:58 ` Pierre Neidhardt
@ 2019-01-23 21:21 ` swedebugia
0 siblings, 0 replies; 9+ messages in thread
From: swedebugia @ 2019-01-23 21:21 UTC (permalink / raw)
To: Pierre Neidhardt, Ricardo Wurmus; +Cc: guix-devel
On 2019-01-23 19:58, Pierre Neidhardt wrote:
>
> I'm not sure you need to extract Quickdocs data at all. What Quickdocs does is
> all automated I believe, so you could probably do the same it's already doing:
> extract the metadata from the project themselves.
>
> Most of the data is contained in the .asd files. I think looking at Quickdocs
> code could be most informative, especially considering it's written in Common
> Lisp ;)
>
You are right. And Ricardo. Quickdocs seems like a dirty hack. The
information we want is in the tgz that we download so I just found out
that we can do the same as with the pypi importer and extract it from
there :)
The .asd files seem uniformly formatted.
--
Cheers
Swedebugia
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help with sxml simple parser for the quicklisp importer
2019-01-23 15:58 ` Ricardo Wurmus
2019-01-23 16:32 ` swedebugia
@ 2019-01-23 16:55 ` Katherine Cox-Buday
1 sibling, 0 replies; 9+ messages in thread
From: Katherine Cox-Buday @ 2019-01-23 16:55 UTC (permalink / raw)
To: Ricardo Wurmus; +Cc: guix-devel
Ricardo Wurmus <rekado@elephly.net> writes:
> swedebugia <swedebugia@riseup.net> writes:
>
>>> The second “link” tag opens but is never closed. This may be valid
>>> HTML, but it is not valid XML, which is what xml->sxml expects.
>>
>> Thanks for the quick answer!
>> I will try to remove this line before handling over to the parser.
>
> I would recommend looking for a better source of package information.
> Parsing HTML is not fun and is often brittle.
The package information in quickdocs is accessed[1] via the API of
whatever is hosting the sourcecode. We could try doing the same.
Alternatively, it is good practice for CL systems defined in .asd files
to contain a `:description`, and even a `:long-description` field. We could
take the stance that package information simply comes from there as
technically this is the actual package's (i.e. system's) description.
And as CL is a lisp, it should be relatively easy to parse this out. The
only caveat is that I think it's possible for these fields to contain
sexps which read in other files, in which case we should do the same.
I hope this helps.
[1] - https://github.com/quickdocs/quickdocs-updater/blob/a64a41df9e5f1a3721ab68f9f02189ecbb54513b/src/repos.lisp
--
Katherine
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2019-01-23 21:21 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-01-23 12:35 Help with sxml simple parser for the quicklisp importer swedebugia
2019-01-23 14:22 ` Ricardo Wurmus
2019-01-23 16:03 ` swedebugia
2019-01-23 15:58 ` Ricardo Wurmus
2019-01-23 16:32 ` swedebugia
2019-01-23 16:41 ` Ricardo Wurmus
2019-01-23 18:58 ` Pierre Neidhardt
2019-01-23 21:21 ` swedebugia
2019-01-23 16:55 ` Katherine Cox-Buday
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).