* Re: Accuracy of importers?
2021-10-28 7:02 Accuracy of importers? Ludovic Courtès
@ 2021-10-28 8:17 ` Lars-Dominik Braun
2021-10-28 8:54 ` Ludovic Courtès
2021-10-29 21:57 ` Ludovic Courtès
2021-10-28 9:06 ` zimoun
` (5 subsequent siblings)
6 siblings, 2 replies; 16+ messages in thread
From: Lars-Dominik Braun @ 2021-10-28 8:17 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: Guix Devel
Hi Ludo’,
> My understanding is that most of them require manual intervention—i.e.,
> one has to tweak what ‘guix import’ produces, even if we ignore
> synopsis/description/license, to set the right inputs, etc. If we were
> to estimate the fraction of imported packages for which manual changes
> are needed, what would it look like?
>
> importer fraction of imported packages needing changes
> pypi 50% (some miss source distro, “sdist”; some have
> non-Python deps)
that seems right, although the most common modification I do nowadays
is replacing 'check with a pytest phase.
> hackage ?
> stackage (Lars?)
I’ve mostly used the updater, not the importer, so I can’t say a
number unfortunately.
> cran 5% (Ricardo? Simon? seems to almost always work?)
In my experience the number of interventions here goes towards zero
actually, except for description. It’s pretty good :)
> npm (WIP) (Jelle? Timothy?)
Maybe 5%? But the imported packages do not build anything and don’t
run tests either, so chances for failure are pretty low.
Would it be possible to just run the importer again for existing packages
and compare the result (minus synopsis/description) with what’s
available in Guix? That should give you much more accurate numbers than
our guesswork.
Cheers,
Lars
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Accuracy of importers?
2021-10-28 8:17 ` Lars-Dominik Braun
@ 2021-10-28 8:54 ` Ludovic Courtès
2021-10-28 10:06 ` Lars-Dominik Braun
2021-10-29 21:57 ` Ludovic Courtès
1 sibling, 1 reply; 16+ messages in thread
From: Ludovic Courtès @ 2021-10-28 8:54 UTC (permalink / raw)
To: Lars-Dominik Braun; +Cc: Guix Devel
Hi!
Lars-Dominik Braun <lars@6xq.net> skribis:
>> My understanding is that most of them require manual intervention—i.e.,
>> one has to tweak what ‘guix import’ produces, even if we ignore
>> synopsis/description/license, to set the right inputs, etc. If we were
>> to estimate the fraction of imported packages for which manual changes
>> are needed, what would it look like?
>>
>> importer fraction of imported packages needing changes
>> pypi 50% (some miss source distro, “sdist”; some have
>> non-Python deps)
> that seems right, although the most common modification I do nowadays
> is replacing 'check with a pytest phase.
Right. PyPI/setup.py/.whl doesn’t contain info as to how to run tests,
right?
>> hackage ?
>> stackage (Lars?)
> I’ve mostly used the updater, not the importer, so I can’t say a
> number unfortunately.
Did the updater suggest input changes?
>> cran 5% (Ricardo? Simon? seems to almost always work?)
> In my experience the number of interventions here goes towards zero
> actually, except for description. It’s pretty good :)
Yay!
>> npm (WIP) (Jelle? Timothy?)
> Maybe 5%? But the imported packages do not build anything and don’t
> run tests either, so chances for failure are pretty low.
Yeah.
> Would it be possible to just run the importer again for existing packages
> and compare the result (minus synopsis/description) with what’s
> available in Guix? That should give you much more accurate numbers than
> our guesswork.
That’s a good idea. I can try and do that on a sample of packages.
Thanks!
Ludo’.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Accuracy of importers?
2021-10-28 8:54 ` Ludovic Courtès
@ 2021-10-28 10:06 ` Lars-Dominik Braun
0 siblings, 0 replies; 16+ messages in thread
From: Lars-Dominik Braun @ 2021-10-28 10:06 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: Guix Devel
Hi Ludo’,
> Right. PyPI/setup.py/.whl doesn’t contain info as to how to run tests,
> right?
technically setup.py has a standard test target, but it’s been
deprecated for years and it must be enabled manually by the project. I’m
not aware of any standard pyproject.toml approach to this. It might be
possible to parse tox.ini.
> >> hackage ?
> >> stackage (Lars?)
> > I’ve mostly used the updater, not the importer, so I can’t say a
> > number unfortunately.
> Did the updater suggest input changes?
yes, I added it in 127828ddd74fc950c0403ca58a6f650355e3d67d, but it
cannot update #:cabal-revision, which is a common source for errors. Would
be nice if updaters could just return an entirely new package and the
generic updater code would modify/merge the existing package definition
as needed.
Cheers,
Lars
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Accuracy of importers?
2021-10-28 8:17 ` Lars-Dominik Braun
2021-10-28 8:54 ` Ludovic Courtès
@ 2021-10-29 21:57 ` Ludovic Courtès
2021-10-30 15:49 ` zimoun
1 sibling, 1 reply; 16+ messages in thread
From: Ludovic Courtès @ 2021-10-29 21:57 UTC (permalink / raw)
To: Guix Devel
[-- Attachment #1: Type: text/plain, Size: 1742 bytes --]
Hello!
Thanks everyone for your feedback!
Lars-Dominik Braun <lars@6xq.net> skribis:
> Would it be possible to just run the importer again for existing packages
> and compare the result (minus synopsis/description) with what’s
> available in Guix? That should give you much more accurate numbers than
> our guesswork.
Turned out to be trickier than we could hope, primarily because the
relevant importers did not support imports of a specific version. I
fixed it for CRAN and PyPI here:
https://issues.guix.gnu.org/51493
With the attached script plus the changes above, I can already get some
insight. Here’s what I get for a sample of 200 PyPI packages and 200
CRAN packages:
--8<---------------cut here---------------start------------->8---
$ SAMPLE_SIZE=200 ./pre-inst-env guile ~/src/guix-debugging/importer-accuracy.scm
[…]
Accuracy for 'pypi' (200 packages):
accurate: 58 (29%)
different inputs: 142 (71%)
different source: 0 (0%)
inconclusive: 0 (0%)
Accuracy for 'cran' (200 packages):
accurate: 176 (88%)
different inputs: 23 (12%)
different source: 1 (0%)
inconclusive: 0 (0%)
--8<---------------cut here---------------end--------------->8---
(It’s quite expensive to run because it downloads a whole bunch of
things and tries many 404 URLs in the case of CRAN before finding the
right one.)
The script doesn’t do anything useful for crates because they have their
own way of representing inputs. It doesn’t account for changes in
‘arguments’ like zimoun suggested, meaning it’s overestimating accuracy.
It’d be nice to run it on gems but that importer doesn’t support
versioning either.
To be continued…
Thanks,
Ludo’.
[-- Attachment #2: importer-accuracy.scm --]
[-- Type: text/plain, Size: 9260 bytes --]
;;; Released under the GNU GPLv3 or any later version.
;;; Copyright © 2021 Ludovic Courtès <ludo@gnu.org>
(use-modules (guix)
(gnu packages)
(guix import cran)
(guix import crate)
(guix import pypi)
((guix import print) #:select (package->code))
((guix upstream) #:select (url-predicate))
(guix diagnostics)
(guix i18n)
(srfi srfi-1)
(srfi srfi-9)
(srfi srfi-9 gnu)
(ice-9 match))
(define-record-type <reimporter>
(reimporter name pred import)
reimporter?
(name reimporter-name)
(pred reimporter-predicate)
(import reimporter-import))
(define (find-reimporter package)
(find (lambda (reimporter)
((reimporter-predicate reimporter) package))
%reimporters))
(define (accurate-import? package)
(define (sexp-field sexp field)
(match sexp
(((or 'package 'origin) fields ...)
(match (assoc field fields)
((key value) value)
(_ #f)))
(('define-public _ exp)
(sexp-field exp field))))
(define (same-source? sexp1 sexp2)
(equal? (sexp-field (sexp-field sexp1 'source) 'sha256)
(sexp-field (sexp-field sexp2 'source) 'sha256)))
(define canonicalize-input
;; 'package->code' creates '@' references but importers don't. Remove
;; the '@' to allow comparison.
(match-lambda
(("gfortran" _)
;; 'package->code' emits nonsense for the value associated with this
;; one, so trust the label.
`("gfortran" ,(list 'unquote 'gfortran)))
((label ('unquote ('@ _ variable)) . rest)
`(,label ,(list 'unquote variable) ,@rest))
(x x)))
(define (equivalent-inputs? inputs1 inputs2)
(if (and inputs1 inputs2)
(lset= equal?
(match inputs1
(('quasiquote inputs)
(map canonicalize-input inputs)))
(match inputs2
(('quasiquote inputs)
(map canonicalize-input inputs))))
(equal? inputs1 inputs2)))
(let* ((reimporter (find-reimporter package))
(imported ((reimporter-import reimporter) package))
(actual (package->code package)))
(define (same-inputs? field)
(equivalent-inputs? (sexp-field imported field)
(sexp-field actual field)))
(if imported
(if (and (same-inputs? 'inputs)
(same-inputs? 'native-inputs)
(same-inputs? 'propagated-inputs))
(if (same-source? actual imported)
'accurate
(begin
(warning (package-location package)
(G_ "~a: source differs from upstream~%")
(package-full-name package))
'different-source))
(begin
(warning (package-location package)
(G_ "~a: inputs differ from upstream~%")
(package-full-name package))
'different-inputs))
'inconclusive)))
;; Stats.
(define-record-type <accuracy>
(accuracy accurate different-inputs different-source inconclusive)
accuracy?
(accurate accuracy-accurate)
(different-inputs accuracy-different-inputs)
(different-source accuracy-different-source)
(inconclusive accuracy-inconclusive))
(define (display-accuracy reimporter accuracy port)
(define total
(letrec-syntax ((sum (syntax-rules ()
((_) 0)
((_ get rest ...)
(+ (get accuracy) (sum rest ...))))))
(sum accuracy-accurate
accuracy-different-inputs
accuracy-different-source
accuracy-inconclusive)))
(define (% fraction)
(inexact->exact (round (* 100. fraction))))
(format port (G_ "Accuracy for '~a' (~a packages):~%")
(reimporter-name reimporter) total)
(format port (G_ " accurate: ~a (~d%)~%")
(accuracy-accurate accuracy)
(% (/ (accuracy-accurate accuracy) total)))
(format port (G_ " different inputs: ~a (~d%)~%")
(accuracy-different-inputs accuracy)
(% (/ (accuracy-different-inputs accuracy) total)))
(format port (G_ " different source: ~a (~d%)~%")
(accuracy-different-source accuracy)
(% (/ (accuracy-different-source accuracy) total)))
(format port (G_ " inconclusive: ~a (~d%)~%")
(accuracy-inconclusive accuracy)
(% (/ (accuracy-inconclusive accuracy) total))))
(define (random-seed)
(logxor (getpid) (car (gettimeofday))))
(define shuffle ;copied from (guix scripts offload)
(let ((state (seed->random-state (random-seed))))
(lambda (lst)
"Return LST shuffled (using the Fisher-Yates algorithm.)"
(define vec (list->vector lst))
(let loop ((result '())
(i (vector-length vec)))
(if (zero? i)
result
(let* ((j (random i state))
(val (vector-ref vec j)))
(vector-set! vec j (vector-ref vec (- i 1)))
(loop (cons val result) (- i 1))))))))
\f
;;;
;;; Reimporters.
;;;
(define pypi-package? ;copied from (guix import pypi)
(url-predicate
(lambda (url)
(or (string-prefix? "https://pypi.org/" url)
(string-prefix? "https://pypi.python.org/" url)
(string-prefix? "https://pypi.org/packages" url)
(string-prefix? "https://files.pythonhosted.org/packages" url)))))
(define guix-package->pypi-name
(@@ (guix import pypi) guix-package->pypi-name))
(define* (package-sample reimporter
#:optional (size (or (and=> (getenv "SAMPLE_SIZE")
string->number)
20)))
(let ((pred (reimporter-predicate reimporter)))
(take (shuffle
(fold-packages (lambda (package lst)
(if (and (pred package)
(not (package-superseded package))
(not (string-prefix? "python2-"
(package-name package))))
(cons package lst)
lst))
'()))
size)))
(define-syntax-rule (increment record field)
(set-field record (field) (+ 1 (field record))))
(define (import-accuracy packages)
(fold (lambda (package accuracy)
(match (accurate-import? package)
('accurate (increment accuracy accuracy-accurate))
('different-inputs (increment accuracy accuracy-different-inputs))
('different-source (increment accuracy accuracy-different-source))
('inconclusive (increment accuracy accuracy-inconclusive))))
(accuracy 0 0 0 0)
packages))
(define (package->cran-name package) ;copied from (guix import cran)
"Return the upstream name of the PACKAGE."
(let ((upstream-name (assoc-ref (package-properties package) 'upstream-name)))
(if upstream-name
upstream-name
(match (package-source package)
((? origin? origin)
(match (origin-uri origin)
((or (? string? url) (url _ ...))
(let ((end (string-rindex url #\_))
(start (string-rindex url #\/)))
;; The URL ends on
;; (string-append "/" name "_" version ".tar.gz")
(and start end (substring url (+ start 1) end))))
(_ #f)))
(_ #f)))))
(define %pypi-reimporter
(reimporter 'pypi pypi-package?
(lambda (package)
(pypi->guix-package
(guix-package->pypi-name package)
#:version (package-version package)))))
(define %cran-reimporter
(reimporter 'cran cran-package?
(lambda (package)
(cran->guix-package
(package->cran-name package)
#:version (package-version package)))))
(define crate-package?
(url-predicate (@@ (guix import crate) crate-url?)))
(define %crate-reimporter
(reimporter 'crate crate-package?
(lambda (package)
(crate->guix-package
(guix-package->crate-name package)
#:version (package-version package)))))
(define %reimporters
(list %pypi-reimporter
%cran-reimporter
;; XXX: Useless since Rust packages don't use the normal inputs
;; fields.
;; %crate-reimporter
))
(let ((results (map (compose import-accuracy package-sample) %reimporters)))
(for-each (lambda (reimporter result)
(display-accuracy reimporter result
(current-output-port)))
%reimporters
results))
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Accuracy of importers?
2021-10-29 21:57 ` Ludovic Courtès
@ 2021-10-30 15:49 ` zimoun
2021-11-09 16:48 ` Ludovic Courtès
0 siblings, 1 reply; 16+ messages in thread
From: zimoun @ 2021-10-30 15:49 UTC (permalink / raw)
To: Ludovic Courtès, Guix Devel
Hi Ludo,
On Fri, 29 Oct 2021 at 23:57, Ludovic Courtès <ludo@gnu.org> wrote:
> (It’s quite expensive to run because it downloads a whole bunch of
> things and tries many 404 URLs in the case of CRAN before finding the
> right one.)
Ah… it requires investigation thus.
> --8<---------------cut here---------------start------------->8---
> $ SAMPLE_SIZE=200 ./pre-inst-env guile ~/src/guix-debugging/importer-accuracy.scm
> […]
> Accuracy for 'pypi' (200 packages):
> accurate: 58 (29%)
> different inputs: 142 (71%)
> different source: 0 (0%)
> inconclusive: 0 (0%)
> Accuracy for 'cran' (200 packages):
> accurate: 176 (88%)
> different inputs: 23 (12%)
> different source: 1 (0%)
> inconclusive: 0 (0%)
> --8<---------------cut here---------------end--------------->8---
[...]
> The script doesn’t do anything useful for crates because they have their
> own way of representing inputs. It doesn’t account for changes in
> ‘arguments’ like zimoun suggested, meaning it’s overestimating
> accuracy.
It is already quite interesting results. Because it shows upstream
stability, IIUC. Well, it means that running “guix import pypi” one
months ago and running the sames now, 71% packages have different
inputs. Right? It is because some metadata from PyPI changed, right?
Not because “guix import pypi” was doing wrong and now it does better,
right?
IMHO, it shows how PyPI allows bad practises about packaging, isn’t it?
My understanding of this experiment is about upstream “quality”, not
about importer “accuracy”. Do I incorrectly understand?
Cheers,
simon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Accuracy of importers?
2021-10-30 15:49 ` zimoun
@ 2021-11-09 16:48 ` Ludovic Courtès
2021-11-09 18:36 ` zimoun
0 siblings, 1 reply; 16+ messages in thread
From: Ludovic Courtès @ 2021-11-09 16:48 UTC (permalink / raw)
To: zimoun; +Cc: Guix Devel
Hi,
zimoun <zimon.toutoune@gmail.com> skribis:
> On Fri, 29 Oct 2021 at 23:57, Ludovic Courtès <ludo@gnu.org> wrote:
[...]
>> --8<---------------cut here---------------start------------->8---
>> $ SAMPLE_SIZE=200 ./pre-inst-env guile ~/src/guix-debugging/importer-accuracy.scm
>> […]
>> Accuracy for 'pypi' (200 packages):
>> accurate: 58 (29%)
>> different inputs: 142 (71%)
>> different source: 0 (0%)
>> inconclusive: 0 (0%)
>> Accuracy for 'cran' (200 packages):
>> accurate: 176 (88%)
>> different inputs: 23 (12%)
>> different source: 1 (0%)
>> inconclusive: 0 (0%)
>> --8<---------------cut here---------------end--------------->8---
>
> [...]
>
>> The script doesn’t do anything useful for crates because they have their
>> own way of representing inputs. It doesn’t account for changes in
>> ‘arguments’ like zimoun suggested, meaning it’s overestimating
>> accuracy.
>
> It is already quite interesting results. Because it shows upstream
> stability, IIUC. Well, it means that running “guix import pypi” one
> months ago and running the sames now, 71% packages have different
> inputs. Right? It is because some metadata from PyPI changed, right?
No no; I’m assuming PyPI, CRAN, etc. provide the same info as they did
back when the package was imported (which is probably the case).
> Not because “guix import pypi” was doing wrong and now it does better,
> right?
I’m also assuming that the importer didn’t change significantly in the
meantime, which is probably a good approximation.
What I think those figures show is the amount of manual tweaks necessary
to get a proper package “à la Guix”, with tests running etc. For PyPI
we often need to add things under ‘native-inputs’, hence the 71%
“different inputs” line. For CRAN that’s sometimes necessary, but much
less frequently. There are also cases with non-R/non-Python
dependencies.
> IMHO, it shows how PyPI allows bad practises about packaging, isn’t it?
>
> My understanding of this experiment is about upstream “quality”, not
> about importer “accuracy”. Do I incorrectly understand?
Yes, in a way, assuming our importers are not lossy, this tells us
whether the upstream repo contains enough information and/or whether
that information is accurate.
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Accuracy of importers?
2021-11-09 16:48 ` Ludovic Courtès
@ 2021-11-09 18:36 ` zimoun
0 siblings, 0 replies; 16+ messages in thread
From: zimoun @ 2021-11-09 18:36 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: Guix Devel
Hi,
On Tue, 09 Nov 2021 at 17:48, Ludovic Courtès <ludo@gnu.org> wrote:
> What I think those figures show is the amount of manual tweaks necessary
> to get a proper package “à la Guix”, with tests running etc. For PyPI
> we often need to add things under ‘native-inputs’, hence the 71%
> “different inputs” line. For CRAN that’s sometimes necessary, but much
> less frequently. There are also cases with non-R/non-Python
> dependencies.
The numbers are based on “dependencies“ mismatch. But this mismatch is
sometimes artificial. For instance, I am not convinced that upstream
distinguish between build-time (or test-time) dependency and run-time
dependency. I mean many packages would work with all dependencies
directly inside ’propagated-inputs’ or ’inputs’ (probably what importers
return), when “à la Guix” move some to ’native-inputs’. Well, I do not
know what we can conclude at the end.
For instance, the numbers are:
Accuracy for 'pypi' (200 packages):
accurate: 58 (29%)
different inputs: 142 (71%)
different source: 0 (0%)
inconclusive: 0 (0%)
Accuracy for 'cran' (200 packages):
accurate: 176 (88%)
different inputs: 23 (12%)
different source: 1 (0%)
inconclusive: 0 (0%)
but on these numbers, how many CRAN packages have other dependencies
than the ones listed ’propagated-inputs’? I guess 24.
My point is that there is a strong bias about the “complexity“ of
packages. If CRAN packages are “simpler”, then indeed they are more
accurate.
Other said, when picking 200 samples for each importer, each of these
200 batch should have the same distribution about inputs:
- X ’propagated-inputs’ only
- Y ’propagated-inputs’ and ’inputs’
- Z ’propagated-inputs’ and ’inputs’ and ’native-inputs’
where X+Y+Z=100%. Then, the number of the two importers become
“comparable”.
>> My understanding of this experiment is about upstream “quality”, not
>> about importer “accuracy”. Do I incorrectly understand?
>
> Yes, in a way, assuming our importers are not lossy, this tells us
> whether the upstream repo contains enough information and/or whether
> that information is accurate.
Thanks for explaining.
Cheers,
simon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Accuracy of importers?
2021-10-28 7:02 Accuracy of importers? Ludovic Courtès
2021-10-28 8:17 ` Lars-Dominik Braun
@ 2021-10-28 9:06 ` zimoun
2021-10-28 9:30 ` zimoun
2021-10-28 11:38 ` Julien Lepiller
` (4 subsequent siblings)
6 siblings, 1 reply; 16+ messages in thread
From: zimoun @ 2021-10-28 9:06 UTC (permalink / raw)
To: Ludovic Courtès, Guix Devel
Hi,
On Thu, 28 Oct 2021 at 09:02, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:
> My understanding is that most of them require manual intervention—i.e.,
> one has to tweak what ‘guix import’ produces, even if we ignore
> synopsis/description/license, to set the right inputs, etc. If we were
> to estimate the fraction of imported packages for which manual changes
> are needed, what would it look like?
Manual intervention depends on how it is packaged upstream, i.e., the
availability of metadata. Therefore, it depends on the upstream
archive. PyPI is messier than CRAN for instance but I find hard to back
this claim with numbers – just intuition. :-)
> importer fraction of imported packages needing changes
>
> gnu 90% (doesn’t know about dependencies)
> pypi 50% (some miss source distro, “sdist”; some have
> non-Python deps)
> cpan ?
> hackage ?
> stackage (Lars?)
> egg (Xinglu?)
> elpa (Nicolas?)
> gem ?
> go (Sarah? Leo? Raghav?)
> cran 5% (Ricardo? Simon? seems to almost always work?)
> crate 10% (Efraim?)
> texlive (Ricardo? Thiago? Marius?)
> opam (Julien?)
> minetest (Maxime? Vivien?)
> julia (WIP) (Simon?)
> npm (WIP) (Jelle? Timothy?)
For the ones I use “cran” and “cran -a bioconductor“, and from the
feedback I get from users in my lab, one regular complaint is the
missing prefix ’license:’ – if that’s the issue, it means the importer
works pretty well. :-)
About Julia, it is often not clear how to extract “dependencies”, which
means the run-time ones vs the test-time other ones.
> (Lower is better.) What would be your estimate?
For all cases, to have a good estimation, I would examine how many
packages already in Guix have a non-default ’argument’ and modified
phases. It means that these packages require manual fix.
Missing or incorrect dependencies happen. But they are impossible to
evaluate. However, special ’argument’ are something eval-able and for
now, none importer tweaks that, IIUC, thus it would sketch the picture
«how well our importers are doing».
For instance, filtered on build-system. For sure, all
python-build-system packages do not come from PyPI, r-build-system from
CRAM/Bioconductor, etc. but, IMHO, such stats would provide a good
estimation for how upstream archives ELPA, PyPI, CRAN/Bioconducor,
Hackage/Stackage, TexLive, etc. are ready for Guix without manual
intervention.
Cheers,
simon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Accuracy of importers?
2021-10-28 9:06 ` zimoun
@ 2021-10-28 9:30 ` zimoun
0 siblings, 0 replies; 16+ messages in thread
From: zimoun @ 2021-10-28 9:30 UTC (permalink / raw)
To: Ludovic Courtès, Guix Devel
Re,
On Thu, 28 Oct 2021 at 11:07, zimoun <zimon.toutoune@gmail.com> wrote:
> For instance, filtered on build-system. For sure, all
> python-build-system packages do not come from PyPI, r-build-system from
> CRAM/Bioconductor, etc. but, IMHO, such stats would provide a good
> estimation for how upstream archives ELPA, PyPI, CRAN/Bioconducor,
> Hackage/Stackage, TexLive, etc. are ready for Guix without manual
> intervention.
Ah, better: filtering on uris, e.g., pypi-uri, cran-uri,
biocondutor-uri, crate-uri, and so on, i.e., where each importer
looks. And compare how many 'arguments' is not the default one. This
should get an estimation on the accuracy of importers, IMHO.
Cheers,
simon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Accuracy of importers?
2021-10-28 7:02 Accuracy of importers? Ludovic Courtès
2021-10-28 8:17 ` Lars-Dominik Braun
2021-10-28 9:06 ` zimoun
@ 2021-10-28 11:38 ` Julien Lepiller
2021-10-28 12:25 ` Ricardo Wurmus
` (3 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: Julien Lepiller @ 2021-10-28 11:38 UTC (permalink / raw)
To: guix-devel, Ludovic Courtès, Guix Devel
Le 28 octobre 2021 03:02:27 GMT-04:00, "Ludovic Courtès" <ludovic.courtes@inria.fr> a écrit :
>Hello Guix!
>
>As I’m preparing my PackagingCon talk and wondering how language package
>managers could make our lives easier, I thought it’d be interesting to
>know how well our importers are doing.
>
>My understanding is that most of them require manual intervention—i.e.,
>one has to tweak what ‘guix import’ produces, even if we ignore
>synopsis/description/license, to set the right inputs, etc. If we were
>to estimate the fraction of imported packages for which manual changes
>are needed, what would it look like?
>
> importer fraction of imported packages needing changes
>
> gnu 90% (doesn’t know about dependencies)
> pypi 50% (some miss source distro, “sdist”; some have
> non-Python deps)
> cpan ?
> hackage ?
> stackage (Lars?)
> egg (Xinglu?)
> elpa (Nicolas?)
> gem ?
> go (Sarah? Leo? Raghav?)
> cran 5% (Ricardo? Simon? seems to almost always work?)
> crate 10% (Efraim?)
> texlive (Ricardo? Thiago? Marius?)
> opam (Julien?)
I find it pretty good, when importing huge numbers of packages recently, I was able to build all of them without modification. However, lots rely on a github tarball, so I would change the source in these cases before sending them to guix.
> minetest (Maxime? Vivien?)
> julia (WIP) (Simon?)
> npm (WIP) (Jelle? Timothy?)
>
>(Lower is better.) What would be your estimate?
>
>Among those, which importers provide source that differs from what you’d
>get from upstream’s checkout or release tarballs? My guess:
>
> pypi (see LastPyMile paper)
> elpa (gives hosted tarballs that can differ from upstream repo)
> gem (similar to PyPI)
> npm (ditto)
>
>What about licensing info: which ones provide accurate licensing info?
>My guess:
>
> gnu
> pypi
> cpan
> cran
> elpa
> go (?)
> cran
> crate (?)
> texlive
> opam (?)
> minetest (?)
>
>TIA! :-)
>
>Ludo’.
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Accuracy of importers?
2021-10-28 7:02 Accuracy of importers? Ludovic Courtès
` (2 preceding siblings ...)
2021-10-28 11:38 ` Julien Lepiller
@ 2021-10-28 12:25 ` Ricardo Wurmus
2021-10-28 14:47 ` Katherine Cox-Buday
` (2 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: Ricardo Wurmus @ 2021-10-28 12:25 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: guix-devel
[-- Attachment #1: Type: text/plain, Size: 3001 bytes --]
Ludovic Courtès <ludovic.courtes@inria.fr> writes:
> Hello Guix!
>
> As I’m preparing my PackagingCon talk and wondering how language
> package
> managers could make our lives easier, I thought it’d be
> interesting to
> know how well our importers are doing.
>
> My understanding is that most of them require manual
> intervention—i.e.,
> one has to tweak what ‘guix import’ produces, even if we ignore
> synopsis/description/license, to set the right inputs, etc. If
> we were
> to estimate the fraction of imported packages for which manual
> changes
> are needed, what would it look like?
>
> importer fraction of imported packages needing changes
[…]
> cran 5% (Ricardo? Simon? seems to almost always
> work?)
Like Lars and Simon wrote: the importers work *really* well for
both CRAN and Bioconductor, so much so that I’m using them in the
background here:
https://git.elephly.net/gitweb.cgi?p=software/r-guix-install.git;a=blob;f=guix-install.R;h=2766aa1f2d248a8ed2a4eb4c3244b85574d326e2;hb=HEAD
The biggest annoyance is the missing “license:” prefix when
packaging things for gnu/packages/cran.scm or
gnu/packages/bioconductor.scm. Descriptions need regular clean-
up work (e.g. to complete sentences), even though we’re using some
heuristics to fix the most common stylistic problems. It’s really
not a big deal, though.
The biggest missing feature is recursive import of dependencies
hosted on Github or Mercurial (with “-r -a git” or “-r -a hg”).
I.e. a package on Github that declares a dependency on another
package that’s also only hosted on Github will fail to import that
dependency. This is pretty rare, but it happens with experimental
bioinfo software.
> texlive (Ricardo? Thiago? Marius?)
This one is not usable. I’d even add “at all”. I keep announcing
that one day I’ll replace it with a new importer, but that new
importer just isn’t ready yet.
> What about licensing info: which ones provide accurate licensing
> info?
> My guess:
>
> gnu
> pypi
> cpan
> cran
The CRAN importer is as accurate as upstream allows. CRAN
requires a free license, Bioconductor requires a license
declaration (there have been very few cases where the license was
not correct, but a number of cases where the license was non-free,
such as the Artistic 1.0 license. Bioconductor sometimes is
sneaky and the R code is free but a necessary library is not.
> texlive
Pretty terrible. The license declaration is generally too vague.
Licenses are often declared without version number, and sometimes
it’s just some generic “free” license. A new importer based on
texlive.tlpdb would not improve this by much, because the upstream
declarations are just spotty and unreliable.
--
Ricardo
PS: attached is a rough WIP patch of what I had been using to
import new texlive stuff.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: texlive-import.diff --]
[-- Type: text/x-patch, Size: 6523 bytes --]
diff --git a/guix/import/texlive.scm b/guix/import/texlive.scm
index 18d8b95ee0..b94aa1cf40 100644
--- a/guix/import/texlive.scm
+++ b/guix/import/texlive.scm
@@ -19,10 +19,12 @@
(define-module (guix import texlive)
#:use-module (ice-9 match)
+ #:use-module (ice-9 rdelim)
#:use-module (sxml simple)
#:use-module (sxml xpath)
#:use-module (srfi srfi-11)
#:use-module (srfi srfi-1)
+ #:use-module (srfi srfi-2)
#:use-module (srfi srfi-26)
#:use-module (srfi srfi-34)
#:use-module (web uri)
@@ -125,9 +127,9 @@ (define (fetch-sxml name)
(xml->sxml (http-fetch url)
#:trim-whitespace? #t))))
-(define (guix-name component name)
+(define (guix-name name)
"Return a Guix package name for a given Texlive package NAME."
- (string-append "texlive-" component "-"
+ (string-append "texlive-"
(string-map (match-lambda
(#\_ #\-)
(#\. #\-)
@@ -186,12 +188,123 @@ (define (sxml-value path)
((lst ...) `(list ,@lst))
(license license)))))))
+(define tlpdb
+ (memoize
+ (lambda ()
+ (let ((file "/home/rekado/dev/gx/branches/master/texlive.tlpdb")
+ (fields
+ '((name . string)
+ (shortdesc . string)
+ (longdesc . string)
+ (catalogue-license . string)
+ (catalogue-ctan . string)
+ (srcfiles . list)
+ (runfiles . list)
+ (docfiles . list)
+ (depend . list)))
+ (record
+ (lambda* (key value alist #:optional (type 'string))
+ (let ((new
+ (or (and=> (assoc-ref alist key)
+ (lambda (existing)
+ (cond
+ ((eq? type 'string)
+ (string-append existing " " value))
+ ((eq? type 'list)
+ (cons value existing)))))
+ (cond
+ ((eq? type 'string)
+ value)
+ ((eq? type 'list)
+ (list value))))))
+ (acons key new (alist-delete key alist))))))
+ (call-with-input-file file
+ (lambda (port)
+ (let loop ((all (list))
+ (current (list))
+ (last-property #false))
+ (let ((line (read-line port)))
+ (cond
+ ((eof-object? line) all)
+
+ ;; End of record.
+ ((string-null? line)
+ (loop (cons (cons (assoc-ref current 'name) current)
+ all)
+ (list) #false))
+
+ ;; Continuation of a list
+ ((and (zero? (string-index line #\space)) last-property)
+ ;; Erase optional second part of list values like
+ ;; "details=Readme" for files
+ (let ((plain-value (first
+ (string-split
+ (string-trim-both line) #\space))))
+ (loop all (record last-property
+ plain-value
+ current
+ 'list)
+ last-property)))
+ (else
+ (or (and-let* ((space (string-index line #\space))
+ (key (string->symbol (string-take line space)))
+ (value (string-drop line (1+ space)))
+ (field-type (assoc-ref fields key)))
+ ;; Erase second part of list keys like "size=29"
+ (if (eq? field-type 'list)
+ (loop all current key)
+ (loop all (record key value current field-type) key)))
+ (loop all current #false))))))))))))
+
+(define (files->directories files)
+ (map (cut string-join <> "/" 'suffix)
+ (delete-duplicates (map (lambda (file)
+ (drop-right (string-split file #\/) 1))
+ files)
+ equal?)))
+
+(define (tlpdb->package name)
+ (and-let* ((data (assoc-ref (tlpdb) name))
+ (dirs (files->directories
+ (append (or (assoc-ref data 'docfiles) (list))
+ (or (assoc-ref data 'runfiles) (list))
+ (or (assoc-ref data 'srcfiles) (list))))))
+ (pk data)
+ ;; TODO
+ `(package
+ (name ,(guix-name name))
+ (version (number->string %texlive-revision))
+ (source (texlive-origin name version
+ ',dirs
+ (base32
+ "TODO"
+ #;
+ ,(bytevector->nix-base32-string
+ (let-values (((port get-hash) (open-sha256-port)))
+ (write-file checkout port)
+ (force-output port)
+ (get-hash))))))
+ (build-system texlive-build-system)
+ (arguments ,`(,'quote (#:tex-directory "TODO")))
+ ,@(or (and=> (assoc-ref data 'depend)
+ (lambda (inputs)
+ `((propagated-inputs ,inputs))))
+ '())
+ ,@(or (and=> (assoc-ref data 'catalogue-ctan)
+ (lambda (url)
+ `((home-page ,(string-append "https://ctan.org" url)))))
+ '((home-page "https://www.tug.org/texlive/")))
+ (synopsis ,(assoc-ref data 'shortdesc))
+ (description ,(beautify-description
+ (assoc-ref data 'longdesc)))
+ (license ,(string->license
+ (assoc-ref data 'catalogue-license))))))
+
(define texlive->guix-package
(memoize
(lambda* (package-name #:optional (component "latex"))
"Fetch the metadata for PACKAGE-NAME from REPO and return the `package'
s-expression corresponding to that package, or #f on failure."
- (and=> (fetch-sxml package-name)
- (cut sxml->package <> component)))))
+ (tlpdb->package package-name))))
;;; ctan.scm ends here
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Accuracy of importers?
2021-10-28 7:02 Accuracy of importers? Ludovic Courtès
` (3 preceding siblings ...)
2021-10-28 12:25 ` Ricardo Wurmus
@ 2021-10-28 14:47 ` Katherine Cox-Buday
2021-10-29 19:29 ` Nicolas Goaziou
2021-10-30 10:55 ` Xinglu Chen
6 siblings, 0 replies; 16+ messages in thread
From: Katherine Cox-Buday @ 2021-10-28 14:47 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: Guix Devel
Ludovic Courtès <ludovic.courtes@inria.fr> writes:
> go (Sarah? Leo? Raghav?)
I have only used this a few times so far, but the quality seems to have gotten a lot better. My impression, though, due to the nature of how we have to generate packages so as to not be reliant on a centralized GOPROXY server (namely one controlled by Google), is that we stumble dealing with the heterogeneity of the internet. There are a few things which could make this situation better:
There is an open issue[1] for a better API to https://pkg.go.dev which may eventually allow us to query for things like license, VCS path, etc. This could obviate Guix's need to crawl the internet.
I was also discussing[2] the pros/cons of relying on the Go tool-chain to do most of the work for us. I think doing so might be making the right trade-offs, but it sounds like[3] we are blocked by cgit's ability to work with shallow checkouts. Since Guix has a build environment, maybe we could just use Git the CLI instead of a scheme library when necessary.
I hope this helps, and good luck with your talk!
[1] - https://github.com/golang/go/issues/36785
[2] - https://lists.gnu.org/archive/html/guix-devel/2021-09/msg00344.html
[3] - https://lists.gnu.org/archive/html/guix-devel/2021-10/msg00020.html
--
Katherine
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Accuracy of importers?
2021-10-28 7:02 Accuracy of importers? Ludovic Courtès
` (4 preceding siblings ...)
2021-10-28 14:47 ` Katherine Cox-Buday
@ 2021-10-29 19:29 ` Nicolas Goaziou
2021-10-29 23:08 ` Carlo Zancanaro
2021-10-30 10:55 ` Xinglu Chen
6 siblings, 1 reply; 16+ messages in thread
From: Nicolas Goaziou @ 2021-10-29 19:29 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: Guix Devel
Hello,
Ludovic Courtès <ludovic.courtes@inria.fr> writes:
> My understanding is that most of them require manual intervention—i.e.,
> one has to tweak what ‘guix import’ produces, even if we ignore
> synopsis/description/license, to set the right inputs, etc. If we were
> to estimate the fraction of imported packages for which manual changes
> are needed, what would it look like?
>
> importer fraction of imported packages needing changes
>
> gnu 90% (doesn’t know about dependencies)
> pypi 50% (some miss source distro, “sdist”; some have
> non-Python deps)
> cpan ?
> hackage ?
> stackage (Lars?)
> egg (Xinglu?)
> elpa (Nicolas?)
The elpa importer is accurate. Manual changes are often (I would say
around 75%) required for the description field, tho.
However, the generated source URI is not reliable (see bug #46849),
which means the importer is not practical. Using it means the imported
package will need to be updated quickly.
> Among those, which importers provide source that differs from what you’d
> get from upstream’s checkout or release tarballs? My guess:
>
> pypi (see LastPyMile paper)
> elpa (gives hosted tarballs that can differ from upstream repo)
Indeed.
> gem (similar to PyPI)
> npm (ditto)
>
> What about licensing info: which ones provide accurate licensing info?
> My guess:
>
> gnu
> pypi
> cpan
> cran
> elpa
Correct
Regards,
--
Nicolas Goaziou
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Accuracy of importers?
2021-10-28 7:02 Accuracy of importers? Ludovic Courtès
` (5 preceding siblings ...)
2021-10-29 19:29 ` Nicolas Goaziou
@ 2021-10-30 10:55 ` Xinglu Chen
6 siblings, 0 replies; 16+ messages in thread
From: Xinglu Chen @ 2021-10-30 10:55 UTC (permalink / raw)
To: Ludovic Courtès, Guix Devel
[-- Attachment #1: Type: text/plain, Size: 2016 bytes --]
On Thu, Oct 28 2021, Ludovic Courtès wrote:
> Hello Guix!
>
> As I’m preparing my PackagingCon talk and wondering how language package
> managers could make our lives easier, I thought it’d be interesting to
> know how well our importers are doing.
>
> My understanding is that most of them require manual intervention—i.e.,
> one has to tweak what ‘guix import’ produces, even if we ignore
> synopsis/description/license, to set the right inputs, etc. If we were
> to estimate the fraction of imported packages for which manual changes
> are needed, what would it look like?
>
> importer fraction of imported packages needing changes
>
> gnu 90% (doesn’t know about dependencies)
> pypi 50% (some miss source distro, “sdist”; some have
> non-Python deps)
> cpan ?
> hackage ?
> stackage (Lars?)
The Stackage is mostly based on the Hackage importer, and they are
unable to parse certains things in the .cabal files.[1][2] I would say that
this happens maybe 1/15 to 1/20 of cases.
[1]: <https://issues.guix.gnu.org/36690>
[2]: <https://issues.guix.gnu.org/35743>
> egg (Xinglu?)
I haven’t used it that much, but I would say it works ~80%. Some
egg packages specify system dependencies (e.g., OpenSSL), but the
importer doesn’t know what the name of that package is in Guix, so it’s
not always correct.
> What about licensing info: which ones provide accurate licensing info?
> My guess:
>
> gnu
> pypi
> cpan
> cran
> elpa
> go (?)
> cran
> crate (?)
> texlive
> opam (?)
> minetest (?)
For the egg importer, many packages specify the wrong license in their
.egg file, and there is no convention for what naming scheme to use, so
sometimes it is ‘GPL3’, other times it is ‘GPL-3.0’.
The Hackage/Stackage importer generally results in correct licenses, so
I would also put it on this list. :-)
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread