unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / Atom feed
* Accuracy of importers?
@ 2021-10-28  7:02 Ludovic Courtès
  2021-10-28  8:17 ` Lars-Dominik Braun
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Ludovic Courtès @ 2021-10-28  7:02 UTC (permalink / raw)
  To: Guix Devel

Hello Guix!

As I’m preparing my PackagingCon talk and wondering how language package
managers could make our lives easier, I thought it’d be interesting to
know how well our importers are doing.

My understanding is that most of them require manual intervention—i.e.,
one has to tweak what ‘guix import’ produces, even if we ignore
synopsis/description/license, to set the right inputs, etc.  If we were
to estimate the fraction of imported packages for which manual changes
are needed, what would it look like?

   importer     fraction of imported packages needing changes

   gnu          90% (doesn’t know about dependencies)
   pypi         50% (some miss source distro, “sdist”; some have
                     non-Python deps)
   cpan         ?
   hackage      ?
   stackage     (Lars?)
   egg          (Xinglu?)
   elpa         (Nicolas?)
   gem          ?
   go           (Sarah? Leo? Raghav?)
   cran         5% (Ricardo? Simon? seems to almost always work?)
   crate        10% (Efraim?)
   texlive      (Ricardo? Thiago? Marius?)
   opam         (Julien?)
   minetest     (Maxime? Vivien?)
   julia (WIP)  (Simon?)
   npm (WIP)    (Jelle? Timothy?)

(Lower is better.)  What would be your estimate?  

Among those, which importers provide source that differs from what you’d
get from upstream’s checkout or release tarballs?  My guess:

   pypi (see LastPyMile paper)
   elpa (gives hosted tarballs that can differ from upstream repo)
   gem (similar to PyPI)
   npm (ditto)

What about licensing info: which ones provide accurate licensing info?
My guess:

   gnu
   pypi
   cpan
   cran
   elpa
   go (?)
   cran
   crate (?)
   texlive
   opam (?)
   minetest (?)

TIA! :-)

Ludo’.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-28  7:02 Accuracy of importers? Ludovic Courtès
@ 2021-10-28  8:17 ` Lars-Dominik Braun
  2021-10-28  8:54   ` Ludovic Courtès
  2021-10-29 21:57   ` Ludovic Courtès
  2021-10-28  9:06 ` zimoun
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 16+ messages in thread
From: Lars-Dominik Braun @ 2021-10-28  8:17 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel

Hi Ludo’,

> My understanding is that most of them require manual intervention—i.e.,
> one has to tweak what ‘guix import’ produces, even if we ignore
> synopsis/description/license, to set the right inputs, etc.  If we were
> to estimate the fraction of imported packages for which manual changes
> are needed, what would it look like?
> 
>    importer     fraction of imported packages needing changes
>    pypi         50% (some miss source distro, “sdist”; some have
>                      non-Python deps)
that seems right, although the most common modification I do nowadays
is replacing 'check with a pytest phase.

>    hackage      ?
>    stackage     (Lars?)
I’ve mostly used the updater, not the importer, so I can’t say a
number unfortunately.

>    cran         5% (Ricardo? Simon? seems to almost always work?)
In my experience the number of interventions here goes towards zero
actually, except for description. It’s pretty good :)

>    npm (WIP)    (Jelle? Timothy?)
Maybe 5%? But the imported packages do not build anything and don’t
run tests either, so chances for failure are pretty low.

Would it be possible to just run the importer again for existing packages
and compare the result (minus synopsis/description) with what’s
available in Guix? That should give you much more accurate numbers than
our guesswork.

Cheers,
Lars



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-28  8:17 ` Lars-Dominik Braun
@ 2021-10-28  8:54   ` Ludovic Courtès
  2021-10-28 10:06     ` Lars-Dominik Braun
  2021-10-29 21:57   ` Ludovic Courtès
  1 sibling, 1 reply; 16+ messages in thread
From: Ludovic Courtès @ 2021-10-28  8:54 UTC (permalink / raw)
  To: Lars-Dominik Braun; +Cc: Guix Devel

Hi!

Lars-Dominik Braun <lars@6xq.net> skribis:

>> My understanding is that most of them require manual intervention—i.e.,
>> one has to tweak what ‘guix import’ produces, even if we ignore
>> synopsis/description/license, to set the right inputs, etc.  If we were
>> to estimate the fraction of imported packages for which manual changes
>> are needed, what would it look like?
>> 
>>    importer     fraction of imported packages needing changes
>>    pypi         50% (some miss source distro, “sdist”; some have
>>                      non-Python deps)
> that seems right, although the most common modification I do nowadays
> is replacing 'check with a pytest phase.

Right.  PyPI/setup.py/.whl doesn’t contain info as to how to run tests,
right?

>>    hackage      ?
>>    stackage     (Lars?)
> I’ve mostly used the updater, not the importer, so I can’t say a
> number unfortunately.

Did the updater suggest input changes?

>>    cran         5% (Ricardo? Simon? seems to almost always work?)
> In my experience the number of interventions here goes towards zero
> actually, except for description. It’s pretty good :)

Yay!

>>    npm (WIP)    (Jelle? Timothy?)
> Maybe 5%? But the imported packages do not build anything and don’t
> run tests either, so chances for failure are pretty low.

Yeah.

> Would it be possible to just run the importer again for existing packages
> and compare the result (minus synopsis/description) with what’s
> available in Guix? That should give you much more accurate numbers than
> our guesswork.

That’s a good idea.  I can try and do that on a sample of packages.

Thanks!

Ludo’.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-28  7:02 Accuracy of importers? Ludovic Courtès
  2021-10-28  8:17 ` Lars-Dominik Braun
@ 2021-10-28  9:06 ` zimoun
  2021-10-28  9:30   ` zimoun
  2021-10-28 11:38 ` Julien Lepiller
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: zimoun @ 2021-10-28  9:06 UTC (permalink / raw)
  To: Ludovic Courtès, Guix Devel

Hi,

On Thu, 28 Oct 2021 at 09:02, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

> My understanding is that most of them require manual intervention—i.e.,
> one has to tweak what ‘guix import’ produces, even if we ignore
> synopsis/description/license, to set the right inputs, etc.  If we were
> to estimate the fraction of imported packages for which manual changes
> are needed, what would it look like?

Manual intervention depends on how it is packaged upstream, i.e., the
availability of metadata.  Therefore, it depends on the upstream
archive.  PyPI is messier than CRAN for instance but I find hard to back
this claim with numbers – just intuition. :-)


>    importer     fraction of imported packages needing changes
>
>    gnu          90% (doesn’t know about dependencies)
>    pypi         50% (some miss source distro, “sdist”; some have
>                      non-Python deps)
>    cpan         ?
>    hackage      ?
>    stackage     (Lars?)
>    egg          (Xinglu?)
>    elpa         (Nicolas?)
>    gem          ?
>    go           (Sarah? Leo? Raghav?)
>    cran         5% (Ricardo? Simon? seems to almost always work?)
>    crate        10% (Efraim?)
>    texlive      (Ricardo? Thiago? Marius?)
>    opam         (Julien?)
>    minetest     (Maxime? Vivien?)
>    julia (WIP)  (Simon?)
>    npm (WIP)    (Jelle? Timothy?)

For the ones I use “cran” and “cran -a bioconductor“, and from the
feedback I get from users in my lab, one regular complaint is the
missing prefix ’license:’ – if that’s the issue, it means the importer
works pretty well. :-)

About Julia, it is often not clear how to extract “dependencies”, which
means the run-time ones vs the test-time other ones.


> (Lower is better.)  What would be your estimate?

For all cases, to have a good estimation, I would examine how many
packages already in Guix have a non-default ’argument’ and modified
phases.  It means that these packages require manual fix.

Missing or incorrect dependencies happen.  But they are impossible to
evaluate.  However, special ’argument’ are something eval-able and for
now, none importer tweaks that, IIUC, thus it would sketch the picture
«how well our importers are doing».

For instance, filtered on build-system.  For sure, all
python-build-system packages do not come from PyPI, r-build-system from
CRAM/Bioconductor, etc. but, IMHO, such stats would provide a good
estimation for how upstream archives ELPA, PyPI, CRAN/Bioconducor,
Hackage/Stackage, TexLive, etc. are ready for Guix without manual
intervention.


Cheers,
simon


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-28  9:06 ` zimoun
@ 2021-10-28  9:30   ` zimoun
  0 siblings, 0 replies; 16+ messages in thread
From: zimoun @ 2021-10-28  9:30 UTC (permalink / raw)
  To: Ludovic Courtès, Guix Devel

Re,

On Thu, 28 Oct 2021 at 11:07, zimoun <zimon.toutoune@gmail.com> wrote:

> For instance, filtered on build-system.  For sure, all
> python-build-system packages do not come from PyPI, r-build-system from
> CRAM/Bioconductor, etc. but, IMHO, such stats would provide a good
> estimation for how upstream archives ELPA, PyPI, CRAN/Bioconducor,
> Hackage/Stackage, TexLive, etc. are ready for Guix without manual
> intervention.

Ah, better: filtering on uris, e.g., pypi-uri, cran-uri,
biocondutor-uri, crate-uri, and so on, i.e., where each importer
looks.  And compare how many 'arguments' is not the default one.  This
should get an estimation on the accuracy of importers, IMHO.

Cheers,
simon


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-28  8:54   ` Ludovic Courtès
@ 2021-10-28 10:06     ` Lars-Dominik Braun
  0 siblings, 0 replies; 16+ messages in thread
From: Lars-Dominik Braun @ 2021-10-28 10:06 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel

Hi Ludo’,

> Right.  PyPI/setup.py/.whl doesn’t contain info as to how to run tests,
> right?
technically setup.py has a standard test target, but it’s been
deprecated for years and it must be enabled manually by the project. I’m
not aware of any standard pyproject.toml approach to this. It might be
possible to parse tox.ini.

> >>    hackage      ?
> >>    stackage     (Lars?)
> > I’ve mostly used the updater, not the importer, so I can’t say a
> > number unfortunately.
> Did the updater suggest input changes?
yes, I added it in 127828ddd74fc950c0403ca58a6f650355e3d67d, but it
cannot update #:cabal-revision, which is a common source for errors. Would
be nice if updaters could just return an entirely new package and the
generic updater code would modify/merge the existing package definition
as needed.

Cheers,
Lars



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-28  7:02 Accuracy of importers? Ludovic Courtès
  2021-10-28  8:17 ` Lars-Dominik Braun
  2021-10-28  9:06 ` zimoun
@ 2021-10-28 11:38 ` Julien Lepiller
  2021-10-28 12:25 ` Ricardo Wurmus
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Julien Lepiller @ 2021-10-28 11:38 UTC (permalink / raw)
  To: guix-devel, Ludovic Courtès, Guix Devel

Le 28 octobre 2021 03:02:27 GMT-04:00, "Ludovic Courtès" <ludovic.courtes@inria.fr> a écrit :
>Hello Guix!
>
>As I’m preparing my PackagingCon talk and wondering how language package
>managers could make our lives easier, I thought it’d be interesting to
>know how well our importers are doing.
>
>My understanding is that most of them require manual intervention—i.e.,
>one has to tweak what ‘guix import’ produces, even if we ignore
>synopsis/description/license, to set the right inputs, etc.  If we were
>to estimate the fraction of imported packages for which manual changes
>are needed, what would it look like?
>
>   importer     fraction of imported packages needing changes
>
>   gnu          90% (doesn’t know about dependencies)
>   pypi         50% (some miss source distro, “sdist”; some have
>                     non-Python deps)
>   cpan         ?
>   hackage      ?
>   stackage     (Lars?)
>   egg          (Xinglu?)
>   elpa         (Nicolas?)
>   gem          ?
>   go           (Sarah? Leo? Raghav?)
>   cran         5% (Ricardo? Simon? seems to almost always work?)
>   crate        10% (Efraim?)
>   texlive      (Ricardo? Thiago? Marius?)
>   opam         (Julien?)

I find it pretty good, when importing huge numbers of packages recently, I was able to build all of them without modification. However, lots rely on a github tarball, so I would change the source in these cases before sending them to guix.

>   minetest     (Maxime? Vivien?)
>   julia (WIP)  (Simon?)
>   npm (WIP)    (Jelle? Timothy?)
>
>(Lower is better.)  What would be your estimate?  
>
>Among those, which importers provide source that differs from what you’d
>get from upstream’s checkout or release tarballs?  My guess:
>
>   pypi (see LastPyMile paper)
>   elpa (gives hosted tarballs that can differ from upstream repo)
>   gem (similar to PyPI)
>   npm (ditto)
>
>What about licensing info: which ones provide accurate licensing info?
>My guess:
>
>   gnu
>   pypi
>   cpan
>   cran
>   elpa
>   go (?)
>   cran
>   crate (?)
>   texlive
>   opam (?)
>   minetest (?)
>
>TIA! :-)
>
>Ludo’.
>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-28  7:02 Accuracy of importers? Ludovic Courtès
                   ` (2 preceding siblings ...)
  2021-10-28 11:38 ` Julien Lepiller
@ 2021-10-28 12:25 ` Ricardo Wurmus
  2021-10-28 14:47 ` Katherine Cox-Buday
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Ricardo Wurmus @ 2021-10-28 12:25 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 3001 bytes --]


Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> Hello Guix!
>
> As I’m preparing my PackagingCon talk and wondering how language 
> package
> managers could make our lives easier, I thought it’d be 
> interesting to
> know how well our importers are doing.
>
> My understanding is that most of them require manual 
> intervention—i.e.,
> one has to tweak what ‘guix import’ produces, even if we ignore
> synopsis/description/license, to set the right inputs, etc.  If 
> we were
> to estimate the fraction of imported packages for which manual 
> changes
> are needed, what would it look like?
>
>    importer     fraction of imported packages needing changes
[…]
>    cran         5% (Ricardo? Simon? seems to almost always 
>    work?)

Like Lars and Simon wrote: the importers work *really* well for 
both CRAN and Bioconductor, so much so that I’m using them in the 
background here:

https://git.elephly.net/gitweb.cgi?p=software/r-guix-install.git;a=blob;f=guix-install.R;h=2766aa1f2d248a8ed2a4eb4c3244b85574d326e2;hb=HEAD

The biggest annoyance is the missing “license:” prefix when 
packaging things for gnu/packages/cran.scm or 
gnu/packages/bioconductor.scm.  Descriptions need regular clean- 
up work (e.g. to complete sentences), even though we’re using some 
heuristics to fix the most common stylistic problems.  It’s really 
not a big deal, though.

The biggest missing feature is recursive import of dependencies 
hosted on Github or Mercurial (with “-r -a git” or “-r -a hg”). 
I.e. a package on Github that declares a dependency on another 
package that’s also only hosted on Github will fail to import that 
dependency.  This is pretty rare, but it happens with experimental 
bioinfo software.

>    texlive      (Ricardo? Thiago? Marius?)

This one is not usable.  I’d even add “at all”.  I keep announcing 
that one day I’ll replace it with a new importer, but that new 
importer just isn’t ready yet.

> What about licensing info: which ones provide accurate licensing 
> info?
> My guess:
>
>    gnu
>    pypi
>    cpan
>    cran

The CRAN importer is as accurate as upstream allows.  CRAN 
requires a free license, Bioconductor requires a license 
declaration (there have been very few cases where the license was 
not correct, but a number of cases where the license was non-free, 
such as the Artistic 1.0 license.  Bioconductor sometimes is 
sneaky and the R code is free but a necessary library is not.

>    texlive

Pretty terrible.  The license declaration is generally too vague. 
Licenses are often declared without version number, and sometimes 
it’s just some generic “free” license.  A new importer based on 
texlive.tlpdb would not improve this by much, because the upstream 
declarations are just spotty and unreliable.

-- 
Ricardo


PS: attached is a rough WIP patch of what I had been using to 
import new texlive stuff.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: texlive-import.diff --]
[-- Type: text/x-patch, Size: 6523 bytes --]

diff --git a/guix/import/texlive.scm b/guix/import/texlive.scm
index 18d8b95ee0..b94aa1cf40 100644
--- a/guix/import/texlive.scm
+++ b/guix/import/texlive.scm
@@ -19,10 +19,12 @@
 
 (define-module (guix import texlive)
   #:use-module (ice-9 match)
+  #:use-module (ice-9 rdelim)
   #:use-module (sxml simple)
   #:use-module (sxml xpath)
   #:use-module (srfi srfi-11)
   #:use-module (srfi srfi-1)
+  #:use-module (srfi srfi-2)
   #:use-module (srfi srfi-26)
   #:use-module (srfi srfi-34)
   #:use-module (web uri)
@@ -125,9 +127,9 @@ (define (fetch-sxml name)
       (xml->sxml (http-fetch url)
                  #:trim-whitespace? #t))))
 
-(define (guix-name component name)
+(define (guix-name name)
   "Return a Guix package name for a given Texlive package NAME."
-  (string-append "texlive-" component "-"
+  (string-append "texlive-"
                  (string-map (match-lambda
                                (#\_ #\-)
                                (#\. #\-)
@@ -186,12 +188,123 @@ (define (sxml-value path)
                      ((lst ...) `(list ,@lst))
                      (license license)))))))
 
+(define tlpdb
+  (memoize
+   (lambda ()
+     (let ((file "/home/rekado/dev/gx/branches/master/texlive.tlpdb")
+           (fields
+            '((name     . string)
+              (shortdesc . string)
+              (longdesc . string)
+              (catalogue-license . string)
+              (catalogue-ctan . string)
+              (srcfiles . list)
+              (runfiles . list)
+              (docfiles . list)
+              (depend   . list)))
+           (record
+            (lambda* (key value alist #:optional (type 'string))
+              (let ((new
+                     (or (and=> (assoc-ref alist key)
+                                (lambda (existing)
+                                  (cond
+                                   ((eq? type 'string)
+                                    (string-append existing " " value))
+                                   ((eq? type 'list)
+                                    (cons value existing)))))
+                         (cond
+                          ((eq? type 'string)
+                           value)
+                          ((eq? type 'list)
+                           (list value))))))
+                (acons key new (alist-delete key alist))))))
+       (call-with-input-file file
+         (lambda (port)
+           (let loop ((all (list))
+                      (current (list))
+                      (last-property #false))
+             (let ((line (read-line port)))
+               (cond
+                ((eof-object? line) all)
+
+                ;; End of record.
+                ((string-null? line)
+                 (loop (cons (cons (assoc-ref current 'name) current)
+                             all)
+                       (list) #false))
+
+                ;; Continuation of a list
+                ((and (zero? (string-index line #\space)) last-property)
+                 ;; Erase optional second part of list values like
+                 ;; "details=Readme" for files
+                 (let ((plain-value (first
+                                     (string-split
+                                      (string-trim-both line) #\space))))
+                   (loop all (record last-property
+                                     plain-value
+                                     current
+                                     'list)
+                         last-property)))
+                (else
+                 (or (and-let* ((space (string-index line #\space))
+                                (key   (string->symbol (string-take line space)))
+                                (value (string-drop line (1+ space)))
+                                (field-type (assoc-ref fields key)))
+                       ;; Erase second part of list keys like "size=29"
+                       (if (eq? field-type 'list)
+                           (loop all current key)
+                           (loop all (record key value current field-type) key)))
+                     (loop all current #false))))))))))))
+
+(define (files->directories files)
+  (map (cut string-join <> "/" 'suffix)
+       (delete-duplicates (map (lambda (file)
+                                 (drop-right (string-split file #\/) 1))
+                               files)
+                          equal?)))
+
+(define (tlpdb->package name)
+  (and-let* ((data (assoc-ref (tlpdb) name))
+             (dirs (files->directories
+                    (append (or (assoc-ref data 'docfiles) (list))
+                            (or (assoc-ref data 'runfiles) (list))
+                            (or (assoc-ref data 'srcfiles) (list))))))
+    (pk data)
+    ;; TODO
+    `(package
+       (name ,(guix-name name))
+       (version (number->string %texlive-revision))
+       (source (texlive-origin name version
+                               ',dirs
+                               (base32
+                                "TODO"
+                                #;
+                                ,(bytevector->nix-base32-string
+                                  (let-values (((port get-hash) (open-sha256-port)))
+                                    (write-file checkout port)
+                                    (force-output port)
+                                    (get-hash))))))
+       (build-system texlive-build-system)
+       (arguments ,`(,'quote (#:tex-directory "TODO")))
+       ,@(or (and=> (assoc-ref data 'depend)
+                    (lambda (inputs)
+                      `((propagated-inputs ,inputs))))
+             '())
+       ,@(or (and=> (assoc-ref data 'catalogue-ctan)
+                    (lambda (url)
+                      `((home-page ,(string-append "https://ctan.org" url)))))
+             '((home-page "https://www.tug.org/texlive/")))
+       (synopsis ,(assoc-ref data 'shortdesc))
+       (description ,(beautify-description
+                      (assoc-ref data 'longdesc)))
+       (license ,(string->license
+                  (assoc-ref data 'catalogue-license))))))
+
 (define texlive->guix-package
   (memoize
    (lambda* (package-name #:optional (component "latex"))
      "Fetch the metadata for PACKAGE-NAME from REPO and return the `package'
 s-expression corresponding to that package, or #f on failure."
-     (and=> (fetch-sxml package-name)
-            (cut sxml->package <> component)))))
+     (tlpdb->package package-name))))
 
 ;;; ctan.scm ends here

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-28  7:02 Accuracy of importers? Ludovic Courtès
                   ` (3 preceding siblings ...)
  2021-10-28 12:25 ` Ricardo Wurmus
@ 2021-10-28 14:47 ` Katherine Cox-Buday
  2021-10-29 19:29 ` Nicolas Goaziou
  2021-10-30 10:55 ` Xinglu Chen
  6 siblings, 0 replies; 16+ messages in thread
From: Katherine Cox-Buday @ 2021-10-28 14:47 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

>    go           (Sarah? Leo? Raghav?)

I have only used this a few times so far, but the quality seems to have gotten a lot better. My impression, though, due to the nature of how we have to generate packages so as to not be reliant on a centralized GOPROXY server (namely one controlled by Google), is that we stumble dealing with the heterogeneity of the internet. There are a few things which could make this situation better:

There is an open issue[1] for a better API to https://pkg.go.dev which may eventually allow us to query for things like license, VCS path, etc. This could obviate Guix's need to crawl the internet.

I was also discussing[2] the pros/cons of relying on the Go tool-chain to do most of the work for us. I think doing so might be making the right trade-offs, but it sounds like[3] we are blocked by cgit's ability to work with shallow checkouts. Since Guix has a build environment, maybe we could just use Git the CLI instead of a scheme library when necessary.

I hope this helps, and good luck with your talk!

[1] - https://github.com/golang/go/issues/36785
[2] - https://lists.gnu.org/archive/html/guix-devel/2021-09/msg00344.html
[3] - https://lists.gnu.org/archive/html/guix-devel/2021-10/msg00020.html

-- 
Katherine


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-28  7:02 Accuracy of importers? Ludovic Courtès
                   ` (4 preceding siblings ...)
  2021-10-28 14:47 ` Katherine Cox-Buday
@ 2021-10-29 19:29 ` Nicolas Goaziou
  2021-10-29 23:08   ` Carlo Zancanaro
  2021-10-30 10:55 ` Xinglu Chen
  6 siblings, 1 reply; 16+ messages in thread
From: Nicolas Goaziou @ 2021-10-29 19:29 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel

Hello,

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> My understanding is that most of them require manual intervention—i.e.,
> one has to tweak what ‘guix import’ produces, even if we ignore
> synopsis/description/license, to set the right inputs, etc.  If we were
> to estimate the fraction of imported packages for which manual changes
> are needed, what would it look like?
>
>    importer     fraction of imported packages needing changes
>
>    gnu          90% (doesn’t know about dependencies)
>    pypi         50% (some miss source distro, “sdist”; some have
>                      non-Python deps)
>    cpan         ?
>    hackage      ?
>    stackage     (Lars?)
>    egg          (Xinglu?)
>    elpa         (Nicolas?)

The elpa importer is accurate. Manual changes are often (I would say
around 75%) required for the description field, tho.

However, the generated source URI is not reliable (see bug #46849),
which means the importer is not practical. Using it means the imported
package will need to be updated quickly.

> Among those, which importers provide source that differs from what you’d
> get from upstream’s checkout or release tarballs?  My guess:
>
>    pypi (see LastPyMile paper)
>    elpa (gives hosted tarballs that can differ from upstream repo)

Indeed.

>    gem (similar to PyPI)
>    npm (ditto)
>
> What about licensing info: which ones provide accurate licensing info?
> My guess:
>
>    gnu
>    pypi
>    cpan
>    cran
>    elpa

Correct

Regards,
-- 
Nicolas Goaziou


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-28  8:17 ` Lars-Dominik Braun
  2021-10-28  8:54   ` Ludovic Courtès
@ 2021-10-29 21:57   ` Ludovic Courtès
  2021-10-30 15:49     ` zimoun
  1 sibling, 1 reply; 16+ messages in thread
From: Ludovic Courtès @ 2021-10-29 21:57 UTC (permalink / raw)
  To: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 1742 bytes --]

Hello!

Thanks everyone for your feedback!

Lars-Dominik Braun <lars@6xq.net> skribis:

> Would it be possible to just run the importer again for existing packages
> and compare the result (minus synopsis/description) with what’s
> available in Guix? That should give you much more accurate numbers than
> our guesswork.

Turned out to be trickier than we could hope, primarily because the
relevant importers did not support imports of a specific version.  I
fixed it for CRAN and PyPI here:

  https://issues.guix.gnu.org/51493

With the attached script plus the changes above, I can already get some
insight.  Here’s what I get for a sample of 200 PyPI packages and 200
CRAN packages:

--8<---------------cut here---------------start------------->8---
$ SAMPLE_SIZE=200 ./pre-inst-env guile ~/src/guix-debugging/importer-accuracy.scm
[…]
Accuracy for 'pypi' (200 packages):
  accurate: 58 (29%)
  different inputs: 142 (71%)
  different source: 0 (0%)
  inconclusive: 0 (0%)
Accuracy for 'cran' (200 packages):
  accurate: 176 (88%)
  different inputs: 23 (12%)
  different source: 1 (0%)
  inconclusive: 0 (0%)
--8<---------------cut here---------------end--------------->8---

(It’s quite expensive to run because it downloads a whole bunch of
things and tries many 404 URLs in the case of CRAN before finding the
right one.)

The script doesn’t do anything useful for crates because they have their
own way of representing inputs.  It doesn’t account for changes in
‘arguments’ like zimoun suggested, meaning it’s overestimating accuracy.

It’d be nice to run it on gems but that importer doesn’t support
versioning either.

To be continued…

Thanks,
Ludo’.


[-- Attachment #2: importer-accuracy.scm --]
[-- Type: text/plain, Size: 9260 bytes --]

;;; Released under the GNU GPLv3 or any later version.
;;; Copyright © 2021 Ludovic Courtès <ludo@gnu.org>

(use-modules (guix)
             (gnu packages)
             (guix import cran)
             (guix import crate)
             (guix import pypi)
             ((guix import print) #:select (package->code))
             ((guix upstream) #:select (url-predicate))
             (guix diagnostics)
             (guix i18n)
             (srfi srfi-1)
             (srfi srfi-9)
             (srfi srfi-9 gnu)
             (ice-9 match))

(define-record-type <reimporter>
  (reimporter name pred import)
  reimporter?
  (name   reimporter-name)
  (pred   reimporter-predicate)
  (import reimporter-import))

(define (find-reimporter package)
  (find (lambda (reimporter)
          ((reimporter-predicate reimporter) package))
        %reimporters))

(define (accurate-import? package)
  (define (sexp-field sexp field)
    (match sexp
      (((or 'package 'origin) fields ...)
       (match (assoc field fields)
         ((key value) value)
         (_ #f)))
      (('define-public _ exp)
       (sexp-field exp field))))

  (define (same-source? sexp1 sexp2)
    (equal? (sexp-field (sexp-field sexp1 'source) 'sha256)
            (sexp-field (sexp-field sexp2 'source) 'sha256)))

  (define canonicalize-input
    ;; 'package->code' creates '@' references but importers don't.  Remove
    ;; the '@' to allow comparison.
    (match-lambda
      (("gfortran" _)
       ;; 'package->code' emits nonsense for the value associated with this
       ;; one, so trust the label.
       `("gfortran" ,(list 'unquote 'gfortran)))
      ((label ('unquote ('@ _ variable)) . rest)
       `(,label ,(list 'unquote variable) ,@rest))
      (x x)))

  (define (equivalent-inputs? inputs1 inputs2)
    (if (and inputs1 inputs2)
        (lset= equal?
               (match inputs1
                 (('quasiquote inputs)
                  (map canonicalize-input inputs)))
               (match inputs2
                 (('quasiquote inputs)
                  (map canonicalize-input inputs))))
        (equal? inputs1 inputs2)))

  (let* ((reimporter (find-reimporter package))
         (imported   ((reimporter-import reimporter) package))
         (actual     (package->code package)))
    (define (same-inputs? field)
      (equivalent-inputs? (sexp-field imported field)
                          (sexp-field actual field)))

    (if imported
        (if (and (same-inputs? 'inputs)
                 (same-inputs? 'native-inputs)
                 (same-inputs? 'propagated-inputs))
            (if (same-source? actual imported)
                'accurate
                (begin
                  (warning (package-location package)
                           (G_ "~a: source differs from upstream~%")
                           (package-full-name package))
                  'different-source))
            (begin
              (warning (package-location package)
                       (G_ "~a: inputs differ from upstream~%")
                       (package-full-name package))
              'different-inputs))
        'inconclusive)))

;; Stats.
(define-record-type <accuracy>
  (accuracy accurate different-inputs different-source inconclusive)
  accuracy?
  (accurate         accuracy-accurate)
  (different-inputs accuracy-different-inputs)
  (different-source accuracy-different-source)
  (inconclusive     accuracy-inconclusive))

(define (display-accuracy reimporter accuracy port)
  (define total
    (letrec-syntax ((sum (syntax-rules ()
                           ((_) 0)
                           ((_ get rest ...)
                            (+ (get accuracy) (sum rest ...))))))
      (sum accuracy-accurate
           accuracy-different-inputs
           accuracy-different-source
           accuracy-inconclusive)))

  (define (% fraction)
    (inexact->exact (round (* 100. fraction))))

  (format port (G_ "Accuracy for '~a' (~a packages):~%")
          (reimporter-name reimporter) total)
  (format port (G_ "  accurate: ~a (~d%)~%")
          (accuracy-accurate accuracy)
          (% (/ (accuracy-accurate accuracy) total)))
  (format port (G_ "  different inputs: ~a (~d%)~%")
          (accuracy-different-inputs accuracy)
          (% (/ (accuracy-different-inputs accuracy) total)))
  (format port (G_ "  different source: ~a (~d%)~%")
          (accuracy-different-source accuracy)
          (% (/ (accuracy-different-source accuracy) total)))
  (format port (G_ "  inconclusive: ~a (~d%)~%")
          (accuracy-inconclusive accuracy)
          (% (/ (accuracy-inconclusive accuracy) total))))

(define (random-seed)
  (logxor (getpid) (car (gettimeofday))))

(define shuffle                           ;copied from (guix scripts offload)
  (let ((state (seed->random-state (random-seed))))
    (lambda (lst)
      "Return LST shuffled (using the Fisher-Yates algorithm.)"
      (define vec (list->vector lst))
      (let loop ((result '())
                 (i (vector-length vec)))
        (if (zero? i)
            result
            (let* ((j (random i state))
                   (val (vector-ref vec j)))
              (vector-set! vec j (vector-ref vec (- i 1)))
              (loop (cons val result) (- i 1))))))))

\f
;;;
;;; Reimporters.
;;;

(define pypi-package?                         ;copied from (guix import pypi)
  (url-predicate
   (lambda (url)
     (or (string-prefix? "https://pypi.org/" url)
         (string-prefix? "https://pypi.python.org/" url)
         (string-prefix? "https://pypi.org/packages" url)
         (string-prefix? "https://files.pythonhosted.org/packages" url)))))

(define guix-package->pypi-name
  (@@ (guix import pypi) guix-package->pypi-name))

(define* (package-sample reimporter
                         #:optional (size (or (and=> (getenv "SAMPLE_SIZE")
                                                     string->number)
                                              20)))
  (let ((pred (reimporter-predicate reimporter)))
    (take (shuffle
           (fold-packages (lambda (package lst)
                            (if (and (pred package)
                                     (not (package-superseded package))
                                     (not (string-prefix? "python2-"
                                                          (package-name package))))
                                (cons package lst)
                                lst))
                          '()))
          size)))

(define-syntax-rule (increment record field)
  (set-field record (field) (+ 1 (field record))))

(define (import-accuracy packages)
  (fold (lambda (package accuracy)
          (match (accurate-import? package)
            ('accurate (increment accuracy accuracy-accurate))
            ('different-inputs (increment accuracy accuracy-different-inputs))
            ('different-source (increment accuracy accuracy-different-source))
            ('inconclusive (increment accuracy accuracy-inconclusive))))
        (accuracy 0 0 0 0)
        packages))

(define (package->cran-name package)          ;copied from (guix import cran)
  "Return the upstream name of the PACKAGE."
  (let ((upstream-name (assoc-ref (package-properties package) 'upstream-name)))
    (if upstream-name
        upstream-name
        (match (package-source package)
          ((? origin? origin)
           (match (origin-uri origin)
             ((or (? string? url) (url _ ...))
              (let ((end   (string-rindex url #\_))
                    (start (string-rindex url #\/)))
                ;; The URL ends on
                ;; (string-append "/" name "_" version ".tar.gz")
                (and start end (substring url (+ start 1) end))))
             (_ #f)))
          (_ #f)))))


(define %pypi-reimporter
  (reimporter 'pypi pypi-package?
              (lambda (package)
                (pypi->guix-package
                 (guix-package->pypi-name package)
                 #:version (package-version package)))))

(define %cran-reimporter
  (reimporter 'cran cran-package?
              (lambda (package)
                (cran->guix-package
                 (package->cran-name package)
                 #:version (package-version package)))))

(define crate-package?
  (url-predicate (@@ (guix import crate) crate-url?)))

(define %crate-reimporter
  (reimporter 'crate crate-package?
              (lambda (package)
                (crate->guix-package
                 (guix-package->crate-name package)
                 #:version (package-version package)))))

(define %reimporters
  (list %pypi-reimporter
        %cran-reimporter

        ;; XXX: Useless since Rust packages don't use the normal inputs
        ;; fields.
        ;; %crate-reimporter
        ))

(let ((results (map (compose import-accuracy package-sample) %reimporters)))
  (for-each (lambda (reimporter result)
              (display-accuracy reimporter result
                                (current-output-port)))
            %reimporters
            results))


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-29 19:29 ` Nicolas Goaziou
@ 2021-10-29 23:08   ` Carlo Zancanaro
  0 siblings, 0 replies; 16+ messages in thread
From: Carlo Zancanaro @ 2021-10-29 23:08 UTC (permalink / raw)
  To: Nicolas Goaziou; +Cc: guix-devel, Ludovic Courtès

Hi Nicolas/Ludo,

On Fri, Oct 29 2021, Nicolas Goaziou wrote:
>> Among those, which importers provide source that differs from 
>> what you’d get from upstream’s checkout or release tarballs? 
>> My guess:
>>
>>    elpa (gives hosted tarballs that can differ from upstream 
>>    repo)
>
> Indeed.

For MELPA specifically there is code to grab the upstream Git 
repository details, but it doesn't seem to work when the module is 
compiled. See https://issues.guix.gnu.org/49006 for some details.

Carlo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-28  7:02 Accuracy of importers? Ludovic Courtès
                   ` (5 preceding siblings ...)
  2021-10-29 19:29 ` Nicolas Goaziou
@ 2021-10-30 10:55 ` Xinglu Chen
  6 siblings, 0 replies; 16+ messages in thread
From: Xinglu Chen @ 2021-10-30 10:55 UTC (permalink / raw)
  To: Ludovic Courtès, Guix Devel

[-- Attachment #1: Type: text/plain, Size: 2016 bytes --]

On Thu, Oct 28 2021, Ludovic Courtès wrote:

> Hello Guix!
>
> As I’m preparing my PackagingCon talk and wondering how language package
> managers could make our lives easier, I thought it’d be interesting to
> know how well our importers are doing.
>
> My understanding is that most of them require manual intervention—i.e.,
> one has to tweak what ‘guix import’ produces, even if we ignore
> synopsis/description/license, to set the right inputs, etc.  If we were
> to estimate the fraction of imported packages for which manual changes
> are needed, what would it look like?
>
>    importer     fraction of imported packages needing changes
>
>    gnu          90% (doesn’t know about dependencies)
>    pypi         50% (some miss source distro, “sdist”; some have
>                      non-Python deps)
>    cpan         ?
>    hackage      ?
>    stackage     (Lars?)

The Stackage is mostly based on the Hackage importer, and they are
unable to parse certains things in the .cabal files.[1][2]  I would say that
this happens maybe 1/15 to 1/20 of cases.

[1]: <https://issues.guix.gnu.org/36690>
[2]: <https://issues.guix.gnu.org/35743>

>    egg          (Xinglu?)

I haven’t used it that much, but I would say it works ~80%.  Some
egg packages specify system dependencies (e.g., OpenSSL), but the
importer doesn’t know what the name of that package is in Guix, so it’s
not always correct.

> What about licensing info: which ones provide accurate licensing info?
> My guess:
>
>    gnu
>    pypi
>    cpan
>    cran
>    elpa
>    go (?)
>    cran
>    crate (?)
>    texlive
>    opam (?)
>    minetest (?)

For the egg importer, many packages specify the wrong license in their
.egg file, and there is no convention for what naming scheme to use, so
sometimes it is ‘GPL3’, other times it is ‘GPL-3.0’.

The Hackage/Stackage importer generally results in correct licenses, so
I would also put it on this list.  :-)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-29 21:57   ` Ludovic Courtès
@ 2021-10-30 15:49     ` zimoun
  2021-11-09 16:48       ` Ludovic Courtès
  0 siblings, 1 reply; 16+ messages in thread
From: zimoun @ 2021-10-30 15:49 UTC (permalink / raw)
  To: Ludovic Courtès, Guix Devel

Hi Ludo,

On Fri, 29 Oct 2021 at 23:57, Ludovic Courtès <ludo@gnu.org> wrote:

> (It’s quite expensive to run because it downloads a whole bunch of
> things and tries many 404 URLs in the case of CRAN before finding the
> right one.)

Ah… it requires investigation thus.


> --8<---------------cut here---------------start------------->8---
> $ SAMPLE_SIZE=200 ./pre-inst-env guile ~/src/guix-debugging/importer-accuracy.scm
> […]
> Accuracy for 'pypi' (200 packages):
>   accurate: 58 (29%)
>   different inputs: 142 (71%)
>   different source: 0 (0%)
>   inconclusive: 0 (0%)
> Accuracy for 'cran' (200 packages):
>   accurate: 176 (88%)
>   different inputs: 23 (12%)
>   different source: 1 (0%)
>   inconclusive: 0 (0%)
> --8<---------------cut here---------------end--------------->8---

[...]

> The script doesn’t do anything useful for crates because they have their
> own way of representing inputs.  It doesn’t account for changes in
> ‘arguments’ like zimoun suggested, meaning it’s overestimating
> accuracy.

It is already quite interesting results.  Because it shows upstream
stability, IIUC.  Well, it means that running “guix import pypi” one
months ago and running the sames now, 71% packages have different
inputs.  Right?  It is because some metadata from PyPI changed, right?
Not because “guix import pypi” was doing wrong and now it does better,
right?

IMHO, it shows how PyPI allows bad practises about packaging, isn’t it?

My understanding of this experiment is about upstream “quality”, not
about importer “accuracy”.  Do I incorrectly understand?


Cheers,
simon


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-10-30 15:49     ` zimoun
@ 2021-11-09 16:48       ` Ludovic Courtès
  2021-11-09 18:36         ` zimoun
  0 siblings, 1 reply; 16+ messages in thread
From: Ludovic Courtès @ 2021-11-09 16:48 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel

Hi,

zimoun <zimon.toutoune@gmail.com> skribis:

> On Fri, 29 Oct 2021 at 23:57, Ludovic Courtès <ludo@gnu.org> wrote:

[...]

>> --8<---------------cut here---------------start------------->8---
>> $ SAMPLE_SIZE=200 ./pre-inst-env guile ~/src/guix-debugging/importer-accuracy.scm
>> […]
>> Accuracy for 'pypi' (200 packages):
>>   accurate: 58 (29%)
>>   different inputs: 142 (71%)
>>   different source: 0 (0%)
>>   inconclusive: 0 (0%)
>> Accuracy for 'cran' (200 packages):
>>   accurate: 176 (88%)
>>   different inputs: 23 (12%)
>>   different source: 1 (0%)
>>   inconclusive: 0 (0%)
>> --8<---------------cut here---------------end--------------->8---
>
> [...]
>
>> The script doesn’t do anything useful for crates because they have their
>> own way of representing inputs.  It doesn’t account for changes in
>> ‘arguments’ like zimoun suggested, meaning it’s overestimating
>> accuracy.
>
> It is already quite interesting results.  Because it shows upstream
> stability, IIUC.  Well, it means that running “guix import pypi” one
> months ago and running the sames now, 71% packages have different
> inputs.  Right?  It is because some metadata from PyPI changed, right?

No no; I’m assuming PyPI, CRAN, etc. provide the same info as they did
back when the package was imported (which is probably the case).

> Not because “guix import pypi” was doing wrong and now it does better,
> right?

I’m also assuming that the importer didn’t change significantly in the
meantime, which is probably a good approximation.

What I think those figures show is the amount of manual tweaks necessary
to get a proper package “à la Guix”, with tests running etc.  For PyPI
we often need to add things under ‘native-inputs’, hence the 71%
“different inputs” line.  For CRAN that’s sometimes necessary, but much
less frequently.  There are also cases with non-R/non-Python
dependencies.

> IMHO, it shows how PyPI allows bad practises about packaging, isn’t it?
>
> My understanding of this experiment is about upstream “quality”, not
> about importer “accuracy”.  Do I incorrectly understand?

Yes, in a way, assuming our importers are not lossy, this tells us
whether the upstream repo contains enough information and/or whether
that information is accurate.

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Accuracy of importers?
  2021-11-09 16:48       ` Ludovic Courtès
@ 2021-11-09 18:36         ` zimoun
  0 siblings, 0 replies; 16+ messages in thread
From: zimoun @ 2021-11-09 18:36 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel

Hi,

On Tue, 09 Nov 2021 at 17:48, Ludovic Courtès <ludo@gnu.org> wrote:

> What I think those figures show is the amount of manual tweaks necessary
> to get a proper package “à la Guix”, with tests running etc.  For PyPI
> we often need to add things under ‘native-inputs’, hence the 71%
> “different inputs” line.  For CRAN that’s sometimes necessary, but much
> less frequently.  There are also cases with non-R/non-Python
> dependencies.

The numbers are based on “dependencies“ mismatch.  But this mismatch is
sometimes artificial.  For instance, I am not convinced that upstream
distinguish between build-time (or test-time) dependency and run-time
dependency.  I mean many packages would work with all dependencies
directly inside ’propagated-inputs’ or ’inputs’ (probably what importers
return), when “à la Guix” move some to ’native-inputs’.  Well, I do not
know what we can conclude at the end.

For instance, the numbers are:

  Accuracy for 'pypi' (200 packages):
    accurate: 58 (29%)
    different inputs: 142 (71%)
    different source: 0 (0%)
    inconclusive: 0 (0%)
  Accuracy for 'cran' (200 packages):
    accurate: 176 (88%)
    different inputs: 23 (12%)
    different source: 1 (0%)
    inconclusive: 0 (0%)

but on these numbers, how many CRAN packages have other dependencies
than the ones listed ’propagated-inputs’?  I guess 24.

My point is that there is a strong bias about the “complexity“ of
packages.  If CRAN packages are “simpler”, then indeed they are more
accurate.

Other said, when picking 200 samples for each importer, each of these
200 batch should have the same distribution about inputs:

 - X ’propagated-inputs’ only
 - Y ’propagated-inputs’ and ’inputs’
 - Z ’propagated-inputs’ and ’inputs’ and ’native-inputs’

where X+Y+Z=100%.  Then, the number of the two importers become
“comparable”.
 

>> My understanding of this experiment is about upstream “quality”, not
>> about importer “accuracy”.  Do I incorrectly understand?
>
> Yes, in a way, assuming our importers are not lossy, this tells us
> whether the upstream repo contains enough information and/or whether
> that information is accurate.

Thanks for explaining.


Cheers,
simon


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-11-09 18:43 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-28  7:02 Accuracy of importers? Ludovic Courtès
2021-10-28  8:17 ` Lars-Dominik Braun
2021-10-28  8:54   ` Ludovic Courtès
2021-10-28 10:06     ` Lars-Dominik Braun
2021-10-29 21:57   ` Ludovic Courtès
2021-10-30 15:49     ` zimoun
2021-11-09 16:48       ` Ludovic Courtès
2021-11-09 18:36         ` zimoun
2021-10-28  9:06 ` zimoun
2021-10-28  9:30   ` zimoun
2021-10-28 11:38 ` Julien Lepiller
2021-10-28 12:25 ` Ricardo Wurmus
2021-10-28 14:47 ` Katherine Cox-Buday
2021-10-29 19:29 ` Nicolas Goaziou
2021-10-29 23:08   ` Carlo Zancanaro
2021-10-30 10:55 ` Xinglu Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).