From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id EEkXF3FufGHpxAAAgWs5BA (envelope-from ) for ; Fri, 29 Oct 2021 23:58:09 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id AFe+EnFufGF0QgAA1q6Kng (envelope-from ) for ; Fri, 29 Oct 2021 21:58:09 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id D0EA12D5D3 for ; Fri, 29 Oct 2021 23:58:08 +0200 (CEST) Received: from localhost ([::1]:39892 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mgZt1-00011l-RD for larch@yhetil.org; Fri, 29 Oct 2021 17:58:07 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:34550) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mgZsX-0000ya-FT for guix-devel@gnu.org; Fri, 29 Oct 2021 17:57:37 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:59888) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mgZsX-0004g5-7G for guix-devel@gnu.org; Fri, 29 Oct 2021 17:57:37 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:In-Reply-To:Date:References:Subject:To: From; bh=wjTJ9IjaJ4/qA9w9M6RFqZ3YFoLyuI4BQ0MBd8NBoQw=; b=ZRWxe7DwRUIr7h1tDbcG R6cSvyE667MU3VAYtqCpZTpUlbwY04GZYYWmbrOMBCa7Rw2g0JAUdUncvS7oR/tqZ6PHYEEAfsa4W 4F0kpypNxw6/fONO6l/zCBGJGduZ/994x97M+NT0QThEWg9zuhzku5cyHCJsLpwm3BKxvdvsEv/G/ 9/UDZGiDVA95gidQlB3L+f8M3we7frmURV++BNKqYDmC6q8jdQ0kQsMqsJdr9Bi/6PNyL0BJc5LPX 9gJw7pLVBNAVz3eE/YhmZmEqyd9YjQYPgcP3NIohE0RLnmxbyXbx135b2rbDmlzJxyH4DRQI7lHSW 5DO1bEAiz3spaA==; Received: from 91-160-117-201.subs.proxad.net ([91.160.117.201]:55474 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mgZsV-0005OQ-LX for guix-devel@gnu.org; Fri, 29 Oct 2021 17:57:37 -0400 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Guix Devel Subject: Re: Accuracy of importers? References: <878ryd8we4.fsf@inria.fr> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 8 Brumaire an 230 de la =?utf-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Fri, 29 Oct 2021 23:57:33 +0200 In-Reply-To: (Lars-Dominik Braun's message of "Thu, 28 Oct 2021 10:17:30 +0200") Message-ID: <87ilxfwl2q.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1635544688; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=K1tz8BlMq/jKrHhrsgz/1BLLepA/RhGvQCMC7SFDxXE=; b=Y3xnVBY+0vPcm1WFXQR2Ah91AyIxjbdb6DH73ExUl8moDjIPYwVG2gnAwvHmgAPGUiLE8R Pjc0sKdssMX56AUd832wb++bSb3AK8xnWQFLqafgGUayh7fL8GYapmHTWCQP+ZiYkv7yTA COZHLi+5FGvMv2Itxc2xDc2Jk/TKvRqfOvapMeTpp/gxsQbZE07gECAQ40pySgOAyXpi7O ickSdisahK7zWqnpKCyI9PZ46tgUS0Hpc0bpU0F671w+Kfu8kdveq3GbOUiI62gw9O6Pav 8idt87WYIGCH13SK4cwIvBlnqdXuFcWCUGWohXqvOHBvHCPnR9hjI92YkRCTXA== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1635544688; a=rsa-sha256; cv=none; b=B3IB6aZ+8f+M5UcVInCIFGdL+mFXWKpvUxSDEUpC1B7H1xbp1ovUH4GZy28VzAS6fWxEUh m3IDhuw1aAP6y7lTnG5OfiZ7/6ulE1dVOKfHq5sFU4/doKkH2MjTze/rHTbWK6snZIjN9c +tFJmeMC4cUolmYSuPMb/5Berz8Bb/yImbleBSzrXWJInRKLmUSgNeixfVu996zuWaIld8 6wwkmCHMrzChB0JBkUC9U1yIYm0nh02QpxFATChGDIqlO/nhPFrZhywz6KkTdCOCwNArFP lbpWLXV0g2ovMKzdAEgyQEyohx7WVkavh5KY4jUobR0k9TkYJ61en+SfYFPfVg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("body hash did not verify") header.d=gnu.org header.s=fencepost-gnu-org header.b=ZRWxe7Dw; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Spam-Score: -0.92 Authentication-Results: aspmx1.migadu.com; dkim=fail ("body hash did not verify") header.d=gnu.org header.s=fencepost-gnu-org header.b=ZRWxe7Dw; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Queue-Id: D0EA12D5D3 X-Spam-Score: -0.92 X-Migadu-Scanner: scn0.migadu.com X-TUID: sOUJyM90vlyh --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hello! Thanks everyone for your feedback! Lars-Dominik Braun skribis: > Would it be possible to just run the importer again for existing packages > and compare the result (minus synopsis/description) with what=E2=80=99s > available in Guix? That should give you much more accurate numbers than > our guesswork. Turned out to be trickier than we could hope, primarily because the relevant importers did not support imports of a specific version. I fixed it for CRAN and PyPI here: https://issues.guix.gnu.org/51493 With the attached script plus the changes above, I can already get some insight. Here=E2=80=99s what I get for a sample of 200 PyPI packages and 2= 00 CRAN packages: --8<---------------cut here---------------start------------->8--- $ SAMPLE_SIZE=3D200 ./pre-inst-env guile ~/src/guix-debugging/importer-accu= racy.scm [=E2=80=A6] Accuracy for 'pypi' (200 packages): accurate: 58 (29%) different inputs: 142 (71%) different source: 0 (0%) inconclusive: 0 (0%) Accuracy for 'cran' (200 packages): accurate: 176 (88%) different inputs: 23 (12%) different source: 1 (0%) inconclusive: 0 (0%) --8<---------------cut here---------------end--------------->8--- (It=E2=80=99s quite expensive to run because it downloads a whole bunch of things and tries many 404 URLs in the case of CRAN before finding the right one.) The script doesn=E2=80=99t do anything useful for crates because they have = their own way of representing inputs. It doesn=E2=80=99t account for changes in =E2=80=98arguments=E2=80=99 like zimoun suggested, meaning it=E2=80=99s ove= restimating accuracy. It=E2=80=99d be nice to run it on gems but that importer doesn=E2=80=99t su= pport versioning either. To be continued=E2=80=A6 Thanks, Ludo=E2=80=99. --=-=-= Content-Type: text/plain; charset=utf-8 Content-Disposition: inline; filename=importer-accuracy.scm Content-Transfer-Encoding: quoted-printable ;;; Released under the GNU GPLv3 or any later version. ;;; Copyright =C2=A9 2021 Ludovic Court=C3=A8s (use-modules (guix) (gnu packages) (guix import cran) (guix import crate) (guix import pypi) ((guix import print) #:select (package->code)) ((guix upstream) #:select (url-predicate)) (guix diagnostics) (guix i18n) (srfi srfi-1) (srfi srfi-9) (srfi srfi-9 gnu) (ice-9 match)) (define-record-type (reimporter name pred import) reimporter? (name reimporter-name) (pred reimporter-predicate) (import reimporter-import)) (define (find-reimporter package) (find (lambda (reimporter) ((reimporter-predicate reimporter) package)) %reimporters)) (define (accurate-import? package) (define (sexp-field sexp field) (match sexp (((or 'package 'origin) fields ...) (match (assoc field fields) ((key value) value) (_ #f))) (('define-public _ exp) (sexp-field exp field)))) (define (same-source? sexp1 sexp2) (equal? (sexp-field (sexp-field sexp1 'source) 'sha256) (sexp-field (sexp-field sexp2 'source) 'sha256))) (define canonicalize-input ;; 'package->code' creates '@' references but importers don't. Remove ;; the '@' to allow comparison. (match-lambda (("gfortran" _) ;; 'package->code' emits nonsense for the value associated with this ;; one, so trust the label. `("gfortran" ,(list 'unquote 'gfortran))) ((label ('unquote ('@ _ variable)) . rest) `(,label ,(list 'unquote variable) ,@rest)) (x x))) (define (equivalent-inputs? inputs1 inputs2) (if (and inputs1 inputs2) (lset=3D equal? (match inputs1 (('quasiquote inputs) (map canonicalize-input inputs))) (match inputs2 (('quasiquote inputs) (map canonicalize-input inputs)))) (equal? inputs1 inputs2))) (let* ((reimporter (find-reimporter package)) (imported ((reimporter-import reimporter) package)) (actual (package->code package))) (define (same-inputs? field) (equivalent-inputs? (sexp-field imported field) (sexp-field actual field))) (if imported (if (and (same-inputs? 'inputs) (same-inputs? 'native-inputs) (same-inputs? 'propagated-inputs)) (if (same-source? actual imported) 'accurate (begin (warning (package-location package) (G_ "~a: source differs from upstream~%") (package-full-name package)) 'different-source)) (begin (warning (package-location package) (G_ "~a: inputs differ from upstream~%") (package-full-name package)) 'different-inputs)) 'inconclusive))) ;; Stats. (define-record-type (accuracy accurate different-inputs different-source inconclusive) accuracy? (accurate accuracy-accurate) (different-inputs accuracy-different-inputs) (different-source accuracy-different-source) (inconclusive accuracy-inconclusive)) (define (display-accuracy reimporter accuracy port) (define total (letrec-syntax ((sum (syntax-rules () ((_) 0) ((_ get rest ...) (+ (get accuracy) (sum rest ...)))))) (sum accuracy-accurate accuracy-different-inputs accuracy-different-source accuracy-inconclusive))) (define (% fraction) (inexact->exact (round (* 100. fraction)))) (format port (G_ "Accuracy for '~a' (~a packages):~%") (reimporter-name reimporter) total) (format port (G_ " accurate: ~a (~d%)~%") (accuracy-accurate accuracy) (% (/ (accuracy-accurate accuracy) total))) (format port (G_ " different inputs: ~a (~d%)~%") (accuracy-different-inputs accuracy) (% (/ (accuracy-different-inputs accuracy) total))) (format port (G_ " different source: ~a (~d%)~%") (accuracy-different-source accuracy) (% (/ (accuracy-different-source accuracy) total))) (format port (G_ " inconclusive: ~a (~d%)~%") (accuracy-inconclusive accuracy) (% (/ (accuracy-inconclusive accuracy) total)))) (define (random-seed) (logxor (getpid) (car (gettimeofday)))) (define shuffle ;copied from (guix scripts offloa= d) (let ((state (seed->random-state (random-seed)))) (lambda (lst) "Return LST shuffled (using the Fisher-Yates algorithm.)" (define vec (list->vector lst)) (let loop ((result '()) (i (vector-length vec))) (if (zero? i) result (let* ((j (random i state)) (val (vector-ref vec j))) (vector-set! vec j (vector-ref vec (- i 1))) (loop (cons val result) (- i 1)))))))) ;;; ;;; Reimporters. ;;; (define pypi-package? ;copied from (guix import pyp= i) (url-predicate (lambda (url) (or (string-prefix? "https://pypi.org/" url) (string-prefix? "https://pypi.python.org/" url) (string-prefix? "https://pypi.org/packages" url) (string-prefix? "https://files.pythonhosted.org/packages" url))))) (define guix-package->pypi-name (@@ (guix import pypi) guix-package->pypi-name)) (define* (package-sample reimporter #:optional (size (or (and=3D> (getenv "SAMPLE_SIZE= ") string->number) 20))) (let ((pred (reimporter-predicate reimporter))) (take (shuffle (fold-packages (lambda (package lst) (if (and (pred package) (not (package-superseded package)) (not (string-prefix? "python2-" (package-name pac= kage)))) (cons package lst) lst)) '())) size))) (define-syntax-rule (increment record field) (set-field record (field) (+ 1 (field record)))) (define (import-accuracy packages) (fold (lambda (package accuracy) (match (accurate-import? package) ('accurate (increment accuracy accuracy-accurate)) ('different-inputs (increment accuracy accuracy-different-input= s)) ('different-source (increment accuracy accuracy-different-sourc= e)) ('inconclusive (increment accuracy accuracy-inconclusive)))) (accuracy 0 0 0 0) packages)) (define (package->cran-name package) ;copied from (guix import cra= n) "Return the upstream name of the PACKAGE." (let ((upstream-name (assoc-ref (package-properties package) 'upstream-na= me))) (if upstream-name upstream-name (match (package-source package) ((? origin? origin) (match (origin-uri origin) ((or (? string? url) (url _ ...)) (let ((end (string-rindex url #\_)) (start (string-rindex url #\/))) ;; The URL ends on ;; (string-append "/" name "_" version ".tar.gz") (and start end (substring url (+ start 1) end)))) (_ #f))) (_ #f))))) (define %pypi-reimporter (reimporter 'pypi pypi-package? (lambda (package) (pypi->guix-package (guix-package->pypi-name package) #:version (package-version package))))) (define %cran-reimporter (reimporter 'cran cran-package? (lambda (package) (cran->guix-package (package->cran-name package) #:version (package-version package))))) (define crate-package? (url-predicate (@@ (guix import crate) crate-url?))) (define %crate-reimporter (reimporter 'crate crate-package? (lambda (package) (crate->guix-package (guix-package->crate-name package) #:version (package-version package))))) (define %reimporters (list %pypi-reimporter %cran-reimporter ;; XXX: Useless since Rust packages don't use the normal inputs ;; fields. ;; %crate-reimporter )) (let ((results (map (compose import-accuracy package-sample) %reimporters))) (for-each (lambda (reimporter result) (display-accuracy reimporter result (current-output-port))) %reimporters results)) --=-=-=--