From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id iIWLObEb418WKAAA0tVLHw (envelope-from ) for ; Wed, 23 Dec 2020 10:28:01 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id 2ExQNbEb41+XNAAAbx9fmQ (envelope-from ) for ; Wed, 23 Dec 2020 10:28:01 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 5604E940149 for ; Wed, 23 Dec 2020 10:28:01 +0000 (UTC) Received: from localhost ([::1]:55814 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ks1NA-0003kn-7A for larch@yhetil.org; Wed, 23 Dec 2020 05:28:00 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:44628) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ks1N1-0003jQ-Ek for guix-devel@gnu.org; Wed, 23 Dec 2020 05:27:51 -0500 Received: from mail-wm1-x330.google.com ([2a00:1450:4864:20::330]:55681) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1ks1Mz-0005Xd-03; Wed, 23 Dec 2020 05:27:51 -0500 Received: by mail-wm1-x330.google.com with SMTP id c124so1342170wma.5; Wed, 23 Dec 2020 02:27:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:subject:in-reply-to:references:date:message-id:mime-version :content-transfer-encoding; bh=pxfJdUOWegFt7WTR1wwlXhOG445aqz529bBBiAj2i/E=; b=M1+lHrPsBBBWuttJSSScVHU40K8zm9syu/DSuDbcCeRYVq4itZCZvU9rJnlxoQeeNt s8nbYtZ/I5HaEZWZipBdvuKj98bp0G7Xb8RG+x489J4Ct/Oqoi+LWrV+iOPLVuj7v1DS M4dXQLGeYo7Jgf1/13zn7AbX7SbyUZelYtY9sw6oD7nVt7O5bns2KNTmxuhgscaLCQmv D+RgH2PTyulnDJJrOWLUPwdKiRkhmPO2FH7c/RF0dp0/q1LzDGFaHdBz+4CDEZ80pY2s WPK5cb7tQr8a/sz9bTh5VibPVCkjHHLu08qXzLeaHphBmPLXGROTKrefp7cEMHHhi7mM E5tA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:in-reply-to:references:date :message-id:mime-version:content-transfer-encoding; bh=pxfJdUOWegFt7WTR1wwlXhOG445aqz529bBBiAj2i/E=; b=THsLNXkCMMGxSBUHMJhhbBzfh6DMNhdki8qVOVwDv27oHP2k0/r0qOHGo53IOABvhu Lv21ceDrkcjkw9OuxRUp+RjzgKl/MFCVxUNEx5k9r02EfV5AtW6gRT6K4QPK3C1y0dX6 3QD2U3xXMJeApwTEKfH4dBHz1Ly6RCapGzeEaghpD3ARHGk1XEz0SeNg8KgJvqKz9Y4G pqIZ+PQwbTYg5gGKF1k6J9tAjgS+CRtLLOkDA9oGOiwtVMzD/jLD3YR8ngnUVLxIlDyK +Ha4WoPzx1oE4XVFkA/n8A0cIX6Uxi4Iz1Tt7XVD92v2+4GUo3irfMWXLjAJKoLlOvsP D2sA== X-Gm-Message-State: AOAM5308q2jDxGBieoJ9GL0G6YSAdH9w7SF55NVJmvJiHdtmtQy6/JSM yQxRbKkzE7GzRTC/S/ykf4qFeatKx+8= X-Google-Smtp-Source: ABdhPJxbbtMyX6siCPsKnLgSUtgsiWdrMFJ64sVfvlvUjasYZQX+BoaN5+zcWBfVInrUthDy2rCohQ== X-Received: by 2002:a1c:7217:: with SMTP id n23mr25371628wmc.167.1608719266898; Wed, 23 Dec 2020 02:27:46 -0800 (PST) Received: from lili ([2a01:e35:2e80:2030:3231:7ae:dc70:aacf]) by smtp.gmail.com with ESMTPSA id a62sm36445532wmh.40.2020.12.23.02.27.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 23 Dec 2020 02:27:46 -0800 (PST) From: zimoun To: Ludovic =?utf-8?Q?Court=C3=A8s?= , Guix Devel Subject: Re: Identical files across subsequent package revisions In-Reply-To: <87wnx9wlea.fsf@gnu.org> References: <87wnx9wlea.fsf@gnu.org> Date: Wed, 23 Dec 2020 11:19:33 +0100 Message-ID: <86eejgsu2y.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2a00:1450:4864:20::330; envelope-from=zimon.toutoune@gmail.com; helo=mail-wm1-x330.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Migadu-Flow: FLOW_IN X-Migadu-Spam-Score: -1.22 Authentication-Results: aspmx1.migadu.com; dkim=fail (body hash did not verify) header.d=gmail.com header.s=20161025 header.b=M1+lHrPs; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Queue-Id: 5604E940149 X-Spam-Score: -1.22 X-Migadu-Scanner: scn0.migadu.com X-TUID: H2S6Quh1g5b9 Hi Ludo, On Tue, 22 Dec 2020 at 23:01, Ludovic Court=C3=A8s wrote: > I wanted to evaluate that by looking at store items corresponding to > subsequent revisions of a package (be it different versions or rebuilds > induced by dependencies), and this is what the program below does. > > Here are preliminary results for a few packages: [...] > The reason I=E2=80=99m looking at this is to understand how much would be= gained > in terms of bandwidth usage if we were able to avoid downloading > individual files already in the store. It would seem to be rather > encouraging. This is really interesting. Especially through normal distribution and co. lens. The mean needs to be weighed: 10% of 180MB is not the same than 55% of 1MB (icecat vs diffoscope, from =E2=80=9Cguix size=E2=80=9D and= I could misread the numbers). The variance between revisions is highly variable (from 0 to significant). The one between packages says how fat or fit the normal distribution is. A smarter distribution could lead to some bandwith saving. The question is: predict how many? Is it worth to add a bit of complexity to save corner cases or for all. By corner cases, I mean that for example I am not updating my Emacs at each revisions. > Below is the program that does that. It grabs revision history from > data.guix.gnu.org, fetches nars from ci.guix.gnu.org, computes a > =E2=80=9Cdigest=E2=80=9D (list of files along with their hash and size), = compares > package digests pairwise, and plots the result with Guile-Charting. > Example REPL session: > > --8<---------------cut here---------------start------------->8--- > scheme@(similarities)> (pairwise-similarities (package-archive-contents = "icecat" #:max 4)) > updating substitutes from 'https://ci.guix.gnu.org'... 100.0% > $86 =3D (17363915/196387883 11380615/98152193) > scheme@(similarities)> (map exact->inexact $86) > $87 =3D (0.08841642740249916 0.11594865740799087) > > [=E2=80=A6] > > scheme@(similarities)> ,pp (at-most 10 (package-instances "r-minimal") ) > $100 =3D (#< version: "4.0.3" output: "/gnu/store/vv3ca= 1r5zw5y35xgkix4r80hdnncx52b-r-minimal-4.0.3"> > #< version: "4.0.3" output: "/gnu/store/5dzad7nhhv3dvm= ap60d6gsj9ppflgzrd-r-minimal-4.0.3"> > #< version: "4.0.3" output: "/gnu/store/01xi3sig314wgw= a1j9sxk37vl816mj74-r-minimal-4.0.3"> > #< version: "4.0.2" output: "/gnu/store/nv7lqhnm0mncqw= dpkkhnlsgb562lcwff-r-minimal-4.0.2"> > #< version: "4.0.2" output: "/gnu/store/w0izbm8q26dmyn= dhv46xr7dgz1irai1z-r-minimal-4.0.2"> > #< version: "4.0.2" output: "/gnu/store/yd83ibzxjrb7cg= cc6d4smx4pqcdl8r3p-r-minimal-4.0.2"> > #< version: "4.0.1" output: "/gnu/store/kpdh0lwlwcwfmm= fzqhwbi6j7m4zzxlmn-r-minimal-4.0.1"> > #< version: "4.0.1" output: "/gnu/store/9x9nzzsiyn1q7g= 5myhgwjh0yx9js3nrj-r-minimal-4.0.1"> > #< version: "4.0.0" output: "/gnu/store/ahbm2gsqc3420a= 23pcwrxd4pdhl7rdpp-r-minimal-4.0.0"> > #< version: "4.0.0" output: "/gnu/store/0sgqhj2628js41= 9wvw1vc1cw07wil7gr-r-minimal-4.0.0">) > $101 =3D (#< version: "3.6.3" output: "/gnu/store/gmx6p= 0wk3xbc9ylv83zfj855azgjxr0p-r-minimal-3.6.3"> > #< version: "3.6.2" output: "/gnu/store/dnb6fzp5295fcd= a66dnjk2y51mcva20f-r-minimal-3.6.2"> > #< version: "3.6.1" output: "/gnu/store/gd6sm42b6fr1qg= yp6p1zp3z4aavxwyk2-r-minimal-3.6.1"> > #< version: "3.6.1" output: "/gnu/store/lpmfhxys3vsaqm= qvj85r344ygfmlmlbg-r-minimal-3.6.1"> > #< version: "3.6.1" output: "/gnu/store/4413h13v7zrb7r= p9lvyp2gfx2laj60wm-r-minimal-3.6.1"> > #< version: "3.6.1" output: "/gnu/store/zm5pl1y0wmh3c8= 45498pbjvrzrm6sb07-r-minimal-3.6.1"> > #< version: "3.6.1" output: "/gnu/store/xrv7y4xgrdrsx5= qba7144cpw69qc5f0j-r-minimal-3.6.1"> > #< version: "3.6.0" output: "/gnu/store/cbwhhqv69xi3k3= g25dcfhwjkkf2427rp-r-minimal-3.6.0"> > #< version: "3.6.0" output: "/gnu/store/69k46yr70zkzzz= 9v18wl7sxasha5m0a5-r-minimal-3.6.0"> > #< version: "3.6.0" output: "/gnu/store/71w0383x0hay6n= g1zaddz59by3g0gxaf-r-minimal-3.6.0"> > #< version: "3.6.0" output: "/gnu/store/m317mg8faxp9qn= 949dnv1xgsxyw57s3x-r-minimal-3.6.0"> > #< version: "3.5.3" output: "/gnu/store/33qdfplk4riig7= 7vqg758m1zll16n6f0-r-minimal-3.5.3"> > #< version: "3.5.3" output: "/gnu/store/g8gkrcxn49yj8z= jr59l7y4k7wgar5brq-r-minimal-3.5.3"> > #< version: "3.5.1" output: "/gnu/store/vrgbyvnx7b1sva= 2ij5a1gwrkbfwmg3lm-r-minimal-3.5.1"> > #< version: "3.5.1" output: "/gnu/store/4fzw7s0cv2zbix= w4wb58zq6lkd97ghnf-r-minimal-3.5.1"> > #< version: "3.5.1" output: "/gnu/store/yb5048dr40nbmy= fa1ar4hfiy3kd06v4c-r-minimal-3.5.1">) > scheme@(similarities)> (similarity-chart '("icecat" "gimp" "openmpi" "ema= cs" "diffoscope" "r-minimal") "/tmp/t.png" #:max 8) > updating substitutes from 'https://ci.guix.gnu.org'... 100.0% > $102 =3D # > --8<---------------cut here---------------end--------------->8--- > > Thoughts? :-) > > Ludo=E2=80=99. > > ;;; Copyright =C2=A9 2020 Ludovic Court=C3=A8s > ;;; Hereby placed under the GNU General Public License, version 3 or late= r. > > (define-module (similarities) > #:use-module (json) > #:use-module ((gcrypt hash) #:select (open-sha256-port)) > #:use-module ((guix store) #:select (%default-substitute-urls)) > #:use-module ((guix progress) #:hide (dump-port*)) > #:use-module (guix http-client) > #:use-module (guix serialization) > #:use-module (guix scripts substitute) > #:use-module (srfi srfi-1) > #:use-module (srfi srfi-9) > #:use-module (srfi srfi-11) > #:use-module (rnrs bytevectors) > #:use-module (charting) > #:use-module (ice-9 match)) > > > ;;; > ;;; Data Service client. > ;;; > > (define-json-mapping make-package-instance > package-instance? > json->package-instance > (version package-instance-version) > (output package-instance-output "derivation")) > > (define %data-service-base-url > (make-parameter "https://data.guix.gnu.org")) > > (define* (package-instances package #:key (branch "master")) > "Return a list of representing instances of PACKAGE = over > time known to the Data Service." > (match (assoc-ref (json->scm > (http-fetch (string-append (%data-service-base-url) > "/repository/1/branch/" > branch "/package/" package > "/output-history.json"))) > "derivations") > (#(lst ...) > (map json->package-instance lst)) > (#f > #f))) > > > ;;; > ;;; Similarity measurement. > ;;; > > (define (port-sha256* port size) ;from (guix scripts challe= nge) > ;; Like 'port-sha256', but limited to SIZE bytes. > (let-values (((out get) (open-sha256-port))) > (dump-port* port out size) > (close-port out) > (get))) > > (define-record-type > (file-info name type size sha256) > file-info? > (name file-info-name) > (type file-info-type) > (size file-info-size) > (sha256 file-info-sha256)) > > (define (archive-contents port) > "Return a list of records from the nar read from PORT." > ;; As in (guix scripts challenge), but return records that > ;; include file size and ignore symlinks. > (fold-archive (lambda (file type contents result) > (match type > ((or 'regular 'executable) > (match contents > ((port . size) > (cons (file-info file type size > (port-sha256* port size)) > result)))) > ('directory result) > ('directory-complete result) > ('symlink result))) > '() > port > "")) > > (define (narinfo-contents narinfo) ;from (guix scripts challe= nge) > "Fetch the nar described by NARINFO and return a list representing the = file > it contains." > ((@@ (guix scripts challenge) call-with-nar) narinfo archive-contents)) > > (define (at-most max-length lst) ;from (guix scripts substit= ute) > "If LST is shorter than MAX-LENGTH, return it and the empty list; other= wise > return its MAX-LENGTH first elements and its tail." > (let loop ((len 0) > (lst lst) > (result '())) > (match lst > (() > (values (reverse result) '())) > ((head . tail) > (if (>=3D len max-length) > (values (reverse result) lst) > (loop (+ 1 len) tail (cons head result))))))) > > (define* (package-archive-contents package > #:key (max 10) > (substitute-urls %default-substitute-u= rls)) > "Look at the MAX latest instances of PACKAGE, fetch them, and return a > summary of their contents as returned by 'narinfo-contents'." > (let ((instances (at-most max (package-instances package)))) > (map narinfo-contents > (lookup-narinfos (first substitute-urls) > (map package-instance-output instances))))) > > (define (similarity contents1 contents2) > "Return two values: the ratio of identical bytes between CONTENTS2 and > CONTENTS2, and the ratio of identical files." > (define (matches name) > (lambda (info) > (string=3D? (file-info-name info) name))) > > (let ((files (delete-duplicates > (append (map file-info-name contents1) > (map file-info-name contents2))))) > (let loop ((files files) > (seen 0) > (identical 0) > (seen-bytes 0) > (identical-bytes 0)) > (match files > (() > (values (/ identical-bytes seen-bytes) > (/ identical seen))) > ((head . tail) > (let ((file1 (find (matches head) contents1)) > (file2 (find (matches head) contents2))) > (cond ((not file1) > (loop tail > (+ seen 1) identical > (+ seen-bytes (file-info-size file2)) > identical-bytes)) > ((not file2) > (loop tail > (+ seen 1) identical > (+ seen-bytes (file-info-size file1)) > identical-bytes)) > (else > (let ((identical? > (and (=3D (file-info-size file1) > (file-info-size file2)) > (bytevector=3D? (file-info-sha256 file1) > (file-info-sha256 file2))))) > (loop tail > (+ seen 1) > (if identical? > (+ identical 1) > identical) > (+ seen-bytes > (max (file-info-size file1) > (file-info-size file2))) > (if identical? > (+ identical-bytes (file-info-size file1)) > identical-bytes))))))))))) > > (define (pairwise-similarities contents) > (let loop ((contents contents) > (similarities '())) > (match contents > ((or () (_)) > (reverse similarities)) > ((a b . tail) > (loop (cons b tail) > (cons (similarity a b) similarities)))))) > > (define* (similarity-chart packages file > #:key (max 20)) > (make-bar-chart > "Similarity across subsequent package revisions" > (map (lambda (package) > (let* ((contents (package-archive-contents package > #:max max)) > (similarities (pairwise-similarities contents)) > (labels (iota (length similarities)))) > `(,package > ,@(map (lambda (label ratio) > `(,(* ratio 100.) ,(number->string label))) > labels similarities)))) > packages) > #:chart-params '(#:x-axis-label "package" > #:y-axis-label "similarity ratio (%)") > #:write-to-png file))