From: zimoun <zimon.toutoune@gmail.com>
To: "Ludovic Courtès" <ludo@gnu.org>, "Guix Devel" <guix-devel@gnu.org>
Subject: Re: Identical files across subsequent package revisions
Date: Wed, 23 Dec 2020 11:19:33 +0100 [thread overview]
Message-ID: <86eejgsu2y.fsf@gmail.com> (raw)
In-Reply-To: <87wnx9wlea.fsf@gnu.org>
Hi Ludo,
On Tue, 22 Dec 2020 at 23:01, Ludovic Courtès <ludo@gnu.org> wrote:
> I wanted to evaluate that by looking at store items corresponding to
> subsequent revisions of a package (be it different versions or rebuilds
> induced by dependencies), and this is what the program below does.
>
> Here are preliminary results for a few packages:
[...]
> The reason I’m looking at this is to understand how much would be gained
> in terms of bandwidth usage if we were able to avoid downloading
> individual files already in the store. It would seem to be rather
> encouraging.
This is really interesting. Especially through normal distribution and
co. lens. The mean needs to be weighed: 10% of 180MB is not the same
than 55% of 1MB (icecat vs diffoscope, from “guix size” and I could
misread the numbers). The variance between revisions is highly variable
(from 0 to significant). The one between packages says how fat or fit
the normal distribution is.
A smarter distribution could lead to some bandwith saving. The question
is: predict how many? Is it worth to add a bit of complexity to save
corner cases or for all.
By corner cases, I mean that for example I am not updating my Emacs at
each revisions.
<https://data.guix.gnu.org/repository/1/branch/master/package/emacs/output-history>
> Below is the program that does that. It grabs revision history from
> data.guix.gnu.org, fetches nars from ci.guix.gnu.org, computes a
> “digest” (list of files along with their hash and size), compares
> package digests pairwise, and plots the result with Guile-Charting.
> Example REPL session:
>
> --8<---------------cut here---------------start------------->8---
> scheme@(similarities)> (pairwise-similarities (package-archive-contents "icecat" #:max 4))
> updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
> $86 = (17363915/196387883 11380615/98152193)
> scheme@(similarities)> (map exact->inexact $86)
> $87 = (0.08841642740249916 0.11594865740799087)
>
> […]
>
> scheme@(similarities)> ,pp (at-most 10 (package-instances "r-minimal") )
> $100 = (#<<package-instance> version: "4.0.3" output: "/gnu/store/vv3ca1r5zw5y35xgkix4r80hdnncx52b-r-minimal-4.0.3">
> #<<package-instance> version: "4.0.3" output: "/gnu/store/5dzad7nhhv3dvmap60d6gsj9ppflgzrd-r-minimal-4.0.3">
> #<<package-instance> version: "4.0.3" output: "/gnu/store/01xi3sig314wgwa1j9sxk37vl816mj74-r-minimal-4.0.3">
> #<<package-instance> version: "4.0.2" output: "/gnu/store/nv7lqhnm0mncqwdpkkhnlsgb562lcwff-r-minimal-4.0.2">
> #<<package-instance> version: "4.0.2" output: "/gnu/store/w0izbm8q26dmyndhv46xr7dgz1irai1z-r-minimal-4.0.2">
> #<<package-instance> version: "4.0.2" output: "/gnu/store/yd83ibzxjrb7cgcc6d4smx4pqcdl8r3p-r-minimal-4.0.2">
> #<<package-instance> version: "4.0.1" output: "/gnu/store/kpdh0lwlwcwfmmfzqhwbi6j7m4zzxlmn-r-minimal-4.0.1">
> #<<package-instance> version: "4.0.1" output: "/gnu/store/9x9nzzsiyn1q7g5myhgwjh0yx9js3nrj-r-minimal-4.0.1">
> #<<package-instance> version: "4.0.0" output: "/gnu/store/ahbm2gsqc3420a23pcwrxd4pdhl7rdpp-r-minimal-4.0.0">
> #<<package-instance> version: "4.0.0" output: "/gnu/store/0sgqhj2628js419wvw1vc1cw07wil7gr-r-minimal-4.0.0">)
> $101 = (#<<package-instance> version: "3.6.3" output: "/gnu/store/gmx6p0wk3xbc9ylv83zfj855azgjxr0p-r-minimal-3.6.3">
> #<<package-instance> version: "3.6.2" output: "/gnu/store/dnb6fzp5295fcda66dnjk2y51mcva20f-r-minimal-3.6.2">
> #<<package-instance> version: "3.6.1" output: "/gnu/store/gd6sm42b6fr1qgyp6p1zp3z4aavxwyk2-r-minimal-3.6.1">
> #<<package-instance> version: "3.6.1" output: "/gnu/store/lpmfhxys3vsaqmqvj85r344ygfmlmlbg-r-minimal-3.6.1">
> #<<package-instance> version: "3.6.1" output: "/gnu/store/4413h13v7zrb7rp9lvyp2gfx2laj60wm-r-minimal-3.6.1">
> #<<package-instance> version: "3.6.1" output: "/gnu/store/zm5pl1y0wmh3c845498pbjvrzrm6sb07-r-minimal-3.6.1">
> #<<package-instance> version: "3.6.1" output: "/gnu/store/xrv7y4xgrdrsx5qba7144cpw69qc5f0j-r-minimal-3.6.1">
> #<<package-instance> version: "3.6.0" output: "/gnu/store/cbwhhqv69xi3k3g25dcfhwjkkf2427rp-r-minimal-3.6.0">
> #<<package-instance> version: "3.6.0" output: "/gnu/store/69k46yr70zkzzz9v18wl7sxasha5m0a5-r-minimal-3.6.0">
> #<<package-instance> version: "3.6.0" output: "/gnu/store/71w0383x0hay6ng1zaddz59by3g0gxaf-r-minimal-3.6.0">
> #<<package-instance> version: "3.6.0" output: "/gnu/store/m317mg8faxp9qn949dnv1xgsxyw57s3x-r-minimal-3.6.0">
> #<<package-instance> version: "3.5.3" output: "/gnu/store/33qdfplk4riig77vqg758m1zll16n6f0-r-minimal-3.5.3">
> #<<package-instance> version: "3.5.3" output: "/gnu/store/g8gkrcxn49yj8zjr59l7y4k7wgar5brq-r-minimal-3.5.3">
> #<<package-instance> version: "3.5.1" output: "/gnu/store/vrgbyvnx7b1sva2ij5a1gwrkbfwmg3lm-r-minimal-3.5.1">
> #<<package-instance> version: "3.5.1" output: "/gnu/store/4fzw7s0cv2zbixw4wb58zq6lkd97ghnf-r-minimal-3.5.1">
> #<<package-instance> version: "3.5.1" output: "/gnu/store/yb5048dr40nbmyfa1ar4hfiy3kd06v4c-r-minimal-3.5.1">)
> scheme@(similarities)> (similarity-chart '("icecat" "gimp" "openmpi" "emacs" "diffoscope" "r-minimal") "/tmp/t.png" #:max 8)
> updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
> $102 = #<cairo-surface 7f502c7adb10>
> --8<---------------cut here---------------end--------------->8---
>
> Thoughts? :-)
>
> Ludo’.
>
> ;;; Copyright © 2020 Ludovic Courtès <ludo@gnu.org>
> ;;; Hereby placed under the GNU General Public License, version 3 or later.
>
> (define-module (similarities)
> #:use-module (json)
> #:use-module ((gcrypt hash) #:select (open-sha256-port))
> #:use-module ((guix store) #:select (%default-substitute-urls))
> #:use-module ((guix progress) #:hide (dump-port*))
> #:use-module (guix http-client)
> #:use-module (guix serialization)
> #:use-module (guix scripts substitute)
> #:use-module (srfi srfi-1)
> #:use-module (srfi srfi-9)
> #:use-module (srfi srfi-11)
> #:use-module (rnrs bytevectors)
> #:use-module (charting)
> #:use-module (ice-9 match))
>
> \f
> ;;;
> ;;; Data Service client.
> ;;;
>
> (define-json-mapping <package-instance> make-package-instance
> package-instance?
> json->package-instance
> (version package-instance-version)
> (output package-instance-output "derivation"))
>
> (define %data-service-base-url
> (make-parameter "https://data.guix.gnu.org"))
>
> (define* (package-instances package #:key (branch "master"))
> "Return a list of <package-instance> representing instances of PACKAGE over
> time known to the Data Service."
> (match (assoc-ref (json->scm
> (http-fetch (string-append (%data-service-base-url)
> "/repository/1/branch/"
> branch "/package/" package
> "/output-history.json")))
> "derivations")
> (#(lst ...)
> (map json->package-instance lst))
> (#f
> #f)))
>
> \f
> ;;;
> ;;; Similarity measurement.
> ;;;
>
> (define (port-sha256* port size) ;from (guix scripts challenge)
> ;; Like 'port-sha256', but limited to SIZE bytes.
> (let-values (((out get) (open-sha256-port)))
> (dump-port* port out size)
> (close-port out)
> (get)))
>
> (define-record-type <file-info>
> (file-info name type size sha256)
> file-info?
> (name file-info-name)
> (type file-info-type)
> (size file-info-size)
> (sha256 file-info-sha256))
>
> (define (archive-contents port)
> "Return a list of <file-info> records from the nar read from PORT."
> ;; As in (guix scripts challenge), but return <file-info> records that
> ;; include file size and ignore symlinks.
> (fold-archive (lambda (file type contents result)
> (match type
> ((or 'regular 'executable)
> (match contents
> ((port . size)
> (cons (file-info file type size
> (port-sha256* port size))
> result))))
> ('directory result)
> ('directory-complete result)
> ('symlink result)))
> '()
> port
> ""))
>
> (define (narinfo-contents narinfo) ;from (guix scripts challenge)
> "Fetch the nar described by NARINFO and return a list representing the file
> it contains."
> ((@@ (guix scripts challenge) call-with-nar) narinfo archive-contents))
>
> (define (at-most max-length lst) ;from (guix scripts substitute)
> "If LST is shorter than MAX-LENGTH, return it and the empty list; otherwise
> return its MAX-LENGTH first elements and its tail."
> (let loop ((len 0)
> (lst lst)
> (result '()))
> (match lst
> (()
> (values (reverse result) '()))
> ((head . tail)
> (if (>= len max-length)
> (values (reverse result) lst)
> (loop (+ 1 len) tail (cons head result)))))))
>
> (define* (package-archive-contents package
> #:key (max 10)
> (substitute-urls %default-substitute-urls))
> "Look at the MAX latest instances of PACKAGE, fetch them, and return a
> summary of their contents as returned by 'narinfo-contents'."
> (let ((instances (at-most max (package-instances package))))
> (map narinfo-contents
> (lookup-narinfos (first substitute-urls)
> (map package-instance-output instances)))))
>
> (define (similarity contents1 contents2)
> "Return two values: the ratio of identical bytes between CONTENTS2 and
> CONTENTS2, and the ratio of identical files."
> (define (matches name)
> (lambda (info)
> (string=? (file-info-name info) name)))
>
> (let ((files (delete-duplicates
> (append (map file-info-name contents1)
> (map file-info-name contents2)))))
> (let loop ((files files)
> (seen 0)
> (identical 0)
> (seen-bytes 0)
> (identical-bytes 0))
> (match files
> (()
> (values (/ identical-bytes seen-bytes)
> (/ identical seen)))
> ((head . tail)
> (let ((file1 (find (matches head) contents1))
> (file2 (find (matches head) contents2)))
> (cond ((not file1)
> (loop tail
> (+ seen 1) identical
> (+ seen-bytes (file-info-size file2))
> identical-bytes))
> ((not file2)
> (loop tail
> (+ seen 1) identical
> (+ seen-bytes (file-info-size file1))
> identical-bytes))
> (else
> (let ((identical?
> (and (= (file-info-size file1)
> (file-info-size file2))
> (bytevector=? (file-info-sha256 file1)
> (file-info-sha256 file2)))))
> (loop tail
> (+ seen 1)
> (if identical?
> (+ identical 1)
> identical)
> (+ seen-bytes
> (max (file-info-size file1)
> (file-info-size file2)))
> (if identical?
> (+ identical-bytes (file-info-size file1))
> identical-bytes)))))))))))
>
> (define (pairwise-similarities contents)
> (let loop ((contents contents)
> (similarities '()))
> (match contents
> ((or () (_))
> (reverse similarities))
> ((a b . tail)
> (loop (cons b tail)
> (cons (similarity a b) similarities))))))
>
> (define* (similarity-chart packages file
> #:key (max 20))
> (make-bar-chart
> "Similarity across subsequent package revisions"
> (map (lambda (package)
> (let* ((contents (package-archive-contents package
> #:max max))
> (similarities (pairwise-similarities contents))
> (labels (iota (length similarities))))
> `(,package
> ,@(map (lambda (label ratio)
> `(,(* ratio 100.) ,(number->string label)))
> labels similarities))))
> packages)
> #:chart-params '(#:x-axis-label "package"
> #:y-axis-label "similarity ratio (%)")
> #:write-to-png file))
next prev parent reply other threads:[~2020-12-23 10:28 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-22 22:01 Identical files across subsequent package revisions Ludovic Courtès
2020-12-23 9:08 ` Taylan Kammer
2020-12-23 10:50 ` Pierre Neidhardt
2020-12-23 22:06 ` Mark H Weaver
2020-12-23 22:47 ` Jonathan Brielmaier
2020-12-23 23:42 ` Mark H Weaver
2020-12-23 10:19 ` zimoun [this message]
2020-12-23 10:41 ` zimoun
2020-12-27 22:22 ` Ludovic Courtès
2020-12-23 11:48 ` Christopher Baines
2020-12-23 13:10 ` Miguel Ángel Arruga Vivas
2020-12-23 14:07 ` zimoun
2020-12-23 15:40 ` Julien Lepiller
2020-12-23 19:07 ` Miguel Ángel Arruga Vivas
2020-12-27 22:29 ` Ludovic Courtès
2021-01-06 9:39 ` Ludovic Courtès
2020-12-29 20:01 ` pukkamustard
2021-01-06 9:27 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://guix.gnu.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=86eejgsu2y.fsf@gmail.com \
--to=zimon.toutoune@gmail.com \
--cc=guix-devel@gnu.org \
--cc=ludo@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).