unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: zimoun <zimon.toutoune@gmail.com>
To: "Ludovic Courtès" <ludo@gnu.org>, "Guix Devel" <guix-devel@gnu.org>
Subject: Re: Identical files across subsequent package revisions
Date: Wed, 23 Dec 2020 11:19:33 +0100	[thread overview]
Message-ID: <86eejgsu2y.fsf@gmail.com> (raw)
In-Reply-To: <87wnx9wlea.fsf@gnu.org>

Hi Ludo,

On Tue, 22 Dec 2020 at 23:01, Ludovic Courtès <ludo@gnu.org> wrote:

> I wanted to evaluate that by looking at store items corresponding to
> subsequent revisions of a package (be it different versions or rebuilds
> induced by dependencies), and this is what the program below does.
>
> Here are preliminary results for a few packages:

[...]

> The reason I’m looking at this is to understand how much would be gained
> in terms of bandwidth usage if we were able to avoid downloading
> individual files already in the store.  It would seem to be rather
> encouraging.

This is really interesting.  Especially through normal distribution and
co. lens.  The mean needs to be weighed: 10% of 180MB is not the same
than 55% of 1MB (icecat vs diffoscope, from “guix size” and I could
misread the numbers).  The variance between revisions is highly variable
(from 0 to significant).  The one between packages says how fat or fit
the normal distribution is.

A smarter distribution could lead to some bandwith saving.  The question
is: predict how many?  Is it worth to add a bit of complexity to save
corner cases or for all.

By corner cases, I mean that for example I am not updating my Emacs at
each revisions.

<https://data.guix.gnu.org/repository/1/branch/master/package/emacs/output-history>




> Below is the program that does that.  It grabs revision history from
> data.guix.gnu.org, fetches nars from ci.guix.gnu.org, computes a
> “digest” (list of files along with their hash and size), compares
> package digests pairwise, and plots the result with Guile-Charting.
> Example REPL session:
>
> --8<---------------cut here---------------start------------->8---
> scheme@(similarities)>  (pairwise-similarities (package-archive-contents "icecat" #:max 4))
> updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
> $86 = (17363915/196387883 11380615/98152193)
> scheme@(similarities)> (map exact->inexact $86)
> $87 = (0.08841642740249916 0.11594865740799087)
>
> […]
>
> scheme@(similarities)> ,pp (at-most 10 (package-instances "r-minimal") )
> $100 = (#<<package-instance> version: "4.0.3" output: "/gnu/store/vv3ca1r5zw5y35xgkix4r80hdnncx52b-r-minimal-4.0.3">
>  #<<package-instance> version: "4.0.3" output: "/gnu/store/5dzad7nhhv3dvmap60d6gsj9ppflgzrd-r-minimal-4.0.3">
>  #<<package-instance> version: "4.0.3" output: "/gnu/store/01xi3sig314wgwa1j9sxk37vl816mj74-r-minimal-4.0.3">
>  #<<package-instance> version: "4.0.2" output: "/gnu/store/nv7lqhnm0mncqwdpkkhnlsgb562lcwff-r-minimal-4.0.2">
>  #<<package-instance> version: "4.0.2" output: "/gnu/store/w0izbm8q26dmyndhv46xr7dgz1irai1z-r-minimal-4.0.2">
>  #<<package-instance> version: "4.0.2" output: "/gnu/store/yd83ibzxjrb7cgcc6d4smx4pqcdl8r3p-r-minimal-4.0.2">
>  #<<package-instance> version: "4.0.1" output: "/gnu/store/kpdh0lwlwcwfmmfzqhwbi6j7m4zzxlmn-r-minimal-4.0.1">
>  #<<package-instance> version: "4.0.1" output: "/gnu/store/9x9nzzsiyn1q7g5myhgwjh0yx9js3nrj-r-minimal-4.0.1">
>  #<<package-instance> version: "4.0.0" output: "/gnu/store/ahbm2gsqc3420a23pcwrxd4pdhl7rdpp-r-minimal-4.0.0">
>  #<<package-instance> version: "4.0.0" output: "/gnu/store/0sgqhj2628js419wvw1vc1cw07wil7gr-r-minimal-4.0.0">)
> $101 = (#<<package-instance> version: "3.6.3" output: "/gnu/store/gmx6p0wk3xbc9ylv83zfj855azgjxr0p-r-minimal-3.6.3">
>  #<<package-instance> version: "3.6.2" output: "/gnu/store/dnb6fzp5295fcda66dnjk2y51mcva20f-r-minimal-3.6.2">
>  #<<package-instance> version: "3.6.1" output: "/gnu/store/gd6sm42b6fr1qgyp6p1zp3z4aavxwyk2-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.1" output: "/gnu/store/lpmfhxys3vsaqmqvj85r344ygfmlmlbg-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.1" output: "/gnu/store/4413h13v7zrb7rp9lvyp2gfx2laj60wm-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.1" output: "/gnu/store/zm5pl1y0wmh3c845498pbjvrzrm6sb07-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.1" output: "/gnu/store/xrv7y4xgrdrsx5qba7144cpw69qc5f0j-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.0" output: "/gnu/store/cbwhhqv69xi3k3g25dcfhwjkkf2427rp-r-minimal-3.6.0">
>  #<<package-instance> version: "3.6.0" output: "/gnu/store/69k46yr70zkzzz9v18wl7sxasha5m0a5-r-minimal-3.6.0">
>  #<<package-instance> version: "3.6.0" output: "/gnu/store/71w0383x0hay6ng1zaddz59by3g0gxaf-r-minimal-3.6.0">
>  #<<package-instance> version: "3.6.0" output: "/gnu/store/m317mg8faxp9qn949dnv1xgsxyw57s3x-r-minimal-3.6.0">
>  #<<package-instance> version: "3.5.3" output: "/gnu/store/33qdfplk4riig77vqg758m1zll16n6f0-r-minimal-3.5.3">
>  #<<package-instance> version: "3.5.3" output: "/gnu/store/g8gkrcxn49yj8zjr59l7y4k7wgar5brq-r-minimal-3.5.3">
>  #<<package-instance> version: "3.5.1" output: "/gnu/store/vrgbyvnx7b1sva2ij5a1gwrkbfwmg3lm-r-minimal-3.5.1">
>  #<<package-instance> version: "3.5.1" output: "/gnu/store/4fzw7s0cv2zbixw4wb58zq6lkd97ghnf-r-minimal-3.5.1">
>  #<<package-instance> version: "3.5.1" output: "/gnu/store/yb5048dr40nbmyfa1ar4hfiy3kd06v4c-r-minimal-3.5.1">)
> scheme@(similarities)> (similarity-chart '("icecat" "gimp" "openmpi" "emacs" "diffoscope" "r-minimal") "/tmp/t.png" #:max 8)
> updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
> $102 = #<cairo-surface 7f502c7adb10>
> --8<---------------cut here---------------end--------------->8---
>
> Thoughts?  :-)
>
> Ludo’.
>
> ;;; Copyright © 2020 Ludovic Courtès <ludo@gnu.org>
> ;;; Hereby placed under the GNU General Public License, version 3 or later.
>
> (define-module (similarities)
>   #:use-module (json)
>   #:use-module ((gcrypt hash) #:select (open-sha256-port))
>   #:use-module ((guix store) #:select (%default-substitute-urls))
>   #:use-module ((guix progress) #:hide (dump-port*))
>   #:use-module (guix http-client)
>   #:use-module (guix serialization)
>   #:use-module (guix scripts substitute)
>   #:use-module (srfi srfi-1)
>   #:use-module (srfi srfi-9)
>   #:use-module (srfi srfi-11)
>   #:use-module (rnrs bytevectors)
>   #:use-module (charting)
>   #:use-module (ice-9 match))
>
> \f
> ;;;
> ;;; Data Service client.
> ;;;
>
> (define-json-mapping <package-instance> make-package-instance
>   package-instance?
>   json->package-instance
>   (version     package-instance-version)
>   (output      package-instance-output "derivation"))
>
> (define %data-service-base-url
>   (make-parameter "https://data.guix.gnu.org"))
>
> (define* (package-instances package #:key (branch "master"))
>   "Return a list of <package-instance> representing instances of PACKAGE over
> time known to the Data Service."
>   (match (assoc-ref (json->scm
>                      (http-fetch (string-append (%data-service-base-url)
>                                                 "/repository/1/branch/"
>                                                 branch "/package/" package
>                                                 "/output-history.json")))
>                     "derivations")
>     (#(lst ...)
>      (map json->package-instance lst))
>     (#f
>      #f)))
>
> \f
> ;;;
> ;;; Similarity measurement.
> ;;;
>
> (define (port-sha256* port size)               ;from (guix scripts challenge)
>   ;; Like 'port-sha256', but limited to SIZE bytes.
>   (let-values (((out get) (open-sha256-port)))
>     (dump-port* port out size)
>     (close-port out)
>     (get)))
>
> (define-record-type <file-info>
>   (file-info name type size sha256)
>   file-info?
>   (name   file-info-name)
>   (type   file-info-type)
>   (size   file-info-size)
>   (sha256 file-info-sha256))
>
> (define (archive-contents port)
>   "Return a list of <file-info> records from the nar read from PORT."
>   ;; As in (guix scripts challenge), but return <file-info> records that
>   ;; include file size and ignore symlinks.
>   (fold-archive (lambda (file type contents result)
>                   (match type
>                     ((or 'regular 'executable)
>                      (match contents
>                        ((port . size)
>                         (cons (file-info file type size
>                                          (port-sha256* port size))
>                               result))))
>                     ('directory result)
>                     ('directory-complete result)
>                     ('symlink result)))
>                 '()
>                 port
>                 ""))
>
> (define (narinfo-contents narinfo)             ;from (guix scripts challenge)
>   "Fetch the nar described by NARINFO and return a list representing the file
> it contains."
>   ((@@ (guix scripts challenge) call-with-nar) narinfo archive-contents))
>
> (define (at-most max-length lst)              ;from (guix scripts substitute)
>   "If LST is shorter than MAX-LENGTH, return it and the empty list; otherwise
> return its MAX-LENGTH first elements and its tail."
>   (let loop ((len 0)
>              (lst lst)
>              (result '()))
>     (match lst
>       (()
>        (values (reverse result) '()))
>       ((head . tail)
>        (if (>= len max-length)
>            (values (reverse result) lst)
>            (loop (+ 1 len) tail (cons head result)))))))
>
> (define* (package-archive-contents package
>                                    #:key (max 10)
>                                    (substitute-urls %default-substitute-urls))
>   "Look at the MAX latest instances of PACKAGE, fetch them, and return a
> summary of their contents as returned by 'narinfo-contents'."
>   (let ((instances (at-most max (package-instances package))))
>     (map narinfo-contents
>          (lookup-narinfos (first substitute-urls)
>                           (map package-instance-output instances)))))
>
> (define (similarity contents1 contents2)
>   "Return two values: the ratio of identical bytes between CONTENTS2 and
> CONTENTS2, and the ratio of identical files."
>   (define (matches name)
>     (lambda (info)
>       (string=? (file-info-name info) name)))
>
>   (let ((files (delete-duplicates
>                 (append (map file-info-name contents1)
>                         (map file-info-name contents2)))))
>     (let loop ((files files)
>                (seen 0)
>                (identical 0)
>                (seen-bytes 0)
>                (identical-bytes 0))
>       (match files
>         (()
>          (values (/ identical-bytes seen-bytes)
>                  (/ identical seen)))
>         ((head . tail)
>          (let ((file1 (find (matches head) contents1))
>                (file2 (find (matches head) contents2)))
>            (cond ((not file1)
>                   (loop tail
>                         (+ seen 1) identical
>                         (+ seen-bytes (file-info-size file2))
>                         identical-bytes))
>                  ((not file2)
>                   (loop tail
>                         (+ seen 1) identical
>                         (+ seen-bytes (file-info-size file1))
>                         identical-bytes))
>                  (else
>                   (let ((identical?
>                          (and (= (file-info-size file1)
>                                  (file-info-size file2))
>                               (bytevector=? (file-info-sha256 file1)
>                                             (file-info-sha256 file2)))))
>                     (loop tail
>                           (+ seen 1)
>                           (if identical?
>                               (+ identical 1)
>                               identical)
>                           (+ seen-bytes
>                              (max (file-info-size file1)
>                                   (file-info-size file2)))
>                           (if identical?
>                               (+ identical-bytes (file-info-size file1))
>                               identical-bytes)))))))))))
>
> (define (pairwise-similarities contents)
>   (let loop ((contents contents)
>              (similarities '()))
>     (match contents
>       ((or () (_))
>        (reverse similarities))
>       ((a b . tail)
>        (loop (cons b tail)
>              (cons (similarity a b) similarities))))))
>
> (define* (similarity-chart packages file
>                            #:key (max 20))
>   (make-bar-chart
>    "Similarity across subsequent package revisions"
>    (map (lambda (package)
>           (let* ((contents     (package-archive-contents package
>                                                          #:max max))
>                  (similarities (pairwise-similarities contents))
>                  (labels       (iota (length similarities))))
>             `(,package
>                ,@(map (lambda (label ratio)
>                         `(,(* ratio 100.) ,(number->string label)))
>                       labels similarities))))
>         packages)
>    #:chart-params '(#:x-axis-label "package"
>                     #:y-axis-label "similarity ratio (%)")
>    #:write-to-png file))


  parent reply	other threads:[~2020-12-23 10:28 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-22 22:01 Identical files across subsequent package revisions Ludovic Courtès
2020-12-23  9:08 ` Taylan Kammer
2020-12-23 10:50   ` Pierre Neidhardt
2020-12-23 22:06   ` Mark H Weaver
2020-12-23 22:47     ` Jonathan Brielmaier
2020-12-23 23:42     ` Mark H Weaver
2020-12-23 10:19 ` zimoun [this message]
2020-12-23 10:41   ` zimoun
2020-12-27 22:22     ` Ludovic Courtès
2020-12-23 11:48 ` Christopher Baines
2020-12-23 13:10 ` Miguel Ángel Arruga Vivas
2020-12-23 14:07   ` zimoun
2020-12-23 15:40     ` Julien Lepiller
2020-12-23 19:07       ` Miguel Ángel Arruga Vivas
2020-12-27 22:29   ` Ludovic Courtès
2021-01-06  9:39     ` Ludovic Courtès
2020-12-29 20:01 ` pukkamustard
2021-01-06  9:27   ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86eejgsu2y.fsf@gmail.com \
    --to=zimon.toutoune@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).