unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Identical files across subsequent package revisions
@ 2020-12-22 22:01 Ludovic Courtès
  2020-12-23  9:08 ` Taylan Kammer
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Ludovic Courtès @ 2020-12-22 22:01 UTC (permalink / raw)
  To: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 490 bytes --]

Hello Guix!

Every time a package changes, we end up downloading complete substitutes
for itself and for all its dependents, even though we have the intuition
that a large fraction of the files in those store items are unchanged.

I wanted to evaluate that by looking at store items corresponding to
subsequent revisions of a package (be it different versions or rebuilds
induced by dependencies), and this is what the program below does.

Here are preliminary results for a few packages:


[-- Attachment #2: the graph --]
[-- Type: image/png, Size: 21168 bytes --]

[-- Attachment #3: Type: text/plain, Size: 5371 bytes --]


It reads as follows:

  • Pairwise similarities of subsequent revisions of IceCat contain
    identical files representing ~10% of its total size (it’s the ratio
    of the size of the identical files to the total file size.)

    The first bar represents similarity between the first and the second
    revision; the second bar shows the similarity between the second one
    and the third one, etc.

  • For Emacs, identical files between subsequent revisions represent
    ~85% of its total size.  Intuitively, this is because the package
    contains mostly source code (.el.gz files) and bytecode, which
    remains unchanged.

  • Diffoscope is written in Python so it’s similar to Emacs: its .py
    file remain unchanged across revisions, and they represent ~60% of
    its total size.

  • GIMP has lots of headers, locale data, and icons that don’t change.

  • For R, we see the effect of the upgrades to 4.0.2 and then 4.0.3,
    where similarity drops to ~25% instead of ~80% when changes are in
    dependencies.

  • For Open MPI, which is compiled C + headers, ~25% is shared across
    revisions.

The reason I’m looking at this is to understand how much would be gained
in terms of bandwidth usage if we were able to avoid downloading
individual files already in the store.  It would seem to be rather
encouraging.

Below is the program that does that.  It grabs revision history from
data.guix.gnu.org, fetches nars from ci.guix.gnu.org, computes a
“digest” (list of files along with their hash and size), compares
package digests pairwise, and plots the result with Guile-Charting.
Example REPL session:

--8<---------------cut here---------------start------------->8---
scheme@(similarities)>  (pairwise-similarities (package-archive-contents "icecat" #:max 4))
updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
$86 = (17363915/196387883 11380615/98152193)
scheme@(similarities)> (map exact->inexact $86)
$87 = (0.08841642740249916 0.11594865740799087)

[…]

scheme@(similarities)> ,pp (at-most 10 (package-instances "r-minimal") )
$100 = (#<<package-instance> version: "4.0.3" output: "/gnu/store/vv3ca1r5zw5y35xgkix4r80hdnncx52b-r-minimal-4.0.3">
 #<<package-instance> version: "4.0.3" output: "/gnu/store/5dzad7nhhv3dvmap60d6gsj9ppflgzrd-r-minimal-4.0.3">
 #<<package-instance> version: "4.0.3" output: "/gnu/store/01xi3sig314wgwa1j9sxk37vl816mj74-r-minimal-4.0.3">
 #<<package-instance> version: "4.0.2" output: "/gnu/store/nv7lqhnm0mncqwdpkkhnlsgb562lcwff-r-minimal-4.0.2">
 #<<package-instance> version: "4.0.2" output: "/gnu/store/w0izbm8q26dmyndhv46xr7dgz1irai1z-r-minimal-4.0.2">
 #<<package-instance> version: "4.0.2" output: "/gnu/store/yd83ibzxjrb7cgcc6d4smx4pqcdl8r3p-r-minimal-4.0.2">
 #<<package-instance> version: "4.0.1" output: "/gnu/store/kpdh0lwlwcwfmmfzqhwbi6j7m4zzxlmn-r-minimal-4.0.1">
 #<<package-instance> version: "4.0.1" output: "/gnu/store/9x9nzzsiyn1q7g5myhgwjh0yx9js3nrj-r-minimal-4.0.1">
 #<<package-instance> version: "4.0.0" output: "/gnu/store/ahbm2gsqc3420a23pcwrxd4pdhl7rdpp-r-minimal-4.0.0">
 #<<package-instance> version: "4.0.0" output: "/gnu/store/0sgqhj2628js419wvw1vc1cw07wil7gr-r-minimal-4.0.0">)
$101 = (#<<package-instance> version: "3.6.3" output: "/gnu/store/gmx6p0wk3xbc9ylv83zfj855azgjxr0p-r-minimal-3.6.3">
 #<<package-instance> version: "3.6.2" output: "/gnu/store/dnb6fzp5295fcda66dnjk2y51mcva20f-r-minimal-3.6.2">
 #<<package-instance> version: "3.6.1" output: "/gnu/store/gd6sm42b6fr1qgyp6p1zp3z4aavxwyk2-r-minimal-3.6.1">
 #<<package-instance> version: "3.6.1" output: "/gnu/store/lpmfhxys3vsaqmqvj85r344ygfmlmlbg-r-minimal-3.6.1">
 #<<package-instance> version: "3.6.1" output: "/gnu/store/4413h13v7zrb7rp9lvyp2gfx2laj60wm-r-minimal-3.6.1">
 #<<package-instance> version: "3.6.1" output: "/gnu/store/zm5pl1y0wmh3c845498pbjvrzrm6sb07-r-minimal-3.6.1">
 #<<package-instance> version: "3.6.1" output: "/gnu/store/xrv7y4xgrdrsx5qba7144cpw69qc5f0j-r-minimal-3.6.1">
 #<<package-instance> version: "3.6.0" output: "/gnu/store/cbwhhqv69xi3k3g25dcfhwjkkf2427rp-r-minimal-3.6.0">
 #<<package-instance> version: "3.6.0" output: "/gnu/store/69k46yr70zkzzz9v18wl7sxasha5m0a5-r-minimal-3.6.0">
 #<<package-instance> version: "3.6.0" output: "/gnu/store/71w0383x0hay6ng1zaddz59by3g0gxaf-r-minimal-3.6.0">
 #<<package-instance> version: "3.6.0" output: "/gnu/store/m317mg8faxp9qn949dnv1xgsxyw57s3x-r-minimal-3.6.0">
 #<<package-instance> version: "3.5.3" output: "/gnu/store/33qdfplk4riig77vqg758m1zll16n6f0-r-minimal-3.5.3">
 #<<package-instance> version: "3.5.3" output: "/gnu/store/g8gkrcxn49yj8zjr59l7y4k7wgar5brq-r-minimal-3.5.3">
 #<<package-instance> version: "3.5.1" output: "/gnu/store/vrgbyvnx7b1sva2ij5a1gwrkbfwmg3lm-r-minimal-3.5.1">
 #<<package-instance> version: "3.5.1" output: "/gnu/store/4fzw7s0cv2zbixw4wb58zq6lkd97ghnf-r-minimal-3.5.1">
 #<<package-instance> version: "3.5.1" output: "/gnu/store/yb5048dr40nbmyfa1ar4hfiy3kd06v4c-r-minimal-3.5.1">)
scheme@(similarities)> (similarity-chart '("icecat" "gimp" "openmpi" "emacs" "diffoscope" "r-minimal") "/tmp/t.png" #:max 8)
updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
$102 = #<cairo-surface 7f502c7adb10>
--8<---------------cut here---------------end--------------->8---

Thoughts?  :-)

Ludo’.


[-- Attachment #4: the code --]
[-- Type: text/plain, Size: 7373 bytes --]

;;; Copyright © 2020 Ludovic Courtès <ludo@gnu.org>
;;; Hereby placed under the GNU General Public License, version 3 or later.

(define-module (similarities)
  #:use-module (json)
  #:use-module ((gcrypt hash) #:select (open-sha256-port))
  #:use-module ((guix store) #:select (%default-substitute-urls))
  #:use-module ((guix progress) #:hide (dump-port*))
  #:use-module (guix http-client)
  #:use-module (guix serialization)
  #:use-module (guix scripts substitute)
  #:use-module (srfi srfi-1)
  #:use-module (srfi srfi-9)
  #:use-module (srfi srfi-11)
  #:use-module (rnrs bytevectors)
  #:use-module (charting)
  #:use-module (ice-9 match))

\f
;;;
;;; Data Service client.
;;;

(define-json-mapping <package-instance> make-package-instance
  package-instance?
  json->package-instance
  (version     package-instance-version)
  (output      package-instance-output "derivation"))

(define %data-service-base-url
  (make-parameter "https://data.guix.gnu.org"))

(define* (package-instances package #:key (branch "master"))
  "Return a list of <package-instance> representing instances of PACKAGE over
time known to the Data Service."
  (match (assoc-ref (json->scm
                     (http-fetch (string-append (%data-service-base-url)
                                                "/repository/1/branch/"
                                                branch "/package/" package
                                                "/output-history.json")))
                    "derivations")
    (#(lst ...)
     (map json->package-instance lst))
    (#f
     #f)))

\f
;;;
;;; Similarity measurement.
;;;

(define (port-sha256* port size)               ;from (guix scripts challenge)
  ;; Like 'port-sha256', but limited to SIZE bytes.
  (let-values (((out get) (open-sha256-port)))
    (dump-port* port out size)
    (close-port out)
    (get)))

(define-record-type <file-info>
  (file-info name type size sha256)
  file-info?
  (name   file-info-name)
  (type   file-info-type)
  (size   file-info-size)
  (sha256 file-info-sha256))

(define (archive-contents port)
  "Return a list of <file-info> records from the nar read from PORT."
  ;; As in (guix scripts challenge), but return <file-info> records that
  ;; include file size and ignore symlinks.
  (fold-archive (lambda (file type contents result)
                  (match type
                    ((or 'regular 'executable)
                     (match contents
                       ((port . size)
                        (cons (file-info file type size
                                         (port-sha256* port size))
                              result))))
                    ('directory result)
                    ('directory-complete result)
                    ('symlink result)))
                '()
                port
                ""))

(define (narinfo-contents narinfo)             ;from (guix scripts challenge)
  "Fetch the nar described by NARINFO and return a list representing the file
it contains."
  ((@@ (guix scripts challenge) call-with-nar) narinfo archive-contents))

(define (at-most max-length lst)              ;from (guix scripts substitute)
  "If LST is shorter than MAX-LENGTH, return it and the empty list; otherwise
return its MAX-LENGTH first elements and its tail."
  (let loop ((len 0)
             (lst lst)
             (result '()))
    (match lst
      (()
       (values (reverse result) '()))
      ((head . tail)
       (if (>= len max-length)
           (values (reverse result) lst)
           (loop (+ 1 len) tail (cons head result)))))))

(define* (package-archive-contents package
                                   #:key (max 10)
                                   (substitute-urls %default-substitute-urls))
  "Look at the MAX latest instances of PACKAGE, fetch them, and return a
summary of their contents as returned by 'narinfo-contents'."
  (let ((instances (at-most max (package-instances package))))
    (map narinfo-contents
         (lookup-narinfos (first substitute-urls)
                          (map package-instance-output instances)))))

(define (similarity contents1 contents2)
  "Return two values: the ratio of identical bytes between CONTENTS2 and
CONTENTS2, and the ratio of identical files."
  (define (matches name)
    (lambda (info)
      (string=? (file-info-name info) name)))

  (let ((files (delete-duplicates
                (append (map file-info-name contents1)
                        (map file-info-name contents2)))))
    (let loop ((files files)
               (seen 0)
               (identical 0)
               (seen-bytes 0)
               (identical-bytes 0))
      (match files
        (()
         (values (/ identical-bytes seen-bytes)
                 (/ identical seen)))
        ((head . tail)
         (let ((file1 (find (matches head) contents1))
               (file2 (find (matches head) contents2)))
           (cond ((not file1)
                  (loop tail
                        (+ seen 1) identical
                        (+ seen-bytes (file-info-size file2))
                        identical-bytes))
                 ((not file2)
                  (loop tail
                        (+ seen 1) identical
                        (+ seen-bytes (file-info-size file1))
                        identical-bytes))
                 (else
                  (let ((identical?
                         (and (= (file-info-size file1)
                                 (file-info-size file2))
                              (bytevector=? (file-info-sha256 file1)
                                            (file-info-sha256 file2)))))
                    (loop tail
                          (+ seen 1)
                          (if identical?
                              (+ identical 1)
                              identical)
                          (+ seen-bytes
                             (max (file-info-size file1)
                                  (file-info-size file2)))
                          (if identical?
                              (+ identical-bytes (file-info-size file1))
                              identical-bytes)))))))))))

(define (pairwise-similarities contents)
  (let loop ((contents contents)
             (similarities '()))
    (match contents
      ((or () (_))
       (reverse similarities))
      ((a b . tail)
       (loop (cons b tail)
             (cons (similarity a b) similarities))))))

(define* (similarity-chart packages file
                           #:key (max 20))
  (make-bar-chart
   "Similarity across subsequent package revisions"
   (map (lambda (package)
          (let* ((contents     (package-archive-contents package
                                                         #:max max))
                 (similarities (pairwise-similarities contents))
                 (labels       (iota (length similarities))))
            `(,package
               ,@(map (lambda (label ratio)
                        `(,(* ratio 100.) ,(number->string label)))
                      labels similarities))))
        packages)
   #:chart-params '(#:x-axis-label "package"
                    #:y-axis-label "similarity ratio (%)")
   #:write-to-png file))

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-01-06  9:40 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-12-22 22:01 Identical files across subsequent package revisions Ludovic Courtès
2020-12-23  9:08 ` Taylan Kammer
2020-12-23 10:50   ` Pierre Neidhardt
2020-12-23 22:06   ` Mark H Weaver
2020-12-23 22:47     ` Jonathan Brielmaier
2020-12-23 23:42     ` Mark H Weaver
2020-12-23 10:19 ` zimoun
2020-12-23 10:41   ` zimoun
2020-12-27 22:22     ` Ludovic Courtès
2020-12-23 11:48 ` Christopher Baines
2020-12-23 13:10 ` Miguel Ángel Arruga Vivas
2020-12-23 14:07   ` zimoun
2020-12-23 15:40     ` Julien Lepiller
2020-12-23 19:07       ` Miguel Ángel Arruga Vivas
2020-12-27 22:29   ` Ludovic Courtès
2021-01-06  9:39     ` Ludovic Courtès
2020-12-29 20:01 ` pukkamustard
2021-01-06  9:27   ` Ludovic Courtès

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).