unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Identical files across subsequent package revisions
@ 2020-12-22 22:01 Ludovic Courtès
  2020-12-23  9:08 ` Taylan Kammer
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Ludovic Courtès @ 2020-12-22 22:01 UTC (permalink / raw)
  To: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 490 bytes --]

Hello Guix!

Every time a package changes, we end up downloading complete substitutes
for itself and for all its dependents, even though we have the intuition
that a large fraction of the files in those store items are unchanged.

I wanted to evaluate that by looking at store items corresponding to
subsequent revisions of a package (be it different versions or rebuilds
induced by dependencies), and this is what the program below does.

Here are preliminary results for a few packages:


[-- Attachment #2: the graph --]
[-- Type: image/png, Size: 21168 bytes --]

[-- Attachment #3: Type: text/plain, Size: 5371 bytes --]


It reads as follows:

  • Pairwise similarities of subsequent revisions of IceCat contain
    identical files representing ~10% of its total size (it’s the ratio
    of the size of the identical files to the total file size.)

    The first bar represents similarity between the first and the second
    revision; the second bar shows the similarity between the second one
    and the third one, etc.

  • For Emacs, identical files between subsequent revisions represent
    ~85% of its total size.  Intuitively, this is because the package
    contains mostly source code (.el.gz files) and bytecode, which
    remains unchanged.

  • Diffoscope is written in Python so it’s similar to Emacs: its .py
    file remain unchanged across revisions, and they represent ~60% of
    its total size.

  • GIMP has lots of headers, locale data, and icons that don’t change.

  • For R, we see the effect of the upgrades to 4.0.2 and then 4.0.3,
    where similarity drops to ~25% instead of ~80% when changes are in
    dependencies.

  • For Open MPI, which is compiled C + headers, ~25% is shared across
    revisions.

The reason I’m looking at this is to understand how much would be gained
in terms of bandwidth usage if we were able to avoid downloading
individual files already in the store.  It would seem to be rather
encouraging.

Below is the program that does that.  It grabs revision history from
data.guix.gnu.org, fetches nars from ci.guix.gnu.org, computes a
“digest” (list of files along with their hash and size), compares
package digests pairwise, and plots the result with Guile-Charting.
Example REPL session:

--8<---------------cut here---------------start------------->8---
scheme@(similarities)>  (pairwise-similarities (package-archive-contents "icecat" #:max 4))
updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
$86 = (17363915/196387883 11380615/98152193)
scheme@(similarities)> (map exact->inexact $86)
$87 = (0.08841642740249916 0.11594865740799087)

[…]

scheme@(similarities)> ,pp (at-most 10 (package-instances "r-minimal") )
$100 = (#<<package-instance> version: "4.0.3" output: "/gnu/store/vv3ca1r5zw5y35xgkix4r80hdnncx52b-r-minimal-4.0.3">
 #<<package-instance> version: "4.0.3" output: "/gnu/store/5dzad7nhhv3dvmap60d6gsj9ppflgzrd-r-minimal-4.0.3">
 #<<package-instance> version: "4.0.3" output: "/gnu/store/01xi3sig314wgwa1j9sxk37vl816mj74-r-minimal-4.0.3">
 #<<package-instance> version: "4.0.2" output: "/gnu/store/nv7lqhnm0mncqwdpkkhnlsgb562lcwff-r-minimal-4.0.2">
 #<<package-instance> version: "4.0.2" output: "/gnu/store/w0izbm8q26dmyndhv46xr7dgz1irai1z-r-minimal-4.0.2">
 #<<package-instance> version: "4.0.2" output: "/gnu/store/yd83ibzxjrb7cgcc6d4smx4pqcdl8r3p-r-minimal-4.0.2">
 #<<package-instance> version: "4.0.1" output: "/gnu/store/kpdh0lwlwcwfmmfzqhwbi6j7m4zzxlmn-r-minimal-4.0.1">
 #<<package-instance> version: "4.0.1" output: "/gnu/store/9x9nzzsiyn1q7g5myhgwjh0yx9js3nrj-r-minimal-4.0.1">
 #<<package-instance> version: "4.0.0" output: "/gnu/store/ahbm2gsqc3420a23pcwrxd4pdhl7rdpp-r-minimal-4.0.0">
 #<<package-instance> version: "4.0.0" output: "/gnu/store/0sgqhj2628js419wvw1vc1cw07wil7gr-r-minimal-4.0.0">)
$101 = (#<<package-instance> version: "3.6.3" output: "/gnu/store/gmx6p0wk3xbc9ylv83zfj855azgjxr0p-r-minimal-3.6.3">
 #<<package-instance> version: "3.6.2" output: "/gnu/store/dnb6fzp5295fcda66dnjk2y51mcva20f-r-minimal-3.6.2">
 #<<package-instance> version: "3.6.1" output: "/gnu/store/gd6sm42b6fr1qgyp6p1zp3z4aavxwyk2-r-minimal-3.6.1">
 #<<package-instance> version: "3.6.1" output: "/gnu/store/lpmfhxys3vsaqmqvj85r344ygfmlmlbg-r-minimal-3.6.1">
 #<<package-instance> version: "3.6.1" output: "/gnu/store/4413h13v7zrb7rp9lvyp2gfx2laj60wm-r-minimal-3.6.1">
 #<<package-instance> version: "3.6.1" output: "/gnu/store/zm5pl1y0wmh3c845498pbjvrzrm6sb07-r-minimal-3.6.1">
 #<<package-instance> version: "3.6.1" output: "/gnu/store/xrv7y4xgrdrsx5qba7144cpw69qc5f0j-r-minimal-3.6.1">
 #<<package-instance> version: "3.6.0" output: "/gnu/store/cbwhhqv69xi3k3g25dcfhwjkkf2427rp-r-minimal-3.6.0">
 #<<package-instance> version: "3.6.0" output: "/gnu/store/69k46yr70zkzzz9v18wl7sxasha5m0a5-r-minimal-3.6.0">
 #<<package-instance> version: "3.6.0" output: "/gnu/store/71w0383x0hay6ng1zaddz59by3g0gxaf-r-minimal-3.6.0">
 #<<package-instance> version: "3.6.0" output: "/gnu/store/m317mg8faxp9qn949dnv1xgsxyw57s3x-r-minimal-3.6.0">
 #<<package-instance> version: "3.5.3" output: "/gnu/store/33qdfplk4riig77vqg758m1zll16n6f0-r-minimal-3.5.3">
 #<<package-instance> version: "3.5.3" output: "/gnu/store/g8gkrcxn49yj8zjr59l7y4k7wgar5brq-r-minimal-3.5.3">
 #<<package-instance> version: "3.5.1" output: "/gnu/store/vrgbyvnx7b1sva2ij5a1gwrkbfwmg3lm-r-minimal-3.5.1">
 #<<package-instance> version: "3.5.1" output: "/gnu/store/4fzw7s0cv2zbixw4wb58zq6lkd97ghnf-r-minimal-3.5.1">
 #<<package-instance> version: "3.5.1" output: "/gnu/store/yb5048dr40nbmyfa1ar4hfiy3kd06v4c-r-minimal-3.5.1">)
scheme@(similarities)> (similarity-chart '("icecat" "gimp" "openmpi" "emacs" "diffoscope" "r-minimal") "/tmp/t.png" #:max 8)
updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
$102 = #<cairo-surface 7f502c7adb10>
--8<---------------cut here---------------end--------------->8---

Thoughts?  :-)

Ludo’.


[-- Attachment #4: the code --]
[-- Type: text/plain, Size: 7373 bytes --]

;;; Copyright © 2020 Ludovic Courtès <ludo@gnu.org>
;;; Hereby placed under the GNU General Public License, version 3 or later.

(define-module (similarities)
  #:use-module (json)
  #:use-module ((gcrypt hash) #:select (open-sha256-port))
  #:use-module ((guix store) #:select (%default-substitute-urls))
  #:use-module ((guix progress) #:hide (dump-port*))
  #:use-module (guix http-client)
  #:use-module (guix serialization)
  #:use-module (guix scripts substitute)
  #:use-module (srfi srfi-1)
  #:use-module (srfi srfi-9)
  #:use-module (srfi srfi-11)
  #:use-module (rnrs bytevectors)
  #:use-module (charting)
  #:use-module (ice-9 match))

\f
;;;
;;; Data Service client.
;;;

(define-json-mapping <package-instance> make-package-instance
  package-instance?
  json->package-instance
  (version     package-instance-version)
  (output      package-instance-output "derivation"))

(define %data-service-base-url
  (make-parameter "https://data.guix.gnu.org"))

(define* (package-instances package #:key (branch "master"))
  "Return a list of <package-instance> representing instances of PACKAGE over
time known to the Data Service."
  (match (assoc-ref (json->scm
                     (http-fetch (string-append (%data-service-base-url)
                                                "/repository/1/branch/"
                                                branch "/package/" package
                                                "/output-history.json")))
                    "derivations")
    (#(lst ...)
     (map json->package-instance lst))
    (#f
     #f)))

\f
;;;
;;; Similarity measurement.
;;;

(define (port-sha256* port size)               ;from (guix scripts challenge)
  ;; Like 'port-sha256', but limited to SIZE bytes.
  (let-values (((out get) (open-sha256-port)))
    (dump-port* port out size)
    (close-port out)
    (get)))

(define-record-type <file-info>
  (file-info name type size sha256)
  file-info?
  (name   file-info-name)
  (type   file-info-type)
  (size   file-info-size)
  (sha256 file-info-sha256))

(define (archive-contents port)
  "Return a list of <file-info> records from the nar read from PORT."
  ;; As in (guix scripts challenge), but return <file-info> records that
  ;; include file size and ignore symlinks.
  (fold-archive (lambda (file type contents result)
                  (match type
                    ((or 'regular 'executable)
                     (match contents
                       ((port . size)
                        (cons (file-info file type size
                                         (port-sha256* port size))
                              result))))
                    ('directory result)
                    ('directory-complete result)
                    ('symlink result)))
                '()
                port
                ""))

(define (narinfo-contents narinfo)             ;from (guix scripts challenge)
  "Fetch the nar described by NARINFO and return a list representing the file
it contains."
  ((@@ (guix scripts challenge) call-with-nar) narinfo archive-contents))

(define (at-most max-length lst)              ;from (guix scripts substitute)
  "If LST is shorter than MAX-LENGTH, return it and the empty list; otherwise
return its MAX-LENGTH first elements and its tail."
  (let loop ((len 0)
             (lst lst)
             (result '()))
    (match lst
      (()
       (values (reverse result) '()))
      ((head . tail)
       (if (>= len max-length)
           (values (reverse result) lst)
           (loop (+ 1 len) tail (cons head result)))))))

(define* (package-archive-contents package
                                   #:key (max 10)
                                   (substitute-urls %default-substitute-urls))
  "Look at the MAX latest instances of PACKAGE, fetch them, and return a
summary of their contents as returned by 'narinfo-contents'."
  (let ((instances (at-most max (package-instances package))))
    (map narinfo-contents
         (lookup-narinfos (first substitute-urls)
                          (map package-instance-output instances)))))

(define (similarity contents1 contents2)
  "Return two values: the ratio of identical bytes between CONTENTS2 and
CONTENTS2, and the ratio of identical files."
  (define (matches name)
    (lambda (info)
      (string=? (file-info-name info) name)))

  (let ((files (delete-duplicates
                (append (map file-info-name contents1)
                        (map file-info-name contents2)))))
    (let loop ((files files)
               (seen 0)
               (identical 0)
               (seen-bytes 0)
               (identical-bytes 0))
      (match files
        (()
         (values (/ identical-bytes seen-bytes)
                 (/ identical seen)))
        ((head . tail)
         (let ((file1 (find (matches head) contents1))
               (file2 (find (matches head) contents2)))
           (cond ((not file1)
                  (loop tail
                        (+ seen 1) identical
                        (+ seen-bytes (file-info-size file2))
                        identical-bytes))
                 ((not file2)
                  (loop tail
                        (+ seen 1) identical
                        (+ seen-bytes (file-info-size file1))
                        identical-bytes))
                 (else
                  (let ((identical?
                         (and (= (file-info-size file1)
                                 (file-info-size file2))
                              (bytevector=? (file-info-sha256 file1)
                                            (file-info-sha256 file2)))))
                    (loop tail
                          (+ seen 1)
                          (if identical?
                              (+ identical 1)
                              identical)
                          (+ seen-bytes
                             (max (file-info-size file1)
                                  (file-info-size file2)))
                          (if identical?
                              (+ identical-bytes (file-info-size file1))
                              identical-bytes)))))))))))

(define (pairwise-similarities contents)
  (let loop ((contents contents)
             (similarities '()))
    (match contents
      ((or () (_))
       (reverse similarities))
      ((a b . tail)
       (loop (cons b tail)
             (cons (similarity a b) similarities))))))

(define* (similarity-chart packages file
                           #:key (max 20))
  (make-bar-chart
   "Similarity across subsequent package revisions"
   (map (lambda (package)
          (let* ((contents     (package-archive-contents package
                                                         #:max max))
                 (similarities (pairwise-similarities contents))
                 (labels       (iota (length similarities))))
            `(,package
               ,@(map (lambda (label ratio)
                        `(,(* ratio 100.) ,(number->string label)))
                      labels similarities))))
        packages)
   #:chart-params '(#:x-axis-label "package"
                    #:y-axis-label "similarity ratio (%)")
   #:write-to-png file))

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-22 22:01 Identical files across subsequent package revisions Ludovic Courtès
@ 2020-12-23  9:08 ` Taylan Kammer
  2020-12-23 10:50   ` Pierre Neidhardt
  2020-12-23 22:06   ` Mark H Weaver
  2020-12-23 10:19 ` zimoun
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 18+ messages in thread
From: Taylan Kammer @ 2020-12-23  9:08 UTC (permalink / raw)
  To: Ludovic Courtès, Guix Devel

On 22.12.2020 23:01, Ludovic Courtès wrote:
> 
> Thoughts?  :-)
> 

My first thought: Neat, would love to see this implemented! :D

My second thought: it's surprising that IceCat supposedly changes so 
much between releases.  I suppose the reason is that this analysis is on 
a per-file basis, and IceCat is mostly a massive binary.  That leads me 
to wonder: what about binary diffs for large files?

Perhaps for all files that are bigger than N bytes, we could check if 
the binary diff is X% or smaller of the total size, and if so, the build 
servers could host the diff file alongside the full file.  (Substitute N 
and X for sensible values, like maybe 5 MB and 50%.)

But that could be a second, separate step I suppose.

- Taylan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-22 22:01 Identical files across subsequent package revisions Ludovic Courtès
  2020-12-23  9:08 ` Taylan Kammer
@ 2020-12-23 10:19 ` zimoun
  2020-12-23 10:41   ` zimoun
  2020-12-23 11:48 ` Christopher Baines
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: zimoun @ 2020-12-23 10:19 UTC (permalink / raw)
  To: Ludovic Courtès, Guix Devel

Hi Ludo,

On Tue, 22 Dec 2020 at 23:01, Ludovic Courtès <ludo@gnu.org> wrote:

> I wanted to evaluate that by looking at store items corresponding to
> subsequent revisions of a package (be it different versions or rebuilds
> induced by dependencies), and this is what the program below does.
>
> Here are preliminary results for a few packages:

[...]

> The reason I’m looking at this is to understand how much would be gained
> in terms of bandwidth usage if we were able to avoid downloading
> individual files already in the store.  It would seem to be rather
> encouraging.

This is really interesting.  Especially through normal distribution and
co. lens.  The mean needs to be weighed: 10% of 180MB is not the same
than 55% of 1MB (icecat vs diffoscope, from “guix size” and I could
misread the numbers).  The variance between revisions is highly variable
(from 0 to significant).  The one between packages says how fat or fit
the normal distribution is.

A smarter distribution could lead to some bandwith saving.  The question
is: predict how many?  Is it worth to add a bit of complexity to save
corner cases or for all.

By corner cases, I mean that for example I am not updating my Emacs at
each revisions.

<https://data.guix.gnu.org/repository/1/branch/master/package/emacs/output-history>




> Below is the program that does that.  It grabs revision history from
> data.guix.gnu.org, fetches nars from ci.guix.gnu.org, computes a
> “digest” (list of files along with their hash and size), compares
> package digests pairwise, and plots the result with Guile-Charting.
> Example REPL session:
>
> --8<---------------cut here---------------start------------->8---
> scheme@(similarities)>  (pairwise-similarities (package-archive-contents "icecat" #:max 4))
> updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
> $86 = (17363915/196387883 11380615/98152193)
> scheme@(similarities)> (map exact->inexact $86)
> $87 = (0.08841642740249916 0.11594865740799087)
>
> […]
>
> scheme@(similarities)> ,pp (at-most 10 (package-instances "r-minimal") )
> $100 = (#<<package-instance> version: "4.0.3" output: "/gnu/store/vv3ca1r5zw5y35xgkix4r80hdnncx52b-r-minimal-4.0.3">
>  #<<package-instance> version: "4.0.3" output: "/gnu/store/5dzad7nhhv3dvmap60d6gsj9ppflgzrd-r-minimal-4.0.3">
>  #<<package-instance> version: "4.0.3" output: "/gnu/store/01xi3sig314wgwa1j9sxk37vl816mj74-r-minimal-4.0.3">
>  #<<package-instance> version: "4.0.2" output: "/gnu/store/nv7lqhnm0mncqwdpkkhnlsgb562lcwff-r-minimal-4.0.2">
>  #<<package-instance> version: "4.0.2" output: "/gnu/store/w0izbm8q26dmyndhv46xr7dgz1irai1z-r-minimal-4.0.2">
>  #<<package-instance> version: "4.0.2" output: "/gnu/store/yd83ibzxjrb7cgcc6d4smx4pqcdl8r3p-r-minimal-4.0.2">
>  #<<package-instance> version: "4.0.1" output: "/gnu/store/kpdh0lwlwcwfmmfzqhwbi6j7m4zzxlmn-r-minimal-4.0.1">
>  #<<package-instance> version: "4.0.1" output: "/gnu/store/9x9nzzsiyn1q7g5myhgwjh0yx9js3nrj-r-minimal-4.0.1">
>  #<<package-instance> version: "4.0.0" output: "/gnu/store/ahbm2gsqc3420a23pcwrxd4pdhl7rdpp-r-minimal-4.0.0">
>  #<<package-instance> version: "4.0.0" output: "/gnu/store/0sgqhj2628js419wvw1vc1cw07wil7gr-r-minimal-4.0.0">)
> $101 = (#<<package-instance> version: "3.6.3" output: "/gnu/store/gmx6p0wk3xbc9ylv83zfj855azgjxr0p-r-minimal-3.6.3">
>  #<<package-instance> version: "3.6.2" output: "/gnu/store/dnb6fzp5295fcda66dnjk2y51mcva20f-r-minimal-3.6.2">
>  #<<package-instance> version: "3.6.1" output: "/gnu/store/gd6sm42b6fr1qgyp6p1zp3z4aavxwyk2-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.1" output: "/gnu/store/lpmfhxys3vsaqmqvj85r344ygfmlmlbg-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.1" output: "/gnu/store/4413h13v7zrb7rp9lvyp2gfx2laj60wm-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.1" output: "/gnu/store/zm5pl1y0wmh3c845498pbjvrzrm6sb07-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.1" output: "/gnu/store/xrv7y4xgrdrsx5qba7144cpw69qc5f0j-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.0" output: "/gnu/store/cbwhhqv69xi3k3g25dcfhwjkkf2427rp-r-minimal-3.6.0">
>  #<<package-instance> version: "3.6.0" output: "/gnu/store/69k46yr70zkzzz9v18wl7sxasha5m0a5-r-minimal-3.6.0">
>  #<<package-instance> version: "3.6.0" output: "/gnu/store/71w0383x0hay6ng1zaddz59by3g0gxaf-r-minimal-3.6.0">
>  #<<package-instance> version: "3.6.0" output: "/gnu/store/m317mg8faxp9qn949dnv1xgsxyw57s3x-r-minimal-3.6.0">
>  #<<package-instance> version: "3.5.3" output: "/gnu/store/33qdfplk4riig77vqg758m1zll16n6f0-r-minimal-3.5.3">
>  #<<package-instance> version: "3.5.3" output: "/gnu/store/g8gkrcxn49yj8zjr59l7y4k7wgar5brq-r-minimal-3.5.3">
>  #<<package-instance> version: "3.5.1" output: "/gnu/store/vrgbyvnx7b1sva2ij5a1gwrkbfwmg3lm-r-minimal-3.5.1">
>  #<<package-instance> version: "3.5.1" output: "/gnu/store/4fzw7s0cv2zbixw4wb58zq6lkd97ghnf-r-minimal-3.5.1">
>  #<<package-instance> version: "3.5.1" output: "/gnu/store/yb5048dr40nbmyfa1ar4hfiy3kd06v4c-r-minimal-3.5.1">)
> scheme@(similarities)> (similarity-chart '("icecat" "gimp" "openmpi" "emacs" "diffoscope" "r-minimal") "/tmp/t.png" #:max 8)
> updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
> $102 = #<cairo-surface 7f502c7adb10>
> --8<---------------cut here---------------end--------------->8---
>
> Thoughts?  :-)
>
> Ludo’.
>
> ;;; Copyright © 2020 Ludovic Courtès <ludo@gnu.org>
> ;;; Hereby placed under the GNU General Public License, version 3 or later.
>
> (define-module (similarities)
>   #:use-module (json)
>   #:use-module ((gcrypt hash) #:select (open-sha256-port))
>   #:use-module ((guix store) #:select (%default-substitute-urls))
>   #:use-module ((guix progress) #:hide (dump-port*))
>   #:use-module (guix http-client)
>   #:use-module (guix serialization)
>   #:use-module (guix scripts substitute)
>   #:use-module (srfi srfi-1)
>   #:use-module (srfi srfi-9)
>   #:use-module (srfi srfi-11)
>   #:use-module (rnrs bytevectors)
>   #:use-module (charting)
>   #:use-module (ice-9 match))
>
> \f
> ;;;
> ;;; Data Service client.
> ;;;
>
> (define-json-mapping <package-instance> make-package-instance
>   package-instance?
>   json->package-instance
>   (version     package-instance-version)
>   (output      package-instance-output "derivation"))
>
> (define %data-service-base-url
>   (make-parameter "https://data.guix.gnu.org"))
>
> (define* (package-instances package #:key (branch "master"))
>   "Return a list of <package-instance> representing instances of PACKAGE over
> time known to the Data Service."
>   (match (assoc-ref (json->scm
>                      (http-fetch (string-append (%data-service-base-url)
>                                                 "/repository/1/branch/"
>                                                 branch "/package/" package
>                                                 "/output-history.json")))
>                     "derivations")
>     (#(lst ...)
>      (map json->package-instance lst))
>     (#f
>      #f)))
>
> \f
> ;;;
> ;;; Similarity measurement.
> ;;;
>
> (define (port-sha256* port size)               ;from (guix scripts challenge)
>   ;; Like 'port-sha256', but limited to SIZE bytes.
>   (let-values (((out get) (open-sha256-port)))
>     (dump-port* port out size)
>     (close-port out)
>     (get)))
>
> (define-record-type <file-info>
>   (file-info name type size sha256)
>   file-info?
>   (name   file-info-name)
>   (type   file-info-type)
>   (size   file-info-size)
>   (sha256 file-info-sha256))
>
> (define (archive-contents port)
>   "Return a list of <file-info> records from the nar read from PORT."
>   ;; As in (guix scripts challenge), but return <file-info> records that
>   ;; include file size and ignore symlinks.
>   (fold-archive (lambda (file type contents result)
>                   (match type
>                     ((or 'regular 'executable)
>                      (match contents
>                        ((port . size)
>                         (cons (file-info file type size
>                                          (port-sha256* port size))
>                               result))))
>                     ('directory result)
>                     ('directory-complete result)
>                     ('symlink result)))
>                 '()
>                 port
>                 ""))
>
> (define (narinfo-contents narinfo)             ;from (guix scripts challenge)
>   "Fetch the nar described by NARINFO and return a list representing the file
> it contains."
>   ((@@ (guix scripts challenge) call-with-nar) narinfo archive-contents))
>
> (define (at-most max-length lst)              ;from (guix scripts substitute)
>   "If LST is shorter than MAX-LENGTH, return it and the empty list; otherwise
> return its MAX-LENGTH first elements and its tail."
>   (let loop ((len 0)
>              (lst lst)
>              (result '()))
>     (match lst
>       (()
>        (values (reverse result) '()))
>       ((head . tail)
>        (if (>= len max-length)
>            (values (reverse result) lst)
>            (loop (+ 1 len) tail (cons head result)))))))
>
> (define* (package-archive-contents package
>                                    #:key (max 10)
>                                    (substitute-urls %default-substitute-urls))
>   "Look at the MAX latest instances of PACKAGE, fetch them, and return a
> summary of their contents as returned by 'narinfo-contents'."
>   (let ((instances (at-most max (package-instances package))))
>     (map narinfo-contents
>          (lookup-narinfos (first substitute-urls)
>                           (map package-instance-output instances)))))
>
> (define (similarity contents1 contents2)
>   "Return two values: the ratio of identical bytes between CONTENTS2 and
> CONTENTS2, and the ratio of identical files."
>   (define (matches name)
>     (lambda (info)
>       (string=? (file-info-name info) name)))
>
>   (let ((files (delete-duplicates
>                 (append (map file-info-name contents1)
>                         (map file-info-name contents2)))))
>     (let loop ((files files)
>                (seen 0)
>                (identical 0)
>                (seen-bytes 0)
>                (identical-bytes 0))
>       (match files
>         (()
>          (values (/ identical-bytes seen-bytes)
>                  (/ identical seen)))
>         ((head . tail)
>          (let ((file1 (find (matches head) contents1))
>                (file2 (find (matches head) contents2)))
>            (cond ((not file1)
>                   (loop tail
>                         (+ seen 1) identical
>                         (+ seen-bytes (file-info-size file2))
>                         identical-bytes))
>                  ((not file2)
>                   (loop tail
>                         (+ seen 1) identical
>                         (+ seen-bytes (file-info-size file1))
>                         identical-bytes))
>                  (else
>                   (let ((identical?
>                          (and (= (file-info-size file1)
>                                  (file-info-size file2))
>                               (bytevector=? (file-info-sha256 file1)
>                                             (file-info-sha256 file2)))))
>                     (loop tail
>                           (+ seen 1)
>                           (if identical?
>                               (+ identical 1)
>                               identical)
>                           (+ seen-bytes
>                              (max (file-info-size file1)
>                                   (file-info-size file2)))
>                           (if identical?
>                               (+ identical-bytes (file-info-size file1))
>                               identical-bytes)))))))))))
>
> (define (pairwise-similarities contents)
>   (let loop ((contents contents)
>              (similarities '()))
>     (match contents
>       ((or () (_))
>        (reverse similarities))
>       ((a b . tail)
>        (loop (cons b tail)
>              (cons (similarity a b) similarities))))))
>
> (define* (similarity-chart packages file
>                            #:key (max 20))
>   (make-bar-chart
>    "Similarity across subsequent package revisions"
>    (map (lambda (package)
>           (let* ((contents     (package-archive-contents package
>                                                          #:max max))
>                  (similarities (pairwise-similarities contents))
>                  (labels       (iota (length similarities))))
>             `(,package
>                ,@(map (lambda (label ratio)
>                         `(,(* ratio 100.) ,(number->string label)))
>                       labels similarities))))
>         packages)
>    #:chart-params '(#:x-axis-label "package"
>                     #:y-axis-label "similarity ratio (%)")
>    #:write-to-png file))


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-23 10:19 ` zimoun
@ 2020-12-23 10:41   ` zimoun
  2020-12-27 22:22     ` Ludovic Courtès
  0 siblings, 1 reply; 18+ messages in thread
From: zimoun @ 2020-12-23 10:41 UTC (permalink / raw)
  To: Ludovic Courtès, Guix Devel

Hi again,

Sorry, I have not finished my previous email.  Wrong keystroke in the
wrong buffer because misuse of the Ludo’s code. :-)


On Wed, 23 Dec 2020 at 11:19, zimoun <zimon.toutoune@gmail.com> wrote:

>> I wanted to evaluate that by looking at store items corresponding to
>> subsequent revisions of a package (be it different versions or rebuilds
>> induced by dependencies), and this is what the program below does.

What I wanted to illustrate is “revision” does not mean “new version“
and I do not know what is the typical usage by people; aside Guix
dev. :-)

How many times per week or month people are doing “guix pull && guix
upgrade” almost blindly without reviewing what is new?


Last, I was trying to understand what is the “revision” number
corresponding to which outputs.  For example r-minimal:

<https://data.guix.gnu.org/repository/1/branch/master/package/r-minimal/output-history>

There are 3 outputs with 4.0.3, so I am expecting 3 similar revisions in
the chart.  Then 3 others with 4.0.2, then 2 others with 4.0.1.

In the 6 revisions in the chart, I was trying to figure out what does
mean the ’0’: r-minimal against which other one, i.e., which Guix commit
against which other one.  Idem for 1,2,3,4,5. :-)


BTW, really interesting topic!  Thanks Ludo. :-)

Cheers,
simon


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-23  9:08 ` Taylan Kammer
@ 2020-12-23 10:50   ` Pierre Neidhardt
  2020-12-23 22:06   ` Mark H Weaver
  1 sibling, 0 replies; 18+ messages in thread
From: Pierre Neidhardt @ 2020-12-23 10:50 UTC (permalink / raw)
  To: Taylan Kammer, Ludovic Courtès, Guix Devel

[-- Attachment #1: Type: text/plain, Size: 266 bytes --]

Taylan Kammer <taylan.kammer@gmail.com> writes:

> On 22.12.2020 23:01, Ludovic Courtès wrote:
>> 
>> Thoughts?  :-)
>> 
>
> My first thought: Neat, would love to see this implemented! :D

I second that! :)

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-22 22:01 Identical files across subsequent package revisions Ludovic Courtès
  2020-12-23  9:08 ` Taylan Kammer
  2020-12-23 10:19 ` zimoun
@ 2020-12-23 11:48 ` Christopher Baines
  2020-12-23 13:10 ` Miguel Ángel Arruga Vivas
  2020-12-29 20:01 ` pukkamustard
  4 siblings, 0 replies; 18+ messages in thread
From: Christopher Baines @ 2020-12-23 11:48 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 376 bytes --]


Ludovic Courtès <ludo@gnu.org> writes:

> The reason I’m looking at this is to understand how much would be gained
> in terms of bandwidth usage if we were able to avoid downloading
> individual files already in the store.  It would seem to be rather
> encouraging.

Very cool! This might help guide the implementation of using things like
IPFS for substitutes.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 987 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-22 22:01 Identical files across subsequent package revisions Ludovic Courtès
                   ` (2 preceding siblings ...)
  2020-12-23 11:48 ` Christopher Baines
@ 2020-12-23 13:10 ` Miguel Ángel Arruga Vivas
  2020-12-23 14:07   ` zimoun
  2020-12-27 22:29   ` Ludovic Courtès
  2020-12-29 20:01 ` pukkamustard
  4 siblings, 2 replies; 18+ messages in thread
From: Miguel Ángel Arruga Vivas @ 2020-12-23 13:10 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel

Hi Ludo,

Just one interjection: wow! :-)

Ludovic Courtès <ludo@gnu.org> writes:

> Hello Guix!
>
> Every time a package changes, we end up downloading complete substitutes
> for itself and for all its dependents, even though we have the intuition
> that a large fraction of the files in those store items are unchanged.

It's great you're taking a look into these kind of optimizations, as
they also close the gap between only-binary distribution and the
substitutes system.

> [Awesome data collection omitted for brevity]
>
> Thoughts?  :-)

Probably you're already aware of it, but I want to mention that
Tridgell's thesis[1] contains a very neat approach to this problem.

A naive prototype would be copying of the latest available nar of the
package on the client side and using it as the destination for a copy
using rsync.  Either the protocol used by the rsync application, or a
protocol based on those ideas, could be implemented over the HTTP layer;
client and server implementation and cooperation would be needed though.

Another idea that might fit well into that kind of protocol---with
harder impact on the design, and probably with a high cost on the
runtime---would be the "upgrade" of the deduplication process towards a
content-based file system as git does[2].  This way the a description of
the nar contents (size, hash) could trigger the retrieval only of the
needed files not found in the current store.

Nonetheless, these are only thoughts, I'll ping back if and when I have
something more tangible. ;-)

Happy hacking!
Miguel

[1] https://rsync.samba.org/~tridge/phd_thesis.pdf
[2] https://git-scm.com/book/en/v2/Git-Internals-Git-Objects


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-23 13:10 ` Miguel Ángel Arruga Vivas
@ 2020-12-23 14:07   ` zimoun
  2020-12-23 15:40     ` Julien Lepiller
  2020-12-27 22:29   ` Ludovic Courtès
  1 sibling, 1 reply; 18+ messages in thread
From: zimoun @ 2020-12-23 14:07 UTC (permalink / raw)
  To: Miguel Ángel Arruga Vivas, Ludovic Courtès; +Cc: Guix Devel

Hi,

On Wed, 23 Dec 2020 at 14:10, Miguel Ángel Arruga Vivas <rosen644835@gmail.com> wrote:

> Probably you're already aware of it, but I want to mention that
> Tridgell's thesis[1] contains a very neat approach to this problem.

This thesis is a must to read! :-)


> A naive prototype would be copying of the latest available nar of the
> package on the client side and using it as the destination for a copy
> using rsync.  Either the protocol used by the rsync application, or a
> protocol based on those ideas, could be implemented over the HTTP layer;
> client and server implementation and cooperation would be needed
> though.

I could misunderstand and miss something, one part of the problem is how
to detect “latest”; other said how to know it is different.  From my
memories, and I have drunk couple of beers since I read the thesis :-),
the ’rsync’ approach uses timestamp and size.  And if you switch to
checksum instead, the performances are poor, because of IO.  Well, it
depends on the number of files and their size, if this checksum are
computed ahead, etc.


> Another idea that might fit well into that kind of protocol---with
> harder impact on the design, and probably with a high cost on the
> runtime---would be the "upgrade" of the deduplication process towards a
> content-based file system as git does[2].  This way the a description of
> the nar contents (size, hash) could trigger the retrieval only of the
> needed files not found in the current store.

Is it not related to Content-Addressed Store?  i.e, «intensional model»?

Chap. 6: <https://edolstra.github.io/pubs/phd-thesis.pdf>
Nix FRC: <https://github.com/tweag/rfcs/blob/cas-rfc/rfcs/0062-content-addressed-paths.md>


Cheers,
simon


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-23 14:07   ` zimoun
@ 2020-12-23 15:40     ` Julien Lepiller
  2020-12-23 19:07       ` Miguel Ángel Arruga Vivas
  0 siblings, 1 reply; 18+ messages in thread
From: Julien Lepiller @ 2020-12-23 15:40 UTC (permalink / raw)
  To: guix-devel, zimoun, Miguel Ángel Arruga Vivas,
	Ludovic Courtès
  Cc: Guix Devel



Le 23 décembre 2020 09:07:23 GMT-05:00, zimoun <zimon.toutoune@gmail.com> a écrit :
>Hi,
>
>On Wed, 23 Dec 2020 at 14:10, Miguel Ángel Arruga Vivas
><rosen644835@gmail.com> wrote:
>
>> Probably you're already aware of it, but I want to mention that
>> Tridgell's thesis[1] contains a very neat approach to this problem.
>
>This thesis is a must to read! :-)
>
>
>> A naive prototype would be copying of the latest available nar of the
>> package on the client side and using it as the destination for a copy
>> using rsync.  Either the protocol used by the rsync application, or a
>> protocol based on those ideas, could be implemented over the HTTP
>layer;
>> client and server implementation and cooperation would be needed
>> though.
>
>I could misunderstand and miss something, one part of the problem is
>how
>to detect “latest”; other said how to know it is different.  From my
>memories, and I have drunk couple of beers since I read the thesis :-),
>the ’rsync’ approach uses timestamp and size.  And if you switch to
>checksum instead, the performances are poor, because of IO.  Well, it
>depends on the number of files and their size, if this checksum are
>computed ahead, etc.
>
>
>> Another idea that might fit well into that kind of protocol---with
>> harder impact on the design, and probably with a high cost on the
>> runtime---would be the "upgrade" of the deduplication process towards
>a
>> content-based file system as git does[2].  This way the a description
>of
>> the nar contents (size, hash) could trigger the retrieval only of the
>> needed files not found in the current store.
>
>Is it not related to Content-Addressed Store?  i.e, «intensional
>model»?
>
>Chap. 6: <https://edolstra.github.io/pubs/phd-thesis.pdf>
>Nix FRC:
><https://github.com/tweag/rfcs/blob/cas-rfc/rfcs/0062-content-addressed-paths.md>

I think this is different, because we're talking about sub-element content-addressing. The intensional model is about content-addressing whole store elements. I think the idea would be to save individual files in, say, /gnu/store/.links, and let nar or narinfo files describe the files to retrieve. If we are missing some, we'd download them, then create hardlinks. This could even help our deduplication I think :)

>
>
>Cheers,
>simon


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-23 15:40     ` Julien Lepiller
@ 2020-12-23 19:07       ` Miguel Ángel Arruga Vivas
  0 siblings, 0 replies; 18+ messages in thread
From: Miguel Ángel Arruga Vivas @ 2020-12-23 19:07 UTC (permalink / raw)
  To: Julien Lepiller; +Cc: guix-devel

Hi Julien and Simon,

Julien Lepiller <julien@lepiller.eu> writes:

> Le 23 décembre 2020 09:07:23 GMT-05:00, zimoun <zimon.toutoune@gmail.com> a écrit :
>>Hi,
>>
>>On Wed, 23 Dec 2020 at 14:10, Miguel Ángel Arruga Vivas
>><rosen644835@gmail.com> wrote:
>>> Another idea that might fit well into that kind of protocol---with
>>> harder impact on the design, and probably with a high cost on the
>>> runtime---would be the "upgrade" of the deduplication process towards
>>a
>>> content-based file system as git does[2].  This way the a description
>>of
>>> the nar contents (size, hash) could trigger the retrieval only of the
>>> needed files not found in the current store.
>>
>>Is it not related to Content-Addressed Store?  i.e, «intensional
>>model»?
>>
>>Chap. 6: <https://edolstra.github.io/pubs/phd-thesis.pdf>
>>Nix FRC:
>><https://github.com/tweag/rfcs/blob/cas-rfc/rfcs/0062-content-addressed-paths.md>
>
> I think this is different, because we're talking about sub-element
> content-addressing. The intensional model is about content-addressing
> whole store elements. I think the idea would be to save individual
> files in, say, /gnu/store/.links, and let nar or narinfo files
> describe the files to retrieve. If we are missing some, we'd download
> them, then create hardlinks. This could even help our deduplication I
> think :)

Exactly.  My first approach would be a tree %links-prefix/hash/size, to
where all the internal contents of each store item would be hard linked:
mainly Git's approach with some touch here and there---some of them
probably have too much good will, the first approach isn't usually the
best. :-)

- The garbage collection process could check if there is any hard link
  to those files or remove them otherwise, deleting the hash folder when
  needed.

- The deduplication process could be in charge of moving the files and
  placing hard links instead, but hash collisions with the same size are
  always a possibility, therefore some mechanism is needed to treat
  these cases[1] and the vulnerabilities associated to them.

- The substitution process could retrieve from the server the
  information about the needed files, check which contents are already
  available and which ones must be retrieved, and ensure that the
  "generated nar" is the same as the one from the server.  This is quite
  related to the deduplication process and the mechanism used there[2].

Happy hacking!
Miguel

[1] Perhaps the usage of two different hash algorithms instead of one,
or different salts, could be enough for the "common" error case as a
collision on both cases are quite improbable.  They are possible anyway
with a size bigger than the hash size, therefore a final fallback to
actual bytes is probably needed.

[2] The folder could be hashed, even with a temporary salt agreed with
the server, to perform an independent/real-time check, but any issue
here has bigger consequences too, as no byte to byte comparison is
possible before the actual transmission.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-23  9:08 ` Taylan Kammer
  2020-12-23 10:50   ` Pierre Neidhardt
@ 2020-12-23 22:06   ` Mark H Weaver
  2020-12-23 22:47     ` Jonathan Brielmaier
  2020-12-23 23:42     ` Mark H Weaver
  1 sibling, 2 replies; 18+ messages in thread
From: Mark H Weaver @ 2020-12-23 22:06 UTC (permalink / raw)
  To: Taylan Kammer, Ludovic Courtès, Guix Devel

Hi Taylan,

Taylan Kammer <taylan.kammer@gmail.com> writes:
> My second thought: it's surprising that IceCat supposedly changes so 
> much between releases.  I suppose the reason is that this analysis is on 
> a per-file basis, and IceCat is mostly a massive binary.

FYI, here's the reason that IceCat is unusual among the projects sampled
in Ludovic's analysis: the large collection of JavaScript source files
and other auxiliary files, which are mostly unchanged from one release
to the next, are bundled into a single zip file in lib/icecat/omni.ja.

      Mark


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-23 22:06   ` Mark H Weaver
@ 2020-12-23 22:47     ` Jonathan Brielmaier
  2020-12-23 23:42     ` Mark H Weaver
  1 sibling, 0 replies; 18+ messages in thread
From: Jonathan Brielmaier @ 2020-12-23 22:47 UTC (permalink / raw)
  To: guix-devel

On 23.12.20 23:06, Mark H Weaver wrote:
> Hi Taylan,
>
> Taylan Kammer <taylan.kammer@gmail.com> writes:
>> My second thought: it's surprising that IceCat supposedly changes so
>> much between releases.  I suppose the reason is that this analysis is on
>> a per-file basis, and IceCat is mostly a massive binary.
>
> FYI, here's the reason that IceCat is unusual among the projects sampled
> in Ludovic's analysis: the large collection of JavaScript source files
> and other auxiliary files, which are mostly unchanged from one release
> to the next, are bundled into a single zip file in lib/icecat/omni.ja.

And even more most of the changed files differ only via the different
hash in the output path. So the actual diff is a few hundred lines of
almost the same -> should get compressed very good...


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-23 22:06   ` Mark H Weaver
  2020-12-23 22:47     ` Jonathan Brielmaier
@ 2020-12-23 23:42     ` Mark H Weaver
  1 sibling, 0 replies; 18+ messages in thread
From: Mark H Weaver @ 2020-12-23 23:42 UTC (permalink / raw)
  To: Taylan Kammer, Ludovic Courtès, Guix Devel

I wrote:
> FYI, here's the reason that IceCat is unusual among the projects sampled
> in Ludovic's analysis: the large collection of JavaScript source files
> and other auxiliary files, which are mostly unchanged from one release
> to the next, are bundled into a single zip file in lib/icecat/omni.ja.

Correction: there are actually two omni.ja files.  The second one is in
lib/icecat/browser/omni.ja.

      Mark


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-23 10:41   ` zimoun
@ 2020-12-27 22:22     ` Ludovic Courtès
  0 siblings, 0 replies; 18+ messages in thread
From: Ludovic Courtès @ 2020-12-27 22:22 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel

Hi!

zimoun <zimon.toutoune@gmail.com> skribis:

> What I wanted to illustrate is “revision” does not mean “new version“
> and I do not know what is the typical usage by people; aside Guix
> dev. :-)
>
> How many times per week or month people are doing “guix pull && guix
> upgrade” almost blindly without reviewing what is new?

Good question.  We should look at differences over time, not just
differences compared to the previous revision, if you see what I mean.

> Last, I was trying to understand what is the “revision” number
> corresponding to which outputs.  For example r-minimal:
>
> <https://data.guix.gnu.org/repository/1/branch/master/package/r-minimal/output-history>
>
> There are 3 outputs with 4.0.3, so I am expecting 3 similar revisions in
> the chart.  Then 3 others with 4.0.2, then 2 others with 4.0.1.
>
> In the 6 revisions in the chart, I was trying to figure out what does
> mean the ’0’: r-minimal against which other one, i.e., which Guix commit
> against which other one.  Idem for 1,2,3,4,5. :-)

In the chart I sent, probably the drop in r-minimal has to do with the
version change.

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-23 13:10 ` Miguel Ángel Arruga Vivas
  2020-12-23 14:07   ` zimoun
@ 2020-12-27 22:29   ` Ludovic Courtès
  2021-01-06  9:39     ` Ludovic Courtès
  1 sibling, 1 reply; 18+ messages in thread
From: Ludovic Courtès @ 2020-12-27 22:29 UTC (permalink / raw)
  To: Miguel Ángel Arruga Vivas; +Cc: Guix Devel

Hi!

Miguel Ángel Arruga Vivas <rosen644835@gmail.com> skribis:

> Another idea that might fit well into that kind of protocol---with
> harder impact on the design, and probably with a high cost on the
> runtime---would be the "upgrade" of the deduplication process towards a
> content-based file system as git does[2].  This way the a description of
> the nar contents (size, hash) could trigger the retrieval only of the
> needed files not found in the current store.

Yes, that’s what I had in mind.  :-)

It could go along these lines:

  1. GET /digest/xyz-emacs; the digest contains a list of file/hash
     pairs essentially;

  2. traverse digest and hardlink the files already in /gnu/store/.links
     to the target directory;

  3. pipeline-GET the remaining files and install them.

Like you write, since we already have deduplication, part of it comes
for free.

Ludo’.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-22 22:01 Identical files across subsequent package revisions Ludovic Courtès
                   ` (3 preceding siblings ...)
  2020-12-23 13:10 ` Miguel Ángel Arruga Vivas
@ 2020-12-29 20:01 ` pukkamustard
  2021-01-06  9:27   ` Ludovic Courtès
  4 siblings, 1 reply; 18+ messages in thread
From: pukkamustard @ 2020-12-29 20:01 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel


Hi Ludo,

>
> Thoughts?  :-)
>

Super cool! :)

Your research inspired me to do conduct some experiments towards
de-duplication.

For two similar packages (emacs-27.1 and emacs-26.3) I was able to
de-duplicate ~12% using EROFS and ERIS. Still far from the ~85%
similarity, but an attempt I'd like to share.

The two main ingredients:

- EROFS (Enhanced Read-Only File-System) is a read-only, 
  compressed
  file-system comparable to SquashFS. It has some properties that 
  make
  it more suitable than SquashFS (it aligns content to fixed block
  size). EROFS is in mainline Linux Kernel since v5.4.

- ERIS (Encoding for Robust Immutable Storage) is an encoding of 
  content
  into uniformly sized blocks that I've been working on. It 
  de-couples
  encoding of content from storage and transport layer. Transport 
  layers
  can be things like IPFS, GNUNet, Named Data Network or just a 
  plain
  old HTTP service.

I make EROFS images of the packages and encode them with ERIS, 
which
de-duplicates blocks as part of the encoding process.

With this I manage to de-duplicate between 12-17% (depending on 
some
parameters).

This could allow:

- Directly mounting packages instead of unarchiving (a la distri)
- Peer-to-peer distribution of packages (that's what ERIS is for)
- De-duplicating common content in packages to a certain extent 
  (topic
  of this thread)

A more in-depth write-up:
https://gitlab.com/openengiadina/eris/-/tree/main/examples/dedup-fs

Happy Hacking!
-pukkamustard



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-29 20:01 ` pukkamustard
@ 2021-01-06  9:27   ` Ludovic Courtès
  0 siblings, 0 replies; 18+ messages in thread
From: Ludovic Courtès @ 2021-01-06  9:27 UTC (permalink / raw)
  To: pukkamustard; +Cc: guix-devel

Hi!

pukkamustard <pukkamustard@posteo.net> skribis:

> Your research inspired me to do conduct some experiments towards
> de-duplication.
>
> For two similar packages (emacs-27.1 and emacs-26.3) I was able to
> de-duplicate ~12% using EROFS and ERIS. Still far from the ~85%
> similarity, but an attempt I'd like to share.
>
> The two main ingredients:
>
> - EROFS (Enhanced Read-Only File-System) is a read-only,   compressed
>  file-system comparable to SquashFS. It has some properties that
>  make
>  it more suitable than SquashFS (it aligns content to fixed block
>  size). EROFS is in mainline Linux Kernel since v5.4.
>
> - ERIS (Encoding for Robust Immutable Storage) is an encoding of
>   content
>  into uniformly sized blocks that I've been working on. It
>  de-couples
>  encoding of content from storage and transport layer. Transport
>  layers
>  can be things like IPFS, GNUNet, Named Data Network or just a   plain
>  old HTTP service.
>
> I make EROFS images of the packages and encode them with ERIS, which
> de-duplicates blocks as part of the encoding process.
>
> With this I manage to de-duplicate between 12-17% (depending on some
> parameters).

Very nice!  I wonder what the file-level similarity is compared to the
block-level similarity.

> This could allow:
>
> - Directly mounting packages instead of unarchiving (a la distri)

Yeah, I’m not sure about this part.  It would be a radical change for
Guix in terms of code, and I also wonder about the efficiency: sure
mounting the package would be instantaneous, but subsequent reads would
be slowed down compared to the current approach.  Maybe the slowdown is
only on the first hit though, and maybe it’s hardly measurable, dunno.

> - Peer-to-peer distribution of packages (that's what ERIS is for)

Yup, looking forward to that.

> - De-duplicating common content in packages to a certain extent
>   (topic
>  of this thread)
>
> A more in-depth write-up:
> https://gitlab.com/openengiadina/eris/-/tree/main/examples/dedup-fs

Great writeup and nice tooling that you have here!

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Identical files across subsequent package revisions
  2020-12-27 22:29   ` Ludovic Courtès
@ 2021-01-06  9:39     ` Ludovic Courtès
  0 siblings, 0 replies; 18+ messages in thread
From: Ludovic Courtès @ 2021-01-06  9:39 UTC (permalink / raw)
  To: Miguel Ángel Arruga Vivas; +Cc: Guix Devel

Hi,

Ludovic Courtès <ludo@gnu.org> skribis:

> It could go along these lines:
>
>   1. GET /digest/xyz-emacs; the digest contains a list of file/hash
>      pairs essentially;
>
>   2. traverse digest and hardlink the files already in /gnu/store/.links
>      to the target directory;
>
>   3. pipeline-GET the remaining files and install them.
>
> Like you write, since we already have deduplication, part of it comes
> for free.

I prototyped that during the holidays and pushed the result as
‘wip-digests’ (rough on the edges!).

It works essentially as written above.  ‘guix substitute’ fetches
digests at the same time as it fetches narinfos so it can keep them in
cache, which in turn means that when you install things it already has
info locally to determine which files it already has and which ones it
needs to fetch (in the best case, the substitute phase simply hard-links
files locally without connecting to the server!).

One of the open issues is compression.  ‘guix publish’ now serves
individual files, but obviously they should be compressed somehow.
For nars, we have /lzip, /gzip, etc. URLs, which gives clients more
freedom on the choice of compression but is also quite heavyweight.  For
individual files, I’m thinking we could use ‘Accept-Encoding’ to let the
client declare which compression formats it supports and then let ‘guix
publish’ choose the one it prefers.

In the worst case, when none of the files are available locally, we’d be
downloading all the individual files, which is probably more data than a
single compressed nar.  I guess ‘guix substitute’ could fall back to nar
download when too little is available locally.

The downside compared to using ERIS or IPFS is that it’s really a hack.
The upside is that it’s little work and it’s efficient since we already
have that content-addressed file store in /gnu/store/.links.

Thoughts?

Ludo’.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-01-06  9:40 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-22 22:01 Identical files across subsequent package revisions Ludovic Courtès
2020-12-23  9:08 ` Taylan Kammer
2020-12-23 10:50   ` Pierre Neidhardt
2020-12-23 22:06   ` Mark H Weaver
2020-12-23 22:47     ` Jonathan Brielmaier
2020-12-23 23:42     ` Mark H Weaver
2020-12-23 10:19 ` zimoun
2020-12-23 10:41   ` zimoun
2020-12-27 22:22     ` Ludovic Courtès
2020-12-23 11:48 ` Christopher Baines
2020-12-23 13:10 ` Miguel Ángel Arruga Vivas
2020-12-23 14:07   ` zimoun
2020-12-23 15:40     ` Julien Lepiller
2020-12-23 19:07       ` Miguel Ángel Arruga Vivas
2020-12-27 22:29   ` Ludovic Courtès
2021-01-06  9:39     ` Ludovic Courtès
2020-12-29 20:01 ` pukkamustard
2021-01-06  9:27   ` Ludovic Courtès

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).