unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
From: "Ludovic Courtès" <ludo@gnu.org>
To: 24937@debbugs.gnu.org
Cc: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Subject: bug#24937: "deleting unused links" GC phase is too slow
Date: Tue, 09 Nov 2021 15:44:49 +0100	[thread overview]
Message-ID: <87pmr9l76m.fsf@gnu.org> (raw)
In-Reply-To: <87wpg7ffbm.fsf@gnu.org> ("Ludovic Courtès"'s message of "Sun, 13 Nov 2016 18:41:01 +0100")

[-- Attachment #1: Type: text/plain, Size: 1112 bytes --]

Hi!

ludo@gnu.org (Ludovic Courtès) skribis:

> ‘LocalStore::removeUnusedLinks’ traverses all the entries in
> /gnu/store/.links and calls lstat(2) on each one of them and checks
> ‘st_nlink’ to determine whether they can be deleted.
>
> There are two problems: lstat(2) can be slow on spinning disks as found
> on hydra.gnu.org, and the algorithm is proportional in the number of
> entries in /gnu/store/.links, which is a lot on hydra.gnu.org.

Taking a step back, we could perhaps mitigate this with heuristics to
reduce the number of entries in .links:

  1. Do not deduplicate files with a size lower than some threshold;

  2. Delete links with st_nlink <= 3 (instead of <= 2); that would
     prevent *further* deduplication of those files, but they’d already
     have two instances sharing the same inode;

  3. Stop deduplicating once the number of entries in .links has reached
     a certain threshold.

For #1, a key insight is that about 30% of the files actually
deduplicated (in my store, where /gnu/store/.links has 2.2M entries) are
smaller than 1 KiB:


[-- Attachment #2: sizes --]
[-- Type: image/png, Size: 22211 bytes --]

[-- Attachment #3: Type: text/plain, Size: 93 bytes --]


… but 85% of them have at most 4 links (thus, saving up to 2 KiB by
deduplicating):


[-- Attachment #4: link counts for files < 1KiB --]
[-- Type: image/png, Size: 20456 bytes --]

[-- Attachment #5: Type: text/plain, Size: 440 bytes --]


On my laptop, we’re talking about space savings of 325 MiB, a tiny
fraction of my store:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> (saved-space (filter (lambda (file)
					    (< (deduplicated-file-size file) 1024))
					  l))
$40 = 325914739
--8<---------------cut here---------------end--------------->8---

Files smaller than 1 KiB represent 35% of the entries in .links:


[-- Attachment #6: distribution by file size --]
[-- Type: image/png, Size: 21654 bytes --]

[-- Attachment #7: Type: text/plain, Size: 728 bytes --]


By not deduplicating files smaller than 1 KiB, we’d reduce the number of
entries by 35%, which should already have a tangible impact on
performance.  It’d be a “mitigation” more than a “fix”, but it has a
good work/reward ratio.

We could conduct a similar analysis for #2.

#3 is more difficult to implement because you cannot know the number of
entries in .links until you’ve traversed it (note that currently
deduplication stops when link(2) returns ENOSPC in .links).

I’m attaching the script I’ve used for that, derived from an earlier
experiment¹.  You’re welcome to give it a spin!

Thoughts?

Ludo’.

¹ https://lists.gnu.org/archive/html/guix-devel/2014-09/msg00422.html


[-- Attachment #8: the script --]
[-- Type: text/plain, Size: 5803 bytes --]

(use-modules (charting)
             ((guix store) #:select (%store-prefix))
             (ice-9 ftw)
             (ice-9 match)
             (srfi srfi-1)
             (srfi srfi-9))

(define-record-type <deduplicated-file>
  (deduplicated-file name size links)
  deduplicated-file?
  (name    deduplicated-file-name)
  (size    deduplicated-file-size)
  (links   deduplicated-file-link-count))

(define %links-directory
  (string-append (%store-prefix) "/.links"))

(define (links)
  "Return a list of <deduplicated-file>."
  (file-system-fold (const #t)
                    (lambda (file stat result)      ;leaf
                      (cons (deduplicated-file file
                                               (stat:size stat)
                                               (stat:nlink stat))
                            result))
                    (lambda (directory stat result) ;down
                      result)
                    (lambda (directory stat result) ;up
                      result)
                    (const #f)                      ;skip
                    (lambda (file stat errno result)
                      (error "i/o error" file errno))
                    '()
                    %links-directory
                    lstat))

(define KiB (expt 2 10))
(define MiB (* KiB KiB))
(define GiB (* KiB MiB))

(define (saved-space files)
  "Return the total amount of saved space given FILES, a list of
<deduplicated-file>."
  (fold (lambda (df result)
          (match df
            (($ <deduplicated-file> name size links)
             (when (< links 2)
               (error "too few links" name links))
             (+ result (* size (- links 2))))))
        0
        files))

(define (cumulative-distribution files property)
  "Return a list of (VALUE . COUNT) pairs representing the number of FILES
whose PROPERTY is VALUE or less."
  (define (file<? df1 df2)
    (< (property df1) (property df2)))

  (fold (lambda (df result)
          (match result
            (((value . count) . rest)
             (let ((current (property df)))
               (if (= value current)
                   (alist-cons value (+ count 1) rest)
                   (alist-cons current (+ count 1) result))))
            (_
             (alist-cons (property df) 1 result))))
        '()
        (sort files file<?)))

(define* (plot-distribution distribution output
                            #:key (subtitle "SUBTITLE") max-x
                            (group-name "GROUP") x-axis-label)
  "Plot DISTRIBUTION, and produce file OUTPUT."
  (define (log2 n)
    (let loop ((result 1))
      (if (zero? (ash n (- result)))
          (- result 1)
          (loop (+ 1 result)))))

  (define (format-log2-tick tick)
    ;; (string-append "2^"
    ;;                (number->string (log2 (inexact->exact tick))))
    (number->string (inexact->exact tick)))

  (define (adjust-items total)
    (lambda (x)
      (match x
        ;; XXX: Filter out the two cases that would give us a numerical
        ;; overflow.
        ((0 . _) #f)
        ((1 . _) #f)
        ((value . count)
         (and (or (not max-x) (< value max-x))
              (cons value (* 100. (/ count total))))))))

  (match distribution
    (((_ . total) . rest)
     (let ((percent (filter-map (adjust-items total) distribution)))
       (make-scatter-plot #:title (string-append "Cumulative distribution by "
                                                 subtitle)
                          #:data `((,group-name ,@percent))
                          #:x-axis-label x-axis-label
                          #:y-axis-label "%"
                          #:tick-label-formatter format-log2-tick
                          #:log-x-base 2
                          #:min-x 1
                          #:max-y 101
                          #:write-to-png output)))))

#! Examples
(define l (links))  ;this is the expensive part
(plot-distribution (cumulative-distribution l deduplicated-file-link-count)
                   "/tmp/nlink.png" #:x-axis-label "number of hard links"
                   #:subtitle "hard link count" #:max-x 2048
                   #:group-name "nlinks")

(plot-distribution (cumulative-distribution
                    (filter (lambda (file)
                              (< (deduplicated-file-size file) 1024))
                            l)
                    deduplicated-file-link-count)
                   "/tmp/nlink-small.png" #:x-axis-label "number of hard links"
                   #:subtitle "hard link count for files < 1KiB" #:max-x 2048
                   #:group-name "nlinks")


(plot-distribution (cumulative-distribution l deduplicated-file-size)
                   "/tmp/size.png" #:x-axis-label "file size"
                   #:subtitle "file size" #:max-x 32768
                   #:group-name "size (B)")

(plot-distribution (cumulative-distribution
                    (filter (lambda (f)
                              (> (deduplicated-file-link-count f) 2))
                            l)
                    deduplicated-file-size)
                   "/tmp/size-deduplicated.png" #:x-axis-label "file size" #:subtitle
                   "size for files actually deduplicated" #:max-x 32768
                   #:group-name "size (B)")

(plot-distribution (cumulative-distribution
                    (filter (lambda (file)
                              (< (deduplicated-file-size file) 1024))
                            l)
                    (lambda (file)
                      (* (deduplicated-file-size file)
                         (- (deduplicated-file-link-count file) 2))))
                   "/tmp/size-savings.png" #:x-axis-label "savings" #:subtitle
                   "savings for files < 1KiB" #:max-x 32768
                   #:group-name "savings (B)")
!#

  parent reply	other threads:[~2021-11-09 14:46 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-13 17:41 bug#24937: "deleting unused links" GC phase is too slow Ludovic Courtès
2016-12-09 22:43 ` Ludovic Courtès
2016-12-11 13:46 ` Ludovic Courtès
2016-12-11 14:23   ` Mark H Weaver
2016-12-11 18:02     ` Ludovic Courtès
2016-12-11 19:27       ` Mark H Weaver
2016-12-13  0:00         ` Ludovic Courtès
2016-12-13 12:48           ` Mark H Weaver
2016-12-13 17:02             ` Ludovic Courtès
2016-12-13 17:18               ` Ricardo Wurmus
2020-04-16 13:26                 ` Ricardo Wurmus
2020-04-16 14:27                   ` Ricardo Wurmus
2020-04-17  8:16                     ` Ludovic Courtès
2020-04-17  8:28                       ` Ricardo Wurmus
2016-12-13  4:09         ` Mark H Weaver
2016-12-15  1:19           ` Mark H Weaver
2021-11-09 14:44 ` Ludovic Courtès [this message]
2021-11-09 15:00   ` Ludovic Courtès
2021-11-11 20:59   ` Maxim Cournoyer
2021-11-13 16:56     ` Ludovic Courtès
2021-11-13 21:37       ` bug#24937: [PATCH 1/2] tests: Factorize 'file=?' Ludovic Courtès
2021-11-13 21:37         ` bug#24937: [PATCH 2/2] daemon: Do not deduplicate files smaller than 4 KiB Ludovic Courtès
2021-11-16 13:54           ` bug#24937: "deleting unused links" GC phase is too slow Ludovic Courtès
2021-11-13 21:45       ` Ludovic Courtès
2021-11-22  2:30 ` John Kehayias via Bug reports for GNU Guix

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87pmr9l76m.fsf@gnu.org \
    --to=ludo@gnu.org \
    --cc=24937@debbugs.gnu.org \
    --cc=maxim.cournoyer@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).