all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: zimoun <zimon.toutoune@gmail.com>
To: Guix Devel <guix-devel@gnu.org>
Subject: Some stats about the graph of dependencies
Date: Fri, 09 Dec 2022 18:29:43 +0100	[thread overview]
Message-ID: <874ju4qyd4.fsf@gmail.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 4734 bytes --]

Hi,

Preparing some Python stuff, I was toying with the package
python-networkx.  And Guix is awesome because it is easy to extract the
graph of dependencies.

Here dependencies are just inputs, native-inputs and propagated-inputs.
It could be interesting to also include build-system dependencies, I
have been lazy. :-)

My initial question is to know what are the “essentials”?  By essential,
I mean the “important“ ones, the “hot” ones, etc.  The ones which are
“influencers” – yeah the world is a social network. :-)

First, let extract the graph with a tiny Scheme script:

    $ guix repl -- packages-to-dict.scm > dod.py

Then, let import that into an IPython session:

    $ guix shell python python-ipython \
           python-scipy python-matplotlib python-networkx  -- ipython

and run another tiny Python script for plotting.  See Figure attached.

We can compare a link analysis metrics [1] and a centrality measure
[2]; say PageRank [3] and Eigenvector [4].  More the value is large and
higher the package is “important“ (for this metrics).

And the Directed and Undirected graphs can be compared, using Networkx
[5,6].  Well, Eigenvector centrality (or Katz centrality [7]) is failing
because the power iteration does not converge but other metrics could be
also considered.  Here is just a first rough toy. :-)

According to PageRank applied to the Directed Graph, the 10 most
“important” packages are:

--8<---------------cut here---------------start------------->8---
[('pkg-config-0.29.2', 0.02418335991713879),
 ('perl-5.34.0', 0.015404032767249512),
 ('coreutils-minimal-8.32', 0.013240458675517012),
 ('zlib-1.2.11', 0.009107245584307803),
 ('python-pytest-6.2.5', 0.008413060648307678),
 ('ncurses-6.2.20210619', 0.007598925467605917),
 ('r-knitr-1.41', 0.00554772892485958),
 ('sbcl-rt-1990.12.19-1.a6a7503', 0.004884721933452539),
 ('bzip2-1.0.8', 0.004800877844001881),
 ('python-3.9.9', 0.00415536078558266)]
--8<---------------cut here---------------end--------------->8---

And if we compare the 3 results (Undirected with PageRank and
Eigenvector, and Directed with PageRank only, then 10 most “important”
packages are:

--8<---------------cut here---------------start------------->8---
['pkg-config-0.29.2',
 'glib-2.70.2',
 'zlib-1.2.11',
 'gtk+-3.24.30',
 'perl-5.34.0',
 'gettext-minimal-0.21',
 'qtbase-5.15.5',
 'libxml2-2.9.12',
 'python-3.9.9',
 'autoconf-2.69']
--8<---------------cut here---------------end--------------->8---

Somehow, it means that these packages have an high influence on all the
others.  Now, we can roughly compare with the release-manifest.scm [8],

--8<---------------cut here---------------start------------->8---
       '("bootstrap-tarballs" "gcc-toolchain" "nss-certs"
         "openssh" "emacs" "vim" "python" "guile" "guix")))
       '("coreutils" "grep" "findutils" "gawk" "make"
         #;"gcc-toolchain" "tar" "xz")))
               '("xorg-server" "xfce" "gnome" "mate" "enlightenment"
                 "openbox" "awesome" "i3-wm" "ratpoison"
                 "emacs" "emacs-exwm" "emacs-desktop-environment"
                 "xlockmore" "slock" "libreoffice"
                 "connman" "network-manager" "network-manager-applet"
                 "openssh" "ntp" "tor"
                 "linux-libre" "grub-hybrid"
               '("coreutils" "grep" "sed" "findutils" "diffutils" "patch"
                 "gawk" "gettext" "gzip" "xz"
                 "hello" "zlib"))))
--8<---------------cut here---------------end--------------->8---

Well, we could investigate more and play more with some graphs tools.
For instance, include all the build-system dependencies and so on.

Some list about “statistically important” packages could help for
improving the list of “essential” packages.

Although Python is great, I would like to run Guile.  Any Guile library
for manipulating graph is around?

All that to say, Guix is great! :-) And perhaps some of you have already
some Guile code for analysing graphs.  Maybe.

Well, comment or idea is welcome. :-)

1: <https://en.wikipedia.org/wiki/Network_theory#Link_analysis>
2: <https://en.wikipedia.org/wiki/Network_theory#Centrality_measures>
3: <https://en.wikipedia.org/wiki/PageRank>
4: <https://en.wikipedia.org/wiki/Eigenvector_centrality>
5: <https://networkx.org/documentation/stable/reference/algorithms/link_analysis.html>
6: <https://networkx.org/documentation/stable/reference/algorithms/centrality.html>
7: <https://en.wikipedia.org/wiki/Katz_centrality>
8: <https://git.savannah.gnu.org/cgit/guix.git/tree/etc/release-manifest.scm#n47>

Cheers,
simon


[-- Attachment #2: figure --]
[-- Type: image/png, Size: 76269 bytes --]

[-- Attachment #3: packages-to-dict.scm --]
[-- Type: application/octet-stream, Size: 1285 bytes --]

(use-modules (guix)
             (gnu)
             (srfi srfi-1)
             (ice-9 match)
             )

(define (all-packages)
  (fold-packages cons '()))

;(define %packages (take (all-packages) 10))
(define %packages (all-packages))

(define name-version
  (match-lambda
    ((? package? package)
     (string-append
      (package-name package)
      "-"
      (package-version package)))
    (other other)))

(define (dictify lst)
  (for-each (lambda (package)
              (format #t " ~s:{}, " (name-version package)))
            lst))

(define (list-inputs package)
  (append-map
   (lambda (inputs)
     (map (match-lambda
            ((_ p out ...) (match p
                             ((? package? pkg) pkg)
                             ((? origin? orig)
                              (symbol->string (gensym "orig-")))
                             (_
                              (symbol->string (gensym "else-"))))))
          (inputs package)))
   (list package-inputs
         package-native-inputs
         package-propagated-inputs)))

(format #t "dod = {~%")
(for-each (lambda (package)
            (format #t "~s: {" (name-version package))
            (dictify (list-inputs package))
            (format #t "},~%"))
          %packages)
(format #t "~%}~%")

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: stats.py --]
[-- Type: text/x-python, Size: 1204 bytes --]

import networkx as nx
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
import numpy as np

exec(open("dod.py").read())
UG = nx.Graph(dod)
DG = nx.DiGraph(dod)

nnodes = DG.number_of_nodes()

# nx.draw(G, with_labels=True);plt.show()

metrics = [
    (UG, nx.eigenvector_centrality),
    (UG, nx.pagerank),
    (DG, nx.pagerank),
]

results = [sorted(metric(G).items(), key=lambda k: k[1], reverse=True)
           for G, metric in metrics]

x = [i+1 for i in range(nnodes)]
y = lambda rs: [v for _, v in rs]

plt.loglog(x, y(results[0]), label="eigen Undirected", color='red', linestyle='-.')
plt.loglog(x, y(results[1]), label="pagerank Undirected", color='blue', linestyle='--')

plt.loglog(x, y(results[2]), label="pagerank Directed", color='blue', linestyle='-')

plt.legend()
plt.xlabel("packages")
plt.ylabel("metric value")
plt.grid()
plt.xlim([1, nnodes+2])
plt.ylim([1e-5, 1])
plt.show()

def common(results, n):
    packages = results[0][:2*n].copy()
    for result in results[1:]:
        for name, value in packages:
            if not name in [name for name, _ in result]:
                packages.remove((name, value))
    return [name for name, _ in packages[:n]]

             reply	other threads:[~2022-12-09 17:31 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-09 17:29 zimoun [this message]
2022-12-10  0:19 ` Some stats about the graph of dependencies jbranso
2022-12-10 13:30   ` zimoun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874ju4qyd4.fsf@gmail.com \
    --to=zimon.toutoune@gmail.com \
    --cc=guix-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.