From mboxrd@z Thu Jan 1 00:00:00 1970 From: zimoun Subject: Re: Proposal for a blog contribution on reproducible computations Date: Thu, 9 Jan 2020 21:40:29 +0100 Message-ID: References: <8D474474-AF4C-4B03-9D38-3BB089BEE4EB@lepiller.eu> <87tv6ec048.fsf@ambrevar.xyz> <14A62244-3626-4146-B40E-BC5CED4B78D3@lepiller.eu> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:470:142:3::10]:44744) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ipebo-0005Ru-1K for guix-devel@gnu.org; Thu, 09 Jan 2020 15:40:53 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ipebi-0001um-N3 for guix-devel@gnu.org; Thu, 09 Jan 2020 15:40:47 -0500 Received: from mail-qk1-x72c.google.com ([2607:f8b0:4864:20::72c]:44026) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1ipebi-0001rB-DK for guix-devel@gnu.org; Thu, 09 Jan 2020 15:40:42 -0500 Received: by mail-qk1-x72c.google.com with SMTP id t129so7323196qke.10 for ; Thu, 09 Jan 2020 12:40:42 -0800 (PST) In-Reply-To: List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane-mx.org@gnu.org Sender: "Guix-devel" To: Konrad Hinsen Cc: Guix Devel Hi Konrad, Thank you! It is very interesting!! Below questions. And suggestions which I can Pull-Request with Github. :-) Hope it is readable: indented text is your text; non-indented one is questi= on. Cheers, simon -- #+TITLE: Reproducible computations with Guix #+STARTUP: inlineimages * Dependencies: what it takes to run a program Move this section title below. This post is about reproducible computations, so let's start with= a computation. A short, though rather uninteresting, C program is a go= od starting point. It computes =CF=80 in three different ways: #+begin_src c :tangle pi.c :eval no #include #include int main() { printf( "M_PI : %.10lf\n", M_PI); printf( "4 * atan(1.) : %.10lf\n", 4.*atan(1.)); printf( "Leibniz' formula (four terms): %.10lf\n", 4.*(1.-1./3.+1./5.-1./7.)); return 0; } #+end_src Align ':' for easier looking. This program uses no random element, such as a random number generat= or or parallelism. It's strictly deterministic. It is reasonable to expe= ct it to produce exactly the same output, on any computer and at any poi= nt in time. And yet, many programs whose results /should/ be perfect= ly reproducible are in fact not. Programs using floating-point arithmeti= c, such as this short example, are particularly prone to seeming= ly inexplicable variations. My goal is to explain why deterministic programs often fail to = be reproducible, and what it takes to fix this. The short answer to th= at question is "use Guix", but even though Guix provides excellent suppo= rt for reproducibility, you still have to use it correctly, and th= at requires some understanding of what's going on. The explanation I wi= ll give is rather detailed, to the point of discussing parts of the Gui= le API of Guix. You should be able to follow the reasoning without knowi= ng Guile though, you will just have to believe me that the scripts I wi= ll show do what I claim they do. And in the end, I will provide = a ready-to-run Guile script that will let you explore package dependenci= es right from the shell. * Dependencies: what it takes to run a program One keyword in discussions of reproducibility is "dependencies". I wi= ll revisit the exact meaning of this term later, but to get started, I wi= ll define it loosely as "any software package required to run a program= ". Running the =CF=80 computation shown above is normally done using so= mething like #+begin_src sh :exports code :eval no gcc pi.c -o pi && ./pi #+end_src Missing '&&'. It does not work without on my machine. C programmers know that =3Dgcc=3D is a C compiler, so that's one o= bvious dependency for running our little program. But is a C compiler enoug= h? That question is surprisingly difficult to answer in practice. Yo= ur computer is loaded with tons of software (otherwise it wouldn't be ve= ry useful), and you don't really know what happens behind the scenes wh= en you run =3Dgcc=3D or =3Dpi=3D. ** Container is good A major element of reproducibility support in Guix is the possibility = to run programs in well-defined environments that contain exactly t= he software packages you request, and no more. So if your program runs = in an environment that contains only a C compiler, you can be sure it h= as no other dependencies. Let's create such an environment: #+begin_src sh :session C-compiler :results output :exports both guix environment --container --ad-hoc gcc-toolchain #+end_src #+RESULTS: The option =3D--container=3D ensures the best possible isolation fr= om the standard environment that your system installation and user accou= nt provide for day-to-day work. This environment contains nothing but a= C compiler and a shell (which you need to type in commands), and h= as access to no other files than those in the current directory. Side note: the option =3D--container=3D requires support from the = Linux kernel that is not available on all systems. If it doesn't work for yo= u, use =3D--pure=3D instead. It provides a less isolated environment, but= it is usually more than good enough. By default, I get: --8<---------------cut here---------------start------------->8--- guix environment: error: cannot create container: unprivileged user cannot create user namespaces guix environment: error: please set /proc/sys/kernel/unprivileged_userns_clone to "1" --8<---------------cut here---------------end--------------->8--- Or a sentence explaining what to do. For example, "The =3D--container=3D op= tion requires allowing the kernel to clone for the unprivileged user, i.e., as =3Droot=3D just run the command =3Decho 1 > /proc/sys/kernel/unprivileged_userns_clone=3D." The above command leaves me in a shell inside my environment, where= I can now compile and run my little program: #+begin_src sh :session C-compiler :results output :exports both gcc pi.c -o pi && ./pi #+end_src Missing again '&&'. Sorry if it is me. #+RESULTS: : M_PI : 3.14159265358979311600 : 4 * atan(1.) : 3.14159265358979311600 : Leibniz' formula (four terms): 2.89523809523809561028 It works! So now I can be sure that my program has a single dependenc= y: the Guix package =3Dgcc-toolchain=3D. Perfectionists who want to e= xclude the possibility that my program requires a shell could run each step = in a separate container: #+begin_src sh :results output :exports both guix environment --container --ad-hoc gcc-toolchain -- gcc pi.c -o pi guix environment --container --ad-hoc gcc-toolchain -- ./pi #+end_src #+RESULTS: : M_PI : 3.14159265358979311600 : 4 * atan(1.) : 3.14159265358979311600 : Leibniz' formula (four terms): 2.89523809523809561028 ** Let open the dependencies hell Now that we know that our only dependency is =3Dgcc-toolchain=3D, let'= s look at it in more detail: #+begin_src sh :results output :exports both guix show gcc-toolchain #+end_src #+RESULTS: #+begin_example name: gcc-toolchain version: 9.2.0 outputs: out debug static systems: x86_64-linux i686-linux dependencies: binutils@2.32 gcc@9.2.0 glibc@2.29 ld-wrapper@0 location: gnu/packages/commencement.scm:2532:4 homepage: https://gcc.gnu.org/ license: GPL 3+ synopsis: Complete GCC tool chain for C/C++ development description: This package provides a complete GCC tool chain for C/C++ + development to be installed in user profiles. This includes GCC, as well as + libc (headers an d binaries, plus debugging symbols in the `debug' output), + and Binutils. name: gcc-toolchain version: 8.3.0 outputs: out debug static systems: x86_64-linux i686-linux dependencies: binutils@2.32 gcc@8.3.0 glibc@2.29 ld-wrapper@0 location: gnu/packages/commencement.scm:2532:4 homepage: https://gcc.gnu.org/ license: GPL 3+ synopsis: Complete GCC tool chain for C/C++ development description: This package provides a complete GCC tool chain for C/C++ + development to be installed in user profiles. This includes GCC, as well as + libc (headers an d binaries, plus debugging symbols in the `debug' output), + and Binutils. [...] #+end_example Guix actually knows about several versions of this toolchain. We didn= 't ask for a specific one, so what we got is the first one in this lis= t, which is the one with the highest version number. Let's check that th= is is true: #+begin_src sh :results output :exports both guix environment --container --ad-hoc gcc-toolchain -- gcc --version #+end_src #+RESULTS: : gcc (GCC) 9.2.0 : Copyright (C) 2019 Free Software Foundation, Inc. : This is free software; see the source for copying conditions. There is NO : warranty; not even for MERCHANTABILITY or FITNESS FOR A PAR1TICULAR PURPOSE. : The output of =3Dguix show=3D contains a line about dependencies. The= se are the dependencies of our dependency, and you may already have guess= ed that they will have dependencies as well. That's why reproducibility = is such a difficult job in practice! The dependencies = of =3Dgcc-toolchain@9.2.0=3D are: Let use =3Drecsel=3D and teach also how to filter the package output. :-) #+begin_src sh :results output :exports both guix show gcc-toolchain@9.2.0 | recsel -P dependencies #+end_src #+RESULTS: : binutils@2.32 gcc@9.2.0 glibc@2.29 ld-wrapper@0 #+begin_example binutils@2.32 gcc@9.2.0 glibc@2.29 ld-wrapper@0 #+end_example To dig deeper, we can try feeding these dependencies to =3Dguix show= =3D, one by one, in order to learn more about them: #+begin_src sh :results output :exports both guix show binutils@2.32 #+end_src #+RESULTS: #+begin_example name: binutils version: 2.32 outputs: out systems: x86_64-linux i686-linux dependencies: location: gnu/packages/base.scm:415:2 homepage: https://www.gnu.org/software/binutils/ license: GPL 3+ synopsis: Binary utilities: bfd gas gprof ld description: GNU Binutils is a collection of tools for working with bi= nary + files. Perhaps the most notable are "ld", a linker, and "as", an assembler. + Other tools include programs to display binary profiling information, list the + strings in a binary file, and utilities for working with archives. The "bfd" + library for working with executable and object formats is also inclu= ded. #+end_example #+begin_src sh :results output :exports both exec 2>&1 guix show gcc@9.2.0 : #+end_src #+RESULTS: : guix show: error: gcc@9.2.0: package not found This looks a bit surprising. What's happening here is that =3Dgcc= =3D is defined as a /hidden package/ in Guix. The package is there, but it = is hidden from package queries. There is a good reason for this: =3Dgcc= =3D on its own is rather useless, you need =3Dgcc-toolchain=3D to actually u= se the compiler. But if both =3Dgcc=3D and =3Dgcc-toolchain=3D showed up in= a search, that would be more confusing than helpful for most users. Hiding t= he package is a way of saying "for experts only". Let's take this as a sign that it's time to move on to the next level = of Guix hacking: Guile scripts. Guile, an implementation of the Sche= me language, is Guix' native language, so using Guile scripts, you g= et access to everything there is to know about Guix and its packages. A note in passing: the [[https://emacs-guix.gitlab.io/website/][emacs-guix]] package provides an intermediate level of Guix exploration for Emacs users. It lets you look at hidd= en packages, for example. But much of what I will show in the followi= ng really requires Guile scripts. * Anatomy of a Guix package From the user's point of view, a package is a piece of software with= a name and a version number that can be installed using =3Dguix insta= ll=3D. The packager's point of view is quite a bit different. In fact, wh= at users consider a package is more precisely called the package's /outpu= t/ in Guix jargon. The package is a recipe for creating this output. To see how all these concepts fit together, let's look at an example = of a package definition: =3Dxmag=3D. I have chosen this package not bec= ause I care much about it, but because its definition is short while showcasi= ng all the features I want to explain. You can access it most easily = by typing =3Dguix edit xmag=3D. Here is what you will see: #+begin_src scheme :eval no (package (name "xmag") (version "1.0.6") (source (origin (method url-fetch) (uri (string-append "mirror://xorg/individual/app/" name "-" version ".tar.gz")= ) (sha256 (base32 "19bsg5ykal458d52v0rvdx49v54vwxwqg8q36fdcsv9p2j8yri87")))) (build-system gnu-build-system) (arguments `(#:configure-flags (list (string-append "--with-appdefaultdir=3D" %output ,%app-defaults-dir)))) (inputs `(("libxaw" ,libxaw))) (native-inputs `(("pkg-config" ,pkg-config))) (home-page "https://www.x.org/wiki/") (synopsis "Display or capture a magnified part of a X11 screen") (description "Xmag displays and captures a magnified snapshot of a portion of an X11 screen.") (license license:x11)) #+end_src After, a package (=3Dglibc=3D) is used to show that the same package can pr= oduce different outputs and this above example does not own the =3Doutputs=3D fie= lds. The package definition starts with the name and version information y= ou expected. Next comes =3Dsource=3D, which says how to obtain the sourc= e code and from where. It also provides a hash that allows to check t= he integrity of the downloaded files. The next four items, =3Dbuild-syst= em=3D, =3Darguments=3D, =3Dinputs=3D, and =3Dnative-inputs=3D supply th= e information required for /building/ the package, which is what creates its output= s. The remaining items are documentation for human consumption, importa= nt for other reasons but not for reproducibility, so I won't say any mo= re about them. Link to the documentation and/or the cookbook entry about Packaging. http://guix.gnu.org/manual/devel/en/html_node/Defining-Packages.html#Defini= ng-Packages http://guix.gnu.org/cookbook/en/html_node/Packaging.html#Packaging The example package definition has =3Dnative-inputs=3D in additi= on to "plain" =3Dinputs=3D. There's a third variant, =3Dpropagated-input= s=3D, but =3Dxmag=3D doesn't have any. The differences between these variants= don't matter for my topic, so I will just refer to "inputs" from n= ow on. Another omission I will make is the possibility to define sever= al outputs for a package. This is done for particularly big packages, = in order to reduce the footprint of installations, but for the purposes = of reproducibility, it's OK to treat all outputs of a package a sing= le unit. The following figure illustrates how the various pieces of informati= on from a package are used in the build process (done explicitly by =3D= guix build=3D, or implicitly when installing or otherwise using a packa= ge): [[file:guix-package.svg]] It may help to translate the Guix jargon to the vocabulary of = C programming: | Guix package | C program | |--------------+------------------| | source code | source code | | inputs | libraries | | arguments | compiler options | | build system | compiler | | output | executable | Building a package can be considered a generalization of compiling= a program. We could in fact create a "GCC build system" for Guix th= at would simply run =3Dgcc=3D. However, such a build system would be of = little practical use, since most real-life software consists of more than ju= st one C source code file, and requires additional pre- or post-processi= ng steps. The =3Dgnu-build-system=3D used in the example is based on= tools such as =3Dmake=3D and =3Dautoconf=3D, in addition to =3Dgcc=3D. * Package exploration in Guile Guile uses a record type called =3D=3D to represent packages,= which is [[https://git.savannah.gnu.org/cgit/guix.git/tree/guix/packages.scm#n249][= =3D=3D]] (hyperlink). Let spread Scheme. :-) Is the syntax highlighting available for Savannah? defined in module =3D(guix packages)=3D. There is also a module = =3D(gnu packages)=3D, which contains the actual package definitions - be car= eful not [[https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages][=3D(gnu packages)=3D]] (hyperlink). to confuse the two (as I always do). Here is a simple Guile script th= at shows some package information, much like the =3Dguix show=3D command = that I used earlier: #+begin_src scheme :results output (use-modules (guix packages) (gnu packages)) (define gcc-toolchain (specification->package "gcc-toolchain")) (format #t "Name : ~a\n" (package-name gcc-toolchain)) (format #t "Version: ~a\n" (package-version gcc-toolchain)) (format #t "Inputs : ~a\n" (package-direct-inputs gcc-toolchain)) #+end_src #+RESULTS: : Name : gcc-toolchain : Version: 8.3.0 : Inputs : ((gcc #) (ld-wrapper #) (binutils #) (libc #) (libc-debug # debug) (libc-static # static)) I would add something about =3Dguix repl=3D. For example, "You can launch a= n interactive REPL with =3Dguix repl=3D and type directly these lines inside.= " Adding also a footnote saying to add #+begin_src scheme (use-modules (ice-9 readline) (ice-9 format) (ice-9 pretty-print)) (activate-readline) #+end_src in =3D~/.guile=3D to ease the REPL experience. This script first calls =3Dspecification->package=3D to look up the p= ackage using the same rules as the =3Dguix=3D command line interface: pi= ck the latest available version if none is explicitly requested. Then = it extracts various information about the package. Note th= at =3Dpackage-direct-inputs=3D returns the combination of =3Dpackage-= inputs=3D, =3Dpackage-native-inputs=3D, and =3Dpackage-propagated-inputs=3D. A= s I said above, I don't care about the distinction here. The inputs are not shown in a particularly nice form, so let's write t= wo Guile functions to improve it: #+begin_src scheme :results output (use-modules (guix packages) (gnu packages) (ice-9 match)) (define (package->specification package) (format #f "~a@~a" (package-name package) (package-version package))) (define (input->specification input) (match input ((label (? package? package) . _) (package->specification package)) (other-item (format #f "~a" other-item)))) (define gcc-toolchain (specification->package "gcc-toolchain")) (format #t "Package: ~a\n" (package->specification gcc-toolchain)) (format #t "Inputs : ~a\n" (map input->specification (package-direct-inputs gcc-toolchain= ))) #+end_src #+RESULTS: : Package: gcc-toolchain@8.3.0 : Inputs : (gcc@8.3.0 ld-wrapper@0 binutils@2.31.1 glibc@2.28 glibc@2.28 glibc@2.28) That looks much better. As you can see from the code, a list of inpu= ts is a bit more than a list of packages. It is in fact a list of labell= ed /package outputs/. That also explains why we see =3Dglibc=3D three ti= mes in the input list: =3Dglibc=3D defines three distinct outputs, all of whi= ch are used in =3Dgcc-toolchain=3D. It is not clear to me why there is 3 times =3Dglibc=3D. Instead, I propose = this. #+begin_src scheme :results output (use-modules (guix packages) (gnu packages) (ice-9 match)) (define (package->specification package) (format #f "~a@~a" (package-name package) (package-version package))) (define (input->specification input) (match input ((label (? package? package) . _) (package->specification package)) (other-item (format #f "~a" other-item)))) (define gcc-toolchain (specification->package "gcc-toolchain")) (format #t "Package : ~a\n" (package->specification gcc-toolchain)) (format #t "Inputs : ~a\n" (map input->specification (package-direct-inputs gcc-toolchain))) (format #t "Internals: ~a\n" (map car (package-direct-inputs gcc-toolchain))) (display "\n") (define glibc (specification->package "glibc")) (format #t "Name : ~a\n" (package-name glibc)) (format #t "Outputs : ~a\n" (package-outputs glibc)) #+end_src #+RESULTS: : Package : gcc-toolchain@8.3.0 : Inputs : (gcc@8.3.0 ld-wrapper@0 binutils@2.31.1 glibc@2.28 glibc@2.28 glibc@2.28) : Internals: (gcc ld-wrapper binutils libc libc-debug libc-static) : : Name : glibc : Outputs : (out debug static) The =3Dcar=3D is not so nice but the =3DInternals=3D mitigates, IMHO. The addition does not add complexity and I hope it clarifies, at least to me. ;-) For reproducibility, all we care about is the package references. Lat= er on, we will deal with much longer input lists, so as a final clean= up step, let's show only unique package references from the list of input= s: #+begin_src scheme :results output (use-modules (guix packages) (gnu packages) (srfi srfi-1) (ice-9 match)) (define (package->specification package) (format #f "~a@~a" (package-name package) (package-version package))) (define (input->specification input) (match input ((label (? package? package) . _) (package->specification package)) (other-item (format #f "~a" other-item)))) (define (unique-inputs inputs) (delete-duplicates (map input->specification inputs))) (define gcc-toolchain (specification->package "gcc-toolchain")) (format #t "Package: ~a\n" (package->specification gcc-toolchain)) (format #t "Inputs : ~a\n" (unique-inputs (package-direct-inputs gcc-toolchain))) #+end_src #+RESULTS: : Package: gcc-toolchain@8.3.0 : Inputs : (gcc@8.3.0 ld-wrapper@0 binutils@2.31.1 glibc@2.28) * Dependencies You may have noticed the absence of the term "dependency" from the la= st two sections. There is a good reason for that: the term is used = in somewhat different meanings, and that can create confusion. Guix jarg= on therefore avoids it. The figure above shows three kinds of input to the build system: sourc= e, inputs, and arguments. These categories reflect the packagers' point = of view: =3Dsource=3D is what the authors of the software supply, =3Dinp= uts=3D are other packages, and =3Darguments=3D is what the packagers themselves = add to the build procedure. It is important to understand that from a pure= ly technical point of view, there is no fundamental difference between t= he three categories. You could, for example, define a package that contai= ns C source code in the build system =3Darguments=3D, but leaves = =3Dsource=3D empty. This would be inconvenient, and confusing for others, so I don= 't recommend you actually do this. The three categories are important, b= ut for humans, not for computers. In fact, even the build system is n= ot fundamentally distinct from its inputs. You could define = a special-purpose build system for one package, and put all the sour= ce code in there. At the level of the CPU and the computer's memory,= a build process (as in fact /any/ computation) looks li= ke [[file:computation.png]] It is human interpretation that decomposes this into [[file:data-code.png]] and in a next step into [[file:data-program-environment.png]] We can go on and divide the environment into operating syste= m, development tools, and application software, for example, but t= he further we go in decomposing the input to a computation, the mo= re arbitrary it gets. From this point of view, a software's dependencies consist of everythi= ng required to run it in addition to its source code. For a Guix packag= e, the dependencies are thus, Adding ',' after 'thus'. - its inputs - the build system arguments - the build system itself - Guix (commit) - the GNU/Linux operating system (kernel). Adding (commit) and (kernel). In the following, I will not mention the last two items any mor= e, because they are a common dependency of all Guix packages, but it= 's important not to forget about them. A change in Guix or in GNU/Linux c= an actually make a computation non-reproducible, although in practice th= at happens very rarely. Moreover, Guix is actually designed to run old= er versions of itself, as we will see later. Hum? the assumption is the "GNU/Linux operating system" on which Guix (package manager) is running does not change the reproducibility of the computations. Right? In practise, the results should be the same using the same Guix (commit) on different GNU/Linux operating systems and from my understanding we are missing data (experience) to report if it happens or not. However, a change in Guix can lead to completely different packages, so non-reproducible computations. And in practise it happens often, e.g., see = how many grafts Guix is doing. :-) Well, I am not sure if I understand correctly the meaning of this paragraph= . * Build systems are packages as well I hope that by now you have a good idea of what a package is: a reci= pe for building outputs from source and inputs, with inputs being t= he outputs of other packages. The recipe involves a build system a= nd arguments supplied to it. So... what exactly is a build system? I ha= ve introduced it as a generalization of a compiler, which describes i= ts role. But where does a build system come from in Guix? The ultimate answer is of course the [[https://git.savannah.gnu.org/cgit/guix.git/tree/guix/build-system][s= ourcecode]]. Build systems are pieces of Guile code that are part of Guix. But this Guile code is on= ly a shallow layer orchestrating invocations of other software, such = as =3Dgcc=3D or =3Dmake=3D. And that software is defined by packages. = So in the end, from a reproducibility point of view, we can replace the "bui= ld system" item in our list of dependenies by "a bundle of packages". = In other words: more inputs. Before Guix can build a package, it must gather all the requir= ed ingredients, and that includes replacing the build system by t= he packages it represents. The resulting list of ingredients is called= a =3Dbag=3D, and we can access it using a Guile script: #+begin_src scheme :results output (use-modules (guix packages) (gnu packages) (srfi srfi-1) (ice-9 match)) (define (package->specification package) (format #f "~a@~a" (package-name package) (package-version package))) (define (input->specification input) (match input ((label (? package? package) . _) (package->specification package)) ((label (? origin? origin)) (format #f "[source code from ~a]" (origin-uri origin))) (other-input (format #f "~a" other-input)))) (define (unique-inputs inputs) (delete-duplicates (map input->specification inputs))) (define hello (specification->package "hello")) (format #t "Package : ~a\n" (package->specification hello)) (format #t "Package inputs: ~a\n" (unique-inputs (package-direct-inputs hello))) (format #t "Build inputs : ~a\n" (unique-inputs (bag-direct-inputs (package->bag hello)))) #+end_src #+RESULTS: : Package : hello@2.10 : Package inputs: () : Build inputs : ([source code from mirror://gnu/hello/hello-2.10.tar.gz] tar@1.30 gzip@1.9 bzip2@1.0.6 xz@5.2.4 file@5.33 diffutils@3.6 patch@2.7.6 findutils@4.6.0 gawk@4.2.1 sed@4.5 grep@3.1 coreutils@8.30 make@4.2.1 bash-minimal@4.4.23 ld-wrapper@0 binutils@2.31.1 gcc@5.5.0 glibc@2.28 glibc-utf8-locales@2.28) I have used a different example, =3Dhello=3D, [[https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/base.scm#n72= ][=3Dhello=3D]] (link to package definition) because for =3Dgcc-toolchain=3D, there is no difference between package inputs and build inputs (che= ck for yourself if you want!) My new example, =3Dhello=3D (a short= demo [[https://hpc.guix.info/package/hello][=3Dhello=3D]] (link to browser) program printing "Hello, world" in the language of the syst= em installation), is interesting because it has no package inputs at al= l. All the build inputs except for the source code have thus be= en contributed by the build system. If you compare this script to the previous one that printed only t= he package inputs, you will notice two major new features. = In =3Dinput->specification=3D, there is an additional case for the sourc= e code reference. And in the last statement, =3Dpackage->bag=3D constructs = a bag from the package, before =3Dbag-direct-inputs=3D is called to get that= bag's input list. * Inputs are outputs I have mentioned before that one package's inputs are other package= s' outputs, but that fact deserves a more in-depth discussion because = of its crucial importance for reproducibility. A package is a recipe f= or building outputs from source and inputs. Since these inputs are output= s, they must have been built as well. Package building is therefore = a process consisting of multiple steps. An immediate consequence is th= at any computation making use of packaged software is a multi-st= ep computation as well. Remember the short C program computing =CF=80 from the beginning = of this post? Running that program is only the last step in a long series = of computations. Before you can run =3Dpi=3D, you must compile =3Dpi.c= =3D. That requires the package =3Dgcc-toolchain=3D, which must first be buil= t. And before it can be built, its inputs must be built. And so on. If y= ou want the output of =3Dpi=3D to be reproducible, *the whole cha= in of computations must be reproducible*, because each step can have an impa= ct on the results produced by =3Dpi=3D. So... where does this chain start? Few people write machine code the= se days, so almost all software requires some compiler or interpreter. A= nd that means that for every package, there are other packages that must = be built first. The question of how to get this chain started is known = as the bootstrapping problem. A rough summary of the solution is that t= he chain starts on somebody else's computer, which creates a bootstr= ap seed, an ideally small package that is downloaded in precompiled for= m. See [[https://guix.gnu.org/blog/2019/guix-reduces-bootstrap-seed-by-50= /][this post by Jan Nieuwenhuizen]] for details of this procedure. The bootstrap seed is not the real start of the chain, but as long as we c= an retrieve an identical copy at a later time, that's good enough f= or reproducibility. In fact, the reason for requiring the bootstrap seed = to be small is not reproducibility, but inspectability: it should = be possible to audit the seed for bugs and malware, even in the absence = of source code. ** Closure of bag Now we are finally ready for the ultimate step in dependency analysi= s: identifying all packages on which a computation depends, right up to t= he bootstrap seed. The starting point is the list of direct inputs of t= he bag derived from a package, which we looked at in the previo= us script. For each package in that list, we must apply this sa= me procedure, recursively. We don't have to write this code ourselve= s, because the function =3Dpackage-closure=3D in Guix does that job. = If you have a basic knowledge of Scheme, you should be able to understand i= ts [[https://git.savannah.gnu.org/cgit/guix.git/tree/guix/packages.scm#n8= 17][implementation]] now. Let's add it to our dependency analysis code: #+begin_src scheme :results output (use-modules (guix packages) (gnu packages) (srfi srfi-1) (ice-9 match)) (define (package->specification package) (format #f "~a@~a" (package-name package) (package-version package))) (define (input->specification input) (match input ((label (? package? package) . _) (package->specification package)) ((label (? origin? origin)) (format #f "[source code from ~a]" (origin-uri origin))) (other-input (format #f "~a" other-input)))) (define (unique-inputs inputs) (delete-duplicates (map input->specification inputs))) (define (length-and-list lists) (list (length lists) lists)) (define hello (specification->package "hello")) (format #t "Package : ~a\n" (package->specification hello)) (format #t "Package inputs : ~a\n" (length-and-list (unique-inputs (package-direct-inputs hello))= )) (format #t "Build inputs : ~a\n" (length-and-list (unique-inputs (bag-direct-inputs (package->bag hello))))) (format #t "Package closure: ~a\n" (length-and-list (delete-duplicates (map package->specification (package-closure (list hello)))))) #+end_src #+RESULTS: : Package : hello@2.10 : Package inputs : (0 ()) : Build inputs : (20 ([source code from mirror://gnu/hello/hello-2.10.tar.gz] tar@1.30 gzip@1.9 bzip2@1.0.6 xz@5.2.4 file@5.33 diffutils@3.6 patch@2.7.6 findutils@4.6.0 gawk@4.2.1 sed@4.5 grep@3.1 coreutils@8.30 make@4.2.1 bash-minimal@4.4.23 ld-wrapper@0 binutils@2.31.1 gcc@5.5.0 glibc@2.28 glibc-utf8-locales@2.28)) : Package closure: (62 (gzip@1.9 libstdc++-boot0@4.9.4 gcc-cross-boot0@5.5.0 m4@1.4.18 linux-libre-headers@4.14.67 gettext-boot0@0.19.8.1 bison@3.0.5 guile-bootstrap@2.0 glibc-intermediate@2.28 gcc-cross-boot0-wrapped@5.5.0 perl-boot0@5.28.0 bootstrap-binaries@0 file-boot0@5.33 findutils-boot0@4.6.0 diffutils-boot0@3.6 make-boot0@4.2.1 binutils-cross-boot0@2.31.1 ld-wrapper-boot0@0 zlib@1.2.11 libstdc++@5.5.0 ld-wrapper-boot3@0 bash-static@4.4.23 texinfo@6.5 libatomic-ops@7.6.6 pkg-config@0.29.2 gmp@6.1.2 libgc@7.6.6 libltdl@2.4.6 libunistring@0.9.10 libffi@3.2.1 guile@2.2.4 expat@2.2.6 perl@5.28.0 gettext-minimal@0.19.8.1 attr@2.4.47 libcap@2.25 acl@2.2.52 binutils-bootstrap@0 gcc-bootstrap@0 glibc-bootstrap@0 libsigsegv@2.12 lzip@1.20 ed@1.14.2 binutils@2.31.1 glibc@2.28 gcc@5.5.0 bash-minimal@4.4.23 glibc-utf8-locales@2.28 grep@3.1 coreutils@8.30 ld-wrapper@0 make@4.2.1 sed@4.5 gawk@4.2.1 findutils@4.6.0 patch@2.7.6 diffutils@3.6 file@5.33 xz@5.2.4 bzip2@1.0.6 tar@1.30 hello@2.10)) That's 84 packages, just for printing "Hello, world!". As promised, = it How do you obtain this 84 packages? includes the boostrap seed, called =3Dbootstrap-binaries=3D. It may b= e more surprising to see Perl and Python in the dependency list of what is= a pure C program. The explanation is that the build process of =3Dgcc= =3D and =3Dglibc=3D contains Perl and Python code. Considering that both Pe= rl and Python are written in C and use =3Dglibc=3D, this hints at why bootstr= apping is a hard problem! ** Ready to analyse yourself As promised, here is a [[file:show-dependencies.scm][Guile script]] that you can download and run from the command line to do dependency analyses much like the ones I ha= ve shown. Just give the packages whose combined list of dependencies y= ou want to analyze. For example: #+begin_src sh :results output :exports both ./show-dependencies.scm hello #+end_src #+RESULTS: : Packages: 1 : hello@2.10 : Package inputs: 0 packages : : Build inputs: 20 packages : [source code from mirror://gnu/hello/hello-2.10.tar.gz] bash-minimal@5.0.7 binutils@2.32 bzip2@1.0.6 coreutils@8.31 diffutils@3.7 file@5.33 findutils@4.6.0 gawk@5.0.1 gcc@7.4.0 glibc-utf8-locales@2.29 glibc@2.29 grep@3.3 gzip@1.10 ld-wrapper@0 make@4.2.1 patch@2.7.6 sed@4.7 tar@1.32 xz@5.2.4 : Package closure: 84 packages : acl@2.2.53 attr@2.4.48 bash-minimal@5.0.7 bash-static@5.0.7 binutils-cross-boot0@2.32 binutils-mesboot0@2.20.1a binutils-mesboot@2.20.1a binutils@2.32 bison@3.4.1 bootstrap-binaries@0 bootstrap-mes@0 bootstrap-mescc-tools@0.5.2 bzip2@1.0.6 coreutils@8.31 diffutils-boot0@3.7 diffutils-mesboot@2.7 diffutils@3.7 ed@1.15 expat@2.2.7 file-boot0@5.33 file@5.33 findutils-boot0@4.6.0 findutils@4.6.0 flex@2.6.4 gawk@5.0.1 gcc-core-mesboot@2.95.3 gcc-cross-boot0-wrapped@7.4.0 gcc-cross-boot0@7.4.0 gcc-mesboot-wrapper@4.9.4 gcc-mesboot0@2.95.3 gcc-mesboot1-wrapper@4.7.4 gcc-mesboot1@4.7.4 gcc-mesboot@4.9.4 gcc@7.4.0 gettext-boot0@0.19.8.1 gettext-minimal@0.20.1 glibc-headers-mesboot@2.16.0 glibc-intermediate@2.29 glibc-mesboot0@2.2.5 glibc-mesboot@2.16.0 glibc-utf8-locales@2.29 glibc@2.29 gmp@6.1.2 grep@3.3 guile-bootstrap@2.0 guile@2.2.6 gzip@1.10 hello@2.10 ld-wrapper-boot0@0 ld-wrapper-boot3@0 ld-wrapper@0 libatomic-ops@7.6.10 libcap@2.27 libffi@3.2.1 libgc@7.6.12 libltdl@2.4.6 libsigsegv@2.12 libstdc++-boot0@4.9.4 libstdc++@7.4.0 libunistring@0.9.10 libxml2@2.9.9 linux-libre-headers-bootstrap@0 linux-libre-headers@4.19.56 lzip@1.21 m4@1.4.18 make-boot0@4.2.1 make-mesboot0@3.80 make-mesboot@3.82 make@4.2.1 mes-boot@0.19 mesboot-headers@0.19 ncurses@6.1-20190609 patch@2.7.6 perl-boot0@5.30.0 perl@5.30.0 pkg-config@0.29.2 python-minimal@3.5.7 sed@4.7 tar@1.32 tcc-boot0@0.9.26-6.c004e9a tcc-boot@0.9.27 texinfo@6.6 xz@5.2.4 zlib@1.2.11 You can now easily experiment yourself, even if you are not at ease wi= th Guile. For example, suppose you have a small Python script that plo= ts some data using matplotlib. What are its dependencies? First you shou= ld check that it runs in a minimal environment: #+begin_src sh :results output :exports both :eval no guix environment --container --ad-hoc python python-matplotlib -- python my-script.py #+end_src Next, find its dependencies: #+begin_src sh :results output :exports both :eval no ./show-dependencies.scm python python-matplotlib #+end_src I won't show the output here because it is rather long - the packa= ge closure contains 499 packages! * OK, but... what are the /real/ dependencies? I have explained dependencies along these lines in a f= ew seminars. There's one question that someone in the audience is bound = to ask: What do the results of a computation /really/ depend on? T= he output of =3Dhello=3D is =3D"Hello, world!"=3D, no matter which versio= n of =3Dgcc=3D I use to compile it, and no matter which version of =3Dpython=3D was u= sed in building =3Dglibc=3D. The package closure is a worst-case estimat= e: it contains everything that can /potentially/ influence the results, thou= gh most of it doesn't in practice. Unfortunately, there is no way = to identify the dependencies that matter automatically, because answeri= ng that question in general (i.e. for arbitrary software) is equivalent = to solving the [[https://en.wikipedia.org/wiki/Halting_problem][halting problem]]. Most package managers, such as Debian's =3Dapt=3D or the multi-pl= atform =3Dconda=3D, take a different point of view. They define the dependenc= ies of a program as all packages that need to be loaded into memory in order = to run it. They thus exclude the software that is required to /build/ t= he program and its run-time dependencies, but can then be discarde= d. Whereas Guix' definition errs on the safe side (its dependency list = is often longer than necessary but never too short), the run-time-on= ly definition is both too vast and too restrictive. Many run-ti= me dependencies don't have an impact on most programs' results, but so= me build-time dependencies do. >From my point of view, an essential point of this "worst-case estimate" is: time travelling. Because the closure is well-defined, it is possible to restore the complete set of the dependencies. And it is not possible with t= he other point of view, if I understand correctly. One important case where build-time dependencies matter = is floating-point computations. For historical reasons, they are surround= ed by an aura of vagueness and imprecision, which goes back to its ear= ly days, when many details were poorly understood and implementatio= ns varied a lot. Today, all computers used for scientific computing respe= ct the [[https://en.wikipedia.org/wiki/IEEE_754][IEEE 754 standard]] that precisely defines how floating-point numbers are represented in memory and what the result of each arithmet= ic operation must be. Floating-point arithmetic is thus perfect= ly deterministic and even perfectly portable between machines, if express= ed in terms of the operations defined by the standard. However, high-lev= el languages such as C or Fortran do not allow programmers to do that. I= ts designers assume (probably correctly) that most programmers do not wa= nt to deal with the intricate details of rounding. Therefore they provi= de only a simplified interface to the arithmetic operations of IEE 75= 4, Missing E at IEEE. which incidentally also provides more liberty for code optimization = to compiler writers. The net result is that the complete specification of= a program's results is its source code /plus the compiler and t= he compilation options/. You thus /can/ get reproducible floating-poi= nt results if you include all compilation steps into the perimeter of yo= ur computation, at least for code running on a single processor. Parall= el computing is a different story: it involves voluntarily giving = up reproducibility in exchange for speed. Reproducibility then becomes= a best-effort approach of limiting the collateral damage done = by optimization through the clever design of algorithms. It is out of scope and I have never read the IEEE 754 standard, so I do not know if this simple propagation of errors depends on the compiler suite and/or the machine. #+begin_src C #include int main() { double x =3D 0.; for (int i =3D 1; i < 10; i++) { x =3D x + 0.1; printf("(%d) x=3D%0.20f\n", i, x); } return 0; } #+end_src And I do not know neither if the standard fixes associativity rules when no parenthesis is provided or if it is up to the compiler. #+begin_src C #include int main() { float x; float r1, r2, r3, r4; x =3D 1.0e21; r1 =3D x + 1 - x + 1; r2 =3D (x + 1) - (x - 1); r3 =3D x + (1 - x) + 1; r4 =3D x + (1 - (x - 1)); printf(" x + 1 - x + 1 =3D%f\n", r1); printf("(x + 1) - (x - 1)=3D%f\n", r2); printf(" x +(1 - x)+ 1 =3D%f\n", r3); printf(" x +(1 - (x+ 1))=3D%f\n", r4); return 0; } #+end_src * Reproducing a reproducible computation So far, I have explained the theory behind reproducib= le computations. The take-home message is that to be sure to get exact= ly the same results in the future, you have to use the exact same versio= ns of all packages in the package closure of your immediate dependencies.= I have also shown you how you can access that package closure. There = is one missing piece: how do you actually run your program in the futur= e, using the same environment? The good news is that doing this is a lot simpler than understanding = my lengthy explanations (which is why I leave this for the end!). T= he complex dependency graphs that I have analyzed up to here are encoded = in the Guix source code, so all you need to re-create your environment = is the exact same version of Guix! You get that version using #+begin_src sh :results output :exports both guix describe #+end_src #+RESULTS: : Generation 15 Jan 06 2020 13:30:45 (current) : guix 769b96b : repository URL: https://git.savannah.gnu.org/git/guix.git : branch: master : commit: 769b96b62e8c09b078f73adc09fb860505920f8f The critical information here is the unpleasantly looking string = of hexadecimal digits after "commit". This is all it takes to unique= ly identify a version of Guix. And to re-use it in the future, all you ne= ed is Guix' time machine: #+begin_src sh :session reproduce-C-compiler :results output :exports = both guix time-machine --commit=3D769b96b62e8c09b078f73adc09fb860505920f8f -- environment --ad-hoc gcc-toolchain #+end_src #+RESULTS: : : Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'... #+begin_src sh :session reproduce-C-compiler :results output :exports = both gcc pi.c -o pi ./pi #+end_src #+RESULTS: : : pi =3D 3.1415926536 : 4 * atan(1.): 3.1415926536 : Leibniz' formula (four terms): 2.8952380952 The time machine actually downloads the specified version of Guix a= nd passes it the rest of the command line. You are running the same co= de again. Even bugs in Guix will be reproduced faithfully! For many practical use cases, this technique is sufficient. But the= re are two variants you should know about for more complicated situations= : - If you need an environment with many packages, you should use = a manifest rather than list the packages on the command line. See [[https://guix.gnu.org/manual/en/html_node/Invoking-guix-environment.ht= ml][the manual]] for details. - If you need packages from additional channels, i.e. packages that a= re not part of the official Guix distribution, you should store = a complete channel description in a file using #+begin_src sh :results none :exports code guix describe -f channels > guix-version-for-reproduction.txt #+end_src and feed that file to the time machine: #+begin_src sh :session reproduce-C-compiler-2 :results output :exports both guix time-machine --channels=3Dguix-version-for-reproduction.txt -- environment --ad-hoc gcc-toolchain #+end_src #+RESULTS: : : Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'... #+begin_src sh :session reproduce-C-compiler-2 :results output :exports both gcc pi.c -o pi ./pi #+end_src #+RESULTS: : : pi =3D 3.1415926536 : 4 * atan(1.): 3.1415926536 : Leibniz' formula (four terms): 2.8952380952 Last, if your colleague does not use yet Guix, then let pack (plain tarball= , Docker or Singularity containers) and provide the image. For example, #+begin_src sh :results none :exports code guix pack \ -f docker \ -C none \ -S /bin=3Dbin \ -S /lib=3Dlib \ -S /share=3Dshare \ -S /etc=3Detc \ gcc-toolchain #+end_src and knowing the Guix commit (channel), you will be able in the future to reproduce bit-to-bit this container using =3Dguix time-machine=3D. And now... congratulations for having survived to the end of this lo= ng journey! May all your computations be reproducible, with Guix.