unofficial mirror of guix-science@gnu.org 
 help / color / mirror / Atom feed
* Conda environments and reproducibility
@ 2022-11-28 17:28 Thibault Lestang
  2022-11-28 19:45 ` Konrad Hinsen
  2022-11-28 20:46 ` Simon Tournier
  0 siblings, 2 replies; 25+ messages in thread
From: Thibault Lestang @ 2022-11-28 17:28 UTC (permalink / raw)
  To: guix-science

Hi there,

I'm new to the list so apologies if this has already been discussed
before.

I've been ruminating on reproducible builds ever since I attended the 10
years birthday event a few months ago. For me, putting /python/ and
/reproduciblity/ together in the same sentence used to invariablity lead
to /virtualenvs/ or /conda/ being featured in the next - suffice it to
say that part of my world view was shattered a bit (that's okay).

Things progressively start to make sense, but when talking about Guix
with a colleague earlier today it became apparent that my understanding
isn't exacly rock solid yet. Particularly, looking at this tweet

https://twitter.com/luispedrocoelho/status/1087685131144495104

referred to in Ludovic's article "Toward reproducible Jupyter notebooks"
(https://hpc.guix.info/blog/2019/10/towards-reproducible-jupyter-notebooks/).

The tweet says (22 Jan 2019)

-----
@luispedrocoelho
Me, 6 months ago: I am going to save this conda
environment with all the versions of all the packages so it can be
recreated later; this is Reproducible Science!

conda, today: these versions don't work together, lol.
-----

I simply can't explain how such a behavior can happen.

I understand that conda ships pre-compiled binaries. I see how that's
bad for reproducibility and provenance tracking since it's not
straightforward to know how these binaries and dependencies were
compiled. I'm assuming that, when conda saves an environment, it records
version tags and "everything else required" to pull the same binaries
later. Okay - I see how binaries could /technically/ be modified at a
later stage whilst maintaning the same version tag (provenance tracking
issue).

Is it the case that someone at Anaconda would modify some package,
keeping the same version tag and other identifiers used by conda, whilst
at the same time marking this package as incompatible with packages it
was previously compatible with?

Thanks for reading!

Thibault

-- 
Dr Thibault Lestang
Senior Research Software Engineer
Department of Aeronautics, Imperial College London


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-11-28 17:28 Conda environments and reproducibility Thibault Lestang
@ 2022-11-28 19:45 ` Konrad Hinsen
  2022-11-29 10:32   ` Thibault Lestang
  2022-11-28 20:46 ` Simon Tournier
  1 sibling, 1 reply; 25+ messages in thread
From: Konrad Hinsen @ 2022-11-28 19:45 UTC (permalink / raw)
  To: Thibault Lestang, guix-science

Hi Thibault,

> -----
> @luispedrocoelho
> Me, 6 months ago: I am going to save this conda
> environment with all the versions of all the packages so it can be
> recreated later; this is Reproducible Science!
>
> conda, today: these versions don't work together, lol.
> -----
>
> I simply can't explain how such a behavior can happen.

The error message is not exactly as cited. Conda doesn't claim that
these versions don't work together, it claims that it cannot find a
combination of package known to work together and available in the
archive.

One possible reason for this is an update in conda's build
infrastructure. That's what happened to the software environment for the
reproducible research MOOC on Fun (of which I am an author). We
published the environment file, but a few months later conda could not
reconstruct it any more. They had updated the compiler infrastructure,
which requires a rebuild of all packages. But they didn't rebuild all
the versions from the past, so most older environment files became
unusable.

The lesson is that packages are reproducible only if you can re-run the
construction of the entire environment, from source code. Which is what
Guix can do (though if you actually have to do this, it will be a very
long process).

There may be other causes for the conda problem cited, I don't claim to
be an authority of conda! After the MOOC experience, I have never used
conda again.

> Is it the case that someone at Anaconda would modify some package,
> keeping the same version tag and other identifiers used by conda, whilst
> at the same time marking this package as incompatible with packages it
> was previously compatible with?

That's in a way what happened in my scenario: rebuilding with a new
compilation infrastructure produces different packages that share
version numbers and tags with the prior ones.

Cheers,
  Konrad.
-- 
---------------------------------------------------------------------
Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans
Synchrotron Soleil - Division Expériences
Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: konrad DOT hinsen AT cnrs DOT fr
http://dirac.cnrs-orleans.fr/~hinsen/
ORCID: https://orcid.org/0000-0003-0330-9428
Twitter: @khinsen
---------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-11-28 17:28 Conda environments and reproducibility Thibault Lestang
  2022-11-28 19:45 ` Konrad Hinsen
@ 2022-11-28 20:46 ` Simon Tournier
  2022-11-29 10:41   ` Thibault Lestang
  1 sibling, 1 reply; 25+ messages in thread
From: Simon Tournier @ 2022-11-28 20:46 UTC (permalink / raw)
  To: Thibault Lestang, guix-science

Hi,

On Mon, 28 Nov 2022 at 17:28, Thibault Lestang <t.lestang@imperial.ac.uk> wrote:
> -----
> @luispedrocoelho
> Me, 6 months ago: I am going to save this conda
> environment with all the versions of all the packages so it can be
> recreated later; this is Reproducible Science!
>
> conda, today: these versions don't work together, lol.
> -----
>
> I simply can't explain how such a behavior can happen.

One thing is the link rot.  I do not know if it is currently estimated,
but for sure, we always underestimate it.

> I understand that conda ships pre-compiled binaries. I see how that's
> bad for reproducibility and provenance tracking since it's not
> straightforward to know how these binaries and dependencies were
> compiled. I'm assuming that, when conda saves an environment, it records
> version tags and "everything else required" to pull the same binaries
> later. Okay - I see how binaries could /technically/ be modified at a
> later stage whilst maintaning the same version tag (provenance tracking
> issue).

Aside, you are assuming the availability of such binaries. :-)

Another thing, from the old time where I used Conda, and I may be wrong,
is, I guess , the SAT solver [1].  Well, 6 months ago, you described
your environment, for instance saying:

    1.0 <= foo
    2.0 <= bar <= 3.0
    baz <= 4.0

then foo@1.1, foo@1.2 and foo@2.0 had been released in these past 6
months.  But baz <= 4.0 only works with 0.9 <= foo <= 1.2 and the
constraint on bar implies other constraints on foo and/or baz.

The complexity about SAT solvers is exponential, IIRC, for sure really
bad, and I do not know the state-of-the-art but I guess the problem to
solve is going to be worse and worse as the time flies.

From my experience, you have only one solution to fight against the
time: freeze.  The question is then how or what to freeze. :-)

One way for freezing is the binary container.  Another way for freezing
is to have a “summary” capturing the whole (fixed) graph of
dependencies.  This is (usually named) the channels.scm file (guix
describe).  Then, the assumptions become:

 1. solve the link rot; tackled by Software Heritage,
 2. Linux kernel API backward compatibility,
 3. hardware compatibility,

to be able to rebuild.  If I might, here some stuff: :-)

https://www.nature.com/articles/s41597-022-01720-9
https://simon.tournier.info/posts/2022-11-08-bluehats.html
https://simon.tournier.info/posts/2022-04-15-cafe-guix-long-term.html


Cheers,
simon

1: https://en.wikipedia.org/wiki/SAT_solver


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-11-28 19:45 ` Konrad Hinsen
@ 2022-11-29 10:32   ` Thibault Lestang
  2022-11-29 13:12     ` Hugo Buddelmeijer
  0 siblings, 1 reply; 25+ messages in thread
From: Thibault Lestang @ 2022-11-29 10:32 UTC (permalink / raw)
  To: Konrad Hinsen; +Cc: guix-science


Thanks for your answer Konrad.

Konrad Hinsen <konrad.hinsen@cnrs.fr> writes:

> There may be other causes for the conda problem cited, I don't claim
> to be an authority of conda! After the MOOC experience, I have never
> used conda again.

That's fair enough. Conda & pip are everywhere around me, and I'd like
to form an accurate picture of their shotcomings before mentioning
alternative approaches to people who use these tools everyday!

>> Is it the case that someone at Anaconda would modify some package,
>> keeping the same version tag and other identifiers used by conda, whilst
>> at the same time marking this package as incompatible with packages it
>> was previously compatible with?
>
> That's in a way what happened in my scenario: rebuilding with a new
> compilation infrastructure produces different packages that share
> version numbers and tags with the prior ones.

Okay - this is an explanation I can understand. A better approach
would have been /not/ to overwrite existing package binaries with new
ones produced from the new infrastructure.

In other words, include whatever information is needed to fully describe
the compilation infrastructure in the conda package metadata -- and
therefore make sure that a new infrastructure produces /new/ packages.

Best,
Thibault


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-11-28 20:46 ` Simon Tournier
@ 2022-11-29 10:41   ` Thibault Lestang
  2022-11-29 14:25     ` Simon Tournier
  0 siblings, 1 reply; 25+ messages in thread
From: Thibault Lestang @ 2022-11-29 10:41 UTC (permalink / raw)
  To: Simon Tournier; +Cc: guix-science


Simon Tournier <zimon.toutoune@gmail.com> writes:

> On Mon, 28 Nov 2022 at 17:28, Thibault Lestang <t.lestang@imperial.ac.uk> wrote:
>> -----
>> @luispedrocoelho
>> Me, 6 months ago: I am going to save this conda
>> environment with all the versions of all the packages so it can be
>> recreated later; this is Reproducible Science!
>>
>> conda, today: these versions don't work together, lol.
>> -----
>>
>> I simply can't explain how such a behavior can happen.
>
> One thing is the link rot.  I do not know if it is currently estimated,
> but for sure, we always underestimate it.

How far back do packages version go in Anaconda's archives? Are there
any guarantees? Good question.

>> I understand that conda ships pre-compiled binaries. I see how that's
>> bad for reproducibility and provenance tracking since it's not
>> straightforward to know how these binaries and dependencies were
>> compiled. I'm assuming that, when conda saves an environment, it records
>> version tags and "everything else required" to pull the same binaries
>> later. Okay - I see how binaries could /technically/ be modified at a
>> later stage whilst maintaning the same version tag (provenance tracking
>> issue).
>
> Aside, you are assuming the availability of such binaries. :-)

Yes I am - I guess that's linked to your point about link rot?
>
> Another thing, from the old time where I used Conda, and I may be wrong,
> is, I guess , the SAT solver [1].  Well, 6 months ago, you described
> your environment, for instance saying:
>
>     1.0 <= foo
>     2.0 <= bar <= 3.0
>     baz <= 4.0
>
> then foo@1.1, foo@1.2 and foo@2.0 had been released in these past 6
> months.  But baz <= 4.0 only works with 0.9 <= foo <= 1.2 and the
> constraint on bar implies other constraints on foo and/or baz.
>
> The complexity about SAT solvers is exponential, IIRC, for sure really
> bad, and I do not know the state-of-the-art but I guess the problem to
> solve is going to be worse and worse as the time flies.
>
> From my experience, you have only one solution to fight against the
> time: freeze.  The question is then how or what to freeze. :-)
>
> One way for freezing is the binary container.  Another way for freezing
> is to have a “summary” capturing the whole (fixed) graph of
> dependencies.  This is (usually named) the channels.scm file (guix
> describe).  Then, the assumptions become:
>
>  1. solve the link rot; tackled by Software Heritage,
>  2. Linux kernel API backward compatibility,
>  3. hardware compatibility,

I think the tweet above is about reproducing an enviroment after
effectively freezing constitutive packages and their dependenies as you
describe. They probably used something like

conda env export

Which outputs something similar to (trimmed)

name: justnumpy
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - blas=1.0=mkl
  - libuuid=1.41.5=h5eee18b_0
  - mkl=2021.4.0=h06a4308_640
  - mkl-service=2.4.0=py310h7f8727e_0
  - mkl_fft=1.3.1=py310hd6ae3a3_0
  - mkl_random=1.2.2=py310h00e6091_0
  - ncurses=6.3=h5eee18b_3
  - numpy=1.23.4=py310hd5efca6_0
  - numpy-base=1.23.4=py310h8e6c178_0
  - ...
  - ...
prefix: /home/thibault/miniconda3/envs/justnumpy

> If I might, here some stuff: :-)
>
> https://www.nature.com/articles/s41597-022-01720-9
> https://simon.tournier.info/posts/2022-11-08-bluehats.html
> https://simon.tournier.info/posts/2022-04-15-cafe-guix-long-term.html

Great stuff - thank you. Congratulations on the paper!

-- Thibault


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-11-29 10:32   ` Thibault Lestang
@ 2022-11-29 13:12     ` Hugo Buddelmeijer
  2022-11-29 13:39       ` Konrad Hinsen
                         ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Hugo Buddelmeijer @ 2022-11-29 13:12 UTC (permalink / raw)
  To: Thibault Lestang; +Cc: Konrad Hinsen, guix-science

[-- Attachment #1: Type: text/plain, Size: 4858 bytes --]

Hi Konrad, Thibault and others,

Konrad, is it perhaps possible for you to dig up this broken conda
environment file?

First, just like you all, my conclusion is that guix is the answer. The
last two paragraphs by Simon captures it succinctly. However, conda seems
to work fine for most people. It would therefore be instructive to have
concrete 'failure stories' in order to show people that conda is not enough.


On Tue, 29 Nov 2022 at 11:32, Thibault Lestang <t.lestang@imperial.ac.uk>
wrote:

> That's fair enough. Conda & pip are everywhere around me, and I'd like
> to form an accurate picture of their shotcomings before mentioning
> alternative approaches to people who use these tools everyday!


I agree, let me share my perspective.

Konrad Hinsen <konrad.hinsen@cnrs.fr> writes:
> > That's in a way what happened in my scenario: rebuilding with a new
> > compilation infrastructure produces different packages that share
> > version numbers and tags with the prior ones.
>
> Okay - this is an explanation I can understand. A better approach
> would have been /not/ to overwrite existing package binaries with new
> ones produced from the new infrastructure.
>

It doesn't seem common to overwrite conda binaries. Conda takes some (not
enough?) measures to prevent the scenario Konrad describes. In particular,
the filenames include a 'hash' since conda 3 (~2014) [1]:

in the past, we have had things like py27np111 in filenames. This is the
> same idea, just generalized. Since we can't readily put every possible
> constraint into the filename, we have kept the old ones, but added the hash
> as a general solution.
>

This hash includes information about the compiler used (~2017) [2, 3]:

The build hash will be added to the build string if these are true for any
> dependency: [...] package uses {{ compiler() }} jinja2 function
>

That is, "conda env export" should contain entries like
"scipy=1.8.0=py39hee8e79c_1", where the hee8e79c should uniquely define the
dependencies 'that matter', like which compiler is used. What goes into the
hash seems rather complicated, and grows over time.


This hash is a great step forward in reproducibility. But it is too
fragile. I can't directly see how, but I can easily assume that this
dependency-hash mechanism leads to the problem that Konrad faced even when
no files are overwritten. Maybe because a new dependency resolver in conda
would have stricter rules on interoperability. (It is still possible that
files indeed were overwritten though; it was probably an incident like this
that made them change the hashes.)

My realization was that improving these hashes is a goose chase and will
ultimately lead to horrific things like "turing-complete yaml files". And
at that point it is clear, at least to me, that guix is the answer.


One thing that conda (or actualy conda-forge) does well, are their bots.
I'm a maintainer of some conda packages and once a month or so I get a
fully automated pull request to update my package [4], e.g. when the
upstream package is updated, or when a dependency is updated. They even
have a tracking system for migrating dependencies that are used by many
packages, such as compilers. This makes maintaining conda-forge packages a
breeze. Having such bots also within the guix-ecosystem would probably help
attract developers.

By the way, it is quite hard to use conda in guix, primarily because "conda
activate myenvironment" will try to set PS1 by calling a bash function
called 'conda'. This bash function calls the 'conda' executable, which
takes PS1, modifies it, and returns it to the bash function. The bash
function subsequently sets PS1 (and makes a backup for deactivating the
environment again). However, the conda executable is replaced by a bash
script that calls conda_real. And bash scripts eat PS1 (because it is in
non-interactive mode), so conda_real gets an empty PS1, fails to modify it,
and then the bash function sets PS1 to nothing. I've got it working
properly on my machine, but don't feel comfortable enough yet with Scheme /
guix to provide a proper patch. The simplest might be to use another shell
for the conda package (because I believe only bash eats PS1); not sure
whether that is possible in guix. And I would rather make guix packages of
everything and ditch conda altogether. But supporting conda properly would
help more people transition.

(Oh, this reminds me of the problems of activation and deactivation scripts
in conda. For another time.)

Greetings,
Hugo


[1] https://www.anaconda.com/blog/package-better-conda-build-3
[2]
https://docs.conda.io/projects/conda-build/en/stable/resources/define-metadata.html
[3]
https://github.com/conda/conda-build/blob/e4d9b3bd255565d47b6ab6b93380ef246b2a1ddf/conda_build/metadata.py#L1294
[4]
https://github.com/conda-forge/python-cpl-feedstock/pulls?q=is%3Apr+is%3Aclosed

[-- Attachment #2: Type: text/html, Size: 6880 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-11-29 13:12     ` Hugo Buddelmeijer
@ 2022-11-29 13:39       ` Konrad Hinsen
  2022-12-01 14:01         ` Hugo Buddelmeijer
  2022-11-29 20:10       ` Simon Tournier
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Konrad Hinsen @ 2022-11-29 13:39 UTC (permalink / raw)
  To: Hugo Buddelmeijer, Thibault Lestang; +Cc: guix-science

Hi Hugo,

 Buddelmeijer <hugo@buddelmeijer.nl> writes:

> Hi Konrad, Thibault and others,
>
> Konrad, is it perhaps possible for you to dig up this broken conda
> environment file?

Yes:

   https://gist.github.com/brospars/4671d9013f0d99e1c961482dab533c57

That environment was set up in 2018 on a Linux machine, and then tested
under macOS and Windows as well. It broke in early 2019.

> First, just like you all, my conclusion is that guix is the answer. The
> last two paragraphs by Simon captures it succinctly. However, conda seems
> to work fine for most people. It would therefore be instructive to have
> concrete 'failure stories' in order to show people that conda is not enough.

I have heard many stories of conda failing long-term, i.e. environments
not being reproducible after a year or two. Most use cases are probably
more short-term.

> It doesn't seem common to overwrite conda binaries. Conda takes some (not
> enough?) measures to prevent the scenario Konrad describes. In particular,
> the filenames include a 'hash' since conda 3 (~2014) [1]:

Weird. We worked with official Miniconda downloads from early 2018, and
our environment files contain no hashes.

> My realization was that improving these hashes is a goose chase and will
> ultimately lead to horrific things like "turing-complete yaml files". And
> at that point it is clear, at least to me, that guix is the answer.

Indeed. Turing-complete Scheme files :-)

My conclusion so far is that conda can never attain long-term
reproducibility, because it wants to be multi-platform. And that means
that it doesn't control the foundations on which it has to build.

From a user's point of view, a big problem with conda is the opacity of
the machinery, which in addition changes all the time as you say. With
Guix, I can understand how everything is built, and thus understand the
potential obstacles to a rebuild many years later. With conda, I don't
really know and my understanding is that the build machinery is not
even completely public (for Anaconda at least).

> One thing that conda (or actualy conda-forge) does well, are their bots.
> I'm a maintainer of some conda packages and once a month or so I get a
> fully automated pull request to update my package [4], e.g. when the
> upstream package is updated, or when a dependency is updated. They even

That's nice!

> packages, such as compilers. This makes maintaining conda-forge packages a
> breeze. Having such bots also within the guix-ecosystem would probably help
> attract developers.

Indeed. More generally, I think package managers should do a better job
in reaching out to upstream maintainers. They are our allies in
providing a better UX.

Cheers,
  Konrad
-- 
---------------------------------------------------------------------
Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans
Synchrotron Soleil - Division Expériences
Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: konrad DOT hinsen AT cnrs DOT fr
http://dirac.cnrs-orleans.fr/~hinsen/
ORCID: https://orcid.org/0000-0003-0330-9428
Twitter: @khinsen
---------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-11-29 10:41   ` Thibault Lestang
@ 2022-11-29 14:25     ` Simon Tournier
  0 siblings, 0 replies; 25+ messages in thread
From: Simon Tournier @ 2022-11-29 14:25 UTC (permalink / raw)
  To: Thibault Lestang; +Cc: guix-science

Hi Thibault,

On Tue, 29 Nov 2022 at 10:41, Thibault Lestang <t.lestang@imperial.ac.uk> wrote:

> I think the tweet above is about reproducing an enviroment after
> effectively freezing constitutive packages and their dependenies as you
> describe. They probably used something like
>
> conda env export
>
> Which outputs something similar to (trimmed)
>
> name: justnumpy
> channels:
>   - defaults
> dependencies:

[...]

>   - ncurses=6.3=h5eee18b_3
>   - numpy=1.23.4=py310hd5efca6_0
>   - numpy-base=1.23.4=py310h8e6c178_0
>   - ...

Do you list all the dependencies?  Other said, dependencies of
dependencies?  Is it only run-time dependencies?

Konrad pointed, (it = Conda)

                                         it claims that it cannot find a
     combination of package known to work together and available in the
     archive.

and from my understanding, I think it is because the solver (SAT or
else).  Well, for instance,

        Theorem 1 Checking whether a single package P can be installed, given a
        repository R, is NP-complete.

        https://www.mancoosi.org/edos/algorithmic/#toc15

Here (conda env export), you generated the Conda requirements using the
repository in the state R.  Then, later the repository becomes R’
(somehow it increases the number of combinations) and it does not matter
if the constraints are foo <= 1.23 or are foo=1.2.3 or are
foo=1.2.3=abcd456.

Maybe I am wrong, from my understanding, Conda builds the graph of
dependencies by resolving a combinatorial problem.  When you run,

    conda env create -f environment.yml

then Conda relies on a “dependency” solver documented here [1].  And,
IMHO, it is where it fails.  Well, if instead of ’conda env export’ you
run,

    conda list --explicit > spec-file.txt

then later and elsewhere,

    conda create --name myenv --file spec-file.txt

it should bypass the solver.  But the documentation [1] reads,

        Since the solver is not involved, the dependencies of the
        explicit package(s) are not processed at all. This can leave the
        environment in an inconsistent state, which can be fixed by
        running conda update --all, for example.

Done. :-)  Conda environments are hard, if not impossible, to reproduce
when time is flying.  It is by design, IMHO.


1: <https://conda.io/projects/conda/en/latest/dev-guide/deep-dives/solvers.html>
2: <https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html>

Cheers,
simon


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-11-29 13:12     ` Hugo Buddelmeijer
  2022-11-29 13:39       ` Konrad Hinsen
@ 2022-11-29 20:10       ` Simon Tournier
  2022-12-16 10:16         ` Thibault Lestang
  2022-12-02 10:52       ` Ludovic Courtès
  2022-12-02 11:05       ` Ludovic Courtès
  3 siblings, 1 reply; 25+ messages in thread
From: Simon Tournier @ 2022-11-29 20:10 UTC (permalink / raw)
  To: Hugo Buddelmeijer, Thibault Lestang; +Cc: Konrad Hinsen, guix-science

Hi Hugo, all,

On Tue, 29 Nov 2022 at 14:12, Hugo Buddelmeijer <hugo@buddelmeijer.nl> wrote:

>                                                      However, conda seems
> to work fine for most people. It would therefore be instructive to have
> concrete 'failure stories' in order to show people that conda is not enough.

What I would do if I would try to convince my colleagues that Conda is
not enough.

1. Target one or two common environments; for example,
   (Python+Numpy+Scipy+Matplotlib) for one, and (R+Seurat) for two.

2. Generate the both environments following the Conda documentation.

Until here all should work smoothly. :-)

3. Commit the Conda files in a Git repository; for instance,

       for e in py rseurat
       do
         conda activate $e
         conda env export > environment-$e.yml
         conda list --explicit > explicit-spec-$e.txt       
         conda deactivate
       done

4.
   a) on the same machine, try to recreate the 2 environments.
   b) on another machine, idem.
   c) Commit to the Git repository how it goes.
   d) Remove the two environments and more on both machine.

5. Every new month, do #4.


Maybe it can be automated with a Cron task.  And maybe we could
collectively do this experience.  And we could do the same with
Guix. :-)

Well, we have not spoken about running something.  We could also write a
small Python script plotting something using Numpy and/or Scipy and try
to run the Seurat vignette.

From my experience, after some months (from 2-3 to 6), Conda will fail.
Especially after an update of the system (apt upgrade)–and it can worse
with a ’dist-upgrade’. :-)
    

> On Tue, 29 Nov 2022 at 11:32, Thibault Lestang <t.lestang@imperial.ac.uk>
> wrote:
>
>> That's fair enough. Conda & pip are everywhere around me, and I'd like
>> to form an accurate picture of their shotcomings before mentioning
>> alternative approaches to people who use these tools everyday!
>
> I agree, let me share my perspective.

Conda and pip works very well when we have in mind a forward view of the
history.  By design, they fail when backward.  For engineering, they are
very efficient and personally I would rely on them **if** I had some
systems to maintain only caring about upgrading them.  Well, Conda, pip
or some other distro package manager.

The troubles are when you try to restore the past.  The 10 Years
Challenge [1] provides very good examples.  This report [2] (in French,
but an English version is probably around) provides very good insights,
IMHO, about the limitations of classical package managers (as Debian,
Conda, pip, etc.)

For what my biased opinion is worth, many shortcomings are around. :-)
For instance, this paper [3] points the reproduction was «so
time-consuming and resulted in only 11 out of 28 (39%) figure panels
conveying the same information».  Well, for sure it is hard to know if
the students tried hard or not–and the paper does not speak much about
the computational environment.

(Well, aside the transparency of the computational stack that Conda
barely provides, but that’s another story. :-))

1: <https://www.nature.com/articles/d41586-020-02462-7>
2: <https://hpc.guix.info/static/videos/atelier-reproductibilit%C3%A9-2021/arnaud-legrand.webm>
3: <https://doi.org/10.1371/journal.pcbi.1010615>


> That is, "conda env export" should contain entries like
> "scipy=1.8.0=py39hee8e79c_1", where the hee8e79c should uniquely define the
> dependencies 'that matter', like which compiler is used. What goes into the
> hash seems rather complicated, and grows over time.
>
> This hash is a great step forward in reproducibility. But it is too
> fragile. I can't directly see how, but I can easily assume that this
> dependency-hash mechanism leads to the problem that Konrad faced even when
> no files are overwritten. Maybe because a new dependency resolver in conda
> would have stricter rules on interoperability. (It is still possible that
> files indeed were overwritten though; it was probably an incident like this
> that made them change the hashes.)

Well, I think Conda documentation [4] about the solver for dependencies
put some warnings around this explicit mechanism.  It is a long time
that I have not given a look at Conda but from my understanding of the
solver documentation, this “failure” reported by Konrad appears to me
expected, by design of Conda. ;-)

If the solver tries to satisfy many constraints, then the problem is
more complex as the time is going.  So, Conda probably fails to find a
working combination.

If the solver is bypassed, then there is no guarantee that the generated
state is a working computational environment.  Conda recommends to
update in order to fix the potential issues.

4: <https://conda.io/projects/conda/en/latest/dev-guide/deep-dives/solvers.html>


> One thing that conda (or actualy conda-forge) does well, are their bots.
> I'm a maintainer of some conda packages and once a month or so I get a
> fully automated pull request to update my package [4], e.g. when the
> upstream package is updated, or when a dependency is updated. They even
> have a tracking system for migrating dependencies that are used by many
> packages, such as compilers. This makes maintaining conda-forge packages a
> breeze. Having such bots also within the guix-ecosystem would probably help
> attract developers.

Cool!  Do you know if the code of these bots is available?


> By the way, it is quite hard to use conda in guix,

Maybe you could open bugs and/or report on help-guix or guix-devel the
annoyance you are observing.  For instance, I fully removed Conda from
my toolbox so I never hit annoyance. ;-)


Cheers,
simon


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-11-29 13:39       ` Konrad Hinsen
@ 2022-12-01 14:01         ` Hugo Buddelmeijer
  2022-12-02 13:01           ` Konrad Hinsen
  0 siblings, 1 reply; 25+ messages in thread
From: Hugo Buddelmeijer @ 2022-12-01 14:01 UTC (permalink / raw)
  To: Konrad Hinsen; +Cc: Thibault Lestang, guix-science

[-- Attachment #1: Type: text/plain, Size: 4219 bytes --]

Thanks Konrad,

On Tue, 29 Nov 2022 at 14:39, Konrad Hinsen <konrad.hinsen@cnrs.fr> wrote:

>
>  Buddelmeijer <hugo@buddelmeijer.nl> writes:
>
> > Hi Konrad, Thibault and others,
> >
> > Konrad, is it perhaps possible for you to dig up this broken conda
> > environment file?
>
> Yes:
>
>    https://gist.github.com/brospars/4671d9013f0d99e1c961482peopledab533c57
> <https://gist.github.com/brospars/4671d9013f0d99e1c961482dab533c57>
>
> That environment was set up in 2018 on a Linux machine, and then tested
> under macOS and Windows as well. It broke in early 2019.
>

Thanks. Those dependencies indeed do not contain the hashes, so it is
probably created with "conda env export --no-build".

I think such a file without build hashes would probably be what you want
when you are giving a course, because it would allow students to install
these exact versions of the packages, but build for their specific
environment (e.g. Linux / macOS / Windows). It would provide limited
reproducibility in the future, as you noticed. I guess you'd want three
sets of environment files for a conda environment for a course:

1. With unpinned dependencies, so just "scipy", whenever possible. That
way, you'd get the latest versions when rerunning the course. This requires
frequent updates to the files to restrict/pin dependencies when necessary,
e.g. "scipy<=1.8.0". This would be equivalent to a guix manifest file
without any channel information.
2. With dependencies pinned just on version, "scipy=1.8.0", like the one
you shared. This should allow you to get equivalent stacks on different
environments. Guix does not really have an equivalent, by design, since it
is not multi-platform. Although I suppose one could create a channel with
many different versions of packages; then the manifest should specify the
ones used.
3. With dependencies pinned on build hash, "scipy=1.8.0=py39hee8e79c_1".
This should give you the exact same binaries every time. Roughly equivalent
to a guix manifest with a channel file. But guix is still better, because
its dependency graph is based on source code, which is easier to archive,
so less chance of missing binaries (and more determinism).

Guix differentiates between scenarios 1 and 3 more cleanly, by having a
clean separation between the manifest and the channels.

(Lets ignore the pip packages in the conda environment file for now.)

> It doesn't seem common to overwrite conda binaries. Conda takes some (not
> > enough?) measures to prevent the scenario Konrad describes. In
> particular,
> > the filenames include a 'hash' since conda 3 (~2014) [1]:
>
> Weird. We worked with official Miniconda downloads from early 2018, and
> our environment files contain no hashes.
>

Probably due to "--no-build" in "conda env export", or maybe the default
was different back then.


> My conclusion so far is that conda can never attain long-term
> reproducibility, because it wants to be multi-platform. And that means
> that it doesn't control the foundations on which it has to build.
>

Perhaps we are at the right time. I started using conda when I myself, or
my colleagues, used many different environments. Linux, windows, mac, and
different versions thereof. Back then, anaconda was great, because it was
very hard to install everything otherwise.

However, nowadays everyone can run linux, either directly, or through WSL
(windows subsystem for linux), or through containers. And everyone knows
how to do this, and it is integrated in IDE's and such. So conda isn't
really necessary anymore.

From a user's point of view, a big problem with conda is the opacity of
> the machinery, which in addition changes all the time as you say. With
> Guix, I can understand how everything is built, and thus understand the
> potential obstacles to a rebuild many years later. With conda, I don't
> really know and my understanding is that the build machinery is not
> even completely public (for Anaconda at least).
>

I agree with you on a philosophical level; ultimately understanding
everything would be easier with guix. But we aren't there yet, I don't
understand most of the guix packages I've looked at. That is probably
because my guile/scheme skills are lacking.

Cheers,
Hugo

[-- Attachment #2: Type: text/html, Size: 5609 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-11-29 13:12     ` Hugo Buddelmeijer
  2022-11-29 13:39       ` Konrad Hinsen
  2022-11-29 20:10       ` Simon Tournier
@ 2022-12-02 10:52       ` Ludovic Courtès
  2022-12-02 11:05       ` Ludovic Courtès
  3 siblings, 0 replies; 25+ messages in thread
From: Ludovic Courtès @ 2022-12-02 10:52 UTC (permalink / raw)
  To: Hugo Buddelmeijer; +Cc: Thibault Lestang, Konrad Hinsen, guix-science

Hi Hugo,

Hugo Buddelmeijer <hugo@buddelmeijer.nl> skribis:

> By the way, it is quite hard to use conda in guix, primarily because "conda
> activate myenvironment" will try to set PS1 by calling a bash function
> called 'conda'. This bash function calls the 'conda' executable, which
> takes PS1, modifies it, and returns it to the bash function. The bash
> function subsequently sets PS1 (and makes a backup for deactivating the
> environment again). However, the conda executable is replaced by a bash
> script that calls conda_real. And bash scripts eat PS1 (because it is in
> non-interactive mode), so conda_real gets an empty PS1, fails to modify it,
> and then the bash function sets PS1 to nothing. I've got it working
> properly on my machine, but don't feel comfortable enough yet with Scheme /
> guix to provide a proper patch. The simplest might be to use another shell
> for the conda package (because I believe only bash eats PS1); not sure
> whether that is possible in guix. And I would rather make guix packages of
> everything and ditch conda altogether. But supporting conda properly would
> help more people transition.

Could you email it to bug-guix@gnu.org so we keep track of it?

Even if you cannot provide a patch, that will be a first step towards
fixing it.

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-11-29 13:12     ` Hugo Buddelmeijer
                         ` (2 preceding siblings ...)
  2022-12-02 10:52       ` Ludovic Courtès
@ 2022-12-02 11:05       ` Ludovic Courtès
  2022-12-02 13:59         ` Simon Tournier
  2022-12-02 14:06         ` Hugo Buddelmeijer
  3 siblings, 2 replies; 25+ messages in thread
From: Ludovic Courtès @ 2022-12-02 11:05 UTC (permalink / raw)
  To: Hugo Buddelmeijer; +Cc: Thibault Lestang, Konrad Hinsen, guix-science

Hi,

I read this thread with interest—great to have first-hand feedback from
Conda users and packagers who also understand Guix!

Hugo Buddelmeijer <hugo@buddelmeijer.nl> skribis:

> That is, "conda env export" should contain entries like
> "scipy=1.8.0=py39hee8e79c_1", where the hee8e79c should uniquely define the
> dependencies 'that matter', like which compiler is used. What goes into the
> hash seems rather complicated, and grows over time.

I think one source of many problems here is to think that there are
dependencies that do not matter.  Another one, which those hashes appear
to address, is to think that a name/version pair is enough to
unambiguously designate a software artifact.

This hash is a hash of the build result, not a hash of the input, is
that correct?

I think it would be great to have a blog post that walks through
shortcomings and concrete issues one may encounter when trying to
reproduce a software environment with Conda, contrasting it with how
Guix does thing.  This would probably make more sense for people who use
Conda everyday than a high-level overview of Guix.

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-12-01 14:01         ` Hugo Buddelmeijer
@ 2022-12-02 13:01           ` Konrad Hinsen
  0 siblings, 0 replies; 25+ messages in thread
From: Konrad Hinsen @ 2022-12-02 13:01 UTC (permalink / raw)
  To: Hugo Buddelmeijer; +Cc: guix-science

Hi Hugo,

> Thanks. Those dependencies indeed do not contain the hashes, so it is
> probably created with "conda env export --no-build".

I can't say, I didn't set up this environment.

> I think such a file without build hashes would probably be what you want
> when you are giving a course, because it would allow students to install
> these exact versions of the packages, but build for their specific
> environment (e.g. Linux / macOS / Windows). It would provide limited

That was exactly our objective. And we knew that in theory,
reproducibility and multi-platform are incompatible. We just hoped that
the conda approach would work long enough for the purposes of our
course. It didn't.

> reproducibility in the future, as you noticed. I guess you'd want three
> sets of environment files for a conda environment for a course:

Sounds good... in theory. In practice, we'd have to explain the reasons
for these three environment files. Which we could do at best at the end
of the course. And even then, it would have been a difficult task, as we
couldn't go into all the details in a course aimed at junior researchers
with little technical background in computing.

> However, nowadays everyone can run linux, either directly, or through WSL
> (windows subsystem for linux), or through containers. And everyone knows
> how to do this, and it is integrated in IDE's and such. So conda isn't
> really necessary anymore.

Indeed.

We are currently working on a follow-up for dealing with reproducibility
at scale (big data, complex code, long computations). We decided to give
up multi-platform, and concentrate on Linux (explaining why). The two
approaches to reproducible environment we plan to cover are

 - Docker containers from reproducible Dockerfiles, based on Debian snapshots
 - Guix

The point is that, once you accept that Docker images are acceptable
only when reproducible, Guix appears as a simplification.

> I agree with you on a philosophical level; ultimately understanding
> everything would be easier with guix. But we aren't there yet, I don't
> understand most of the guix packages I've looked at. That is probably
> because my guile/scheme skills are lacking.

Maybe not. A big part of the complexity of Guix packaging is the need to
patch most software, in order to make its build reproducible and in
order to remove tacit dependencies in the build process on FHS
conventions.

Once Guix becomes the norm in the Linux world, the next step is to
encourage software developers to develop with Guix in mind. Produce
software that doesn't require any patches to compile under Guix.
World dominance is in sight!  ;-)

Cheers,
  Konrad.
-- 
---------------------------------------------------------------------
Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans
Synchrotron Soleil - Division Expériences
Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: konrad DOT hinsen AT cnrs DOT fr
http://dirac.cnrs-orleans.fr/~hinsen/
ORCID: https://orcid.org/0000-0003-0330-9428
Twitter: @khinsen
---------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-12-02 11:05       ` Ludovic Courtès
@ 2022-12-02 13:59         ` Simon Tournier
  2022-12-02 14:06         ` Hugo Buddelmeijer
  1 sibling, 0 replies; 25+ messages in thread
From: Simon Tournier @ 2022-12-02 13:59 UTC (permalink / raw)
  To: Ludovic Courtès, Hugo Buddelmeijer
  Cc: Thibault Lestang, Konrad Hinsen, guix-science

Hi,

On Fri, 02 Dec 2022 at 12:05, Ludovic Courtès <ludo@gnu.org> wrote:
> Hugo Buddelmeijer <hugo@buddelmeijer.nl> skribis:
>
>> That is, "conda env export" should contain entries like
>> "scipy=1.8.0=py39hee8e79c_1", where the hee8e79c should uniquely define the
>> dependencies 'that matter', like which compiler is used. What goes into the
>> hash seems rather complicated, and grows over time.
>
> I think one source of many problems here is to think that there are
> dependencies that do not matter.  Another one, which those hashes appear
> to address, is to think that a name/version pair is enough to
> unambiguously designate a software artifact.
>
> This hash is a hash of the build result, not a hash of the input, is
> that correct?

Well, the official Conda documentation seems explanatory, IMHO.  For
instance,

https://conda.io/projects/conda/en/latest/dev-guide/deep-dives/solvers.html#matchspec-vs-packagerecord

From my understanding, if you go via MatchSpec then the SAT solver is
invoked.  The SAT solver tries to satisfy all the constraints and the
solution depends on the state of the index (the upstream repository).

Aside the SAT solver can be very long and even fails if the constraints
are too hard, there is no guarantee that the SAT solver will find the
exact same combination for the packages to install.  Having an equality
(numpy=1.23) or something else does not really change this point.

Conda offers the option to be “explicit”.  And in that case, the solver
is not invoked.  Somehow, it is a way to directly deal with
PackageRecord.  Then, the Conda documentation has these warnings:

        * Explicit package installs

        Since  the  solver is  not  involved,  the dependencies  of  the
        explicit package(s) are not processed at all. This can leave the
        environment  in an  inconsistent state,  which can  be fixed  by
        running conda update --all, for example.

        * Cloning an environment

        It essentially takes the  source environment, generates the URLs
        for  each installed  packages  (filtering  conda, conda-env  and
        their   dependencies)   and  passes   the   list   of  URLs   to
        explicit(). If the source tarballs are not in the cache anymore,
        it will  query the  index for  the best  possible match  for the
        current channels. As  such, there’s a slim chance  that the copy
        is not exactly a clone of the original environment.

        https://conda.io/projects/conda/en/latest/dev-guide/deep-dives/solvers.html#early-exit-tasks


Therefore, the official Conda documentation explains that it is not
possible to have some guarantee about reproducing an environment.


> I think it would be great to have a blog post that walks through
> shortcomings and concrete issues one may encounter when trying to
> reproduce a software environment with Conda, contrasting it with how
> Guix does thing.  This would probably make more sense for people who use
> Conda everyday than a high-level overview of Guix.

From my understanding, the main issue is that Conda perfectly works when
you are in a short temporal window (2-3 months, say!).  In this range,
people can often reproduce.  It becomes more complicated outside this
range – so it is hard to demo for explaining. :-)

For sure, a blog post by people being fluent in both Conda and Guix
would be very welcome.  Aside the discussion about reproducibility, just
a Rosetta Stone comparing how to do that using Conda vs Guix.  It would
smooth the migration and at least give a try with Guix. :-)


Cheers,
simon


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-12-02 11:05       ` Ludovic Courtès
  2022-12-02 13:59         ` Simon Tournier
@ 2022-12-02 14:06         ` Hugo Buddelmeijer
  1 sibling, 0 replies; 25+ messages in thread
From: Hugo Buddelmeijer @ 2022-12-02 14:06 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Thibault Lestang, Konrad Hinsen, guix-science

[-- Attachment #1: Type: text/plain, Size: 4007 bytes --]

Hi Ludovic,

On Fri, 2 Dec 2022 at 12:05, Ludovic Courtès <ludo@gnu.org> wrote:

> Hi,
>
> I read this thread with interest—great to have first-hand feedback from
> Conda users and packagers who also understand Guix!
>
> Hugo Buddelmeijer <hugo@buddelmeijer.nl> skribis:
>
> > That is, "conda env export" should contain entries like
> > "scipy=1.8.0=py39hee8e79c_1", where the hee8e79c should uniquely define
> the
> > dependencies 'that matter', like which compiler is used. What goes into
> the
> > hash seems rather complicated, and grows over time.
>
> I think one source of many problems here is to think that there are
> dependencies that do not matter.


In the Python world, most dependencies are runtime dependencies. Those do
not actually affect the build, or the build result, and therefore arguably
'do not matter'. (I disagree, because what matters is whether the software
runs and creates the right results.)


> Another one, which those hashes appear
> to address, is to think that a name/version pair is enough to
> unambiguously designate a software artifact.
>
> This hash is a hash of the build result, not a hash of the input, is
> that correct?
>

No, this conda build hash is used to identify the build environment, not to
identify a particular package build.

The easiest way to explain is to show an example. Here is a small part of a
"conda env export" of one of my environments:
  - pybind11-abi=4=hd8ed1ab_3
  - pycodestyle=2.8.0=pyhd8ed1ab_0
  - pycosat=0.6.3=py39h3811e60_1009
  - pycparser=2.21=pyhd8ed1ab_0
  - pydocstyle=6.1.1=pyhd8ed1ab_0
  - pyerfa=2.0.0.1=py39hce5d2b2_1
  - pyflakes=2.4.0=pyhd8ed1ab_0
  - pygments=2.11.2=pyhd8ed1ab_0
  - pyopenssl=22.0.0=pyhd8ed1ab_0
  - pyqt=5.12.3=py39hf3d152e_8
  - pyqt-impl=5.12.3=py39hde8b62d_8
  - pyqt5-sip=4.19.18=py39he80948d_8
  - pyqtchart=5.12=py39h0fcd23e_8
  - pyqtwebengine=5.12.1=py39h0fcd23e_8

As you see, many packages share the "hd8ed1ab" build hash, two qt-related
packages have h0fcd23e, and some others have their own. The "hd8ed1ab" hash
is by far the most common in this environment. These "hd8ed1ab" packages
are mostly independent (with separate maintainers, etc), but are probably
all in conda-forge and probably all use the 'default' conda environment.

(The last digit/number is the build number. The "8" suggests that all
qt-packages are actually built together, even though their build hash
differs.)

I don't really understand what goes into the hash. It is described on
https://docs.conda.io/projects/conda-build/en/stable/resources/define-metadata.html#build-number-and-string

The goal of these hashes is to capture which package builds will work
together. So two package builds with the same build-hash should have been
made with the same environment and thus work together.

I'm not sure how it works if the hashes are different. Maybe they are
merkle trees? So it is possible to determine whether one hash is a
'superset' of another hash. Probably not.


>
> I think it would be great to have a blog post that walks through
> shortcomings and concrete issues one may encounter when trying to
> reproduce a software environment with Conda, contrasting it with how
> Guix does thing.  This would probably make more sense for people who use
> Conda everyday than a high-level overview of Guix.
>

A key difference might be how to handle different combinations of versions.

E.g. you might want to use numpy 3.0 and scipy 18.0, while I want to use
numpy 6.0 and scipy 15.0 (made up numbers, but on purpose with one lower
and one greater between us). Conda and Guix solve this in fundamentally
different ways.

Conda-forge (as a project) is kinda in between conda alone and Guix, and
can kinda be seen as a linux distribution itself (sans kernel). Conda forge
is moving closer to Guix every year, including more and more dependencies,
and more shared recreate-everything moments.

Greetings,,
Hugo

[-- Attachment #2: Type: text/html, Size: 5311 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-11-29 20:10       ` Simon Tournier
@ 2022-12-16 10:16         ` Thibault Lestang
  2023-03-11 11:05           ` Ludovic Courtès
  0 siblings, 1 reply; 25+ messages in thread
From: Thibault Lestang @ 2022-12-16 10:16 UTC (permalink / raw)
  To: guix-science; +Cc: Simon Tournier


Simon Tournier <zimon.toutoune@gmail.com> writes:

> Well, we have not spoken about running something.  We could also write a
> small Python script plotting something using Numpy and/or Scipy and try
> to run the Seurat vignette.
>
> From my experience, after some months (from 2-3 to 6), Conda will fail.
> Especially after an update of the system (apt upgrade)–and it can worse
> with a ’dist-upgrade’. :-)

Well let's see. I just set up a Gitlab repo with a weekly pipeline (re)creating
a conda env from an environment.yml spec I generated earlier this morning.

https://framagit.org/tlestang/conda-python-example

Just Python for now but anyone feel free to contribute an R env as well.

-- Thibault


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2022-12-16 10:16         ` Thibault Lestang
@ 2023-03-11 11:05           ` Ludovic Courtès
  2023-03-11 11:43             ` Simon Tournier
  0 siblings, 1 reply; 25+ messages in thread
From: Ludovic Courtès @ 2023-03-11 11:05 UTC (permalink / raw)
  To: Thibault Lestang; +Cc: guix-science, Simon Tournier

Hi Thibault,

Thibault Lestang <t.lestang@imperial.ac.uk> skribis:

> Well let's see. I just set up a Gitlab repo with a weekly pipeline (re)creating
> a conda env from an environment.yml spec I generated earlier this morning.
>
> https://framagit.org/tlestang/conda-python-example
>
> Just Python for now but anyone feel free to contribute an R env as well.

Any findings so far?  Looking at the pipelines, it seems to be all
green, right?

It’s an interesting experiment, great that you set it up!

Ludo’.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2023-03-11 11:05           ` Ludovic Courtès
@ 2023-03-11 11:43             ` Simon Tournier
  2023-03-13 10:26               ` Lestang, Thibault
  0 siblings, 1 reply; 25+ messages in thread
From: Simon Tournier @ 2023-03-11 11:43 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Thibault Lestang, guix-science

Hi,

On Sat, 11 Mar 2023 at 12:05, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

> It’s an interesting experiment, great that you set it up!

That's cool, isn't it? :-)

Well, the current pipeline tests exactly what we discussed: the effect
of the Conda resolver.  However I would add two others:

 1. also use the image continuumio/miniconda3:latest
 2. install Miniconda on the top of the Docker image of Debian
unstable and run "apt update && apt upgrade"

And I expect that #2 will break first, then #1 and last the current
one.  My crystal ball told that we have to wait 2 years before the
current break... wait and see.

Cheers,
simon


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2023-03-11 11:43             ` Simon Tournier
@ 2023-03-13 10:26               ` Lestang, Thibault
  2023-03-13 11:00                 ` Ricardo Wurmus
  0 siblings, 1 reply; 25+ messages in thread
From: Lestang, Thibault @ 2023-03-13 10:26 UTC (permalink / raw)
  To: Simon Tournier, Ludovic Courtès; +Cc: guix-science

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> Any findings so far?  Looking at the pipelines, it seems to be all
> green, right?

Timely reply - all green until 3 days ago when the job timed out after 70min. 
However, I re-ran the job manually this morning and it succeeded within a couple of minutes. Not 
quite sure what happened but probably not related to conda. Not logs available unfortunately.

If the process of reproducing the environment is going to fail at some point, I 
wonder if we could accelerate this process by defining a more complex environment. 
Any ideas?

Simon Tournier <zimon.toutoune@gmail.com> writes:

> 1. also use the image continuumio/miniconda3:latest
> 2. install Miniconda on the top of the Docker image of Debian
>   unstable and run "apt update && apt upgrade"
> 
> And I expect that #2 will break first, then #1 and last the current
> one.

Could you elaborate on this? For context the current pipeline 
pulls a pinned miniconda image then updates conda (=conda update conda=).  
Do you expect system libraries (I mean software installed through apt, not 
managed by conda) to influence the conda environment creation?  My current 
understanding is that conda brings its  own copies of these libraries without relying 
on whatever was/will be installed through other ways (e.g. apt).

Anyways very happy to set these two cases up as well.

Thibault


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2023-03-13 10:26               ` Lestang, Thibault
@ 2023-03-13 11:00                 ` Ricardo Wurmus
  2023-03-13 12:38                   ` Simon Tournier
  0 siblings, 1 reply; 25+ messages in thread
From: Ricardo Wurmus @ 2023-03-13 11:00 UTC (permalink / raw)
  To: Lestang, Thibault; +Cc: Simon Tournier, Ludovic Courtès, guix-science


"Lestang, Thibault" <t.lestang@imperial.ac.uk> writes:

> Ludovic Courtès <ludovic.courtes@inria.fr> writes:
>
>> Any findings so far?  Looking at the pipelines, it seems to be all
>> green, right?
>
> Timely reply - all green until 3 days ago when the job timed out after 70min. 
> However, I re-ran the job manually this morning and it succeeded within a couple of minutes. Not 
> quite sure what happened but probably not related to conda. Not logs available unfortunately.
>
> If the process of reproducing the environment is going to fail at some point, I 
> wonder if we could accelerate this process by defining a more complex environment. 
> Any ideas?

A more complex environment would increase the chance of failure because
it increases the complexity of the challenge to the resolver.  While it
would be a useful demonstration to see the resolver fail I think it is
the least damning kind of failure.

As Simon suggests, changing the underlying system that *currently*
satisfies all the implicit assumptions that Conda artefacts contain
would likely yield a more realistic and interesting kind of failure.

>
> Simon Tournier <zimon.toutoune@gmail.com> writes:
>
>> 1. also use the image continuumio/miniconda3:latest
>> 2. install Miniconda on the top of the Docker image of Debian
>>   unstable and run "apt update && apt upgrade"
>> 
>> And I expect that #2 will break first, then #1 and last the current
>> one.
>
> Could you elaborate on this? For context the current pipeline 
> pulls a pinned miniconda image then updates conda (=conda update conda=).  
> Do you expect system libraries (I mean software installed through apt, not 
> managed by conda) to influence the conda environment creation?  My current 
> understanding is that conda brings its  own copies of these libraries without relying 
> on whatever was/will be installed through other ways (e.g. apt).

This depends on the packages.  There are packages that do link with
system libraries, and these are provided by a base image in which the
binary artefacts are built.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2023-03-13 11:00                 ` Ricardo Wurmus
@ 2023-03-13 12:38                   ` Simon Tournier
  2023-03-16 10:26                     ` Ludovic Courtès
  0 siblings, 1 reply; 25+ messages in thread
From: Simon Tournier @ 2023-03-13 12:38 UTC (permalink / raw)
  To: Ricardo Wurmus, Lestang, Thibault; +Cc: Ludovic Courtès, guix-science

Hi,

On lun., 13 mars 2023 at 12:00, Ricardo Wurmus <rekado@elephly.net> wrote:

>> If the process of reproducing the environment is going to fail at some point, I 
>> wonder if we could accelerate this process by defining a more complex environment. 
>> Any ideas?

Maybe something using PyTorch or some other ML framework.


> A more complex environment would increase the chance of failure because
> it increases the complexity of the challenge to the resolver.  While it
> would be a useful demonstration to see the resolver fail I think it is
> the least damning kind of failure.

Yes, I agree the solver will be the last thing to break.  Well, from my
understanding of [1], the breakage of the Conda solver depends on the
state of their index.  Quoting [1]:

        This is where the SAT solver will act. It will use the list of MatchSpec
        objects to pick a number of PackageRecord entries from the index, thus
        building the “final state of the solved environment”. This is detailed
        later in this deep dive guide, if you need more info. 

so more complex is the environment and more complicated the solution of
the SAT will be.  And finding the solution can be slow.  That’s why they
implemented various solvers [2].  And it is not clear for me if [1] and
[2] always lead to the same environment.

To my knowledge, the issue is well-identified, for instance by the
Mancoosi project [3]; in short, it reads:

        4.2 Package installation is NP-Complete
        
        Theorem 1: Checking whether a single package P can be installed,
                   given a repository R, is NP-complete.

        4.2.4 Conclusions

        Despite the apparent differences, the constraint languages in DEB and
        RPM are sensibly equivalent in expressiveness, and the associated
        installation problems are both NP-complete. 

        This means that automatic package installation tools like APT, URPMI or
        SMART live dangerously on the edge of intractability, and must carefully
        apply heuristics that may be either safe (the approach advocated by
        SMART), and hence still not guaranteed to avoid intractability, or
        unsafe, thus accepting the risk of not always finding a solution when it
        exists. 

Therefore, I do not see where Conda would be different.  However, indeed
it could be hard to construct a concrete example of a failure for the
SAT solver part.  Moreover, Conda documentation reads [1],

        Explicit package installs

        These commands do not need a solver because the requested packages are
        expressed with a direct URL or path to a specific tarball. Instead of a
        MatchSpec, we already have a PackageRecord-like entity! For this to
        work, all the requested packages neeed to be URLs or paths. They can be
        typed in the command line or in a text file including a @EXPLICIT line. 

        Since the solver is not involved, the dependencies of the explicit
        package(s) are not processed at all. This can leave the environment in
        an inconsistent state, which can be fixed by running conda update --all,
        for example. 

        Explicit installs are taken care of by the explicit function.

For sure, the failure of Conda is by design.  And as with many things in
life, people only believe what they see from their own eyes. :-)

1: https://docs.conda.io/projects/conda/en/latest/dev-guide/deep-dives/solvers.html
2: https://conda.github.io/conda-libmamba-solver/libmamba-vs-classic/
3: https://www.mancoosi.org/edos/algorithmic/


>> Simon Tournier <zimon.toutoune@gmail.com> writes:
>>
>>> 1. also use the image continuumio/miniconda3:latest
>>> 2. install Miniconda on the top of the Docker image of Debian
>>>   unstable and run "apt update && apt upgrade"
>>> 
>>> And I expect that #2 will break first, then #1 and last the current
>>> one.
>>
>> Could you elaborate on this? For context the current pipeline 
>> pulls a pinned miniconda image then updates conda (=conda update conda=).  
>> Do you expect system libraries (I mean software installed through apt, not 
>> managed by conda) to influence the conda environment creation?  My current 
>> understanding is that conda brings its  own copies of these libraries without relying 
>> on whatever was/will be installed through other ways (e.g. apt).
>
> This depends on the packages.  There are packages that do link with
> system libraries, and these are provided by a base image in which the
> binary artefacts are built.

As Ricardo explained, sometimes Conda relies on system libraries.  Guix
makes the assumption of a compatible Linux kernel.  Conda also makes
assumptions and, to my knowledge, they are less strict about isolated
environments.

That’s why replacing the base image could also help to expose examples
where it breaks.

Just to point that I was in a workshop of Reproducible Research past
week and I discussed with the developer of BenchOpt [4].  Their aim is
to maintain the computational stack for some ML framework when the
passing of time by making their benchmarks evolving.  Other said, they
take the other direction of Guix.  If they do that, that’s because it is
not possible to run again. :-)

4: https://benchopt.github.io/


It is hard to predict beforehand where Conda will break. :-)  From my
point of view, by order of most probable:

 1. because the underlying Linux distribution base
 2. because the SAT solver

Well, for testing #1, I propose:

 a) to also run the pipeline using continuumio/miniconda3:latest
 b) to run an installation of Conda
     i) on the top of Debian
     ii) on the top of Ubuntu
    and then run the script

As corollary, it will also test #2. ;-)

The current script is about Numpy, maybe it would accelerate the process
if instead it would be PyTorch.


Thanks for the discussion about that topic.  If no one beats me, I will
adapt .gitlab-ci.yml.  Well, do not hold your breath… first holidays! ;-)


Cheers,
simon


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2023-03-13 12:38                   ` Simon Tournier
@ 2023-03-16 10:26                     ` Ludovic Courtès
  2023-03-16 13:40                       ` Thibault Lestang
  0 siblings, 1 reply; 25+ messages in thread
From: Ludovic Courtès @ 2023-03-16 10:26 UTC (permalink / raw)
  To: Simon Tournier; +Cc: Ricardo Wurmus, Lestang, Thibault, guix-science

Hello comrades,

“Seeing is believing” so I think we should build upon Thibault’s
experiments and on what Simon and Ricardo pointed out to write an
article showing in concrete ways in which Conda would fail to reproduce
a software environment.

That would make a good blog post on hpc.guix.info; some conferences such
as <https://acm-rep.github.io/> (too late for this edition) may also be
good opportunities for such a study.

Just sayin’!  :-)

Ludo’.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2023-03-16 10:26                     ` Ludovic Courtès
@ 2023-03-16 13:40                       ` Thibault Lestang
  2023-04-03 15:22                         ` Simon Tournier
  0 siblings, 1 reply; 25+ messages in thread
From: Thibault Lestang @ 2023-03-16 13:40 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Simon Tournier, Ricardo Wurmus, guix-science


Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> “Seeing is believing” so I think we should build upon Thibault’s
> experiments and on what Simon and Ricardo pointed out to write an
> article showing in concrete ways in which Conda would fail to reproduce
> a software environment.

I'm all for it. 

As Ricardo and Simon made clear in previous messages, the current
experiment is only exercising the conda resolver -- whilst maintaining
the underlying system and libraries constant in time.  Which at the time
I thought was actually the interesting issue.  I'll try to find some
time in the next few days to add a couple of cases where they are
allowed to vary.

@Simon: Whether I actually do it or not, feel free to add/tweak the
pipelines when you're back.  You should be able to open a merge request?

As a reminder the repo currently lives at

https://framagit.org/tlestang/conda-python-example

Happy for it to be moved if there is somewhere else that is more
suitful.

-- Thibault


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2023-03-16 13:40                       ` Thibault Lestang
@ 2023-04-03 15:22                         ` Simon Tournier
  2023-04-04 12:19                           ` Thibault Lestang
  0 siblings, 1 reply; 25+ messages in thread
From: Simon Tournier @ 2023-04-03 15:22 UTC (permalink / raw)
  To: Thibault Lestang, Ludovic Courtès; +Cc: Ricardo Wurmus, guix-science

Hi Thibault,

On jeu., 16 mars 2023 at 13:40, Thibault Lestang <t.lestang@imperial.ac.uk> wrote:

> @Simon: Whether I actually do it or not, feel free to add/tweak the
> pipelines when you're back.  You should be able to open a merge request?

I am back. :-)

Nice, you already did it.  Well, I will see if instead the simple Numpy
example, we can also add something using PyTorch – some more complicated
stack.


Cheers,
simon


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Conda environments and reproducibility
  2023-04-03 15:22                         ` Simon Tournier
@ 2023-04-04 12:19                           ` Thibault Lestang
  0 siblings, 0 replies; 25+ messages in thread
From: Thibault Lestang @ 2023-04-04 12:19 UTC (permalink / raw)
  To: Simon Tournier; +Cc: guix-science


Simon Tournier <zimon.toutoune@gmail.com> writes:

> On jeu., 16 mars 2023 at 13:40, Thibault Lestang <t.lestang@imperial.ac.uk> wrote:
>
>> @Simon: Whether I actually do it or not, feel free to add/tweak the
>> pipelines when you're back.  You should be able to open a merge request?
>
> I am back. :-)
>
> Nice, you already did it.  Well, I will see if instead the simple Numpy
> example, we can also add something using PyTorch – some more complicated
> stack.

That would be helpful, thanks.  I'm not very familiar with PyTorch or
Tensorflow and lacking inspiration on what to add to the environment.

Hope you had a good break :)

-- Thibault



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-04-04 12:25 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-28 17:28 Conda environments and reproducibility Thibault Lestang
2022-11-28 19:45 ` Konrad Hinsen
2022-11-29 10:32   ` Thibault Lestang
2022-11-29 13:12     ` Hugo Buddelmeijer
2022-11-29 13:39       ` Konrad Hinsen
2022-12-01 14:01         ` Hugo Buddelmeijer
2022-12-02 13:01           ` Konrad Hinsen
2022-11-29 20:10       ` Simon Tournier
2022-12-16 10:16         ` Thibault Lestang
2023-03-11 11:05           ` Ludovic Courtès
2023-03-11 11:43             ` Simon Tournier
2023-03-13 10:26               ` Lestang, Thibault
2023-03-13 11:00                 ` Ricardo Wurmus
2023-03-13 12:38                   ` Simon Tournier
2023-03-16 10:26                     ` Ludovic Courtès
2023-03-16 13:40                       ` Thibault Lestang
2023-04-03 15:22                         ` Simon Tournier
2023-04-04 12:19                           ` Thibault Lestang
2022-12-02 10:52       ` Ludovic Courtès
2022-12-02 11:05       ` Ludovic Courtès
2022-12-02 13:59         ` Simon Tournier
2022-12-02 14:06         ` Hugo Buddelmeijer
2022-11-28 20:46 ` Simon Tournier
2022-11-29 10:41   ` Thibault Lestang
2022-11-29 14:25     ` Simon Tournier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).