* Re: Conda environments and reproducibility
2022-11-29 13:12 ` Hugo Buddelmeijer
@ 2022-11-29 13:39 ` Konrad Hinsen
2022-12-01 14:01 ` Hugo Buddelmeijer
2022-11-29 20:10 ` Simon Tournier
` (2 subsequent siblings)
3 siblings, 1 reply; 25+ messages in thread
From: Konrad Hinsen @ 2022-11-29 13:39 UTC (permalink / raw)
To: Hugo Buddelmeijer, Thibault Lestang; +Cc: guix-science
Hi Hugo,
Buddelmeijer <hugo@buddelmeijer.nl> writes:
> Hi Konrad, Thibault and others,
>
> Konrad, is it perhaps possible for you to dig up this broken conda
> environment file?
Yes:
https://gist.github.com/brospars/4671d9013f0d99e1c961482dab533c57
That environment was set up in 2018 on a Linux machine, and then tested
under macOS and Windows as well. It broke in early 2019.
> First, just like you all, my conclusion is that guix is the answer. The
> last two paragraphs by Simon captures it succinctly. However, conda seems
> to work fine for most people. It would therefore be instructive to have
> concrete 'failure stories' in order to show people that conda is not enough.
I have heard many stories of conda failing long-term, i.e. environments
not being reproducible after a year or two. Most use cases are probably
more short-term.
> It doesn't seem common to overwrite conda binaries. Conda takes some (not
> enough?) measures to prevent the scenario Konrad describes. In particular,
> the filenames include a 'hash' since conda 3 (~2014) [1]:
Weird. We worked with official Miniconda downloads from early 2018, and
our environment files contain no hashes.
> My realization was that improving these hashes is a goose chase and will
> ultimately lead to horrific things like "turing-complete yaml files". And
> at that point it is clear, at least to me, that guix is the answer.
Indeed. Turing-complete Scheme files :-)
My conclusion so far is that conda can never attain long-term
reproducibility, because it wants to be multi-platform. And that means
that it doesn't control the foundations on which it has to build.
From a user's point of view, a big problem with conda is the opacity of
the machinery, which in addition changes all the time as you say. With
Guix, I can understand how everything is built, and thus understand the
potential obstacles to a rebuild many years later. With conda, I don't
really know and my understanding is that the build machinery is not
even completely public (for Anaconda at least).
> One thing that conda (or actualy conda-forge) does well, are their bots.
> I'm a maintainer of some conda packages and once a month or so I get a
> fully automated pull request to update my package [4], e.g. when the
> upstream package is updated, or when a dependency is updated. They even
That's nice!
> packages, such as compilers. This makes maintaining conda-forge packages a
> breeze. Having such bots also within the guix-ecosystem would probably help
> attract developers.
Indeed. More generally, I think package managers should do a better job
in reaching out to upstream maintainers. They are our allies in
providing a better UX.
Cheers,
Konrad
--
---------------------------------------------------------------------
Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans
Synchrotron Soleil - Division Expériences
Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: konrad DOT hinsen AT cnrs DOT fr
http://dirac.cnrs-orleans.fr/~hinsen/
ORCID: https://orcid.org/0000-0003-0330-9428
Twitter: @khinsen
---------------------------------------------------------------------
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2022-11-29 13:39 ` Konrad Hinsen
@ 2022-12-01 14:01 ` Hugo Buddelmeijer
2022-12-02 13:01 ` Konrad Hinsen
0 siblings, 1 reply; 25+ messages in thread
From: Hugo Buddelmeijer @ 2022-12-01 14:01 UTC (permalink / raw)
To: Konrad Hinsen; +Cc: Thibault Lestang, guix-science
[-- Attachment #1: Type: text/plain, Size: 4219 bytes --]
Thanks Konrad,
On Tue, 29 Nov 2022 at 14:39, Konrad Hinsen <konrad.hinsen@cnrs.fr> wrote:
>
> Buddelmeijer <hugo@buddelmeijer.nl> writes:
>
> > Hi Konrad, Thibault and others,
> >
> > Konrad, is it perhaps possible for you to dig up this broken conda
> > environment file?
>
> Yes:
>
> https://gist.github.com/brospars/4671d9013f0d99e1c961482peopledab533c57
> <https://gist.github.com/brospars/4671d9013f0d99e1c961482dab533c57>
>
> That environment was set up in 2018 on a Linux machine, and then tested
> under macOS and Windows as well. It broke in early 2019.
>
Thanks. Those dependencies indeed do not contain the hashes, so it is
probably created with "conda env export --no-build".
I think such a file without build hashes would probably be what you want
when you are giving a course, because it would allow students to install
these exact versions of the packages, but build for their specific
environment (e.g. Linux / macOS / Windows). It would provide limited
reproducibility in the future, as you noticed. I guess you'd want three
sets of environment files for a conda environment for a course:
1. With unpinned dependencies, so just "scipy", whenever possible. That
way, you'd get the latest versions when rerunning the course. This requires
frequent updates to the files to restrict/pin dependencies when necessary,
e.g. "scipy<=1.8.0". This would be equivalent to a guix manifest file
without any channel information.
2. With dependencies pinned just on version, "scipy=1.8.0", like the one
you shared. This should allow you to get equivalent stacks on different
environments. Guix does not really have an equivalent, by design, since it
is not multi-platform. Although I suppose one could create a channel with
many different versions of packages; then the manifest should specify the
ones used.
3. With dependencies pinned on build hash, "scipy=1.8.0=py39hee8e79c_1".
This should give you the exact same binaries every time. Roughly equivalent
to a guix manifest with a channel file. But guix is still better, because
its dependency graph is based on source code, which is easier to archive,
so less chance of missing binaries (and more determinism).
Guix differentiates between scenarios 1 and 3 more cleanly, by having a
clean separation between the manifest and the channels.
(Lets ignore the pip packages in the conda environment file for now.)
> It doesn't seem common to overwrite conda binaries. Conda takes some (not
> > enough?) measures to prevent the scenario Konrad describes. In
> particular,
> > the filenames include a 'hash' since conda 3 (~2014) [1]:
>
> Weird. We worked with official Miniconda downloads from early 2018, and
> our environment files contain no hashes.
>
Probably due to "--no-build" in "conda env export", or maybe the default
was different back then.
> My conclusion so far is that conda can never attain long-term
> reproducibility, because it wants to be multi-platform. And that means
> that it doesn't control the foundations on which it has to build.
>
Perhaps we are at the right time. I started using conda when I myself, or
my colleagues, used many different environments. Linux, windows, mac, and
different versions thereof. Back then, anaconda was great, because it was
very hard to install everything otherwise.
However, nowadays everyone can run linux, either directly, or through WSL
(windows subsystem for linux), or through containers. And everyone knows
how to do this, and it is integrated in IDE's and such. So conda isn't
really necessary anymore.
From a user's point of view, a big problem with conda is the opacity of
> the machinery, which in addition changes all the time as you say. With
> Guix, I can understand how everything is built, and thus understand the
> potential obstacles to a rebuild many years later. With conda, I don't
> really know and my understanding is that the build machinery is not
> even completely public (for Anaconda at least).
>
I agree with you on a philosophical level; ultimately understanding
everything would be easier with guix. But we aren't there yet, I don't
understand most of the guix packages I've looked at. That is probably
because my guile/scheme skills are lacking.
Cheers,
Hugo
[-- Attachment #2: Type: text/html, Size: 5609 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2022-12-01 14:01 ` Hugo Buddelmeijer
@ 2022-12-02 13:01 ` Konrad Hinsen
0 siblings, 0 replies; 25+ messages in thread
From: Konrad Hinsen @ 2022-12-02 13:01 UTC (permalink / raw)
To: Hugo Buddelmeijer; +Cc: guix-science
Hi Hugo,
> Thanks. Those dependencies indeed do not contain the hashes, so it is
> probably created with "conda env export --no-build".
I can't say, I didn't set up this environment.
> I think such a file without build hashes would probably be what you want
> when you are giving a course, because it would allow students to install
> these exact versions of the packages, but build for their specific
> environment (e.g. Linux / macOS / Windows). It would provide limited
That was exactly our objective. And we knew that in theory,
reproducibility and multi-platform are incompatible. We just hoped that
the conda approach would work long enough for the purposes of our
course. It didn't.
> reproducibility in the future, as you noticed. I guess you'd want three
> sets of environment files for a conda environment for a course:
Sounds good... in theory. In practice, we'd have to explain the reasons
for these three environment files. Which we could do at best at the end
of the course. And even then, it would have been a difficult task, as we
couldn't go into all the details in a course aimed at junior researchers
with little technical background in computing.
> However, nowadays everyone can run linux, either directly, or through WSL
> (windows subsystem for linux), or through containers. And everyone knows
> how to do this, and it is integrated in IDE's and such. So conda isn't
> really necessary anymore.
Indeed.
We are currently working on a follow-up for dealing with reproducibility
at scale (big data, complex code, long computations). We decided to give
up multi-platform, and concentrate on Linux (explaining why). The two
approaches to reproducible environment we plan to cover are
- Docker containers from reproducible Dockerfiles, based on Debian snapshots
- Guix
The point is that, once you accept that Docker images are acceptable
only when reproducible, Guix appears as a simplification.
> I agree with you on a philosophical level; ultimately understanding
> everything would be easier with guix. But we aren't there yet, I don't
> understand most of the guix packages I've looked at. That is probably
> because my guile/scheme skills are lacking.
Maybe not. A big part of the complexity of Guix packaging is the need to
patch most software, in order to make its build reproducible and in
order to remove tacit dependencies in the build process on FHS
conventions.
Once Guix becomes the norm in the Linux world, the next step is to
encourage software developers to develop with Guix in mind. Produce
software that doesn't require any patches to compile under Guix.
World dominance is in sight! ;-)
Cheers,
Konrad.
--
---------------------------------------------------------------------
Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans
Synchrotron Soleil - Division Expériences
Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: konrad DOT hinsen AT cnrs DOT fr
http://dirac.cnrs-orleans.fr/~hinsen/
ORCID: https://orcid.org/0000-0003-0330-9428
Twitter: @khinsen
---------------------------------------------------------------------
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2022-11-29 13:12 ` Hugo Buddelmeijer
2022-11-29 13:39 ` Konrad Hinsen
@ 2022-11-29 20:10 ` Simon Tournier
2022-12-16 10:16 ` Thibault Lestang
2022-12-02 10:52 ` Ludovic Courtès
2022-12-02 11:05 ` Ludovic Courtès
3 siblings, 1 reply; 25+ messages in thread
From: Simon Tournier @ 2022-11-29 20:10 UTC (permalink / raw)
To: Hugo Buddelmeijer, Thibault Lestang; +Cc: Konrad Hinsen, guix-science
Hi Hugo, all,
On Tue, 29 Nov 2022 at 14:12, Hugo Buddelmeijer <hugo@buddelmeijer.nl> wrote:
> However, conda seems
> to work fine for most people. It would therefore be instructive to have
> concrete 'failure stories' in order to show people that conda is not enough.
What I would do if I would try to convince my colleagues that Conda is
not enough.
1. Target one or two common environments; for example,
(Python+Numpy+Scipy+Matplotlib) for one, and (R+Seurat) for two.
2. Generate the both environments following the Conda documentation.
Until here all should work smoothly. :-)
3. Commit the Conda files in a Git repository; for instance,
for e in py rseurat
do
conda activate $e
conda env export > environment-$e.yml
conda list --explicit > explicit-spec-$e.txt
conda deactivate
done
4.
a) on the same machine, try to recreate the 2 environments.
b) on another machine, idem.
c) Commit to the Git repository how it goes.
d) Remove the two environments and more on both machine.
5. Every new month, do #4.
Maybe it can be automated with a Cron task. And maybe we could
collectively do this experience. And we could do the same with
Guix. :-)
Well, we have not spoken about running something. We could also write a
small Python script plotting something using Numpy and/or Scipy and try
to run the Seurat vignette.
From my experience, after some months (from 2-3 to 6), Conda will fail.
Especially after an update of the system (apt upgrade)–and it can worse
with a ’dist-upgrade’. :-)
> On Tue, 29 Nov 2022 at 11:32, Thibault Lestang <t.lestang@imperial.ac.uk>
> wrote:
>
>> That's fair enough. Conda & pip are everywhere around me, and I'd like
>> to form an accurate picture of their shotcomings before mentioning
>> alternative approaches to people who use these tools everyday!
>
> I agree, let me share my perspective.
Conda and pip works very well when we have in mind a forward view of the
history. By design, they fail when backward. For engineering, they are
very efficient and personally I would rely on them **if** I had some
systems to maintain only caring about upgrading them. Well, Conda, pip
or some other distro package manager.
The troubles are when you try to restore the past. The 10 Years
Challenge [1] provides very good examples. This report [2] (in French,
but an English version is probably around) provides very good insights,
IMHO, about the limitations of classical package managers (as Debian,
Conda, pip, etc.)
For what my biased opinion is worth, many shortcomings are around. :-)
For instance, this paper [3] points the reproduction was «so
time-consuming and resulted in only 11 out of 28 (39%) figure panels
conveying the same information». Well, for sure it is hard to know if
the students tried hard or not–and the paper does not speak much about
the computational environment.
(Well, aside the transparency of the computational stack that Conda
barely provides, but that’s another story. :-))
1: <https://www.nature.com/articles/d41586-020-02462-7>
2: <https://hpc.guix.info/static/videos/atelier-reproductibilit%C3%A9-2021/arnaud-legrand.webm>
3: <https://doi.org/10.1371/journal.pcbi.1010615>
> That is, "conda env export" should contain entries like
> "scipy=1.8.0=py39hee8e79c_1", where the hee8e79c should uniquely define the
> dependencies 'that matter', like which compiler is used. What goes into the
> hash seems rather complicated, and grows over time.
>
> This hash is a great step forward in reproducibility. But it is too
> fragile. I can't directly see how, but I can easily assume that this
> dependency-hash mechanism leads to the problem that Konrad faced even when
> no files are overwritten. Maybe because a new dependency resolver in conda
> would have stricter rules on interoperability. (It is still possible that
> files indeed were overwritten though; it was probably an incident like this
> that made them change the hashes.)
Well, I think Conda documentation [4] about the solver for dependencies
put some warnings around this explicit mechanism. It is a long time
that I have not given a look at Conda but from my understanding of the
solver documentation, this “failure” reported by Konrad appears to me
expected, by design of Conda. ;-)
If the solver tries to satisfy many constraints, then the problem is
more complex as the time is going. So, Conda probably fails to find a
working combination.
If the solver is bypassed, then there is no guarantee that the generated
state is a working computational environment. Conda recommends to
update in order to fix the potential issues.
4: <https://conda.io/projects/conda/en/latest/dev-guide/deep-dives/solvers.html>
> One thing that conda (or actualy conda-forge) does well, are their bots.
> I'm a maintainer of some conda packages and once a month or so I get a
> fully automated pull request to update my package [4], e.g. when the
> upstream package is updated, or when a dependency is updated. They even
> have a tracking system for migrating dependencies that are used by many
> packages, such as compilers. This makes maintaining conda-forge packages a
> breeze. Having such bots also within the guix-ecosystem would probably help
> attract developers.
Cool! Do you know if the code of these bots is available?
> By the way, it is quite hard to use conda in guix,
Maybe you could open bugs and/or report on help-guix or guix-devel the
annoyance you are observing. For instance, I fully removed Conda from
my toolbox so I never hit annoyance. ;-)
Cheers,
simon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2022-11-29 20:10 ` Simon Tournier
@ 2022-12-16 10:16 ` Thibault Lestang
2023-03-11 11:05 ` Ludovic Courtès
0 siblings, 1 reply; 25+ messages in thread
From: Thibault Lestang @ 2022-12-16 10:16 UTC (permalink / raw)
To: guix-science; +Cc: Simon Tournier
Simon Tournier <zimon.toutoune@gmail.com> writes:
> Well, we have not spoken about running something. We could also write a
> small Python script plotting something using Numpy and/or Scipy and try
> to run the Seurat vignette.
>
> From my experience, after some months (from 2-3 to 6), Conda will fail.
> Especially after an update of the system (apt upgrade)–and it can worse
> with a ’dist-upgrade’. :-)
Well let's see. I just set up a Gitlab repo with a weekly pipeline (re)creating
a conda env from an environment.yml spec I generated earlier this morning.
https://framagit.org/tlestang/conda-python-example
Just Python for now but anyone feel free to contribute an R env as well.
-- Thibault
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2022-12-16 10:16 ` Thibault Lestang
@ 2023-03-11 11:05 ` Ludovic Courtès
2023-03-11 11:43 ` Simon Tournier
0 siblings, 1 reply; 25+ messages in thread
From: Ludovic Courtès @ 2023-03-11 11:05 UTC (permalink / raw)
To: Thibault Lestang; +Cc: guix-science, Simon Tournier
Hi Thibault,
Thibault Lestang <t.lestang@imperial.ac.uk> skribis:
> Well let's see. I just set up a Gitlab repo with a weekly pipeline (re)creating
> a conda env from an environment.yml spec I generated earlier this morning.
>
> https://framagit.org/tlestang/conda-python-example
>
> Just Python for now but anyone feel free to contribute an R env as well.
Any findings so far? Looking at the pipelines, it seems to be all
green, right?
It’s an interesting experiment, great that you set it up!
Ludo’.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2023-03-11 11:05 ` Ludovic Courtès
@ 2023-03-11 11:43 ` Simon Tournier
2023-03-13 10:26 ` Lestang, Thibault
0 siblings, 1 reply; 25+ messages in thread
From: Simon Tournier @ 2023-03-11 11:43 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: Thibault Lestang, guix-science
Hi,
On Sat, 11 Mar 2023 at 12:05, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:
> It’s an interesting experiment, great that you set it up!
That's cool, isn't it? :-)
Well, the current pipeline tests exactly what we discussed: the effect
of the Conda resolver. However I would add two others:
1. also use the image continuumio/miniconda3:latest
2. install Miniconda on the top of the Docker image of Debian
unstable and run "apt update && apt upgrade"
And I expect that #2 will break first, then #1 and last the current
one. My crystal ball told that we have to wait 2 years before the
current break... wait and see.
Cheers,
simon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2023-03-11 11:43 ` Simon Tournier
@ 2023-03-13 10:26 ` Lestang, Thibault
2023-03-13 11:00 ` Ricardo Wurmus
0 siblings, 1 reply; 25+ messages in thread
From: Lestang, Thibault @ 2023-03-13 10:26 UTC (permalink / raw)
To: Simon Tournier, Ludovic Courtès; +Cc: guix-science
Ludovic Courtès <ludovic.courtes@inria.fr> writes:
> Any findings so far? Looking at the pipelines, it seems to be all
> green, right?
Timely reply - all green until 3 days ago when the job timed out after 70min.
However, I re-ran the job manually this morning and it succeeded within a couple of minutes. Not
quite sure what happened but probably not related to conda. Not logs available unfortunately.
If the process of reproducing the environment is going to fail at some point, I
wonder if we could accelerate this process by defining a more complex environment.
Any ideas?
Simon Tournier <zimon.toutoune@gmail.com> writes:
> 1. also use the image continuumio/miniconda3:latest
> 2. install Miniconda on the top of the Docker image of Debian
> unstable and run "apt update && apt upgrade"
>
> And I expect that #2 will break first, then #1 and last the current
> one.
Could you elaborate on this? For context the current pipeline
pulls a pinned miniconda image then updates conda (=conda update conda=).
Do you expect system libraries (I mean software installed through apt, not
managed by conda) to influence the conda environment creation? My current
understanding is that conda brings its own copies of these libraries without relying
on whatever was/will be installed through other ways (e.g. apt).
Anyways very happy to set these two cases up as well.
Thibault
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2023-03-13 10:26 ` Lestang, Thibault
@ 2023-03-13 11:00 ` Ricardo Wurmus
2023-03-13 12:38 ` Simon Tournier
0 siblings, 1 reply; 25+ messages in thread
From: Ricardo Wurmus @ 2023-03-13 11:00 UTC (permalink / raw)
To: Lestang, Thibault; +Cc: Simon Tournier, Ludovic Courtès, guix-science
"Lestang, Thibault" <t.lestang@imperial.ac.uk> writes:
> Ludovic Courtès <ludovic.courtes@inria.fr> writes:
>
>> Any findings so far? Looking at the pipelines, it seems to be all
>> green, right?
>
> Timely reply - all green until 3 days ago when the job timed out after 70min.
> However, I re-ran the job manually this morning and it succeeded within a couple of minutes. Not
> quite sure what happened but probably not related to conda. Not logs available unfortunately.
>
> If the process of reproducing the environment is going to fail at some point, I
> wonder if we could accelerate this process by defining a more complex environment.
> Any ideas?
A more complex environment would increase the chance of failure because
it increases the complexity of the challenge to the resolver. While it
would be a useful demonstration to see the resolver fail I think it is
the least damning kind of failure.
As Simon suggests, changing the underlying system that *currently*
satisfies all the implicit assumptions that Conda artefacts contain
would likely yield a more realistic and interesting kind of failure.
>
> Simon Tournier <zimon.toutoune@gmail.com> writes:
>
>> 1. also use the image continuumio/miniconda3:latest
>> 2. install Miniconda on the top of the Docker image of Debian
>> unstable and run "apt update && apt upgrade"
>>
>> And I expect that #2 will break first, then #1 and last the current
>> one.
>
> Could you elaborate on this? For context the current pipeline
> pulls a pinned miniconda image then updates conda (=conda update conda=).
> Do you expect system libraries (I mean software installed through apt, not
> managed by conda) to influence the conda environment creation? My current
> understanding is that conda brings its own copies of these libraries without relying
> on whatever was/will be installed through other ways (e.g. apt).
This depends on the packages. There are packages that do link with
system libraries, and these are provided by a base image in which the
binary artefacts are built.
--
Ricardo
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2023-03-13 11:00 ` Ricardo Wurmus
@ 2023-03-13 12:38 ` Simon Tournier
2023-03-16 10:26 ` Ludovic Courtès
0 siblings, 1 reply; 25+ messages in thread
From: Simon Tournier @ 2023-03-13 12:38 UTC (permalink / raw)
To: Ricardo Wurmus, Lestang, Thibault; +Cc: Ludovic Courtès, guix-science
Hi,
On lun., 13 mars 2023 at 12:00, Ricardo Wurmus <rekado@elephly.net> wrote:
>> If the process of reproducing the environment is going to fail at some point, I
>> wonder if we could accelerate this process by defining a more complex environment.
>> Any ideas?
Maybe something using PyTorch or some other ML framework.
> A more complex environment would increase the chance of failure because
> it increases the complexity of the challenge to the resolver. While it
> would be a useful demonstration to see the resolver fail I think it is
> the least damning kind of failure.
Yes, I agree the solver will be the last thing to break. Well, from my
understanding of [1], the breakage of the Conda solver depends on the
state of their index. Quoting [1]:
This is where the SAT solver will act. It will use the list of MatchSpec
objects to pick a number of PackageRecord entries from the index, thus
building the “final state of the solved environment”. This is detailed
later in this deep dive guide, if you need more info.
so more complex is the environment and more complicated the solution of
the SAT will be. And finding the solution can be slow. That’s why they
implemented various solvers [2]. And it is not clear for me if [1] and
[2] always lead to the same environment.
To my knowledge, the issue is well-identified, for instance by the
Mancoosi project [3]; in short, it reads:
4.2 Package installation is NP-Complete
Theorem 1: Checking whether a single package P can be installed,
given a repository R, is NP-complete.
4.2.4 Conclusions
Despite the apparent differences, the constraint languages in DEB and
RPM are sensibly equivalent in expressiveness, and the associated
installation problems are both NP-complete.
This means that automatic package installation tools like APT, URPMI or
SMART live dangerously on the edge of intractability, and must carefully
apply heuristics that may be either safe (the approach advocated by
SMART), and hence still not guaranteed to avoid intractability, or
unsafe, thus accepting the risk of not always finding a solution when it
exists.
Therefore, I do not see where Conda would be different. However, indeed
it could be hard to construct a concrete example of a failure for the
SAT solver part. Moreover, Conda documentation reads [1],
Explicit package installs
These commands do not need a solver because the requested packages are
expressed with a direct URL or path to a specific tarball. Instead of a
MatchSpec, we already have a PackageRecord-like entity! For this to
work, all the requested packages neeed to be URLs or paths. They can be
typed in the command line or in a text file including a @EXPLICIT line.
Since the solver is not involved, the dependencies of the explicit
package(s) are not processed at all. This can leave the environment in
an inconsistent state, which can be fixed by running conda update --all,
for example.
Explicit installs are taken care of by the explicit function.
For sure, the failure of Conda is by design. And as with many things in
life, people only believe what they see from their own eyes. :-)
1: https://docs.conda.io/projects/conda/en/latest/dev-guide/deep-dives/solvers.html
2: https://conda.github.io/conda-libmamba-solver/libmamba-vs-classic/
3: https://www.mancoosi.org/edos/algorithmic/
>> Simon Tournier <zimon.toutoune@gmail.com> writes:
>>
>>> 1. also use the image continuumio/miniconda3:latest
>>> 2. install Miniconda on the top of the Docker image of Debian
>>> unstable and run "apt update && apt upgrade"
>>>
>>> And I expect that #2 will break first, then #1 and last the current
>>> one.
>>
>> Could you elaborate on this? For context the current pipeline
>> pulls a pinned miniconda image then updates conda (=conda update conda=).
>> Do you expect system libraries (I mean software installed through apt, not
>> managed by conda) to influence the conda environment creation? My current
>> understanding is that conda brings its own copies of these libraries without relying
>> on whatever was/will be installed through other ways (e.g. apt).
>
> This depends on the packages. There are packages that do link with
> system libraries, and these are provided by a base image in which the
> binary artefacts are built.
As Ricardo explained, sometimes Conda relies on system libraries. Guix
makes the assumption of a compatible Linux kernel. Conda also makes
assumptions and, to my knowledge, they are less strict about isolated
environments.
That’s why replacing the base image could also help to expose examples
where it breaks.
Just to point that I was in a workshop of Reproducible Research past
week and I discussed with the developer of BenchOpt [4]. Their aim is
to maintain the computational stack for some ML framework when the
passing of time by making their benchmarks evolving. Other said, they
take the other direction of Guix. If they do that, that’s because it is
not possible to run again. :-)
4: https://benchopt.github.io/
It is hard to predict beforehand where Conda will break. :-) From my
point of view, by order of most probable:
1. because the underlying Linux distribution base
2. because the SAT solver
Well, for testing #1, I propose:
a) to also run the pipeline using continuumio/miniconda3:latest
b) to run an installation of Conda
i) on the top of Debian
ii) on the top of Ubuntu
and then run the script
As corollary, it will also test #2. ;-)
The current script is about Numpy, maybe it would accelerate the process
if instead it would be PyTorch.
Thanks for the discussion about that topic. If no one beats me, I will
adapt .gitlab-ci.yml. Well, do not hold your breath… first holidays! ;-)
Cheers,
simon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2023-03-13 12:38 ` Simon Tournier
@ 2023-03-16 10:26 ` Ludovic Courtès
2023-03-16 13:40 ` Thibault Lestang
0 siblings, 1 reply; 25+ messages in thread
From: Ludovic Courtès @ 2023-03-16 10:26 UTC (permalink / raw)
To: Simon Tournier; +Cc: Ricardo Wurmus, Lestang, Thibault, guix-science
Hello comrades,
“Seeing is believing” so I think we should build upon Thibault’s
experiments and on what Simon and Ricardo pointed out to write an
article showing in concrete ways in which Conda would fail to reproduce
a software environment.
That would make a good blog post on hpc.guix.info; some conferences such
as <https://acm-rep.github.io/> (too late for this edition) may also be
good opportunities for such a study.
Just sayin’! :-)
Ludo’.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2023-03-16 10:26 ` Ludovic Courtès
@ 2023-03-16 13:40 ` Thibault Lestang
2023-04-03 15:22 ` Simon Tournier
0 siblings, 1 reply; 25+ messages in thread
From: Thibault Lestang @ 2023-03-16 13:40 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: Simon Tournier, Ricardo Wurmus, guix-science
Ludovic Courtès <ludovic.courtes@inria.fr> writes:
> “Seeing is believing” so I think we should build upon Thibault’s
> experiments and on what Simon and Ricardo pointed out to write an
> article showing in concrete ways in which Conda would fail to reproduce
> a software environment.
I'm all for it.
As Ricardo and Simon made clear in previous messages, the current
experiment is only exercising the conda resolver -- whilst maintaining
the underlying system and libraries constant in time. Which at the time
I thought was actually the interesting issue. I'll try to find some
time in the next few days to add a couple of cases where they are
allowed to vary.
@Simon: Whether I actually do it or not, feel free to add/tweak the
pipelines when you're back. You should be able to open a merge request?
As a reminder the repo currently lives at
https://framagit.org/tlestang/conda-python-example
Happy for it to be moved if there is somewhere else that is more
suitful.
-- Thibault
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2023-03-16 13:40 ` Thibault Lestang
@ 2023-04-03 15:22 ` Simon Tournier
2023-04-04 12:19 ` Thibault Lestang
0 siblings, 1 reply; 25+ messages in thread
From: Simon Tournier @ 2023-04-03 15:22 UTC (permalink / raw)
To: Thibault Lestang, Ludovic Courtès; +Cc: Ricardo Wurmus, guix-science
Hi Thibault,
On jeu., 16 mars 2023 at 13:40, Thibault Lestang <t.lestang@imperial.ac.uk> wrote:
> @Simon: Whether I actually do it or not, feel free to add/tweak the
> pipelines when you're back. You should be able to open a merge request?
I am back. :-)
Nice, you already did it. Well, I will see if instead the simple Numpy
example, we can also add something using PyTorch – some more complicated
stack.
Cheers,
simon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2023-04-03 15:22 ` Simon Tournier
@ 2023-04-04 12:19 ` Thibault Lestang
0 siblings, 0 replies; 25+ messages in thread
From: Thibault Lestang @ 2023-04-04 12:19 UTC (permalink / raw)
To: Simon Tournier; +Cc: guix-science
Simon Tournier <zimon.toutoune@gmail.com> writes:
> On jeu., 16 mars 2023 at 13:40, Thibault Lestang <t.lestang@imperial.ac.uk> wrote:
>
>> @Simon: Whether I actually do it or not, feel free to add/tweak the
>> pipelines when you're back. You should be able to open a merge request?
>
> I am back. :-)
>
> Nice, you already did it. Well, I will see if instead the simple Numpy
> example, we can also add something using PyTorch – some more complicated
> stack.
That would be helpful, thanks. I'm not very familiar with PyTorch or
Tensorflow and lacking inspiration on what to add to the environment.
Hope you had a good break :)
-- Thibault
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2022-11-29 13:12 ` Hugo Buddelmeijer
2022-11-29 13:39 ` Konrad Hinsen
2022-11-29 20:10 ` Simon Tournier
@ 2022-12-02 10:52 ` Ludovic Courtès
2022-12-02 11:05 ` Ludovic Courtès
3 siblings, 0 replies; 25+ messages in thread
From: Ludovic Courtès @ 2022-12-02 10:52 UTC (permalink / raw)
To: Hugo Buddelmeijer; +Cc: Thibault Lestang, Konrad Hinsen, guix-science
Hi Hugo,
Hugo Buddelmeijer <hugo@buddelmeijer.nl> skribis:
> By the way, it is quite hard to use conda in guix, primarily because "conda
> activate myenvironment" will try to set PS1 by calling a bash function
> called 'conda'. This bash function calls the 'conda' executable, which
> takes PS1, modifies it, and returns it to the bash function. The bash
> function subsequently sets PS1 (and makes a backup for deactivating the
> environment again). However, the conda executable is replaced by a bash
> script that calls conda_real. And bash scripts eat PS1 (because it is in
> non-interactive mode), so conda_real gets an empty PS1, fails to modify it,
> and then the bash function sets PS1 to nothing. I've got it working
> properly on my machine, but don't feel comfortable enough yet with Scheme /
> guix to provide a proper patch. The simplest might be to use another shell
> for the conda package (because I believe only bash eats PS1); not sure
> whether that is possible in guix. And I would rather make guix packages of
> everything and ditch conda altogether. But supporting conda properly would
> help more people transition.
Could you email it to bug-guix@gnu.org so we keep track of it?
Even if you cannot provide a patch, that will be a first step towards
fixing it.
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2022-11-29 13:12 ` Hugo Buddelmeijer
` (2 preceding siblings ...)
2022-12-02 10:52 ` Ludovic Courtès
@ 2022-12-02 11:05 ` Ludovic Courtès
2022-12-02 13:59 ` Simon Tournier
2022-12-02 14:06 ` Hugo Buddelmeijer
3 siblings, 2 replies; 25+ messages in thread
From: Ludovic Courtès @ 2022-12-02 11:05 UTC (permalink / raw)
To: Hugo Buddelmeijer; +Cc: Thibault Lestang, Konrad Hinsen, guix-science
Hi,
I read this thread with interest—great to have first-hand feedback from
Conda users and packagers who also understand Guix!
Hugo Buddelmeijer <hugo@buddelmeijer.nl> skribis:
> That is, "conda env export" should contain entries like
> "scipy=1.8.0=py39hee8e79c_1", where the hee8e79c should uniquely define the
> dependencies 'that matter', like which compiler is used. What goes into the
> hash seems rather complicated, and grows over time.
I think one source of many problems here is to think that there are
dependencies that do not matter. Another one, which those hashes appear
to address, is to think that a name/version pair is enough to
unambiguously designate a software artifact.
This hash is a hash of the build result, not a hash of the input, is
that correct?
I think it would be great to have a blog post that walks through
shortcomings and concrete issues one may encounter when trying to
reproduce a software environment with Conda, contrasting it with how
Guix does thing. This would probably make more sense for people who use
Conda everyday than a high-level overview of Guix.
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2022-12-02 11:05 ` Ludovic Courtès
@ 2022-12-02 13:59 ` Simon Tournier
2022-12-02 14:06 ` Hugo Buddelmeijer
1 sibling, 0 replies; 25+ messages in thread
From: Simon Tournier @ 2022-12-02 13:59 UTC (permalink / raw)
To: Ludovic Courtès, Hugo Buddelmeijer
Cc: Thibault Lestang, Konrad Hinsen, guix-science
Hi,
On Fri, 02 Dec 2022 at 12:05, Ludovic Courtès <ludo@gnu.org> wrote:
> Hugo Buddelmeijer <hugo@buddelmeijer.nl> skribis:
>
>> That is, "conda env export" should contain entries like
>> "scipy=1.8.0=py39hee8e79c_1", where the hee8e79c should uniquely define the
>> dependencies 'that matter', like which compiler is used. What goes into the
>> hash seems rather complicated, and grows over time.
>
> I think one source of many problems here is to think that there are
> dependencies that do not matter. Another one, which those hashes appear
> to address, is to think that a name/version pair is enough to
> unambiguously designate a software artifact.
>
> This hash is a hash of the build result, not a hash of the input, is
> that correct?
Well, the official Conda documentation seems explanatory, IMHO. For
instance,
https://conda.io/projects/conda/en/latest/dev-guide/deep-dives/solvers.html#matchspec-vs-packagerecord
From my understanding, if you go via MatchSpec then the SAT solver is
invoked. The SAT solver tries to satisfy all the constraints and the
solution depends on the state of the index (the upstream repository).
Aside the SAT solver can be very long and even fails if the constraints
are too hard, there is no guarantee that the SAT solver will find the
exact same combination for the packages to install. Having an equality
(numpy=1.23) or something else does not really change this point.
Conda offers the option to be “explicit”. And in that case, the solver
is not invoked. Somehow, it is a way to directly deal with
PackageRecord. Then, the Conda documentation has these warnings:
* Explicit package installs
Since the solver is not involved, the dependencies of the
explicit package(s) are not processed at all. This can leave the
environment in an inconsistent state, which can be fixed by
running conda update --all, for example.
* Cloning an environment
It essentially takes the source environment, generates the URLs
for each installed packages (filtering conda, conda-env and
their dependencies) and passes the list of URLs to
explicit(). If the source tarballs are not in the cache anymore,
it will query the index for the best possible match for the
current channels. As such, there’s a slim chance that the copy
is not exactly a clone of the original environment.
https://conda.io/projects/conda/en/latest/dev-guide/deep-dives/solvers.html#early-exit-tasks
Therefore, the official Conda documentation explains that it is not
possible to have some guarantee about reproducing an environment.
> I think it would be great to have a blog post that walks through
> shortcomings and concrete issues one may encounter when trying to
> reproduce a software environment with Conda, contrasting it with how
> Guix does thing. This would probably make more sense for people who use
> Conda everyday than a high-level overview of Guix.
From my understanding, the main issue is that Conda perfectly works when
you are in a short temporal window (2-3 months, say!). In this range,
people can often reproduce. It becomes more complicated outside this
range – so it is hard to demo for explaining. :-)
For sure, a blog post by people being fluent in both Conda and Guix
would be very welcome. Aside the discussion about reproducibility, just
a Rosetta Stone comparing how to do that using Conda vs Guix. It would
smooth the migration and at least give a try with Guix. :-)
Cheers,
simon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Conda environments and reproducibility
2022-12-02 11:05 ` Ludovic Courtès
2022-12-02 13:59 ` Simon Tournier
@ 2022-12-02 14:06 ` Hugo Buddelmeijer
1 sibling, 0 replies; 25+ messages in thread
From: Hugo Buddelmeijer @ 2022-12-02 14:06 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: Thibault Lestang, Konrad Hinsen, guix-science
[-- Attachment #1: Type: text/plain, Size: 4007 bytes --]
Hi Ludovic,
On Fri, 2 Dec 2022 at 12:05, Ludovic Courtès <ludo@gnu.org> wrote:
> Hi,
>
> I read this thread with interest—great to have first-hand feedback from
> Conda users and packagers who also understand Guix!
>
> Hugo Buddelmeijer <hugo@buddelmeijer.nl> skribis:
>
> > That is, "conda env export" should contain entries like
> > "scipy=1.8.0=py39hee8e79c_1", where the hee8e79c should uniquely define
> the
> > dependencies 'that matter', like which compiler is used. What goes into
> the
> > hash seems rather complicated, and grows over time.
>
> I think one source of many problems here is to think that there are
> dependencies that do not matter.
In the Python world, most dependencies are runtime dependencies. Those do
not actually affect the build, or the build result, and therefore arguably
'do not matter'. (I disagree, because what matters is whether the software
runs and creates the right results.)
> Another one, which those hashes appear
> to address, is to think that a name/version pair is enough to
> unambiguously designate a software artifact.
>
> This hash is a hash of the build result, not a hash of the input, is
> that correct?
>
No, this conda build hash is used to identify the build environment, not to
identify a particular package build.
The easiest way to explain is to show an example. Here is a small part of a
"conda env export" of one of my environments:
- pybind11-abi=4=hd8ed1ab_3
- pycodestyle=2.8.0=pyhd8ed1ab_0
- pycosat=0.6.3=py39h3811e60_1009
- pycparser=2.21=pyhd8ed1ab_0
- pydocstyle=6.1.1=pyhd8ed1ab_0
- pyerfa=2.0.0.1=py39hce5d2b2_1
- pyflakes=2.4.0=pyhd8ed1ab_0
- pygments=2.11.2=pyhd8ed1ab_0
- pyopenssl=22.0.0=pyhd8ed1ab_0
- pyqt=5.12.3=py39hf3d152e_8
- pyqt-impl=5.12.3=py39hde8b62d_8
- pyqt5-sip=4.19.18=py39he80948d_8
- pyqtchart=5.12=py39h0fcd23e_8
- pyqtwebengine=5.12.1=py39h0fcd23e_8
As you see, many packages share the "hd8ed1ab" build hash, two qt-related
packages have h0fcd23e, and some others have their own. The "hd8ed1ab" hash
is by far the most common in this environment. These "hd8ed1ab" packages
are mostly independent (with separate maintainers, etc), but are probably
all in conda-forge and probably all use the 'default' conda environment.
(The last digit/number is the build number. The "8" suggests that all
qt-packages are actually built together, even though their build hash
differs.)
I don't really understand what goes into the hash. It is described on
https://docs.conda.io/projects/conda-build/en/stable/resources/define-metadata.html#build-number-and-string
The goal of these hashes is to capture which package builds will work
together. So two package builds with the same build-hash should have been
made with the same environment and thus work together.
I'm not sure how it works if the hashes are different. Maybe they are
merkle trees? So it is possible to determine whether one hash is a
'superset' of another hash. Probably not.
>
> I think it would be great to have a blog post that walks through
> shortcomings and concrete issues one may encounter when trying to
> reproduce a software environment with Conda, contrasting it with how
> Guix does thing. This would probably make more sense for people who use
> Conda everyday than a high-level overview of Guix.
>
A key difference might be how to handle different combinations of versions.
E.g. you might want to use numpy 3.0 and scipy 18.0, while I want to use
numpy 6.0 and scipy 15.0 (made up numbers, but on purpose with one lower
and one greater between us). Conda and Guix solve this in fundamentally
different ways.
Conda-forge (as a project) is kinda in between conda alone and Guix, and
can kinda be seen as a linux distribution itself (sans kernel). Conda forge
is moving closer to Guix every year, including more and more dependencies,
and more shared recreate-everything moments.
Greetings,,
Hugo
[-- Attachment #2: Type: text/html, Size: 5311 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread