unofficial mirror of guix-science@gnu.org 
 help / color / mirror / Atom feed
* GC strategy on clusters
@ 2021-04-01 10:17 Ludovic Courtès
  2021-04-01 10:45 ` Efraim Flashner
  0 siblings, 1 reply; 5+ messages in thread
From: Ludovic Courtès @ 2021-04-01 10:17 UTC (permalink / raw)
  To: guix-science

Hi there!

Recently the Guix head node of our cluster at Inria was getting short on
disk space, despite running ‘guix gc -F20G’ (or similar) twice a day.

Turns out that some users had accumulated many profile generations and
that was getting in the way.  So we kindly asked them to run:

  guix package --delete-generations=4m

or similar, which was enough to free more space.

We’re now considering setting up automatic user notification by email,
as is commonly done for disk quotas, asking them to remove old
generations.  That way, users remain in control and choose what GC roots
or generations they want to remove.

How do people on this list deal with that?

Longer term, I think Guix should automatically delete old generations
and instead store the channels + manifest to reproduce them, when
possible.

Ludo’.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: GC strategy on clusters
  2021-04-01 10:17 GC strategy on clusters Ludovic Courtès
@ 2021-04-01 10:45 ` Efraim Flashner
  2021-04-01 12:36   ` Ludovic Courtès
  0 siblings, 1 reply; 5+ messages in thread
From: Efraim Flashner @ 2021-04-01 10:45 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-science

[-- Attachment #1: Type: text/plain, Size: 2445 bytes --]

On Thu, Apr 01, 2021 at 12:17:22PM +0200, Ludovic Courtès wrote:
> Hi there!
> 
> Recently the Guix head node of our cluster at Inria was getting short on
> disk space, despite running ‘guix gc -F20G’ (or similar) twice a day.
> 
> Turns out that some users had accumulated many profile generations and
> that was getting in the way.  So we kindly asked them to run:
> 
>   guix package --delete-generations=4m
> 
> or similar, which was enough to free more space.

I feel like 4-6 months should be plenty for anything active. Even if it
were run automatically for them it wouldn't remove the last generation.

> We’re now considering setting up automatic user notification by email,
> as is commonly done for disk quotas, asking them to remove old
> generations.  That way, users remain in control and choose what GC roots
> or generations they want to remove.
> 
> How do people on this list deal with that?

I like the idea of asking people to remove old generations. It's not
something that we've come up against yet. It doesn't feel that different
than reminding them that their $HOME is for code and smaller things and
the storage space is for their large data collections.

> Longer term, I think Guix should automatically delete old generations
> and instead store the channels + manifest to reproduce them, when
> possible.
> 

This seems to help a bit less when we run into issues about dates being
wrong on SSL tests, or when sources go missing.

How much storage and people are you working with? Our initial multiuser
system has 188GB for /gnu and I think 30-40 people and some people have
profiles going back almost 3 years. Not many people have multiple
profiles and the experiments we tried with shared profiles in
/usr/local/guix-profiles don't see a lot of use or get updated
frequently.

I guess I'm not really sure if its a technology problem or a people
problem. Figuring out if someone is the only one pulling in a copy of
glibc-2.25 is doable but how many copies of diffoscope is too many?

On a practical note, 'guix package --list-profiles' as root currently
lists everyone's profiles so it can be easier to see who has older
profiles hanging around.

-- 
Efraim Flashner   <efraim@flashner.co.il>   אפרים פלשנר
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: GC strategy on clusters
  2021-04-01 10:45 ` Efraim Flashner
@ 2021-04-01 12:36   ` Ludovic Courtès
  2021-04-01 13:29     ` zimoun
  0 siblings, 1 reply; 5+ messages in thread
From: Ludovic Courtès @ 2021-04-01 12:36 UTC (permalink / raw)
  To: Efraim Flashner; +Cc: guix-science

Hello!

Efraim Flashner <efraim@flashner.co.il> skribis:

> On Thu, Apr 01, 2021 at 12:17:22PM +0200, Ludovic Courtès wrote:

[...]

>>   guix package --delete-generations=4m
>> 
>> or similar, which was enough to free more space.
>
> I feel like 4-6 months should be plenty for anything active. Even if it
> were run automatically for them it wouldn't remove the last generation.

It depends.  A practical use case I have in mind: you run experiments,
you submit a paper including its results, you get initial reviews months
later, and even later it’s published and you get to present it.  At that
point, you want to answer questions and to reproduce it.  4–6 months is
not a lot in that context.

(Though of course, ideally you’d save channels.scm + manifest.scm and
share it with reviewers and readers in the first place…)

Besides, I think the whole point of Guix is that users on the cluster
can remain in control, unlike what happens with “environment modules”.

>> Longer term, I think Guix should automatically delete old generations
>> and instead store the channels + manifest to reproduce them, when
>> possible.
>> 
>
> This seems to help a bit less when we run into issues about dates being
> wrong on SSL tests, or when sources go missing.

Good points.  Hopefully “sources go missing” can soon be considered
addressed.  Really, failing TLS tests is the most worrisome issue to me
because we don’t have any idea on how to address it systematically.

> How much storage and people are you working with? Our initial multiuser
> system has 188GB for /gnu and I think 30-40 people and some people have
> profiles going back almost 3 years. Not many people have multiple
> profiles and the experiments we tried with shared profiles in
> /usr/local/guix-profiles don't see a lot of use or get updated
> frequently.

I’m not sure how much storage the Guix head node has (I’m not an admin),
but the number of users and duration is in the same ballpark.

> I guess I'm not really sure if its a technology problem or a people
> problem. Figuring out if someone is the only one pulling in a copy of
> glibc-2.25 is doable but how many copies of diffoscope is too many?
>
> On a practical note, 'guix package --list-profiles' as root currently
> lists everyone's profiles so it can be easier to see who has older
> profiles hanging around.

Actually, as non-root, I walked /var/guix/profiles/per-user on the
cluster to see the number of generations per user, which allowed us to
target those with a lot of generations.  :-)

It would be nice to provide a documented approach sysadmins could
follow!

Ludo’.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: GC strategy on clusters
  2021-04-01 12:36   ` Ludovic Courtès
@ 2021-04-01 13:29     ` zimoun
  2021-04-01 13:43       ` Ludovic Courtès
  0 siblings, 1 reply; 5+ messages in thread
From: zimoun @ 2021-04-01 13:29 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Efraim Flashner, guix-science

Hi,

On Thu, 1 Apr 2021 at 14:36, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:
> Efraim Flashner <efraim@flashner.co.il> skribis:
>
> > On Thu, Apr 01, 2021 at 12:17:22PM +0200, Ludovic Courtès wrote:
>
> [...]
>
> >>   guix package --delete-generations=4m
> >>
> >> or similar, which was enough to free more space.

For now, I am waiting to reach the full capacity at hand to ask how to
deal with it. ;-)

> > I feel like 4-6 months should be plenty for anything active. Even if it
> > were run automatically for them it wouldn't remove the last generation.
>
> It depends.  A practical use case I have in mind: you run experiments,
> you submit a paper including its results, you get initial reviews months
> later, and even later it’s published and you get to present it.  At that
> point, you want to answer questions and to reproduce it.  4–6 months is
> not a lot in that context.

To add another data point.  Even it is hard to have a good overview, I
would say the average is 2-4 years for a typical project.  It is hard
because currently all is not done with the same tools for the same
task.  For instance, some data is cleaned with some tools, then an
partial analysis is done, months later another data is added so
another partial analysis with probably different tools, then months
later a full toolsuite as Bioconductor (or whatever) is updated and
some analysis are re-done.  The final publication is a mix of all over
the 2-4 years project with details at various level.

Other said, it depends on the level we are looking.

> (Though of course, ideally you’d save channels.scm + manifest.scm and
> share it with reviewers and readers in the first place…)

It is what I am tirelessly explaining in my lab. ;-)

> >> Longer term, I think Guix should automatically delete old generations
> >> and instead store the channels + manifest to reproduce them, when
> >> possible.
> >>
> >
> > This seems to help a bit less when we run into issues about dates being
> > wrong on SSL tests, or when sources go missing.
>
> Good points.  Hopefully “sources go missing” can soon be considered
> addressed.  Really, failing TLS tests is the most worrisome issue to me
> because we don’t have any idea on how to address it systematically.

By "soon", you mean the bricks are there and it is missing to glue
them together.  From my opinion, some details need to be addressed to
have a full end-to-end sources fallback, since evil is hidden inside
the details. ;-)

About the TLS, you proposed to setup a machine ahead of clock.  Maybe
it is worth to try.

> > I guess I'm not really sure if its a technology problem or a people
> > problem. Figuring out if someone is the only one pulling in a copy of
> > glibc-2.25 is doable but how many copies of diffoscope is too many?
> >
> > On a practical note, 'guix package --list-profiles' as root currently
> > lists everyone's profiles so it can be easier to see who has older
> > profiles hanging around.
>
> Actually, as non-root, I walked /var/guix/profiles/per-user on the
> cluster to see the number of generations per user, which allowed us to
> target those with a lot of generations.  :-)

Well, when we discussed the '--list-profiles', it was initially for my
personal purposes.  Then, I have tried to use it to monitor the few
users that I have.  Well, in Biology they have the concept of "-80
fridge".  It is a big and very cold fridge where you keep samples,
potentially for a long time.  Everybody put in until it is full and
once it is full, there is endless discussion on what to throw... until
the fridge is broken because shutdown or unexpected failures and then
it is obvious to everybody what needs to be taken or thrown.  I am
using such analogy to explain the hygiene to have on shared machines
Hum, once written I do not know if it relevant. ;-)

From my point of view, having a channel+manifest "backup" for old
profiles seems something to try.  It cost nothing with
''--export-manifest" and "--export-channels".

Cheers,
simon


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: GC strategy on clusters
  2021-04-01 13:29     ` zimoun
@ 2021-04-01 13:43       ` Ludovic Courtès
  0 siblings, 0 replies; 5+ messages in thread
From: Ludovic Courtès @ 2021-04-01 13:43 UTC (permalink / raw)
  To: zimoun; +Cc: Efraim Flashner, guix-science

Hi,

zimoun <zimon.toutoune@gmail.com> skribis:

>> It depends.  A practical use case I have in mind: you run experiments,
>> you submit a paper including its results, you get initial reviews months
>> later, and even later it’s published and you get to present it.  At that
>> point, you want to answer questions and to reproduce it.  4–6 months is
>> not a lot in that context.
>
> To add another data point.  Even it is hard to have a good overview, I
> would say the average is 2-4 years for a typical project.  It is hard
> because currently all is not done with the same tools for the same
> task.  For instance, some data is cleaned with some tools, then an
> partial analysis is done, months later another data is added so
> another partial analysis with probably different tools, then months
> later a full toolsuite as Bioconductor (or whatever) is updated and
> some analysis are re-done.  The final publication is a mix of all over
> the 2-4 years project with details at various level.

Right.

>> Good points.  Hopefully “sources go missing” can soon be considered
>> addressed.  Really, failing TLS tests is the most worrisome issue to me
>> because we don’t have any idea on how to address it systematically.
>
> By "soon", you mean the bricks are there and it is missing to glue
> them together.  From my opinion, some details need to be addressed to
> have a full end-to-end sources fallback, since evil is hidden inside
> the details. ;-)

True!  I’m really hopeful about the Disarchive/SWH combination, together
with <https://guix.gnu.org/sources.json>.  Of course we’ll have to
monitor that, and we can expect bumps on the road :-), but at least we
have a plan.

> About the TLS, you proposed to setup a machine ahead of clock.  Maybe
> it is worth to try.

Oh sure, that trick definitely works.  But it’s a terrible hack, and it
means that, by default, people will just fail to build the package.

> Well, when we discussed the '--list-profiles', it was initially for my
> personal purposes.  Then, I have tried to use it to monitor the few
> users that I have.  Well, in Biology they have the concept of "-80
> fridge".  It is a big and very cold fridge where you keep samples,
> potentially for a long time.  Everybody put in until it is full and
> once it is full, there is endless discussion on what to throw... until
> the fridge is broken because shutdown or unexpected failures and then
> it is obvious to everybody what needs to be taken or thrown.  I am
> using such analogy to explain the hygiene to have on shared machines
> Hum, once written I do not know if it relevant. ;-)

It surely is.  :-)  I mean, I didn’t even think about it until GC
couldn’t make any progress.

> From my point of view, having a channel+manifest "backup" for old
> profiles seems something to try.  It cost nothing with
> ''--export-manifest" and "--export-channels".

Yes, though it’s an approximation; so it should only be used when we
know that it’s 100% faithful, as is the case for ‘guix pull’ profiles.

Ludo’.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-04-01 15:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-01 10:17 GC strategy on clusters Ludovic Courtès
2021-04-01 10:45 ` Efraim Flashner
2021-04-01 12:36   ` Ludovic Courtès
2021-04-01 13:29     ` zimoun
2021-04-01 13:43       ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).