unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Improving cgroups for fun and Kubernetes
@ 2023-09-24 16:39 Tao Hansen
  2023-10-04 15:21 ` Ludovic Courtès
  0 siblings, 1 reply; 2+ messages in thread
From: Tao Hansen @ 2023-09-24 16:39 UTC (permalink / raw)
  To: guix-devel


Hello, Guix!

This is my second posting to the mailing list but the first using Gnus
and smtmpmail. If I've formatted anything poorly, don't hesitate to let
me know.

I've been spending a silly amount of time trying to get a local flavor
of Kubernetes running on Guix System. I wanted to share my experience
and also solicit feedback from Guix's developers on how to improve the
cgroups implementation such that those who follow me will have an easier
time of it.

I wish to start by stating that I am largely a Linux enthusiast. Most of
my knowledge of cgroups I owe to reading over the last two weeks.
If I state something as true and I've gotten it wrong, please don't
hesitate to correct me (kindly). With that, here come the statements as
I understand them to be true.

Most flavors of local Kubernetes are expecting systemd, which presents
 some unusual challenges for Guix System users, especially when using
 Podman rootlessly to run a local Kubernetes cluster, which is my use-case.

As I understand it, systemd creates user "slices", which kind and
minikube then map cgroups to. Patch 64260 added support for cgroups v2,
a necessary requirement for Podman to run rootless containers and
rootless Kubernetes clusters. However, because we don't make use of
systemd and therefore assigned user slices, our /sys/fs/cgroups looks
like this:

ls -lah /sys/fs/cgroup/
total 0
dr-xr-xr-x 7 root root 0 Sep 24 13:09 .
drwxr-xr-x 8 root root 0 Sep 24 13:09 ..
drwxr-xr-x 2 root root 0 Sep 24 13:09 c1
drwxr-xr-x 2 root root 0 Sep 24 13:09 c2
drwxr-xr-x 2 root root 0 Sep 24 16:26 c3
drwxr-xr-x 2 root root 0 Sep 24 16:26 c4
-r--r--r-- 1 root root 0 Sep 24 13:09 cgroup.controllers
-rw-r--r-- 1 root root 0 Sep 24 18:07 cgroup.max.depth
-rw-r--r-- 1 root root 0 Sep 24 18:07 cgroup.max.descendants
-rw-r--r-- 1 root root 0 Sep 24 18:07 cgroup.pressure
-rw-r--r-- 1 root root 0 Sep 24 13:09 cgroup.procs
-r--r--r-- 1 root root 0 Sep 24 18:07 cgroup.stat
-rw-r--r-- 1 root root 0 Sep 24 18:06 cgroup.subtree_control
-rw-r--r-- 1 root root 0 Sep 24 18:07 cgroup.threads
-rw-r--r-- 1 root root 0 Sep 24 18:07 cpu.pressure
-r--r--r-- 1 root root 0 Sep 24 18:07 cpuset.cpus.effective
-r--r--r-- 1 root root 0 Sep 24 18:07 cpuset.mems.effective
-r--r--r-- 1 root root 0 Sep 24 18:07 cpu.stat
dr-xr-xr-x 2 root root 0 Sep 24 13:09 elogind
-rw-r--r-- 1 root root 0 Sep 24 18:07 io.cost.model
-rw-r--r-- 1 root root 0 Sep 24 18:07 io.cost.qos
-rw-r--r-- 1 root root 0 Sep 24 18:07 io.pressure
-rw-r--r-- 1 root root 0 Sep 24 18:07 io.prio.class
-r--r--r-- 1 root root 0 Sep 24 18:07 io.stat
-r--r--r-- 1 root root 0 Sep 24 18:07 memory.numa_stat
-rw-r--r-- 1 root root 0 Sep 24 18:07 memory.pressure
--w------- 1 root root 0 Sep 24 18:07 memory.reclaim
-r--r--r-- 1 root root 0 Sep 24 18:07 memory.stat
-r--r--r-- 1 root root 0 Sep 24 18:07 misc.capacity

You may notice the first problem, which is that the entire tree is owned
by root. kind and minikube don't like this:

2023-09-23T23:33:41.974998799+02:00 Failed to create /init.scope control
group: Permission denied
2023-09-23T23:33:41.974998799+02:00 Failed to allocate manager object:
Permission denied
2023-09-23T23:33:41.974998799+02:00 [!!!!!!] Failed to allocate manager
object.
2023-09-23T23:33:41.974998799+02:00 Exiting PID 1...: container exited
unexpectedly

The second problem is kind and minikube are both expecting Delegate=yes
to be set, which is a systemd function that allows these tools to set
cgroups limits. The limits it's expecting to control are cpu, cpuset,
memory and pids. We can force these privileges like so, echo "+cpu
+cpuset +memory +pids" >> /sys/fs/cgroup/cgroup.subtree_control

To fix the first problem we can run

g=users && sudo chgrp -R ${g} /sys/fs/cgroup/
u=$USER && sudo chown -R ${u}: /sys/fs/cgroup

These aren't harmful actions since all we're doing is changing the
cgroups file tree to be owned by our users and its users group.

Once we've addressed the first and second problem, the rest is
relatively easy: we need to make iptables (and iptables' modules so just
the package isn't enough: we need Guix's service) available. We need to
set a range of user IDs and group IDs for Podman to make use of
rootlessly, and finally we need to set a container policy otherwise Podman
can't pull any image from anywhere. All of those can be done from inside
our Guix System configuration file.

What I'd really like to see is some method for declaratively changing
the cgroups file-tree and setting limit delegation, since otherwise
these actions need to be done on every boot. I don't have the Guile
skills to pull this off but if someone fancied mentoring me, I'd be
happy to give it a shot. I have just enough ability to cobble together
a kind package from a binary (for shame, I know) and to edit the EXWM
upstream package to be based on a newer Emacs release version.
Otherwise, if there's a method of declaring these already available or
someone else can take a crack at this, please let me know!

Here's what that Guix System configuration looks like:

;; Rootless Podman requires the next 4 services
;; we're using the iptables service purely to make its resources
;; available to minikube and kind

(service iptables-service-type
         (iptables-configuration
          (ipv4-rules (plain-file "iptables.rules" "*filter
:INPUT ACCEPT
:FORWARD ACCEPT
:OUTPUT ACCEPT
COMMIT
"))
              (ipv6-rules (plain-file "ip6tables.rules" "*filter
:INPUT ACCEPT
:FORWARD ACCEPT
:OUTPUT ACCEPT
COMMIT
"))))
	(simple-service 'etc-subuid etc-service-type
	     	        (list `("subuid" ,(plain-file "subuid"
         (string-append "root:0:65536\n" username ":100000:65536\n")))))
	(simple-service 'etc-subgid etc-service-type
	     	        (list `("subgid" ,(plain-file "subgid"
         (string-append "root:0:65536\n" username ":100000:65536\n")))))
    (service pam-limits-service-type
             (list
              (pam-limits-entry "*" 'both 'nofile 100000)))
    (simple-service 'etc-container-policy etc-service-type
	     	        (list `("containers/policy.json", (plain-file
         "policy.json" "{\"default\": [{\"type\":
         \"insecureAcceptAnything\"}]}"))))
    %my-services



^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Improving cgroups for fun and Kubernetes
  2023-09-24 16:39 Improving cgroups for fun and Kubernetes Tao Hansen
@ 2023-10-04 15:21 ` Ludovic Courtès
  0 siblings, 0 replies; 2+ messages in thread
From: Ludovic Courtès @ 2023-10-04 15:21 UTC (permalink / raw)
  To: Tao Hansen; +Cc: guix-devel

Hi Tao,

Tao Hansen <worldofgeese@riseup.net> skribis:

> This is my second posting to the mailing list but the first using Gnus
> and smtmpmail. If I've formatted anything poorly, don't hesitate to let
> me know.

Looks perfect to me.  :-)

> I've been spending a silly amount of time trying to get a local flavor
> of Kubernetes running on Guix System. I wanted to share my experience
> and also solicit feedback from Guix's developers on how to improve the
> cgroups implementation such that those who follow me will have an easier
> time of it.

I’ve never used Kubernetes, but I’m confident you’re not the only
interested in using it on Guix System!

[...]

> The second problem is kind and minikube are both expecting Delegate=yes
> to be set, which is a systemd function that allows these tools to set
> cgroups limits. The limits it's expecting to control are cpu, cpuset,
> memory and pids. We can force these privileges like so, echo "+cpu
> +cpuset +memory +pids" >> /sys/fs/cgroup/cgroup.subtree_control

How about having a Shepherd service that does writes to that
‘cgroup.subtree_control’ file as you write above?

> To fix the first problem we can run
>
> g=users && sudo chgrp -R ${g} /sys/fs/cgroup/
> u=$USER && sudo chown -R ${u}: /sys/fs/cgroup

What does Debian do?  Perhaps there’s a “cgroup” group (in /etc/groups)
that users who want to user podman need to belong to, similar to the
‘kvm’ group for those who want to access /dev/kvm?

Or maybe we should create a sub-tree specifically for podman usage?

At any rate, we could again have a Shepherd service that sets ownership
on the relevant file tree.

> Once we've addressed the first and second problem, the rest is
> relatively easy: we need to make iptables (and iptables' modules so just
> the package isn't enough: we need Guix's service) available. We need to
> set a range of user IDs and group IDs for Podman to make use of
> rootlessly, and finally we need to set a container policy otherwise Podman
> can't pull any image from anywhere. All of those can be done from inside
> our Guix System configuration file.

Right, we should populate /etc/subuid by default (I tried to use
subordinate UIDs in the past, by invoking ‘newuidmap’, but never managed
to get it to work.)

> Here's what that Guix System configuration looks like:
>
> ;; Rootless Podman requires the next 4 services
> ;; we're using the iptables service purely to make its resources
> ;; available to minikube and kind
>
> (service iptables-service-type
>          (iptables-configuration
>           (ipv4-rules (plain-file "iptables.rules" "*filter
> :INPUT ACCEPT
> :FORWARD ACCEPT
> :OUTPUT ACCEPT
> COMMIT
> "))
>               (ipv6-rules (plain-file "ip6tables.rules" "*filter
> :INPUT ACCEPT
> :FORWARD ACCEPT
> :OUTPUT ACCEPT
> COMMIT
> "))))
> 	(simple-service 'etc-subuid etc-service-type
> 	     	        (list `("subuid" ,(plain-file "subuid"
>          (string-append "root:0:65536\n" username ":100000:65536\n")))))
> 	(simple-service 'etc-subgid etc-service-type
> 	     	        (list `("subgid" ,(plain-file "subgid"
>          (string-append "root:0:65536\n" username ":100000:65536\n")))))
>     (service pam-limits-service-type
>              (list
>               (pam-limits-entry "*" 'both 'nofile 100000)))
>     (simple-service 'etc-container-policy etc-service-type
> 	     	        (list `("containers/policy.json", (plain-file
>          "policy.json" "{\"default\": [{\"type\":
>          \"insecureAcceptAnything\"}]}"))))
>     %my-services

Looks great!  We should probably consider /etc/{subuid,subgid} support
separately, but otherwise it looks like you already have the start of a
‘rootless-podman-service-type’ (or similar).

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-10-04 15:22 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-24 16:39 Improving cgroups for fun and Kubernetes Tao Hansen
2023-10-04 15:21 ` Ludovic Courtès

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).