From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp12.migadu.com ([2001:41d0:306:2d92::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms9.migadu.com with LMTPS id qI5IEquzEGU61wAAauVa8A:P1 (envelope-from ) for ; Mon, 25 Sep 2023 00:09:47 +0200 Received: from aspmx1.migadu.com ([2001:41d0:306:2d92::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp12.migadu.com with LMTPS id qI5IEquzEGU61wAAauVa8A (envelope-from ) for ; Mon, 25 Sep 2023 00:09:47 +0200 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id E7E703BE15 for ; Mon, 25 Sep 2023 00:09:46 +0200 (CEST) Authentication-Results: aspmx1.migadu.com; dkim=none; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=riseup.net (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1695593387; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type:list-id:list-help:list-unsubscribe: list-subscribe:list-post; bh=Hif5yMk7E++cec99ZoE0hFMKStKXXXtURqyDHXy7fKg=; b=jkNtPI21eld9kdztpcmwm5752uxvQc6YgdtdhcX5T8LDTATt1GCyhzraUuyFMPoFmhwA5p fK80/R3Oh+fjnBlts4drmlwJvqHhG14XwnpnjysPzeOSyjjNXLN331xU9a+6FDyTf22iYx JdYcsUvXc+ZYka9w7pUQg2JitpWUKEdDcl6XJpqBBQCX7eRCUcEW5NETsX09MnAc/tYKky OTyg4Mn9dDRrN7ntD7npIg/XgEwEHQguXMH0WmuQdxpt2hzNBVhZi8YfrrYUEE7+9oE8iJ G8zlSAW01bA55J1fXkzvlWwu0y9A1Vr7QmgcLLEVDUVahbsSL3Jt7ZO7b+lUBw== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=riseup.net (policy=none) ARC-Seal: i=1; s=key1; d=yhetil.org; t=1695593387; a=rsa-sha256; cv=none; b=IIL0O34cQvIu81xr8re3OZ+lPXAzAQhH19N+K9+dhFx+INZ7V+T4ALp7OstluKwMs+kdBF tUDvjdLulgQQaEI6+Qke87mv1Mk50b3n+gQqn6mS/lUmr4fim6NvoX0vX4ImAXoBWWimol GExpz20rwfv6GDtBq+/eP/kdoOTncaKw8+AeeFmAZHvOZrSXRWl55xjXY0wUvTAcXl3Wfg c3zUzKXshYO6hFZzI7G4QthRBtcLDFhUTXhydYkQ6d1ndSeA301NdyUfeUJ9F0ljy046Fx CxLXtCq6plrNhNbwUM0qCg2a6p7M6H26KAq9tmCBgbhLyQVPf/hOsZFF4swF7g== Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qkWzT-0001bJ-OB; Sun, 24 Sep 2023 17:50:11 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qkWzR-0001at-N9 for guix-devel@gnu.org; Sun, 24 Sep 2023 17:50:09 -0400 Received: from ciao.gmane.io ([116.202.254.214]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qkWzP-0002Bb-Tu for guix-devel@gnu.org; Sun, 24 Sep 2023 17:50:09 -0400 Received: from list by ciao.gmane.io with local (Exim 4.92) (envelope-from ) id 1qkWzL-0003WX-Ph for guix-devel@gnu.org; Sun, 24 Sep 2023 23:50:04 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: guix-devel@gnu.org From: Tao Hansen Subject: Improving cgroups for fun and Kubernetes Date: Sun, 24 Sep 2023 18:39:15 +0200 Message-ID: <87zg1bplu4.fsf@riseup.net> Mime-Version: 1.0 Content-Type: text/plain User-Agent: Gnus/5.13 (Gnus v5.13) Cancel-Lock: sha1:Dh1GgXrJTUuczBFrCEl38dX0ljo= Received-SPF: pass client-ip=116.202.254.214; envelope-from=gcggd-guix-devel@m.gmane-mx.org; helo=ciao.gmane.io X-Spam_score_int: 0 X-Spam_score: -0.1 X-Spam_bar: / X-Spam_report: (-0.1 / 5.0 requ) BAYES_00=-1.9, DATE_IN_PAST_03_06=1.592, HEADER_FROM_DIFFERENT_DOMAINS=0.249, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Flow: FLOW_IN X-Migadu-Country: US X-Migadu-Spam-Score: -2.60 X-Migadu-Scanner: mx2.migadu.com X-Migadu-Queue-Id: E7E703BE15 X-Spam-Score: -2.60 X-TUID: 1I0sHkJjnFYs Hello, Guix! This is my second posting to the mailing list but the first using Gnus and smtmpmail. If I've formatted anything poorly, don't hesitate to let me know. I've been spending a silly amount of time trying to get a local flavor of Kubernetes running on Guix System. I wanted to share my experience and also solicit feedback from Guix's developers on how to improve the cgroups implementation such that those who follow me will have an easier time of it. I wish to start by stating that I am largely a Linux enthusiast. Most of my knowledge of cgroups I owe to reading over the last two weeks. If I state something as true and I've gotten it wrong, please don't hesitate to correct me (kindly). With that, here come the statements as I understand them to be true. Most flavors of local Kubernetes are expecting systemd, which presents some unusual challenges for Guix System users, especially when using Podman rootlessly to run a local Kubernetes cluster, which is my use-case. As I understand it, systemd creates user "slices", which kind and minikube then map cgroups to. Patch 64260 added support for cgroups v2, a necessary requirement for Podman to run rootless containers and rootless Kubernetes clusters. However, because we don't make use of systemd and therefore assigned user slices, our /sys/fs/cgroups looks like this: ls -lah /sys/fs/cgroup/ total 0 dr-xr-xr-x 7 root root 0 Sep 24 13:09 . drwxr-xr-x 8 root root 0 Sep 24 13:09 .. drwxr-xr-x 2 root root 0 Sep 24 13:09 c1 drwxr-xr-x 2 root root 0 Sep 24 13:09 c2 drwxr-xr-x 2 root root 0 Sep 24 16:26 c3 drwxr-xr-x 2 root root 0 Sep 24 16:26 c4 -r--r--r-- 1 root root 0 Sep 24 13:09 cgroup.controllers -rw-r--r-- 1 root root 0 Sep 24 18:07 cgroup.max.depth -rw-r--r-- 1 root root 0 Sep 24 18:07 cgroup.max.descendants -rw-r--r-- 1 root root 0 Sep 24 18:07 cgroup.pressure -rw-r--r-- 1 root root 0 Sep 24 13:09 cgroup.procs -r--r--r-- 1 root root 0 Sep 24 18:07 cgroup.stat -rw-r--r-- 1 root root 0 Sep 24 18:06 cgroup.subtree_control -rw-r--r-- 1 root root 0 Sep 24 18:07 cgroup.threads -rw-r--r-- 1 root root 0 Sep 24 18:07 cpu.pressure -r--r--r-- 1 root root 0 Sep 24 18:07 cpuset.cpus.effective -r--r--r-- 1 root root 0 Sep 24 18:07 cpuset.mems.effective -r--r--r-- 1 root root 0 Sep 24 18:07 cpu.stat dr-xr-xr-x 2 root root 0 Sep 24 13:09 elogind -rw-r--r-- 1 root root 0 Sep 24 18:07 io.cost.model -rw-r--r-- 1 root root 0 Sep 24 18:07 io.cost.qos -rw-r--r-- 1 root root 0 Sep 24 18:07 io.pressure -rw-r--r-- 1 root root 0 Sep 24 18:07 io.prio.class -r--r--r-- 1 root root 0 Sep 24 18:07 io.stat -r--r--r-- 1 root root 0 Sep 24 18:07 memory.numa_stat -rw-r--r-- 1 root root 0 Sep 24 18:07 memory.pressure --w------- 1 root root 0 Sep 24 18:07 memory.reclaim -r--r--r-- 1 root root 0 Sep 24 18:07 memory.stat -r--r--r-- 1 root root 0 Sep 24 18:07 misc.capacity You may notice the first problem, which is that the entire tree is owned by root. kind and minikube don't like this: 2023-09-23T23:33:41.974998799+02:00 Failed to create /init.scope control group: Permission denied 2023-09-23T23:33:41.974998799+02:00 Failed to allocate manager object: Permission denied 2023-09-23T23:33:41.974998799+02:00 [!!!!!!] Failed to allocate manager object. 2023-09-23T23:33:41.974998799+02:00 Exiting PID 1...: container exited unexpectedly The second problem is kind and minikube are both expecting Delegate=yes to be set, which is a systemd function that allows these tools to set cgroups limits. The limits it's expecting to control are cpu, cpuset, memory and pids. We can force these privileges like so, echo "+cpu +cpuset +memory +pids" >> /sys/fs/cgroup/cgroup.subtree_control To fix the first problem we can run g=users && sudo chgrp -R ${g} /sys/fs/cgroup/ u=$USER && sudo chown -R ${u}: /sys/fs/cgroup These aren't harmful actions since all we're doing is changing the cgroups file tree to be owned by our users and its users group. Once we've addressed the first and second problem, the rest is relatively easy: we need to make iptables (and iptables' modules so just the package isn't enough: we need Guix's service) available. We need to set a range of user IDs and group IDs for Podman to make use of rootlessly, and finally we need to set a container policy otherwise Podman can't pull any image from anywhere. All of those can be done from inside our Guix System configuration file. What I'd really like to see is some method for declaratively changing the cgroups file-tree and setting limit delegation, since otherwise these actions need to be done on every boot. I don't have the Guile skills to pull this off but if someone fancied mentoring me, I'd be happy to give it a shot. I have just enough ability to cobble together a kind package from a binary (for shame, I know) and to edit the EXWM upstream package to be based on a newer Emacs release version. Otherwise, if there's a method of declaring these already available or someone else can take a crack at this, please let me know! Here's what that Guix System configuration looks like: ;; Rootless Podman requires the next 4 services ;; we're using the iptables service purely to make its resources ;; available to minikube and kind (service iptables-service-type (iptables-configuration (ipv4-rules (plain-file "iptables.rules" "*filter :INPUT ACCEPT :FORWARD ACCEPT :OUTPUT ACCEPT COMMIT ")) (ipv6-rules (plain-file "ip6tables.rules" "*filter :INPUT ACCEPT :FORWARD ACCEPT :OUTPUT ACCEPT COMMIT ")))) (simple-service 'etc-subuid etc-service-type (list `("subuid" ,(plain-file "subuid" (string-append "root:0:65536\n" username ":100000:65536\n"))))) (simple-service 'etc-subgid etc-service-type (list `("subgid" ,(plain-file "subgid" (string-append "root:0:65536\n" username ":100000:65536\n"))))) (service pam-limits-service-type (list (pam-limits-entry "*" 'both 'nofile 100000))) (simple-service 'etc-container-policy etc-service-type (list `("containers/policy.json", (plain-file "policy.json" "{\"default\": [{\"type\": \"insecureAcceptAnything\"}]}")))) %my-services