* networking service not starting with netlink-response-error errno:17 @ 2024-06-14 11:04 Giovanni Biscuolo 2024-06-14 13:07 ` networking service not starting for a network-route setting (was for network with netlink-response-error errno:17) Giovanni Biscuolo 2024-06-17 13:23 ` networking service not starting with netlink-response-error errno:17 Ludovic Courtès 0 siblings, 2 replies; 4+ messages in thread From: Giovanni Biscuolo @ 2024-06-14 11:04 UTC (permalink / raw) To: guix-devel [-- Attachment #1.1: Type: text/plain, Size: 3125 bytes --] Hello, after a reboot on a running remote host (it was running since several guix system generations ago... but with no reboots meanwhile) I get a failing networking service and consequently the ssh service (et al) refuses to start :-( Sorry I've no text to show you but a screenshot (see attachment below) because I'm connecting with a remote KVM console appliance. The networking service is failing with this message (manually copied here, please forgive mistakes): --8<---------------cut here---------------start------------->8--- [...] 11:28 vmunix [...] shepherd [1]: Exception caught while starting networking: (no-such-device "swws-bridge") shepherd [1]: Exception caught while staring networking. (%exception #<&netlink-response-error errno: 17>) --8<---------------cut here---------------end--------------->8--- The strange thing is that all the configured interfaces: eno1 Please find below the relevant parts of the configuration of my host. As you can see I've installed a libvirt daemon service (it is working) with an autostarted (by libvirt) bridge interface named "swws-bridge": I've tried stopping that bridge (virsh net-destroy...) but the networking service keeps failing after a "herd restart networking" --8<---------------cut here---------------start------------->8--- ;; ------------------------------------ ;; operating-system (operating-system (locale "en_US.utf8") (timezone "Europe/Rome") (keyboard-layout (keyboard-layout "us")) (host-name "ane") [...] (services (append (modify-services %base-services ;; base-services with modificatios (sysctl-service-type config => (sysctl-configuration (settings (append '(("net.ipv4.ip_forward" . "1")) %default-sysctl-settings))))) (list (service static-networking-service-type (list (static-networking (addresses (list (network-address (device ane-wan-device) (value (string-append ane-wan-ip4 "/24"))))) (routes (list (network-route (destination "default") (gateway ane-wan-gateway)) ;; ip route add 10.1.2.0/24 dev swws-bridge via 192.168.133.12 (network-route (destination "10.1.2.0/24") ;; lxcbr0 net (device swws-bridge-name) (gateway "192.168.133.12")))) ;; on node002 (name-servers '("185.12.64.1" "185.12.64.1"))))) (service ntp-service-type) [...] (service libvirt-service-type (libvirt-configuration (unix-sock-group "libvirt") (tls-port "16555"))) (service virtlog-service-type (virtlog-configuration (max-clients 1000) (max-size 5) (max-backups 9))) (service openssh-service-type (openssh-configuration (port-number 22) (password-authentication? #f) (permit-root-login 'prohibit-password) [...] --8<---------------cut here---------------end--------------->8--- Please how can I debug this error? Thanks, Gio'. [-- Attachment #1.2: 20240614-ane-screenshot_1718359609964.png --] [-- Type: image/png, Size: 1574555 bytes --] [-- Attachment #1.3: Type: text/plain, Size: 55 bytes --] -- Giovanni Biscuolo Xelera IT Infrastructures [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 849 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: networking service not starting for a network-route setting (was for network with netlink-response-error errno:17) 2024-06-14 11:04 networking service not starting with netlink-response-error errno:17 Giovanni Biscuolo @ 2024-06-14 13:07 ` Giovanni Biscuolo 2024-06-17 13:23 ` networking service not starting with netlink-response-error errno:17 Ludovic Courtès 1 sibling, 0 replies; 4+ messages in thread From: Giovanni Biscuolo @ 2024-06-14 13:07 UTC (permalink / raw) To: guix-devel [-- Attachment #1: Type: text/plain, Size: 7107 bytes --] Hello, OK I've managed to fix my networking problem, here is how I did it... Giovanni Biscuolo <g@xelera.eu> writes: [...] > The networking service is failing with this message (manually copied > here, please forgive mistakes): now that I can connect via SSH, I can copy the actual messages: --8<---------------cut here---------------start------------->8--- Jun 14 11:28:32 localhost vmunix: [ 6.258520] shepherd[1]: Starting service networking... Jun 14 11:28:32 localhost vmunix: [ 6.472949] shepherd[1]: Service networking failed to start. Jun 14 11:28:32 localhost vmunix: [ 6.474842] shepherd[1]: Exception caught while starting networking: (no-such-device "swws-bridge") Jun 14 11:28:32 localhost vmunix: [ 6.492344] shepherd[1]: Starting service networking... Jun 14 11:28:32 localhost vmunix: [ 6.509652] shepherd[1]: Exception caught while starting networking: (%exception #<&netlink-response-error errno: 17>) Jun 14 11:28:32 localhost vmunix: [ 6.510034] shepherd[1]: Service networking failed to start. --8<---------------cut here---------------end--------------->8--- > The strange thing is that all the configured interfaces: eno1 I truncated the list, the actual list of interfaces was (and is): --8<---------------cut here---------------start------------->8--- g@ane ~$ ip addre ls 1: lo: <LOOPBACK,MULTICAST,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope global lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether b4:2e:99:c5:cc:1c brd ff:ff:ff:ff:ff:ff inet 162.55.88.253/24 scope global eno1 valid_lft forever preferred_lft forever inet6 fe80::b62e:99ff:fec5:cc1c/64 scope link valid_lft forever preferred_lft forever 3: swws-bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 52:54:00:9b:c6:63 brd ff:ff:ff:ff:ff:ff inet 192.168.133.1/24 brd 192.168.133.255 scope global swws-bridge valid_lft forever preferred_lft forever 4: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master swws-bridge state UNKNOWN group default qlen 1000 link/ether fe:54:00:ff:e2:fd brd ff:ff:ff:ff:ff:ff inet6 fe80::fc54:ff:feff:e2fd/64 scope link valid_lft forever preferred_lft forever 5: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master swws-bridge state UNKNOWN group default qlen 1000 link/ether fe:54:00:41:53:1e brd ff:ff:ff:ff:ff:ff inet6 fe80::fc54:ff:fe41:531e/64 scope link valid_lft forever preferred_lft forever 6: vnet2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master swws-bridge state UNKNOWN group default qlen 1000 link/ether fe:54:00:3d:17:90 brd ff:ff:ff:ff:ff:ff inet6 fe80::fc54:ff:fe3d:1790/64 scope link valid_lft forever preferred_lft forever 7: vnet3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master swws-bridge state UNKNOWN group default qlen 1000 link/ether fe:54:00:64:81:8f brd ff:ff:ff:ff:ff:ff inet6 fe80::fc54:ff:fe64:818f/64 scope link valid_lft forever preferred_lft forever --8<---------------cut here---------------end--------------->8--- > Please find below the relevant parts of the configuration of my host. > > As you can see I've installed a libvirt daemon service (it is working) > with an autostarted (by libvirt) bridge interface named "swws-bridge" [...] > --8<---------------cut here---------------start------------->8--- [...] sorry I missed to add some relevant definitions I have at the start of my config.scm file: (define ane-wan-device "eno1") (define ane-wan-ip4 "162.55.88.253") (define ane-wan-gateway "162.55.88.193") (define swws-bridge-name "swws-bridge") > (list > (service static-networking-service-type > (list (static-networking > (addresses (list (network-address > (device ane-wan-device) > (value (string-append ane-wan-ip4 "/24"))))) > (routes (list (network-route > (destination "default") > (gateway ane-wan-gateway)) the next one the problematic part of my static-networking configuration: > ;; ip route add 10.1.2.0/24 dev swws-bridge via 192.168.133.12 > (network-route > (destination "10.1.2.0/24") ;; lxcbr0 net > (device swws-bridge-name) > (gateway "192.168.133.12")))) > ;; on node002 I've commented out this network-route part and now the networking service is running fine at boot (and after a restart obviously) I think that the missing "swws-bridge" interface when the static-network is activates is blocking all further networking service startup, including restarts after "swws-bridge" has been created by the libvirtd service. After the "swws-bridge" interface has been created this is the routing table: --8<---------------cut here---------------start------------->8--- g@ane ~$ ip route ls default via 162.55.88.193 dev eno1 162.55.88.0/24 dev eno1 proto kernel scope link src 162.55.88.253 192.168.133.0/24 dev swws-bridge proto kernel scope link src 192.168.133.1 --8<---------------cut here---------------end--------------->8--- Obviously if I "manually" add the route I'm able to ping hosts on the 10.1.2.0/24 network: --8<---------------cut here---------------start------------->8--- g@ane ~$ sudo ip route add 10.1.2.0/24 dev swws-bridge via 192.168.133.12 g@ane ~$ ip route ls default via 162.55.88.193 dev eno1 10.1.2.0/24 via 192.168.133.12 dev swws-bridge 162.55.88.0/24 dev eno1 proto kernel scope link src 162.55.88.253 192.168.133.0/24 dev swws-bridge proto kernel scope link src 192.168.133.1 g@ane ~$ ping 10.1.2.1 PING 10.1.2.1 (10.1.2.1): 56 data bytes 64 bytes from 10.1.2.1: icmp_seq=0 ttl=64 time=0.341 ms 64 bytes from 10.1.2.1: icmp_seq=1 ttl=64 time=0.232 ms 64 bytes from 10.1.2.1: icmp_seq=2 ttl=64 time=0.544 ms ^C--- 10.1.2.1 ping statistics --- 3 packets transmitted, 3 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.232/0.372/0.544/0.129 ms --8<---------------cut here---------------end--------------->8--- ...but I would like the route be automatically added at boot time and not have to remember to add it "manually" after a reboot. Please how can I specify that "swws-bridge" is a dependency for the networking service and make that service wait for that interface to come up? I know there is a (requirement ) field in static-networking but "swws-bridge" is not a Shepherd service: do I have to use "libvirtd" as my static-networking requirement? [...] Happy hacking! Gio' -- Giovanni Biscuolo Xelera IT Infrastructures [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 849 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: networking service not starting with netlink-response-error errno:17 2024-06-14 11:04 networking service not starting with netlink-response-error errno:17 Giovanni Biscuolo 2024-06-14 13:07 ` networking service not starting for a network-route setting (was for network with netlink-response-error errno:17) Giovanni Biscuolo @ 2024-06-17 13:23 ` Ludovic Courtès 2024-06-17 15:12 ` Giovanni Biscuolo 1 sibling, 1 reply; 4+ messages in thread From: Ludovic Courtès @ 2024-06-17 13:23 UTC (permalink / raw) To: Giovanni Biscuolo; +Cc: guix-devel Hi Giovanni, Giovanni Biscuolo <g@xelera.eu> skribis: > after a reboot on a running remote host (it was running since several > guix system generations ago... but with no reboots meanwhile) I get a > failing networking service and consequently the ssh service (et al) > refuses to start :-( > > Sorry I've no text to show you but a screenshot (see attachment below) > because I'm connecting with a remote KVM console appliance. > > The networking service is failing with this message (manually copied > here, please forgive mistakes): > > > [...] 11:28 vmunix [...] shepherd [1]: Exception caught while starting > networking: (no-such-device "swws-bridge") > > > shepherd [1]: Exception caught while staring networking. (%exception > #<&netlink-response-error errno: 17>) 17 = EEXIST, which is netlink’s way of saying that the device/route/link it’s trying to add already exists. The problem here is that static networking adds devices, routes, and links (see ‘network-set-up/linux’ in the code). If it fails in the middle, then it may have added devices without adding routes, so you end up with half-configured networking. Ideally this would be transactional. When that happens, you need to check the logs and use the ‘ip’ command to figure out which part failed exactly. In your case, the root problem seems to be that “swws-bridge” did not exist. Then you can (1) manually fix it with ‘ip’, and (2) adjust your Guix System config to fix the problems you found. This is inconvenient at best. I would be interested in hearing suggestions on how to improve on this. HTH, Ludo’. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: networking service not starting with netlink-response-error errno:17 2024-06-17 13:23 ` networking service not starting with netlink-response-error errno:17 Ludovic Courtès @ 2024-06-17 15:12 ` Giovanni Biscuolo 0 siblings, 0 replies; 4+ messages in thread From: Giovanni Biscuolo @ 2024-06-17 15:12 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guix-devel, Julien Lepiller [-- Attachment #1: Type: text/plain, Size: 6096 bytes --] Hi Ludovic, executive summary: it is (was) a "network architecture" mistake by my side, since I was mixing a device with static-network defined via guix with a bridge defined via libvirt... and this is not good. The more I think about it the more I'm convinced that trying to add a route for device "swws-bridge" (see below) in the "eno1" [1] static-networking declaration is simply a... mistake. Julien I'm adidng you in Cc: only because you develop guile-netlink and maybe you could see if it's possible to improve netlink related error messages. Ludovic Courtès <ludo@gnu.org> writes: > Giovanni Biscuolo <g@xelera.eu> skribis: > >> after a reboot on a running remote host (it was running since several >> guix system generations ago... but with no reboots meanwhile) I get a >> failing networking service and consequently the ssh service (et al) >> refuses to start :-( >> >> Sorry I've no text to show you but a screenshot (see attachment below) >> because I'm connecting with a remote KVM console appliance. In a follow-up message I was then able to copy the actual error message: --8<---------------cut here---------------start------------->8--- Jun 14 11:28:32 localhost vmunix: [ 6.258520] shepherd[1]: Starting service networking... Jun 14 11:28:32 localhost vmunix: [ 6.472949] shepherd[1]: Service networking failed to start. Jun 14 11:28:32 localhost vmunix: [ 6.474842] shepherd[1]: Exception caught while starting networking: (no-such-device "swws-bridge") Jun 14 11:28:32 localhost vmunix: [ 6.492344] shepherd[1]: Starting service networking... Jun 14 11:28:32 localhost vmunix: [ 6.509652] shepherd[1]: Exception caught while starting networking: (%exception #<&netlink-response-error errno: 17>) Jun 14 11:28:32 localhost vmunix: [ 6.510034] shepherd[1]: Service networking failed to start. --8<---------------cut here---------------end--------------->8--- Then (in the same message) I described how I was able to solve my issue, this is the "core" of my configuration _mistake:_ --8<---------------cut here---------------start------------->8--- (service static-networking-service-type (list (static-networking (addresses (list (network-address (device ane-wan-device) (value (string-append ane-wan-ip4 "/24"))))) (routes (list (network-route (destination "default") (gateway ane-wan-gateway)))) ;; ip route add 10.1.2.0/24 dev swws-bridge via 192.168.133.12 ;; (network-route ;; (destination "10.1.2.0/24") ;; lxcbr0 net ;; (device swws-bridge-name) ;; (gateway "192.168.133.12")))) ;; on node002 (name-servers '("185.12.64.1" "185.12.64.1"))))) --8<---------------cut here---------------end--------------->8--- I commented out the second network-route definition, the one using "swws-bridge" [1] as device to route to 10.1.2.0/24 via 192.168.133.12. When I used that code, AFAIU the first time shepherd was trying to start the networking service, failing because "swws-bridge" is missing and (guile-)netlink fails with "no-such-device", then it tries again but fails because the very same route is already defined (but not functional). A failing networking service (although the interface is up and running) means that ssh (et al) fails to start, because networking is a ssh requisite. > 17 = EEXIST, which is netlink’s way of saying that the device/route/link > it’s trying to add already exists. Ah thanks! I was not able to find that error code. When run on the command line I get: --8<---------------cut here---------------start------------->8--- g@ane ~$ sudo ip route add 10.1.2.0/24 dev swws-bridge via 192.168.133.12 RTNETLINK answers: File exists --8<---------------cut here---------------end--------------->8--- Is it possible to have the same error and/or little bit of context in syslog when this happens with 'network-set-up/linux' Anyway, I think that "ip route" should just be idempotent... but maybe I'm missing something. (and this is obviously not a downstream issue) > The problem here is that static networking adds devices, routes, and > links (see ‘network-set-up/linux’ in the code). If it fails in the > middle, then it may have added devices without adding routes, so you end > up with half-configured networking. Ideally this would be > transactional. Well, actually it would be a pity to fail a whole static-networking "just" for a failing /secondary/ route, no? But as I told in the "executive summary", how could I /dare/ to declaratively add (with Guix System) a similar route for "swws-bridge" when "swws-bridge" is managed by libvirt? I should simply use libvirt to add that! :-) https://libvirt.org/formatnetwork.html#static-routes > When that happens, you need to check the logs and use the ‘ip’ command > to figure out which part failed exactly. In your case, the root problem > seems to be that “swws-bridge” did not exist. Yes, I can confirm this > Then you can (1) manually fix it with ‘ip’, and (2) adjust your Guix > System config to fix the problems you found. > > This is inconvenient at best. I would be interested in hearing > suggestions on how to improve on this. Oh well, for my use-case I don't think there is anything to improve: I just have to keep the "eno1" device configuration _separate_ from the "swws-bridge" one (even if "swws-bridge" was defined via static-network and not libvirt). The only suggestion I have is to add a more "user friendly" error messages in syslog for netlink-related errors, it wold have helped me more to read "adding route, RTNETLINK answers: File exists" than "netlink-response-error errno: 17" Thank you and... happy hacking! Gio' [1] swws-bridge-name is defined as "swws-bridge" ane-wan-device is defined as "eno1" -- Giovanni Biscuolo Xelera IT Infrastructures [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 849 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-06-17 15:13 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-06-14 11:04 networking service not starting with netlink-response-error errno:17 Giovanni Biscuolo 2024-06-14 13:07 ` networking service not starting for a network-route setting (was for network with netlink-response-error errno:17) Giovanni Biscuolo 2024-06-17 13:23 ` networking service not starting with netlink-response-error errno:17 Ludovic Courtès 2024-06-17 15:12 ` Giovanni Biscuolo
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/guix.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).