all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* networking service not starting with netlink-response-error errno:17
@ 2024-06-14 11:04 Giovanni Biscuolo
  2024-06-14 13:07 ` networking service not starting for a network-route setting (was for network with netlink-response-error errno:17) Giovanni Biscuolo
  2024-06-17 13:23 ` networking service not starting with netlink-response-error errno:17 Ludovic Courtès
  0 siblings, 2 replies; 4+ messages in thread
From: Giovanni Biscuolo @ 2024-06-14 11:04 UTC (permalink / raw)
  To: guix-devel


[-- Attachment #1.1: Type: text/plain, Size: 3125 bytes --]

Hello,

after a reboot on a running remote host (it was running since several
guix system generations ago... but with no reboots meanwhile) I get a
failing networking service and consequently the ssh service (et al)
refuses to start :-(

Sorry I've no text to show you but a screenshot (see attachment below)
because I'm connecting with a remote KVM console appliance.

The networking service is failing with this message (manually copied
here, please forgive mistakes):

--8<---------------cut here---------------start------------->8---

[...] 11:28 vmunix [...] shepherd [1]: Exception caught while starting
networking: (no-such-device "swws-bridge")


shepherd [1]: Exception caught while staring networking. (%exception
#<&netlink-response-error errno: 17>)

--8<---------------cut here---------------end--------------->8---

The strange thing is that all the configured interfaces: eno1


Please find below the relevant parts of the configuration of my host.

As you can see I've installed a libvirt daemon service (it is working)
with an autostarted (by libvirt) bridge interface named "swws-bridge":
I've tried stopping that bridge (virsh net-destroy...) but the
networking service keeps failing after a "herd restart networking"

--8<---------------cut here---------------start------------->8---

;; ------------------------------------
;; operating-system
(operating-system
  (locale "en_US.utf8")
  (timezone "Europe/Rome")
  (keyboard-layout (keyboard-layout "us"))
  (host-name "ane")


[...]


  (services
   (append (modify-services
	    %base-services
	    ;; base-services with modificatios
	    (sysctl-service-type config =>
				 (sysctl-configuration
				  (settings (append '(("net.ipv4.ip_forward" . "1"))
						    %default-sysctl-settings)))))
           (list
            (service static-networking-service-type
        	     (list (static-networking
        		    (addresses (list (network-address
        				      (device ane-wan-device)
        				      (value (string-append ane-wan-ip4 "/24")))))
        		    (routes (list (network-route
        				   (destination "default")
        				   (gateway ane-wan-gateway))
					  ;; ip route add 10.1.2.0/24 dev swws-bridge via 192.168.133.12
					  (network-route
					   (destination "10.1.2.0/24")   ;; lxcbr0 net
					   (device swws-bridge-name)
					   (gateway "192.168.133.12")))) ;; on node002
        		    (name-servers '("185.12.64.1"
        				    "185.12.64.1")))))

	    (service ntp-service-type)

[...]

	    (service libvirt-service-type
		     (libvirt-configuration
		      (unix-sock-group "libvirt")
		      (tls-port "16555")))

	    (service virtlog-service-type
		     (virtlog-configuration
		      (max-clients 1000)
		      (max-size 5)
		      (max-backups 9)))

            (service openssh-service-type
        	     (openssh-configuration
        	      (port-number 22)
                      (password-authentication? #f)
                      (permit-root-login 'prohibit-password)

[...]

--8<---------------cut here---------------end--------------->8---

Please how can I debug this error?


Thanks, Gio'.



[-- Attachment #1.2: 20240614-ane-screenshot_1718359609964.png --]
[-- Type: image/png, Size: 1574555 bytes --]

[-- Attachment #1.3: Type: text/plain, Size: 55 bytes --]


-- 
Giovanni Biscuolo

Xelera IT Infrastructures

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 849 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: networking service not starting for a network-route setting (was for network with netlink-response-error errno:17)
  2024-06-14 11:04 networking service not starting with netlink-response-error errno:17 Giovanni Biscuolo
@ 2024-06-14 13:07 ` Giovanni Biscuolo
  2024-06-17 13:23 ` networking service not starting with netlink-response-error errno:17 Ludovic Courtès
  1 sibling, 0 replies; 4+ messages in thread
From: Giovanni Biscuolo @ 2024-06-14 13:07 UTC (permalink / raw)
  To: guix-devel

[-- Attachment #1: Type: text/plain, Size: 7107 bytes --]

Hello,

OK I've managed to fix my networking problem, here is how I did it...

Giovanni Biscuolo <g@xelera.eu> writes:

[...]

> The networking service is failing with this message (manually copied
> here, please forgive mistakes):

now that I can connect via SSH, I can copy the actual messages:

--8<---------------cut here---------------start------------->8---

Jun 14 11:28:32 localhost vmunix: [    6.258520] shepherd[1]: Starting service networking...
Jun 14 11:28:32 localhost vmunix: [    6.472949] shepherd[1]: Service networking failed to start.
Jun 14 11:28:32 localhost vmunix: [    6.474842] shepherd[1]: Exception caught while starting networking: (no-such-device "swws-bridge")
Jun 14 11:28:32 localhost vmunix: [    6.492344] shepherd[1]: Starting service networking...
Jun 14 11:28:32 localhost vmunix: [    6.509652] shepherd[1]: Exception caught while starting networking: (%exception #<&netlink-response-error errno: 17>)
Jun 14 11:28:32 localhost vmunix: [    6.510034] shepherd[1]: Service networking failed to start.

--8<---------------cut here---------------end--------------->8---

> The strange thing is that all the configured interfaces: eno1

I truncated the list, the actual list of interfaces was (and is):

--8<---------------cut here---------------start------------->8---

g@ane ~$ ip addre ls
1: lo: <LOOPBACK,MULTICAST,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope global lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether b4:2e:99:c5:cc:1c brd ff:ff:ff:ff:ff:ff
    inet 162.55.88.253/24 scope global eno1
       valid_lft forever preferred_lft forever
    inet6 fe80::b62e:99ff:fec5:cc1c/64 scope link 
       valid_lft forever preferred_lft forever
3: swws-bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 52:54:00:9b:c6:63 brd ff:ff:ff:ff:ff:ff
    inet 192.168.133.1/24 brd 192.168.133.255 scope global swws-bridge
       valid_lft forever preferred_lft forever
4: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master swws-bridge state UNKNOWN group default qlen 1000
    link/ether fe:54:00:ff:e2:fd brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc54:ff:feff:e2fd/64 scope link 
       valid_lft forever preferred_lft forever
5: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master swws-bridge state UNKNOWN group default qlen 1000
    link/ether fe:54:00:41:53:1e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc54:ff:fe41:531e/64 scope link 
       valid_lft forever preferred_lft forever
6: vnet2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master swws-bridge state UNKNOWN group default qlen 1000
    link/ether fe:54:00:3d:17:90 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc54:ff:fe3d:1790/64 scope link 
       valid_lft forever preferred_lft forever
7: vnet3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master swws-bridge state UNKNOWN group default qlen 1000
    link/ether fe:54:00:64:81:8f brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc54:ff:fe64:818f/64 scope link 
       valid_lft forever preferred_lft forever

--8<---------------cut here---------------end--------------->8---

> Please find below the relevant parts of the configuration of my host.
>
> As you can see I've installed a libvirt daemon service (it is working)
> with an autostarted (by libvirt) bridge interface named "swws-bridge"

[...]

> --8<---------------cut here---------------start------------->8---

[...]

sorry I missed to add some relevant definitions I have at the start of
my config.scm file:

(define ane-wan-device "eno1")
(define ane-wan-ip4 "162.55.88.253")
(define ane-wan-gateway "162.55.88.193")
(define swws-bridge-name "swws-bridge")

>            (list
>             (service static-networking-service-type
>         	     (list (static-networking
>         		    (addresses (list (network-address
>         				      (device ane-wan-device)
>         				      (value (string-append ane-wan-ip4 "/24")))))
>         		    (routes (list (network-route
>         				   (destination "default")
>         				   (gateway ane-wan-gateway))


the next one the problematic part of my static-networking configuration:

> 					  ;; ip route add 10.1.2.0/24 dev swws-bridge via 192.168.133.12
> 					  (network-route
> 					   (destination "10.1.2.0/24")   ;; lxcbr0 net
> 					   (device swws-bridge-name)
> 					   (gateway "192.168.133.12"))))
>             ;; on node002

I've commented out this network-route part and now the networking
service is running fine at boot (and after a restart obviously)

I think that the missing "swws-bridge" interface when the static-network
is activates is blocking all further networking service startup,
including restarts after "swws-bridge" has been created by the libvirtd
service.

After the "swws-bridge" interface has been created this is the routing
table:

--8<---------------cut here---------------start------------->8---

g@ane ~$ ip route ls
default via 162.55.88.193 dev eno1 
162.55.88.0/24 dev eno1 proto kernel scope link src 162.55.88.253 
192.168.133.0/24 dev swws-bridge proto kernel scope link src 192.168.133.1 

--8<---------------cut here---------------end--------------->8---

Obviously if I "manually" add the route I'm able to ping hosts on the
10.1.2.0/24 network:

--8<---------------cut here---------------start------------->8---

g@ane ~$ sudo ip route add 10.1.2.0/24 dev swws-bridge via 192.168.133.12
g@ane ~$ ip route ls
default via 162.55.88.193 dev eno1 
10.1.2.0/24 via 192.168.133.12 dev swws-bridge 
162.55.88.0/24 dev eno1 proto kernel scope link src 162.55.88.253 
192.168.133.0/24 dev swws-bridge proto kernel scope link src 192.168.133.1 
g@ane ~$ ping 10.1.2.1
PING 10.1.2.1 (10.1.2.1): 56 data bytes
64 bytes from 10.1.2.1: icmp_seq=0 ttl=64 time=0.341 ms
64 bytes from 10.1.2.1: icmp_seq=1 ttl=64 time=0.232 ms
64 bytes from 10.1.2.1: icmp_seq=2 ttl=64 time=0.544 ms
^C--- 10.1.2.1 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.232/0.372/0.544/0.129 ms

--8<---------------cut here---------------end--------------->8---

...but I would like the route be automatically added at boot time and
not have to remember to add it "manually" after a reboot.

Please how can I specify that "swws-bridge" is a dependency for the
networking service and make that service wait for that interface to come
up?

I know there is a (requirement ) field in static-networking but
"swws-bridge" is not a Shepherd service: do I have to use "libvirtd" as
my static-networking requirement?

[...]

Happy hacking! Gio'

-- 
Giovanni Biscuolo

Xelera IT Infrastructures

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 849 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: networking service not starting with netlink-response-error errno:17
  2024-06-14 11:04 networking service not starting with netlink-response-error errno:17 Giovanni Biscuolo
  2024-06-14 13:07 ` networking service not starting for a network-route setting (was for network with netlink-response-error errno:17) Giovanni Biscuolo
@ 2024-06-17 13:23 ` Ludovic Courtès
  2024-06-17 15:12   ` Giovanni Biscuolo
  1 sibling, 1 reply; 4+ messages in thread
From: Ludovic Courtès @ 2024-06-17 13:23 UTC (permalink / raw)
  To: Giovanni Biscuolo; +Cc: guix-devel

Hi Giovanni,

Giovanni Biscuolo <g@xelera.eu> skribis:

> after a reboot on a running remote host (it was running since several
> guix system generations ago... but with no reboots meanwhile) I get a
> failing networking service and consequently the ssh service (et al)
> refuses to start :-(
>
> Sorry I've no text to show you but a screenshot (see attachment below)
> because I'm connecting with a remote KVM console appliance.
>
> The networking service is failing with this message (manually copied
> here, please forgive mistakes):
>
>
> [...] 11:28 vmunix [...] shepherd [1]: Exception caught while starting
> networking: (no-such-device "swws-bridge")
>
>
> shepherd [1]: Exception caught while staring networking. (%exception
> #<&netlink-response-error errno: 17>)

17 = EEXIST, which is netlink’s way of saying that the device/route/link
it’s trying to add already exists.

The problem here is that static networking adds devices, routes, and
links (see ‘network-set-up/linux’ in the code).  If it fails in the
middle, then it may have added devices without adding routes, so you end
up with half-configured networking.  Ideally this would be
transactional.

When that happens, you need to check the logs and use the ‘ip’ command
to figure out which part failed exactly.  In your case, the root problem
seems to be that “swws-bridge” did not exist.

Then you can (1) manually fix it with ‘ip’, and (2) adjust your Guix
System config to fix the problems you found.

This is inconvenient at best.  I would be interested in hearing
suggestions on how to improve on this.

HTH,
Ludo’.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: networking service not starting with netlink-response-error errno:17
  2024-06-17 13:23 ` networking service not starting with netlink-response-error errno:17 Ludovic Courtès
@ 2024-06-17 15:12   ` Giovanni Biscuolo
  0 siblings, 0 replies; 4+ messages in thread
From: Giovanni Biscuolo @ 2024-06-17 15:12 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel, Julien Lepiller

[-- Attachment #1: Type: text/plain, Size: 6096 bytes --]

Hi Ludovic,

executive summary: it is (was) a "network architecture" mistake by my
side, since I was mixing a device with static-network defined via guix
with a bridge defined via libvirt... and this is not good.  The more I
think about it the more I'm convinced that trying to add a route for
device "swws-bridge" (see below) in the "eno1" [1] static-networking
declaration is simply a... mistake.

Julien I'm adidng you in Cc: only because you develop guile-netlink and
maybe you could see if it's possible to improve netlink related error
messages.

Ludovic Courtès <ludo@gnu.org> writes:

> Giovanni Biscuolo <g@xelera.eu> skribis:
>
>> after a reboot on a running remote host (it was running since several
>> guix system generations ago... but with no reboots meanwhile) I get a
>> failing networking service and consequently the ssh service (et al)
>> refuses to start :-(
>>
>> Sorry I've no text to show you but a screenshot (see attachment below)
>> because I'm connecting with a remote KVM console appliance.

In a follow-up message I was then able to copy the actual error message:

--8<---------------cut here---------------start------------->8---

Jun 14 11:28:32 localhost vmunix: [    6.258520] shepherd[1]: Starting service
networking...
Jun 14 11:28:32 localhost vmunix: [    6.472949] shepherd[1]: Service networking failed to
start.
Jun 14 11:28:32 localhost vmunix: [    6.474842] shepherd[1]: Exception caught while
starting networking: (no-such-device "swws-bridge")
Jun 14 11:28:32 localhost vmunix: [    6.492344] shepherd[1]: Starting service
networking...
Jun 14 11:28:32 localhost vmunix: [    6.509652] shepherd[1]: Exception caught while
starting networking: (%exception #<&netlink-response-error errno: 17>)
Jun 14 11:28:32 localhost vmunix: [    6.510034] shepherd[1]: Service networking failed to
start.

--8<---------------cut here---------------end--------------->8---

Then (in the same message) I described how I was able to solve my issue,
this is the "core" of my configuration _mistake:_

--8<---------------cut here---------------start------------->8---

            (service static-networking-service-type
        	     (list (static-networking
        		    (addresses (list (network-address
        				      (device ane-wan-device)
        				      (value (string-append ane-wan-ip4 "/24")))))
        		    (routes (list (network-route
        				   (destination "default")
        				   (gateway ane-wan-gateway))))
					  ;; ip route add 10.1.2.0/24 dev swws-bridge via 192.168.133.12
					  ;; (network-route
					  ;;  (destination "10.1.2.0/24")   ;; lxcbr0 net
					  ;;  (device swws-bridge-name)
					  ;;  (gateway "192.168.133.12")))) ;; on node002
        		    (name-servers '("185.12.64.1"
        				    "185.12.64.1")))))

--8<---------------cut here---------------end--------------->8---

I commented out the second network-route definition, the one using
"swws-bridge" [1] as device to route to 10.1.2.0/24 via 192.168.133.12.

When I used that code, AFAIU the first time shepherd was trying to start
the networking service, failing because "swws-bridge" is missing and
(guile-)netlink fails with "no-such-device", then it tries again but
fails because the very same route is already defined (but not
functional).

A failing networking service (although the interface is up and running)
means that ssh (et al) fails to start, because networking is a ssh
requisite.

> 17 = EEXIST, which is netlink’s way of saying that the device/route/link
> it’s trying to add already exists.

Ah thanks!  I was not able to find that error code.

When run on the command line I get:

--8<---------------cut here---------------start------------->8---

g@ane ~$ sudo ip route add 10.1.2.0/24 dev swws-bridge via 192.168.133.12
RTNETLINK answers: File exists

--8<---------------cut here---------------end--------------->8---

Is it possible to have the same error and/or little bit of context in
syslog when this happens with 'network-set-up/linux'

Anyway, I think that "ip route" should just be idempotent... but maybe
I'm missing something. (and this is obviously not a downstream issue)

> The problem here is that static networking adds devices, routes, and
> links (see ‘network-set-up/linux’ in the code).  If it fails in the
> middle, then it may have added devices without adding routes, so you end
> up with half-configured networking.  Ideally this would be
> transactional.

Well, actually it would be a pity to fail a whole static-networking
"just" for a failing /secondary/ route, no?

But as I told in the "executive summary", how could I /dare/ to
declaratively add (with Guix System) a similar route for "swws-bridge"
when "swws-bridge" is managed by libvirt?

I should simply use libvirt to add that! :-)
https://libvirt.org/formatnetwork.html#static-routes

> When that happens, you need to check the logs and use the ‘ip’ command
> to figure out which part failed exactly.  In your case, the root problem
> seems to be that “swws-bridge” did not exist.

Yes, I can confirm this

> Then you can (1) manually fix it with ‘ip’, and (2) adjust your Guix
> System config to fix the problems you found.
>
> This is inconvenient at best.  I would be interested in hearing
> suggestions on how to improve on this.

Oh well, for my use-case I don't think there is anything to improve:
I just have to keep the "eno1" device configuration _separate_ from the
"swws-bridge" one (even if "swws-bridge" was defined via static-network
and not libvirt).

The only suggestion I have is to add a more "user friendly" error
messages in syslog for netlink-related errors, it wold have helped me
more to read "adding route, RTNETLINK answers: File exists" than
"netlink-response-error errno: 17"

Thank you and... happy hacking! Gio'


[1] swws-bridge-name is defined as "swws-bridge"
    ane-wan-device is defined as "eno1"    

-- 
Giovanni Biscuolo

Xelera IT Infrastructures

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 849 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-06-17 15:13 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-14 11:04 networking service not starting with netlink-response-error errno:17 Giovanni Biscuolo
2024-06-14 13:07 ` networking service not starting for a network-route setting (was for network with netlink-response-error errno:17) Giovanni Biscuolo
2024-06-17 13:23 ` networking service not starting with netlink-response-error errno:17 Ludovic Courtès
2024-06-17 15:12   ` Giovanni Biscuolo

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.