Related to this other thread, tagging @fabian as requested.
I currently have a cluster with 13 nodes running. Everything is updated to the latest versions (except for the kernel which is pinned to 5.13.19-6-pve on all nodes because of some issues with live-migration on different CPUs). All the nodes are HPE DL360gen9 or DL360gen10.
Yesterday I added the 14th node to the cluster and as soon as I clicked "Join cluster" every other node rebooted, bringing down hundreds of VMs.
Of course I shutdown the new node, and after the reboot all the nodes restarted working like nothing happened. Even Ceph realigned within seconds (phew!).
Today, as soon as I reconnected the management interface of the new node (vmbr0 on eno1), all the other nodes rebooted again.
I'm using dedicated network switches for management interfaces and the average ping between nodes is < 1ms. Also, corosync itself is configured on redundant links, so even if all the management interfaces would be down they should be able to see each other through the two other interfaces. Shouldn't they?
So I can't really figure out what's happening. For me, it's like something completely broke the network stack on the nodes but why?
Someone with the exact same problem suggested [reddit.com] to remove the management IP from the default bridge and set it directly on the NIC, and I can do it, but does this really make sense?
Consider that:
As you can see, all the nodes broke connectivity with the cluster at 09:19:45 and rebooted two minutes later.
Edit: please ignore the errors "500 Can't connect to 172.29.0.193:8007 (No route to host)" in the logs as I purposely detached the switch for the backup network in order to avoid any risk of network loops.
Thank you!
I currently have a cluster with 13 nodes running. Everything is updated to the latest versions (except for the kernel which is pinned to 5.13.19-6-pve on all nodes because of some issues with live-migration on different CPUs). All the nodes are HPE DL360gen9 or DL360gen10.
Yesterday I added the 14th node to the cluster and as soon as I clicked "Join cluster" every other node rebooted, bringing down hundreds of VMs.
Of course I shutdown the new node, and after the reboot all the nodes restarted working like nothing happened. Even Ceph realigned within seconds (phew!).
Today, as soon as I reconnected the management interface of the new node (vmbr0 on eno1), all the other nodes rebooted again.
I'm using dedicated network switches for management interfaces and the average ping between nodes is < 1ms. Also, corosync itself is configured on redundant links, so even if all the management interfaces would be down they should be able to see each other through the two other interfaces. Shouldn't they?
So I can't really figure out what's happening. For me, it's like something completely broke the network stack on the nodes but why?
Someone with the exact same problem suggested [reddit.com] to remove the management IP from the default bridge and set it directly on the NIC, and I can do it, but does this really make sense?
Consider that:
- nodes are number proxnode01 to proxnode11 and proxnode16 to proxnode18 (because physically installed on two distinct racks, sized for 15 nodes each). So the corosync node #14 is actually proxnode18
- I'm confident that there are no hardware issues with the new node because I repurposed it from another rack where it's been doing virtualization for two years, up to two days ago
- ceph networks (172.27.0.x "ceph-public" and 172.28.0.x "ceph-cluster") are managed by two 10G switches using LACP and dedicated stacking on 100G link, so they may be considered as one large redundant switch; for this reason I'm using VLANs on a LACP bond
As you can see, all the nodes broke connectivity with the cluster at 09:19:45 and rebooted two minutes later.
Edit: please ignore the errors "500 Can't connect to 172.29.0.193:8007 (No route to host)" in the logs as I purposely detached the switch for the backup network in order to avoid any risk of network loops.
Thank you!
Attachments
-
interfaces_proxnode01.txt1.6 KB · Views: 7
-
network_proxnode03.png308.5 KB · Views: 14
-
uname_pveversion_proxnode18.png132.1 KB · Views: 11
-
uname_pveversion_all_nodes.txt21.2 KB · Views: 3
-
syslog_proxnode02.txt6.2 KB · Views: 3
-
syslog_proxnode01.txt6.1 KB · Views: 3
-
interfaces_proxnode17.txt1.4 KB · Views: 1
-
interfaces_proxnode02.txt1.5 KB · Views: 1
-
corosync.conf.txt2.4 KB · Views: 5
-
syslog_proxnode17.txt6.9 KB · Views: 3
Last edited: