Sudden reboot of multiple nodes while adding a new node

alex_ca

New Member
May 20, 2022
14
0
1
Hi All,

For the last year i'm running a proxmox cluster with 7 nodes, and except for some minor issue, everything went fine until yesterday.

Yesterday, adding a new node to the cluster (something i did, obviously, several times) i experienced three of the 7 existing nodes to reboot suddenly with no apparent reason. And this lead to a temporary disaster in which for several minutes all our VMs went down.

At this point i'm still struggling to understand what happened, and I'm a bit stucked because I don't feel confortable to attempt to add the new node again to the cluster.

I'm attaching the syslogs for the joining node and from one of the nodes that suffered the sudden reboot.

Anyone can help me to understand?
 

Attachments

Last edited:
could you post
- the logs of *all* nodes starting before the attempted join up to the outage
- pveversion -v of *all* nodes (or /var/log/apt/history.log , if you've since installed updates)
 
Hello Fabian,

You already have the syslog portion for the Joining node and the first Existing node.

I'm now attaching:
1) History Log for the Joining Node
2) History Log for the "existing node"
3) For the other two nodes the history.log is empty, but they are all running pve-manager/7.2-11/b76d3178 and Linux 5.15.60-2-pve #1 SMP PVE 5.15.60-2 (Tue, 04 Oct 2022 16:52:28 +0200) and CEPH 16.2.9
4) Syslog for Existing node n.2 and n.3 for the period in which the problem happened.

Pveversion -v for the nodes is currently:
Existing node 1
# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.60-2-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-13
pve-kernel-5.15: 7.2-12
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-3
libpve-http-server-perl: 4.1-4
libpve-network-perl: 0.7.1
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
openvswitch-switch: 2.15.0+ds1-2+deb11u1
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-4
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1



Existing node 2
# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.60-2-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-13
pve-kernel-5.15: 7.2-12
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-3
libpve-http-server-perl: 4.1-4
libpve-network-perl: 0.7.1
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
openvswitch-switch: 2.15.0+ds1-2+deb11u1
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-4
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1


Existing node 3:
# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.60-2-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-13
pve-kernel-5.15: 7.2-12
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-3
libpve-http-server-perl: 4.1-4
libpve-network-perl: 0.7.1
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
openvswitch-switch: 2.15.0+ds1-2+deb11u1
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-4
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1


Joining node:
# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.60-2-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-13
pve-kernel-5.15: 7.2-12
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-3
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-4
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1
 

Attachments

I really meant "all nodes" when I wrote "all nodes" ;) anything particular/non-default about your corosync.conf?
 
Here are the logs for the remaining 4 nodes (the ones that haven't rebooted). I had to zip them because they were bigger.

I cannot see anything strange in corosync.conf, here is its content now (i've already removed the node that caused the failure this morning):

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve41-cluster01
nodeid: 1
quorum_votes: 1
ring0_addr: 172.16.7.41
ring1_addr: 172.16.8.41
}
node {
name: pve42-cluster01
nodeid: 2
quorum_votes: 1
ring0_addr: 172.16.7.42
ring1_addr: 172.16.8.42
}
node {
name: pve43-cluster01
nodeid: 3
quorum_votes: 1
ring0_addr: 172.16.7.43
ring1_addr: 172.16.8.43
}
node {
name: stor71-cluster01
nodeid: 4
quorum_votes: 1
ring0_addr: 172.16.7.71
ring1_addr: 172.16.8.71
}
node {
name: stor72-cluster01
nodeid: 5
quorum_votes: 1
ring0_addr: 172.16.7.72
ring1_addr: 172.16.8.72
}
node {
name: stor73-cluster01
nodeid: 6
quorum_votes: 1
ring0_addr: 172.16.7.73
ring1_addr: 172.16.8.73
}
node {
name: stor74-cluster01
nodeid: 7
quorum_votes: 1
ring0_addr: 172.16.7.74
ring1_addr: 172.16.8.74
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: sax-pve-cl01
config_version: 11
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}

pveversion -v for all of reports the same:

proxmox-ve: 7.2-1 (running kernel: 5.15.60-2-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-13
pve-kernel-5.15: 7.2-12
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-3
libpve-http-server-perl: 4.1-4
libpve-network-perl: 0.7.1
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
openvswitch-switch: 2.15.0+ds1-2+deb11u1
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-4
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1
 

Attachments

I see the following in the joining node's log which seems to have caused corosync communication to return to normal:

Code:
Oct 16 19:06:47 pvemgmt-cluster01 kernel: [20925.802411] bnx2 0000:01:00.0 eno1: NIC Copper Link is Down
Oct 16 19:06:47 pvemgmt-cluster01 kernel: [20925.803427] bond0: (slave eno1): speed changed to 0 on port 1
Oct 16 19:06:47 pvemgmt-cluster01 kernel: [20925.844523] bond0: (slave eno1): link status definitely down, disabling slave
Oct 16 19:06:47 pvemgmt-cluster01 kernel: [20925.844537] bond0: active interface up!
Oct 16 19:06:47 pvemgmt-cluster01 kernel: [20925.874983] bnx2 0000:01:00.1 eno2: NIC Copper Link is Down
Oct 16 19:06:47 pvemgmt-cluster01 kernel: [20925.875956] bond0: (slave eno2): speed changed to 0 on port 2
Oct 16 19:06:47 pvemgmt-cluster01 kernel: [20925.952424] bond0: (slave eno2): link status definitely down, disabling slave
Oct 16 19:06:47 pvemgmt-cluster01 kernel: [20925.952438] bond0: now running without any active interface!
Oct 16 19:06:47 pvemgmt-cluster01 kernel: [20925.952743] vmbr0: port 1(bond0) entered disabled state

I guess this was the point when you disabled the switch ports or pulled the cables for the joining node? and once that node was "isolated" the other 7 recovered almost immediately (at least as far as corosync/pmxcfs were concerned)?

up until then, corosync only complained about nodes going down which were fenced (and single tokens being lost as a result), which would indicate that
- UDP traffic and the TOTEM part of corosync worked fine
- the CPG part of corosync was non-functional (which in turn meant that pmxcfs was non functional, which in turn caused the HA stack to trigger a fence)

could you give some more details about your network configuration (at least as far as the corosync links are concerned)?
 
Actually i haven't disabled or disconnected any port.

The joining node, before starting the Cluster Join, was already connected in LACP with the switch, properly reachable on all the VLANs needed (6,7,8) and able to ping all the other nodes even before i've tried to let it join

All the 7 working nodes are all connected to a pair of Cisco Nexus 5548 configured in VPC between each other.
The first port of each node is connected to switch 1, the second port of each node is connected to switch 2.
Both ports form a LACP bond that on the cisco side is brought up trough VCP (some sort of MLAG/MC-LAG).

This is the configuration on both Cisco ports:
interface port-channel41
description "INFRA: PVE41-CLUSTER01"
switchport mode trunk
switchport trunk native vlan 1018
spanning-tree port type edge trunk
speed 10000
vpc 41

The "joining node" that failed, is connected to a Extreme EX4200 switch, and the ports are configured as:

description "INFRA: LACP to PVE08-POOL01";
aggregated-ether-options {
lacp {
active;
}
}
unit 0 {
family ethernet-switching {
port-mode trunk;
vlan {
members [ vlan-1008 vlan-900 bd-20 vlan-4 vlan-7 vlan-8 vlan-1019 vlan-910 vlan-6 vlan-337 vlan-11 vlan-301 ];
}
native-vlan-id vlan-1018;
}
}

And obviously the EX4200 switch is connected to the Cisco Nexus pair using a 2x10GB LACP bond in trunk mode.

Few things to consider:

a) when i've tried to join the new node, it tried to join the cluster but it never became "green". The other nodes rebooted even before the node was able to complete the join.

b) What i find really odd is that, instead of perhaps isolating the joining node due to some fault, the cluster decided to reboot three machines that were actually working fine before. This behaviour is not just odd, but quite dangerous. If a new node is joining a working cluster, the corosync configuration should absolutely avoid a situation like this.

c) In the cluster there is only one HA group configured, in which all the existing nodes are part of it.

d) There are two SDN zones. One is a VLAN bridge, one is a VXLAN zone.... maybe it could affect this?

e) The joining node was supposed to be used only with local storage and not even take part of the CEPH cluster, but i don't think this matters. In any case it wasn't joined yet.
 
Actually i haven't disabled or disconnected any port.

The joining node, before starting the Cluster Join, was already connected in LACP with the switch, properly reachable on all the VLANs needed (6,7,8) and able to ping all the other nodes even before i've tried to let it join

All the 7 working nodes are all connected to a pair of Cisco Nexus 5548 configured in VPC between each other.
The first port of each node is connected to switch 1, the second port of each node is connected to switch 2.
Both ports form a LACP bond that on the cisco side is brought up trough VCP (some sort of MLAG/MC-LAG).

This is the configuration on both Cisco ports:
interface port-channel41
description "INFRA: PVE41-CLUSTER01"
switchport mode trunk
switchport trunk native vlan 1018
spanning-tree port type edge trunk
speed 10000
vpc 41

The "joining node" that failed, is connected to a Extreme EX4200 switch, and the ports are configured as:

description "INFRA: LACP to PVE08-POOL01";
aggregated-ether-options {
lacp {
active;
}
}
unit 0 {
family ethernet-switching {
port-mode trunk;
vlan {
members [ vlan-1008 vlan-900 bd-20 vlan-4 vlan-7 vlan-8 vlan-1019 vlan-910 vlan-6 vlan-337 vlan-11 vlan-301 ];
}
native-vlan-id vlan-1018;
}
}

And obviously the EX4200 switch is connected to the Cisco Nexus pair using a 2x10GB LACP bond in trunk mode.

thanks for the details!

unfortunately that might mean it could boil down to some interaction between corosync and your network hardware that I can't reproduce in our lab :-/ it would still be interesting why node 8 detected the bond ports going down - maybe the switches logged something (storm control? STP?)

Few things to consider:

a) when i've tried to join the new node, it tried to join the cluster but it never became "green". The other nodes rebooted even before the node was able to complete the join.

yes, because pmxcfs didn't become operational post-join pvestatd didn't have a chance to collect and broadcast status information on that node, that is to be expected and just a symptom, not the cause.

b) What i find really odd is that, instead of perhaps isolating the joining node due to some fault, the cluster decided to reboot three machines that were actually working fine before. This behaviour is not just odd, but quite dangerous. If a new node is joining a working cluster, the corosync configuration should absolutely avoid a situation like this.

obviously, this is not the intended behaviour. the part of corosync that we rely on for HA and regular operations stopped working, possibly due to a bug in corosync (I'll definitely take a deeper look there together with upstream!) or our stack.

that only those three nodes rebooted was likely because they had HA active and armed (which means a watchdog that expires if pmxcfs becomes non-quorate/non-functional).

c) In the cluster there is only one HA group configured, in which all the existing nodes are part of it.

d) There are two SDN zones. One is a VLAN bridge, one is a VXLAN zone.... maybe it could affect this?

I don't think that any of that had any meaningful impact. the only thing that would have prevented the fence event would have been disarming HA on all nodes before the join operation.

e) The joining node was supposed to be used only with local storage and not even take part of the CEPH cluster, but i don't think this matters. In any case it wasn't joined yet.

if you don't limit your ceph storages to the other nodes before joining, it will at least also query the ceph cluster as a client. but obviously any ceph services such as additional mon/mgr/mds/osd instances on the new node would need to be set up manually, there is nothing automatic on cluster joining on that level. and in any case, it should have no effect on cluster joining/HA/corosync ;)
 
I don't think that any of that had any meaningful impact. the only thing that would have prevented the fence event would have been disarming HA on all nodes before the join operation.
Do you mean that if i temporarly remove all nodes from the HA group, i should be able to join the server without risking the existing nodes to reboot as it already happened?

unfortunately that might mean it could boil down to some interaction between corosync and your network hardware that I can't reproduce in our lab :-/ it would still be interesting why node 8 detected the bond ports going down - maybe the switches logged something (storm control? STP?)
I can see a spanning-tree topology change about the time in which the node tried to join the cluster.
At this point my suspect is that the joining node itself caused a STP loop.
As i said before, the node was already connected and working fine. Just not joined yet. Is it possible that something happened in the joining node during the join process that have caused a loop? And why? I believe this should be checked too, and if there is any chance the node caused the loop it's quite important to avoid it.
Maybe this could be related to some bug in the SDN implementation?
In my particular scenario i have a linux bond between the two ethernet-cards, and the bond itself is configured as LACP.
On the top of the bond there is a "vmbr0" interface, no ip set, VLAN aware.
On the top of vmbr0 there are three VLANs (6,7,8) Vlan6 is for management, Vlan7 for clustering, Vlan8 to access some external storage.
Then on the SDN side, there is a clusterwise VLAN zone called BRIDGE0 that uses vmbr0
And on the top of this there is a SDN Vnet that uses the BRIDGE0 SDN Vlan zone, with tag 11, that i use to reach some internal server
(some picture attached) .

that only those three nodes rebooted was likely because they had HA active and armed (which means a watchdog that expires if pmxcfs becomes non-quorate/non-functional).

All the existing nodes have HA active. But i've just noticed they three nodes rebooted have a higher priority than the nodes that didn't reboot.

I don't remember having configured those priorities, are those values set automatically by proxmox by any chance?
 

Attachments

  • CorpServ.png
    CorpServ.png
    6.6 KB · Views: 7
  • HA group conf.png
    HA group conf.png
    22 KB · Views: 8
  • SDN Vlan.png
    SDN Vlan.png
    9.9 KB · Views: 7
I guess this was the point when you disabled the switch ports or pulled the cables for the joining node? and once that node was "isolated" the other 7 recovered almost immediately (at least as far as corosync/pmxcfs were concerned)?
Actually now I recall when i saw all the VMs running in the existing node disappearing, i've immediately assumed a spanning-tree loop and decided to shutdown the port-channel with the joining node. But at this point the 3 nodes were already rebooting.

This really points to the possibility that the join process ran in the joining node caused a huge spanning tree loop.

But why?
 
I'vejust noticed this in the joining node logs:

Code:
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: PMTUD link change for host: 7 link: 1 from 469 to 1397
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: PMTUD link change for host: 6 link: 0 from 469 to 1397
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: PMTUD link change for host: 6 link: 1 from 469 to 1397
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 1 from 469 to 1397
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 1 from 469 to 1397
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 1 from 469 to 1397
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 1 from 469 to 1397
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 1397
Oct 16 19:02:35 pvemgmt-cluster01 corosync[259242]:   [KNET  ] pmtud: Global data MTU changed to: 1397

This is quite strange.... All the nodes are running with MTU 9216, and the MTU size is honored all along the path.
Can this be related to the issue?
 
it could possibly be (for example, if the regular heartbeat and totem packages are smaller, but the sync/CPG packages are bigger and attempted to send via jumbo frames).

I'll try to see if I can trigger similar behaviour with setups modelled after your usage (without the cisco part, but with SDN). could you maybe provide /etc/network/interface and the SDN config files in text format so that I can try to stay as close to your setup as possible?

regarding

Do you mean that if i temporarly remove all nodes from the HA group, i should be able to join the server without risking the existing nodes to reboot as it already happened?

no - the safe option is to stop the HA services (in the right order) on all nodes (this is the same thing that gets done on node reboot/package upgrades), and will disarm the watchdog, so even if /etc/pve becomes unwritable/corosync loses quorum/.. existing guests should continue running, just like in a non-HA cluster.
 
thinking this over for a bit - the first order of business would definitely be verifying that the joining node (with the extra switch in-between) is able to communicate with the other nodes using the large MTU. it doesn't have to join the cluster for that test, so there shouldn't be any risk of further corosync interference.
 
auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual

auto eno2
iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

auto bond0
iface bond0 inet manual
bond-slaves eno1 eno2
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer3+4
mtu 9216

auto vmbr0
iface vmbr0 inet manual
bridge-ports bond0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
mtu 9216

auto MGMT
iface MGMT inet static
address 172.16.6.41/24
gateway 172.16.6.1
mtu 1500
vlan-id 6
vlan-raw-device vmbr0

auto PVE_CLUSTER
iface PVE_CLUSTER inet static
address 172.16.7.41/24
vlan-id 7
vlan-raw-device vmbr0

auto PVE_STORAGE
iface PVE_STORAGE inet static
address 172.16.8.41/24
vlan-id 8
vlan-raw-device vmbr0
 
UPDATE

i've just discovered that the interfaces in the joining node don't support MTU higher than 9000..... but this shouldn't be a valid reason for the node to trigger a spanning tree loop when it's joining the remaining nodes :(

I digged in our network logs and can confirm it triggered a loop. And this made the communication between the nodes unavail, and this triggered them to reboot.

My gut is pointing more and more toward some bug in SDN.....
 
It happened the same to me right now. Everything rebooted as soon as I added the new node to the cluster.
Now I'm afraid to restart the new node and I don't know how to remove it from the cluster configuration.

Edit: I have to specify that I'm currently running a cluster with 13 nodes, everything updated to the latest versions, everything set up as usual (I have a written procedure that I follow extremely carefully)

Any help?
Thank you
 
Last edited:
It happened the same to me right now. Everything rebooted as soon as I added the new node to the cluster.
Now I'm afraid to restart the new node and I don't know how to remove it from the cluster configuration.

Edit: I have to specify that I'm currently running a cluster with 13 nodes, everything updated to the latest versions, everything set up as usual (I have a written procedure that I follow extremely carefully)

Any help?
Thank you
please open a new thread and mention me (it's easier to keep the configs and logs separate that way!)

the following would be helpful:

- corosync.conf
- journal starting before the join operation, at least until the outage preferably a bit afterwards as well (from all nodes)
- pveversion -v (from all nodes)
- anything special about your network configuration
 
UPDATE

i've just discovered that the interfaces in the joining node don't support MTU higher than 9000..... but this shouldn't be a valid reason for the node to trigger a spanning tree loop when it's joining the remaining nodes :(

I digged in our network logs and can confirm it triggered a loop. And this made the communication between the nodes unavail, and this triggered them to reboot.

My gut is pointing more and more toward some bug in SDN.....

it doesn't look like communication in general was unavailable (in that case, knet would have detected the error and marked the links as down), just certain parts (which happen to be the ones that HA relies on, which caused the fencing) - although you are in a better position to judge what actions your network hardware took. note that at this point SDN on the joining node was not yet available as it hadn't had a chance to sync the state of /etc/pve yet (that happens via the same mechanism that HA also uses, and the joining node wasn't even able to do the first part of joining the other nodes on that level), and thus no config files would have been there to cause any SDN related actions..

my gut points towards either the MTU or some part of your bond config/switch settings being the culprit, or those just being false herring and some yet-to-be-discovered corosync or pmxcfs bug being at fault.
 
so I've been in contact with upstream and we have a lead on a bad MTU interaction that might cause this. I'll try to reproduce the issue locally!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!