networks used to determine fencing

phew01 · Mar 3, 2025

Hi all,
Newbie here...
setup a cluster with 4 nodes, 3 switches: 1 for management traffic (1gbit), 2 for production traffic (2x10g).
for now, because of the size of the cluster, the corosync network is on a vlan on the 2 production switches.

Unforeseen, the management switch rebooted today, causing all nodes to simultaneously reboot.
Which led to the question: why is this cluster rebooting. Corosync was up, the VMs were all running, shared storage was available.
Everything except that management switch was up & running.

Thanks for any insights in this!

Phew01

niteshadow · Mar 3, 2025

simultaneously reboot
This make me think it's power related, such as a badly configured UPS.
Did your switch reboot because of loss of power ? Is it on the same UPS as the PVE servers ?

phew01 · Mar 4, 2025

Hi,
Thanks for the fast reply. The setup is in a datacenter on dual feed power. No power loss, the management switch crashed and restarted.
The result was that the 4 nodes rebooted.

The nodes themselves had power the whole time, their management env (ILO on 4th switch) stayed on.

niteshadow · Mar 4, 2025

Did you check the PVE logs for indication of a reboot caused by proxmox ?

phew01 · Mar 4, 2025

Sorry for not including that earlier...

The only thing I saw in there was that it lost connection with its peers but that does not make sense because the corosync network did not go down.

corosync[3514]: [KNET ] pmtud: Global data MTU changed to: 1397

timestamp here correlates to the management switch rebooting.

It also gives:

link: host: 4 link: 0 is down
which seems related to the MTU message (going by the time of the log entry). link 0 should be the corosync network, but that did not go down or loose connectivity?! So is this pmtud playing up, corosync overreacting or what? Dunno where to look what caused all this

niteshadow · Mar 4, 2025

Did your VMs crash or did they cleanly shutdown/start during node reboot ?

phew01 · Mar 4, 2025

They crashed. approx 10 of them with mysql databases that were damaged beyond repair

niteshadow · Mar 4, 2025

My understanding is that a PVE issued reboot will try to shutdown VMs cleanly; should work OK assuming guest agent was running.
Maybe an issue with some other watchdog (non Proxmox) service ?
Any watchdog service in BIOS/iLO ?
Did you install any HPe or external management tools ?

phew01 · Mar 4, 2025

Hi,
No external management tools in play here.
My main point of concern is that a change regarding an interface carrying an IP address that is "only" used for management traffic seems to cause this when all production and cluster networking is separated off to different vlans in order to keep every datastream clean.

I'm not sure whether we can replicate the issue (may have to setup an additional lab for that), repairing all those databases was no fun

Thanks for your info

ghusson · Mar 4, 2025

Hello. Do you use HA ? If yes, the servers reboot was expected.

phew01 · Mar 4, 2025

Hi,
Thanks. But why was that expected? all services (corosync, shared storage etc) were up & running. Does HA evaluate the management IP addresses/interfaces as well for deciding whether to reboot or not?

ghusson · Mar 4, 2025

You can read this to understand : https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing

VictorSTS · Mar 4, 2025

phew01 said:
Hi,
Thanks. But why was that expected? all services (corosync, shared storage etc) were up & running. Does HA evaluate the management IP addresses/interfaces as well for deciding whether to reboot or not?

Post your /etc/pve/corosync.conf and /etc/network/interfaces of the nodes. Also the current output of corosync-cfgtool -n in each node. Full logs of each node at around the time of the reboot will be useful too. We can't really guess anything without that information.

ghusson · Mar 4, 2025

VictorSTS said:
Post your /etc/pve/corosync.conf and /etc/network/interfaces of the nodes. Also the current output of corosync-cfgtool -n in each node. Full logs of each node at around the time of the reboot will be useful too. We can't really guess anything without that information.

Yes, and I think that your corosync cluster speak on the management traffic. That's why I said it was expected. In order to avoid this, you should add a second ring for corosync on the other switch network.

VictorSTS · Mar 4, 2025

ghusson said:
I think that your corosync cluster speak on the management traffic

Sorry, I'm not the OP

. I too think that something isn't correctly configured in Corosync, host lost quorum and HA made them reboot, hence I asked for the configuration files and other details.

fabian · Mar 5, 2025

either corosync isn't configured like you think it is or your network/routing works differently than you think.. corosync, network config and logs would probably help shed some light.

phew01 · Mar 5, 2025

Hi,
Below the ip -br a output from 3 of the nodes in the cluster. They are configured with 1 ansible playbook, so config wise identical

Code:

root@A:~# ip -br a | grep -v UNKNOWN
eno1             UP
eno5             UP             fe80::9eb6:54ff:fe9b:35b8/64
eno6             UP
vmbr0            UP             fe80::af1:eaff:fe76:b26c/64
vmbr0.11@vmbr0   UP             172.16.23.207/23 fe80::af1:eaff:fe76:b26c/64
bond0            UP
vmbr1            UP             fe80::9eb6:54ff:fe9b:35bc/64
vmbr1.171@vmbr1  UP             172.16.3.49/24 fe80::9eb6:54ff:fe9b:35bc/64
vmbr1.172@vmbr1  UP             172.16.4.49/24 fe80::9eb6:54ff:fe9b:35bc/64
vmbr1.205@vmbr1  UP             172.16.8.49/24 fe80::9eb6:54ff:fe9b:35bc/64
vmbr1.1@vmbr1    UP             fe80::9eb6:54ff:fe9b:35bc/64
root@A:~# corosync-cfgtool -n
Local node ID 3, transport knet
nodeid: 1 reachable
   LINK: 0 udp (172.16.8.49->172.16.8.50) enabled connected mtu: 1397

nodeid: 2 reachable
   LINK: 0 udp (172.16.8.49->172.16.8.70) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (172.16.8.49->172.16.8.47) enabled connected mtu: 1397

nodeid: 5 reachable
   LINK: 0 udp (172.16.8.49->172.16.8.52) enabled connected mtu: 1397

====================================================================================

root@B:~# corosync-cfgtool -n
Local node ID 1, transport knet
nodeid: 2 reachable
   LINK: 0 udp (172.16.8.50->172.16.8.70) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (172.16.8.50->172.16.8.49) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (172.16.8.50->172.16.8.47) enabled connected mtu: 1397

nodeid: 5 reachable
   LINK: 0 udp (172.16.8.50->172.16.8.52) enabled connected mtu: 1397

root@B:~# ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128
eno1             UP
eno5             UP             fe80::9af2:b3ff:fe20:1338/64
eno6             UP
vmbr0            UP             fe80::af1:eaff:fe7b:2658/64
vmbr0.11@vmbr0   UP             172.16.22.50/23 fe80::af1:eaff:fe7b:2658/64
vmbr1            UP             fe80::9af2:b3ff:fe20:133c/64
vmbr1.171@vmbr1  UP             172.16.3.50/24 fe80::9af2:b3ff:fe20:133c/64
vmbr1.172@vmbr1  UP             172.16.4.50/24 fe80::9af2:b3ff:fe20:133c/64
vmbr1.205@vmbr1  UP             172.16.8.50/24 fe80::9af2:b3ff:fe20:133c/64

====================================================================================

root@C:~# corosync-cfgtool -n
Local node ID 5, transport knet
nodeid: 1 reachable
   LINK: 0 udp (172.16.8.52->172.16.8.50) enabled connected mtu: 1397

nodeid: 2 reachable
   LINK: 0 udp (172.16.8.52->172.16.8.70) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (172.16.8.52->172.16.8.49) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (172.16.8.52->172.16.8.47) enabled connected mtu: 1397

root@C:~# ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128
ens1f0           UP
eno5             UP
eno6             UP
bond0            UP
vmbr0            UP             172.16.22.41/23 fe80::ae16:2dff:fe80:68b4/64
vmbr1            UP             fe80::1602:ecff:fe3c:eff8/64
vmbr1.171@vmbr1  UP             172.16.3.52/24 fe80::1602:ecff:fe3c:eff8/64
vmbr1.172@vmbr1  UP             172.16.4.52/24 fe80::1602:ecff:fe3c:eff8/64
vmbr1.205@vmbr1  UP             172.16.8.52/24 fe80::1602:ecff:fe3c:eff8/64
vmbr1.1@vmbr1    UP             fe80::1602:ecff:fe3c:eff8/64


========================================================
Corosync.conf:


logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: A
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 172.16.8.49
  }
  node {
    name: B
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.16.8.50
  }
  node {
    name: C
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 172.16.8.52
  }
  node {
    name: D
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.16.8.70
  }
  node {
    name: E
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 172.16.8.47
  }
}

quorum {
  auto_tie_breaker: 1
  auto_tie_breaker_node: lowest
  provider: corosync_votequorum
}

totem {
  cluster_name: C1
  config_version: 11
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

fabian · Mar 5, 2025

1. your cluster has 5 nodes, not 4
2. the auto_tie_breaker settings are nonstandard and actually not correct for a 5 node cluster (there cannot be a tie?)
3. you really need to provide the network setup ("ip a" from all nodes for the physical devices, bridges and bonds at least) and logs covering the problematic time period (from all nodes)

ghusson · Mar 5, 2025

Did you install a 5th node for quorum in case of split ?
Is 172.16.8.0/24 your management network ?

networks used to determine fencing

New Member

Member

New Member

Member

New Member

Member

New Member

Member

New Member

Renowned Member

New Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Renowned Member

We value your privacy