networks used to determine fencing

Mar 2, 2024
16
0
1
Hi all,
Newbie here...
setup a cluster with 4 nodes, 3 switches: 1 for management traffic (1gbit), 2 for production traffic (2x10g).
for now, because of the size of the cluster, the corosync network is on a vlan on the 2 production switches.

Unforeseen, the management switch rebooted today, causing all nodes to simultaneously reboot.
Which led to the question: why is this cluster rebooting. Corosync was up, the VMs were all running, shared storage was available.
Everything except that management switch was up & running.

Thanks for any insights in this!

Phew01
 
Hi,
Thanks for the fast reply. The setup is in a datacenter on dual feed power. No power loss, the management switch crashed and restarted.
The result was that the 4 nodes rebooted.

The nodes themselves had power the whole time, their management env (ILO on 4th switch) stayed on.
 
Sorry for not including that earlier...

The only thing I saw in there was that it lost connection with its peers but that does not make sense because the corosync network did not go down.


corosync[3514]: [KNET ] pmtud: Global data MTU changed to: 1397

timestamp here correlates to the management switch rebooting.

It also gives:

link: host: 4 link: 0 is down
which seems related to the MTU message (going by the time of the log entry). link 0 should be the corosync network, but that did not go down or loose connectivity?! So is this pmtud playing up, corosync overreacting or what? Dunno where to look what caused all this
 
My understanding is that a PVE issued reboot will try to shutdown VMs cleanly; should work OK assuming guest agent was running.
Maybe an issue with some other watchdog (non Proxmox) service ?
Any watchdog service in BIOS/iLO ?
Did you install any HPe or external management tools ?
 
Hi,
No external management tools in play here.
My main point of concern is that a change regarding an interface carrying an IP address that is "only" used for management traffic seems to cause this when all production and cluster networking is separated off to different vlans in order to keep every datastream clean.

I'm not sure whether we can replicate the issue (may have to setup an additional lab for that), repairing all those databases was no fun :(

Thanks for your info
 
Hi,
Thanks. But why was that expected? all services (corosync, shared storage etc) were up & running. Does HA evaluate the management IP addresses/interfaces as well for deciding whether to reboot or not?
 
Hi,
Thanks. But why was that expected? all services (corosync, shared storage etc) were up & running. Does HA evaluate the management IP addresses/interfaces as well for deciding whether to reboot or not?
Post your /etc/pve/corosync.conf and /etc/network/interfaces of the nodes. Also the current output of corosync-cfgtool -n in each node. Full logs of each node at around the time of the reboot will be useful too. We can't really guess anything without that information.
 
Post your /etc/pve/corosync.conf and /etc/network/interfaces of the nodes. Also the current output of corosync-cfgtool -n in each node. Full logs of each node at around the time of the reboot will be useful too. We can't really guess anything without that information.
Yes, and I think that your corosync cluster speak on the management traffic. That's why I said it was expected. In order to avoid this, you should add a second ring for corosync on the other switch network.
 
either corosync isn't configured like you think it is or your network/routing works differently than you think.. corosync, network config and logs would probably help shed some light.
 
Hi,
Below the ip -br a output from 3 of the nodes in the cluster. They are configured with 1 ansible playbook, so config wise identical

Code:
root@A:~# ip -br a | grep -v UNKNOWN
eno1             UP
eno5             UP             fe80::9eb6:54ff:fe9b:35b8/64
eno6             UP
vmbr0            UP             fe80::af1:eaff:fe76:b26c/64
vmbr0.11@vmbr0   UP             172.16.23.207/23 fe80::af1:eaff:fe76:b26c/64
bond0            UP
vmbr1            UP             fe80::9eb6:54ff:fe9b:35bc/64
vmbr1.171@vmbr1  UP             172.16.3.49/24 fe80::9eb6:54ff:fe9b:35bc/64
vmbr1.172@vmbr1  UP             172.16.4.49/24 fe80::9eb6:54ff:fe9b:35bc/64
vmbr1.205@vmbr1  UP             172.16.8.49/24 fe80::9eb6:54ff:fe9b:35bc/64
vmbr1.1@vmbr1    UP             fe80::9eb6:54ff:fe9b:35bc/64
root@A:~# corosync-cfgtool -n
Local node ID 3, transport knet
nodeid: 1 reachable
   LINK: 0 udp (172.16.8.49->172.16.8.50) enabled connected mtu: 1397

nodeid: 2 reachable
   LINK: 0 udp (172.16.8.49->172.16.8.70) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (172.16.8.49->172.16.8.47) enabled connected mtu: 1397

nodeid: 5 reachable
   LINK: 0 udp (172.16.8.49->172.16.8.52) enabled connected mtu: 1397

====================================================================================

root@B:~# corosync-cfgtool -n
Local node ID 1, transport knet
nodeid: 2 reachable
   LINK: 0 udp (172.16.8.50->172.16.8.70) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (172.16.8.50->172.16.8.49) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (172.16.8.50->172.16.8.47) enabled connected mtu: 1397

nodeid: 5 reachable
   LINK: 0 udp (172.16.8.50->172.16.8.52) enabled connected mtu: 1397

root@B:~# ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128
eno1             UP
eno5             UP             fe80::9af2:b3ff:fe20:1338/64
eno6             UP
vmbr0            UP             fe80::af1:eaff:fe7b:2658/64
vmbr0.11@vmbr0   UP             172.16.22.50/23 fe80::af1:eaff:fe7b:2658/64
vmbr1            UP             fe80::9af2:b3ff:fe20:133c/64
vmbr1.171@vmbr1  UP             172.16.3.50/24 fe80::9af2:b3ff:fe20:133c/64
vmbr1.172@vmbr1  UP             172.16.4.50/24 fe80::9af2:b3ff:fe20:133c/64
vmbr1.205@vmbr1  UP             172.16.8.50/24 fe80::9af2:b3ff:fe20:133c/64

====================================================================================

root@C:~# corosync-cfgtool -n
Local node ID 5, transport knet
nodeid: 1 reachable
   LINK: 0 udp (172.16.8.52->172.16.8.50) enabled connected mtu: 1397

nodeid: 2 reachable
   LINK: 0 udp (172.16.8.52->172.16.8.70) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (172.16.8.52->172.16.8.49) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (172.16.8.52->172.16.8.47) enabled connected mtu: 1397

root@C:~# ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128
ens1f0           UP
eno5             UP
eno6             UP
bond0            UP
vmbr0            UP             172.16.22.41/23 fe80::ae16:2dff:fe80:68b4/64
vmbr1            UP             fe80::1602:ecff:fe3c:eff8/64
vmbr1.171@vmbr1  UP             172.16.3.52/24 fe80::1602:ecff:fe3c:eff8/64
vmbr1.172@vmbr1  UP             172.16.4.52/24 fe80::1602:ecff:fe3c:eff8/64
vmbr1.205@vmbr1  UP             172.16.8.52/24 fe80::1602:ecff:fe3c:eff8/64
vmbr1.1@vmbr1    UP             fe80::1602:ecff:fe3c:eff8/64


========================================================
Corosync.conf:


logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: A
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 172.16.8.49
  }
  node {
    name: B
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.16.8.50
  }
  node {
    name: C
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 172.16.8.52
  }
  node {
    name: D
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.16.8.70
  }
  node {
    name: E
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 172.16.8.47
  }
}

quorum {
  auto_tie_breaker: 1
  auto_tie_breaker_node: lowest
  provider: corosync_votequorum
}

totem {
  cluster_name: C1
  config_version: 11
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
1. your cluster has 5 nodes, not 4
2. the auto_tie_breaker settings are nonstandard and actually not correct for a 5 node cluster (there cannot be a tie?)
3. you really need to provide the network setup ("ip a" from all nodes for the physical devices, bridges and bonds at least) and logs covering the problematic time period (from all nodes)