unexpected restart of all cluster nodes

fabian · Dec 6, 2021

mmenaz said:
Reading this thread, but not being experienced in clusters, I'm really worried about a couple of points:
a) fencing should be different, i.e. Proxmox node finds itself isolated, understands that has to "suicide" then stops/kills all KVM processes (or LXC or whatever), logs the fact, syncs the local storage (where logs are located) then does a clean "reboot" or if you think is risky, a "reset".

that's how fencing works - each node has a watchdog controlled by the HA services, if the watchdog expires, the node kills itself

as long as the node is part of the quorum, it will prevent the watchdog from expiring.

mmenaz said:
b) if corosync is separated from other networks, it can be that all the other networks are working (storage and VM) but just a corosync network problem can provoke a cluster suicide... that's bad

you have to define certain criteria for "part of the cluster". corosync already does the heavy lifting here, and we require corosync to say "this node is part of the quorate partition of the cluster" anyway for any tasks that modify state to work, so it's a good fit.

mmenaz said:
c) why not just have an option for not really critical setup (i.e. max 10 nodes and that can work with the described setup) to consider a note be OK as long as can communicate with it's shared cluster storage? Just reserve a "cluster_disk" in that storage with a FS that supports concurrent writes and each node rewrites a file with nodename. If a node can't write there, has to "commit suicide" (but as point a)), if it can write, has just to read all other nodes timestamps and if finds ones that are older than "n" minutes, can understand that that node is out of the cluster and, i.e., start HA VMs. I'm in a hurry and maybe must be thought something more sophisticated like node_vmid.txt or a sort of "cluster db" like proxmox already has or something good enough? Corosync is really overcomplicated and for small setups introduces more problems that it solves, OMHO

that's exactly what we are doing, just replace "shared cluster storage" with "pmxcfs", our fuse-mapped shared DB backed by corosync, and the rest (writing timestamps, checking which nodes haven't updated theirs and must have already fenced themselves via their watchdog expiring, etc.pp.) is what our HA stack is doing

establishing a consistent view of the world/cluster is not trivial, the 'write and check timestamp' part is just the last piece of the puzzle and not sufficient on its own.

simba30 · Jan 25, 2023

I'm digging up this old thread.

I have a 4 node cluster with (v 7.2-3) HA configuration on some VM's.
Each node is connected trough a redundant active / passive setup to the network, quorum has it's own vlan on this network bond.

Since a couple of days, we had all nodes doing two cold restart (watchdog fencing) because the quorum was completely stuck/lost/awol.

After reading and analysing all logs, I have identified an issue with one of the network interface on one of the nodes.
It seems that one of the SPF+ on the node01 had a lot of interface errors (the spf was starting to die) and after he completely died , the active-backup config switched to the second SFP.

The problem we faced was that during the time the SFP+ died :
- node01 rebooted (fencing ? this would make sense as it had network issues.
- then node 02 03 and 04 also rebooted

In the logs of all nodes we see node01 joining / leaving quorum several times.

So questions :
- It it possible that network issues on one interface from a bond completely crashed the quorum for the whole cluster
- Is it expected to have all nodes reboot in such an event ?

In advance thank you.

fabian · Jan 26, 2023

without logs it's really hard to tell what happened..

Search

Search

unexpected restart of all cluster nodes

fabian

Proxmox Staff Member

simba30

Member

fabian

Proxmox Staff Member

We value your privacy