unexpected restart of all cluster nodes

Reading this thread, but not being experienced in clusters, I'm really worried about a couple of points:
a) fencing should be different, i.e. Proxmox node finds itself isolated, understands that has to "suicide" then stops/kills all KVM processes (or LXC or whatever), logs the fact, syncs the local storage (where logs are located) then does a clean "reboot" or if you think is risky, a "reset".

that's how fencing works - each node has a watchdog controlled by the HA services, if the watchdog expires, the node kills itself ;) as long as the node is part of the quorum, it will prevent the watchdog from expiring.

b) if corosync is separated from other networks, it can be that all the other networks are working (storage and VM) but just a corosync network problem can provoke a cluster suicide... that's bad

you have to define certain criteria for "part of the cluster". corosync already does the heavy lifting here, and we require corosync to say "this node is part of the quorate partition of the cluster" anyway for any tasks that modify state to work, so it's a good fit.

c) why not just have an option for not really critical setup (i.e. max 10 nodes and that can work with the described setup) to consider a note be OK as long as can communicate with it's shared cluster storage? Just reserve a "cluster_disk" in that storage with a FS that supports concurrent writes and each node rewrites a file with nodename. If a node can't write there, has to "commit suicide" (but as point a)), if it can write, has just to read all other nodes timestamps and if finds ones that are older than "n" minutes, can understand that that node is out of the cluster and, i.e., start HA VMs. I'm in a hurry and maybe must be thought something more sophisticated like node_vmid.txt or a sort of "cluster db" like proxmox already has or something good enough? Corosync is really overcomplicated and for small setups introduces more problems that it solves, OMHO

that's exactly what we are doing, just replace "shared cluster storage" with "pmxcfs", our fuse-mapped shared DB backed by corosync, and the rest (writing timestamps, checking which nodes haven't updated theirs and must have already fenced themselves via their watchdog expiring, etc.pp.) is what our HA stack is doing ;)

establishing a consistent view of the world/cluster is not trivial, the 'write and check timestamp' part is just the last piece of the puzzle and not sufficient on its own.
 
  • Like
Reactions: Moayad
I'm digging up this old thread.

I have a 4 node cluster with (v 7.2-3) HA configuration on some VM's.
Each node is connected trough a redundant active / passive setup to the network, quorum has it's own vlan on this network bond.

Since a couple of days, we had all nodes doing two cold restart (watchdog fencing) because the quorum was completely stuck/lost/awol.

After reading and analysing all logs, I have identified an issue with one of the network interface on one of the nodes.
It seems that one of the SPF+ on the node01 had a lot of interface errors (the spf was starting to die) and after he completely died , the active-backup config switched to the second SFP.

The problem we faced was that during the time the SFP+ died :
- node01 rebooted (fencing ? this would make sense as it had network issues.
- then node 02 03 and 04 also rebooted

In the logs of all nodes we see node01 joining / leaving quorum several times.

So questions :
- It it possible that network issues on one interface from a bond completely crashed the quorum for the whole cluster
- Is it expected to have all nodes reboot in such an event ?

In advance thank you.
 
Last edited:
without logs it's really hard to tell what happened..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!