Proxmox Staff Member
- Jan 7, 2016
Reading this thread, but not being experienced in clusters, I'm really worried about a couple of points:
a) fencing should be different, i.e. Proxmox node finds itself isolated, understands that has to "suicide" then stops/kills all KVM processes (or LXC or whatever), logs the fact, syncs the local storage (where logs are located) then does a clean "reboot" or if you think is risky, a "reset".
that's how fencing works - each node has a watchdog controlled by the HA services, if the watchdog expires, the node kills itself as long as the node is part of the quorum, it will prevent the watchdog from expiring.
b) if corosync is separated from other networks, it can be that all the other networks are working (storage and VM) but just a corosync network problem can provoke a cluster suicide... that's bad
you have to define certain criteria for "part of the cluster". corosync already does the heavy lifting here, and we require corosync to say "this node is part of the quorate partition of the cluster" anyway for any tasks that modify state to work, so it's a good fit.
c) why not just have an option for not really critical setup (i.e. max 10 nodes and that can work with the described setup) to consider a note be OK as long as can communicate with it's shared cluster storage? Just reserve a "cluster_disk" in that storage with a FS that supports concurrent writes and each node rewrites a file with nodename. If a node can't write there, has to "commit suicide" (but as point a)), if it can write, has just to read all other nodes timestamps and if finds ones that are older than "n" minutes, can understand that that node is out of the cluster and, i.e., start HA VMs. I'm in a hurry and maybe must be thought something more sophisticated like node_vmid.txt or a sort of "cluster db" like proxmox already has or something good enough? Corosync is really overcomplicated and for small setups introduces more problems that it solves, OMHO
that's exactly what we are doing, just replace "shared cluster storage" with "pmxcfs", our fuse-mapped shared DB backed by corosync, and the rest (writing timestamps, checking which nodes haven't updated theirs and must have already fenced themselves via their watchdog expiring, etc.pp.) is what our HA stack is doing
establishing a consistent view of the world/cluster is not trivial, the 'write and check timestamp' part is just the last piece of the puzzle and not sufficient on its own.