False Fencing Issue

Andrew Holybee

Well-Known Member
Mar 27, 2017
52
1
46
44
I came from Vmware and have been happy with Proxmox the one dark spot is it seems like my cluster is doing false fences. Where it just takes down a node randomly and fences the VM. Also if I take down one node manually it seems to restart all of my nodes which doesn't make any sense. I check the logs on the node but it always resets to after the restart how can I look at logs leading up to fence?
 
I came from Vmware and have been happy with Proxmox the one dark spot is it seems like my cluster is doing false fences. Where it just takes down a node randomly and fences the VM. Also if I take down one node manually it seems to restart all of my nodes which doesn't make any sense. I check the logs on the node but it always resets to after the restart how can I look at logs leading up to fence?

fencing usually happens because of a loss of quorum, which is often caused by unreliable multicast in the cluster network. see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network for details

you can enable persistent journal logging with "mkdir /var/log/journal; systemctl restart systemd-journald". after that, you should be able to list boot intervals using "journalctl --list-boots".
 
We had issues with everything being on the same power connect switch so we moved ceph storage to a dedicated switch a netgear jgs524ev2-200nas. I followed the white paper netgear has so I disabled igmp snooping status and block unknown multicast address to disable and set broadcast forwarding method to hardware. The fencing is seemingly random so I am hoping logs will tell me whats going on, will the persistent logging affect performance?
 
I am going to be doing omping I just want to make sure I get the syntax right
so I click on node 1 then shell
then run
omping -c 10000 -i 0.001 -F -q 10.10.10.1 10.10.10.2 ... or do I need to have the name of the node in there as well as in?
omping -c 10000 -i 0.001 -F -q PM01-10.10.10.1 pm02-IP 10.10.10.2 etc?

this is for ceph do i need to run omping for both normal lan and ceph?
 
you have to do omping on the cluster network
( ie ping the nodes ip as they appear in /etc/pve/corosync.conf after each "name" entry