We've been fighting with this for a few months now with a non/pre-production cluster of four (4) nodes running PVE 4.4.latest-ish. We haven't made it past 30 days without at least one node falling out of the cluster. The node that falls out of the cluster is seemingly random, i.e. every node has fallen out roughly an equal number of times over the past six months.
We can't buy more nodes and move into production until this is resolved. =(
We have four PVE nodes (two Supermicro SBI‑7228R‑T2X TwinBlades). To help troubleshoot, we've since moved all corosync traffic to a separate air-gapped gigabit switch and have verified multicast functionality with omping commands prescribed in PVE documentation. There is absolutely nothing else on this network except corosync traffic.
Our primary storage backend is an independent (hardware separate from PVE) Ceph Jewel (10.2.6) cluster.
We have corosync debugging turned on, sending to a remote syslog. I've attached the logs from our cluster from our most recent crash (ignore the api stuff in the logs; it's from our custom dashboard). Node obfuscated-pve-blade-1 drops the token and falls out of the cluster at 0728:17. The node is then appropriately fenced and the watchdog performs an IPMI action (we've configured it to shut down) to avoid split-brain. There is absolutely nothing interesting in /var/log on the node that falls out. It's not out of memory or under heavy CPU load (our cluster is generally pretty bored since we're not running any production workloads). It just drops the token and the cluster reacts appropriately.
I've also attached our corosync.conf.
One time I thought that _maaaaybe_ clock skew could have been a problem so I switched from systemd-timesyncd to ntpd and closely monitored time offset. Every node's clock offset has stayed under 0.5ms and the problem still persist. I once even tried forcing one node's clock waaaay out of sync and the corosync logs were significantly different ("zOMG that node's clock is way wrong. KILL IT.").
Do y'all see anything obvious that we've missed? What else can I look at to get more information on why this is happening?
Thanks!
We can't buy more nodes and move into production until this is resolved. =(
We have four PVE nodes (two Supermicro SBI‑7228R‑T2X TwinBlades). To help troubleshoot, we've since moved all corosync traffic to a separate air-gapped gigabit switch and have verified multicast functionality with omping commands prescribed in PVE documentation. There is absolutely nothing else on this network except corosync traffic.
Our primary storage backend is an independent (hardware separate from PVE) Ceph Jewel (10.2.6) cluster.
We have corosync debugging turned on, sending to a remote syslog. I've attached the logs from our cluster from our most recent crash (ignore the api stuff in the logs; it's from our custom dashboard). Node obfuscated-pve-blade-1 drops the token and falls out of the cluster at 0728:17. The node is then appropriately fenced and the watchdog performs an IPMI action (we've configured it to shut down) to avoid split-brain. There is absolutely nothing interesting in /var/log on the node that falls out. It's not out of memory or under heavy CPU load (our cluster is generally pretty bored since we're not running any production workloads). It just drops the token and the cluster reacts appropriately.
I've also attached our corosync.conf.
One time I thought that _maaaaybe_ clock skew could have been a problem so I switched from systemd-timesyncd to ntpd and closely monitored time offset. Every node's clock offset has stayed under 0.5ms and the problem still persist. I once even tried forcing one node's clock waaaay out of sync and the corosync logs were significantly different ("zOMG that node's clock is way wrong. KILL IT.").
Do y'all see anything obvious that we've missed? What else can I look at to get more information on why this is happening?
Thanks!