corosync problems, token lost

Jay Sullivan · Mar 27, 2017

We've been fighting with this for a few months now with a non/pre-production cluster of four (4) nodes running PVE 4.4.latest-ish. We haven't made it past 30 days without at least one node falling out of the cluster. The node that falls out of the cluster is seemingly random, i.e. every node has fallen out roughly an equal number of times over the past six months.

We can't buy more nodes and move into production until this is resolved. =(

We have four PVE nodes (two Supermicro SBI‑7228R‑T2X TwinBlades). To help troubleshoot, we've since moved all corosync traffic to a separate air-gapped gigabit switch and have verified multicast functionality with omping commands prescribed in PVE documentation. There is absolutely nothing else on this network except corosync traffic.

Our primary storage backend is an independent (hardware separate from PVE) Ceph Jewel (10.2.6) cluster.

We have corosync debugging turned on, sending to a remote syslog. I've attached the logs from our cluster from our most recent crash (ignore the api stuff in the logs; it's from our custom dashboard). Node obfuscated-pve-blade-1 drops the token and falls out of the cluster at 0728:17. The node is then appropriately fenced and the watchdog performs an IPMI action (we've configured it to shut down) to avoid split-brain. There is absolutely nothing interesting in /var/log on the node that falls out. It's not out of memory or under heavy CPU load (our cluster is generally pretty bored since we're not running any production workloads). It just drops the token and the cluster reacts appropriately.

I've also attached our corosync.conf.

One time I thought that _maaaaybe_ clock skew could have been a problem so I switched from systemd-timesyncd to ntpd and closely monitored time offset. Every node's clock offset has stayed under 0.5ms and the problem still persist. I once even tried forcing one node's clock waaaay out of sync and the corosync logs were significantly different ("zOMG that node's clock is way wrong. KILL IT.").

Do y'all see anything obvious that we've missed? What else can I look at to get more information on why this is happening?

Thanks!

wolfgang · Mar 28, 2017

Hi,

try as ring0_addr the ip addresses instead the name.

Jay Sullivan · Mar 28, 2017

wolfgang said:
Hi,

try as ring0_addr the ip addresses instead the name.

I've already got all of the nodes in /etc/hosts, but I'll give this a shot. Thanks for the suggestion. I'm willing to try anything at this point.

Jay Sullivan · Mar 28, 2017

Sweet. I just had a new crash in the few hours since making the changes from names to IPs. I'll post up logs shortly.

Search

Search

corosync problems, token lost

Jay Sullivan

Member

Attachments

wolfgang

Proxmox Retired Staff

Jay Sullivan

Member

Jay Sullivan

Member

We value your privacy