corosync problems, token lost

Jay Sullivan

Member
Mar 27, 2017
12
1
23
43
We've been fighting with this for a few months now with a non/pre-production cluster of four (4) nodes running PVE 4.4.latest-ish. We haven't made it past 30 days without at least one node falling out of the cluster. The node that falls out of the cluster is seemingly random, i.e. every node has fallen out roughly an equal number of times over the past six months.

We can't buy more nodes and move into production until this is resolved. =(

We have four PVE nodes (two Supermicro SBI‑7228R‑T2X TwinBlades). To help troubleshoot, we've since moved all corosync traffic to a separate air-gapped gigabit switch and have verified multicast functionality with omping commands prescribed in PVE documentation. There is absolutely nothing else on this network except corosync traffic.

Our primary storage backend is an independent (hardware separate from PVE) Ceph Jewel (10.2.6) cluster.

We have corosync debugging turned on, sending to a remote syslog. I've attached the logs from our cluster from our most recent crash (ignore the api stuff in the logs; it's from our custom dashboard). Node obfuscated-pve-blade-1 drops the token and falls out of the cluster at 0728:17. The node is then appropriately fenced and the watchdog performs an IPMI action (we've configured it to shut down) to avoid split-brain. There is absolutely nothing interesting in /var/log on the node that falls out. It's not out of memory or under heavy CPU load (our cluster is generally pretty bored since we're not running any production workloads). It just drops the token and the cluster reacts appropriately.

I've also attached our corosync.conf.

One time I thought that _maaaaybe_ clock skew could have been a problem so I switched from systemd-timesyncd to ntpd and closely monitored time offset. Every node's clock offset has stayed under 0.5ms and the problem still persist. I once even tried forcing one node's clock waaaay out of sync and the corosync logs were significantly different ("zOMG that node's clock is way wrong. KILL IT.").

Do y'all see anything obvious that we've missed? What else can I look at to get more information on why this is happening?

Thanks!
 

Attachments

Hi,

try as ring0_addr the ip addresses instead the name.
 
Sweet. I just had a new crash in the few hours since making the changes from names to IPs. I'll post up logs shortly.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!