Intermitent cluster node failure

alitvak69

Renowned Member
Oct 2, 2015
105
5
83
Hello everyone,

On my production pve-3.4-11 cluster (qdisk + 2 nodes) I am having a node evicting at the middle of the night once in a while.

The only clues in the other node logs are:

corosync.log
Code:
Dec 19 01:02:28 corosync [TOTEM ] A processor failed, forming new configuration.
Dec 19 01:02:30 corosync [CLM  ] CLM CONFIGURATION CHANGE

qdiskd.log
Code:
Dec 19 01:01:19 qdiskd Node 1 missed an update (2/10)
Dec 19 01:01:20 qdiskd Node 1 missed an update (3/10)
Dec 19 01:01:21 qdiskd Node 1 missed an update (4/10)
Dec 19 01:01:22 qdiskd Node 1 missed an update (5/10)
Dec 19 01:01:23 qdiskd Node 1 missed an update (6/10)
Dec 19 01:01:24 qdiskd Node 1 missed an update (7/10)
Dec 19 01:01:25 qdiskd Node 1 missed an update (8/10)
Dec 19 01:01:26 qdiskd Node 1 missed an update (9/10)
Dec 19 01:01:27 qdiskd Node 1 missed an update (10/10)
Dec 19 01:01:28 qdiskd Node 1 missed an update (11/10)
Dec 19 01:01:28 qdiskd Node 1 DOWN
Dec 19 01:01:28 qdiskd Making bid for master
Dec 19 01:01:29 qdiskd Node 1 missed an update (12/10)
Dec 19 01:01:30 qdiskd Node 1 missed an update (13/10)
Dec 19 01:01:31 qdiskd Node 1 missed an update (14/10)
Dec 19 01:01:32 qdiskd Node 1 missed an update (15/10)
Dec 19 01:01:32 qdiskd Assuming master role
Dec 19 01:01:33 qdiskd Node 1 is undead.
Dec 19 01:01:33 qdiskd Writing eviction notice (again) for node 1
Dec 19 01:01:34 qdiskd Node 1 evicted

When it happens VMs do migrate but needless to say that is not good situation for us.
How can I debug the problem better?

One of the thoughts I had that may be node misses a heartbeat during the backup, does it make sense to adjust some settings (which ones?) on cluster so fencing is not initiated?

Thank you in advance,
 
Maybe there is too much load on the qdisk storage (missed update). I would try to reduce backup traffic (rate limit), or do not run backups from several nodes in parallel.