Hello everyone,
On my production pve-3.4-11 cluster (qdisk + 2 nodes) I am having a node evicting at the middle of the night once in a while.
The only clues in the other node logs are:
corosync.log
qdiskd.log
When it happens VMs do migrate but needless to say that is not good situation for us.
How can I debug the problem better?
One of the thoughts I had that may be node misses a heartbeat during the backup, does it make sense to adjust some settings (which ones?) on cluster so fencing is not initiated?
Thank you in advance,
On my production pve-3.4-11 cluster (qdisk + 2 nodes) I am having a node evicting at the middle of the night once in a while.
The only clues in the other node logs are:
corosync.log
Code:
Dec 19 01:02:28 corosync [TOTEM ] A processor failed, forming new configuration.
Dec 19 01:02:30 corosync [CLM ] CLM CONFIGURATION CHANGE
qdiskd.log
Code:
Dec 19 01:01:19 qdiskd Node 1 missed an update (2/10)
Dec 19 01:01:20 qdiskd Node 1 missed an update (3/10)
Dec 19 01:01:21 qdiskd Node 1 missed an update (4/10)
Dec 19 01:01:22 qdiskd Node 1 missed an update (5/10)
Dec 19 01:01:23 qdiskd Node 1 missed an update (6/10)
Dec 19 01:01:24 qdiskd Node 1 missed an update (7/10)
Dec 19 01:01:25 qdiskd Node 1 missed an update (8/10)
Dec 19 01:01:26 qdiskd Node 1 missed an update (9/10)
Dec 19 01:01:27 qdiskd Node 1 missed an update (10/10)
Dec 19 01:01:28 qdiskd Node 1 missed an update (11/10)
Dec 19 01:01:28 qdiskd Node 1 DOWN
Dec 19 01:01:28 qdiskd Making bid for master
Dec 19 01:01:29 qdiskd Node 1 missed an update (12/10)
Dec 19 01:01:30 qdiskd Node 1 missed an update (13/10)
Dec 19 01:01:31 qdiskd Node 1 missed an update (14/10)
Dec 19 01:01:32 qdiskd Node 1 missed an update (15/10)
Dec 19 01:01:32 qdiskd Assuming master role
Dec 19 01:01:33 qdiskd Node 1 is undead.
Dec 19 01:01:33 qdiskd Writing eviction notice (again) for node 1
Dec 19 01:01:34 qdiskd Node 1 evicted
When it happens VMs do migrate but needless to say that is not good situation for us.
How can I debug the problem better?
One of the thoughts I had that may be node misses a heartbeat during the backup, does it make sense to adjust some settings (which ones?) on cluster so fencing is not initiated?
Thank you in advance,