Hey guys !
We're having 2 issues on our cluster, who are likely to be related.
Since a few hours, most of our HA task fails, with the error 'update resource failed: error with cfs lock 'domain-ha': got lock request timeout', and our ha-manager reports that the master is old timestamp - dead? (but the concerned node is running fine, and updates it's lrm lock).
We already tried to restart pve-cluster & pve-ha-crm accross all nodes. We also tried to force a new master, by stopping the ha-crm on the "old" master => It works (a new master acquires the manager lock), but as soon as the new master is select, the same problem appears (old-timestamp & cfs lock timeout on domain-ha).
It looks like that deleting manually the directory /etc/pve/priv/lock/domain-ha fixes the issue for a couple hours, but then it appears again. Is there any known fix for this issue ?
Here's the pve-ha-crm logs when we force a new master :
We're having 2 issues on our cluster, who are likely to be related.
Since a few hours, most of our HA task fails, with the error 'update resource failed: error with cfs lock 'domain-ha': got lock request timeout', and our ha-manager reports that the master is old timestamp - dead? (but the concerned node is running fine, and updates it's lrm lock).
We already tried to restart pve-cluster & pve-ha-crm accross all nodes. We also tried to force a new master, by stopping the ha-crm on the "old" master => It works (a new master acquires the manager lock), but as soon as the new master is select, the same problem appears (old-timestamp & cfs lock timeout on domain-ha).
It looks like that deleting manually the directory /etc/pve/priv/lock/domain-ha fixes the issue for a couple hours, but then it appears again. Is there any known fix for this issue ?
Here's the pve-ha-crm logs when we force a new master :
Code:
Oct 28 15:17:56 compute008 pve-ha-crm[17544]: status change wait_for_quorum => slave
Oct 28 19:43:48 compute008 pve-ha-crm[17544]: successfully acquired lock 'ha_manager_lock'
Oct 28 19:43:48 compute008 pve-ha-crm[17544]: watchdog active
Oct 28 19:43:48 compute008 pve-ha-crm[17544]: status change slave => master
Oct 28 19:44:02 compute008 pve-ha-crm[17544]: got unexpected error - error with cfs lock 'domain-ha': got lock request timeout
Oct 28 19:44:13 compute008 pve-ha-crm[17544]: got unexpected error - error with cfs lock 'domain-ha': got lock request timeout
Oct 28 19:44:26 compute008 pve-ha-crm[17544]: got unexpected error - error with cfs lock 'domain-ha': got lock request timeout
Oct 28 19:44:39 compute008 pve-ha-crm[17544]: got unexpected error - error with cfs lock 'domain-ha': got lock request timeout