[SOLVED] cfs lock 'domain-ha' timeout & master old timestamp

Ch@rlus

Renowned Member
Feb 14, 2013
31
3
73
Hey guys !

We're having 2 issues on our cluster, who are likely to be related.

Since a few hours, most of our HA task fails, with the error 'update resource failed: error with cfs lock 'domain-ha': got lock request timeout', and our ha-manager reports that the master is old timestamp - dead? (but the concerned node is running fine, and updates it's lrm lock).

We already tried to restart pve-cluster & pve-ha-crm accross all nodes. We also tried to force a new master, by stopping the ha-crm on the "old" master => It works (a new master acquires the manager lock), but as soon as the new master is select, the same problem appears (old-timestamp & cfs lock timeout on domain-ha).

It looks like that deleting manually the directory /etc/pve/priv/lock/domain-ha fixes the issue for a couple hours, but then it appears again. Is there any known fix for this issue ?

Here's the pve-ha-crm logs when we force a new master :
Code:
Oct 28 15:17:56 compute008 pve-ha-crm[17544]: status change wait_for_quorum => slave
Oct 28 19:43:48 compute008 pve-ha-crm[17544]: successfully acquired lock 'ha_manager_lock'
Oct 28 19:43:48 compute008 pve-ha-crm[17544]: watchdog active
Oct 28 19:43:48 compute008 pve-ha-crm[17544]: status change slave => master
Oct 28 19:44:02 compute008 pve-ha-crm[17544]: got unexpected error - error with cfs lock 'domain-ha': got lock request timeout
Oct 28 19:44:13 compute008 pve-ha-crm[17544]: got unexpected error - error with cfs lock 'domain-ha': got lock request timeout
Oct 28 19:44:26 compute008 pve-ha-crm[17544]: got unexpected error - error with cfs lock 'domain-ha': got lock request timeout
Oct 28 19:44:39 compute008 pve-ha-crm[17544]: got unexpected error - error with cfs lock 'domain-ha': got lock request timeout
 
In addition, here's the listing of the /etc/pve/priv/lock directory, except compute004 which is poweredoff, all the locks seems to be active with good date & hour values..

Code:
drwx------ 2 root www-data 0 Oct 28 19:48 domain-ha
drwx------ 2 root www-data 0 Oct 28 19:48 ha_agent_compute001_lock
drwx------ 2 root www-data 0 Oct 28 19:48 ha_agent_compute002_lock
drwx------ 2 root www-data 0 Oct 28 19:48 ha_agent_compute003_lock
drwx------ 2 root www-data 0 Sep 24 21:00 ha_agent_compute004_lock
drwx------ 2 root www-data 0 Oct 28 19:48 ha_agent_compute005_lock
drwx------ 2 root www-data 0 Oct 28 19:48 ha_agent_compute006_lock
drwx------ 2 root www-data 0 Oct 28 19:48 ha_agent_compute007_lock
drwx------ 2 root www-data 0 Oct 28 19:48 ha_agent_compute008_lock
drwx------ 2 root www-data 0 Oct 28 19:48 ha_manager_lock
 
Hum, after more investigations, I found something, and I'm not sure if it's a "normal" behaviour.

It seems that i'm not able to "touch" the directory /etc/pve/priv/lock/domain-ha/ from the "master" node compute07. But i'm able to touch this directory from one of the other node (and this node seems to change other time).

I assume that the node that is able to touch the directory is the one that is holding the "lock" (and it explains why the master can't acquire the lock).
 
Okay, I'll switch this topic as "resolved". After doing a restart of theses services, everything seems to be running smoothly again.

Code:
systemctl restart pve-cluster
systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd
 
  • Like
Reactions: yhquintero

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!