Weird ha-manager state

casparsmit

Renowned Member
Feb 24, 2015
41
2
73
Hi all,

We have a 4 node proxmox 4.2 cluster.
We cannot seem to get the HA manager in a healthy state.

The config of HA looks empty (resources.cfg = empty and ha-manager config shows nothing).

BUT the status shows different (same output on all 4 nodes):

# ha-manager status -verbose
quorum OK
master blade02 (old timestamp - dead?, Tue May 24 13:26:16 2016)
lrm blade01 (idle, Tue Jul 5 11:12:02 2016)
lrm blade02 (old timestamp - dead?, Mon Jul 4 19:07:26 2016)
lrm blade03 (idle, Tue Jul 5 11:12:04 2016)
lrm blade04 (idle, Tue Jul 5 11:12:03 2016)
service vm:100 (blade02, stopped)
service vm:101 (blade02, started)
service vm:104 (blade03, stopped)
full cluster state:
{
"lrm_status" : {
"blade01" : {
"mode" : "active",
"results" : {},
"state" : "wait_for_agent_lock",
"timestamp" : 1467709922
},
"blade02" : {
"mode" : "restart",
"results" : {},
"state" : "wait_for_agent_lock",
"timestamp" : 1467652046
},
"blade03" : {
"mode" : "active",
"results" : {},
"state" : "wait_for_agent_lock",
"timestamp" : 1467709924
},
"blade04" : {
"mode" : "active",
"results" : {},
"state" : "wait_for_agent_lock",
"timestamp" : 1467709923
}
},
"manager_status" : {
"master_node" : "blade02",
"node_status" : {
"blade01" : "online",
"blade02" : "online",
"blade03" : "online",
"blade04" : "unknown"
},
"relocate_trial" : {
"vm:100" : 0,
"vm:101" : 0,
"vm:104" : 0
},
"service_status" : {
"vm:100" : {
"node" : "blade02",
"state" : "stopped",
"uid" : "uLZmWgXNwkC1pGMqdy8TVw"
},
"vm:101" : {
"node" : "blade02",
"state" : "started",
"uid" : "pgut4BCpjF7Jgutk2vLXHQ"
},
"vm:104" : {
"node" : "blade03",
"state" : "stopped",
"uid" : "sr4qs8BHTAAKqgkusM+WIg"
}
},
"timestamp" : 1464089176
},
"quorum" : {
"node" : "blade04",
"quorate" : "1"
}
}

It says there are services configured but there aren't and blade02 mode says 'restart' and manager status of blade04 says "unknown"

Corosync/pve-cluster works fine, only the HA part isn't

What is going on here? And how can we fix it the least disruptive way (there are VM's running on all nodes).

Furthermore blade02 pve-ha-lrm service cannot be started and says:

daemon.log:Jul 4 17:40:58 blade02 pve-ha-lrm[40064]: starting server
daemon.log:Jul 4 17:40:58 blade02 pve-ha-lrm[40064]: status change startup => wait_for_agent_lock
daemon.log:Jul 4 17:40:59 blade02 pve-ha-lrm[40064]: successfully acquired lock 'ha_agent_blade02_lock'
daemon.log:Jul 4 17:40:59 blade02 pve-ha-lrm[40064]: ERROR: unable to open watchdog socket - No such file or directory
daemon.log:Jul 4 17:40:59 blade02 pve-ha-lrm[40064]: restart LRM, freeze all services
daemon.log:Jul 4 17:40:59 blade02 pve-ha-lrm[40064]: server stopped
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!