ha-crm unable to read file

Wout Van den Ende

Active Member
Jun 5, 2019
2
0
41
We've been working on upgrading our proxmox/ceph cluster to new machines and 10G for ceph. Another system engineer was in the process of removing 2 nodes from the cluster. He followed a tutorial that suggested manually removing the old nodes from /etc/pve/nodes. This was obviously false information, the tutorial's source was a forum thread here from 2014.

Now, the ha-manager and the HA tab in gui won't won't work, throwing the error "got unexpected error - unable to read file '/etc/pve/nodes/stor1/lrm_status'". I've been browsing around the forums and it seems that removing the /etc/pve/ha/manager_status file might help me, but i'm quite hesistant since the last thing i tried fenced the whole cluster.

Any guidance on what to do?
 
He followed a tutorial that suggested manually removing the old nodes from /etc/pve/nodes.
(...)
Now, the ha-manager and the HA tab in gui won't won't work, throwing the error "got unexpected error - unable to read file '/etc/pve/nodes/stor1/lrm_status'".
I could not reproduce a problem with HA. The lrm_status files get restored automatically on all nodes. The cluster is broken nonetheless.

Please execute
Code:
pvecm status
Code:
ha-manager config
Code:
tree /etc/pve/nodes
and post the result formatted as code.
 
It looks like the HA fixed itself. We added another VM to the HA, using the HA option in the VM, not the HA tab in datacenter. Checked with
Code:
journalctl -u pve-ha-crm -u pve-ha-lrm -u corosync
(some details omitted)
Code:
pve-ha-crm[7197]: got unexpected error - unable to read file '/etc/pve/nodes/stor1/lrm_status'
pve-ha-crm[7197]: got unexpected error - unable to read file '/etc/pve/nodes/stor1/lrm_status'
pve-ha-crm[7197]: deleting gone node 'stor1', not a cluster member anymore.
pve-ha-crm[7197]: deleting gone node 'stor1', not a cluster member anymore.
pve-ha-crm[7197]: adding new service 'vm:xxx' on node 'xxx'
I could not reproduce a problem with HA. The lrm_status files get restored automatically on all nodes. The cluster is broken nonetheless.
He did remove the nodes first with pvecm delnode, so the cluster was fine.