[SOLVED] lrm (unable to read lrm status)

dthompson · Feb 15, 2020

I had a server die this past week and I removed it from the infrastructure. I can confirm that the server is out of commission and is pulled from the rack. I removed the node and all is well, or so I thought. The cluster is giving me an issue with regards to it seeming not being able to find the node.

I had 5 nodes, now down to 4.
All 4 nodes (vdc2-vdc5) are showing as proper.

When I run a pvecm status, I get the following:

Cluster information
-------------------
Name: vdc-cluster
Config Version: 6
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Feb 14 23:47:51 2020
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000002
Ring ID: 2.dc
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.16.1.3 (local)
0x00000003 1 172.16.1.4
0x00000004 1 172.16.1.5
0x00000005 1 172.16.1.6

So all looks right to me. I see that the node is done and its gone from the GUI as a server. I don't see any reference to it except in the /etc/pve/corosync.conf file at the bottom where there is still the old entry on the cluster for the removed node in the bindnetaddr section:

totem {
cluster_name: vdc-cluster
config_version: 6
interface {
bindnetaddr: 172.16.1.2
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

If I do into /etc/pve/nodes I also see an old folder for vdc1 which has been removed.

I'd like to know how to best resolve this issue to get that error to go away.
Can I simple delete /etc/pve/nodes/vdc1 and change the bindnetaddr IP address to one of the other nodes that are still alive?

Thank you very much for any help.

t.lamprecht · Feb 15, 2020

dthompson said:
Can I simple delete /etc/pve/nodes/vdc1 and change the bindnetaddr IP address to one of the other nodes that are still alive?

The bindnet address is rather a way for corosync 2 to find the local address from this ring, it normally is the the address from the node where you created the initial cluster on. In your case that was probably the now gone node.
You could change it, but that won't really do anything. Note also that this slightly strange way to specify a network with a single address instead of, e.g., a CIDR has gone completely with Proxmox VE 6.x which uses corosync 3.

dthompson said:
I'd like to know how to best resolve this issue to get that error to go away.

The issue is the one from your title? I.e. "lrm (unable to read lrm status)"?

Normally the HA Stack removes decommissioned nodes automatically from internal tracking once their gone for over an hour.

Can you please post the output of:

Code:

cat /etc/pve/.members

ha-manager status

dthompson · Feb 15, 2020

Thank you very much for the help.

t.lamprecht said:
The bindnet address is rather a way for corosync 2 to find the local address from this ring, it normally is the the address from the node where you created the initial cluster on. In your case that was probably the now gone node.

Thats correct. Thats the node that is now gone.

You could change it, but that won't really do anything. Note also that this slightly strange way to specify a network with a single address instead of, e.g., a CIDR has gone completely with Proxmox VE 6.x which uses corosync 3.

This was an upgrade from version 5 to 6. This was done automatically when I initially set the cluster with Proxmox 5 and then the subsequent update to version 6

If this is no longer the way to do this in Proxmox 6, how is it now done and how do I change this configuration to work and be supported with the current version or cornosync? Should this not have been modified with the upgrade from version 5 to 6?

The issue is the one from your title? I.e. "lrm (unable to read lrm status)"?

Normally the HA Stack removes decommissioned nodes automatically from internal tracking once their gone for over an hour.

Can you please post the output of: cat /etc/pve/.members
ha-manager status

Here is the output from .members:

Code:

{
"nodename": "vdc2",
"version": 10,
"cluster": { "name": "vdc-cluster", "version": 6, "nodes": 4, "quorate": 1 },
"nodelist": {
  "vdc2": { "id": 2, "online": 1, "ip": "172.16.1.3"},
  "vdc3": { "id": 3, "online": 1, "ip": "172.16.1.4"},
  "vdc4": { "id": 4, "online": 1, "ip": "172.16.1.5"},
  "vdc5": { "id": 5, "online": 1, "ip": "172.16.1.6"}
  }
}

ha-manager status

Code:

quorum OK
master vdc3 (active, Sat Feb 15 07:37:33 2020)
lrm  (unable to read lrm status)
lrm vdc2 (active, Sat Feb 15 07:37:33 2020)
lrm vdc3 (active, Sat Feb 15 07:37:33 2020)
lrm vdc4 (active, Sat Feb 15 07:37:38 2020)
lrm vdc5 (active, Sat Feb 15 07:37:36 2020)
service ct:102 (vdc5, started)
service ct:107 (vdc4, started)
service ct:109 (vdc2, started)
service ct:113 (vdc2, started)
service ct:117 (vdc4, started)
service ct:118 (vdc5, started)
service ct:122 (vdc3, started)
service ct:500 (vdc2, started)
service ct:501 (vdc3, started)
service ct:502 (vdc4, started)
service ct:503 (vdc3, started)
service vm:100 (vdc4, started)
service vm:103 (vdc3, started)
service vm:104 (vdc4, started)
service vm:105 (vdc2, started)
service vm:111 (vdc3, started)
service vm:114 (vdc3, started)
service vm:120 (vdc3, started)
service vm:125 (vdc4, started)
service vm:216 (vdc5, started)

It seems to me like its all good in terms of the dead node being gone, yet its still complaining about the dead node

t.lamprecht · Feb 15, 2020

dthompson said:
If this is no longer the way to do this in Proxmox 6, how is it now done and how do I change this configuration to work and be supported with the current version or cornosync

It is still support due to backward compatibility, but you can simply drop the "bindnet" line and increase the config_version.
The information is now solely taken from the "ringX_addr" properties in the node entries, and those should be already fine in your case as else corosync wouldn't be so happy

dthompson said:
It seems to me like its all good in terms of the dead node being gone, yet its still complaining about the dead node

This is mostly a "visual glitch", but it is indeed a bit weird why it's happening and why the offending line has no nodename at all, just a empty string...

It'd be great if you can get me an additional output: cat /etc/pve/ha/manager_status
Maybe that one has some better hints about the ghost node.

t.lamprecht · Feb 15, 2020

A workaround, which should do it for sure, would be to stop all CRM services, reset the manager_status file by doing:

Code:

rm /etc/pve/ha/manager_status

It won't help doing above as long as there's a master active, as the in-memory status will just get flushed out to disk again.

Theoretically it can be enough to stop the current master, immediately delete it and start it again, while the other nodes HA-Cluster Ressource Manager will try to take over as new master, but your deletion is hopefully faster and they start with a clean slate and rescan the nodes and services freshly.

Code:

systemctl stop pve-ha-crm.service && rm -f /etc/pve/ha/manager_status && systemctl start pve-ha-crm.service

Just telling you that method as it can be a bit more convenient than to stop and start all CRM's on all nodes, not just one.

dthompson · Feb 15, 2020

t.lamprecht said:
It is still support due to backward compatibility, but you can simply drop the "bindnet" line and increase the config_version.

So, I can take it from this:

Code:

totem {
cluster_name: vdc-cluster
config_version: 6
interface {
bindnetaddr: 172.16.1.2
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

and change it to this:

Code:

totem {
cluster_name: vdc-cluster
config_version: 7
interface {
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

The information is now solely taken from the "ringX_addr" properties in the node entries, and those should be already fine in your case as else corosync wouldn't be so happy

This is mostly a "visual glitch", but it is indeed a bit weird why it's happening and why the offending line has no nodename at all, just a empty string...

It'd be great if you can get me an additional output: cat /etc/pve/ha/manager_status
Maybe that one has some better hints about the ghost node.

Absolutely, here you go:

Code:

{"service_status":{"vm:104":{"uid":"LrhcZnNrXkJhY75J2u5JOA","running":1,"state":"started","node":"vdc4"},"vm:111":{"state":"started","node":"vdc2","running":1,"uid":"d2kq0+35h1gQEqWNTIN1Xg"},"ct:502":{"running":1,"uid":"365fPtHxFj3WwQBuRRLzNg","node":"vdc4","state":"started"},"ct:122":{"state":"started","node":"vdc5","running":1,"uid":"PW+xVsCvNf8zXwzrfzsA9Q"},"ct:501":{"uid":"aCIHPWh8+80465K9MAvvAQ","running":1,"node":"vdc2","state":"started"},"ct:107":{"state":"started","node":"vdc4","running":1,"uid":"Hj8DZJCSmp0qWYjACO/7Pw"},"vm:105":{"uid":"JmMdH/aleg5hn0412+wpzA","running":1,"node":"vdc2","state":"started"},"vm:103":{"state":"started","node":"vdc4","uid":"tI++wWkMKrlTcvVfcpyR5g","running":1},"vm:100":{"uid":"KbQAmyCMFQVj6MFZaLheSA","running":1,"state":"started","node":"vdc4"},"vm:216":{"running":1,"uid":"7ObYPbveS4/BmsUzI7shSA","state":"started","node":"vdc5"},"vm:114":{"state":"started","node":"vdc4","uid":"34gcVYnJrVxkoQHoVZl2Vg","running":1},"ct:102":{"running":1,"uid":"4y+BdNZblW9JAp8X9GCZ5Q","state":"started","node":"vdc5"},"ct:118":{"node":"vdc5","state":"started","running":1,"uid":"+/xbVXr9k0PQcIAXuL+b5Q"},"ct:117":{"uid":"JUWVotLditqD+aAuGS6gSg","running":1,"node":"vdc4","state":"started"},"ct:113":{"node":"vdc4","state":"started","uid":"HRsPZTo6ZtqkXSXCoklB+g","running":1},"ct:500":{"node":"vdc2","state":"started","running":1,"uid":"JsqGKja0OpxUpwE4vM62cg"},"vm:120":{"running":1,"uid":"hBlQ1ZCHZTv6uhNe78MVvQ","state":"started","node":"vdc5"},"ct:109":{"uid":"85iuNjo9rGha5sOqjAL7fQ","running":1,"state":"started","node":"vdc2"},"ct:503":{"uid":"qUgk69llgPjiayKynViQjA","running":1,"state":"started","node":"vdc5"},"vm:125":{"running":1,"uid":"paHPEqz5HLvhBLWPQQiCqA","state":"started","node":"vdc4"}},"master_node":"vdc4","node_status":{"vdc2":"online","vdc3":"online","":"fence","vdc5":"online","vdc4":"online"},"timestamp":1581776756}

dthompson · Feb 15, 2020

t.lamprecht said:
A workaround, which should do it for sure, would be to stop all CRM services, reset the manager_status file by doing:

Code:

rm /etc/pve/ha/manager_status

It won't help doing above as long as there's a master active, as the in-memory status will just get flushed out to disk again.

Theoretically it can be enough to stop the current master, immediately delete it and start it again, while the other nodes HA-Cluster Ressource Manager will try to take over as new master, but your deletion is hopefully faster and they start with a clean slate and rescan the nodes and services freshly.

Code:

systemctl stop pve-ha-crm.service && rm -f /etc/pve/ha/manager_status && systemctl start pve-ha-crm.service

Just telling you that method as it can be a bit more convenient than to stop and start all CRM's on all nodes, not just one.

Thanks I've tried both those options except stopping it on all nodes.

so I just run this on each node then:

Code:

systemctl stop pve-ha-crm.service
rm -f /etc/pve/ha/manager_status
systemctl start pve-ha-crm.service

What are the implications in doing this on my nodes while its live? Will any of the VM's go down or will the nodes spontaneously reboot themselves?

dthompson · Feb 15, 2020

I might also add in my logs that I am seeing this error on each node:

Code:

pveproxy[2464]: unable to read file '/etc/pve/nodes//lrm_status'

t.lamprecht · Feb 15, 2020

dthompson said:
"node_status":{ ... "":"fence"

yes there's a ghost node with an empty string as name, and edge case which would be interesting to know how that happened...

dthompson said:
What are the implications in doing this on my nodes while its live? Will any of the VM's go down or will the nodes spontaneously reboot themselves?

No, no VM or CT will be touched. The implication is that there's no active manager. Thus no migration, or service state change (start to stopped, ..) will propagate. But as this is only temporary, normally dooable in roughly a minute this shouldn't be an issue.
After a master comes up again the queued up work, if any at all, will get done. So this procedure is normally not dangerous at all.

Important is just that the following order is followed, as else it will have no use.

Stop the CRM service on all nodes, only then continue
On a single node remove the manager status file
Start the CRM service again on all nodes

So if you only continue to step 2. once the first one was done on all nodes you really should be fine.

dthompson · Feb 15, 2020

t.lamprecht said:
yes there's a ghost node with an empty string as name, and edge case which would be interesting to know how that happened...

I'm not sure what happened. The node died and then I pulled it out of the rack and then I ran the

Code:

pvecm delnode,

and removed it. It never gave an error.

Am I able to edit that file or do I have to run the 3 commands below to clear this issue up?

Code:

"node_status":{"vdc5":"online","":"fence","vdc2":"online","vdc3":

and pull

Code:

"node_status":{"vdc5":"online","":"fence","vdc2":"online","vdc3":

so it looks like this:

Code:

"node_status":{"vdc5":"online",:"fence","vdc2":"online","vdc3":

or would I pull the entire section such as:

Code:

"node_status":{"vdc5":"online","vdc2":"online","vdc3":

No, no VM or CT will be touched. The implication is that there's no active manager. Thus no migration, or service state change (start to stopped, ..) will propagate. But as this is only temporary, normally dooable in roughly a minute this shouldn't be an issue.
After a master comes up again the queued up work, if any at all, will get done. So this procedure is normally not dangerous at all.

Important is just that the following order is followed, as else it will have no use.

Stop the CRM service on all nodes, only then continue

On a single node remove the manager status file

Start the CRM service again on all nodes

So if you only continue to step 2. once the first one was done on all nodes you really should be fine.

On the single node where I remove the manager status file, do I need to do this on the master or it doesn't matter at this point once the CRM service is shut down?

Also, thank you very much for the help!

dthompson · Feb 15, 2020

This has done the trick for me by doing this on each node. Its seems my ghost node is now gone (seemingly). So thats amazing. Thank you for the awesome help.

Code:

systemctl stop pve-ha-crm.service
rm -f /etc/pve/ha/manager_status
systemctl start pve-ha-crm.service

I just have one last question now about the bindnetaddr address in my config file:

Code:

totem {
cluster_name: vdc-cluster
config_version: 7
interface {
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

Is the above configuration proper for me to pull the IP address out of the corosync.conf file in order to get the cluster up to the proper syntax with its current version or is there anything else I need do to?

t.lamprecht · Feb 15, 2020

dthompson said:
On the single node where I remove the manager status file, do I need to do this on the master or it doesn't matter at this point once the CRM service is shut down?

Just FYI, one node should be enough. As /etc/pve is clustered and thus shared between all nodes.

dthompson said:
Is the above configuration proper for me to pull the IP address out of the corosync.conf file in order to get the cluster up to the proper syntax with its current version or is there anything else I need do to?

Actually you could do slightly more, but as said - it's fine if you leave it that way.
If the time comes when we just cannot be backward compatible anymore, which really only happens on major releases (so earliest 7.0) we will either transform it automatic or document how to do it.

If you still want to adapt it now then do:
1. delete the "bindnet" line completely
2. rename "ringnumber" to "linknumber", e.g., resulting in: linknumber: 0
3. don't forget to bump (increment) the "config_version" value before saving.

Oh, and you only need to do this on one node, as /etc/pve/corosync.conf is also shared.

Honestly, I would just keep it as is for now - in the "don't touch a running system if not required" sense.

Glad I could help

anbischo · Nov 5, 2024

Danke für den Tipp, das hat so funktioniert

Search

Search

[SOLVED] lrm (unable to read lrm status)

dthompson

Well-Known Member

t.lamprecht

Proxmox Staff Member

dthompson

Well-Known Member

t.lamprecht

Proxmox Staff Member

t.lamprecht

Proxmox Staff Member

dthompson

Well-Known Member

dthompson

Well-Known Member

dthompson

Well-Known Member

t.lamprecht

Proxmox Staff Member

dthompson

Well-Known Member

dthompson

Well-Known Member

t.lamprecht

Proxmox Staff Member

anbischo

New Member