Proxmox host/node failure behavior questions

hlbot001

New Member
Oct 1, 2024
2
0
1
Hi,

I have a four node Proxmox 9.1.2 cluster with fibre attached shared storage.

One of the nodes encountered a system board failure which resulted it to go down and become inaccessible. In researching how to recover from a dead node, I saw the recommendation is to evict the dead host from the cluster and rebuild it once the server was repaired.

Since the host had already been down for over 2 days, I went forward with evicting it from the cluster by running the "pvecm delnode" command from one of the running nodes in the cluster. This resulted in an error stating that the dead node did not exist. A "pvecm nodes" command from a running node did not list the dead node. However, the dead node was still being displayed in the web GUI. I was able to clean up the GUI by removing/deleting the entry for the dead node in the corosync.conf file.

I have questions regarding this and how Proxmox determines if a node is not going to be coming back up.

1. How long does a node need to be down before the other nodes in the cluster "removes" it from the quorum? For example, the result from the pvecm nodes command did not show the dead node.

2. I "cleaned up" the dead node from the GUI by editing the corosync.conf. file. Is this the correct way to remove the node from the GUI? If not, what else needs to be done.

3. What is the correct approach to bring a node up after it has been down for an extended period of time (days)? Manual cleanup? Rebuild?

Thanks,
Henry
 
Hi Henry,

I think the main point of confusion here is that `pvecm nodes` does not necessarily show every configured cluster node. It only shows the currently active membership, so an offline/dead node not appearing there does not mean it was already cleanly removed from the cluster.

2026-03-10_12-09pvecm nodes.png

So in your case, seeing the failed node disappear from pvecm nodes after being down for a long time is expected behavior, but that is different from actually removing it from the cluster configuration.

For removing a dead node, the usual approach is:
  1. Make sure the failed node is fully powered off and cannot come back with stale cluster state.
  2. From another healthy cluster node, run pvecm delnode <nodename>.
The official procedure is documented here:
https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

I would be cautious about manually editing corosync.conf just to make the GUI look clean, unless it was part of the documented removal workflow and you were sure the cluster config was still consistent.

For bringing that server back later after hardware repair, I would generally avoid just booting it back into the old cluster state after several days offline. Safer approach is usually to clean/reinstall it (or at least remove old cluster state on that node) and then rejoin it properly.

So the short answer to your questions is:
  • The node is removed from active quorum/membership quickly once it is offline, but that is not the same as being removed from cluster config.
  • pvecm delnode <nodename> is the correct removal path.
  • For a node that has been down for days, rebuild/rejoin is usually the cleaner and safer approach than trying to revive it as-is.
 
  • Like
Reactions: UdoB
Make sure the failed node is fully powered off and cannot come back with stale cluster state.
  1. From another healthy cluster node, run pvecm delnode <nodename>.
The official procedure is documented here:
https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node
This is what got me started down the rabbit hole that eventually led to me manually editing corosync.config. When I ran pvecm delnode <nodename> command it returned an error stating something to the effect that it was an unknown node.

  • The node is removed from active quorum/membership quickly once it is offline, but that is not the same as being removed from cluster config.
How quickly does this happen? For example, if a node goes down because of a failed DIMM and it takes me a few hours to replace the DIMM. Would the node being down for 2 hours cause the cluster config on the problem node to go stale enough to warrant a rebuild?

My confusion is mainly because I don't fully understand the corosync mechanism and function. If you can suggest any good documentation on corosync, I would appreciate it.

Thanks,
Henry
 
Last edited:
How quickly does this happen? For example, if a node goes down because of a failed DIMM and it takes me a few hours to replace the DIMM. Would the node being down for 2 hours cause the cluster config on the problem node to go stale enough to warrant a rebuild?

Hi Henry,

cluster configuration will get outdated if you change too much "cluster stuff" while that node is offline. Such as adding/removing nodes, different network configuration/addresses etc. That is not a question of time itself. As long as nothing has been changed, I would just start the fixed machine.

I also didn't find any detailed documentation. Best option so far is "man corosync 8" on console or web search. "SEE ALSO" section has links to corosync.conf and corosync-overview.