[SOLVED] Remove node from cluster/datacenter after physically removing the device

mati

New Member
Nov 14, 2022
12
1
3
Hi folks!

I have a small homelab and I had 4 PVE nodes: pve0, pve1, pve2 and pve3 in 1 datacenter/cluster. I was looking to turn pve3 into PBS, so I simply disconnected the hardware and installed PBS there.

When I login to GUI on pve0/pve1/pve2 I still can see pve3 and I'd like to get rid of it.
When I run:
Code:
root@pve0:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 pve1
         3          1 pve2
         4          1 pve0 (local)
pve3 is not visible, but I can still see it in corosync.conf:
Code:
root@pve0:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve0
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.0.190
  }
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.0.191
  }
  node {
    name: pve2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.0.192
  }
  node {
    name: pve3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.0.193
  }
}
on each node.

Looks like I should have look up the documentation first and remove the node while it was still active :/

How can I fix that and get rid of pve3 references from all the other nodes?

Thanks!
 
You did nothing wrong, the same happens when a piece of hardware dies.

Code:
pvecm delnode pve3
rm -rf /etc/pve/nodes/pve3

You may need to reload the web UI afterwards.
 
  • Like
Reactions: Kpf444 and mati
Amazing, thank you! This worked like a charm.

In case other folks find this topic. After running pvecm delnode pve3 I got the following error:
Code:
root@pve0:~# pvecm delnode pve3
Could not kill node (error = CS_ERR_NOT_EXIST)
Killing node 2
but the pve3 node was properly deleted from corosync.conf
 
  • Like
Reactions: esi_y
Amazing, thank you! This worked like a charm.

In case other folks find this topic. After running pvecm delnode pve3 I got the following error:
Code:
root@pve0:~# pvecm delnode pve3
Could not kill node (error = CS_ERR_NOT_EXIST)
Killing node 2

I suppose the patch did not fix it then: https://bugzilla.proxmox.com/show_bug.cgi?id=3596
I will update the bugreport to see if there's an explanation.

but the pve3 node was properly deleted from corosync.conf

The other thing I also wonder about, why doesn't PVE also clean up the respective node directory. It never does it, it's not causing cluster issues per se, but it keeps showing the zombie node in the GUI.
 
  • Like
Reactions: mati
Hi,
@mati what version of Proxmox VE was installed at the time the issue happened?
 
@fiona It's still in the docs [1] that CS_ERR_NOT_EXIST can happen after issuing delnode, but I suppose (also based on the commentary) that it should have been resolved by the fix of bug #3596.

FWIW I had this happen to me on 8.0 before (with the said node being dead at the time of issuing the command), just I was not aware of the fix at that time, later I forgot about it.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node
 
@fiona It's still in the docs [1] that CS_ERR_NOT_EXIST can happen after issuing delnode, but I suppose (also based on the commentary) that it should have been resolved by the fix of bug #3596.

FWIW I had this happen to me on 8.0 before (with the said node being dead at the time of issuing the command), just I was not aware of the fix at that time, later I forgot about it.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node
Oh, right. The message itself is fine. The bug was about exiting with an error code at that point. Now the code continues even if the corosync-cfgtool fails. What could potentially be improved is catching the error explicitly and hinting that it's fine if the node is offline also on the CLI, but the docs already mention it.
 
Oh, right. The message itself is fine. The bug was about exiting with an error code at that point. Now the code continues even if the corosync-cfgtool fails. What could potentially be improved is catching the error explicitly and hinting that it's fine if the node is offline also on the CLI, but the docs already mention it.
Thanks for the reply, to be honest I had not looked before, but now that I did - and I know some might think I am nitpicking here, but wouldn't it be much much nicer to have the PVE::Tools::run_command account for these situations (eval ignore error and ditch stderr, except when debug is on)? There's already all all sorts of $errmsg, $errfunc, ... flags at one's disposal ... this is a pvecm command which will virtually always cause corosync to complain that the node is offline when deleting according to the docs. As it stands now, it's not even clear it's a corosync-cfgtool error.

The other question I had was why not also clean (or move out of the directory) the /etc/pve/nodes/$nodename directory, everyone doing these operations the first time wonders why they have a zombie GUI entry.
 
Hi,
@mati what version of Proxmox VE was installed at the time the issue happened?
pve0: 8.X (I don't remember if I updated to 8.1 after removing the node or before)
pve1: 7.4-16
pve2: 8.X (I don't remember if I updated to 8.1 after removing the node or before)

and I was using pve0 to perform these operations.
 
Thanks for the reply, to be honest I had not looked before, but now that I did - and I know some might think I am nitpicking here, but wouldn't it be much much nicer to have the PVE::Tools::run_command account for these situations (eval ignore error and ditch stderr, except when debug is on)?
In general, you do want to see all error output, to see (potential) issues. That's why it's present by default with PVE::Tools::run_command. Even for the specific command here, there could be other error output and it'd be bad to just silence that. As I wrote, the potential improvement would be to explicitly catch the specific error and hint that it's fine if the node is offline.
There's already all all sorts of $errmsg, $errfunc, ... flags at one's disposal ... this is a pvecm command which will virtually always cause corosync to complain that the node is offline when deleting according to the docs. As it stands now, it's not even clear it's a corosync-cfgtool error.
Yes, the output could be prefixed to make it clear where the message comes from (but it still won't help users much if they didn't look at the docs).
The other question I had was why not also clean (or move out of the directory) the /etc/pve/nodes/$nodename directory, everyone doing these operations the first time wonders why they have a zombie GUI entry.
I think it's not done automatically because the configuration might still be useful for the admin. But the docs can be improved to mention it of course: https://lists.proxmox.com/pipermail/pve-devel/2024-January/061236.html
 
In general, you do want to see all error output, to see (potential) issues. That's why it's present by default with PVE::Tools::run_command. Even for the specific command here, there could be other error output and it'd be bad to just silence that. As I wrote, the potential improvement would be to explicitly catch the specific error and hint that it's fine if the node is offline.

I agree on the stderr output to be useful, it's just before (the patch) it was clear that the exit code came from corosync-cfgtool, that is now hidden and basically just the format of the error message makes (just the experienced) one guess where and why it's coming up. It would be good to have stderr prefixed or prepended if it's not PVE's own message (generally). I did not want to go tell you HOW to do it in the bugreport, but the report was concerned about error message coming up, not just exit code.

Yes, the output could be prefixed to make it clear where the message comes from (but it still won't help users much if they didn't look at the docs).

I would just regexp match CS_ERR_NOT_EXIST and make it look like a warning in that stderr line.

As for the docs, they say "it is possible that you will receive" ... this is in the procedure when it is advised that "it is critical to power off the node before removal." Hence the stderr will always be showing up the "error". I would at least update the docs to show it in the sample output and say it WILL show up and why it can be ignored.

I understand why it all happens and that it's all safe and for a developer just cosmetic, but it's nerve-wrecking for an admin who is not a developer, especially that it affects cluster quorum (failure to "kill" a node).

I think it's not done automatically because the configuration might still be useful for the admin. But the docs can be improved to mention it of course: https://lists.proxmox.com/pipermail/pve-devel/2024-January/061236.html

Thank you very much for this!
 
I agree on the stderr output to be useful, it's just before (the patch) it was clear that the exit code came from corosync-cfgtool, that is now hidden and basically just the format of the error message makes (just the experienced) one guess where and why it's coming up. It would be good to have stderr prefixed or prepended if it's not PVE's own message (generally). I did not want to go tell you HOW to do it in the bugreport, but the report was concerned about error message coming up, not just exit code.



I would just regexp match CS_ERR_NOT_EXIST and make it look like a warning in that stderr line.

As for the docs, they say "it is possible that you will receive" ... this is in the procedure when it is advised that "it is critical to power off the node before removal." Hence the stderr will always be showing up the "error". I would at least update the docs to show it in the sample output and say it WILL show up and why it can be ignored.
If you want to propose specific ways to do things, it's best to send patches: https://pve.proxmox.com/wiki/Developer_Documentation

If other developers find them useful, we are happy to include them.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!