[SOLVED] Remove node from cluster/datacenter after physically removing the device

mati · Dec 31, 2023

Hi folks!

I have a small homelab and I had 4 PVE nodes: pve0, pve1, pve2 and pve3 in 1 datacenter/cluster. I was looking to turn pve3 into PBS, so I simply disconnected the hardware and installed PBS there.

When I login to GUI on pve0/pve1/pve2 I still can see pve3 and I'd like to get rid of it.
When I run:

Code:

root@pve0:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 pve1
         3          1 pve2
         4          1 pve0 (local)

pve3 is not visible, but I can still see it in corosync.conf:

Code:

root@pve0:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve0
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.0.190
  }
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.0.191
  }
  node {
    name: pve2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.0.192
  }
  node {
    name: pve3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.0.193
  }
}

on each node.

Looks like I should have look up the documentation first and remove the node while it was still active :/

How can I fix that and get rid of pve3 references from all the other nodes?

Thanks!

esi_y · Dec 31, 2023

You did nothing wrong, the same happens when a piece of hardware dies.

Code:

pvecm delnode pve3
rm -rf /etc/pve/nodes/pve3

You may need to reload the web UI afterwards.

mati · Dec 31, 2023

Amazing, thank you! This worked like a charm.

In case other folks find this topic. After running pvecm delnode pve3 I got the following error:

Code:

root@pve0:~# pvecm delnode pve3
Could not kill node (error = CS_ERR_NOT_EXIST)
Killing node 2

but the pve3 node was properly deleted from corosync.conf

esi_y · Jan 2, 2024

mati said:
Amazing, thank you! This worked like a charm.

In case other folks find this topic. After running pvecm delnode pve3 I got the following error:

Code:

root@pve0:~# pvecm delnode pve3 Could not kill node (error = CS_ERR_NOT_EXIST) Killing node 2

I suppose the patch did not fix it then: https://bugzilla.proxmox.com/show_bug.cgi?id=3596
I will update the bugreport to see if there's an explanation.

mati said:
but the pve3 node was properly deleted from corosync.conf

The other thing I also wonder about, why doesn't PVE also clean up the respective node directory. It never does it, it's not causing cluster issues per se, but it keeps showing the zombie node in the GUI.

fiona · Jan 4, 2024

Hi,
@mati what version of Proxmox VE was installed at the time the issue happened?

esi_y · Jan 4, 2024

@fiona It's still in the docs [1] that CS_ERR_NOT_EXIST can happen after issuing delnode, but I suppose (also based on the commentary) that it should have been resolved by the fix of bug #3596.

FWIW I had this happen to me on 8.0 before (with the said node being dead at the time of issuing the command), just I was not aware of the fix at that time, later I forgot about it.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

fiona · Jan 4, 2024

tempacc346235 said:
@fiona It's still in the docs [1] that CS_ERR_NOT_EXIST can happen after issuing delnode, but I suppose (also based on the commentary) that it should have been resolved by the fix of bug #3596.

FWIW I had this happen to me on 8.0 before (with the said node being dead at the time of issuing the command), just I was not aware of the fix at that time, later I forgot about it.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

Oh, right. The message itself is fine. The bug was about exiting with an error code at that point. Now the code continues even if the corosync-cfgtool fails. What could potentially be improved is catching the error explicitly and hinting that it's fine if the node is offline also on the CLI, but the docs already mention it.

esi_y · Jan 4, 2024

fiona said:
Oh, right. The message itself is fine. The bug was about exiting with an error code at that point. Now the code continues even if the corosync-cfgtool fails. What could potentially be improved is catching the error explicitly and hinting that it's fine if the node is offline also on the CLI, but the docs already mention it.

Thanks for the reply, to be honest I had not looked before, but now that I did - and I know some might think I am nitpicking here, but wouldn't it be much much nicer to have the PVE::Tools::run_command account for these situations (eval ignore error and ditch stderr, except when debug is on)? There's already all all sorts of $errmsg, $errfunc, ... flags at one's disposal ... this is a pvecm command which will virtually always cause corosync to complain that the node is offline when deleting according to the docs. As it stands now, it's not even clear it's a corosync-cfgtool error.

The other question I had was why not also clean (or move out of the directory) the /etc/pve/nodes/$nodename directory, everyone doing these operations the first time wonders why they have a zombie GUI entry.

mati · Jan 4, 2024

fiona said:
Hi,
@mati what version of Proxmox VE was installed at the time the issue happened?

pve0: 8.X (I don't remember if I updated to 8.1 after removing the node or before)
pve1: 7.4-16
pve2: 8.X (I don't remember if I updated to 8.1 after removing the node or before)

and I was using pve0 to perform these operations.

fiona · Jan 5, 2024

tempacc346235 said:
Thanks for the reply, to be honest I had not looked before, but now that I did - and I know some might think I am nitpicking here, but wouldn't it be much much nicer to have the PVE::Tools::run_command account for these situations (eval ignore error and ditch stderr, except when debug is on)?

In general, you do want to see all error output, to see (potential) issues. That's why it's present by default with PVE::Tools::run_command. Even for the specific command here, there could be other error output and it'd be bad to just silence that. As I wrote, the potential improvement would be to explicitly catch the specific error and hint that it's fine if the node is offline.

tempacc346235 said:
There's already all all sorts of $errmsg, $errfunc, ... flags at one's disposal ... this is a pvecm command which will virtually always cause corosync to complain that the node is offline when deleting according to the docs. As it stands now, it's not even clear it's a corosync-cfgtool error.

Yes, the output could be prefixed to make it clear where the message comes from (but it still won't help users much if they didn't look at the docs).

tempacc346235 said:
The other question I had was why not also clean (or move out of the directory) the /etc/pve/nodes/$nodename directory, everyone doing these operations the first time wonders why they have a zombie GUI entry.

I think it's not done automatically because the configuration might still be useful for the admin. But the docs can be improved to mention it of course: https://lists.proxmox.com/pipermail/pve-devel/2024-January/061236.html

esi_y · Jan 5, 2024

fiona said:
In general, you do want to see all error output, to see (potential) issues. That's why it's present by default with PVE::Tools::run_command. Even for the specific command here, there could be other error output and it'd be bad to just silence that. As I wrote, the potential improvement would be to explicitly catch the specific error and hint that it's fine if the node is offline.

I agree on the stderr output to be useful, it's just before (the patch) it was clear that the exit code came from corosync-cfgtool, that is now hidden and basically just the format of the error message makes (just the experienced) one guess where and why it's coming up. It would be good to have stderr prefixed or prepended if it's not PVE's own message (generally). I did not want to go tell you HOW to do it in the bugreport, but the report was concerned about error message coming up, not just exit code.

fiona said:
Yes, the output could be prefixed to make it clear where the message comes from (but it still won't help users much if they didn't look at the docs).

I would just regexp match CS_ERR_NOT_EXIST and make it look like a warning in that stderr line.

As for the docs, they say "it is possible that you will receive" ... this is in the procedure when it is advised that "it is critical to power off the node before removal." Hence the stderr will always be showing up the "error". I would at least update the docs to show it in the sample output and say it WILL show up and why it can be ignored.

I understand why it all happens and that it's all safe and for a developer just cosmetic, but it's nerve-wrecking for an admin who is not a developer, especially that it affects cluster quorum (failure to "kill" a node).

fiona said:
I think it's not done automatically because the configuration might still be useful for the admin. But the docs can be improved to mention it of course: https://lists.proxmox.com/pipermail/pve-devel/2024-January/061236.html

Thank you very much for this!

fiona · Jan 5, 2024

tempacc346235 said:
I agree on the stderr output to be useful, it's just before (the patch) it was clear that the exit code came from corosync-cfgtool, that is now hidden and basically just the format of the error message makes (just the experienced) one guess where and why it's coming up. It would be good to have stderr prefixed or prepended if it's not PVE's own message (generally). I did not want to go tell you HOW to do it in the bugreport, but the report was concerned about error message coming up, not just exit code.

I would just regexp match CS_ERR_NOT_EXIST and make it look like a warning in that stderr line.

As for the docs, they say "it is possible that you will receive" ... this is in the procedure when it is advised that "it is critical to power off the node before removal." Hence the stderr will always be showing up the "error". I would at least update the docs to show it in the sample output and say it WILL show up and why it can be ignored.

If you want to propose specific ways to do things, it's best to send patches: https://pve.proxmox.com/wiki/Developer_Documentation

If other developers find them useful, we are happy to include them.

Search

Search

[SOLVED] Remove node from cluster/datacenter after physically removing the device

mati

Member

esi_y

Renowned Member

mati

Member

esi_y

Renowned Member

fiona

Proxmox Staff Member

esi_y

Renowned Member

fiona

Proxmox Staff Member

esi_y

Renowned Member

mati

Member

fiona

Proxmox Staff Member

esi_y

Renowned Member

fiona

Proxmox Staff Member

We value your privacy