[SOLVED] Cluster Node in offline state but it is online

rabban · Sep 11, 2023

Hi,

We have a cluster with 6 nodes. I've logged in to one of our nodes to perform a deletion of two VMs in other node in the same cluster, after that the node became unresponsive via GUI and is now shown as offline. I can reach the node via tty and all VMs running on it are still up.

The VMs that I wanted to delete were ID 1602 and 1603 both had two disks. In this case 1603 was a restore of 1602 and had a disk from 1602 that had more recent data than the backup. In that case, I've detached the 1602 disk 1 and added it manually to VM 1603 editing the VM 1603 config file. For both VMs, the disks were stored in the same shared storage.

I think the problem was that I deleted VM 1603 first without unreferencing the disk from VM 1602 from it. Now, the VM 1603 doesn't exists and VM 1602 still exists, and we have a node "offline".

I'm able to run commands on it, I was reading through the posts and found that I can restart the GUI using service pveproxy restart command, however, the command is not able to kill some pveproxy processes and it is in a loop trying to start the proxy server.

All nodes in the cluster has PVE 6.0-4.

Regards

Philipp Hufnagl · Sep 11, 2023

Hello
Can you give me the output of journalctl --since "2023-09-10" > $(hostname)-journal.txt

rabban · Sep 11, 2023

Hi Philipp,

Attached is the requested output.

Philipp Hufnagl · Sep 11, 2023

Can you give me the output of

pvecm status

and

ps aux > $(hostname)-ps.txt

rabban · Sep 11, 2023

Both are attached. Important to mention that afected server is 192.168.15.10, and the command pvecm status hasn't given an output yet so I ran it in other node from the same cluster (192.168.15.3)

rabban · Sep 11, 2023

Adding as an update.

I'm not able to do SSH from or to the affected server because it doesn't respond, it's like the process is not able to get started correctly, although the server is Up and ping requests respond. I tried to run df -h without success because it hangs. If I kill the process, it is not getting killed, neither with -9 option.

In Proxmox GUI, the node and all vm hosts are with a grey "?" icon

Philipp Hufnagl · Sep 12, 2023

A lot of your processes have as status "Ds" which means uninterruptible sleep.

You can trysystemctl restart pve-cluster but it might be simpler to recover by simply restarting the server

rabban · Sep 13, 2023

Hello Philipp,

Indeed, we restarted the server and it is fixed now. Probably this issue was caused by the VM deletion, as I mentioned, there was a Virtual Disk from VM 1602 attached to VM 1603, and probably the server was waiting for the deletion of that disk from 1602.

Final question, that command should be ran in the server or in the whole cluster?

Thanks for your help!

Philipp Hufnagl · Sep 15, 2023

You have to do this on all nodes. However, if everything works after a restart this is not needed

Search

Search

[SOLVED] Cluster Node in offline state but it is online

rabban

New Member

Philipp Hufnagl

Active Member

rabban

New Member

Attachments

Philipp Hufnagl

Active Member

rabban

New Member

Attachments

rabban

New Member

Philipp Hufnagl

Active Member

rabban

New Member

Philipp Hufnagl

Active Member

We value your privacy