[SOLVED] Cluster Node in offline state but it is online

rabban

New Member
Sep 11, 2023
20
2
3
Hi,

We have a cluster with 6 nodes. I've logged in to one of our nodes to perform a deletion of two VMs in other node in the same cluster, after that the node became unresponsive via GUI and is now shown as offline. I can reach the node via tty and all VMs running on it are still up.

The VMs that I wanted to delete were ID 1602 and 1603 both had two disks. In this case 1603 was a restore of 1602 and had a disk from 1602 that had more recent data than the backup. In that case, I've detached the 1602 disk 1 and added it manually to VM 1603 editing the VM 1603 config file. For both VMs, the disks were stored in the same shared storage.

I think the problem was that I deleted VM 1603 first without unreferencing the disk from VM 1602 from it. Now, the VM 1603 doesn't exists and VM 1602 still exists, and we have a node "offline".

I'm able to run commands on it, I was reading through the posts and found that I can restart the GUI using service pveproxy restart command, however, the command is not able to kill some pveproxy processes and it is in a loop trying to start the proxy server.

All nodes in the cluster has PVE 6.0-4.

Regards
 
Hello
Can you give me the output of journalctl --since "2023-09-10" > $(hostname)-journal.txt
 
Hi Philipp,

Attached is the requested output.
 

Attachments

  • node1-journal.txt
    451.4 KB · Views: 5
Can you give me the output of

pvecm status

and

ps aux > $(hostname)-ps.txt
 
Both are attached. Important to mention that afected server is 192.168.15.10, and the command pvecm status hasn't given an output yet so I ran it in other node from the same cluster (192.168.15.3)
 

Attachments

  • node1-ps.txt
    69.9 KB · Views: 7
  • pvecm-status.txt
    663 bytes · Views: 6
Adding as an update.

I'm not able to do SSH from or to the affected server because it doesn't respond, it's like the process is not able to get started correctly, although the server is Up and ping requests respond. I tried to run df -h without success because it hangs. If I kill the process, it is not getting killed, neither with -9 option.

In Proxmox GUI, the node and all vm hosts are with a grey "?" icon
 
A lot of your processes have as status "Ds" which means uninterruptible sleep.

You can trysystemctl restart pve-cluster but it might be simpler to recover by simply restarting the server
 
Hello Philipp,

Indeed, we restarted the server and it is fixed now. Probably this issue was caused by the VM deletion, as I mentioned, there was a Virtual Disk from VM 1602 attached to VM 1603, and probably the server was waiting for the deletion of that disk from 1602.

Final question, that command should be ran in the server or in the whole cluster?

Thanks for your help!
 
You have to do this on all nodes. However, if everything works after a restart this is not needed
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!