[SOLVED] Node/guests status unknown, guests offline, 'pvecm status' ok

*Daedalus

New Member
Feb 5, 2024
11
2
3
Hi all,

Weird one here. Or at least, something I haven't encountered before.
3 node cluster. Node 1 shows unknown, all guests as well:

1757835095985.png

Code:
pvecm status
shows all node are there, and I can interact with node 1 from the GUI fine, but none of the guests are actually running.
Something that may or not be related, is that node 1 doesn't seem to register customer colours for the guest tags, indicating it's not reading datacenter.cfg correctly?
They should look like this (from node 2):

1757835387337.png

I notice there that node 2 doesn't have a status indicator on the node itself. I'm not sure what this signifies.

Any help here would be great!
 
Does it look the same when you're directly connected to prox1's GUI?
Just as a quick test, can you check the status of pvestatd and if restarting it helps?
Bash:
systemctl status pvestatd
systemctl restart pvestatd
Do this on prox1.
 
The first screenshot above is from prox1's GUI. Like I say, I can interact with the node itself fine, and it seems like the cluster sees it, but the guests aren't happy.

pvestatd was running, only messages were about timed out connection to "unRAID" storage, from yesterday.
I restarted it, and the node status icon showed green, but all the VMs were still offline and unknown. After a couple of minutes it went back to the same state as above.

Further info:
From prox1, I can reach a guest, and I see terminal output, but usually I can't login. Sometimes no keyboard input will be passed on the username field, sometimes it will be accepted and ask me for a password, but it will do nothing at that point.

From another node, I can access prox1's shell fine, and run any commands I like, but the guests are unreachable, acting like a network connectivity failure.

This massive IO delay is interesting:
1757837416909.png

But all the storage is there, and I can browse to "unRAID" and read files fine.
Code:
root@prox1:/mnt/pve/unRAID# pvesm status
Name              Type     Status           Total            Used       Available        %
backup-pbs         pbs     active      4714135552      3474581376      1239554176   73.71%
local              dir     active         8515456          263552         8251904    3.09%
local-nvme     zfspool     active      1885865868      1313838452       572027416   69.67%
unRAID            cifs     active    124990845616     90941736356     34049109260   72.76%
 
Last edited:
It was the VM disk on prox1 that got stuck doing a scrub. I think it might be hardware, still investigating, but it looks like that's what was causing the IO delay, which I guess made PVE unable to query the guests.