Timeouts in multiple parts of PVE8.1.4

CelticWebs

Member
Mar 14, 2023
76
3
13
I recently started having an issue with a single node not connecting to the other node. There were no obvious errors in any of the config files. We discovered in another thread that I could usually force it to connect by simply running the corosync -f command. I've started to notice some other things which I am pretty sure are related but what exactly is causing it isn't so obvious. The things I've noted are:

PVE won't achieve quorum without help (even then it sometime looses connection)
SSH is VERY slow to connect via a terminal on my machine, even when using the direct IP
Opening console in PVE for a VM or the PVE often fails, showing "failed to connect to server"
Telling a VM to reboot will often close down and then not start back up, checking the logs shows TASK ERROR: timeout waiting on systemd

I'm sure there's other things but these are the most obvious, move inside a VM and everything seems totally normal, I can connect to remotely to the IP via terminal and the application run as expected from inside.

It's been suggest that I somehow enable verbose logging on corosync, though I'm told its VERY verbose and I'm not sure that's where I need to start.

It's as if something is causing response times, causing timeouts. Though CPU load and available memory are both healthy.

I've considered just reinstalling but having done that before, I know it can cause real issues with corosync and devices failing to connect due to keys being wrong etc.

Does anyone know a good way of debugging these kinds of issues?
 
Update, I've noticed today that if I set the system to reboot, as it's closing down, I can login to SSH remotely quickly without the wait. Being able to work out whats been closed down that was previously hogging the system on the other hand isn't something that is going to be that easy I doubt. I guess I could find out what's running service wise and shut them all down one at a time?