One Host in Cluster stop allowing authentication

ypxm1

New Member
May 18, 2024
6
0
1
Hello,

I have a long running cluster of 3 host, usually everything is referenced towards the first host(such as remote api monitoring etc...) recently that specific host stops allowing authentication to login. When open the webGUI the login page comes up and when I put in username and password it just times out. I have been looking at the logs and I cant find anything but a whole bunch of "pveproxy[4090618]: proxy detected vanished client connection". I noticed that restarting the pvedaemon,, this allows me to login but only temporary, logging in to other hosts works fine, hosts SSH is working as well. If I reboot the host then it starts working for few days but eventually goes down. Im still on 8.4.14 and before I totally move to 9.x I want to fix this issue. I had this cluster stable and running for over 2 years and all of a sudden this issue comes up. I tried some of my own troubleshooting and searching, but come up empty. Any help is appreciated.
 
Last edited:
When the problem occurs, is that one node in the cluster normally available via the other nodes?
Clusterstatus is ok?
Code:
pvecm status

Do you see any other messages in the logs? For example, you try to log and during that, look at the journal:
Code:
journalctl -f

Or search the log in reverse for warnings and errors:
Code:
journalctl -r -p4

Has anything changed that could have led to this behavior?
 
When the problem occurs, is that one node in the cluster normally available via the other nodes?
Clusterstatus is ok?
Code:
pvecm status

Do you see any other messages in the logs? For example, you try to log and during that, look at the journal:
Code:
journalctl -f

Or search the log in reverse for warnings and errors:
Code:
journalctl -r -p4

Has anything changed that could have led to this behavior?
This seems to happen under maybe some load. I now notice this happening on other hosts when all vms are on it(or most of them). Nothing changed, but upgrade to the latest version on everything.
 
This seems to happen under maybe some load. I now notice this happening on other hosts when all vms are on it(or most of them). Nothing changed, but upgrade to the latest version on everything.
Are the nodes under heavy load? CPU/MEM/IO/Pressure?
 
All right. Maybe check the time too on all nodes. There might be a time drift.
time seems to be correct. It has something to do with the load. Whe a NODE is running high on load the pvedaemon seems to be stuck. Restarting the service brings it back for a little bit but then stops again. I have never observed this before while using this cluster for a while now. Getting very frustrating as other API things stop working