Hi Most Excellent Proxmox Forum Users,
We have a 3 server cluster, doing all the fun stuff like CEPH in the back end.
Here's what I did to break this:
Here's what I've done so far:
I don't want to reboot the server, because we have workloads that have to stay running. Can anyone recommend some good next troubleshooting step?
Okay so in the time it took me to write this question, the GUI for our first server came back. I now have a second, and third question:
We have a 3 server cluster, doing all the fun stuff like CEPH in the back end.
Here's what I did to break this:
- created an LDAP Server entry (we're using JumpCloud)
- tested a user login against that realm
Here's what I've done so far:
- I still have SSH access, and the 3rd server in the cluster has full access to the other servers
- pveproxy is active, and restarting does not restore the GUI
- pvedaemon is active, and restarting does not restore the GUI
- When I'm on SSH of the server I can "telnet 192.168.this.server 8006" and it answers
- We get no response from a web browser on the two servers that have a dead GUI
- pveproxy access.log continues to increment API requests from this server, but does not show my attempts to reach the GUI
- "ss -an | grep LISTEN | grep 8006" reports a LISTEN on port 8006
- this matches the fact I can telnet to this port directly from the server
- when I restarted both pveproxy and pvedaemon, I see that both the parent and the worker PIDs are updated
I don't want to reboot the server, because we have workloads that have to stay running. Can anyone recommend some good next troubleshooting step?
Okay so in the time it took me to write this question, the GUI for our first server came back. I now have a second, and third question:
- what logs should I look into figure out what proxmox was doing in this interim period
- what the heck happened?