I have a 7 node Proxmox/Ceph cluster that has been running fine for several months.
This morning I woke up and one node(Chen1) was showing as down, red X on the node list, but was still powered up. I can’t log into the web UI through it, I can’t SSH into it, and it won’t respond to pings. I rebooted it and it came back up and seemed to work fine for about 5 minutes, then went down again.
After rebooting that node, two others(Dak 1, Dak2) started showing “unknown” as their status. I can load the web UI, I can SSH in, and they DO respond to pings. The LXC’s assigned to those nodes were running, but also showed an “unknown” as their state. The HA state for those LXC's was "error"
I had to go to work, so I powered down the whole cluster for the day. Where do I start with troubleshooting this mess when I get home this afternoon?
Edit: All nodes are running version 6 and were last updated sometime last week.
This morning I woke up and one node(Chen1) was showing as down, red X on the node list, but was still powered up. I can’t log into the web UI through it, I can’t SSH into it, and it won’t respond to pings. I rebooted it and it came back up and seemed to work fine for about 5 minutes, then went down again.
After rebooting that node, two others(Dak 1, Dak2) started showing “unknown” as their status. I can load the web UI, I can SSH in, and they DO respond to pings. The LXC’s assigned to those nodes were running, but also showed an “unknown” as their state. The HA state for those LXC's was "error"
I had to go to work, so I powered down the whole cluster for the day. Where do I start with troubleshooting this mess when I get home this afternoon?
Edit: All nodes are running version 6 and were last updated sometime last week.
Last edited: