Hi, I have a 3 node cluster and things were running well for several months, suddenly I had found that some VMs were fenced (probably using that term wrong) and in my effort to troubleshoot found that Ceph was not mounting. Running ceph -s doesn't work, the GUI 500 errors if I try viewing any Ceph stuff and most of the errors I can find seem to not really turn up much or if they do the problem seems to be different. The monitors seem to be running but are spamming with:
So I looked to see if mds is running and it seems to be restarting over and over with this in the logs
Some other troubleshooting I tried was seeing if I could communicate over the storage network to all the nodes, pings work, connecting to 6789 on each node to itself works but fails between nodes, even with the pve-firewall disabled. I have no idea if that is normal but since I normally work on ESXi and am evaluating this as a replacement I am totally a noob.
I am on PVE 8.2.2 upgraded a few weeks before, I also at one point seemed to upgrade Ceph to the newer version successfully. The storage network is separate, and I am not sure what other details one might need to help but let me know if I didn't provide enough info.
[ 3725.264440] ceph: No mds server is up or the cluster is laggy
[ 3734.447225] libceph: mon1 (1)192.168.6.21:6789 socket closed (con state V1_BANNER)
[ 3734.704469] libceph: mon1 (1)192.168.6.21:6789 socket closed (con state V1_BANNER)
[ 3735.208468] libceph: mon1 (1)192.168.6.21:6789 socket closed (con state V1_BANNER)
[ 3736.656702] libceph: mon1 (1)192.168.6.21:6789 socket closed (con state V1_BANNER)
[ 3740.304488] libceph: mon0 (1)192.168.6.20:6789 socket closed (con state OPEN)
So I looked to see if mds is running and it seems to be restarting over and over with this in the logs
ceph-mds[9702]: failed to fetch mon config (--no-mon-config to skip)
Some other troubleshooting I tried was seeing if I could communicate over the storage network to all the nodes, pings work, connecting to 6789 on each node to itself works but fails between nodes, even with the pve-firewall disabled. I have no idea if that is normal but since I normally work on ESXi and am evaluating this as a replacement I am totally a noob.
I am on PVE 8.2.2 upgraded a few weeks before, I also at one point seemed to upgrade Ceph to the newer version successfully. The storage network is separate, and I am not sure what other details one might need to help but let me know if I didn't provide enough info.