CEPH has become completely unresponsive - Prod Env

QuantumSchema

Member
Mar 25, 2019
7
0
6
65
Hey everyone,

I'm in dire need of assistance.

Our production CEPH environment has become completely unresponsive.

It's a 38 node Proxmox environment (blend of dedicated storage & compute nodes and a few hyper-converged nodes).

What appeared to start this was a time skew? We saw a slew of "monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early" in the logs.

After digging in, 7 of the 38 nodes didn't have systemd-timesyncd.service running. I tried starting the service on those nodes but they wouldn't come up. We eventually went through rebooting each node, one by one. On power down, the nodes hung on attempting to stop watchdog. I manually power cycled them at that point and when they came back up, the systemd-timesyncd.service had started. So now systemd-timesyncd.service is running on all nodes.

Currently, CEPH is is completely unresponsive. Trying to run "ceph status" or "ceph osd" etc. results in a time out so I'm unable to determine the status of the environment.

Now I see "osd.AAA 535503 heartbeat_check: no reply from xx.xxx.x.xx:6814 osd.BBB ever on either front or back, first ping sent" on all of the nodes. The OSDs are of course a bit different on each. The "no reply from xx.xxx.x.xx" portion points to the same nodes though, nodes: 27, 26, 25, 22, 24, 22, 21. Interestingly enough, these were the nodes that had the time service stopped earlier.

We still see "monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early" but not as frequently.

We are also seeing "osd.AAA 535503 get_health_metrics reporting 262 slow ops, oldest is osd_op(client.1529603251.0:45966540 3.1e9b 3:d97b4254:::rbd_header.c0aab834792ffb:head [call rbd.set_access_timestamp] snapc 0=[] ondisk+write+known_if_redirected e535503)" in the logs but since I can't run any of the ceph commands, I can't dig much deeper into whether it's rebuilding or the status of the cluster.

The Proxmox UI just shows a "500 Timeout" error.

We do have 2 tickets in w/ Proxmox support but response have been slow so any help from the community would be greatly appreciated!

Thanks everyone & cheers!
 
heartbeat_check: no reply from xx.xxx.x.xx:6814 osd.BBB ever on either front or back, first ping sent"
This message usually points to a networking problem between the nodes (most likely on the ceph public network) - do you see any problems with the network configuration?

As my colleague pointed out in the enterprise support portal - we'd need the current logs and reports to get a better picture though.
Also I'd suggest to stick to one support/communication channel (more channels usually cause confusion and duplicate work)

I hope this helps!