CEPH has become completely unresponsive - Prod Env

QuantumSchema

Member
Mar 25, 2019
7
0
6
64
Hey everyone,

I'm in dire need of assistance.

Our production CEPH environment has become completely unresponsive.

It's a 38 node Proxmox environment (blend of dedicated storage & compute nodes and a few hyper-converged nodes).

What appeared to start this was a time skew? We saw a slew of "monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early" in the logs.

After digging in, 7 of the 38 nodes didn't have systemd-timesyncd.service running. I tried starting the service on those nodes but they wouldn't come up. We eventually went through rebooting each node, one by one. On power down, the nodes hung on attempting to stop watchdog. I manually power cycled them at that point and when they came back up, the systemd-timesyncd.service had started. So now systemd-timesyncd.service is running on all nodes.

Currently, CEPH is is completely unresponsive. Trying to run "ceph status" or "ceph osd" etc. results in a time out so I'm unable to determine the status of the environment.

Now I see "osd.AAA 535503 heartbeat_check: no reply from xx.xxx.x.xx:6814 osd.BBB ever on either front or back, first ping sent" on all of the nodes. The OSDs are of course a bit different on each. The "no reply from xx.xxx.x.xx" portion points to the same nodes though, nodes: 27, 26, 25, 22, 24, 22, 21. Interestingly enough, these were the nodes that had the time service stopped earlier.

We still see "monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early" but not as frequently.

We are also seeing "osd.AAA 535503 get_health_metrics reporting 262 slow ops, oldest is osd_op(client.1529603251.0:45966540 3.1e9b 3:d97b4254:::rbd_header.c0aab834792ffb:head [call rbd.set_access_timestamp] snapc 0=[] ondisk+write+known_if_redirected e535503)" in the logs but since I can't run any of the ceph commands, I can't dig much deeper into whether it's rebuilding or the status of the cluster.

The Proxmox UI just shows a "500 Timeout" error.

We do have 2 tickets in w/ Proxmox support but response have been slow so any help from the community would be greatly appreciated!

Thanks everyone & cheers!
 
heartbeat_check: no reply from xx.xxx.x.xx:6814 osd.BBB ever on either front or back, first ping sent"
This message usually points to a networking problem between the nodes (most likely on the ceph public network) - do you see any problems with the network configuration?

As my colleague pointed out in the enterprise support portal - we'd need the current logs and reports to get a better picture though.
Also I'd suggest to stick to one support/communication channel (more channels usually cause confusion and duplicate work)

I hope this helps!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!