Ceph - slow ops for nvme OSD

JDA · May 18, 2022

I've got a small problem - I wanted to swap servers in my Proxmox 7.1 cluster. Removing node pve002 and adding the new pve005 was working fine. Ceph was healthy.

But now I try to shutdown pve004 and set the last nvme to out there, I get 19 PGs in inactive status because the new osd.5 in pve005 is reporting slow ops - delayed:

Code:

2022-05-18T04:58:08.324+0200 7fd000fe1700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.138713884.0:4843782 8.28a 8.70bbb28a (undecoded) ondisk+write+known_if_redirected e5
0558) initiated 2022-05-18T04:56:23.881471+0200 currently delayed
2022-05-18T04:58:08.324+0200 7fd000fe1700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.138713884.0:4843737 8.283 8.e367c683 (undecoded) ondisk+write+known_if_redirected e5
0556) initiated 2022-05-18T04:56:19.866903+0200 currently delayed
2022-05-18T04:58:08.324+0200 7fd000fe1700 -1 osd.5 50588 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.138713884.0:4843737 8.283 8.e367c683 (undecoded) ondisk+write+known
_if_redirected e50556)

Right now I set osd.5 to out and everything is being remapped to the other nvmes in the cluster - without any slow op

JDA · May 18, 2022

I just tried to recreate the osd.5 - after starting it, I get the full data from the other hosts (now it's filled with 1,6 TB data). So the nvme should be fine. I'm unsure whether I've got a network problem between the two nodes pve004 and pve005

Search

Search

Ceph - slow ops for nvme OSD

JDA

Member

JDA

Member