Hi,
We have a 4 node cluster with ceph installed on 4 disks per node so 16 OSD and only 50% is used.
All disks are nvme, the scheduler is setup with option mq-deadline in /sys/block/nvmeXXX/queue/scheduler
Our node have enough RAM and CPU
Today, a node has been shutdown and as we're expecting the 3 other node to still work "normally" the 3 available nodes are overworking with high serverload and IO Delay very high
An iostat -x 1 shows that RDB is consuming a lot

That's look strange to me so I tried to disable some options :

Yet the iodelay and server load still very high.
I even try to lower the priority of operations with something like :
Any tips or advice would be appreciated.
Regards
We have a 4 node cluster with ceph installed on 4 disks per node so 16 OSD and only 50% is used.
All disks are nvme, the scheduler is setup with option mq-deadline in /sys/block/nvmeXXX/queue/scheduler
Our node have enough RAM and CPU
Today, a node has been shutdown and as we're expecting the 3 other node to still work "normally" the 3 available nodes are overworking with high serverload and IO Delay very high
An iostat -x 1 shows that RDB is consuming a lot

That's look strange to me so I tried to disable some options :

Yet the iodelay and server load still very high.
I even try to lower the priority of operations with something like :
Code:
ceph tell 'osd.*' injectargs --osd-max-backfills=1 --osd-recovery-max-active=3 --osd_recovery_op_priority=30
Any tips or advice would be appreciated.
Regards