High io delay after loosing a node

flotho

Renowned Member
Sep 3, 2012
81
4
73
Hi,
We have a 4 node cluster with ceph installed on 4 disks per node so 16 OSD and only 50% is used.
All disks are nvme, the scheduler is setup with option mq-deadline in /sys/block/nvmeXXX/queue/scheduler

Our node have enough RAM and CPU
Today, a node has been shutdown and as we're expecting the 3 other node to still work "normally" the 3 available nodes are overworking with high serverload and IO Delay very high
An iostat -x 1 shows that RDB is consuming a lot
1752054668810.png

That's look strange to me so I tried to disable some options :
1752054699636.png
Yet the iodelay and server load still very high.
I even try to lower the priority of operations with something like :

Code:
ceph tell 'osd.*' injectargs --osd-max-backfills=1 --osd-recovery-max-active=3 --osd_recovery_op_priority=30

Any tips or advice would be appreciated.
Regards
 
Additional information, we haven't activate the HA ,
When the server is coming back to life, everything is running smoothly.
We're looking for the reason why other nodes are overreacting to this.
 
Hum,
I think this is a priority issue in the operations.
When the recovery has ended all the IO delays has decreased and everything is running smoothly on a 3 node.
What can I setup to give priority to client usage when there is no "emergency" ?

Regards
 
Hum,

I think I understood what happenned.
The ceph conf is as below :
1752065528464.png

AFAICT, if an OSD is down there's a high probability that the fact of missing OSD is considered as an emergency, So it will react to rebalance the missing pg

Am I correct ?
 

Attachments

  • 1752065519007.png
    1752065519007.png
    50.3 KB · Views: 3
If there are less than min_size copies then I/O will pause until min_size is available.

osd_pool_default_size = 3 # Write an object three times.
osd_pool_default_min_size = 2 # Accept an I/O operation to a PG that has two copies of an object.
 
Thanks @SteveITS ,
If I change this parameter, will ceph create all the "missing pg" and use 50% of additional storage ?
Regards
 
If you leave it at 2/2 and let it recover (norecover=off) then Ceph should eventually catch up/recover, and allow VMs to function, once all PGs for the VM have 2 valid copies.

If you set it at 3/2 then yes it will use more storage but you will have a cushion, so one OSD going down does not lock any PGs that only have 1 active copy, because there will already be 2 other copies of the PG. The 3 copies are by default on different nodes so two nodes would have to go down to block I/O.
 
The ceph conf is as below :
1752065528464.png
This effectively renders the cluster useless: once you lose any OSD, there will be no I/O on the PGs stored in that OSD until they get recovered from the single copy still in the cluster. You should always use at least size=3, min_size=2 unless you can tolerate such downtime. Not to mention the risk of another OSD failing at the same time that might hold the other copy of the same PGs...

The I/O you see are the VMs disks, not your hosts disk and is caused by Ceph blocking I/O due to having less than min_size copies of some PG's, as SteveITS explained above. You VMs try to do I/O, Ceph doesn't allow that until recovery, the VM get more and more processes waiting for I/O and so on.