High io delay after loosing a node

flotho · Jul 9, 2025

Hi,
We have a 4 node cluster with ceph installed on 4 disks per node so 16 OSD and only 50% is used.
All disks are nvme, the scheduler is setup with option mq-deadline in /sys/block/nvmeXXX/queue/scheduler

Our node have enough RAM and CPU
Today, a node has been shutdown and as we're expecting the 3 other node to still work "normally" the 3 available nodes are overworking with high serverload and IO Delay very high
An iostat -x 1 shows that RDB is consuming a lot

That's look strange to me so I tried to disable some options :

Yet the iodelay and server load still very high.
I even try to lower the priority of operations with something like :

Code:

ceph tell 'osd.*' injectargs --osd-max-backfills=1 --osd-recovery-max-active=3 --osd_recovery_op_priority=30

Any tips or advice would be appreciated.
Regards

flotho · Jul 9, 2025

Additional information, we haven't activate the HA ,
When the server is coming back to life, everything is running smoothly.
We're looking for the reason why other nodes are overreacting to this.

flotho · Jul 9, 2025

The ceph status indicate only warnings :

Yet the iodelay still very high

flotho · Jul 9, 2025

Hum,
I think this is a priority issue in the operations.
When the recovery has ended all the IO delays has decreased and everything is running smoothly on a 3 node.
What can I setup to give priority to client usage when there is no "emergency" ?

Regards

flotho · Jul 9, 2025

Hum,

I think I understood what happenned.
The ceph conf is as below :

AFAICT, if an OSD is down there's a high probability that the fact of missing OSD is considered as an emergency, So it will react to rebalance the missing pg

Am I correct ?

SteveITS · Jul 9, 2025

If there are less than min_size copies then I/O will pause until min_size is available.

osd_pool_default_size = 3 # Write an object three times.
osd_pool_default_min_size = 2 # Accept an I/O operation to a PG that has two copies of an object.

flotho · Jul 9, 2025

Thanks @SteveITS ,
If I change this parameter, will ceph create all the "missing pg" and use 50% of additional storage ?
Regards

SteveITS · Jul 9, 2025

If you leave it at 2/2 and let it recover (norecover=off) then Ceph should eventually catch up/recover, and allow VMs to function, once all PGs for the VM have 2 valid copies.

If you set it at 3/2 then yes it will use more storage but you will have a cushion, so one OSD going down does not lock any PGs that only have 1 active copy, because there will already be 2 other copies of the PG. The 3 copies are by default on different nodes so two nodes would have to go down to block I/O.

VictorSTS · Jul 10, 2025

flotho said:
The ceph conf is as below :

This effectively renders the cluster useless: once you lose any OSD, there will be no I/O on the PGs stored in that OSD until they get recovered from the single copy still in the cluster. You should always use at least size=3, min_size=2 unless you can tolerate such downtime. Not to mention the risk of another OSD failing at the same time that might hold the other copy of the same PGs...

The I/O you see are the VMs disks, not your hosts disk and is caused by Ceph blocking I/O due to having less than min_size copies of some PG's, as SteveITS explained above. You VMs try to do I/O, Ceph doesn't allow that until recovery, the VM get more and more processes waiting for I/O and so on.

Search

Search

High io delay after loosing a node

flotho

Renowned Member

flotho

Renowned Member

flotho

Renowned Member

flotho

Renowned Member

flotho

Renowned Member

Attachments

SteveITS

Active Member

flotho

Renowned Member

SteveITS

Active Member

VictorSTS

Famous Member

We value your privacy