Hi
We are running a Ceph cluster with 3 nodes and 2 OSDs per node on SSD drives.
One of the nodes has recently been causing problems and both OSDs in this node sporadically report "log_latency_fn slow operation observed for _txc_committed_kv" in the log and the write performance is also poor in some cases. I can reproduce this by repeatedly creating 100MB files via dd, for example. This is usually relatively fast with 200-300 MB/s, but sometimes I reach only under 10 MB/s and exactly in these moments the mentioned message is logged).
I would now like to replace the two affected OSDs / their SSDs.
What is the best way to avoid multiple re-balancing of the data (due to the changes to the crush-map when adding/removing OSDs)?
I would have imagined the following procedure:
1. set norebalance flag
2. stop OSDs of the faulty discs and set them to out
3. install new discs and create new OSDs
4. deactivate norebalance
Does this make sense or does anyone have a better suggestion?
We are running a Ceph cluster with 3 nodes and 2 OSDs per node on SSD drives.
One of the nodes has recently been causing problems and both OSDs in this node sporadically report "log_latency_fn slow operation observed for _txc_committed_kv" in the log and the write performance is also poor in some cases. I can reproduce this by repeatedly creating 100MB files via dd, for example. This is usually relatively fast with 200-300 MB/s, but sometimes I reach only under 10 MB/s and exactly in these moments the mentioned message is logged).
I would now like to replace the two affected OSDs / their SSDs.
What is the best way to avoid multiple re-balancing of the data (due to the changes to the crush-map when adding/removing OSDs)?
I would have imagined the following procedure:
1. set norebalance flag
2. stop OSDs of the faulty discs and set them to out
3. install new discs and create new OSDs
4. deactivate norebalance
Does this make sense or does anyone have a better suggestion?