Bulk OSD replace in Ceph

Pravednik · Oct 20, 2021

Hello,

It`s not an issue topic, but I`ll be really appreciate for your answers.

We have 12 Node PVE cluster with 54 Ceph OSD (all SSD).
PVE 6.1.5
Ceph 14.2.5
OSD count 54
PG count 2048
Replica 3
OSD in RAID0 (we know that is not supported configuration, but we have to do so. And BTW OSD in RAID0 in our infrastructure works with less latency than OSD via recommended HBA).

Now we need to replace in each node 1 or 2 SSD (old SSD 500GB, new SSD 2TB).
We are making the following:
1. OSD down
2. OSD out
3. Wait for PG re-balance and Ceph_health OK status
4, OSD destroy
5. Insert new 2TB SSD (making raid0, start PVE)
6. Create new OSD
7. OSD up with proper device type and crush-rule, OSD in

After remap process is finished we repeat all steps above with another node. It takes more than 7 hours per node. We can`t wait so much because equipment in another country and our stay is limited.

The main question is it safe to down\out\destroy another OSD before all remap process finished (not degraded PG re-balance)? In this way we can replace all OSD in 1-2 days and wait for long remap process (it is acceptable).

We can`t increase remap speed by adding more threads because this is production system and our customers will be affected with high latency.

Thanks in advance for answers.

aaron · Oct 20, 2021

Setting more OSDs as out and stopping (down) them right away is risky!
If you have the available space in the cluster, you could mark them all as out way ahead of time. Give Ceph the time to rebalance to the remaining OSDs. Then, once you are there and Ceph is healthy, stop them, destroy them, .....

As long as the OSDs are marked out but are still up and running, Ceph can still access the data. Once you stop the OSDs, well... And with 12 nodes with 1 or 2 OSDs affected on each node, you will have the situation that some placement groups will have all their 3 replicas on only those OSDs and therefore data loss if you stop them before Ceph was able to rebalance that.

Pravednik · Oct 20, 2021

Thanks a lot for reply @aaron.
We fully understand risk when OSD down\out and we never remove 2 OSD on separate nodes in our "Per node" redundancy policy.

I just asked is it ok remove OSD when Ceph is Healthy, there is no PG degraded state, but re-map\backfills process still running (no redundancy degradation)?

Unfortunately we haven`t time to wait for backfills each time when we replacing OSD so that`s why I`m asking is it ok to destroy OSD one-by-one (of course with PG re-balance each time), insert new OSD and leave for backfills process.

RokaKen · Oct 20, 2021

IFF you have sufficient space on the remaining OSDs per node (none would reach near full ratio) and PGs per OSD (mon_max_pg_per_osd), I would drain the OSD(s) to be replaced with ceph osd reweight {ID} 0 and then replace them per node. After OSD replacement, the PGs could rebalance at their leisure.

It should be faster for Ceph to simply move a healthy PG to another local OSD than to recreate a PG across the cluster as your method would require and there is no risk of data loss. However, I don't know that this will be sufficient to meet your time constraint.

Pravednik · Oct 20, 2021

@RokaKen hmm....this is an idea. We don`t think about reweight 0. Shall to calculate it. Thanks.

Search

Search

Bulk OSD replace in Ceph

Pravednik

Active Member

aaron

Proxmox Staff Member

Pravednik

Active Member

RokaKen

Active Member

Pravednik

Active Member