Hello,
It`s not an issue topic, but I`ll be really appreciate for your answers.
We have 12 Node PVE cluster with 54 Ceph OSD (all SSD).
PVE 6.1.5
Ceph 14.2.5
OSD count 54
PG count 2048
Replica 3
OSD in RAID0 (we know that is not supported configuration, but we have to do so. And BTW OSD in RAID0 in our infrastructure works with less latency than OSD via recommended HBA).
Now we need to replace in each node 1 or 2 SSD (old SSD 500GB, new SSD 2TB).
We are making the following:
1. OSD down
2. OSD out
3. Wait for PG re-balance and Ceph_health OK status
4, OSD destroy
5. Insert new 2TB SSD (making raid0, start PVE)
6. Create new OSD
7. OSD up with proper device type and crush-rule, OSD in
After remap process is finished we repeat all steps above with another node. It takes more than 7 hours per node. We can`t wait so much because equipment in another country and our stay is limited.
The main question is it safe to down\out\destroy another OSD before all remap process finished (not degraded PG re-balance)? In this way we can replace all OSD in 1-2 days and wait for long remap process (it is acceptable).
We can`t increase remap speed by adding more threads because this is production system and our customers will be affected with high latency.
Thanks in advance for answers.
It`s not an issue topic, but I`ll be really appreciate for your answers.
We have 12 Node PVE cluster with 54 Ceph OSD (all SSD).
PVE 6.1.5
Ceph 14.2.5
OSD count 54
PG count 2048
Replica 3
OSD in RAID0 (we know that is not supported configuration, but we have to do so. And BTW OSD in RAID0 in our infrastructure works with less latency than OSD via recommended HBA).
Now we need to replace in each node 1 or 2 SSD (old SSD 500GB, new SSD 2TB).
We are making the following:
1. OSD down
2. OSD out
3. Wait for PG re-balance and Ceph_health OK status
4, OSD destroy
5. Insert new 2TB SSD (making raid0, start PVE)
6. Create new OSD
7. OSD up with proper device type and crush-rule, OSD in
After remap process is finished we repeat all steps above with another node. It takes more than 7 hours per node. We can`t wait so much because equipment in another country and our stay is limited.
The main question is it safe to down\out\destroy another OSD before all remap process finished (not degraded PG re-balance)? In this way we can replace all OSD in 1-2 days and wait for long remap process (it is acceptable).
We can`t increase remap speed by adding more threads because this is production system and our customers will be affected with high latency.
Thanks in advance for answers.