Hi,
I have the above behavior when I adding a new OSD on CEPH storage.
After adding a new OSD all storage becomes unstable to connectivite with VMs. A lot of VMs crash with error "hung_task_timeout_secs" (look image). The process to rebalance take about 20 min but during this time the storage it real inaccessible for other operations.
It's possible to add OSD with 0 weight and weight a little bit at a time as "Argonaut (v0.48) Best Practices" says at : docs.ceph.com/docs/giant/rados/operations/add-or-rm-osds/ . what else I could do?
PVE Manager Version: pve-manager/4.4-5/c43015a5
Ceph cluster network is over Infiniband.
Now I have only 20OSD on 5 nodes but with above process we wand to add plus 30 OSD and this is real huge problem. All new OSD is SSD but 12 old is SATA.
I have the above behavior when I adding a new OSD on CEPH storage.
After adding a new OSD all storage becomes unstable to connectivite with VMs. A lot of VMs crash with error "hung_task_timeout_secs" (look image). The process to rebalance take about 20 min but during this time the storage it real inaccessible for other operations.
It's possible to add OSD with 0 weight and weight a little bit at a time as "Argonaut (v0.48) Best Practices" says at : docs.ceph.com/docs/giant/rados/operations/add-or-rm-osds/ . what else I could do?
Code:
root@pve01:~# ceph status
cluster bc865d82-2de0-439f-ae34-f14a565c023d
health HEALTH_WARN
55 pgs backfill
22 pgs backfilling
8 pgs peering
11 pgs stuck inactive
92 pgs stuck unclean
33 requests are blocked > 32 sec
recovery 96309/1702601 objects misplaced (5.657%)
monmap e3: 3 mons at {0=172.16.0.1:6789/0,1=172.16.0.2:6789/0,2=172.16.0.3:6789/0}
election epoch 152, quorum 0,1,2 0,1,2
osdmap e2573: 19 osds: 19 up, 19 in; 77 remapped pgs
pgmap v12000981: 1088 pgs, 2 pools, 2076 GB data, 537 kobjects
6460 GB used, 28362 GB / 34823 GB avail
96309/1702601 objects misplaced (5.657%)
996 active+clean
55 active+remapped+wait_backfill
22 active+remapped+backfilling
8 peering
7 activating
recovery io 357 MB/s, 90 objects/s
client io 2691 kB/s rd, 5328 kB/s wr, 157 op/s
root@pve01:~#
Ceph cluster network is over Infiniband.
Now I have only 20OSD on 5 nodes but with above process we wand to add plus 30 OSD and this is real huge problem. All new OSD is SSD but 12 old is SATA.