Hi,
I have the above behavior when I adding a new OSD on CEPH storage.
After adding a new OSD all storage becomes unstable to connectivite with VMs. A lot of VMs crash with error "hung_task_timeout_secs" (look image). The process to rebalance take about 20 min but during this time the storage it real inaccessible for other operations.
It's possible to add OSD with 0 weight and weight a little bit at a time as "Argonaut (v0.48) Best Practices" says at : docs.ceph.com/docs/giant/rados/operations/add-or-rm-osds/ . what else I could do?
	
	
	
		
PVE Manager Version: pve-manager/4.4-5/c43015a5
Ceph cluster network is over Infiniband.
Now I have only 20OSD on 5 nodes but with above process we wand to add plus 30 OSD and this is real huge problem. All new OSD is SSD but 12 old is SATA.
				
			I have the above behavior when I adding a new OSD on CEPH storage.
After adding a new OSD all storage becomes unstable to connectivite with VMs. A lot of VMs crash with error "hung_task_timeout_secs" (look image). The process to rebalance take about 20 min but during this time the storage it real inaccessible for other operations.
It's possible to add OSD with 0 weight and weight a little bit at a time as "Argonaut (v0.48) Best Practices" says at : docs.ceph.com/docs/giant/rados/operations/add-or-rm-osds/ . what else I could do?
		Code:
	
	root@pve01:~# ceph status
    cluster bc865d82-2de0-439f-ae34-f14a565c023d
     health HEALTH_WARN
            55 pgs backfill
            22 pgs backfilling
            8 pgs peering
            11 pgs stuck inactive
            92 pgs stuck unclean
            33 requests are blocked > 32 sec
            recovery 96309/1702601 objects misplaced (5.657%)
     monmap e3: 3 mons at {0=172.16.0.1:6789/0,1=172.16.0.2:6789/0,2=172.16.0.3:6789/0}
            election epoch 152, quorum 0,1,2 0,1,2
     osdmap e2573: 19 osds: 19 up, 19 in; 77 remapped pgs
      pgmap v12000981: 1088 pgs, 2 pools, 2076 GB data, 537 kobjects
            6460 GB used, 28362 GB / 34823 GB avail
            96309/1702601 objects misplaced (5.657%)
                 996 active+clean
                  55 active+remapped+wait_backfill
                  22 active+remapped+backfilling
                   8 peering
                   7 activating
recovery io 357 MB/s, 90 objects/s
  client io 2691 kB/s rd, 5328 kB/s wr, 157 op/s
root@pve01:~#Ceph cluster network is over Infiniband.
Now I have only 20OSD on 5 nodes but with above process we wand to add plus 30 OSD and this is real huge problem. All new OSD is SSD but 12 old is SATA.
 
	 
	 
			 
 
		