3.4: All VM stopped under heavy Ceph reorg

stefws

Renowned Member
Jan 29, 2015
302
4
83
Denmark
siimnet.dk
Had two Ceph pools for RBD virt disks, vm_images (boot hdd images) + rbd_data (extra hdd images).


Then while adding pools for a rados GW (.rgw.*) suddenly ceph health status said that my vm_images pool had too few PGs, thus I ran:


ceph osd pool set vm_images pg_num <larger_number>
ceph osd pool set vm_images pgp_num <larger_number>


Kicking off a 20 min rebalancing with a lot of IO in the Ceph Cluster, eventually Ceph Cluster was fine again, only almost all my PVE VMs ended up in stopped state, wondering why, a watchdog thingy maybe...

/Steffen

PS! Admitting my Ceph public and private networks are on the same physical 2-3Gbs LaCP load balanced network (some nodes with 2x1Gbs NICs, some with 3x1Gbs NIcs) since my only other physical network is a slow 100Mbs public network.