I have upgraded my 6 node cluster (3 ceph-only plus 3 compute-only nodes) from 5.4 to 6. The Ceph config was created on the Luminous release and I am following the upgrade instructions provided at https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus. During the upgrade the OSDs were "adapted" from ceph-disk to ceph-volume and the old SSD DB partitions (1GB each) were revealed to be too small. As outlined in the thread https://forum.proxmox.com/threads/bluefs-spillover-detected-on-30-osd-s.56230/ I figured the only way to fix the problem was to remove and re-add the OSDs one server at a time, adopting the new LVM method for Ceph with much larger (around 150GB) db partitions on the SSD.
I've done two of the three Ceph servers (each one has 10 OSDs), but each time while the cluster was recovering, all VMs lost connection to their disk images. The compute nodes where the VMs run all had empty /etc/ceph directories and needless to say the VMs all hung. However, the odd fix for this was to reboot the storage server on which the OSDs had just been re-added...once it went offline for the reboot all disk images on Ceph storage became available again. Everything continued working fine after the rebooted Ceph node restarted and the Ceph recovery continued. Unfortunately all the VMs had to be restarted to gain access to their disk images again.
I'm wondering if anyone else has seen this. It may not necessarily be related to the upgrade but instead the after actions of deleting and re-adding OSDs on a per server basis. I still have one more storage node in the cluster to upgrade, then I have another identical cluster that will need to have the same upgrade/disk conversion done and I want to avoid this happening again.
I've done two of the three Ceph servers (each one has 10 OSDs), but each time while the cluster was recovering, all VMs lost connection to their disk images. The compute nodes where the VMs run all had empty /etc/ceph directories and needless to say the VMs all hung. However, the odd fix for this was to reboot the storage server on which the OSDs had just been re-added...once it went offline for the reboot all disk images on Ceph storage became available again. Everything continued working fine after the rebooted Ceph node restarted and the Ceph recovery continued. Unfortunately all the VMs had to be restarted to gain access to their disk images again.
I'm wondering if anyone else has seen this. It may not necessarily be related to the upgrade but instead the after actions of deleting and re-adding OSDs on a per server basis. I still have one more storage node in the cluster to upgrade, then I have another identical cluster that will need to have the same upgrade/disk conversion done and I want to avoid this happening again.