Hi,
I just did an upgrade to hyperconverged PVE-Cluster with Ceph. From Proxmox VE 8.1 -> 8.2. I noticed "Ceph" complained about crashes in its health status - I just rebooted the nodes after evacuating the VMs. So I upgraded 2 of 3 Cluster-Nodes. I also missed to set noout before the reboot and got stuck with changed predictive network interface names and it took me some time to fix that.
The I realized when being about to upgrade the last cluster node, that ceph is not healthy.
ceph status showed something like that ...
... but the number of misplaced pgs were at ~80000. I waited for 2 hours hoping for Ceph to repair itself but it didn't work. The number of misplaced objects decreased to ~60000 and increased again to ~80000 - back and forth all the time.
I was wondering what is going on. I realized then that the versions of ceph were different, when looking at the output of ceph versions. Of course that was the case because one of 3 nodes had ceph 18.2.1 (Not 100% sure if this was the exact version, might have been 18.1.x also) and the other were already upgraded to 18.2.2.
I suspected that's not an ideal situation and set osd noout, upgraded the software at the remaining node, took down the osds on that last host and rebooted the machine.
Now the Ceph recovery progressed. There are no more misplaced objects now. and the active/clean pgs are now fully restored after about 20 minutes after the reboot.
Questsions:
I just did an upgrade to hyperconverged PVE-Cluster with Ceph. From Proxmox VE 8.1 -> 8.2. I noticed "Ceph" complained about crashes in its health status - I just rebooted the nodes after evacuating the VMs. So I upgraded 2 of 3 Cluster-Nodes. I also missed to set noout before the reboot and got stuck with changed predictive network interface names and it took me some time to fix that.
The I realized when being about to upgrade the last cluster node, that ceph is not healthy.
ceph status showed something like that ...
Code:
data:
pools: 2 pools, 160 pgs
objects: 408.34k objects, 1.4 TiB
usage: 4.4 TiB used, 17 TiB / 21 TiB avail
pgs: 1480/1225032 objects misplaced (0.121%)
155 active+clean
3 active+clean+scrubbing+deep
1 active+clean+scrubbing
1 active+remapped+backfilling
... but the number of misplaced pgs were at ~80000. I waited for 2 hours hoping for Ceph to repair itself but it didn't work. The number of misplaced objects decreased to ~60000 and increased again to ~80000 - back and forth all the time.
I was wondering what is going on. I realized then that the versions of ceph were different, when looking at the output of ceph versions. Of course that was the case because one of 3 nodes had ceph 18.2.1 (Not 100% sure if this was the exact version, might have been 18.1.x also) and the other were already upgraded to 18.2.2.
I suspected that's not an ideal situation and set osd noout, upgraded the software at the remaining node, took down the osds on that last host and rebooted the machine.
Now the Ceph recovery progressed. There are no more misplaced objects now. and the active/clean pgs are now fully restored after about 20 minutes after the reboot.
Questsions:
- Is it wrong to just reboot a server without any running services (vms,containers) for ceph?
- Was that some kind of bad situation which could not be fixed by Ceph itself because of the version mismatch?
Last edited: