Ceph Recovery not progressing after PVE-Update

hellfire

Renowned Member
Aug 17, 2016
78
42
83
46
Hi,

I just did an upgrade to hyperconverged PVE-Cluster with Ceph. From Proxmox VE 8.1 -> 8.2. I noticed "Ceph" complained about crashes in its health status - I just rebooted the nodes after evacuating the VMs. So I upgraded 2 of 3 Cluster-Nodes. I also missed to set noout before the reboot and got stuck with changed predictive network interface names and it took me some time to fix that.

The I realized when being about to upgrade the last cluster node, that ceph is not healthy.

ceph status showed something like that ...

Code:
  data:
    pools:   2 pools, 160 pgs
    objects: 408.34k objects, 1.4 TiB
    usage:   4.4 TiB used, 17 TiB / 21 TiB avail
    pgs:     1480/1225032 objects misplaced (0.121%)
             155 active+clean
             3   active+clean+scrubbing+deep
             1   active+clean+scrubbing
             1   active+remapped+backfilling

... but the number of misplaced pgs were at ~80000. I waited for 2 hours hoping for Ceph to repair itself but it didn't work. The number of misplaced objects decreased to ~60000 and increased again to ~80000 - back and forth all the time.

I was wondering what is going on. I realized then that the versions of ceph were different, when looking at the output of ceph versions. Of course that was the case because one of 3 nodes had ceph 18.2.1 (Not 100% sure if this was the exact version, might have been 18.1.x also) and the other were already upgraded to 18.2.2.

I suspected that's not an ideal situation and set osd noout, upgraded the software at the remaining node, took down the osds on that last host and rebooted the machine.

Now the Ceph recovery progressed. There are no more misplaced objects now. and the active/clean pgs are now fully restored after about 20 minutes after the reboot.

Questsions:
  • Is it wrong to just reboot a server without any running services (vms,containers) for ceph?
  • Was that some kind of bad situation which could not be fixed by Ceph itself because of the version mismatch?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!