Ceph PG stuck unkown / inactive after upgrade?

petwri

New Member
Jan 27, 2025
8
0
1
I just did the PVE upgrade and upgraded ceph to 19.2.1. I did one node after the other, and in the beginning everything seemed fine. But then, when the last 1 out of 3 nodes was scheduled for reboot, I had an accidental power outage. Had to do a hard restart of the cluster.

Everything came back up as expected. I immediately turned off all my VMs to make sure ceph was ok. All OSDs up and everything seemed fine. But then, 1 PG became unknown and others stale and/or peering.

Code:
    pg 2.1b is stuck stale for 9h, current state stale+peering, last acting [9,3]
    pg 4.a is stuck peering for 19h, current state peering, last acting [3,9]
    pg 4.4e is stuck inactive for 19h, current state unknown, last acting []
    pg 4.78 is stuck inactive for 19h, current state activating+undersized, last acting [3]

Also, more and more OSDs started to become slow:

Code:
     osd.6 observed slow operation indications in BlueStore
     osd.7 observed slow operation indications in BlueStore
     osd.9 observed slow operation indications in BlueStore
     osd.10 observed slow operation indications in BlueStore
     osd.11 observed slow operation indications in BlueStore

SMART status of all OSDs is PASSED. What could be causing this? I started to scale down the size of the replicated pools in order to temporarily free up some space and trigger a rebalancing. Now I don't know if this was the right approach, or what else I could do to get all PGs active again. Any ideas?
 
Re slow, that’s a new message in 19.2.1.

Did you create these OSDs in 19.x?
 
Re slow, that’s a new message in 19.2.1.

Did you create these OSDs in 19.x?
Understood, I already read this was a new message in 19.2.1. But still, PGs are not active.
The whole cluster (and every single OSD) was set up in 19.2.
 
Yes…happy Friday. :-/

One at a time. Just be patient and wait for it to finish the Out before destroying it. The GUI shows an ETA but it’s a guess.
 
Ok, good news: this is "just" my homelab. So the only person that is going to kill me over this is myself.
So to understand you correctly: I'll set an OSD to down, wait for rebalancing finish, wipe the disk, set up the OSD from scratch, and repeat that one by one? That should bring my unknown PGs back without any data loss?
 
I can't answer about the stuck OSDs. I haven't seen that. I'm just saying they say it's required to recreate the OSDs to prevent corruption, per the release notes.

I might try restarting one? Or a server? But recreating them would restart it anyway.
 
I can't answer about the stuck OSDs. I haven't seen that. I'm just saying they say it's required to recreate the OSDs to prevent corruption, per the release notes.

I might try restarting one? Or a server? But recreating them would restart it anyway.
I already did restart multiple times. Tried rebooting several nodes, several mgr services, several osds, my switch - nothing brought back the PGs so far. But up to now, I haven't done anything that would overwrite data. So fingers crossed I'll get it back up and running.