Ceph PG stuck unkown / inactive after upgrade?

petwri

New Member
Jan 27, 2025
9
0
1
I just did the PVE upgrade and upgraded ceph to 19.2.1. I did one node after the other, and in the beginning everything seemed fine. But then, when the last 1 out of 3 nodes was scheduled for reboot, I had an accidental power outage. Had to do a hard restart of the cluster.

Everything came back up as expected. I immediately turned off all my VMs to make sure ceph was ok. All OSDs up and everything seemed fine. But then, 1 PG became unknown and others stale and/or peering.

Code:
    pg 2.1b is stuck stale for 9h, current state stale+peering, last acting [9,3]
    pg 4.a is stuck peering for 19h, current state peering, last acting [3,9]
    pg 4.4e is stuck inactive for 19h, current state unknown, last acting []
    pg 4.78 is stuck inactive for 19h, current state activating+undersized, last acting [3]

Also, more and more OSDs started to become slow:

Code:
     osd.6 observed slow operation indications in BlueStore
     osd.7 observed slow operation indications in BlueStore
     osd.9 observed slow operation indications in BlueStore
     osd.10 observed slow operation indications in BlueStore
     osd.11 observed slow operation indications in BlueStore

SMART status of all OSDs is PASSED. What could be causing this? I started to scale down the size of the replicated pools in order to temporarily free up some space and trigger a rebalancing. Now I don't know if this was the right approach, or what else I could do to get all PGs active again. Any ideas?
 
Re slow, that’s a new message in 19.2.1.

Did you create these OSDs in 19.x?
 
Re slow, that’s a new message in 19.2.1.

Did you create these OSDs in 19.x?
Understood, I already read this was a new message in 19.2.1. But still, PGs are not active.
The whole cluster (and every single OSD) was set up in 19.2.
 
Ok, good news: this is "just" my homelab. So the only person that is going to kill me over this is myself.
So to understand you correctly: I'll set an OSD to down, wait for rebalancing finish, wipe the disk, set up the OSD from scratch, and repeat that one by one? That should bring my unknown PGs back without any data loss?
 
I can't answer about the stuck OSDs. I haven't seen that. I'm just saying they say it's required to recreate the OSDs to prevent corruption, per the release notes.

I might try restarting one? Or a server? But recreating them would restart it anyway.
 
I can't answer about the stuck OSDs. I haven't seen that. I'm just saying they say it's required to recreate the OSDs to prevent corruption, per the release notes.

I might try restarting one? Or a server? But recreating them would restart it anyway.
I already did restart multiple times. Tried rebooting several nodes, several mgr services, several osds, my switch - nothing brought back the PGs so far. But up to now, I haven't done anything that would overwrite data. So fingers crossed I'll get it back up and running.
 
I already did restart multiple times. Tried rebooting several nodes, several mgr services, several osds, my switch - nothing brought back the PGs so far. But up to now, I haven't done anything that would overwrite data. So fingers crossed I'll get it back up and running.
Were you or anyone else ever able to figure this out? I just bought three MS-A2 9955HX nodes specifically to make a ceph cluster in my homelab and have tried two fresh proxmox 9.0.11 installs with Squid 19.2.3 and it still continues to say "Reduced availability: 1pg inactive, 1pg peering" "stuck peering for {insert several hours here}, current state creating+peering" and "2 slow ops, oldest one blocked for {insert several hours here}, osd.1 has slow ops.


If I add the ceph pool, it just does the same thing but for all of the pg's and sometime it has said more than one of the 3 total osd's has slow ops.

All drives are brand new 990 pro 2TB drives (I should have got PLP, yes I figured that out after I already built these nodes), networking is all setup correctly and working, tried to setup through CLI and then fresh installed on all nodes and retried with the ceph install wizard and web gui but both end with this same result. Completely overwriting and re-creating the OSD's does nothing. No VM's or anything else has even been set up on these systems yet, they are all fresh installs and brand new hardware. First time with Ceph but have been running Proxmox VE 8 on my other two servers for a couple years now with no issues...