I just did the PVE upgrade and upgraded ceph to 19.2.1. I did one node after the other, and in the beginning everything seemed fine. But then, when the last 1 out of 3 nodes was scheduled for reboot, I had an accidental power outage. Had to do a hard restart of the cluster.
Everything came back up as expected. I immediately turned off all my VMs to make sure ceph was ok. All OSDs up and everything seemed fine. But then, 1 PG became
Also, more and more OSDs started to become slow:
SMART status of all OSDs is PASSED. What could be causing this? I started to scale down the size of the replicated pools in order to temporarily free up some space and trigger a rebalancing. Now I don't know if this was the right approach, or what else I could do to get all PGs active again. Any ideas?
Everything came back up as expected. I immediately turned off all my VMs to make sure ceph was ok. All OSDs up and everything seemed fine. But then, 1 PG became
unknown
and others stale
and/or peering
.
Code:
pg 2.1b is stuck stale for 9h, current state stale+peering, last acting [9,3]
pg 4.a is stuck peering for 19h, current state peering, last acting [3,9]
pg 4.4e is stuck inactive for 19h, current state unknown, last acting []
pg 4.78 is stuck inactive for 19h, current state activating+undersized, last acting [3]
Also, more and more OSDs started to become slow:
Code:
osd.6 observed slow operation indications in BlueStore
osd.7 observed slow operation indications in BlueStore
osd.9 observed slow operation indications in BlueStore
osd.10 observed slow operation indications in BlueStore
osd.11 observed slow operation indications in BlueStore
SMART status of all OSDs is PASSED. What could be causing this? I started to scale down the size of the replicated pools in order to temporarily free up some space and trigger a rebalancing. Now I don't know if this was the right approach, or what else I could do to get all PGs active again. Any ideas?