This is more a ceph issue and not a proxmox issue, but I'm asking here to see if anyone knows a quick answer or has seen this behavior before.
We operate a fair number of PVE clusters that are all independent (about 100 hypervisors split in clusters of around 12 each). Each cluster is more or less identical, with dedicated 10G dual-homed backends for ceph, dedicated independent quorum networks, and so on. Each cluster also has ceph configured completely by default, following the proxmox install guide. We're currently using 6.4, with ceph 14.2.22. Every node is a storage node and has OSds, typically around 4 OSDS each (ergo, each cluster has around 48 OSDs).
The problem: Two weeks ago we had a near-total outage on one cluster due to ceph marking a certain number of PGs inactive and taking about an hour to recover them when a second node offline while one was out for repairs. This was a shock to us because we've come to understand that ceph should have at minimum 3 copies of a PG, and should by default use a host bucket for the crush map. And, indeed, when this outage happened, ceph DID eventually recover on its own - but it took an hour, and this needless to say caused the many hundreds of VMs to stop responding altogether, as they had blocks on those PGs that were not responding.
All PGs that were inactive were in one of these 2 states: backfilling+peered or backfill_wait+peered. Ceph was in HEALTH_WARN, which was completely expected, but it was completely unexpected that it took it an hour to decide to mark the 8 or so PGs active again and start using their backup copies. The outage was very frustrating for us, as there was no indication as to why ceph decided to take so long to fix those inactive PGs.
tldr; completely by-the-book install of proxmox 6.4+ceph 14.2.22 with a total of 12 ceph nodes with 4 osds each (so 48 osds total) makes PGs inactive for an hour if 2/12 nodes are down, resulting in every VM stalling. We'd like to understand more if we are missing something about ceph or there is something we could have done.
				
			We operate a fair number of PVE clusters that are all independent (about 100 hypervisors split in clusters of around 12 each). Each cluster is more or less identical, with dedicated 10G dual-homed backends for ceph, dedicated independent quorum networks, and so on. Each cluster also has ceph configured completely by default, following the proxmox install guide. We're currently using 6.4, with ceph 14.2.22. Every node is a storage node and has OSds, typically around 4 OSDS each (ergo, each cluster has around 48 OSDs).
The problem: Two weeks ago we had a near-total outage on one cluster due to ceph marking a certain number of PGs inactive and taking about an hour to recover them when a second node offline while one was out for repairs. This was a shock to us because we've come to understand that ceph should have at minimum 3 copies of a PG, and should by default use a host bucket for the crush map. And, indeed, when this outage happened, ceph DID eventually recover on its own - but it took an hour, and this needless to say caused the many hundreds of VMs to stop responding altogether, as they had blocks on those PGs that were not responding.
All PGs that were inactive were in one of these 2 states: backfilling+peered or backfill_wait+peered. Ceph was in HEALTH_WARN, which was completely expected, but it was completely unexpected that it took it an hour to decide to mark the 8 or so PGs active again and start using their backup copies. The outage was very frustrating for us, as there was no indication as to why ceph decided to take so long to fix those inactive PGs.
tldr; completely by-the-book install of proxmox 6.4+ceph 14.2.22 with a total of 12 ceph nodes with 4 osds each (so 48 osds total) makes PGs inactive for an hour if 2/12 nodes are down, resulting in every VM stalling. We'd like to understand more if we are missing something about ceph or there is something we could have done.
 
	