Yesterday I was greeted with numerous unreachable services stemming from a Ceph health error on our VM cluster due to "1 full osd(s)" and "1 backfillfull osd(s)" resulting in "4 pool(s) full" and solved it. This was the ceph status panel:

As a result, our Ceph ended up in read-only state. The full OSDs were shown directly in the details of the status summary. After some reading and looking at our configuration I got it fixed by doing the following:
I still have a few questions though, maybe you can help:
If my solution wasn't quite textbook, feel free to jump in and suggest better ones =)
Thanks in advance

As a result, our Ceph ended up in read-only state. The full OSDs were shown directly in the details of the status summary. After some reading and looking at our configuration I got it fixed by doing the following:
- Increase the full and backfillfull ratios a little such that Ceph has some space to work with and switches from a health error to a health warning:
Bash:sudo ceph osd set-backfillfull-ratio 0.93 sudo ceph osd set-full-ratio 0.97 - The Balancer was running in mode
upmap, butsudo ceph osd dfshowed a great imbalance in utilization of our ~50 OSDs. The solution was to setrequire-min-compat-clienttoluminoussuch that the balancer is able to perform theupmapmode properly as noted in the corresponding ceph docs:
Here's more info about the balancer itself.Bash:sudo ceph osd set-require-min-compat-client luminous - After an hour, everything was rebalanced and ceph in a healthy state again \o/. I could revert the backfillfull- and full-ratios to their previous values:
Bash:sudo ceph osd set-backfillfull-ratio 0.90 sudo ceph osd set-full-ratio 0.95 - Some VMs ended up in a strange state but luckily worked normally after a reboot.
I still have a few questions though, maybe you can help:
- The balancer seemed to work at least partially, otherwise this problem would have happened sooner I guess. Is it intended that it is able to run in
upmapmode together withrequire-min-compat-clientset to something pre-luminous (was "jewel" in our case) although the docs say: "supporting only Luminous (and newer) clients"? - In the PVE upgrade instructions from Ceph Jewel to Luminous this setting is explicitly mentioned. However, there it is adviced to set this value to
jewelinstead ofluminous- is that indended or a typo? I didn't findset-require-min-compat-clientanywhere else in the PVE wiki.
If my solution wasn't quite textbook, feel free to jump in and suggest better ones =)
Thanks in advance