Help with Ceph

deepcloud

Member
Feb 12, 2021
130
17
23
India
deepcloud.in
Hello folks,

We have a proxmox cluster running ver. 7.4 - not yet upgraded to 8.1 - will do shortly after we resolve the ceph issue.

Its a 4 node cluster with ceph on 3 nodes. with 6.4TB NVME * 2 for each host for ceph osd. means total of 6.4TB * 6 drives in the pool.
3 way replicated with 2 min. so total usable capacity is approx 12TB at 100% utilization.

now our folks have not monitored the usage and it was at 95% i think and failure happened.

unfortunately both boot drives on one of the node failed and the 2 osd did not import and another osd on another node also failed which meant we were down to 3 OSD of the 6 having data.
anyway we replaced these and it started rebuilding but strangely we got osd full errors. we had 2 spare 6.4TB NVME drives available and we added them into 2 different hosts 1 each. so totally we have 8 OSD Now and it started rebuilding and rebalancing again only to get stuck at 97% and complaining of low disk space to backfill pg.

now if you see the attached screenshots you will see that in one node you have OSD1 and OSD5 and OSD1 kept writing data upto 95% but the other one is not writing beyond 66.98% - why is my question.

2nd question is that why is the space less. the same amount of data is there before failure and nothing got added so how is it finding shortage in space to accomodate the data even after adding 2 additional SSD

can somebody help please
 

Attachments

  • rebuid-stuck.png
    rebuid-stuck.png
    343.1 KB · Views: 9
  • while-rebuilding-stalled.png
    while-rebuilding-stalled.png
    235.6 KB · Views: 9
  • while-rebuilding.png
    while-rebuilding.png
    191.4 KB · Views: 9
now if you see the attached screenshots you will see that in one node you have OSD1 and OSD5 and OSD1 kept writing data upto 95% but the other one is not writing beyond 66.98% - why is my question.
Quite simply, this is the Crush algorithm. Therefore, you should never wait until nearfull or fullratio is reached, as the full OSDs could still receive data and this would make the condition worse. Only afterwards will CEPH distribute the data and over the course of the backfill you can see how the full OSD becomes emptier - but at the beginning it often only increases.

2nd question is that why is the space less. the same amount of data is there before failure and nothing got added so how is it finding shortage in space to accomodate the data even after adding 2 additional SSD
That's because of your mistake in thinking. You have to think a little more abstractly here. You have 3 nodes with 2 OSDs each and CEPH distributes the data to exactly these three nodes with 2 OSDs. So you have to consider one node as a unit and you can exclude the other two. If both OSDs have 50% fill level, then that's good. If one fails, 50% of the data has to be transferred to the remaining OSD on exactly the same host - now with one this means 100% utilization.
If I now take the fill level of the cpx03, I get 169.53% total usage, if I divide that by 2 OSDs (like in the cpx05) I get 84.765% - so you're close to nearfull again. Between 96.22 and 84.765% there is just 11.455% - that can certainly be a deviation that is within reason.

You should try to slowly reduce the weight of OSD 1, but you have to keep in mind that you don't push OSD 5 directly to the limit and that OSD 1 will also receive data again through replication. So you have to be there permanently to solve the problem. Or you have a third NVMe, then you should use it now and record it.

With the command you reduce the weight of OSD 1 from 1.00 to 0.95: ceph osd reweight 1 0.95
Syntax: ceph osd reweight <OSD> <WEIGHT>
You'll probably have to repeat this in small increments until the cluster is okay again. But please don't overdo it, otherwise you'll just shift the problem and not solve it.

In this respect, you should rather use smaller NVMe but simply a few more per node, then this problem will be distributed much better and you can lose an OSD without immediately reaching the limit.

Example: all OSDs have 50% fill level, if these 50% have to be transferred to another one, it is immediately 100%. If you have two more OSDs in the cluster, each will have about 25% and then the total is about 75% utilization, etc.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!