Do I need to have free disk space in the ceph pool in case of osd or server failure.

Thank you, Aaron!
We decided not to install another network card, we will allocate a separate VLAN on the switches and configure COS to allocate guaranteed bandwidth for the cluster network. Wouldn't that work too?

Can anyone say something about the practice of disabling backfill?
can anyone help?
 
Can anyone say something about the practice of disabling backfill?
If you keep it disabled and only enable it manually, you rob Ceph of one of the automatic healing options. How fast will you be able to activate it on a weekend or in the middle of the night?
 
  • Like
Reactions: Tmanok
Hi all,

Really interesting thread.

Shouldn't default values full_ratio (0.95) and nearfull_ratio (0.85), which are very aggressive numbers adapted for a test cluster (https://docs.ceph.com/en/quincy/rados/configuration/mon-config-ref/#storage-capacity), be set to a more realistic value at installation time / add node / remove node, depending on nodes count and OSDs?

Christophe.
Given that many clusters that I have seen are rather inhomogeneous and overall failure domains can be quite different (multiple rooms, ...), having that set automatically by some formula, could mean very quickly, that it is interfering with setups that are a bit more unusual.

Giving more hints about them, maybe showing the currently defined full and nearfull parameters and making them easily configurable is something I would prefer more. In the end, it is the responsibility of the admin to decide what they want :)
 
  • Like
Reactions: Tmanok
If you keep it disabled and only enable it manually, you rob Ceph of one of the automatic healing options. How fast will you be able to activate it on a weekend or in the middle of the night?
I agree that we are losing in automation.
But considering that we have 4 nodes, if two fail, do we still have a workable cluster? Or am I wrong?
 
But considering that we have 4 nodes, if two fail, do we still have a workable cluster? Or am I wrong?
Depends on the size parameter of the pools used.

If the pools use a size/min-size of 4/2, then you should be fine, since there are still two copies available. But with a size=3, you will most likely have some PGs that had 2 of their 3 replicas on the two nodes that failed -> only one replica left, less than the min-size=2 and therefore the pool is IO blocked until at least 2 replicas are recovered.

Additionally, you need to look at how many MONs you have lost. They always need a quorum (majority). so if 2 out of 3 MONs are gone or 2 out of 4, your cluster will not work until you get these MONs back up or remove them from the Ceph Config and the monmap. The latter is quite the procedure.
 
  • Like
Reactions: Tmanok
Depends on the size parameter of the pools used.

If the pools use a size/min-size of 4/2, then you should be fine, since there are still two copies available. But with a size=3, you will most likely have some PGs that had 2 of their 3 replicas on the two nodes that failed -> only one replica left, less than the min-size=2 and therefore the pool is IO blocked until at least 2 replicas are recovered.

Additionally, you need to look at how many MONs you have lost. They always need a quorum (majority). so if 2 out of 3 MONs are gone or 2 out of 4, your cluster will not work until you get these MONs back up or remove them from the Ceph Config and the monmap. The latter is quite the procedure.Than
Thanks, aaron!
 
And what will happen when 1 node is rebooted?
The cluster will begin to redistribute the lost pgs, but within a few minutes node will work again, and all the same pg on it.
What will happen in this case?
 
Ceph waits for a few minutes before it marks an OSD that is down as out. If you know that the node will be down for longer and you don't want it to recover, you can set the "noout" flag.
 
Ceph waits for a few minutes before it marks an OSD that is down as out. If you know that the node will be down for longer and you don't want it to recover, you can set the "noout" flag.
Yes, I read about noout.
But I'm talking about a situation when at night, something happens to the server and it restarts and maybe several times. Won't it lead to disaster?
 
But I'm talking about a situation when at night, something happens to the server and it restarts and maybe several times. Won't it lead to disaster?
If it happens to one out of the 4, what you will probably see is reduced redundancy that is restored if it is back up within a few minutes. If it takes longer, Ceph will start to recover to other nodes. If the node comes back. Ceph will rebalance the data in the cluster.

Depending on your failure scenario, that might repeat. Worst case is a higher load in your cluster that might affect the guests getting slow.

That is of course a thought experiment. In real life, situations are often times more complicated and you might have another issue as well, or something that is triggering these reboots that might have other side effects as well.
 
  • Like
Reactions: Tmanok
Can anyone tell me what the difference is?
When we install ceph in proxmox, we select the number of replicas and the minimum size.
When we create a pool on ceph, we also choose the size and minimum size of the pool, which corresponds to the number of replicas.
What is the difference?
Why do we specify, in my opinion, the same settings twice?
 
In the initial config, you select the defaults. When creating a new pool, you could choose different sizes/min_sizes if the pool has different redundancy needs.

Please be aware that if you chose defaults different than 3/2, that the GUI does not yet use them. There are patches on the way, so hopefully one of the next versions will show your specific defaults when you create a new pool via the GUI.
 
  • Like
Reactions: Tmanok
In the initial config, you select the defaults. When creating a new pool, you could choose different sizes/min_sizes if the pool has different redundancy needs.

Please be aware that if you chose defaults different than 3/2, that the GUI does not yet use them. There are patches on the way, so hopefully one of the next versions will show your specific defaults when you create a new pool via the GUI.
Thanks, Aaron!
That is, values other than 3/2 by default for all do not work at the moment? And you need to manually install when creating a pool, for example 4/2.
But if I use the default configuration, is everything ok?
 
If you set a 4/2 during the initial configuration of Ceph and create a new pool via the Ceph CLI tooling, it would get that size/min_size set. But when creating a new pool via the GUI, the dialog currently has the default 3/2 values hardcoded, so you will have to change that to 4/2 yourself. The linked bug report and the patches linked in there make it, that the configured defaults are queried so that the pool create window will pre-populate the fields correctly.
 
  • Like
Reactions: Tmanok
If you set a 4/2 during the initial configuration of Ceph and create a new pool via the Ceph CLI tooling, it would get that size/min_size set. But when creating a new pool via the GUI, the dialog currently has the default 3/2 values hardcoded, so you will have to change that to 4/2 yourself. The linked bug report and the patches linked in there make it, that the configured defaults are queried so that the pool create window will pre-populate the fields correctly.
Thanks for the clarification! Now it's clear.
 
  • Like
Reactions: aaron
Hello everyone.
Can someone tell me about these Ceph parameters:
full_ratio
backfillfull_ratio
?
The nearfull_ratio parameter is needed for alerts when the mark is reached.
And what is backfillfull_ratio for?
And what happens when the OSD fills up to the full_ratio mark? Does data stop being written to it?
Аnd what is the reserved space used for?

For example, full_ratio = 0.75 (only 75% of the disk space is used). Do I understand correctly that when the OSD or node crashes, the missing replicas will be written to these 25%? And will the Proxmox+Ceph cluster continue to work?
 
Last edited:
some ratios will only trigger a warning. Some will cause certain actions to stop.

And what is backfillfull_ratio for?
once an OSD has reached it, it will not be able to backfill anymore.

And what happens when the OSD fills up to the full_ratio mark? Does data stop being written to it?
The pools using that OSD will halt IO in my experience.

The Ceph docs have more details:
https://docs.ceph.com/en/latest/rad...ighlight=mon+initial+members#storage-capacity
https://docs.ceph.com/en/quincy/rados/operations/pg-states/#placement-group-states


You can get the current settings for an OSD, for example by
Code:
ceph config show-with-defaults osd.0 | grep full_ratio
 
some ratios will only trigger a warning. Some will cause certain actions to stop.


once an OSD has reached it, it will not be able to backfill anymore.


The pools using that OSD will halt IO in my experience.

The Ceph docs have more details:
https://docs.ceph.com/en/latest/rad...ighlight=mon+initial+members#storage-capacity
https://docs.ceph.com/en/quincy/rados/operations/pg-states/#placement-group-states


You can get the current settings for an OSD, for example by
Code:
ceph config show-with-defaults osd.0 | grep full_ratio
Why are the values of ceph config show-with-defaults osd.0 different from ceph osd dump?
I used these commands to set the values I needed:
ceph osd set-nearfull-ratio
ceph osd set-full-ratio


root@tvr-pve-04:~# ceph config show-with-defaults osd.0 | grep full_ratio
mon_osd_backfillfull_ratio 0.900000 default
mon_osd_full_ratio 0.950000 default
mon_osd_nearfull_ratio 0.850000 default
osd_failsafe_full_ratio 0.970000 default
osd_pool_default_cache_target_full_ratio 0.800000 default
root@tvr-pve-04:~# ceph config show-with-defaults osd.6 | grep full_ratio
mon_osd_backfillfull_ratio 0.900000 default
mon_osd_full_ratio 0.950000 default
mon_osd_nearfull_ratio 0.850000 default
osd_failsafe_full_ratio 0.970000 default
osd_pool_default_cache_target_full_ratio 0.800000 default
root@tvr-pve-04:~# ceph osd dump | grep full_ratio
full_ratio 0.8
backfillfull_ratio 0.75
nearfull_ratio 0.7