Do I need to have free disk space in the ceph pool in case of osd or server failure.

emptness · Feb 14, 2023

Hello!
I'm just starting to study ceph and ran into a problem. I can't find the exact information anywhere.

Should I, when planning disk space, count on the fact that there is always free space in the pool in case of failure of one server for pg replication to the remaining servers, or does ceph allocate space for the entire pool based on the replication factor?

For example, if I have 4 servers with 4 osd 10 TB, the replication factor is 3 (the pool will be 40 TB).
If I use all the disk space (all 40 TB), then in case of failure of one server or even two, the pool will completely degrade due to the fact that it will have nowhere to redistribute pg?
Or is there already reserved space in the system for this case and I can take up all 40 TB?

gurubert · Feb 14, 2023

Ceph will give you a warning as soon as one OSD reaches 85% used capacity. It will not write to the OSD at 95% used capacity.

In very small clusters this overhead of 15% may not be enough.

E.g. if you only have three nodes with two OSDs each you would have to have a spare capacity of at least 50% on the OSDs because the data from one OSD would be backfilled to the other one in case of a failure.

If you have four nodes all nodes would have to keep 25% of their capacity unused to be able to backfill data from one lost node.

The formula is basically spare_capcity_per_node = 100 / (number of nodes).

The 15% warning threshold is for clusters with 7 nodes and more.

emptness · Feb 14, 2023

gurubert said:
Ceph will give you a warning as soon as one OSD reaches 85% used capacity. It will not write to the OSD at 95% used capacity.

In very small clusters this overhead of 15% may not be enough.

E.g. if you only have three nodes with two OSDs each you would have to have a spare capacity of at least 50% on the OSDs because the data from one OSD would be backfilled to the other one in case of a failure.

If you have four nodes all nodes would have to keep 25% of their capacity unused to be able to backfill data from one lost node.

The formula is basically spare_capcity_per_node = 100 / (number of nodes).

The 15% warning threshold is for clusters with 7 nodes and more.

Gurubert, that is, instead of a 40 TB pool configured as I described above, I can only use 30 TB and 10TB should always be free?
So much overhead for ceph!?
Can I somehow limit the use of space on the osd for my configuration in order to prevent a situation when there is nowhere to replicate data in case of a failure?

gurubert · Feb 14, 2023

emptness said:
I have 4 servers with 4 osd 10 TB, the replication factor is 3 (the pool will be 40 TB).

Is this one 10TB OSD in each server or 4 10TB OSDs in each?

With replication factor 3 you only get 33% from the raw capacity minus the spare capacity for disaster recovery.

E.g. 4 10TB OSDs give you 40TB raw capacity. Subtract 25% spare capacity for recovery you get 30TB. Divide by 3 copies gives you 10TB usable capacity.

emptness · Feb 14, 2023

gurubert said:
Is this one 10TB OSD in each server or 4 10TB OSDs in each?

With replication factor 3 you only get 33% from the raw capacity minus the spare capacity for disaster recovery.

E.g. 4 10TB OSDs give you 40TB raw capacity. Subtract 25% spare capacity for recovery you get 30TB. Divide by 3 copies gives you 10TB usable capacity.

4 servers
each with 4 disks installed,
the volume of one disk is 10 TB

I still don't understand a little, I'm sorry.
When creating a pool, the space on it is allocated to the maximum and I have to make sure that 25% is always free on it?

And what replication factor would you recommend with such a configuration?
So that the system remains operational in case of failure of 2 servers. If possible)

gurubert · Feb 14, 2023

If you lose one host the object copies of this host have to be recovered on the other hosts.
That means you need 25% free capacity on all hosts.
The failed host is filled to max 75% divided by 3 is 25% for each other host.

With 4 OSDs in each host you get 40TB usable capacity in the end.

With three replicas you may sustain the failure of two hosts, but you would need to reduce min_size of the pool to 1 in the worst case.

emptness · Feb 14, 2023

gurubert said:
If you lose one host the object copies of this host have to be recovered on the other hosts.
That means you need 25% free capacity on all hosts.
The failed host is filled to max 75% divided by 3 is 25% for each other host.

With 4 OSDs in each host you get 30TB usable capacity in the end.

With three replicas you may sustain the failure of two hosts, but you would need to reduce min_size of the pool to 1 in the worst case.

The number of replicas is set by the pool size (as in the picture)?

And yet I can't understand.
If we have three copies of data stored on FOUR servers (4x40TB), why will there be only 40TB of available space in such a pool?
I'm very sorry, I'm completely confused.

gurubert · Feb 14, 2023

4 x 40TB is 160TB raw capacity.
Subtract 25% from that for recovery space you get 120TB.
Divided by 3 copies you get 40TB usable space for your data.

aaron · Feb 14, 2023

gurubert said:
but you would need to reduce min_size of the pool to 1 in the worst case.

Please do NOT do that! This increases the chances of data loss significantly!

The other thing you need to consider in small clusters where the number of nodes = size, in most situations that would be 3-node clusters, is the following:

How much space do you need available in each OSD in a single node, in case one OSD / disk fails?

Because Ceph can still recover those replicas on the remaining OSDs in the same node. Therefore, having more but smaller OSDs is a good idea.
For example: If you only have 2 OSDs per node and one of the OSDs fails, Ceph cannot recover those replicas on other nodes, as the other 2 nodes already have the other 2 replicas. But it can recover it on the remaining OSD in the same node. This can easily lead to an OSD that is too full, unless the OSDs were rather empty to begin with.

gurubert · Feb 14, 2023

aaron said:
Please do NOT do that! This increases the chances of data loss significantly!

If two out of four nodes have failed and a new node has been added some placement groups may exist with only one copy.
For Ceph to be able to recover you have to set min_size = 1 in this situation. After all copies have been recovered set min_size = 2 again.

emptness · Feb 14, 2023

gurubert said:
4 x 40TB is 160TB raw capacity.
Subtract 25% from that for recovery space you get 120TB.
Divided by 3 copies you get 40TB usable space for your data.

Gurubert, thank you so much for the detailed explanation!
Finally, I have a complete picture of the distribution of disk space.

Please tell me how ceph will behave when two servers fail with this configuration?
There will be no free space left on the remaining two servers to restore three replicas.
Will the VMs on the cluster remain available?

emptness · Feb 14, 2023

gurubert said:
If two out of four nodes have failed and a new node has been added some placement groups may exist with only one copy.
For Ceph to be able to recover you have to set min_size = 1 in this situation. After all copies have been recovered set min_size = 2 again.

In such a situation, will I have to set the min_size = 1 value before entering the third node into the cluster? Wait for ceph rebalancing and set the value to 2?

gurubert · Feb 14, 2023

emptness said:
There will be no free space left on the remaining two servers to restore three replicas.
Will the VMs on the cluster remain available?

This is not a question of free capacity but one of placement. With only two nodes left there is no location for the third copy on another node.

The VMs will suffer as objects with only one copy left go into read-only mode. That translates into write errors on VM disk images.

Tmanok · Feb 15, 2023

gurubert said:
This is not a question of free capacity but one of placement. With only two nodes left there is no location for the third copy on another node.

The VMs will suffer as objects with only one copy left go into read-only mode. That translates into write errors on VM disk images.

This is part of why you should have more CEPH nodes than total number of replicas.

The same minimum PVE configuration regardless of storage is 4 nodes. With four nodes, you can maintain 1 (turn off), lose another to failure, and still have 50% workload. The non production or non HA (non critical) minimum should be 3 nodes which is why PVE let's you do that by default.

With a 3 node CEPH cluster the only way to avoid write failures during a node loss is to have 2 replicas which is NOT recommended.

CEPH is not a toy, it needs lots of disks, lots of CPU and RAM, many nodes or racks for fault tolerance. I learned this the hard way maintaining two 4 node clusters a couple years ago. I don't hyperconverge without 6 to 8 nodes now for this reason. It's risky!

Cheers, and read what Aaron said carefully!!

Tmanok

emptness · Feb 15, 2023

Tmanok said:
This is part of why you should have more CEPH nodes than total number of replicas.

The same minimum PVE configuration regardless of storage is 4 nodes. With four nodes, you can maintain 1 (turn off), lose another to failure, and still have 50% workload. The non production or non HA (non critical) minimum should be 3 nodes which is why PVE let's you do that by default.

With a 3 node CEPH cluster the only way to avoid write failures during a node loss is to have 2 replicas which is NOT recommended.

CEPH is not a toy, it needs lots of disks, lots of CPU and RAM, many nodes or racks for fault tolerance. I learned this the hard way maintaining two 4 node clusters a couple years ago. I don't hyperconverge without 6 to 8 nodes now for this reason. It's risky!

Cheers, and read what Aaron said carefully!!

Tmanok

You tell me sad things(
Our configuration is
2xIntel Xeon Gold 6336Y,
512 GB RAM,
NIC dual 100Gbit/s,
2xSSD 480 GB OS,
4xHDD 10GB pool1,
3xSamsung SSD 7.68TB SAS3 12Gbit/s, TLC, 2100/2000 MB/s, 400k/90k IOPS, 1DWPD MZILT7T6HALA pool2,
2xIntel NVMe SSD 3.2TB TLC, 3200/3050 MB/s, 638k/222k IOPS, 21.85 PBW SSDPE2KE032T801 cache.
4 servers.
We plan to deploy the Proxmox VE + Ceph cluster on them.
Is this configuration really that bad?

aaron · Feb 15, 2023

Ceph works better, the more resources you give it. Therefore, for smaller clusters, you need to look closer into edge cases and plan it accordingly to reduce potential problems. Mainly the number of OSDs and how Ceph will (try) to recover in different failure scenarios.

The hardware looks decent and with a cluster of 4 nodes, it should work well. Just keep in mind to not fill the cluster too much. Running out of space is one of the few things that are hard/painful to recover from.
The larger the cluster, the less a single node / OSD feels the impact of a node or OSD going down, as the recovered data can be spread more in the remaining cluster.

That is my experience from what I have seen in the past years here in the forums and our paid support.

If you do run Ceph on Proxmox VE and also use HA, make sure that Corosync has at least one dedicated network for itself (1Gibt should be fine) and more networks configured as fallback.
Otherwise, if it shares the physical network link with other services, like Ceph, backups, etc., it can happen, that the other services take up all the bandwidth, causing higher latency for the Corosync packages. If there is no better fallback, it can happen, that the Corosync connection between the Proxmox VE nodes falls apart. Not too much of an issue without HA enabled, but with HA enabled, and the connection stays problematic for longer than a minute, your nodes will fence themselves, hoping that there still is a quorate part of the cluster that will start the HA guests.

emptness · Feb 15, 2023

aaron said:
Ceph работает тем лучше, чем больше ресурсов вы ему предоставляете. Поэтому для небольших кластеров вам необходимо более внимательно изучить крайние случаи и спланировать их соответствующим образом, чтобы уменьшить количество потенциальных проблем. В основном количество OSD и то, как Ceph будет (пытаться) восстанавливаться в различных сценариях отказа.

Аппаратное обеспечение выглядит прилично и с кластером из 4 узлов должно работать хорошо. Просто имейте в виду, чтобы не заполнять кластер слишком сильно. Нехватка места — одна из немногих вещей, после которых сложно/болезненно восстановиться.
Чем больше кластер, тем меньше одиночный узел/OSD ощущает воздействие выхода узла или OSD из строя, поскольку восстановленные данные могут быть больше распределены в оставшемся кластере.

Это мой опыт из того, что я видел в последние годы здесь, на форумах и в нашей платной поддержке.

Если вы запускаете Ceph на Proxmox VE, а также используете HA, убедитесь, что Corosync имеет по крайней мере одну выделенную сеть для себя (1Gbt должно быть достаточно) и больше сетей, настроенных как резервные.
В противном случае, если он разделяет физическую сетевую ссылку с другими службами, такими как Ceph, резервное копирование и т. д., может случиться так, что другие службы займут всю полосу пропускания, вызывая более высокую задержку для пакетов Corosync. Если нет лучшего запасного варианта, может случиться так, что соединение Corosync между узлами Proxmox VE развалится. Не слишком большая проблема без включенного HA, но с включенным HA, и соединение остается проблематичным дольше минуты, ваши узлы будут ограждать себя, надеясь, что все еще есть кворумная часть кластера, которая запустит гостей HA.

I used this calculator to calculate the usable disk space.
https://florian.ca/ceph-calculator/
For my configuration, it is recommended to have 25% free disk space, in case of failure.

Some people advise disabling backfill, and enabling it only after failures. And then turn it off again.
What do you say about this? Is it bad?

We have only 2 network cards.
10 GBit is a working LAN for connecting to the VM and the web interface of the user cluster.
100GBit - Ceph.
Is it better to build a cluster on the network through which users will connect?
Or with this configuration, do we need a 3rd network card?

aaron · Feb 21, 2023

I would use the 100Gbit for all Ceph traffic. Ceph knows two networks, the mandatory Public network and the optional Cluster network.
The VMs will use the public network to access their disk images. The cluster network can be useful, if your public network is saturated. You can then place the optional cluster network on a different physical network to take away load from the public network. Both should be fast!

For corosync, ideally you can add another network card for it. It can be a 1 Gbit. If you configure Corosync, you can configure up to 8 networks for it. If a network has a problem, it can switch to the other configured networks. Therefore, even if you have the dedicated corosync network card, also configure the other networks for corosync. This can save you should there be a problem with the main corosync network

emptness · Feb 22, 2023

aaron said:
I would use the 100Gbit for all Ceph traffic. Ceph knows two networks, the mandatory Public network and the optional Cluster network.
The VMs will use the public network to access their disk images. The cluster network can be useful, if your public network is saturated. You can then place the optional cluster network on a different physical network to take away load from the public network. Both should be fast!

For corosync, ideally you can add another network card for it. It can be a 1 Gbit. If you configure Corosync, you can configure up to 8 networks for it. If a network has a problem, it can switch to the other configured networks. Therefore, even if you have the dedicated corosync network card, also configure the other networks for corosync. This can save you should there be a problem with the main corosync network

Thank you, Aaron!
We decided not to install another network card, we will allocate a separate VLAN on the switches and configure COS to allocate guaranteed bandwidth for the cluster network. Wouldn't that work too?

Can anyone say something about the practice of disabling backfill?

christophe · Feb 22, 2023

Hi all,

Really interesting thread.

Shouldn't default values full_ratio (0.95) and nearfull_ratio (0.85), which are very aggressive numbers adapted for a test cluster (https://docs.ceph.com/en/quincy/rados/configuration/mon-config-ref/#storage-capacity), be set to a more realistic value at installation time / add node / remove node, depending on nodes count and OSDs?

Christophe.

Do I need to have free disk space in the ceph pool in case of osd or server failure.

Member

Famous Member

Member

Famous Member

Member

Famous Member

Member

Famous Member

Proxmox Staff Member

Famous Member

Member

Member

Famous Member

Well-Known Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Renowned Member