How to restore services on a proxmox+ceph cluster when 2 servers fail?

emptness · May 24, 2023

Greetings!
I have a cluster of 4 proxmox servers + ceph replication factor 3.
If 2 servers fail, the cluster will continue to work on 2. This is understandable.
But how will ceph behave? There is a high probability that some PGs will remain in a single instance.
Then the record will be blocked and the service (virtual machine) will be unavailable?
How can I restore the availability of the service in such a failure scenario?

gurubert · May 24, 2023

The VM will get IO errors on its virtual disk because some blocks are not writable any more.

If 2 of the 3 MONs of the Ceph cluster are also down because they were running on the down nodes the Ceph cluster will not work any more because the single remaining MON has no quorum.

If you provisioned a MON on each of the 4 nodes the situation is the same as 2 out of 4 MONs will not have the quorum.

shanreich · May 24, 2023

With a 4-node cluster, when 2 nodes fail you will lose quorum! So, the cluster will not continue to work normally. VMs will keep running, but any action that requires quorum will be blocked (starting / stopping VMs for instance).

If you have HA activated, then those nodes will additionally fence [1] themselves - since they are not in a quorate part of the cluster - this is something you have to be wary of.

Ceph itself will block I/O on all objects that have fewer replicas than the min_size (usually 2). That means I/O will be blocked and VMs will become unavailable. After a while though, Ceph will start to rebalance and create new replicas of the data on the other, still available, nodes. Something you also have to consider in this case is the amount of free space, since it will need to accommodate the new data. If you do not have enough space left on the remaining OSDs, then this can also lead to problems in that case.

edit: @gurubert also raised a very valid point that the Ceph cluster will stop working when 2 monitors are down - this will of course inhibit the rebalancing I described above and the situation will need manual intervention.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing

jsterr · May 24, 2023

Just to add: if you want to loose 2 servers at the same time you could set SIZE=4 and MIN-SIZE=2.
You still need 5 Votes for the quorum though, so a fifth server would be nice (does not need osds)

emptness · May 25, 2023

All thanks a lot for the answers!
Is it necessary to add a 5th server for quorum? In previous versions of Proxmox, it was possible to use a "file" witness, now it can't be done?
or will it work much worse?
Or when using Ceph, do I need to deploy a 5th monitor somewhere anyway?

shanreich

you write that the services will not be available until ceph restores the number of replicas.
Do you mean all 3 replicas? or after creating the 2nd for each PG, will everything work?

aaron · May 25, 2023

emptness said:
Is it necessary to add a 5th server for quorum? In previous versions of Proxmox, it was possible to use a "file" witness, now it can't be done?

The additional server is for Ceph, not Proxmox VE. You will need 5 MONs in order to survive the loss of two. -> small Proxmox VE node with a Ceph MON on it.

If you run the pools with size/min_size 3/2 and lose two nodes, chances are high that some PGs will have lost two replicas. Until Ceph is able to restore at least a second replica on the remaining nodes, the affected pools will be IO blocked.

emptness · May 25, 2023

aaron said:
The additional server is for Ceph, not Proxmox VE. You will need 5 MONs in order to survive the loss of two. -> small Proxmox VE node with a Ceph MON on it.

If you run the pools with size/min_size 3/2 and lose two nodes, chances are high that some PGs will have lost two replicas. Until Ceph is able to restore at least a second replica on the remaining nodes, the affected pools will be IO blocked.

Thanks!
And tell me, is there a chance of losing data in this scenario (if only 2 servers remain)?
Оr as long as there is at least 1 replica, the 2nd and 3rd will recover from it?

aaron · May 25, 2023

So, 4 nodes with OSDs, pool is using a size/min_size of 3/2.

2 Nodes die. Some PGs will only have one replica. So far, so good. Make sure to have enough space on the OSDs that Ceph can restore the second replica on either node.

While the pool is IO blocked, the VMs won't be able to access their disks, so they might crash or switch to read-only.

The data is still there in the one replica. The only way to lose data is if Murphy's Law hits you and the OSD/disk that has the single replica dies before it could be recreated on the other node.

But due to a min_size 2 (pool only works with at least two replicas present), the data won't be touched and once the dead nodes are back online, you should be fine again.

emptness · May 25, 2023

aaron said:
So, 4 nodes with OSDs, pool is using a size/min_size of 3/2.

2 Nodes die. Some PGs will only have one replica. So far, so good. Make sure to have enough space on the OSDs that Ceph can restore the second replica on either node.

While the pool is IO blocked, the VMs won't be able to access their disks, so they might crash or switch to read-only.

The data is still there in the one replica. The only way to lose data is if Murphy's Law hits you and the OSD/disk that has the single replica dies before it could be recreated on the other node.

But due to a min_size 2 (pool only works with at least two replicas present), the data won't be touched and once the dead nodes are back online, you should be fine again.

Aaron, thanks again so much for your answers! Everything became clear to me.
Do I need any action on my part in such a situation?
Will ceph understand that there are only 2 hosts and the 3rd replica does not need to be restored? or will ceph still create a 3rd replica on the remaining 2 servers?

jsterr · May 25, 2023

emptness said:
Aaron, thanks again so much for your answers! Everything became clear to me.
Do I need any action on my part in such a situation?
Will ceph understand that there are only 2 hosts and the 3rd replica does not need to be restored? or will ceph still create a 3rd replica on the remaining 2 servers?

No. Ceph wont do anything, it makes no sense to create a second copy on on of the remaining servers, as they already have a copy. If you loose a disk on the remaining two servers, ceph might create the lost data on the remaining disks of the node (which had a failed disk). But you will have storag io blocked, as loosing a disk on a size=3 setup with only 2 servers remaining means MIN-SIZE is not reached on all objects that were stored on the failed disk.

emptness · May 25, 2023

jsterr said:
No. Ceph wont do anything, it makes no sense to create a second copy on on of the remaining servers, as they already have a copy. If you loose a disk on the remaining two servers, ceph might create the lost data on the remaining disks of the node (which had a failed disk). But you will have storag io blocked, as loosing a disk on a size=3 setup with only 2 servers remaining means MIN-SIZE is not reached on all objects that were stored on the failed disk.

That is, if 2 servers fail, ceph will create the necessary second replicas, and will the work of services (VMs) be restored?

jsterr · May 25, 2023

emptness said:
That is, if 2 servers fail, ceph will create the necessary second replicas, and will the work of services (VMs) be restored?

No my case even applies when only one node fails + one disk of one of the remaining servers. Ceph also recovers data even when all nodes are up, when a disk fails. because the data that was on that disk, need to be recreated on the same host. That also applies when one node is down, but also when all nodes are up. If a ods fails, that data on it is not there anymore -> so ceph recreates them no matter what.

When all nodes are up, there is no downtime when you loose a disk on one of the nodes. But also there, ceph will make sure, that this host (with the failed disk) will get all pgs healthy again means it will recreate the lost data on the same server so you have 3 replica available again.

Edit: this can also be very dangerous if your loosing to much disks from time to time and you dont notice it! because of that recreation of objects, you can potentially loose 3 of 4 disks one after another because data gets always recreated. This means, after each osd failure, the others osds have a higher %usage ... BUT: if one disk reaches 90-95% your cluster will go down because ceph locks storage access when a osds gets full.

emptness · May 25, 2023

jsterr said:
No my case even applies when only one node fails + one disk of one of the remaining servers. Ceph also recovers data even when all nodes are up, when a disk fails. because the data that was on that disk, need to be recreated on the same host. That also applies when one node is down, but also when all nodes are up. If a ods fails, that data on it is not there anymore -> so ceph recreates them no matter what.

When all nodes are up, there is no downtime when you loose a disk on one of the nodes. But also there, ceph will make sure, that this host (with the failed disk) will get all pgs healthy again means it will recreate the lost data on the same server so you have 3 replica available again.

Edit: this can also be very dangerous if your loosing to much disks from time to time and you dont notice it! because of that recreation of objects, you can potentially loose 3 of 4 disks one after another because data gets always recreated. BUT: if one disk reaches 90-95% your cluster will go down because ceph locks storage access when a osds gets full.

I meant that if only 2 servers remain, then replicas of more than two will not be created on them.
And at the same time, when there are 2 copies of all PGs, virtual machines will become available again.
Did I understand you correctly?

jsterr · May 25, 2023

emptness said:
I meant that if only 2 servers remain, then replicas of more than two will not be created on them.
And at the same time, when there are 2 copies of all PGs, virtual machines will become available again.
Did I understand you correctly?

Yes when there are two servers remaining ceph does NOT create the third copy unless the THIRD server is online again.
if there are two copies of the data in your pgs then VMs can be started and are available again. (MIN-SIZE 2 = reached)

@aaron will vms automatically start, when pg was undersized before (only one replica available) and then objects get their second copy again, so only third replica missing? Im not sure, as vm still might be online, when host was up but pg was undersized (disk failure for example)

emptness · May 25, 2023

jsterr said:
Yes when there are two servers remaining ceph does NOT create the third copy unless the THIRD server is online again.
if there are two copies of the data in your pgs then VMs can be started and are available again. (MIN-SIZE 2 = reached)

@aaron will vms automatically start, when pg was undersized before (only one replica available) and then objects get their second copy again, so only third replica missing? Im not sure, as vm still might be online, when host was up but pg was undersized (disk failure for example)

OK!
Thank you, jsterr!

I apologize for the offtop. Can someone tell me if the ceph settings in proxmox are optimal in the general case?
There are many articles on ceph optimization (and many old ones), is it worth changing something or is it better to leave everything by default?

jsterr · May 25, 2023

emptness said:
OK!
Thank you, jsterr!

I apologize for the offtop. Can someone tell me if the ceph settings in proxmox are optimal in the general case?
There are many articles on ceph optimization (and many old ones), is it worth changing something or is it better to leave everything by default?

I would personally leave everything by default, I tried lots of things and its very rare that some parameters give you a benefit. they key to good ceph performance is low latency network (25Gbit+), direct cabling (if you dont want to spend money for switches) https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server and low latency disks (nvmes).

Make sure your network can be fully used (make iperf between servers) and your good to go.

Edit: meshed setup with 4 nodes will be hard, as you need minimum 3 ceph ports per node. you can also go with 10gbit/s and sata-ssds.

emptness · May 26, 2023

Thank you all so much for the information!
This is very useful and important for me.

Search

Search

How to restore services on a proxmox+ceph cluster when 2 servers fail?

emptness

Member

gurubert

Distinguished Member

shanreich

Proxmox Staff Member

jsterr

Renowned Member

emptness

Member

shanreich

aaron

Proxmox Staff Member

emptness

Member

aaron

Proxmox Staff Member

emptness

Member

jsterr

Renowned Member

emptness

Member

jsterr

Renowned Member

emptness

Member

jsterr

Renowned Member

emptness

Member

jsterr

Renowned Member

emptness

Member

How to restore services on a proxmox+ceph cluster when 2 servers fail?

Member

Distinguished Member

Proxmox Staff Member

Renowned Member

Member

shanreich​

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Member

shanreich