Ceph is not configured to be really HA

mgiammarco · Jul 28, 2021

Hello,
I have a three node proxmox/ceph cluster. Each node has two nvme as ceph osd. Replica is 3. Network is 40gb.
OSD are 60% full.
Now one osd breaks. Considering I have followed all guidelines (replica 3, min 2, fast network and so on) I expected that I have no problems.
But ceph on the server with one disk broken and only one working starts filling the remaining disk with the third replica.
Now the osd is full and ALL ceph is down (it does not accept writes, VMs are blocked and so on).
I think it is an unaccetable behaviour: in a fully redundant cluster one broken disk put the cluster on knees.
What can I do?
Thanks,
Mario

aaron · Jul 28, 2021

Give more resources to Ceph -> smaller but more disks per node. That way, if you lose a single OSD, Ceph can distribute that data on the remaining OSDs in the node better, and you will have a much less chance to get the remaining OSDs becoming too full.

A 3 node cluster is a bit of a special case as it is enough to maintain 3 replicas, but the 3 nodes are the only chance to get up to 3 replicas and following the crush rule to distribute the replicas on the node level. If you had more nodes, there would be more options to distribute the data over the nodes.

mgiammarco · Jul 28, 2021

Sorry but the reply cannot be "buy more disks" because:
1) I have already followed "best practices" (replica 3, fast networks, disks full only at 60%), you cannot change requirements each time I find a problem
2) with replica 3 the cost of data is already 3x you cannot say increase it to 6x or more "just in case"
3) if I have the same situation with 3 servers with only 3 disks and one disk breaks in this case all goes right: ceph has two disks to continue keeping data, it has no spare disks to distribute replica, so it does nothing and HA works
Ceph should understand that if you have min replica 2 it is better for HA to keep replica 2 and not to try to reach replica 3 filling remaining disks and stopping all the system.
Otherwise I cannot suggest to my customers to use a three node cluster with replica 3 because it is not HA at all.

need2gcm · Jul 28, 2021

This sounds like a fundamental misunderstanding of Ceph, which is not a Proxmox product.

PigLover · Jul 28, 2021

In order to remain HA Ceph requires you to supply enough “spare” resource to absorb a failure. You need to have enough disk space free on each host in order to absorb the loss of your largest OSD on that host. Further, in a cluster with replica 3, you really should have at least 4 hosts in order to absorb the failure of an entire host (better to have 5, but that is a different story).

It’s not that Ceph isn’t HA - its that the configuration you are running does not meet the requirements to provide HA services using Ceph.

mgiammarco · Aug 4, 2021

I strongly disagree. You do not get the point. With three servers and a replica 3 it should be obvious that a failure of an entire server means that I have two servers remaining with two copies of data that is more than enough to maintain HA status. And it is the purpose of HA: it allow one failure.
So an ENTIRE server failure does not stop HA. Both disks of a server fails and the system still runs ok. One disk fails and the system with replica 3 crashes.
You don't even try to understand the problem and you are only able to say I have not followed requirements or guidelines. Where are these guidelines/requirements/best practices? Please note that I have read all of them and, btw, 4 hosts is a wrong suggestion because the number of server should be odd.
Nobody tried to think about to give me a real suggestion like disabling reconstruction. Or at least admit the problem.
BTW: it is not a good marketing practice to tell: "This sounds like a fundamental misunderstanding of Ceph, which is not a Proxmox product." when Ceph it is installable with a click under Proxmox and you have it completely integrated. Next time I will learn that qemu is not part of Proxmox?

wigor · Aug 4, 2021

but ceph is not that simple than only count# of hosts or so.
and i doubt, that there some source says "odd number of server" for ceph.

but that´s all iirc.

aaron · Aug 4, 2021

If you want to handle situations where a single OSD in a node fails manually, you could set the OSD flag "norecover". This way, it will not start to recover lost PGs when possible, for example because only one of the OSDs in a node failed and there are other OSDs still available where the 3rd replicas could be recreated.

Be careful with that though! Because you are disabling Ceph's self-healing capabilities. If you do lose an OSD and replace it, you will of course need to disable that flag in order for Ceph to recover the PGs.

This is not something I recommend, and I have no experience of possible other side effects since I did not test this in a production environment for some time.

Again, one thing everyone needs to be aware of, is that the more resources you give Ceph (nodes, OSDs, ...) the easier it is for Ceph to act upon failures. Therefore, instead of a few large entities (nodes, OSDs, ...) I'd rather go with more but smaller ones.

Search

Search

Ceph is not configured to be really HA

mgiammarco

Renowned Member

aaron

Proxmox Staff Member

mgiammarco

Renowned Member

need2gcm

Active Member

PigLover

Renowned Member

mgiammarco

Renowned Member

wigor

Well-Known Member

aaron

Proxmox Staff Member

We value your privacy