Ceph: Verhalten beim Ausfall eines Knoten

Nov 10, 2023
7
1
1
Guten Tag,

ich habe eine Vefrständnisfrage bzgl. des Verhaltens von Ceph beim Ausfall eines Knotens.

Szenario:
  • 3+ Knoten
  • Ceph in einer 3/2-Kopnfiguration
  • Ceph-Storage inkl. CephFS ist zu 75+% gefüllt
Bei dem plötzlichen Ausfall eines Knoten beginnt Ceph die PGs neu zu verteilen bzw. wiederherzustellen
um die gewünschte Anzahl von Repliken vorzuhalten.

Was passiert, wenn der Speicherplatz auf den OSDs nicht ausreicht um alle PGs aufzunehmen und die
verbleibenden OSDs komplett gefüllt werden?
Bleibt Ceph schlicht im herabgestuften Zustand, aber weiterhin verfügbar, oder wird es pausiert, bis wieder
mehr Speicherplatz zur Verfügung steht?

Vielen Dank und Gruß
Björn
 
  • Like
Reactions: Johannes S
n the event of a sudden failure of a node, Ceph starts to redistribute or restore the PGs
in order to maintain the desired number of replicas.
Not with only three nodes. There is nowhere for redistribution to be deployed to.

What happens if there is not enough space on the OSDs to accommodate all PGs and the
remaining OSDs are completely filled?
This would be a problem with all three nodes as well. Each OSD is monitored for high water mark. OSDs that reach the high water mark will become read only. By the time this happens the pool high water mark would have already been reached and the whole pool goes read only. You will receive warnings long before this happens.
Will Ceph simply remain in a downgraded state, but still available, or will it be paused until more space is available again
2 nodes can continue to provide full services, but pool full would become read only until more space is available AND rebalance makes enough OSDs have free space.
 
  • Like
Reactions: Johannes S
Bei dem plötzlichen Ausfall eines Knoten...
Dann geschehen (früher oder später) böse Dinge. Der Ceph-Pool ist jedenfalls sofort degraded - und bleibt es auch dauerhaft. Es ist ja kein Reserve-Node vorhanden, der die Daten (size=3) aufnehmen könnte --> keine automatische Reparatur/self-healing möglich. Darum will man eher so etwas wie fünf Nodes mit jeweils mehreren (vier oder mehr!) OSD haben...

This is the English area; some other pitfalls I found during my "year with Ceph": https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/

Tl;dr: Ceph needs a lot more resources than the absolute minimum to fulfill its promises.

Good luck :-)
 
  • Like
Reactions: Johannes S
Not with only three nodes. There is nowhere for redistribution to be deployed to.


This would be a problem with all three nodes as well. Each OSD is monitored for high water mark. OSDs that reach the high water mark will become read only. By the time this happens the pool high water mark would have already been reached and the whole pool goes read only. You will receive warnings long before this happens.

2 nodes can continue to provide full services, but pool full would become read only until more space is available AND rebalance makes enough OSDs have free space.

Hi.
Sorry, I posted my question at the wrong (English) area. Again.

What happens, if there are more than three nodes? Will Ceph try to rebalance the PGs untill the OSDs are full and become read only?
Or does Ceph stops to rebalance at some point to let it degraded but writable? Like in maintenance with the "noout"-flag is set?

Or does we have to reserve further space for such a scenario? Additional to the space we need for the three replicas?
 
Its worth revisiting what ceph is and how it works.

ceph is software defined storage, which is to say there is an algorithm and rules. In a normal virtualization workload the pool rules look like this:

replicated size 3 shards (members) in a pg (placement group), minimum required for operation 2. Each shard MUST be placed on a different host from the others to satisfy the requirement. So, given these parameters, you can see how you need to have AT LEAST 4 nodes to sustain a node fault AND self healing. As for capacity, the way to look at it is you need to define your "FULL" ratio in such a way that a node fault does not take the storage out, or

(N-1)/N * 0.8

where N is the total OSD capacity of a node when all nodes have the same capacity. You multiply by 0.8 to account for OSD full situations.

But wait, I hear you say, that means I can only use 60% of my capacity! its worse than that my friend- remember each write is triplicated, so in practice that means 20% of your capacity, not 60. And that's just the price of doing business; all that disk capacity is provided in order for you to have business continuity in every fault condition up to a full node out.
 
  • Like
Reactions: Johannes S and UdoB