Three-Cluster-Node + Ceph

Stefan_Malte_Schumacher · May 30, 2023

Hello,

I have a small test-setup consisting of three physical nodes, which I have equipped with four SSDs and 10GB Ethernet each. I would like to test the viability of Ceph as a storage backend for our HA-Clusters, which we are running for several customers. Currently we are using a SAN backend with a shared LVM exposed over iSCSI. I want to test two scenarios: 1) One Node goes down 2) a single SSD goes down. 2) does not seem to be a problem. Ceph recognizes the loss and starts rebalancing. I had to remove the OSD from the CRUSH map and add the replacement SSD as a new storage device, but in general everything worked as expected.

Scenario 1) is a bit more difficult. I have powered down the server instantly - no regular shutdown procedure - and after a few minutes the HA VMs on the server start on the other nodes in the clusters. Now Ceph is all red in the dashboard, but it seems I can write to it - I tested writing to the block devices with dd on the VMs and I downloaded a new iso into CephFS. Still, there is no rebalancing in process, presumably because it can't produce three replicas on different cluster nodes as initially configured. If this assumption is correct I do not understand why Ceph is still writeable - shouldn't it go into read-only mode in this condition?

Output of ceph status, anonymized:

ceph status
cluster:
id: 7f3e6609-43f4-4e6d-a6aa-e4badcbccb9a
health: HEALTH_WARN
1/3 mons down, quorum freya,odin
Degraded data redundancy: 45475/136425 objects degraded (33.333%), 90 pgs degraded, 97 pgs undersized

services:
mon: 3 daemons, quorum freya,odin(age 23m), out of quorum: thor
mgr: freya(active, since 23m), standbys: odin
mds: 1/1 daemons up, 1 standby
osd: 12 osds: 8 up (since 23m), 8 in (since 13m)

data:
volumes: 1/1 healthy
pools: 4 pools, 97 pgs
objects: 45.48k objects, 172 GiB
usage: 329 GiB used, 6.7 TiB / 7.0 TiB avail
pgs: 45475/136425 objects degraded (33.333%)
90 active+undersized+degraded
7 active+undersized

io:
client: 19 KiB/s wr, 0 op/s rd, 0 op/s wr

Yours sincerely
Stefan Malte Schumacher

aaron · May 30, 2023

Stefan_Malte_Schumacher said:
Still, there is no rebalancing in process, presumably because it can't produce three replicas on different cluster nodes as initially configured. If this assumption is correct I do not understand why Ceph is still writeable - shouldn't it go into read-only mode in this condition?

If you created the pool with the default size/min_size of 3/2, then the pool still has 2 replicas available and is therefore fully functional. But it does not have any redundancy at this point. If you would have a larger cluster with more nodes, Ceph could recover the lost replicas on the other nodes that don't have a replica yet.

You can see that all placement groups (PGs) are still active, though undersized.

Astraea · May 30, 2023

aaron said:
If you created the pool with the default size/min_size of 3/2, then the pool still has 2 replicas available and is therefore fully functional. But it does not have any redundancy at this point. If you would have a larger cluster with more nodes, Ceph could recover the lost replicas on the other nodes that don't have a replica yet.

You can see that all placement groups (PGs) are still active, though undersized.

Correct me if I am wrong as I am still new/learning Ceph but If they were to add a fourth node to their cluster it would allow Ceph to re-balance to that other node and go back to a healthy status. Then would it still be able to stay writable and running with the loss of another node? would you need a fifth node to allow for a dual node failure? Though I realise that if 2 nodes are split from the other 2 nodes you can get a split brain problem with having an even number of nodes in the cluster.

aaron · May 30, 2023

If you have additional nodes, then Ceph can recover the lost replicas on the other nodes. Once full redundancy is back, another node can be lost. If you lose 2 nodes at the same time, you will surely end up with only one replica for some PGs -> affected pools will be IO blocked until Ceph is able to get every PG to at least two replicas.

DO NOT set min_size to 1. That is a sure way to have data loss / corruption.

The other thing you need to keep in mind is how many MONs can fail. They work by forming a quorum (majority) similar to the Proxmox VE cluster itself. So that puts another limit on how many (MON) nodes you can lose.

Search

Search

Three-Cluster-Node + Ceph

Stefan_Malte_Schumacher

Active Member

aaron

Proxmox Staff Member

Astraea

Renowned Member

aaron

Proxmox Staff Member

We value your privacy