Three-Cluster-Node + Ceph

May 4, 2021
91
2
13
43
Hello,

I have a small test-setup consisting of three physical nodes, which I have equipped with four SSDs and 10GB Ethernet each. I would like to test the viability of Ceph as a storage backend for our HA-Clusters, which we are running for several customers. Currently we are using a SAN backend with a shared LVM exposed over iSCSI. I want to test two scenarios: 1) One Node goes down 2) a single SSD goes down. 2) does not seem to be a problem. Ceph recognizes the loss and starts rebalancing. I had to remove the OSD from the CRUSH map and add the replacement SSD as a new storage device, but in general everything worked as expected.

Scenario 1) is a bit more difficult. I have powered down the server instantly - no regular shutdown procedure - and after a few minutes the HA VMs on the server start on the other nodes in the clusters. Now Ceph is all red in the dashboard, but it seems I can write to it - I tested writing to the block devices with dd on the VMs and I downloaded a new iso into CephFS. Still, there is no rebalancing in process, presumably because it can't produce three replicas on different cluster nodes as initially configured. If this assumption is correct I do not understand why Ceph is still writeable - shouldn't it go into read-only mode in this condition?

Output of ceph status, anonymized:

ceph status
cluster:
id: 7f3e6609-43f4-4e6d-a6aa-e4badcbccb9a
health: HEALTH_WARN
1/3 mons down, quorum freya,odin
Degraded data redundancy: 45475/136425 objects degraded (33.333%), 90 pgs degraded, 97 pgs undersized

services:
mon: 3 daemons, quorum freya,odin(age 23m), out of quorum: thor
mgr: freya(active, since 23m), standbys: odin
mds: 1/1 daemons up, 1 standby
osd: 12 osds: 8 up (since 23m), 8 in (since 13m)

data:
volumes: 1/1 healthy
pools: 4 pools, 97 pgs
objects: 45.48k objects, 172 GiB
usage: 329 GiB used, 6.7 TiB / 7.0 TiB avail
pgs: 45475/136425 objects degraded (33.333%)
90 active+undersized+degraded
7 active+undersized

io:
client: 19 KiB/s wr, 0 op/s rd, 0 op/s wr

Yours sincerely
Stefan Malte Schumacher
 
Still, there is no rebalancing in process, presumably because it can't produce three replicas on different cluster nodes as initially configured. If this assumption is correct I do not understand why Ceph is still writeable - shouldn't it go into read-only mode in this condition?
If you created the pool with the default size/min_size of 3/2, then the pool still has 2 replicas available and is therefore fully functional. But it does not have any redundancy at this point. If you would have a larger cluster with more nodes, Ceph could recover the lost replicas on the other nodes that don't have a replica yet.

You can see that all placement groups (PGs) are still active, though undersized.
 
Last edited:
If you created the pool with the default size/min_size of 3/2, then the pool still has 2 replicas available and is therefore fully functional. But it does not have any redundancy at this point. If you would have a larger cluster with more nodes, Ceph could recover the lost replicas on the other nodes that don't have a replica yet.

You can see that all placement groups (PGs) are still active, though undersized.

Correct me if I am wrong as I am still new/learning Ceph but If they were to add a fourth node to their cluster it would allow Ceph to re-balance to that other node and go back to a healthy status. Then would it still be able to stay writable and running with the loss of another node? would you need a fifth node to allow for a dual node failure? Though I realise that if 2 nodes are split from the other 2 nodes you can get a split brain problem with having an even number of nodes in the cluster.
 
If you have additional nodes, then Ceph can recover the lost replicas on the other nodes. Once full redundancy is back, another node can be lost. If you lose 2 nodes at the same time, you will surely end up with only one replica for some PGs -> affected pools will be IO blocked until Ceph is able to get every PG to at least two replicas.

DO NOT set min_size to 1. That is a sure way to have data loss / corruption.

The other thing you need to keep in mind is how many MONs can fail. They work by forming a quorum (majority) similar to the Proxmox VE cluster itself. So that puts another limit on how many (MON) nodes you can lose.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!