Ceph Node failure

Dave Wood · Apr 19, 2017

Hi,

I have 3 nodes Proxmox Cluster with Ceph. I'm happy with current setup and performance.

Now we are planning Disaster Recovery. We are using separate NFS storage for VM backup.

I have few questions and I need expert advice.

Our Ceph pools are setup with 2 replica. We have 4 OSDs in each node. Is Ceph saving replica in the same node's different OSD or different node's OSD?
What will happen if one node failed or node's Journal disk failed?
What will happen if rack's power failed? And what is the process to recover?

Thank you in advance.

manu · Apr 19, 2017

1. this depends on the ceph crush map that you have.

By default this is is setup that ceph will always try to write to an OSD on a different host, and will indicate a failure if it cannot.
Just check that "step chooseleaf firstn 0 type" is set to "host" and you should be OK here.
You can get the crush map with
pvesh get /nodes/my_hostname/ceph/crush

2.a If one node fails:
this depends on the parameter you have for size and min_size.
size ( ie replica) will show you the replication ceph will try to make
min_size will show you the minimum number of replicas that have to be present so I/O is allowed

for instance in my test cluster I have the following parameter set:
(rbd is the name of the pool in my test cluster)

ceph osd pool get rbd size
size: 2

ceph osd pool get rbd min_size
min_size: 1

which mean in theory ceph will duplicate my data on two OSDs located on two different nodes

however a production setting should rather use

size:3
and
min_size:2

why ?
as disks are getting bigger the chances that two OSDs have a defect at the same time is too big
suppose you lose the disk of an OSD: it that case Ceph will still works and as soon as you add a new disk it will start rebalacing the block to the new disk. But this rebalacing takes time because of the heavy IO involved (think: copying 4TB over the network) What does happen is during this time the source disk fails ?
see: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005672.html for details about that

2.b If the node journal disk fails, data on the OSD that this journal disk is associated to is considered lost.

3. on case on whole power outage, on boot ceph will recover by applying the change found in the journal.
Since a write is only acknoledged when it has been commited to the journal disk, you should not have any kind of corruption.

If you want to want to be on the safer side, you should either have SSDs disk with capacitors, or disable the write cache on the SSD. Disabling the write cache on the SSD has a big downside in performance, especially for consumer grades SSD.

Dave Wood · Apr 19, 2017

Thank you so much Manu for your time and answering all my questions in details.
I have confirmed that we are using chooseleaf firstn 0 type host.
Currently we are using rbd size 2 and min size 1

Can we change rbd size and min size without migrating or loosing data?

udo · Apr 19, 2017

Dave Wood said:
Thank you so much Manu for your time and answering all my questions in details.
I have confirmed that we are using chooseleaf firstn 0 type host.
Currently we are using rbd size 2 and min size 1

Can we change rbd size and min size without migrating or loosing data?

Hi,
yes of course - but you can see an impact during rebuild due to heavy IO (rbd size not min size).
This can be limited with parameters ("osd_max_backfills = 1" + "osd_recovery_max_active = 1") which must acrivate before (injectargs).

Udo

Search

Search

Ceph Node failure

Dave Wood

Well-Known Member

manu

Proxmox Staff Member

Dave Wood

Well-Known Member

udo

Distinguished Member

We value your privacy