Add another ceph node

Jun 3, 2020
3
2
23
37
Hello everyone,

I currently have 3 Proxmox Ceph nodes with 12 NVMe of 2TB each, using the default recommended configuration. CRUSH Rule: Replicated_Rule, 1 pool, and all OSDs class as NVMe.

The pool is using about 70% of the space, and my idea is to buy another node and add it to the cluster to increase the available space.

My question is to understand how the cluster will function with this 4th node, which will have the same characteristics, meaning the same 12 NVMe disks.

How will it distribute the data?

Another question: on this 4th node, will I also need to configure the Monitor and Manager? Is there any additional procedure I should be aware of or any recommendations?

Thanks.
 
How will it distribute the data?
Some replicas currently stored in the drives of nodes 1,2,3 will be moved to node 4, freeing space from the source while using it in the fourth node. This way you are effectively adding 24TB gross to your available space (with the default 3/2 replica configuration).

will I also need to configure the Monitor and Manager?
No. Ceph need it's own quorum with an uneven number of MON, being 3 the right amount for most PVE setups, which do not have many nodes.

Is there any additional procedure I should be aware of or any recommendations?
Yes. A few important ones to take care of:

- Read about OSD flags, specifically no_rebalance, no_backfill and no_recover. Enable them while creating the OSD to reduce data movement while creating the OSD's. Disable them once all OSD's have been created.

- Care with mon_osd_down_out_interval [1], which at it's default of 600 (10 minutes) will mark out a down OSD, making it's replicas to be recreating on other OSD of the cluster. If the whole host is down for more than 10 minutes, all replicas will be recreated and they will potentially fill most of the OSDs in the other 3 nodes (you are adding a fourth node because your current data doesn't fit in 3 nodes, right? ;) ). For planned downtimes, use the no_out flag. For unplanned downtime, increase mon_osd_down_out_interval to a value that will allow an administrator to act and decide how to proceed.

- Check also mon_osd_min_in_ratio [2], which with the right values for your cluster to recreate replicas for some amount of OSD's while not doing that recreation if more than n OSD are already down. A value of 1.01 will completely disable Ceph automatic recovery making failed/down OSD not to become out.

[1] https://docs.ceph.com/en/latest/rad...nteraction/#confval-mon_osd_down_out_interval
[2] https://docs.ceph.com/en/latest/rad...osd-interaction/#confval-mon_osd_min_in_ratio

- If you want to allow a full recovery in the event of a whole host failure and still increase the available space by 24TB you need also a 5th node due to the behaviour explained above.

If unclear, build a test cluster using VMs in you cluster and practice there how Ceph behaves.
 
  • Like
Reactions: Lukas Moravek

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!