Add another ceph node

arthdark · Oct 17, 2024

Hello everyone,

I currently have 3 Proxmox Ceph nodes with 12 NVMe of 2TB each, using the default recommended configuration. CRUSH Rule: Replicated_Rule, 1 pool, and all OSDs class as NVMe.

The pool is using about 70% of the space, and my idea is to buy another node and add it to the cluster to increase the available space.

My question is to understand how the cluster will function with this 4th node, which will have the same characteristics, meaning the same 12 NVMe disks.

How will it distribute the data?

Another question: on this 4th node, will I also need to configure the Monitor and Manager? Is there any additional procedure I should be aware of or any recommendations?

Thanks.

VictorSTS · Oct 17, 2024

arthdark said:
How will it distribute the data?

Some replicas currently stored in the drives of nodes 1,2,3 will be moved to node 4, freeing space from the source while using it in the fourth node. This way you are effectively adding 24TB gross to your available space (with the default 3/2 replica configuration).

arthdark said:
will I also need to configure the Monitor and Manager?

No. Ceph need it's own quorum with an uneven number of MON, being 3 the right amount for most PVE setups, which do not have many nodes.

arthdark said:
Is there any additional procedure I should be aware of or any recommendations?

Yes. A few important ones to take care of:

- Read about OSD flags, specifically no_rebalance, no_backfill and no_recover. Enable them while creating the OSD to reduce data movement while creating the OSD's. Disable them once all OSD's have been created.

- Care with mon_osd_down_out_interval [1], which at it's default of 600 (10 minutes) will mark out a down OSD, making it's replicas to be recreating on other OSD of the cluster. If the whole host is down for more than 10 minutes, all replicas will be recreated and they will potentially fill most of the OSDs in the other 3 nodes (you are adding a fourth node because your current data doesn't fit in 3 nodes, right?

). For planned downtimes, use the no_out flag. For unplanned downtime, increase mon_osd_down_out_interval to a value that will allow an administrator to act and decide how to proceed.

- Check also mon_osd_min_in_ratio [2], which with the right values for your cluster to recreate replicas for some amount of OSD's while not doing that recreation if more than n OSD are already down. A value of 1.01 will completely disable Ceph automatic recovery making failed/down OSD not to become out.

[1] https://docs.ceph.com/en/latest/rad...nteraction/#confval-mon_osd_down_out_interval
[2] https://docs.ceph.com/en/latest/rad...osd-interaction/#confval-mon_osd_min_in_ratio

- If you want to allow a full recovery in the event of a whole host failure and still increase the available space by 24TB you need also a 5th node due to the behaviour explained above.

If unclear, build a test cluster using VMs in you cluster and practice there how Ceph behaves.

arthdark · Oct 18, 2024

Hi VictorSTS,

Thank you for your clear and insightful response! It was very helpful in understanding how to handle the addition of new nodes and the recommended settings for mon_osd_down_out_interval and mon_osd_min_in_ratio.

Best regards.

Search

Search

Add another ceph node

arthdark

Member

VictorSTS

Famous Member

arthdark

Member

We value your privacy