Expanding a Proxmox/Ceph cluster. More nodes, more networking.

John.N

Member
Sep 21, 2019
24
3
23
33
Hello everyone,

I've recently came to manage a Proxmox/Ceph cluster built a year ago and I've been assigned the task of expanding it.
Currently, the cluster is made up of 3 nodes with 2/1 replication, each running with 4 480GB SSD OSDs (for a total of 12) Bluestore.
They have 10+1G networking in active-standby.

I'm thinking of doing the following
  • Add 1x960GB to each old node taking them to ~3TB raw.
  • Change the networking to 10+10Gb LACP with Cisco vPC (with 2x40Gb LACP between them)
  • Add 2 more nodes, join them to the cluster each with 3x960 OSDs (so each server has about ~3TB RAW). The only thing is that Nodes 1-3 will have 5 OSDs while 4-5 only 3.
  • Increase the pg_num from 512 to 1024
  • Increase replication from 2/1 to 3/2 for redundancy
  • Only the first 3 nodes will continue to be monitors


Do you think this would be a good plan, anything you find dodgy or might be bad, or general comments?
Thank you in advance! :)
 
Hi,

Increase replication from 2/1 to 3/2 for redundancy

Do above sooner than later. As 2/1 is really not recommended, it can easily lead to split brains and totally lost data..
If you really really cannot spare the third object copy use at least 2/2, while that makes the cluster read-only on node failure you won't get into split brain territory, at least. But for 99.999% cases it's just worth to go with 3/2..

Do you think this would be a good plan, anything you find dodgy or might be bad, or general comments?

Sounds OK, IMO. The OSD count difference is no real issue for ceph, albeit in general big differences in node/OSD raw storage and OSD count should be avoided, but yours should be really fine..
 
Thanks for the answer.

So you think this has to be done first? Thing is, since the cluster is in production, I think I should add the 960s to the 3 nodes, take it to 3/1 first and after replication is complete, set it to 3/2, so I don't interrupt any services, correct?

Edit: When do you think is a good idea to change pg_num from 512->1024? In the beginning or at the end?
 
Last edited:
So you think this has to be done first? Thing is, since the cluster is in production, I think I should add the 960s to the 3 nodes, take it to 3/1 first and after replication is complete, set it to 3/2, so I don't interrupt any services, correct?

No, you can go straight to 3 target copies / 2 minimum copies. With 2/1 you already have 2 copies for all objects (at least all those not currently in-flight of being written) in a healthy cluster, so the minimum of 2 should be OK for existing ones, and new ones need naturally slightly longer as only after two data object copies are written out int the cluster the write operation (from the VM) returns, but that will always be the case for a 3/2 setup - and so the transition could directly made to 3/2.

But, as with that change an additional copy of each object needs to be written out, you'll generate quite a bit of load (depending on current data usage), so I'd always do this in off-peak-hours, if possible.

So, IMO, for your change a big aspect you probably want to look after is reducing the amount of re-balancing for all those operations, as they produce loads and shuffle data around to possibly only shuffle it back for the next change.

The changes which trigger re-balancing for your case are:
  • OSD addition
  • object replica/min-replica change
  • placement group count change
So I'd ensure that only a single re-balance happens by enabling noout (can do so over the Webinterface) then do all changes, add OSDs, change replica/min and change pg_num (do not forget to change pgp_num too, see here) and then unset the noout flag, if all operation succeeded. This will then hit off quite some re balancing, but IMO it's much better to do only a single one, as multiple ones would produce a bigger load in sum.
 
Oh, and as this is production and (I guess) the first time you expand a Proxmox VE Cluster with Ceph I'd recommend of testing as much as possible. While I do not think that doing this is really hard, it's good to have some practice, and thus maybe additional questions/potential issues can be tackled before going to do the production change.

And old (test) Setup could be good for this, as an alternative you could setup a small reduced new PVE cluster in common VMs and test it there virtually.
 
As far as networking goes depending on how big your setup is I'd just put 2-3 2x40Gbps VPI IB cards in your main node and get a 4x10Gbps splitter cable bringing your total capacity to 16x10Gbps or 24x10Gbps. This removes the need for a switch. You can also still have a public+private network with this setup. And since all of this is flowing through one server if you mount cephfs on that server and use samba/nfs you get the total aggregated connection speed.

You can configure ceph monitors to use the public network only, so if you do lose the central server your cluster isn't FUBAR.

YMMV on what the server can handle but I'm doing this for my current setup (on a smaller scale) and don't notice any issues with the bandwidth or CPU load.
 
Just reporting in to say that everything went as planned, thank you very much @t.lamprecht for your input. :)

@paradox55 That sounds like an interesting idea but while compact, seems extremely fragile. Stacked switches is the way to go IMO, whether you go for things like StackWise or MC-LAG (I don't really like active-backup bonding). Plus, I'm pretty sure that IPoIB results in higher latency even at 40Gb, and at the end of the day, that's what at least I specifically am after - less latency -> higher IOPS.
 
  • Like
Reactions: t.lamprecht

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!