PVE-Ceph, Adding multiple disks to a existing pool

May 6, 2021
38
1
13
Bern, Switzerland
Hi there

We got a six server cluster with existing ceph pools.
Now we need to add more disks to one pool and I am unsure what scenario needs more time and/or causes more «turbulances» .
The pool consists of 6 x 2 SAS SSD (3.2 TB and 6.4 TB). We would add another 6 x 2 SAS SSD (6.4 TB).
Variants:
  1. Add all disks at the same time to the pool
  2. Add the disks one by one and wait until rebalanced
  3. Add the disks at the same time per server (2 at the time) and wait until rebalanced
my guts are going to 1. (all together) but I kindly appreciate any input.
(the servers are equipped with dedicated storage interfaces on 2 x 25 GbE with LACP)

Regards, Urs
 
You can add all the disks at once, but I would only create 1 new OSD at a time. Wait for the recovery to complete before creating the next.
 
Where do people even get the term "best practice" from? This isn't a cartel like Microsoft or Cisco where they give exams and awards to their little disciples. It reeks of escapism, like nobody ever got fired for following a best practice, regardless of outcome.

There are currently only 3 authors publishing on Ceph, maybe they have a blurb on it.

In the FOSS world you are your own best advocate and you must know and do what is best for you. The code, tools, information, and guidance found on the internet are given freely and without warranty. The buck only ever stops with you. There is never even a guarantee of a correct answer. Still not sure? Test it yourself and find out.

Ceph has no strict hardware requirements. You can put it on anything and it is super stable. It may not be fast if you don't build it right, but it will be stable. Since all the possible combinations of hardware and config are truly infinite, there is no way for anyone to tell you the best way to do something on your system or for any outcome to be assured.

If you want to expand your cluster at the fastest rate, then you would create all your new OSDs simultaneously so that your ultimately intended CRUSH map is rendered a single time at-once. The resulting recovery workload will proceed at the highest possible rate, at your peril.

Not knowing your CPU, RAM, or allocations on your VMs, I only see your small cluster and paltry 25 GbE network, I can say that expanding the cluster in a single giant gulp in this careless manner also gives you the best chance of having slow blocking OSD ops and sluggish or potentially crashed VMs, and may result in your scrambling to play with the recovery priority flags to see if you can find a balance. If you do, that's great, you will have discovered something that works for you but may not be applicable to the next guy with a completely different system.

If you are only using CephFS or RGW workloads on this cluster, the impact of slow OSD ops will be less apparent and you can consider that. If your workload is principally RBD, then let's all agree that the survival of the RBD users is the top priority, and not expansion performance.

Want to go easy on the system, and yourself? Expand the cluster slow and steady.
 
Last edited:
Where do people even get the term "best practice" from?
preponderance of recognized experts in the field. This is true for anything, not just FOSS.

In the FOSS world you are your own best advocate and you must know and do what is best for you.
true for everywhere else too.

Ceph has no strict hardware requirements. You can put it on anything and it is super stable. It may not be fast if you don't build it right, but it will be stable. Since all the possible combinations of hardware and config are truly infinite, there is no way for anyone to tell you the best way to do something on your system or for any outcome to be assured.
There is. its called documentation.

Want to go easy on the system, and yourself? Expand the cluster slow and steady.
undoubtedly. you could have probably skipped the rest of the rant ;)
 
  • Like
Reactions: Jackobli
look at my first response. then he comes back with "what is this, your best guess?" too funny
 
The documentation is incomplete relative to the number of scenarios that can be encountered, especially with respect to split or shared osd nets where the implications can be tremendous.
The fact that you, the end user, choose to operate in environments outside the documentation doesn't mean the documentation is incomplete. dont do it if its not documented, and you'll be fine.
 
The doc says to split the nets! And I do, but most people don't, or at least a majority of PVE setups I've seen.

Wherever they do discuss recovery loads, it's not getting through to the users.

The 100 GbE examples given in the 2023/12 PVE document were also not split, I imagine because the study was focused on small cluster RBD client performance and not recovery performance.
 
Last edited:
  1. Add all disks at the same time to the pool
  2. Add the disks one by one and wait until rebalanced
  3. Add the disks at the same time per server (2 at the time) and wait until rebalanced

In my experience all are valid options with their pro's and cons.

1.- Usually my choice. Set no backfill, no rebalance, no recovery flags. Add all OSDs, set device class, create new pool if needed. Remove flags and wait.
Pros: set and forget, least data movement of 3 options, as when the rebalance starts the crush map will be the definitive for that much OSDs.
Cons: all OSDs will be syncing data, so service might be impacted. If many OSD break or misbehave might get clumsy to get it back to a healthy state.

2.- Least preferred option.
Pros: limited failure domain, so easier to diagnose if something goes wrong. Potentially less OSD will sync data to the new one.
Cons: highest data movement of the 3 options, as every OSD you add will create a new crush map which may even remove PG's from recently added OSD's and place other PGs instead.

3.- A mix of options 1 and 2 with "half" the pro's and con's of each.

Anyway on Quincy and Reef all recovery operations have low priority by default, so it poses low chance of affecting VM's services or performance. As you have 2 OSD to add to each server, I suggest you try option 2 with some of the servers and when you get confident enough try option 1.
 
  • Like
Reactions: Jackobli
The doc says to split the nets! And I do, but most people don't, or at least a majority of PVE setups I've seen.
oh I see what you mean; I had taken your comment to mean you wanted OSDs on different subnets from each other, not seperation of public and private ceph traffic.

There is no harm in keeping both public and private on the same link(s) or even the same vlan; all it means is that your throughput will be half since the same physical layer carries both traffic loads- and also the potential security risk that ceph guests have potential access to ceph private traffic. "best practices" suggest to keep those seperate, but it is possible to operate reliably that way. its not always wrong to ignore best practices, just be aware and prepared for the consequences.
-edit- the "half" comment applies to OSD hosting nodes. pure guests arent impacted.

Wherever they do discuss recovery loads, it's not getting through to the users.
What they dont learn through the head they'll learn on their flesh ;) thats not on you, you're trying to help.

The 100 GbE examples given in the 2023/12 PVE document were also not split, I imagine because the study was focused on small cluster RBD client performance and not recovery performance.
see above. the overall link bandwidth isnt important in and of itself. many smaller environments wont ever generate anywhere NEAR the bandwidth allowed by their physical layer, and consequently have perfectly adequate performance with much slower links. The reason you dont want too much simultaneous rebalance is not because of lack of bandwidth, its because access latency shoots through the roof- which of course it does, the file system is busy.
 
Last edited:
You can absolutely operate on a shared net, that is why it is still permitted, and merely not recommended, but the recovery impact on client operations are more likely to be higher.

And even on dedicated dual 40 GbE OSD network with recovery at the lowest priority I have seen slow ops during a 2x enlargement. There is just no reason to leave yourself open to that possibility, no matter how remote.
 
Last edited:
my guts are going to 1. (all together) but I kindly appreciate any input.
you can do it, but I would urge you to put aside maybe overnight where you wouldn't be dependent on accessing the file system. you have a small pool without a lot of data, it should be most of the way done by morning. access latency would be curtailed in the interim.
 
And even on dedicated dual 40 GbE OSD network with recovery at the lowest priority I have seen slow ops during a 2x enlargement.
pretty sure there isnt much you can do about a 2x enlargement if you do it all at once ;)

On quincy and reef with it's default mclock backfill/recovery ops have low prio by default. So low in fact that I find it difficult to make them run fast enough and make full use of network and drives capacity.
Even with the previous wpq scheduler was quite easy to regulate those operations to not disturb client traffic.
 
All I'm saying is that I have seen it on quincy, with sufficient network and NVMe drives, and a modest 3k steady iops from fewer than 150 VMs, and I told myself I probably won't do it again just in terms of risk/reward.

For what appears to be a relatively green user to undertake a 2x on their first try, on 25 GbE, which he has not sufficiently detailed is shared or not, in one move? I guess it would make a good exercise for him....
 
Last edited:
Thank you guys, I just wanted to understand and perhaps getting a hint to some documentation or experiences.
But feel free to rate my network. It won't feel hurt, neither do I.
The servers are HPE and equipped with 512 G RAM, 32 Core EPYC. Data and storage networks are separated of course.
This is PVE only and our VM are having different loads, also during the night. No CephFS. And yes, we're on Reef.
We changed the way to do the migration and will create a new pool with the new disks using the chance to encrypt them.
We will move over the the VMs and later destroy the old pool/OSD and re-add them encrypted to the new pool.
We understood, that it might be better to do this carefully one be one and let the pool distribute the data several times.
This is enterprise hardware, we had not one disk failure operating the cluster since more than 2 years.
 
There's nothing "wrong" with your configuration. if it works for you, it works.

My only comment to you is that you have very few osds. since I assume your guest count is small this is probably ok, more osds=more performance and resilience. going from 2 osds/node to 4 is a great start.
 
  • Like
Reactions: Jackobli

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!