PVE-Ceph, Adding multiple disks to a existing pool

Jackobli · Feb 13, 2024

Hi there

We got a six server cluster with existing ceph pools.
Now we need to add more disks to one pool and I am unsure what scenario needs more time and/or causes more «turbulances» .
The pool consists of 6 x 2 SAS SSD (3.2 TB and 6.4 TB). We would add another 6 x 2 SAS SSD (6.4 TB).
Variants:

Add all disks at the same time to the pool
Add the disks one by one and wait until rebalanced
Add the disks at the same time per server (2 at the time) and wait until rebalanced

my guts are going to 1. (all together) but I kindly appreciate any input.
(the servers are equipped with dedicated storage interfaces on 2 x 25 GbE with LACP)

Regards, Urs

alyarb · Feb 13, 2024

You can add all the disks at once, but I would only create 1 new OSD at a time. Wait for the recovery to complete before creating the next.

Jackobli · Feb 22, 2024

Is this «best guess», «best practice» or general recommendation?
Based on time until completion? Best performance?

alyarb · Feb 22, 2024

Where do people even get the term "best practice" from? This isn't a cartel like Microsoft or Cisco where they give exams and awards to their little disciples. It reeks of escapism, like nobody ever got fired for following a best practice, regardless of outcome.

There are currently only 3 authors publishing on Ceph, maybe they have a blurb on it.

In the FOSS world you are your own best advocate and you must know and do what is best for you. The code, tools, information, and guidance found on the internet are given freely and without warranty. The buck only ever stops with you. There is never even a guarantee of a correct answer. Still not sure? Test it yourself and find out.

Ceph has no strict hardware requirements. You can put it on anything and it is super stable. It may not be fast if you don't build it right, but it will be stable. Since all the possible combinations of hardware and config are truly infinite, there is no way for anyone to tell you the best way to do something on your system or for any outcome to be assured.

If you want to expand your cluster at the fastest rate, then you would create all your new OSDs simultaneously so that your ultimately intended CRUSH map is rendered a single time at-once. The resulting recovery workload will proceed at the highest possible rate, at your peril.

Not knowing your CPU, RAM, or allocations on your VMs, I only see your small cluster and paltry 25 GbE network, I can say that expanding the cluster in a single giant gulp in this careless manner also gives you the best chance of having slow blocking OSD ops and sluggish or potentially crashed VMs, and may result in your scrambling to play with the recovery priority flags to see if you can find a balance. If you do, that's great, you will have discovered something that works for you but may not be applicable to the next guy with a completely different system.

If you are only using CephFS or RGW workloads on this cluster, the impact of slow OSD ops will be less apparent and you can consider that. If your workload is principally RBD, then let's all agree that the survival of the RBD users is the top priority, and not expansion performance.

Want to go easy on the system, and yourself? Expand the cluster slow and steady.

alexskysilk · Feb 22, 2024

alyarb said:
Where do people even get the term "best practice" from?

preponderance of recognized experts in the field. This is true for anything, not just FOSS.

alyarb said:
In the FOSS world you are your own best advocate and you must know and do what is best for you.

true for everywhere else too.

alyarb said:
Ceph has no strict hardware requirements. You can put it on anything and it is super stable. It may not be fast if you don't build it right, but it will be stable. Since all the possible combinations of hardware and config are truly infinite, there is no way for anyone to tell you the best way to do something on your system or for any outcome to be assured.

There is. its called documentation.

alyarb said:
Want to go easy on the system, and yourself? Expand the cluster slow and steady.

undoubtedly. you could have probably skipped the rest of the rant

alyarb · Feb 22, 2024

look at my first response. then he comes back with "what is this, your best guess?" too funny

alyarb · Feb 22, 2024

https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/

ehh, the opportunity cost of recovery performance reconciled with client performance is not discussed.

The documentation is incomplete relative to the number of scenarios that can be encountered, especially with respect to split or shared osd nets where the implications can be tremendous.

alexskysilk · Feb 22, 2024

alyarb said:
The documentation is incomplete relative to the number of scenarios that can be encountered, especially with respect to split or shared osd nets where the implications can be tremendous.

The fact that you, the end user, choose to operate in environments outside the documentation doesn't mean the documentation is incomplete. dont do it if its not documented, and you'll be fine.

alyarb · Feb 22, 2024

The doc says to split the nets! And I do, but most people don't, or at least a majority of PVE setups I've seen.

Wherever they do discuss recovery loads, it's not getting through to the users.

The 100 GbE examples given in the 2023/12 PVE document were also not split, I imagine because the study was focused on small cluster RBD client performance and not recovery performance.

VictorSTS · Feb 22, 2024

Jackobli said:
Add all disks at the same time to the pool

Add the disks one by one and wait until rebalanced

Add the disks at the same time per server (2 at the time) and wait until rebalanced

In my experience all are valid options with their pro's and cons.

1.- Usually my choice. Set no backfill, no rebalance, no recovery flags. Add all OSDs, set device class, create new pool if needed. Remove flags and wait.
Pros: set and forget, least data movement of 3 options, as when the rebalance starts the crush map will be the definitive for that much OSDs.
Cons: all OSDs will be syncing data, so service might be impacted. If many OSD break or misbehave might get clumsy to get it back to a healthy state.

2.- Least preferred option.
Pros: limited failure domain, so easier to diagnose if something goes wrong. Potentially less OSD will sync data to the new one.
Cons: highest data movement of the 3 options, as every OSD you add will create a new crush map which may even remove PG's from recently added OSD's and place other PGs instead.

3.- A mix of options 1 and 2 with "half" the pro's and con's of each.

Anyway on Quincy and Reef all recovery operations have low priority by default, so it poses low chance of affecting VM's services or performance. As you have 2 OSD to add to each server, I suggest you try option 2 with some of the servers and when you get confident enough try option 1.

alexskysilk · Feb 22, 2024

alyarb said:
The doc says to split the nets! And I do, but most people don't, or at least a majority of PVE setups I've seen.

oh I see what you mean; I had taken your comment to mean you wanted OSDs on different subnets from each other, not seperation of public and private ceph traffic.

There is no harm in keeping both public and private on the same link(s) or even the same vlan; all it means is that your throughput will be half since the same physical layer carries both traffic loads- and also the potential security risk that ceph guests have potential access to ceph private traffic. "best practices" suggest to keep those seperate, but it is possible to operate reliably that way. its not always wrong to ignore best practices, just be aware and prepared for the consequences.
-edit- the "half" comment applies to OSD hosting nodes. pure guests arent impacted.

alyarb said:
Wherever they do discuss recovery loads, it's not getting through to the users.

What they dont learn through the head they'll learn on their flesh

thats not on you, you're trying to help.

alyarb said:
The 100 GbE examples given in the 2023/12 PVE document were also not split, I imagine because the study was focused on small cluster RBD client performance and not recovery performance.

see above. the overall link bandwidth isnt important in and of itself. many smaller environments wont ever generate anywhere NEAR the bandwidth allowed by their physical layer, and consequently have perfectly adequate performance with much slower links. The reason you dont want too much simultaneous rebalance is not because of lack of bandwidth, its because access latency shoots through the roof- which of course it does, the file system is busy.

alyarb · Feb 22, 2024

You can absolutely operate on a shared net, that is why it is still permitted, and merely not recommended, but the recovery impact on client operations are more likely to be higher.

And even on dedicated dual 40 GbE OSD network with recovery at the lowest priority I have seen slow ops during a 2x enlargement. There is just no reason to leave yourself open to that possibility, no matter how remote.

alexskysilk · Feb 22, 2024

pretty sure there isnt much you can do about a 2x enlargement if you do it all at once

alyarb · Feb 22, 2024

It is merely the case presented by OP.

alexskysilk · Feb 22, 2024

Jackobli said:
my guts are going to 1. (all together) but I kindly appreciate any input.

you can do it, but I would urge you to put aside maybe overnight where you wouldn't be dependent on accessing the file system. you have a small pool without a lot of data, it should be most of the way done by morning. access latency would be curtailed in the interim.

VictorSTS · Feb 22, 2024

alyarb said:
And even on dedicated dual 40 GbE OSD network with recovery at the lowest priority I have seen slow ops during a 2x enlargement.

alexskysilk said:
pretty sure there isnt much you can do about a 2x enlargement if you do it all at once

On quincy and reef with it's default mclock backfill/recovery ops have low prio by default. So low in fact that I find it difficult to make them run fast enough and make full use of network and drives capacity.
Even with the previous wpq scheduler was quite easy to regulate those operations to not disturb client traffic.

alexskysilk · Feb 22, 2024

VictorSTS said:
Even with the previous wpq scheduler was quite easy to regulate those operations to not disturb client traffic.

I am not doubting you. experience shows... otherwise. but as was pointed out before, there are many variables in a ceph deployment that may have different effects, so ymmv. I tend to err on the side of caution.

alyarb · Feb 22, 2024

All I'm saying is that I have seen it on quincy, with sufficient network and NVMe drives, and a modest 3k steady iops from fewer than 150 VMs, and I told myself I probably won't do it again just in terms of risk/reward.

For what appears to be a relatively green user to undertake a 2x on their first try, on 25 GbE, which he has not sufficiently detailed is shared or not, in one move? I guess it would make a good exercise for him....

Jackobli · Feb 22, 2024

Thank you guys, I just wanted to understand and perhaps getting a hint to some documentation or experiences.
But feel free to rate my network. It won't feel hurt, neither do I.
The servers are HPE and equipped with 512 G RAM, 32 Core EPYC. Data and storage networks are separated of course.
This is PVE only and our VM are having different loads, also during the night. No CephFS. And yes, we're on Reef.
We changed the way to do the migration and will create a new pool with the new disks using the chance to encrypt them.
We will move over the the VMs and later destroy the old pool/OSD and re-add them encrypted to the new pool.
We understood, that it might be better to do this carefully one be one and let the pool distribute the data several times.
This is enterprise hardware, we had not one disk failure operating the cluster since more than 2 years.

alexskysilk · Feb 22, 2024

There's nothing "wrong" with your configuration. if it works for you, it works.

My only comment to you is that you have very few osds. since I assume your guest count is small this is probably ok, more osds=more performance and resilience. going from 2 osds/node to 4 is a great start.

PVE-Ceph, Adding multiple disks to a existing pool

Member

Well-Known Member

Member

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Famous Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Famous Member

Distinguished Member

Well-Known Member

Member

Distinguished Member