new three node PVE+Ceph cluster

Oct 28, 2013
299
44
93
www.nadaka.de
Hi everybody,

we just decided to build a new virtualization environment based on three PVE+Ceph nodes.
At the moment we're running about 50 VMs (Windows and Linux servers) which have assigned 192 vCPU cores, 377 GB RAM and 12 TB allocated storage, of which 8,3 TB are really in use.

We like to setup a as-far-as-possible-standard installation of PVE and Ceph. The first two steps I have in mind is 1) planing the server hardware and 2) planing the networking around these servers.

1) Server Hardware
Because we are very happy with Thomas Krenn for years, we would like to buy these servers there. Here's my first shot (per machine):

** 2HE AMD Dual-CPU RA2224 Server **
CPUs: 2x AMD EPYC 7351
RAM: 256 GB (4x 64 GB) ECC Reg DDR4 2666
SSDs (OS and software): 2x 240 GB Samsung SM883
SSDs (Ceph): 7x 1,92 TB Samsung SM883 (each of them OSD+Journal for Ceph storage)
1x Broadcom HBA 9300-8i
2x Intel 10 Gigabit X710-DA2 SFP+
Proxmox standard subscriptions
5 years essential hardware support

notes:
a) Why AMD? Because when Goliath fights against David, we're on David's side. :) Are there great reasons to use Intel anyway?
b) SSDs: What about SM883 vs. PM883? Or something completely different?

2) Networking
The three nodes will live in three different rooms which are connected via OM3 fibre connections (R1 <-107 meters-> R2 <-145 meters-> R3)
In every room there are switch virtual chassis based on Juniper EX3300 switches (also connected to each other via 10 GBit/s using the connections mentioned above). Everywhere there are at least four free 10 GBit/s ports.
So in my opinion the "obvious" way is:
1x 10 GBit/s for the "VM network"
1-2x 10 GBit/s for Ceph
1x 10 GBit/s for live migration
Another wild idea could be building a direct mesh for Ceph using more than 10 GBit/s. Here the question is: What is possible with 145 meters OM3: 25 GBit/s? 40 GBit/s? even more?

That's it for the moment. I would be very glad about your comments on this project, and I promise you to run every performance test you like to see in this environment when it's built! :)

Thanks and many greets from Germany
Stephan
 
Last edited:
no answers? Such a boring project? :D
So let's try it more concrete:

1) Ceph network (performance)
I guess this is more about latency than about bandwidth. Would a 2x 10 GBit/s bonding be significantly better than 1x 10 GBit/s? (Or even worse because of the bonding overhead?)

2) Ceph OSDs and journals
In an SSD-only environment, would it be ok to give every OSD its own journal, or are there better designs?

Thanks and greets
Stephan
 
no answers? Such a boring project? :D
So let's try it more concrete:

1) Ceph network (performance)
I guess this is more about latency than about bandwidth. Would a 2x 10 GBit/s bonding be significantly better than 1x 10 GBit/s? (Or even worse because of the bonding overhead?)

2) Ceph OSDs and journals
In an SSD-only environment, would it be ok to give every OSD its own journal, or are there better designs?

Thanks and greets
Stephan

1)

Speaking from experience latency has a bigger effect than pure bandwidth, if you can go for the higher speed low latency NIC's over multiple 10Gbps then I would say go that way.
If not if your 2 x 10GBit/s are going to separate switches then it also gives you some redundancy over a single failure point on a switch. It also really depends on how much performance you need out of the setup vs the extra hardware costs. Your atleast giving your self some extra future proof with 20Gbps vs 10Gbps.


2)

When you say Journals are you still looking to use Filestore, I would highly suggest Bluestore.

Either way if your using a decent SSD then unless you can get a few NVME's into your setup your better using a single disk per OSD + Journal/(DB+WAL) your get better predictable performance over hitting a single SSD for all DB/Journal I/O for multiple OSD's, you also again remove the single point of failure, with a drive failure only affecting a single OSD.
 
Hey sg90,

thanks for your answer!

1) Ceph networking
Our options are:
a) 1-2x 10 GBit/s with hardware switching (Juniper EX3300)
b) mesh network with whatever I can get with 145 meters of OM3 fibre.

For buying more than 10 GBit/s hardware switching there there are no plans (and no budget ;)).
But buying dual 25/40/more GBit/s NICs for a mesh network would be fine.

Basically our storage performance needs are mainly "4k random read/write IOPS" for our Databases. Bandwidth would be nice for fast backups, but that's not critical in terms of performance.

2) Ceph OSDs etc.
Oh, thanks for that hint! Of course we want to use Bluestore - I didn't know that journaling is not a topic here anymore.

btw: I know benchmarking ist hard :D, but how much 4k random rw IOPS would you expect from this environment?

Greets
Stephan
 
Hey sg90,

thanks for your answer!

1) Ceph networking
Our options are:
a) 1-2x 10 GBit/s with hardware switching (Juniper EX3300)
b) mesh network with whatever I can get with 145 meters of OM3 fibre.

For buying more than 10 GBit/s hardware switching there there are no plans (and no budget ;)).
But buying dual 25/40/more GBit/s NICs for a mesh network would be fine.

Basically our storage performance needs are mainly "4k random read/write IOPS" for our Databases. Bandwidth would be nice for fast backups, but that's not critical in terms of performance.

2) Ceph OSDs etc.
Oh, thanks for that hint! Of course we want to use Bluestore - I didn't know that journaling is not a topic here anymore.

btw: I know benchmarking ist hard :D, but how much 4k random rw IOPS would you expect from this environment?

Greets
Stephan
1)

If your cluster never needs to grow past the 2 servers being in a pair linked by some dual 25/40 then I would go with that option, you generally can get up to 50% lower latency vs 10Gbps let alone any small overhead from bonding.

2)

Very hard to say, but if you look at the 4K random IOPS for the SM883 and take into consideration a CEPH replication of 3.

Read : (97,000 * 14) / 3 = 452,666 IOPS 4K
Write : (29,000 * 14)/3 = 135,333 IOPS 4K

But remember this is like the most you could ever expect from your hardware, excluding any overheads from CEPH or the hardware. Id expect much less than the above numbers in real life.
 
Ceph network: Switching vs. mesh
I just had an exciting conversation with a TK tech; We decided that 40 GBit/s mesh would be a little bit too... experimental! :D So we will go with 1-2x 10 GBit/s switching for Ceph.

SSDs: SM883 vs. PM883
It seems that the SM are much more durable than PM without loosing (much) performance. So SM883 is our choice!

Ok, so I think that server hardware and network design is defined! :cool:
Next I will go shopping and build the new hardware into out datacenters.

Nonetheless any coments, suggestions, hints, etc. are welcome!

Thanks a lot and greets
Stephan
 
Ceph network: Switching vs. mesh
I just had an exciting conversation with a TK tech; We decided that 40 GBit/s mesh would be a little bit too... experimental! :D So we will go with 1-2x 10 GBit/s switching for Ceph.

SSDs: SM883 vs. PM883
It seems that the SM are much more durable than PM without loosing (much) performance. So SM883 is our choice!

Ok, so I think that server hardware and network design is defined! :cool:
Next I will go shopping and build the new hardware into out datacenters.

Nonetheless any coments, suggestions, hints, etc. are welcome!

Thanks a lot and greets
Stephan


Wondering why they would class 40Gbps as expiremental..?
 
Wondering why they would class 40Gbps as expiremental..?
Not really experimental in technical terms, but:
a) Almost no other customer goes with 40 GBit/s or more yet
b) Meshed network for Ceph doesn't seem to be very common for productive use
c) Our fibre connections are longer than the official standards allow for 40 GBits/s (100 meters)
d) QSFP+ would be new to us while we alreadey know SFP+

So for us it would feel experimental. :D
 
Hi again,

things becoming true - feels like christmas! :)
We bought the server hardware like the list above, with one exception: Instead of the X710 network cards we bought X520.
So the servers are mounted into our datacenters, network integration looks good, and I just installed three PVE, not yet clustered and no Ceph so far.
Do you have any suggestions or wishes what I should test before creating the cluster and bringing up Ceph?

Thanks and greets
Stephan
 
@sherminator : 1x Broadcom HBA 9300-8i
For what :)?
We have two system discs (RAIDZ1) and we start with seven 1,92 TB SSD per node for Ceph. Maximum per node is 24.
...and because the Thomas Krenn webconfigurator just offers me 8i. :)

Some benchmarks (4k IOPs) would be nice to see :) Thanks

For a single 1,92 TB SSD I posted a benchmark result here:
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/page-7#post-271062
Is this what you're looking for, or should I do something else?

very nice! But a bit bigger I think. Good luck!
 
@sherminator when you´re CEPH is ready, some benchmark would be nice, not the pure single drive ;-)

However, now I understand the need for controller, because you have 24 port case ;-) lol :) great shit :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!