New cluster. Would like someone to look over the plan for issues/problems. Few questions.

AllanM

Well-Known Member
Oct 17, 2019
118
38
48
41
Hello Proxmox community and developers,

I'm new here! Very excited about proxmox!

I'm the IT and information security manager for a small-mid business (~80 people, expect to double that over the next 5-8 years). Need to spin up a cluster to host services and security systems as we work towards new security goals. Been simulating a proxmox cluster w/ceph in a virtual environment and think it is a good fit for our needs based on impressions thus far. Looking forward to our build-out. I intend to buy basic support or contribute donations to all of the open source projects we intend to take advantage of for this system.

Will start with 3 nodes this fall to get our feet wet on real hardware (learn how to break it, fix it, recover it etc), expand to 5-6 node next year when we're ready to bring user computers into the new domain environment and begin migrating data from cloud to on-prem.

Anticipated Workloads:
WinServer 2019 (instance1): Domain Controller (Internal TrustedLAN DNS/DHCP/GP/AD), RADIUS server
WinServer 2019 (instance2): File Server
Win10Pro: Video Surveillance System w/~16 cams. Untrusted network Wifi controller, temporary print server "back door" to trusted network.
Pfsense: OpenVPN (full gateway redirect, full tunnel tap for all remote and local users), RADIUS client, IPS, pfBlocker, UntrustedLAN DHCP, etc.
Security Onion: full packet capture, NIDS, DLP, Log aggregation for server systems.
Wazuh Server: HIDS, config compliance scans, Log aggregation for client systems.
Placeholder: (possible GIT server or something similar for our software devs?)... Will likely add more in future but not sure what that looks like yet.


I'm leaning towards single socket EPYC Rome 2U WIO servers from SuperMicro (2113S-WTRT).
Config per node: 7402P / 128-256GB RAM / 250GB NVME Boot / 500GB NVME DB WAL / Intel X700 or X500 for 10G (baseT and SFP+ ports).

"Fast" pool of 2.5" SATA SSD's (likely 2TB each, minimum 4 per node, up to ~8-12 per node is likely long term) for VM OS's, security logs, and eventual in-house file server (~2M files and growing). DB/WAL for each OSD on respective OSD's.

"Slow" pool of 3.5" SATA spinners (likely 6-12TB each, 4 per node) for packet capture, security cam footage, and creative services media archive installed in an external DAS enclosure. (R2424RM from raidmachine looks interesting) connected to JBOD/IT mode SAS controllers. DB/WAL for all "slow" OSD on a single NVME M.2 SSD per node installed on MOBO.

1 X SG350XG-24T 10G switch for Coro1/CephP/TrustedLAN (separate VLAN's)
1 X SG350XG-24T 10G switch for Coro2/CephC/OOBLAN (separate VLAN's)

Various other switches for Untrusted Building Network, IPMI, and WAN.... not important.

---------------------------

Questions:

1. Assuming each node would potentially have up to 4 X ~10TB direct attached as part of its "slow" pool, how big should the NVME DB/WAL drive be to support the 4 drives?

2. Does Proxmox work with the 10G Broadcom NIC's built-into many SuperMicro Servers?

3. Should I substitute the cephC network on the second switch with a second cephP in a LAG failover instead? (any development plans in Ceph to make CephC capable of becoming cephP automagically in the case of a cephP fail?)

4. X700 vs X500 series Intel NIC's? Newer vs older models? Best practice here for Proxmox 6?

5. Any major concerns with the plan above in terms of hardware selection/config? Overkill/Underkill? I feel like this is a good starting point for the intended use.

------------------

Thank you!

-Eric
 
Hi,

I'm new here! Very excited about proxmox!
Glad to hear that.

1. Assuming each node would potentially have up to 4 X ~10TB direct attached as part of its "slow" pool, how big should the NVME DB/WAL drive be to support the 4 drives?
The Block.db has to be about 4% of the Raw size.
In you case 1,6 TB.
For more information see[1].

2. Does Proxmox work with the 10G Broadcom NIC's built-into many SuperMicro Servers?
If this NIC has Linux support it will work with ProxmoxVE

3. Should I substitute the cephC network on the second switch with a second cephP in a LAG failover instead?
Yes, because with a single switch you have a single point of failure and I guess you will use Ceph to avoid this.

4. X700 vs X500 series Intel NIC's? Newer vs older models? Best practice here for Proxmox 6?
I would use the X500 because they are more common, cheaper and there is no real benefit for the X700 in this case.
But honestly, I would use Mellanox 25 GBit Nics with the SFP28 interface.
They provide lower latency than an SFP+ and for your speed pool, you need this.
Anyway, 10GBit can be bare.

Any major concerns with the plan above in terms of hardware selection/config? Overkill/Underkill? I feel like this is a good starting point for the intended use.
No, Only the Network as I noticed before.
For the NVME Block/WAL DB drive be sure you use enterprise Drive with good 4k sync write capabilities.

1.) https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
 
Thanks WolfGang!!

Excellent information. Thanks for the link. I probably skimmed over that ~4% of block size detail a few times reading about this but missed it every time.

I see the SAMSUNG 983 DCT and Micron 7300 Pro as possibilities bearing the "enterprise" marketing wank, but notice the "consumer" grade corsair MP510 achieves about 10X better QD1 4K random write performance at nearly half the price. Tough sell for me to go "enterprise" unless I can find something else.



The Mellanox SFP28 approach is something I had actually done in several of my hardware list options, but eventually disqualified on price. Seemed like I'd be into near $20,000 worth of switches to go that route... Not sure if the performance advantage would ever actually have an impact on our users. (many will be remote access vpn and throttled by their hotel/airport so much that internal performance here won't matter much.)

Looking again today at options for going SFP28... found some Dell S5048F-ON switches in refurb condition being sold on newegg marketplace for about the price of a new SG350XG-24T. Might have to have a closer look at this route. With that said, is there a functionality problem using 10Gb for this? You mention it being a "bare." Wondering if that's an issue of convenience (replication/rebuild speed) or something more serious (like latency causing actual problems).

Regards,
-Eric
 
I see the SAMSUNG 983 DCT and Micron 7300 Pro as possibilities bearing the "enterprise" marketing wank, but notice the "consumer" grade corsair MP510 achieves about 10X better QD1 4K random write performance at nearly half the price. Tough sell for me to go "enterprise" unless I can find something else.
I don't mean the marketing enterprise. I meant the techincal enterprise.
So high duration and high IOPS at 4k with sync and direct.

It does not meter if Mellanox or else.
But SFP28 vs SFP+ you will notice in higher speed on ceph.
Might have to have a closer look at this route. With that said, is there a functionality problem using 10Gb for this?
No, it works fine. Just the latency with SFP28 is better.
And Ceph like low latency.
Wondering if that's an issue of convenience (replication/rebuild speed) or something more serious (like latency causing actual problems).
If a node has a failure and drops out the cluster ceph will try to rebalance all the data to the rest of the cluster.
At this moment you have extra traffic on the network.
And in your case, this could some TB to rebalance.
You can configure ceph that it will not rebalance so much at the same time.
But this scenario you have to be aware of.

Also, it depends on what storage access pattern you have.
So may 4K sync write enormously benefit from low latency network.
Streaming data like doesn't.
 
I don't mean the marketing enterprise. I meant the techincal enterprise.
So high duration and high IOPS at 4k with sync and direct.

I think we're on the same page here ;) Though, I have a hunch almost any modern non-QLC 2TB drive would still be far better than putting the DB on the spinners. We're not going to have small files on the spinners anyway, so might not matter much. The spinning pool is for archive of large files.

It does not meter if Mellanox or else.
But SFP28 vs SFP+ you will notice in higher speed on ceph.

I'll expand my searching for sure.

Would be nice to be able to match the NIC brand to Switch brand so that DACs are more feasible (non-custom). I'm also under the impression that DAC is slightly lower latency (not enough to matter but academically interesting).

Also, it depends on what storage access pattern you have.
So may 4K sync write enormously benefit from low latency network.
Streaming data like doesn't.

It will be a mix. The SSD pool is likely to have several million small files on it all said and done with the various VM's and data storage we'll put there. The spinning pool will have all large files (mostly 100MB or larger).

--------

Thanks again for all of the input on this. Helps a ton!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!