Private server cluster configuration suggestion to use with Proxmox VE with SDS and NAS

electrolibrium

New Member
Mar 2, 2021
2
0
1
40
Hello, I'd like to ask for your suggestions on how to configure a new private server cluster for a start-up. I'm comfortable doing things in the cloud, but everything has to be private and self-hosted including code repositories, wiki, CI & CD pipeline, chat, and collaboration tools. So I'm really a noob when it comes to bare metal servers and I hope you'd help me find the right answers.

3 identical servers with this configuration: 2 x Intel Xeon Silver 4210 CPU (10 Core, 20 Thread, 2.20 GHz Base, 3.20 GHz Turbo ), 64 GB DDR4 ECC RAM, 2 x600 GB SAS 15K RPM HD.

We plan to form a cluster using these 3 servers with Proxmox VE and here is the list of items we'd like to run on using vms:
  • Kubernetes cluster for running applications in multiple environments (dev, test, prod) via K8s namespaces.
  • MongoDB database instances in multiple environments
  • Gitlab for Git repository, wiki, issue tracking, and continuous integration and deployment pipeline
  • Element from element.io for conversations like Slack or MS Teams
  • Seafile for on-premises fileSharing like Google drive
  1. Would we face any problems configuring K8s on top of the vms? Losing some performance is not a concern here, but we're more concerned about K8s not running reliably.
  2. We have about 5 TB a year expected data for MongoDB and redundancy is important in case of a disk failure. Instead of relying on the server's HD for this, we'd like to find a shared storage option to have redundancy, high availability, and performance. What are our options here? I read about NAS, but it might not handle the load of transaction-intensive databases. Software-defined storage (SDS) is suggested in an article, but not really sure what it means in terms of hardware we need to buy; are we talking about something like having 2 servers with 1 CPU and less memory but lots of HDs, then just using these servers for its data?
  3. For Gitlab, Element, and Seafile, I'm thinking that having a NAS solution like the units Synology has would be sufficient. Again redundancy is very important here since we don't want to lose any data in case of a disk failure.
  4. Do we need to buy a physical firewall for this setup or Proxmox firewall would be sufficient?
  5. Please feel free to suggest or criticize the configuration we plan to use if you think it doesn't make sense.
Thank you very much for spending your precious time reading my post!
 
Hi,
Would we face any problems configuring K8s on top of the vms? Losing some performance is not a concern here, but we're more concerned about K8s not running reliably.
Personally not to experienced with running K8s onto of VMs but there are some users here doing just that and tbh, I do not see why that should be an issue for k8s.

We have about 5 TB a year expected data for MongoDB and redundancy is important in case of a disk failure. Instead of relying on the server's HD for this, we'd like to find a shared storage option to have redundancy, high availability, and performance. What are our options here? I read about NAS, but it might not handle the load of transaction-intensive databases. Software-defined storage (SDS) is suggested in an article, but not really sure what it means in terms of hardware we need to buy; are we talking about something like having 2 servers with 1 CPU and less memory but lots of HDs, then just using these servers for its data?
I'd go for ceph here, it's very scalable and fault-tolerant and avoids proprietary black box NAS solutions (IMO almost the best feature of it) you may want to read at least the introduction and preconditions in https://pve.proxmox.com/pve-docs/chapter-pveceph.html to get some feeling about if it fits your use case. Note, I envision currently the hyper-converged use case, i.e., ceph storage and compute nodes on those three nodes, but if budget allows you could have some nodes for mostly compute and some for mostly storage, for example, add two more nodes to make it five totals, cluster it and add in some more disks and make them Ceph monitors, so that hey have a more central role regarding ceph, and in the others use fewer disks and have more CPU and memory left over for your VM compute stuff.

We released a Ceph benchmark paper a few months ago, maybe that is interesting too (in terms of dooable workloads):
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/

With three or even five nodes you'd have the benefit of being still able to add fast extra network cards (minimum 10G, 25G+ would be better) with two or four ports, respectively, and setup a full mesh on them for use of the ceph private network (the latency and bandwidth hungry "backbone network" of ceph) avoiding an extra switch or that the switch in uses needs to handle all that extra traffic which can interfere VM and Cluster communication.

For Gitlab, Element, and Seafile, I'm thinking that having a NAS solution like the units Synology has would be sufficient. Again redundancy is very important here since we don't want to lose any data in case of a disk failure.
For those ceph would work as charm IMO, and unifying the data into a single storage tech to manage has quite some benefits and reduces extra cogs that can fail from the system.

Databases may be a bit more hungry for IOPS, with ceph cou could also have faster and slower storage pools in the same cluster, where one is assigned faster SSDs and the other is using slower but cheaper HDDs.

Do we need to buy a physical firewall for this setup or Proxmox firewall would be sufficient?
Depends a bit on how your setup and its environment will look like (other servers, is there already some routing/FW host), and what your needs are. There's always the option of using a pfSense, either on a baremetal host or in a VM, both has its advantages and disadvantages. If we're mostly talking shielding off traffic the PVE firewall is definitively fine, natively integrated full-fledged SDN management is currently under development, so for that you may need either a some more hands-on approach or some other tooling, if that's a requirement.


Just some things coming to my mind, feel free to ask if anything is unclear or if you have more specific questions.
 
_if_ you are going with NAS, I'd prefer a simple linux host with nfs and md raid. this way you have total control (eg can just take a disk and access data everywhere in case of failure, no prorietary controllers) and get more performance:price

another advantage is simplicity which makes it reliable in most of the cases.

but ceph is also very reliable one must say. we have got productive ceph since years too. one of it ssd only with quite high i/o (mail cluster)
 
Please avoid MD-RAID, it has zero checks itself and the classic combination md-raid+LVM+ext4/xfs can not detect nor repair any bitrot whatsoever. Rather use a filesystem which can do so, that'd be ZFS, BTRFS or CephFS (or any other stuff on top of ceph RADOS block devices using bluestore, which has data and metadata checksumming).
 
  • Like
Reactions: pvps1
Hi,

Personally not to experienced with running K8s onto of VMs but there are some users here doing just that and tbh, I do not see why that should be an issue for k8s.


I'd go for ceph here, it's very scalable and fault-tolerant and avoids proprietary black box NAS solutions (IMO almost the best feature of it) you may want to read at least the introduction and preconditions in https://pve.proxmox.com/pve-docs/chapter-pveceph.html to get some feeling about if it fits your use case. Note, I envision currently the hyper-converged use case, i.e., ceph storage and compute nodes on those three nodes, but if budget allows you could have some nodes for mostly compute and some for mostly storage, for example, add two more nodes to make it five totals, cluster it and add in some more disks and make them Ceph monitors, so that hey have a more central role regarding ceph, and in the others use fewer disks and have more CPU and memory left over for your VM compute stuff.

We released a Ceph benchmark paper a few months ago, maybe that is interesting too (in terms of dooable workloads):
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/

With three or even five nodes you'd have the benefit of being still able to add fast extra network cards (minimum 10G, 25G+ would be better) with two or four ports, respectively, and setup a full mesh on them for use of the ceph private network (the latency and bandwidth hungry "backbone network" of ceph) avoiding an extra switch or that the switch in uses needs to handle all that extra traffic which can interfere VM and Cluster communication.


For those ceph would work as charm IMO, and unifying the data into a single storage tech to manage has quite some benefits and reduces extra cogs that can fail from the system.

Databases may be a bit more hungry for IOPS, with ceph cou could also have faster and slower storage pools in the same cluster, where one is assigned faster SSDs and the other is using slower but cheaper HDDs.


Depends a bit on how your setup and its environment will look like (other servers, is there already some routing/FW host), and what your needs are. There's always the option of using a pfSense, either on a baremetal host or in a VM, both has its advantages and disadvantages. If we're mostly talking shielding off traffic the PVE firewall is definitively fine, natively integrated full-fledged SDN management is currently under development, so for that you may need either a some more hands-on approach or some other tooling, if that's a requirement.


Just some things coming to my mind, feel free to ask if anything is unclear or if you have more specific questions.
Thank you for very much for your reply Thomas.

I read about the Ceph and it looks like it's what I'm looking for. I have some questions related to preconditions section.

Facts I've gathered:
There are 3 services; monitor, manager and object storage daemon. A CPU core or thread for each Ceph service is needed, which means 3 Core or 1.5 Core (Since # of threads are double the number of threads) is required for the 3 services on all the nodes in the cluster.
For the memory usage it's mentioned that OSD service uses 1 GiB of memory to handle 1 TiB of data in addition to the 3-5GB adjustable memory usage.

Questions:
-
Do we have any information regarding the memory usage of manager and monitor services as well?
- Should I make my requirement calculations based on 3 CPU core or 1.5 CPU core (using threads instead of cores) for the 3 services in total?
- The requirement of "1 GiB of memory to handle 1 TiB" for OSD would only be required for the data in motion right not for the data that's is peacefully sitting in the disks?
- For your points about the networking requirements, we might need to gradually jump from 3 nodes to 10 within 2 years, in this case what kind of networking suggestion you'd have that we can form the infrastructure now so that we can expand on it easily later? I'd assume starting with a separate switch would provide that capability. And for the network cards I'd assume your suggestion of "minimum 10G, 25G+ would be better" would still be valuable with a switch involved?
- From the benchmark report: “
Can I mix various disk types?
It is possible, but the cluster performance will drop to the performance of the slowest disk.”, wouldn’t be a problem is our case where we mix some regular Hdds with SSDs?

About the firewall, the server room is empty at the moment :) So there won't be any other servers outside of this cluster we're forming. And our only concern is to secure ourselves from the outside world and according to your reply I understand that PVE firewall would be sufficient for our use case. It's also good to know that we'd have SDN management in the future as well.

Thomas, I couldn't understand how the suggestion of adding 2 separate nodes would work. So, let's assume that I have the following configuration:

3 identical servers for computing: 2 x Intel Xeon Silver 4210 CPU (10 Core, 20 Thread, 2.20 GHz Base, 3.20 GHz Turbo ), 64 GB DDR4 ECC RAM, 2 x600 GB SAS 15K RPM HD.

2 identical servers for Ceph: 1x Intel Xeon Silver 4210 CPU (10 Core, 20 Thread, 2.20 GHz Base, 3.20 GHz Turbo ), 32GB DDR4 ECC RAM, 4x600 GB SAS 15K RPM HD, 8 TB SSD.

In this setup, don't I still have to install Ceph service monitor, manager and object storage daemon into all of the nodes? Is the advantage maybe the reduced traffic required to control redundancy since it'd flow between the dedicated Ceph servers? Don't we have to have 3 identical servers for this setup?

Overall, I'm leaning towards increasing the core and memory size of the cluster based on Ceph requirements.

Self Notes
- To build a hyper-converged Proxmox + Ceph Cluster there should be at least three (preferably) identical servers for the setup.
- Assign a CPU core (or thread) to each Ceph service to provide enough resources for stable and durable Ceph performance.
- 3-5 GiB of memory (adjustable) will be required by OSD service in addition to the suggested 1 GiB of memory for 1 TiB of data.
- Ceph best performs with an even sized and distributed amount of disks per node.
- Avoid RAID as Ceph handles it.
- PVE firewall will have natively integrated full-fledged SDN management in the future.
 
Last edited:
Please avoid MD-RAID, it has zero checks itself and the classic combination md-raid+LVM+ext4/xfs can not detect nor repair any bitrot whatsoever. Rather use a filesystem which can do so, that'd be ZFS, BTRFS or CephFS (or any other stuff on top of ceph RADOS block devices using bluestore, which has data and metadata checksumming).
even it's true technically speaking, I feel the urge to say that md-raid and ext4/xfs etc served very well and reliable over the last 20 years.
the point is, that ceph and ZFS are everything else but slim. so we use depending on the usecase and available power/budget all of these technics.

and it is kind of a learning process for a small company like ours, that the "cheep" solutions (because they are compared to the SAN big-business) are in fact quit expensive. 3 boxes at least, 1:3 disks, just 75% fillable, much much RAM, 10G+ networking redundant and so on.
but amounts of data are getting bigger and more precious every day, so your point is legit
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!