Server hardware recommendations for new 3–4 node Proxmox VE / Ceph cluster

marc.charbonneau

New Member
Dec 4, 2025
6
2
3
Hi,

We are currently running an aging VMware vSAN environment on Dell VxRail and are evaluating Proxmox VE as a migration path.

We have already completed a nested PVE test deployment on our existing vSAN environment. This allowed us to validate the VMware-to-PVE migration process using our existing Veeam license, as well as validate use of the PVE API for VM automation. This was important for us because our Dev/QA team currently uses VMware PowerCLI with Jenkins automation.

So far, the testing has gone well and we believe Proxmox VE could be a strong replacement path for our VxRail environment.

Our main dilemma now is selecting appropriate new server hardware for a 3-node or 4-node PVE cluster using Ceph for hyperconverged storage.

We are currently considering Dell, Lenovo, and Supermicro, including models such as the following, but we are certainly open to other suggestions that may be more cost-effective:
  • Dell PowerEdge R770
  • Lenovo ThinkSystem SR650 V4
  • Supermicro SYS-222H-TN
The concern is that these platforms are new, and we would like to avoid being on the bleeding edge if there are known issues with Linux, Proxmox VE, NVMe, firmware, NICs, BIOS options, or Ceph.

The tentative design is:
  • 3 or 4 nodes
  • NVMe storage for Ceph OSDs, likely in the range of 5–8 enterprise NVMe drives per node
  • Redundant 25G network for Ceph
  • Redundant 10G network for VM/public traffic
  • Separate redundant network for Corosync/cluster communication
  • Intel CPUs preferred since VxRail's on Intel
  • Intel NICs preferred unless there is strong reason to choose otherwise
I would be interested in hearing from anyone running similar current-generation Dell, Lenovo, or Supermicro hardware with Proxmox VE and Ceph.

Specific questions:
  1. Are there particular server models that have proven reliable for PVE/Ceph deployments?
  2. Are the newer Dell R770, Lenovo SR650 V4, or Supermicro SYS-222H-TN reasonable choices, or would you prefer the previous generation for maturity?
  3. Are there any specific NICs, HBAs, NVMe backplanes, BIOS settings, or firmware considerations we should watch for?
  4. Do any Proxmox partners or experienced users maintain informal lists of commonly deployed server models for PVE/Ceph clusters?
  5. Are we overdesigning this? If so, what specs would you consider more than sufficient for a reliable PVE/Ceph production cluster?
Any real-world experience or guidance would be appreciated.

Thanks,
Marc
 
Hi Marc, here is my experience

Are there particular server models that have proven reliable for PVE/Ceph deployments?
All of the servers have the same stuff going on under the covers, pick the vendor you like best. My experience is supermicro has the best pricing but a bit barebones on IPMI.

Are the newer Dell R770, Lenovo SR650 V4, or Supermicro SYS-222H-TN reasonable choices, or would you prefer the previous generation for maturity?
I would go with the newest gen xeon 6 series, i run some of them and have had no issues, the previous gen is like 2-3 years old already


Are there any specific NICs, HBAs, NVMe backplanes, BIOS settings, or firmware considerations we should watch for?
Do not use RAID cards, only HBAs for ceph, a BOSS card or similar is fine to use for the OS only, i personally avoid intel NICs because the drivers have often been buggy, broadcom and mellanox are good. For ceph there is some performance tuning available in bios related to power efficiency, mostly just turn everything to high performance


Do any Proxmox partners or experienced users maintain informal lists of commonly deployed server models for PVE/Ceph clusters?
I used a proxmox partner and with the one i reached out to they did not have a "rubber stamp" build or list of approved servers, because proxmox is linux basically everything is supported, partners will have good recommendations for server specs disk sizing network etc

Are we overdesigning this? If so, what specs would you consider more than sufficient for a reliable PVE/Ceph production cluster?
it looks like you have covered all your bases, a partner doesnt hurt to double check before you purchase hardware, and after you build your cluster to check that things are configured optimally. for your disks get high write endurance NVMEs, each host needs 6 network interfaces (2 mgmt/corosync, 2 VM, 2 Ceph). Intel CPUs are a good choice

Do not hesitate to spend some money on consulting, you may spend 5k to double check all your work but its worth if to ensure things go smoothly and you do not make a mistake with purchasing new hardware.
 
Corosync doesn’t need a redundant network. You can configure multiple networks for backup.

read through https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/ and consider at least 4 nodes.
i could be incorrect but proxmox documentation basically states you need two networks, you really need two if you plan to ever update one of your switches

https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_cluster_requirements

We recommend a dedicated physical NIC for the cluster traffic.
To ensure reliable Corosync redundancy, it is essential to have at least another link on a different physical network. This enables Corosync to keep thecluster communication alive should the dedicated network be down.
 
Take a look at the Dell R6715. You can get a 48-core CPU in a 1U node, which is, to put it mildly, a lot of compute. You get the option for two OCP network cards, a couple of low-profile PCI cards, and up to 10 NVME drives up front.

I don't think there is any reason to limit yourself to Intel CPUs. You are going to be standing up a new cluster and migrating the VMs off the existing system to Proxmox. It is unlikely that the existing VMs are configured for a new CPU model that can be handled with a Zen5 CPU. Certainly not an issue for us.

Personally, I would be concerned about a Ceph network limited to 25 Gbps. As for network cards, we wanted Mellanox but ended up holding our noses and getting Broadcom cards, otherwise we would still be waiting for the servers to be delivered.

Our new R6715 system has a double-link 100 Gbps mesh network, courtesy of some daftness in the Dell configurator. We ordered a bunch of new compute nodes for the HPC last year, and they were cheaper if you got an extra ConnectX-6 OCP PCIe card!!! I still have a whole pile I removed from them (because they will never ever be used, so would just be wasting electricity), despite loading up the R6715s with two each :D

To be frank, the old Lenovo SR530 system with Xeon 6130s remains more than enough for our needs, but it is now eight years old, so we are moving to new hardware. Although we are now on a single CPU, memory bandwidth has increased significantly. The best bit for us is being a single-CPU system; at a stroke, it halved our licensing costs. For us, capex is easier to come by than opex.

The SR530's till last September were a VSphere Essentials Plus system with a DS4200 for shared storage. However, trying to get a price from Broadcom was like trying to get blood out of a stone, so we jumped ship with a crazy in-place swapover scheme involving some upgrades to the SR530s and moving the drives from the DS4200 into the front of the SR530s. Our VSphere licensing runs out later this month :) . Our central IT have jumped ship to Proxmox, too, as I understand it. Their licensing was going to increase from 75k to ~500k per year, at which point payback, even factoring in new hardware, would be less than 12 months.

Our config is PowerEdge R6715 with AMD EPYC 9455 CPUs and 768GB of RAM. Up the front, we have a bunch of NVMe drives. Primary networking is BCM57414 with dual 25Gbps links, then two ConnectX6-DX cards forming a mesh network for Ceph, with two links between nodes for redundancy, and finally a quad-port BCM57454 OCP card for Corosync. They are in another mesh network and are the result of a previous procurement "mistake". Despite 100s of 10Gbps ports, we don't have anything to plug 10GBase-T into, and the only 1Gbps ports available are on the management switches, which are entirely unsuitable for a production Corosync network. Finally, they have a 512GB USB key in them for ISO storage (unless Dell has updated the manual; you need micro-fit style drives less than 24.19mm long instead of the 57.15mm it claims).

The reason for the Ceph mesh network was to avoid burning our 200Gbps Ethernet ports, and we also had enough ConnectX6-DX cards to run a dual-link mesh network.

It's all looking good in testing and is ready to go. The holdup for the migration is that the money we got for them included upgrading the power to the racks (22kW doesn't cut it anymore), and the electricians only finished the 3-phase outlets last week.
 
  • Like
Reactions: Johannes S
Are there particular server models that have proven reliable for PVE/Ceph deployments?
All of the servers have the same stuff going on under the covers, pick the vendor you like best. My experience is supermicro has the best pricing but a bit barebones on IPMI.

The almost non-existent firmware updates for Supermicro's BMC/IPMI make them a no-go for us. There is no way there aren't a slew of security issues in them, and given the current cybersecurity landscape, even with a dedicated management network, it is a no-go.
 
i could be incorrect but proxmox documentation basically states you need two networks, you really need two if you plan to ever update one of your switches

https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_cluster_requirements

We recommend a dedicated physical NIC for the cluster traffic.
To ensure reliable Corosync redundancy, it is essential to have at least another link on a different physical network. This enables Corosync to keep thecluster communication alive should the dedicated network be down.
That's what I was describing. Perhaps I just misunderstood what OP meant by "redundant network."
 
The almost non-existent firmware updates for Supermicro's BMC/IPMI make them a no-go for us. There is no way there aren't a slew of security issues in them, and given the current cybersecurity landscape, even with a dedicated management network, it is a no-go.
yep its pretty barebones, i quoted both dell and supermicro and dell came in about 50% more for the same hardware, a better IPMI was not worth that amount of money
 
Hi Marc, here is my experience

Are there particular server models that have proven reliable for PVE/Ceph deployments?
All of the servers have the same stuff going on under the covers, pick the vendor you like best. My experience is supermicro has the best pricing but a bit barebones on IPMI.

Are the newer Dell R770, Lenovo SR650 V4, or Supermicro SYS-222H-TN reasonable choices, or would you prefer the previous generation for maturity?
I would go with the newest gen xeon 6 series, i run some of them and have had no issues, the previous gen is like 2-3 years old already


Are there any specific NICs, HBAs, NVMe backplanes, BIOS settings, or firmware considerations we should watch for?
Do not use RAID cards, only HBAs for ceph, a BOSS card or similar is fine to use for the OS only, i personally avoid intel NICs because the drivers have often been buggy, broadcom and mellanox are good. For ceph there is some performance tuning available in bios related to power efficiency, mostly just turn everything to high performance


Do any Proxmox partners or experienced users maintain informal lists of commonly deployed server models for PVE/Ceph clusters?
I used a proxmox partner and with the one i reached out to they did not have a "rubber stamp" build or list of approved servers, because proxmox is linux basically everything is supported, partners will have good recommendations for server specs disk sizing network etc

Are we overdesigning this? If so, what specs would you consider more than sufficient for a reliable PVE/Ceph production cluster?
it looks like you have covered all your bases, a partner doesnt hurt to double check before you purchase hardware, and after you build your cluster to check that things are configured optimally. for your disks get high write endurance NVMEs, each host needs 6 network interfaces (2 mgmt/corosync, 2 VM, 2 Ceph). Intel CPUs are a good choice

Do not hesitate to spend some money on consulting, you may spend 5k to double check all your work but its worth if to ensure things go smoothly and you do not make a mistake with purchasing new hardware.
Thanks for the detailed reply. This definitely raises my confidence around using newer-generation servers for a PVE/Ceph deployment. I had been debating whether the previous generation might be the safer/more mature choice, but you make a good point that those platforms are already a few years old, and we want this refresh to last at least 5 years.

Your NIC comment is also useful. Based on various things I had read, I was initially leaning toward Intel networking, but I’ll take another look at Broadcom or NVIDIA/Mellanox options for the OCP NICs if those have proven more stable in practice.

I completely agree on using a partner for validation. I actually engaged a Proxmox Gold Partner with a clear request for help with hardware selection, final server review, and post-deployment validation before going live. Unfortunately, there was a disconnect between sales and technical staff: sales indicated they could assist with all three, but the technical discussion later made it clear they could not help with those specific areas. I ended up getting a full refund, which was disappointing, as I’m the only IT person for a small company of around 20 staff and was really hoping for experienced validation before committing to hardware.

Based on other feedback below, I’m also going to adjust my quoting strategy and look more seriously at 1U single-socket AMD options. My initial quotes around Dell R770 and Lenovo SR650 V4 were coming in over budget, so hopefully a simpler 1U single-CPU AMD design can bring the cost down while still meeting our requirements.

Thanks again — this was very helpful.
 
Take a look at the Dell R6715. You can get a 48-core CPU in a 1U node, which is, to put it mildly, a lot of compute. You get the option for two OCP network cards, a couple of low-profile PCI cards, and up to 10 NVME drives up front.

I don't think there is any reason to limit yourself to Intel CPUs. You are going to be standing up a new cluster and migrating the VMs off the existing system to Proxmox. It is unlikely that the existing VMs are configured for a new CPU model that can be handled with a Zen5 CPU. Certainly not an issue for us.

Personally, I would be concerned about a Ceph network limited to 25 Gbps. As for network cards, we wanted Mellanox but ended up holding our noses and getting Broadcom cards, otherwise we would still be waiting for the servers to be delivered.

Our new R6715 system has a double-link 100 Gbps mesh network, courtesy of some daftness in the Dell configurator. We ordered a bunch of new compute nodes for the HPC last year, and they were cheaper if you got an extra ConnectX-6 OCP PCIe card!!! I still have a whole pile I removed from them (because they will never ever be used, so would just be wasting electricity), despite loading up the R6715s with two each :D

To be frank, the old Lenovo SR530 system with Xeon 6130s remains more than enough for our needs, but it is now eight years old, so we are moving to new hardware. Although we are now on a single CPU, memory bandwidth has increased significantly. The best bit for us is being a single-CPU system; at a stroke, it halved our licensing costs. For us, capex is easier to come by than opex.

The SR530's till last September were a VSphere Essentials Plus system with a DS4200 for shared storage. However, trying to get a price from Broadcom was like trying to get blood out of a stone, so we jumped ship with a crazy in-place swapover scheme involving some upgrades to the SR530s and moving the drives from the DS4200 into the front of the SR530s. Our VSphere licensing runs out later this month :) . Our central IT have jumped ship to Proxmox, too, as I understand it. Their licensing was going to increase from 75k to ~500k per year, at which point payback, even factoring in new hardware, would be less than 12 months.

Our config is PowerEdge R6715 with AMD EPYC 9455 CPUs and 768GB of RAM. Up the front, we have a bunch of NVMe drives. Primary networking is BCM57414 with dual 25Gbps links, then two ConnectX6-DX cards forming a mesh network for Ceph, with two links between nodes for redundancy, and finally a quad-port BCM57454 OCP card for Corosync. They are in another mesh network and are the result of a previous procurement "mistake". Despite 100s of 10Gbps ports, we don't have anything to plug 10GBase-T into, and the only 1Gbps ports available are on the management switches, which are entirely unsuitable for a production Corosync network. Finally, they have a 512GB USB key in them for ISO storage (unless Dell has updated the manual; you need micro-fit style drives less than 24.19mm long instead of the 57.15mm it claims).

The reason for the Ceph mesh network was to avoid burning our 200Gbps Ethernet ports, and we also had enough ConnectX6-DX cards to run a dual-link mesh network.

It's all looking good in testing and is ready to go. The holdup for the migration is that the money we got for them included upgrading the power to the racks (22kW doesn't cut it anymore), and the electricians only finished the 3-phase outlets last week.
Thanks for the info.

I’ll definitely be looking into the R6715 option. I may have been a bit too strict about wanting to stay with Intel on the destination cluster simply because the source cluster is Intel. Thinking it through more, the available CPU compatibility modes should be fine for the migrated VMs, and I doubt there is any real need for "host" CPU mode for most of our existing workloads. New deployments, or VMs we later validate on the new cluster, could then use the optimal CPU mode for the new hardware.

Our existing Dell TOR rack switches only have 4 QSFP28 ports (2 per switch), and those will likely need to be used in 25G breakout mode, so 25G is probably my practical limit today. The plan would be to use two 25G links for Ceph, ideally with MLAG/bonding. I realize that does not necessarily mean a single Ceph flow gets 50G, but across multiple OSD/client flows it should still provide more aggregate bandwidth and redundancy than a single 25G link.

We’re also getting new power added for this project, although that work has not started yet because we are still finalizing the server direction.

Thanks again — this is very helpful.
 
Thanks for the detailed reply. This definitely raises my confidence around using newer-generation servers for a PVE/Ceph deployment. I had been debating whether the previous generation might be the safer/more mature choice, but you make a good point that those platforms are already a few years old, and we want this refresh to last at least 5 years.

Your NIC comment is also useful. Based on various things I had read, I was initially leaning toward Intel networking, but I’ll take another look at Broadcom or NVIDIA/Mellanox options for the OCP NICs if those have proven more stable in practice.

I completely agree on using a partner for validation. I actually engaged a Proxmox Gold Partner with a clear request for help with hardware selection, final server review, and post-deployment validation before going live. Unfortunately, there was a disconnect between sales and technical staff: sales indicated they could assist with all three, but the technical discussion later made it clear they could not help with those specific areas. I ended up getting a full refund, which was disappointing, as I’m the only IT person for a small company of around 20 staff and was really hoping for experienced validation before committing to hardware.

Based on other feedback below, I’m also going to adjust my quoting strategy and look more seriously at 1U single-socket AMD options. My initial quotes around Dell R770 and Lenovo SR650 V4 were coming in over budget, so hopefully a simpler 1U single-CPU AMD design can bring the cost down while still meeting our requirements.

Thanks again — this was very helpful.
certainly, i also used a gold partner in north america and they were able to assist with hardware selection, so you might try reaching out to another one. other comments in the thread also have good advice. On Intel/AMD, the Xeon 6 CPUs can support a lot more cores than previous generation so they seem competitive with AMD now, either option is fine IMO.