Proxmox + Ceph Cluster – Architecture & Technical Validation

prxpb

New Member
May 24, 2025
5
0
1
Hi, I'm a Microsoft solutions engineer currently evaluating a migration path for several of our Windows Server clusters (based on Hyper-V and S2D) to Proxmox VE. While I'm experienced in the Microsoft ecosystem, Proxmox and Ceph are not my strong suit, so I'm seeking guidance and validation from those of you with deeper expertise in this area.

Hardware (4 nodes):

  • Dell R750 servers
  • 2 × Intel Xeon Gold 16c/32t CPUs per node
  • 1024 GB RAM per node
  • 16 × 7.68 TB NVMe SSDs (Dell P5500 RI), 100% SSD-based
  • 4 × 25 GbE NICs per node (Mellanox ConnectX-5)
  • Switching: 2 × Dell S5048F-ON
    • 100 GbE interconnect between switches
    • 25 GbE ports to nodes
    • VLT (Virtual Link Trunking) enabled
    • Cross-switch LACP bonding per node


Network & QoS:
  • Bonding: 4 × 25 GbE on each node (bond0)
  • Mode: balance-xor with xmit_hash_policy=layer3+4
  • MTU 8000 across hosts, switches, bridges
  • VLAN-aware configuration
  • QoS via tc + fq_codel, with class-based shaping:
    • Ceph RBD I/O: 50–60% (constant)
    • Ceph Replication: 10–20% (post-backup/snapshot)
    • VM LAN: 20–40% (application traffic)
    • VM Migration: 1–5% (planned)
    • Management: 1–2% (critical, low jitter)


Ceph Configuration:
  • Version: Ceph Reef
  • 16 OSDs per node = 64 total
  • Pool: Single 3× replicated data pool
  • PG count: 2048 (manually set, autoscaler off)
  • CRUSH: simplified with a single rule
  • Placement algorithm: straw2
  • Object size: 4 MB (considering 16 MB)
  • MTU: 8000
  • Ceph traffic over bond0, separated with QoS


VM Configuration (Proxmox VE 8.4):
  • VirtIO SCSI single
  • iothread=1, discard=on
  • cache=none
  • AIO: io_uring
  • CPU type: host
  • QEMU guest agent: installed on all VMs
  • Guest OS: Windows Server (SQL, RDS, AD, File)
Backup:

  • Handled via Veeam Backup & Replication (VBR)
  • VMs use qemu-agent and VSS for application-consistent snapshots
  • No PBS is planned at this stage
Cluster Management:

  • CPU mitigations disabled (mitigations=off)
  • Scheduler: none (for NVMe)
  • All traffic flows through bond0 with VLAN & QOS class-based separation
  • External qdevice configured for quorum and split-brain protection in 4-node corosync cluster
Questions:

  1. Is a single pool (64 OSDs, 2048 PG) sufficient for mixed workloads, or should I split them?
  2. Would 16 MB object size improve performance for large VMs (esp. SQL/Windows)?
  3. How does Ceph Reef compare to S2D with RDMA (e.g., 4 SSDs per CSV) in latency?
  4. Which Ceph or RBD parameters should I tune for fsync-heavy workloads (SQL, AD)?
  5. Are RBD snapshots with qemu-agent + VSS safe for SQL Server consistency?
  6. Does xmit_hash_policy=layer3+4 and MTU 8000 help with replica distribution and avoiding bottlenecks?
  7. And most importantly: will my Windows Server VMs perform well enough in this setup? I'm genuinely concerned about the performance of my workloads – especially domain controllers, file servers, RDS and SQL – under Ceph compared to what I'm used to on S2D.
Thanks in advance for your time – especially if you're running larger Windows deployments on Proxmox + Ceph.

Best regards,
P
 
You run a setup very similar to mine, I would suggest an odd number of nodes to prevent split-brain scenarios.

We currently run ~250 high end and server workstation VM on 7 nodes with 42 NVMe OSD and have plenty capacity to spare. If possible, I would go with 2 40/100G links to the Ceph nodes rather than 4x25G links - LACP etc is great but adds complexity.

I wouldn’t do traffic shaping, we rather use multiple VLANs and if necessary you can prioritize VLAN on the switch, but I haven’t seen the need to do that yet. But shaping on the client ends add latency, you want to pump out packets as fast as possible, not go through complex code paths, so we see better results using the simplest of schedulers across the board (CPU, disk, net), disabling things beyond S1 in CPU etc.

Use as large an MTU as your switch permits and reasonable for your workload, 9000 is standard even 16k is possible. For a small system like this, I haven’t seen any reason to do tuning of Ceph, your network links are incapable of stretching the CPU, and again, you’re adding complexity. Often tuning you just end up trading numbers on the edges of the performance statistics (will you truly have workloads near the 100G throughput per device) which if you have a varied workload, is counterproductive.

For modern loads, you can set disk cache to Write-Back and let VirtIO handle the rest. RDMA on low speed links - unless you’re doing GPUDirect and other integrations (there are vendors out there) - I’m not seeing any benefit. According to a few engineers I’ve spoken to RDMA becomes useful when you are pushing at the boundaries of 40-100G and can optimize end-to-end, for low end databases like SQL and AD, you won’t notice.

For the switch, we used 4x100G between the switches, which Ceph can push in my instance about 60% when benchmarking. I’m assuming you’re doing redundant LACP to each switch as well. The xmit_hash_policy 3+4 is not ‘standards approved’, I believe the Dell does L2/3 by default, we leave it at that although nothing seems to break at 3+4, there is some overhead, L2/3 will be offloaded on the NIC.
 
Is a single pool (64 OSDs, 2048 PG) sufficient for mixed workloads, or should I split them?
Completely up to you. You can have multiple pools with multiple crush setups using the same disks, but generally speaking unless you have different OSD classes a single pool is likely what you want.

Would 16 MB object size improve performance for large VMs (esp. SQL/Windows)?
SQL wants smaller object sizes (like 16k,) but trial and error would show you what yields best results. The good news is that you can set that per rbd image so you wont need to mess with the pool while testing. you can (and should) deploy a separate rbd for your sql payload.

How does Ceph Reef compare to S2D with RDMA (e.g., 4 SSDs per CSV) in latency?
In my experience, not well. a properly configured s2d would outperform ceph. There is also the matter that you could be running your sql native.

Are RBD snapshots with qemu-agent + VSS safe for SQL Server consistency?
That is a fantastic question, although not as asked; rbd snapshots dont care about the guests at all; qemu quiescense is the key function here. I'm not sure if qemu-quiesce makes mssql commit to disk or not. I'd look at the qemu-guest-agent documentation (Sorry, I'm a terrible cleaning lady- I dont typically do Windows.)

Does xmit_hash_policy=layer3+4 and MTU 8000 help with replica distribution and avoiding bottlenecks?
SO a couple of things about your network assumption. While you COULD make a single lagg with all 4 of your interfaces, I wouldn't advocate for it. I would suggest 2x2 or 2+1+1 depending on your switch arrangement. this will allow you greater seperation of traffic types and take advantage of each interface queue independently (laggs are great in AGGREGATE but they have the latency potential of a single interface.) Having multiple interface would also help assign MTU in the more advantageous manner- mtu 9000 for your ceph interfaces, 1500 for everything else.

will my Windows Server VMs perform well enough in this setup?
Yes. No. Maybe. whats well enough?
 
You run a setup very similar to mine, I would suggest an odd number of nodes to prevent split-brain scenarios.

We currently run ~250 high end and server workstation VM on 7 nodes with 42 NVMe OSD and have plenty capacity to spare. If possible, I would go with 2 40/100G links to the Ceph nodes rather than 4x25G links - LACP etc is great but adds complexity.

I wouldn’t do traffic shaping, we rather use multiple VLANs and if necessary you can prioritize VLAN on the switch, but I haven’t seen the need to do that yet. But shaping on the client ends add latency, you want to pump out packets as fast as possible, not go through complex code paths, so we see better results using the simplest of schedulers across the board (CPU, disk, net), disabling things beyond S1 in CPU etc.

Use as large an MTU as your switch permits and reasonable for your workload, 9000 is standard even 16k is possible. For a small system like this, I haven’t seen any reason to do tuning of Ceph, your network links are incapable of stretching the CPU, and again, you’re adding complexity. Often tuning you just end up trading numbers on the edges of the performance statistics (will you truly have workloads near the 100G throughput per device) which if you have a varied workload, is counterproductive.

For modern loads, you can set disk cache to Write-Back and let VirtIO handle the rest. RDMA on low speed links - unless you’re doing GPUDirect and other integrations (there are vendors out there) - I’m not seeing any benefit. According to a few engineers I’ve spoken to RDMA becomes useful when you are pushing at the boundaries of 40-100G and can optimize end-to-end, for low end databases like SQL and AD, you won’t notice.

For the switch, we used 4x100G between the switches, which Ceph can push in my instance about 60% when benchmarking. I’m assuming you’re doing redundant LACP to each switch as well. The xmit_hash_policy 3+4 is not ‘standards approved’, I believe the Dell does L2/3 by default, we leave it at that although nothing seems to break at 3+4, there is some overhead, L2/3 will be offloaded on the NIC.
Thanks for your reply — I really appreciate the insights. I have a few follow-up questions and comments:

  1. I’m planning to use a qdevice, so I believe the risk of split-brain in a 4-node setup is effectively mitigated.
  2. At the moment, I can't change my switches or NICs. I have two ToR switches interconnected with 4×100 GbE links, and each server connects via 4×25 GbE.
  3. Isn’t combining cache=writeback with Ceph a serious risk for data loss in case of a node failure? Since Ceph isn't aware of the buffered writes in QEMU RAM, wouldn't that create potential inconsistency for VM’s ?
  4. My switches fully support xmit_hash_policy=layer3+4. Wouldn't that provide better distribution of traffic across the bonded interfaces compared to L2+3?
  5. I'm absolutely able to configure MTU 9000, but I initially chose a more conservative value (8000) due to header overhead and fragmentation safety.
  6. Thanks for your comments regarding CPU tuning — I agree it makes little sense in this context, and I'm skipping it.
  7. Are you running similar workloads on Windows (e.g. SQL Server, AD, SMB, RDS) in your setup? Just curious if your conclusions are based on Linux-only workloads or also enterprise Windows services.
  8. Regarding RDMA — I was referring specifically to S2D, where SMB Direct is leveraged. To my knowledge, Proxmox + Ceph doesn’t support RDMA for now (even in Ceph 19 it's still not officially functional).
 
Completely up to you. You can have multiple pools with multiple crush setups using the same disks, but generally speaking unless you have different OSD classes a single pool is likely what you want.


SQL wants smaller object sizes (like 16k,) but trial and error would show you what yields best results. The good news is that you can set that per rbd image so you wont need to mess with the pool while testing. you can (and should) deploy a separate rbd for your sql payload.


In my experience, not well. a properly configured s2d would outperform ceph. There is also the matter that you could be running your sql native.


That is a fantastic question, although not as asked; rbd snapshots dont care about the guests at all; qemu quiescense is the key function here. I'm not sure if qemu-quiesce makes mssql commit to disk or not. I'd look at the qemu-guest-agent documentation (Sorry, I'm a terrible cleaning lady- I dont typically do Windows.)


SO a couple of things about your network assumption. While you COULD make a single lagg with all 4 of your interfaces, I wouldn't advocate for it. I would suggest 2x2 or 2+1+1 depending on your switch arrangement. this will allow you greater seperation of traffic types and take advantage of each interface queue independently (laggs are great in AGGREGATE but they have the latency potential of a single interface.) Having multiple interface would also help assign MTU in the more advantageous manner- mtu 9000 for your ceph interfaces, 1500 for everything else.


Yes. No. Maybe. whats well enough?
Compared to the performance achieved with S2D on identical hardware and under the same types of workloads that I’m running
 
Cache=writeback is only a problem if your host OS does async writes when it should be sync. Sync writes are still sync, async writes are async. You are confusing it with Cache=writeback (unsafe) which always treats requests as async. Ceph provides a raw device, so Ceph is aware of whatever the host asks it to do. You should test with your workload because there are some benefits ignoring the page cache especially if your OS also does read caching, however for that reason we disable the cache in Windows (this gives better idea of what your VM actually needs for memory). We have had hard crashes and neither Windows nor Linux got inconsistent (even SQL).

We do run a variety of loads, including MSSQL, Postgres, MySQL, IIS a ton of Windows VDI and some Linux loads.

Not sure about your switch supporting L3+4, look into the docs on how to do that, because I have the exact same family dual ToR switch (36x100G). They do support an ‘even larger’ MTU (12000) so 9000 is definitely doable and the ‘standard’ for Jumbo Packets. For Ceph, you already need to distribute to 3+ other NICs, so L2+3 LACP is probably optimal, L3+4 is only useful for 1-on-1 host communication and requires ‘more hashing’ and won’t be offloaded to the NIC and you then run the risk of out of order packets which are much more expensive to resolve.

Ceph RDMA, people have tried, there is some basic support and NVIDIA has patches but yeah, it’s definitely not for the faint of heart. And as I said, for current loads, not really necessary.

I’ve heard really bad things from Storage Spaces, I know people who have ran it and every single instance has thus far managed to crash and burn taking all the data with it. You get really good performance from /dev/null, but I wouldn’t recommend it as a storage solution.
 
  • Like
Reactions: UdoB
At the moment, I can't change my switches or NICs. I have two ToR switches interconnected with 4×100 GbE links, and each server connects via 4×25 GbE.
That's sad, your bottleneck will be the network. Theoretically, one PCIe 5.0 NVMe will outperform a single 100GbE link, so two of your PCIe 4.0 NVMes will already outperform you bonded 4x25Gbe network if given the whole bandwidth, which you haven't assigned.

I’m planning to use a qdevice, so I believe the risk of split-brain in a 4-node setup is effectively mitigated.
This will only apply to PVE, not Ceph. Even number of ceph nodes it not recommended. 3 CEPH nodes is the absolute minimum, 5 the recommended minimum. Please compare to Udo's Ceph Guide:

 
Cache=writeback is only a problem if your host OS does async writes when it should be sync. Sync writes are still sync, async writes are async. You are confusing it with Cache=writeback (unsafe) which always treats requests as async. Ceph provides a raw device, so Ceph is aware of whatever the host asks it to do. You should test with your workload because there are some benefits ignoring the page cache especially if your OS also does read caching, however for that reason we disable the cache in Windows (this gives better idea of what your VM actually needs for memory). We have had hard crashes and neither Windows nor Linux got inconsistent (even SQL).

We do run a variety of loads, including MSSQL, Postgres, MySQL, IIS a ton of Windows VDI and some Linux loads.

Not sure about your switch supporting L3+4, look into the docs on how to do that, because I have the exact same family dual ToR switch (36x100G). They do support an ‘even larger’ MTU (12000) so 9000 is definitely doable and the ‘standard’ for Jumbo Packets. For Ceph, you already need to distribute to 3+ other NICs, so L2+3 LACP is probably optimal, L3+4 is only useful for 1-on-1 host communication and requires ‘more hashing’ and won’t be offloaded to the NIC and you then run the risk of out of order packets which are much more expensive to resolve.

Ceph RDMA, people have tried, there is some basic support and NVIDIA has patches but yeah, it’s definitely not for the faint of heart. And as I said, for current loads, not really necessary.

I’ve heard really bad things from Storage Spaces, I know people who have ran it and every single instance has thus far managed to crash and burn taking all the data with it. You get really good performance from /dev/null, but I wouldn’t recommend it as a storage solution.
Thanks for your perspective. This is the second time I’m hearing about L2+3, so I’ll thoroughly test these settings and follow through with them. We’re also experiencing some instability on the S2D side, which is why we’re considering a change.
 
That's sad, your bottleneck will be the network. Theoretically, one PCIe 5.0 NVMe will outperform a single 100GbE link, so two of your PCIe 4.0 NVMes will already outperform you bonded 4x25Gbe network if given the whole bandwidth, which you haven't assigned.


This will only apply to PVE, not Ceph. Even number of ceph nodes it not recommended. 3 CEPH nodes is the absolute minimum, 5 the recommended minimum. Please compare to Udo's Ceph Guide:

I use NVMe U.2 drives, so they’re slower than PCIe 5.0 NVMe drives. Additionally, since Ceph distributes data across nodes, in practice the network traffic between nodes won’t utilize the full I/O bandwidth of each individual drive simultaneously.

My Ceph quorum will have only 3 monitors (MON), which is the recommended setup, so I don't see any issue with that.
 
Last edited:
I use NVMe U.2 drives, so they’re slower than PCIe 5.0 NVMe drives. Additionally, since Ceph distributes data across nodes, in practice the network traffic between nodes won’t utilize the full I/O bandwidth of each individual drive simultaneously.
The bottleneck will still be the network and you don't need NVMe for that. SATA SSDs will be sufficient and maybe the network will still be the bottleneck. Keep in mind that CEPH will not automatically read the data locally if you have it locally, it'll always distribute the read across all copied, which is not what you want if the local storage has 8-times more bandwidth.
 
  • Like
Reactions: Johannes S