Hi everyone,
I’d like to ask for advice on improving user experience in two Windows Terminal Servers (around 15 users each, RDP/UDP).
After migrating from two standalone VMware hosts (EPYC 9654, local SSDs) to a Proxmox + Ceph cluster, users feel sessions are slightly slower or less responsive.
Cluster: Proxmox VE 8.4.1 + Ceph Squid 19.2.1
Backup: Separate PBS node (HDD SAS pool).
Storage: 3 Ceph storage nodes (each 4× NVMe U.2 3.84 TB) + HDD nodes for PBS.
Main compute nodes:
Networking (current state):
This setup replaced a previous Cisco SG350XG / SG550XG fabric with LAGs and an ER8411 router. ( 4 x10GB LAGG all nodes)
The environment is stable and throughput tests are strong, but RDP sessions feel less responsive than before.
Ceph benchmarks, CPU stress tests, and FIO runs all return solid and expected results. All NVMe drives (Oracle U.2 models) show 0% wear.
This weekend I plan to:
Thanks a lot for any insights or tuning recommendations
I’d like to ask for advice on improving user experience in two Windows Terminal Servers (around 15 users each, RDP/UDP).
After migrating from two standalone VMware hosts (EPYC 9654, local SSDs) to a Proxmox + Ceph cluster, users feel sessions are slightly slower or less responsive.
Current Infrastructure
Cluster: Proxmox VE 8.4.1 + Ceph Squid 19.2.1
Backup: Separate PBS node (HDD SAS pool).
Storage: 3 Ceph storage nodes (each 4× NVMe U.2 3.84 TB) + HDD nodes for PBS.
Main compute nodes:
- main1 and main2 (Dell R750 / Oracle X8-2L)
- CPUs: Duals. Xeon Platinum 8368 / 8270CL
- RAM: 512GB ECC DDR4 — not all memory channels populated yet
- Boot: local SSD
- Data: Ceph NVMe pool (MTU 9000, dedicated VLAN)
- Optional local RAID10 array (12 Gb/s SATA SSDs)
- Tesla M10 GPU available (not installed yet)
Networking (current state):
- Full 2x 10/40 Gb fabric LAGG —
- Core: dual Mellanox SX6036 in MLAG (LACP L3+L4, MTU 9000)
- Routers: two UDM Pro in HA — each connected via SFP+ 10 Gb breakout
(active to SX6036-A, passive to SX6036-B) - Networks:
- VLAN 1 → Ceph (private, MTU 9000)
- VLAN120 → Cluster Network / Corosync, backups )
- VLAN 1020 → VM traffic (10.20.0.0/24) MTU 9000)
- VLAN 10 → Management (ilom idracs)
This setup replaced a previous Cisco SG350XG / SG550XG fabric with LAGs and an ER8411 router. ( 4 x10GB LAGG all nodes)
Workloads
- 2 × Windows Terminal Servers (≈15 users each, RDP over UDP)
- Several Linux database VMs
The environment is stable and throughput tests are strong, but RDP sessions feel less responsive than before.
Questions
- Memory channels: Will fully populating all RAM channels (on both main hosts) noticeably improve responsiveness or latency for RDP sessions?
- GPU: If I install the Tesla M10 (for basic VDI / RDP graphics, no vGPU GRID), how much real-world improvement can I expect?
- Storage: Would moving the Terminal Server VMs from Ceph NVMe to a local SSD RAID10 array improve user experience, or is latency similar?
- Other tunings: Any Ceph / VirtIO / network settings that had the biggest impact for you on RDP smoothness or UI responsiveness?
Ceph benchmarks, CPU stress tests, and FIO runs all return solid and expected results. All NVMe drives (Oracle U.2 models) show 0% wear.
This weekend I plan to:
- Upgrade the cluster to Proxmox 9.
- Fully populate memory with 8× 64 GB DDR4-3200 modules on both main1 and main2.
- Migrate the Terminal Server VMs to local RAID10 SSD storage on main1.
- Dedicate main1 exclusively to the Windows Server ecosystem (DC, TS, UPD, DFS, etc.), splitting the services across several VMs.
- Use main2 for database workloads.
- Install the Tesla M10 GPU on main1 and configure CPU passthrough for the Terminal Server VMs.
Thanks a lot for any insights or tuning recommendations