Hardware advice (or "questionable Proxmox performance on nice box")

ronejamesdo

Member
Apr 22, 2024
57
6
8
Hi.

We have PowerEdge that we installed in the hopes of abandoning ESXi for Proxmox, but it seems to be having performance issues under relatively light loads. Can anyone recommend different hardware going forward (that's working better for them)? Or are there Proxmox configuration issues that I can investigate?

Dell PowerEdge R740XD
Processors: 2 x 18-core Xeon Gold 6140 2.3GHz
12 x 32GB PC4-21300-R ECC (DDR4-2666) (384GB Total)
2x500GB, 8x4TB
RAID Controller: PERC H740P 8GB cache

(All drives are SSD, the 500GB are the OS and the 4TBs are for the guests.)

Initially we installed this with zfs, but as we moved forward we found that the performance for backup restoration was untenable (we do not have Proxmox Backup Server). Restoring a backup of a virtual machine would freeze all or some other VMs for 40-45 minutes (load ran away on Proxmox and on a couple of occasions we had to reboot it). (The backups are made to an external box, but the freeze did not happen during the 1% 2% 3% part of the file transfer, but rather in the long hang step that happens after 100%. Messing with "bandwidth" only seemed to slow down the 1% 2% 3% part, there were some lesser stalls during backup taking and restoring snapshots as well) (And frankly zfs is confusing.)

We wiped it and started over with it just relying on the PERC RAID (as I recall to make zfs work we had to set "Enhanced HBA" to bypass the Dell PERC card's RAID).

Now (using .qcow2) I have begun migrating (by copying from ESXi) Windows 10 remote-desktop boxes and upgrading them to Windows 11 (which our ESXi does not natively support). Backup restoration seemed to freeze my Windows 11 test user for 30-or-40 seconds, which is not great for business hours but is definitely more workable than 40 minutes and a possible reboot. (Also snapshots work much more nicely with .qcow2 in that they allow for hopping around). Everything else works rightly procedure side but now I am hearing rumblings about performance of the VMs.

Currently there are 16 VMs on this Proxmox.

4 - Windows 11s with regular use
1 - Windows 10 with regular use (starting today though)
4 - Windows 11s with little or no use (they were the test install and migration of no-longer or only occasionally used examples for test upgrades from Windows 10)
1 - Windows 2019 SQL server (unused, no databases, also to test migration)
1 - Rocky server (our virus scanner's network controller) (it sees very little use but presumably does some stuff in the background)
4 - Ubuntu server tests (minimal use, but use)
1 - Ubuntu actual server (but our least used, chosen for that as a test)

The Windows 10 and 11 users are now complaining about the performance of their boxes, in what seem to them like system-resource issues (long times to open Windows, task manager never comes up, etc). These particularly align with migrations of new boxes (which I now do after hours) but their new VMs are generally performing worse than their old Windows 10 version on ESXi6, which was a fairly similar box:

PowerEdge R730xd
2 x 14-core E5-2695v3 2.20GHz Xeon
16 x 16GB PC4-17000 (DDR4-2133) (256GB Total)
12 x 2TB SSD SATA 2.5" 6GB/s
RAID Controller: PERC H730P Mini 2GB Cache

(And I was hoping to be able to empty that ESXi box on to this Proxmox machine, then wipe it, and put Proxmox on it too.)

With no-one in yet this morning I see 7's for load across top. When I see that get to 12-16 or so, I will start getting complaints.

During the writing of this I noticed that one of those Ubuntu server tests (that I haven't looked at in months) was in kernel panic, using 13% of its CPU and more importantly 6GB of the system's 8GB SWAP (not its own internal, but Proxmox's) so turning that off made the system SWAP and apparently load go down, it was 3068914 here:

for f in /proc/*/status; do awk '/^Name:/ {name=$2} /^Pid:/ {pid=$2} /^VmSwap:/ {swap=$2} END{ if(!swap) swap=0; print pid, name, swap }' "$f"; done | sort -k3 -n -r | awk '$3>0 {printf "%s %s %s kB\n",$1,$2,$3}'

3068914 kvm 6383808 kB
4084100 kvm 730368 kB
247463 kvm 628956 kB
2784890 kvm 122688 kB
1804 pve-ha-crm 88128 kB
1989 pvescheduler 72000 kB
1814 pve-ha-lrm 57600 kB
2249270 pvedaemon 49968 kB
2195454 pvedaemon 49968 kB
3614976 pvedaemon 49536 kB
2092611 pvedaemon 47692 kB
1772 pvestatd 44928 kB
1768 pve-firewall 41472 kB
247554 kvm 20736 kB
158285 kvm 17044 kB
250795 kvm 16128 kB
248071 kvm 13824 kB
812854 kvm 10944 kB
1955101 kvm 5760 kB
1536755 kvm 4032 kB
1636617 kvm 2304 kB
1622629 (sd-pam) 704 kB
2026 esxi-folder-fus 576 kB
1652 pmxcfs 576 kB

And now my load seems more reasonable for basically no users (so maybe that was a big part of it?).

Still, I am concerned that I have chosen the wrong server hardware or Proxmox configuration for this. I was planning to move another 17 VMs or so to this (to clear out the original ESXi) and though only 3 of them see real traffic, almost all of them are production for someone (even if their use is low).

Can anyone recommend different hardware going forward (that's working better for them)? Or are there Proxmox configuration issues that I can investigate? As it stands I can only migrate anything in the middle of the night and now I am scared to take a snapshot, much less attempt restoring a backup, during business hours... even though that was a much bigger problem with zfs before, disrupting desktop users is immediate trouble for everyone (and the fact that I am means that I am causing server troubles that I won't hear about right away but that will translate to "it's been slower the last few months" stuff).

Thanks.

RJD
 

Attachments

  • Screenshot 2025-10-01 at 9.10.55 AM.png
    Screenshot 2025-10-01 at 9.10.55 AM.png
    148.8 KB · Views: 8
  • Screenshot 2025-10-01 at 9.14.47 AM.png
    Screenshot 2025-10-01 at 9.14.47 AM.png
    104.6 KB · Views: 8
  • Screenshot 2025-10-01 at 9.19.49 AM.png
    Screenshot 2025-10-01 at 9.19.49 AM.png
    111.1 KB · Views: 8
  • Screenshot 2025-10-01 at 9.40.04 AM.png
    Screenshot 2025-10-01 at 9.40.04 AM.png
    198.2 KB · Views: 7
Disabled power saving states on the Dell? Firmware updated?
SSDs are enterprise or desktop versions?
I have feeling, you have disk problem, missing virtio drivers in Window VMs etc, but nothing concrete, no VM config/pve versions written.
Maybe install netdata and check the problematic time window.
 
> The Windows 10 and 11 users are now complaining about the performance of their boxes, in what seem to them like system-resource issues (long times to open Windows, task manager never comes up, etc). These particularly align with migrations of new boxes (which I now do after hours) but their new VMs are generally performing worse than their old Windows 10 version on ESXi6

Host interactive VMs on ZFS mirrors with enterprise-class SSD backing storage.

https://github.com/kneutron/ansitest/blob/master/winstuff/noatime.cmd

Get the largest SSDs you can afford and you may need to re-set them up in (I would say) a no more than 3-wide (6x) mirror pool if you're doing one large pool for VMs. If they still complain about response times, set them up as 3x different pools with 2x mirror disks each.

Usergroup1: zfs mirror e.g. 2x4TB SSD
Usergroup2: same
Usergroup3: same

It may also be worthwhile to run a consistency check in Windows to see if there are any corrupted files ( sfc /scannow as admin + dism )

https://github.com/kneutron/ansitest/blob/master/winstuff/sfc-scannow-dism.cmd
 
Disabled power saving states on the Dell? Firmware updated?
SSDs are enterprise or desktop versions?
I have feeling, you have disk problem, missing virtio drivers in Window VMs etc, but nothing concrete, no VM config/pve versions written.
Maybe install netdata and check the problematic time window.

Thank you for your reply. So that I am clear:

I should disable power savings states in Dell's BIOS (or having it disabled might be the problem)?

I should update all Dell firmware to the newest (or are there known version issues with Proxmox)?

The big drives are desktop I think, Samsung 870 QVO (560/530 MB/s). (The 2 500GB for the OS are Samsung 870 EVO with similar read/writes.). (My ESXi box uses the same Samsung 870 QVO drives in its PERC RAID and has no such problems.)

My Windows 11 occasional user tried the virtio drivers during the zfs period but I don't think notice much change (I will try that on the Windows guests for sure though).

What does "no VM config/pve versions written" mean?

I am looking into netdata now.

Thank you.
 
> The Windows 10 and 11 users are now complaining about the performance of their boxes, in what seem to them like system-resource issues (long times to open Windows, task manager never comes up, etc). These particularly align with migrations of new boxes (which I now do after hours) but their new VMs are generally performing worse than their old Windows 10 version on ESXi6

Host interactive VMs on ZFS mirrors with enterprise-class SSD backing storage.

https://github.com/kneutron/ansitest/blob/master/winstuff/noatime.cmd

Get the largest SSDs you can afford and you may need to re-set them up in (I would say) a no more than 3-wide (6x) mirror pool if you're doing one large pool for VMs. If they still complain about response times, set them up as 3x different pools with 2x mirror disks each.

Usergroup1: zfs mirror e.g. 2x4TB SSD
Usergroup2: same
Usergroup3: same

It may also be worthwhile to run a consistency check in Windows to see if there are any corrupted files ( sfc /scannow as admin + dism )

https://github.com/kneutron/ansitest/blob/master/winstuff/sfc-scannow-dism.cmd

We got away from zfs because this was working so poorly with it (and it has been better since). If I started all over again (which I think your solution implies) and did separate zfs pools, mightn't that just spare the users in group 2 and 3 from the disruption if I did a restore of something in group 1?

The drives are Samsung 870 QVO (which are desktop), but these boxes (and a several more, including servers) were on ESXi6 until a week ago and worked fine. Why would Proxmox (or zfs for that matter) perform so differently on similar hardware?

(One of the reasons I am anxious to get everything moved from the ESXi6 machine is that if I recycle its hardware for Proxmox and it works fine, then I have demonstrated that something is actually wrong with the current Proxmox hardware that I have yet to diagnose. If it is also poor then something must be wrong with my configuration of Proxmox or its compatibility with this hardware.)

Thanks for the reply.
 
> If I started all over again (which I think your solution implies) and did separate zfs pools, mightn't that just spare the users in group 2 and 3 from the disruption if I did a restore of something in group 1?

That would be another net benefit, yes. Isolating different groups of users to different pools should speed things up in general and minimize I/O contention.

> The drives are Samsung 870 QVO (which are desktop)

Search the forums for QLC / quad-level cell, there are horror stories and multiple recommendations to stay away from them.

ESXI and Proxmox are two different things, and paradigms change. Desktop-grade drives are NOT rated to run 24/7 with hypervisors.


https://search.brave.com/search?q=w...summary=1&conversation=eb5b49dfa1d48d49df9ce1
 
Also to confirm, taking a snapshot jumped load from 7 to as high as 17 (in at around 12 for the average I would say). I tried logging into a system at that time and sat at a blank black remote-desktop screen for a full minute at least (that was one with virtio installed). (All during the disk part of the operation.). The snapshot took about 4 minutes and I would say that at least Windows VMs were pretty useless for 3 of those.

It definitely feels "disky" but neither zfs nor Perc RAID with qcow2 really fixes it and the same kind of drives are fine and fast in ESXi.

I mean I guess maybe those drives don't work that well in ESXi but appear to for this test. ESXi does snapshots in a different way.

I see that my Proxmox snapshot (of a machine with 16GB and 60GB virtual drives and 32GB RAM) wrote a 6.5GB .raw file. Maybe if I copied or wrote 6.5GB on my ESXi box all at once the same thing would happen.

ESXi writes small -delta.vmdk files when a snapshot is taken, which are then populated with disk changes (I assume). It can be notoriously draggy to leave a snapshot indefinitely (and slow merge that data when a snapshot is removed after a long time/lot of changes). Still though I have only really observed that messing with the client for which the snapshots taken, not every other box (or at least the Windows boxes).

Other people can take snapshots or restore backups on same zfs's or same RAID pools that write similar sized files on Proxmox and do not simultaneously experience this in their Windows clients... right? (I haven't actually established that this isn't normal for the product since I'm relatively new to it.)

Thanks.