Proxmox vs ESXi Storage Performance - Tuning iSCSI?

PwrBank

New Member
Nov 12, 2024
15
4
3
Hello,

I'm trying to evaluate the performance differences on storage between ESXi and Proxmox. I'm having some trouble identifying where the performance issues are. Using the same hardware between tests I'm getting drastically different results when comparing the two hypervisors. What can I look into to increase performance on NFS or iSCSI? I'm kind of disappointed with having no ability to create snapshots on iSCSI, and with the current performance on NFS, I don't think it's a viable option.

Hardware:
Dedicated 25GbE cards are installed on the test machine and the SAN. When using iSCSI it is fully multipathed and verified on both ends to be using all paths. When using NFS the 25GbE connections are bonded with LACP using balance-rr.
All flash network appliance with dual controllers with two 25GbE each, all four NICs used for multipath

All tests were ran using the same fio scripts on the same VM that was transferred between Proxmox and ESXi using Veeam.



Here are the test results:
(The VMware NFS was inaccurate, as it was 10GbE at the time, but based on the IOPS, I suspect it is much faster than Proxmox)


SetupWrite IOPSRead IOPSWrite Throughput (MB/s)Read Throughput (MB/s)
Proxmox iSCSI LVM600005400023002000
Proxmox iSCSI Direct842008560023001000
Proxmox NFS48400176001870181
VMware iSCSI5470010700028005530
VMware NFS463005340011601160



1731442323113.png

Is there any tuning that can be done to increase performance on either NFS or iSCSI?

I've already changed the iSCSI tuning settings to reflect this post with no change in performance:
https://forum.proxmox.com/threads/s...th-dell-equallogic-storage.43018/#post-323461

Any help would be appreciated.
 
Hi @PwrBank , welcome to the forum.

You may want to take a look at these knowledgebase articles we wrote, covering block storage and performance, for some tips:

Low Latency Storage Optimizations for Proxmox, KVM, & QEMU:
https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/index.html

Optimizing Proxmox: iothreads, aio, & io_uring
https://kb.blockbridge.com/technote/proxmox-aio-vs-iouring/index.html

Proxmox vs. VMware ESXi: A Performance Comparison Using NVMe/TCP:
https://kb.blockbridge.com/technote/proxmox-vs-vmware-nvmetcp/


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: PwrBank
I am in the very long process of migrating myself, my company and my customers away from VMware. I was a VMware afficianado for well over 20 years.

VMware nailed it when they created VMFS and their clustering stuff back when it was still Linux under the hood. MS created an almighty bodge with their clustering effort but got away with it. Proxmox and that don't bother with the shared monolithic block device thing as such, which is why you find yourself pissed off with iSCSI SANs not able to do snapshots.

I've reconciled myself (and quite a few customer budgets) with going "hyper converged". If the cluster hosts have enough slots, then populate them with SAS SSDs and do Ceph and dump the SAN or use it for backups or whatever. You must have at least three nodes and 10Gb+ networking for Ceph. With three nodes, two NICs are enough in each host without a switch and some careful networking.

Once you get away from the SAN thing (call it legacy if it makes you feel better) you really release yourself! A Proxmox Ceph cluster and Proxmox cluster with local flash storage compares rather well with a SAN. The VMs are always on local storage. Ceph ensures that they are replicated and the Proxmox clustering ensures availability.

Note that even if you don't do shared storage, you still have the equivalent of "Storage vMotion" available out of the box. Whilst I'm writing this post, I am continuing migrating a small business customer from VMware to Proxmox - a single ESXi host.

I delivered a migration box to site a couple of days ago with PVE on it. I fixed up the networking, iDRACs etc and mounted the VMFS on the PVE box. My colleagues migrated the VMs over last night and sorted out virtio drivers etc. Backups (Veeam) were fixed up and verified. I updated the BIOS etc, enabled the TPM, EFI and Secure Boot (yes ESXi can but it wasn't). I mounted the PVE install .iso from my laptop at home via the iDRAC and installed PVE on the box.

I created a temporary cluster on the migration box and joined the customer's new PVE to it. I am now live migrating the VMs over. When that's done, I will destroy the clustering (carefully and after cheching backups), fix up the backups and move on to the next customer.
 
  • Like
Reactions: Johannes S
Hi @PwrBank , welcome to the forum.

You may want to take a look at these knowledgebase articles we wrote, covering block storage and performance, for some tips:

Low Latency Storage Optimizations for Proxmox, KVM, & QEMU:
https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/index.html

Optimizing Proxmox: iothreads, aio, & io_uring
https://kb.blockbridge.com/technote/proxmox-aio-vs-iouring/index.html

Proxmox vs. VMware ESXi: A Performance Comparison Using NVMe/TCP:
https://kb.blockbridge.com/technote/proxmox-vs-vmware-nvmetcp/


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Hey bbgeek17! I've read a LOT of your posts on this forum and am very interested in Blockbridge as a whole as well. I reached out to the sales team to discuss if Blockbridge could be combined with the SAN that's in this environment, and while they said it could, but it probably wouldn't be cost effective. Which I understand. :)

I'll take a look at these articles and report back my findings.
 
Last edited:
I am in the very long process of migrating myself, my company and my customers away from VMware. I was a VMware afficianado for well over 20 years.

VMware nailed it when they created VMFS and their clustering stuff back when it was still Linux under the hood. MS created an almighty bodge with their clustering effort but got away with it. Proxmox and that don't bother with the shared monolithic block device thing as such, which is why you find yourself pissed off with iSCSI SANs not able to do snapshots.

I utilize CEPH in a non-production cluster with pretty good results using 40GbE and 3 nodes with only 3 CBDs. However, a very large investment was put into these storage devices - So kind of stuck with it in a way. My only gripe with CEPH is the overhead of the system, it's not insanely resource intensive but you do need to take in account many more things than just a simple shared block device. One of these days I need to try out Linstor to see if there's much of difference in performance and up keep.
 
I utilize CEPH in a non-production cluster with pretty good results using 40GbE and 3 nodes with only 3 CBDs. However, a very large investment was put into these storage devices - So kind of stuck with it in a way. My only gripe with CEPH is the overhead of the system, it's not insanely resource intensive but you do need to take in account many more things than just a simple shared block device. One of these days I need to try out Linstor to see if there's much of difference in performance and up keep.
Do note that with the hyper converged model and something like Ceph, the virtual hard discs are nearly always local and not on the end of a network block device.

I think that Ceph is rather light on resources - it "only" has to ensure data integrity and that's a latency thing these days. Blocks need to be replicated and considered "correct" at the point of access.

You mention three CBDs (did you mean OBD) on three nodes and that sounds to me like three nodes with three RAID or other block devices. Ceph isn't designed to work like that. Ceph ideally gets to see each disc as a separate OBD. Your fancy RAID controller can still use its battery to store writes in the event of a power out and its cache for each disc individually. You need to set it to JBOD mode for the discs that do Ceph.

Your 40GbE network is a very nice touch.

I recently put in a three node cluster with mostly 10GbE and SAS SSDs (and a budget). Each node has six 1.5Tb SSDs. I devoted two NICs per box in a ring with Open vSwitch and STP for Ceph. With three nodes this works but for more, you'll need to use a switch. That's 18 OBDs.

It all works rather rapidly. I have several other systems to compare it with including VMware with a totally SAS flash SAN over iSCSI and others.

I am now sold on the Ceph approach for smaller clusters where compute nodes == storage nodes. That's most of my customers.

For larger systems, I'll get my slide rule out!
 
  • Like
Reactions: Johannes S
I haven't done the changes listed in bbgeek's post, but I did discover two threads that may help some people in the future, regarding the downfalls of iSCSI on Proxmox.

LVM over iSCSI Snapshot script:
https://forum.proxmox.com/threads/lvm-snapshot-script-for-shared-storage-lvm-over-iscsi-only.150068/

LVM using qcow2:
https://forum.proxmox.com/threads/what-about-lvm-qcow2.138969/

User @spirit has submitted a bug report for adding the features needed for iSCSI
https://bugzilla.proxmox.com/show_bug.cgi?id=4160

As well as submitted some downstream patches
https://lists.proxmox.com/pipermail/pve-devel/2024-August/065201.html
 
Hi @PwrBank , to be technically correct: the threads listed above show implementation limitations of Proxmox storage pool type of "iSCSI", not iSCSI as a protocol.

For example, Blockbridge PVE storage plugin implements access over iSCSI (as well as NVMe/TCP) but provides snapshots, thin provisioning and other advanced features.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Right, @bbgeek17 , I should have clarified.

Is there a possibility of Blockbridge licensing out the plugin to be used with more general iSCSI implementations? Or is it more specific to Blockbridge like using API calls and such to actually directly interact with Blockbridge?
 
What is that hw disk and setup to measure for ?
And what fio cmd('s) were used ?

I'm not sure what the first question is regarding, but the commands are as follows:

Read IOPS
Bash:
fio --name=read_iops --directory=./read_iops/ --numjobs=8 --size=1G --time_based --runtime=60s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=4K --iodepth=64 --rw=randread --group_reporting=1
rm ./read_iops/*

Read Throughput
Bash:
fio --name=read_throughput --directory=./read_throughput/ --numjobs=8 --size=1G --time_based --runtime=60s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=1M --iodepth=64 --rw=read --group_reporting=1
rm ./read_throughput/*

Write IOPS
Bash:
fio --name=write_iops --directory=./write_iops/ --size=1G --time_based --runtime=60s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=4K --iodepth=64 --rw=randwrite --group_reporting=1
rm ./write_iops/*

Write Throughput
Bash:
fio --name=write_throughput --directory=./write_throughput --numjobs=8 --size=2G --time_based --runtime=60s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=1M --iodepth=64 --rw=write --group_reporting=1
rm ./write_throughput/*
 
Which disk configuration exists to serve read >5500 MB/s ? But see your testfiles are just 8x 2 GB or 64 threads on 1G, that's not i/O, that's ram, direct=1 on read's not useful. With IB100 client get over nfs4 4000 MB/s disk read (see iostat on server also) or 10500 MB/s when in fileserver cache, write 4500 MB/s which is hdd raid6 write limit. So there is not a real limit in using nfs protocol, it's the setup.
 
Btw. see "sar -n DEV 1" while benchmarking to see the real datatransfer between hosts and don't trust the results tuned by caches.
 
Which disk configuration exists to serve read >5500 MB/s ? But see your testfiles are just 8x 2 GB or 64 threads on 1G, that's not i/O, that's ram, direct=1 on read's not useful. With IB100 client get over nfs4 4000 MB/s disk read (see iostat on server also) or 10500 MB/s when in fileserver cache, write 4500 MB/s which is hdd raid6 write limit. So there is not a real limit in using nfs protocol, it's the setup.
The storage device is a Pure X20 with 4x25GbE networking.

Do you have a recommended fio command I can run a comparison again using?
 
Pure X20 - nice :) So first you should check throughput (/iops) from host to storage to understand what could be measured by "remote client".
On client use nfs4.2, rsize=1048576,wsize=1048576 (seen by mount) should be set auto-selected, use additional setting nconnect=2 (by multiple clients, between servers for sync to =8). For any benchmarking use amount testdata which exceeds 10x ram of the bigger host (server or client).
Control results with "sar -n DEV 1" for any remote test, use "iostat -xm 1" for any local I/O but the last isn't useful for zfs as data is mostly compressed so there you always need to calculate amount of data to measured time for !!!
After nfs share is mounted set nfs readahead "echo 8192 > /sys/class/bdi/$(mountpoint -d /<my-nfs-mountpoint>)/read_ahead_kb" !!
 
Last edited:
Is there a possibility of Blockbridge licensing out the plugin to be used with more general iSCSI implementations? Or is it more specific to Blockbridge like using API calls and such to actually directly interact with Blockbridge?

@PwrBank Our plugin is effectively an API bridge. So, the plugin isn't going to help with general iSCSI implementations. And, we added important Proxmox-inspired features to our API, iSCSI, and NVMe virtualization to make everything rock solid based on years of experience supporting PVE in production. So, even if the plugin/orchestration were repurposed or emulated, it would not have the same reliability at scale.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Johannes S
Pure X20 - nice :) So first you should check throughput (/iops) from host to storage to understand what could be measured by "remote client".
On client use nfs4.2, rsize=1048576,wsize=1048576 (seen by mount) should be set auto-selected, use additional setting nconnect=2 (by multiple clients, between servers for sync to =8). For any benchmarking use amount testdata which exceeds 10x ram of the bigger host (server or client).
Control results with "sar -n DEV 1" for any remote test, use "iostat -xm 1" for any local I/O but the last isn't useful for zfs as data is mostly compressed so there you always need to calculate amount of data to measured time for !!!
After nfs share is mounted set nfs readahead "echo 8192 > /sys/class/bdi/$(mountpoint -d /<my-nfs-mountpoint>)/read_ahead_kb" !!

Okay, made the changes as you suggested and verified the Pure is seeing the same throughput and IOPS as the client

1731702304005.png

New Results using NFS:

Reads IOPS: 50.6k ~3x Increase
Read Throughput: 2841 ~15.6x increase

Write IOPS: 42K ~About the same
Write Throughput: 2160 ~1.15x increase
I ran the same tests after making the read_ahead_kb to 8192, and got the same results as above.

Weird behavior though, on the write IOPS it takes time for it to actually ramp up
1731702329112.png

But once it does, it stays pretty steady.

Overall a very good increase, but still not on the same level as iSCSI - But probably the same as ESXi, I'll try to confirm that.
 
I haven't done the changes listed in bbgeek's post, but I did discover two threads that may help some people in the future, regarding the downfalls of iSCSI on Proxmox.

LVM over iSCSI Snapshot script:
https://forum.proxmox.com/threads/lvm-snapshot-script-for-shared-storage-lvm-over-iscsi-only.150068/

LVM using qcow2:
https://forum.proxmox.com/threads/what-about-lvm-qcow2.138969/

User @spirit has submitted a bug report for adding the features needed for iSCSI
https://bugzilla.proxmox.com/show_bug.cgi?id=4160

As well as submitted some downstream patches
https://lists.proxmox.com/pipermail/pve-devel/2024-August/065201.html
I'm still working on qcow2 external snapshot && lvm+qcow2. I'm hopping to have it ready for next year.
 
  • Like
Reactions: floh8 and PwrBank
Overall a very good increase, but still not on the same level as iSCSI - But probably the same as ESXi, I'll try to confirm that.
You should increase benchmark size as I wrote and you will see that your iscsi peak advances go away.
Edit: It's not uncommon for a pve cluster to have >100 machines (vm or lxc) and a multiple of 10TB of data and you cannot exapolate benchmark results of 8 to 16 GB based on your prefered iscsi protocol to eg 16TB in reality.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!