Proxmox vs ESXi Storage Performance - Tuning iSCSI?

PwrBank · Nov 12, 2024

Hello,

I'm trying to evaluate the performance differences on storage between ESXi and Proxmox. I'm having some trouble identifying where the performance issues are. Using the same hardware between tests I'm getting drastically different results when comparing the two hypervisors. What can I look into to increase performance on NFS or iSCSI? I'm kind of disappointed with having no ability to create snapshots on iSCSI, and with the current performance on NFS, I don't think it's a viable option.

Hardware:
Dedicated 25GbE cards are installed on the test machine and the SAN. When using iSCSI it is fully multipathed and verified on both ends to be using all paths. When using NFS the 25GbE connections are bonded with LACP using balance-rr.
All flash network appliance with dual controllers with two 25GbE each, all four NICs used for multipath

All tests were ran using the same fio scripts on the same VM that was transferred between Proxmox and ESXi using Veeam.

Here are the test results:
(The VMware NFS was inaccurate, as it was 10GbE at the time, but based on the IOPS, I suspect it is much faster than Proxmox)

Setup	Write IOPS	Read IOPS	Write Throughput (MB/s)	Read Throughput (MB/s)
Proxmox iSCSI LVM	60000	54000	2300	2000
Proxmox iSCSI Direct	84200	85600	2300	1000
Proxmox NFS	48400	17600	1870	181
VMware iSCSI	54700	107000	2800	5530
VMware NFS	46300	53400	1160	1160

Is there any tuning that can be done to increase performance on either NFS or iSCSI?

I've already changed the iSCSI tuning settings to reflect this post with no change in performance:
https://forum.proxmox.com/threads/s...th-dell-equallogic-storage.43018/#post-323461

Any help would be appreciated.

bbgeek17 · Nov 12, 2024

Hi @PwrBank , welcome to the forum.

You may want to take a look at these knowledgebase articles we wrote, covering block storage and performance, for some tips:

Low Latency Storage Optimizations for Proxmox, KVM, & QEMU:
https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/index.html

Optimizing Proxmox: iothreads, aio, & io_uring
https://kb.blockbridge.com/technote/proxmox-aio-vs-iouring/index.html

Proxmox vs. VMware ESXi: A Performance Comparison Using NVMe/TCP:
https://kb.blockbridge.com/technote/proxmox-vs-vmware-nvmetcp/

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Blueloop · Nov 13, 2024

I am in the very long process of migrating myself, my company and my customers away from VMware. I was a VMware afficianado for well over 20 years.

VMware nailed it when they created VMFS and their clustering stuff back when it was still Linux under the hood. MS created an almighty bodge with their clustering effort but got away with it. Proxmox and that don't bother with the shared monolithic block device thing as such, which is why you find yourself pissed off with iSCSI SANs not able to do snapshots.

I've reconciled myself (and quite a few customer budgets) with going "hyper converged". If the cluster hosts have enough slots, then populate them with SAS SSDs and do Ceph and dump the SAN or use it for backups or whatever. You must have at least three nodes and 10Gb+ networking for Ceph. With three nodes, two NICs are enough in each host without a switch and some careful networking.

Once you get away from the SAN thing (call it legacy if it makes you feel better) you really release yourself! A Proxmox Ceph cluster and Proxmox cluster with local flash storage compares rather well with a SAN. The VMs are always on local storage. Ceph ensures that they are replicated and the Proxmox clustering ensures availability.

Note that even if you don't do shared storage, you still have the equivalent of "Storage vMotion" available out of the box. Whilst I'm writing this post, I am continuing migrating a small business customer from VMware to Proxmox - a single ESXi host.

I delivered a migration box to site a couple of days ago with PVE on it. I fixed up the networking, iDRACs etc and mounted the VMFS on the PVE box. My colleagues migrated the VMs over last night and sorted out virtio drivers etc. Backups (Veeam) were fixed up and verified. I updated the BIOS etc, enabled the TPM, EFI and Secure Boot (yes ESXi can but it wasn't). I mounted the PVE install .iso from my laptop at home via the iDRAC and installed PVE on the box.

I created a temporary cluster on the migration box and joined the customer's new PVE to it. I am now live migrating the VMs over. When that's done, I will destroy the clustering (carefully and after cheching backups), fix up the backups and move on to the next customer.

PwrBank · Nov 13, 2024

bbgeek17 said:
Hi @PwrBank , welcome to the forum.

You may want to take a look at these knowledgebase articles we wrote, covering block storage and performance, for some tips:

Low Latency Storage Optimizations for Proxmox, KVM, & QEMU:
https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/index.html

Optimizing Proxmox: iothreads, aio, & io_uring
https://kb.blockbridge.com/technote/proxmox-aio-vs-iouring/index.html

Proxmox vs. VMware ESXi: A Performance Comparison Using NVMe/TCP:
https://kb.blockbridge.com/technote/proxmox-vs-vmware-nvmetcp/

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Hey bbgeek17! I've read a LOT of your posts on this forum and am very interested in Blockbridge as a whole as well. I reached out to the sales team to discuss if Blockbridge could be combined with the SAN that's in this environment, and while they said it could, but it probably wouldn't be cost effective. Which I understand.

I'll take a look at these articles and report back my findings.

PwrBank · Nov 13, 2024

Blueloop said:
I am in the very long process of migrating myself, my company and my customers away from VMware. I was a VMware afficianado for well over 20 years.

VMware nailed it when they created VMFS and their clustering stuff back when it was still Linux under the hood. MS created an almighty bodge with their clustering effort but got away with it. Proxmox and that don't bother with the shared monolithic block device thing as such, which is why you find yourself pissed off with iSCSI SANs not able to do snapshots.

I utilize CEPH in a non-production cluster with pretty good results using 40GbE and 3 nodes with only 3 CBDs. However, a very large investment was put into these storage devices - So kind of stuck with it in a way. My only gripe with CEPH is the overhead of the system, it's not insanely resource intensive but you do need to take in account many more things than just a simple shared block device. One of these days I need to try out Linstor to see if there's much of difference in performance and up keep.

Blueloop · Nov 14, 2024

PwrBank said:
I utilize CEPH in a non-production cluster with pretty good results using 40GbE and 3 nodes with only 3 CBDs. However, a very large investment was put into these storage devices - So kind of stuck with it in a way. My only gripe with CEPH is the overhead of the system, it's not insanely resource intensive but you do need to take in account many more things than just a simple shared block device. One of these days I need to try out Linstor to see if there's much of difference in performance and up keep.

Do note that with the hyper converged model and something like Ceph, the virtual hard discs are nearly always local and not on the end of a network block device.

I think that Ceph is rather light on resources - it "only" has to ensure data integrity and that's a latency thing these days. Blocks need to be replicated and considered "correct" at the point of access.

You mention three CBDs (did you mean OBD) on three nodes and that sounds to me like three nodes with three RAID or other block devices. Ceph isn't designed to work like that. Ceph ideally gets to see each disc as a separate OBD. Your fancy RAID controller can still use its battery to store writes in the event of a power out and its cache for each disc individually. You need to set it to JBOD mode for the discs that do Ceph.

Your 40GbE network is a very nice touch.

I recently put in a three node cluster with mostly 10GbE and SAS SSDs (and a budget). Each node has six 1.5Tb SSDs. I devoted two NICs per box in a ring with Open vSwitch and STP for Ceph. With three nodes this works but for more, you'll need to use a switch. That's 18 OBDs.

It all works rather rapidly. I have several other systems to compare it with including VMware with a totally SAS flash SAN over iSCSI and others.

I am now sold on the Ceph approach for smaller clusters where compute nodes == storage nodes. That's most of my customers.

For larger systems, I'll get my slide rule out!

PwrBank · Nov 15, 2024

I haven't done the changes listed in bbgeek's post, but I did discover two threads that may help some people in the future, regarding the downfalls of iSCSI on Proxmox.

LVM over iSCSI Snapshot script:
https://forum.proxmox.com/threads/lvm-snapshot-script-for-shared-storage-lvm-over-iscsi-only.150068/

LVM using qcow2:
https://forum.proxmox.com/threads/what-about-lvm-qcow2.138969/

User @spirit has submitted a bug report for adding the features needed for iSCSI
https://bugzilla.proxmox.com/show_bug.cgi?id=4160

As well as submitted some downstream patches
https://lists.proxmox.com/pipermail/pve-devel/2024-August/065201.html

bbgeek17 · Nov 15, 2024

Hi @PwrBank , to be technically correct: the threads listed above show implementation limitations of Proxmox storage pool type of "iSCSI", not iSCSI as a protocol.

For example, Blockbridge PVE storage plugin implements access over iSCSI (as well as NVMe/TCP) but provides snapshots, thin provisioning and other advanced features.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

PwrBank · Nov 15, 2024

Right, @bbgeek17 , I should have clarified.

Is there a possibility of Blockbridge licensing out the plugin to be used with more general iSCSI implementations? Or is it more specific to Blockbridge like using API calls and such to actually directly interact with Blockbridge?

waltar · Nov 15, 2024

PwrBank said:
Using the same hardware between tests

What is that hw disk and setup to measure for ?
And what fio cmd('s) were used ?

PwrBank · Nov 15, 2024

waltar said:
What is that hw disk and setup to measure for ?
And what fio cmd('s) were used ?

I'm not sure what the first question is regarding, but the commands are as follows:

Read IOPS

Bash:

fio --name=read_iops --directory=./read_iops/ --numjobs=8 --size=1G --time_based --runtime=60s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=4K --iodepth=64 --rw=randread --group_reporting=1
rm ./read_iops/*

Read Throughput

Bash:

fio --name=read_throughput --directory=./read_throughput/ --numjobs=8 --size=1G --time_based --runtime=60s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=1M --iodepth=64 --rw=read --group_reporting=1
rm ./read_throughput/*

Write IOPS

Bash:

fio --name=write_iops --directory=./write_iops/ --size=1G --time_based --runtime=60s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=4K --iodepth=64 --rw=randwrite --group_reporting=1
rm ./write_iops/*

Write Throughput

Bash:

fio --name=write_throughput --directory=./write_throughput --numjobs=8 --size=2G --time_based --runtime=60s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=1M --iodepth=64 --rw=write --group_reporting=1
rm ./write_throughput/*

waltar · Nov 15, 2024

Which disk configuration exists to serve read >5500 MB/s ? But see your testfiles are just 8x 2 GB or 64 threads on 1G, that's not i/O, that's ram, direct=1 on read's not useful. With IB100 client get over nfs4 4000 MB/s disk read (see iostat on server also) or 10500 MB/s when in fileserver cache, write 4500 MB/s which is hdd raid6 write limit. So there is not a real limit in using nfs protocol, it's the setup.

waltar · Nov 15, 2024

Btw. see "sar -n DEV 1" while benchmarking to see the real datatransfer between hosts and don't trust the results tuned by caches.

PwrBank · Nov 15, 2024

waltar said:
Which disk configuration exists to serve read >5500 MB/s ? But see your testfiles are just 8x 2 GB or 64 threads on 1G, that's not i/O, that's ram, direct=1 on read's not useful. With IB100 client get over nfs4 4000 MB/s disk read (see iostat on server also) or 10500 MB/s when in fileserver cache, write 4500 MB/s which is hdd raid6 write limit. So there is not a real limit in using nfs protocol, it's the setup.

The storage device is a Pure X20 with 4x25GbE networking.

Do you have a recommended fio command I can run a comparison again using?

waltar · Nov 15, 2024

Pure X20 - nice

So first you should check throughput (/iops) from host to storage to understand what could be measured by "remote client".
On client use nfs4.2, rsize=1048576,wsize=1048576 (seen by mount) should be set auto-selected, use additional setting nconnect=2 (by multiple clients, between servers for sync to =8). For any benchmarking use amount testdata which exceeds 10x ram of the bigger host (server or client).
Control results with "sar -n DEV 1" for any remote test, use "iostat -xm 1" for any local I/O but the last isn't useful for zfs as data is mostly compressed so there you always need to calculate amount of data to measured time for !!!
After nfs share is mounted set nfs readahead "echo 8192 > /sys/class/bdi/$(mountpoint -d /<my-nfs-mountpoint>)/read_ahead_kb" !!

bbgeek17 · Nov 15, 2024

PwrBank said:
Is there a possibility of Blockbridge licensing out the plugin to be used with more general iSCSI implementations? Or is it more specific to Blockbridge like using API calls and such to actually directly interact with Blockbridge?

@PwrBank Our plugin is effectively an API bridge. So, the plugin isn't going to help with general iSCSI implementations. And, we added important Proxmox-inspired features to our API, iSCSI, and NVMe virtualization to make everything rock solid based on years of experience supporting PVE in production. So, even if the plugin/orchestration were repurposed or emulated, it would not have the same reliability at scale.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

PwrBank · Nov 15, 2024

waltar said:
Pure X20 - nice So first you should check throughput (/iops) from host to storage to understand what could be measured by "remote client".
On client use nfs4.2, rsize=1048576,wsize=1048576 (seen by mount) should be set auto-selected, use additional setting nconnect=2 (by multiple clients, between servers for sync to =8). For any benchmarking use amount testdata which exceeds 10x ram of the bigger host (server or client).
Control results with "sar -n DEV 1" for any remote test, use "iostat -xm 1" for any local I/O but the last isn't useful for zfs as data is mostly compressed so there you always need to calculate amount of data to measured time for !!!
After nfs share is mounted set nfs readahead "echo 8192 > /sys/class/bdi/$(mountpoint -d /<my-nfs-mountpoint>)/read_ahead_kb" !!

Okay, made the changes as you suggested and verified the Pure is seeing the same throughput and IOPS as the client

New Results using NFS:

Reads IOPS: 50.6k ~3x Increase
Read Throughput: 2841 ~15.6x increase

Write IOPS: 42K ~About the same
Write Throughput: 2160 ~1.15x increase
I ran the same tests after making the read_ahead_kb to 8192, and got the same results as above.

Weird behavior though, on the write IOPS it takes time for it to actually ramp up

But once it does, it stays pretty steady.

Overall a very good increase, but still not on the same level as iSCSI - But probably the same as ESXi, I'll try to confirm that.

spirit · Nov 16, 2024

PwrBank said:
I haven't done the changes listed in bbgeek's post, but I did discover two threads that may help some people in the future, regarding the downfalls of iSCSI on Proxmox.

LVM over iSCSI Snapshot script:
https://forum.proxmox.com/threads/lvm-snapshot-script-for-shared-storage-lvm-over-iscsi-only.150068/

LVM using qcow2:
https://forum.proxmox.com/threads/what-about-lvm-qcow2.138969/

User @spirit has submitted a bug report for adding the features needed for iSCSI
https://bugzilla.proxmox.com/show_bug.cgi?id=4160

As well as submitted some downstream patches
https://lists.proxmox.com/pipermail/pve-devel/2024-August/065201.html

I'm still working on qcow2 external snapshot && lvm+qcow2. I'm hopping to have it ready for next year.

waltar · Nov 16, 2024

PwrBank said:
Overall a very good increase, but still not on the same level as iSCSI - But probably the same as ESXi, I'll try to confirm that.

You should increase benchmark size as I wrote and you will see that your iscsi peak advances go away.
Edit: It's not uncommon for a pve cluster to have >100 machines (vm or lxc) and a multiple of 10TB of data and you cannot exapolate benchmark results of 8 to 16 GB based on your prefered iscsi protocol to eg 16TB in reality.

waltar · Nov 16, 2024

What filesystem do you use under the hood of your nfs ?

Proxmox vs ESXi Storage Performance - Tuning iSCSI?

Active Member

Distinguished Member

New Member

Active Member

Active Member

New Member

Active Member

Distinguished Member

Active Member

Famous Member

Active Member

Famous Member

Famous Member

Active Member

Famous Member

Distinguished Member

Active Member

Distinguished Member

Famous Member

Famous Member

We value your privacy