iSCSI under performance

pmvemf

New Member
Feb 10, 2025
5
0
1
Proxmox iSCSI Performance Issue - Support Report

Executive Summary
For more than a year I've been using this great product and had worked flawlessly.
Now that I need to add a shared storage I've seen a significant performance degradation in Proxmox iSCSI initiator on 5-node cluster. Proxmox host achieves only ~5GbE speeds (498MiB/s read, 408MiB/s write) despite 25GbE connection, while VMs on same hardware achieve full 25GbE performance (1872-2173MiB/s).

Environment
- Proxmox VE: 9.0.6, PVE-No Subscription
- Cluster: 5 nodes (Node 1: 25GbE, Nodes 2-5: 10GbE)
- Storage: HPE MSA 2060 with 25GbE connection
- iSCSI: open-iscsi 2.1.11-1, multipath 0.11.1-2

Problem Description
Proxmox native iSCSI initiator underperforms by ~80% compared to expected 25GbE throughput, while guest OS iSCSI implementations achieve expected performance on identical hardware.

Test Results Summary

Proxmox Host (25GbE) - UNDERPERFORMING ❌
Code:
Sequential Read:  498MiB/s (522MB/s) - 17.4% of expected
Sequential Write: 408MiB/s (428MB/s) - 15.3% of expected  
Mixed R/W:        229+229MiB/s      - 16.6% of expected
Latency:          64-78ms (high)

Windows Server 2025 VM - PARTIAL PERFORMANCE ⚠️
Code:
Sequential Read:  1451MiB/s (1452MB/s) - 48.4% of expected
Sequential Write: 808MiB/s (808MB/s)   - 29.0% of expected
Mixed R/W:        425+427MiB/s        - 30.6% of expected

Ubuntu 24.04 LTS VM - GOOD PERFORMANCE ✅
Code:
Sequential Read:  1872MiB/s (1963MB/s) - 65.4% of expected
Sequential Write: 1738MiB/s (1823MB/s) - 65.1% of expected
Mixed R/W:        987+984MiB/s        - 71.3% of expected
Latency:          12-19ms (normal)

Debian Trixie VM - EXCELLENT PERFORMANCE ✅
Code:
Sequential Read:  2173MiB/s (2279MB/s) - 76.0% of expected
Sequential Write: 1824MiB/s (1913MB/s) - 68.3% of expected
Mixed R/W:        1096+1090MiB/s      - 79.1% of expected
Latency:          11-17ms (normal)

Key Technical Details

Test Configuration
- Test Tool: fio 3.39 with libaio engine
- Block Size: 1MB sequential I/O
- Queue Depth: 32
- Runtime: 60 seconds per test
- Direct I/O: Enabled (bypass cache)

Network Baseline
- 25GbE Theoretical: 3.125 GB/s
- Expected Practical: 2.8-3.0 GB/s
- Proxmox Actual: 0.52 GB/s (83% performance loss)

Root Cause Analysis

Hardware/Network: ✅ CONFIRMED WORKING
- Network infrastructure functional (VMs achieve high performance)
- HPE MSA 2060 storage performing correctly
- 25GbE connection established and stable

Proxmox iSCSI Stack: ❌ IDENTIFIED BOTTLENECK
- Native open-iscsi implementation severely underperforming
- Same open-iscsi version in Debian VM performs excellently
- Issue appears specific to Proxmox integration/configuration

Tested Configurations
- Multiple multipath configurations
- Various iSCSI session parameters
- Different queue depths and block sizes
- Results consistent across all variations

Business Impact
- Severity: HIGH - Direct impact on VM storage performance
- Scope: Cluster-wide shared storage affected
- Production Risk: Potential performance degradation for workloads

Requested Actions
1. Root Cause Investigation: Why does Proxmox iSCSI underperform vs guest OS?
2. Performance Tuning: Optimize Proxmox iSCSI configuration
3. Best Practices: Updated guidance for high-performance iSCSI setups
4. Configuration Review: Analyze current LVM/iSCSI settings

Environment Details
- Kernel: Linux 6.14.11-1-pve
- NIC: Broadcom BCM57414 NetXtreme-E 25Gb
- Switch: Ubiquiti USW Pro Aggregation
- Config: iSCSI LVM shared volume, cluster-level

Expected Resolution
Performance optimization to achieve 80%+ of theoretical 25GbE throughput (~2.4GB/s) matching guest OS performance levels.
 
Last edited:
Hi @pmvemf , welcome to the forum.

A few questions:
  1. what's the hardware configuration of the host?
  2. how are you performing the benchmark? (exact commands and target locations)
  3. please provide the raw benchmark information from the host
  4. is the problem present in PVE8?
  5. are you comparing running bare metal or virtualized?
while guest OS iSCSI implementations achieve expected performance on identical hardware.
For number 5, this ^ is the source of confusion. What does it mean "guest on identical hardware"? Are you simply saying guest VM running on the same hypervisor that shows bad performance? I.e. are you comparing iSCSI connection from PVE and iSCSI connection from VM running on that PVE?

Cheers


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi @pmvemf , welcome to the forum.

A few questions:
  1. what's the hardware configuration of the host?
  2. how are you performing the benchmark? (exact commands and target locations)
  3. please provide the raw benchmark information from the host
  4. is the problem present in PVE8?
  5. are you comparing running bare metal or virtualized?

For number 5, this ^ is the source of confusion. What does it mean "guest on identical hardware"? Are you simply saying guest VM running on the same hypervisor that shows bad performance? I.e. are you comparing iSCSI connection from PVE and iSCSI connection from VM running on that PVE?

Cheers


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox


Hi @bbgeek17 ,

I will give you the answers inline:

  1. what's the hardware configuration of the host?
    HPE DL 380 - 2 sockets CPU (128 Cores), 256GB RAM, 2 x NIC Broadcom BCM57414 NetXtreme-E 25Gb
  2. how are you performing the benchmark? (exact commands and target locations)
    Proxmox host
    Read
    Code:
    fio --filename=/dev/mapper/mpatha --direct=1 --rw=read --bs=1M --ioengine=libaio --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test
    Write
    Code:
    fio --filename=/dev/mapper/mpatha --direct=1 --rw=write --bs=1M --ioengine=libaio --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_write_test
    Read-Write
    Code:
    fio --filename=/dev/mapper/mpatha --direct=1 --rw=rw --bs=1M --ioengine=libaio --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_write_test

    Ubuntu 24.04 LTS VM in the same Proxmox host
    Read
    Code:
    sudo fio --filename=/dev/mapper/mpatha --direct=1 --rw=read --bs=1M --ioengine=libaio --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test
    Write
    Code:
    fio --filename=/dev/mapper/mpatha --direct=1 --rw=write --bs=1M --ioengine=libaio --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_write_test
    Read-Write
    Code:
    fio --filename=/dev/mapper/mpatha --direct=1 --rw=rw --bs=1M --ioengine=libaio --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_write_test

    Debian Trixie VM in the same Proxmox host
    Read
    Code:
    sudo fio --filename=/dev/mapper/3600c0ff000ff3d07bf64b76801000000 --direct=1 --rw=read --bs=1M --ioengine=libaio --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test
    Write
    Code:
    sudo fio --filename=/dev/mapper/3600c0ff000ff3d07bf64b76801000000 --direct=1 --rw=write --bs=1M --ioengine=libaio --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_write_test
    Read-Write
    Code:
    sudo fio --filename=/dev/mapper/3600c0ff000ff3d07bf64b76801000000 --direct=1 --rw=rw --bs=1M --ioengine=libaio --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_write_test
  3. please provide the raw benchmark information from the host
    Proxmox host
    Read
    Code:
    fio --filename=/dev/mapper/mpatha --direct=1 --rw=read --bs=1M --ioengine=libaio --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test
    seq_read_test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
    fio-3.39
    Starting 1 process
    Jobs: 1 (f=1): [R(1)][100.0%][r=509MiB/s][r=509 IOPS][eta 00m:00s]
    seq_read_test: (groupid=0, jobs=1): err= 0: pid=970430: Thu Sep  4 17:25:17 2025
      read: IOPS=497, BW=498MiB/s (522MB/s)(29.2GiB/60084msec)
        slat (usec): min=18, max=615, avg=36.85, stdev=12.32
        clat (msec): min=3, max=602, avg=64.26, stdev=104.92
         lat (msec): min=3, max=602, avg=64.30, stdev=104.92
        clat percentiles (msec):
         |  1.00th=[    5],  5.00th=[    6], 10.00th=[    7], 20.00th=[    8],
         | 30.00th=[   10], 40.00th=[   12], 50.00th=[   15], 60.00th=[   21],
         | 70.00th=[   40], 80.00th=[   92], 90.00th=[  239], 95.00th=[  334],
         | 99.00th=[  439], 99.50th=[  464], 99.90th=[  523], 99.95th=[  535],
         | 99.99th=[  592]
       bw (  KiB/s): min=303104, max=612352, per=100.00%, avg=509747.20, stdev=40081.82, samples=120
       iops        : min=  296, max=  598, avg=497.80, stdev=39.14, samples=120
      lat (msec)   : 4=0.39%, 10=32.63%, 20=26.31%, 50=13.63%, 100=8.00%
      lat (msec)   : 250=9.57%, 500=9.26%, 750=0.20%
      cpu          : usr=0.16%, sys=2.10%, ctx=27781, majf=0, minf=8225
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
         issued rwts: total=29899,0,0,0 short=0,0,0,0 dropped=0,0,0,0
         latency   : target=0, window=0, percentile=100.00%, depth=32
    
    Run status group 0 (all jobs):
       READ: bw=498MiB/s (522MB/s), 498MiB/s-498MiB/s (522MB/s-522MB/s), io=29.2GiB (31.4GB), run=60084-60084msec

    Write
    Code:
    fio --filename=/dev/mapper/mpatha --direct=1 --rw=write --bs=1M --ioengine=libaio --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_write_test
    seq_write_test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
    fio-3.39
    Starting 1 process
    Jobs: 1 (f=1): [W(1)][100.0%][w=408MiB/s][w=408 IOPS][eta 00m:00s]
    seq_write_test: (groupid=0, jobs=1): err= 0: pid=992335: Thu Sep  4 17:35:15 2025
      write: IOPS=408, BW=408MiB/s (428MB/s)(24.0GiB/60121msec); 0 zone resets
        slat (usec): min=22, max=500, avg=60.42, stdev=18.34
        clat (msec): min=4, max=635, avg=78.29, stdev=131.28
         lat (msec): min=4, max=635, avg=78.35, stdev=131.28
        clat percentiles (msec):
         |  1.00th=[    8],  5.00th=[   10], 10.00th=[   11], 20.00th=[   14],
         | 30.00th=[   16], 40.00th=[   18], 50.00th=[   21], 60.00th=[   26],
         | 70.00th=[   43], 80.00th=[   92], 90.00th=[  264], 95.00th=[  443],
         | 99.00th=[  567], 99.50th=[  584], 99.90th=[  617], 99.95th=[  625],
         | 99.99th=[  634]
       bw (  KiB/s): min=356352, max=475136, per=100.00%, avg=418467.41, stdev=27576.34, samples=120
       iops        : min=  348, max=  464, avg=408.66, stdev=26.93, samples=120
      lat (msec)   : 10=7.49%, 20=41.57%, 50=23.27%, 100=8.72%, 250=8.39%
      lat (msec)   : 500=7.58%, 750=2.98%
      cpu          : usr=1.23%, sys=1.58%, ctx=22948, majf=0, minf=14
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
         issued rwts: total=0,24551,0,0 short=0,0,0,0 dropped=0,0,0,0
         latency   : target=0, window=0, percentile=100.00%, depth=32
    
    Run status group 0 (all jobs):
      WRITE: bw=408MiB/s (428MB/s), 408MiB/s-408MiB/s (428MB/s-428MB/s), io=24.0GiB (25.7GB), run=60121-60121msec

    Read-Write
    Code:
    fio --filename=/dev/mapper/mpatha --direct=1 --rw=rw --bs=1M --ioengine=libaio --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_write_test
    seq_read_write_test: (g=0): rw=rw, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
    fio-3.39
    Starting 1 process
    Jobs: 1 (f=1): [M(1)][100.0%][r=218MiB/s,w=222MiB/s][r=218,w=222 IOPS][eta 00m:00s]
    seq_read_write_test: (groupid=0, jobs=1): err= 0: pid=997578: Thu Sep  4 17:38:02 2025
      read: IOPS=228, BW=229MiB/s (240MB/s)(13.4GiB/60080msec)
        slat (usec): min=18, max=532, avg=35.99, stdev=11.92
        clat (msec): min=3, max=542, avg=56.12, stdev=67.93
         lat (msec): min=3, max=542, avg=56.15, stdev=67.93
        clat percentiles (msec):
         |  1.00th=[    5],  5.00th=[    7], 10.00th=[    9], 20.00th=[   12],
         | 30.00th=[   16], 40.00th=[   21], 50.00th=[   29], 60.00th=[   40],
         | 70.00th=[   57], 80.00th=[   88], 90.00th=[  150], 95.00th=[  205],
         | 99.00th=[  321], 99.50th=[  359], 99.90th=[  464], 99.95th=[  502],
         | 99.99th=[  531]
       bw (  KiB/s): min=167936, max=286720, per=100.00%, avg=234359.47, stdev=23058.75, samples=120
       iops        : min=  164, max=  280, avg=228.87, stdev=22.52, samples=120
      write: IOPS=229, BW=229MiB/s (240MB/s)(13.4GiB/60080msec); 0 zone resets
        slat (usec): min=25, max=339, avg=86.40, stdev=34.23
        clat (msec): min=4, max=811, avg=83.52, stdev=95.73
         lat (msec): min=4, max=811, avg=83.61, stdev=95.73
        clat percentiles (msec):
         |  1.00th=[    8],  5.00th=[   12], 10.00th=[   14], 20.00th=[   20],
         | 30.00th=[   26], 40.00th=[   35], 50.00th=[   45], 60.00th=[   61],
         | 70.00th=[   84], 80.00th=[  130], 90.00th=[  220], 95.00th=[  292],
         | 99.00th=[  443], 99.50th=[  498], 99.90th=[  600], 99.95th=[  651],
         | 99.99th=[  810]
       bw (  KiB/s): min=172032, max=288768, per=100.00%, avg=234547.20, stdev=22169.15, samples=120
       iops        : min=  168, max=  282, avg=229.05, stdev=21.65, samples=120
      lat (msec)   : 4=0.09%, 10=9.75%, 20=20.20%, 50=30.27%, 100=18.06%
      lat (msec)   : 250=16.59%, 500=4.78%, 750=0.25%, 1000=0.01%
      cpu          : usr=1.41%, sys=1.75%, ctx=20951, majf=0, minf=541
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
         issued rwts: total=13745,13761,0,0 short=0,0,0,0 dropped=0,0,0,0
         latency   : target=0, window=0, percentile=100.00%, depth=32
    
    Run status group 0 (all jobs):
       READ: bw=229MiB/s (240MB/s), 229MiB/s-229MiB/s (240MB/s-240MB/s), io=13.4GiB (14.4GB), run=60080-60080msec
      WRITE: bw=229MiB/s (240MB/s), 229MiB/s-229MiB/s (240MB/s-240MB/s), io=13.4GiB (14.4GB), run=60080-60080msec
  4. is the problem present in PVE8?
    Didn't test in PVE8 since it didn't have the snapshot feature in iSCSI LVM Shared volumes
  5. are you comparing running bare metal or virtualized?
    The comparison was performed in the same host between the proxmox O.S. itself with iSCSI direct connection to he HEP MSA 2060 storage and VMs running in the same host with direct connection to the HPE MSA 2060 Storage.


For number 5, this ^ is the source of confusion. What does it mean "guest on identical hardware"? Are you simply saying guest VM running on the same hypervisor that shows bad performance? I.e. are you comparing iSCSI connection from PVE and iSCSI connection from VM running on that PVE?

I'm comparing iSCSI direct connection to PVE itself and iSCSI direct connection to the VMs running on that PVE.



If any more information is needed I'll be glad to provide the required information!
Thank you for your time!
 
Last edited:
Thank you @pmvemf

And just for completeness: What's the configuration of the virtual machine? (i.e., how many cores, etc).


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox


Hi @bbgeek17,

The VMs have the following configuration:


Ubuntu 24.04 LTS VM

open-iscsi
2.1.9-3ubuntu5.4

multipath-tools
0.9.4-5ubuntu8


Code:
balloon: 0
boot: c
bootdisk: scsi0
cipassword: xxx
ciuser: ubuntu
cores: 2
cpu: host
ide2: local-lvm:vm-108-cloudinit,media=cdrom,size=4M
ipconfig0: ip=192.168.xxx.xxx/xx,gw=192.168.xxx.xxx
ipconfig1: ip=192.168.xxx.xxx/xx,gw=192.168.xxx.xxx
memory: 4096
meta: creation-qemu=8.1.5,ctime=1721661503
name: testmsa
net0: virtio=BC:24:11:57:8E:81,bridge=vmbr0,firewall=1
numa: 1
scsi0: local-lvm:vm-108-disk-0,size=13824M,ssd=1
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=69575474-5229-4184-a1f3-a500865e78f2
sockets: 1
sshkeys: xxx
vga: serial0
vmgenid: f539851d-2a7b-4cb2-bc3e-b32a74385c2a

Debian Trixie VM

multipath-tools
0.11.1-2

open-iscsi
2.1.11-1

Code:
agent: enabled=1
boot: c
bootdisk: scsi0
cipassword: xxxxx
cores: 2
ide2: local-lvm:vm-111-cloudinit,media=cdrom,size=4M
ipconfig0: ip=192.168.xxx.xxx/xx,gw=192.168.xxx.xxx
memory: 2048
meta: creation-qemu=10.0.2,ctime=1757063274
name: debiantest
nameserver: 192.168.xxx.xxx
net0: virtio=BC:24:11:CA:5E:01,bridge=vmbr0,firewall=1
ostype: l26
scsi0: local-lvm:vm-111-disk-0,size=13G
scsihw: virtio-scsi-pci
smbios1: uuid=2a0a24d0-49ae-4d4b-9711-57e8b484ed40
sshkeys: XXX
vmgenid: d6b6db27-3479-4259-8278-781e9c11eb80

If any more information is needed I'll be glad to provide the required information!
Thank you for your time!
 
Hi @pmvemf,

Following up on this, I asked our performance team to review the kernel iSCSI stack (what you referred to as the "Proxmox native initiator," which is in fact the standard Linux initiator).

Our testing with PVE9 showed no functional issues with the Linux kernel iSCSI initiator. At 100 GbE speeds, we observed some minor performance artifacts. Profiling under load indicated these could be caused by unnecessary CPU cycles spent on tracepoints in the initiator code path. We may reach out to the upstream Linux kernel community to explore optimizations that could reduce this overhead.

For reference, below is a table of our measurements for iSCSI Multipath, alongside NVMe/TCP results for comparison. Overall performance is excellent, with only minor degradation at high bandwidth.

## PROTOCOL: iSCSI
## TRANSPORT: TCP
## MULTIPATH: YES
## MTU: 9000
## AIO: NATIVE
## PVE VERSION: pve-manager/9.0.6/49c767b70aeb6648 (running kernel: 6.14.11-1-pve)
## ATTACHMENT: NATIVE / BARE METAL
## STORAGE: BLOCKBRIDGE NVME20P 100G

Ethernet Speed [Gbit/s] Theoretical BW (Link) [MiB/s]Measured BW (FIO) [MiB/s]BW Efficiency
101192117998.9%
252980286196.0%
404768463497.2%
505960572896.1%
10011921949279.6%



## PROTOCOL: NVME
## TRANSPORT: TCP
## MULTIPATH: NO
## AIO: IO_URING
## PVE VERSION: pve-manager/9.0.6/49c767b70aeb6648 (running kernel: 6.14.11-1-pve)
## ATTACHMENT: NATIVE / BARE METAL
## STORAGE: BLOCKBRIDGE NVME20P 100G

Ethernet Speed [Gbit/s]Theoretical BW (Link) [MiB/s]Measured BW (FIO) [MiB/s]BW Efficiency
101192117998.9%
252980288596.8%
404768471698.9%
505960585998.3%
100119211175298.6%


Regarding your reported results, both the native and virtual measurements point to configuration or environmental issues. I recommend engaging with your storage vendor's support team to review best practices for configuring Linux in your setup, as the iSCSI initiator functionality in Proxmox is simply the standard Linux implementation.

## PROTOCOL: iSCSI
## TRANSPORT: TCP
## MULTIPATH: YES
## MTU: 9000
## AIO: NATIVE
## PVE VERSION: pve-manager/9.0.6/49c767b70aeb6648 (running kernel: 6.14.11-1-pve)
## ATTACHMENT: NATIVE / BARE METAL
## STORAGE: BLOCKBRIDGE NVME20P 100G

root@pve9:~# bb host attach --disk iscsi/disk-1 --multipath
=============================================================================================================
iscsi/disk-1 attached (read-write) to pve9 (via 2 paths) as /dev/mapper/360a010a014fc86d91962194c406264b4
=============================================================================================================
## 10G Ethernet
root@pve9:~# sudo fio --filename=/dev/mapper/360a010a014fc86d91962194c406264b4 --direct=1 --ioengine=libaio \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=1179MiB/s (1237MB/s), 1179MiB/s-1179MiB/s (1237MB/s-1237MB/s), io=69.1GiB (74.2GB), run=60027-60027msec

## 25G Ethernet
root@pve9:~# sudo fio --filename=/dev/mapper/360a010a014fc86d91962194c406264b4 --direct=1 --ioengine=libaio \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=2861MiB/s (3000MB/s), 2861MiB/s-2861MiB/s (3000MB/s-3000MB/s), io=168GiB (180GB), run=60011-60011msec

## 40G Ethernet
root@pve9:~# sudo fio --filename=/dev/mapper/360a010a014fc86d91962194c406264b4 --direct=1 --ioengine=libaio \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=4634MiB/s (4859MB/s), 4634MiB/s-4634MiB/s (4859MB/s-4859MB/s), io=272GiB (292GB), run=60007-60007msec

## 50G Ethernet
root@pve9:~# sudo fio --filename=/dev/mapper/360a010a014fc86d91962194c406264b4 --direct=1 --ioengine=libaio \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=5728MiB/s (6006MB/s), 5728MiB/s-5728MiB/s (6006MB/s-6006MB/s), io=336GiB (360GB), run=60005-60005msec


## 100G Ethernet
root@pve9:~# sudo fio --filename=/dev/mapper/360a010a014fc86d91962194c406264b4 --direct=1 --ioengine=libaio \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=9492MiB/s (9953MB/s), 9492MiB/s-9492MiB/s (9953MB/s-9953MB/s), io=556GiB (597GB), run=60004-60004msec

## PROTOCOL: NVME
## TRANSPORT: TCP
## MULTIPATH: NO
## AIO: IO_URING
## PVE VERSION: pve-manager/9.0.6/49c767b70aeb6648 (running kernel: 6.14.11-1-pve)
## ATTACHMENT: NATIVE / BARE METAL
## STORAGE: BLOCKBRIDGE NVME20P 100G

root@pve9:~# bb host attach --disk nvme/disk-1
=============================================================
nvme/disk-1 attached (read-write) to pve9 as /dev/nvme1n1
=============================================================

## 10G Ethernet
root@pve9:~# sudo fio --filename=/dev/nvme1n1 --direct=1 --ioengine=io_uring \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=1179MiB/s (1236MB/s), 1179MiB/s-1179MiB/s (1236MB/s-1236MB/s), io=69.1GiB (74.2GB), run=60023-60023msec

## 25G Ethernet
root@pve9:~# sudo fio --filename=/dev/nvme1n1 --direct=1 --ioengine=io_uring \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=2885MiB/s (3025MB/s), 2885MiB/s-2885MiB/s (3025MB/s-3025MB/s), io=169GiB (182GB), run=60010-60010msec

## 40G Ethernet
root@pve9:~# sudo fio --filename=/dev/nvme1n1 --direct=1 --ioengine=io_uring \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=4716MiB/s (4945MB/s), 4716MiB/s-4716MiB/s (4945MB/s-4945MB/s), io=276GiB (297GB), run=60006-60006msec

## 50G Ethernet
root@pve9:~# sudo fio --filename=/dev/nvme1n1 --direct=1 --ioengine=io_uring \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=5859MiB/s (6144MB/s), 5859MiB/s-5859MiB/s (6144MB/s-6144MB/s), io=343GiB (369GB), run=60005-60005msec

## 100G Ethernet
root@pve9:~# sudo fio --filename=/dev/nvme1n1 --direct=1 --ioengine=io_uring \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=11.2GiB/s (12.0GB/s), 11.2GiB/s-11.2GiB/s (12.0GB/s-12.0GB/s), io=671GiB (721GB), run=60003-60003msec



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
For reference, below is a table
I really appreciate your posts about testing storages, even though it is one (or two) levels above my world.

But please, a bare numerical value without knowing the unit is... problematic just incomplete.
 
Last edited:
Hi @pmvemf,

Following up on this, I asked our performance team to review the kernel iSCSI stack (what you referred to as the "Proxmox native initiator," which is in fact the standard Linux initiator).

Our testing with PVE9 showed no functional issues with the Linux kernel iSCSI initiator. At 100 GbE speeds, we observed some minor performance artifacts. Profiling under load indicated these could be caused by unnecessary CPU cycles spent on tracepoints in the initiator code path. We may reach out to the upstream Linux kernel community to explore optimizations that could reduce this overhead.

For reference, below is a table of our measurements for iSCSI Multipath, alongside NVMe/TCP results for comparison. Overall performance is excellent, with only minor degradation at high bandwidth.

## PROTOCOL: iSCSI
## TRANSPORT: TCP
## MULTIPATH: YES
## MTU: 9000
## AIO: NATIVE
## PVE VERSION: pve-manager/9.0.6/49c767b70aeb6648 (running kernel: 6.14.11-1-pve)
## ATTACHMENT: NATIVE / BARE METAL
## STORAGE: BLOCKBRIDGE NVME20P 100G

Ethernet Theoretical (Link)Measured (FIO)Efficiency
101192117998.9%
252980286196.0%
404768463497.2%
505960572896.1%
10011921949279.6%



## PROTOCOL: NVME
## TRANSPORT: TCP
## MULTIPATH: NO
## AIO: IO_URING
## PVE VERSION: pve-manager/9.0.6/49c767b70aeb6648 (running kernel: 6.14.11-1-pve)
## ATTACHMENT: NATIVE / BARE METAL
## STORAGE: BLOCKBRIDGE NVME20P 100G

EthernetTheoretical (Link)Measured (FIO)Efficiency
101192117998.9%
252980288596.8%
404768471698.9%
505960585998.3%
100119211175298.6%


Regarding your reported results, both the native and virtual measurements point to configuration or environmental issues. I recommend engaging with your storage vendor's support team to review best practices for configuring Linux in your setup, as the iSCSI initiator functionality in Proxmox is simply the standard Linux implementation.

## PROTOCOL: iSCSI
## TRANSPORT: TCP
## MULTIPATH: YES
## MTU: 9000
## AIO: NATIVE
## PVE VERSION: pve-manager/9.0.6/49c767b70aeb6648 (running kernel: 6.14.11-1-pve)
## ATTACHMENT: NATIVE / BARE METAL
## STORAGE: BLOCKBRIDGE NVME20P 100G

root@pve9:~# bb host attach --disk iscsi/disk-1 --multipath
=============================================================================================================
iscsi/disk-1 attached (read-write) to pve9 (via 2 paths) as /dev/mapper/360a010a014fc86d91962194c406264b4
=============================================================================================================
## 10G Ethernet
root@pve9:~# sudo fio --filename=/dev/mapper/360a010a014fc86d91962194c406264b4 --direct=1 --ioengine=libaio \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=1179MiB/s (1237MB/s), 1179MiB/s-1179MiB/s (1237MB/s-1237MB/s), io=69.1GiB (74.2GB), run=60027-60027msec

## 25G Ethernet
root@pve9:~# sudo fio --filename=/dev/mapper/360a010a014fc86d91962194c406264b4 --direct=1 --ioengine=libaio \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=2861MiB/s (3000MB/s), 2861MiB/s-2861MiB/s (3000MB/s-3000MB/s), io=168GiB (180GB), run=60011-60011msec

## 40G Ethernet
root@pve9:~# sudo fio --filename=/dev/mapper/360a010a014fc86d91962194c406264b4 --direct=1 --ioengine=libaio \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=4634MiB/s (4859MB/s), 4634MiB/s-4634MiB/s (4859MB/s-4859MB/s), io=272GiB (292GB), run=60007-60007msec

## 50G Ethernet
root@pve9:~# sudo fio --filename=/dev/mapper/360a010a014fc86d91962194c406264b4 --direct=1 --ioengine=libaio \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=5728MiB/s (6006MB/s), 5728MiB/s-5728MiB/s (6006MB/s-6006MB/s), io=336GiB (360GB), run=60005-60005msec


## 100G Ethernet
root@pve9:~# sudo fio --filename=/dev/mapper/360a010a014fc86d91962194c406264b4 --direct=1 --ioengine=libaio \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=9492MiB/s (9953MB/s), 9492MiB/s-9492MiB/s (9953MB/s-9953MB/s), io=556GiB (597GB), run=60004-60004msec

## PROTOCOL: NVME
## TRANSPORT: TCP
## MULTIPATH: NO
## AIO: IO_URING
## PVE VERSION: pve-manager/9.0.6/49c767b70aeb6648 (running kernel: 6.14.11-1-pve)
## ATTACHMENT: NATIVE / BARE METAL
## STORAGE: BLOCKBRIDGE NVME20P 100G

root@pve9:~# bb host attach --disk nvme/disk-1
=============================================================
nvme/disk-1 attached (read-write) to pve9 as /dev/nvme1n1
=============================================================

## 10G Ethernet
root@pve9:~# sudo fio --filename=/dev/nvme1n1 --direct=1 --ioengine=io_uring \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=1179MiB/s (1236MB/s), 1179MiB/s-1179MiB/s (1236MB/s-1236MB/s), io=69.1GiB (74.2GB), run=60023-60023msec

## 25G Ethernet
root@pve9:~# sudo fio --filename=/dev/nvme1n1 --direct=1 --ioengine=io_uring \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=2885MiB/s (3025MB/s), 2885MiB/s-2885MiB/s (3025MB/s-3025MB/s), io=169GiB (182GB), run=60010-60010msec

## 40G Ethernet
root@pve9:~# sudo fio --filename=/dev/nvme1n1 --direct=1 --ioengine=io_uring \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=4716MiB/s (4945MB/s), 4716MiB/s-4716MiB/s (4945MB/s-4945MB/s), io=276GiB (297GB), run=60006-60006msec

## 50G Ethernet
root@pve9:~# sudo fio --filename=/dev/nvme1n1 --direct=1 --ioengine=io_uring \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=5859MiB/s (6144MB/s), 5859MiB/s-5859MiB/s (6144MB/s-6144MB/s), io=343GiB (369GB), run=60005-60005msec

## 100G Ethernet
root@pve9:~# sudo fio --filename=/dev/nvme1n1 --direct=1 --ioengine=io_uring \
--rw=read --bs=1M --iodepth=32 --runtime=60 --time_based --group_reporting --name=seq_read_test

Run status group 0 (all jobs):
READ: bw=11.2GiB/s (12.0GB/s), 11.2GiB/s-11.2GiB/s (12.0GB/s-12.0GB/s), io=671GiB (721GB), run=60003-60003msec



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Hi @bbgeek17,

Thank you for your feedback and testing!

If I understand, the tests performed by you were done in a single proxmox host that didn't belong to a proxmox cluster, am I correct?

Because I'm suspecting of something related to a "QoS like" behavior regarding the Proxmox Cluster (for example dividing the network bandwith capabilities by cluster node), otherwise if the problem exists from the HPE MSA side was also visible in the tests performed from the VMs!

Despite of my suspects I will open a support ticket with the storage vendor as suggested by you and when I have some feedback I will post it here.


Thank you for your time!

Best regards.
 
Hi @pmvemf,

I realized that I did not answer some of your questions:

Our test WAS performed in a cluster. There are no Proxmox specific cluster-aware QoS features that would limit your storage performance.

That said, there are some behaviors in the Linux SCSI stack that can clamp down on queue depth if the storage system can't keep pace. But I don't suspect that's what's happening here.

Also, it's essential to keep in mind that your virtual machine tests are apples to oranges. The virtual machine performance allows you to conclude that your storage system is capable of delivering a specific level of performance. However, the performance issues on bare metal do not allow you to conclude that the iSCSI initiator in Linux is somehow deficient. After all, the code is effectively identical to what's running in your Linux VMs.

My best guess is that you have a network-related issue.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi @bbgeek17,

Thank you for your feedback and testing!

If I understand, the tests performed by you were done in a single proxmox host that didn't belong to a proxmox cluster, am I correct?

Because I'm suspecting of something related to a "QoS like" behavior regarding the Proxmox Cluster (for example dividing the network bandwith capabilities by cluster node), otherwise if the problem exists from the HPE MSA side was also visible in the tests performed from the VMs!

Despite of my suspects I will open a support ticket with the storage vendor as suggested by you and when I have some feedback I will post it here.


Thank you for your time!

Best regards.


Hi, @bbgeek17

As promised, I send the feedback of the information given by the vendor:

"On the MSA side, there are no indications of any problems.
Regarding issues with the Proxmox operating system, it is not supported in our matrices, and based on the tests you have already performed, the problem points to the operating system since the tests you conducted with other operating systems do not present the problem.
The compatibility matrices can be consulted on our SPOCK website - HPE Storage Single Point of Connectivity Knowledge - SPOCK
Direct link for supported operating systems on the MSA2060 - SPOCK"
This appears to be a technical support response from HPE (Hewlett Packard Enterprise) regarding storage compatibility issues with Proxmox virtualization platform and their MSA (Modular Smart Array) storage systems."

hpw_msa.png
 
Ugh, sorry to hear that. When your support contact calls Proxmox "the operating system", you know things are not going well. That said, for people who invested in storage systems that do not officially support PVE, we recommend hiding the PVE "badge" and presenting the client as Debian. In your case, there is a double hit since even Debian is not on their supported list. Next time you need HPE support, consider framing it as an Ubuntu-derived kernel. In essence, you may need to conceal what you are doing to obtain basic support. At the end of the day, this is all "just Linux."


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: waltar
I have the same problem on Linux 6.14.11-2-pve, Proxmox 9.0.10.
From the guest OS on the node, iSCSI performance is excellent. But operations from the host (Disk Move, VM Clone) are extremely slow.
The network is 10 Gb/s, so there’s no network issue—we’ve tested it.
Storage: ME5024 RAID ADAPT with 12 SAS SSDs.


I get 60k/25k IOPS in fio at 70/30% from the VM, yet cloning a VM doesn’t exceed 20 MB/s.

There is multipath.conf
defaults {
user_friendly_names yes
find_multipaths yes
max_fds 8192
polling_interval 5
queue_without_daemon no
}

## Universal Blacklist (recommend white-listing)
blacklist {
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st|sda)[0-9]*"
}

blacklist_exceptions {
device {
vendor "DellEMC"
product "ME5"
}
}

## Dell Device Configuration
devices {
device {
vendor "DellEMC"
product "ME5"
path_grouping_policy group_by_prio
path_checker "tur"
hardware_handler "1 alua"
prio "alua"
failback immediate
rr_weight "uniform"
path_selector "round-robin 0"
}


and multipath -ll output

mpatha (3600c0ff000fd6c2a872ed56801000000) dm-5 DellEMC,ME5
size=14T features='0' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 18:0:0:1 sdc 8:32 active ready running
| `- 20:0:0:1 sdd 8:48 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
|- 19:0:0:1 sde 8:64 active ready running
`- 21:0:0:1 sdi 8:128 active ready running
mpathb (3600c0ff000fd6c2a882ed56801000000) dm-6 DellEMC,ME5
size=14T features='0' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 18:0:0:2 sdf 8:80 active ready running
| `- 20:0:0:2 sdg 8:96 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
|- 19:0:0:2 sdh 8:112 active ready running
`- 21:0:0:2 sdj 8:144 active ready running

Thats what i have:
1759247275193.png

And fio results

Code:
Starting 8 processes Jobs: 8 (f=8): [m(8)][100.0%][r=264MiB/s,w=113MiB/s][r=67.5k,w=28.9k IOPS][eta 00m:00s] mixed: (groupid=0, jobs=8): err= 0: pid=1442: Mon Sep 29 23:52:02 2025 read: IOPS=64.8k, BW=253MiB/s (265MB/s)(29.6GiB/120011msec) slat (usec): min=2, max=151455, avg=11.77, stdev=132.50 clat (nsec): min=1293, max=207973k, avg=2997404.38, stdev=5728018.10 lat (usec): min=3, max=207990, avg=3009.17, stdev=5731.02 clat percentiles (nsec): | 1.00th=[ 1384], 5.00th=[ 1528], 10.00th=[ 1976], | 20.00th=[ 23936], 30.00th=[ 107008], 40.00th=[ 448512], | 50.00th=[ 987136], 60.00th=[ 1564672], 70.00th=[ 2473984], | 80.00th=[ 4685824], 90.00th=[ 8716288], 95.00th=[ 12910592], | 99.00th=[ 25034752], 99.50th=[ 31064064], 99.90th=[ 52166656], | 99.95th=[ 63700992], 99.99th=[147849216] bw ( KiB/s): min=158848, max=364160, per=100.00%, avg=259254.10, stdev=5020.46, samples=1912 iops : min=39712, max=91040, avg=64813.07, stdev=1255.12, samples=1912 write: IOPS=27.8k, BW=108MiB/s (114MB/s)(12.7GiB/120011msec); 0 zone resets slat (usec): min=3, max=155676, avg=177.05, stdev=1153.51 clat (usec): min=56, max=224236, avg=11242.78, stdev=9816.38 lat (usec): min=217, max=224283, avg=11419.83, stdev=9884.04 clat percentiles (usec): | 1.00th=[ 1909], 5.00th=[ 2868], 10.00th=[ 3621], 20.00th=[ 4817], | 30.00th=[ 5997], 40.00th=[ 7308], 50.00th=[ 8717], 60.00th=[ 10552], | 70.00th=[ 12649], 80.00th=[ 15795], 90.00th=[ 21365], 95.00th=[ 27395], | 99.00th=[ 43779], 99.50th=[ 53216], 99.90th=[137364], 99.95th=[156238], | 99.99th=[173016] bw ( KiB/s): min=66944, max=153016, per=100.00%, avg=111113.72, stdev=2113.74, samples=1912 iops : min=16736, max=38254, avg=27777.99, stdev=528.44, samples=1912 lat (usec) : 2=7.13%, 4=1.81%, 10=1.81%, 20=2.54%, 50=4.12% lat (usec) : 100=3.29%, 250=4.17%, 500=3.83%, 750=3.27%, 1000=3.23% lat (msec) : 2=11.18%, 4=11.76%, 10=23.49%, 20=13.55%, 50=4.55% lat (msec) : 100=0.21%, 250=0.06% cpu : usr=5.72%, sys=21.71%, ctx=2816183, majf=0, minf=105 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=7770737,3330947,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=253MiB/s (265MB/s), 253MiB/s-253MiB/s (265MB/s-265MB/s), io=29.6GiB (31.8GB), run=120011-120011msec WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=12.7GiB (13.6GB), run=120011-120011msec Disk stats (read/write): sdb: ios=996546/3332695, sectors=7972368/29193200, merge=0/316473, ticks=7488464/22143280, in_queue=29631809, util=97.02%
 
Last edited: