Very slow performance in VMs

vfrans · Dec 30, 2023

Hello everyone!

I'm no storage expert but I do observe some subpar disk performance on Windows (I have some linux I can test as well but I didn't notice anything while using them).

We use ZFS and we disabled balloon and tablet following https://pve.proxmox.com/wiki/Performance_Tweaks

Here is the config for one of our Windows VM:

Code:

agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;net0
cores: 4
cpu: host
efidisk0: local-zfs:vm-107-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
machine: pc-q35-7.2
memory: 8192
meta: creation-qemu=7.2.0,ctime=1702106358
name: af-rds
net0: virtio=D2:9E:C5:B4:C6:39,bridge=vmbr2
numa: 0
ostype: win11
scsi0: local-zfs:vm-107-disk-1,discard=on,iothread=1,size=32G
scsihw: virtio-scsi-single
smbios1: uuid=efe0e43f-d258-4636-b81c-5e6335f94d79
sockets: 1
tablet: 0
tpmstate0: local-zfs:vm-107-disk-2,size=4M,version=v2.0
vmgenid: c2751023-6d8b-4ef1-95e8-108733dca380

Disk performance is very slow, meaning that we get a few kb/s while copying files from one directory to another.

We tried to benchmark using CrystalDisMark (cfr attached screenshot), we see very poor performance in random R/W, pretty decent in seq, but we think it's just R and W to cache.

Here is the hardware we use (that's a Hetzner server BTW):
CPU: 32 x AMD Ryzen Threadripper 2950X 16-Core Processor (1 Socket)
125G RAM
Disks:
local-zfs -> SSDs : 2* SAMSUNG MZ7LM1T9 1.75 Tib
slowpool -> HDDs: 2* ST10000NM0568-2H 9.1 Tib

Autotrim is activated on local-zfs pool and trim is activated on vm disks as well.

Could anybody advise on where and how to search for the underlying issue?

Thanks a lot in advance,

Best,

François

cwt · Dec 30, 2023

Benchmarking within the VM is not recommended and pretty useless. You should use fio on the host for real results.

local-zfs is your PVE system „drive“ and is also used for VMs which means that the drive (with a maximum speed of a single ssd since you‘re using a mirror) has to handle concurrent r/w operations from multiple sources. You shouldn‘t expect much performance from such a setup. Beside that: how many VMs are running on local-zfs?

vfrans · Dec 30, 2023

Thanks for your answer, I'll try fio (I'll try to read the doc to actually test it correctly

)

Yes indeed, we are running 5 other VMs on this mirror, on top of Proxmox itself, you are correct. All those VMs are pretty idle as they are not used as of now.

That being said, we are migrating from an old host that has 4 cores, 32g of ram and only HDDs (running ESXi) and performance is way (way!) worse.

Are you saying that we should provision an SSD mirror per VM?

Thanks a lot for your help

Dunuin · Dec 30, 2023

According to datasheet that SSD got max 24K IOPS when writing random data. Your benchmark shows 5.8K IOPS (23MB/s divided by 4K random writes). So with all the overhead of ZFS and the NTFS that isn' that bad. But random reads are indeed not great and should be faster than writes.

cwt · Dec 30, 2023

An approach would be: install PVE itself on two smaller SSDs (240GB are more than sufficient from the PM series) in a mirror. After that you could expand the two larger SSDs with another set of two SSDs (same models) and create a striped mirror (also known from other RAID systems as RAID1+0) with ZFS and use this for your VMs. This will give a good performance increase. Another thing to keep in mind is the dataset you use on ZFS and it’s blocksize. You should read a few things about this in the FAQ to max out the most.

ZFS is not only a simple file or RAID system like ext4/btfrs or mdadm. Roughly said it’s more a complete package with a different approach.

vfrans · Dec 30, 2023

Ok I think I found the issue.

The issue was... Windows Server 2022 Core Isolation Hypervisor-based Memory integrity feature was enabled (Apparently running and HV on an HV was not performant enough, who knew

)

Now I have systems that are on par with what I excepted performance-wise from SSDs.

For the record, I made in VMs CrystalDiskMark (I know, not a good test but comparatively we can maybe see some patterns), results are 6x-10x better on random, better also on seq (car screenshot)

Also, I benchmarked my internal disks this way (found the commands on Google's GCE benchmark, feel free to comment if you have inputs:

Code:

TEST_DIR=/rpool/data/testdir
# Test write throughput by performing sequential writes with multiple parallel streams (16+), using an I/O block size of 1 MB and an I/O depth of at least 64:
    root@compute:/rpool/data# fio --name=write_throughput --directory=/rpool/data/testdir --numjobs=16 \
    --size=10G --time_based --runtime=60s --ramp_time=2s --ioengine=libaio \
    --direct=1 --verify=0 --bs=1M --iodepth=64 --rw=write \
    --group_reporting=1 --iodepth_batch_submit=64 \
    --iodepth_batch_complete_max=64
    write_throughput: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=64
    ...
    fio-3.25
    Starting 16 processes
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    write_throughput: Laying out IO file (1 file / 10240MiB)
    Jobs: 14 (f=13): [_(1),W(1),f(1),_(1),W(12)][34.6%][w=448MiB/s][w=447 IOPS][eta 02m:03s]                
    write_throughput: (groupid=0, jobs=16): err= 0: pid=13006: Sat Dec 30 16:52:15 2023
    write: IOPS=445, BW=462MiB/s (484MB/s)(28.0GiB/62088msec); 0 zone resets
        slat (msec): min=973, max=2430, avg=2269.91, stdev=156.12
        clat (nsec): min=3160, max=17820, avg=9812.44, stdev=1311.93
        lat (msec): min=973, max=2430, avg=2287.45, stdev=121.16
        clat percentiles (nsec):
        |  1.00th=[ 5728],  5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ 9024],
        | 30.00th=[ 9280], 40.00th=[ 9536], 50.00th=[ 9792], 60.00th=[ 9920],
        | 70.00th=[10176], 80.00th=[10432], 90.00th=[10944], 95.00th=[11584],
        | 99.00th=[15936], 99.50th=[16512], 99.90th=[17792], 99.95th=[17792],
        | 99.99th=[17792]
    bw (  MiB/s): min= 1947, max= 2049, per=100.00%, avg=2044.61, stdev= 1.84, samples=432
    iops        : min= 1943, max= 2048, avg=2044.07, stdev= 1.88, samples=432
    lat (usec)   : 4=0.01%, 10=64.19%, 20=35.79%
    cpu          : usr=0.10%, sys=0.84%, ctx=222148, majf=1, minf=945
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
        submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=100.0%, >=64=0.0%
        complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=100.0%, >=64=0.0%
        issued rwts: total=0,27648,0,0 short=0,0,0,0 dropped=0,0,0,0
        latency   : target=0, window=0, percentile=100.00%, depth=64

    Run status group 0 (all jobs):
    WRITE: bw=462MiB/s (484MB/s), 462MiB/s-462MiB/s (484MB/s-484MB/s), io=28.0GiB (30.1GB), run=62088-62088msec


# Test write IOPS by performing random writes, using an I/O block size of 4 KB and an I/O depth of at least 256:
    root@compute:/rpool/data# fio --name=write_iops --directory=$TEST_DIR --size=10G \
    --time_based --runtime=60s --ramp_time=2s --ioengine=libaio --direct=1 \
    --verify=0 --bs=4K --iodepth=256 --rw=randwrite --group_reporting=1  \
    --iodepth_batch_submit=256  --iodepth_batch_complete_max=256
    write_iops: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
    fio-3.25
    Starting 1 process
    write_iops: Laying out IO file (1 file / 10240MiB)
    Jobs: 1 (f=1): [w(1)][100.0%][w=36.0MiB/s][w=9216 IOPS][eta 00m:00s]
    write_iops: (groupid=0, jobs=1): err= 0: pid=46778: Sat Dec 30 16:54:58 2023
    write: IOPS=15.9k, BW=61.0MiB/s (64.0MB/s)(3718MiB/60011msec); 0 zone resets
        slat (msec): min=7, max=168, avg=16.02, stdev= 8.15
        clat (nsec): min=5400, max=51751, avg=18021.34, stdev=3126.49
        lat (msec): min=7, max=168, avg=16.04, stdev= 8.15
        clat percentiles (nsec):
        |  1.00th=[14784],  5.00th=[15424], 10.00th=[15808], 20.00th=[16512],
        | 30.00th=[16768], 40.00th=[17024], 50.00th=[17280], 60.00th=[17536],
        | 70.00th=[18048], 80.00th=[18560], 90.00th=[20352], 95.00th=[25216],
        | 99.00th=[30336], 99.50th=[32640], 99.90th=[46336], 99.95th=[49408],
        | 99.99th=[51968]
    bw (  KiB/s): min=34816, max=116969, per=100.00%, avg=63448.62, stdev=22226.07, samples=120
    iops        : min= 8704, max=29242, avg=15862.11, stdev=5556.53, samples=120
    lat (usec)   : 10=0.37%, 20=88.72%, 50=10.88%, 100=0.03%
    cpu          : usr=1.14%, sys=58.07%, ctx=203468, majf=0, minf=58
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
        submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=100.0%
        complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=100.0%
        issued rwts: total=0,951552,0,0 short=0,0,0,0 dropped=0,0,0,0
        latency   : target=0, window=0, percentile=100.00%, depth=256

    Run status group 0 (all jobs):
    WRITE: bw=61.0MiB/s (64.0MB/s), 61.0MiB/s-61.0MiB/s (64.0MB/s-64.0MB/s), io=3718MiB (3899MB), run=60011-60011msec

# Test read throughput by performing sequential reads with multiple parallel streams (16+), using an I/O block size of 1 MB and an I/O depth of at least 64:
    root@compute:/rpool/data# fio --name=read_throughput --directory=$TEST_DIR --numjobs=16 \
    --size=10G --time_based --runtime=60s --ramp_time=2s --ioengine=libaio \
    --direct=1 --verify=0 --bs=1M --iodepth=64 --rw=read \
    --group_reporting=1 \
    --iodepth_batch_submit=64 --iodepth_batch_complete_max=64
    read_throughput: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=64
    ...
    fio-3.25
    Starting 16 processes
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    read_throughput: Laying out IO file (1 file / 10240MiB)
    Jobs: 15 (f=13): [R(4),f(1),R(2),f(1),R(7),_(1)][51.6%][r=3937MiB/s][r=3936 IOPS][eta 01m:00s]
    read_throughput: (groupid=0, jobs=16): err= 0: pid=707921: Sat Dec 30 17:04:49 2023
    read: IOPS=5477, BW=5495MiB/s (5762MB/s)(328GiB/61151msec)
        slat (msec): min=9, max=2046, avg=188.24, stdev=394.41
        clat (usec): min=3, max=109, avg=12.78, stdev= 4.77
        lat (msec): min=9, max=1849, avg=184.87, stdev=389.53
        clat percentiles (usec):
        |  1.00th=[    8],  5.00th=[   10], 10.00th=[   10], 20.00th=[   11],
        | 30.00th=[   11], 40.00th=[   12], 50.00th=[   12], 60.00th=[   13],
        | 70.00th=[   14], 80.00th=[   15], 90.00th=[   17], 95.00th=[   20],
        | 99.00th=[   33], 99.50th=[   39], 99.90th=[   64], 99.95th=[   74],
        | 99.99th=[  111]
    bw (  MiB/s): min= 1884, max=12922, per=100.00%, avg=7079.36, stdev=336.74, samples=1032
    iops        : min= 1879, max=12919, avg=7078.50, stdev=336.73, samples=1032
    lat (usec)   : 4=0.01%, 10=16.76%, 20=78.78%, 50=4.24%, 100=0.19%
    lat (usec)   : 250=0.02%
    cpu          : usr=0.02%, sys=15.03%, ctx=40945, majf=0, minf=954
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
        submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=100.0%, >=64=0.0%
        complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=100.0%, >=64=0.0%
        issued rwts: total=334976,0,0,0 short=0,0,0,0 dropped=0,0,0,0
        latency   : target=0, window=0, percentile=100.00%, depth=64

    Run status group 0 (all jobs):
    READ: bw=5495MiB/s (5762MB/s), 5495MiB/s-5495MiB/s (5762MB/s-5762MB/s), io=328GiB (352GB), run=61151-61151msec

# Test read IOPS by performing random reads, using an I/O block size of 4 KB and an I/O depth of at least 256:
    root@compute:/rpool/data# fio --name=read_iops --directory=$TEST_DIR --size=10G \
    --time_based --runtime=60s --ramp_time=2s --ioengine=libaio --direct=1 \
    --verify=0 --bs=4K --iodepth=256 --rw=randread --group_reporting=1 \
    --iodepth_batch_submit=256  --iodepth_batch_complete_max=256
    read_iops: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
    fio-3.25
    Starting 1 process
    read_iops: Laying out IO file (1 file / 10240MiB)
    Jobs: 1 (f=1): [r(1)][100.0%][r=198MiB/s][r=50.7k IOPS][eta 00m:00s]
    read_iops: (groupid=0, jobs=1): err= 0: pid=799909: Sat Dec 30 17:07:09 2023
    read: IOPS=51.2k, BW=200MiB/s (210MB/s)(11.7GiB/60004msec)
        slat (usec): min=1013, max=6544, avg=4920.36, stdev=738.62
        clat (nsec): min=1310, max=83081, avg=15251.98, stdev=2096.12
        lat (usec): min=1019, max=6562, avg=4935.64, stdev=739.61
        clat percentiles (nsec):
        |  1.00th=[ 6816],  5.00th=[13504], 10.00th=[13888], 20.00th=[14144],
        | 30.00th=[14528], 40.00th=[14912], 50.00th=[15168], 60.00th=[15552],
        | 70.00th=[16064], 80.00th=[16512], 90.00th=[17024], 95.00th=[17280],
        | 99.00th=[18816], 99.50th=[20864], 99.90th=[32128], 99.95th=[33536],
        | 99.99th=[70144]
    bw (  KiB/s): min=180224, max=549963, per=100.00%, avg=205051.91, stdev=40550.88, samples=120
    iops        : min=45056, max=137490, avg=51262.87, stdev=10137.68, samples=120
    lat (usec)   : 2=0.01%, 4=0.01%, 10=1.92%, 20=97.49%, 50=0.55%
    lat (usec)   : 100=0.02%
    cpu          : usr=2.03%, sys=97.96%, ctx=100, majf=0, minf=58
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
        submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=100.0%
        complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=100.0%
        issued rwts: total=3074304,0,0,0 short=0,0,0,0 dropped=0,0,0,0
        latency   : target=0, window=0, percentile=100.00%, depth=256

    Run status group 0 (all jobs):
    READ: bw=200MiB/s (210MB/s), 200MiB/s-200MiB/s (210MB/s-210MB/s), io=11.7GiB (12.6GB), run=60004-60004msec

lens06 · Apr 12, 2024

looks like you solved it. ive heard a lot of people mention that zfs kneecaps nvme too. might be worth trying mdadm raid 1 too.

are you running hyperv in proxmox? lol

Search

Search

Very slow performance in VMs

vfrans

New Member

Attachments

cwt

Active Member

vfrans

New Member

Dunuin

Distinguished Member

cwt

Active Member

vfrans

New Member

Attachments

lens06

New Member