Disk performance issues

Jun 18, 2023
12
0
1
Hi all,

I am having some pretty severe problems when it comes to disk performance. Basically, the speeds that PVE reports when running tests is nothing like the speeds that are shown inside a VM when running that exact same test. I was about to buy a new NVMe, thinking that it was the disk itself that is the problem, but I can see that if I went down this path I would only be fighting the symptom and not the cause.

For example, these are the FIO results I get when running the test from Proxmox.

Code:
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=514MiB/s,w=170MiB/s][r=132k,w=43.6k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1796551: Sun Dec 17 13:00:50 2023
  read: IOPS=129k, BW=506MiB/s (530MB/s)(3070MiB/6069msec)
   bw (  KiB/s): min=344256, max=585616, per=99.93%, avg=517602.00, stdev=64283.77, samples=12
   iops        : min=86064, max=146404, avg=129400.50, stdev=16070.94, samples=12
  write: IOPS=43.3k, BW=169MiB/s (177MB/s)(1026MiB/6069msec); 0 zone resets
   bw (  KiB/s): min=115072, max=196960, per=99.92%, avg=172975.33, stdev=21894.97, samples=12
   iops        : min=28768, max=49240, avg=43243.83, stdev=5473.74, samples=12
  cpu          : usr=15.18%, sys=72.48%, ctx=30790, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=506MiB/s (530MB/s), 506MiB/s-506MiB/s (530MB/s-530MB/s), io=3070MiB (3219MB), run=6069-6069msec
  WRITE: bw=169MiB/s (177MB/s), 169MiB/s-169MiB/s (177MB/s-177MB/s), io=1026MiB (1076MB), run=6069-6069msec

When I SSH into the VM and run the exact same test, the results are really bad.

Code:
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.28
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=228MiB/s,w=75.2MiB/s][r=58.5k,w=19.3k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=2853: Sun Dec 17 12:52:20 2023
  read: IOPS=55.5k, BW=217MiB/s (228MB/s)(3070MiB/14148msec)
   bw (  KiB/s): min=142842, max=255696, per=100.00%, avg=222202.07, stdev=31067.74, samples=28
   iops        : min=35710, max=63924, avg=55550.50, stdev=7766.98, samples=28
  write: IOPS=18.6k, BW=72.5MiB/s (76.0MB/s)(1026MiB/14148msec); 0 zone resets
   bw (  KiB/s): min=47337, max=85096, per=99.99%, avg=74252.32, stdev=10239.74, samples=28
   iops        : min=11834, max=21274, avg=18563.07, stdev=2559.96, samples=28
  cpu          : usr=8.59%, sys=78.74%, ctx=235996, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s), io=3070MiB (3219MB), run=14148-14148msec
  WRITE: bw=72.5MiB/s (76.0MB/s), 72.5MiB/s-72.5MiB/s (76.0MB/s-76.0MB/s), io=1026MiB (1076MB), run=14148-14148msec

Disk stats (read/write):
    dm-0: ios=778690/260239, merge=0/0, ticks=91932/9152, in_queue=101084, util=99.40%, aggrios=785920/262691, aggrmerge=0/8, aggrticks=96697/10822, aggrin_queue=107535, aggrutil=99.08%
  sda: ios=785920/262691, merge=0/8, ticks=96697/10822, in_queue=107535, util=99.08%

This translates to a:

Read reduction of -56.98%
Write reduction of -57.04%

I know with virtualisation comes extra resource overheads, but surely it isn't this bad?

Proxmox information:
Linux 6.2.16-11-bpo11-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-11~bpo11+2 (2023-09-04T14:49Z)
pve-manager/7.4-17/513c62be

VM information:
Ubuntu Server 22.04.3
Linux 6.2.0-39

Disk type is LVM-Thin
SCSI Controller: VirtIO SCSI single
Hard Disk: aio=io_uring,discard=on,iothread=1,ssd=1

I've also just installed the QEMU agent but that had no impact on the test results.
 
I know with virtualisation comes extra resource overheads, but surely it isn't this bad?
Fio is going 74 thousand times a second through the complete virtualization stack for you. In my small world this is fast.

It would really be interesting to compare these PVE/KVM specific "costs" with the same on alternative hypervisors...

Best regards
 
Fio is going 74 thousand times a second through the complete virtualization stack for you. In my small world this is fast.

It would really be interesting to compare these PVE/KVM specific "costs" with the same on alternative hypervisors...

Best regards

If you are saying that there is nothing that can be done here and that you are impressed that I am "only" getting a 55% performance decrease, well then I may need to find another solution. Unfortunately the software I am running inside my VM is facing many problems, all linked back to poor performance from disk IO. After many weeks of config changes, troubleshooting with developers and even expensive disk upgrades, I never managed to get anywhere. It looks like I have identified the root cause and I was indeed fighting the symptom all along.

Could please post the VM Config qm config VMID.

Did you tried the other options?

Actually I did, I changed this setting to `threads` and also changed cache to `Write back (unsafe)`. I ran fio again and the results were not any better. So I powered down the machine, changed the settings back to the default settings (io_uring, No cache), and started the machine up again and now it is stuck in a boot loop. It's not even getting to the grub bootloader. Terminal gets to `Booting from Hard Disk...` and then PVE reports `Status: internal-error` and it doesn't progress beyond this. I then powered off, disabled QEMU Guest Agent, started up again, but still the same thing. All the settings I changed were reverted but something is horribly broken somewhere.

Here are the results from that command

Code:
agent: 0
boot: order=ide2;scsi0
cores: 16
ide2: none,media=cdrom
memory: 28672
meta: creation-qemu=6.2.0,ctime=1651926219
name: -
net0: virtio=EE:35:D2:8D:9B:87,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: nvme002:vm-201-disk-0,discard=on,iothread=1,size=3700G,ssd=1
scsihw: virtio-scsi-single
smbios1: -
sockets: 1
vmgenid: -
 
If you are saying that there is nothing that can be done here and that you are impressed that I am "only" getting a 55% performance decrease, well then I may need to find another solution.
Well..., for best performance nothing beats bare metal - which is not surprising :-)

Even while computer capabilities rise every year there will always be a demand for faster ones.

Best regards
 
Well..., for best performance nothing beats bare metal - which is not surprising :)

Even while computer capabilities rise every year there will always be a demand for faster ones.

Best regards
I understood it wouldn't be the same as bare metal, I guess what I am trying to say is that I am surprised my disk performance more than halved.
 
How did you set up the NVMe as storage? CEPH, ZFS, Directory, etc?

Somehow I don't see what CPU type you gave the VM. What hardware does your node have? Maybe your CPU has NUMA? Then it might be a good idea to set this up in the VM too.

After each change to the disk, did you shut down the VM and then start it again? This is the only way the new KVM process can adopt the new parameters. Some things go directly online, but some things don't. Even for comparable results, you should always carry out the test on a boat.
 
How did you set up the NVMe as storage? CEPH, ZFS, Directory, etc?

Somehow I don't see what CPU type you gave the VM. What hardware does your node have? Maybe your CPU has NUMA? Then it might be a good idea to set this up in the VM too.

After each change to the disk, did you shut down the VM and then start it again? This is the only way the new KVM process can adopt the new parameters. Some things go directly online, but some things don't. Even for comparable results, you should always carry out the test on a boat.

The storage type is LVM-Thin

CPU is an AMD Ryzen 7 5700G. And I have just the one socket, so I didn't enable NUMA.

And yes, all my changes were done exactly like that. Send powerdown command, wait until it's off, and then make the changes, and then power it back on.
 
And how did you benchmark the host using fio? I guess not by creating a new thin volume on the same thin pool, formating it with the same filesystem that is used inside the VM, mounting it on the PVE host and then pointing fio to that mountpoint?
Everything else wouldn't be a fair comparison.

And you are already using a proper Enterprise NVMe SSD?
 
Last edited:
And how did you benchmark the host using fio? I guess not by creating a new thin volume on the same thin pool, formating it with the same filesystem that is used inside the VM, mounting it on the PVE host and then pointing fio to that mountpoint?
Everything else wouldn't be a fair comparison.

And you are already using a proper Enterprise NVMe SSD?

So in the fio software you can specify as an option where you want the test to run. I SSH into the Proxmox host, ran this command with it pointed at the NVMe and gathered the results. Then in my VM, I switched off the applications that I was running so it was just Ubuntu Server by itself, meaning that nothing else was using resources except the OS. Which when idle is extremely low, so if there is any impact on the results, it would be minimal. Then I executed that exact same fio command as before, except with the filename option changed so that it was the local disk.

I do have to say this though. I'm a little confused by your comment. Whether it is an enterprise NVMe or not should not make any difference. The metrics look great in PVE console, but not in the VM. I have already upgraded hardware once from SSD to NVMe, and it didn't help. So I don't want to do this again because I could just be throwing more money into this without any guarantee that the problem will be solved.
 
Whether it is an enterprise NVMe or not should not make any difference.
They will perform way better. For example easy factor 100 to 1000 better when doing sync writes. Depeding on the workload a consumer NVMe SSD will be as slow as a fast HDD.

I SSH into the Proxmox host, ran this command with it pointed at the NVMe and gathered the results
How exactly? I guess you just benchmarked the PVE root filesystem so ext4 on top of the root LV?
 
They will perform way better. For example easy factor 100 to 1000 better when doing sync writes. Depeding on the workload a consumer NVMe SSD will be as slow as a fast HDD.

I hope that this inefficiency can be fixed or addressed without having to go down the path of further hardware upgrades. But before I consider this, I would like Proxmox support to validate what you are saying; that this problem will be solved with the use of an enterprise NVMe. That also prompts a further question - What kind of performance difference between host OS and guest OS would I experience if I swapped my current NVMe to an enterprise NVMe? Would it still be an incredibly significant reduction?

If the answer is no, then I will need to consider buying an enterprise NVMe. If the answer is yes, then it may be time to move on from Proxmox. But I am hoping that Proxmox support can provide an input on this one.

How exactly? I guess you just benchmarked the PVE root filesystem so ext4 on top of the root LV?

No, the machine has two disks. One with PVE and nothing else, and the other with the VM in question. When I ran the test on PVE I had to specify `--filename=/dev/mapper/nvme002-vm--201--disk--0`, otherwise it would have run the test on the disk where PVE is installed, which would have proved nothing.
 
That also prompts a further question - What kind of performance difference between host OS and guest OS would I experience if I swapped my current NVMe to an enterprise NVMe? Would it still be an incredibly significant reduction?
There will still be overhead. But it doesn't matter that much if your storage is fast in the first place. ;)
Running virtualization will always cause overhead, no matter what hypervisor you use. PVE isn't worse than other solutions.
What you can do is buy proper disks, choose a storage that isn't that demanding (but your LVM-Thin already should be low demanding...would be way more worse when running something like ZFS or ceph...), avoid nested filesystems, optimize your filesystems and storages so the block sizes match and so on.

--filename=/dev/mapper/nvme002-vm--201--disk--0
If I'm not wrong you did a destructive test on block level. Then it should be faster because there is no filesystem involved adding additional overhead. You can't compare benchmark results between block level and filesystem level. Try to mount the filesystem on /dev/mapper/nvme002-vm--201--disk--0 on the PVE host while VM 201 isn't running and point fio to that mountpoint.
 
Last edited:
There will still be overhead. But it doesn't matter that much if your storage is fast in the first place. ;)
Running virtualization will always cause overhead, no matter what hypervisor you use. PVE isn't worse than other solutions.
What you can do is buy proper disks, choose a storage that isn't that demanding (but your LVM-Thin already should be low demanding...would be way more worse when running something like ZFS or ceph...), avoid nested filesystems, optimize your filesystems and storages so the block sizes match and so on.

Of course there would still be more overhead, I am not disputing that, I did want to know more about the specifics of what those overheads would look like.

For example here I got close to -55% reduction. But what if I ran this exact same test on an enterprise NVMe instead of the NVMe I have right now.

I'm trying to gather as more information as possible before I make any decisions. Previous research of mine suggested that upgrading from an SSD to an NVMe would make a big difference. Checking the raw numbers on the SSD/NVMe manufactures website also backed up this research.

Raw numbers suggest the sequential read/write speed should have gone from 540MB/s & 500MB/s to 4800MB/s & 4100MB/s. I also understand these figures are theoretical maximums, but when I ran the same fio tests on the SSD that I ran on my NVMe, the SSD had +5.5k write IOPS and +16.6k read IOPS over the newer more expensive NVMe.

So please forgive me for wanting more information before making a decision, especially one that involves more money.

If I'm not wrong you did a destructive test on block level. Then it should be faster because there is no filesystem involved adding additional overhead. You can't compare benchmark results between block level and filesystem level. Try to mount the filesystem on /dev/mapper/nvme002-vm--201--disk--0 on the PVE host while VM 201 isn't running and point fio to that mountpoint.

What do you mean by destructive? Is this test what potentially caused the system to not boot anymore? I'm not overly concerned about that anyway, I have backups, and it wasn't a critical system.
 
The CPU has 8 cores and 16 threads. You have assigned 16 cores to your VM. I would advise setting the CPU type to Host and reducing it to 8 cores and then doing the test again. Also take a look at the metrics on the node to see if you see a higher utilization or bottleneck somewhere.

For an optimized result you will have to experiment a bit until you find it. Factors such as those already described by @Dunuin can also have a high impact on performance.
 
The CPU has 8 cores and 16 threads. You have assigned 16 cores to your VM. I would advise setting the CPU type to Host and reducing it to 8 cores and then doing the test again. Also take a look at the metrics on the node to see if you see a higher utilization or bottleneck somewhere.
PVE reports that I have "16 CPU(s)", that is the reason why I have assigned 16 cores. When I open htop in PVE console I see 16, and htop in the VM also show 16. Am I only supposed to have 8 on the guest OS? Won't that mean 50% less CPU power? Have I overprovisioned the machine?

And also, more importantly, would changing this from 16 to 8 have a positive impact on disk IOPS?

CPU type is currently set to the default of kvm64, but I can definitely change to host and try these tests again. At this stage I will need to create a new VM anyway as this one is broken. Restoring from backup would most likely take more than a day given its size, and a clean state would probably be a better measure for this test anyway. However this is something I will need to do at a later time.

For an optimized result you will have to experiment a bit until you find it. Factors such as those already described by @Dunuin can also have a high impact on performance.

Believe me, I have tried so many things over the last 6 months to get better disk performance. However, based on most of the replies I have received in this thread, it sounds more and more like Proxmox just isn't meant to be if you are running software that requires heavy disk IOPS, like the software I am running. I do want to note that I still haven't ruled out an enterprise NVMe, I just want to know more details from support before I commit more money to it.
 
For example here I got close to -55% reduction.
Again, your -55% is probably not comparable. Benchmark the same filesystem from the host and guestOS. You can't compare apples and oranges...
Do a proper fio benchmark on the host and you will probably see a slower result more similar to what you see inside the VM.

Checking the raw numbers on the SSD/NVMe manufactures website also backed up this research.
Those numbers are not that useful when running continous server workloads. You could buy a NVMe SSD rated for 5000MB/s writes and if you hit it with the wrong workload it's performance will drop down to below 1MB/s. The peak performance, the SSD can only achieve for a few seconds, doesn't really matter much and this is what the manufacturer will advertise in its datasheets. What's more important is the minimum performance you will alway get when continously reading/writing to it 24/7 and here Enterprise SSD are way ahead.

Raw numbers suggest the sequential read/write speed should have gone from 540MB/s & 500MB/s to 4800MB/s & 4100MB/s.
Yes, only for a few seconds until the cache is full. Upgrading from SATA SSD to NVMe SSD primarily helps with read performance or short bursts of writes. When doing continuous writes on a full disk the performance will drop a lot...not uncommon that it's then on par with your old SATA SSD.

What do you mean by destructive? Is this test what potentially caused the system to not boot anymore? I'm not overly concerned about that anyway, I have backups, and it wasn't a critical system.
Yes. If you point fio to a block device with a write benchmark it will just write on block level to it destroying all data it is overwriting, corrupting the filesystem on it. So again, do a proper fio test on file level after mounting a thin volume with the same filesystem your guestOS is using...

PVE reports that I have "16 CPU(s)", that is the reason why I have assigned 16 cores. When I open htop in PVE console I see 16, and htop in the VM also show 16. Am I only supposed to have 8 on the guest OS? Won't that mean 50% less CPU power? Have I overprovisioned the machine?
16 threads doesn't mean you got 16 times the performance of a single CPU core. You only got 8 physical CPU cores so 8 things can be done at a time. Two threads share a CPU core ut only one thread can do stuff at a time. So one thread can do something, utilizing the core, while the other threads in waiting. So 16 threads won't give you 1600% CPU performance...more like 800%-1100%.
So yes, when only assigning 8 vCPUs to a VM the VM will be slower in peak performance. But not that much. But you also have to take the Overhead of the hypervisor in mind. PVE needs 2 cores. If you assign all the CPU performance to the VM this can starve the hypervisor and stuff will become slow as the hypervisor then is missing the ressources to do its job.

And also, more importantly, would changing this from 16 to 8 have a positive impact on disk IOPS?
Depends. If your VM is actually utilizing all those 16 vCPUs this will cripple storage performance as this has to be done by the PVE host that has to wait for the VM.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!