[SOLVED] High IO wait during backups after upgrading to Proxmox 7

e100

Renowned Member
Nov 6, 2010
1,268
45
88
Columbus, Ohio
ulbuilder.wordpress.com
We recently completed the upgrade to Proxmox 7.
The issue exists on two different kernels
pveversion:
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.39-1-pve)
and
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.35-2-pve)

Since the upgrade Io wait has increased dramatically during vzdump backup processes.
I also noticed that it seems like vzdump completes the backups faster than before the upgrade, thats good, but the trade off seems to be overloading the IO subsystems to the point its causing issues.

vzdump is writing to spinning rust SATA Disk with LUKS.

VM storage is a mixture of zfs zvols and LVM over DRBD
The backing storage for DRBD ranges from Areca Raid Arrays that might be rust or SSD to a few backed by PcieSSD

Most of our nodes only have fast SSD for VM storage, on those the IO wait during backup averaged around 4% before and is now around 8%, does not seem to cause any issues.
On nodes with SSD and spinning rust they averaged around 8% before and after the upgrade around 35% and we are seeing IO stalls in VMs where this was not a problem in Proxmox 6.


Nearly all VMs are setup to use 'VirtIO SCSI' with a few using 'VirtIO', default cache, none have IO Threads enabled and Async IO is the default.
I see that Async IO default changed in Proxmox 7 and I believe this is the source of the problem.

https://bugzilla.kernel.org/show_bug.cgi?id=199727
Seems to indicate I should use 'VirtIO SCSI Single', enable IO Threads and set Async IO to threads to alleviate this issue.

Any other suggestions?

Here is a zabbix graph of IO wait from one of the nodes:
proxmox6-6iowait.png
 
Hi,
Nearly all VMs are setup to use 'VirtIO SCSI' with a few using 'VirtIO', default cache, none have IO Threads enabled and Async IO is the default.
I see that Async IO default changed in Proxmox 7 and I believe this is the source of the problem.

https://bugzilla.kernel.org/show_bug.cgi?id=199727
Seems to indicate I should use 'VirtIO SCSI Single', enable IO Threads and set Async IO to threads to alleviate this issue.
yes, we had users reporting improvements when enabling iothread in similar scenarios.

Any other suggestions?
You could also set a bandwidth limit for the backup job (not exposed in the GUI unfortunately) or as a node-wide default in /etc/vzdump.conf.
 
Hi,

yes, we had users reporting improvements when enabling iothread in similar scenarios.


You could also set a bandwidth limit for the backup job (not exposed in the GUI unfortunately) or as a node-wide default in /etc/vzdump.conf.
Would IO threads be sufficient or is it also necessary to to change Async IO to threads too? What about the 'virtio scsi single' setting?

Some of the nodes already had bwlimit set to 150MB/sec, I'll try lowering it more to see if that helps or not.

Is this a bug that is being investigated?
Seems like it should be since this is a significant performance regression between versions. At a minimum should be mentioned in the release notes, would have saved me a lot of time diagnosing the problem.
 
Would IO threads be sufficient or is it also necessary to to change Async IO to threads too? What about the 'virtio scsi single' setting?
It really depends on your setup. Virtio scsi single is required for enabling iothread for SCSI disks. If enabling iothread alone doesn't help, then yes, you should switch the aio setting on your disks.
Some of the nodes already had bwlimit set to 150MB/sec, I'll try lowering it more to see if that helps or not.

Is this a bug that is being investigated?
Seems like it should be since this is a significant performance regression between versions. At a minimum should be mentioned in the release notes, would have saved me a lot of time diagnosing the problem.
Not that I'm aware of. I have only seen one or two other such reports and I don't know if much can be done if it's really io_uring producing too much load other than trying the above workarounds.
 
@fiona The problem only happens when backing VMs that are on ZFS. Here is IO wait over the entire backup on 10 VMs. The high spikes happen when backing up VMs on ZFS storage and the lows are when backing up from LVM storage.
zfsvslvm.png
Any idea what changed in ZFS between Proxmox 6 and 7 that would explain this performance degradation?
 
@fiona The problem only happens when backing VMs that are on ZFS. Here is IO wait over the entire backup on 10 VMs. The high spikes happen when backing up VMs on ZFS storage and the lows are when backing up from LVM storage.
View attachment 40248
Any idea what changed in ZFS between Proxmox 6 and 7 that would explain this performance degradation?
A lot of things, because we went from ZFS 2.0 to ZFS 2.1. Can you check if the issue is also present when you boot with a 5.11 kernel (which uses an older ZFS kernel module)?
 
@fiona I should be able to get one node running on 5.11 this weekend and will report back the results.

A couple of things I think are important:
1. This is a problem affecting ALL Proxmox users who use zfs, they just might not have noticed it. All 23 Proxmox servers were have had an increase in IO wait after upgrading from Proxmox 6 to 7. Does not matter if zfs is running on rust, SSD or pcie SSDs or how zfs is configured such as having l2arc or not. Rust is just so much more significant in IO wait that it is noticeable. I am confident that if you setup two identical nodes one with Proxmox 6 and the other 7 you can observe the difference in IO wait while performing the same benchmarks. This needs investigated.

2. We had bwlimit set to 150MB/sec in vzdump.conf in Proxmox 6 and did not have an issue. I looked back at our notes on one of the nodes using rust disks and we documented that bwlimit set to 200MB/sec caused issues but 150MB/sec did not. With Proxmox 7 lowering bwlimit to 30MB/sec on that node causes issues. That is a significant reduction in performance!

We have been a Proxmox subscriber for years. I am very disappointed that our problem is not being taken seriously and investigated even after I have shown that at least four other threads report the same problem!
 
  • Like
Reactions: JohnyTheR
We have been a Proxmox subscriber for years. I am very disappointed that our problem is not being taken seriously and investigated even after I have shown that at least four other threads report the same problem!

Did you already open a support ticket for your problem?
 
  • Like
Reactions: leesteken
@fiona I should be able to get one node running on 5.11 this weekend and will report back the results.

A couple of things I think are important:
1. This is a problem affecting ALL Proxmox users who use zfs, they just might not have noticed it. All 23 Proxmox servers were have had an increase in IO wait after upgrading from Proxmox 6 to 7. Does not matter if zfs is running on rust, SSD or pcie SSDs or how zfs is configured such as having l2arc or not. Rust is just so much more significant in IO wait that it is noticeable. I am confident that if you setup two identical nodes one with Proxmox 6 and the other 7 you can observe the difference in IO wait while performing the same benchmarks. This needs investigated.

2. We had bwlimit set to 150MB/sec in vzdump.conf in Proxmox 6 and did not have an issue. I looked back at our notes on one of the nodes using rust disks and we documented that bwlimit set to 200MB/sec caused issues but 150MB/sec did not. With Proxmox 7 lowering bwlimit to 30MB/sec on that node causes issues. That is a significant reduction in performance!

We have been a Proxmox subscriber for years. I am very disappointed that our problem is not being taken seriously and investigated even after I have shown that at least four other threads report the same problem!
We are taking problems seriously, but if you need immediate assistance, please open a support ticket as @Neobin already suggested. Unfortunately, there are too many people reporting issues on the forum compared to the size of our team to address all of them.

FYI, the last thread you linked doesn't even use ZFS. And if all ZFS setups were affected after upgrading to Proxmox VE 7, I'd expect many more reports.
 
hello e100 - there are updates to zfs and a bunch of other packages today. hopefully those help with the issue you ran in to.


from apticron:

Code:
The following packages are currently pending an upgrade:

    e2fsprogs 1.46.5-2~bpo11+2
    e2fsprogs-l10n 1.46.5-2~bpo11+2
    libcom-err2 1.46.5-2~bpo11+2
    libext2fs2 1.46.5-2~bpo11+2
    libnvpair3linux 2.1.5-pve1
    libss2 1.46.5-2~bpo11+2
    libuutil3linux 2.1.5-pve1
    libzfs4linux 2.1.5-pve1
    libzpool5linux 2.1.5-pve1
    logsave 1.46.5-2~bpo11+2
    pve-kernel-5.15 7.2-9
    pve-kernel-5.15.39-4-pve 5.15.39-4
    pve-kernel-helper 7.2-9
    spl 2.1.5-pve1
    zfs-initramfs 2.1.5-pve1
    zfsutils-linux 2.1.5-pve1
    zfs-zed 2.1.5-pve1
 ...
    zfs-linux (2.1.5-pve1) bullseye; urgency=medium
* update ZFS to 2.1.5
  * Build with libcurl for new keylocation=https://
  * d/control: add new zfs-dracut package

Our production cluster is ceph based so we luckily did not experience the heavy issues you ran in to. our pve secondary systems and pbs use zfs and those have been running slower .
 
Last edited:
I was able to reproduce the issue now on a test server going from <5% to 25-30% IO wait after upgrading from Proxmox VE 6 to 7.

It seems to not depend on kernel or ZFS version, but on QEMU version, in particular something between pve-qemu-kvm=5.2.0-11 and pve-qemu-kvm=6.0.0-2. I'll try to bisect and see if I can find the offending change.
 
The offending change is here and it is for improving backup performance, which it does, but it seems to be too aggressive for your setup. While QEMU uses max_worker = 64 by default, we only use max_worker = 16. I played around with that setting a bit. With max_worker = 1 the IO wait on my test server is basically like before the commit, but even max_worker = 4 leads to rather high IO wait.

High IO wait by itself is not and issue, but I can't predict how far the setting would need to be reduced to help with your guest IO stall issue (I didn't get that in my test VMs). And the problem is that lowering the setting very likely decreases performance for people with faster disks/different setups, and they will obviously complain as well then ;)
 
  • Like
Reactions: Neobin and Dunuin
Could the "max_worker" be defined using arguments in the VMs config file? So that users could benchmark the VM and use what works best for them?
 
The offending change is here and it is for improving backup performance, which it does, but it seems to be too aggressive for your setup. While QEMU uses max_worker = 64 by default, we only use max_worker = 16. I played around with that setting a bit. With max_worker = 1 the IO wait on my test server is basically like before the commit, but even max_worker = 4 leads to rather high IO wait.

High IO wait by itself is not and issue, but I can't predict how far the setting would need to be reduced to help with your guest IO stall issue (I didn't get that in my test VMs). And the problem is that lowering the setting very likely decreases performance for people with faster disks/different setups, and they will obviously complain as well then ;)
I am very happy you were able to reproduce the problem. :)

By any chance did the size of the IO requests also increase between the two versions?
According to iostat it looks like during the backup each request is 1M.
I believe that might be related to "max_chunk"
Maybe smaller sized IO requests would help?

Can I make any changes to try with different max_chunk or max_worker to test on my system?

I think this is a problem on zfs because zfs has no idea how to be fair, so when a single VM starts saturating the IO other VMs cannot get a fair share of IO. Seems like this would be even more of a problem if the high IO process is making much larger IO requests compared to other processes.
 
Could the "max_worker" be defined using arguments in the VMs config file? So that users could benchmark the VM and use what works best for them?
If we do that, I think the storage config would a better place for such parameters. But it's rather ugly and likely confusing if such low-level and very specific settings are exposed at a high level. It also would require code changes throughout the stack.

I am very happy you were able to reproduce the problem. :)
I tested around a bit more and with a faster target disk, the IO wait is much lower for me. Can you try if that helps in your case as well? Also, how much RAM do you have and how much for ZFS?

By any chance did the size of the IO requests also increase between the two versions?
According to iostat it looks like during the backup each request is 1M.
I can see 1M requests from QEMU, but for what ZFS does in the end, it will depend on the volblocksize I think. Can you check with zfs list -o name,volblocksize what you are using currently?

I believe that might be related to "max_chunk"
We use the default value of 0 which means "unlimited". But request size depends on other factors too, the calculation in QEMU is a bit involved. So unlimited just means, limited by other factors, but yes, it seems to be 1M (checked with debugger).

Maybe smaller sized IO requests would help?
Unfortunately, that again is likely detrimental for other setups.

Can I make any changes to try with different max_chunk or max_worker to test on my system?
This is currently hard-coded and cannot be changed on the fly I'm afraid.

I think this is a problem on zfs because zfs has no idea how to be fair, so when a single VM starts saturating the IO other VMs cannot get a fair share of IO. Seems like this would be even more of a problem if the high IO process is making much larger IO requests compared to other processes.
 
  • Like
Reactions: Neobin
@fiona


All of my testing has been on a single node so this info only applies to it, the others are similar tho
I've not adjusted volblock size and is 8K
We have 128G RAM
VMs are using about 14G RAM
I have ARC limited to 80G
This node has mirrored SLOG on two nvme drives and l2arc on nvme too.
The pool is RAID 10 with 10 SATA III mechanical disks.

I do not have any faster disks to backup onto but I did have some slower ones.
The slower disks did cause slightly higher IO wait than the faster disk.

Could you prepare a package with max_workers=1 that I could test with to see if it resolves my problem?

The server I've been testing with is fully updated:
# pveversion -v
Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.39-4-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-9
pve-kernel-helper: 7.2-9
pve-kernel-5.13: 7.1-9
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-12
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.4.162-1-pve: 5.4.162-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 15.2.16-pve1
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.5-1
proxmox-backup-file-restore: 2.2.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-5
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-2
pve-xtermjs: 4.16.0-1
pve-zsync: 2.2.3
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1
 
In case you are using ashift=12 it might be faster to use a bigger volblocksize. Ashift=12 means ZFS will work with 4K blocks. So with a volblocksize of 8K these 8K will be splitted into 2x 4K blocks and written to the disks, but you got 5 mirrors data could be striped accross but only data for 2 of them. With a volblocksize of 16K you would get 4x 4K blocks so 4 mirrors could work in parallel.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!