[SOLVED] High IO wait during backups after upgrading to Proxmox 7

e100 · Aug 17, 2022

We recently completed the upgrade to Proxmox 7.
The issue exists on two different kernels
pveversion:
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.39-1-pve)
and
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.35-2-pve)

Since the upgrade Io wait has increased dramatically during vzdump backup processes.
I also noticed that it seems like vzdump completes the backups faster than before the upgrade, thats good, but the trade off seems to be overloading the IO subsystems to the point its causing issues.

vzdump is writing to spinning rust SATA Disk with LUKS.

VM storage is a mixture of zfs zvols and LVM over DRBD
The backing storage for DRBD ranges from Areca Raid Arrays that might be rust or SSD to a few backed by PcieSSD

Most of our nodes only have fast SSD for VM storage, on those the IO wait during backup averaged around 4% before and is now around 8%, does not seem to cause any issues.
On nodes with SSD and spinning rust they averaged around 8% before and after the upgrade around 35% and we are seeing IO stalls in VMs where this was not a problem in Proxmox 6.

Nearly all VMs are setup to use 'VirtIO SCSI' with a few using 'VirtIO', default cache, none have IO Threads enabled and Async IO is the default.
I see that Async IO default changed in Proxmox 7 and I believe this is the source of the problem.

https://bugzilla.kernel.org/show_bug.cgi?id=199727
Seems to indicate I should use 'VirtIO SCSI Single', enable IO Threads and set Async IO to threads to alleviate this issue.

Any other suggestions?

Here is a zabbix graph of IO wait from one of the nodes:

fiona · Aug 18, 2022

Hi,

e100 said:
Nearly all VMs are setup to use 'VirtIO SCSI' with a few using 'VirtIO', default cache, none have IO Threads enabled and Async IO is the default.
I see that Async IO default changed in Proxmox 7 and I believe this is the source of the problem.

https://bugzilla.kernel.org/show_bug.cgi?id=199727
Seems to indicate I should use 'VirtIO SCSI Single', enable IO Threads and set Async IO to threads to alleviate this issue.

yes, we had users reporting improvements when enabling iothread in similar scenarios.

e100 said:
Any other suggestions?

You could also set a bandwidth limit for the backup job (not exposed in the GUI unfortunately) or as a node-wide default in /etc/vzdump.conf.

e100 · Aug 18, 2022

fiona said:
Hi,

yes, we had users reporting improvements when enabling iothread in similar scenarios.

You could also set a bandwidth limit for the backup job (not exposed in the GUI unfortunately) or as a node-wide default in /etc/vzdump.conf.

Would IO threads be sufficient or is it also necessary to to change Async IO to threads too? What about the 'virtio scsi single' setting?

Some of the nodes already had bwlimit set to 150MB/sec, I'll try lowering it more to see if that helps or not.

Is this a bug that is being investigated?
Seems like it should be since this is a significant performance regression between versions. At a minimum should be mentioned in the release notes, would have saved me a lot of time diagnosing the problem.

fiona · Aug 18, 2022

e100 said:
Would IO threads be sufficient or is it also necessary to to change Async IO to threads too? What about the 'virtio scsi single' setting?

It really depends on your setup. Virtio scsi single is required for enabling iothread for SCSI disks. If enabling iothread alone doesn't help, then yes, you should switch the aio setting on your disks.

e100 said:
Some of the nodes already had bwlimit set to 150MB/sec, I'll try lowering it more to see if that helps or not.

Is this a bug that is being investigated?
Seems like it should be since this is a significant performance regression between versions. At a minimum should be mentioned in the release notes, would have saved me a lot of time diagnosing the problem.

Not that I'm aware of. I have only seen one or two other such reports and I don't know if much can be done if it's really io_uring producing too much load other than trying the above workarounds.

e100 · Aug 18, 2022

Obviously io_uring is more efficient at doing IO.
But it seems that io_uring or something else is allowing a single process to consume all the IO resulting in other processes waiting on IO.

e100 · Aug 22, 2022

@fiona The problem only happens when backing VMs that are on ZFS. Here is IO wait over the entire backup on 10 VMs. The high spikes happen when backing up VMs on ZFS storage and the lows are when backing up from LVM storage.

Any idea what changed in ZFS between Proxmox 6 and 7 that would explain this performance degradation?

e100 · Aug 22, 2022

All of these threads seem to be the same problem I am having:
https://forum.proxmox.com/threads/vm-slow-after-proxmox-pve-upgrade-6-4-7-2.113133/
https://forum.proxmox.com/threads/massive-load-spikes-since-upgrade-to-7-2.112774/
https://forum.proxmox.com/threads/zfs-on-host-node-high-cpu-and-system-load.107889/
https://forum.proxmox.com/threads/performance-issues-after-proxmox-7-migration.96271/

fiona · Aug 23, 2022

e100 said:
@fiona The problem only happens when backing VMs that are on ZFS. Here is IO wait over the entire backup on 10 VMs. The high spikes happen when backing up VMs on ZFS storage and the lows are when backing up from LVM storage.
View attachment 40248
Any idea what changed in ZFS between Proxmox 6 and 7 that would explain this performance degradation?

A lot of things, because we went from ZFS 2.0 to ZFS 2.1. Can you check if the issue is also present when you boot with a 5.11 kernel (which uses an older ZFS kernel module)?

e100 · Aug 25, 2022

@fiona I should be able to get one node running on 5.11 this weekend and will report back the results.

A couple of things I think are important:
1. This is a problem affecting ALL Proxmox users who use zfs, they just might not have noticed it. All 23 Proxmox servers were have had an increase in IO wait after upgrading from Proxmox 6 to 7. Does not matter if zfs is running on rust, SSD or pcie SSDs or how zfs is configured such as having l2arc or not. Rust is just so much more significant in IO wait that it is noticeable. I am confident that if you setup two identical nodes one with Proxmox 6 and the other 7 you can observe the difference in IO wait while performing the same benchmarks. This needs investigated.

2. We had bwlimit set to 150MB/sec in vzdump.conf in Proxmox 6 and did not have an issue. I looked back at our notes on one of the nodes using rust disks and we documented that bwlimit set to 200MB/sec caused issues but 150MB/sec did not. With Proxmox 7 lowering bwlimit to 30MB/sec on that node causes issues. That is a significant reduction in performance!

We have been a Proxmox subscriber for years. I am very disappointed that our problem is not being taken seriously and investigated even after I have shown that at least four other threads report the same problem!

Neobin · Aug 25, 2022

e100 said:
We have been a Proxmox subscriber for years. I am very disappointed that our problem is not being taken seriously and investigated even after I have shown that at least four other threads report the same problem!

Did you already open a support ticket for your problem?

fiona · Aug 25, 2022

e100 said:
@fiona I should be able to get one node running on 5.11 this weekend and will report back the results.

A couple of things I think are important:
1. This is a problem affecting ALL Proxmox users who use zfs, they just might not have noticed it. All 23 Proxmox servers were have had an increase in IO wait after upgrading from Proxmox 6 to 7. Does not matter if zfs is running on rust, SSD or pcie SSDs or how zfs is configured such as having l2arc or not. Rust is just so much more significant in IO wait that it is noticeable. I am confident that if you setup two identical nodes one with Proxmox 6 and the other 7 you can observe the difference in IO wait while performing the same benchmarks. This needs investigated.

2. We had bwlimit set to 150MB/sec in vzdump.conf in Proxmox 6 and did not have an issue. I looked back at our notes on one of the nodes using rust disks and we documented that bwlimit set to 200MB/sec caused issues but 150MB/sec did not. With Proxmox 7 lowering bwlimit to 30MB/sec on that node causes issues. That is a significant reduction in performance!

We have been a Proxmox subscriber for years. I am very disappointed that our problem is not being taken seriously and investigated even after I have shown that at least four other threads report the same problem!

We are taking problems seriously, but if you need immediate assistance, please open a support ticket as @Neobin already suggested. Unfortunately, there are too many people reporting issues on the forum compared to the size of our team to address all of them.

FYI, the last thread you linked doesn't even use ZFS. And if all ZFS setups were affected after upgrading to Proxmox VE 7, I'd expect many more reports.

RobFantini · Aug 26, 2022

hello e100 - there are updates to zfs and a bunch of other packages today. hopefully those help with the issue you ran in to.

from apticron:

Code:

The following packages are currently pending an upgrade:

    e2fsprogs 1.46.5-2~bpo11+2
    e2fsprogs-l10n 1.46.5-2~bpo11+2
    libcom-err2 1.46.5-2~bpo11+2
    libext2fs2 1.46.5-2~bpo11+2
    libnvpair3linux 2.1.5-pve1
    libss2 1.46.5-2~bpo11+2
    libuutil3linux 2.1.5-pve1
    libzfs4linux 2.1.5-pve1
    libzpool5linux 2.1.5-pve1
    logsave 1.46.5-2~bpo11+2
    pve-kernel-5.15 7.2-9
    pve-kernel-5.15.39-4-pve 5.15.39-4
    pve-kernel-helper 7.2-9
    spl 2.1.5-pve1
    zfs-initramfs 2.1.5-pve1
    zfsutils-linux 2.1.5-pve1
    zfs-zed 2.1.5-pve1
 ...
    zfs-linux (2.1.5-pve1) bullseye; urgency=medium
* update ZFS to 2.1.5
  * Build with libcurl for new keylocation=https://
  * d/control: add new zfs-dracut package

Our production cluster is ceph based so we luckily did not experience the heavy issues you ran in to. our pve secondary systems and pbs use zfs and those have been running slower .

fiona · Aug 29, 2022

I was able to reproduce the issue now on a test server going from <5% to 25-30% IO wait after upgrading from Proxmox VE 6 to 7.

It seems to not depend on kernel or ZFS version, but on QEMU version, in particular something between pve-qemu-kvm=5.2.0-11 and pve-qemu-kvm=6.0.0-2. I'll try to bisect and see if I can find the offending change.

fiona · Aug 30, 2022

The offending change is here and it is for improving backup performance, which it does, but it seems to be too aggressive for your setup. While QEMU uses max_worker = 64 by default, we only use max_worker = 16. I played around with that setting a bit. With max_worker = 1 the IO wait on my test server is basically like before the commit, but even max_worker = 4 leads to rather high IO wait.

High IO wait by itself is not and issue, but I can't predict how far the setting would need to be reduced to help with your guest IO stall issue (I didn't get that in my test VMs). And the problem is that lowering the setting very likely decreases performance for people with faster disks/different setups, and they will obviously complain as well then

Dunuin · Aug 30, 2022

Could the "max_worker" be defined using arguments in the VMs config file? So that users could benchmark the VM and use what works best for them?

e100 · Aug 30, 2022

fiona said:
The offending change is here and it is for improving backup performance, which it does, but it seems to be too aggressive for your setup. While QEMU uses max_worker = 64 by default, we only use max_worker = 16. I played around with that setting a bit. With max_worker = 1 the IO wait on my test server is basically like before the commit, but even max_worker = 4 leads to rather high IO wait.

High IO wait by itself is not and issue, but I can't predict how far the setting would need to be reduced to help with your guest IO stall issue (I didn't get that in my test VMs). And the problem is that lowering the setting very likely decreases performance for people with faster disks/different setups, and they will obviously complain as well then

I am very happy you were able to reproduce the problem.

By any chance did the size of the IO requests also increase between the two versions?
According to iostat it looks like during the backup each request is 1M.
I believe that might be related to "max_chunk"
Maybe smaller sized IO requests would help?

Can I make any changes to try with different max_chunk or max_worker to test on my system?

I think this is a problem on zfs because zfs has no idea how to be fair, so when a single VM starts saturating the IO other VMs cannot get a fair share of IO. Seems like this would be even more of a problem if the high IO process is making much larger IO requests compared to other processes.

e100 · Aug 30, 2022

Looks like its max_worker = 16 and max_chunk = 64k
16*64k = 1M which is what I see during backups.

fiona · Sep 1, 2022

Dunuin said:
Could the "max_worker" be defined using arguments in the VMs config file? So that users could benchmark the VM and use what works best for them?

If we do that, I think the storage config would a better place for such parameters. But it's rather ugly and likely confusing if such low-level and very specific settings are exposed at a high level. It also would require code changes throughout the stack.

e100 said:
I am very happy you were able to reproduce the problem.

I tested around a bit more and with a faster target disk, the IO wait is much lower for me. Can you try if that helps in your case as well? Also, how much RAM do you have and how much for ZFS?

e100 said:
By any chance did the size of the IO requests also increase between the two versions?
According to iostat it looks like during the backup each request is 1M.

I can see 1M requests from QEMU, but for what ZFS does in the end, it will depend on the volblocksize I think. Can you check with zfs list -o name,volblocksize what you are using currently?

e100 said:
I believe that might be related to "max_chunk"

We use the default value of 0 which means "unlimited". But request size depends on other factors too, the calculation in QEMU is a bit involved. So unlimited just means, limited by other factors, but yes, it seems to be 1M (checked with debugger).

e100 said:
Maybe smaller sized IO requests would help?

Unfortunately, that again is likely detrimental for other setups.

e100 said:
Can I make any changes to try with different max_chunk or max_worker to test on my system?

This is currently hard-coded and cannot be changed on the fly I'm afraid.

e100 said:
I think this is a problem on zfs because zfs has no idea how to be fair, so when a single VM starts saturating the IO other VMs cannot get a fair share of IO. Seems like this would be even more of a problem if the high IO process is making much larger IO requests compared to other processes.

e100 · Sep 1, 2022

@fiona

All of my testing has been on a single node so this info only applies to it, the others are similar tho
I've not adjusted volblock size and is 8K
We have 128G RAM
VMs are using about 14G RAM
I have ARC limited to 80G
This node has mirrored SLOG on two nvme drives and l2arc on nvme too.
The pool is RAID 10 with 10 SATA III mechanical disks.

I do not have any faster disks to backup onto but I did have some slower ones.
The slower disks did cause slightly higher IO wait than the faster disk.

Could you prepare a package with max_workers=1 that I could test with to see if it resolves my problem?

The server I've been testing with is fully updated:
# pveversion -v

Code:

proxmox-ve: 7.2-1 (running kernel: 5.15.39-4-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-9
pve-kernel-helper: 7.2-9
pve-kernel-5.13: 7.1-9
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-12
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.4.162-1-pve: 5.4.162-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 15.2.16-pve1
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.5-1
proxmox-backup-file-restore: 2.2.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-5
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-2
pve-xtermjs: 4.16.0-1
pve-zsync: 2.2.3
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

Dunuin · Sep 1, 2022

In case you are using ashift=12 it might be faster to use a bigger volblocksize. Ashift=12 means ZFS will work with 4K blocks. So with a volblocksize of 8K these 8K will be splitted into 2x 4K blocks and written to the disks, but you got 5 mirrors data could be striped accross but only data for 2 of them. With a volblocksize of 16K you would get 4x 4K blocks so 4 mirrors could work in parallel.

[SOLVED] High IO wait during backups after upgrading to Proxmox 7

Famous Member

Proxmox Staff Member

Famous Member

Proxmox Staff Member

Famous Member

Famous Member

Famous Member

Proxmox Staff Member

Famous Member

Distinguished Member

Proxmox Staff Member

Famous Member

Proxmox Staff Member

Proxmox Staff Member

Distinguished Member

Famous Member

Famous Member

Proxmox Staff Member

Famous Member

Distinguished Member

We value your privacy