Shared Storage Backups slowing guests

Lymond · Jan 10, 2023

I realize this topic has possibly been discussed to death. We had our setup working well until the upgrade from 6.4 to 7.3. Things are less good now.

We have 5 front-end nodes connected via 40 Gb/s inifiniband to a ZFS file server.
VM storage is mounted via NFS on each node and is referred to in the Proxmox Gui as just a folder (the location of the mount)
VM Backups using the GUI vzdump (not PBS) run over CIFS and a 1 Gb/s ethernet connection, also mapped per node and referred as a folder in GUI.

When we upgraded, performance all around was terrible until we downgraded NFS from 4.2 to 4.0 (the node upgrade bumped the mount to vers=4.2) on the nodes (and rebooted them all). VM performance has been excellent since (that was about a week ago).

When the VM backups run, they're 20-50% faster than before which is super. But any running VM that has a normally high load (we stupidly run a small but busy mailserver -- it uses the NFS over IB but its VHDs are on the shared storage SSD pool), has its iowait cranked to 95%. Mail users not happy. And this is anywhere in the cluster -- mail server runs on VM5, backup was running on VM3 last night...and things were a mess.

I'm wondering if I slowed the backups down with bwlimit in /etc/vzdump.conf if that would help things. It seems the topology essentially has the VM traffic moving over IB with the backup traffic -- the backup data seems to be pulled through the front-end node from the storage server which means you have normal VM traffic mixing with the backup traffic. I don't see a way around this short of doing VHD backups directly from the storage outside of Proxmox.

Thank you for any tips.

Dunuin · Jan 10, 2023

Maybe thats related to the increased default "max-workers" which will hit the storage harder with more threads in parallel. This can really slow down a PVE7 when the storage is too slow to handle it. So better performance if your storage is fast (IOPS not already bottlenecking), worse performance if your storage is slow.
Since a few weeks you can edit that by setting the "performance: max-workers=N" in the vzdump.conf. Try to set something lower than the default of 8 there and see if it helps.

Lymond · Jan 11, 2023

Dunuin said:
Maybe thats related to the increased default "max-workers" which will hit the storage harder with more threads in parallel. This can really slow down a PVE7 when the storage is too slow to handle it. So better performance if your storage is fast, worse performance if your storage is slow.
Since a few weeks you can edit that by setting the "performance: max-workers=N" in the vzdump.conf. Try to set something lower than the default of 8 there and see if it helps.

Dunuin -- thanks for the reply. We're going to start by upping the number of nfsd processes. Our storage *should* be fast enough to handle the traffic in terms of what disks we're using -- whenever we load up the storage IO with something not NFS, the VMs don't blink. The info about max-workers increasing is interesting and likely very relevant. We're seeing faster backup speeds and this could be why. Next backup runs tonight and I'll be monitoring.

xrobau · Jan 11, 2023

Lymond said:
Our storage *should* be fast enough to handle the traffic in terms of what disks we're using

You are literally going down the path I have recently been down! Here's a few tips:

1. Never GUESS. You're saying it 'should be fast enough', but you need to check. Run 'iostat -xy 1' on your storage server. See if it's getting overloaded.
2. Are you using LACP and OVS? You may be hitting the same problem that I did, where OVS is switching output ports and confusing your switch. Try using `ovs_options bond_mode=active-backup`
3. Do you have sync=disabled, or sharenfs=async,... on your ZFS server? If so, go back to NFS3 (and don't forget to reboot the host after you've changed the NFS version!)
4. Have you tuned your ZFS server correctly? I have an extremely opinionated ZFS playbook that you can run against an Ubuntu 22.04 machine to turn it into a ZFS server here: https://github.com/xrobau/zfs -- you may want to check, at least, zfs_arcmax and the other settings that are suboptimal for a file server - https://github.com/xrobau/zfs/blob/master/Makefile#L54

Lymond · Jan 11, 2023

xrobau said:
You are literally going down the path I have recently been down! Here's a few tips:

1. Never GUESS. You're saying it 'should be fast enough', but you need to check. Run 'iostat -xy 1' on your storage server. See if it's getting overloaded.
2. Are you using LACP and OVS? You may be hitting the same problem that I did, where OVS is switching output ports and confusing your switch. Try using `ovs_options bond_mode=active-backup`
3. Do you have sync=disabled, or sharenfs=async,... on your ZFS server? If so, go back to NFS3 (and don't forget to reboot the host after you've changed the NFS version!)
4. Have you tuned your ZFS server correctly? I have an extremely opinionated ZFS playbook that you can run against an Ubuntu 22.04 machine to turn it into a ZFS server here: https://github.com/xrobau/zfs -- you may want to check, at least, zfs_arcmax and the other settings that are suboptimal for a file server - https://github.com/xrobau/zfs/blob/master/Makefile#L54

Xrobau, thanks for the tips. After upping the default nfsd processes from 8 to 1024, things are looking very good. VM backups kicked off and our little email server hasn’t seen any extra IO wait. I’ll wait for the backups to finish and review the IO and rw graphs for the night but I’m cautiously optimistic.

Now I’m wondering about the bad NFS 4.2 performance which was present with backups happening or not. Perhaps if we had upped the number of nfsd processes we wouldn’t have needed to drop to 4.0.

We have a similar DR setup and we can do some non-production testing on that.

Lymond · Jan 11, 2023

Lymond said:
Xrobau, thanks for the tips. After upping the default nfsd processes from 8 to 1024, things are looking very good. VM backups kicked off and our little email server hasn’t seen any extra IO wait. I’ll wait for the backups to finish and review the IO and rw graphs for the night but I’m cautiously optimistic.

Now I’m wondering about the bad NFS 4.2 performance which was present with backups happening or not. Perhaps if we had upped the number of nfsd processes we wouldn’t have needed to drop to 4.0.

We have a similar DR setup and we can do some non-production testing on that.

Well, not quite I guess. When the actual VM began backing up (rather than just being a VM on the node where other VMs were being backed up), IO wait spiked again. I'm reading that max-workers in /etc/vzdump.conf is configurable now. The default is 16, up from 1 in Proxmox 6. People in this thread are reporting that dropping it to 8 and/or changing bwlimit to 150 MB is helping. It's worth a read. I'm going to test dropping max-workers to 8 and see how things go.

xrobau · Jan 12, 2023

Lymond said:
Well, not quite I guess

Don't guess, measure! Start from looking at the disks on the storage server. If they're busy there, then that's it - there's nothing you can do (apart from tuning how zfs writes to the disks, see playbook link above).

If they're NOT busy, you need to figure out where the slowdown is. Generating the backups? Network? NFS? Retransmissions? Fragmentation? It could be any number of things.

But please PLEASE. Measure. Don't guess! And *REALLY* don't make changes just because you saw someone say it worked for them (eg, 1024 nfs threads would be sufficient for about 30,000 clients mounting that server).

Search

Search

Shared Storage Backups slowing guests

Lymond

Renowned Member

Dunuin

Distinguished Member

Lymond

Renowned Member

xrobau

Member

Lymond

Renowned Member

Lymond

Renowned Member

xrobau

Member