[SOLVED] High I/O wait on nodes during backups

JuiceAdmin

New Member
Jul 3, 2024
3
1
3
Hello Proxmox community!

Our company uses Proxmox Virtual Environment and Proxmox Backup Server and have been noticing high I/O delay on our nodes while VM backups are running.
I'll list the relevant specs here at the top before I get into the details, as I will be referencing these later in the post.

Main VM storage NAS:
* Motherboard: Supermicro H12SSL-CT
* OS: TrueNAS-13.0-U6.1
* CPU: AMD EPYC 7443P (2.9 GHz base clock, 1 socket, 24 cores, 2 threads per core)
* RAM: 460 GB DDR4 ECC
* Networking: Mellanox Technologies MT27800 Family [ConnectX-5] 40 GBps
* Disks:
* Boot drive: HP MK000480GWCEV 480 GB SSD
* VM storage pool:
* RAIDZ3 vdev:
* 24x Seagate XS6400LE70084 6.4 TB HDD
* slog, mirror:
* 2x Samsung SSD 990 PRO 1TB
Proxmox node A (This node is part of a 5-node cluster. It is the most heavily used in terms of CPU).
* Motherboard: Supermicro H12DSi-NT6
* OS: Proxmox Virtual Environment 8.2.2
* CPU: AMD EPYC 7F32 8 core processor (3.70 GHz base clock, 2 sockets, 8 cores, 2 threads per core)
* RAM: 256 GB DDR4 ECC
PBS server (physical):
* Pre-built server: HPE Proliant DL20 Gen10
* OS: Proxmox Backup Server 3.2-6
* CPU: Intel(R) Xeon(R) E-2236 CPU (3.40GHz base clock, 1 socket, 6 cores, 2 threads per core)
* RAM: 64 GB DDR4 ECC
* Disks:
* OS: 1x Samsung SSD 860 PRO 256 GB
* Datastore: 20 TB zvol set up over iSCSI on the RAIDZ3 vdev on the main VM storage NAS (see above specs)

We use Proxmox Virtual Environment and Proxmox Backup Server and have been noticing high I/O delay on our nodes while VM backups are running. At first we felt as if there was a correlation between the I/O delay and the fact that one of our virtual machines, a Windows Server OS running Microsoft SQL Server, would slow to a crawl and need to be completely restarted. We've made a couple of changes since then, including making networking improvements to the destination NAS (a Synology NAS whose datastore was mounted in PBS as SMB, later changed to NFS), and flat out changing the destination dataset to another zvol on the main VM storage NAS (this is the current setup).

The main VM storage NAS has one pool that has separate zvols for each VM, and those zvols are shared out over iSCSI as separate LUNs. On the PVE side, we create LVMs on each shared LUN and store the VM disks inside those. All Windows VMs including the SQL Server one have SCSI disks set to write back caching, with discard, SSD emulation, and IO thread enabled. The aio is also set to threads due to being blocked from live migration from local storage to LVM per this commit: https://git.proxmox.com/?p=qemu-server.git;a=commit;h=8fbae1dc8f9041aaa1a1a021740a06d80e8415ed).

Attached below is a snapshot of what IO delay looks like during PBS backups for node A. It also goes up during backup jobs from our other backup system (Veeam). The PBS backups for this node start at 21:00 local time. The other 4 nodes are staggered in half hour increments after the previous node's backup. E.g. the start times are 21:00, 21:30, 22:00, 22:30, and 23:00 for the 5 nodes. At least for the night of 7/2/2024's backups, there is no overlap between node A's backups and the next node's backups.

This is a screenshot of the entire day on 7/2/2024
iowait_20240702.png
This is a screenshot of just the timeframe of the PBS backups on that same day:
iowait_zoom_20240702.png

My question here is, is this jump in I/O delay a concern? Is it a sign of deeper-rooted issues or a bottleneck somewhere? Again, there are no current visible issues as of now when PBS backups occur (e.g. no virtual machines hang), but I'd like to know if this is something worth investigating so we can prevent any potential issues before they occur.

Please let me know if there is any additional information that I can provide to assist.

Thanks in advance for the help!
 
  • Like
Reactions: _gabriel
Thank you both for the replies!

I completely forgot to speak on fleecing - we actually tried it in the past as part of solutions to solving the I/O lockups, but we had an issue with it where, in the middle of backup jobs, we'd get "permission denied" errors. Disabling fleecing resulted in no permissions errors to the same storage destinations. I believe the only time fleecing didn't throw errors is when the VMs that were being backed up had to generate new dirty bitmaps. I wish I would have taken logs and/or screenshots and posted something during that time, so therefore I unfortunately have nothing more to offer than an anecdote on that particular point, and this was a couple of months ago so it's entirely possible my memory is fuzzy on all of this.

For tonight's backups, I'm going to start by reducing the I/O workers by half for each backup and see if that yields any improvements. We left them at the default blank fallback value for all of this time, which the web GUI states is 16, so I will change them all to 8 and go from there.

I will also spend some time getting all of our nodes up to date and attempt using fleecing again. When we originally attempted to use the feature, I believe it was very recently after fleecing was introduced. I poked around the forums at the time to see if anyone had those permissions errors like we did but was unable to find anything at the time. Quite some time has passed since then so we'll definitely give it another shot.

Thanks again for the help! I'll report back with more data as I get it.
 
I completely forgot to speak on fleecing - we actually tried it in the past as part of solutions to solving the I/O lockups, but we had an issue with it where, in the middle of backup jobs, we'd get "permission denied" errors. Disabling fleecing resulted in no permissions errors to the same storage destinations. I believe the only time fleecing didn't throw errors is when the VMs that were being backed up had to generate new dirty bitmaps. I wish I would have taken logs and/or screenshots and posted something during that time, so therefore I unfortunately have nothing more to offer than an anecdote on that particular point, and this was a couple of months ago so it's entirely possible my memory is fuzzy on all of this.
Sounds like the following issue that popped up right after fleecing got introduced: https://bugzilla.proxmox.com/show_bug.cgi?id=5409 which is fixed in pve-qemu-kvm >= 8.1.5-6 with: https://git.proxmox.com/?p=pve-qemu.git;a=commit;h=51232e2e40a1cbee21ca153ca3af909f40d5a55d
 
Hi everyone,

So I have an update. It looks like reducing I/O workers solved the issue! Going from 16 to 8 didn't make much of a difference, but going all the way down to 1 as was mentioned by you in the linked post actually made a significant difference. We still have high I/O wait during some other operations, but the scope of my post here was limited to PBS, and that's likely something we'll have to investigate on our own.

Fiona, that issue you linked looks exactly like what we were experiencing with fleecing at the time. We'll have to play with it in the future, but for now, that I/O workers setting did the trick and that's good enough for me!

For fun, here is a graph of the backup time period after the change was made. Backups now seem to take slightly longer (which makes sense), but this isn't at all a concern for us:
1720724559806.png


I appreciate the help with this! I'll go ahead and mark this post as solved.
 
  • Like
Reactions: Veidit

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!