IO performance issue with vzdump since upgrade to version 7

timproxmox

Member
Dec 18, 2019
13
1
8
49
Since updating to Proxmox version 7, we have not been able to backup a single VM (Production affected)

The issue is that VZDUMP slows down and VM's on the Proxmox nodes become unresponsive.

I was able to get 450MB transfer using rsync to the VM locally to test access to the vm drives which go over ISCSI.
I also copied a file to another server from the ISCSI volume and got 450MB/s.

The issue occurs when using vzdump.

Initially we thought it was related to the proxmox backup server but that has been ruled out with our testing. We have isolated this to one of the Proxmox hosts.

Here is a vzdump output from the command line.

vzdump 130 --dumpdir /root/tmp -mode stop
INFO: starting new backup job: vzdump 130 --dumpdir /root/tmp --mode stop
INFO: Starting Backup of VM 130 (qemu)
INFO: Backup started at 2021-07-19 11:54:55
INFO: status = running
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: ace-bow-s-servicedesk-new
INFO: include disk 'scsi0' 'lvm_RG2V1:vm-130-disk-0' 60G
INFO: stopping virtual guest
INFO: creating vzdump archive '/root/tmp/vzdump-qemu-130-2021_07_19-11_54_55.vma'
INFO: starting kvm to execute backup task
INFO: started backup task '768b8bf7-5fab-4e84-8bb0-8166e24421d0'
INFO: resuming VM again after 7 seconds
INFO: 0% (59.0 MiB of 60.0 GiB) in 3s, read: 19.7 MiB/s, write: 19.2 MiB/s
INFO: 1% (633.0 MiB of 60.0 GiB) in 4m 49s, read: 2.0 MiB/s, write: 952.7 KiB/s

As you can see the write speed is very low, this is where IO waits are high and vms are also affected.
The issue occurs if the vm being backed up is powered on as well as being powered off.

We have confirmed VM performance is fine writing to and reading from disk.

The environment consists of two servers, san, iscsi connections to the san with lvm drives.

Adding a second iscsi drive to the vm and performing a dd backup got 80-90 MB/s (within the vm).

#dd if=/dev/sda of=/mnt/sdb/sda.img status=progress
3694258176 bytes (3.7 GB, 3.4 GiB) copied, 44 s, 84.0 MB/s
11057149440 bytes (11 GB, 10 GiB) copied, 124 s, 89.2 MB/s
13946732544 bytes (14 GB, 13 GiB) copied, 154 s, 90.6 MB/s
14949594112 bytes (15 GB, 14 GiB) copied, 168 s, 89.0 MB/s

Shutting the vm down and backing up from the san to the local drive using dd go the below results:

#dd if=/dev/pve/vm-133-disk-1 of=/root/tmp/133.img status=progress
2048063488 bytes (2.0 GB, 1.9 GiB) copied, 10 s, 205 MB/s
2653897216 bytes (2.7 GB, 2.5 GiB) copied, 13 s, 204 MB/s
4004057088 bytes (4.0 GB, 3.7 GiB) copied, 20 s, 200 MB/s

I have also tested installing earlier versions of vzdump (same issue) by installing earlier versions of pve-manager which includes vzdump.

Any assistance is very much appreciated.
 
Last edited:
Hi Everyone,

I believe we got to the bottom of it, however, it raises a few questions to Proxmox to provide a fix in version 7. vzdump was not the issue, it was a symptom of a larger issue with multipath.

Initially we were able to get the backups to appear to work when we throttled the backup. I found https://www.serra.me/en/2020/11/proxmox-throttled-backups-for-better-performance/ which helped to a point, but ultimately the issue still presented itself being high IO, VM's not responding and evidence of disk errors from the VM console when reconnected later.

We narrowed the issue to be a multipath issue and decided the only way forward was to roll back to version 6.4.

The below is a basic summary of the rollback steps. It also helped having a copy of the /etc folder and subfolders before blowing version 7 away. This allowed us to compare settings such as networking config (bonding) to get it up and running without too many headaches.

1. migrate vm's to 1st node
2. install 6.4 on second node (configure iscsi + multipath), run updates
3. migrate vm's back to 2nd node (some required to be shutdown to do this due to version but still worked)
4. install 6.4 on first node (configure iscsi + multipath), run updates
5. migrate required machines back
6. Test backup - All worked without an issue and performance was not impacted.

If anyone has any channels to proxmox support, please let them know we believe there is a serious issue with multipath with . Happy to provide hardware details if required.

This also affected an older linux server, I had to run gparted testdisk tool to allow me to scan and re-write the partition info which allowed me to access the volume. I had to re-install grub to allow the volume to boot once again.

I pretty much worked the whole weekend on this - hopefully this will help others in similar situations or allow you to do the relevant research before upgrading.

The system has been rock solid since downgrading.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!