Snapshot backups spuriously max guest disk activity, causing application crashes

AffectedArc07

New Member
May 30, 2021
8
0
1
Hello

I am experiencing a weird issue where whenever a backup job starts for a guest VM, there is a chance for the guest to report 100% disk activity and refuse to do any more writes, which causes application crashes and other issues. I cant seem to find a rhyme or reason for this to happen, as backups functioned fine for 2 weeks prior until they broke one day, with no configuration changes or package updates inbetween. The host disk is an NVMe SSD (Sabrent rocket 4.0 2TB), so I have high doubts that the read/write speeds are being maxed out with the rate backups can transfer (1 gigabit WAN to a backup target consisting of mechanical HDDs)

The backup consists of 4 VMs (One freeBSD pfsense, one windows 10, two ubuntu server), and they all backup to a proxmox backup server target. All VMs are susceptible to the random freezing and they all started to have the issue at the same time, with the only real solution being to manually stop the backups with vzdump --stop and re-run them, hoping the guest disk activity does not max out.

Things I have tried:
  • Enabling/disabling guest agent
  • Restarting the guest VMs
  • Restarting the host
  • Searching logs for anything out of the ordinary (there isnt as far as I can tell)
Most VMs use the same configurations, but I will post them here for reference
1622367970364.png
1622367982675.png

1622368053962.png
1622368062653.png

1622368086994.png
1622368095694.png

1622368114591.png
1622368122430.png

Package versions:

proxmox-ve: 6.4-1 (running kernel: 5.4.114-1-pve)
pve-manager: 6.4-6 (running version: 6.4-6/be2fa32c)
pve-kernel-5.4: 6.4-2
pve-kernel-helper: 6.4-2
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-2
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.1.6-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-5
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-3
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

Any ideas? Thanks
 
Image to better convey whats going on, made on a fresh test VM, with the guest agent enabled.
1622412377029.png


EDIT: After removing the bandwidth limit I had added as part of testing, the VM now has large random usage spikes
1622413036522.png
It will still lock up from time to time, but its less bad.


I still want to know why this happened now after running flawlessly for 2 weeks.
 
Last edited:
this is probably happening because of the way qemu vm backups work

the backup code begins to backup blocks, now when a write request from inside the vm is incoming, qemu pauses that request, we backup that block, and qemu lets the write command continue
this means that for the duration of the backup, a write to disk can be limited by the backup speed
 
this means that for the duration of the backup, a write to disk can be limited by the backup speed
Is there a way I can just hold these reads/writes in RAM while it waits for that block to be done? The host has 128GB of RAM so should be able to handle a few megabytes for a bit, even if its a consistency risk.

Failing that, is there any other remedy for this?
 
you can try changing the cache mode of the disk, this may have an impact on the backup speed too

afair there is no switch to keep those block in memory until they are done, and this would be very risky, since this could increase the memory consumption quite a bit
 
you can try changing the cache mode of the disk, this may have an impact on the backup speed too
I tried each cache mode individually, both with guest agent enabled and disabled, and the problems were still the same
1622472196428.png

The confusing part is every 1 in 10 backups, it wont lock the entire guest IO up, and it will just run, not to mention it running flawlessly for 2 weeks before this.

Stuff is confusing.
 
After doing even more research, apparently vzdump used to have a --size parameter
1622625952179.png

This would allow me to make the blocks smaller so writes wouldn't have to be queued up instead of having them have entire 1GB blocks.

Is there any reason this was removed, and is there a way this can be bought back, or failing that, what was the last version this worked on.


Edit: The minimum block size is 500 so that wouldnt work as our blocks are small
 
Last edited:
Did even more reading, and came across this

The vzdump utility is doing backup through qemu. This means, to have consistent backup, qemu will write changing blocks first to the backup file and then to the VM disk. As a result the write speed of a VM will be always the write speed of the slowest storage to write too.

Is there a way to VZDump a snapshot to local storage then have it auto-move (not copy, move) to a PBS node? I would much rather have all my backups consolidated on there and have a proper system as opposed to scripts which could break at any moment.
 
Is there a way to VZDump a snapshot to local storage then have it auto-move (not copy, move) to a PBS node? I would much rather have all my backups consolidated on there and have a proper system as opposed to scripts which could break at any moment.
no not really, you only could install another pbs instance that you would sync with the other one
there is no way currently to move/copy "old-style" vzdump backups to a pbs
 
there is no way currently to move/copy "old-style" vzdump backups to a pbs
Is the ability to backup locally then move to a PBS anywhere on the roadmap? The backup latency for offsite backups is a bit crippling, though I could likely speak to my provider about getting another local HDD to use for on-site backups then replicate that datastore to an offsite backups.
 
Is the ability to backup locally then move to a PBS anywhere on the roadmap?
no because you can do that by having a pbs locally (even installed in parallel on pve) and then sync it to a remote site
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!