Snapshot backups spuriously max guest disk activity, causing application crashes

AffectedArc07 · May 30, 2021

Hello

I am experiencing a weird issue where whenever a backup job starts for a guest VM, there is a chance for the guest to report 100% disk activity and refuse to do any more writes, which causes application crashes and other issues. I cant seem to find a rhyme or reason for this to happen, as backups functioned fine for 2 weeks prior until they broke one day, with no configuration changes or package updates inbetween. The host disk is an NVMe SSD (Sabrent rocket 4.0 2TB), so I have high doubts that the read/write speeds are being maxed out with the rate backups can transfer (1 gigabit WAN to a backup target consisting of mechanical HDDs)

The backup consists of 4 VMs (One freeBSD pfsense, one windows 10, two ubuntu server), and they all backup to a proxmox backup server target. All VMs are susceptible to the random freezing and they all started to have the issue at the same time, with the only real solution being to manually stop the backups with vzdump --stop and re-run them, hoping the guest disk activity does not max out.

Things I have tried:

Enabling/disabling guest agent
Restarting the guest VMs
Restarting the host
Searching logs for anything out of the ordinary (there isnt as far as I can tell)

Most VMs use the same configurations, but I will post them here for reference

Package versions:

proxmox-ve: 6.4-1 (running kernel: 5.4.114-1-pve)
pve-manager: 6.4-6 (running version: 6.4-6/be2fa32c)
pve-kernel-5.4: 6.4-2
pve-kernel-helper: 6.4-2
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-2
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.1.6-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-5
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-3
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

Any ideas? Thanks

AffectedArc07 · May 31, 2021

Image to better convey whats going on, made on a fresh test VM, with the guest agent enabled.

EDIT: After removing the bandwidth limit I had added as part of testing, the VM now has large random usage spikes

It will still lock up from time to time, but its less bad.

I still want to know why this happened now after running flawlessly for 2 weeks.

dcsapak · May 31, 2021

this is probably happening because of the way qemu vm backups work

the backup code begins to backup blocks, now when a write request from inside the vm is incoming, qemu pauses that request, we backup that block, and qemu lets the write command continue
this means that for the duration of the backup, a write to disk can be limited by the backup speed

AffectedArc07 · May 31, 2021

dcsapak said:
this means that for the duration of the backup, a write to disk can be limited by the backup speed

Is there a way I can just hold these reads/writes in RAM while it waits for that block to be done? The host has 128GB of RAM so should be able to handle a few megabytes for a bit, even if its a consistency risk.

Failing that, is there any other remedy for this?

dcsapak · May 31, 2021

you can try changing the cache mode of the disk, this may have an impact on the backup speed too

afair there is no switch to keep those block in memory until they are done, and this would be very risky, since this could increase the memory consumption quite a bit

AffectedArc07 · May 31, 2021

dcsapak said:
you can try changing the cache mode of the disk, this may have an impact on the backup speed too

I tried each cache mode individually, both with guest agent enabled and disabled, and the problems were still the same

The confusing part is every 1 in 10 backups, it wont lock the entire guest IO up, and it will just run, not to mention it running flawlessly for 2 weeks before this.

Stuff is confusing.

AffectedArc07 · Jun 2, 2021

After doing even more research, apparently vzdump used to have a --size parameter

This would allow me to make the blocks smaller so writes wouldn't have to be queued up instead of having them have entire 1GB blocks.

Is there any reason this was removed, and is there a way this can be bought back, or failing that, what was the last version this worked on.

Edit: The minimum block size is 500 so that wouldnt work as our blocks are small

AffectedArc07 · Jun 5, 2021

Did even more reading, and came across this

The vzdump utility is doing backup through qemu. This means, to have consistent backup, qemu will write changing blocks first to the backup file and then to the VM disk. As a result the write speed of a VM will be always the write speed of the slowest storage to write too.

Is there a way to VZDump a snapshot to local storage then have it auto-move (not copy, move) to a PBS node? I would much rather have all my backups consolidated on there and have a proper system as opposed to scripts which could break at any moment.

dcsapak · Jun 7, 2021

AffectedArc07 said:
Is there a way to VZDump a snapshot to local storage then have it auto-move (not copy, move) to a PBS node? I would much rather have all my backups consolidated on there and have a proper system as opposed to scripts which could break at any moment.

no not really, you only could install another pbs instance that you would sync with the other one
there is no way currently to move/copy "old-style" vzdump backups to a pbs

AffectedArc07 · Jun 7, 2021

dcsapak said:
there is no way currently to move/copy "old-style" vzdump backups to a pbs

Is the ability to backup locally then move to a PBS anywhere on the roadmap? The backup latency for offsite backups is a bit crippling, though I could likely speak to my provider about getting another local HDD to use for on-site backups then replicate that datastore to an offsite backups.

dcsapak · Jun 7, 2021

AffectedArc07 said:
Is the ability to backup locally then move to a PBS anywhere on the roadmap?

no because you can do that by having a pbs locally (even installed in parallel on pve) and then sync it to a remote site

AffectedArc07 · Jun 7, 2021

dcsapak said:
no because you can do that by having a pbs locally (even installed in parallel on pve) and then sync it to a remote site

Noted

tuxis · Jun 10, 2021

I've created https://bugzilla.proxmox.com/show_bug.cgi?id=3462 to make PBS smarter in handling verification tasks, which in our case, seem to cause the PBS to be slower than necessary.

Search

Search

Snapshot backups spuriously max guest disk activity, causing application crashes

AffectedArc07

New Member

AffectedArc07

New Member

dcsapak

Proxmox Staff Member

AffectedArc07

New Member

dcsapak

Proxmox Staff Member

AffectedArc07

New Member

AffectedArc07

New Member

AffectedArc07

New Member

dcsapak

Proxmox Staff Member

AffectedArc07

New Member

dcsapak

Proxmox Staff Member

AffectedArc07

New Member

tuxis

Famous Member

We value your privacy