VMs freezing and unreachable when backup server is slow

Luki20

Member
Apr 28, 2018
44
1
13
36
Hello all,

I've been noticing a strange problem with Proxmox (6.4-13) lately in connection with Proxmox Backup Server (2.0-9).

Backing up our VMs at night works very well and reliably in most cases, and since it's mostly incremental, it's also pretty fast. The performance of the backup with an average of 80 MB/s is also relatively good, considering that it is pushed over the network to a completely different data center.

However, if there are problems with writing/reading the backup or the write process to the backup server is slow in general (because of bad bandwidth or slow hard disk on the backup server), then the VM freezes (services on it are no longer accessible) and is buggy in GUI (console does not open -> timeout, shutdown -> timeout, only a hard stop does work). Mostly the backup stops during the process then (50% or 60% for example) and throws an error: "ERROR: VM ... qmp command 'query-backup' failed - got timeout"

The symptoms mentioned in this forum entry (https://forum.proxmox.com/threads/certain-vms-from-a-cluster-cannot-be-backed-up-and-managed.57016/) look very familiar to me.

Now my question — should it be the case that the VM becomes inaccessible for a short time due to problems with the backup? Optimally, a backup should not interfere with the functioning of the services or the VM itself. Is this normal or just a problem at my server?

Thanks for your help!

Full backup log:

INFO: trying to get global lock - waiting...
INFO: got global lock
INFO: starting new backup job: vzdump 101 --mailnotification failure --mode snapshot --mailto ...@... --storage backup_pbs --quiet 1
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2021-09-15 02:52:46
INFO: status = running
INFO: VM Name: VM01
INFO: include disk 'sata1' 'storage2:101/vm-101-disk-0.qcow2' 500G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/101/2021-09-15T00:52:46Z'
INFO: enabling encryption
INFO: started backup task '0c8136ea-9282-44b1-a505-1a1ae417eb88'
INFO: resuming VM again
INFO: sata1: dirty-bitmap status: OK (21.7 GiB of 500.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 21.7 GiB dirty of 500.0 GiB total
INFO: 1% (308.0 MiB of 21.7 GiB) in 3s, read: 102.7 MiB/s, write: 101.3 MiB/s
INFO: 2% (656.0 MiB of 21.7 GiB) in 6s, read: 116.0 MiB/s, write: 116.0 MiB/s
INFO: 4% (1000.0 MiB of 21.7 GiB) in 9s, read: 114.7 MiB/s, write: 113.3 MiB/s
INFO: 5% (1.2 GiB of 21.7 GiB) in 12s, read: 65.3 MiB/s, write: 65.3 MiB/s
INFO: 6% (1.3 GiB of 21.7 GiB) in 1m 37s, read: 1.7 MiB/s, write: 1.7 MiB/s
INFO: 7% (1.6 GiB of 21.7 GiB) in 1m 56s, read: 14.7 MiB/s, write: 14.5 MiB/s
INFO: 8% (1.8 GiB of 21.7 GiB) in 1m 59s, read: 74.7 MiB/s, write: 73.3 MiB/s
INFO: 9% (2.0 GiB of 21.7 GiB) in 2m 2s, read: 69.3 MiB/s, write: 69.3 MiB/s
INFO: 10% (2.2 GiB of 21.7 GiB) in 2m 5s, read: 80.0 MiB/s, write: 78.7 MiB/s
INFO: 11% (2.6 GiB of 21.7 GiB) in 2m 8s, read: 109.3 MiB/s, write: 109.3 MiB/s
INFO: 12% (2.8 GiB of 21.7 GiB) in 2m 11s, read: 81.3 MiB/s, write: 80.0 MiB/s
INFO: 14% (3.1 GiB of 21.7 GiB) in 2m 14s, read: 94.7 MiB/s, write: 92.0 MiB/s
INFO: 15% (3.3 GiB of 21.7 GiB) in 2m 17s, read: 76.0 MiB/s, write: 74.7 MiB/s
INFO: 16% (3.5 GiB of 21.7 GiB) in 2m 20s, read: 73.3 MiB/s, write: 73.3 MiB/s
INFO: 17% (3.9 GiB of 21.7 GiB) in 2m 23s, read: 129.3 MiB/s, write: 126.7 MiB/s
INFO: 19% (4.1 GiB of 21.7 GiB) in 2m 26s, read: 88.0 MiB/s, write: 88.0 MiB/s
INFO: 20% (4.4 GiB of 21.7 GiB) in 2m 29s, read: 88.0 MiB/s, write: 85.3 MiB/s
INFO: 21% (4.7 GiB of 21.7 GiB) in 4m 56s, read: 2.1 MiB/s, write: 2.1 MiB/s
INFO: 23% (5.0 GiB of 21.7 GiB) in 4m 59s, read: 101.3 MiB/s, write: 101.3 MiB/s
INFO: 24% (5.3 GiB of 21.7 GiB) in 5m 3s, read: 67.0 MiB/s, write: 67.0 MiB/s
INFO: 25% (5.4 GiB of 21.7 GiB) in 5m 6s, read: 57.3 MiB/s, write: 56.0 MiB/s
INFO: 26% (5.7 GiB of 21.7 GiB) in 5m 9s, read: 74.7 MiB/s, write: 74.7 MiB/s
INFO: 27% (5.9 GiB of 21.7 GiB) in 5m 12s, read: 86.7 MiB/s, write: 86.7 MiB/s
INFO: 28% (6.1 GiB of 21.7 GiB) in 5m 15s, read: 69.3 MiB/s, write: 69.3 MiB/s
INFO: 29% (6.3 GiB of 21.7 GiB) in 11m 45s, read: 525.1 KiB/s, write: 462.1 KiB/s
INFO: 30% (6.5 GiB of 21.7 GiB) in 14m 16s, read: 1.6 MiB/s, write: 1.6 MiB/s
INFO: 31% (6.8 GiB of 21.7 GiB) in 14m 19s, read: 76.0 MiB/s, write: 73.3 MiB/s
INFO: 32% (7.0 GiB of 21.7 GiB) in 14m 22s, read: 66.7 MiB/s, write: 64.0 MiB/s
INFO: 33% (7.3 GiB of 21.7 GiB) in 14m 25s, read: 101.3 MiB/s, write: 101.3 MiB/s
INFO: 34% (7.5 GiB of 21.7 GiB) in 14m 28s, read: 68.0 MiB/s, write: 66.7 MiB/s
INFO: 35% (7.7 GiB of 21.7 GiB) in 14m 31s, read: 66.7 MiB/s, write: 66.7 MiB/s
INFO: 36% (7.9 GiB of 21.7 GiB) in 14m 34s, read: 80.0 MiB/s, write: 80.0 MiB/s
INFO: 37% (8.1 GiB of 21.7 GiB) in 15m 53s, read: 2.4 MiB/s, write: 2.2 MiB/s
INFO: 38% (8.3 GiB of 21.7 GiB) in 15m 56s, read: 77.3 MiB/s, write: 74.7 MiB/s
INFO: 39% (8.6 GiB of 21.7 GiB) in 16m, read: 72.0 MiB/s, write: 68.0 MiB/s
INFO: 40% (8.8 GiB of 21.7 GiB) in 16m 3s, read: 58.7 MiB/s, write: 58.7 MiB/s
INFO: 41% (9.0 GiB of 21.7 GiB) in 16m 7s, read: 54.0 MiB/s, write: 54.0 MiB/s
INFO: 42% (9.2 GiB of 21.7 GiB) in 16m 10s, read: 72.0 MiB/s, write: 69.3 MiB/s
INFO: 43% (9.4 GiB of 21.7 GiB) in 16m 13s, read: 77.3 MiB/s, write: 77.3 MiB/s
INFO: 44% (9.6 GiB of 21.7 GiB) in 16m 16s, read: 60.0 MiB/s, write: 57.3 MiB/s
INFO: 45% (9.8 GiB of 21.7 GiB) in 16m 20s, read: 59.0 MiB/s, write: 58.0 MiB/s
INFO: 46% (10.0 GiB of 21.7 GiB) in 16m 24s, read: 58.0 MiB/s, write: 51.0 MiB/s
INFO: 47% (10.3 GiB of 21.7 GiB) in 16m 28s, read: 58.0 MiB/s, write: 47.0 MiB/s
INFO: 48% (10.5 GiB of 21.7 GiB) in 16m 31s, read: 84.0 MiB/s, write: 74.7 MiB/s
INFO: 49% (10.7 GiB of 21.7 GiB) in 16m 43s, read: 15.7 MiB/s, write: 15.0 MiB/s
INFO: 50% (10.9 GiB of 21.7 GiB) in 16m 50s, read: 28.0 MiB/s, write: 28.0 MiB/s
INFO: 51% (11.1 GiB of 21.7 GiB) in 16m 59s, read: 24.0 MiB/s, write: 24.0 MiB/s
INFO: 52% (11.3 GiB of 21.7 GiB) in 17m 7s, read: 26.5 MiB/s, write: 26.5 MiB/s
INFO: 53% (11.5 GiB of 21.7 GiB) in 17m 13s, read: 40.7 MiB/s, write: 24.7 MiB/s
INFO: 54% (11.7 GiB of 21.7 GiB) in 17m 18s, read: 40.8 MiB/s, write: 24.8 MiB/s
ERROR: VM 101 qmp command 'query-backup' failed - got timeout
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 101 failed - VM 101 qmp command 'query-backup' failed - got timeout
INFO: Failed at 2021-09-15 03:22:30
INFO: Backup job finished with errors

TASK ERROR: job errors
 
Last edited:
  • Like
Reactions: Darkk and Luki20
we already cache some blocks in memory, but if the backup storage is too slow, symptoms such as this can happen
Thanks for the clarification! So there isn't any thing to prevent a slowdown (beside a fast backup server), right?
 
yeah sadly. we cannot cache infinitely (potentially up to the size of the disk?). the only way that could possibly work is to ignore writes to the disk or let them happen, either way the backup or the disk is incosistent afterwards...
 
  • Like
Reactions: Luki20
Depending on the importance of the VM, failing a backup but ensuring the VM is still working might be desired
mhm... i agree, but i am unsure if (and how) we could implement such a setting, maybe you could open a feature request for that here: https://bugzilla.proxmox.com (there we can better track/assign it)
 
I've never had this happen - what disks are you using? Are the VM's and PBS backup disks separate?
FYI, I do not use RAID (speed increase) - nor SSD's and I have yet to experience a VM hang during backup - almost never.
 
Would it be a good idea to install the PBS as a VM inside the same proxmox node, and the use the remote sync job feature, to sync backup content to an external server?
 
This seems like a show-stopper to me, not an annoyance. A freeze or more than a few seconds is rarely tolerable, and an effective total freeze until hard reboot is a terrifying thing to face.

Network errors are a fact of life. In addition, if you use a remote third party PBS service (e.g. Tuxis) you have no control over the load, bandwidth availability and so forth on the PBS server itself.

From what I read in the two bug reports, it seems like this is all down to some kind of limitation in qemu.

Even so, can't there be a user-configurable disk buffer (on the node) as well as the memory buffer? That would help smooth the bumps if there is very short term problems - more so than just buffering in memory.

Another issue that seems to be asked repeatedly in the bug reports is "how slow is too slow?" in terms of write speeds to trigger a backup abort, to prevent the VM freezing (or worse). Again, can't that be user-configurable? If I know that I have a really slow DSL connection, I can set write speeds to to a low number. If I have a 1Gig ethernet connection, I can set it to a high write speed. That way, if things get slower than they should during normal operations, the backup would abort, leaving the VM to run correctly instead of freezing.

If none of the above is possible, what about a short timeout (user configurable again). For our VMs, a freeze of anything more than about 10 seconds would be intolerable. I'd happily see a backup aborted rather than a VM freeze for longer than that.
 
Thanks all for discussing this further. I agree that there should be any configuration possibility for users.
 
this is how the backup works, it intercepts write calls from the vm and backups the relevant block (detailed info here: https://git.proxmox.com/?p=pve-qemu...16aeb06259c2bcd196e4949c;hb=refs/heads/master)

we already cache some blocks in memory, but if the backup storage is too slow, symptoms such as this can happen
First I want to thank you. It is years that I have several vm slowness problems during backup. I have read many many many threads of people with same problems and we lost many days to understand where the problem is and now you finally showed it to me.
I supposed that backup was done in the standard way: a disk snapshot then backup of snapshot.
Now I read that to "optimize" number of read and writes during backup proxmox "binds" the speed of disks of vm unit with the speed of the disks of the backup server.
I have a customer with a database on ceph/nvme disks. They write on the database 24 hours a day. The backup server has 5400rpm disks. When the backup starts the database VM slows down. Please note that is not a network problem because proxmox->pbs connections is 10gbit.
I repeat: to do an optimization of backup process you think is "acceptable" to bind VM disks speed to backup unit disks speed. This is a classical example of "perfect is enemy of good".
Do you know we have so many problems due to this choice?
Where can I fill a complain or request to change this behaviour? At least for ceph filesystem.
Thanks,
Mario
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!