[SOLVED] Backups time out and "crash" VMs?

wasteground

Member
Aug 6, 2019
25
2
23
41
Hi!

Trying out PBS, and having some issues backing up virtual machines consistently. Everything is upgraded to the latest (enterprise) editions of PVE, and I'm running the latest PBS (updated with apt-get update && apt-get dist-upgrade -y).

When I try to create a backup, the following happens:

Code:
INFO: starting new backup job: vzdump 102 --remove 0 --node proxmox1 --mode snapshot --storage pbs
INFO: Starting Backup of VM 102 (qemu)
INFO: Backup started at 2020-07-30 08:05:13
INFO: status = running
INFO: VM Name: netservices1
INFO: include disk 'virtio0' 'local-zfs:vm-102-disk-0' 32G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/102/2020-07-30T08:05:13Z'
ERROR: VM 102 qmp command 'backup' failed - got timeout
ERROR: Backup of VM 102 failed - VM 102 qmp command 'backup' failed - got timeout
INFO: Failed at 2020-07-30 08:06:13
INFO: Backup job finished with errors
TASK ERROR: job errors

At this point, the VM is stuck in a suspended state and has to be killed and restarted to continue. A "backup" shows on PBS, with 1 byte in size, and eventually the PBS server times out the incoming backup. If I reboot the PVE box, I can *sometimes* get the backup to work (maybe 1 in 10 times). Killing and restarting the qemu process never seems to fix the issue.

I guess there are potentially a few issues here, but potentially the most serious (ignoring the issue of the backup not actually working) seems to be that taking a backup should never permanently kill a workload VM (I understand a short suspend is required to take a snapshot, which makes sense - but this never returns, just stays suspended/crashed).

This seems to happen consistently across all my PVE boxes at this point - however, a week or two ago when I was first trying this, it seemed to work just fine, so I'm not sure if something updated and broke it, or if there's something else at play here.

Any ideas or suggestions? So far PBS seems like a really excellent solution, looking forward to using it in the future in production!

Thanks!
 
Hm, this is a tricky one. It seems that your QEMU process gets completely stuck or deadlocks on something. I can't reproduce the issue locally right now, so it's hard to debug, is there anything special you're doing on this machine? Does the same issue happen on other machines?

Can you post the log from the backup task on the backup server? It should be in the task list (Administration -> Tasks).

Also, you said it only started happening with a recent update, so you could try to find out which version caused the issue by selectively downgrading the 'pve-qemu-kvm' and 'libproxmox-backup-qemu0' packages (i.e. apt install pve-qemu-kvm=<version>, use tab completion to show available ones).
 
Nothing special at all, just standard PVE which has been running for many months with no issues at all. I have just removed all the PBS setup from my PVE clusters and have wiped the PBS machine, and am standing up a new "clean" server to act as a PBS server to rule out hardware/network issues - will report back with more information as soon as I've done that :)
 
So, re-installing a new server seems to have "fixed" this - but I did also make one change. Previously the data store was over NFS, and now it's local to the PBS server on some directly attached disks (with ZFS as the filesystem).

Is it possible that latency between PBS and the NFS server (it's probably 10-20msec away) could cause the archive creation step to take too long, time out, fail, and then not cleanly recover?
 
Hm, I just tested configuration with a deliberately slow NFS server and still could not reproduce anything... Could it be that while your PVE was up-to-date you hand't installed the latest version of PBS?

Anyway, if you run into the issue again and can isolate a factor that causes it, let us know here, we'd be glad to fix any outstanding bugs.
 
Nope, definitely the latest version of PBS too (and I actually rebooted after installing it this morning just to be absolutely certain nothing "old" was hanging around - and the PVE clusters rebooted too to ensure the right version of qemu everywhere). If I get some time over the weekend I'll try recreating it, but I guess this is probably something caused at my side rather than an issue with PBS :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!