[SOLVED] Backup Error "got timeout" at random VMs

Jan 18, 2022
5
2
3
25
Hi. We have Proxmox Backup Server v2.1.2-1 installed and are getting the same error messages very often and backups are failing. The backups are failing more ore less randomly. But sometimes the backups are done successfully.
We have 16 PVE-Hosts (PVE v.7.0-13) and one PBS in the cluster. Each Host has 64GB of RAM. Is that to less for snaphot-backups at VMs with much RAM consumption (12 up to 48GB)?

Please see the following log-snippet showing the mentioned error messages:

INFO: Starting Backup of VM 113 (qemu)
INFO: Backup started at 2022-01-18 04:00:17
INFO: status = running
INFO: VM Name: xx-ad-01.win.forest.de
INFO: include disk 'scsi0' 'rbd-bit-prod:vm-113-disk-0' 150G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/113/2022-01-18T03:00:17Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 113 qmp command 'backup' failed - got timeout
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 113 failed - VM 113 qmp command 'backup' failed - got timeout
INFO: Failed at 2022-01-18 04:07:50

Is there a method to get a deeper look into it to find the basic problem?
Please let me know if you need further information.
 
do you have monitoring in place? how is the load situation on the PBS? the RAM usage inside the guest shouldn't play a big role, but load on either end can of course. the 'backup' command basically does the setup and connection to PBS, so it running into a timeout is usually a sign of severely overloaded systems..
 
between start and failure reporting there are almost 8 minutes - the backup command itself has a timeout of 125s, so I guess the fs-freeze takes up the remainder of the time? possible there are a lot of buffered writes from the guest, and the fs-freeze causes overloading on the Ceph storage of scsi0? is there anything inside the guest that gets triggered on fs-freeze (like VSS, DB dumps/flushes, ...)? does the VM stay responsive between start of backup task and failure being reported?
 
Hi. Our Proxmox Backup Job is scheduled to start at 4:00 AM because of many critical systems. Therefore we can't check if they are responsive at this time. If I start the backup jobs manually, they do not fail and the VMs stays responsive.
Some VMs have VSS or DB dumps but many of the failing ones do not (Win- and Debian machines). Even very small machines (only nameservice) fail sometimes.
By the way, the backup storage is a mounted NFS Share.

We noticed that there is a difference in the timeformat (is that normal?):
"creating Proxmox Backup Server archive 'vm/113/2022-01-18T03:00:17Z'
...
Failed at 2022-01-18 04:07:50"
 
if you have multiple nodes - possibly they all start backups at the same time and overload the NFS server? you could try pinging the VM once or twice a minute over night to see whether packets are delayed/lost while the backup is initializing and correlate that to success/failure of the backup..

the different timestamp formats are no cause for concern.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!