[SOLVED] Backup Error "got timeout" at random VMs

InxmailGmbH · Jan 18, 2022

Hi. We have Proxmox Backup Server v2.1.2-1 installed and are getting the same error messages very often and backups are failing. The backups are failing more ore less randomly. But sometimes the backups are done successfully.
We have 16 PVE-Hosts (PVE v.7.0-13) and one PBS in the cluster. Each Host has 64GB of RAM. Is that to less for snaphot-backups at VMs with much RAM consumption (12 up to 48GB)?

Please see the following log-snippet showing the mentioned error messages:

INFO: Starting Backup of VM 113 (qemu)
INFO: Backup started at 2022-01-18 04:00:17
INFO: status = running
INFO: VM Name: xx-ad-01.win.forest.de
INFO: include disk 'scsi0' 'rbd-bit-prod:vm-113-disk-0' 150G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/113/2022-01-18T03:00:17Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 113 qmp command 'backup' failed - got timeout
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 113 failed - VM 113 qmp command 'backup' failed - got timeout
INFO: Failed at 2022-01-18 04:07:50

Is there a method to get a deeper look into it to find the basic problem?
Please let me know if you need further information.

fabian · Jan 18, 2022

do you have monitoring in place? how is the load situation on the PBS? the RAM usage inside the guest shouldn't play a big role, but load on either end can of course. the 'backup' command basically does the setup and connection to PBS, so it running into a timeout is usually a sign of severely overloaded systems..

InxmailGmbH · Jan 18, 2022

Hi. Thanks for your reply. There is a monitoring. And the load of the PBS has never been more than 50 percent (both CPU and RAM).

fabian · Jan 18, 2022

between start and failure reporting there are almost 8 minutes - the backup command itself has a timeout of 125s, so I guess the fs-freeze takes up the remainder of the time? possible there are a lot of buffered writes from the guest, and the fs-freeze causes overloading on the Ceph storage of scsi0? is there anything inside the guest that gets triggered on fs-freeze (like VSS, DB dumps/flushes, ...)? does the VM stay responsive between start of backup task and failure being reported?

InxmailGmbH · Jan 19, 2022

Hi. Our Proxmox Backup Job is scheduled to start at 4:00 AM because of many critical systems. Therefore we can't check if they are responsive at this time. If I start the backup jobs manually, they do not fail and the VMs stays responsive.
Some VMs have VSS or DB dumps but many of the failing ones do not (Win- and Debian machines). Even very small machines (only nameservice) fail sometimes.
By the way, the backup storage is a mounted NFS Share.

We noticed that there is a difference in the timeformat (is that normal?):
"creating Proxmox Backup Server archive 'vm/113/2022-01-18T03:00:17Z'
...
Failed at 2022-01-18 04:07:50"

fabian · Jan 19, 2022

if you have multiple nodes - possibly they all start backups at the same time and overload the NFS server? you could try pinging the VM once or twice a minute over night to see whether packets are delayed/lost while the backup is initializing and correlate that to success/failure of the backup..

the different timestamp formats are no cause for concern.

InxmailGmbH · Jan 19, 2022

Yes. It seems that all backup jobs on all nodes start at the same time.
We have now set up different backup schedules for each host for testing the behavior.

InxmailGmbH · Jan 21, 2022

Hi. With the different start times for the backup jobs everithing went smoothly now. Maybe this is an issue whit PBS that should be improved.
Thanks a lot for your help.

[SOLVED] Backup Error "got timeout" at random VMs

InxmailGmbH

Member

fabian

Proxmox Staff Member

InxmailGmbH

Member

fabian

Proxmox Staff Member

InxmailGmbH

Member

fabian

Proxmox Staff Member

InxmailGmbH

Member

InxmailGmbH

Member

We value your privacy