I have a cluster backing up to a dedicated Proxmox Backup server, which is normally working great.
Out of 54 VMs, three, all from the same node failed backup last night:
10 other VMs on that node backed up just fine, both before and after the failure. I have 12 nodes doing backups, with a total of 60 VMs and LXCs.
The backup server is a dedicated Supermicro chassis, with dual 40g NICs, currently 2.5% disk space is used.
I see these failures once in a while, and haven't found the root cause yet.
Is there any way to set a backup to "try again on failure"?
Any tips on debugging this?
Out of 54 VMs, three, all from the same node failed backup last night:
Code:
313: 2021-11-15 01:04:04 INFO: Starting Backup of VM 313 (qemu)
313: 2021-11-15 01:04:04 INFO: status = running
313: 2021-11-15 01:04:04 INFO: VM Name: spktest05
313: 2021-11-15 01:04:04 INFO: include disk 'scsi0' 'spk-ceph-pool1:vm-313-disk-0' 32G
313: 2021-11-15 01:04:04 INFO: backup mode: snapshot
313: 2021-11-15 01:04:04 INFO: ionice priority: 7
313: 2021-11-15 01:04:04 INFO: creating Proxmox Backup Server archive 'vm/313/2021-11-15T09:04:04Z'
313: 2021-11-15 01:04:04 INFO: issuing guest-agent 'fs-freeze' command
313: 2021-11-15 01:06:09 INFO: issuing guest-agent 'fs-thaw' command
313: 2021-11-15 01:06:09 ERROR: VM 313 qmp command 'backup' failed - got timeout
313: 2021-11-15 01:06:09 INFO: aborting backup job
313: 2021-11-15 01:06:22 INFO: resuming VM again
313: 2021-11-15 01:06:22 ERROR: Backup of VM 313 failed - VM 313 qmp command 'backup' failed - got timeout
10 other VMs on that node backed up just fine, both before and after the failure. I have 12 nodes doing backups, with a total of 60 VMs and LXCs.
The backup server is a dedicated Supermicro chassis, with dual 40g NICs, currently 2.5% disk space is used.
I see these failures once in a while, and haven't found the root cause yet.
Is there any way to set a backup to "try again on failure"?
Any tips on debugging this?