Occasional backup failures

Zephrant

Member
Sep 12, 2021
34
4
8
124
I have a cluster backing up to a dedicated Proxmox Backup server, which is normally working great.
Out of 54 VMs, three, all from the same node failed backup last night:

Code:
313: 2021-11-15 01:04:04 INFO: Starting Backup of VM 313 (qemu)
313: 2021-11-15 01:04:04 INFO: status = running
313: 2021-11-15 01:04:04 INFO: VM Name: spktest05
313: 2021-11-15 01:04:04 INFO: include disk 'scsi0' 'spk-ceph-pool1:vm-313-disk-0' 32G
313: 2021-11-15 01:04:04 INFO: backup mode: snapshot
313: 2021-11-15 01:04:04 INFO: ionice priority: 7
313: 2021-11-15 01:04:04 INFO: creating Proxmox Backup Server archive 'vm/313/2021-11-15T09:04:04Z'
313: 2021-11-15 01:04:04 INFO: issuing guest-agent 'fs-freeze' command
313: 2021-11-15 01:06:09 INFO: issuing guest-agent 'fs-thaw' command
313: 2021-11-15 01:06:09 ERROR: VM 313 qmp command 'backup' failed - got timeout
313: 2021-11-15 01:06:09 INFO: aborting backup job
313: 2021-11-15 01:06:22 INFO: resuming VM again
313: 2021-11-15 01:06:22 ERROR: Backup of VM 313 failed - VM 313 qmp command 'backup' failed - got timeout

10 other VMs on that node backed up just fine, both before and after the failure. I have 12 nodes doing backups, with a total of 60 VMs and LXCs.
The backup server is a dedicated Supermicro chassis, with dual 40g NICs, currently 2.5% disk space is used.

I see these failures once in a while, and haven't found the root cause yet.
Is there any way to set a backup to "try again on failure"?

Any tips on debugging this?
 
The VMs that fail appear to be random, I wouldn't want to disable the guest agent on all of my VMs.
They are all upgraded to the lastest version AFAIK, this is a test bed so all-new.
 
I was running 2.0-13, I just tripped an update to 2.0-14.
No failures in backups last night though.
 
Do you have Proxmox 2.x installation?
The last Proxmox version is 7.0-2. You really should think about upgrading it.
 
Sorry, was reporting the Proxmox Backup version. My Proxmox cluster was updated to the latest a few weeks ago. It's on 7.0-13.
 
In you previous post you said that no failures in your latest backup.
Have you done more backups? Do you still have failures in the latest backups?
 
My test bed backs up 4 times a day- twice to a NFS mount, and twice to the Proxmox Backup server.
No additional failures since the above, no network or other changes since then either.

This was not the first time backups have failed. Out of 2713 backups to the Proxmox Backup server, I have 15 failures so far.

My concern is I don't see a way to tell why they failed, and what I can do about it. There is no re-try mechanism available?
 
Just got a new failure:
Code:
118: 2021-11-19 12:32:06 INFO: Starting Backup of VM 118 (qemu)
118: 2021-11-19 12:32:06 INFO: status = running
118: 2021-11-19 12:32:06 INFO: VM Name: spk-ubuntu-test2
118: 2021-11-19 12:32:06 INFO: include disk 'scsi0' 'spk-ceph-pool1:vm-118-disk-0' 32G
118: 2021-11-19 12:32:06 INFO: backup mode: snapshot
118: 2021-11-19 12:32:06 INFO: ionice priority: 7
118: 2021-11-19 12:32:06 INFO: creating Proxmox Backup Server archive 'vm/118/2021-11-19T20:32:06Z'
118: 2021-11-19 12:32:06 INFO: issuing guest-agent 'fs-freeze' command
118: 2021-11-19 12:34:12 INFO: issuing guest-agent 'fs-thaw' command
118: 2021-11-19 12:34:12 ERROR: VM 118 qmp command 'backup' failed - got timeout
118: 2021-11-19 12:34:12 INFO: aborting backup job
118: 2021-11-19 12:34:12 INFO: resuming VM again
118: 2021-11-19 12:34:12 ERROR: Backup of VM 118 failed - VM 118 qmp command 'backup' failed - got timeout

One one out of 60 VMs that failed backup. No obvious reason.
 
Sometimes the VM is shutdown, so nothing in the logs. Had one failure this weekend of a VM that has been off for a week.

6 failures last night. 17 fails out of 2,977 backups so far.

Worth noting, I'm backing up to a NFS mount twice a day too (offset by six hours), and no failures occurred on those backups this weekend, but I have seen issues in the past. So both NFS and Proxmox Backup failures from VE.
 
The error is a timeout in qeumu-agent communication.
Have you tried to increase the timeout for the qemu-agent communication?
 
I've not found how to increase the timeout. This is becoming very concerning though. Most every night I have a few VMs that fail to backup.
420 VM 420 FAILED 00:00:00 unable to open file '/etc/pve/nodes/test-prox-n101/qemu-server/420.conf.tmp.729468' - Device or resource busy
902 VM 902 FAILED 00:00:00 unable to open file '/etc/pve/nodes/test-prox-n101/qemu-server/902.conf.tmp.729468' - Device or resource busy
903 VM 903 FAILED 00:00:00 unable to open file '/etc/pve/nodes/test-prox-n101/qemu-server/903.conf.tmp.729468' - Device or resource busy

All three of those are powered off VMs. No reason there should be any issues backing them up, but regularly they hang and need to be manually unlocked the next morning.

Just updated Backup to 2.1-2, same issue still.
Nodes are at pve-manager/7.1-8/5b267f33
 
It looks like all nodes backup simultaneously. Is there any way to spread out the backups, maybe have the nodes go sequentially?
It's not a race, I don't care how long it takes as long as it is less than a few hours.
 
I've not found how to increase the timeout. This is becoming very concerning though. Most every night I have a few VMs that fail to backup.
420 VM 420 FAILED 00:00:00 unable to open file '/etc/pve/nodes/test-prox-n101/qemu-server/420.conf.tmp.729468' - Device or resource busy
902 VM 902 FAILED 00:00:00 unable to open file '/etc/pve/nodes/test-prox-n101/qemu-server/902.conf.tmp.729468' - Device or resource busy
903 VM 903 FAILED 00:00:00 unable to open file '/etc/pve/nodes/test-prox-n101/qemu-server/903.conf.tmp.729468' - Device or resource busy

All three of those are powered off VMs. No reason there should be any issues backing them up, but regularly they hang and need to be manually unlocked the next morning.

Just updated Backup to 2.1-2, same issue still.
Nodes are at pve-manager/7.1-8/5b267f33
do your ceph cluster and corosync share physical links? because that message indicates that corosync/pmxcfs became read-only, likely caused by the increased load on your ceph cluster cause of the backup..
 
The cluster nodes each have dual 40g links to dual switches, in a trunk. The backup server has dual 10g links, so could be buried by 12 high-end nodes doing backup simultaneously.

The CEPH runs on a vlan on the same trunk as the backup, which is on another vlan.

Any tips on how to slow down the backup processes? I could drop to 1g on the backup server...
 
Last edited:
no, the problem is sharing ceph and corosync links.. load on the former will cause outages for the latter (and if you use HA, outage means nodes and their guests being fenced!).
 
Email:
430 test1 FAILED 00:02:33 VM 430 qmp command 'backup' failed - got timeout

From the backup server:
2022-01-21T01:05:38-08:00: starting new backup on datastore 'store1': "vm/430/2022-01-21T09:07:33Z"
2022-01-21T01:05:38-08:00: download 'index.json.blob' from previous backup.
2022-01-21T01:05:45-08:00: register chunks in 'drive-scsi0.img.fidx' from previous backup.
2022-01-21T01:05:45-08:00: download 'drive-scsi0.img.fidx' from previous backup.
2022-01-21T01:05:46-08:00: created new fixed index 1 ("vm/430/2022-01-21T09:07:33Z/drive-scsi0.img.fidx")
2022-01-21T01:06:10-08:00: register chunks in 'drive-scsi1.img.fidx' from previous backup.
2022-01-21T01:06:10-08:00: download 'drive-scsi1.img.fidx' from previous backup.
2022-01-21T01:07:46-08:00: created new fixed index 2 ("vm/430/2022-01-21T09:07:33Z/drive-scsi1.img.fidx")
2022-01-21T01:08:10-08:00: add blob "/mnt/datastore/storage/vm/430/2022-01-21T09:07:33Z/qemu-server.conf.blob" (366 bytes, comp: 366)
2022-01-21T01:08:10-08:00: backup ended and finish failed: backup ended but finished flag is not set.
2022-01-21T01:08:10-08:00: removing unfinished backup
2022-01-21T01:08:10-08:00: TASK ERROR: backup ended but finished flag is not set.

Any way to prioritize traffic so this is not an issue with shared links? For redundancy, I don't want to dedicate one of my two links to CEPH, and don't have another two I can use.
 
well you can try depending on which network hardware you use (corosync traffic is on specific ports only), but ideally you need dedicated, low-latency links.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!