Occasional backup failures

Zephrant · Nov 15, 2021

I have a cluster backing up to a dedicated Proxmox Backup server, which is normally working great.
Out of 54 VMs, three, all from the same node failed backup last night:

Code:

313: 2021-11-15 01:04:04 INFO: Starting Backup of VM 313 (qemu)
313: 2021-11-15 01:04:04 INFO: status = running
313: 2021-11-15 01:04:04 INFO: VM Name: spktest05
313: 2021-11-15 01:04:04 INFO: include disk 'scsi0' 'spk-ceph-pool1:vm-313-disk-0' 32G
313: 2021-11-15 01:04:04 INFO: backup mode: snapshot
313: 2021-11-15 01:04:04 INFO: ionice priority: 7
313: 2021-11-15 01:04:04 INFO: creating Proxmox Backup Server archive 'vm/313/2021-11-15T09:04:04Z'
313: 2021-11-15 01:04:04 INFO: issuing guest-agent 'fs-freeze' command
313: 2021-11-15 01:06:09 INFO: issuing guest-agent 'fs-thaw' command
313: 2021-11-15 01:06:09 ERROR: VM 313 qmp command 'backup' failed - got timeout
313: 2021-11-15 01:06:09 INFO: aborting backup job
313: 2021-11-15 01:06:22 INFO: resuming VM again
313: 2021-11-15 01:06:22 ERROR: Backup of VM 313 failed - VM 313 qmp command 'backup' failed - got timeout

10 other VMs on that node backed up just fine, both before and after the failure. I have 12 nodes doing backups, with a total of 60 VMs and LXCs.
The backup server is a dedicated Supermicro chassis, with dual 40g NICs, currently 2.5% disk space is used.

I see these failures once in a while, and haven't found the root cause yet.
Is there any way to set a backup to "try again on failure"?

Any tips on debugging this?

SOLTECSIS - Carles Munyoz · Nov 15, 2021

Have you tried to disable Qemu Guest Agent on the virtual machines or update it to te latest version?

Zephrant · Nov 15, 2021

The VMs that fail appear to be random, I wouldn't want to disable the guest agent on all of my VMs.
They are all upgraded to the lastest version AFAIK, this is a test bed so all-new.

SOLTECSIS - Carles Munyoz · Nov 16, 2021

And what about your Proxmox version?
Is it updated to the latest release?

Zephrant · Nov 16, 2021

I was running 2.0-13, I just tripped an update to 2.0-14.
No failures in backups last night though.

SOLTECSIS - Carles Munyoz · Nov 16, 2021

Do you have Proxmox 2.x installation?
The last Proxmox version is 7.0-2. You really should think about upgrading it.

Zephrant · Nov 16, 2021

Sorry, was reporting the Proxmox Backup version. My Proxmox cluster was updated to the latest a few weeks ago. It's on 7.0-13.

SOLTECSIS - Carles Munyoz · Nov 17, 2021

In you previous post you said that no failures in your latest backup.
Have you done more backups? Do you still have failures in the latest backups?

Zephrant · Nov 17, 2021

My test bed backs up 4 times a day- twice to a NFS mount, and twice to the Proxmox Backup server.
No additional failures since the above, no network or other changes since then either.

This was not the first time backups have failed. Out of 2713 backups to the Proxmox Backup server, I have 15 failures so far.

My concern is I don't see a way to tell why they failed, and what I can do about it. There is no re-try mechanism available?

Zephrant · Nov 19, 2021

Just got a new failure:

Code:

118: 2021-11-19 12:32:06 INFO: Starting Backup of VM 118 (qemu)
118: 2021-11-19 12:32:06 INFO: status = running
118: 2021-11-19 12:32:06 INFO: VM Name: spk-ubuntu-test2
118: 2021-11-19 12:32:06 INFO: include disk 'scsi0' 'spk-ceph-pool1:vm-118-disk-0' 32G
118: 2021-11-19 12:32:06 INFO: backup mode: snapshot
118: 2021-11-19 12:32:06 INFO: ionice priority: 7
118: 2021-11-19 12:32:06 INFO: creating Proxmox Backup Server archive 'vm/118/2021-11-19T20:32:06Z'
118: 2021-11-19 12:32:06 INFO: issuing guest-agent 'fs-freeze' command
118: 2021-11-19 12:34:12 INFO: issuing guest-agent 'fs-thaw' command
118: 2021-11-19 12:34:12 ERROR: VM 118 qmp command 'backup' failed - got timeout
118: 2021-11-19 12:34:12 INFO: aborting backup job
118: 2021-11-19 12:34:12 INFO: resuming VM again
118: 2021-11-19 12:34:12 ERROR: Backup of VM 118 failed - VM 118 qmp command 'backup' failed - got timeout

One one out of 60 VMs that failed backup. No obvious reason.

SOLTECSIS - Carles Munyoz · Nov 21, 2021

Can you see something in the syslog of the virtual machine?

Zephrant · Nov 22, 2021

Sometimes the VM is shutdown, so nothing in the logs. Had one failure this weekend of a VM that has been off for a week.

6 failures last night. 17 fails out of 2,977 backups so far.

Worth noting, I'm backing up to a NFS mount twice a day too (offset by six hours), and no failures occurred on those backups this weekend, but I have seen issues in the past. So both NFS and Proxmox Backup failures from VE.

SOLTECSIS - Carles Munyoz · Nov 23, 2021

The error is a timeout in qeumu-agent communication.
Have you tried to increase the timeout for the qemu-agent communication?

Zephrant · Dec 31, 2021

I've not found how to increase the timeout. This is becoming very concerning though. Most every night I have a few VMs that fail to backup.
420 VM 420 FAILED 00:00:00 unable to open file '/etc/pve/nodes/test-prox-n101/qemu-server/420.conf.tmp.729468' - Device or resource busy
902 VM 902 FAILED 00:00:00 unable to open file '/etc/pve/nodes/test-prox-n101/qemu-server/902.conf.tmp.729468' - Device or resource busy
903 VM 903 FAILED 00:00:00 unable to open file '/etc/pve/nodes/test-prox-n101/qemu-server/903.conf.tmp.729468' - Device or resource busy

All three of those are powered off VMs. No reason there should be any issues backing them up, but regularly they hang and need to be manually unlocked the next morning.

Just updated Backup to 2.1-2, same issue still.
Nodes are at pve-manager/7.1-8/5b267f33

Zephrant · Jan 20, 2022

It looks like all nodes backup simultaneously. Is there any way to spread out the backups, maybe have the nodes go sequentially?
It's not a race, I don't care how long it takes as long as it is less than a few hours.

fabian · Jan 20, 2022

Zephrant said:
I've not found how to increase the timeout. This is becoming very concerning though. Most every night I have a few VMs that fail to backup.
420 VM 420 FAILED 00:00:00 unable to open file '/etc/pve/nodes/test-prox-n101/qemu-server/420.conf.tmp.729468' - Device or resource busy
902 VM 902 FAILED 00:00:00 unable to open file '/etc/pve/nodes/test-prox-n101/qemu-server/902.conf.tmp.729468' - Device or resource busy
903 VM 903 FAILED 00:00:00 unable to open file '/etc/pve/nodes/test-prox-n101/qemu-server/903.conf.tmp.729468' - Device or resource busy

All three of those are powered off VMs. No reason there should be any issues backing them up, but regularly they hang and need to be manually unlocked the next morning.

Just updated Backup to 2.1-2, same issue still.
Nodes are at pve-manager/7.1-8/5b267f33

do your ceph cluster and corosync share physical links? because that message indicates that corosync/pmxcfs became read-only, likely caused by the increased load on your ceph cluster cause of the backup..

Zephrant · Jan 20, 2022

The cluster nodes each have dual 40g links to dual switches, in a trunk. The backup server has dual 10g links, so could be buried by 12 high-end nodes doing backup simultaneously.

The CEPH runs on a vlan on the same trunk as the backup, which is on another vlan.

Any tips on how to slow down the backup processes? I could drop to 1g on the backup server...

fabian · Jan 21, 2022

no, the problem is sharing ceph and corosync links.. load on the former will cause outages for the latter (and if you use HA, outage means nodes and their guests being fenced!).

Zephrant · Jan 21, 2022

Email:
430 test1 FAILED 00:02:33 VM 430 qmp command 'backup' failed - got timeout

From the backup server:
2022-01-21T01:05:38-08:00: starting new backup on datastore 'store1': "vm/430/2022-01-21T09:07:33Z"
2022-01-21T01:05:38-08:00: download 'index.json.blob' from previous backup.
2022-01-21T01:05:45-08:00: register chunks in 'drive-scsi0.img.fidx' from previous backup.
2022-01-21T01:05:45-08:00: download 'drive-scsi0.img.fidx' from previous backup.
2022-01-21T01:05:46-08:00: created new fixed index 1 ("vm/430/2022-01-21T09:07:33Z/drive-scsi0.img.fidx")
2022-01-21T01:06:10-08:00: register chunks in 'drive-scsi1.img.fidx' from previous backup.
2022-01-21T01:06:10-08:00: download 'drive-scsi1.img.fidx' from previous backup.
2022-01-21T01:07:46-08:00: created new fixed index 2 ("vm/430/2022-01-21T09:07:33Z/drive-scsi1.img.fidx")
2022-01-21T01:08:10-08:00: add blob "/mnt/datastore/storage/vm/430/2022-01-21T09:07:33Z/qemu-server.conf.blob" (366 bytes, comp: 366)
2022-01-21T01:08:10-08:00: backup ended and finish failed: backup ended but finished flag is not set.
2022-01-21T01:08:10-08:00: removing unfinished backup
2022-01-21T01:08:10-08:00: TASK ERROR: backup ended but finished flag is not set.

Any way to prioritize traffic so this is not an issue with shared links? For redundancy, I don't want to dedicate one of my two links to CEPH, and don't have another two I can use.

fabian · Jan 24, 2022

well you can try depending on which network hardware you use (corosync traffic is on specific ports only), but ideally you need dedicated, low-latency links.

Occasional backup failures

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Member

Well-Known Member

Member

Well-Known Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

We value your privacy