[SOLVED] 20% of all Backup fails with "qmp command 'query-proxmox-support' failed - got timeout"

Dec 6, 2022
42
13
8
We have a 9 Node ProxMox Cluster. PVE 8.0.4 (last updated and restarted yesterday)
Based on Ceph Quincy. 17.2.6 ( 33 OSDs - max 60% filled - 40G Network )
200 Running VMs. (virtio-win-0.1.215.iso)

Proxmox Backup Server 3.0-2
2 x 16TB HDDs as ZFS Backup drive




When we run our nightly backups. 10% - 20% of all backups fail.
It's always different VMs that fail. But it's allways the same log output.

The VMs are running without any problms after that.
So the VMs did not hang/freez like it is discribed in other threads.
Also 4 hours earlier we allways make snapshoots of all VMs. - That task never fails. ( But maybe the snapshoot command does not fs-freez and fs-thaw the VMs?)

The log looks like this

Code:
INFO: Backup finished at 2023-10-19 02:19:03
INFO: Starting Backup of VM 151 (qemu)
INFO: Backup started at 2023-10-19 02:19:03
INFO: status = running
INFO: VM Name: esx-Tina
INFO: include disk 'scsi0' 'VM_Festplatten_NVME:base-133-disk-1/vm-151-disk-1' 156G
INFO: include disk 'efidisk0' 'VM_Festplatten_NVME:base-133-disk-0/vm-151-disk-0' 528K
INFO: include disk 'tpmstate0' 'VM_Festplatten_NVME:base-133-disk-2/vm-151-disk-2' 4M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/151/2023-10-19T00:19:03Z'
INFO: attaching TPM drive to QEMU for backup
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 151 qmp command 'backup' failed - got timeout
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 151 failed - VM 151 qmp command 'backup' failed - got timeout
INFO: Failed at 2023-10-19 02:22:55


I don't know what is running in a timeout .
Also i see that the 'fs-freeze' and the 'fs-thaw' command is a guest-agent command.
So i have to say all the VMs are running with the virtio-win-0.1.215.iso Version of the guest agent.


Log on Backup Server

Code:
Oct 19 02:19:06 proxb1 proxmox-backup-proxy[1358]: starting new backup on datastore 'Backup': "ns/Buero/vm/151/2023-10-19T00:19:03Z"
Oct 19 02:19:06 proxb1 proxmox-backup-proxy[1358]: download 'index.json.blob' from previous backup.
Oct 19 02:19:06 proxb1 proxmox-backup-proxy[1358]: register chunks in 'drive-efidisk0.img.fidx' from previous backup.
Oct 19 02:19:06 proxb1 proxmox-backup-proxy[1358]: download 'drive-efidisk0.img.fidx' from previous backup.
Oct 19 02:19:06 proxb1 proxmox-backup-proxy[1358]: created new fixed index 1 ("ns/Buero/vm/151/2023-10-19T00:19:03Z/drive-efidisk0.img.fidx")
Oct 19 02:19:11 proxb1 proxmox-backup-proxy[1358]: add blob "/mnt/datastore/Backup/ns/Buero/vm/140/2023-10-19T00:09:41Z/qemu-server.conf.blob" (523 bytes, comp: 523)
Oct 19 02:19:11 proxb1 proxmox-backup-proxy[1358]: backup ended and finish failed: backup ended but finished flag is not set.
Oct 19 02:19:11 proxb1 proxmox-backup-proxy[1358]: removing unfinished backup


Here i found another log with more detailed error

Code:
Oct 19 02:19:03 prox4 pvescheduler[73755]: INFO: Starting Backup of VM 151 (qemu)
Oct 19 02:19:17 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - got timeout
Oct 19 02:19:17 prox4 pvestatd[1714]: status update time (8.227 seconds)
Oct 19 02:19:27 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - unable to connect to VM 151 qmp s>
Oct 19 02:21:11 prox4 pvescheduler[73755]: VM 151 qmp command failed - VM 151 qmp command 'backup' failed - got timeout
Oct 19 02:21:32 prox4 pvestatd[1714]: proxmox-backup-client failed: Error: http request timed out
Oct 19 02:21:32 prox4 pvestatd[1714]: status update time (133.078 seconds)
Oct 19 02:21:40 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - got timeout
Oct 19 02:22:08 prox4 pvestatd[1714]: status update time (35.738 seconds)
Oct 19 02:22:09 prox4 pmxcfs[1428]: [status] notice: received log
Oct 19 02:22:16 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - unable to connect to VM 151 qmp s>
Oct 19 02:22:16 prox4 pvestatd[1714]: status update time (8.236 seconds)
Oct 19 02:22:26 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - unable to connect to VM 151 qmp s>
Oct 19 02:22:27 prox4 pvestatd[1714]: status update time (8.248 seconds)
Oct 19 02:22:36 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - unable to connect to VM 151 qmp s>
Oct 19 02:22:36 prox4 pvestatd[1714]: status update time (8.245 seconds)
Oct 19 02:22:46 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - unable to connect to VM 151 qmp s>
Oct 19 02:22:46 prox4 pvestatd[1714]: status update time (8.239 seconds)
Oct 19 02:22:55 prox4 QEMU[5715]: BackupTask send abort failed.
Oct 19 02:22:55 prox4 pvescheduler[73755]: ERROR: Backup of VM 151 failed - VM 151 qmp command 'backup' failed - got timeout


Any ideas ?

(btw.: we have a second way smaler setup with the same backup server configuration. but only 5 Nodes and 30 VMs. And there we have no problems )
 
The backup server declares the backup as failed at : 02:19:11
But the first timeout on the ProxMox Node is at : 02:19:17

Som maybe a communication problem between the backup sever and the node ?
Or some other error handling between the backup server and the node does not work?
 
Are all 9 nodes in the same backup job or starting at the same time respectively are overlapping?
If so, maybe, and this is completely guessing, the PBS(-hardware; more specifically its storage (IOPS of only one HDD)) is just overwhelmed by the 9 simultaneous running backups?
 
  • Like
Reactions: fiona
Yes, it is in one backup job.

I splitted them into 9 backup jobs that are running at different times. If i only run 2 node backups at a time, then the backup jobs are running fine.
When i try to backup 3 nodes at a time, it fails.
(btw.: same hardware on another cluster easily managed to backup 5 nodes at the same time.)

Reasons to backup all nodes in one job:
- When i configure the backup jobs only in the UI, then i could not exclude a VM that is not runing on the node at that moment.
But the VM that i don't want to be backuped could run on a completly different node when the backup job runs.
- Easy of use

So maybe it would be nice to have an option, to set a limit of how many nodes should run the backup job at the same time.
Slow IOPs could appear for many reasons, and maybe the handling of such cases could be optimized.

For me this problem is resolved.
Thx
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!