qmp command 'backup' failed - got timeout

ssevasta

New Member
Jul 25, 2024
21
1
3
Hi some VMs backup is failing with this error: "qmp command 'backup' failed - got timeout". When i re-run it will work fine. However this is not comfortable since they run on a schedule. On the host server's log, i have these entries: "
Jan 06 11:05:02 tst01pvepoc03 pve-ha-lrm[132827]: VM 112 qmp command failed - VM 112 qmp command 'query-status' failed - unable to connect to VM 112 qmp socket - timeout after 51 retries
Jan 06 11:05:02 tst01pvepoc03 pve-ha-lrm[132827]: VM 112 qmp command 'query-status' failed - unable to connect to VM 112 qmp socket - timeout after 51 retries
"
Backup storage is on NFS created on an HPE storeonce.

Regards,
 
Hello ssevasta! Could you please do the following:
  1. Post the output of pveversion -v
  2. When you trigger a backup manually and you see the error, could you post the output of qm status 112 --verbose ? If you are trying to run the backup again and it works, is the output of qm status 112 --verbose any different?
 
As far as I can see, the failed backup output of qm status 112 --verbose simply shows that the storage is not reachable (because it contains no storage information), while the successful one shows storage information. At this point it seems that there's not an issue with the Backup Server, but rather with the storage of the VM.

I'm just wondering: did you notice any issue in the VM except for the backup? Like any applications freezing, or anything else.

My guess at the moment is that there are issues with the storage where it stops working for a certain amount of time, then it starts working again, then stops working, etc. You should monitor the situation for some time, because there might be a bigger issue somewhere. You may want to check the journal for any storage errors and report back the results.
 
so far have not noticed any other issues.
This afternoon, we re install PBS on a physical server rather than a VM. Migrated the configuration from the old PBS server. I will wait for tonight's backup jobs to complete. Maybe the issue was timeouts between PBS, VM and the NFS server.
 
The underlying storage for the VMs is a 3 nodes external Ceph Cluster, we haven't noticed any issues at ceph level.

When with PBS we take backup of the same VMs to a datastore that is on the local disks or a ceph volume we dont encounter issues. The issue is occurring when the PBS datastore is an NFS share on Storeonce.