Random Proxmox Backups Failing

dlbjosh

New Member
Nov 21, 2024
5
0
1
Hi,

We’ve got a cluster of 4 new servers, all running Ceph on NVMe and besides this issue seem to be running happily.

There is about 70 VMs in total, and we are utilising Proxmox Backup Server for backups primarily.

What we receive on one or two VMs during the backup run is;

INFO: creating Proxmox Backup Server archive 'vm/151/2024-11-20T13:21:24Z'
ERROR: QMP command query-proxmox-support failed - VM 151 qmp command 'query-proxmox-support' failed - unable to connect to VM 151 qmp socket - timeout after 51 retries
INFO: aborting backup job
ERROR: VM 151 qmp command 'backup-cancel' failed - unable to connect to VM 151 qmp socket - timeout after 5956 retries

We have, in the past, recreated the VM and reattached the disks manually and the problem has gone away for a small period of time, but then it may reoccur on the same VM or potentially another one.

With very little information to go on as to why its failing, its hard to troubleshoot but perhaps others may have encountered this same issue and resolved it or know how to get more information about what is going wrong under the hood so we can better target our troubleshooting.

It is worth noting that we also tried installing Veeam and using it to perform backups, and we’re also seeing a failure on the same VMs that PBS is having, which is why I thought it best to post in the Proxmox board than the PBS board, but happy to have it moved if deemed appropriate.

Thank you for any and all ideas.
 
Its worth noting that these timeouts occur after 10mins of retrying.

Still experiencing this issue. :(
 
Hi,
please enable the IO Thread setting on all your VM disks (needs to be VirtIO block or SCSI and the VirtIO SCSI single controller selected). Otherwise, try configuring a bandwidth limit for the backup or reducing the amount of workers (see the Advanced tab when editing the backup job) or using fleecing.
 
It appears to be more than that.

Even when trying to set the IO Thread, I get this error:

VM 154 qmp command 'query-version' failed - unable to connect to VM 154 qmp socket - timeout after 51 retries

Even after trying to "qm unlock 154". It feels like something has it locked but I can't determine what or where.
 
ps ax | grep 154 revealed this process running:

2005749 ? Ss 0:00 socat TCP-LISTEN:62632,bind=127.0.0.1 UNIX-CONNECT:/var/run/qemu-server/154.qmp

Killing this process allowed me to make the disk change. I'm wondering if this is also the issue impacting the backup?
 
ps ax | grep 154 revealed this process running:

2005749 ? Ss 0:00 socat TCP-LISTEN:62632,bind=127.0.0.1 UNIX-CONNECT:/var/run/qemu-server/154.qmp

Killing this process allowed me to make the disk change. I'm wondering if this is also the issue impacting the backup?
Most likely. The QMP socket only allows one concurrent connection IIRC. Question is who/what started that process, checking the parent PID (e.g. ps axl) would be interesting should the issue happen again.

Checking in our code, only qm terminal command uses socat but only for connecting to the serial sockets, not the QMP socket and it uses different parameters, so that can't be it. Likely it's some third-party tool/script?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!