[SOLVED] All Backups Started Failing

danieln

New Member
Mar 13, 2024
6
0
1
Link To Solution

Hey, I've recently started to experience failure in all of my backups. Shortly after I've realized that this has nothing to do with my backup servers because other people can still backup to them, so it's clearly something at my end.

I've narrowed this down to an issue with the fsfreeze thingy, I'm attaching some output, hopefully someone here will have some insights towards what's going on.
All VM's have qemu-guest-agent installed and running and the VM's os vary, from BSD to Linux and all the way to Windows.

Here's the backup log from PVE, I have three PBS's, this is the log from the first one, the other two have the same logs pretty much.
Code:
INFO: starting new backup job: vzdump --notes-template '{{guestname}}' --mode snapshot --all 1 --mailnotification always --quiet 1 --storage pbs1
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2024-03-13 05:00:04
INFO: status = running
INFO: VM Name: (CENSORED)
VM 100 qmp command 'query-status' failed - got timeout

INFO: include disk 'scsi0' 'local-lvm:vm-100-disk-0' 32G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/100/2024-03-13T03:00:04Z'
INFO: enabling encryption
ERROR: QMP command query-proxmox-support failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout
INFO: aborting backup job
ERROR: VM 100 qmp command 'backup-cancel' failed - got timeout
INFO: resuming VM again
ERROR: Backup of VM 100 failed - VM 100 qmp command 'cont' failed - got timeout
INFO: Failed at 2024-03-13 05:11:00
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2024-03-13 05:11:00
INFO: status = running
INFO: VM Name: (CENSORED)
VM 101 qmp command 'query-status' failed - got timeout

INFO: include disk 'scsi0' 'local-lvm:vm-101-disk-0' 16G
INFO: exclude disk 'scsi1' 'local-lvm:vm-101-disk-1' (backup=no)
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/101/2024-03-13T03:11:00Z'
INFO: enabling encryption
ERROR: QMP command query-proxmox-support failed - VM 101 qmp command 'query-proxmox-support' failed - got timeout
INFO: aborting backup job
ERROR: VM 101 qmp command 'backup-cancel' failed - got timeout
INFO: resuming VM again
ERROR: Backup of VM 101 failed - VM 101 qmp command 'cont' failed - got timeout
INFO: Failed at 2024-03-13 05:21:55
INFO: Starting Backup of VM 102 (qemu)
INFO: Backup started at 2024-03-13 05:21:55
INFO: status = running
INFO: VM Name: (CENSORED)
VM 102 qmp command 'query-status' failed - got timeout

INFO: include disk 'scsi0' 'local-lvm:vm-102-disk-0' 32G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/102/2024-03-13T03:21:55Z'
INFO: enabling encryption
ERROR: QMP command query-proxmox-support failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
INFO: aborting backup job
ERROR: VM 102 qmp command 'backup-cancel' failed - got timeout
INFO: resuming VM again
ERROR: Backup of VM 102 failed - VM 102 qmp command 'cont' failed - got timeout
INFO: Failed at 2024-03-13 05:32:51
INFO: Starting Backup of VM 103 (qemu)
INFO: Backup started at 2024-03-13 05:32:51
INFO: status = running
INFO: VM Name: (CENSORED)
VM 103 qmp command 'query-status' failed - got timeout

INFO: include disk 'scsi0' 'local-lvm:vm-103-disk-1' 128G
INFO: include disk 'efidisk0' 'local-lvm:vm-103-disk-0' 4M
INFO: include disk 'tpmstate0' 'local-lvm:vm-103-disk-2' 4M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/103/2024-03-13T03:32:51Z'
INFO: enabling encryption
ERROR: QMP command query-proxmox-support failed - VM 103 qmp command 'query-proxmox-support' failed - got timeout
INFO: aborting backup job
ERROR: VM 103 qmp command 'backup-cancel' failed - got timeout
INFO: resuming VM again
ERROR: Backup of VM 103 failed - VM 103 qmp command 'cont' failed - got timeout
INFO: Failed at 2024-03-13 05:43:52
INFO: Starting Backup of VM 200 (qemu)
INFO: Backup started at 2024-03-13 05:43:52
INFO: status = running
INFO: VM Name: (CENSORED)
VM 200 qmp command 'query-status' failed - got timeout

INFO: include disk 'scsi0' 'local-lvm:vm-200-disk-0' 32G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/200/2024-03-13T03:43:52Z'
INFO: enabling encryption
ERROR: QMP command query-proxmox-support failed - VM 200 qmp command 'query-proxmox-support' failed - got timeout
INFO: aborting backup job
ERROR: VM 200 qmp command 'backup-cancel' failed - got timeout
INFO: resuming VM again
ERROR: Backup of VM 200 failed - VM 200 qmp command 'cont' failed - got timeout
INFO: Failed at 2024-03-13 05:54:47
INFO: Starting Backup of VM 201 (qemu)
INFO: Backup started at 2024-03-13 05:54:47
INFO: status = running
INFO: VM Name: (CENSORED)
VM 201 qmp command 'query-status' failed - got timeout

INFO: include disk 'scsi0' 'local-lvm:vm-201-disk-0' 16G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/201/2024-03-13T03:54:47Z'
INFO: enabling encryption
ERROR: QMP command query-proxmox-support failed - VM 201 qmp command 'query-proxmox-support' failed - got timeout
INFO: aborting backup job
ERROR: VM 201 qmp command 'backup-cancel' failed - got timeout
INFO: resuming VM again
ERROR: Backup of VM 201 failed - VM 201 qmp command 'cont' failed - got timeout
INFO: Failed at 2024-03-13 06:05:42
INFO: Starting Backup of VM 9000 (qemu)
INFO: Backup started at 2024-03-13 06:05:42
INFO: status = running
INFO: VM Name: CloudInit-Template-(CENSORED)
VM 9000 qmp command 'query-status' failed - got timeout

INFO: include disk 'scsi0' 'local-lvm:base-9000-disk-0' 16G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/9000/2024-03-13T04:05:42Z'
INFO: enabling encryption
ERROR: QMP command query-proxmox-support failed - VM 9000 qmp command 'query-proxmox-support' failed - got timeout
INFO: aborting backup job
ERROR: VM 9000 qmp command 'backup-cancel' failed - got timeout
INFO: resuming VM again
ERROR: Backup of VM 9000 failed - VM 9000 qmp command 'cont' failed - got timeout
INFO: Failed at 2024-03-13 06:16:38
INFO: Backup job finished with errors
INFO: notified via target `mail-to-root`

I've ran these commands for testing purposes, but honestly this only confused me even more.. I have no clue what's going on.
Code:
root@(CENSORED):~# qm guest cmd 100 fsfreeze-status
thawed
root@(CENSORED):~# qm guest cmd 100 fsfreeze-freeze
{
   "error" : {
      "class" : "GenericError",
      "desc" : "failed to freeze /: Resource deadlock avoided"
   }
}
root@(CENSORED):~# qm guest cmd 100 fsfreeze-status
thawed
root@(CENSORED):~# qm guest cmd 101 fsfreeze-freeze
{
   "error" : {
      "class" : "GenericError",
      "desc" : "failed to open /mnt/(CENSORED): Permission denied"
   }
}
root@(CENSORED):~# qm guest cmd 101 fsfreeze-status
thawed
root@(CENSORED):~# qm guest cmd 102 fsfreeze-freeze
2
root@(CENSORED):~# qm guest cmd 102 fsfreeze-status
frozen
root@(CENSORED):~# qm guest cmd 102 fsfreeze-thaw
2
root@(CENSORED):~# qm guest cmd 102 fsfreeze-status
thawed

Any help would be greatly appreciated!
 
Last edited:
This sounds familiar. There was a package update addressing such an issue. I can't find it right now. But are you on the latest packages (pveversion -v)?
I think I'm running the latest?
Code:
root@(censored):~# pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.13-1-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
proxmox-kernel-6.5: 6.5.13-1
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.2
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.1
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.1.0
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.5
proxmox-widget-toolkit: 4.1.4
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.4
pve-edk2-firmware: 4.2023.08-4
pve-firewall: 5.0.3
pve-firmware: 3.9-2
pve-ha-manager: 4.0.3
pve-i18n: 3.2.1
pve-qemu-kvm: 8.1.5-3
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve2
 
UPDATE

It seems that not only backups are broken.. Also basic operations such as restart/shutdown through the guest agent are broken and while the VMs do shutdown they still appear as online, literally the VM process (/usr/sbin/kvm [...]) is still running.

Another weird symptom is that the PVE Web UI is extremely slow.

My colleague is facing the exact same issues after updating his instance. Anyone else facing this with any solution in mind?
 
Hi,
it looks like communication via the QMP sockets (used to send commands from the host to QEMU instances) might be broken for some reason. Please share the log file resulting from journalctl -b > /tmp/journal.log. If you also upgraded the kernel, try rebooting into an older one to see if it works there. What CPU does your host have?
 
Hi,
it looks like communication via the QMP sockets (used to send commands from the host to QEMU instances) might be broken for some reason. Please share the log file resulting from journalctl -b > /tmp/journal.log. If you also upgraded the kernel, try rebooting into an older one to see if it works there. What CPU does your host have?
Hey!

My CPU is a 2x Intel Xeon E5-2680v4. Running on a Dell R430.

I've posted a certain part of the journal because it's huge and also basically it all looks relatively the same. This is from a backup job that ran but from the obscene amount of errors that are exactly like the ones here along the day I assume the issue is deeper then backups.
 

Attachments

Did you already try booting an older kernel?

Please share the file /tmp/qmp-strace.txt and the output in the terminal after running
Code:
echo '{"execute": "qmp_capabilities"}{"execute": "query-version", "arguments": {}}' | strace socat - /var/run/qemu-server/100.qmp 2> /tmp/qmp-strace.txt
If VM 100 is not running anymore, replace the 100 in the command by any ID of a running VM.
 
Did you already try booting an older kernel?

Please share the file /tmp/qmp-strace.txt and the output in the terminal after running
Code:
echo '{"execute": "qmp_capabilities"}{"execute": "query-version", "arguments": {}}' | strace socat - /var/run/qemu-server/100.qmp 2> /tmp/qmp-strace.txt
If VM 100 is not running anymore, replace the 100 in the command by any ID of a running VM.
I didn't try an older kernel yet. Would you mind referencing me to a list of PVE kernels?

I've also attached the output.

Thank you so much for your help so far in diagnosing the issue.
 

Attachments

I didn't try an older kernel yet. Would you mind referencing me to a list of PVE kernels?
To get a list of installed kernelsyou could run: proxmox-boot-tool kernel list
To boot another kernel either reboot your node and select the one you want to use when asked while booting or you could temporarily/permenently pin it like described here: https://pve.proxmox.com/wiki/Host_Bootloader#sysboot_kernel_pin

For a list of installable PVE8 kernels see: apt update && apt-cache search proxmox-kernel
There are also some old kernels from PVE7 but not sure how well they will work with PVE8: apt-cache search pve-kernel
 
Last edited:
  • Like
Reactions: danieln
UPDATE

I've downgrade from 6.5.13-1-pve which is the kernel version that had all the issues to 6.5.11-8-pve which is as far I know the previous kernel update and now everything works flawlessly and all the bugs are fixed. I guess I'll wait for the next update.

Bash:
# I'd suggest turning off all VMs gracefully before rebooting.
proxmox-boot-tool kernel pin 6.5.11-8-pve && reboot
 
Last edited:
I've also attached the output.
Mine is essentially the same. The only interesting bit that's different is:
Code:
pselect6(6, [5], [], [], {tv_sec=0, tv_nsec=500000000}, NULL) = 0 (Timeout)
From the rest of the trace, it seems the command did finish successfully and printed a result.

Maybe the issue was a bit further up in the stack, in the Perl code communicating via QMP, but it's hard to tell.
 
  • Like
Reactions: danieln
We recently updated all of our PVE servers from 7 to 8 and updated PBS as well. We also expanded our PVE cluster of 12 nodes to another 14 nodes to perform a phase-out of our old PVE servers. We are experiencing the same issues randomly across our cluster, where guests will not be backed out due to the same error: qmp command 'backup' failed - got timeout.

We already had all clients moved to the new nodes, so the old ones actually don't have any guests left, which have to be backed up, nonetheless, the issue still persists. The kernel we are running right now is 6.5.11-8-pve on all nodes (from the 8.1.4 update), so it's probably not the kernel itself, which is the issue, as suggested in this thread.
 
Hi,
We recently updated all of our PVE servers from 7 to 8 and updated PBS as well. We also expanded our PVE cluster of 12 nodes to another 14 nodes to perform a phase-out of our old PVE servers. We are experiencing the same issues randomly across our cluster, where guests will not be backed out due to the same error: qmp command 'backup' failed - got timeout.
the timeout errors reported in this thread were not for the backup command, but for other commands following it. Timeouts for the backup command could mean that the PBS is under too much load. Are you starting backups from all nodes at the same time?
 
Well, before the update from PVE7 to PVE8, we actually ran all the backups in one job, so all 12 nodes startet their backups at the same time, which never caused any load issues on our PBS. Now, we have the backups on our active nodes being started with an offset of 30 mins. to spread them out and there are still errors occuring - although much less than before.

If you're speaking of load issues, what kind of issue do you refer to? As far as I can see in our monitoring, there are no load spikes. I will probably have to have a closer look at runtime, though.
 
If you're speaking of load issues, what kind of issue do you refer to? As far as I can see in our monitoring, there are no load spikes. I will probably have to have a closer look at runtime, though.
Network or IO are most likely culprits. The VM's backup command has a 125 second timeout to connect to the PBS server and do initial preparation and the error message indicates that you run into this.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!