VM 161 qmp command 'backup' failed - backup connect failed: command error: while locking last snapshot during backup

Feb 4, 2024
74
6
8
Hey since some days we get with 8.3 and 8.4 as well as PBS on 8.3. and 8.4 the following problems randomly on all nodes. on the next run problem may be gone then.

here 2 samples from different nodes (on ist still on 8.3. latest and one on 8.4.1)

161: 2025-04-14 12:12:18 INFO: Starting Backup of VM 161 (qemu)161: 2025-04-14 12:12:18 INFO: status = running161: 2025-04-14 12:12:18 INFO: VM Name: sql2161: 2025-04-14 12:12:18 INFO: include disk 'virtio0' 'vm:vm-161-disk-0' 150G161: 2025-04-14 12:12:18 INFO: include disk 'virtio1' 'vm:vm-161-disk-2' 700G161: 2025-04-14 12:12:18 INFO: include disk 'virtio2' 'vm:vm-161-disk-1' 100G161: 2025-04-14 12:12:18 INFO: include disk 'efidisk0' 'vm:vm-161-disk-3' 128K161: 2025-04-14 12:12:19 INFO: backup mode: snapshot161: 2025-04-14 12:12:19 INFO: bandwidth limit: 2048000 KiB/s161: 2025-04-14 12:12:19 INFO: ionice priority: 7161: 2025-04-14 12:12:19 INFO: creating Proxmox Backup Server archive 'vm/161/2025-04-14T10:12:18Z'161: 2025-04-14 12:12:19 INFO: drive-virtio0: attaching fleecing image vm:vm-161-fleece-0 to QEMU161: 2025-04-14 12:12:19 INFO: drive-virtio1: attaching fleecing image vm:vm-161-fleece-1 to QEMU161: 2025-04-14 12:12:19 INFO: drive-virtio2: attaching fleecing image vm:vm-161-fleece-2 to QEMU161: 2025-04-14 12:12:19 INFO: issuing guest-agent 'fs-freeze' command161: 2025-04-14 12:12:37 INFO: issuing guest-agent 'fs-thaw' command161: 2025-04-14 12:12:38 ERROR: VM 161 qmp command 'backup' failed - backup connect failed: command error: while locking last snapshot during backup 'BackupInfo { backup_dir: BackupDir { store: "PBS1", ns: BackupNamespace { inner: [], len: 0 }, dir: BackupDir { group: BackupGroup { ty: Vm, id: "161" }, time: 1744618415 }, backup_time_string: "2025-04-14T08:13:35Z" }, files: ["drive-virtio2.img.fidx", "drive-efidisk0.img.fidx", "client.log.blob", "drive-virtio1.img.fidx", "drive-virtio0.img.fidx", "index.json.blob", "qemu-server.conf.blob"], protected: false }'161: 2025-04-14 12:12:38 INFO: aborting backup job161: 2025-04-14 12:12:38 INFO: resuming VM again161: 2025-04-14 12:12:39 ERROR: Backup of VM 161 failed - VM 161 qmp command 'backup' failed - backup connect failed: command error: while locking last snapshot during backup 'BackupInfo { backup_dir: BackupDir { store: "PBS1", ns: BackupNamespace { inner: [], len: 0 }, dir: BackupDir { group: BackupGroup { ty: Vm, id: "161" }, time: 1744618415 }, backup_time_string: "2025-04-14T08:13:35Z" }, files: ["drive-virtio2.img.fidx", "drive-efidisk0.img.fidx", "client.log.blob", "drive-virtio1.img.fidx", "drive-virtio0.img.fidx", "index.json.blob", "qemu-server.conf.blob"], protected: false }'
133: 2025-04-14 12:14:45 INFO: Starting Backup of VM 133 (qemu)133: 2025-04-14 12:14:45 INFO: status = running133: 2025-04-14 12:14:45 INFO: VM Name: tsext-pve5133: 2025-04-14 12:14:45 INFO: include disk 'virtio0' 'vm:vm-133-disk-0' 500G133: 2025-04-14 12:14:45 INFO: backup mode: snapshot133: 2025-04-14 12:14:45 INFO: bandwidth limit: 2048000 KiB/s133: 2025-04-14 12:14:45 INFO: ionice priority: 7133: 2025-04-14 12:14:45 INFO: creating Proxmox Backup Server archive 'vm/133/2025-04-14T10:14:45Z'133: 2025-04-14 12:14:45 INFO: drive-virtio0: attaching fleecing image vm:vm-133-fleece-0 to QEMU133: 2025-04-14 12:14:45 INFO: issuing guest-agent 'fs-freeze' command133: 2025-04-14 12:14:50 INFO: issuing guest-agent 'fs-thaw' command133: 2025-04-14 12:14:50 ERROR: VM 133 qmp command 'backup' failed - backup connect failed: command error: while locking last snapshot during backup 'BackupInfo { backup_dir: BackupDir { store: "PBS1", ns: BackupNamespace { inner: [], len: 0 }, dir: BackupDir { group: BackupGroup { ty: Vm, id: "133" }, time: 1744618532 }, backup_time_string: "2025-04-14T08:15:32Z" }, files: ["index.json.blob", "qemu-server.conf.blob", "client.log.blob", "drive-virtio0.img.fidx"], protected: false }'133: 2025-04-14 12:14:50 INFO: aborting backup job133: 2025-04-14 12:14:50 INFO: resuming VM again133: 2025-04-14 12:14:51 INFO: removing (old) fleecing image 'vm:vm-133-fleece-0'133: 2025-04-14 12:14:51 ERROR: Backup of VM 133 failed - VM 133 qmp command 'backup' failed - backup connect failed: command error: while locking last snapshot during backup 'BackupInfo { backup_dir: BackupDir { store: "PBS1", ns: BackupNamespace { inner: [], len: 0 }, dir: BackupDir { group: BackupGroup { ty: Vm, id: "133" }, time: 1744618532 }, backup_time_string: "2025-04-14T08:15:32Z" }, files: ["index.json.blob", "qemu-server.conf.blob", "client.log.blob", "drive-virtio0.img.fidx"], protected: false }'
 
Last login: Fri Mar 21 08:19:44 CET 2025 on pts/0
root@pbs1:~# proxmox-backup-manager version --verbose
proxmox-backup 3.4.0 running kernel: 6.8.12-9-pve
proxmox-backup-server 3.4.0-1 running version: 3.4.0
proxmox-kernel-helper 8.1.1
proxmox-kernel-6.8 6.8.12-9
proxmox-kernel-6.8.12-9-pve-signed 6.8.12-9
proxmox-kernel-6.8.12-8-pve-signed 6.8.12-8
proxmox-kernel-6.8.12-6-pve-signed 6.8.12-6
proxmox-kernel-6.8.4-2-pve-signed 6.8.4-2
ifupdown2 3.2.0-1+pmx11
libjs-extjs 7.0.0-5
proxmox-backup-docs 3.4.0-1
proxmox-backup-client 3.4.0-1
proxmox-mail-forward 0.3.2
proxmox-mini-journalreader 1.4.0
proxmox-offline-mirror-helper 0.6.7
proxmox-widget-toolkit 4.3.10
pve-xtermjs 5.5.0-2
smartmontools 7.3-pve1
zfsutils-linux 2.2.7-pve2
root@pbs1:~#
 
i dont have that problem in my homelab running PBS and PVE on both latest versions from yesterday but here i Use ZFS and BTRFS - the main cluster is on CEPH - so maybe its related to that.
 
Last edited:
here is from another node:

163: 2025-04-14 18:12:26 INFO: Starting Backup of VM 163 (qemu)163: 2025-04-14 18:12:26 INFO: status = running163: 2025-04-14 18:12:26 INFO: VM Name: file163: 2025-04-14 18:12:26 INFO: include disk 'virtio0' 'vm:vm-163-disk-0' 60G163: 2025-04-14 18:12:26 INFO: include disk 'virtio1' 'vm:vm-163-disk-1' 3T163: 2025-04-14 18:12:26 INFO: backup mode: snapshot163: 2025-04-14 18:12:26 INFO: bandwidth limit: 2048000 KiB/s163: 2025-04-14 18:12:26 INFO: ionice priority: 7163: 2025-04-14 18:12:26 INFO: creating Proxmox Backup Server archive 'vm/163/2025-04-14T16:12:26Z'163: 2025-04-14 18:12:26 INFO: drive-virtio0: attaching fleecing image vm:vm-163-fleece-0 to QEMU163: 2025-04-14 18:12:26 INFO: drive-virtio1: attaching fleecing image vm:vm-163-fleece-1 to QEMU163: 2025-04-14 18:12:26 INFO: issuing guest-agent 'fs-freeze' command163: 2025-04-14 18:12:29 INFO: issuing guest-agent 'fs-thaw' command163: 2025-04-14 18:12:30 ERROR: VM 163 qmp command 'backup' failed - backup connect failed: command error: while locking last snapshot during backup 'BackupInfo { backup_dir: BackupDir { store: "PBS1", ns: BackupNamespace { inner: [], len: 0 }, dir: BackupDir { group: BackupGroup { ty: Vm, id: "163" }, time: 1744639975 }, backup_time_string: "2025-04-14T14:12:55Z" }, files: ["index.json.blob", "drive-virtio1.img.fidx", "drive-virtio0.img.fidx", "client.log.blob", "qemu-server.conf.blob"], protected: false }'163: 2025-04-14 18:12:30 INFO: aborting backup job163: 2025-04-14 18:12:30 INFO: resuming VM again163: 2025-04-14 18:12:31 ERROR: Backup of VM 163 failed - VM 163 qmp command 'backup' failed - backup connect failed: command error: while locking last snapshot during backup 'BackupInfo { backup_dir: BackupDir { store: "PBS1", ns: BackupNamespace { inner: [], len: 0 }, dir: BackupDir { group: BackupGroup { ty: Vm, id: "163" }, time: 1744639975 }, backup_time_string: "2025-04-14T14:12:55Z" }, files: ["index.json.blob", "drive-virtio1.img.fidx", "drive-virtio0.img.fidx", "client.log.blob", "qemu-server.conf.blob"], protected: false }'
 
What's the output of stat /run/proxmox-backup/old-locking on the PBS? Also, please share the output of mount, again executed on the PBS.
 
or do i need to state these commands while the backup is running?
Thanks, no, I've got the information i needed. This tells us that the new paths are used for locking.

To narrow down the issue, can you fully exclude that there are multiple backup jobs trying to backup to the same group at the same time? Further, can you exclude that there are other ongoing jobs which would lock the last snapshot of the group, e.g. prunes. Do the backup jobs have their own retention settings or is retention managed by the PBS?
 
well thats how it is set on the PVE:
1744730053839.png

so the individual backup jobs per VM/CT for the 5 nodes is managed by a sinlgle job daily all 3h during weektime.

on the PBS the prune could be running, or a push sync:

1744730135159.png

1744730145390.png

shall i change the prune one to not all 2 hours but in the night?
 
problem still persists also after changing the timing of the prune jobs:

today happening on 2 various nodes and SQL

164: 2025-04-16 10:16:48 ERROR: Backup of VM 164 failed - VM 164 qmp command 'backup' failed - backup connect failed: command error: while locking last snapshot during backup 'BackupInfo { backup_dir: BackupDir { store: "PBS1", ns: BackupNamespace { inner: [], len: 0 }, dir: BackupDir { group: BackupGroup { ty: Vm, id: "164" }, time: 1744780440 }, backup_time_string: "2025-04-16T05:14:00Z" }, files: ["index.json.blob", "drive-virtio0.img.fidx", "client.log.blob", "drive-virtio1.img.fidx", "drive-virtio2.img.fidx", "drive-efidisk0.img.fidx", "qemu-server.conf.blob"], protected: false }'
 
Right so you suspect the sync jobs. i have seen 2 updates on the PVE for backup this morning, but these havent cahnged anything. please update once there is more progress.

tahkn you
 
Yeah I only send the patch today, will probably take a little bit for this to get pushed out. You could space the push sync jobs more if that is an option for you.

The problem here is that push sync jobs shouldn't take exclusive locks, as they only need to read the backup, not modify it. However, if there is no sync job running during the backup, this shouldn't make a difference. So if you could run the sync jobs, for example, in such a way that they only run two hours after the backups, there likely won't be an issue.
 
  • Like
Reactions: Chris
Right understood, well then i just leave it as is right now. since the mgmt of timing is one time done on PVE and for the sync on the PBS i couldnt chain them together. thank you for fast response!
 
By the way this change was pushed to pbstest as PBS 3.4.1 by now, might want to check that out and see if it fixes the problem for you.