[SOLVED] Snapshot fails to cleanup and prevents future snapshots

Tmanok

Renowned Member
Hi Everyone,

Before performing maintenance on a Windows Server 2022 VM, a snapshot was attempted with memory. Unfortunately, it failed to clean up with a perplexing error. The VM has two virtual disks, one on GlusterFS and the other on NFS. The GlusterFS storage is high performance (local on 3 of the nodes, replicated between the nodes) and the only VMs on the GlusterFS volume are also on the nodes running GlusterFS (no remote VMs).

VM Configuration:
Code:
agent: 1
balloon: 0
bios: ovmf
boot: order=ide0;virtio0
cores: 2
cpu: x86-64-v2-AES
description: Todo%3A%0ASet custom IP%0AConfigure SMB%0ABegin Mapping File Permissions
efidisk0: Hades-NFS:133/vm-133-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
ide0: none,media=cdrom
machine: pc-q35-7.2
memory: 8192
meta: creation-qemu=7.2.0,ctime=1706002550
name: WS2022-FS-1
net0: virtio=F2:6A:23:0A:7B:0E,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=2f5c6879-5cfb-41f3-b4a9-bb1fa39aa002
sockets: 2
tags: ws2022
virtio0: gfs-raid10-pve456:133/vm-133-disk-0.qcow2,cache=writeback,discard=on,iothread=1,size=100G
virtio1: Hades-NFS:133/vm-133-disk-2.qcow2,discard=on,mbps_rd=50,mbps_wr=50,size=2000G
vmgenid: 088640ca-f4f9-469e-9551-e026f6acb0d7

First Snapshot:
Code:
Formatting 'gluster://192.168.80.110/gfs-raid10-pve456/images/133/vm-133-state-Updates-April-8th.raw', fmt=raw size=17704157184 preallocation=off
[2024-04-09 03:41:20.811280 +0000] I [io-stats.c:3701:ios_sample_buf_size_configure] 0-gfs-raid10-pve456: Configure ios_sample_buf  size is 1024 because ios_sample_interval is 0
[2024-04-09 03:41:20.973969 +0000] E [MSGID: 108006] [afr-common.c:6123:__afr_handle_child_down_event] 0-gfs-raid10-pve456-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
[2024-04-09 03:41:30.819255 +0000] I [io-stats.c:4033:fini] 0-gfs-raid10-pve456: io-stats translator unloaded
saving VM state and RAM using storage 'gfs-raid10-pve456'
515.99 KiB in 0s
{Truncated for the Proxmox Forums}...
7.40 GiB in 2m 50s
completed saving the VM state in 3m 1s, saved 7.85 GiB
snapshotting 'drive-virtio0' (gfs-raid10-pve456:133/vm-133-disk-0.qcow2)
snapshotting 'drive-virtio1' (Hades-NFS:133/vm-133-disk-2.qcow2)
VM 133 qmp command 'savevm-end' failed - unable to connect to VM 133 qmp socket - timeout after 5980 retries
snapshot create failed: starting cleanup
VM 133 qmp command 'blockdev-snapshot-delete-internal-sync' failed - unable to connect to VM 133 qmp socket - timeout after 5980 retries
TASK ERROR: VM 133 qmp command 'blockdev-snapshot-internal-sync' failed - got timeout

Second Snapshot:
Code:
snapshot create failed: starting cleanup
TASK ERROR: VM 133 qmp command 'savevm-start' failed - VM snapshot already started

Additionally, when attempting qm delsnapshot 133 Updates-April-8th the following is returned: snapshot 'Updates-April-8th' does not exist

Any assistance would be greatly appreciated.
Thank you,


Tmanok
 
Last edited:
"[2024-04-09 03:41:20.973969 +0000] E [MSGID: 108006] [afr-common.c:6123:__afr_handle_child_down_event] 0-gfs-raid10-pve456-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up."

sounds like your gluster storage fails when creating the snapshot state volume.. and then the VM blocks when attempting to finish the write to the state volume.. sounds like a storage problem to me.

I would try:
- stopping or shutting down the VM
- removing any traces of the snapshots in the qcow2 files, if there are any
- starting the VM again
 
  • Like
Reactions: Tmanok
Hi Fabian,

Any pointers on diagnosing the issue with this GlusterFS setup? Perhaps there are configurations I could modify to ensure Proxmox VE is interfacing with it correctly.

- stopping or shutting down the VM
- removing any traces of the snapshots in the qcow2 files, if there are any
- starting the VM again
Good news:
Code:
root@pve5:/mnt/pve/gfs-raid10-pve456/images# qemu-img snapshot -l 133/vm-133-disk-0.qcow2
Snapshot list:
ID        TAG               VM SIZE                DATE     VM CLOCK     ICOUNT
1         Updates-April-8th      0 B 2024-04-08 20:44:33 76:43:46.617

When the next maintenance window is open, I will backup the files on the VM itself, shut it down, delete the snapshot, and reboot the VM.
Expect a follow-up whether it works or not.
Thank you!


Tmanok
 
no idea, sorry. maybe gluster and/or system logs have more details?
 
After further troubleshooting, I have successfully been able to make a PBS backup as the backup snapshotting is different from regular snapshotting, however, when pursuing the objective of deleting this broken snapshot, I'm unable to do so.

Code:
root@pve5:/mnt/pve/gfs-raid10-pve456/images# qemu-img snapshot -l 133/vm-133-disk-0.qcow2
Snapshot list:
ID        TAG               VM SIZE                DATE     VM CLOCK     ICOUNT
1         Updates-April-8th      0 B 2024-04-08 20:44:33 76:43:46.617           
root@pve5:/mnt/pve/gfs-raid10-pve456/images# qemu-img snapshot -d 1 133/vm-133-disk-0.
qcow2
qemu-img: Could not delete snapshot '1': snapshot not found
root@pve5:/mnt/pve/gfs-raid10-pve456/images# qemu-img snapshot -l 133/vm-133-disk-0.qcow2
Snapshot list:
ID        TAG               VM SIZE                DATE     VM CLOCK     ICOUNT
1         Updates-April-8th      0 B 2024-04-08 20:44:33 76:43:46.617

Any further ideas? I wish to regain my snapshot capabilities for this VM. By the way, I am able to snapshot other VMs on this same GlusterFS volume.
Thank you,


Tmanok
 
that command needs the name ("TAG"), not the ID.

backups of VMs don't do storage level snapshots at all, only containers do..
 
Hi Fabian,

It took me a while to get another maintenance window for this client, but when I finally did your hint to delete the tag instead of the ID worked!
One more thing was of course important, and that was to delete the snapshots on the NFS storage as well for the data disk and EFI.
Code:
root@pve5:/mnt/pve/Hades-NFS/images# qemu-img snapshot -l 133/vm-133-disk-2.qcow2
Snapshot list:
ID        TAG               VM SIZE                DATE     VM CLOCK     ICOUNT
1         Updates-April-8th      0 B 2024-04-08 20:44:46 76:44:00.093           
2         Test                  0 B 2024-04-10 02:35:09 00:06:26.640

root@pve5:/mnt/pve/Hades-NFS/images# qemu-img snapshot -d "Test" 133/vm-133-disk-2.qcow2
root@pve5:/mnt/pve/Hades-NFS/images# qemu-img snapshot -d "Updates-April-8th" 133/vm-133-disk-2.qcow2
root@pve5:/mnt/pve/Hades-NFS/images# qemu-img snapshot -l 133/vm-133-disk-0.qcow2
root@pve5:/mnt/pve/Hades-NFS/images#

Thank you very much, Fabian!


Tmanok
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!