Snapshots not being created, or are they?

j_s · Feb 8, 2023

I've had this problem with 1 VM in particular for about a month. It's only happened on this 1 VM. It's a VM that has a 200GB virtual disk on NFS to a TrueNAS server.

When I have big updates come out for the VM, I'll shutdown the VM and snapshot it. Then boot it up and apply the updates. If things go well, I'll delete the snapshot (usually within about 30 minutes or so once I feel everything is running smoothly again).

This VM is running Nextcloud, and the file system internally is ZFS. I know, I have ZFS in the VM, and the storage is ZFS backed, but this VM wasn't originally going to be stored on a TrueNAS server. I've done a ZFS scrub of the zpool on the TrueNAS itself, as well as a scrub on the zpool in the VM, and had no errors found.

Anyway, I noticed a month ago that when I go to the WebGUI of Proxmox to take a snapshot, the typical entry that says "NOW - You are here!" is missing. The list is completely empty of snapshots. I tried to take a snapshot, and the task seemed to occur without any problems. However the snapshot wasn't created as the list never populated. I tried 2 or 3 more times before I realized something was odd, and I went to the CLI. From the CLI I checked and no snapshot seemed to exist. I assumed the issue is something weird with the WebGUI, and I updated the VM without a snapshot and life went on.

Last night I shutdown the Proxmox cluster to install the latest updates from the pve-no-subscription, and had no problems. Tonight I decided to look into this issue more. The issue seems to be the same from the WebGUI, no "NOW - You are here!" is present, and I tried to take another snapshot from the WebGUI. From the WebGUI:

snapshotting 'drive-scsi0' (sandisk_ssd:116/vm-116-disk-0.qcow2)
TASK OK

However again, like a month ago, no snapshot exists in the WebGUI and no "NOW - You are here!" is in the WebGUI.

I did a few idiot checks:


root@jpve1:/etc# qm listsnapshot 116
root@jpve1:/etc# qemu-img check /mnt/pve/sandisk_ssd/images/116/vm-116-disk-0.qcow2
No errors were found on the image.
1675005/3276800 = 51.12% allocated, 85.54% fragmented, 0.00% compressed clusters
Image end offset: 201816014848
root@jpve1:/etc# qm snapshot 116 test
snapshot name 'test' already used
root@jpve1:/etc# qm snapshot 116 randomcheck
snapshotting 'drive-scsi0' (sandisk_ssd:116/vm-116-disk-0.qcow2)
root@jpve1:/etc# qm listsnapshot 116
root@jpve1:/etc#

So I have two take-aways from this so far:

1. Somewhere a "test" snapshot apparently exists, even though I cannot see or use it.
2. If I create a randomly named snapshot, it seems to complete successfully, but doesn't.

Something I think is possibly related, but I've never felt comfy about that I feel is worth mentioning:

Last year I deleted a number of snapshots of this VM that were old. It took so long it timed out on deleting the snapshot. The timeout was 10 minutes. I was a bit surprised there would be a timeout on deleting a snapshot. The VM is on SSD storage on a TrueNAS that has 10Gb networking between the Proxmox host and the TruenAS, so I'd think 10 minutes would be enough. But I read the forums and I was able to do 2 things based on old forum posts:

1. Since the snapshots were deleted to some extent, the best thing to go forward was to edit the conf file on that host for the VM and remove the snapshot entries, which I did, and there seemed to be no bad consequences. I was able to create and delete a snapshot without powering on the VM, so I felt comfortable that all was well.
2. Per this post https://forum.proxmox.com/threads/snapshots-in-delete-status.27729/post-440790 I was able to change the timeout to something larger. For me, I'm okay with it taking an hour or more (if it needed to) to cleanup a snapshot. I just need it to run to completion (however long it takes). So I set it to a default of 60 minutes by editing the appropriate file on all 3 of the Proxmox hosts. That seemed to resolve the problems for this VM. (This VM is the largest by far at 200GB, so I'm not surprised its the only one having this problem.)

Anyone have any recommendations on how to proceed? One idea I had was to convert the disk to raw, then back to qcow2. You cannot have snapshots in raw format, so whatever sticky snapshots exist would have to be purged, right? Before I go draconian and covert to raw and back, I'd like to have a deeper understanding of what broke "under the hood".

I do realize that snapshots may be a problem because of their size if left unchecked for too long, and removing them can take an excessive amount of time by Proxmox standards, but I'm okay with these limitations and I use snapshots for a lot of testing, so ditching them because they take too long isn't a great option.

pveversion -v

Code:

pveversion -v
proxmox-ve: 7.3-1 (running kernel: 5.15.83-1-pve)
pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4)
pve-kernel-helper: 7.3-3
pve-kernel-5.15: 7.3-1
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-2
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-1
lxcfs: 5.0.3-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.2-1
proxmox-backup-file-restore: 2.3.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-3
pve-ha-manager: 3.5.1
pve-i18n: 2.8-2
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

fabian · Feb 8, 2023

could you please post the full contents of the config file (e.g., cat /etc/pve/local/qemu-server/XXX.conf, where XXX is the VMID)

j_s · Feb 17, 2023

So I think I found the problem, but I'm not sure what to do to "fix" this issue now.

To quote myself:

Since the snapshots were deleted to some extent, the best thing to go forward was to edit the conf file on that host for the VM and remove the snapshot entries, which I did, and there seemed to be no bad consequences. I was able to create and delete a snapshot without powering on the VM, so I felt comfortable that all was well.

Keyword: "on that host".

When I had to delete the snapshot entries from the 116.conf file a month or two ago, I did not make the same changes on the other nodes in the cluster. I have no idea if I should have made the same changes manually, or run some command to sync the changes automatically, or if my assumptions that my modifications to the 116.conf file are the problem. However, I did sha256sum on the 116.conf files on all 3 nodes, and they all do match now.

My VM's VMID is 116. I have 3 hosts in the cluster, named jpve1, jpve2, and jpve3. The VM is running on jpve1.

Below is the entirety of the 116.conf file:

Code:

agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 2
cpu: host
memory: 32768
meta: creation-qemu=7.0.0,ctime=1666250823
name: Nextcloud-zfs
net0: virtio=D2:64:85:4A:0F:2B,bridge=vmbr0,tag=7
numa: 0
onboot: 1
ostype: l26
parent: randomcheck
scsi0: sandisk_ssd:116/vm-116-disk-0.qcow2,backup=0,discard=on,size=200G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=87988008-aef8-4941-8a16-b70e690ac41a
sockets: 1
vmgenid: c3b74891-b4bd-4b62-a64b-add3afdb6a65

[Before_Update]
agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 2
cpu: host
memory: 32768
meta: creation-qemu=7.0.0,ctime=1666250823
name: Nextcloud-zfs
net0: virtio=D2:64:85:4A:0F:2B,bridge=vmbr0,tag=7
numa: 0
onboot: 1
ostype: l26
parent: Update
scsi0: sandisk_ssd:116/vm-116-disk-0.qcow2,discard=on,size=200G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=87988008-aef8-4941-8a16-b70e690ac41a
snaptime: 1674777390
sockets: 1
vmgenid: c3b74891-b4bd-4b62-a64b-add3afdb6a65

[Before_Update2]
agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 2
cpu: host
memory: 32768
meta: creation-qemu=7.0.0,ctime=1666250823
name: Nextcloud-zfs
net0: virtio=D2:64:85:4A:0F:2B,bridge=vmbr0,tag=7
numa: 0
onboot: 1
ostype: l26
parent: Before_Update
scsi0: sandisk_ssd:116/vm-116-disk-0.qcow2,discard=on,size=200G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=87988008-aef8-4941-8a16-b70e690ac41a
snaptime: 1674777530
sockets: 1
vmgenid: c3b74891-b4bd-4b62-a64b-add3afdb6a65

[Before_Upgrade]
agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 2
cpu: host
memory: 32768
meta: creation-qemu=7.0.0,ctime=1666250823
name: Nextcloud-zfs
net0: virtio=D2:64:85:4A:0F:2B,bridge=vmbr0,tag=7
numa: 0
onboot: 1
ostype: l26
parent: Before_Update2
scsi0: sandisk_ssd:116/vm-116-disk-0.qcow2,discard=on,size=200G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=87988008-aef8-4941-8a16-b70e690ac41a
snaptime: 1674778094
sockets: 1
vmgenid: c3b74891-b4bd-4b62-a64b-add3afdb6a65

[One]
agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 2
cpu: host
memory: 32768
meta: creation-qemu=7.0.0,ctime=1666250823
name: Nextcloud-zfs
net0: virtio=D2:64:85:4A:0F:2B,bridge=vmbr0,tag=7
numa: 0
onboot: 1
ostype: l26
parent: One
scsi0: sandisk_ssd:116/vm-116-disk-0.qcow2,discard=on,size=200G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=87988008-aef8-4941-8a16-b70e690ac41a
snaptime: 1673662296
sockets: 1
vmgenid: c3b74891-b4bd-4b62-a64b-add3afdb6a65

[TEST]
agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 2
cpu: host
memory: 32768
meta: creation-qemu=7.0.0,ctime=1666250823
name: Nextcloud-zfs
net0: virtio=D2:64:85:4A:0F:2B,bridge=vmbr0,tag=7
numa: 0
onboot: 1
ostype: l26
parent: Before_Upgrade
scsi0: sandisk_ssd:116/vm-116-disk-0.qcow2,discard=on,size=200G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=87988008-aef8-4941-8a16-b70e690ac41a
snaptime: 1674778203
sockets: 1
vmgenid: c3b74891-b4bd-4b62-a64b-add3afdb6a65

[Update]
#Update Available
agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 2
cpu: host
memory: 32768
meta: creation-qemu=7.0.0,ctime=1666250823
name: Nextcloud-zfs
net0: virtio=D2:64:85:4A:0F:2B,bridge=vmbr0,tag=7
numa: 0
onboot: 1
ostype: l26
parent: One
scsi0: sandisk_ssd:116/vm-116-disk-0.qcow2,discard=on,size=200G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=87988008-aef8-4941-8a16-b70e690ac41a
snaptime: 1674071607
sockets: 1
vmgenid: c3b74891-b4bd-4b62-a64b-add3afdb6a65

[randomcheck]
agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 2
cpu: host
memory: 32768
meta: creation-qemu=7.0.0,ctime=1666250823
name: Nextcloud-zfs
net0: virtio=D2:64:85:4A:0F:2B,bridge=vmbr0,tag=7
numa: 0
onboot: 1
ostype: l26
parent: test
scsi0: sandisk_ssd:116/vm-116-disk-0.qcow2,backup=0,discard=on,size=200G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=87988008-aef8-4941-8a16-b70e690ac41a
snaptime: 1675841113
sockets: 1
vmgenid: c3b74891-b4bd-4b62-a64b-add3afdb6a65

[test]
agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 2
cpu: host
memory: 32768
meta: creation-qemu=7.0.0,ctime=1666250823
name: Nextcloud-zfs
net0: virtio=D2:64:85:4A:0F:2B,bridge=vmbr0,tag=7
numa: 0
onboot: 1
ostype: l26
parent: TEST
scsi0: sandisk_ssd:116/vm-116-disk-0.qcow2,backup=0,discard=on,size=200G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=87988008-aef8-4941-8a16-b70e690ac41a
snaptime: 1675840876
sockets: 1
vmgenid: c3b74891-b4bd-4b62-a64b-add3afdb6a65

It seems that snapshots are at least being taken from the .conf file's perspective. I assume there is probably some command to examine the contents of a qcow2 file and see if any "remnant" snapshots are sitting inside, but I don't know how to do that. Google wasn't helpful, probably because I have no idea how to even describe what I'm looking for very well.

I guess I'm not sure how to get out of this rut. I'd love to simply trash all of the snapshots in the config file and remove them from the qcow2 file, but I'm really not sure how to clean up the mess I've probably made.

Thanks for the help!

fabian · Feb 17, 2023

you should be able to use qm delsnapshot XXX YYY -force (XXX being the VMID, YYY being the snapshot name) to remove a snapshot even if removing (some) of the volume snapshots on the storage fail. you can then check with qemu-img snapshot -l "$(pvesm path VOLUMEID)" whether any volume snapshots still exist in the qcow2 file (VOLUMEID is sandisk_ssd:116/vm-116-disk-0.qcow2, for example). if you find any snapshots that should not be there, you can then delete them with qemu-img snapshot -d SNAPSHOT "$(pvesm path VOLUMEID)" (VOLUMEID is again the PVE volume ID, SNAPSHOT the snapshot name displayed by the qemu-img command listing snapshots).

j_s · Feb 17, 2023

@fabian

Thank you for the reply. That post was 100% spot on. It told me what I needed to do and I understood what was going on. I started deleting snapshots from the command line with the qemu-img command and after a couple had been deleted the WebGUI started listing snapshots in the WebGUI. I went ahead and cleaned up everything from the CLI though, and all seems to be back to normal.

Thank you so much.

esilva · Mar 7, 2024

fabian said:
you should be able to use qm delsnapshot XXX YYY -force (XXX being the VMID, YYY being the snapshot name) to remove a snapshot even if removing (some) of the volume snapshots on the storage fail. you can then check with qemu-img snapshot -l "$(pvesm path VOLUMEID)" whether any volume snapshots still exist in the qcow2 file (VOLUMEID is sandisk_ssd:116/vm-116-disk-0.qcow2, for example). if you find any snapshots that should not be there, you can then delete them with qemu-img snapshot -d SNAPSHOT "$(pvesm path VOLUMEID)" (VOLUMEID is again the PVE volume ID, SNAPSHOT the snapshot name displayed by the qemu-img command listing snapshots).

is possible restore a Snapshot after running qm delsnapshot command? Like restore from a trash or something like that?

fabian · Mar 7, 2024

Search

Search

Snapshots not being created, or are they?

j_s

Member

fabian

Proxmox Staff Member

j_s

Member

fabian

Proxmox Staff Member

j_s

Member

esilva

New Member

fabian

Proxmox Staff Member