I've had this problem with 1 VM in particular for about a month. It's only happened on this 1 VM. It's a VM that has a 200GB virtual disk on NFS to a TrueNAS server.
When I have big updates come out for the VM, I'll shutdown the VM and snapshot it. Then boot it up and apply the updates. If things go well, I'll delete the snapshot (usually within about 30 minutes or so once I feel everything is running smoothly again).
This VM is running Nextcloud, and the file system internally is ZFS. I know, I have ZFS in the VM, and the storage is ZFS backed, but this VM wasn't originally going to be stored on a TrueNAS server. I've done a ZFS scrub of the zpool on the TrueNAS itself, as well as a scrub on the zpool in the VM, and had no errors found.
Anyway, I noticed a month ago that when I go to the WebGUI of Proxmox to take a snapshot, the typical entry that says "NOW - You are here!" is missing. The list is completely empty of snapshots. I tried to take a snapshot, and the task seemed to occur without any problems. However the snapshot wasn't created as the list never populated. I tried 2 or 3 more times before I realized something was odd, and I went to the CLI. From the CLI I checked and no snapshot seemed to exist. I assumed the issue is something weird with the WebGUI, and I updated the VM without a snapshot and life went on.
Last night I shutdown the Proxmox cluster to install the latest updates from the pve-no-subscription, and had no problems. Tonight I decided to look into this issue more. The issue seems to be the same from the WebGUI, no "NOW - You are here!" is present, and I tried to take another snapshot from the WebGUI. From the WebGUI:
However again, like a month ago, no snapshot exists in the WebGUI and no "NOW - You are here!" is in the WebGUI.
I did a few idiot checks:
So I have two take-aways from this so far:
1. Somewhere a "test" snapshot apparently exists, even though I cannot see or use it.
2. If I create a randomly named snapshot, it seems to complete successfully, but doesn't.
Something I think is possibly related, but I've never felt comfy about that I feel is worth mentioning:
Last year I deleted a number of snapshots of this VM that were old. It took so long it timed out on deleting the snapshot. The timeout was 10 minutes. I was a bit surprised there would be a timeout on deleting a snapshot. The VM is on SSD storage on a TrueNAS that has 10Gb networking between the Proxmox host and the TruenAS, so I'd think 10 minutes would be enough. But I read the forums and I was able to do 2 things based on old forum posts:
1. Since the snapshots were deleted to some extent, the best thing to go forward was to edit the conf file on that host for the VM and remove the snapshot entries, which I did, and there seemed to be no bad consequences. I was able to create and delete a snapshot without powering on the VM, so I felt comfortable that all was well.
2. Per this post https://forum.proxmox.com/threads/snapshots-in-delete-status.27729/post-440790 I was able to change the timeout to something larger. For me, I'm okay with it taking an hour or more (if it needed to) to cleanup a snapshot. I just need it to run to completion (however long it takes). So I set it to a default of 60 minutes by editing the appropriate file on all 3 of the Proxmox hosts. That seemed to resolve the problems for this VM. (This VM is the largest by far at 200GB, so I'm not surprised its the only one having this problem.)
Anyone have any recommendations on how to proceed? One idea I had was to convert the disk to raw, then back to qcow2. You cannot have snapshots in raw format, so whatever sticky snapshots exist would have to be purged, right? Before I go draconian and covert to raw and back, I'd like to have a deeper understanding of what broke "under the hood".
I do realize that snapshots may be a problem because of their size if left unchecked for too long, and removing them can take an excessive amount of time by Proxmox standards, but I'm okay with these limitations and I use snapshots for a lot of testing, so ditching them because they take too long isn't a great option.
pveversion -v
When I have big updates come out for the VM, I'll shutdown the VM and snapshot it. Then boot it up and apply the updates. If things go well, I'll delete the snapshot (usually within about 30 minutes or so once I feel everything is running smoothly again).
This VM is running Nextcloud, and the file system internally is ZFS. I know, I have ZFS in the VM, and the storage is ZFS backed, but this VM wasn't originally going to be stored on a TrueNAS server. I've done a ZFS scrub of the zpool on the TrueNAS itself, as well as a scrub on the zpool in the VM, and had no errors found.
Anyway, I noticed a month ago that when I go to the WebGUI of Proxmox to take a snapshot, the typical entry that says "NOW - You are here!" is missing. The list is completely empty of snapshots. I tried to take a snapshot, and the task seemed to occur without any problems. However the snapshot wasn't created as the list never populated. I tried 2 or 3 more times before I realized something was odd, and I went to the CLI. From the CLI I checked and no snapshot seemed to exist. I assumed the issue is something weird with the WebGUI, and I updated the VM without a snapshot and life went on.
Last night I shutdown the Proxmox cluster to install the latest updates from the pve-no-subscription, and had no problems. Tonight I decided to look into this issue more. The issue seems to be the same from the WebGUI, no "NOW - You are here!" is present, and I tried to take another snapshot from the WebGUI. From the WebGUI:
snapshotting 'drive-scsi0' (sandisk_ssd:116/vm-116-disk-0.qcow2)
TASK OK
However again, like a month ago, no snapshot exists in the WebGUI and no "NOW - You are here!" is in the WebGUI.
I did a few idiot checks:
root@jpve1:/etc# qm listsnapshot 116
root@jpve1:/etc# qemu-img check /mnt/pve/sandisk_ssd/images/116/vm-116-disk-0.qcow2
No errors were found on the image.
1675005/3276800 = 51.12% allocated, 85.54% fragmented, 0.00% compressed clusters
Image end offset: 201816014848
root@jpve1:/etc# qm snapshot 116 test
snapshot name 'test' already used
root@jpve1:/etc# qm snapshot 116 randomcheck
snapshotting 'drive-scsi0' (sandisk_ssd:116/vm-116-disk-0.qcow2)
root@jpve1:/etc# qm listsnapshot 116
root@jpve1:/etc#
So I have two take-aways from this so far:
1. Somewhere a "test" snapshot apparently exists, even though I cannot see or use it.
2. If I create a randomly named snapshot, it seems to complete successfully, but doesn't.
Something I think is possibly related, but I've never felt comfy about that I feel is worth mentioning:
Last year I deleted a number of snapshots of this VM that were old. It took so long it timed out on deleting the snapshot. The timeout was 10 minutes. I was a bit surprised there would be a timeout on deleting a snapshot. The VM is on SSD storage on a TrueNAS that has 10Gb networking between the Proxmox host and the TruenAS, so I'd think 10 minutes would be enough. But I read the forums and I was able to do 2 things based on old forum posts:
1. Since the snapshots were deleted to some extent, the best thing to go forward was to edit the conf file on that host for the VM and remove the snapshot entries, which I did, and there seemed to be no bad consequences. I was able to create and delete a snapshot without powering on the VM, so I felt comfortable that all was well.
2. Per this post https://forum.proxmox.com/threads/snapshots-in-delete-status.27729/post-440790 I was able to change the timeout to something larger. For me, I'm okay with it taking an hour or more (if it needed to) to cleanup a snapshot. I just need it to run to completion (however long it takes). So I set it to a default of 60 minutes by editing the appropriate file on all 3 of the Proxmox hosts. That seemed to resolve the problems for this VM. (This VM is the largest by far at 200GB, so I'm not surprised its the only one having this problem.)
Anyone have any recommendations on how to proceed? One idea I had was to convert the disk to raw, then back to qcow2. You cannot have snapshots in raw format, so whatever sticky snapshots exist would have to be purged, right? Before I go draconian and covert to raw and back, I'd like to have a deeper understanding of what broke "under the hood".
I do realize that snapshots may be a problem because of their size if left unchecked for too long, and removing them can take an excessive amount of time by Proxmox standards, but I'm okay with these limitations and I use snapshots for a lot of testing, so ditching them because they take too long isn't a great option.
pveversion -v
Code:
pveversion -v
proxmox-ve: 7.3-1 (running kernel: 5.15.83-1-pve)
pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4)
pve-kernel-helper: 7.3-3
pve-kernel-5.15: 7.3-1
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-2
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-1
lxcfs: 5.0.3-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.2-1
proxmox-backup-file-restore: 2.3.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-3
pve-ha-manager: 3.5.1
pve-i18n: 2.8-2
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1