Snapshotting gets stuck and freezes VM as well as other VM's on the node

isaacgross1

New Member
May 4, 2024
6
0
1
I have an issue on one of my nodes where if I snapshot a VM, it seems to hang on the progress where it says how many GiB it has snapshotted so far, it just reports the same amount for several seconds then will advance a little bit a freeze again. This causes the process to take a VERY long time, and it freezes the VM when this happens and takes services down hosted on that VM for a while. In addition I also notice other VM's on the same node/same SSD freeze and go offline until the snapshot is complete. These VM's are hosted on a NVME SSD. Any ideas why this may be happening?Screenshot 2025-06-05 212157.pngScreenshot 2025-06-05 212207.png
 
Hi! Could you please share the config of the affected VM (qm config <VMID>) and pveversion -v? Are you taking the snapshot with RAM or without?
 
I am taking snapshots with RAM. I'm wondering if it matters that the SSD is 75% used space, I'm not sure how much space a snapshot uses but it's possible creating the snapshot is oversubscribing the SSD?
root@pve1:~# qm config 101
agent: 1,fstrim_cloned_disks=1
boot: order=sata0;net0
cores: 2
cpu: x86-64-v2-AES
memory: 4096
meta: creation-qemu=8.1.5,ctime=1714662990
name: FreePBX
net0: virtio=BC:24:11:87:90:58,bridge=vmbr0,firewall=1,tag=20
numa: 0
onboot: 1
ostype: l26
sata0: SSD-LVM-thin:vm-101-disk-0,size=32G
scsihw: virtio-scsi-single
smbios1: uuid=c9cae756-9a4d-487a-b750-a1c4b711c6f7
sockets: 1
spice_enhancements: foldersharing=1,videostreaming=all
startup: order=1
vga: qxl
vmgenid: 44de184b-c473-4194-9011-831270491d94
root@pve1:~# pveversion -v
proxmox-ve: 8.4.0 (running kernel: 6.8.12-11-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-11-pve-signed: 6.8.12-11
proxmox-kernel-6.8: 6.8.12-11
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph: 19.2.1-pve3
ceph-fuse: 19.2.1-pve3
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.11
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-4
pve-ha-manager: 4.0.7
pve-i18n: 3.4.4
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2
root@pve1:~#
 
Thanks for providing the outputs.

The size of a snapshot depends on how many changes that were made in the VM. The snapshots can be much, much smaller than the size of the underlying disk. I assume that the log does not at some point further down print that you're running out of space? Because I would expect it to print an error in that case and stop. Could you post the output of lvs please?

Also, can you try if the same problem occurs if you set the CPU type to host and if that also occurs if you snapshot without RAM?
 
Last edited:
Output of lvs is below. I tried snapping another VM I have where the CPU type is host and same result. It freezes the entire node. Even the historical statistics of every disk in the node has empty data for the time period the snapshot was occuring.
1749223951286.png

root@pve1:~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
HDD-LVM-thin HDD-LVM-thin twi-aotz-- 912.76g 5.15 0.35
vm-109-disk-0 HDD-LVM-thin Vwi-a-tz-- 100.00g HDD-LVM-thin 10.86
vm-110-disk-0 HDD-LVM-thin Vwi-aotz-- 50.00g HDD-LVM-thin 72.28
SSD-LVM-thin SSD-LVM-thin twi-aotz-- 233.58g 74.76 2.78
snap_vm-101-disk-0_Test SSD-LVM-thin Vri---tz-k 32.00g SSD-LVM-thin vm-101-disk-0
vm-100-disk-0 SSD-LVM-thin Vwi-aotz-- 32.00g SSD-LVM-thin 99.06
vm-101-disk-0 SSD-LVM-thin Vwi-aotz-- 32.00g SSD-LVM-thin 98.77
vm-104-disk-0 SSD-LVM-thin Vwi-aotz-- 90.00g SSD-LVM-thin 66.33
vm-104-disk-2 SSD-LVM-thin Vwi-aotz-- 4.00m SSD-LVM-thin 14.06
vm-105-disk-1 SSD-LVM-thin Vwi-aotz-- 100.00g SSD-LVM-thin 51.55
data pve twi-aotz-- <53.93g 55.29 2.44
root pve -wi-ao---- <39.56g
swap pve -wi-ao---- 8.00g
vm-207-disk-0 pve Vwi-aotz-- 32.00g data 93.17
Output of config of VM with CPU type host
root@pve1:~# qm config 100
agent: 1,fstrim_cloned_disks=1
boot: order=scsi0;net0
cores: 2
cpu: host
hostpci0: 0000:00:02
memory: 8192
meta: creation-qemu=8.1.5,ctime=1714788472
name: RHEL9
net0: virtio=BC:24:11:37:97:30,bridge=vmbr0,firewall=1,mtu=1500,tag=20
numa: 0
onboot: 1
ostype: l26
scsi0: SSD-LVM-thin:vm-100-disk-0,discard=on,iothread=1,size=32G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7fc1f937-c83c-4d95-bfe6-57ac7abfcfa7
sockets: 1
spice_enhancements: foldersharing=1,videostreaming=all
startup: order=5
usb0: spice
usb1: spice
vga: qxl
vmgenid: fcf02378-5aa3-48c8-b5d2-be9953218ade
 

Attachments

  • 1749223938510.png
    1749223938510.png
    7.2 KB · Views: 0
I don't think the remaining space on the disk is the problem.

You say the node freezes entirely? From what you're writing it seems that at some point it becomes available again and the snapshot was successful? Could you post the system log from the time you took the snapshot? Do you see anything in the task log for the snapshot?