VM Hangs/Crashes on snapshot

jumpysky81

New Member
May 14, 2024
5
1
3
Hi team,
Former ESXi user here evaluating Proxmox on a single host prior to testing clusters and considering it for our datacenter environments as a vCenter replacement.
I'm used to a Snapshot being an instant process with no VM pause witnessed unless you're grabbing the memory also, which I am not.

I last used Proxmox many years ago and am somewhat familiar with it's UI
I have a Win 2022 Server with the below config

agent: 1,freeze-fs-on-backup=0
balloon: 0
bios: ovmf
boot: order=scsi0;net0;scsi1;ide2
cores: 8
cpu: x86-64-v2-AES
efidisk0: SSD_VOL1:100/vm-100-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: SAS_VOL1:iso/virtio-win-0.1.248.iso,media=cdrom,size=715188K
lock: snapshot
machine: pc-i440fx-8.1
memory: 16384
meta: creation-qemu=8.1.5,ctime=1715560288
name:Redacted
net0: virtio=Redacted,bridge=vmbr0,tag=2
numa: 0
ostype: win11
scsi0: SSD_VOL1:100/vm-100-disk-1.qcow2,backup=0,iothread=1,size=80G
scsi1: SAS_VOL1:100/vm-100-disk-0.qcow2,backup=0,iothread=1,size=6T
scsihw: virtio-scsi-single
smbios1: uuid=fbb8944e-1cca-4e67-ac06-bbed39d2e377,manufacturer=SFA=,product=UHJvTGlhbnQgTUwzNTAgR2VuMTE=,family=SFA=,base64=1
sockets: 1
vmgenid: f5b375f6-f477-47b6-b6c6-320503848c1b

[preupdates]
agent: 1,freeze-fs-on-backup=0
balloon: 0
bios: ovmf
boot: order=scsi0;net0;scsi1;ide2
cores: 8
cpu: x86-64-v2-AES
efidisk0: SSD_VOL1:100/vm-100-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: SAS_VOL1:iso/virtio-win-0.1.248.iso,media=cdrom,size=715188K
machine: pc-i440fx-8.1
memory: 16384
meta: creation-qemu=8.1.5,ctime=1715560288
name: Redacted
net0: virtio=BC:24:11:53:CD:86,bridge=vmbr0,tag=2
numa: 0
ostype: win11
scsi0: SSD_VOL1:100/vm-100-disk-1.qcow2,backup=0,iothread=1,size=80G
scsi1: SAS_VOL1:100/vm-100-disk-0.qcow2,backup=0,iothread=1,size=6T
scsihw: virtio-scsi-single
smbios1: uuid=fbb8944e-1cca-4e67-ac06-bbed39d2e377,manufacturer=SFA=,product=UHJvTGlhbnQgTUwzNTAgR2VuMTE=,family=SFA=,base64=1
snapstate: prepare
snaptime: 1715657000
sockets: 1
vmgenid: f5b375f6-f477-47b6-b6c6-320503848c1b

I have the virtio PCIe driver and the QEMU Guest Agent running correctly,
But the Snapshot seems to run quickly through my SSD drive 80GB Disk and times out on the SAS Drive (6TB)

The error I get after a hung VM console and 20 minutes of waiting is
snapshotting 'drive-scsi0' (SSD_VOL1:100/vm-100-disk-1.qcow2)
snapshotting 'drive-scsi1' (SAS_VOL1:100/vm-100-disk-0.qcow2)
VM 100 qmp command 'savevm-end' failed - unable to connect to VM 100 qmp socket - timeout after 5989 retries
guest-fsfreeze-thaw problems - VM 100 qmp command 'guest-fsfreeze-thaw' failed - got timeout
snapshot create failed: starting cleanup
TASK ERROR: VM 100 qmp command 'blockdev-snapshot-internal-sync' failed - got timeout

Server CPU is elevated during the snapshot attempt.
As you can see from the vm.conf I turned off freeze/thaw in a failed attempt to mitigate this.

The host is a brand new HPE server with a RAID1 SSD array and a RAID6 SAS array with EXT4 partitions as "Directories" that otherwise perform well under simulated load.

When I detect the larger disk, the snapshot completes, albeit still locking the guest OS while it runs.
Please let me know if you have any thoughts on
a) Why it's locking the guest OS
and
b) How we can get snapshots working on the larger volume?

Thanks in advance
 
agent: 1,freeze-fs-on-backup=0
You have enabled the VM agent in the config.

Are you sure it is installed and running? The following error says it can't reach the agent:
unable to connect to VM 100 qmp socket - timeout after 5989 retries
guest-fsfreeze-thaw problems - VM 100 qmp command 'guest-fsfreeze-thaw' failed - got timeout


Why it's locking the guest OS
It is locking the VM in the hypervisor as long as the task is running.
Normally you want the agent in the VM to freeze the filesystem for a brief moment to get the filesystem consistent for the snapshot.
 
Last edited:
  • Like
Reactions: jumpysky81
Thanks @Azunai333 ,

I've confirmed in the moments before running the snapshot that the
qm agent 100 ping
command, no output Is returned as expected
https://pve.proxmox.com/wiki/Qemu-guest-agent#:~:text=qm agent <vmid> ping

I'm not used to snapshots locking the VM at all, where I'm coming from any IO from the moment the snapshot button is clicked is written directly to the snapshot file and VM UX is not impacted.
This thread indicates a pause is expected only when using the agent
https://forum.proxmox.com/threads/snapshot-stopping-vm.59701/

What's a reasonable pause time in your mind, and is it tied to the size of the disk, or just the current IO at the time the snapshot is taken and the IO is redirected to the Snapshot file?

Edit: Corrected @Azunai333 's username
 
Update:

I've intentionally disabled the Qemu agent in the VM's options and tested a snapshot. THe VM still hangs while it runs, then we get a big timeout for the large disk and the VM is stopped.

snapshotting 'drive-scsi0' (SSD_VOL1:100/vm-100-disk-1.qcow2)
snapshotting 'drive-scsi1' (SAS_VOL1:100/vm-100-disk-0.qcow2)
VM 100 qmp command 'savevm-end' failed - unable to connect to VM 100 qmp socket - timeout after 5989 retries
snapshot create failed: starting cleanup
TASK ERROR: VM 100 qmp command 'blockdev-snapshot-internal-sync' failed - got timeout
 
Hi,
unfortunately, snapshots for large qcow2 disks on network storages can take a very long time. The timeout we use for the drive snapshot operation is 10 minutes. But even after the timeout is hit, QEMU will continue in the background and your VM will hang until it is finished.
 
  • Like
Reactions: Azunai333
Thanks @fiona
Our qcow2 disks are on local storage so something seems off still. As above I was under the impression that with the QEMU agent not running that the snapshot's VM pause (if at all) would be very brief regardless of the size of the disk.

We'll spin up some different hardware for testing when time allows.
 
Thanks @fiona
Our qcow2 disks are on local storage so something seems off still. As above I was under the impression that with the QEMU agent not running that the snapshot's VM pause (if at all) would be very brief regardless of the size of the disk.
Hmm, okay, but apparently it still takes more than 10 minutes for the 6 TiB disk. The VM needs to be paused before the disk snapshots are taken and can only be resumed after. When you do not include RAM/VM state, then an fsfreeze is done to ensure that the filesystems are in a consistent state on-disk. When RAM/VM state is included, no freeze is needed, as you're keeping track of the pending filesystem stuff with that.
 
Hi all,

Closing the loop. :-D

We were running a "directory" with EXT4 partitions for each of our Hardware RAID volumes, and using the QCOW2 disk type for our VMs.
A switch to ZFS, with .raw disk types has made a world of difference. Snapshots take seconds with no notable impact on end users.

-jumpysky81
 
  • Like
Reactions: fiona

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!