VM Hangs/Crashes on snapshot

jumpysky81 · May 14, 2024

Hi team,
Former ESXi user here evaluating Proxmox on a single host prior to testing clusters and considering it for our datacenter environments as a vCenter replacement.
I'm used to a Snapshot being an instant process with no VM pause witnessed unless you're grabbing the memory also, which I am not.

I last used Proxmox many years ago and am somewhat familiar with it's UI
I have a Win 2022 Server with the below config

agent: 1,freeze-fs-on-backup=0
balloon: 0
bios: ovmf
boot: order=scsi0;net0;scsi1;ide2
cores: 8
cpu: x86-64-v2-AES
efidisk0: SSD_VOL1:100/vm-100-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: SAS_VOL1:iso/virtio-win-0.1.248.iso,media=cdrom,size=715188K
lock: snapshot
machine: pc-i440fx-8.1
memory: 16384
meta: creation-qemu=8.1.5,ctime=1715560288
name:Redacted
net0: virtio=Redacted,bridge=vmbr0,tag=2
numa: 0
ostype: win11
scsi0: SSD_VOL1:100/vm-100-disk-1.qcow2,backup=0,iothread=1,size=80G
scsi1: SAS_VOL1:100/vm-100-disk-0.qcow2,backup=0,iothread=1,size=6T
scsihw: virtio-scsi-single

smbios1: uuid=fbb8944e-1cca-4e67-ac06-bbed39d2e377,manufacturer=SFA=,product=UHJvTGlhbnQgTUwzNTAgR2VuMTE=,family=SFA=,base64=1

sockets: 1
vmgenid: f5b375f6-f477-47b6-b6c6-320503848c1b

[preupdates]
agent: 1,freeze-fs-on-backup=0
balloon: 0
bios: ovmf
boot: order=scsi0;net0;scsi1;ide2
cores: 8
cpu: x86-64-v2-AES
efidisk0: SSD_VOL1:100/vm-100-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: SAS_VOL1:iso/virtio-win-0.1.248.iso,media=cdrom,size=715188K
machine: pc-i440fx-8.1
memory: 16384
meta: creation-qemu=8.1.5,ctime=1715560288
name: Redacted
net0: virtio=BC:24:11:53:CD:86,bridge=vmbr0,tag=2
numa: 0
ostype: win11
scsi0: SSD_VOL1:100/vm-100-disk-1.qcow2,backup=0,iothread=1,size=80G
scsi1: SAS_VOL1:100/vm-100-disk-0.qcow2,backup=0,iothread=1,size=6T
scsihw: virtio-scsi-single

smbios1: uuid=fbb8944e-1cca-4e67-ac06-bbed39d2e377,manufacturer=SFA=,product=UHJvTGlhbnQgTUwzNTAgR2VuMTE=,family=SFA=,base64=1

snapstate: prepare
snaptime: 1715657000
sockets: 1
vmgenid: f5b375f6-f477-47b6-b6c6-320503848c1b

I have the virtio PCIe driver and the QEMU Guest Agent running correctly,
But the Snapshot seems to run quickly through my SSD drive 80GB Disk and times out on the SAS Drive (6TB)

The error I get after a hung VM console and 20 minutes of waiting is
snapshotting 'drive-scsi0' (SSD_VOL1:100/vm-100-disk-1.qcow2)
snapshotting 'drive-scsi1' (SAS_VOL1:100/vm-100-disk-0.qcow2)
VM 100 qmp command 'savevm-end' failed - unable to connect to VM 100 qmp socket - timeout after 5989 retries
guest-fsfreeze-thaw problems - VM 100 qmp command 'guest-fsfreeze-thaw' failed - got timeout
snapshot create failed: starting cleanup
TASK ERROR: VM 100 qmp command 'blockdev-snapshot-internal-sync' failed - got timeout

Server CPU is elevated during the snapshot attempt.
As you can see from the vm.conf I turned off freeze/thaw in a failed attempt to mitigate this.

The host is a brand new HPE server with a RAID1 SSD array and a RAID6 SAS array with EXT4 partitions as "Directories" that otherwise perform well under simulated load.

When I detect the larger disk, the snapshot completes, albeit still locking the guest OS while it runs.
Please let me know if you have any thoughts on
a) Why it's locking the guest OS
and
b) How we can get snapshots working on the larger volume?

Thanks in advance

Azunai333 · May 14, 2024

jumpysky81 said:
agent: 1,freeze-fs-on-backup=0

You have enabled the VM agent in the config.

Are you sure it is installed and running? The following error says it can't reach the agent:

jumpysky81 said:
unable to connect to VM 100 qmp socket - timeout after 5989 retries
guest-fsfreeze-thaw problems - VM 100 qmp command 'guest-fsfreeze-thaw' failed - got timeout

jumpysky81 said:
Why it's locking the guest OS

It is locking the VM in the hypervisor as long as the task is running.
Normally you want the agent in the VM to freeze the filesystem for a brief moment to get the filesystem consistent for the snapshot.

jumpysky81 · May 14, 2024

Thanks @Azunai333 ,

I've confirmed in the moments before running the snapshot that the
qm agent 100 ping
command, no output Is returned as expected
https://pve.proxmox.com/wiki/Qemu-guest-agent#:~:text=qm agent <vmid> ping

I'm not used to snapshots locking the VM at all, where I'm coming from any IO from the moment the snapshot button is clicked is written directly to the snapshot file and VM UX is not impacted.
This thread indicates a pause is expected only when using the agent
https://forum.proxmox.com/threads/snapshot-stopping-vm.59701/

What's a reasonable pause time in your mind, and is it tied to the size of the disk, or just the current IO at the time the snapshot is taken and the IO is redirected to the Snapshot file?

Edit: Corrected @Azunai333 's username

jumpysky81 · May 14, 2024

Update:

I've intentionally disabled the Qemu agent in the VM's options and tested a snapshot. THe VM still hangs while it runs, then we get a big timeout for the large disk and the VM is stopped.

snapshotting 'drive-scsi0' (SSD_VOL1:100/vm-100-disk-1.qcow2)
snapshotting 'drive-scsi1' (SAS_VOL1:100/vm-100-disk-0.qcow2)
VM 100 qmp command 'savevm-end' failed - unable to connect to VM 100 qmp socket - timeout after 5989 retries
snapshot create failed: starting cleanup
TASK ERROR: VM 100 qmp command 'blockdev-snapshot-internal-sync' failed - got timeout

fiona · May 15, 2024

Hi,
unfortunately, snapshots for large qcow2 disks on network storages can take a very long time. The timeout we use for the drive snapshot operation is 10 minutes. But even after the timeout is hit, QEMU will continue in the background and your VM will hang until it is finished.

jumpysky81 · May 15, 2024

Thanks @fiona
Our qcow2 disks are on local storage so something seems off still. As above I was under the impression that with the QEMU agent not running that the snapshot's VM pause (if at all) would be very brief regardless of the size of the disk.

We'll spin up some different hardware for testing when time allows.

fiona · May 16, 2024

jumpysky81 said:
Thanks @fiona
Our qcow2 disks are on local storage so something seems off still. As above I was under the impression that with the QEMU agent not running that the snapshot's VM pause (if at all) would be very brief regardless of the size of the disk.

Hmm, okay, but apparently it still takes more than 10 minutes for the 6 TiB disk. The VM needs to be paused before the disk snapshots are taken and can only be resumed after. When you do not include RAM/VM state, then an fsfreeze is done to ensure that the filesystems are in a consistent state on-disk. When RAM/VM state is included, no freeze is needed, as you're keeping track of the pending filesystem stuff with that.

jumpysky81 · May 30, 2024

Hi all,

Closing the loop. :-D

We were running a "directory" with EXT4 partitions for each of our Hardware RAID volumes, and using the QCOW2 disk type for our VMs.
A switch to ZFS, with .raw disk types has made a world of difference. Snapshots take seconds with no notable impact on end users.

-jumpysky81

Search

Search

VM Hangs/Crashes on snapshot

jumpysky81

New Member

Azunai333

Active Member

jumpysky81

New Member

jumpysky81

New Member

fiona

Proxmox Staff Member

jumpysky81

New Member

fiona

Proxmox Staff Member

jumpysky81

New Member

We value your privacy