Snapshot with VM State and RAM fails.

rborg

New Member
Jun 26, 2024
22
5
3
Hi,

any pointers would help.
We have a particular VM that when we try to perform a snapshot including RAM, the snapshot fails. On the Ceph cluster we can see the RBD state image being written to, once it reaches around 99% process fails with the following error:
Code:
TASK ERROR: unable to save VM state and RAM - qemu_savevm_state_complete_precopy error -5

and subsequent snapshot attempts with or without RAM fail with the following error until the VM is stopped or migrated to another node.
Code:
TASK ERROR: VM 5021020 qmp command 'savevm-start' failed - VM snapshot already started

Snapshots without RAM work flawlessly for this VM once the already started error is cleared.
Snapshots for other VMs on the same nodes and storage with RAM work without issues.

A few a seconds before the snapshot fails we also observer the following type of logs
Code:
Oct 27 15:30:42 pvenode07 pvedaemon[2193130]: VM 5021020 qmp command failed - VM 5021020 qmp command 'query-proxmox-support' failed - unable to connect to VM 5021020 qmp socket - timeout after 51 retries
Oct 27 15:30:42 pvenode07 pvedaemon[2185978]: VM 5021020 qmp command failed - VM 5021020 qmp command 'query-proxmox-support' failed - unable to connect to VM 5021020 qmp socket - timeout after 51 retries
Oct 27 15:30:43 pvenode07 pvestatd[1753]: VM 5021020 qmp command failed - VM 5021020 qmp command 'query-proxmox-support' failed - unable to connect to VM 5021020 qmp socket - timeout after 51 retries
Oct 27 15:30:43 pvenode07 pvedaemon[2176923]: VM 5021020 qmp command failed - VM 5021020 qmp command 'query-proxmox-support' failed - unable to connect to VM 5021020 qmp socket - timeout after 51 retries
Oct 27 15:30:47 pvenode07 pvedaemon[2185978]: <user> end task UPID:pvenode07:00216AA8:0CF41FED:68FF8169:qmsnapshot:5021020:<user>: unable to save VM state and RAM - qemu_savevm_state_complete_precopy error -5

Multinode Cluster
Code:
Proxmox Version: 8.4.14
Kernel: Linux 6.8.12-15-pve
Storage: External Ceph 19.2.2 squid (stable)
VM Config:
Code:
agent: enabled=1
bios: seabios
boot: order=scsi0
cores: 4
cpu: Haswell-noTSX
machine: pc-i440fx-9.2
memory: 8192
meta: creation-qemu=9.2.0,ctime=1752059571
name: <redacted>
net0: virtio=BC:24:11:13:FA:7B,bridge=vmbr1,tag=221
numa: 0
ostype: l26
scsi0: <redacted>:vm-5021020-disk-0,size=85G
scsihw: virtio-scsi-pci
smbios1: uuid=ac1fc4f2-4318-44f1-a0fd-d8f004c48245
sockets: 1
tags: <redacted>
vmgenid: 89dca5cb-194c-4325-8050-47d36b2f993c
 
could you share the VM config of other VM on the same node and storage where snapshot works without error?
 
the problem is that its random vms. When there is failure, the job restarts automatically and it is successful. if vm a fails today, tomorrow will be ok.
 
@Y
could you share the VM config of other VM on the same node and storage where snapshot works without error?





Regression: Snapshot with RAM fails with 'qemu_savevm_state_complete_precopy error -5' on large disks with dirty bitmaps



I am reporting a suspected regression or a very similar variant of the old bug from 2023 regarding VM hibernation/snapshot failure on VMs with large disks and existing dirty bitmaps.

In 2023, a similar issue was resolved in pve-qemu-kvm >= 7.2.0-5 where qm suspend --todisk failed with qemu_savevm_state_iterate error -5 due to the dirty bitmap size not being properly accounted for in the driver state limit during migration/savevm.

However, I am currently facing the exact same behavior, but this time it happens during a live snapshot with RAM included. The task hangs and eventually terminates with qemu_savevm_state_complete_precopy error -5.


Steps to Reproduce:
Start the VM.

Run a backup task (or start and cancel it) to ensure the dirty bitmap is generated on the large attached disk.

Attempt to take a live snapshot of the VM with the "Include RAM" option enabled.

The process progresses for a while and then fails.


Error Output:
```
saving VM state and RAM using storage 'vms'
294.00 B in 0s
433.86 MiB in 1s
942.70 MiB in 2s
950.22 MiB in 3s
1.19 GiB in 4s
1.63 GiB in 5s
2.10 GiB in 6s
2.48 GiB in 7s
2.90 GiB in 8s
3.33 GiB in 9s
3.47 GiB in 11s
3.47 GiB in 12s
3.57 GiB in 13s
4.06 GiB in 14s
4.78 GiB in 15s
5.35 GiB in 16s
5.83 GiB in 17s
6.29 GiB in 18s
6.67 GiB in 19s
7.16 GiB in 20s
7.69 GiB in 21s
8.24 GiB in 22s
snapshot create failed: starting cleanup
TASK ERROR: unable to save VM state and RAM - qemu_savevm_state_complete_precopy error -5
```
PVE Version Information (pveversion -v):
(Note: My pve-qemu-kvm version is already 10.1.2-7)


```
proxmox-ve: 9.1.0 (running kernel: 6.17.13-2-pve)
pve-manager: 9.1.7 (running version: 9.1.7/16b139a017452f16)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.17: 6.17.13-2
proxmox-kernel-6.17.13-2-pve-signed: 6.17.13-2
proxmox-kernel-6.17.2-1-pve-signed: 6.17.2-1
amd64-microcode: 3.20251202.1~bpo13+1
ceph-fuse: 19.2.3-pve2
corosync: 3.1.10-pve1
criu: 4.1.1-1
frr-pythontools: 10.4.1-1+pve1
ifupdown2: 3.3.0-1+pmx12
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.2
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.5
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.1.1
libpve-cluster-perl: 9.1.1
libpve-common-perl: 9.1.9
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.2.5
libpve-rs-perl: 0.11.4
libpve-storage-perl: 9.1.1
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-4
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.1.5-1
proxmox-backup-file-restore: 4.1.5-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.1
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.1.8
pve-cluster: 9.1.1
pve-container: 6.1.2
pve-docs: 9.1.2
pve-edk2-firmware: 4.2025.05-2
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.18-2
pve-ha-manager: 5.1.3
pve-i18n: 3.6.6
pve-qemu-kvm: 10.1.2-7
pve-xtermjs: 5.5.0-3
qemu-server: 9.1.6
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve3
vncterm: 1.9.1
zfsutils-linux: 2.4.1-pve1
```

I also report this on https://bugzilla.proxmox.com/show_bug.cgi?id=4476