Manual snapshot corrupts VM

listhor · Jul 12, 2024

I've been using snapshots for quite a long time (usually before doing any upgrades in VM) and all worked flawlessly. Until yesterday.
I use PVE 8.2.4 and this might be the first snapshot done after recent PVE updates.
While doing snapshot (with RAM) I noticed that its size keeps growing well over RAM's size (21 GB for 14 GB of RAM). And once snapshot is done, VM becomes almost completely unresponsive. One time I managed to log in and noticed that mysql (Percona) consumes a lot of CPU.

My first thought was that something wrong is with qemu-guest-agent but backups done daily by PBS (in snapshot mode) are ok - I used them to restore VM after manual snapshot damaged VM. When checking status of qemu-guest-agent I could see that fsfreeze was issued at the time of PBS backup was initiated.

Manual snapshot done when VM is shutdown, doesn't corrupt it. So, my guess is that this issue is related to failure when saving RAM content(?)
How to troubleshoot it further?

Maximiliano · Jul 12, 2024

Hello,

Do you include the RAM in the snapshot? Do you have the QEMU guest agent enabled for that VM? Do you have the QEMU guest agent installed inside the guest?

Do note though that a VM "snapshot" is completely different from a backup in "snapshot" mode.

listhor · Jul 12, 2024

Maximiliano said:
Hello,

Do you include the RAM in the snapshot? Do you have the QEMU guest agent enabled for that VM? Do you have the QEMU guest agent installed inside the guest?

Do note though that a VM "snapshot" is completely different from a backup in "snapshot" mode.

Like I wrote, snapshots were done with RAM included and QEMU guest agent is enabled:

Code:

root@zsovh:~# systemctl status qemu-guest-agent
● qemu-guest-agent.service - QEMU Guest Agent
     Loaded: loaded (/lib/systemd/system/qemu-guest-agent.service; static)
     Active: active (running) since Thu 2024-07-11 23:42:36 CEST; 10h ago
   Main PID: 1003 (qemu-ga)
      Tasks: 2 (limit: 16611)
     Memory: 588.0K
        CPU: 14.903s
     CGroup: /system.slice/qemu-guest-agent.service
             └─1003 /usr/sbin/qemu-ga

Jul 11 23:48:00 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:48:11 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:48:21 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:48:32 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:48:42 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:48:53 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:49:04 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:49:14 zsovh qemu-ga[1003]: info: guest-ping called
Jul 12 01:00:00 zsovh qemu-ga[1003]: info: guest-ping called
Jul 12 01:00:00 zsovh qemu-ga[1003]: info: guest-fsfreeze called

Above fsfreeze was done during PBS backup... VM is Ubuntu server 22.04.

EDIT:
VM config:

Code:

root@pveovh:~# qm config 100
agent: 1
bios: ovmf
boot: order=virtio0
cores: 3
cpu: host,flags=+md-clear;+pcid;+spec-ctrl;+ssbd;+aes
cpuunits: 200
description: scsi1%3A backup-storage-nfs%3A100/vm-100-disk-0.qcow2,aio=native,backup=0,discard=on,iothread=1,size=50G,ssd=1
efidisk0: local-zfs:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide2: bpool-iso:iso/ubuntu-22.04.2-live-server-amd64.iso,media=cdrom,size=1929660K
machine: q35
memory: 14336
meta: creation-qemu=8.0.2,ctime=1689627380
name: Ubuntu
net0: virtio=xx:xx,bridge=vmbr0,firewall=1,queues=12
net1: virtio=xx:xx,bridge=vmbr1,firewall=1,queues=12
numa: 1
onboot: 1
ostype: l26
parent: przed-akt-30
rng0: source=/dev/urandom
scsihw: virtio-scsi-single
smbios1: uuid=8dedd322-413b-4c78-9152-dd5b93bedd20
sockets: 4
startup: order=3,up=180,down=0
tpmstate0: local-zfs:vm-100-disk-1,size=4M,version=v2.0
vga: qxl,memory=24
virtio0: local-zfs:vm-100-disk-2,cache=writeback,discard=on,iothread=1,size=250G
vmgenid: 4c005fff-1393-485a-90a0-1516603f3d5e

Adeon · Jul 13, 2024

I'm having the same issue. Promox version is 8.2.4.
I take a manual snapshot (incl. RAM), with an active QEMU agent and after that my VM gets unresponsive.
Was able to reproduce this issue with 3 VM's (all running Ubuntu 22.04 (Kernel 5.15.0-113) + Docker 27.0.3).
It looks like Docker causes the unresponsiveness.
Even when sshed in i'm not able to reboot or shutdown the VMs, i have to "qm stop" the VMs and after starting the VMs run normal again.

This VM only runs 3 containers (Graylog, Zabbix & Portainer-Agent):

agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;net0;ide0
cipassword: **********
ciuser: user
cores: 2
cpu: host
description: Ubuntu server 2204
efidisk0: nvme01-prod-1:vm-9911-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: nvme01-prod-1:vm-9911-cloudinit,media=cdrom
ipconfig0: ip=xx.xx.xx.xx/24,gw=xx.xx.xx.xx
memory: 6144
meta: creation-qemu=7.2.0,ctime=1678384182
name: sysmon01-vmm.stage
nameserver: xx.xx.xx.xx xx.xx.xx.xx
net0: virtio=EE:72:E3:57:63:C5,bridge=vmbr0,firewall=1,tag=99
numa: 0
ostype: l26
parent: test
scsi0: nvme01-prod-1:vm-9911-disk-1,discard=on,iothread=1,size=200G,ssd=1
scsihw: virtio-scsi-single
searchdomain: example.com
serial0: socket
smbios1: uuid=903d9c7f-07a5-4c92-88b7-29404255d5e8
sockets: 1
tablet: 0
tags: stage;dck;mon;ubu
vga: serial0
vmgenid: 077bb6eb-57b6-495e-9fd7-d28a234caa59

This VM runs completely different containers than the VM above:

agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;net0;ide0
cipassword: **********
ciuser: user
cores: 6
cpu: host
description: Ubuntu server 2204
efidisk0: nvme01-prod-1:vm-9901-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: nvme01-prod-1:vm-9901-cloudinit,media=cdrom
ipconfig0: ip=xx.xx.xx.xx/24,gw=xx.xx.xx.xx
memory: 8192
meta: creation-qemu=7.2.0,ctime=1678384182
name: dck01-vmm.stage
nameserver: xx.xx.xx.xx xx.xx.xx.xx
net0: virtio=26:F3:B2

A:97:99,bridge=vmbr0,firewall=1,tag=99
numa: 0
ostype: l26
parent: before-scheduled-maintenance
scsi0: nvme01-prod-1:vm-9901-disk-1,discard=on,iothread=1,size=200G,ssd=1
scsihw: virtio-scsi-single
searchdomain: example.com
serial0: socket
smbios1: uuid=ca4c1e23-f3e1-4884-97c6-5712d239cc8c
sockets: 1
tablet: 0
tags: stage;dck;ubu
vga: serial0
vmgenid: 45866c13-85f6-404d-8a7f-cbdab78f6ce7

Edit:

Even when i'm able to reboot the VM when SSHed in, the VM keeps being unresponsive. Only after stopping and starting the VM via Proxmox shell does the VM act normal again.
Proxmox syslog looks normal, the only thing that points to the unresponsiveness is that the 'guest-ping' failed with a timeout.

Further testing:

1. Stopped docker.service + docker.socket after VM start -> no issues after snapshot
2. Started docker.service + docker.socket again -> unresponsiveness is back after new snapshot
3. Shutdown (SSH) and started the VM again -> unresponsiveness is back after a snapshot
4. Disabled docker.service + docker.socket, shutdown and started the VM again -> no issues after snapshot.
5. Restarted Proxmox -> unresponsiveness is back after a snapshot

listhor · Jul 13, 2024

Adeon said:
Even when sshed in i'm not able to reboot or shutdown the VMs, i have to "qm stop" the VMs and after starting the VMs run normal again.

In my case - even after forceful reboot (same as you: qm stop) VM doesn't work as normal. I don't have running docker in that VM, only Percona database; more likely it gets corrupted while doing snapshot and can't restore itself.

Adeon · Jul 13, 2024

listhor said:
In my case - even after forceful reboot (same as you: qm stop) VM doesn't work as normal. I don't have running docker in that VM, only Percona database; more likely it gets corrupted while doing snapshot and can't restore itself.

For me this issue only occurs with Docker running. But Docker is also pretty much the only thing that is running in my VMs.

Edit:

I downgraded to Docker 26.0.0 but still the same unresponsiveness. And since i did not have this issue with Docker 26.0.0, i agree with listhor here, i don't think Docker is the issue.

listhor · Jul 13, 2024

I have another Proxmox instance (different hardware) with very similar result. Snapshot with RAM included was bigger than RAM itself, VM in question wasn't completely unresponsive (main application is docker) but I had to force it to stop from Proxmox as docker's stop job was taking forever to be executed. In this case (docker being affected) forced reboot was enough to restore normal functioning of VM.

Anyway, manual snapshot is currently NOT to be used in Proxmox 8.2.4...

listhor · Jul 16, 2024

Maximiliano said:
Hello,
.....

Is there anything we could do about that?

listhor · Aug 3, 2024

Bumping up this thread as issue does still exist.
Anybody has found any solution?

ManfredU · Sep 30, 2024

listhor said:
I have another Proxmox instance (different hardware) with very similar result. Snapshot with RAM included was bigger than RAM itself, VM in question wasn't completely unresponsive (main application is docker) but I had to force it to stop from Proxmox as docker's stop job was taking forever to be executed. In this case (docker being affected) forced reboot was enough to restore normal functioning of VM.

Anyway, manual snapshot is currently NOT to be used in Proxmox 8.2.4...

I have a similar issue. It's not caused by manual snapshots but by snapshots with RAM state enabled. After this the vm becomes unresponsive and I have to stop the vm and start it again. The vm is running docker applications, all using postgresql databases. I believe the database is having problems after the ram snapshot because I noticed the message "the database system is in recovery mode".

See also https://forum.proxmox.com/threads/snapshot-causes-vm-to-become-unresponsive.153483/

listhor · Sep 30, 2024

ManfredU said:
If have a similar issue. It's not caused by manual snapshots but by snapshots with RAM state enabled. After this the vm becomes unresponsive and I have to stop the vm and start it again. The vm is running docker applications, all using postgresql databases. I believe the database is having problems after the ram snapshot because I noticed the message "the database system is in recovery mode".

See also https://forum.proxmox.com/threads/snapshot-causes-vm-to-become-unresponsive.153483/

Unfortunately, the only answer I see is to make snapshot without RAM...

ManfredU · Oct 8, 2024

listhor said:
Unfortunately, the only answer I see is to make snapshot without RAM...

Another workaround is to roll back to this snapshot immediately after taking it. In this case the vm remains responsive.
So the snapshot itself including the state file is fine.

Is there an open bug report for this issue?

listhor · Oct 17, 2024

ManfredU said:
Is there an open bug report for this issue?

I haven't seen anything. I really miss this feature as it was very handy.

Search

Search

Manual snapshot corrupts VM

listhor

Member

Maximiliano

Proxmox Staff Member

listhor

Member

Adeon

New Member

listhor

Member

Adeon

New Member

listhor

Member

listhor

Member

listhor

Member

ManfredU

New Member

listhor

Member

ManfredU

New Member

listhor

Member

We value your privacy