Manual snapshot corrupts VM

listhor

New Member
Nov 14, 2023
29
1
3
I've been using snapshots for quite a long time (usually before doing any upgrades in VM) and all worked flawlessly. Until yesterday.
I use PVE 8.2.4 and this might be the first snapshot done after recent PVE updates.
While doing snapshot (with RAM) I noticed that its size keeps growing well over RAM's size (21 GB for 14 GB of RAM). And once snapshot is done, VM becomes almost completely unresponsive. One time I managed to log in and noticed that mysql (Percona) consumes a lot of CPU.

My first thought was that something wrong is with qemu-guest-agent but backups done daily by PBS (in snapshot mode) are ok - I used them to restore VM after manual snapshot damaged VM. When checking status of qemu-guest-agent I could see that fsfreeze was issued at the time of PBS backup was initiated.

Manual snapshot done when VM is shutdown, doesn't corrupt it. So, my guess is that this issue is related to failure when saving RAM content(?)
How to troubleshoot it further?
 
Hello,

Do you include the RAM in the snapshot? Do you have the QEMU guest agent enabled for that VM? Do you have the QEMU guest agent installed inside the guest?

Do note though that a VM "snapshot" is completely different from a backup in "snapshot" mode.
 
Hello,

Do you include the RAM in the snapshot? Do you have the QEMU guest agent enabled for that VM? Do you have the QEMU guest agent installed inside the guest?

Do note though that a VM "snapshot" is completely different from a backup in "snapshot" mode.
Like I wrote, snapshots were done with RAM included and QEMU guest agent is enabled:
Code:
root@zsovh:~# systemctl status qemu-guest-agent
● qemu-guest-agent.service - QEMU Guest Agent
     Loaded: loaded (/lib/systemd/system/qemu-guest-agent.service; static)
     Active: active (running) since Thu 2024-07-11 23:42:36 CEST; 10h ago
   Main PID: 1003 (qemu-ga)
      Tasks: 2 (limit: 16611)
     Memory: 588.0K
        CPU: 14.903s
     CGroup: /system.slice/qemu-guest-agent.service
             └─1003 /usr/sbin/qemu-ga

Jul 11 23:48:00 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:48:11 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:48:21 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:48:32 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:48:42 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:48:53 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:49:04 zsovh qemu-ga[1003]: info: guest-ping called
Jul 11 23:49:14 zsovh qemu-ga[1003]: info: guest-ping called
Jul 12 01:00:00 zsovh qemu-ga[1003]: info: guest-ping called
Jul 12 01:00:00 zsovh qemu-ga[1003]: info: guest-fsfreeze called
Above fsfreeze was done during PBS backup... VM is Ubuntu server 22.04.

EDIT:
VM config:
Code:
root@pveovh:~# qm config 100
agent: 1
bios: ovmf
boot: order=virtio0
cores: 3
cpu: host,flags=+md-clear;+pcid;+spec-ctrl;+ssbd;+aes
cpuunits: 200
description: scsi1%3A backup-storage-nfs%3A100/vm-100-disk-0.qcow2,aio=native,backup=0,discard=on,iothread=1,size=50G,ssd=1
efidisk0: local-zfs:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide2: bpool-iso:iso/ubuntu-22.04.2-live-server-amd64.iso,media=cdrom,size=1929660K
machine: q35
memory: 14336
meta: creation-qemu=8.0.2,ctime=1689627380
name: Ubuntu
net0: virtio=xx:xx,bridge=vmbr0,firewall=1,queues=12
net1: virtio=xx:xx,bridge=vmbr1,firewall=1,queues=12
numa: 1
onboot: 1
ostype: l26
parent: przed-akt-30
rng0: source=/dev/urandom
scsihw: virtio-scsi-single
smbios1: uuid=8dedd322-413b-4c78-9152-dd5b93bedd20
sockets: 4
startup: order=3,up=180,down=0
tpmstate0: local-zfs:vm-100-disk-1,size=4M,version=v2.0
vga: qxl,memory=24
virtio0: local-zfs:vm-100-disk-2,cache=writeback,discard=on,iothread=1,size=250G
vmgenid: 4c005fff-1393-485a-90a0-1516603f3d5e
 
Last edited:
I'm having the same issue. Promox version is 8.2.4.
I take a manual snapshot (incl. RAM), with an active QEMU agent and after that my VM gets unresponsive.
Was able to reproduce this issue with 3 VM's (all running Ubuntu 22.04 (Kernel 5.15.0-113) + Docker 27.0.3).
It looks like Docker causes the unresponsiveness.
Even when sshed in i'm not able to reboot or shutdown the VMs, i have to "qm stop" the VMs and after starting the VMs run normal again.

This VM only runs 3 containers (Graylog, Zabbix & Portainer-Agent):

agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;net0;ide0
cipassword: **********
ciuser: user
cores: 2
cpu: host
description: Ubuntu server 2204
efidisk0: nvme01-prod-1:vm-9911-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: nvme01-prod-1:vm-9911-cloudinit,media=cdrom
ipconfig0: ip=xx.xx.xx.xx/24,gw=xx.xx.xx.xx
memory: 6144
meta: creation-qemu=7.2.0,ctime=1678384182
name: sysmon01-vmm.stage
nameserver: xx.xx.xx.xx xx.xx.xx.xx
net0: virtio=EE:72:E3:57:63:C5,bridge=vmbr0,firewall=1,tag=99
numa: 0
ostype: l26
parent: test
scsi0: nvme01-prod-1:vm-9911-disk-1,discard=on,iothread=1,size=200G,ssd=1
scsihw: virtio-scsi-single
searchdomain: example.com
serial0: socket
smbios1: uuid=903d9c7f-07a5-4c92-88b7-29404255d5e8
sockets: 1
tablet: 0
tags: stage;dck;mon;ubu
vga: serial0
vmgenid: 077bb6eb-57b6-495e-9fd7-d28a234caa59

Screenshot 2024-07-13 102853.png

This VM runs completely different containers than the VM above:

agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;net0;ide0
cipassword: **********
ciuser: user
cores: 6
cpu: host
description: Ubuntu server 2204
efidisk0: nvme01-prod-1:vm-9901-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: nvme01-prod-1:vm-9901-cloudinit,media=cdrom
ipconfig0: ip=xx.xx.xx.xx/24,gw=xx.xx.xx.xx
memory: 8192
meta: creation-qemu=7.2.0,ctime=1678384182
name: dck01-vmm.stage
nameserver: xx.xx.xx.xx xx.xx.xx.xx
net0: virtio=26:F3:B2:DA:97:99,bridge=vmbr0,firewall=1,tag=99
numa: 0
ostype: l26
parent: before-scheduled-maintenance
scsi0: nvme01-prod-1:vm-9901-disk-1,discard=on,iothread=1,size=200G,ssd=1
scsihw: virtio-scsi-single
searchdomain: example.com
serial0: socket
smbios1: uuid=ca4c1e23-f3e1-4884-97c6-5712d239cc8c
sockets: 1
tablet: 0
tags: stage;dck;ubu
vga: serial0
vmgenid: 45866c13-85f6-404d-8a7f-cbdab78f6ce7

Screenshot 2024-07-13 112414.png


Edit:

Even when i'm able to reboot the VM when SSHed in, the VM keeps being unresponsive. Only after stopping and starting the VM via Proxmox shell does the VM act normal again.
Proxmox syslog looks normal, the only thing that points to the unresponsiveness is that the 'guest-ping' failed with a timeout.

Further testing:

1. Stopped docker.service + docker.socket after VM start -> no issues after snapshot
2. Started docker.service + docker.socket again -> unresponsiveness is back after new snapshot
3. Shutdown (SSH) and started the VM again -> unresponsiveness is back after a snapshot
4. Disabled docker.service + docker.socket, shutdown and started the VM again -> no issues after snapshot.
5. Restarted Proxmox -> unresponsiveness is back after a snapshot
 
Last edited:
Even when sshed in i'm not able to reboot or shutdown the VMs, i have to "qm stop" the VMs and after starting the VMs run normal again.
In my case - even after forceful reboot (same as you: qm stop) VM doesn't work as normal. I don't have running docker in that VM, only Percona database; more likely it gets corrupted while doing snapshot and can't restore itself.
 
In my case - even after forceful reboot (same as you: qm stop) VM doesn't work as normal. I don't have running docker in that VM, only Percona database; more likely it gets corrupted while doing snapshot and can't restore itself.

For me this issue only occurs with Docker running. But Docker is also pretty much the only thing that is running in my VMs.

Edit:

I downgraded to Docker 26.0.0 but still the same unresponsiveness. And since i did not have this issue with Docker 26.0.0, i agree with listhor here, i don't think Docker is the issue.
 
Last edited:
I have another Proxmox instance (different hardware) with very similar result. Snapshot with RAM included was bigger than RAM itself, VM in question wasn't completely unresponsive (main application is docker) but I had to force it to stop from Proxmox as docker's stop job was taking forever to be executed. In this case (docker being affected) forced reboot was enough to restore normal functioning of VM.

Anyway, manual snapshot is currently NOT to be used in Proxmox 8.2.4...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!