Some VMs not working anymore after snapshot

Dunuin · Mar 21, 2021

Hi,

For around 2 or 3 day I have problems with some of my VMs. Around that time I updated my Proxmox (from a version a updated 2 weeks ago or so) and upgraded my pools to OpenZFS 2.0.

Every night at 5:00 AM pv4pve-snapshot is creating a snapshot of all VMs (running as user "snapshot" and dumping RAM to my state storage). After that some of my VMs aren't responding anymore but are shown by proxmox as running. All VMs are stored on one of my two ZFS pools, where zpool status tells me that everything is fine.
While these VMs aren't responding I see massive writes if I look at the proxmox graphs:

Pic: Until 5 AM of 20th march everything is fine and the VM is writing with normal rates at around 400-500K. At 5AM the snapshot kicks in and disk IO goes up until I have seen it at 6 PM and rebooted the server. After that everything was fine again and the VMs continued writing with normal 400-500K. Between 5 AM and 6PM also the RAM usage and CPU utilization dropped.

I see this behavior on most VMs. This Win10 VM for example that was idleing with no logged in users:

Pic: writes started again at 5 AM with the snapshot and ended at 8 AM when I rebooted the server (constant 100MB/s writes are really bad...normally this VM is idleing with around 10-100KB/s). Looking at the log that VMs snapshot task finished after 66 seconds without an error:

Mar 21 05:01:36 Hypervisor pvedaemon[6612]: <snapshot@pam> end task UPID:Hypervisor:000049FF:00398617:6056C4E2:qmsnapshot:103:snapshot@pam: OK

But I still see errors like these in the logs:

Mar 21 05:01:23 Hypervisor pvestatd[6599]: VM 103 qmp command failed - VM 103 qmp command 'query-proxmox-support' failed - unable to connect to VM 103 qmp socket - timeout after 31 retries

And snapshotting of some VMs now gives errors like this...

Code:

TASK ERROR: VM 119 qmp command 'query-machines' failed - unable to connect to VM 119 qmp socket - timeout after 31 retries

TASK ERROR: VM 116 qmp command 'query-machines' failed - unable to connect to VM 116 qmp socket - timeout after 31 retries

...while other snapshots report "ok" but still are not accessible afterwards...

Code:

saving VM state and RAM using storage 'VMpool8_VMSS'
4.01 MiB in 0s
completed saving the VM state in 1s, saved 400.93 MiB
snapshotting 'drive-scsi0' (VMpool7_VM:vm-113-disk-2)
snapshotting 'drive-scsi1' (VMpool8_VM:vm-113-disk-0)
TASK OK

Any idea why snapshotting is now causing problems? It was working fine all the time and I didn't updated any of the guests.

Edit:
Here my PVE Version:

Code:

pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.103-1-pve)
pve-manager: 6.3-6 (running version: 6.3-6/2184247e)
pve-kernel-5.4: 6.3-7
pve-kernel-helper: 6.3-7
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-5.4.98-1-pve: 5.4.98-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksmtuned: 4.20150325+b1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-6
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-3
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-8
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve2

This is my storage.cfg:

Code:

cat /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content snippets
        prune-backups keep-all=1
        shared 0

zfspool: VMpool7_VM
        pool VMpool7/VLT/VM
        blocksize 32k
        content rootdir,images
        mountpoint /VMpool7/VLT/VM
        sparse 1

zfspool: VMpool7_VM_NoSync
        pool VMpool7/VLT/VMNS
        content rootdir,images
        mountpoint /VMpool7/VLT/VMNS
        sparse 1

zfspool: VMpool8_VMSS
        pool VMpool8/VLT/VMSS
        blocksize 32k
        content images,rootdir
        mountpoint /VMpool8/VLT/VMSS
        sparse 1

zfspool: VMpool8_VM
        pool VMpool8/VLT/VM
        blocksize 32k
        content rootdir,images
        mountpoint /VMpool8/VLT/VM
        sparse 1
        
... + some additional CIFS shares for backups, Isos and so on

State Storage for all VMs is "VMpool8_VMSS". This problem seems to effect all OSs. Got this problem with Win10, FreeBSD and Linux VMs. And no matter if that VM is stored on "VMpool7_VM" or "VMpool8_VM".

Edit:
Looking at the syslog of today I can't see any unusual except for alot of "unable to connect to VM XXX qmp socket" messages that won't stop until I reboot the machine. Snapshotting of only 3 VMs failed with this error and the other VMs finished with "OK" but after the snapshot I also see alot of these messages for other VMs. Here is the syslog between 5 AM when the snapshotting started and when I rebooted the host:

Dunuin · Mar 22, 2021

Looks like downgrading qemu fixes it (apt install pve-qemu-kvm=5.1.0-8 libproxmox-backup-qemu0=1.0.2-1). Should be related to this thread.

Search

Search

Some VMs not working anymore after snapshot

Dunuin

Distinguished Member

Attachments

Dunuin

Distinguished Member

We value your privacy