Hi,
For around 2 or 3 day I have problems with some of my VMs. Around that time I updated my Proxmox (from a version a updated 2 weeks ago or so) and upgraded my pools to OpenZFS 2.0.
Every night at 5:00 AM pv4pve-snapshot is creating a snapshot of all VMs (running as user "snapshot" and dumping RAM to my state storage). After that some of my VMs aren't responding anymore but are shown by proxmox as running. All VMs are stored on one of my two ZFS pools, where zpool status tells me that everything is fine.
While these VMs aren't responding I see massive writes if I look at the proxmox graphs:
Pic: Until 5 AM of 20th march everything is fine and the VM is writing with normal rates at around 400-500K. At 5AM the snapshot kicks in and disk IO goes up until I have seen it at 6 PM and rebooted the server. After that everything was fine again and the VMs continued writing with normal 400-500K. Between 5 AM and 6PM also the RAM usage and CPU utilization dropped.
I see this behavior on most VMs. This Win10 VM for example that was idleing with no logged in users:
Pic: writes started again at 5 AM with the snapshot and ended at 8 AM when I rebooted the server (constant 100MB/s writes are really bad...normally this VM is idleing with around 10-100KB/s). Looking at the log that VMs snapshot task finished after 66 seconds without an error:
But I still see errors like these in the logs:
And snapshotting of some VMs now gives errors like this...
...while other snapshots report "ok" but still are not accessible afterwards...
Any idea why snapshotting is now causing problems? It was working fine all the time and I didn't updated any of the guests.
Edit:
Here my PVE Version:
This is my storage.cfg:
State Storage for all VMs is "VMpool8_VMSS". This problem seems to effect all OSs. Got this problem with Win10, FreeBSD and Linux VMs. And no matter if that VM is stored on "VMpool7_VM" or "VMpool8_VM".
Edit:
Looking at the syslog of today I can't see any unusual except for alot of "unable to connect to VM XXX qmp socket" messages that won't stop until I reboot the machine. Snapshotting of only 3 VMs failed with this error and the other VMs finished with "OK" but after the snapshot I also see alot of these messages for other VMs. Here is the syslog between 5 AM when the snapshotting started and when I rebooted the host:
For around 2 or 3 day I have problems with some of my VMs. Around that time I updated my Proxmox (from a version a updated 2 weeks ago or so) and upgraded my pools to OpenZFS 2.0.
Every night at 5:00 AM pv4pve-snapshot is creating a snapshot of all VMs (running as user "snapshot" and dumping RAM to my state storage). After that some of my VMs aren't responding anymore but are shown by proxmox as running. All VMs are stored on one of my two ZFS pools, where zpool status tells me that everything is fine.
While these VMs aren't responding I see massive writes if I look at the proxmox graphs:
Pic: Until 5 AM of 20th march everything is fine and the VM is writing with normal rates at around 400-500K. At 5AM the snapshot kicks in and disk IO goes up until I have seen it at 6 PM and rebooted the server. After that everything was fine again and the VMs continued writing with normal 400-500K. Between 5 AM and 6PM also the RAM usage and CPU utilization dropped.
I see this behavior on most VMs. This Win10 VM for example that was idleing with no logged in users:
Pic: writes started again at 5 AM with the snapshot and ended at 8 AM when I rebooted the server (constant 100MB/s writes are really bad...normally this VM is idleing with around 10-100KB/s). Looking at the log that VMs snapshot task finished after 66 seconds without an error:
Mar 21 05:01:36 Hypervisor pvedaemon[6612]: <snapshot@pam> end task UPID:Hypervisor:000049FF:00398617:6056C4E2:qmsnapshot:103:snapshot@pam: OK
But I still see errors like these in the logs:
Mar 21 05:01:23 Hypervisor pvestatd[6599]: VM 103 qmp command failed - VM 103 qmp command 'query-proxmox-support' failed - unable to connect to VM 103 qmp socket - timeout after 31 retries
And snapshotting of some VMs now gives errors like this...
Code:
TASK ERROR: VM 119 qmp command 'query-machines' failed - unable to connect to VM 119 qmp socket - timeout after 31 retries
TASK ERROR: VM 116 qmp command 'query-machines' failed - unable to connect to VM 116 qmp socket - timeout after 31 retries
Code:
saving VM state and RAM using storage 'VMpool8_VMSS'
4.01 MiB in 0s
completed saving the VM state in 1s, saved 400.93 MiB
snapshotting 'drive-scsi0' (VMpool7_VM:vm-113-disk-2)
snapshotting 'drive-scsi1' (VMpool8_VM:vm-113-disk-0)
TASK OK
Any idea why snapshotting is now causing problems? It was working fine all the time and I didn't updated any of the guests.
Edit:
Here my PVE Version:
Code:
pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.103-1-pve)
pve-manager: 6.3-6 (running version: 6.3-6/2184247e)
pve-kernel-5.4: 6.3-7
pve-kernel-helper: 6.3-7
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-5.4.98-1-pve: 5.4.98-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksmtuned: 4.20150325+b1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-6
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-3
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-8
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve2
This is my storage.cfg:
Code:
cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content snippets
prune-backups keep-all=1
shared 0
zfspool: VMpool7_VM
pool VMpool7/VLT/VM
blocksize 32k
content rootdir,images
mountpoint /VMpool7/VLT/VM
sparse 1
zfspool: VMpool7_VM_NoSync
pool VMpool7/VLT/VMNS
content rootdir,images
mountpoint /VMpool7/VLT/VMNS
sparse 1
zfspool: VMpool8_VMSS
pool VMpool8/VLT/VMSS
blocksize 32k
content images,rootdir
mountpoint /VMpool8/VLT/VMSS
sparse 1
zfspool: VMpool8_VM
pool VMpool8/VLT/VM
blocksize 32k
content rootdir,images
mountpoint /VMpool8/VLT/VM
sparse 1
... + some additional CIFS shares for backups, Isos and so on
State Storage for all VMs is "VMpool8_VMSS". This problem seems to effect all OSs. Got this problem with Win10, FreeBSD and Linux VMs. And no matter if that VM is stored on "VMpool7_VM" or "VMpool8_VM".
Edit:
Looking at the syslog of today I can't see any unusual except for alot of "unable to connect to VM XXX qmp socket" messages that won't stop until I reboot the machine. Snapshotting of only 3 VMs failed with this error and the other VMs finished with "OK" but after the snapshot I also see alot of these messages for other VMs. Here is the syslog between 5 AM when the snapshotting started and when I rebooted the host:
Attachments
Last edited: