Sudden progressive freezing of virtual machines

vaschthestampede

Active Member
Oct 21, 2020
119
8
38
38
Today, for the second time, I had this problem.

Suddenly some virtual machines start to become unreachable.
Gradually all the machines begin to have problems.
Full RAM usage is reported first, and shortly thereafter, they become unreachable.

The only way to fix is to stop many VMs and restart them a few at a time.
Even after a restart of proxmox the problem arises.

Unfortunately no other details on the problem.
I attach the syslog and the grub file.
From there you can see that the problem started at 16:41:48 (row 964) and then no longer showed up after the 17:09:19 (row 2092).

Tell me what other information you may need.
 

Attachments

I think I was bitten by the same issue this morning. It affected 5 different VM of a single node. Some VM where still working normally. The affected VM could not respond on the network (the issue started during the night and from my monitoring system, the network issue seemed intermitent), the corresponding qemu-kvm process was eating 100% CPU. The only trace I found in the logs are :
Code:
Mar 15 06:52:11 pvo6 pvedaemon[31371]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - got timeout
Mar 15 06:52:19 pvo6 pvestatd[2359]: VM 128 qmp command failed - VM 128 qmp command 'query-proxmox-support' failed - unable to connect to VM 128 qmp socket - timeout after 31 retries
Mar 15 06:52:22 pvo6 pvestatd[2359]: VM 123 qmp command failed - VM 123 qmp command 'query-proxmox-support' failed - unable to connect to VM 123 qmp socket - timeout after 31 retries
Mar 15 06:52:25 pvo6 pvestatd[2359]: VM 160 qmp command failed - VM 160 qmp command 'query-proxmox-support' failed - unable to connect to VM 160 qmp socket - timeout after 31 retries
Mar 15 06:52:28 pvo6 pvestatd[2359]: VM 158 qmp command failed - VM 158 qmp command 'query-proxmox-support' failed - unable to connect to VM 158 qmp socket - timeout after 31 retries

The console of the VM wasn't reachable either (timeout), nor did the VM responded to a clean shutdown signal. I had to stop them. Some of them went into the same state after being restarted. I had to reboot the whole node. SInce then, no issue.

It might be some regression in qemu-kvm 5.2.

It's a fully updated PVE 6.3 (using no-subscription repo)
Code:
root@pvo6:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.103-1-pve)
pve-manager: 6.3-6 (running version: 6.3-6/2184247e)
pve-kernel-5.4: 6.3-7
pve-kernel-helper: 6.3-7
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-network-perl: 0.4-6
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.0.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-6
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-3
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-8
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve2
root@pvo6:~#