All VMs locking up after latest PVE update

jro

Member
Mar 6, 2021
7
1
8
38
EDIT 2: An updated version of pve-qemu-kvm is now available in the pve-no-subscription repository and should resolve this issue (version 5.2.0-4): https://forum.proxmox.com/threads/all-vms-locking-up-after-latest-pve-update.85397/post-379077
https://forum.proxmox.com/threads/all-vms-locking-up-after-latest-pve-update.85397/post-379373


EDIT: Temporary workaround for this issue is to roll back pve-qemu-kvm and libproxmox-backup-qemu

apt install pve-qemu-kvm=5.1.0-8 libproxmox-backup-qemu0=1.0.2-1

Reboot after you roll back.


I have about a dozen VMs running on PVE 6.3. I upgrade from (I believe) 6.3-2 to -4 two days ago, and since then, all of my VMs have started locking up after running for 2-3 minutes. If I force stop them, I can reboot them and I don't see anything suspicious in guest system logs, but the lock up again after a few minutes. In the PVE syslog, the only messages I get are Mar 05 21:35:07 florence pvestatd[5677]: VM ### qmp command failed - VM ### qmp command 'query-proxmox-support' failed - unable to connect to VM ### qmp socket - timeout after 31 retries for each VM as it locks up.

Any suggestions? I'd like to possibly roll back but I'm not sure exactly which packages I should roll back and the best way to do that.

Code:
root@florence:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.101-1-pve)
pve-manager: 6.3-4 (running version: 6.3-4/0a38c56f)
pve-kernel-5.4: 6.3-6
pve-kernel-helper: 6.3-6
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-2
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve2
 
Last edited:
Confirmed. Same issue with random freeze VM and qmp command error after updating to 6.3-4
 
I also have similar issues. There are some serious stability issues with the recent updates. You have to either reinstall from the ISO or roll back every package to get a stable system. The potential problem packages are the updated kernel and QEMU I believe. Also look at your system services and see if everything is running.
 
The problem is on server who have / run WINDOWS VMs.

This maybe fix the problem:
-Change repos (to testing / current branch)
-update (4 packages) ( -- https://git.proxmox.com/?o=age ) ..it have windows fixes!
 
Last edited:
Hey,

what storage is in use for those VMs (ZFS, Ceph, ...?), further, what OS runs inside the VMs OS?

Is there anything suspicious in either the Proxmox VE host kernel log (dmesg or journalctl) and/or the VMs console (Linux VMs may have written something to the tty before hanging)?
 
Oh, further what hardware is in use, CPU and motherboard wise? The output of the following two commands executed as root on the PVE host would also be nice
Bash:
lscpu
dmidecode -t bios
 
proxmox-ve: 6.3-1 (running kernel: 5.4.101-1-pve)
As you seem to reproduce this relatively easily, can you try to reboot in an older kernel? You can choose to do so quite early on startup. That could help to pin it down, currently I suspect either a kernel or QEMU regression, and for problems specific to platforms I'd try kernel first.
 
The problem is on server who have / run WINDOWS VMs.

This maybe fix the problem:
-Change repos (to testing / current branch)
-update (4 packages) ( -- https://git.proxmox.com/?o=age ) ..it have windows fixes!
Actually I do not think that this hang is related to those fixes, they were fixes for a change in the virtual hardware layout which confused windows, but it alone did not lead to freezes of any kind, AFAICT.
 
Last edited:
Hello there,

we are using for all VM's ZFS.

Windows 10 Pro, Windows Server 2019, Debian, Ubuntu, CentOS.

lscpu
Code:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               62
Model name:          Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Stepping:            4
CPU MHz:             2589.437
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4399.91
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            25600K
NUMA node0 CPU(s):   0-9,20-29
NUMA node1 CPU(s):   10-19,30-39
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d

dmidecode -t bios
Code:
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
    Vendor: American Megatrends Inc.
    Version: 3.2
    Release Date: 03/04/2015
    Address: 0xF0000
    Runtime Size: 64 kB
    ROM Size: 12288 kB
    Characteristics:
        PCI is supported
        BIOS is upgradeable
        BIOS shadowing is allowed
        Boot from CD is supported
        Selectable boot is supported
        BIOS ROM is socketed
        EDD is supported
        Print screen service is supported (int 5h)
        8042 keyboard services are supported (int 9h)
        Serial services are supported (int 14h)
        Printer services are supported (int 17h)
        ACPI is supported
        USB legacy is supported
        BIOS boot specification is supported
        Function key-initiated network boot is supported
        Targeted content distribution is supported
        UEFI is supported
    BIOS Revision: 4.6

Handle 0x00AA, DMI type 13, 22 bytes
BIOS Language Information
    Language Description Format: Long
    Installable Languages: 1
        en|US|iso8859-1
    Currently Installed Language: en|US|iso8859-1
 
Thanks for sharing your information!

Model name: Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Hmm, a bit old (~8 years) but not yet so old that its really rare…

Do you have the latest available firmware and CPU µcode package installed?

Also, can you try to boot an older kernel and see if that helps, if you regularly run into problems here?
E.g., pve-kernel-5.4.78-2-pve or pve-kernel-5.4.86-1-pve would be candidates. Note, that this may not work if you already upgraded your ZFS pools to the feature set of ZFS 2.0.
 
Thanks for your reply. We have already upgrade our zfs pools.
 
We are not using Ceph and the Linux VMs are ext4. Proxmox backup is enabled but the freeze didn't occur during or short after the backups
 
After I posted this, I ran memcheck overnight (no errors) and booted back into Proxmox. At that point, all the VMs ran fine for ~36 hours. I just now had to restart a few VMs because I took storage offline for maintenance. When I attempted to boot the VMs again, the issue returned.

I'm using ZFS for all my Proxmox storage, all VMs are using ext4 internally. My ZFS pools have not been upgraded.

Booted kernel pve-kernel-5.4.78-2-pve, still having the same issue.

I noticed that if I do a full shutdown/power off of the proxmox host and boot up cold, the VMs don't lock up until I reboot one of them.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!