All VMs locking up after latest PVE update

jro · Mar 6, 2021

EDIT 2: An updated version of pve-qemu-kvm is now available in the pve-no-subscription repository and should resolve this issue (version 5.2.0-4): https://forum.proxmox.com/threads/all-vms-locking-up-after-latest-pve-update.85397/post-379077
https://forum.proxmox.com/threads/all-vms-locking-up-after-latest-pve-update.85397/post-379373

EDIT: Temporary workaround for this issue is to roll back pve-qemu-kvm and libproxmox-backup-qemu

apt install pve-qemu-kvm=5.1.0-8 libproxmox-backup-qemu0=1.0.2-1

Reboot after you roll back.

I have about a dozen VMs running on PVE 6.3. I upgrade from (I believe) 6.3-2 to -4 two days ago, and since then, all of my VMs have started locking up after running for 2-3 minutes. If I force stop them, I can reboot them and I don't see anything suspicious in guest system logs, but the lock up again after a few minutes. In the PVE syslog, the only messages I get are

Mar 05 21:35:07 florence pvestatd[5677]: VM ### qmp command failed - VM ### qmp command 'query-proxmox-support' failed - unable to connect to VM ### qmp socket - timeout after 31 retries

for each VM as it locks up.

Any suggestions? I'd like to possibly roll back but I'm not sure exactly which packages I should roll back and the best way to do that.

Code:

root@florence:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.101-1-pve)
pve-manager: 6.3-4 (running version: 6.3-4/0a38c56f)
pve-kernel-5.4: 6.3-6
pve-kernel-helper: 6.3-6
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-2
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve2

toplus · Mar 6, 2021

Confirmed. Same issue with random freeze VM and qmp command error after updating to 6.3-4

danielandastro · Mar 6, 2021

Same issue here too

strix · Mar 6, 2021

We have exactly the same issue!!

peteb · Mar 6, 2021

I also have similar issues. There are some serious stability issues with the recent updates. You have to either reinstall from the ISO or roll back every package to get a stable system. The potential problem packages are the updated kernel and QEMU I believe. Also look at your system services and see if everything is running.

strix · Mar 7, 2021

The problem is on server who have / run WINDOWS VMs.

This maybe fix the problem:
-Change repos (to testing / current branch)
-update (4 packages) ( -- https://git.proxmox.com/?o=age ) ..it have windows fixes!

dgsqrs · Mar 7, 2021

Same problem here ! (But I have no windows vm)

t.lamprecht · Mar 7, 2021

Hey,

what storage is in use for those VMs (ZFS, Ceph, ...?), further, what OS runs inside the VMs OS?

Is there anything suspicious in either the Proxmox VE host kernel log (dmesg or journalctl) and/or the VMs console (Linux VMs may have written something to the tty before hanging)?

t.lamprecht · Mar 7, 2021

Oh, further what hardware is in use, CPU and motherboard wise? The output of the following two commands executed as root on the PVE host would also be nice

Bash:

lscpu
dmidecode -t bios

t.lamprecht · Mar 7, 2021

jro said:
proxmox-ve: 6.3-1 (running kernel: 5.4.101-1-pve)

As you seem to reproduce this relatively easily, can you try to reboot in an older kernel? You can choose to do so quite early on startup. That could help to pin it down, currently I suspect either a kernel or QEMU regression, and for problems specific to platforms I'd try kernel first.

t.lamprecht · Mar 7, 2021

strix said:
The problem is on server who have / run WINDOWS VMs.

This maybe fix the problem:
-Change repos (to testing / current branch)
-update (4 packages) ( -- https://git.proxmox.com/?o=age ) ..it have windows fixes!

Actually I do not think that this hang is related to those fixes, they were fixes for a change in the virtual hardware layout which confused windows, but it alone did not lead to freezes of any kind, AFAICT.

strix · Mar 7, 2021

Hello there,

we are using for all VM's ZFS.

Windows 10 Pro, Windows Server 2019, Debian, Ubuntu, CentOS.

lscpu

Code:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               62
Model name:          Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Stepping:            4
CPU MHz:             2589.437
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4399.91
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            25600K
NUMA node0 CPU(s):   0-9,20-29
NUMA node1 CPU(s):   10-19,30-39
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d

dmidecode -t bios

Code:

# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
    Vendor: American Megatrends Inc.
    Version: 3.2
    Release Date: 03/04/2015
    Address: 0xF0000
    Runtime Size: 64 kB
    ROM Size: 12288 kB
    Characteristics:
        PCI is supported
        BIOS is upgradeable
        BIOS shadowing is allowed
        Boot from CD is supported
        Selectable boot is supported
        BIOS ROM is socketed
        EDD is supported
        Print screen service is supported (int 5h)
        8042 keyboard services are supported (int 9h)
        Serial services are supported (int 14h)
        Printer services are supported (int 17h)
        ACPI is supported
        USB legacy is supported
        BIOS boot specification is supported
        Function key-initiated network boot is supported
        Targeted content distribution is supported
        UEFI is supported
    BIOS Revision: 4.6

Handle 0x00AA, DMI type 13, 22 bytes
BIOS Language Information
    Language Description Format: Long
    Installable Languages: 1
        en|US|iso8859-1
    Currently Installed Language: en|US|iso8859-1

t.lamprecht · Mar 7, 2021

Thanks for sharing your information!

strix said:
Model name: Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz

Hmm, a bit old (~8 years) but not yet so old that its really rare…

Do you have the latest available firmware and CPU µcode package installed?

Also, can you try to boot an older kernel and see if that helps, if you regularly run into problems here?
E.g., pve-kernel-5.4.78-2-pve or pve-kernel-5.4.86-1-pve would be candidates. Note, that this may not work if you already upgraded your ZFS pools to the feature set of ZFS 2.0.

strix · Mar 7, 2021

Thanks for your reply. We have already upgrade our zfs pools.

spirit · Mar 7, 2021

Hi, if it can help, I can't reproduce here on xeon v3 with ceph rbd, with debian 9 && 10 guests.

Also, I don't use proxmox backup.

spirit · Mar 8, 2021

@Thomas Lamprecht

another report here:

https://forum.proxmox.com/threads/freez-issue-with-latest-proxmox-6-3-4-and-amd-cpu.85348/

dgsqrs · Mar 8, 2021

We are not using Ceph and the Linux VMs are ext4. Proxmox backup is enabled but the freeze didn't occur during or short after the backups

jro · Mar 8, 2021

After I posted this, I ran memcheck overnight (no errors) and booted back into Proxmox. At that point, all the VMs ran fine for ~36 hours. I just now had to restart a few VMs because I took storage offline for maintenance. When I attempted to boot the VMs again, the issue returned.

I'm using ZFS for all my Proxmox storage, all VMs are using ext4 internally. My ZFS pools have not been upgraded.

Booted kernel pve-kernel-5.4.78-2-pve, still having the same issue.

I noticed that if I do a full shutdown/power off of the proxmox host and boot up cold, the VMs don't lock up until I reboot one of them.

jcesclapez · Mar 8, 2021

Is this the same issue?

https://forum.proxmox.com/threads/problems-with-zfs-after-major-proxmox-update.85171/

After upgrade all nodes, only one gets freezed and needs force a physical reboot

jro · Mar 8, 2021

jcesclapez said:
Is this the same issue?

https://forum.proxmox.com/threads/problems-with-zfs-after-major-proxmox-update.85171/

After upgrade all nodes, only one gets freezed and needs force a physical reboot

It looks like you've got full nodes freezing. I'm seeing individual VMs freeze. Note that the fix from the thread linked above seems to have fixed it for me (downgrading pve-manager to 6.3-3):

Code:

apt install pve-manager=6.3-3
reboot

All VMs locking up after latest PVE update

Member

Member

New Member

Active Member

Active Member

Active Member

New Member

Proxmox Staff Member

Proxmox Staff Member

Proxmox Staff Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Distinguished Member

Distinguished Member

New Member

Member

Member

Member