Windows VMs bluescreening with Proxmox 6.1

Robert Dahlem

Active Member
May 7, 2018
20
1
43
61
I upgraded from 6.0 to 6.1 and now my Window VMs are bluescreening all over the place with several different messages like:

KERNEL SECURITY CHECK FAILURE​
IRQL LESS OR EQUAL​
KMODE EXCEPTION NOT HANDLED​
CRITICAL PROCESS DIED​
SYSTEM SERVICE EXCEPTION​
UNEXPECTED STORE EXCEPTION​

The CPU of the host is an AMD Ryzen 5 2400G (latest vendor BIOS update from 11/2019 applied). I do not have this symptom on an older Host with an Intel i3-4170.

When I install pve-kernel-5.0.21-5-pve and boot that, the symptom is gone.

I would like to downgrade to Proxmox 6.0, but "# apt-get install proxmox-ve=6.0-2" gives me proxmox-ve 6.0 and everything else is from 6.1. How do I consistently downgrade? Is there a recommended way to install 6.0 with correct dependencies?

Kind regards,
Robert

Code:
# qm config 100
agent: 1
boot: cdn
bootdisk: scsi0
cores: 4
cpu: host
memory: 8192
name: LV2019
net0: virtio=CE:1F:08:96:BA:37,bridge=vmbr1,firewall=1
numa: 0
ostype: win10
sata0: ISO:iso/Win10_1903_german_x64-May-2019-Update.iso,media=cdrom,size=3898816K
sata1: ISO:iso/virtio-win-0.1.171.iso,media=cdrom,size=363020K
scsi0: VMs:100/vm-100-disk-0.qcow2,discard=on,size=163529024K
scsihw: virtio-scsi-pci
smbios1: uuid=[...]
sockets: 1
vmgenid: [...]

# cat /proc/cpuinfo | egrep "model name|microcode" | head -2
model name      : AMD Ryzen 5 2400G with Radeon Vega Graphics
microcode       : 0x8101016

# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)
pve-manager: 6.1-3 (running version: 6.1-3/37248ce6)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-9
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-2
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.1-2
pve-container: 3.0-14
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-3
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2
 
Hm, that strongly sounds like a kernel bug. Could you try with different versions of our kernel?

E.g. apt search pve-kernel-5, install a few 5.3 and 5.0 variants and boot from them?

Our tests on other second gen Ryzen chips did not show any bluescreens with Windows VMs... Is there anything in 'dmesg' after such bluescreens? Anything specific that triggers them?

How do I consistently downgrade? Is there a recommended way to install 6.0 with correct dependencies?

No, downgrading point releases is not supported. But since only the kernel package is required to be downgraded, I'd suggest simply installing the pve-kernel-5.0 package and changing the default boot entry in grub for now.
 
Hm, that strongly sounds like a kernel bug. Could you try with different versions of our kernel?
Same symptom with 5.3.1-1-pve, which seems to be the first kernel after 5.0.21-5-pve. So probably something with the 5.3 series.
Our tests on other second gen Ryzen chips did not show any bluescreens with Windows VMs... Is there anything in 'dmesg' after such bluescreens?
See attached dmesg.log (from 5.3.13-1-pve). This seems to happen one time only, after I boot the VM for the first time. Can't reproduce it without rebooting the Proxmox host.
Anything specific that triggers them?
Apart from "booting Windows"? No. :)
No, downgrading point releases is not supported.
Ok, let's rephrase the question: say someone would have a serious and reproducible problem with 6.1, like Windows VMs running into blue screens only seconds after boot. Someone who would even be willing to build the Promox host from scratch just to get the VMs running. Is there really no way to get a complete Proxmox 6.0 from anywhere? A paid option maybe?
 

Attachments

Same symptom with 5.3.1-1-pve, which seems to be the first kernel after 5.0.21-5-pve. So probably something with the 5.3 series.

Ok, let's rephrase the question: say someone would have a serious and reproducible problem with 6.1, like Windows VMs running into blue screens only seconds after boot. Someone who would even be willing to build the Promox host from scratch just to get the VMs running. Is there really no way to get a complete Proxmox 6.0 from anywhere? A paid option maybe?
6.1 was just pushed on top of our regular 6.0 repositories, as it's mainly fixes, and fully compatible with 6.0 - there's practically no reason not to upgrade. You could of course install from our 6.0 ISO, and then just never upgrade, but that is illadvised.

In your case, it seems your system is running fine with the 5.0-series kernel? Why not just use that? 6.1 is compatible with the 5.0 kernel, so you get the best of both worlds: A stable system and the most recent fixes and features of PVE.

See attached dmesg.log (from 5.3.13-1-pve). This seems to happen one time only, after I boot the VM for the first time. Can't reproduce it without rebooting the Proxmox host.

I'd be interested in finding the cause of your issues though. I'm entirely unable to reproduce them here, but analyzing the log you posted, that is a warning related to FPU states - something that we had some troubles with previously.

There's currently an open bug, where the fix is not yet in our kernels: https://bugzilla.kernel.org/show_bug.cgi?id=205663 though I'm entirely unsure if this is related.

Are you using ZFS by any chance? Does the 'cpu: host' setting matter? Any issues with Linux VMs?
 
In your case, it seems your system is running fine with the 5.0-series kernel? Why not just use that? 6.1 is compatible with the 5.0 kernel, so you get the best of both worlds: A stable system and the most recent fixes and features of PVE.
I will do that for the moment.

Are you using ZFS by any chance? Does the 'cpu: host' setting matter? Any issues with Linux VMs?
I do not use ZFS.

The CPU ist set to "kvm64". Same with "host". Do you want me to try anything else?

I tried to install Debian in a VM. Got traps: expr[5186] general protection ip:7f35232519bd sp:7ffebee96de0 error:0 in libc.so.6[7f35231ef000+148000] inside the VM and another stack trace in the host.

Also there is another stack trace at host boot unless I configure GRUB_GFXPAYLOAD_LINUX=keep in /etc/default/grub which refers to drivers/gpu/drm/amd/amdgpu. Something is seriously wrong with kernel support for this CPU.

Can you tell me how Promox kernel's are built? I would like to try and nail down the kernel version causing this.
 
I installed linux-image-5.3.0-0.bpo.2-amd64, the Debian backported 5.3 kernel. That makes me believe it has something to do with kernel 5.3, nothing PVE specific.
 
I'd be interested in finding the cause of your issues though. I'm entirely unable to reproduce them here, but analyzing the log you posted, that is a warning related to FPU states - something that we had some troubles with previously.

There's currently an open bug, where the fix is not yet in our kernels: https://bugzilla.kernel.org/show_bug.cgi?id=205663 though I'm entirely unsure if this is related.

No, this bug is not relevant.

Over the last days I isolated the bug to something that happened with kernel version 5.2.5 and then bisected it to this patch: KVM: X86: Fix fpu state crash in kvm guest. Ironically, the patch was written to overcome FPU related bluescreens in Windows 10 VMs. I tested kernel versions 5.2.21, 5.3.18 and 5.4.6 against this assumption: all versions show the bug until I reverse apply the patch.

I will contact the author of the patch and report back here.
 
I've been experiencing exactly the same issue as Robert although on an Intel Xeon E5 CPU. This affected all our Windows 10 Pro VMs (BSODing with the described error messages in the first post).

Booting an older kernel (was still installed due to upgrades from older Proxmox versions) solved the issue. Running Proxmox 6.1 with kernel 5.0.21-5-pve.

Code:
root@pmx1:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.0.21-5-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-5.0: 6.0-11
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-9
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.1-2
pve-container: 3.0-15
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-4
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

Code:
root@pmx1:~# cat /proc/cpuinfo | egrep "model name|microcode" | head -2
model name    : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
microcode    : 0xb000038
 
@Robert Dahlem FYI, if you haven't seen it: A fix related to the patch you mentioned has been posted to the KVM mailing list [0]. If you or @fhibler could check if they fix your issue, I'd see that we get those patches into our pve-kernel (otherwise it might take a while before our ubuntu upstream picks them up). You can of course also report your findings directly to the KVM list if you want, I'm sure they'd be happy to here about people testing their patches :)

[0] https://patchwork.kernel.org/project/kvm/list/?series=230131
 
Experiencing exactly the same on all WS2019 machines on 2 different nodes since the update yesterday :( Anybody has a clue how can I load the previous kernel on a headless machine?
 
So just a short update: I have managed to load an old kernel and running 5.0.21-5-pve. It has no problems.

Actual kernel 5.3.18-2-pve produces for me continuous crashes and blue screens with all possible types of messages under WS2019 on two different nodes. Different CPUs, different systems even different data centers. Using ZFS everywhere.

If somebody is interesting to debug I'm ready to assist...
 
I can confirm that this is no longer a problem under pve-kernel-5.3.18-3-pve (which seems to be the latest one for Proxmox 6.1) nor under pve-kernel-5.4.41-1-pve (which is the current one for Proxmox 6.2).
 
  • Like
Reactions: TorqueWrench