Windows VMs stuck on boot after Proxmox Upgrade to 7.0

dea · Aug 4, 2022

@fiona, very good news !!!

I found a cluster in Proxmox enterprise 7.2-7 with a completely unloaded node, which, attention, was running a Windows 10 VM in HUNG post automatic reboot (spinning circle problem).

So we are sure that this host falls within the issue here.

Than I downgrade *only* qemu to 6.0.0-3 (August 2021).

So I set the Machine type to PCi440fx-6.0

Rerun the same VM WIndows 10.... and now wait....

Below is the list of packages installed on that node:

proxmox-ve: 7.2-1 (running kernel: 5.15.39-1-pve) pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85) pve-kernel-5.15: 7.2-6 pve-kernel-helper: 7.2-6 pve-kernel-5.13: 7.1-9 pve-kernel-5.15.39-1-pve: 5.15.39-1 pve-kernel-5.15.35-2-pve: 5.15.35-5 pve-kernel-5.15.35-1-pve: 5.15.35-3 pve-kernel-5.13.19-6-pve: 5.13.19-15 pve-kernel-5.13.19-2-pve: 5.13.19-4 ceph-fuse: 14.2.21-1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve1 libproxmox-acme-perl: 1.4.2 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.2-4 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.2-2 libpve-guest-common-perl: 4.1-2 libpve-http-server-perl: 4.1-3 libpve-storage-perl: 7.2-7 libqb0: 1.0.5-1 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.0-3 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.2.5-1 proxmox-backup-file-restore: 2.2.5-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.1 pve-cluster: 7.2-2 pve-container: 4.2-2 pve-docs: 7.2-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.5-1 pve-ha-manager: 3.4.0 pve-i18n: 2.7-2 pve-qemu-kvm: 6.0.0-3 pve-xtermjs: 4.16.0-1 qemu-server: 7.2-3 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.4-pve1

The node is completely empty (except for the aforementioned Windows 10 VM), totally available for testing, kernel downgrade, etc.

fiona · Aug 4, 2022

dea said:
@fiona, very good news !!!

I found a cluster in Proxmox enterprise 7.2-7 with a completely unloaded node, which, attention, was running a Windows 10 VM in HUNG post automatic reboot (spinning circle problem).

So we are sure that this host falls within the issue here.

Than I downgrade *only* qemu to 6.0.0-3 (August 2021).

So I set the Machine type to PCi440fx-6.0

Rerun the same VM WIndows 10.... and now wait....

Below is the list of packages installed on that node:

proxmox-ve: 7.2-1 (running kernel: 5.15.39-1-pve) pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85) pve-kernel-5.15: 7.2-6 pve-kernel-helper: 7.2-6 pve-kernel-5.13: 7.1-9 pve-kernel-5.15.39-1-pve: 5.15.39-1 pve-kernel-5.15.35-2-pve: 5.15.35-5 pve-kernel-5.15.35-1-pve: 5.15.35-3 pve-kernel-5.13.19-6-pve: 5.13.19-15 pve-kernel-5.13.19-2-pve: 5.13.19-4 ceph-fuse: 14.2.21-1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve1 libproxmox-acme-perl: 1.4.2 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.2-4 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.2-2 libpve-guest-common-perl: 4.1-2 libpve-http-server-perl: 4.1-3 libpve-storage-perl: 7.2-7 libqb0: 1.0.5-1 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.0-3 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.2.5-1 proxmox-backup-file-restore: 2.2.5-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.1 pve-cluster: 7.2-2 pve-container: 4.2-2 pve-docs: 7.2-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.5-1 pve-ha-manager: 3.4.0 pve-i18n: 2.7-2 pve-qemu-kvm: 6.0.0-3 pve-xtermjs: 4.16.0-1 qemu-server: 7.2-3 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.4-pve1

The node is completely empty (except for the aforementioned Windows 10 VM), totally available for testing, kernel downgrade, etc.

Great! I really hope we can isolate one component at last. Too bad it takes a long time to trigger :/ Maybe you can add some clones/other VMs to increase the chances?

Instead of downgrading to pve-qemu-kvm=6.0.0-3, an alternative way to disable SMM is to use args: -machine smm=off in the VM configuration. As always, you need to stop+start the VM after adding it, to apply it. Since this can be done on a per-machine basis, it's much less intrusive.

Would be great if more people could test this, to see if the issue is caused by SMM.

dea · Aug 4, 2022

fiona said:
Great! I really hope we can isolate one component at last. Too bad it takes a long time to trigger :/ Maybe you can add some clones/other VMs to increase the chances?

Instead of downgrading to pve-qemu-kvm=6.0.0-3, an alternative way to disable SMM is to use args: -machine smm=off in the VM configuration. As always, you need to stop+start the VM after adding it, to apply it. Since this can be done on a per-machine basis, it's much less intrusive.

Would be great if more people could test this, to see if the issue is caused by SMM.

i am totally sure that the windows 10 VM that is now running was in hung (circle spinning).

So I am totally certain that this server a Lenovo sr650 like all the other servers I use (unfortunately) has this problem.

And this is a certain starting point.

I downgraded to Qemu 6.0.0-3 release and for total safety I rebooted the node, the Windows 10 vm went into auto start at boot, so it definitely started with Qemu 6.0.0-3.

Sure, I can create a clone !

At the moment I think the most important thing is to bring a system that has this problem (like mine) in a situation where you no longer have the problem ... then the workarounds will come by themselves ... if with Qemu 6.0. 0-3 works you will see was introduced in Qemu 6.1 compared to 6.0.0.

Today is August 4th, let's see what will happen in a couple of weeks (I will put the VM in snapshot with RAM and I will try to restart it every now and then).

dea · Aug 4, 2022

@fiona

OK, node running the very last 7.2-7 with kernel Linux 5.15.39-1 BUT Qemu 6.0.0-3 is ready.
Run the original Windows 10 VM (before the change in hung circles spinning) and a VM clone.

And now... wait and testing........

This is the VM configuration;

dea · Aug 5, 2022

fiona said:
Great! I really hope we can isolate one component at last. Too bad it takes a long time to trigger :/ Maybe you can add some clones/other VMs to increase the chances?

Instead of downgrading to pve-qemu-kvm=6.0.0-3, an alternative way to disable SMM is to use args: -machine smm=off in the VM configuration. As always, you need to stop+start the VM after adding it, to apply it. Since this can be done on a per-machine basis, it's much less intrusive.

Would be great if more people could test this, to see if the issue is caused by SMM.

@fiona

On another cluster (7.2-7) completely updated and without alteration of the package versions, which naturally presents the same problems, I tried to setup on a dozen Windows server VMs .... args: -machine smm=off

Let's see what happens ...

dea · Aug 7, 2022

@fiona

... I state that the tests I started (with a fully updated host but with QEMU downgraded to version 6.0.0-3 and with another node with a dozen Windows VMs running with the "args: -machine smm = off "are in progress and will not be interrupted until you understand if we are on the right way or not.

But out of curiosity, I saw the release in CE of the kernel 5.15.39-3 with a changelog of this type ...

and I immediately became curious about a possible connection.

coolspot · Aug 7, 2022

weehooey said:
Has anyone had the exact same VM stuck twice?

I am asking because we purposely have left VMs running for extended periods of time (up to 65 days) without rebooting to attempt to learn about this issue. These are VMs that have experienced the problem in the past. There were on the exact same PVE nodes. When we rebooted them again, they did not hang.

I just had this issue for the first time - the VM was up for an extneded period of time and got stuck after updating to 2022-07 Cumulative Update. But like you said, a single hard stop resolved the issue.

fiona · Aug 8, 2022

dea said:
@fiona

... I state that the tests I started (with a fully updated host but with QEMU downgraded to version 6.0.0-3 and with another node with a dozen Windows VMs running with the "args: -machine smm = off "are in progress and will not be interrupted until you understand if we are on the right way or not.

But out of curiosity, I saw the release in CE of the kernel 5.15.39-3 with a changelog of this type ...

and I immediately became curious about a possible connection.

View attachment 39748

See here and here for a bit of backstory. The KVM: entry failed, hardware error 0x80000021 issue only started appearing with kernel 5.15 and so I wouldn't put to much confidence in the issues being related. It is possible that one of the issues fixed there has been present in older kernels and is related to the reboot issues, but we don't know yet if the reboot issues are because of SMM, so I wouldn't get my hopes too high.

dea · Aug 8, 2022

fiona said:
See here and here for a bit of backstory. The KVM: entry failed, hardware error 0x80000021 issue only started appearing with kernel 5.15 and so I wouldn't put to much confidence in the issues being related. It is possible that one of the issues fixed there has been present in older kernels and is related to the reboot issues, but we don't know yet if the reboot issues are because of SMM, so I wouldn't get my hopes too high.

I agree.

Let us continue our way to testing.
The first with Qemu 6.0.0-3, the second with Qemu 6.2 and SMM = OFF.

Let's see in the coming weeks what the tests will reveal ...

alyarb · Aug 9, 2022

FYI this issue came up with me on a 2016 VM, Proxmox 7.2-7, and Qemu 6.2.0-11, and adding args: -machine smm=off to the VM config had no effect.

Will try the downgrade later on, this is my only environment and it's production so unfortunately I can't afford to play with it.

dea · Aug 10, 2022

@fiona

OK, applied the August updates to the two VMs I have in test, Proxmox 7.2-7 and Qemu in downgrade to 6.0.0-3 (see above).

Regular reboot.

I remember that the system certainly presents the problem of circle spinning.

OK, I know, it doesn't mean anything after a 6 day uptime, but I'd like to be optimistic.

The real test will be in 30 days, I will also proceed with reboots before the September updates, I will take snapshots with RAM in order to give continuity to the system runup.

.. for completeness, I have updated some vm running on a cluster 6.4-15: kernel 5.4.157 and Qemu 5.2-8.

No problem, everything is regular (uptime 30+ days).

The host is a Lenovo not too old (less than two years old), very similar although less performing to the sr650s that I use in other clusters.

wolfgang5505 · Aug 11, 2022

Today i had one server 2019 running on a 5 node cluster with pve 6.4-13 and kernel 5.11.22 with spinning circle.
First i migrate the vm to another node with the same config and reset the vm -> no luck spinning circle.
Second i migrate the vm to a node with kernel 5.4.162-2 and reset the vm -> vm started and is running.

wolfgang5505 · Aug 11, 2022

Dear Proxmox Team,

is it possible to provide a kernel 5.4 to PVE7?

spirit · Aug 11, 2022

wolfgang5505 said:
Dear Proxmox Team,

is it possible to provide a kernel 5.4 to PVE7?

for testing ,you should be able to install the kernel 5.4 deb package from pve6 to pve7 without problem.

Code:

wget http://download.proxmox.com/debian/pve/dists/buster/pve-no-subscription/binary-amd64/pve-kernel-5.4.98-1-pve_5.4.98-1_amd64.deb
dpkg -i pve-kernel-5.4.98-1-pve_5.4.98-1_amd64.deb
[CODE]

wolfgang5505 · Aug 11, 2022

Thanks Proxmox Team,

Just change on node of cluster 7.2.7 running kernel 5.15.39.3 to 5.4.98-1

On the node with 5.15 i have a Win10 VM with spinning circle, migrate to node with 5.4 reset VM and it worked.

dea · Aug 11, 2022

wolfgang5505 said:
Thanks Proxmox Team,

Just change on node of cluster 7.2.7 running kernel 5.15.39.3 to 5.4.98-1

On the node with 5.15 i have a Win10 VM with spinning circle, migrate to node with 5.4 reset VM and it worked.

.. downgrading the kernel may cause hardware problems, in the sense that you may run into unsupported hardware components.

I think you can only do this type of test on older servers or with older components added.

Apart from that, one thing should be investigated:

"On the node with 5.15 i have a Win10 VM with spinning circle, migrate to node with 5.4 reset VM and it worked"

On kernel 5.15 if you reset the VM (absolutely NO STOP and START RESET only) the problem remains, while migrating the VM in STUCK on a node with a 5.4 kernel and performing a reset does it work? Is that what you are saying?

If I understand the above is correct I am reasonably convinced that with a kernel 5.4 (also running on proxmox 7.2) the problem does not arise.

dea · Aug 11, 2022

@fiona

in light of the above, if the tests on the Proxmox 7.2-7 node completely updated but with QEMU downgrade to version 6.0.0-3 are NOT successful, I propose to bring that node to the latest QEMU release but to bring the kernel to the latest patch of the 5.4 series

We have to isolate one component at a time, for now we carry out the tests with QEMU 6.0.0-3

wolfgang5505 · Aug 11, 2022

Yes I never stopped the win10 VM just reset, and as i wrote earlier same behavior on PVE6 and PVE7

wolfgang5505 · Aug 11, 2022

dea said:
@fiona

in light of the above, if the tests on the Proxmox 7.2-7 node completely updated but with QEMU downgrade to version 6.0.0-3 are NOT successful, I propose to bring that node to the latest QEMU release but to bring the kernel to the latest patch of the 5.4 series

We have to isolate one component at a time, for now we carry out the tests with QEMU 6.0.0-3

Qemu was not downgraded! PVE 7 was fully upgraded, and one node had kernel 5.4.

dea · Aug 11, 2022

wolfgang5505 said:
Yes I never stopped the win10 VM just reset, and as i wrote earlier same behavior on PVE6 and PVE7

Very interesting, I am convinced that with a kernel 5.4 also Proxmox 7.2 does not present problems (obviously to be verified, it would be interesting to do so).

Do you have a test node with Proxmox 7.2-7 completely updated but with the kernel a downgrade to 5.4, so that you can test it for 30 days with some Windows VM?

Windows VMs stuck on boot after Proxmox Upgrade to 7.0

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

New Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Member

Member

Distinguished Member

Member

Renowned Member

Renowned Member

Member

Member

Renowned Member

We value your privacy