Windows VMs stuck on boot after Proxmox Upgrade to 7.0

dea · Jul 19, 2022

wolfgang5505 said:
Linux pve6-01 5.11.22-5-pve #1 SMP PVE 5.11.22-10~bpo10+1 (Tue, 28 Sep 2021 10:30:51 +0200)

no, my guess doesn't work like that. I thought that following the passage in 5.13 of proxmox 7.0-> 7.1 something had been brought back towards the 5.11 kernel of proxmox 6.4 that had inserted the problem. But the dates don't prove me right.

t.lamprecht · Jul 19, 2022

dea said:
in fact, I went to look for my colleagues' emails, the first problems arose around the week of January 17-21, 2022, coinciding with the application of Microsoft updates and the VMs, 25 days after Proxmox was updated.

Which means that the issue could have come in with Windows updates too (in addition with some PVE environmental characteristics that make it trigger only in some setups)...

dea said:
I'm perfectly agree. As I said months ago in this thread, the problem is in QEMU, the kernel or both.

And now ?

I was just pandering to your question, as said elsewhere, windows is a proprietary system and such a PITA to debug, especially if we cannot reproduce the problem just somewhat (even if only once a few dozens time).

dea said:
It is too much of a gamble to roll back several clusters in production to a kernel and QEMU version from 8 months ago ... running on Proxmox 7.2-7 enterprise.

You'd only need to roll back one system that can produce this though. Ideally with kernel and QEMU combinations separately to better narrow down the tuple of packages/versions triggering this inside windows.

dea · Jul 19, 2022

t.lamprecht said:
Which means that the issue could have come in with Windows updates too (in addition with some PVE environmental characteristics that make it trigger only in some setups)...

Yes, this would justify why Windows 2008r2 never gave any problems. Either there is something different in Windows 2008r2 or its patch level (commercial until 31-12-2022) does not include what introduced the problem.

t.lamprecht said:
You'd only need to roll back one system that can produce this though. Ideally with kernel and QEMU combinations separately to better narrow down the tuple of packages/versions triggering this inside windows.

Everyone who writes here has problems. I don't have clusters where I can downgrade, they are all in production, but someone who has test systems that I can manage in downgrade can be found.

tjk · Jul 19, 2022

t.lamprecht said:
I was just pandering to your question, as said elsewhere, windows is a proprietary system and such a PITA to debug, especially if we cannot reproduce the problem just somewhat (even if only once a few dozens time).

Some folks have reported seeing this on Linux guests too, so I'm not sure it is just a Windows guest problem.

guzi · Jul 19, 2022

I can confirm, it happened on Ubuntu 20.04.4 LTS as well.

dea · Jul 19, 2022

guzi said:
I can confirm, it happened on Ubuntu 20.04.4 LTS as well.

Interestingly, I have more than 50 Debian 11 VMs on multiple Proxmox 7.2 clusters (all of which have this problem with Windows 2012r2 and up), but never a Debian VM with 5.10 amd64 kernel has had this problem, ever.

How does Ubuntu 20.04 differ from Debian 11? Ok from the kernel version ... then what?

guzi · Jul 19, 2022

dea said:
Interestingly, I have more than 50 Debian 11 VMs on multiple Proxmox 7.2 clusters (all of which have this problem with Windows 2012r2 and up), but never a Debian VM with 5.10 amd64 kernel has had this problem, ever.

How does Ubuntu 20.04 differ from Debian 11? Ok from the kernel version ... then what?

Good question. The same cluster has also some Debian 10/11 machines, without this issue. The Ubuntu machine was migrated from Hyper-V, maybe something in the boot process is still left from M$.

fireon · Jul 20, 2022

guzi said:
I can confirm, it happened on Ubuntu 20.04.4 LTS as well.

Many of us Ubuntu 20.04 VM's also concerned. We sometimes have to stop and start many VM's, depending on whether there has been an autoreboot after the updates.

dea · Jul 20, 2022

@t.lamprecht The fact that this problem also occurs on Ubuntu (although I don't know in the least how it can be, since it is practically the same as Debian and Debian does not have the slightest problem on clusters with Windows VMs that present it in an obvious way). It might help to facilitate debugging.

All that remains is to have you put your hand on an Ubuntu VM that has the problem (snapshot with usual RAM ... ??)

fireon · Jul 20, 2022

Just had the luck to see directly during the manual reboot of two Ubuntu VM's that the error occurs. What does it look like? Screenshot attached.

A reset says "ok" in the task but does not work. Only a stop and start brings the VM back to life. No updates, only a manual reboot. There is nothing in the log at the host, also nothing in the log in the VM.

Now we come to the exciting part. I was always looking for similarities, what do VM's have in common where this happens?

Here I could find out that every VM going into this state had Snap from Canonical installed. That is the very first real commonality. I have now uninstalled all but one VM (where we need Snap) and rebooted the VM's from the OS. Event. it brings a changes. We will see.

Maybe also interesting: One of the VM's that got stuck, ran just one day. (Qemu Process Uptime)

dea · Jul 20, 2022

fireon said:
Just had the luck to see directly during the manual reboot of two Ubuntu VM's that the error occurs. What does it look like? Screenshot attached.

A reset says "ok" in the task but does not work. Only a stop and start brings the VM back to life. No updates, only a manual reboot. There is nothing in the log at the host, also nothing in the log in the VM.

Now we come to the exciting part. I was always looking for similarities, what do VM's have in common where this happens?

Here I could find out that every VM going into this state had Snap from Canonical installed. That is the very first real commonality. I have now uninstalled all but one VM (where we need Snap) and rebooted the VM's from the OS. Event. it brings a changes. We will see.

Maybe also interesting: One of the VM's that got stuck, ran just one day. (Qemu Process Uptime)

... the thing is more and more interesting, if we continue like this by the end of the year we could bind all the posts in a book.

What you say would justify the different behavior between Debian and Ubuntu. Now one reason is clear.

I just have a doubt ... it's a different problem than the one reported here for Windows VM.

A different problem with similar behavior.

It will be just a sensation, but on Windows VMs I have never experienced stuck after such a short time of running the QEMU process.

Oggy512 · Jul 20, 2022

I have the same behavoiur as fireon describes.
Debian 10 and Debian 11, as well as Windows 10 and Server 2016 VM.
*If* it happens, then all VMs showing the same "result" as shown in the screenshot of fireon. Also, a reset VM does not work. I have to stop it, wait some seconds and start it again.

I tried to reboot Linux and Windows without making updates and with updates, doesn't really matter. Sometimes the VM hangs, sometimes not. Really strange to reproduce.

dea · Jul 20, 2022

Oggy512 said:
I have the same behavoiur as fireon describes.
Debian 10 and Debian 11, as well as Windows 10 and Server 2016 VM.
*If* it happens, then all VMs showing the same "result" as shown in the screenshot of fireon. Also, a reset VM does not work. I have to stop it, wait some seconds and start it again.

I tried to reboot Linux and Windows without making updates and with updates, doesn't really matter. Sometimes the VM hangs, sometimes not. Really strange to reproduce.

At this point, in my opinion, after all I'm messed up there is only one thing to do, as debugging is extremely difficult to do.

Change one of the milestones that characterize the system, passing for example to QEMU 7.

fireon · Jul 21, 2022

Now we build in Qemu Version 5.2.0.8 in us repository. For the first test it works under pve 7.x. But important would be several tests for time. Therefore, I call here to test the version on 7.x. After installation you have to adjust the Qemu version in the VM's accordingly. Make backups/snapshots before. This is highly experimental.

https://apt.iteas.at/

Code:

apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 2FAB19E7CCB7F415
echo "deb https://apt.iteas.at/iteas bullseye main" > /etc/apt/sources.list.d/iteas.list
apt update
apt install pve-qemu-kvm=5.2.0-8

dea · Jul 21, 2022

fireon said:
Now we build in Qemu Version 5.2.0.8 in us repository. For the first test it works under pve 7.x. But important would be several tests for time. Therefore, I call here to test the version on 7.x. After installation you have to adjust the Qemu version in the VM's accordingly. Make backups/snapshots before. This is highly experimental.

https://apt.iteas.at/

Code:

apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 2FAB19E7CCB7F415 echo "deb https://apt.iteas.at/iteas bullseye main" > /etc/apt/sources.list.d/iteas.list apt update apt install pve-qemu-kvm=5.2.0-8

... so you are proposing to use the version that was in use at Proxmox 6.x in Proxmox 7, as a major release downgrade to check if really the problem is in QEMU, correct?
The test is really interesting, not aimed at using QEMU 5 in Proxmox 7, well understood, but to understand if the problem is in QEMU 6. Great job !!!!

wolfgang5505 · Jul 21, 2022

hello as i wrote above:
two cluster 6.4-13
1. pve-qemu-kvm: 5.2.0-6 kernel pve-kernel-5.11.22-5-pve: 5.11.22-10~bpo10+1 ===> problems with reboot
2. pve-qemu-kvm: 5.2.0-6 pve-kernel-5.4.78-2-pve: 5.4.78-2 ====> no problems since 1,5 years

dea · Jul 21, 2022

wolfgang5505 said:
hello as i wrote above:
two cluster 6.4-13
1. pve-qemu-kvm: 5.2.0-6 kernel pve-kernel-5.11.22-5-pve: 5.11.22-10~bpo10+1 ===> problems with reboot
2. pve-qemu-kvm: 5.2.0-6 pve-kernel-5.4.78-2-pve: 5.4.78-2 ====> no problems since 1,5 years

is the qemu version identical on both?

wolfgang5505 · Jul 21, 2022

dea said:
is the qemu version identical on both?

Yes 5.2.0-6

wolfgang5505 · Jul 21, 2022

wolfgang5505 said:
Yes 5.2.0-6

Just changed kernel of cluster 1 on one node to 5.4, now we have to wait.

dea · Jul 21, 2022

At this point ... if any of you have the opportunity to do so, you should try to do a rollback on a fully upgraded Proxmox 7.2-7 server of a 5.11 kernel from September-October (maximum) 2021. ... and try it for at least a month ...

Windows VMs stuck on boot after Proxmox Upgrade to 7.0

Renowned Member

Proxmox Staff Member

Renowned Member

Active Member

Member

Renowned Member

Member

Distinguished Member

Renowned Member

Distinguished Member

Attachments

Renowned Member

Member

Renowned Member

Distinguished Member

Renowned Member

Member

Renowned Member

Member

Member

Renowned Member

We value your privacy