Windows VMs stuck on boot after Proxmox Upgrade to 7.0

Linux pve6-01 5.11.22-5-pve #1 SMP PVE 5.11.22-10~bpo10+1 (Tue, 28 Sep 2021 10:30:51 +0200)

no, my guess doesn't work like that. I thought that following the passage in 5.13 of proxmox 7.0-> 7.1 something had been brought back towards the 5.11 kernel of proxmox 6.4 that had inserted the problem. But the dates don't prove me right.
 
in fact, I went to look for my colleagues' emails, the first problems arose around the week of January 17-21, 2022, coinciding with the application of Microsoft updates and the VMs, 25 days after Proxmox was updated.
Which means that the issue could have come in with Windows updates too (in addition with some PVE environmental characteristics that make it trigger only in some setups)...

I'm perfectly agree. As I said months ago in this thread, the problem is in QEMU, the kernel or both.

And now ?
I was just pandering to your question, as said elsewhere, windows is a proprietary system and such a PITA to debug, especially if we cannot reproduce the problem just somewhat (even if only once a few dozens time).
It is too much of a gamble to roll back several clusters in production to a kernel and QEMU version from 8 months ago ... running on Proxmox 7.2-7 enterprise.
You'd only need to roll back one system that can produce this though. Ideally with kernel and QEMU combinations separately to better narrow down the tuple of packages/versions triggering this inside windows.
 
Which means that the issue could have come in with Windows updates too (in addition with some PVE environmental characteristics that make it trigger only in some setups)...

Yes, this would justify why Windows 2008r2 never gave any problems. Either there is something different in Windows 2008r2 or its patch level (commercial until 31-12-2022) does not include what introduced the problem.

You'd only need to roll back one system that can produce this though. Ideally with kernel and QEMU combinations separately to better narrow down the tuple of packages/versions triggering this inside windows.

Everyone who writes here has problems. I don't have clusters where I can downgrade, they are all in production, but someone who has test systems that I can manage in downgrade can be found.
 
  • Like
Reactions: weehooey-bh
I was just pandering to your question, as said elsewhere, windows is a proprietary system and such a PITA to debug, especially if we cannot reproduce the problem just somewhat (even if only once a few dozens time).
Some folks have reported seeing this on Linux guests too, so I'm not sure it is just a Windows guest problem.
 
I can confirm, it happened on Ubuntu 20.04.4 LTS as well.

Interestingly, I have more than 50 Debian 11 VMs on multiple Proxmox 7.2 clusters (all of which have this problem with Windows 2012r2 and up), but never a Debian VM with 5.10 amd64 kernel has had this problem, ever.

How does Ubuntu 20.04 differ from Debian 11? Ok from the kernel version ... then what?
 
Interestingly, I have more than 50 Debian 11 VMs on multiple Proxmox 7.2 clusters (all of which have this problem with Windows 2012r2 and up), but never a Debian VM with 5.10 amd64 kernel has had this problem, ever.

How does Ubuntu 20.04 differ from Debian 11? Ok from the kernel version ... then what?
Good question. The same cluster has also some Debian 10/11 machines, without this issue. The Ubuntu machine was migrated from Hyper-V, maybe something in the boot process is still left from M$.
 
I can confirm, it happened on Ubuntu 20.04.4 LTS as well.
Many of us Ubuntu 20.04 VM's also concerned. We sometimes have to stop and start many VM's, depending on whether there has been an autoreboot after the updates.
 
@t.lamprecht The fact that this problem also occurs on Ubuntu (although I don't know in the least how it can be, since it is practically the same as Debian and Debian does not have the slightest problem on clusters with Windows VMs that present it in an obvious way). It might help to facilitate debugging.

All that remains is to have you put your hand on an Ubuntu VM that has the problem (snapshot with usual RAM ... ??)
 
Just had the luck to see directly during the manual reboot of two Ubuntu VM's that the error occurs. What does it look like? Screenshot attached.

A reset says "ok" in the task but does not work. Only a stop and start brings the VM back to life. No updates, only a manual reboot. There is nothing in the log at the host, also nothing in the log in the VM.

Now we come to the exciting part. I was always looking for similarities, what do VM's have in common where this happens?

Here I could find out that every VM going into this state had Snap from Canonical installed. That is the very first real commonality. I have now uninstalled all but one VM (where we need Snap) and rebooted the VM's from the OS. Event. it brings a changes. We will see.

Maybe also interesting: One of the VM's that got stuck, ran just one day. (Qemu Process Uptime)
 

Attachments

  • Screenshot_20220720_135947.jpg
    Screenshot_20220720_135947.jpg
    54.1 KB · Views: 28
  • Like
Reactions: weehooey-bh
Just had the luck to see directly during the manual reboot of two Ubuntu VM's that the error occurs. What does it look like? Screenshot attached.

A reset says "ok" in the task but does not work. Only a stop and start brings the VM back to life. No updates, only a manual reboot. There is nothing in the log at the host, also nothing in the log in the VM.

Now we come to the exciting part. I was always looking for similarities, what do VM's have in common where this happens?

Here I could find out that every VM going into this state had Snap from Canonical installed. That is the very first real commonality. I have now uninstalled all but one VM (where we need Snap) and rebooted the VM's from the OS. Event. it brings a changes. We will see.

Maybe also interesting: One of the VM's that got stuck, ran just one day. (Qemu Process Uptime)

... the thing is more and more interesting, if we continue like this by the end of the year we could bind all the posts in a book.

What you say would justify the different behavior between Debian and Ubuntu. Now one reason is clear.

I just have a doubt ... it's a different problem than the one reported here for Windows VM.

A different problem with similar behavior.

It will be just a sensation, but on Windows VMs I have never experienced stuck after such a short time of running the QEMU process.
 
I have the same behavoiur as fireon describes.
Debian 10 and Debian 11, as well as Windows 10 and Server 2016 VM.
*If* it happens, then all VMs showing the same "result" as shown in the screenshot of fireon. Also, a reset VM does not work. I have to stop it, wait some seconds and start it again.

I tried to reboot Linux and Windows without making updates and with updates, doesn't really matter. Sometimes the VM hangs, sometimes not. Really strange to reproduce.
 
I have the same behavoiur as fireon describes.
Debian 10 and Debian 11, as well as Windows 10 and Server 2016 VM.
*If* it happens, then all VMs showing the same "result" as shown in the screenshot of fireon. Also, a reset VM does not work. I have to stop it, wait some seconds and start it again.

I tried to reboot Linux and Windows without making updates and with updates, doesn't really matter. Sometimes the VM hangs, sometimes not. Really strange to reproduce.

At this point, in my opinion, after all I'm messed up there is only one thing to do, as debugging is extremely difficult to do.

Change one of the milestones that characterize the system, passing for example to QEMU 7.
 
Now we build in Qemu Version 5.2.0.8 in us repository. For the first test it works under pve 7.x. But important would be several tests for time. Therefore, I call here to test the version on 7.x. After installation you have to adjust the Qemu version in the VM's accordingly. Make backups/snapshots before. This is highly experimental.

https://apt.iteas.at/

Code:
apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 2FAB19E7CCB7F415
echo "deb https://apt.iteas.at/iteas bullseye main" > /etc/apt/sources.list.d/iteas.list
apt update
apt install pve-qemu-kvm=5.2.0-8
 
  • Like
Reactions: weehooey-bh
Now we build in Qemu Version 5.2.0.8 in us repository. For the first test it works under pve 7.x. But important would be several tests for time. Therefore, I call here to test the version on 7.x. After installation you have to adjust the Qemu version in the VM's accordingly. Make backups/snapshots before. This is highly experimental.

https://apt.iteas.at/

Code:
apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 2FAB19E7CCB7F415
echo "deb https://apt.iteas.at/iteas bullseye main" > /etc/apt/sources.list.d/iteas.list
apt update
apt install pve-qemu-kvm=5.2.0-8

... so you are proposing to use the version that was in use at Proxmox 6.x in Proxmox 7, as a major release downgrade to check if really the problem is in QEMU, correct?
The test is really interesting, not aimed at using QEMU 5 in Proxmox 7, well understood, but to understand if the problem is in QEMU 6. Great job !!!!
 
  • Like
Reactions: weehooey-bh
hello as i wrote above:
two cluster 6.4-13
1. pve-qemu-kvm: 5.2.0-6 kernel pve-kernel-5.11.22-5-pve: 5.11.22-10~bpo10+1 ===> problems with reboot
2. pve-qemu-kvm: 5.2.0-6 pve-kernel-5.4.78-2-pve: 5.4.78-2 ====> no problems since 1,5 years
 
  • Like
Reactions: weehooey-bh
At this point ... if any of you have the opportunity to do so, you should try to do a rollback on a fully upgraded Proxmox 7.2-7 server of a 5.11 kernel from September-October (maximum) 2021. ... and try it for at least a month ...
 
  • Like
Reactions: weehooey-bh

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!