Windows VMs stuck on boot after Proxmox Upgrade to 7.0

As itNGO already mentionend, I have this issue with Windows VMs (Win 10) and also with Linux VMs (Debian 10 and 11), so it's not Windows only.
 
... Interesting. I use VM Debian 11 heavily and never, ever did the problem arise with those. Instead with Windows VM it is a bloodbath (from 2012 upwards, with 2008 never no problem).
 
We do have access to a VM exhibiting this issue, and we're currently trying to find the cause.
So all I can currently say is that we're working on it.

Once we know more, we'll provide more information both in the bug tracker and here in this, and possibly other, threads.
Any guess already what the issue is caused by? Or maybe already an idea of fix? Please keep us updated a little bit. Some customers already start to cry. Explanation is getting harder. However we still tell them that this might take some more time....
 
  • Like
Reactions: Whatever
Any guess already what the issue is caused by? Or maybe already an idea of fix? Please keep us updated a little bit. Some customers already start to cry. Explanation is getting harder. However we still tell them that this might take some more time....

Totally agree. If PVE maintainers managed to reproduce the issue they at least should get some more details on it and could provide them to pve community. It's not the rare issue. Lots of PVE users are facing it and it has already become a nightmare for dozen of them
 
Last edited:
Totally agree. If PVE maintainers managed to reproduce the issue at least they should get some more details on it and could provide them to community. It's not the rare issue. Lots of PVE users are facing it and it's already become a nightmare for dozen of them

I would like to be extremely transparent, it is in everyone's interest that this problem be solved and we all stop for at least a workaround. In a few days the Microsoft updates will be released and we are ready for a new death of VMs that do not restart at the reboot.

A lot has been written, 13 pages, and the last thing I want is to hear what has already been said.

Update server firmware, update Intel or AMD microcode. Turn off workarounds for spectre and the like.

All things already done, already tried. Nothing.

In my opinion we need to focus on the reliable data we have. Why does win 2008r2 have no problems and why do they appear from 2012 on? Proxmox 6.4 is OK, never stuck on reboot, never ... Why?
 
  • Like
Reactions: Whatever
I got same problem on my pve 7 hosts. I read the complete thread. So i think the only way is to build a stop/start script if the system stucks?! :)
 
Hy

looks like we had the same idea, so last day's I wrote one of my first phython-scripts. Please don't take a closer look to my code *lol*

* The script starts from cron how often you want.
* A "qemu ping" checks if vm is responding or not.
If not, the script tries to ping via network. Name of vm is used as hostname. If also the second ping failes, am qm stop and start will restart the vm
* To skip a vm set "cbxSkipHeartbeatCheck" in notice

!!! Please test the script before productive use !!!
 

Attachments

Hi,

EDIT 2: unfortunately, this doesn't seem to be a valid workaround after all. Post #267 mentions that localtime: 0 didn't work for them and the user reporting that OS type other helped, also ran into the issue again (probably was just lucky in the beginning).

how many of you tried setting localtime: 0 as a potential workaround? (EDIT: Of course you need to stop/start the VM for the setting to actually takes effect. Note that the change will affect the time seen by the guest, so be careful when applying this to production VMs).
I'm asking because there's post #82 suggesting this. And now there's a new report where changing the guest OS type to other helped. One of the few things changing the OS type after VM creation affects is the default value of the localtime setting (Use local time for RTC in the UI). It's only enabled by default when the OS type is Windows.
 
Last edited:
  • Like
Reactions: weehooey-bh
Hi,
how many of you tried setting localtime: 0 as a potential workaround? (EDIT: Of course you need to stop/start the VM or migrate to get a new QEMU instance where the setting actually takes effect).
I'm asking because there's post #82 suggesting this. And now there's a new report where changing the guest OS type to other helped. One of the few things changing the OS type after VM creation affects is the default value of the localtime setting (Use local time for RTC in the UI). It's only enabled by default when the OS type is Windows.

.. I don't think this setting modification has ever been tried .... however I have to shut down the VM I cannot move between the cluster nodes and get the new "localtime: 0" configuration, thanks @fabian
 
Last edited:
I'm having the same issue on 7.1-12. Spinning circle. I've tried running windows repair a couple of times, but I'm still waiting. These are windows 2022 server VMs and I need to someone get them back online. Any other ideas what I can do?
 
I tried doing that a few times already, when Window repair starts. Then it runs the repair, and reboots, but I'm still waiting at a spinning circle.
 

Attachments

  • windows-server-spinning-wheel.PNG
    windows-server-spinning-wheel.PNG
    36.8 KB · Views: 11
Yes Hard power off, gets the VM running again. I was doing a reset and not a STOP and START. At least that gets the VM back online.

Thanks
 
Hi,
how many of you tried setting localtime: 0 as a potential workaround? (EDIT: Of course you need to stop/start the VM or migrate to get a new QEMU instance where the setting actually takes effect).
I'm asking because there's post #82 suggesting this. And now there's a new report where changing the guest OS type to other helped. One of the few things changing the OS type after VM creation affects is the default value of the localtime setting (Use local time for RTC in the UI). It's only enabled by default when the OS type is Windows.
i just apply modification to all my 160 VM (Windows 2019 datacenter) that have same behaviour, live migrate to and from to another cluster member (it took some hour), but we have to wait some day to have feedback


cluster, 3 supermicro Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, kernel 5.13.19-2-pve, pve-qemu-kvm 6.1.0-3


see you later
 
  • Like
Reactions: Whatever and itNGO
i just apply modification to all my 160 VM (Windows 2019 datacenter) that have same behaviour, live migrate to and from to another cluster member (it took some hour), but we have to wait some day to have feedback


cluster, 3 supermicro Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, kernel 5.13.19-2-pve, pve-qemu-kvm 6.1.0-3


see you later

Interestingly, if I make a non-applicable hot change, like this one, and I perform a motion of the vm on another node, does the change remain in the "not applied" state because it waits for the vm to stop start, instead is it applied the same during the motion? So is it just an "aesthetic" bug? The thing is interesting, if I have to apply this workaround to a consistent number of VMs there is a lot of difference. Regardless, what matters is solving the age-old problem. So let's try this workaround.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!