Windows VMs stuck on boot after Proxmox Upgrade to 7.0

I think I might have found a solution for 3 of our VMs. The VMs run Windows Server 2022 and had the following configuration when the problems was observed:
Code:
agent: 1
bios: ovmf
boot: order=scsi0;net0
cores: 2
cpu: host
machine: pc-q35-5.1
memory: 4096
meta: creation-qemu=6.2.0,ctime=1648638637
name: Win2022Test
net0: virtio=6A:78:4E:4E:5D:43,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
scsi0: proxmox:102/vm-102-disk-0.qcow2,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=047aac44-58ec-48c9-a917-957bbbcdf899
sockets: 1
tpmstate0: proxmox:102/vm-102-disk-0.raw,size=4M,version=v2.0
vmgenid: e66d0391-63d2-4e55-b933-426803a8f8d6

I found the following WIndows system logs in sequence during the reboot process of the VM:
  1. The operating system is shutting down at system time ‎2022‎-‎03‎-‎30T14:03:53.680178600Z.
  2. The operating system started at system time ‎2022‎-‎03‎-‎30T23:10:58.500000000Z.
  3. The last shutdown's success status was true. The last boot's success status was true.
These logs are from a VM rebooted from within Windows and which got stuck during the reboot. Windows seems to think the reboot was successfull, but notice the time difference between the log for "system is shutting down" and "system started"? Following up this strange difference in time (in reality these logs should have been produced a couple of seconds apart), I decided to try to disable RTC by adding the following to my VM configurtaion:
Code:
localtime: 0
This can also be achieved by modifying the setting "Use local time for RTC" in Proxmox GUI.

After disabling RTC, I did the following:
  1. Started up VM with RTC disabled
  2. Configured correct timezone for VM (same as my Proxmox nodes use)
  3. Forced Windows to sync time with NTP
  4. Waited for 5 minutes before rebooting the VM (does not seem to work if I rebooted the VM immediately after NTP sync)
Following these steps seems to have resolved the reboot issues for the 3 VMs I have tried it on so far. Will try to use the same process on a couple of more VMs tomorrow in order to verify that this actually completely solved the problem for us.

It seems like somehow the RTC time for our VMs is 2 hours behind, even though all our nodes report the correct local time. However, as long as the clock in our VMs run in the local time, and not the time reported by RTC in this case, they don't seem to have any issues rebooting.
 
  • Like
Reactions: weehooey-bh
I think I might have found a solution for 3 of our VMs. The VMs run Windows Server 2022 and had the following configuration when the problems was observed:
Code:
agent: 1
bios: ovmf
boot: order=scsi0;net0
cores: 2
cpu: host
machine: pc-q35-5.1
memory: 4096
meta: creation-qemu=6.2.0,ctime=1648638637
name: Win2022Test
net0: virtio=6A:78:4E:4E:5D:43,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
scsi0: proxmox:102/vm-102-disk-0.qcow2,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=047aac44-58ec-48c9-a917-957bbbcdf899
sockets: 1
tpmstate0: proxmox:102/vm-102-disk-0.raw,size=4M,version=v2.0
vmgenid: e66d0391-63d2-4e55-b933-426803a8f8d6

I found the following WIndows system logs in sequence during the reboot process of the VM:
  1. The operating system is shutting down at system time ‎2022‎-‎03‎-‎30T14:03:53.680178600Z.
  2. The operating system started at system time ‎2022‎-‎03‎-‎30T23:10:58.500000000Z.
  3. The last shutdown's success status was true. The last boot's success status was true.
These logs are from a VM rebooted from within Windows and which got stuck during the reboot. Windows seems to think the reboot was successfull, but notice the time difference between the log for "system is shutting down" and "system started"? Following up this strange difference in time (in reality these logs should have been produced a couple of seconds apart), I decided to try to disable RTC by adding the following to my VM configurtaion:
Code:
localtime: 0
This can also be achieved by modifying the setting "Use local time for RTC" in Proxmox GUI.

After disabling RTC, I did the following:
  1. Started up VM with RTC disabled
  2. Configured correct timezone for VM (same as my Proxmox nodes use)
  3. Forced Windows to sync time with NTP
  4. Waited for 5 minutes before rebooting the VM (does not seem to work if I rebooted the VM immediately after NTP sync)
Following these steps seems to have resolved the reboot issues for the 3 VMs I have tried it on so far. Will try to use the same process on a couple of more VMs tomorrow in order to verify that this actually completely solved the problem for us.

It seems like somehow the RTC time for our VMs is 2 hours behind, even though all our nodes report the correct local time. However, as long as the clock in our VMs run in the local time, and not the time reported by RTC in this case, they don't seem to have any issues rebooting.

After investigating this a bit more, it seems I found the underlying problem for our environment at least. There was a difference between the local time reported by our Proxmox nodes (found by running the command "date" in console) and the hardware clock (found by running the command "hwclock" in console). The hardware clock was running 2 hours behind, so I decided to sync the hardware clock with the local time used by our Proxmox nodes by running the command "hwclock --systohc". After doing this and rebooting our VMs (still having to manually stop and start the VM during the first reboot) I can no longer reproduce the problem during subsequent reboots in our environment.
 
  • Like
Reactions: weehooey-bh
Tried the 5.15-Kernel (pve-kernel-5.15.30-1-pve) - my test VM still hangs on reboot.
@vulture What do you need to do to get the test VM to hang?

Asking because, after a hang, it seems to take a while before the same VM will hang again.

Do you have to leave it for X days? Or is there something you can do to trigger it?
 
The simulation should be done like this: snapshot of the VM with RAM, try to restart it, if the problem occurs then THAT SNAPSHOT refers to the tests, because it contains the state of the machine and the RAM. Since the problem does not occur methodically and above all not immediately it is necessary to block the state of the machine at a certain moment in which the problem arises.
 
  • Like
Reactions: weehooey-bh
Hi,


Can you reproduce the issue on the same VM? if so please do the following:
1- Since at the moment you cannot predict whether the issue will happen after the next reboot, please take a snapshot if you feel this reboot the VM will hang at boot.
2- If the issue occurs please rollback the snapshot and make a backup for the VM and provide the backup file

This will help us to more digging about the issue.

@Moayad I tried to reproduce the issue: I restored one VM to a state before the windows updates on tuesday, did the update again and there was no reboot loop at this time. I tried with a manually initiated reboot after the update and after restoring to a snapshot before the reboot I also tried with the windows initiaed automatic reboot (at 2am). Unfortunately it was not reproducable.

However, we had this reboot loop issues two times already (see my previous comments). It was always (!) on wednesdays after tuesday microsoft patch day. First one on 9th Feb, second one on 13th April. I am really concerned that the next microsoft patch day all VMs will stuck again.

How can we help you on fixing this?
 
I hope in Qemu 6.2.

I find the questions asked by the Proxmox Support team strange. It seems this thread is not even being read. Several tickets have been opened, claiming that the problem was never replicated by them. They ask for information that is evident if this thread is read. But obviously this is not the case. For example, after which Windows update the problem becomes evident. It is not clear that this problem arises not following a specific update but following the restart of a vm that has been running for several days, the typical case is the restart of a server following Windows updates. But it is a consequence, not a cause.


I hope in Qemu 6.2
 
Hello,

We have the same problem on several of our customers 7.1 installs.

Windows VMs are stuck after a reboot following Windows Update.

This is very problematic since it can disrupt many services.

It doesn't seem to affect our 5.x and 6.x customers though.
 
Hi,
just to clarify. This has nothing to do with Windows Update itself. It is more a matter of runtime/uptime of the VM.... reboot it after a day or two... no problem. Reboot after 4 to 7 days.... maybe it works.... all VMs running longer than 7 days, 99% chance you get stuck on reboot with black screen....

And with this, a Snapshot with RAM does not help... it is more that Proxmox/KVM-Process itself has to be running for several days to run into this issue....
 
Last edited:
Tested one of our HPE DL380 G10 Testmachines with the very newest Qemu-6.2.0-3 and for now it looks good.
I dont know if this is persistent....
 
Last edited:
  • Like
Reactions: weehooey-bh
I assume that all with Windows see the same effect of the stuck not spinning boot circle?

How does it show for those running something else like Linux? Does it get stuck at the same point all the time (if you had it on different VMs or multiple times on the same)?

Do you have screenshots and/or boot logs?
 
  • Like
Reactions: weehooey-bh

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!