Windows VMs stuck on boot after Proxmox Upgrade to 7.0

dea · Mar 30, 2022

vulture said:
Tried the 5.15-Kernel (pve-kernel-5.15.30-1-pve) - my test VM still hangs on reboot.

Bad news. I will prepare myself psychologically for the next MS$ update and the related bloodbath of VMs that will stuck during the night following the automatic reboot.

marcus.hanikat · Mar 30, 2022

I think I might have found a solution for 3 of our VMs. The VMs run Windows Server 2022 and had the following configuration when the problems was observed:

Code:

agent: 1
bios: ovmf
boot: order=scsi0;net0
cores: 2
cpu: host
machine: pc-q35-5.1
memory: 4096
meta: creation-qemu=6.2.0,ctime=1648638637
name: Win2022Test
net0: virtio=6A:78:4E:4E:5D:43,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
scsi0: proxmox:102/vm-102-disk-0.qcow2,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=047aac44-58ec-48c9-a917-957bbbcdf899
sockets: 1
tpmstate0: proxmox:102/vm-102-disk-0.raw,size=4M,version=v2.0
vmgenid: e66d0391-63d2-4e55-b933-426803a8f8d6

I found the following WIndows system logs in sequence during the reboot process of the VM:

The operating system is shutting down at system time ‎2022‎-‎03‎-‎30T14:03:53.680178600Z.
The operating system started at system time ‎2022‎-‎03‎-‎30T23:10:58.500000000Z.
The last shutdown's success status was true. The last boot's success status was true.

These logs are from a VM rebooted from within Windows and which got stuck during the reboot. Windows seems to think the reboot was successfull, but notice the time difference between the log for "system is shutting down" and "system started"? Following up this strange difference in time (in reality these logs should have been produced a couple of seconds apart), I decided to try to disable RTC by adding the following to my VM configurtaion:

Code:

localtime: 0

This can also be achieved by modifying the setting "Use local time for RTC" in Proxmox GUI.

After disabling RTC, I did the following:

Started up VM with RTC disabled
Configured correct timezone for VM (same as my Proxmox nodes use)
Forced Windows to sync time with NTP
Waited for 5 minutes before rebooting the VM (does not seem to work if I rebooted the VM immediately after NTP sync)

Following these steps seems to have resolved the reboot issues for the 3 VMs I have tried it on so far. Will try to use the same process on a couple of more VMs tomorrow in order to verify that this actually completely solved the problem for us.

It seems like somehow the RTC time for our VMs is 2 hours behind, even though all our nodes report the correct local time. However, as long as the clock in our VMs run in the local time, and not the time reported by RTC in this case, they don't seem to have any issues rebooting.

marcus.hanikat · Mar 31, 2022

marcus.hanikat said:
I think I might have found a solution for 3 of our VMs. The VMs run Windows Server 2022 and had the following configuration when the problems was observed:

Code:

agent: 1 bios: ovmf boot: order=scsi0;net0 cores: 2 cpu: host machine: pc-q35-5.1 memory: 4096 meta: creation-qemu=6.2.0,ctime=1648638637 name: Win2022Test net0: virtio=6A:78:4E:4E:5D:43,bridge=vmbr0,firewall=1 numa: 0 onboot: 1 ostype: win11 scsi0: proxmox:102/vm-102-disk-0.qcow2,size=50G scsihw: virtio-scsi-pci smbios1: uuid=047aac44-58ec-48c9-a917-957bbbcdf899 sockets: 1 tpmstate0: proxmox:102/vm-102-disk-0.raw,size=4M,version=v2.0 vmgenid: e66d0391-63d2-4e55-b933-426803a8f8d6

I found the following WIndows system logs in sequence during the reboot process of the VM:

The operating system is shutting down at system time ‎2022‎-‎03‎-‎30T14:03:53.680178600Z.

The operating system started at system time ‎2022‎-‎03‎-‎30T23:10:58.500000000Z.

The last shutdown's success status was true. The last boot's success status was true.

These logs are from a VM rebooted from within Windows and which got stuck during the reboot. Windows seems to think the reboot was successfull, but notice the time difference between the log for "system is shutting down" and "system started"? Following up this strange difference in time (in reality these logs should have been produced a couple of seconds apart), I decided to try to disable RTC by adding the following to my VM configurtaion:

Code:

localtime: 0

This can also be achieved by modifying the setting "Use local time for RTC" in Proxmox GUI.

After disabling RTC, I did the following:

Started up VM with RTC disabled

Configured correct timezone for VM (same as my Proxmox nodes use)

Forced Windows to sync time with NTP

Waited for 5 minutes before rebooting the VM (does not seem to work if I rebooted the VM immediately after NTP sync)

Following these steps seems to have resolved the reboot issues for the 3 VMs I have tried it on so far. Will try to use the same process on a couple of more VMs tomorrow in order to verify that this actually completely solved the problem for us.

It seems like somehow the RTC time for our VMs is 2 hours behind, even though all our nodes report the correct local time. However, as long as the clock in our VMs run in the local time, and not the time reported by RTC in this case, they don't seem to have any issues rebooting.

After investigating this a bit more, it seems I found the underlying problem for our environment at least. There was a difference between the local time reported by our Proxmox nodes (found by running the command "date" in console) and the hardware clock (found by running the command "hwclock" in console). The hardware clock was running 2 hours behind, so I decided to sync the hardware clock with the local time used by our Proxmox nodes by running the command "hwclock --systohc". After doing this and rebooting our VMs (still having to manually stop and start the VM during the first reboot) I can no longer reproduce the problem during subsequent reboots in our environment.

dea · Mar 31, 2022

No, that's not my problem. On all my clusters that exhibit this behavior there is perfect alignment between "date" and "hwclock". But the problem arises.

weehooey-bh · Mar 31, 2022

vulture said:
Tried the 5.15-Kernel (pve-kernel-5.15.30-1-pve) - my test VM still hangs on reboot.

@vulture What do you need to do to get the test VM to hang?

Asking because, after a hang, it seems to take a while before the same VM will hang again.

Do you have to leave it for X days? Or is there something you can do to trigger it?

dea · Mar 31, 2022

The simulation should be done like this: snapshot of the VM with RAM, try to restart it, if the problem occurs then THAT SNAPSHOT refers to the tests, because it contains the state of the machine and the RAM. Since the problem does not occur methodically and above all not immediately it is necessary to block the state of the machine at a certain moment in which the problem arises.

dea · Apr 5, 2022

Any updates related to this issue (which is really becoming more than just a problem)? We are close to the next patch on Tuesday (which results in a massive reboot of many Windows VMs), can we need further help to identify the cause?

alexander@cloud · Apr 13, 2022

As @dea said - today all (!) of our windows VMs (count: 20) did a reboot after an windows update and got stuck in a reboot loop. We are using the latest proxmox version (enterprise) as of today.

See here for the cluster/vm config: https://forum.proxmox.com/threads/w...xmox-upgrade-to-7-0.100744/page-2#post-451057

alexander@cloud · Apr 15, 2022

Moayad said:
Hi,

Can you reproduce the issue on the same VM? if so please do the following:
1- Since at the moment you cannot predict whether the issue will happen after the next reboot, please take a snapshot if you feel this reboot the VM will hang at boot.
2- If the issue occurs please rollback the snapshot and make a backup for the VM and provide the backup file

This will help us to more digging about the issue.

@Moayad I tried to reproduce the issue: I restored one VM to a state before the windows updates on tuesday, did the update again and there was no reboot loop at this time. I tried with a manually initiated reboot after the update and after restoring to a snapshot before the reboot I also tried with the windows initiaed automatic reboot (at 2am). Unfortunately it was not reproducable.

However, we had this reboot loop issues two times already (see my previous comments). It was always (!) on wednesdays after tuesday microsoft patch day. First one on 9th Feb, second one on 13th April. I am really concerned that the next microsoft patch day all VMs will stuck again.

How can we help you on fixing this?

dea · Apr 15, 2022

I hope in Qemu 6.2.

I find the questions asked by the Proxmox Support team strange. It seems this thread is not even being read. Several tickets have been opened, claiming that the problem was never replicated by them. They ask for information that is evident if this thread is read. But obviously this is not the case. For example, after which Windows update the problem becomes evident. It is not clear that this problem arises not following a specific update but following the restart of a vm that has been running for several days, the typical case is the restart of a server following Windows updates. But it is a consequence, not a cause.

I hope in Qemu 6.2

chojin · Apr 20, 2022

Hello,

We have the same problem on several of our customers 7.1 installs.

Windows VMs are stuck after a reboot following Windows Update.

This is very problematic since it can disrupt many services.

It doesn't seem to affect our 5.x and 6.x customers though.

itNGO · Apr 20, 2022

Hi,
just to clarify. This has nothing to do with Windows Update itself. It is more a matter of runtime/uptime of the VM.... reboot it after a day or two... no problem. Reboot after 4 to 7 days.... maybe it works.... all VMs running longer than 7 days, 99% chance you get stuck on reboot with black screen....

And with this, a Snapshot with RAM does not help... it is more that Proxmox/KVM-Process itself has to be running for several days to run into this issue....

alexander@cloud · Apr 20, 2022

I can just back up @chojin answer. This is absoluty problematic and very dissruptive for all services as we also had multiple VMs stuck and dont come up.

Is there already any official statement from a proxmox team member?

ITT · Apr 20, 2022

Same story here. Anyone tested Qemu-6.2?

dea · Apr 20, 2022

... the Proxmox development team claims that they cannot reproduce the problem. As said before, I find it not credible.
Here is the umpteenth VM in freeze after reboot.
Can't reproduce ... let's say my average is about 30-35 VMs freezing once a month ... either I'm particularly unlucky or they're particularly lucky.

H

fireon · Apr 20, 2022

Same here on many VM's!

Juanfran · Apr 21, 2022

Another one here!

ITT · Apr 21, 2022

Tested one of our HPE DL380 G10 Testmachines with the very newest Qemu-6.2.0-3 and for now it looks good.
I dont know if this is persistent....

fireon · Apr 21, 2022

This day again 4 Ubuntu 20.04 VM's after autoreboot at night.

aaron · Apr 21, 2022

I assume that all with Windows see the same effect of the stuck not spinning boot circle?

How does it show for those running something else like Linux? Does it get stuck at the same point all the time (if you had it on different VMs or multiple times on the same)?

Do you have screenshots and/or boot logs?

Windows VMs stuck on boot after Proxmox Upgrade to 7.0

Renowned Member

Member

Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Member

Member

Renowned Member

Renowned Member

Famous Member

Member

Renowned Member

Renowned Member

Distinguished Member

New Member

Renowned Member

Distinguished Member

Proxmox Staff Member

We value your privacy