Windows VMs stuck on boot after Proxmox Upgrade to 7.0

Very interesting, I am convinced that with a kernel 5.4 also Proxmox 7.2 does not present problems (obviously to be verified, it would be interesting to do so).

Do you have a test node with Proxmox 7.2-7 completely updated but with the kernel a downgrade to 5.4, so that you can test it for 30 days with some Windows VM?
I have today two production clusters PVE7 (each 3 nodes) of schools ( vacation) fully ugraded running with kernel 5.4. These cluster had kernel 5.15 and today ca. 20 VMs with spinning circle
 
Perfect. While you try to use 7.2-7 nodes with a 5.4 kernel ... A question arises spontaneously to the Proxmox Developers ... If it is true that a 5.4 kernel does not give problems all else being equal, while a 5.11 or higher kernel does, how does it differ?
 
It seems that sometihing changed after i appplied suggested modificaiton.
i just reboot all my vdi with Windows 2019 and all get ok (163 VM with Windows 2019, some with uptime since 17-7-2022)
 
It seems that sometihing changed after i appplied suggested modificaiton.
i just reboot all my vdi with Windows 2019 and all get ok (163 VM with Windows 2019, some with uptime since 17-7-2022)

sorry, in detail which modification?
Several have been proposed ...
Various downgrades, tuning in VM options, etc.
 
So at one point this issue went away... maybe a few months ago... sorry id have to go back in this thread... I havent had any issues on 5.13 and qemu 6.2 until the most recent update. I'm using uefi (i know you were looking for people using the 4m version) and a Q35 machine. It acts very strange and I'm wondering if its not a partial windows thing for us windows server users. I can shut the VM down cleanly but I cant reboot it, it just hangs there. I also cant do any windows updates... If I install and shut down as soon as it boots it pulls them all back out... says it ran into an issue. I just spent 3 hours rolling updates in and out trying to find the issue... i got nothing. I cant make it do anything other than whats described. If you have something you want me to test let me know.

I'm Rolling pve back to 5.13 i just realized it pushed me up to 5.15... let you know what i find....
 
Last edited:
Maybe my five cents are helping here.

I have a Windows 8.1 vm which was running well on PVE 6.4-15. I already experienced some black screen issues after installing updates a while ago but they could be resolved by resetting the VM as most of you confirmed. I also had the vm shut down for months now because I only use open source systems for my daily work .

Now I migrated the VM from 6.4-15 to 7.2-7 and on 7.2-7 it was not working (black screen issue). So I started the old VM on 6.4-15 again where it started currently and it installed some updates (maybe they had not been finished before). After copying the file to the new VM host BOTH VMs are not starting anymore with the black screen issue. Unfortunately I don't have full disk backups so I can't reproduce the working state on the old PROXMOX host anymore. If I execute sfc /scannow in recovery mode on both VMs it says everythings clean so the installation is fine.

The current behaviour is as follows and it is absolutely not changing no matter what I tried so far (at the beginning it is showing the PROXMOX logo – thats just missing on the GIF):

Peek 2022-08-15 00-59.gif

The config of the VM on 7.2-7 is as follows:

Code:
agent: 0
args: -machine smm=off
boot: order=sata0
cores: 4
machine: pc-q35-5.1
memory: 3072
meta: creation-qemu=6.2.0,ctime=1660405498
name: sg-win81bitch-01-swb-cvm-win
net0: e1000=MYMACADDRESS,bridge=vmbr1,firewall=1
numa: 0
ostype: win8
sata0: local-root:103/vm-103-disk-0.qcow2,size=128G
sata1: local:iso/virtio-win.iso,media=cdrom,size=519030K
sata2: local:iso/Win8.1_German_x64.iso,media=cdrom,size=4263300K
scsihw: virtio-scsi-pci
smbios1: uuid=MYUNIQUEID
sockets: 1
vga: qxl,memory=32
vmgenid: ANOTHERUNIQUEID

What I already tried without any change (you can see some of the changes in the config above):
args: -machine smm=off
virtio0/ide0 instead of sata0 (I thought it was a virtio driver issue before)
changing ostype to win10 or win11
changing machine to pc-q35 and switching versions
using noVNC instead of SPICE

The original config on 6.4-15 was:

Code:
agent: 0
bootdisk: virtio0
cores: 4
ide2: none,media=cdrom
memory: 3072
name: MOVED-sg-win81bitch-01-swb-cvm-win
net0: e1000=MYMACADDRESS,bridge=vmbr5,firewall=1
numa: 0
onboot: 1
ostype: win8
scsihw: virtio-scsi-pci
smbios1: uuid=MYUNIQUEID
sockets: 1
vga: qxl,memory=32
virtio0: local-root:103/vm-103-disk-0.qcow2,size=79G
vmgenid: ANOTHERUNIQUEID

After I can reproduce the issue every time maybe I can help you by trying stuff?

6.4-15 is running 5.4.195-1-pve #1 SMP PVE 5.4.195-1 (Wed, 13 Jul 2022 13:19:46 +0200) x86_64 GNU/Linux
7.2-7 is running 5.15.39-1-pve #1 SMP PVE 5.15.39-1 (Wed, 22 Jun 2022 17:22:00 +0200) x86_64 GNU/Linux

The 6.4-15 node is not in use so I can also change kernels there, on the 7.2-7 (which is used for production) this would be a bit more difficult but not impossible.
 
Last edited:
does somebody have an redhat account to see details of this links:

https://access.redhat.com/solutions/6836351
Code:
Issue
Microsoft Windows guests hang while rebooting, the bootscreen shows the Windows log and the boot circle endlessly.
Resolution
There is no resolution or workaround at this time, please power off the Guest and power on again.
Root Cause
This is Bug 1975840 - Windows guest hangs after updating and restarting from the guest OS and is caused by a TSC (Time Stamp Counter) issue in Guest when it overflows during boot time if the value provided by the virtualization platform is already high. It tends to affect Guests with longer uptimes.


Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Code:
Issue
Microsoft Windows guests hang while rebooting, the bootscreen shows the Windows log and the boot circle endlessly.
Resolution
There is no resolution or workaround at this time, please power off the Guest and power on again.
Root Cause
This is Bug 1975840 - Windows guest hangs after updating and restarting from the guest OS and is caused by a TSC (Time Stamp Counter) issue in Guest when it overflows during boot time if the value provided by the virtualization platform is already high. It tends to affect Guests with longer uptimes.


Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

mmm,ok, this one is not tlbflush related. thanks.
I'll dig on tsc, I have seen some recent redhat bugzilla with hyperv tsc value too, maybe it could help.



I have found 2 other links in a related bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=2070417

"
https://gitlab.com/redhat/rhel/src/qemu-kvm/qemu-kvm/-/merge_requests/172
https://issues.redhat.com/browse/RHELPLAN-117453
"
are you able to read them ?
 
  • Like
Reactions: weehooey-bh
about the redhat bugzilla with tsc bug,

I have found the patch fix in redhat qemu package

https://git.centos.org/rpms/qemu-kv...target-i386-properly-reset-TSC-on-reset.patch

https://github.com/qemu/qemu/commit/5286c3662294119dc2dd1e9296757337211451f6


Don't seem to be related to kernel here, only qemu . So I'm not sure it's related to our problem, but It's look really exactly the same hang we have.

Code:
Date: Thu, 24 Mar 2022 09:21:41 +0100
Subject: [PATCH] target/i386: properly reset TSC on reset
 

Some versions of Windows hang on reboot if their TSC value is greater
than 2^54.  The calibration of the Hyper-V reference time overflows
and fails; as a result the processors' clock sources are out of sync.
 
The issue is that the TSC _should_ be reset to 0 on CPU reset and
QEMU tries to do that.  However, KVM special cases writing 0 to the
TSC and thinks that QEMU is trying to hot-plug a CPU, which is
correct the first time through but not later.  Thwart this valiant
effort and reset the TSC to 1 instead, but only if the CPU has been
run once.
 
For this to work, env->tsc has to be moved to the part of CPUArchState
that is not zeroed at the beginning of x86_cpu_reset.
 
about the redhat bugzilla with tsc bug,

I have found the patch fix in redhat qemu package

https://git.centos.org/rpms/qemu-kv...target-i386-properly-reset-TSC-on-reset.patch

https://github.com/qemu/qemu/commit/5286c3662294119dc2dd1e9296757337211451f6


Don't seem to be related to kernel here, only qemu . So I'm not sure it's related to our problem, but It's look really exactly the same hang we have.

Code:
Date: Thu, 24 Mar 2022 09:21:41 +0100
Subject: [PATCH] target/i386: properly reset TSC on reset
 

Some versions of Windows hang on reboot if their TSC value is greater
than 2^54.  The calibration of the Hyper-V reference time overflows
and fails; as a result the processors' clock sources are out of sync.
 
The issue is that the TSC _should_ be reset to 0 on CPU reset and
QEMU tries to do that.  However, KVM special cases writing 0 to the
TSC and thinks that QEMU is trying to hot-plug a CPU, which is
correct the first time through but not later.  Thwart this valiant
effort and reset the TSC to 1 instead, but only if the CPU has been
run once.
 
For this to work, env->tsc has to be moved to the part of CPUArchState
that is not zeroed at the beginning of x86_cpu_reset.
There's now a qemu 7.0 package in the pvetest repository containing the above patch.
 
if somebody want to test, I have also buid a 6.2 version with the patch (use at your own risk ;)

https://mutulin1.odiso.net/pve-qemu-kvm_6.2.0-11_amd64.deb


I think that if a vm is already running and hang at reboot, a live migration to a patched qemu (6.2 or 7.X) + reset from proxmox gui should work.
 
Thank you @wbumiller and @spirit !!!!

@wbumiller why did you release the patch in question on Qemu 7 series? Is the intention to release the next Proxmox 7.3 with Qemu 7 series?

I have a test node in 7.2-7 enterprise, with Qemu downgraded to 6.0.0-3 (see above) I wonder if it makes sense to continue testing in downgrade or upgrade to 7.0 series with the patch in question ... .

Sorry @wbumiller the new patched qemu is 7.0.0-2 ?
 
Last edited:
Thank you @wbumiller and @spirit !!!!

@wbumiller why did you release the patch in question on Qemu 7 series? Is the intention to release the next Proxmox 7.3 with Qemu 7 series?

I have a test node in 7.2-7 enterprise, with Qemu downgraded to 6.0.0-3 (see above) I wonder if it makes sense to continue testing in downgrade or upgrade to 7.0 series with the patch in question ... .

Sorry @wbumiller the new patched qemu is 7.0.0-2 ?
The fix is already available in QEMU 7.0, so no backport was necessary there.
We've also been testing it here on our workstations for quite a while and haven't seen any issues so far. So for now it's the one to test the fix.
Depending on the feedback we get, and our decision on when to release QEMU 7.0 to pve-no-subscription and pve-enterprise, we might provide a QEMU 6.2 version with the fix backported as well.
 
The fix is already available in QEMU 7.0, so no backport was necessary there.
We've also been testing it here on our workstations for quite a while and haven't seen any issues so far. So for now it's the one to test the fix.
Depending on the feedback we get, and our decision on when to release QEMU 7.0 to pve-no-subscription and pve-enterprise, we might provide a QEMU 6.2 version with the fix backported as well.

Very thanks @mira I hope to close this bug as soon as possible, after 8 months of delirium (by now we have got into the habit, before the monthly upgrades, to turn off and on the VMs, a hundred, so as to restart them without problems at the time of the upgrades).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!