Windows VMs stuck on boot after Proxmox Upgrade to 7.0

Ok. So I finally had this issue last night. After running Windows Update (manually) and rebooting it from the Windows Update screen. The machine did not boot but only showed the Windows logo and a rolling 'loading' logo. After 'Stop' and 'Start', it booted normally and finished it's updates.

I asked around on the dev-list and it would help if anybody has a snapshot of a machine (including state!) just before the reboot that fails. Maybe, that way, things can be reproduced. I just tried to reproduce it with another Windows server, but obviously, that just worked.

So if anyone is able to provide access to such a snapshot, give me or @fabian a ping so we might start to try and debug this issue.
 
We have tested a few things. This seems to be an issue with Hyper-V VMs imported to ProxMox on 7.1-4. We have been importing Gen 2 Hyper-V VMs into ProxMox and also importing from another KVM based hyper-visor using qm importdisk. The KVM based hyper-visor imports work fine, the Hyper-v imports have been causing trouble ONLY on 7.1-4. The Hyper-v imports to 6.4-13 are not having this issue.

I am worried that now upgrading our cluster to 7.1-4 is going to make this issue appear. Windows VMs get stuck on a black screen and the only solution is to stop/start. This is not ideal as it means we are having to fix this primitively at the wee hours of the morning before the scheduled reboots take effect.

I think this is an issue with UEFI / EFI but I am not 100% at this point, but 7.1-4 seems to be the only thing that is different and showing these issues.
Can you post a vm config?
 
We are seeing the same behaviour since upgrading to 7.x. At first, we thought it was Windows Updates because they always break and you usually only reboot a Windows Server if you are applying updates.

A sample of the machines that have hung:

Code:
Windows Server 2019 Datacenter (1809)
agent: 1
balloon: 4096
bootdisk: virtio0
cores: 4
cpu: Westmere
cpuunits: 2048
ide0: none,media=cdrom
localtime: 1
memory: 8192
name: WEHY-SM
net0: virtio=8A:81:1A:36:82:C8,bridge=vmbr0,firewall=1,link_down=1,tag=102
net1: virtio=6A:44:2E:95:F3:09,bridge=vmbr0,firewall=1,tag=108
numa: 1
onboot: 1
ostype: win10
protection: 1
scsihw: virtio-scsi-single
smbios1: uuid=4778cce2-9222-46f0-870d-7c1230f07d84
sockets: 2
startup: order=400,up=60
virtio0: VM_SSD:vm-115-disk-1,discard=on,size=64G
virtio1: VM_SSD:vm-115-disk-0,discard=on,size=96G
vmgenid: 4f27a838-0055-473c-8c4d-52727aac1525

Windows Server 2016 Standard (1607)
agent: 1
balloon: 8192
bootdisk: virtio0
cores: 4
cpu: Westmere
ide0: none,media=cdrom
memory: 16384
name: FC-01
net0: virtio=6A:6B:6A:51:6D:37,bridge=vmbr0,firewall=1,tag=2000
numa: 1
onboot: 1
ostype: win10
protection: 1
scsihw: virtio-scsi-pci
smbios1: uuid=a98ebbed-ac84-4418-a083-b7a37ac4555d
sockets: 2
startup: order=200,up=60
virtio0: VM_HDD:vm-106-disk-0,discard=on,size=1000G
vmgenid: ffa8c494-81e3-42ad-8818-cb6a8800b434

Windows Server 2019 Datacenter (1809)
agent: 1
boot: order=scsi0;net0
cores: 1
cpu: Westmere
ide0: none,media=cdrom
memory: 4098
name: AC-W01
net0: virtio=5A:42:EC:03:A5:C6,bridge=vmbr0,firewall=1,tag=30
numa: 1
onboot: 1
ostype: win10
protection: 1
scsi0: VM_HDD:vm-149-disk-0,discard=on,size=64G
scsihw: virtio-scsi-pci
smbios1: uuid=d26bb6a2-e672-43c0-9bdb-f23599355c64
sockets: 2
startup: order=600,up=60
vmgenid: 2265db06-3391-41f0-8367-d585199d6ef5

Windows Server 2019 Standard (1809)
agent: 1
balloon: 4096
bootdisk: scsi0
cores: 2
cpu: Westmere
ide2: none,media=cdrom
memory: 6144
name: AC-D2
net0: virtio=EA:C7:ED:F1:B4:14,bridge=vmbr0,firewall=1,tag=1300
numa: 1
onboot: 1
ostype: win10
protection: 1
sata0: none,media=cdrom
scsi0: VM_HDD:vm-119-disk-0,discard=on,size=64G
scsihw: virtio-scsi-pci
smbios1: uuid=417f381c-ae1a-4593-8f24-5a7ff613e40c
sockets: 2
startup: order=200,up=60
vmgenid: f3a1bc1a-a35b-43bc-8a81-ef623efc966d
 
What we have learned so far from our research and compiled from others in this forum post:

- Windows does not report anything in bootlog (ntbtlog.txt) -- suggests issue happens very early in the boot process
- Windows Server version does not appear to affect it -- 2016 to 2022 affected
- Only one person has reported an OS other than Windows
- Only on a reboot does the hang happen and only if the VM has been running for a while.
- Hard powering off VM and starting, VMs always boots up fine.
- Reports of all storage types: Ceph, ZFS and NFS
- Reports of both controller types: SATA and SCSI
- Reports of both BIOS and UEFI
- VMs built before and after upgrade from 6.x to 7.x affected

Output from one of our nodes where VMs have experienced the issue:

Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.4: 6.4-12
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.4.162-1-pve: 5.4.162-2
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 15.2.15-pve1
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.4-1
proxmox-backup-file-restore: 2.1.4-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1
 
Same story on many Windows vms in our cluster (Windows Server 2012/2016/2019). NFS storage and SCSI disks
 
Same story on many Windows vms in our cluster (Windows Server 2012/2016/2019). NFS storage and SCSI disks

@Whatever thanks for adding to the data.

Some additional information we have found:

- It is affecting both HA and non-HA VMs
- When a VM is booting, at some point the PVE console window resizes (when the OS and hardware agree on resolution?). The hang happens before the console resizes.
 
We are seeing the same behaviour since upgrading to 7.x. At first, we thought it was Windows Updates because they always break and you usually only reboot a Windows Server if you are applying updates.

A sample of the machines that have hung:

Code:
Windows Server 2019 Datacenter (1809)
agent: 1
balloon: 4096
bootdisk: virtio0
cores: 4
cpu: Westmere
...
cpu: Westmere
I don't think latest version of Windows like much this old cpu...
Can you give a try with:
cpu: host
?
 
Hey @Emilien

Sure we can try that. One of the VMs that froze has been changed to cpu: host.

Some of these VMs have been running on the same hardware for over two years and only switched to cpu: Westmere about a year ago. This same cluster was only updated to 7.x this year when the hanging started.

Additionally, some people are reporting Windows Server 2012 having the issue. I have not seen any reports of Windows Server 2008 having the issue.

It appears that some interaction between PVE 7.x and Windows happens when the VM has been running for a while.

At this point, it appears that somewhere between 5 days and ~22 days the issue will arise.
 
Same here.... about 25 of 150VMs have shown this.
2016/2019/2022
 
  • Like
Reactions: weehooey-bh
Hey @Emilien

Sure we can try that. One of the VMs that froze has been changed to cpu: host.

Some of these VMs have been running on the same hardware for over two years and only switched to cpu: Westmere about a year ago. This same cluster was only updated to 7.x this year when the hanging started.

Additionally, some people are reporting Windows Server 2012 having the issue. I have not seen any reports of Windows Server 2008 having the issue.

It appears that some interaction between PVE 7.x and Windows happens when the VM has been running for a while.

At this point, it appears that somewhere between 5 days and ~22 days the issue will arise.

I can confirm. On our clusters, servers 2012, 2016 and 2019 have the same problem. 2008r2 servers (with active extended support from Microsoft) have never experienced this problem.
 
  • Like
Reactions: weehooey-bh

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!