Windows VMs stuck on boot after Proxmox Upgrade to 7.0

lolomat · Dec 2, 2021

Hi guys,

I hope you can help me with a strange problem I've encountered with my Windows VMs since Proxmox upgrade from 6.4-13.
When I restart a VM which was running a couple of weeks, the boot process often gets stuck on the windows boot logo and the "circle" just keeps on spinning.

If I shutdown the machine and just boot it again, it boots without an issue. Also when I restart a VM which was just booted, there are no problems as well.

The issue occurs only with Windows machines which were running a couple of weeks, Linux VMs are not affected.

Code:

pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.4: 6.4-5
pve-kernel-5.11.22-4-pve: 5.11.22-8
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.4.128-1-pve: 5.4.128-1
ceph: residual config
ceph-fuse: 16.2.5-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-11
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.3-1
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.0.0-4
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Code:

qm config 129
balloon: 0
bootdisk: sata0
cores: 4
cpu: Westmere,flags=+pcid
description: %0A%0AWindows Server 2016%0ADomain Controller
ide2: none,media=cdrom
memory: 8192
name: redacted
net0: virtio=66:31:34:34:38:37,bridge=vmbr0,tag=18
numa: 0
ostype: win8
sata0: ceph:vm-129-disk-0,cache=writeback,discard=on,size=100G
smbios1: uuid=e214a6aa-92e6-43ad-9ed8-b0791e018123
sockets: 1

oguz · Dec 2, 2021

Code:

pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve) <----running kernel is older than the installed one
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.4: 6.4-5
pve-kernel-5.11.22-4-pve: 5.11.22-8
pve-kernel-5.11.22-3-pve: 5.11.22-7

please reboot after kernel upgrades

if the issue still happens afterwards we can look into it

lolomat · Dec 2, 2021

Unfortunately this doesn't seem to be the problem, same issue occurs on another host with up2date kernel:

Code:

pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-4-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.4: 6.4-6
pve-kernel-5.11.22-4-pve: 5.11.22-8
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.14-pve1
ceph-fuse: 15.2.14-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-11
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.3-1
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.0.0-4
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Code:

qm config 119
balloon: 0
boot: dcn
bootdisk: sata0
cores: 8
cpu: SandyBridge
description: redacted
ide2: none,media=cdrom
memory: 16384
name: 1531
net0: virtio=46:4C:6D:65:7A:36,bridge=vmbr2,rate=80,tag=17
numa: 0
ostype: win8
sata0: ceph01:vm-119-disk-0,cache=unsafe,discard=on,iops_rd=1000,iops_rd_max=1500,iops_wr=1000,iops_wr_max=1500,size=500G
scsihw: virtio-scsi-pci
smbios1: uuid=79ebd0d7-1142-4bb7-898a-f24748d5caa3
sockets: 1
vmgenid: c7aad1c3-e8ea-4734-9fd9-a682bd182a7e

oguz · Dec 2, 2021

are there any interesting messages in the logs when you try to start a VM and it hangs?

does the hang only happen after a couple of weeks? or is it more frequent?
it would be good to see if there's a way to reproduce this reliably

you could check dmesg and journalctl. you can also enable persistent journaling with mkdir -p /var/log/journal to keep journals from consecutive boots.

lolomat said:
Unfortunately this doesn't seem to be the problem, same issue occurs on another host with up2date kernel:

are all hosts rebooted to run the newer kernel version?

Code:

sata0: ceph01:vm-119-disk-0,cache=unsafe,discard=on,iops_rd=1000,iops_rd_max=1500,iops_wr=1000,iops_wr_max=1500,size=500G

what kind of performance do you get from the above ceph storage?

also i've noticed you seem to have different ceph versions on the nodes, which can cause issues.
ideally you should make sure all your nodes are updated to the same level and use the same package repositories, so please do that

lolomat · Dec 3, 2021

Hi, and thanks for trying to help

are there any interesting messages in the logs when you try to start a VM and it hangs?

No, unfortunately no error or interesting messages regarding the VM.

does the hang only happen after a couple of weeks? or is it more frequent?
it would be good to see if there's a way to reproduce this reliably

That's the problem, it seems to only affect the VMs which were running for a while. I can restart a freshly booted VM without an issue but once restarting a machine which was running for couple of weeks I'm getting the boot freeze.

As a workaround I can turn of the VM instead of restarting it, then it starts up normally too.

you could check dmesg and journalctl. you can also enable persistent journaling with mkdir -p /var/log/journal to keep journals from consecutive boots.

No, unfortinately no error or interesting logs regarding the VM there too.

are all hosts rebooted to run the newer kernel version?

yes, the host from my initial posting is another, seperate cluster than from my second example.

Code:
Code:

sata0: ceph01:vm-119-disk-0,cache=unsafe,discard=on,iops_rd=1000,iops_rd_max=1500,iops_wr=1000,iops_wr_max=1500,size=500G

what kind of performance do you get from the above ceph storage?

~20-30k IOPS

also i've noticed you seem to have different ceph versions on the nodes, which can cause issues.
ideally you should make sure all your nodes are updated to the same level and use the same package repositories, so please do that

the frist example was another cluster, all nodes from the second example run on the same ceph version and use the same package repo.

daros · Dec 3, 2021

This is an know bug. There are more topic about this bug. Proxmox please fix it.
It was there in 7.0 but still here in 7.1

tom · Dec 4, 2021

daros said:
This is an know bug. There are more topic about this bug. Proxmox please fix it.
It was there in 7.0 but still here in 7.1

We fixed almost all known issues in latest package, please upgrade to current 7.1 and please test. If you still see issues, please report.

daros · Dec 4, 2021

tom said:
We fixed almost all known issues in latest package, please upgrade to current 7.1 and please test. If you still see issues, please report.

Is the new kernel needed? Dont have the option currently to reboot the nodes.
Hope that with all the new updates its also solved??

tom · Dec 4, 2021

daros said:
Is the new kernel needed? Dont have the option currently to reboot the nodes.
Hope that with all the new updates its also solved??

You actively asking for updates but you do not want to apply already delivered fixes?

Re-Think again about your approach ...

daros · Dec 5, 2021

tom said:
You actively asking for updates but you do not want to apply already delivered fixes?

Re-Think again about your approach ...

Sleep well?

Don’t have an window soon for maintaince but where would have the fix asap.

Have a good day

RasmusToft · Dec 23, 2021

We have the same issue on 2 different Proxmox clusters upgraded from 6 to 7.
We are running around 200 Windows VM on the clusters.
I have today tried to upgrade one of the cluster to the newest version including upgrade to pve-kernel-5.15
Still the same issue

oguz · Dec 23, 2021

RasmusToft said:
I have today tried to upgrade one of the cluster to the newest version including upgrade to pve-kernel-5.15

done reboot afterwards?

RasmusToft said:
We are running around 200 Windows VM on the clusters.

* are they running all at the same time? is there a lot of load on the server?

* are you using ceph?

* do you notice the problem in all VMs or only on some of them?

* could you also post an example VM config and pveversion -v as well?

RasmusToft · Dec 23, 2021

Yes all hosts

oguz said:
done reboot afterwards?

Yes

oguz said:
* are they running all at the same time? is there a lot of load on the server?

* are you using ceph?

* do you notice the problem in all VMs or only on some of them?

* could you also post an example VM config and pveversion -v as well?

Yes we are running over 200VM on this 2 clusters on same time
Yes we are running ceph with 9 SSD OSD's ind each node running over 2x25Gbit
We have not testet on all VMs, but we see the issue typical when VMs are trying to reboot after windows patch - but manuel reboot are also hanging.

----------------------------------------------------------------------------------------------------------------------------------------------
agent: 1,fstrim_cloned_disks=1
bios: ovmf
boot: cdn
bootdisk: scsi0
cores: 2
cpu: host
efidisk0: rbd-ssd:vm-123-disk-0,size=128K
hotplug: disk,network,usb,memory
ide2: none,media=cdrom
memory: 24576
name: ecs-sql-1
net0: virtio=36:6A:06:E3:08:5C,bridge=vmbr0,tag=1111
numa: 1
onboot: 1
ostype: win10
scsi0: rbd-ssd:vm-123-disk-1,backup=0,discard=on,iothread=1,size=75G,ssd=1
scsi10: rbd-ssd:vm-123-disk-6,backup=0,discard=on,iothread=1,size=200G,ssd=1
scsi2: rbd-ssd:vm-123-disk-2,backup=0,discard=on,iothread=1,size=180G,ssd=1
scsi3: rbd-ssd:vm-123-disk-3,backup=0,discard=on,iothread=1,size=130G,ssd=1
scsi4: rbd-ssd:vm-123-disk-4,backup=0,discard=on,iothread=1,size=25G,ssd=1
scsi5: rbd-ssd:vm-123-disk-5,backup=0,discard=on,iothread=1,size=150G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=866d7e20-bbb6-4761-9c72-7f8ea700fddd
sockets: 2
vmgenid: 588ec08a-fc2a-4175-9478-49ac33d5ce0d

----------------------------------------------------------------------------------------------------------------------------------------------

proxmox-ve: 7.1-1 (running kernel: 5.15.5-1-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-5.15: 7.1-6
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-4
pve-kernel-5.15.5-1-pve: 5.15.5-1
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-2-pve: 5.11.22-4
pve-kernel-5.4.124-1-pve: 5.4.124-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.15-pve1
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

oguz · Dec 23, 2021

thanks for the output, we'll look into it

RasmusToft said:
We have not testet on all VMs, but we see the issue typical when VMs are trying to reboot after windows patch - but manuel reboot are also hanging.

now that you mention, i think i've seen that happen once.
the manual reboot also hangs because the VM isn't waiting for the ACPI at that stage (post windows-update reboot). you should be able to cancel the reboot from the task log, and issue a "Stop" (but this is effectively pulling the plug from the machine, be careful!)

btw a tip for next time, you can use [code][/code] tags when posting outputs

RasmusToft · Dec 23, 2021

oguz said:
thanks for the output, we'll look into it

now that you mention, i think i've seen that happen once.
the manual reboot also hangs because the VM isn't waiting for the ACPI at that stage (post windows-update reboot). you should be able to cancel the reboot from the task log, and issue a "Stop" (but this is effectively pulling the plug from the machine, be careful!)

btw a tip for next time, you can use [code][/code] tags when posting outputs

the manual reboot is made from the OS, just to verify that is not the windows update causing the issue

oguz · Dec 23, 2021

RasmusToft said:
the manual reboot is made from the OS, just to verify that is not the windows update causing the issue

yes that's how i saw the issue as well, rebooting from inside windows (sorry for confusion!).

what i meant was that if you try issuing another "Reboot" it won't respond because it's already waiting for the reboot inside the VM

jacorall · Dec 28, 2021

A solution is to set Grub to boot with an earlier version of the kernel (kernel 5.11.22-7-pve). I've tested this on a number of Proxmox installations:

Ensure the the kernel 5.11.22-7-pve is present, by running update-grub, and see if it is present in the output on the screen.

Edit the /etc/default/grub:
nano /etc/default/grub

Comment the line:
#GRUB_DEFAULT=0

Add the line:
GRUB_DEFAULT="Advanced options for Proxmox VE GNU/Linux>Proxmox VE GNU/Linux, with Linux 5.11.22-7-pve"

Save the file.

Run grub-update config

Reboot the server.

The Proxmox server will run on a slightly older kernel.

RasmusToft · Jan 4, 2022

We have now testet downgrade to kernel 5.11.22-7 but we still seeing the reboot issues.
I have now testet to create a new VM (with same settings) and importet the RDB disk from one of the VM now working.
The newly created VM boots and reboots just fine.
It looks like it just VMs created before upgrade to version 7, that have the issues.

tuxis · Jan 5, 2022

@RasmusToft Hmm. Is there any difference in the output of

Code:

rbd info $imagename

of the older and newer images?

RasmusToft · Jan 6, 2022

@tuxis

Here is the output from a new, and a old rbd image

seems to be the same

root@ecs-rgv-pve-int-1:~# rbd --pool rbd-ssd --image vm-132-disk-0 info
rbd image 'vm-132-disk-0':
size 60 GiB in 15360 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: dd3d414a9ae00f
block_name_prefix: rbd_data.dd3d414a9ae00f
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
op_features:
flags:
create_timestamp: Mon May 10 11:26:51 2021
access_timestamp: Thu Jan 6 08:50:04 2022
modify_timestamp: Thu Jan 6 08:49:35 2022

root@ecs-rgv-pve-int-1:~# rbd --pool rbd-ssd --image vm-115-disk-1 info
rbd image 'vm-115-disk-1':
size 60 GiB in 15360 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: c645d436f69a39
block_name_prefix: rbd_data.c645d436f69a39
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
op_features:
flags:
create_timestamp: Mon May 3 11:55:13 2021
access_timestamp: Thu Jan 6 08:50:29 2022
modify_timestamp: Thu Jan 6 08:49:44 2022

Windows VMs stuck on boot after Proxmox Upgrade to 7.0

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

New Member

Famous Member

New Member

We value your privacy