Windows VMs stuck on boot after Proxmox Upgrade to 7.0

lolomat

New Member
Dec 2, 2021
3
0
1
Hi guys,

I hope you can help me with a strange problem I've encountered with my Windows VMs since Proxmox upgrade from 6.4-13.
When I restart a VM which was running a couple of weeks, the boot process often gets stuck on the windows boot logo and the "circle" just keeps on spinning.

If I shutdown the machine and just boot it again, it boots without an issue. Also when I restart a VM which was just booted, there are no problems as well.

The issue occurs only with Windows machines which were running a couple of weeks, Linux VMs are not affected.

Code:
pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.4: 6.4-5
pve-kernel-5.11.22-4-pve: 5.11.22-8
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.4.128-1-pve: 5.4.128-1
ceph: residual config
ceph-fuse: 16.2.5-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-11
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.3-1
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.0.0-4
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Code:
qm config 129
balloon: 0
bootdisk: sata0
cores: 4
cpu: Westmere,flags=+pcid
description: %0A%0AWindows Server 2016%0ADomain Controller
ide2: none,media=cdrom
memory: 8192
name: redacted
net0: virtio=66:31:34:34:38:37,bridge=vmbr0,tag=18
numa: 0
ostype: win8
sata0: ceph:vm-129-disk-0,cache=writeback,discard=on,size=100G
smbios1: uuid=e214a6aa-92e6-43ad-9ed8-b0791e018123
sockets: 1
 

oguz

Proxmox Staff Member
Staff member
Nov 19, 2018
5,207
676
118
Code:
pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve) <----running kernel is older than the installed one
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.4: 6.4-5
pve-kernel-5.11.22-4-pve: 5.11.22-8
pve-kernel-5.11.22-3-pve: 5.11.22-7

please reboot after kernel upgrades :)
if the issue still happens afterwards we can look into it
 

lolomat

New Member
Dec 2, 2021
3
0
1
Unfortunately this doesn't seem to be the problem, same issue occurs on another host with up2date kernel:

Code:
pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-4-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.4: 6.4-6
pve-kernel-5.11.22-4-pve: 5.11.22-8
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.14-pve1
ceph-fuse: 15.2.14-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-11
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.3-1
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.0.0-4
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Code:
qm config 119
balloon: 0
boot: dcn
bootdisk: sata0
cores: 8
cpu: SandyBridge
description: redacted
ide2: none,media=cdrom
memory: 16384
name: 1531
net0: virtio=46:4C:6D:65:7A:36,bridge=vmbr2,rate=80,tag=17
numa: 0
ostype: win8
sata0: ceph01:vm-119-disk-0,cache=unsafe,discard=on,iops_rd=1000,iops_rd_max=1500,iops_wr=1000,iops_wr_max=1500,size=500G
scsihw: virtio-scsi-pci
smbios1: uuid=79ebd0d7-1142-4bb7-898a-f24748d5caa3
sockets: 1
vmgenid: c7aad1c3-e8ea-4734-9fd9-a682bd182a7e
 

oguz

Proxmox Staff Member
Staff member
Nov 19, 2018
5,207
676
118
are there any interesting messages in the logs when you try to start a VM and it hangs?

does the hang only happen after a couple of weeks? or is it more frequent?
it would be good to see if there's a way to reproduce this reliably :)

you could check dmesg and journalctl. you can also enable persistent journaling with mkdir -p /var/log/journal to keep journals from consecutive boots.

Unfortunately this doesn't seem to be the problem, same issue occurs on another host with up2date kernel:
are all hosts rebooted to run the newer kernel version?

Code:
sata0: ceph01:vm-119-disk-0,cache=unsafe,discard=on,iops_rd=1000,iops_rd_max=1500,iops_wr=1000,iops_wr_max=1500,size=500G
what kind of performance do you get from the above ceph storage?

also i've noticed you seem to have different ceph versions on the nodes, which can cause issues.
ideally you should make sure all your nodes are updated to the same level and use the same package repositories, so please do that :)
 

lolomat

New Member
Dec 2, 2021
3
0
1
Hi, and thanks for trying to help ;)

are there any interesting messages in the logs when you try to start a VM and it hangs?
No, unfortunately no error or interesting messages regarding the VM.

does the hang only happen after a couple of weeks? or is it more frequent?
it would be good to see if there's a way to reproduce this reliably :)
That's the problem, it seems to only affect the VMs which were running for a while. I can restart a freshly booted VM without an issue but once restarting a machine which was running for couple of weeks I'm getting the boot freeze.

As a workaround I can turn of the VM instead of restarting it, then it starts up normally too.

you could check dmesg and journalctl. you can also enable persistent journaling with mkdir -p /var/log/journal to keep journals from consecutive boots.
No, unfortinately no error or interesting logs regarding the VM there too.

are all hosts rebooted to run the newer kernel version?
yes, the host from my initial posting is another, seperate cluster than from my second example.

Code:
sata0: ceph01:vm-119-disk-0,cache=unsafe,discard=on,iops_rd=1000,iops_rd_max=1500,iops_wr=1000,iops_wr_max=1500,size=500G
what kind of performance do you get from the above ceph storage?
~20-30k IOPS

also i've noticed you seem to have different ceph versions on the nodes, which can cause issues.
ideally you should make sure all your nodes are updated to the same level and use the same package repositories, so please do that :)
the frist example was another cluster, all nodes from the second example run on the same ceph version and use the same package repo.
 

daros

Active Member
Jul 22, 2014
49
1
28
This is an know bug. There are more topic about this bug. Proxmox please fix it.
It was there in 7.0 but still here in 7.1
 

tom

Proxmox Staff Member
Staff member
Aug 29, 2006
15,537
920
163
This is an know bug. There are more topic about this bug. Proxmox please fix it.
It was there in 7.0 but still here in 7.1
We fixed almost all known issues in latest package, please upgrade to current 7.1 and please test. If you still see issues, please report.
 

daros

Active Member
Jul 22, 2014
49
1
28
We fixed almost all known issues in latest package, please upgrade to current 7.1 and please test. If you still see issues, please report.
Is the new kernel needed? Dont have the option currently to reboot the nodes.
Hope that with all the new updates its also solved??
 

tom

Proxmox Staff Member
Staff member
Aug 29, 2006
15,537
920
163
Is the new kernel needed? Dont have the option currently to reboot the nodes.
Hope that with all the new updates its also solved??
You actively asking for updates but you do not want to apply already delivered fixes?

Re-Think again about your approach ...
 

daros

Active Member
Jul 22, 2014
49
1
28
You actively asking for updates but you do not want to apply already delivered fixes?

Re-Think again about your approach ...
Sleep well?

Don’t have an window soon for maintaince but where would have the fix asap.

Have a good day
 
Dec 23, 2021
5
0
1
37
We have the same issue on 2 different Proxmox clusters upgraded from 6 to 7.
We are running around 200 Windows VM on the clusters.
I have today tried to upgrade one of the cluster to the newest version including upgrade to pve-kernel-5.15
Still the same issue :(
 

oguz

Proxmox Staff Member
Staff member
Nov 19, 2018
5,207
676
118
I have today tried to upgrade one of the cluster to the newest version including upgrade to pve-kernel-5.15
done reboot afterwards?
We are running around 200 Windows VM on the clusters.
* are they running all at the same time? is there a lot of load on the server?

* are you using ceph?

* do you notice the problem in all VMs or only on some of them?

* could you also post an example VM config and pveversion -v as well?
 
Dec 23, 2021
5
0
1
37
Yes all hosts
done reboot afterwards?
Yes
* are they running all at the same time? is there a lot of load on the server?

* are you using ceph?

* do you notice the problem in all VMs or only on some of them?

* could you also post an example VM config and pveversion -v as well?
Yes we are running over 200VM on this 2 clusters on same time
Yes we are running ceph with 9 SSD OSD's ind each node running over 2x25Gbit
We have not testet on all VMs, but we see the issue typical when VMs are trying to reboot after windows patch - but manuel reboot are also hanging.

----------------------------------------------------------------------------------------------------------------------------------------------
agent: 1,fstrim_cloned_disks=1
bios: ovmf
boot: cdn
bootdisk: scsi0
cores: 2
cpu: host
efidisk0: rbd-ssd:vm-123-disk-0,size=128K
hotplug: disk,network,usb,memory
ide2: none,media=cdrom
memory: 24576
name: ecs-sql-1
net0: virtio=36:6A:06:E3:08:5C,bridge=vmbr0,tag=1111
numa: 1
onboot: 1
ostype: win10
scsi0: rbd-ssd:vm-123-disk-1,backup=0,discard=on,iothread=1,size=75G,ssd=1
scsi10: rbd-ssd:vm-123-disk-6,backup=0,discard=on,iothread=1,size=200G,ssd=1
scsi2: rbd-ssd:vm-123-disk-2,backup=0,discard=on,iothread=1,size=180G,ssd=1
scsi3: rbd-ssd:vm-123-disk-3,backup=0,discard=on,iothread=1,size=130G,ssd=1
scsi4: rbd-ssd:vm-123-disk-4,backup=0,discard=on,iothread=1,size=25G,ssd=1
scsi5: rbd-ssd:vm-123-disk-5,backup=0,discard=on,iothread=1,size=150G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=866d7e20-bbb6-4761-9c72-7f8ea700fddd
sockets: 2
vmgenid: 588ec08a-fc2a-4175-9478-49ac33d5ce0d

----------------------------------------------------------------------------------------------------------------------------------------------

proxmox-ve: 7.1-1 (running kernel: 5.15.5-1-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-5.15: 7.1-6
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-4
pve-kernel-5.15.5-1-pve: 5.15.5-1
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-2-pve: 5.11.22-4
pve-kernel-5.4.124-1-pve: 5.4.124-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.15-pve1
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3
 

oguz

Proxmox Staff Member
Staff member
Nov 19, 2018
5,207
676
118
thanks for the output, we'll look into it :)
We have not testet on all VMs, but we see the issue typical when VMs are trying to reboot after windows patch - but manuel reboot are also hanging.
now that you mention, i think i've seen that happen once.
the manual reboot also hangs because the VM isn't waiting for the ACPI at that stage (post windows-update reboot). you should be able to cancel the reboot from the task log, and issue a "Stop" (but this is effectively pulling the plug from the machine, be careful!)

btw a tip for next time, you can use [code][/code] tags when posting outputs ;)
 
Dec 23, 2021
5
0
1
37
thanks for the output, we'll look into it :)

now that you mention, i think i've seen that happen once.
the manual reboot also hangs because the VM isn't waiting for the ACPI at that stage (post windows-update reboot). you should be able to cancel the reboot from the task log, and issue a "Stop" (but this is effectively pulling the plug from the machine, be careful!)

btw a tip for next time, you can use [code][/code] tags when posting outputs ;)
the manual reboot is made from the OS, just to verify that is not the windows update causing the issue
 

oguz

Proxmox Staff Member
Staff member
Nov 19, 2018
5,207
676
118
the manual reboot is made from the OS, just to verify that is not the windows update causing the issue
yes that's how i saw the issue as well, rebooting from inside windows (sorry for confusion!).

what i meant was that if you try issuing another "Reboot" it won't respond because it's already waiting for the reboot inside the VM
 

jacorall

New Member
Dec 28, 2021
3
3
3
51
A solution is to set Grub to boot with an earlier version of the kernel (kernel 5.11.22-7-pve). I've tested this on a number of Proxmox installations:

Ensure the the kernel 5.11.22-7-pve is present, by running update-grub, and see if it is present in the output on the screen.

Edit the /etc/default/grub:
nano /etc/default/grub

Comment the line:
#GRUB_DEFAULT=0

Add the line:
GRUB_DEFAULT="Advanced options for Proxmox VE GNU/Linux>Proxmox VE GNU/Linux, with Linux 5.11.22-7-pve"

Save the file.

Run grub-update config

Reboot the server.

The Proxmox server will run on a slightly older kernel.
 
  • Like
Reactions: RasmusToft
Dec 23, 2021
5
0
1
37
We have now testet downgrade to kernel 5.11.22-7 but we still seeing the reboot issues.
I have now testet to create a new VM (with same settings) and importet the RDB disk from one of the VM now working.
The newly created VM boots and reboots just fine.
It looks like it just VMs created before upgrade to version 7, that have the issues.
 
Dec 23, 2021
5
0
1
37
@tuxis

Here is the output from a new, and a old rbd image

seems to be the same

root@ecs-rgv-pve-int-1:~# rbd --pool rbd-ssd --image vm-132-disk-0 info
rbd image 'vm-132-disk-0':
size 60 GiB in 15360 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: dd3d414a9ae00f
block_name_prefix: rbd_data.dd3d414a9ae00f
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
op_features:
flags:
create_timestamp: Mon May 10 11:26:51 2021
access_timestamp: Thu Jan 6 08:50:04 2022
modify_timestamp: Thu Jan 6 08:49:35 2022

root@ecs-rgv-pve-int-1:~# rbd --pool rbd-ssd --image vm-115-disk-1 info
rbd image 'vm-115-disk-1':
size 60 GiB in 15360 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: c645d436f69a39
block_name_prefix: rbd_data.c645d436f69a39
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
op_features:
flags:
create_timestamp: Mon May 3 11:55:13 2021
access_timestamp: Thu Jan 6 08:50:29 2022
modify_timestamp: Thu Jan 6 08:49:44 2022
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!