Windows 10 VM hangs when booting

tjying95

Member
Mar 11, 2020
6
0
6
29
Hi all,

I have a cluster of 4 Proxmox servers, all running version 6.2-4, a version that I prefer not upgrade unless I necessary have to.
I have a preinstalled Windows 10 VM template (with 4 CPU cores and 6GB RAM with minimal hardware) which I used to create linked clones from.

When I create ~30 linked clones all at once (programmatically through API of course), and started them all together, I am seeing one or two VM out of all those 30 VMs that will hang during boot. It may or may not be the first boot of that clone. However, if I power off and then power on again, the VM boots normally.

If I take a snapshot with RAM, and restore to it (hung state), it continues to hang.
If I modify through the vmid.conf and change the CPU count/RAM, then the restored system continues to boot (lol, wut?).

It gives a black screen with "Guest has not initialized the display (yet)." usually. Sometimes, it is the Windows logo with the spinny thing, but stopped.
1675928422333.png

One of the CPU cores will be fully utilized, making the Proxmox UI to show a constant 25% CPU usage. The RAM usage is also constant, sometimes high usage as shown, sometimes low usage (15%).
1675928607287.png

It seems like it is still somewhat interactive with qm monitor, with some commands failing.

Some of the configurations I tried changing to no avail:
  1. Increasing and decreasing the CPU core count
  2. Increasing and decreasing the RAM
  3. Use Windows 10 1909 and 22H2.
The Proxmox server is running on 64 cores Intel(R) Xeon(R) Gold 6242, 512 GB of RAM and 16 TB SATA SSD RAID. I believe there shouldn't be hardware bottleneck on this.

Have you guys ever encounter this kind of behavior before? I tried searching around, doesn't seem like anyone been seeing this before.

code_language.shell:
~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.34-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-1
pve-kernel-helper: 6.2-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.3
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-5
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
Last edited:
Hi all,

I have a cluster of 4 Proxmox servers, all running version 6.2-4, a version that I prefer not upgrade unless I necessary have to.
I have a preinstalled Windows 10 VM template (with 4 CPU cores and 6GB RAM with minimal hardware) which I used to create linked clones from.

When I create ~30 linked clones all at once (programmatically through API of course), and started them all together, I am seeing one or two VM out of all those 30 VMs that will hang during boot. It may or may not be the first boot of that clone. However, if I power off and then power on again, the VM boots normally.

If I take a snapshot with RAM, and restore to it (hung state), it continues to hang.
If I modify through the vmid.conf and change the CPU count/RAM, then the restored system continues to boot (lol, wut?).

It gives a black screen with "Guest has not initialized the display (yet)." usually. Sometimes, it is the Windows logo with the spinny thing, but stopped.
View attachment 46572

One of the CPU cores will be fully utilized, making the Proxmox UI to show a constant 25% CPU usage. The RAM usage is also constant, sometimes high usage as shown, sometimes low usage (15%).
View attachment 46573

It seems like it is still somewhat interactive with qm monitor, with some commands failing.

Some of the configurations I tried changing to no avail:
  1. Increasing and decreasing the CPU core count
  2. Increasing and decreasing the RAM
  3. Use Windows 10 1909 and 22H2.
The Proxmox server is running on 64 cores Intel(R) Xeon(R) Gold 6242, 512 GB of RAM and 16 TB SATA SSD RAID. I believe there shouldn't be hardware bottleneck on this.

Have you guys ever encounter this kind of behavior before? I tried searching around, doesn't seem like anyone been seeing this before.

code_language.shell:
~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.34-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-1
pve-kernel-helper: 6.2-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.3
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-5
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
Hi,
are you sure that you are not running out of memory during the startup of all the VMs at once? I mean 30*6G should be well within your limits, but this does not account for other services, ZFS arc, ...

Do you see some errors in the syslog around the time when booting the VMs?

all running version 6.2-4, a version that I prefer not upgrade unless I necessary have to
I would say you really have to. Running without security updates is not what you want, especially since it seems you are running your cluster in a productive environment.
 
are you sure that you are not running out of memory during the startup of all the VMs at once? I mean 30*6G should be well within your limits, but this does not account for other services, ZFS arc, ...
I am pretty sure the memory is enough during the run. According to the max usage graph, it never exceeded half during the event.

One thing I notice is the IO delay creeps up to around 15% during the event.

Do you see some errors in the syslog around the time when booting the VMs?
Nothing out of the ordinary from the log.

I would say you really have to. Running without security updates is not what you want, especially since it seems you are running your cluster in a productive environment.
It's in our plan, but a rather low priority one. Security isn't really a concern though, it is not exposed to the internet and is mainly used to run tests internally.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!