[SOLVED] VM (Win11) frozen (post boot) on some... PVE Hosts

Aug 27, 2022
5
0
1
Hi All,
I have 4 new cluster nodes (Blade chassis) that I recently added to my proxmox cluster... They are not yet in production though in the testing phase I have run into a very strange issue. A new VM I just created (Win 11 Pro for WS) will boot and run just fine on my older nodes (2x Epyc 7302) and even my ancient ones (2x Xeon E5-2650) but on my new blades (1x Epyc 7573x) the system boots almost to the login screen then gets stuck showing the time with no background image... Once here it is completely frozen, time no longer advancing or anything. Attached are several images giving some details on the configuration. Note that all nodes are connected to a CEPH cluster as the primary storage for this device which has other VMs running just fine. I have installed guest additions on the VM and am running it with virtIO-SCSI and virtIO-GPU
Thank you all in advance for the assistance... also I know I dont have a support license for the Blade server yet... once I get this into prod ill add it on the list lol.
Thanks;
SPraus


Not Running on Blade:
1707846127565.png

Running on Current Prod Node:
1707846134605.png

VM Hardware:
1707846165320.png

VM Options:
1707846175015.png

Blade Summary:
1707846185606.png

Prod Summary:
1707846226524.png
 
I just now realized something strange too... the times... As both servers are setup using the same time server this tells me that the VM is freezing so early it dosnt even do a timesync change as its currently set -1 hour from the server in windows settings.

Not sure if this helps but another datapoint.
Thanks;
SPraus
 
Since the problem presents itself in the VM, you need to start with VM troubleshooting.

Its unclear if this is a newly installed VM on one of the blades, or something you created elsewhere and moved to the blades.
Its also unclear from your description whether all the blades are running an identical CPU configuration.

I would try to boot in safe mode and examine the logs. I would also experiment with CPU configuration, ie try "host" to get a data point. If you didnt install the VM on the blade in question - try that.

Running a heterogenous hardware cluster in production is "frowned upon" by PVE and is highly not recommended. There are no guarantees of successful VM movement/boot across such cluster.

Good luck


Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: spraus
Hi bbgeek17,
Thank you for such a prompt response. I understand the comment about heterogeneous solutions... I plan on moving completely to only the blades once production intensifies. As for the configuration all blades CPU are identical... only changes between them are memory/disk.
The VM was built on the older node as at the current time I am having some NFS issues preventing me from mounting my installation ISO on the blade, though to this point every other VM I have moved to this blade (built on the older nodes) has worked fine.
As for safe mode... When I try to go in it gets to the blue win 11 screen (that should allow you to select the boot to safe mode) and instead freezes aswell.
Thanks for your input;
SPraus
 
Thank you for clarifying. Given the set of restrictions you presented, I would experiment with the CPU config of the VM. Unfortunately, windows is especially touchy around CPU changes. All the common windows boot troubleshooting steps would apply here. As well as understanding the BSOD, sometimes those are useful.

For advanced level of debugging you can try "https://qemu-project.gitlab.io/qemu/system/monitor.html". Other than that, I cant imagine there are any messages in Hypervisor journal, but you can always check "journalctl -f" during the boot.

Good luck


Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: spraus
Hi bbgeek17,
Thanks for your response... Ill give that a shot, it looks like the same behavior with all windows VMs transitioned... guess I had just been running my *nix ones.
Thanks again for the help!
 
Hi All,
I accidently found the issue with this, thank you @bbgeek17 for the help.
The underlying issue was actually the same problem I had with my NFS share that was preventing me from installing using the actual hardware.... It looks like somewhere in the networking chain an MTU was improperly set (compensating for the differences between OSs and HW). Once this was resolved (specifically to the storage) everything worked perfectly. It seems that the 9k MTU on proxmox can be checked using "ping -M do -s 8972 target" from proxmox ssh... not 8958 (which is used for vmware) or 8988 (used on many *nix distros). Once we identified this problem and fixed the MTU through our chain everything works great...
Strange that the VM booted to that screen than froze but hopefully this summary helps someone in the future!
Thanks;
SPraus
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!