Windows VMs stuck on boot after Proxmox Upgrade to 7.0

Hi,
hello as i wrote above:
two cluster 6.4-13
1. pve-qemu-kvm: 5.2.0-6 kernel pve-kernel-5.11.22-5-pve: 5.11.22-10~bpo10+1 ===> problems with reboot
2. pve-qemu-kvm: 5.2.0-6 pve-kernel-5.4.78-2-pve: 5.4.78-2 ====> no problems since 1,5 years
do these clusters use similar hardware, similar VM configurations? What guest OSes are affected for you? Do you run into the "spinning circle" issue, black screen or hang at Guest has not initialized display?

Just changed kernel of cluster 1 on one node to 5.4, now we have to wait.
Hoping for the best!
 
Hi,

do these clusters use similar hardware, similar VM configurations? What guest OSes are affected for you? Do you run into the "spinning circle" issue, black screen or hang at Guest has not initialized display?


Hoping for the best!

Sorry but I don't understand.
What is the point of doing kernel rollback tests on Proxmox 6.4 which is in EOL?

Doesn't it make more sense to do some rollback tests on Proxmox 7.2-7, for example trying a 5.11 kernel from September 2021, or maybe doing a rollback of QEMU to 6.0?

The aim is to make Proxmox 7 work as it should, now Proxmox 6 is in EOL, it makes no sense.

IMHO
 
Sorry but I don't understand.
What is the point of doing kernel rollback tests on Proxmox 6.4 which is in EOL?
Well, if it helps find the issue, it can be fixed for Proxmox 7 ;). If it actually is a kernel regression for example. There's quite likely multiple problems described in this thread, and so I'm asking which one it is.
Doesn't it make more sense to do some rollback tests on Proxmox 7.2-7, for example trying a 5.11 kernel from September 2021, or maybe doing a rollback of QEMU to 6.0?
Please do if you can. Everything that reduces the space of possible causes is good. And please always indicate which issue it's about so we can associate the issue and the workarounds.
The aim is to make Proxmox 7 work as it should, now Proxmox 6 is in EOL, it makes no sense.

IMHO
 
Well, if it helps find the issue, it can be fixed for Proxmox 7 ;). If it actually is a kernel regression for example. There's quite likely multiple problems described in this thread, and so I'm asking which one it is.

Please do if you can. Everything that reduces the space of possible causes is good. And please always indicate which issue it's about so we can associate the issue and the workarounds.

Unfortunately I can't because all the clusters we manage are in production.

I had raised some time ago that in my first installations of Proxmox 7.0 the problem did not occur, please refer to # 274-5

That's why I was asking to try one of the early 5.11 kernels (September 2021) and / or QEMU 6.0
 
Hi,

do these clusters use similar hardware, similar VM configurations? What guest OSes are affected for you? Do you run into the "spinning circle" issue, black screen or hang at Guest has not initialized display?


Hoping for the best!
All run into spinning circle, and have similar hardware (supermicro with epyc) and similar VM configurations
 
@wolfgang5505 thanks! One more question: are the VMs freshly installed (not clones or restored from backup) or imported from somewhere else?

Unfortunately I can't because all the clusters we manage are in production.

I had raised some time ago that in my first installations of Proxmox 7.0 the problem did not occur, please refer to # 274-5

That's why I was asking to try one of the early 5.11 kernels (September 2021) and / or QEMU 6.0
Yes, ideally it would be possible to identify one of kernel and QEMU as a culprit.

@wolfgang5505 's report seems to be the odd one out, I can't see any other reports with QEMU less than 6.0 (in fact, less than 6.0.0-4).

In QEMU 6.0.0-4 we switched to turning SMM on, which might be another candidate for causing the issue. Comparing pve-qemu-kvm=6.0.0-3 and pve-qemu-kvm=6.0.0-4 might be interesting.

Does anybody have a machine with an efitype=4m EFI disk running into the problem?
 
@wolfgang5505 thanks! One more question: are the VMs freshly installed (not clones or restored from backup) or imported from somewhere else?


Yes, ideally it would be possible to identify one of kernel and QEMU as a culprit.

@wolfgang5505 's report seems to be the odd one out, I can't see any other reports with QEMU less than 6.0 (in fact, less than 6.0.0-4).

In QEMU 6.0.0-4 we switched to turning SMM on, which might be another candidate for causing the issue. Comparing pve-qemu-kvm=6.0.0-3 and pve-qemu-kvm=6.0.0-4 might be interesting.

Does anybody have a machine with an efitype=4m EFI disk running into the problem?

... strange. In October 2021 I was using pve-qemu 6.0.0-4 and we don't remember ever having any problems. The problems came later, in moving to Proxmox 7.1 in December 2021
 
... strange. In October 2021 I was using pve-qemu 6.0.0-4 and we don't remember ever having any problems. The problems came later, in moving to Proxmox 7.1 in December 2021
All i can say is we have this issue (spinnig dots) on all PVE7 and PVE6 ( exept kernel 5.4)
yes it's EOL but the same problem with pve-qemu 5.2.0-6.
 
Last edited:
... strange. In October 2021 I was using pve-qemu 6.0.0-4 and we don't remember ever having any problems. The problems came later, in moving to Proxmox 7.1 in December 2021
I see, but the very first report in this thread is using 6.0.0-4 for example. It might be that the issue became more likely by some other changes later? Or maybe it doesn't affect SeaBIOS and UEFI the same way? Which of those do your affected VMs use?
 
I see, but the very first report in this thread is using 6.0.0-4 for example. It might be that the issue became more likely by some other changes later? Or maybe it doesn't affect SeaBIOS and UEFI the same way? Which of those do your affected VMs use?
PVE6 are all SeaBIOS
 
All i can say is we have this issue (spinnig dots) on all PVE7 and PVE6 ( exept kernel 5.4)
yes it's EOL but the same problem with pve-qemu 5.2.0-6.

I see, but the very first report in this thread is using 6.0.0-4 for example. It might be that the issue became more likely by some other changes later? Or maybe it doesn't affect SeaBIOS and UEFI the same way? Which of those do your affected VMs use?

OK, I also checked on an (old) cluster that has been running for more than two years of uptime with Proxmox 6.4 and kernel 5.4 (I use kernelcare). And he has never and I mean never had the slightest problem.

Question, let's start with a certainty, kernel 5.4 is not the culprit so ... is it possible to run Proxmox 7.2-7 with a Proxmox 6.4 kernel 5.4?

If we start from something certain ... the kernel is not the culprit.
 
OK, I also checked on an (old) cluster that has been running for more than two years of uptime with Proxmox 6.4 and kernel 5.4 (I use kernelcare). And he has never and I mean never had the slightest problem.

Question, let's start with a certainty, kernel 5.4 is not the culprit so ... is it possible to run Proxmox 7.2-7 with a Proxmox 6.4 kernel 5.4?

If we start from something certain ... the kernel is not the culprit.
Its possible that you can update the kernel on the Old Cluster? Yes i can confirm proxmox 6.4 with kernel 5.4 i never had any problems.
 
Last edited:
We are affected on Dell R640, 740 with proxmox 7. After Windows update spinning circle on reboot. I must stop and start VM. It doesn happen on other HW like DL380p, or home proxmox , all on same version 7. Seems like something related to HW / Windows change after update?
 
@fiona @t.lamprecht, @Moayad, @mira @tom @aaron @oguz

Hello, since it's the end of the month I was just looking for an update. I know that you can duplicate the issue with the freezing machines. (Thank you member who provided that to Proxmox team!) I am going to assume you haven't had a fix yet, but if you could supply information on how the testing is going? Do you feel you are on the right path to resolving the issue? and if you are close to a fix around when would it come out? Just looking for some information in the hopes that you are closer to fixing this problem.

Thanks for your on going battle with this bug
 
Hi,
@fiona @t.lamprecht, @Moayad, @mira @tom @aaron @oguz

Hello, since it's the end of the month I was just looking for an update. I know that you can duplicate the issue with the freezing machines. (Thank you member who provided that to Proxmox team!)
yes, we were able to reproduce hangs with user-provided images locally, which seem to be some kind of corruption, where the guest will be in a reset loop and run into CPU triple faults. But it's not yet clear where the corruption is coming from. And it's not clear how much it relates to the spinning circles issue, as the symptoms are rather different (in the spinning circles case, QEMU seems to be executing guest code as usual, there's no loop and no triple faults).

We still were not able to reproduce the issue with the spinning circles locally (some of you might not believe me, but we don't have huge test cluster with dozens and dozens of Windows VMs like your production setups...). We do have access to a VM with a snapshot right before the reboot exposing the issue, but we can't modify the host system there. We tried to load that snapshot locally, but it's with CPU type host and one colleague had a similar enough CPU to successfully load it, but when he rebooted, it didn't hang :/

I am going to assume you haven't had a fix yet, but if you could supply information on how the testing is going? Do you feel you are on the right path to resolving the issue? and if you are close to a fix around when would it come out? Just looking for some information in the hopes that you are closer to fixing this problem.
We are currently looking at memory dumps from the above mentioned VMs.
Thanks for your on going battle with this bug

Soo...there's still nobody reporting that a guest with efitype=4m EFI disk ran into this?
Did anybody try to downgrade to pve-qemu-kvm=6.0.0-3 and if so, what can you tell us?
 
  • Like
Reactions: itNGO
Hi,

yes, we were able to reproduce hangs with user-provided images locally, which seem to be some kind of corruption, where the guest will be in a reset loop and run into CPU triple faults. But it's not yet clear where the corruption is coming from. And it's not clear how much it relates to the spinning circles issue, as the symptoms are rather different (in the spinning circles case, QEMU seems to be executing guest code as usual, there's no loop and no triple faults).

We still were not able to reproduce the issue with the spinning circles locally (some of you might not believe me, but we don't have huge test cluster with dozens and dozens of Windows VMs like your production setups...). We do have access to a VM with a snapshot right before the reboot exposing the issue, but we can't modify the host system there. We tried to load that snapshot locally, but it's with CPU type host and one colleague had a similar enough CPU to successfully load it, but when he rebooted, it didn't hang :/


We are currently looking at memory dumps from the above mentioned VMs.


Soo...there's still nobody reporting that a guest with efitype=4m EFI disk ran into this?
Did anybody try to downgrade to pve-qemu-kvm=6.0.0-3 and if so, what can you tell us?
Can I simply remove 128K-EFI-Disk and add a 4MB somehow to verify this?
 
Can I simply remove 128K-EFI-Disk and add a 4MB somehow to verify this?
Windows might be finicky with this and if you changed settings/vars you'll need to re-create them. I'd first make a snapshot/or keep the old EFI disk around (reattaching is only possible via CLI currently).

But there is a report with efitype=4m disk now: https://forum.proxmox.com/threads/stuck-at-efi-boot-during-reboot-of-vm.112946/ although it does seem to get stuck at a slightly different place.
 
Windows might be finicky with this and if you changed settings/vars you'll need to re-create them. I'd first make a snapshot/or keep the old EFI disk around (reattaching is only possible via CLI currently).

But there is a report with efitype=4m disk now: https://forum.proxmox.com/threads/stuck-at-efi-boot-during-reboot-of-vm.112946/ although it does seem to get stuck at a slightly different place.
Switched/changed about 20 VMs to efitype=4m for testing..... will report back.....
 
Windows might be finicky with this and if you changed settings/vars you'll need to re-create them. I'd first make a snapshot/or keep the old EFI disk around (reattaching is only possible via CLI currently).

But there is a report with efitype=4m disk now: https://forum.proxmox.com/threads/stuck-at-efi-boot-during-reboot-of-vm.112946/ although it does seem to get stuck at a slightly different place.

... I would try downgrading (especially QEMU and kernel).

Unfortunately I use clusters in production and I don't have a real test system, so I can't do such a thing.

Does anyone have hardware that definitely has this problem and can downgrade pve-qemu-kvm = 6.0.0-3

???

Thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!