Windows VMs stuck on boot after Proxmox Upgrade to 7.0

fiona · Jul 22, 2022

Hi,

wolfgang5505 said:
hello as i wrote above:
two cluster 6.4-13
1. pve-qemu-kvm: 5.2.0-6 kernel pve-kernel-5.11.22-5-pve: 5.11.22-10~bpo10+1 ===> problems with reboot
2. pve-qemu-kvm: 5.2.0-6 pve-kernel-5.4.78-2-pve: 5.4.78-2 ====> no problems since 1,5 years

do these clusters use similar hardware, similar VM configurations? What guest OSes are affected for you? Do you run into the "spinning circle" issue, black screen or hang at Guest has not initialized display?

wolfgang5505 said:
Just changed kernel of cluster 1 on one node to 5.4, now we have to wait.

Hoping for the best!

dea · Jul 22, 2022

fiona said:
Hi,

do these clusters use similar hardware, similar VM configurations? What guest OSes are affected for you? Do you run into the "spinning circle" issue, black screen or hang at Guest has not initialized display?

Hoping for the best!

Sorry but I don't understand.
What is the point of doing kernel rollback tests on Proxmox 6.4 which is in EOL?

Doesn't it make more sense to do some rollback tests on Proxmox 7.2-7, for example trying a 5.11 kernel from September 2021, or maybe doing a rollback of QEMU to 6.0?

The aim is to make Proxmox 7 work as it should, now Proxmox 6 is in EOL, it makes no sense.

IMHO

fiona · Jul 22, 2022

dea said:
Sorry but I don't understand.
What is the point of doing kernel rollback tests on Proxmox 6.4 which is in EOL?

Well, if it helps find the issue, it can be fixed for Proxmox 7

. If it actually is a kernel regression for example. There's quite likely multiple problems described in this thread, and so I'm asking which one it is.

dea said:
Doesn't it make more sense to do some rollback tests on Proxmox 7.2-7, for example trying a 5.11 kernel from September 2021, or maybe doing a rollback of QEMU to 6.0?

Please do if you can. Everything that reduces the space of possible causes is good. And please always indicate which issue it's about so we can associate the issue and the workarounds.

dea said:
The aim is to make Proxmox 7 work as it should, now Proxmox 6 is in EOL, it makes no sense.

IMHO

dea · Jul 22, 2022

fiona said:
Well, if it helps find the issue, it can be fixed for Proxmox 7 . If it actually is a kernel regression for example. There's quite likely multiple problems described in this thread, and so I'm asking which one it is.

Please do if you can. Everything that reduces the space of possible causes is good. And please always indicate which issue it's about so we can associate the issue and the workarounds.

Unfortunately I can't because all the clusters we manage are in production.

I had raised some time ago that in my first installations of Proxmox 7.0 the problem did not occur, please refer to # 274-5

That's why I was asking to try one of the early 5.11 kernels (September 2021) and / or QEMU 6.0

wolfgang5505 · Jul 22, 2022

fiona said:
Hi,

do these clusters use similar hardware, similar VM configurations? What guest OSes are affected for you? Do you run into the "spinning circle" issue, black screen or hang at Guest has not initialized display?

Hoping for the best!

fiona said:
All run into spinning circle, and have similar hardware (supermicro with epyc) and similar VM configurations

fiona · Jul 22, 2022

@wolfgang5505 thanks! One more question: are the VMs freshly installed (not clones or restored from backup) or imported from somewhere else?

dea said:
Unfortunately I can't because all the clusters we manage are in production.

I had raised some time ago that in my first installations of Proxmox 7.0 the problem did not occur, please refer to # 274-5

That's why I was asking to try one of the early 5.11 kernels (September 2021) and / or QEMU 6.0

Yes, ideally it would be possible to identify one of kernel and QEMU as a culprit.

@wolfgang5505 's report seems to be the odd one out, I can't see any other reports with QEMU less than 6.0 (in fact, less than 6.0.0-4).

In QEMU 6.0.0-4 we switched to turning SMM on, which might be another candidate for causing the issue. Comparing pve-qemu-kvm=6.0.0-3 and pve-qemu-kvm=6.0.0-4 might be interesting.

Does anybody have a machine with an efitype=4m EFI disk running into the problem?

dea · Jul 22, 2022

fiona said:
@wolfgang5505 thanks! One more question: are the VMs freshly installed (not clones or restored from backup) or imported from somewhere else?

Yes, ideally it would be possible to identify one of kernel and QEMU as a culprit.

@wolfgang5505 's report seems to be the odd one out, I can't see any other reports with QEMU less than 6.0 (in fact, less than 6.0.0-4).

In QEMU 6.0.0-4 we switched to turning SMM on, which might be another candidate for causing the issue. Comparing pve-qemu-kvm=6.0.0-3 and pve-qemu-kvm=6.0.0-4 might be interesting.

Does anybody have a machine with an efitype=4m EFI disk running into the problem?

... strange. In October 2021 I was using pve-qemu 6.0.0-4 and we don't remember ever having any problems. The problems came later, in moving to Proxmox 7.1 in December 2021

dea · Jul 22, 2022

@fiona please refer #274

wolfgang5505 · Jul 22, 2022

dea said:
... strange. In October 2021 I was using pve-qemu 6.0.0-4 and we don't remember ever having any problems. The problems came later, in moving to Proxmox 7.1 in December 2021

All i can say is we have this issue (spinnig dots) on all PVE7 and PVE6 ( exept kernel 5.4)
yes it's EOL but the same problem with pve-qemu 5.2.0-6.

fiona · Jul 22, 2022

dea said:
... strange. In October 2021 I was using pve-qemu 6.0.0-4 and we don't remember ever having any problems. The problems came later, in moving to Proxmox 7.1 in December 2021

I see, but the very first report in this thread is using 6.0.0-4 for example. It might be that the issue became more likely by some other changes later? Or maybe it doesn't affect SeaBIOS and UEFI the same way? Which of those do your affected VMs use?

wolfgang5505 · Jul 22, 2022

fiona said:
I see, but the very first report in this thread is using 6.0.0-4 for example. It might be that the issue became more likely by some other changes later? Or maybe it doesn't affect SeaBIOS and UEFI the same way? Which of those do your affected VMs use?

PVE6 are all SeaBIOS

dea · Jul 22, 2022

wolfgang5505 said:
All i can say is we have this issue (spinnig dots) on all PVE7 and PVE6 ( exept kernel 5.4)
yes it's EOL but the same problem with pve-qemu 5.2.0-6.

fiona said:
I see, but the very first report in this thread is using 6.0.0-4 for example. It might be that the issue became more likely by some other changes later? Or maybe it doesn't affect SeaBIOS and UEFI the same way? Which of those do your affected VMs use?

OK, I also checked on an (old) cluster that has been running for more than two years of uptime with Proxmox 6.4 and kernel 5.4 (I use kernelcare). And he has never and I mean never had the slightest problem.

Question, let's start with a certainty, kernel 5.4 is not the culprit so ... is it possible to run Proxmox 7.2-7 with a Proxmox 6.4 kernel 5.4?

If we start from something certain ... the kernel is not the culprit.

wolfgang5505 · Jul 22, 2022

dea said:
OK, I also checked on an (old) cluster that has been running for more than two years of uptime with Proxmox 6.4 and kernel 5.4 (I use kernelcare). And he has never and I mean never had the slightest problem.

Question, let's start with a certainty, kernel 5.4 is not the culprit so ... is it possible to run Proxmox 7.2-7 with a Proxmox 6.4 kernel 5.4?

If we start from something certain ... the kernel is not the culprit.

Its possible that you can update the kernel on the Old Cluster? Yes i can confirm proxmox 6.4 with kernel 5.4 i never had any problems.

GeorgeCZZZ · Jul 22, 2022

We are affected on Dell R640, 740 with proxmox 7. After Windows update spinning circle on reboot. I must stop and start VM. It doesn happen on other HW like DL380p, or home proxmox , all on same version 7. Seems like something related to HW / Windows change after update?

yottabyteman · Jul 29, 2022

@fiona @t.lamprecht, @Moayad, @mira @tom @aaron @oguz

Hello, since it's the end of the month I was just looking for an update. I know that you can duplicate the issue with the freezing machines. (Thank you member who provided that to Proxmox team!) I am going to assume you haven't had a fix yet, but if you could supply information on how the testing is going? Do you feel you are on the right path to resolving the issue? and if you are close to a fix around when would it come out? Just looking for some information in the hopes that you are closer to fixing this problem.

Thanks for your on going battle with this bug

fiona · Aug 1, 2022

Hi,

yottabyteman said:
@fiona @t.lamprecht, @Moayad, @mira @tom @aaron @oguz

Hello, since it's the end of the month I was just looking for an update. I know that you can duplicate the issue with the freezing machines. (Thank you member who provided that to Proxmox team!)

yes, we were able to reproduce hangs with user-provided images locally, which seem to be some kind of corruption, where the guest will be in a reset loop and run into CPU triple faults. But it's not yet clear where the corruption is coming from. And it's not clear how much it relates to the spinning circles issue, as the symptoms are rather different (in the spinning circles case, QEMU seems to be executing guest code as usual, there's no loop and no triple faults).

We still were not able to reproduce the issue with the spinning circles locally (some of you might not believe me, but we don't have huge test cluster with dozens and dozens of Windows VMs like your production setups...). We do have access to a VM with a snapshot right before the reboot exposing the issue, but we can't modify the host system there. We tried to load that snapshot locally, but it's with CPU type host and one colleague had a similar enough CPU to successfully load it, but when he rebooted, it didn't hang :/

yottabyteman said:
I am going to assume you haven't had a fix yet, but if you could supply information on how the testing is going? Do you feel you are on the right path to resolving the issue? and if you are close to a fix around when would it come out? Just looking for some information in the hopes that you are closer to fixing this problem.

We are currently looking at memory dumps from the above mentioned VMs.

yottabyteman said:
Thanks for your on going battle with this bug

Soo...there's still nobody reporting that a guest with efitype=4m EFI disk ran into this?
Did anybody try to downgrade to pve-qemu-kvm=6.0.0-3 and if so, what can you tell us?

itNGO · Aug 3, 2022

fiona said:
Hi,

yes, we were able to reproduce hangs with user-provided images locally, which seem to be some kind of corruption, where the guest will be in a reset loop and run into CPU triple faults. But it's not yet clear where the corruption is coming from. And it's not clear how much it relates to the spinning circles issue, as the symptoms are rather different (in the spinning circles case, QEMU seems to be executing guest code as usual, there's no loop and no triple faults).

We still were not able to reproduce the issue with the spinning circles locally (some of you might not believe me, but we don't have huge test cluster with dozens and dozens of Windows VMs like your production setups...). We do have access to a VM with a snapshot right before the reboot exposing the issue, but we can't modify the host system there. We tried to load that snapshot locally, but it's with CPU type host and one colleague had a similar enough CPU to successfully load it, but when he rebooted, it didn't hang :/

We are currently looking at memory dumps from the above mentioned VMs.

Soo...there's still nobody reporting that a guest with efitype=4m EFI disk ran into this?
Did anybody try to downgrade to pve-qemu-kvm=6.0.0-3 and if so, what can you tell us?

Can I simply remove 128K-EFI-Disk and add a 4MB somehow to verify this?

fiona · Aug 4, 2022

itNGO said:
Can I simply remove 128K-EFI-Disk and add a 4MB somehow to verify this?

Windows might be finicky with this and if you changed settings/vars you'll need to re-create them. I'd first make a snapshot/or keep the old EFI disk around (reattaching is only possible via CLI currently).

But there is a report with efitype=4m disk now: https://forum.proxmox.com/threads/stuck-at-efi-boot-during-reboot-of-vm.112946/ although it does seem to get stuck at a slightly different place.

itNGO · Aug 4, 2022

fiona said:
Windows might be finicky with this and if you changed settings/vars you'll need to re-create them. I'd first make a snapshot/or keep the old EFI disk around (reattaching is only possible via CLI currently).

But there is a report with efitype=4m disk now: https://forum.proxmox.com/threads/stuck-at-efi-boot-during-reboot-of-vm.112946/ although it does seem to get stuck at a slightly different place.

Switched/changed about 20 VMs to efitype=4m for testing..... will report back.....

dea · Aug 4, 2022

fiona said:
Windows might be finicky with this and if you changed settings/vars you'll need to re-create them. I'd first make a snapshot/or keep the old EFI disk around (reattaching is only possible via CLI currently).

But there is a report with efitype=4m disk now: https://forum.proxmox.com/threads/stuck-at-efi-boot-during-reboot-of-vm.112946/ although it does seem to get stuck at a slightly different place.

... I would try downgrading (especially QEMU and kernel).

Unfortunately I use clusters in production and I don't have a real test system, so I can't do such a thing.

Does anyone have hardware that definitely has this problem and can downgrade pve-qemu-kvm = 6.0.0-3

???

Thanks

Windows VMs stuck on boot after Proxmox Upgrade to 7.0

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Member

Proxmox Staff Member

Renowned Member

Renowned Member

Member

Proxmox Staff Member

Member

Renowned Member

Member

Member

Member

Proxmox Staff Member

Famous Member

Proxmox Staff Member

Famous Member

Renowned Member

We value your privacy