VM shutdown, KVM: entry failed, hardware error 0x80000021

Hi,

experience with the package residing about two weeks on testing and one on no-subscription went good so far, so from that POV it would be now good to go for the enterprise.

But, upstream send out a slightly revised fix with some development feedback addressed, which we'd like to check out first. As the closer we're with the patch series that actually goes in upstream, the less friction potential is there on the longer run.

Testing that version with our reproducer will take a few hours, if that goes well we'll move that package a bit faster to the open repos, and as the change is relatively small compared to the last kernel build, it should get into the enterprise repo no later than start of next week, naturally only if nothing comes up. Depending on upstream feedback and also further feedback from the community here, we may be able to short cut that, but no promises here and if we're not very sure about impact potential we'll lean towards the slower and safer option. Anyhow, we'll keep you posted if that new kernel is available in public repos.

Thanks for your answers... and for the caution you take ;-)
 
The updated version of the fix is now available on pvetest as package pve-kernel-5.15.39-4-pve in version 5.15.39-4, our reproducer was still fixed and in addition some slightly dubious kernel logs, like:

Code:
QEMU[2775]: kvm: Could not update PFLASH: Stale file handle
kernel: kvm: vcpu 1: requested 191999 ns lapic timer period limited to 200000 ns

disappeared with the new version, so even a slight improvement.
 
  • Like
Reactions: itNGO
looks ok on a first glance?
you do need to reboot the PVE-node for the new parameter-setting to take effect.
Hi,
I installed the new kernel a week ago and everything works fine; I also renabled the tdp_mmu.
Several VMs with Win2022 and Debian and there aren't any error about KVM.
At the moment on my hosts I have theese pve-kernel version installed:
pve-kernel-5.13.19-6-pve/stable,now 5.13.19-15 amd64 [installed] pve-kernel-5.15.30-2-pve/stable,now 5.15.30-3 amd64 [installed] pve-kernel-5.15.39-1-pve/stable,now 5.15.39-1 amd64 [installed,automatic] pve-kernel-5.15.39-3-pve/stable,now 5.15.39-3 amd64 [installed,automatic]

It is safe to remove the versions 5.13.19-6 and 5.15.30-2?
 
I installed the new kernel a week ago and everything works fine; I also renabled the tdp_mmu.
Several VMs with Win2022 and Debian and there aren't any error about KVM.
If you still have the time @t.lamprecht uploaded a new kernel 5.15.39-4 yesterday - which contains the same fix, but in a version that is more likely to get merged upstream - so testing would be much appreciated!

It is safe to remove the versions 5.13.19-6 and 5.15.30-2?
In general I'd say as long as there is a kernel on the machine that does boot and work - you can of course remove old ones (you can always reinstall them later if you really want to) - but make sure that you have a running kernel left :)
 
  • Like
Reactions: t.lamprecht
If you still have the time @t.lamprecht uploaded a new kernel 5.15.39-4 yesterday - which contains the same fix, but in a version that is more likely to get merged upstream - so testing would be much appreciated!


In general I'd say as long as there is a kernel on the machine that does boot and work - you can of course remove old ones (you can always reinstall them later if you really want to) - but make sure that you have a running kernel left :)
Hi,
Some minutes ago I installed the new kernel version 5.15.39-4 and restarted the node.
I also installed the new version pve-kernel-helper.
After one week I will reply this post with the feedback about it.

Thanks!
 
Upgraded to 5.15.39-4 on 3 servers that were having the issue and the Windows VM have been stable for the last 2 weeks. It does seem there is a slight performance impact with this new kernel though but at least it is stable.

Thanks for the fix !
 
Short update: I have upgraded to 5.15.39-4 a week ago and also removed the setting kvm.tdp_mmu=N
So far all my VM are running stable and do not crash during backup. Especially my Exchange server was an issue.

But now I have another issue after this change: almost each day I have at least one VM which won't be backed up by PBS, I get the error similar like: backup write data failed: command error: protocol canceled.

Then I usually was able to backup this machine individually in offline mode. But often next day I had another VM with similar issue.

Now I have reverted the mmu setting kvm.tdp_mmu=N but keep the kernel version 5.15.39-4 and all backups are running normal so far. The PBS server is also a VM running on same ProxMox host and the backup target is a Synology NAS.
 
If you still have the time @t.lamprecht uploaded a new kernel 5.15.39-4 yesterday - which contains the same fix, but in a version that is more likely to get merged upstream - so testing would be much appreciated!


In general I'd say as long as there is a kernel on the machine that does boot and work - you can of course remove old ones (you can always reinstall them later if you really want to) - but make sure that you have a running kernel left :)
Just wondering if the fixes has been added to the upstream yet or the estimated time on the merge?
 
What do you mean with upstream? Our kernel hast the fix since a while and is rolled out on all supported repositories.
Apologies, I am not familiar with your processes. I thought when @Stoiko Ivanov mentioned "is more likely to get merged with upstream" that it was still in a test phase. We have applied the fix and thank you for this community which helped bring closure to this issue.
 
  • Like
Reactions: Altrove and trisweb
  • Like
Reactions: datschlatscher
Hello, I'm suffering a problem with windows server 2022 on proxmox VE, no hyperV installed.
That was recently updated from WS2008R2, and we had 0 problems.

The machine only hangs for a while while doing lots of iops, always when using windows disk optimizer. maybe syncing with google drive for the first time. Maybe in 10-15 minutes comes back, it depends on the load. No IO delay on the hypervisor, maximum was 3.15% this week.

It looses connectivity, qemu tools drop comunication, the machine display on the hypervisor doesn't work, but it's working by itselft after a while it recovers conectivity and it doesn't restart.
It has two disks, one conected to a ZFS NVME RAID1 for the OS, and another to an LVM RAID1 for DATA.

Moved the data disk to another pool, made no differece.
Tried the disks with native and io_uring Async IO, made no difference.
Reduced read/write limits and op/s on disks, made no difference.
Tried with several proxmox kernel 5.13 and 5.15, made no difference.




Syslog throws this:

pvedaemon[2320607]: VM 105 qmp command failed - VM 105 qmp command 'guest-network-get-interfaces' failed - got timeout
I would create a new thread. I don't believe the issue in this thread is related to what you are experiencing.
 
Last edited:
  • Like
Reactions: albert

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!