All VMs locking up after latest PVE update

There's now a new QEMU package on our pvetest repository which includes Stefan's fix, it's in pve-qemu-kvm version 5.2.0-4.

[EDIT 2021-03-24]: Package is now on pve-no-subscription, no need for enabling the pvetest repo anymore.

The quickest and most secure way to upgrade just that package from pvetest would be:

Bash:
# enable pvetest
echo 'deb http://download.proxmox.com/debian/pve buster pvetest' > /etc/apt/sources.list.d/pvetest.list
# update available packages
apt update
# instruct apt to only install the 'pve-qemu-kvm' package, it will get the newer one from pvetest
apt install pve-qemu-kvm
# disable pvetest again
rm /etc/apt/sources.list.d/pvetest.list
apt update


Remember, you always need to either fully restart the VM after the upgrade or migrate it to an upgraded PVE node, else the VM is still running the older QEMU version, and you won't have the fix active.
 
Last edited:
Additional information, after proxmox upgrading(pve-qemu-kvm), VM with windows 2019 lost the network adapter and found a new one (the network settings were lost from this). During the downgrade proxmox, the old adapter returned and the new one was lost.
I've noticed the same phenomenon with one of our 2012R2 servers. It happened at the same time we had the issue reported in this thread. It has also happened before the issue in this thread came to be. I noticed it the first time after a Windows update.
 
Last edited:
You can check with:
/usr/bin/kvm --version

Should show the following:
Code:
QEMU emulator version 5.2.0 (pve-qemu-kvm_5.2.0)
Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers

Also:
Code:
# dpkg -l pve-qemu-kvm
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-===================================
ii  pve-qemu-kvm   5.2.0-4      amd64        Full virtualization on x86 hardware

Be sure to actually stop/start the virtual machines after installing. A migration out and back into the machine should be sufficient if I read correctly. I've been migrating systems off, upgrading with:

Code:
# Update repositories
apt update
# Upgrade OS
apt dist-upgrade

Then I use t.lamprecht' update instructions to upgrade pve-qemu-kvm. After fully updating, I've been rebooting the hosts to ensure all kernel updates are applied and then migrating vms back.
 
Last edited:
You can check with:
/usr/bin/kvm --version

Should show the following:
Code:
QEMU emulator version 5.2.0 (pve-qemu-kvm_5.2.0)
Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers

Also:
Code:
# dpkg -l pve-qemu-kvm
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-===================================
ii  pve-qemu-kvm   5.2.0-4      amd64        Full virtualization on x86 hardware

Be sure to actually stop/start the virtual machines after installing. A migration out and back into the machine should be sufficient if I read correctly. I've been migrating systems off, upgrading with:

Code:
# Update repositories
apt update
# Upgrade OS
apt dist-upgrade

Then I use t.lamprecht' update instructions to upgrade pve-qemu-kvm. After fully updating, I've been rebooting the hosts to ensure all kernel updates are applied and then migrating vms back.
Thank you!

The first command yields the same on my production versus updated POC cluster. The second command does show 5.2.0-4 versus 5.2.0-3 so it appears to have worked. I live migrated the VMs back and forth on the POC. I'll update and do the same migrating on the production cluster now.
 
Bash:
# enable pvetest
echo 'deb http://download.proxmox.com/debian/pve buster pvetest' > /etc/apt/sources.list.d/pvetest.list
# update available packages
apt update
# instruct apt to only install the 'pve-qemu-kvm' package, it will get the newer one from pvetest
apt install pve-qemu-kvm
# disable pvetest again
rm /etc/apt/sources.list.d/pvetest.list
apt update
I tried that and rebooted the host. Looks like it fixed it.
All snapshot tasks still produce a "VM 122 qmp command failed - VM 122 qmp command 'query-proxmox-support' failed - unable to connect to VM 122 qmp socket - timeout after 31 retries" in the syslog but atleast VMs don't get unresponsive anymore.
 
I've noticed the same phenomenon with one of our 2012R2 servers. It happened at the same time we had the issue reported in this thread. It has also happened before the issue in this thread came to be. I noticed it the first time after a Windows update.

FYI: Seem unrelated to this specific issue, rather you were/are affected by this bug:
https://forum.proxmox.com/threads/w...s-6-3-4-patch-inside.84915/page-2#post-373380

And here's the available fix:
https://forum.proxmox.com/threads/w...s-6-3-4-patch-inside.84915/page-3#post-374993
 
Excellent, thanks Proxmox team!

What is the cause of the "VM ### qmp command failed - VM ### qmp command 'query-proxmox-support' failed - unable to connect to VM ### qmp socket - timeout after 31 retries" syslog event that Dunuin reported above? Is this message safe to ignore?
 
What is the cause of the "VM ### qmp command failed - VM ### qmp command 'query-proxmox-support' failed - unable to connect to VM ### qmp socket - timeout after 31 retries" syslog event that Dunuin reported above? Is this message safe to ignore?
Hm, do you also see that? After applying the upgrade? If so, please provide more details about your configuration, when the message appears and potentially anything else that shows up your logs. In general it is not safe to ignore, it means something has gone wrong, but the question is if it is related to the error here.

There is a similar report on our bugtracker: https://bugzilla.proxmox.com/show_bug.cgi?id=3360
 
Edit: I re-read your post; after I upgrade, I'll let you know if I'm getting this syslog message as well.

I have not upgraded yet, but the user above (Dunuin) said he upgraded from the test repo, didn't get VM freezing, but still go that syslog message:

I tried that and rebooted the host. Looks like it fixed it.
All snapshot tasks still produce a "VM 122 qmp command failed - VM 122 qmp command 'query-proxmox-support' failed - unable to connect to VM 122 qmp socket - timeout after 31 retries" in the syslog but atleast VMs don't get unresponsive anymore.
 
Last edited:
Yes, but with the bugged version where the VMs got unresponsive I got that error message several times per minute for hours. Everything was fine until I started any snapshot. Then the VMs got unresponsive, the snapshot finished with "OK" or "Error" but it looked like the snapshot got stuck somehow because there was a factor 100-1000 increasement in disk IO and that error message was spamming the the syslog all the time until I rebooted the host. After reboot everything was fine again.

But now with the fixed version VMs are working fine and I can use snapshotting again. The error message is still there, but only while the snapshot is running and not afterwards. And I also got that message whiel snapshotting with "pve-qemu-kvm 5.1.0-8".
 
Hi,

my System is pve-qemu-kvm 5.2.0-2 and want's to update to 5.2.0-4.

Thx.
Sorry the new version that should work is "pve-qemu-kvm: 5.2.0-4". "pve-qemu-kvm 5.1.0-8" is the last version that was working before the bug.
 
  • Like
Reactions: Stefan_R
Can anyone confirm that which all pve-qemu-kvm versions produce the error? i see that 5.2.0-4 fixed it, and last working was 5.1.0-8, can anyone confirm which versions produce the error?
 
Can anyone confirm that which all pve-qemu-kvm versions produce the error? i see that 5.2.0-4 fixed it, and last working was 5.1.0-8, can anyone confirm which versions produce the error?
That regression came in with 5.2-1 (the first 5.2 release) and was fixed with 5.2-4, so all versions in-between were affected.
Why do you ask?
 
That regression came in with 5.2-1 (the first 5.2 release) and was fixed with 5.2-4, so all versions in-between were affected.
Why do you ask?
I have widespread issue across 20+ nodes, so just want to check which is infected, as it is hard to keep track.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!