Freez issue with latest Proxmox 6.3-4 and AMD CPU

As for Intel vs AMD side of this issue, we are seeing this on both, but AMD is way more impacted.

We have 10 node cluster of rather older hardware - CPUs are Intel Xeons E5 (dual socket) and one AMD Opteron 6344 (quad socket).
Nodes were running stable.
We're using PBS.

We did a PVE update from 6.2-11 to 6.3-4 on 2-3 nodes in one day, observe, move to next nodes on following day.
All was good, node with AMD Opteron was last to update.

After updating Opteron all VMs on that node crashed during night:
- no response for ICMP, ssh from VM
- can't connect to VM's console in node GUI
- qm stop managed to stop VM by killing process (not gracefully)
- qm start started VM normally
Load on node was about 1-1,5 for every VM, lots of qmp command failed and qmp socket timeout in syslog on node.

Situation happened again next morning, same scenario - VMs dead, high load on node, qmp errors in syslog.
Updated node with Opteron to 6.3-6 (kernel 5.4.103-1), rebooted.
Next morning - same scenario plus one of Intel Xeons E5 got the same issue (some 5-6 days after it was updated to 6.3-4).

Then we did advised here downgrade pve-qemu-kvm=5.1.0-8 libproxmox-backup-qemu0=1.0.2-1 on Opteron and only one affected Intel.
One morning later all VMs are good on AMD (and one affected Intel).

8 Intels - no issues.
One Intel had issue once.
AMD had issues multiple times.
I've noticed that mass qmp errors in syslog tends to start after midnight, we don't have anything specific scheduled then, PBS backup starts at 22:00 and are done about 23:30-ish.
 
Last edited:
So I cannot downgrade for now? Am I right?
You can't downgrade the kernel, but you can do so for QEMU (which as it seems currently works around the issue):
apt install pve-qemu-kvm=5.1.0-8 libproxmox-backup-qemu0=1.0.2-1
 
  • Like
Reactions: LazyGatto
Just wanted to add a 'me too' to the list, hopefully to help pinpoint the issue.

We ran into this issue as well on our development cluster running from the no-subscription repository.

For us the problem was reproducible with a snapshot rollback with memory, but only if our storage was on Ceph using KRBD, when we moved the VM to the same pool but using librbd we were unable to trigger it.

We've downgraded packages on that cluster to the enterprise versions after getting licenses for these servers as well, though we cannot reproduce the issue as the memory snapshot was made with qemu 5.2 and we were unable to reproduce the issue with a new snapshot.

I might be able to provide the memory snapshot if that would help debugging, I was able to trigger it without the disks (or just using an empty disk).
 
In my case all is OK. Downgrade of packages was done almost a day ago for now.
Everything working.
 
There is any hope of a short-term upgrade to resolve this?

I would not want to downgrade, it is a temporary solution.
We'll release a new version as soon as we have a fix ;) Right now, the more debug information we can get, the better - that is, if you currently experience the bug, check the link in my previous response on how to get debug information and consider posting it here.
 
<AOL mode>
me too
</AOL mode>

We experienced the issue last night (E5 2697 v3 CPU on the nodes, no AMD), upgraded to last PVE during the week-end.
All VMs on one node became totally unresponsive, qm command went timeout, etc.
Had to ssh into the node, kill all kvm processes, restart the VM manually on other nodes then reboot the nodes.

Then a couple minutes ago, one OPNSense VM froze on another host.
Stop, start, it's up a couple second then freezes again.
All other VMs on the node are more or less responsive, qm processes are timeouting.
I'll try to post the debug in the other thread.

Currently downgrading the other nodes and this is not easy: VM that were started under last PVE version (ie: ones that I had to restart last night) can't be migrated to a downgraded node.
 
Last edited:
Hi,

it seems we have also this issue. The crashs start at night when backups are starting...
 
We now have a *-103 kernel available, would be interesting to see if that changes anything.
Didn't change for me, just had to restart few VMs, one crashed few times in 15 minutes
$ uname -r
5.4.103-1-pve
$ pveversion
pve-manager/6.3-6/2184247e (running kernel: 5.4.103-1-pve)

pve-qemu-kvm 5.2.0-3
 
We now have a *-103 kernel available, would be interesting to see if that changes anything.

We updated and reboot all our nodes the same day you released version 6.3-6.
It's a 5 Intel nodes cluster:
2x Dell R720
2x Dell R620
1x Dell T130
running CEPH 14.2.16 on datacenter class SSD drives.

Everything was working fine during days, more days than it used to work with 6.3-3, but today we migrated 8 VMs from one node to others, and a bit later some of the VMs got freezed.
And not only VMs involved in migration but others too.

This problem is taking waaaaay looonng to fix, and it's very annoying.

I hope you can solve it soon.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!