Freez issue with latest Proxmox 6.3-4 and AMD CPU

a56 · Mar 15, 2021

As for Intel vs AMD side of this issue, we are seeing this on both, but AMD is way more impacted.

We have 10 node cluster of rather older hardware - CPUs are Intel Xeons E5 (dual socket) and one AMD Opteron 6344 (quad socket).
Nodes were running stable.
We're using PBS.

We did a PVE update from 6.2-11 to 6.3-4 on 2-3 nodes in one day, observe, move to next nodes on following day.
All was good, node with AMD Opteron was last to update.

After updating Opteron all VMs on that node crashed during night:
- no response for ICMP, ssh from VM
- can't connect to VM's console in node GUI
- qm stop managed to stop VM by killing process (not gracefully)
- qm start started VM normally
Load on node was about 1-1,5 for every VM, lots of qmp command failed and qmp socket timeout in syslog on node.

Situation happened again next morning, same scenario - VMs dead, high load on node, qmp errors in syslog.
Updated node with Opteron to 6.3-6 (kernel 5.4.103-1), rebooted.
Next morning - same scenario plus one of Intel Xeons E5 got the same issue (some 5-6 days after it was updated to 6.3-4).

Then we did advised here downgrade pve-qemu-kvm=5.1.0-8 libproxmox-backup-qemu0=1.0.2-1 on Opteron and only one affected Intel.
One morning later all VMs are good on AMD (and one affected Intel).

8 Intels - no issues.
One Intel had issue once.
AMD had issues multiple times.
I've noticed that mass qmp errors in syslog tends to start after midnight, we don't have anything specific scheduled then, PBS backup starts at 22:00 and are done about 23:30-ish.

LazyGatto · Mar 15, 2021

Stefan_R said:
(if you haven't ugpraded any zpools to 2.0 yet):

Hmm. I have zpool 2.0.3.
So I cannot downgrade for now? Am I right?

Code:

# zpool -V
zfs-2.0.3-pve2
zfs-kmod-2.0.3-pve2

Stefan_R · Mar 15, 2021

LazyGatto said:
So I cannot downgrade for now? Am I right?

You can't downgrade the kernel, but you can do so for QEMU (which as it seems currently works around the issue):
apt install pve-qemu-kvm=5.1.0-8 libproxmox-backup-qemu0=1.0.2-1

Menno · Mar 15, 2021

Just wanted to add a 'me too' to the list, hopefully to help pinpoint the issue.

We ran into this issue as well on our development cluster running from the no-subscription repository.

For us the problem was reproducible with a snapshot rollback with memory, but only if our storage was on Ceph using KRBD, when we moved the VM to the same pool but using librbd we were unable to trigger it.

We've downgraded packages on that cluster to the enterprise versions after getting licenses for these servers as well, though we cannot reproduce the issue as the memory snapshot was made with qemu 5.2 and we were unable to reproduce the issue with a new snapshot.

I might be able to provide the memory snapshot if that would help debugging, I was able to trigger it without the disks (or just using an empty disk).

Stefan_R · Mar 16, 2021

As posted in another thread: https://forum.proxmox.com/threads/all-vms-locking-up-after-latest-pve-update.85397/post-377434

If anyone experiences it again, could you report back what the debugger command I linked says?

LazyGatto · Mar 16, 2021

In my case all is OK. Downgrade of packages was done almost a day ago for now.
Everything working.

vaschthestampede · Mar 16, 2021

There is any hope of a short-term upgrade to resolve this?

I would not want to downgrade, it is a temporary solution.

Stefan_R · Mar 16, 2021

vaschthestampede said:
There is any hope of a short-term upgrade to resolve this?

I would not want to downgrade, it is a temporary solution.

We'll release a new version as soon as we have a fix

Right now, the more debug information we can get, the better - that is, if you currently experience the bug, check the link in my previous response on how to get debug information and consider posting it here.

Klug · Mar 16, 2021

<AOL mode>
me too
</AOL mode>

We experienced the issue last night (E5 2697 v3 CPU on the nodes, no AMD), upgraded to last PVE during the week-end.
All VMs on one node became totally unresponsive, qm command went timeout, etc.
Had to ssh into the node, kill all kvm processes, restart the VM manually on other nodes then reboot the nodes.

Then a couple minutes ago, one OPNSense VM froze on another host.
Stop, start, it's up a couple second then freezes again.
All other VMs on the node are more or less responsive, qm processes are timeouting.
I'll try to post the debug in the other thread.

Currently downgrading the other nodes and this is not easy: VM that were started under last PVE version (ie: ones that I had to restart last night) can't be migrated to a downgraded node.

mweigelt · Mar 17, 2021

Hi,

it seems we have also this issue. The crashs start at night when backups are starting...

pawkor · Mar 17, 2021

Stefan_R said:
We now have a *-103 kernel available, would be interesting to see if that changes anything.

Didn't change for me, just had to restart few VMs, one crashed few times in 15 minutes
$ uname -r
5.4.103-1-pve
$ pveversion
pve-manager/6.3-6/2184247e (running kernel: 5.4.103-1-pve)

pve-qemu-kvm 5.2.0-3

Pakillo77 · Mar 20, 2021

Stefan_R said:
We now have a *-103 kernel available, would be interesting to see if that changes anything.

We updated and reboot all our nodes the same day you released version 6.3-6.
It's a 5 Intel nodes cluster:
2x Dell R720
2x Dell R620
1x Dell T130
running CEPH 14.2.16 on datacenter class SSD drives.

Everything was working fine during days, more days than it used to work with 6.3-3, but today we migrated 8 VMs from one node to others, and a bit later some of the VMs got freezed.
And not only VMs involved in migration but others too.

This problem is taking waaaaay looonng to fix, and it's very annoying.

I hope you can solve it soon.

Stefan_R · Mar 22, 2021

As mentioned on the related thread, a fix has been sent to the upstream qemu-devel mailing list and were currently awaiting a response. Thanks for your patience!

Stefan_R · Mar 23, 2021

FYI, we pushed out a fix to pvetest: https://forum.proxmox.com/threads/all-vms-locking-up-after-latest-pve-update.85397/post-379077

If you install the fix, please report back if it works as intended or any other issues that may arise!

Freez issue with latest Proxmox 6.3-4 and AMD CPU

a56

Active Member

LazyGatto

Renowned Member

Stefan_R

Proxmox Retired Staff

Menno

Member

Stefan_R

Proxmox Retired Staff

LazyGatto

Renowned Member

vaschthestampede

Well-Known Member

Stefan_R

Proxmox Retired Staff

Klug

Well-Known Member

mweigelt

Renowned Member

pawkor

Active Member

Pakillo77

Active Member

Stefan_R

Proxmox Retired Staff

Stefan_R

Proxmox Retired Staff

We value your privacy