Freez issue with latest Proxmox 6.3-4 and AMD CPU

tincboy

Renowned Member
Apr 13, 2010
466
3
83
In last few days my latest Proxmox installation on AMD CPUs has VM freezing issue.
My CPUs in different servers was in two models:
1.AMD Ryzen 9 3900 12-Core Processor
2. AMD EPYC 7401P 24-Core Processor

On both I can see Windows VMs freez after free hours, and stop/starting them will temporary fix the problem.
Code:
pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.101-1-pve)
pve-manager: 6.3-4 (running version: 6.3-4/0a38c56f)
pve-kernel-5.4: 6.3-6
pve-kernel-helper: 6.3-6
pve-kernel-5.4.101-1-pve: 5.4.101-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-2
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve2
 
It seems downgrading to 6.3-3 using below commands will fix the issue:
Code:
apt install pve-manager=6.3-3
reboot
The sign of issue is that some VMs show CPU usage of exactly 0% and VNC will not be able to connect and also you can see lines like below inf /var/log/syslog
Code:
VM 1116 qmp command failed - VM 1116 qmp command 'query-proxmox-support' failed - unable to connect to VM 1116 qmp socket - timeout after 31 retries
Mar  3 19:04:06 c101 pvestatd[1528]: VM 1390 qmp command failed - VM 1390 qmp command 'query-proxmox-support' failed - unable to connect to VM 1390 qmp socket - timeout after 31 retries
Mar  3 19:04:09 c101 pvestatd[1528]: VM 1252 qmp command failed - VM 1252 qmp command 'query-proxmox-support' failed - unable to connect to VM 1252 qmp socket - timeout after 31 retries
Mar  3 19:04:12 c101 pvestatd[1528]: VM 1213 qmp command failed - VM 1213 qmp command 'query-proxmox-support' failed - unable to connect to VM 1213 qmp socket - timeout after 31 retries
Mar  3 19:04:15 c101 pvestatd[1528]: VM 1338 qmp command failed - VM 1338 qmp command 'query-proxmox-support' failed - unable to connect to VM 1338 qmp socket - timeout after 31 retries
 
Yes, we are experimenting same issue 5 node cluster after updating to 6.3-4:
CPU(s) 40 x Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz (2 Sockets)
 
Last edited:
Yes, but this is not solved. pve-manager 6.3-4 has issues that need to be fixed. Your solution is only a work around and Proxmox should look at fixing 6.3-4.
 
Yes, but this is not solved. pve-manager 6.3-4 has issues that need to be fixed. Your solution is only a work around and Proxmox should look at fixing 6.3-4.
I've removed the solved prefix,
 
Same here, 8 node cluster with linstor drbd storage, Intel CPUs, some VMs freeze but not on all nodes.
Nothing in the logs of node or in logs from vm on remote syslog server (probably because it freezes instantly).
Just dowgraded pvemanager as @tincboy suggested and observing.
 
Last edited:
Hi all,
We are facing the same behaviour on 15 nodes (~600 VM), pve-manager=6.3-4, on Intel CPUs and No Backups enabled.
Can someone confirm that the downgrade in 6.3-3 is effective and stable as @tincboy and @pawkor suggested ?
 
Last edited:
Over the night I downgraded all my nodes to 6.3-3, just woke up and everything looks good but it's to early to be sure, I will report back int the evening or late at night (I'm in EU).
For me freezes were not related to backups (that was my first thought), I'm using PBS, one VM froze 2 hours after nightly backup finished others during the the day. at worst it took about 15 min since last restart and few days at best.
 
  • Like
Reactions: wolverine
Just started testing proxmox and cannot recreate the error myself, since Im haunted by even more serious proxmox issues and have to try to resolve them first. (see my threads if you wanna help)
But could this be connected to a power state issue ?
Have you tried deactivating global c states in your servers bios (Zen common options) and disabling all pstates above 3 ?
 
My entire system froze last night, and I'm not sure why. It's an AMD Ryzen system. I wonder if it's the same issue. Is there a version changelog somewhere?
 
My entire system froze last night, and I'm not sure why. It's an AMD Ryzen system. I wonder if it's the same issue. Is there a version changelog somewhere?
Entire system like Proxmox or VMs, if Proxmox than it's different issue. We are talking about VMs freezing here.
 
My entire system froze last night, and I'm not sure why. It's an AMD Ryzen system. I wonder if it's the same issue. Is there a version changelog somewhere?
The same here.
PVE 6.3-4, Five Intel nodes cluster, in 3 different CPDs with 20G fibber links reboot at the same time.
All 5 hosts rebooted with no reason or message at the same time. No backup processes involved.
 
My other cluster started having same issues but this time instead of downgrading I upgraded it to 6.3-6
We tried that too but on Tuesday... didn't work at all.
We downgraded the whole cluster and reboot the 15 nodes one by one, but we only downgraded like that : apt install pve-qemu-kvm=5.1.0-8 libproxmox-backup-qemu0=1.0.2-1

Bash:
root@host001:~# dpkg -l | grep -E "pve-kernel-5.4 |pve-qemu-kvm|libproxmox-backup|pve-manager"
ii  libproxmox-backup-qemu0              1.0.2-1                         amd64        Proxmox Backup Server client library for QEMU
ii  pve-kernel-5.4                       6.3-6                           all          Latest Proxmox VE Kernel Image
ii  pve-manager                          6.3-4                           amd64        Proxmox Virtual Environment Management Tools
ii  pve-qemu-kvm                         5.1.0-8                         amd64        Full virtualization on x86 hardware

Running Kernel : 5.4.101-1-pve

Note:
We saw in the night of Wednesday to Thursday some new upgrades :
Bash:
root@host001:~# apt list --upgradable
Listing... Done
libpve-common-perl/stable 6.3-5 all [upgradable from: 6.3-4]
proxmox-backup-client/stable 1.0.9-1 amd64 [upgradable from: 1.0.8-1]
proxmox-widget-toolkit/stable 2.4-6 all [upgradable from: 2.4-5]
pve-kernel-5.4/stable 6.3-7 all [upgradable from: 6.3-6]
pve-kernel-helper/stable 6.3-7 all [upgradable from: 6.3-6]
pve-manager/stable 6.3-6 amd64 [upgradable from: 6.3-4]
pve-qemu-kvm/stable 5.2.0-3 amd64 [upgradable from: 5.2.0-2]
qemu-server/stable 6.3-7 amd64 [upgradable from: 6.3-5]

But we didn't try.
Our production cluster is working again, not with the latest versions but it's working.

Note:
Is there a mailing list or something else to list all minor changelogs (not the global Roadmap nor the Proxmox GIT) ?
Maybe something like this : https://forum.proxmox.com/threads/proxmox-ve-6-3-available.79686/ but with the diff between the N-1 and the N versions (even if it's a minor change)

@all Thanks for your help and for sharing :)
 
  • Like
Reactions: wolverine
We now have a *-103 kernel available, would be interesting to see if that changes anything.
One node setup.
AMD EPYC 7502, 512 Gb RAM.
ZFS Mirror Pool.
Code:
# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.103-1-pve)
pve-manager: 6.3-6 (running version: 6.3-6/2184247e)
pve-kernel-5.4: 6.3-7
pve-kernel-helper: 6.3-7
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.101-1-pve: 5.4.101-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.0.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-6
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-3
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-8
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve2

Today faced with these freezes on a fresh PVE installation.
After loading more then 30-40% CPU VMs begin to random freezing.

BTW, on any of our rest Intel-based PVE nodes with latest updates nothing like this happened.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!