This thread is dedicated to the issue where the server just freezes.
If the kernel gives error messages when the server crashes
there is a thread https://forum.proxmox.com/threads/random-6-8-4-2-pve-kernel-crashes.145760
and not AMD GPU related as in https://gitlab.freedesktop.org/drm/amd/-/issues/3173
(problems may be however related... but who knows)
Hi,
i have updated a 8 server cluster to Proxmox 8.2 on 28.4.
On 1.5. i had the first problems with 3 of the 8 servers got stuck.
And then after a couple of hours another server and so on.
It doesn't last six hours and i have to reboot one of the servers.
(and if I'm not lucky enough and i don't restart the server quickly,
another 1 or 2 servers freezes meanwhile and the cluster dies,
because we are using ceph on all 8 nodes)
The server is in a "frozen" condition.
It does display the login prompt and nothing else on connected monitor,
but it does not react to a usb keyboard, even numlock is not working.
No segfault or other messages in dmesg or syslog.
I have to hard reset the frozen node.
The server is a ASUS RS500A-E11 with AMD Epyc Milan series CPU (motherboard KMPA-U16)
And it is not possible that all the servers have suddenly HW problems.
All packages are updated. And I have the latest BIOS - released few months ago.
Today I will try to downgrade the kernels to 6.5
Any other ideas? Thank you in advance.
Last edited: