Proxmox Randomly Hanging

Rainnny

New Member
Dec 6, 2023
8
0
1
Good evening, for the last week or so, I've been experiencing random lock ups on my Proxmox instance and I'm unsure as to why. Every day, or even several times a day, my entire machine freezes, I cannot access the web panel, or SSH into the machine. This server is in my homelab in my rack, I have full access to it. This issue seemingly started to happen randomly without a clear reason as to why. I've tried looking in syslogs, kernel logs, as well as dmesg without being able to see anything, but perhaps I'm missing something. I'm also monitoring this machine with Netdata, and I see no spikes or anything going on prior to these lock ups.

Version
Code:
☁  ~  pveversion -v
proxmox-ve: 8.0.2 (running kernel: 6.2.16-19-pve)
pve-manager: 8.0.9 (running version: 8.0.9/fd1a0ae1b385cdcd)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.5
pve-kernel-5.15: 7.4-4
pve-kernel-5.13: 7.1-9
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2: 6.2.16-19
proxmox-kernel-6.2.16-18-pve: 6.2.16-18
proxmox-kernel-6.2.16-14-pve: 6.2.16-14
proxmox-kernel-6.2.16-10-pve: 6.2.16-10
proxmox-kernel-6.2.16-8-pve: 6.2.16-8
proxmox-kernel-6.2.16-6-pve: 6.2.16-7
pve-kernel-6.2.16-5-pve: 6.2.16-6
pve-kernel-6.2.16-4-pve: 6.2.16-5
pve-kernel-6.2.16-3-pve: 6.2.16-3
pve-kernel-5.15.108-1-pve: 5.15.108-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx7
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.10
libpve-guest-common-perl: 5.0.5
libpve-http-server-perl: 5.0.5
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.4
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.0.4-1
proxmox-backup-file-restore: 3.0.4-1
proxmox-kernel-helper: 8.0.5
proxmox-mail-forward: 0.2.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.2
proxmox-widget-toolkit: 4.1.1
pve-cluster: 8.0.5
pve-container: 5.0.6
pve-docs: 8.0.5
pve-edk2-firmware: 4.2023.08-1
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.0.7
pve-qemu-kvm: 8.1.2-3
pve-xtermjs: 5.3.0-2
qemu-server: 8.0.8
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.0-pve3

Below is the VM/CTs running on this instance, I turned a lot of things off to see if it was an issue with one of them or not, however I had just done that so I'm unsure if that fixed anything or not.
1701845632948.png

Mobo: Asus PRIME B450M-A II
CPU: Ryzen 5 5600G
RAM: 40gb of Corsair Vengeance @ 2666Mhz (x2 16gb, x1 8gb, occupying the last 3 dimms as a troubleshooting step)
Boot NVME: Samsung SSD 980 500GB


1701848213381.png
 
Last edited:
I also tried to disable backups temporarily to see if they were causing the issue, that also didn't fix it.
 
Hi,
are the services still running normally during those hangs? If there's nothing at all in the logs, it might also be the network connection to the server that's hanging. You might also want to consider updating to the latest version and kernel 6.5.
 
Hi,
are the services still running normally during those hangs? If there's nothing at all in the logs, it might also be the network connection to the server that's hanging. You might also want to consider updating to the latest version and kernel 6.5.
No, everything dies, even trying to physically use the machine with mouse and keyboard does nothing, the monitor is also unresponsive (no blinking cursor). I will go ahead and update to the latest Proxmox release however with the new kernel and see if that resolves anything.
 
Do you have latest microcode package, i.e. apt install amd64-microcode (you might need to add the non-free-firmware component for your Debian repository and apt update first) and BIOS updates installed?

What you can also try is connecting from a different physical machine via SSH and run journalctl -f and let it running in the background until the machine freezes. If you're lucky you can get a log from that.
 
So I ran journalctl -f and the machine just died like 10 mins ago, nothing showed in those logs. The only recent thing there is something saying my session closed for root, but that's normal, as well as errors whilst connecting to my proxmox backup server as I turned it off temporarily to lower the amount of CTs and VMs running on my machine to help troubleshoot. I'm going to go out to my rack now and get the machine back up and running, once up, I'm going to run apt install amd64-microcode as well as update to the latest Proxmox version and kernel.

As for the BIOS version, I'm unsure of the current version I'm running however I will check in a few moments when I go to the machine and boot it back up, however I haven't updated the firmware of the BIOS, so I don't believe it is on the latest version, however I'm not sure if it would be that anyways, this issue just started randomly with the random lock ups, I didn't change Proxmox or anything about the machine
 
I also don't really want to try and update the BIOS, it takes around 1-24 hours for this lock up to show up, and I don't really want to chance updating the BIOS and have the server die on me in the middle of that update and brick the board
 
Do you have latest microcode package, i.e. apt install amd64-microcode (you might need to add the non-free-firmware component for your Debian repository and apt update first) and BIOS updates installed?

What you can also try is connecting from a different physical machine via SSH and run journalctl -f and let it running in the background until the machine freezes. If you're lucky you can get a log from that.
Also the output of journalctl is:

Code:
Dec 06 05:11:07 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:11:17 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:11:27 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:11:38 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:11:47 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:11:57 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:12:07 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:12:18 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:12:27 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:12:38 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:12:47 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:12:57 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:13:07 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:13:17 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:13:28 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:13:37 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:13:48 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:13:57 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:14:08 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:14:17 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:14:28 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:14:37 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:14:47 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:14:57 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:15:01 pve CRON[259395]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Dec 06 05:15:01 pve CRON[259396]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 06 05:15:01 pve CRON[259395]: pam_unix(cron:session): session closed for user root
Dec 06 05:15:07 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:15:17 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:15:27 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:15:37 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:15:47 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:15:58 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:16:07 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:16:18 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:16:27 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:16:37 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:16:47 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:16:58 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:17:01 pve CRON[261274]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Dec 06 05:17:01 pve CRON[261275]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Dec 06 05:17:01 pve CRON[261274]: pam_unix(cron:session): session closed for user root
Dec 06 05:17:07 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:17:17 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:17:27 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)
Dec 06 05:17:37 pve pvestatd[2787]: pbs: error fetching datastores - 500 Can't connect to 10.10.10.122:8007 (Connection refused)

Network error: Software caused connection abort

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Session stopped
    - Press <Return> to exit tab
    - Press R to restart session
    - Press S to save terminal output to file

As mentioned previously, the error relating to PBS is due to my Proxmox Backup Server being offline atm for a troubleshooting step. But as you can see, it just dies, nothing to show as to why
 
Just checked, and I'm running BIOS version 2807 from 2/1/2021, the latest is 4401 from 10/25/2023. It is quite out-of-date, however as stated previously, I'm a bit worried about updating the BIOS as I don't want the mobo to get bricked if it dies in the middle of this update, and nothing has changed recently on the system to cause these lock ups, so idk. amd64-microcode is also on the latest version.

I also updated to the latest Proxmox version and kernel, as seen below:
Code:
☁  ~  pveversion -v            
proxmox-ve: 8.1.0 (running kernel: 6.5.11-6-pve)
pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
pve-kernel-5.15: 7.4-4
pve-kernel-5.13: 7.1-9
proxmox-kernel-6.5: 6.5.11-6
proxmox-kernel-6.5.11-6-pve-signed: 6.5.11-6
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2: 6.2.16-19
proxmox-kernel-6.2.16-18-pve: 6.2.16-18
proxmox-kernel-6.2.16-14-pve: 6.2.16-14
proxmox-kernel-6.2.16-10-pve: 6.2.16-10
proxmox-kernel-6.2.16-8-pve: 6.2.16-8
proxmox-kernel-6.2.16-6-pve: 6.2.16-7
pve-kernel-6.2.16-5-pve: 6.2.16-6
pve-kernel-6.2.16-4-pve: 6.2.16-5
pve-kernel-6.2.16-3-pve: 6.2.16-3
pve-kernel-5.15.108-1-pve: 5.15.108-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx7
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.2-1
proxmox-backup-file-restore: 3.1.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.3
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-2
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.1.4
pve-qemu-kvm: 8.1.2-4
pve-xtermjs: 5.3.0-2
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.0-pve4
 
Last edited:
Update: Figured out that it was the motherboard that had died, after replacing it the server has been operating normally
 
I was experiencing similar symptoms after upgrading from pve-kernel-6.2 to 6.5. The system would randomly hang after 1-5 days, however the host did still reply to pings and reset via magic SysRq was possible.

In my case I had added `mitigations=off` to my kernel cmdline. Since removing this I haven't had a hang in over 14 days.

I had added `mitigations=off` to the kernel cmdline to try to eek out some extra performance, but now it seems that software is validated with the default mitigations enabled so this is just asking for trouble.

`mitigations=off` considered harmful
https://news.ycombinator.com/item?id=37812556
https://forum.level1techs.com/t/mit...harmful-or-spurious-sigill-on-amd-zen4/202049
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!