Issue with proxmox host locking up every 24-36 hours

znelson

New Member
Jul 11, 2023
5
0
1
I have a proxmox host (see details below) that keeps locking up every 24-36 hours. It becomes unresponsive on the console, disconnects from the internet, and there is no way to recover other than doing a switched hardware reboot.

I've recently set it up to do a health check by the minute so I can reference logs on my network devices as well as the syslog and I'm not seeing anything that explains what is happening so I'm a bit puzzled. The network devices just show it going offline around the same time as the lockup.

Info from the node:
Code:
CPU(s) 24 x 13th Gen Intel(R) Core(TM) i7-13700K (1 Socket)
Kernel Version Linux 6.2.11-2-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.11-2 (2023-05-10T09:13Z)
PVE Manager Version pve-manager/7.4-15/a5d2a31e

Syslog from 10 minutes before lock up:
Code:
Jul 11 10:46:04 pve-prod NetworkManager[1992]: <info>  [1689093964.2335] device (wlp0s20f3): set-hw-addr: set MAC address to 42:6C:59:22:C5:B9 (scanning)
Jul 11 10:46:04 pve-prod NetworkManager[1992]: <info>  [1689093964.2715] device (wlp0s20f3): supplicant interface state: inactive -> disconnected
Jul 11 10:46:04 pve-prod NetworkManager[1992]: <info>  [1689093964.2715] device (p2p-dev-wlp0s20f3): supplicant management interface state: inactive -> disconnected
Jul 11 10:46:04 pve-prod NetworkManager[1992]: <info>  [1689093964.2768] device (wlp0s20f3): supplicant interface state: disconnected -> inactive
Jul 11 10:46:04 pve-prod NetworkManager[1992]: <info>  [1689093964.2768] device (p2p-dev-wlp0s20f3): supplicant management interface state: disconnected -> inactive
Jul 11 10:48:28 pve-prod smartd[2034]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 71
Jul 11 10:48:28 pve-prod smartd[2034]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 30 to 29
-- Reboot --

The machine missed its 10:49 check in and reported down. I'm sitting right next to it so I tried to type in the console and it was locked up, forcing a reboot.

Not sure where else I can inspect some logs to see what its doing? Any help appreciated
 
FWIW it looks like there is a slightly new kernel version which I'm upgrading to now, and will report back if this is resolved but my gut feeling is no.
 
The new kernel did not help - the machine locked up again this morning. Any assistance would be helpful.
 
There is nothing to go on; it's as if the power is interrupted or maybe the logs don't reach the disk. Maybe this post can help with that? Maybe you can find out which hardware is causing the issue by replacing/testing parts.
 
There is nothing to go on; it's as if the power is interrupted or maybe the logs don't reach the disk. Maybe this post can help with that? Maybe you can find out which hardware is causing the issue by replacing/testing parts.

Yeah, possibly. The power is still on for what its worth, the console still displays on the monitor, it just is not responsive.
 
Anyone who might be searching this - I found two issues with my bios config:

1- XMP memory overclocking was enabled by default, which I turned off
2- bios had an update.

I corrected these two items and have had a stable machine for almost 48 hours now. I'll report back if anything changes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!