I have a proxmox host (see details below) that keeps locking up every 24-36 hours. It becomes unresponsive on the console, disconnects from the internet, and there is no way to recover other than doing a switched hardware reboot.
I've recently set it up to do a health check by the minute so I can reference logs on my network devices as well as the syslog and I'm not seeing anything that explains what is happening so I'm a bit puzzled. The network devices just show it going offline around the same time as the lockup.
Info from the node:
Syslog from 10 minutes before lock up:
The machine missed its 10:49 check in and reported down. I'm sitting right next to it so I tried to type in the console and it was locked up, forcing a reboot.
Not sure where else I can inspect some logs to see what its doing? Any help appreciated
I've recently set it up to do a health check by the minute so I can reference logs on my network devices as well as the syslog and I'm not seeing anything that explains what is happening so I'm a bit puzzled. The network devices just show it going offline around the same time as the lockup.
Info from the node:
Code:
CPU(s) 24 x 13th Gen Intel(R) Core(TM) i7-13700K (1 Socket)
Kernel Version Linux 6.2.11-2-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.11-2 (2023-05-10T09:13Z)
PVE Manager Version pve-manager/7.4-15/a5d2a31e
Syslog from 10 minutes before lock up:
Code:
Jul 11 10:46:04 pve-prod NetworkManager[1992]: <info> [1689093964.2335] device (wlp0s20f3): set-hw-addr: set MAC address to 42:6C:59:22:C5:B9 (scanning)
Jul 11 10:46:04 pve-prod NetworkManager[1992]: <info> [1689093964.2715] device (wlp0s20f3): supplicant interface state: inactive -> disconnected
Jul 11 10:46:04 pve-prod NetworkManager[1992]: <info> [1689093964.2715] device (p2p-dev-wlp0s20f3): supplicant management interface state: inactive -> disconnected
Jul 11 10:46:04 pve-prod NetworkManager[1992]: <info> [1689093964.2768] device (wlp0s20f3): supplicant interface state: disconnected -> inactive
Jul 11 10:46:04 pve-prod NetworkManager[1992]: <info> [1689093964.2768] device (p2p-dev-wlp0s20f3): supplicant management interface state: disconnected -> inactive
Jul 11 10:48:28 pve-prod smartd[2034]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 71
Jul 11 10:48:28 pve-prod smartd[2034]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 30 to 29
-- Reboot --
The machine missed its 10:49 check in and reported down. I'm sitting right next to it so I tried to type in the console and it was locked up, forcing a reboot.
Not sure where else I can inspect some logs to see what its doing? Any help appreciated