Proxmox Crashes hard regularly

Guy

Renowned Member
Jan 15, 2009
121
1
83
m0guy.com
I've been running Proxmox for many years now and it's always been highly reliable.

Recently I switched to some newer smaller hardware (16 x AMD Ryzen 7 4800U with Radeon Graphics (1 Socket)), they have been running perfectly for 4-5 months. However for the last month or so they have started to crash at night, usually around 0200. They hang to the point where ping does not respond, I can see the power light it on but nothing responds. Local HDMI monitor and keyboard doesn't show anything, just blank screen. I have to physically power off and on again to recover. They start up fine but then in a day or two one or the other will crash again.

I thought maybe a memory leak, however my SNMP monitoring shows a constant memory usage, no adverse CPU, DISK or other things.

One of my Nodes now only has two VMs on it and it's crashing more regularly than the other which has 8.

From last nights crash, I only have these log entries:

Code:
Dec 18 02:17:01 prox01 CRON[1477415]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Dec 18 02:17:01 prox01 CRON[1477416]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Dec 18 02:17:01 prox01 CRON[1477415]: pam_unix(cron:session): session closed for user root
Dec 18 02:17:28 prox01 pmxcfs[1167]: [dcdb] notice: data verification successful
Dec 18 02:18:34 prox01 pmxcfs[1167]: [status] notice: received log
-- Reboot --
Dec 18 08:04:07 prox01 kernel: Linux version 5.15.74-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.74-1 (Mon, 14 Nov 2022 20:17:15 +0100) ()
Dec 18 08:04:07 prox01 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.74-1-pve root=/dev/mapper/pve-root ro quiet

I'm running the latest versions of proxmox

Code:
~# pveversion -v
proxmox-ve: 7.3-1 (running kernel: 5.15.74-1-pve)
pve-manager: 7.3-3 (running version: 7.3-3/c3928077)
pve-kernel-helper: 7.3-1
pve-kernel-5.15: 7.2-14
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-1
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
openvswitch-switch: 2.15.0+ds1-2+deb11u1
proxmox-backup-client: 2.3.1-1
proxmox-backup-file-restore: 2.3.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-1
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-1
pve-ha-manager: 3.5.1
pve-i18n: 2.8-1
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-1
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

Can anyone suggest any debugging ideas to help narrow this down?

Cheers,
--Guy
 
Hi,

Same issue here,

Any thought on how to resolve?
Maybe downgrade to an older version?
 
From last nights crash, I only have these log entries:
Those look normal, unfortunately.

Is there anything configured to run at that time, that might cause some load, which the system might not be handling well?

Memory test? Maybe set the memory speed a little lower? Try a different power supply?
Firmware / BIOS update?
Try booting an older kernel?
 
LAst night is crashed hard again, but this time I was able to capture the screen which shows a kernel panic.
 

Attachments

  • IMG_4704.jpeg
    IMG_4704.jpeg
    907.6 KB · Views: 36
I've rolled back to Kernel "Linux 5.15.30-2-pve #1 SMP PVE 5.15.30-3 (Fri, 22 Apr 2022 18:08:27 +0200)" and it does seem to be more stable. I stepped back kernel at a time until I got here.. it's the oldest Kernel I have in the grub configuration.

Currently It's manually selected but I think I'll have to swap it to primary until the later Kernels become more stable.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!