I've been running Proxmox for many years now and it's always been highly reliable.
Recently I switched to some newer smaller hardware (16 x AMD Ryzen 7 4800U with Radeon Graphics (1 Socket)), they have been running perfectly for 4-5 months. However for the last month or so they have started to crash at night, usually around 0200. They hang to the point where ping does not respond, I can see the power light it on but nothing responds. Local HDMI monitor and keyboard doesn't show anything, just blank screen. I have to physically power off and on again to recover. They start up fine but then in a day or two one or the other will crash again.
I thought maybe a memory leak, however my SNMP monitoring shows a constant memory usage, no adverse CPU, DISK or other things.
One of my Nodes now only has two VMs on it and it's crashing more regularly than the other which has 8.
From last nights crash, I only have these log entries:
I'm running the latest versions of proxmox
Can anyone suggest any debugging ideas to help narrow this down?
Cheers,
--Guy
Recently I switched to some newer smaller hardware (16 x AMD Ryzen 7 4800U with Radeon Graphics (1 Socket)), they have been running perfectly for 4-5 months. However for the last month or so they have started to crash at night, usually around 0200. They hang to the point where ping does not respond, I can see the power light it on but nothing responds. Local HDMI monitor and keyboard doesn't show anything, just blank screen. I have to physically power off and on again to recover. They start up fine but then in a day or two one or the other will crash again.
I thought maybe a memory leak, however my SNMP monitoring shows a constant memory usage, no adverse CPU, DISK or other things.
One of my Nodes now only has two VMs on it and it's crashing more regularly than the other which has 8.
From last nights crash, I only have these log entries:
Code:
Dec 18 02:17:01 prox01 CRON[1477415]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Dec 18 02:17:01 prox01 CRON[1477416]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Dec 18 02:17:01 prox01 CRON[1477415]: pam_unix(cron:session): session closed for user root
Dec 18 02:17:28 prox01 pmxcfs[1167]: [dcdb] notice: data verification successful
Dec 18 02:18:34 prox01 pmxcfs[1167]: [status] notice: received log
-- Reboot --
Dec 18 08:04:07 prox01 kernel: Linux version 5.15.74-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.74-1 (Mon, 14 Nov 2022 20:17:15 +0100) ()
Dec 18 08:04:07 prox01 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.74-1-pve root=/dev/mapper/pve-root ro quiet
I'm running the latest versions of proxmox
Code:
~# pveversion -v
proxmox-ve: 7.3-1 (running kernel: 5.15.74-1-pve)
pve-manager: 7.3-3 (running version: 7.3-3/c3928077)
pve-kernel-helper: 7.3-1
pve-kernel-5.15: 7.2-14
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-1
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
openvswitch-switch: 2.15.0+ds1-2+deb11u1
proxmox-backup-client: 2.3.1-1
proxmox-backup-file-restore: 2.3.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-1
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-1
pve-ha-manager: 3.5.1
pve-i18n: 2.8-1
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-1
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1
Can anyone suggest any debugging ideas to help narrow this down?
Cheers,
--Guy