Random crashes after several days

JeepieJoep

New Member
Dec 18, 2024
4
0
1
Hi,

Hope you can help me.

I have a recent Proxmox server with the following specs:
  • Intel Core i5-14500 Boxed
  • Kontron K3843-B motherboard
  • FSP FlexGURU 300W PSU
  • Kingston Fury Beast 64 GB RAM
  • Samsung 990 Pro (with heatsink) 2TB
For the past few weeks, I’ve noticed that the server freezes at random moments. Upon checking the logs (journalctl), there is nothing unusual to be found. The logs stop at the moment of the freeze and resume only after a hard reset using the server's power button.
The Proxmox host can still be pinged but is not accessible via the WebGUI or SSH.

I'm using ZFS and running 15 LXC containers and 2 VM's. My PVE version is 8.3.0.
 
Last edited:
The samsung 990 did several problems regardless the filesystem used. I would try any other disk first for a while, maybe have any hdd and when ok then get a better ssd as today.
 
  • Like
Reactions: gfngfn256
How do you notice that the server freezes / crashes?
The Proxmox host can still be pinged but is not accessible via the WebGUI or SSH.
Are the VMs & LXCs accessible or pingable?

What does RAM usage, IO delay & Server load look like at the time of those freezes / crashes?

When did the problem start & what kernels have you used & are using?

Has that RAM been tested.

I'm using ZFS
If you are running root on ZFS you may be having a SWAP on ZFS problem as described here.

I agree with waltar above that you may want to try swapping out the disk for testing.
 
VM's and LXC's are not accessible or pingable anymore.

RAM, IO delay and server load of the host:proxmox.png

I think the problem started several weeks ago after the fairly new server worked for a month or 4 without any problems. Maybe after a upgrade but I don;t remember it exactly.

If I look at the current kernel I use it's: 6.8.12-4-pve

RAM isn't tested. Can you tell me how to best achieve this?

The output of the "free" command is:

proxmox2.png

I can try to use another NVME but don't have one available at the moment. I think it's best to install a clean version of PVE and copy al VM's and LXC's on that one?

Extra info. "journalctl -S 2024-12-18" outputs no logging between the crash and the restart:

proxmox3.png

As experiment I added mitigations=off to /etc/kernel/cmdline tonight. Maybe I can try if this helps.
 
Last edited:
RAM isn't tested. Can you tell me how to best achieve this?
memtest is installed by default and you just should (manually) select it when booting instead of default pve kernel.
I can try to use another NVME but don't have one available at the moment. I think it's best to install a clean version of PVE and copy al VM's and LXC's on that one?
As written before I would try with any other disk if hdd or usb stick and let the host run for a while to see what happen. Maybe the problem is related to your nvme or other hw (or pve itself) ...
 
Maybe after a upgrade but I don;t remember it exactly.

If I look at the current kernel I use it's: 6.8.12-4-pve
You could test by pinning to a previous kernel. If you post output for pveversion -v we may get an idea which kernel to try.
 
Here is the output:

Code:
proxmox-ve: 8.3.0 (running kernel: 6.8.12-4-pve)
pve-manager: 8.3.0 (running version: 8.3.0/c1689ccb1065a83b)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-4
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.5.13-6-pve: 6.5.13-6
ceph-fuse: 17.2.7-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
intel-microcode: 3.20241112.1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.0
libpve-storage-perl: 8.2.9
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.2.9-1
proxmox-backup-file-restore: 3.2.9-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.1
pve-cluster: 8.0.10
pve-container: 5.2.2
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-1
pve-ha-manager: 4.0.6
pve-i18n: 3.3.1
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.0
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1
 
Just had another crash. Strange thing: I have 1 VM and the rest are LXC containers. The VM was still working and accessible through webgui. All containers were pingable but not accessible. Also the PVE host was pingable but didnt have a webgui or SSH.
 
Maybe try pinning to this kernel.

You can read here how this is done.

You should still try swapping out to a different disk. The lack of any journal entries during the crash could coincide with the drive failing.

REMEMBER TO ALWAYS HAVE FULL & RESTORABLE BACKUPS!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!