Proxmox VE Host crashes randomly

mind.overflow

Member
May 6, 2022
9
0
6
Hello everyone,
I have a very low performance machine that I'm using to host a few game servers in VMs. The CPU performance is good enough - however, when it goes under load, it often completely crashes and reboots everything. Not just the VM, but the whole host, which is very very weird. This is not happening on my main, more powerful Xeon host, which never killed a single VM on me, let alone completely crash.
There are the syslogs from the last two crashes today:
Code:
May 16 16:11:59 n2 pmxcfs[1417]: [status] notice: received log
May 16 16:13:38 n2 kernel: perf: interrupt took too long (2514 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
May 16 16:14:24 n2 kernel: perf: interrupt took too long (3146 > 3142), lowering kernel.perf_event_max_sample_rate to 63500
May 16 16:16:06 n2 kernel: perf: interrupt took too long (3947 > 3932), lowering kernel.perf_event_max_sample_rate to 50500
May 16 16:17:01 n2 CRON[16581]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 16 16:17:01 n2 CRON[16582]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 16 16:17:01 n2 CRON[16581]: pam_unix(cron:session): session closed for user root
May 16 16:17:58 n2 pmxcfs[1417]: [dcdb] notice: data verification successful
-- Reboot --
May 16 16:21:42 n2 kernel: Linux version 5.15.35-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.35-2 (Thu, 05 May 2022 13:54:35 +0200) ()
May 16 16:21:42 n2 kernel: Command line: initrd=\EFI\proxmox\5.15.35-1-pve\initrd.img-5.15.35-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
May 16 16:21:42 n2 kernel: KERNEL supported cpus:
May 16 16:21:42 n2 kernel:   Intel GenuineIntel

Code:
May 16 15:17:58 n2 pmxcfs[1325]: [dcdb] notice: data verification successful
May 16 15:40:47 n2 smartd[924]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 120 to 106
May 16 15:40:48 n2 smartd[924]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63
May 16 15:40:48 n2 smartd[924]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 37
May 16 15:40:48 n2 smartd[924]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 39 to 43
May 16 15:40:48 n2 smartd[924]: Device: /dev/sdb [SAT], 3 Currently unreadable (pending) sectors
May 16 15:40:48 n2 smartd[924]: Device: /dev/sdb [SAT], 3 Offline uncorrectable sectors
-- Reboot --
May 16 16:09:16 n2 kernel: Linux version 5.15.35-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.35-2 (Thu, 05 May 2022 13:54:35 +0200) ()
May 16 16:09:16 n2 kernel: Command line: initrd=\EFI\proxmox\5.15.35-1-pve\initrd.img-5.15.35-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
May 16 16:09:16 n2 kernel: KERNEL supported cpus:
May 16 16:09:16 n2 kernel:   Intel GenuineIntel


What do you think this could be? I'm noticing an error on a specific disk, but when I look at that disk from the web ui it says that SMART values are all good and it passes. Also, it's running in a ZFS RAID1 mirror, so I'm not too worried about it failing. And also, the machine rebooted at 16:09 and the error is from 30 minutes earlier, so that's definitely not the (direct) cause.
But on the other crash there is no mention of the disk errors, so I'm pretty sure it's not related.

What else could it be? This happens almost always if the CPU usages stays at 400% (4-core CPU) for a few minutes, but it also rarely happens randomly when no one is playing. What other logs could I look into? I'm starting to think that this might be an hardware issue, but what's failing then?

Thank you!

EDIT: Machine is running on an i5-650 with 8GB RAM, with two 500GB hard disks attached. I mainly built this out of random unused parts I had lying around. I know I should probably just avoid Proxmox with such a low-performance system but I like how easy it is to back everything up, add disks, create VMs, and add redundancy through ZFS pools and I'm ok with the performance being slightly degraded due to the overhead.
Code:
Linux 5.15.35-1-pve #1 SMP PVE 5.15.35-2 (Thu, 05 May 2022 13:54:35 +0200)
pve-manager/7.2-3/c743d6c1
 
Last edited:
I don't see any indicator for why this would happen in your logs, though I'd suggest checking:
  • The most obvious problem here could be your CPU critically overheating. Try monitoring your CPU temperature when your system is under heavy load.
  • Another thing I could think of would be a memory problem. It might just be that your CPU is doing more memory operations when under heavy load which kind of skews the appearance what's at fault here. A memtest should bring clarity in this regard too.
Edit:
Sorry I skipped over that part:
but it also rarely happens randomly when no one is playing.
This makes it much more likely that it is some kind of problem with e.g. the RAM. Though there could still be other parts at fault here. I would check temperatures just in case, run a memtest and see whether I could find anything in the logs. If that doesn't bring clarity, I guess the only thing that's left would be to try and change different hardware in your PC and see if the problem persists.

Best of luck!
 
Last edited: