Hello everyone,
I have a very low performance machine that I'm using to host a few game servers in VMs. The CPU performance is good enough - however, when it goes under load, it often completely crashes and reboots everything. Not just the VM, but the whole host, which is very very weird. This is not happening on my main, more powerful Xeon host, which never killed a single VM on me, let alone completely crash.
There are the syslogs from the last two crashes today:
What do you think this could be? I'm noticing an error on a specific disk, but when I look at that disk from the web ui it says that SMART values are all good and it passes. Also, it's running in a ZFS RAID1 mirror, so I'm not too worried about it failing. And also, the machine rebooted at 16:09 and the error is from 30 minutes earlier, so that's definitely not the (direct) cause.
But on the other crash there is no mention of the disk errors, so I'm pretty sure it's not related.
What else could it be? This happens almost always if the CPU usages stays at 400% (4-core CPU) for a few minutes, but it also rarely happens randomly when no one is playing. What other logs could I look into? I'm starting to think that this might be an hardware issue, but what's failing then?
Thank you!
EDIT: Machine is running on an i5-650 with 8GB RAM, with two 500GB hard disks attached. I mainly built this out of random unused parts I had lying around. I know I should probably just avoid Proxmox with such a low-performance system but I like how easy it is to back everything up, add disks, create VMs, and add redundancy through ZFS pools and I'm ok with the performance being slightly degraded due to the overhead.
I have a very low performance machine that I'm using to host a few game servers in VMs. The CPU performance is good enough - however, when it goes under load, it often completely crashes and reboots everything. Not just the VM, but the whole host, which is very very weird. This is not happening on my main, more powerful Xeon host, which never killed a single VM on me, let alone completely crash.
There are the syslogs from the last two crashes today:
Code:
May 16 16:11:59 n2 pmxcfs[1417]: [status] notice: received log
May 16 16:13:38 n2 kernel: perf: interrupt took too long (2514 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
May 16 16:14:24 n2 kernel: perf: interrupt took too long (3146 > 3142), lowering kernel.perf_event_max_sample_rate to 63500
May 16 16:16:06 n2 kernel: perf: interrupt took too long (3947 > 3932), lowering kernel.perf_event_max_sample_rate to 50500
May 16 16:17:01 n2 CRON[16581]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 16 16:17:01 n2 CRON[16582]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
May 16 16:17:01 n2 CRON[16581]: pam_unix(cron:session): session closed for user root
May 16 16:17:58 n2 pmxcfs[1417]: [dcdb] notice: data verification successful
-- Reboot --
May 16 16:21:42 n2 kernel: Linux version 5.15.35-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.35-2 (Thu, 05 May 2022 13:54:35 +0200) ()
May 16 16:21:42 n2 kernel: Command line: initrd=\EFI\proxmox\5.15.35-1-pve\initrd.img-5.15.35-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
May 16 16:21:42 n2 kernel: KERNEL supported cpus:
May 16 16:21:42 n2 kernel: Intel GenuineIntel
Code:
May 16 15:17:58 n2 pmxcfs[1325]: [dcdb] notice: data verification successful
May 16 15:40:47 n2 smartd[924]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 120 to 106
May 16 15:40:48 n2 smartd[924]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63
May 16 15:40:48 n2 smartd[924]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 37
May 16 15:40:48 n2 smartd[924]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 39 to 43
May 16 15:40:48 n2 smartd[924]: Device: /dev/sdb [SAT], 3 Currently unreadable (pending) sectors
May 16 15:40:48 n2 smartd[924]: Device: /dev/sdb [SAT], 3 Offline uncorrectable sectors
-- Reboot --
May 16 16:09:16 n2 kernel: Linux version 5.15.35-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.35-2 (Thu, 05 May 2022 13:54:35 +0200) ()
May 16 16:09:16 n2 kernel: Command line: initrd=\EFI\proxmox\5.15.35-1-pve\initrd.img-5.15.35-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
May 16 16:09:16 n2 kernel: KERNEL supported cpus:
May 16 16:09:16 n2 kernel: Intel GenuineIntel
What do you think this could be? I'm noticing an error on a specific disk, but when I look at that disk from the web ui it says that SMART values are all good and it passes. Also, it's running in a ZFS RAID1 mirror, so I'm not too worried about it failing. And also, the machine rebooted at 16:09 and the error is from 30 minutes earlier, so that's definitely not the (direct) cause.
But on the other crash there is no mention of the disk errors, so I'm pretty sure it's not related.
What else could it be? This happens almost always if the CPU usages stays at 400% (4-core CPU) for a few minutes, but it also rarely happens randomly when no one is playing. What other logs could I look into? I'm starting to think that this might be an hardware issue, but what's failing then?
Thank you!
EDIT: Machine is running on an i5-650 with 8GB RAM, with two 500GB hard disks attached. I mainly built this out of random unused parts I had lying around. I know I should probably just avoid Proxmox with such a low-performance system but I like how easy it is to back everything up, add disks, create VMs, and add redundancy through ZFS pools and I'm ok with the performance being slightly degraded due to the overhead.
Code:
Linux 5.15.35-1-pve #1 SMP PVE 5.15.35-2 (Thu, 05 May 2022 13:54:35 +0200)
pve-manager/7.2-3/c743d6c1
Last edited: