spontaneous reboots

Andreas Pflug · Sep 13, 2021

Running pve7 on a Supermicro board with Intel C3758, ECC memory, on-board NIC, 2xSamsung 860 pro. No problems with memory, temperature or SMART status detected, load is constantly low. Two VMS running, one OpnSense firewall, one Debian helper VM.

pve has been rebooting 5x in the last 5 days, with indication that some interrupt problems exist, reboot log entries from kern.log:

Code:

Sep  8 09:04:53 clubhouse kernel: [19344.941119] perf: interrupt took too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
Sep  8 11:28:49 clubhouse kernel: [27980.479450] perf: interrupt took too long (3140 > 3138), lowering kernel.perf_event_max_sample_rate to 63500
Sep  8 15:39:55 clubhouse kernel: [43045.899259] perf: interrupt took too long (3931 > 3925), lowering kernel.perf_event_max_sample_rate to 50750
Sep  9 00:58:56 clubhouse kernel: [76763.022321] perf: interrupt took too long (4921 > 4913), lowering kernel.perf_event_max_sample_rate to 40500
Sep  9 02:14:47 clubhouse kernel: [    0.000000] Linux version 5.11.22-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +>
...
Sep  9 02:15:19 clubhouse kernel: [   38.257772] fwbr101i1: port 2(tap101i1) entered forwarding state
Sep  9 07:47:18 clubhouse kernel: [19846.315676] perf: interrupt took too long (2503 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Sep  9 10:08:20 clubhouse kernel: [28344.049493] perf: interrupt took too long (3134 > 3128), lowering kernel.perf_event_max_sample_rate to 63750
Sep  9 13:54:41 clubhouse kernel: [41961.790948] perf: interrupt took too long (3921 > 3917), lowering kernel.perf_event_max_sample_rate to 51000
Sep  9 14:25:33 clubhouse kernel: [    0.000000] Linux version 5.11.22-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +>
...
Sep  9 19:56:03 clubhouse kernel: [19760.818444] perf: interrupt took too long (2520 > 2500), lowering kernel.perf_event_max_sample_rate to 79250
Sep  9 22:16:53 clubhouse kernel: [28210.549773] perf: interrupt took too long (3161 > 3150), lowering kernel.perf_event_max_sample_rate to 63250
Sep 10 02:27:43 clubhouse kernel: [43260.049114] perf: interrupt took too long (3962 > 3951), lowering kernel.perf_event_max_sample_rate to 50250
Sep 10 13:58:01 clubhouse kernel: [84677.546913] perf: interrupt took too long (4962 > 4952), lowering kernel.perf_event_max_sample_rate to 40250
Sep 12 14:10:26 clubhouse kernel: [258260.114336] hrtimer: interrupt took 6494 ns
Sep 12 19:48:30 clubhouse kernel: [    0.000000] Linux version 5.11.22-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +>
...
Sep 12 19:49:02 clubhouse kernel: [   38.094149] fwbr101i1: port 2(tap101i1) entered forwarding state
Sep 13 01:48:09 clubhouse kernel: [21476.614237] perf: interrupt took too long (2515 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
Sep 13 04:10:50 clubhouse kernel: [30037.777012] perf: interrupt took too long (3158 > 3143), lowering kernel.perf_event_max_sample_rate to 63250
Sep 13 07:57:04 clubhouse kernel: [43684.907223] perf: interrupt took too long (3948 > 3947), lowering kernel.perf_event_max_sample_rate to 50500
Sep 13 12:37:17 clubhouse kernel: [    0.000000] Linux version 5.11.22-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +>

Any clue what might trigger the problem?
I've seen issues on Bay Trail processors with cstate (this is a Denverton), I was wondering if I should/refrain from installing intel-microcode.

Regards,
Andreas

Stefan_R · Sep 14, 2021

These entries are normal on most systems, the "interrupt took too long" comes from the perf monitoring subsystem of the kernel, nothing fatal. Are there any crash logs available from when the systems actually died? Otherwise, potentially look into setting up kdump or netconsole, to get a log of a potential kernel panic.

The 'microcode' package for your CPU should always be installed for best stability.

Andreas Pflug · Sep 14, 2021

You're right, seeing "interrupt took too long" on other machines as well, when looking for it ;-)

Unfortunately, this is a lonely machine in the desert, no machine to send netconsole to.

Stefan_R · Sep 14, 2021

Try kdump. If you're using grub, here's a past explanation: https://forum.proxmox.com/threads/random-proxmox-server-hang-no-vms-no-web-gui.58823/#post-271632

Andreas Pflug · Sep 27, 2021

I've installed kdump, and I still have those reboots. kdump is working fine, /var/crash is filled when provoking the crash with echo c >sysrq-trigger, but the malicious reboots don't log anything. So the reason seems to be non-software triggered. Mainboard BMC didn't log any power event. Running a stress test on the machine for 24h (100% load on 4 of 8 cpu cores) brings up the cpu temp to 80°, no incident. So this looks like a motherboard issue. but:

Changed the mainboard, CPU and RAM, and two days later the next crash...

spontaneous reboots

Andreas Pflug

Well-Known Member

Stefan_R

Proxmox Retired Staff

Andreas Pflug

Well-Known Member

Stefan_R

Proxmox Retired Staff

Andreas Pflug

Well-Known Member

We value your privacy