spontaneous reboots

Andreas Pflug

Active Member
Nov 13, 2019
32
2
28
Running pve7 on a Supermicro board with Intel C3758, ECC memory, on-board NIC, 2xSamsung 860 pro. No problems with memory, temperature or SMART status detected, load is constantly low. Two VMS running, one OpnSense firewall, one Debian helper VM.

pve has been rebooting 5x in the last 5 days, with indication that some interrupt problems exist, reboot log entries from kern.log:

Code:
Sep  8 09:04:53 clubhouse kernel: [19344.941119] perf: interrupt took too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
Sep  8 11:28:49 clubhouse kernel: [27980.479450] perf: interrupt took too long (3140 > 3138), lowering kernel.perf_event_max_sample_rate to 63500
Sep  8 15:39:55 clubhouse kernel: [43045.899259] perf: interrupt took too long (3931 > 3925), lowering kernel.perf_event_max_sample_rate to 50750
Sep  9 00:58:56 clubhouse kernel: [76763.022321] perf: interrupt took too long (4921 > 4913), lowering kernel.perf_event_max_sample_rate to 40500
Sep  9 02:14:47 clubhouse kernel: [    0.000000] Linux version 5.11.22-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +>
...
Sep  9 02:15:19 clubhouse kernel: [   38.257772] fwbr101i1: port 2(tap101i1) entered forwarding state
Sep  9 07:47:18 clubhouse kernel: [19846.315676] perf: interrupt took too long (2503 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Sep  9 10:08:20 clubhouse kernel: [28344.049493] perf: interrupt took too long (3134 > 3128), lowering kernel.perf_event_max_sample_rate to 63750
Sep  9 13:54:41 clubhouse kernel: [41961.790948] perf: interrupt took too long (3921 > 3917), lowering kernel.perf_event_max_sample_rate to 51000
Sep  9 14:25:33 clubhouse kernel: [    0.000000] Linux version 5.11.22-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +>
...
Sep  9 19:56:03 clubhouse kernel: [19760.818444] perf: interrupt took too long (2520 > 2500), lowering kernel.perf_event_max_sample_rate to 79250
Sep  9 22:16:53 clubhouse kernel: [28210.549773] perf: interrupt took too long (3161 > 3150), lowering kernel.perf_event_max_sample_rate to 63250
Sep 10 02:27:43 clubhouse kernel: [43260.049114] perf: interrupt took too long (3962 > 3951), lowering kernel.perf_event_max_sample_rate to 50250
Sep 10 13:58:01 clubhouse kernel: [84677.546913] perf: interrupt took too long (4962 > 4952), lowering kernel.perf_event_max_sample_rate to 40250
Sep 12 14:10:26 clubhouse kernel: [258260.114336] hrtimer: interrupt took 6494 ns
Sep 12 19:48:30 clubhouse kernel: [    0.000000] Linux version 5.11.22-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +>
...
Sep 12 19:49:02 clubhouse kernel: [   38.094149] fwbr101i1: port 2(tap101i1) entered forwarding state
Sep 13 01:48:09 clubhouse kernel: [21476.614237] perf: interrupt took too long (2515 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
Sep 13 04:10:50 clubhouse kernel: [30037.777012] perf: interrupt took too long (3158 > 3143), lowering kernel.perf_event_max_sample_rate to 63250
Sep 13 07:57:04 clubhouse kernel: [43684.907223] perf: interrupt took too long (3948 > 3947), lowering kernel.perf_event_max_sample_rate to 50500
Sep 13 12:37:17 clubhouse kernel: [    0.000000] Linux version 5.11.22-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +>

Any clue what might trigger the problem?
I've seen issues on Bay Trail processors with cstate (this is a Denverton), I was wondering if I should/refrain from installing intel-microcode.

Regards,
Andreas
 
Last edited:
These entries are normal on most systems, the "interrupt took too long" comes from the perf monitoring subsystem of the kernel, nothing fatal. Are there any crash logs available from when the systems actually died? Otherwise, potentially look into setting up kdump or netconsole, to get a log of a potential kernel panic.

The 'microcode' package for your CPU should always be installed for best stability.
 
You're right, seeing "interrupt took too long" on other machines as well, when looking for it ;-)

Unfortunately, this is a lonely machine in the desert, no machine to send netconsole to.
 
I've installed kdump, and I still have those reboots. kdump is working fine, /var/crash is filled when provoking the crash with echo c >sysrq-trigger, but the malicious reboots don't log anything. So the reason seems to be non-software triggered. Mainboard BMC didn't log any power event. Running a stress test on the machine for 24h (100% load on 4 of 8 cpu cores) brings up the cpu temp to 80°, no incident. So this looks like a motherboard issue. but:

Changed the mainboard, CPU and RAM, and two days later the next crash...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!