[SOLVED] Serious node instability - CPU hangs, softdog wants to kick-in.

ferdek · Jul 13, 2020

Hi,
I am having some serious problems with my server and I am unable to pinpoint the root cause. After installing the first virtual machine (OPNSense, BSD system), the server randomly started to reboot. Initially, after powering on the server, it would reboot in like 5-10 minutes and then work perfectly fine.

I started troubleshooting the initial symptoms. I left memtest for a whole day, 64GB of ECC Ram got tested, no problems. Deleted Proxmox, evaluated XCP-NG - no problems. Went back to Proxmox as it happens to match my use-case. Problem reappeared. Then I attached a serial console to the server to hopefully catch some kernel ouput just before reboot. And sure there was:

Code:

kernel: softdog: Triggered

Oh... But what was the reason? So I changed the configuration of softdog module to not reboot my server when problems get detected. I was able to capture more logs this way by casually reading the syslog now:

Code:

Jul 13 08:44:26 alpha-pve pvestatd[2470]: status update time (5.082 seconds)
Jul 13 08:44:52 alpha-pve kernel: softdog: Triggered - Reboot ignored
Jul 13 08:45:23 alpha-pve pve-ha-crm[2566]: loop take too long (31 seconds)
Jul 13 08:45:23 alpha-pve pve-ha-lrm[2659]: loop take too long (31 seconds)
Jul 13 08:45:23 alpha-pve kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 29s! [swapper/1:0]
Jul 13 08:45:23 alpha-pve kernel: softdog: Triggered - Reboot ignored
Jul 13 08:45:23 alpha-pve kernel: Modules linked in: [redacted]
Jul 13 08:45:23 alpha-pve kernel: CPU: 1 PID: 0 Comm: swapper/1 Tainted: P           O      5.4.44-2-pve #1
Jul 13 08:45:23 alpha-pve kernel: Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1b       11/16/2012
Jul 13 08:45:23 alpha-pve kernel: RIP: 0010:cpuidle_enter_state+0xba/0x450
Jul 13 08:45:23 alpha-pve kernel: Code: 66 90 31 ff e8 a7 b4 84 ff 80 7d c7 00 74 17 9c 58 66 66 90 66 90 f6 c4 02 0f 85 63 03 00 00 31 ff e8 ca 22 8b ff fb 66 66 90 <66> 66 90 45 85 ed 0f 88 8d 02 00 00 49 63 cd 48 8b 75 d0 48 2b 75
Jul 13 08:45:23 alpha-pve kernel: RSP: 0018:ffffb8a9c62c7e48 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Jul 13 08:45:23 alpha-pve kernel: RAX: ffff9df65f86ad40 RBX: ffffffffa8b57a00 RCX: 000000000000001f
Jul 13 08:45:23 alpha-pve kernel: RDX: 0000012000a93594 RSI: 000000002ba284a3 RDI: 0000000000000000
Jul 13 08:45:23 alpha-pve kernel: RBP: ffffb8a9c62c7e88 R08: 0000000000000002 R09: 000000000002a5c0
Jul 13 08:45:23 alpha-pve kernel: R10: 0000037d87813a90 R11: ffff9df65f8699e0 R12: ffff9df65f876200
Jul 13 08:45:23 alpha-pve kernel: R13: 0000000000000004 R14: ffffffffa8b57b98 R15: ffffffffa8b57b80
Jul 13 08:45:23 alpha-pve kernel: FS:  0000000000000000(0000) GS:ffff9df65f840000(0000) knlGS:0000000000000000
Jul 13 08:45:23 alpha-pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 13 08:45:23 alpha-pve kernel: CR2: 00007fd3cf9860a8 CR3: 000000038ba0a002 CR4: 00000000000226e0
Jul 13 08:45:23 alpha-pve kernel: Call Trace:
Jul 13 08:45:23 alpha-pve kernel:  cpuidle_enter+0x2e/0x40
Jul 13 08:45:23 alpha-pve kernel:  call_cpuidle+0x23/0x40
Jul 13 08:45:23 alpha-pve kernel:  do_idle+0x22c/0x270
Jul 13 08:45:23 alpha-pve kernel:  cpu_startup_entry+0x1d/0x20
Jul 13 08:45:23 alpha-pve kernel:  start_secondary+0x166/0x1c0
Jul 13 08:45:23 alpha-pve kernel:  secondary_startup_64+0xa4/0xb0
Jul 13 08:45:23 alpha-pve systemd[1]: Starting Proxmox VE replication runner...
Jul 13 08:45:29 alpha-pve systemd[1]: pvesr.service: Succeeded.
Jul 13 08:45:29 alpha-pve systemd[1]: Started Proxmox VE replication runner.
Jul 13 08:45:52 alpha-pve kernel: softdog: Triggered - Reboot ignored
Jul 13 08:46:01 alpha-pve systemd[1]: Starting Proxmox VE replication runner...
Jul 13 08:46:17 alpha-pve kernel: softdog: Triggered - Reboot ignored

Since I did not know the root cause and the above problem was happening only ocassionaly just after reboot, I left the softdog disabled and went on with my life.
But last night - oh boy. I don't know if software update happened or something - syslog is now getting spammed with those messages, to the point that even network communication and I/O writes are impacted - ZFS ejects discs from pools as a result of IO errors,

Code:

Jul 13 09:40:22 alpha-pve kernel: NETDEV WATCHDOG: enp6s0 (e1000e): transmit queue 0 timed out
Jul 13 09:40:24 alpha-pve kernel: e1000e 0000:06:00.0 enp6s0: Reset adapter unexpectedly

Help? :3 This is not a hardware problem, I believe, as other software (like XCP-NG) has no problems with this server.

What's going on? C-States?

[SOLUTION]
* disabled intel C-states driver
* enabled ACPI C-states driver and limited it to C3 as the deepest sleep state
* switched to ladder cpuidle governor
* set idling strategy as no-mwait
* set all system CPUs as no-CBs, switching from interrupt-based timer handling to dedicated processess (I think this was what really helped)

Full kernel args

GRUB_CMDLINE_LINUX="nmi_watchdog=0 console=ttyS1,115200n8 console=tty0 ignore_loglevel intel_idle.max_cstate=0 clocksource=hpet nouveau.modeset=0 nomodeset processor.max_cstate=2 pcie_aspm=off nosmt=force cpuidle.governor=ladder idle=nomwait rcu_nocbs=0-24"

Search

Search

[SOLVED] Serious node instability - CPU hangs, softdog wants to kick-in.

ferdek

New Member