NMI watchdog: BUG: soft lockup - CPU#5 stuck for

BelCloud · Feb 12, 2017

Since few days ago, daily, i'm getting such errors. Once they start, in few minutes, the whole node crashes. I cannot run any commands, i was logged from both idrac console and ssh. The only solution so far is a reboot (but this affects my uptime a lot)

Code:

Message from syslogd@dx411-s09 at Feb 12 16:16:51 ...
 kernel:[214551.571665] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [queueprocd - pr:45363]
^C^C^C^C^C
Message from syslogd@dx411-s09 at Feb 12 16:18:51 ...
 kernel:[214671.566741] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [queueprocd - pr:45363]
^C^C^C^C^X^Z^Z
Message from syslogd@dx411-s09 at Feb 12 16:19:19 ...
 kernel:[214699.565592] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [queueprocd - pr:45363]

Message from syslogd@dx411-s09 at Feb 12 16:19:47 ...
 kernel:[214727.564445] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [queueprocd - pr:45363]

Message from syslogd@dx411-s09 at Feb 12 16:20:55 ...
 kernel:[214795.561655] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [queueprocd - pr:45363]

Code:

pveversion -v
proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-108
pve-firmware: 1.1-10
libpve-common-perl: 4.0-91
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-73
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-93
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-1
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve14~bpo80

According to bios, watchdog is disabled (dell r420)

nseba · Feb 13, 2017

Hello BelCloud,
it seems I've got a similar problem. I created this post and tried different solutions but none of them worked for me. Perhaps you'd want to test them also. I would be so grateful if we can come up to a working solution.

BelCloud · Feb 13, 2017

Hi nseba

I tried "sysctl -w kernel.nmi_watchdog=0" but without any results. Made sure the hardware watchdog is also disabled.

I ended up keeping the cpu load on the node below 10% and moving the customers that used 100% of their cpu resources(which i believe caused the crash) to a single node cluster. So far seems stable, but i'm monitoring.
How many containers do you have in the node?

nseba · Feb 13, 2017

BelCloud said:
I tried "sysctl -w kernel.nmi_watchdog=0" but without any results.

I tried to put nmi_watchdog=0 in grub, which is quite the same as with sysctl I think, but I had no result either.

As it is a recently acquired machine, I have no containers nor kvm machines on it and still experience the same error. I checked the bios and there's no evidence of hardware watchdog.

BelCloud · Feb 13, 2017

I only saw it happening when there were some heavily loaded lxcs (100% of their allocated cpu and oom-killer killing processes as they were out of ram). I run many kvm nodes which had no problems and the hardware is identic.

nseba · Feb 13, 2017

I've got this crash quite in any configuration. That's why I did not put this machine in production. For now, I did not find a working solution and didn't really find help on the forum. I hope you'll get more support

.

BelCloud · Feb 13, 2017

Same problem started happening on another lxc node:
syslog just before crashing:

Code:

Feb 13 18:39:19 dx411-s07 kernel: [1164719.968823] Modules linked in: nfnetlink_queue act_police cls_basic sch_ingress sch_htb bluetooth dccp_diag dccp udp_diag nf_log_ipv6 xt_hl ip6t_rt dm_snapshot xt_recent xt_time unix_diag tcp_diag inet_diag xt_REDIRECT nf_nat_redirect nf_log_ipv4 nf_log_common xt_LOG xt_limit $
Feb 13 18:39:19 dx411-s07 kernel: [1164719.968930] RIP: 0010:[<ffffffff8119c6cb>]  [<ffffffff8119c6cb>] global_dirty_limits+0x4b/0x80
Feb 13 18:39:19 dx411-s07 kernel: [1164719.968945] R10: 00000000026738e1 R11: 0000000000000333 R12: ffff882883013940
Feb 13 18:39:19 dx411-s07 kernel: [1164719.968958]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
Feb 13 18:39:19 dx411-s07 kernel: [1164719.968986]  [<ffffffff811fdd26>] try_charge+0x1a6/0x680
Feb 13 18:39:19 dx411-s07 kernel: [1164719.969015]  [<ffffffff8106b4dd>] __do_page_fault+0x19d/0x410
Feb 13 18:39:33 dx411-s07 kernel: [1164733.600805]  [<ffffffff810f0052>] __hrtimer_run_queues+0x102/0x290
Feb 13 18:39:33 dx411-s07 kernel: [1164733.600838]  [<ffffffff811fe78f>] ? mem_cgroup_iter+0x1cf/0x380
Feb 13 18:39:33 dx411-s07 kernel: [1164733.600863]  [<ffffffff8118f274>] ? filemap_map_pages+0x224/0x230
Feb 13 18:39:59 dx411-s07 kernel: xor raid6_pq ixgbe(O) dca vxlan ip6_udp_tunnel udp_tunnel tg3 ahci ptp libahci pps_core megaraid_sas fjes
Feb 13 18:39:59 dx411-s07 kernel: [1164759.969494] task: ffff8815b46eb800 ti: ffff882883010000 task.ti: ffff882883010000
Feb 13 18:39:59 dx411-s07 kernel: [1164759.969512] RAX: 0000000000000000 RBX: ffff882883013948 RCX: 0000000000000000
Feb 13 18:39:59 dx411-s07 kernel: [1164759.969522] FS:  00007f8d4ac66700(0000) GS:ffff88301f340000(0000) knlGS:0000000000000000
Feb 13 18:39:59 dx411-s07 kernel: [1164759.969557]  [<ffffffff8119ccea>] throttle_vm_writeout+0x5a/0xd0
Feb 13 18:39:59 dx411-s07 kernel: [1164759.969594]  [<ffffffff811bdc06>] wp_page_copy.isra.56+0x166/0x540
Feb 13 18:39:59 dx411-s07 kernel: [1164759.969637]  [<ffffffff8185e3f8>] page_fault+0x28/0x30
Feb 13 18:40:27 dx411-s07 kernel: [1164787.969645] NMI watchdog: BUG: soft lockup - CPU#27 stuck for 23s! [bash:40900]
Feb 13 18:40:27 dx411-s07 kernel: xor raid6_pq ixgbe(O) dca vxlan ip6_udp_tunnel udp_tunnel tg3 ahci ptp libahci pps_core megaraid_sas fjes
Feb 13 18:40:27 dx411-s07 kernel: [1164787.969815] Hardware name: Dell Inc. PowerEdge R420/0K29HN, BIOS 2.4.2 01/29/2015
Feb 13 18:40:27 dx411-s07 kernel: [1164787.969829] RSP: 0000:ffff882883013980  EFLAGS: 00000282
Feb 13 18:40:27 dx411-s07 kernel: [1164787.969837] FS:  00007f8d4ac66700(0000) GS:ffff88301f340000(0000) knlGS:0000000000000000
Feb 13 18:40:27 dx411-s07 kernel: [1164787.969857]  [<ffffffff811fe78f>] ? mem_cgroup_iter+0x1cf/0x380
Feb 13 18:40:27 dx411-s07 kernel: [1164787.969884]  [<ffffffff8118f274>] ? filemap_map_pages+0x224/0x230
Feb 13 18:40:55 dx411-s07 kernel: ip_tables softdog nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 8021q garp mrp bonding nfnetlink_log nfnetlink zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O)$
Feb 13 18:40:55 dx411-s07 kernel: [1164815.970210] R13: 0000000000001740 R14: ffff88307fffb6f0 R15: ffff88187fffb000
Feb 13 18:40:55 dx411-s07 kernel: [1164815.970220]  ffff8815b46eb800 0000000000000000 ffff88307fffb6c0 0000000000000000
Feb 13 18:40:55 dx411-s07 kernel: [1164815.970242]  [<ffffffff81201fdc>] mem_cgroup_try_charge+0x9c/0x1b0
Feb 13 18:40:55 dx411-s07 kernel: [1164815.970266]  [<ffffffff8185e3f8>] ? page_fault+0x28/0x30

I'm assuming a bug in throttle_vm

bob y · Feb 20, 2017

Is there still no fixes on this issue? I'm still seeing it now even with the latest version.

# pveversion -v
proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-92
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-73
pve-libspice-server1: 0.12.8-1
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-3
pve-container: 1.0-94
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-3
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80

nseba · Mar 24, 2017

Hi,
I don't want to hijack this thread but I managed to have a config that seems to work for me. Details are at the end of this thread. Hope it will help.

Andrii · May 21, 2017

Is irqbalance helped for someone?

BelCloud · May 21, 2017

For me, no.
The only solution that worked for me was limiting the pids per container.

Andrii · May 21, 2017

BelCloud said:
For me, no.
The only solution that worked for me was limiting the pids per container.

What limit did you set? 3k was not help me.

BelCloud · May 21, 2017

I've set 500 by default, 3k seems very large for a container (personal opinion)
But i had some containers that were able to crash the node with the nmi watchdog issue, with more than 150pids. I've limited 2-3 such containers manually to 150.

Since i've set it to 500, i've had the nmiwatchdog issue happen just once or twice, so it's the the perfect solution, but does the job for now.
How many containers do you have per node?

Andrii · May 21, 2017

BelCloud said:
How many containers do you have per node?

30-50 CT. But mostly it uses for VPN and other lite service. At least 50% CPU are free.
OpenVZ was more stable ((

BelCloud said:
But i had some containers that were able to crash the node

How do you detect these "bad" containers?

BelCloud · May 21, 2017

Very hard.
1. The solution is to be connected on the node, and when the first nmi_watchdog error apears (usualy by KVM), to copy the PID from it and check /proc/PID/cgroup to see to which container it belongs before the node dies. It's not 100% foolproof, but in most cases it provides the real container issue.
2. Move containers one by one untill the node stops crashing.
3. ps -Ao pid,cgroup|grep lxc|cut -d / -f3|cut -d, -f1|sort|uniq -c|sort -n
This might show you any container using too many pids.

Andrii · May 21, 2017

BelCloud said:
ps -Ao pid,cgroup|grep lxc|cut -d / -f3|cut -d, -f1|sort|uniq -c|sort -n

Good idea!

Andrii · May 24, 2017

No solutions for LXC. The servers was down and down every day.

BelCloud · May 24, 2017

Have you tried a lower max-pid?

Andrii · May 24, 2017

BelCloud said:
Have you tried a lower max-pid?

Yes. 500 was set.

BelCloud · May 24, 2017

Try 200 as a test.

NMI watchdog: BUG: soft lockup - CPU#5 stuck for

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

New Member

Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Member

Member

Attachments

Renowned Member

Member

Renowned Member

We value your privacy