NMI watchdog: BUG: soft lockup - CPU#5 stuck for

BelCloud

Renowned Member
Dec 15, 2015
96
5
73
www.belcloud.net
Since few days ago, daily, i'm getting such errors. Once they start, in few minutes, the whole node crashes. I cannot run any commands, i was logged from both idrac console and ssh. The only solution so far is a reboot (but this affects my uptime a lot)

Code:
Message from syslogd@dx411-s09 at Feb 12 16:16:51 ...
 kernel:[214551.571665] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [queueprocd - pr:45363]
^C^C^C^C^C
Message from syslogd@dx411-s09 at Feb 12 16:18:51 ...
 kernel:[214671.566741] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [queueprocd - pr:45363]
^C^C^C^C^X^Z^Z
Message from syslogd@dx411-s09 at Feb 12 16:19:19 ...
 kernel:[214699.565592] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [queueprocd - pr:45363]

Message from syslogd@dx411-s09 at Feb 12 16:19:47 ...
 kernel:[214727.564445] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [queueprocd - pr:45363]

Message from syslogd@dx411-s09 at Feb 12 16:20:55 ...
 kernel:[214795.561655] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [queueprocd - pr:45363]

Code:
pveversion -v
proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-108
pve-firmware: 1.1-10
libpve-common-perl: 4.0-91
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-73
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-93
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-1
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve14~bpo80

According to bios, watchdog is disabled (dell r420)
 
Hello BelCloud,
it seems I've got a similar problem. I created this post and tried different solutions but none of them worked for me. Perhaps you'd want to test them also. I would be so grateful if we can come up to a working solution.
 
Hi nseba

I tried "sysctl -w kernel.nmi_watchdog=0" but without any results. Made sure the hardware watchdog is also disabled.

I ended up keeping the cpu load on the node below 10% and moving the customers that used 100% of their cpu resources(which i believe caused the crash) to a single node cluster. So far seems stable, but i'm monitoring.
How many containers do you have in the node?
 
I tried "sysctl -w kernel.nmi_watchdog=0" but without any results.

I tried to put nmi_watchdog=0 in grub, which is quite the same as with sysctl I think, but I had no result either.

As it is a recently acquired machine, I have no containers nor kvm machines on it and still experience the same error. I checked the bios and there's no evidence of hardware watchdog.
 
I only saw it happening when there were some heavily loaded lxcs (100% of their allocated cpu and oom-killer killing processes as they were out of ram). I run many kvm nodes which had no problems and the hardware is identic.
 
I've got this crash quite in any configuration. That's why I did not put this machine in production. For now, I did not find a working solution and didn't really find help on the forum. I hope you'll get more support :cool:.
 
Same problem started happening on another lxc node:
syslog just before crashing:
Code:
Feb 13 18:39:19 dx411-s07 kernel: [1164719.968823] Modules linked in: nfnetlink_queue act_police cls_basic sch_ingress sch_htb bluetooth dccp_diag dccp udp_diag nf_log_ipv6 xt_hl ip6t_rt dm_snapshot xt_recent xt_time unix_diag tcp_diag inet_diag xt_REDIRECT nf_nat_redirect nf_log_ipv4 nf_log_common xt_LOG xt_limit $
Feb 13 18:39:19 dx411-s07 kernel: [1164719.968930] RIP: 0010:[<ffffffff8119c6cb>]  [<ffffffff8119c6cb>] global_dirty_limits+0x4b/0x80
Feb 13 18:39:19 dx411-s07 kernel: [1164719.968945] R10: 00000000026738e1 R11: 0000000000000333 R12: ffff882883013940
Feb 13 18:39:19 dx411-s07 kernel: [1164719.968958]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
Feb 13 18:39:19 dx411-s07 kernel: [1164719.968986]  [<ffffffff811fdd26>] try_charge+0x1a6/0x680
Feb 13 18:39:19 dx411-s07 kernel: [1164719.969015]  [<ffffffff8106b4dd>] __do_page_fault+0x19d/0x410
Feb 13 18:39:33 dx411-s07 kernel: [1164733.600805]  [<ffffffff810f0052>] __hrtimer_run_queues+0x102/0x290
Feb 13 18:39:33 dx411-s07 kernel: [1164733.600838]  [<ffffffff811fe78f>] ? mem_cgroup_iter+0x1cf/0x380
Feb 13 18:39:33 dx411-s07 kernel: [1164733.600863]  [<ffffffff8118f274>] ? filemap_map_pages+0x224/0x230
Feb 13 18:39:59 dx411-s07 kernel: xor raid6_pq ixgbe(O) dca vxlan ip6_udp_tunnel udp_tunnel tg3 ahci ptp libahci pps_core megaraid_sas fjes
Feb 13 18:39:59 dx411-s07 kernel: [1164759.969494] task: ffff8815b46eb800 ti: ffff882883010000 task.ti: ffff882883010000
Feb 13 18:39:59 dx411-s07 kernel: [1164759.969512] RAX: 0000000000000000 RBX: ffff882883013948 RCX: 0000000000000000
Feb 13 18:39:59 dx411-s07 kernel: [1164759.969522] FS:  00007f8d4ac66700(0000) GS:ffff88301f340000(0000) knlGS:0000000000000000
Feb 13 18:39:59 dx411-s07 kernel: [1164759.969557]  [<ffffffff8119ccea>] throttle_vm_writeout+0x5a/0xd0
Feb 13 18:39:59 dx411-s07 kernel: [1164759.969594]  [<ffffffff811bdc06>] wp_page_copy.isra.56+0x166/0x540
Feb 13 18:39:59 dx411-s07 kernel: [1164759.969637]  [<ffffffff8185e3f8>] page_fault+0x28/0x30
Feb 13 18:40:27 dx411-s07 kernel: [1164787.969645] NMI watchdog: BUG: soft lockup - CPU#27 stuck for 23s! [bash:40900]
Feb 13 18:40:27 dx411-s07 kernel: xor raid6_pq ixgbe(O) dca vxlan ip6_udp_tunnel udp_tunnel tg3 ahci ptp libahci pps_core megaraid_sas fjes
Feb 13 18:40:27 dx411-s07 kernel: [1164787.969815] Hardware name: Dell Inc. PowerEdge R420/0K29HN, BIOS 2.4.2 01/29/2015
Feb 13 18:40:27 dx411-s07 kernel: [1164787.969829] RSP: 0000:ffff882883013980  EFLAGS: 00000282
Feb 13 18:40:27 dx411-s07 kernel: [1164787.969837] FS:  00007f8d4ac66700(0000) GS:ffff88301f340000(0000) knlGS:0000000000000000
Feb 13 18:40:27 dx411-s07 kernel: [1164787.969857]  [<ffffffff811fe78f>] ? mem_cgroup_iter+0x1cf/0x380
Feb 13 18:40:27 dx411-s07 kernel: [1164787.969884]  [<ffffffff8118f274>] ? filemap_map_pages+0x224/0x230
Feb 13 18:40:55 dx411-s07 kernel: ip_tables softdog nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 8021q garp mrp bonding nfnetlink_log nfnetlink zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O)$
Feb 13 18:40:55 dx411-s07 kernel: [1164815.970210] R13: 0000000000001740 R14: ffff88307fffb6f0 R15: ffff88187fffb000
Feb 13 18:40:55 dx411-s07 kernel: [1164815.970220]  ffff8815b46eb800 0000000000000000 ffff88307fffb6c0 0000000000000000
Feb 13 18:40:55 dx411-s07 kernel: [1164815.970242]  [<ffffffff81201fdc>] mem_cgroup_try_charge+0x9c/0x1b0
Feb 13 18:40:55 dx411-s07 kernel: [1164815.970266]  [<ffffffff8185e3f8>] ? page_fault+0x28/0x30

I'm assuming a bug in throttle_vm
 
Is there still no fixes on this issue? I'm still seeing it now even with the latest version.

# pveversion -v
proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-92
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-73
pve-libspice-server1: 0.12.8-1
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-3
pve-container: 1.0-94
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-3
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
 
I've set 500 by default, 3k seems very large for a container (personal opinion)
But i had some containers that were able to crash the node with the nmi watchdog issue, with more than 150pids. I've limited 2-3 such containers manually to 150.

Since i've set it to 500, i've had the nmiwatchdog issue happen just once or twice, so it's the the perfect solution, but does the job for now.
How many containers do you have per node?
 
Very hard.
1. The solution is to be connected on the node, and when the first nmi_watchdog error apears (usualy by KVM), to copy the PID from it and check /proc/PID/cgroup to see to which container it belongs before the node dies. It's not 100% foolproof, but in most cases it provides the real container issue.
2. Move containers one by one untill the node stops crashing.
3. ps -Ao pid,cgroup|grep lxc|cut -d / -f3|cut -d, -f1|sort|uniq -c|sort -n
This might show you any container using too many pids.
 
  • Like
Reactions: Andrii
No solutions for LXC. The servers was down and down every day.
 

Attachments

  • cpu-day.png
    cpu-day.png
    37 KB · Views: 40

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!