NMI watchdog: Watchdog detected hard LOCKUP on cpu

ispirto

Renowned Member
Oct 20, 2012
37
1
73
I'm getting these kernel dumps after a while of running KVM's on the server.

I've replaced all the memory modules. The Mobo has the latest BIOS. This is a new install so I don't know if this is only happening on 4.4.35-2 or in previous versions, too.

Can this be a hardware error?

Code:
[ 7234.424652] NMI watchdog: Watchdog detected hard LOCKUP on cpu 6
[ 7234.424669] Modules linked in:
[ 7234.424690]  ebt_ip binfmt_misc ebtable_filter ebtables nfsv3 ip_set ip6table_filter ip6_tables iptable_filter ip_tables x_tables softdog nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfnetlink_log nfnetlink zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c ipmi_ssif intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_pcm snd_timer snd soundcore pcspkr joydev input_leds sb_edac edac_core i2c_i801 lpc_ich mei_me mei ioatdma ipmi_si ipmi_msghandler 8250_fintek
[ 7234.424737]  shpchp wmi mac_hid vhost_net vhost macvtap macvlan autofs4 btrfs xor raid6_pq hid_generic ixgbe(O) vxlan ip6_udp_tunnel udp_tunnel usbkbd usbmouse usbhid ahci isci libahci hid libsas igb(O) scsi_transport_sas dca ptp pps_core megaraid_sas fjes
[ 7234.424758] CPU: 6 PID: 8900 Comm: kvm Tainted: P           O    4.4.35-2-pve #1
[ 7234.424760] Hardware name: Supermicro X9DRH-7TF/7F/iTF/iF/X9DRH-7TF/7F/iTF/iF, BIOS 3.2a 06/18/2016
[ 7234.424762]  0000000000000086 000000004245fa45 ffff88207fc85b90 ffffffff813f9523
[ 7234.424765]  0000000000000000 0000000000000000 ffff88207fc85ba8 ffffffff8113bfbf
[ 7234.424768]  ffff882038a88000 ffff88207fc85be0 ffffffff81184eb8 0000000000000001
[ 7234.424771] Call Trace:
[ 7234.424772]  <NMI>  [<ffffffff813f9523>] dump_stack+0x63/0x90
[ 7234.424787]  [<ffffffff8113bfbf>] watchdog_overflow_callback+0xbf/0xd0
[ 7234.424791]  [<ffffffff81184eb8>] __perf_event_overflow+0x88/0x1d0
[ 7234.424793]  [<ffffffff81185a84>] perf_event_overflow+0x14/0x20
[ 7234.424797]  [<ffffffff8100c6a1>] intel_pmu_handle_irq+0x1e1/0x490
[ 7234.424803]  [<ffffffff811cee7c>] ? vunmap_page_range+0x20c/0x330
[ 7234.424806]  [<ffffffff811cefb1>] ? unmap_kernel_range_noflush+0x11/0x20
[ 7234.424809]  [<ffffffff814c6dbe>] ? ghes_copy_tofrom_phys+0x11e/0x2a0
[ 7234.424814]  [<ffffffff8105a23b>] ? native_apic_msr_write+0x2b/0x30
[ 7234.424817]  [<ffffffff8105a08d>] ? x2apic_send_IPI_self+0x1d/0x20
[ 7234.424821]  [<ffffffff810058dd>] perf_event_nmi_handler+0x2d/0x50
[ 7234.424825]  [<ffffffff810325d6>] nmi_handle+0x66/0x120
[ 7234.424827]  [<ffffffff81032b40>] default_do_nmi+0x40/0x100
[ 7234.424830]  [<ffffffff81032ce2>] do_nmi+0xe2/0x130
[ 7234.424834]  [<ffffffff8185e751>] end_repeat_nmi+0x1a/0x1e
[ 7234.424838]  [<ffffffff814067e5>] ? delay_tsc+0x25/0x50
[ 7234.424841]  [<ffffffff814067e5>] ? delay_tsc+0x25/0x50
[ 7234.424843]  [<ffffffff814067e5>] ? delay_tsc+0x25/0x50
[ 7234.424845]  <<EOE>>  [<ffffffff814066ff>] __delay+0xf/0x20
[ 7234.424877]  [<ffffffffc056eb2b>] wait_lapic_expire+0x12b/0x130 [kvm]
[ 7234.424892]  [<ffffffffc0552a28>] kvm_arch_vcpu_ioctl_run+0x608/0x1460 [kvm]
[ 7234.424906]  [<ffffffffc054ca0a>] ? kvm_arch_vcpu_load+0x5a/0x220 [kvm]
[ 7234.424918]  [<ffffffffc0539eca>] kvm_vcpu_ioctl+0x31a/0x5e0 [kvm]
[ 7234.424923]  [<ffffffff81222d02>] do_vfs_ioctl+0x2d2/0x4b0
[ 7234.424926]  [<ffffffff8118ae6b>] ? fire_user_return_notifiers+0x3b/0x50
[ 7234.424930]  [<ffffffff81003360>] ? exit_to_usermode_loop+0xb0/0xd0
[ 7234.424932]  [<ffffffff81222f59>] SyS_ioctl+0x79/0x90
[ 7234.424934]  [<ffffffff81003c38>] ? syscall_return_slowpath+0x98/0x110
[ 7234.424937]  [<ffffffff8185c276>] entry_SYSCALL_64_fastpath+0x16/0x75

Code:
~# pveversion -v
proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-108
pve-firmware: 1.1-10
libpve-common-perl: 4.0-91
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-73
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-93
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-1
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve14~bpo80
 
I've updated Proxmox to the new version and the kernel is now 4.4.40-1. I've also narrowed this down. These are happening when I start OS installations on the VMs.
 
I've updated Proxmox to the new version and the kernel is now 4.4.40-1. I've also narrowed this down. These are happening when I start OS installations on the VMs.

Looks like it could be storage related.

What storage do you use for the VMs. filesystem, Raid, ...?
 
  • Like
Reactions: ispirto
It's worth mentioning that these errors are not happening when I use 4.9.0-0.bpo.1-rt-amd64 kernel. Hence the thread: https://forum.proxmox.com/threads/risks-of-using-a-jessie-backports-kernel.33197/

Ahh, okay, that makes sense, I didn't recognized the same usernames :)
Could you try it also with: https://packages.debian.org/jessie-backports/linux-image-amd64 (the non-realtime version), so we see if a bugfix from the newer kernel solves this ore if the preemptive RT version solved it "by accident".
 
  • Like
Reactions: ispirto
Ahh, okay, that makes sense, I didn't recognized the same usernames :)
Could you try it also with: https://packages.debian.org/jessie-backports/linux-image-amd64 (the non-realtime version), so we see if a bugfix from the newer kernel solves this ore if the preemptive RT version solved it "by accident".

Thanks a lot Thomas. I'm not sure how I missed the "rt" there in the name. I've rebooted it with the 4.9.0-0.bpo.1-amd64 and am testing it now.
 
Hi ispirto,
I've been having the same issue for two months now and still didn't find any solution. I'm very curious about your test: do you have a feedback?
Thanks for your help.
 
Hi ispirto,
I've been having the same issue for two months now and still didn't find any solution. I'm very curious about your test: do you have a feedback?
Thanks for your help.

Installing irqbalance and that new kernel solved it for me for now.
 
Started to see this on first boots time to time. Not sure if this is related to the kernel:

Code:
smpboot: CPU1: Not responding
smpboot: do_boot_cpu failed(1) to wakeup CPU#1
 
New findings:

- This only happens when the guest has more than one Virtual CPU.

- Only happens on resets/reboots, if the guest is powered off then powered back on, it boots normally.

- The issue goes away when I disable ACPI on the guest.

It looks like somewhat related: https://bugzilla.redhat.com/show_bug.cgi?id=1278808

Of course no reply until the version is EOL on the bug tracker :)
 
Last edited:
Hi ispirto,
thanks a lot for your help. I tested your solution but still had this annoying NMI watchdog (had it even with no container or kvm guest). I completed with intel-microcode and nvidia non-free drivers and it seems to work (24h+ running without crash). I also tested with the default current kernel (4.4.35-1-pve) without troubles up to know.
 
Ahh, okay, that makes sense, I didn't recognized the same usernames :)
Could you try it also with: (old link) (the non-realtime version), so we see if a bugfix from the newer kernel solves this ore if the preemptive RT version solved it "by accident".
I have (since the past 6 months) been struggling A LOT with these freezes... Sometimes my server runs for 48 hours, other times 15 minutes... And with no correlation to activity on the Win 10 guest, could be after 10 hours idle, or during use...

I sometimes get the console error/lock up - but often just a complete freeze and no other option than a full power cycle..

My question: How can you use a non pre kernel??? I do not yet feel confident that I should upgrade to the 5.0 beta (a bit of point of no return), so installing another kernel (4.9), with the option to select it during boot would be sensible for me at this point.. But how???

BR, Tony
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!