NMI watchdog: Watchdog detected hard LOCKUP on cpu

ispirto · Feb 22, 2017

I'm getting these kernel dumps after a while of running KVM's on the server.

I've replaced all the memory modules. The Mobo has the latest BIOS. This is a new install so I don't know if this is only happening on 4.4.35-2 or in previous versions, too.

Can this be a hardware error?

Code:

[ 7234.424652] NMI watchdog: Watchdog detected hard LOCKUP on cpu 6
[ 7234.424669] Modules linked in:
[ 7234.424690]  ebt_ip binfmt_misc ebtable_filter ebtables nfsv3 ip_set ip6table_filter ip6_tables iptable_filter ip_tables x_tables softdog nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfnetlink_log nfnetlink zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c ipmi_ssif intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_pcm snd_timer snd soundcore pcspkr joydev input_leds sb_edac edac_core i2c_i801 lpc_ich mei_me mei ioatdma ipmi_si ipmi_msghandler 8250_fintek
[ 7234.424737]  shpchp wmi mac_hid vhost_net vhost macvtap macvlan autofs4 btrfs xor raid6_pq hid_generic ixgbe(O) vxlan ip6_udp_tunnel udp_tunnel usbkbd usbmouse usbhid ahci isci libahci hid libsas igb(O) scsi_transport_sas dca ptp pps_core megaraid_sas fjes
[ 7234.424758] CPU: 6 PID: 8900 Comm: kvm Tainted: P           O    4.4.35-2-pve #1
[ 7234.424760] Hardware name: Supermicro X9DRH-7TF/7F/iTF/iF/X9DRH-7TF/7F/iTF/iF, BIOS 3.2a 06/18/2016
[ 7234.424762]  0000000000000086 000000004245fa45 ffff88207fc85b90 ffffffff813f9523
[ 7234.424765]  0000000000000000 0000000000000000 ffff88207fc85ba8 ffffffff8113bfbf
[ 7234.424768]  ffff882038a88000 ffff88207fc85be0 ffffffff81184eb8 0000000000000001
[ 7234.424771] Call Trace:
[ 7234.424772]  <NMI>  [<ffffffff813f9523>] dump_stack+0x63/0x90
[ 7234.424787]  [<ffffffff8113bfbf>] watchdog_overflow_callback+0xbf/0xd0
[ 7234.424791]  [<ffffffff81184eb8>] __perf_event_overflow+0x88/0x1d0
[ 7234.424793]  [<ffffffff81185a84>] perf_event_overflow+0x14/0x20
[ 7234.424797]  [<ffffffff8100c6a1>] intel_pmu_handle_irq+0x1e1/0x490
[ 7234.424803]  [<ffffffff811cee7c>] ? vunmap_page_range+0x20c/0x330
[ 7234.424806]  [<ffffffff811cefb1>] ? unmap_kernel_range_noflush+0x11/0x20
[ 7234.424809]  [<ffffffff814c6dbe>] ? ghes_copy_tofrom_phys+0x11e/0x2a0
[ 7234.424814]  [<ffffffff8105a23b>] ? native_apic_msr_write+0x2b/0x30
[ 7234.424817]  [<ffffffff8105a08d>] ? x2apic_send_IPI_self+0x1d/0x20
[ 7234.424821]  [<ffffffff810058dd>] perf_event_nmi_handler+0x2d/0x50
[ 7234.424825]  [<ffffffff810325d6>] nmi_handle+0x66/0x120
[ 7234.424827]  [<ffffffff81032b40>] default_do_nmi+0x40/0x100
[ 7234.424830]  [<ffffffff81032ce2>] do_nmi+0xe2/0x130
[ 7234.424834]  [<ffffffff8185e751>] end_repeat_nmi+0x1a/0x1e
[ 7234.424838]  [<ffffffff814067e5>] ? delay_tsc+0x25/0x50
[ 7234.424841]  [<ffffffff814067e5>] ? delay_tsc+0x25/0x50
[ 7234.424843]  [<ffffffff814067e5>] ? delay_tsc+0x25/0x50
[ 7234.424845]  <<EOE>>  [<ffffffff814066ff>] __delay+0xf/0x20
[ 7234.424877]  [<ffffffffc056eb2b>] wait_lapic_expire+0x12b/0x130 [kvm]
[ 7234.424892]  [<ffffffffc0552a28>] kvm_arch_vcpu_ioctl_run+0x608/0x1460 [kvm]
[ 7234.424906]  [<ffffffffc054ca0a>] ? kvm_arch_vcpu_load+0x5a/0x220 [kvm]
[ 7234.424918]  [<ffffffffc0539eca>] kvm_vcpu_ioctl+0x31a/0x5e0 [kvm]
[ 7234.424923]  [<ffffffff81222d02>] do_vfs_ioctl+0x2d2/0x4b0
[ 7234.424926]  [<ffffffff8118ae6b>] ? fire_user_return_notifiers+0x3b/0x50
[ 7234.424930]  [<ffffffff81003360>] ? exit_to_usermode_loop+0xb0/0xd0
[ 7234.424932]  [<ffffffff81222f59>] SyS_ioctl+0x79/0x90
[ 7234.424934]  [<ffffffff81003c38>] ? syscall_return_slowpath+0x98/0x110
[ 7234.424937]  [<ffffffff8185c276>] entry_SYSCALL_64_fastpath+0x16/0x75

Code:

~# pveversion -v
proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-108
pve-firmware: 1.1-10
libpve-common-perl: 4.0-91
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-73
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-93
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-1
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve14~bpo80

ispirto · Feb 22, 2017

I've updated Proxmox to the new version and the kernel is now 4.4.40-1. I've also narrowed this down. These are happening when I start OS installations on the VMs.

t.lamprecht · Feb 23, 2017

ispirto said:
I've updated Proxmox to the new version and the kernel is now 4.4.40-1. I've also narrowed this down. These are happening when I start OS installations on the VMs.

Looks like it could be storage related.

What storage do you use for the VMs. filesystem, Raid, ...?

ispirto · Feb 23, 2017

t.lamprecht said:
Looks like it could be storage related.

What storage do you use for the VMs. filesystem, Raid, ...?

Thanks Thomas. I'm using LSI 2208 RAID10 with SSDs. I'm using LVM for the storage.

It's worth mentioning that these errors are not happening when I use 4.9.0-0.bpo.1-rt-amd64 kernel. Hence the thread: https://forum.proxmox.com/threads/risks-of-using-a-jessie-backports-kernel.33197/

t.lamprecht · Feb 23, 2017

ispirto said:
It's worth mentioning that these errors are not happening when I use 4.9.0-0.bpo.1-rt-amd64 kernel. Hence the thread: https://forum.proxmox.com/threads/risks-of-using-a-jessie-backports-kernel.33197/

Ahh, okay, that makes sense, I didn't recognized the same usernames

Could you try it also with: https://packages.debian.org/jessie-backports/linux-image-amd64 (the non-realtime version), so we see if a bugfix from the newer kernel solves this ore if the preemptive RT version solved it "by accident".

ispirto · Feb 23, 2017

t.lamprecht said:
Ahh, okay, that makes sense, I didn't recognized the same usernames
Could you try it also with: https://packages.debian.org/jessie-backports/linux-image-amd64 (the non-realtime version), so we see if a bugfix from the newer kernel solves this ore if the preemptive RT version solved it "by accident".

Thanks a lot Thomas. I'm not sure how I missed the "rt" there in the name. I've rebooted it with the 4.9.0-0.bpo.1-amd64 and am testing it now.

nseba · Mar 3, 2017

Hi ispirto,
I've been having the same issue for two months now and still didn't find any solution. I'm very curious about your test: do you have a feedback?
Thanks for your help.

ispirto · Mar 3, 2017

nseba said:
Hi ispirto,
I've been having the same issue for two months now and still didn't find any solution. I'm very curious about your test: do you have a feedback?
Thanks for your help.

Installing irqbalance and that new kernel solved it for me for now.

ispirto · Mar 4, 2017

Started to see this on first boots time to time. Not sure if this is related to the kernel:

Code:

smpboot: CPU1: Not responding
smpboot: do_boot_cpu failed(1) to wakeup CPU#1

ispirto · Mar 6, 2017

New findings:

- This only happens when the guest has more than one Virtual CPU.

- Only happens on resets/reboots, if the guest is powered off then powered back on, it boots normally.

- The issue goes away when I disable ACPI on the guest.

It looks like somewhat related: https://bugzilla.redhat.com/show_bug.cgi?id=1278808

Of course no reply until the version is EOL on the bug tracker

nseba · Mar 24, 2017

Hi ispirto,
thanks a lot for your help. I tested your solution but still had this annoying NMI watchdog (had it even with no container or kvm guest). I completed with intel-microcode and nvidia non-free drivers and it seems to work (24h+ running without crash). I also tested with the default current kernel (4.4.35-1-pve) without troubles up to know.

Tony Pilborg · Apr 21, 2017

t.lamprecht said:
Ahh, okay, that makes sense, I didn't recognized the same usernames
Could you try it also with: (old link) (the non-realtime version), so we see if a bugfix from the newer kernel solves this ore if the preemptive RT version solved it "by accident".

I have (since the past 6 months) been struggling A LOT with these freezes... Sometimes my server runs for 48 hours, other times 15 minutes... And with no correlation to activity on the Win 10 guest, could be after 10 hours idle, or during use...

I sometimes get the console error/lock up - but often just a complete freeze and no other option than a full power cycle..

My question: How can you use a non pre kernel??? I do not yet feel confident that I should upgrade to the 5.0 beta (a bit of point of no return), so installing another kernel (4.9), with the option to select it during boot would be sensible for me at this point.. But how???

BR, Tony

Search

Search

NMI watchdog: Watchdog detected hard LOCKUP on cpu

ispirto

Renowned Member

ispirto

Renowned Member

t.lamprecht

Proxmox Staff Member

ispirto

Renowned Member

t.lamprecht

Proxmox Staff Member

ispirto

Renowned Member

nseba

New Member

ispirto

Renowned Member

ispirto

Renowned Member

ispirto

Renowned Member

nseba

New Member

Tony Pilborg

Active Member