Kernel panic - not syncing: An NMI occurred

Laris

New Member
Dec 14, 2015
3
0
1
42
Proxmox 4.0

Linux version 4.2.2-1-pve (root@elsa) (gcc version 4.9.2 (Debian 4.9.2-10) ) #1 SMP Mon Oct 5 18:23:31 CEST 2015 ()
[ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.2.2-1-pve ro ot=ZFS=/ROOT/pve-1 ro boot=zfs root=ZFS=rpool/ROOT/pve-1 boot=zfs console=tty0 c onsole=ttyS0,115200

[ 9500.744813] Kernel panic - not syncing: An NMI occurred. Depending on your system the reaso n for the NMI is logged in any one of the following resources:
[ 9500.744813] 1. Integrated Management Log (IML)
[ 9500.744813] 2. OA Syslog
[ 9500.744813] 3. OA Forward Progress Log
[ 9500.744813] 4. iLO Event Log
[ 9500.877554] CPU: 0 PID: 486 Comm: z_wr_int_7 Tainted: P O 4.2.2-1-pve #1
[ 9500.915211] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16/2015
[ 9500.949574] 000008a41084f25a ffff8803fa205d68 ffffffff817c92f3 ffff8803fa2112f8
[ 9500.985652] ffffffffc05472d8 ffff8803fa205de8 ffffffff817c6da4 0000000000000000
[ 9501.021258] 0000000000000008 ffff8803fa205df8 ffff8803fa205d98 0000000000000000
[ 9501.057138] Call Trace:
[ 9501.069042] <NMI> [<ffffffff817c92f3>] dump_stack+0x45/0x57
[ 9501.096927] [<ffffffff817c6da4>] panic+0xc1/0x1fe
[ 9501.119976] [<ffffffffc0546948>] hpwdt_pretimeout+0xc8/0xc8 [hpwdt]
[ 9501.150795] [<ffffffff8101d2a9>] ? sched_clock+0x9/0x10
[ 9501.176646] [<ffffffff81017cc9>] nmi_handle+0x79/0x100
[ 9501.201817] [<ffffffff81017cd3>] ? nmi_handle+0x83/0x100
[ 9501.227838] [<ffffffff8101815e>] io_check_error+0x1e/0x90
[ 9501.254256] [<ffffffff81018256>] default_do_nmi+0x86/0x100
[ 9501.280868] [<ffffffff810183ba>] do_nmi+0xea/0x140
[ 9501.304401] [<ffffffff817d2054>] end_repeat_nmi+0x1a/0x1e
[ 9501.330748] [<ffffffff811b2644>] ? page_referenced+0x84/0x120
[ 9501.358696] [<ffffffff811b2644>] ? page_referenced+0x84/0x120
[ 9501.386655] [<ffffffff811b2644>] ? page_referenced+0x84/0x120
[ 9501.414219] <<EOE>> [<ffffffff811b0f90>] ? __page_check_address+0x1a0/0x1a0
[ 9501.448796] [<ffffffff811b1b30>] ? page_get_anon_vma+0x80/0x80
[ 9501.477257] [<ffffffff8118ba20>] shrink_page_list+0x5c0/0x710
[ 9501.505413] [<ffffffff8118c203>] shrink_inactive_list+0x273/0x520
[ 9501.535396] [<ffffffff8118ce4d>] shrink_lruvec+0x5dd/0x7b0
[ 9501.562354] [<ffffffff8118d0fc>] shrink_zone+0xdc/0x290
[ 9501.587796] [<ffffffff8118d413>] do_try_to_free_pages+0x163/0x430
[ 9501.617409] [<ffffffff8118d79a>] try_to_free_pages+0xba/0x130
[ 9501.645406] [<ffffffff8118000e>] __alloc_pages_nodemask+0x60e/0xa30
[ 9501.676567] [<ffffffff813bbb08>] ? __sg_alloc_table+0x78/0x160
[ 9501.705577] [<ffffffff811c51ca>] alloc_pages_current+0x9a/0x110
[ 9501.734615] [<ffffffff811cf896>] new_slab+0x366/0x450
[ 9501.759577] [<ffffffff811cfd50>] __slab_alloc+0x3d0/0x4c0
[ 9501.786627] [<ffffffffc0104599>] ? spl_kmem_cache_alloc+0x69/0x800 [spl]
[ 9501.820682] [<ffffffff8155a44f>] ? scsi_request_fn+0x3f/0x640
[ 9501.849860] [<ffffffffc0104599>] ? spl_kmem_cache_alloc+0x69/0x800 [spl]
[ 9501.885389] [<ffffffff811cffe7>] kmem_cache_alloc+0x1a7/0x200
[ 9501.914099] [<ffffffffc0104599>] spl_kmem_cache_alloc+0x69/0x800 [spl]
[ 9501.945625] [<ffffffff8137bd9c>] ? generic_make_request+0xcc/0x110
[ 9501.978244] [<ffffffff811cf05c>] ? __slab_free+0x14c/0x290
[ 9502.006295] [<ffffffffc027a822>] zio_create+0x42/0x470 [zfs]
[ 9502.035283] [<ffffffffc027badf>] zio_vdev_delegated_io+0x6f/0x80 [zfs]
[ 9502.068338] [<ffffffffc023f7a0>] ? vdev_queue_timestamp_compare+0x40/0x40 [zfs]
[ 9502.105047] [<ffffffffc023fefd>] vdev_queue_io_to_issue+0x61d/0x8a0 [zfs]
[ 9502.138805] [<ffffffffc023f7a0>] ? vdev_queue_timestamp_compare+0x40/0x40 [zfs]
[ 9502.173454] [<ffffffffc0240693>] vdev_queue_io_done+0x193/0x250 [zfs]
[ 9502.205728] [<ffffffffc0278808>] zio_vdev_io_done+0x88/0x180 [zfs]
[ 9502.236441] [<ffffffffc0279aee>] zio_execute+0xde/0x190 [zfs]
[ 9502.264533] [<ffffffffc0106070>] taskq_thread+0x230/0x420 [spl]
[ 9502.293363] [<ffffffff810a0570>] ? wake_up_q+0x70/0x70
[ 9502.318618] [<ffffffffc0105e40>] ? taskq_cancel_id+0x110/0x110 [spl]
[ 9502.349695] [<ffffffff810957db>] kthread+0xdb/0x100
[ 9502.373579] [<ffffffff81095700>] ? kthread_create_on_node+0x1c0/0x1c0
[ 9502.406001] [<ffffffff817d019f>] ret_from_fork+0x3f/0x70
[ 9502.432028] [<ffffffff81095700>] ? kthread_create_on_node+0x1c0/0x1c0
[ 9502.463507] Kernel Offset: disabled
[ 9502.484204] ERST: [Firmware Warn]: Firmware does not respond in time.
[ 9502.520039] ERST: [Firmware Warn]: Firmware does not respond in time.
[ 9502.551205] ---[ end Kernel panic - not syncing: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources:
[ 9502.551205] 1. Integrated Management Log (IML)
[ 9502.551205] 2. OA Syslog
[ 9502.551205] 3. OA Forward Progress Log
[ 9502.551205] 4. iLO Event Log
[ 9502.690480] ------------[ cut here ]------------
 
[ 9502.715512] WARNING: CPU: 0 PID: 486 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x60/0x70()
[ 9502.762874] Modules linked in: nls_utf8 hfsplus ip_set ip6table_filter ip6_tables iptable_filter ip_table s x_tables nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfnetlink_log nfnetlink intel_rapl iosf _mbi ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp gpio_ich kvm_intel snd_pcm iTCO_wdt kvm crct10 dif_pclmul crc32_pclmul ghash_clmulni_intel hpilo aesni_intel aes_x86_64 snd_timer lrw iTCO_vendor_support h pwdt gf128mul snd joydev input_leds glue_helper ablk_helper cryptd soundcore lpc_ich shpchp ipmi_si ipmi_msg handler psmouse ie31200_edac edac_core 8250_fintek serio_raw acpi_power_meter pcspkr mac_hid vhost_net vhost macvtap macvlan autofs4 zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) uas usb_storage hid_ge neric usbmouse usbkbd ahci usbhid tg3 libahci ptp hid pps_core mpt2sas raid_class scsi_transport_sas
[ 9503.204769] CPU: 0 PID: 486 Comm: z_wr_int_7 Tainted: P O 4.2.2-1-pve #1
[ 9503.248200] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16/2015
[ 9503.285577] ffffffff81a92439 ffff8803fa203d88 ffffffff817c92f3 0000000000000007
[ 9503.325563] 0000000000000000 ffff8803fa203dc8 ffffffff8107776a 0000000000000000
[ 9503.364325] 0000000000000001 ffff8803fa256a00 0000000000000000 ffff8800ed849b80
[ 9503.405360] Call Trace:
[ 9503.418229] <IRQ> [<ffffffff817c92f3>] dump_stack+0x45/0x57
[ 9503.451553] [<ffffffff8107776a>] warn_slowpath_common+0x8a/0xc0
[ 9503.482940] [<ffffffff8107785a>] warn_slowpath_null+0x1a/0x20
[ 9503.515209] [<ffffffff81049ab0>] native_smp_send_reschedule+0x60/0x70
[ 9503.549871] [<ffffffff810b239b>] trigger_load_balance+0x13b/0x230
[ 9503.582296] [<ffffffff810a1926>] scheduler_tick+0xa6/0xd0
[ 9503.611711] [<ffffffff810ef9b0>] ? tick_sched_do_timer+0x30/0x30
[ 9503.645166] [<ffffffff810e0921>] update_process_times+0x51/0x60
[ 9503.677153] [<ffffffff810ef3d5>] tick_sched_handle.isra.15+0x25/0x60
[ 9503.711070] [<ffffffff810ef9f4>] tick_sched_timer+0x44/0x80
[ 9503.741571] [<ffffffff810e1464>] __hrtimer_run_queues+0xe4/0x200
[ 9503.774272] [<ffffffff810e1898>] hrtimer_interrupt+0xa8/0x1a0
[ 9503.803259] [<ffffffff8106ab60>] ? do_flush_tlb_all+0x40/0x40
[ 9503.832607] [<ffffffff8104c36c>] local_apic_timer_interrupt+0x3c/0x70
[ 9503.866396] [<ffffffff817d2a41>] smp_apic_timer_interrupt+0x41/0x60
[ 9503.898991] [<ffffffff817d0bdb>] apic_timer_interrupt+0x6b/0x70
[ 9503.928193] <EOI> <NMI> [<ffffffff817c6ea0>] ? panic+0x1bd/0x1fe
[ 9503.959385] [<ffffffff817c6e99>] ? panic+0x1b6/0x1fe
[ 9503.984463] [<ffffffffc0546948>] hpwdt_pretimeout+0xc8/0xc8 [hpwdt]
[ 9504.015315] [<ffffffff8101d2a9>] ? sched_clock+0x9/0x10
[ 9504.045511] [<ffffffff81017cc9>] nmi_handle+0x79/0x100
[ 9504.074394] [<ffffffff81017cd3>] ? nmi_handle+0x83/0x100
[ 9504.103562] [<ffffffff8101815e>] io_check_error+0x1e/0x90
[ 9504.132425] [<ffffffff81018256>] default_do_nmi+0x86/0x100
[ 9504.161966] [<ffffffff810183ba>] do_nmi+0xea/0x140
[ 9504.187954] [<ffffffff817d2054>] end_repeat_nmi+0x1a/0x1e
[ 9504.214527] [<ffffffff811b2644>] ? page_referenced+0x84/0x120
[ 9504.246282] [<ffffffff811b2644>] ? page_referenced+0x84/0x120
[ 9504.276464] [<ffffffff811b2644>] ? page_referenced+0x84/0x120
[ 9504.307623] <<EOE>> [<ffffffff811b0f90>] ? __page_check_address+0x1a0/0x1a0
[ 9504.344259] [<ffffffff811b1b30>] ? page_get_anon_vma+0x80/0x80
[ 9504.374009] [<ffffffff8118ba20>] shrink_page_list+0x5c0/0x710
[ 9504.406375] [<ffffffff8118c203>] shrink_inactive_list+0x273/0x520
[ 9504.438788] [<ffffffff8118ce4d>] shrink_lruvec+0x5dd/0x7b0
[ 9504.468542] [<ffffffff8118d0fc>] shrink_zone+0xdc/0x290
[ 9504.496434] [<ffffffff8118d413>] do_try_to_free_pages+0x163/0x430
[ 9504.528425] [<ffffffff8118d79a>] try_to_free_pages+0xba/0x130
[ 9504.557808] [<ffffffff8118000e>] __alloc_pages_nodemask+0x60e/0xa30
[ 9504.589173] [<ffffffff813bbb08>] ? __sg_alloc_table+0x78/0x160
[ 9504.618014] [<ffffffff811c51ca>] alloc_pages_current+0x9a/0x110
[ 9504.646901] [<ffffffff811cf896>] new_slab+0x366/0x450
[ 9504.671957] [<ffffffff811cfd50>] __slab_alloc+0x3d0/0x4c0
[ 9504.698788] [<ffffffffc0104599>] ? spl_kmem_cache_alloc+0x69/0x800 [spl]
[ 9504.731536] [<ffffffff8155a44f>] ? scsi_request_fn+0x3f/0x640
[ 9504.760039] [<ffffffffc0104599>] ? spl_kmem_cache_alloc+0x69/0x800 [spl]
[ 9504.798720] [<ffffffff811cffe7>] kmem_cache_alloc+0x1a7/0x200
[ 9504.827838] [<ffffffffc0104599>] spl_kmem_cache_alloc+0x69/0x800 [spl]
[ 9504.865036] [<ffffffff8137bd9c>] ? generic_make_request+0xcc/0x110
[ 9504.897370] [<ffffffff811cf05c>] ? __slab_free+0x14c/0x290
[ 9504.925321] [<ffffffffc027a822>] zio_create+0x42/0x470 [zfs]
[ 9504.956180] [<ffffffffc027badf>] zio_vdev_delegated_io+0x6f/0x80 [zfs]
[ 9504.992215] [<ffffffffc023f7a0>] ? vdev_queue_timestamp_compare+0x40/0x40 [zfs]
[ 9505.029963] [<ffffffffc023fefd>] vdev_queue_io_to_issue+0x61d/0x8a0 [zfs]
[ 9505.065161] [<ffffffffc023f7a0>] ? vdev_queue_timestamp_compare+0x40/0x40 [zfs]
[ 9505.103297] [<ffffffffc0240693>] vdev_queue_io_done+0x193/0x250 [zfs]
[ 9505.136628] [<ffffffffc0278808>] zio_vdev_io_done+0x88/0x180 [zfs]
[ 9505.169031] [<ffffffffc0279aee>] zio_execute+0xde/0x190 [zfs]
[ 9505.200111] [<ffffffffc0106070>] taskq_thread+0x230/0x420 [spl]
[ 9505.235515] [<ffffffff810a0570>] ? wake_up_q+0x70/0x70
[ 9505.264688] [<ffffffffc0105e40>] ? taskq_cancel_id+0x110/0x110 [spl]
[ 9505.298158] [<ffffffff810957db>] kthread+0xdb/0x100
[ 9505.325375] [<ffffffff81095700>] ? kthread_create_on_node+0x1c0/0x1c0
[ 9505.359976] [<ffffffff817d019f>] ret_from_fork+0x3f/0x70
[ 9505.387250] [<ffffffff81095700>] ? kthread_create_on_node+0x1c0/0x1c0
[ 9505.421347] ---[ end trace fa0fa2ed71056d9b ]---
 
I've got the hpwdt blacklisted last week, but then when I patched to 4.1 few days ago by running 'apt-get dist-upgrade' it also pulled a new 4.2.6 kernel alongside new PVE versions and while the server was running grub updating (which seem to take a long time ie. minutes, possible probing all our iSCSI luns) the server got shot by the SW watchdog with a NMI :(

Tried on another server before patching to disable the SW watchdog with a 'systemctl stop watchdog-mx.service' but this server also got a NMI during patching.

What is BCP to avoid getting a NMI during [longer duration] maintenance?
 
watchdog-mux.service is the service which reset the watchdog timer, so if you stop it, you'll get a reboot.

I think it could be improve, by increasing temporary the timeout or disable the reboot of the watchdog hardware, when watchdog-mux is stopped.

I don't known if they are some tools to manually setup the watchdog device.

When you have done your dist-upgrade, when grub was updating, was the watchdog-mux stopped ?

Also, I don't known too much about HP server, but with dell server server, the dell management software also doing some bad change in watchdog device timeout. (writing in /dev/watchdog)
 
as workaround, you could try to launch

"while [ 1 ] ; do sleep 1; echo V > /dev/watchdog; done"

in a shell, when you are doing the upgrade.
It should avoid a reboot if the watchdog-mux is stopped.

(but you need to reboot after, I'm not sure that watchdog will work after watchdog-mux start)
 
watchdog-mux.service is the service which reset the watchdog timer, so if you stop it, you'll get a reboot.

When you have done your dist-upgrade, when grub was updating, was the watchdog-mux stopped ?
Right okay thought the mux was the server-side of the WD device :)
I'll assume dost-upgrade took down WD mux before patching and held it down while rebuilding grub, only it took more than the default 60 sec the WD is default configured for, hence the NMI. In future I'll run kernel patching separately to spare the risk of holding down the WD while rebuilding the grub.
 
I have talked with Dietmar, the upgrade process should never stop watchdog-mux service.

(but if you shutdown it manually, yes, it'll trigger reboot)
I only tried stopping WD, mistaken the mux for the server-side of WD device, as the first box got a NMI whilst rebuilding the grub config. Two step patching in future should spare me NMIs I believe.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!