[SOLVED] Uhhuh. NMI received for unknown reason (on AMD EPYC)

Alibek

Well-Known Member
Jan 13, 2017
102
15
58
45
Error:
Code:
[Tue Nov 13 14:35:35 2018] Uhhuh. NMI received for unknown reason 21 on CPU 84.
[Tue Nov 13 14:35:35 2018] Do you have a strange power saving mode enabled?
[Tue Nov 13 14:35:35 2018] Dazed and confused, but trying to continue

Hardware:
CPU 2x AMD EPYC 7601 on Supermicro H11DST-B (Version: 1.01) 2123BT-HNC0R
with BIOS Version: 1.1a (Release Date: 10/04/2018)
RAM 2TiB 2666 MHz


Software:
OC Linux 4.15.18-7-pve #1 SMP PVE 4.15.18-27 (Wed, 10 Oct 2018 10:50:11 +0200) x86_64 GNU/Linux (Debian GNU/Linux 9.5 (stretch))


How to reproduce:
Code:
# apt install linux-tools-4.15
# dpkg -S $(which perf)
linux-base: /usr/bin/perf
# dmesg -T | tail -f
run in other console:
Code:
# perf top


I try to disable nmi_watchdog:
Code:
# cat /etc/modprobe.d/nmi-watchdog-blacklist.conf
blacklist iTCO_wdt
blacklist iTCO_vendor_support

Code:
# grep 'Command line' /var/log/kern.log
Nov 14 19:13:19 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet pcie_aspm=off
Nov 14 19:30:49 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet pcie_aspm=off nmi_watchdog=0
Nov 14 19:56:51 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet nmi_watchdog=0 pcie_aspm=off
Nov 14 20:38:15 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet nmi_watchdog=0 pcie_aspm=off idle=nomwait
Nov 14 21:10:49 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet nmi_watchdog=0 pcie_aspm=off idle=nomwait


I try to change governon (ondemand to perfrormance):
Code:
# for c in {0..127}; do cpufreq-set -g performance -c $c; done


But error still preset (on all 4 nodes in server platform)

Solution:
Disable C-States in BIOS
 
Last edited:
Hi,

did you disable c-states in Bios and set it in Performace mode?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!