NMI watchdog: Watchdog detected hard LOCKUP

John_Dong

New Member
Mar 12, 2023
1
0
1
I continue to get this hard lockup on CPU 17, what should I do to debug this issue?
Code:
 kernel:[74994.546705] NMI watchdog: Watchdog detected hard LOCKUP on cpu 17
Mar 11 21:23:01 proxmox kernel: [74994.546705] NMI watchdog: Watchdog detected hard LOCKUP on cpu 17
Mar 11 21:23:01 proxmox kernel: [74994.546708] Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common zfs(PO) zunicode(PO) zzstd(O) iwlmvm zlua(O) zavl(PO) mac80211 icp(PO) edac_mce_amd libarc4 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi kvm_amd snd_hda_intel zcommon(PO) znvpair(PO) kvm btusb snd_intel_dspcfg spl(O) snd_intel_sdw_acpi btrtl btbcm crct10dif_pclmul vhost_net iwlwifi ghash_clmulni_intel snd_hda_codec vhost btintel vhost_iotlb aesni_intel tap joydev snd_hda_core bluetooth crypto_simd ib_iser input_leds cfg80211 snd_hwdep cryptd ecdh_generic rdma_cm snd_pcm ecc iw_cm rapl snd_timer snd mxm_wmi efi_pstore soundcore wmi_bmof pcspkr ccp ib_cm k10temp ib_core mac_hid iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nct6775 hwmon_vid vfio_pci vfio_pci_core vfio_virqfd irqbypass vfio_iommu_type1 vfio drm sunrpc
Mar 11 21:23:01 proxmox kernel: [74994.546745]  ip_tables x_tables autofs4 hid_logitech_hidpp btrfs blake2b_generic xor zstd_compress hid_logitech_dj hid_generic usbkbd usbmouse usbhid hid raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c xhci_pci ahci xhci_pci_renesas crc32_pclmul i2c_piix4 igb libahci xhci_hcd i2c_algo_bit dca nvme nvme_core wmi
Mar 11 21:23:01 proxmox kernel: [74994.546758] CPU: 17 PID: 0 Comm: swapper/17 Tainted: P           O      5.15.85-1-pve #1
Mar 11 21:23:01 proxmox kernel: [74994.546760] Hardware name: To Be Filled By O.E.M. X570 Taichi/X570 Taichi, BIOS P5.01 01/18/2023
Mar 11 21:23:01 proxmox kernel: [74994.546761] RIP: 0010:native_queued_spin_lock_slowpath+0x79/0x240
Mar 11 21:23:01 proxmox kernel: [74994.546766] Code: 2b 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 03 30 e4 09 d0 a9 00 01 ff ff 0f 85 13 01 00 00 85 c0 74 0e 8b 03 84 c0 74 08 f3 90 <8b> 03 84 c0 75 f8 b8 01 00 00 00 66 89 03 5b 41 5c 41 5d 41 5e 41
Mar 11 21:23:01 proxmox kernel: [74994.546767] RSP: 0018:ffffb346805d0e98 EFLAGS: 00000082
Mar 11 21:23:01 proxmox kernel: [74994.546768] RAX: 0000000000000180 RBX: ffff8fb01ee61a40 RCX: 0000000000000020
Mar 11 21:23:01 proxmox kernel: [74994.546769] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fb01ee61a40
Mar 11 21:23:01 proxmox kernel: [74994.546770] RBP: ffffb346805d0ec0 R08: 00004431998c84b0 R09: 000044319967efcb
Mar 11 21:23:01 proxmox kernel: [74994.546771] R10: ffffffffa92060c0 R11: 000000000000036f R12: 0000000000000082
Mar 11 21:23:01 proxmox kernel: [74994.546771] R13: dead000000000122 R14: 0000000000000001 R15: ffff8fb01ee61a40
Mar 11 21:23:01 proxmox kernel: [74994.546772] FS:  0000000000000000(0000) GS:ffff8fb01ee40000(0000) knlGS:0000000000000000
Mar 11 21:23:01 proxmox kernel: [74994.546773] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 11 21:23:01 proxmox kernel: [74994.546774] CR2: ffffd58784c55408 CR3: 000000011232c000 CR4: 0000000000350ee0
Mar 11 21:23:01 proxmox kernel: [74994.546775] Call Trace:
Mar 11 21:23:01 proxmox kernel: [74994.546776]  <IRQ>
Mar 11 21:23:01 proxmox kernel: [74994.546778]  _raw_spin_lock_irq+0x2a/0x40
Mar 11 21:23:01 proxmox kernel: [74994.546781]  __run_timers.part.0+0x32/0x270
Mar 11 21:23:01 proxmox kernel: [74994.546783]  ? recalibrate_cpu_khz+0x10/0x10
Mar 11 21:23:01 proxmox kernel: [74994.546785]  ? ktime_get+0x46/0xc0
Mar 11 21:23:01 proxmox kernel: [74994.546787]  ? native_x2apic_icr_read+0x20/0x20
Mar 11 21:23:01 proxmox kernel: [74994.546788]  ? lapic_next_event+0x21/0x30
Mar 11 21:23:01 proxmox kernel: [74994.546790]  ? clockevents_program_event+0xab/0x130
Mar 11 21:23:01 proxmox kernel: [74994.546793]  run_timer_softirq+0x4b/0x60
Mar 11 21:23:01 proxmox kernel: [74994.546793]  __do_softirq+0xd9/0x2ea
Mar 11 21:23:01 proxmox kernel: [74994.546795]  irq_exit_rcu+0x94/0xc0
Mar 11 21:23:01 proxmox kernel: [74994.546797]  sysvec_apic_timer_interrupt+0x80/0x90
Mar 11 21:23:01 proxmox kernel: [74994.546799]  </IRQ>
Mar 11 21:23:01 proxmox kernel: [74994.546799]  <TASK>
Mar 11 21:23:01 proxmox kernel: [74994.546800]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
Mar 11 21:23:01 proxmox kernel: [74994.546801] RIP: 0010:native_safe_halt+0xb/0x10
Mar 11 21:23:01 proxmox kernel: [74994.546803] Code: ff ff 4c 89 ee 48 c7 c7 e0 45 25 a9 e8 be 52 8f ff e9 46 ff ff ff cc cc cc cc cc cc cc cc cc eb 07 0f 00 2d 69 d8 47 00 fb f4 <e9> 00 1f 27 00 eb 07 0f 00 2d 59 d8 47 00 f4 e9 f1 1e 27 00 cc 0f
Mar 11 21:23:01 proxmox kernel: [74994.546804] RSP: 0018:ffffb346801e7de0 EFLAGS: 00000246
Mar 11 21:23:01 proxmox kernel: [74994.546805] RAX: 0000000000004000 RBX: 000000000002dec8 RCX: 0000000000000000
Mar 11 21:23:01 proxmox kernel: [74994.546805] RDX: ffff8fb01ee40000 RSI: ffff8fa9019bd400 RDI: ffff8fa9019bd464
Mar 11 21:23:01 proxmox kernel: [74994.546806] RBP: ffffb346801e7de8 R08: 00004431a90fa112 R09: 0000000000000000
Mar 11 21:23:01 proxmox kernel: [74994.546806] R10: 0000000000000002 R11: 071c71c71c71c71c R12: 0000000000000001
Mar 11 21:23:01 proxmox kernel: [74994.546807] R13: 0000000000000011 R14: ffff8fa9019bd464 R15: ffffffffa94e6ec0
Mar 11 21:23:01 proxmox kernel: [74994.546809]  ? acpi_idle_do_entry+0x53/0x70
Mar 11 21:23:01 proxmox kernel: [74994.546811]  acpi_idle_enter+0xc0/0x160
Mar 11 21:23:01 proxmox kernel: [74994.546812]  cpuidle_enter_state+0x9a/0x620
Mar 11 21:23:01 proxmox kernel: [74994.546816]  cpuidle_enter+0x2e/0x50
Mar 11 21:23:01 proxmox kernel: [74994.546817]  do_idle+0x20d/0x2b0
Mar 11 21:23:01 proxmox kernel: [74994.546819]  cpu_startup_entry+0x20/0x30
Mar 11 21:23:01 proxmox kernel: [74994.546821]  start_secondary+0x12a/0x180
Mar 11 21:23:01 proxmox kernel: [74994.546822]  secondary_startup_64_no_verify+0xc2/0xcb
Mar 11 21:23:01 proxmox kernel: [74994.546825]  </TASK>
Mar 11 21:23:01 proxmox kernel: [75040.129389] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Mar 11 21:23:01 proxmox kernel: [75040.129401] rcu:     17-...0: (6 ticks this GP) idle=607/0/0x1 softirq=932436/932438 fqs=6026
Mar 11 21:23:01 proxmox kernel: [75040.129406]  (detected by 8, t=15002 jiffies, g=1693097, q=4367)
Mar 11 21:23:01 proxmox kernel: [75040.129409] Sending NMI from CPU 8 to CPUs 17:
Mar 11 21:23:01 proxmox kernel: [75040.129413] NMI backtrace for cpu 17
Mar 11 21:23:01 proxmox kernel: [75040.129416] CPU: 17 PID: 0 Comm: swapper/17 Tainted: P           O      5.15.85-1-pve #1
Mar 11 21:23:01 proxmox kernel: [75040.129418] Hardware name: To Be Filled By O.E.M. X570 Taichi/X570 Taichi, BIOS P5.01 01/18/2023
Mar 11 21:23:01 proxmox kernel: [75040.129420] RIP: 0010:native_queued_spin_lock_slowpath+0x79/0x240
Mar 11 21:23:01 proxmox kernel: [75040.129426] Code: 2b 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 03 30 e4 09 d0 a9 00 01 ff ff 0f 85 13 01 00 00 85 c0 74 0e 8b 03 84 c0 74 08 f3 90 <8b> 03 84 c0 75 f8 b8 01 00 00 00 66 89 03 5b 41 5c 41 5d 41 5e 41
Mar 11 21:23:01 proxmox kernel: [75040.129427] RSP: 0018:ffffb346805d0e98 EFLAGS: 00000082
Mar 11 21:23:01 proxmox kernel: [75040.129429] RAX: 0000000000000180 RBX: ffff8fb01ee61a40 RCX: 0000000000000020
Mar 11 21:23:01 proxmox kernel: [75040.129431] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fb01ee61a40
Mar 11 21:23:01 proxmox kernel: [75040.129431] RBP: ffffb346805d0ec0 R08: 00004431998c84b0 R09: 000044319967efcb
Mar 11 21:23:01 proxmox kernel: [75040.129433] R10: ffffffffa92060c0 R11: 000000000000036f R12: 0000000000000082
Mar 11 21:23:01 proxmox kernel: [75040.129434] R13: dead000000000122 R14: 0000000000000001 R15: ffff8fb01ee61a40
Mar 11 21:23:01 proxmox kernel: [75040.129435] FS:  0000000000000000(0000) GS:ffff8fb01ee40000(0000) knlGS:0000000000000000
Mar 11 21:23:01 proxmox kernel: [75040.129436] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 11 21:23:01 proxmox kernel: [75040.129437] CR2: ffffd58784c55408 CR3: 000000011232c000 CR4: 0000000000350ee0
Mar 11 21:23:01 proxmox kernel: [75040.129439] Call Trace:
Mar 11 21:23:01 proxmox kernel: [75040.129440]  <IRQ>
Mar 11 21:23:01 proxmox kernel: [75040.129442]  _raw_spin_lock_irq+0x2a/0x40
Mar 11 21:23:01 proxmox kernel: [75040.129445]  __run_timers.part.0+0x32/0x270
Mar 11 21:23:01 proxmox kernel: [75040.129447]  ? recalibrate_cpu_khz+0x10/0x10
Mar 11 21:23:01 proxmox kernel: [75040.129450]  ? ktime_get+0x46/0xc0
Mar 11 21:23:01 proxmox kernel: [75040.129451]  ? native_x2apic_icr_read+0x20/0x20
Mar 11 21:23:01 proxmox kernel: [75040.129453]  ? lapic_next_event+0x21/0x30
Mar 11 21:23:01 proxmox kernel: [75040.129456]  ? clockevents_program_event+0xab/0x130
Mar 11 21:23:01 proxmox kernel: [75040.129458]  run_timer_softirq+0x4b/0x60
Mar 11 21:23:01 proxmox kernel: [75040.129459]  __do_softirq+0xd9/0x2ea
Mar 11 21:23:01 proxmox kernel: [75040.129461]  irq_exit_rcu+0x94/0xc0
Mar 11 21:23:01 proxmox kernel: [75040.129463]  sysvec_apic_timer_interrupt+0x80/0x90
Mar 11 21:23:01 proxmox kernel: [75040.129466]  </IRQ>
Mar 11 21:23:01 proxmox kernel: [75040.129466]  <TASK>
Mar 11 21:23:01 proxmox kernel: [75040.129467]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
Mar 11 21:23:01 proxmox kernel: [75040.129468] RIP: 0010:native_safe_halt+0xb/0x10
Mar 11 21:23:01 proxmox kernel: [75040.129470] Code: ff ff 4c 89 ee 48 c7 c7 e0 45 25 a9 e8 be 52 8f ff e9 46 ff ff ff cc cc cc cc cc cc cc cc cc eb 07 0f 00 2d 69 d8 47 00 fb f4 <e9> 00 1f 27 00 eb 07 0f 00 2d 59 d8 47 00 f4 e9 f1 1e 27 00 cc 0f
Mar 11 21:23:01 proxmox kernel: [75040.129471] RSP: 0018:ffffb346801e7de0 EFLAGS: 00000246
Mar 11 21:23:01 proxmox kernel: [75040.129472] RAX: 0000000000004000 RBX: 000000000002dec8 RCX: 0000000000000000
Mar 11 21:23:01 proxmox kernel: [75040.129473] RDX: ffff8fb01ee40000 RSI: ffff8fa9019bd400 RDI: ffff8fa9019bd464
Mar 11 21:23:01 proxmox kernel: [75040.129474] RBP: ffffb346801e7de8 R08: 00004431a90fa112 R09: 0000000000000000
Mar 11 21:23:01 proxmox kernel: [75040.129475] R10: 0000000000000002 R11: 071c71c71c71c71c R12: 0000000000000001
Mar 11 21:23:01 proxmox kernel: [75040.129475] R13: 0000000000000011 R14: ffff8fa9019bd464 R15: ffffffffa94e6ec0
Mar 11 21:23:01 proxmox kernel: [75040.129478]  ? acpi_idle_do_entry+0x53/0x70
Mar 11 21:23:01 proxmox kernel: [75040.129480]  acpi_idle_enter+0xc0/0x160
Mar 11 21:23:01 proxmox kernel: [75040.129482]  cpuidle_enter_state+0x9a/0x620
Mar 11 21:23:01 proxmox kernel: [75040.129485]  cpuidle_enter+0x2e/0x50
Mar 11 21:23:01 proxmox kernel: [75040.129487]  do_idle+0x20d/0x2b0
Mar 11 21:23:01 proxmox kernel: [75040.129489]  cpu_startup_entry+0x20/0x30
Mar 11 21:23:01 proxmox kernel: [75040.129490]  start_secondary+0x12a/0x180
Mar 11 21:23:01 proxmox kernel: [75040.129492]  secondary_startup_64_no_verify+0xc2/0xcb
Mar 11 21:23:01 proxmox kernel: [75040.129496]  </TASK>
pveversion -v
Code:
proxmox-ve: 7.3-1 (running kernel: 5.15.85-1-pve)
pve-manager: 7.3-6 (running version: 7.3-6/723bb6ec)
pve-kernel-helper: 7.3-6
pve-kernel-5.15: 7.3-2
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-6
libpve-storage-perl: 7.3-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.5.5
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20221111-1
pve-firewall: 4.2-7
pve-firmware: 3.6-3
pve-ha-manager: 3.5.1
pve-i18n: 2.8-3
pve-qemu-kvm: 7.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1
lscpu
Code:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           113
Model name:                      AMD Ryzen 9 3950X 16-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         3500.000
CPU max MHz:                     4761.2300
CPU min MHz:                     2200.0000
BogoMIPS:                        6999.55
Virtualization:                  AMD-V
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        8 MiB
L3 cache:                        64 MiB
NUMA node0 CPU(s):               0-31
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmx
                                 ext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulq
                                 dq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapi
                                 c cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perf
                                 ctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdsee
                                 d adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local
                                 clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeass
                                 ists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev se
                                 v_es
 
Last edited: