NMI watchdog detectedhard LOCKUP on cpu1

dejhost

Member
Dec 13, 2020
64
1
13
45
Hello there,

I got myself a new server. Installing proxmox 7.3.-3 went fine. After reboot, I encountered some issues getting the network to work, but after 1-2 additional reboots, all seemed fine. I installed configured cephfs, mounted disks from other servers in the LAN using sshfs and started data migration.

A few days later, I had to move the server to a new place (same LAN), and could not get the network to work again afterwards. I discovered, that all 6 LED's of the networkports stop blinking simultaniously, the moment linux starts booting. "dmesg -w" does not report any changes when I plugged in or removed the network cables.

Maybe a hardware conflict?
  • I removed the additional network card (Asus) => No changes.
  • Then I removed the graphic card (Nvidia GeForce 7300 LE), and replaced it through another one. The boot-process is different, and I get the following screen:
temp.jpeg

Could you please advise on how to proceed?

Here is my hardware:
  • Fractal Design Node 804
  • PSU: 600W semi- modular
  • Mainboard: ASUS X99-M WS\SE
  • Intel CPU 2650 es v3
  • GIGABYTE Aorus NVMe 1TB
  • RAM: 128 GB DDR4 ECCr 2133MHz Mhz (4 x 32 GB)
  • Graphic-card: Nvidia GeForce 7300 LE (replaced later by an old Nvidia quadro)
  • Asus network card: XG-C100C
  • 2x 10TB HDD from WD
Any help would be much appreciated!
 
Last edited:
1. Flash new BIOS Firmware, if there is some
2. Reset mainboard (better via jumper, not reset option in BIOS)
3. Reseat the RAM modules, try booting with only 1 or 2, see if that makes a difference (your HW looks new, so not impossible, but unlikely that one module is broken)
4. Switch off everything you find with powersaving in the BIOS, especially ASPM
5. Try boot without any USB device plugged in
6. Look in BIOS if you can change USB controller modes betwenn 3.0, 3.1 or 3.2
7. Change the slots of the pci-e cards or try to boot without them for test, see if the error disappears
8. Double-check sata cables, they're sometimes wobbly and moving the server could do the rest.
9. Is your PSU new or already older? PSUs are ageing and/or if an older (but still okayish PSU) should drive a newer CPU it could provoke a brown-out, because they cannot deliver the fast load changes actual CPUs need.
 
Thank you for your reply.

I booted the server (without having made any changes), and - without any error messages - I got into proxmox. Only 1 (out of 3) network card is working. Since then, I removed and changed the processor, the graphic card, and also the additional network card from asus. In all cases, proxmox seems to boot fine. But mostly without any network. The interfaces seem to not exist. Sometimes, I get 1 working (always the same onboard network interface).
I tried to check the Asus Homepage for updated firmware for the motherboard, but it seems it is partially down.

From your list, I tried 5, 7 and 8.
9) Yes: my PSU is pretty new.
 
I booted the server (without having made any changes), and - without any error messages - I got into proxmox.
Broken RAM could give such 'every time new surprise' effect on boot. Or just broken NIC going havoc. Also overheating of components. When you changed the cpu, did you clean it and the cooler and provided new thermal paste?

Try the other points from the list and check the asus page when they're back. Can't say much for now, seems tricky. :oops:
 
Since the ethernet ports work at the beginning of a boot process and stop working as soon as proxmox boots: possibly missing/wrong drivers? Anything I can do about that?

I will continie troubleshooting durig the holidays.
 
1. Flash new BIOS Firmware, if there is some
2. Reset mainboard (better via jumper, not reset option in BIOS)
3. Reseat the RAM modules, try booting with only 1 or 2, see if that makes a difference (your HW looks new, so not impossible, but unlikely that one module is broken)
4. Switch off everything you find with powersaving in the BIOS, especially ASPM
5. Try boot without any USB device plugged in
6. Look in BIOS if you can change USB controller modes betwenn 3.0, 3.1 or 3.2
7. Change the slots of the pci-e cards or try to boot without them for test, see if the error disappears
8. Double-check sata cables, they're sometimes wobbly and moving the server could do the rest.
9. Is your PSU new or already older? PSUs are ageing and/or if an older (but still okayish PSU) should drive a newer CPU it could provoke a brown-out, because they cannot deliver the fast load changes actual CPUs need.
2. I reset the mainboard
4. I checked - no changes to be made.
6. I turned off xHCI. Anyhow: the only thing connected via USB is an elderly keyboard.

The Asus homepage does not offer any Linux-drivers for this mainboard :-(

I downloaded the driver for the PCIe network card (ASUS XG-C100C Driver Version 5.0.3.3), but I have difficulties installing it:

[ATTACH type="full"]44878[/ATTACH]

Can you please advice?
 

Attachments

  • temp.jpg
    temp.jpg
    68.6 KB · Views: 5
Since the ethernet ports work at the beginning of a boot process and stop working as soon as proxmox boots: possibly missing/wrong drivers? Anything I can do about that?
The Asus homepage does not offer any Linux-drivers for this mainboard :-(
The card should work out of the box since long time:
https://forum.proxmox.com/threads/problems-with-10g-nic-xg-c100c-aqc107.44828/post-220756
https://forum.proxmox.com/threads/asusA-xg-c100f-driver-in-kernel-for-proxmox-5-4-15.78457/

2. I reset the mainboard
4. I checked - no changes to be made.
6. I turned off xHCI. Anyhow: the only thing connected via USB is an elderly keyboard.
Ok, thats good in the sense of excluding most of the common things.

I now suggest running a memory check with minimal config of the mainboard. Disconnect every SATA device and pull out the ASUS NIC for testing time.
Boot proxmox iso and choose on the very first greeting screen->advanced->test memory

If the first test run throws errors with all your 4 modules, then start a run with only one module per time to rule out the faulty one.

Edit:
Does this link work for you? https://dlcdnets.asus.com/pub/ASUS/mb/Socket2011-R3/X99-M_WS_SE/BIOS/X99-M-WS-SE-ASUS-4001.zip
That is BIOS Version 4001 from 2019/08/01 and it says:

1. Update CPU uCode
2. Improve system performance

Sounds promising for me, if you don't have this BIOS already...
 
Last edited:
I booted, and instead of choosing Proxmox, I chose to test the memory. I didn'tcheck the exact time, but it must have been running for 14 hours or so. screen looks like this (doesn't seem to change):
temp.jpeg

I did not pull the Sata-cables, but the Asus NIC is not installed.
Though the state says "Running", I cannot see any progress. It's been like this all the time. Should I exit?

Seems I already have the latest Bios version:
temp2.jpeg


Once testing the memory is done, I'd liek to install debian on a partition, and see if the NIC show up. I will also post my networkconfig asap.. Maybe I managed to insert a conflict there?
 
screen looks like this (doesn't seem to change):
Uh, that freezed after 2 seconds run, not good. A strong sign for broken RAM.
Normally it should count up the test-% and the red + in memtest86+ should blink.

Plug out all modules and try again with only one module after another.

If this is with all four tries the same, then it could be because the chipset isn't detected. In that case try the tool from passmark, which unfortunately has the same name. https://www.memtest86.com/download.htm (Free version!)
Write to usb stick or burn to cd and boot from that.

Also it could be that the SATA chip blocks or the SATA disks, so because of this I recommend memtests with only a minimun of plugged in things (only keyboard and vga...etc)
 
Last edited:
I removed 3 out of four ram modules. Only the one in bank A1 remains. Restarted the test. It freezes after 2 seconds. I then removed this module and replaced it with the neighboring module. The screen remains black. I then swapped it with the next module: the test freezes after 2 seconds.
Testing the last module: the test freezes after 2 seconds.
Should I use another bank?

This is my network config:




20221227_145304.jpg
 
Should I use another bank?
Yes, try every combination. Also it can be, that this board or module type needs to be plugged in pairs (and in slots with same color). Or that you need to use the other software from passmark.

I don't think that it's a network configuration error. Signs are strong, that this is hardware error or bug.

Edit:
If both tools freeze, then you could try proxmox again with 1 or 2 modules. If it fails and gives error, swap to the other.
Finding bad RAM is mostly pain in the ass. Errors appear in different ways, freezes. Sometimes they can't be detected with tools. Best method to rule out broken RAM would be another board to check against, but I guess you are a homeuser and not a hardware shop. :)
 
Last edited:
Just want to point out a few things:
  • The error "NMI watchdog detectedhard LOCKUP on cpu1" has not appeared again. The only problem I have is that the 2 out of three network-interfaces aren't activated, once linux starts booting.
  • The CPU I am using is a ES-version. I tried another one in between, but since the issues did not disappear, I am using the ES again. Still might be an issue.
  • I disconnected the 2 Sata-hard drives.
  • I tested several combinations of RAM-Modules and RAM banks. All fail after 2 secs.
I will test the memtest86 now.
 
The error "NMI watchdog detectedhard LOCKUP on cpu1" has not appeared again. The only problem I have is that the 2 out of three network-interfaces aren't activated, once linux starts booting.
I would count that also to bad RAM, if this is the case. I mean it wouldn't wonder me, but on the other side it's rare with new hardware.

Should I use another bank?
The handbook says bank b1 for single module, b1 and d1 for two modules. https://dlcdnets.asus.com/pub/ASUS/...WS_SE/Manual/E13679_X99-M_WS_SE_UM_V3_WEB.pdf -> page 21/176
There is autooverclocking called XMP, maybe try with and/or without it. Your modules could need slightly more voltage for stability, so XMP could be needed which does all of these settings automatically. Just mentioning this as another possibility. +also page 30

Overlapping IRQs could be another problem (when NIC-Leds go out), but I don't understand Page25 what the letters mean.

Also you could try the EPU switch on page 29, it should be enabled on default, on my opinion.
 
There is also a "Mem. Ok" button... page 1-13.


The test has not revealed any errors so far...
20221227_180726.jpg
 
I just realized, that the LED of the onboard NIC that stops working while linux boots is blinking, while the PC is turned off. This means that WOL is activated for this NIC, right? How to turn off?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!