Proxmox 4.4 NMI watchdog and network failure

nseba

New Member
Jan 16, 2017
12
0
1
42
Hi,
I just received a new machine on which I installed proxmox 4.4. Installation went great, full access to wed admin, etc.
However, the next day the system was blocked, screen filled with lines like

Code:
Message from syslog
kernel:[90524.099987] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [systemd-timesyn:788]

and once in a while

Code:
[90675.828055] INFO: rccu_sched detected stall on CPUs/tasks:
[90675.828xxx]$0-...: (1 GPs behind) idle=ff9/1/0 softirq=1360721/1360723 fqs=2752358
[90675.828yyy]$(detected by 2, t=3030337 jiffies, g=927259, c=927258, q=161956)

More surprisingly, any machine connected on the same switch (physical network switch) wasn't able to communicate through ip (Wireshark showed only ARP packets). ip traffic went back when unplugging the network cable from the px4.4 machine, and stopped when plugging in again.

When hard rebooted, the machine runs well for 7-8hours then hangs again with the same error but a different process. The first time it crashed, the process was RRDCached, the second time iptables-restore, and the third time systemd-timesyn.

The machine specs:
  • MB: MSI X99A Raider (built-in ethernet is used)
  • proc: Intel Core i7 6800K
  • RAM: 4 x 8GB DDR4 2400MHz
  • power supply: Seasonic G-650 (650W)
It seems possible to inhibit the NMI watchdog but it means some process will randomly being stuck. Does anyone have the same trouble? Do you have some hints on what I can do/test to solve this problem?
Thanks a lot for your help
 
Hello,
it seems my issue let everyone speechless ;).
Some news about it. I bought and mounted a new network card (TP-Link TG-3468 with Realtek RTL8168), desactivated the build in network card (Intel e1000e) and did a clean install of Proxmox 4.4.

With this new configuration, I still experience system hang with NMI watchdog, but it doesn't jam with the network. I also noticed that the first watchdog error is a HARD lockup and is then followed with soft ones. Lastly, the concerned process is most of the time [systemd-timesyn]. Do some of you have any hints?
 
I get a reproduceble crash with this error-message (after hard LOCKUP messages) also on a fresh Proxmox 4.4 install (after dist-upgrade) b copying a 20G raw-disk from a NFS source to a drbd9-storage. When this happens, the cpu-usage of the drbd-ressource goes up to 100%.
So
@nseba: I'm very interested to hear if you are using drbd at all.
 

Attachments

  • lockup-BUG.png
    lockup-BUG.png
    127.5 KB · Views: 24
  • lockup-dmsg.txt
    31.4 KB · Views: 7
Hi Michael,
sorry to hear you also have these crashes.
I don't use drbd at all but I wondered if there could be some problem with disk management. Here are some tests I ran :

1. disconnected as many hard drives as possibles and change cables. My current configuration is just one hard drive connected with a new sata cable.

2. uninstalled ZFS as it can be resources demanding

3. played with some sysctl variables indicated in https://forum.proxmox.com/threads/f...sts-during-high-io-on-host.30702/#post-159066

4. played with the watchdog https://pve.proxmox.com/wiki/High_Availability_Cluster_4.x#Hardware_Watchdogs (tested with
nmi_watchdog=0, 1 or 2)

None worked for me but perhaps number 3 could be interesting for you to test.
 
After countless testings and reinstallations, I managed to have a configuration working for 24h+ for now. To achieve this, I installed a couple non-free packages on the hypervisor.

Here's the walkthrough :
- Add contrib and non-free repositories from Debian
Code:
nano /etc/apt/sources.list

- update base debian entry to
Code:
deb http://ftp.fr.debian.org/debian jessie main contrib non-free

- update aptitude cache
Code:
apt-get update

- install required packages (for me, I had Intel and nvidia specific drivers)
Code:
apt-get install intel-microcode irqbalance nvidia-settings nvidia-driver

And a gentle reboot to make sure everything has been taken into account (particularly, nvidia blacklist nouveau module which can be troublesome).
I'll set thread to solved on monday if my server doesn't crash this weekend :cool:.
 
After countless testings and reinstallations, I managed to have a configuration working for 24h+ for now. To achieve this, I installed a couple non-free packages on the hypervisor.
...
Code:
apt-get install intel-microcode irqbalance nvidia-settings nvidia-driver
Hi,
I'm not sure if irqbalance is an good idea on an kvm-host.

Udo
 
Hi. Why is that not a good idea? I have a similar problem:

Code:
kernel:[11563.649652] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:1:68]

and

Code:
kernel:[12277.425302] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [(md-udevd):25026]

I'm running proxmox 5.0 beta1 with latest debian updates. The system freezes completely. Sometimes it takes minutes, sometimes the system runs for 20 hours or so untill I get that error. At that time the whole system is frozen.

My hardware config:
Intel Xeon E-3 1230v3
MSI Gaming 5 Board
4x 4GB Kingston 1333 DDR-3 RAM
nVidia GT710 (for host - in first pcie 16x slot)
nVidia GTX 970 (for KVM Win10 guest - in second pcie 16x slot)
3x 3TB WD ecoGreen HDDs for cold storage (raidz-1 zfs - all my VMs are stored in that pool)
1x 250 GB Evo 850 SSD (this is where the host installation is located - ext4)

The error occures even if no VM is running, so I think it is either a hardware problem or it's a bug? Had same issues with proxmox 4.4.
 
Hi,
I’ve ran successfully my system for a few weeks now. I tried without the irqbalance package and it works without any troubles.

I think my main problem was the open-source nvidia driver “nouveau” (and probably also intel-microcode). Joshlukas, did you install nvidia official driver? I was very surprised but I think it is worth trying even with a GT710 used for host (I’ve got the same one…).

Keep us informed.
 
Hi.

No, I just installed the intel microcode package. The nvidia packages have too many dependencies, so I didn't try. Will give it a shot today.
 
I first tested without the nvidia packages but in the end it made the difference. I highly recommend it.
 
Hi.

Well, it seems to run stable now. Would like to know what the cause is. But anyway, it's working. I did not install the irqbalance package also. Just the intel microdoce and the official nVidia drivers. Thank you.
 
those
NMI watchdog: BUG: soft lockup

messages happen when a CPU is blocked for than 5 seconds in a task
the kernel stack strace appended to the message should point you which part of the subsystem is blocked
( the
[<ffffffff9710e5f4>] futex_wait_queue_me+0xc4/0x120
[<ffffffff9710f276>] futex_wait+0x116/0x270
[<ffffffff9711115d>] do_futex+0x2cd/0xb60
[<ffffffff97111a75>] SyS_futex+0x85/0x180
[<ffffffff978f4efb>] entry_SYSCALL_64_fastpath+0x1e/0xad
[<ffffffffffffffff>] 0xffffffffffffffff
part )

usually it boils downs to non-working hardware and broken network connections from my experience
 
@joshlukas
Hey,
I have the same problem at the moment and I use Proxmox 5 too.
But if I want to install the package "nvidia-driver" I get the following error message:

root@Proxmox:~# apt-get install nvidia-driver
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
nvidia-driver : Depends: xserver-xorg-video-nvidia (= 340.102-1) but it is not going to be installed
Recommends: libgl1-nvidia-glx-i386 but it is not installable
E: Unable to correct problems, you have held broken packages.

Can you help me? Or somebody else?
 
@joshlukas
Hey,
I have the same problem at the moment and I use Proxmox 5 too.
But if I want to install the package "nvidia-driver" I get the following error message:

root@Proxmox:~# apt-get install nvidia-driver
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
nvidia-driver : Depends: xserver-xorg-video-nvidia (= 340.102-1) but it is not going to be installed
Recommends: libgl1-nvidia-glx-i386 but it is not installable
E: Unable to correct problems, you have held broken packages.

Can you help me? Or somebody else?

Don't install nvidia driver, no point in that for a server.. You just need to blacklist nouveau (the open source driver)
- nano /etc/modprobe.d/blacklist-nouveau.conf - Paste this:

blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

- echo options nouveau modeset=0 | tee -a /etc/modprobe.d/nouveau-kms.conf
- update-initramfs -u
- reboot.

Now the issue is solved :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!