Spontaneous Node Reboots

CZappe

Member
Jan 25, 2021
36
9
13
Santa Fe, NM, USA
www.bti-usa.com
I've been diagnosing a Proxmox node on our network that began acting up about a year ago with the primary symptom being spontaneous and frequent node reboots while under light load. I've offloaded every VM and container from the node and reinstalled the latest version of PVE 6 to create a clean system to continue diagnosis. I feel like I'm getting close to a root cause, be that hardware, software, or some collision between the two, but the system remains unstable. Here's a quick summary of the symptoms I'm experiencing:
  1. VM Usage: VMs are generally unstable when running on this system though the better results tend to be VMs running on the local-lvm storage as opposed to ones stored on our NAS and accessed over NFS via a bonded 10GbE network interface.
  2. VM Creation: Dozens of attempts to create test Linux VMs on this node have failed or required several attempts and new Hard Disk files, even when using a locally stored ISO that installs without issue on a similar node. The only successful installation has been from a local ISO to local-lvm storage. The most recent attempts to install from a NAS-hosted ISO over NFS have frequently resulted in a spontaneous node reboot.
Diagnosis. These are things I've checked for clues as to what's going on with this node:
  1. Manufacturer (Supermicro) offline diagnostics: I've run full diagnostics on this server several times, including memory, disk, and PSU tests. Each result comes back with passing status on all available tests. Manually tested redundant power by pulling cords, one at a time. Passed that, too.
  2. IPMI logs: other than recording that I pulled the power cords, no diagnostic info relating to the spontaneous reboots appears to be held in these logs.
  3. Syslog: The reboots are spontaneous, with no consistent events recorded in /var/log/syslog immediately prior to the reboots. Occasionally the syslog is splattered with null characters at the time of reboot, indicating that it was in the middle of recording something else when it happened.
  4. Journalctl: I just set the storage parameter on this to "persistent" via /etc/systemd/journald.conf to try and capture more data across reboots but don't have anything to report yet that I haven't found also in...
  5. Dmesg: reviewing the kernel log for messages with a level or "warn" and above did turn up some potential leads regarding both storage and networking, which seems appropriate given the symptoms I'm seeing. Below are their respective entries from dmesg
Code:
root@/node/:~# dmesg -HTx --level=warn,err,crit,alert,emerg
kern  :warn  : [Tue Jan 11 16:06:44 2022]  #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
kern  :warn  : [Tue Jan 11 16:06:44 2022] mtrr: your CPUs had inconsistent fixed MTRR settings
kern  :warn  : [Tue Jan 11 16:06:44 2022] mtrr: your CPUs had inconsistent variable MTRR settings
kern  :warn  : [Tue Jan 11 16:06:45 2022]  PPR X2APIC NX GT IA GA PC GA_vAPIC
kern  :warn  : [Tue Jan 11 16:06:45 2022]  PPR X2APIC NX GT IA GA PC GA_vAPIC
kern  :warn  : [Tue Jan 11 16:06:45 2022]  PPR X2APIC NX GT IA GA PC GA_vAPIC
kern  :warn  : [Tue Jan 11 16:06:45 2022]  PPR X2APIC NX GT IA GA PC GA_vAPIC
kern  :warn  : [Tue Jan 11 16:06:45 2022]  PPR X2APIC NX GT IA GA PC GA_vAPIC
kern  :warn  : [Tue Jan 11 16:06:45 2022]  PPR X2APIC NX GT IA GA PC GA_vAPIC
kern  :warn  : [Tue Jan 11 16:06:45 2022]  PPR X2APIC NX GT IA GA PC GA_vAPIC
kern  :warn  : [Tue Jan 11 16:06:45 2022]  PPR X2APIC NX GT IA GA PC GA_vAPIC
kern  :warn  : [Tue Jan 11 16:06:45 2022] platform eisa.0: EISA: Cannot allocate resource for mainboard
kern  :warn  : [Tue Jan 11 16:06:45 2022] platform eisa.0: Cannot allocate resource for EISA slot 1
kern  :warn  : [Tue Jan 11 16:06:45 2022] platform eisa.0: Cannot allocate resource for EISA slot 2
kern  :warn  : [Tue Jan 11 16:06:45 2022] platform eisa.0: Cannot allocate resource for EISA slot 3
kern  :err   : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: overriding NVDATA EEDPTagMode setting
kern  :warn  : [Tue Jan 11 16:06:56 2022] spl: loading out-of-tree module taints kernel.
kern  :warn  : [Tue Jan 11 16:06:56 2022] znvpair: module license 'CDDL' taints kernel.
kern  :warn  : [Tue Jan 11 16:06:56 2022] Disabling lock debugging due to kernel taint
kern  :warn  : [Tue Jan 11 16:06:56 2022] ipmi_si dmi-ipmi-si.0: The BMC does not support clearing the recv irq bit, compensating, bu
kern  :warn  : [Tue Jan 11 16:06:57 2022] new mount options do not match the existing superblock, will be ignored
kern  :warn  : [Tue Jan 11 16:06:57 2022] device-mapper: thin: Data device (dm-3) discard unsupported: Disabling discard passdown.
kern  :warn  : [Tue Jan 11 16:07:03 2022] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond

The message mpt3sas_cm0: overriding NVDATA EEDPTagMode setting shows up on the screen during boot and is referenced in a number of forums, including the Proxmox Forum post here, with descriptions of spontaneous reboots.

Code:
root@/node/:~# dmesg -HTx | grep "mpt3sas_cm0"
kern  :info  : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (264051344 kB)
kern  :info  : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
kern  :info  : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0: MSI-X vectors supported: 96
kern  :info  : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0:  0 64
kern  :info  : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0: High IOPs queues : disabled
kern  :info  : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0: iomem(0x00000000b6440000), mapped(0x00000000094fbf45), size(65536)
kern  :info  : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0: ioport(0x0000000000003000), size(256)
kern  :info  : [Tue Jan 11 16:06:46 2022] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
kern  :info  : [Tue Jan 11 16:06:46 2022] mpt3sas_cm0: sending diag reset !!
kern  :info  : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: diag reset: SUCCESS
kern  :info  : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: scatter gather: sge_in_main_msg(1), sge_per_chain(7), sge_per_io(128), chains_per_io(19)
kern  :info  : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: request pool(0x00000000b8c68811) - dma(0xffb80000): depth(3200), frame_size(128), pool_size(400 kB)
kern  :info  : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: sense pool(0x00000000f7470cb1)- dma(0xff380000): depth(2939),element_size(96), pool_size(275 kB)
kern  :info  : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: config page(0x00000000bae55bea) - dma(0xff2fa000): size(512)
kern  :info  : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: Allocated physical memory: size(4704 kB)
kern  :info  : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: Current Controller Queue Depth(2936),Max Controller Queue Depth(3072)
kern  :info  : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: Scatter Gather Elements per IO(128)
kern  :info  : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: _base_display_fwpkg_version: complete
kern  :err   : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: overriding NVDATA EEDPTagMode setting
kern  :info  : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: LSISAS3008: FWVersion(16.00.01.00), ChipRevision(0x02), BiosVersion(08.37.00.00)
kern  :info  : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: Protocol=(Initiator), Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
kern  :info  : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: sending port enable !!
kern  :info  : [Tue Jan 11 16:06:48 2022] mpt3sas_cm0: hba_port entry: 000000002a4f8565, port: 255 is added to hba_port list
kern  :info  : [Tue Jan 11 16:06:48 2022] mpt3sas_cm0: host_add: handle(0x0001), sas_addr(0x5003048022eac7f0), phys(8)
kern  :info  : [Tue Jan 11 16:06:55 2022] mpt3sas_cm0: port enable: SUCCESS

The message bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond is concerning because of the particularly bad results we get with this node while trying to access any files on shared storage. I'm looking into whether we have a switch configuration issue but I'm worried that an issue of this sort, in itself, is even capable of causing the catastrophic symptoms we're seeing.

Code:
root@/node/:~# dmesg -HTx | grep "bond0"
kern  :info  : [Tue Jan 11 16:06:58 2022] bond0: (slave enp129s0f0): Enslaving as a backup interface with a down link
kern  :info  : [Tue Jan 11 16:06:58 2022] bond0: (slave enp129s0f1): Enslaving as a backup interface with a down link
kern  :info  : [Tue Jan 11 16:06:58 2022] vmbr4: port 1(bond0) entered blocking state
kern  :info  : [Tue Jan 11 16:06:58 2022] vmbr4: port 1(bond0) entered disabled state
kern  :info  : [Tue Jan 11 16:06:58 2022] device bond0 entered promiscuous mode
kern  :info  : [Tue Jan 11 16:07:03 2022] bond0: (slave enp129s0f0): link status definitely up, 10000 Mbps full duplex
kern  :warn  : [Tue Jan 11 16:07:03 2022] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
kern  :info  : [Tue Jan 11 16:07:03 2022] bond0: active interface up!
kern  :info  : [Tue Jan 11 16:07:03 2022] vmbr4: port 1(bond0) entered blocking state
kern  :info  : [Tue Jan 11 16:07:03 2022] vmbr4: port 1(bond0) entered forwarding state
kern  :info  : [Tue Jan 11 16:07:04 2022] bond0: (slave enp129s0f1): link status definitely up, 10000 Mbps full duplex

At this point, I've been staring at this, bleary-eyed, for months and have posted about it here and in other forums, attempting to get my head around it. Hopefully, a fresh take on it may point to the ultimate cause of the issue and help us get this node back into production.

Thank you, kindly!
 
The "usual suspect" is the communication for corosync. If the physical line gets congested by "normal" traffic, quorum is lost and a node will fence itself - by rebooting. See https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_quorum

Then the recommendation is to have a dedicated/isolated network/link (1GBit/s) only for this purpose.

Just guessing...

That is not something that had ever occurred to me to check. Thank you for shedding light on this aspect of Proxmox clustering and configuration. However, our setup is not clustered and the problematic node is not communicating with any other nodes on the network.

(We are planning on getting clustering up and running eventually, though. I will certainly do my RTFM in advance of this!)

Right now I'm running some stress tests, reading/writing data over bond0 in an attempt to reliably reproduce the issue.
 
I was noticing somewhat better outcomes on this node when I was running VMs and installer ISOs from local storage vs over a bonded 10GbE NFS connection (i.e no sudden node reboots, but numerous hangs and segfaults when installing new Linux VMs). In an attempt to remove as many variables as possible, I deleted the NFS drive configurations from the node's Storage settings, and then removed all but the main 1Gb Ethernet network interface configuration from Network (cables are still physically plugged in). Reboot.

So far, things seem to be running much better. I've installed, wiped, reinstalled, created templates, and cloned about 15 devices today with no errors (other than user errors) and not a single spontaneous reboot.

So, it seems that maybe the issue has roots in one of the 10GbE connections and/or the NFS storage configuration on the node. The NFS server itself is running smoothly for other nodes on our rack, so I don't suspect any issues further upstream. I'm going to pay a visit to our server room over the weekend to visually inspect connections and configurations at the switch level.

So far I've been gathering information via /var/log/syslog,dmesg, and tools like ip,lshw,andethtool without much more to show for it than what I've posted above. Are there any other things I can check to try and catch this node-killing culprit in the act?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!