I've been diagnosing a Proxmox node on our network that began acting up about a year ago with the primary symptom being spontaneous and frequent node reboots while under light load. I've offloaded every VM and container from the node and reinstalled the latest version of PVE 6 to create a clean system to continue diagnosis. I feel like I'm getting close to a root cause, be that hardware, software, or some collision between the two, but the system remains unstable. Here's a quick summary of the symptoms I'm experiencing:
The message
The message
At this point, I've been staring at this, bleary-eyed, for months and have posted about it here and in other forums, attempting to get my head around it. Hopefully, a fresh take on it may point to the ultimate cause of the issue and help us get this node back into production.
Thank you, kindly!
- VM Usage: VMs are generally unstable when running on this system though the better results tend to be VMs running on the local-lvm storage as opposed to ones stored on our NAS and accessed over NFS via a bonded 10GbE network interface.
- VM Creation: Dozens of attempts to create test Linux VMs on this node have failed or required several attempts and new Hard Disk files, even when using a locally stored ISO that installs without issue on a similar node. The only successful installation has been from a local ISO to local-lvm storage. The most recent attempts to install from a NAS-hosted ISO over NFS have frequently resulted in a spontaneous node reboot.
- Manufacturer (Supermicro) offline diagnostics: I've run full diagnostics on this server several times, including memory, disk, and PSU tests. Each result comes back with passing status on all available tests. Manually tested redundant power by pulling cords, one at a time. Passed that, too.
- IPMI logs: other than recording that I pulled the power cords, no diagnostic info relating to the spontaneous reboots appears to be held in these logs.
Syslog
: The reboots are spontaneous, with no consistent events recorded in/var/log/syslog
immediately prior to the reboots. Occasionally the syslog is splattered with null characters at the time of reboot, indicating that it was in the middle of recording something else when it happened.Journalctl
: I just set the storage parameter on this to "persistent" via/etc/systemd/journald.conf
to try and capture more data across reboots but don't have anything to report yet that I haven't found also in...Dmesg
: reviewing the kernel log for messages with a level or "warn" and above did turn up some potential leads regarding both storage and networking, which seems appropriate given the symptoms I'm seeing. Below are their respective entries fromdmesg
Code:
root@/node/:~# dmesg -HTx --level=warn,err,crit,alert,emerg
kern :warn : [Tue Jan 11 16:06:44 2022] #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
kern :warn : [Tue Jan 11 16:06:44 2022] mtrr: your CPUs had inconsistent fixed MTRR settings
kern :warn : [Tue Jan 11 16:06:44 2022] mtrr: your CPUs had inconsistent variable MTRR settings
kern :warn : [Tue Jan 11 16:06:45 2022] PPR X2APIC NX GT IA GA PC GA_vAPIC
kern :warn : [Tue Jan 11 16:06:45 2022] PPR X2APIC NX GT IA GA PC GA_vAPIC
kern :warn : [Tue Jan 11 16:06:45 2022] PPR X2APIC NX GT IA GA PC GA_vAPIC
kern :warn : [Tue Jan 11 16:06:45 2022] PPR X2APIC NX GT IA GA PC GA_vAPIC
kern :warn : [Tue Jan 11 16:06:45 2022] PPR X2APIC NX GT IA GA PC GA_vAPIC
kern :warn : [Tue Jan 11 16:06:45 2022] PPR X2APIC NX GT IA GA PC GA_vAPIC
kern :warn : [Tue Jan 11 16:06:45 2022] PPR X2APIC NX GT IA GA PC GA_vAPIC
kern :warn : [Tue Jan 11 16:06:45 2022] PPR X2APIC NX GT IA GA PC GA_vAPIC
kern :warn : [Tue Jan 11 16:06:45 2022] platform eisa.0: EISA: Cannot allocate resource for mainboard
kern :warn : [Tue Jan 11 16:06:45 2022] platform eisa.0: Cannot allocate resource for EISA slot 1
kern :warn : [Tue Jan 11 16:06:45 2022] platform eisa.0: Cannot allocate resource for EISA slot 2
kern :warn : [Tue Jan 11 16:06:45 2022] platform eisa.0: Cannot allocate resource for EISA slot 3
kern :err : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: overriding NVDATA EEDPTagMode setting
kern :warn : [Tue Jan 11 16:06:56 2022] spl: loading out-of-tree module taints kernel.
kern :warn : [Tue Jan 11 16:06:56 2022] znvpair: module license 'CDDL' taints kernel.
kern :warn : [Tue Jan 11 16:06:56 2022] Disabling lock debugging due to kernel taint
kern :warn : [Tue Jan 11 16:06:56 2022] ipmi_si dmi-ipmi-si.0: The BMC does not support clearing the recv irq bit, compensating, bu
kern :warn : [Tue Jan 11 16:06:57 2022] new mount options do not match the existing superblock, will be ignored
kern :warn : [Tue Jan 11 16:06:57 2022] device-mapper: thin: Data device (dm-3) discard unsupported: Disabling discard passdown.
kern :warn : [Tue Jan 11 16:07:03 2022] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
The message
mpt3sas_cm0: overriding NVDATA EEDPTagMode setting
shows up on the screen during boot and is referenced in a number of forums, including the Proxmox Forum post here, with descriptions of spontaneous reboots.
Code:
root@/node/:~# dmesg -HTx | grep "mpt3sas_cm0"
kern :info : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (264051344 kB)
kern :info : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
kern :info : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0: MSI-X vectors supported: 96
kern :info : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0: 0 64
kern :info : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0: High IOPs queues : disabled
kern :info : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0: iomem(0x00000000b6440000), mapped(0x00000000094fbf45), size(65536)
kern :info : [Tue Jan 11 16:06:45 2022] mpt3sas_cm0: ioport(0x0000000000003000), size(256)
kern :info : [Tue Jan 11 16:06:46 2022] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
kern :info : [Tue Jan 11 16:06:46 2022] mpt3sas_cm0: sending diag reset !!
kern :info : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: diag reset: SUCCESS
kern :info : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: scatter gather: sge_in_main_msg(1), sge_per_chain(7), sge_per_io(128), chains_per_io(19)
kern :info : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: request pool(0x00000000b8c68811) - dma(0xffb80000): depth(3200), frame_size(128), pool_size(400 kB)
kern :info : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: sense pool(0x00000000f7470cb1)- dma(0xff380000): depth(2939),element_size(96), pool_size(275 kB)
kern :info : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: config page(0x00000000bae55bea) - dma(0xff2fa000): size(512)
kern :info : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: Allocated physical memory: size(4704 kB)
kern :info : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: Current Controller Queue Depth(2936),Max Controller Queue Depth(3072)
kern :info : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: Scatter Gather Elements per IO(128)
kern :info : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: _base_display_fwpkg_version: complete
kern :err : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: overriding NVDATA EEDPTagMode setting
kern :info : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: LSISAS3008: FWVersion(16.00.01.00), ChipRevision(0x02), BiosVersion(08.37.00.00)
kern :info : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: Protocol=(Initiator), Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
kern :info : [Tue Jan 11 16:06:47 2022] mpt3sas_cm0: sending port enable !!
kern :info : [Tue Jan 11 16:06:48 2022] mpt3sas_cm0: hba_port entry: 000000002a4f8565, port: 255 is added to hba_port list
kern :info : [Tue Jan 11 16:06:48 2022] mpt3sas_cm0: host_add: handle(0x0001), sas_addr(0x5003048022eac7f0), phys(8)
kern :info : [Tue Jan 11 16:06:55 2022] mpt3sas_cm0: port enable: SUCCESS
The message
bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
is concerning because of the particularly bad results we get with this node while trying to access any files on shared storage. I'm looking into whether we have a switch configuration issue but I'm worried that an issue of this sort, in itself, is even capable of causing the catastrophic symptoms we're seeing.
Code:
root@/node/:~# dmesg -HTx | grep "bond0"
kern :info : [Tue Jan 11 16:06:58 2022] bond0: (slave enp129s0f0): Enslaving as a backup interface with a down link
kern :info : [Tue Jan 11 16:06:58 2022] bond0: (slave enp129s0f1): Enslaving as a backup interface with a down link
kern :info : [Tue Jan 11 16:06:58 2022] vmbr4: port 1(bond0) entered blocking state
kern :info : [Tue Jan 11 16:06:58 2022] vmbr4: port 1(bond0) entered disabled state
kern :info : [Tue Jan 11 16:06:58 2022] device bond0 entered promiscuous mode
kern :info : [Tue Jan 11 16:07:03 2022] bond0: (slave enp129s0f0): link status definitely up, 10000 Mbps full duplex
kern :warn : [Tue Jan 11 16:07:03 2022] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
kern :info : [Tue Jan 11 16:07:03 2022] bond0: active interface up!
kern :info : [Tue Jan 11 16:07:03 2022] vmbr4: port 1(bond0) entered blocking state
kern :info : [Tue Jan 11 16:07:03 2022] vmbr4: port 1(bond0) entered forwarding state
kern :info : [Tue Jan 11 16:07:04 2022] bond0: (slave enp129s0f1): link status definitely up, 10000 Mbps full duplex
At this point, I've been staring at this, bleary-eyed, for months and have posted about it here and in other forums, attempting to get my head around it. Hopefully, a fresh take on it may point to the ultimate cause of the issue and help us get this node back into production.
Thank you, kindly!