All NIC / Interfaces continuously up / down or at 100Mbs...at wits' end here...

junta

Renowned Member
Aug 21, 2014
5
0
66
Dear community,

I am reaching out because I am at my wits end. Since some days, ALL my NICS constantly go up and down, and when they do come back, it is as 100Mb, with no connectivity.
The ONLY thing that helps is manually (via console) re-running:

ethtool -s eno1np0 speed 1000 duplex full autoneg on
ethtool -s eno2np1 speed 1000 duplex full autoneg on

Unfortunately, I can not say EXACTLY when it started, but I started investigating because I was getting connectivity cuts / degraded performance.

I have the following hardware:

Base Board Information
Manufacturer: Supermicro
Product Name: H12SSL-CT
Version: 1.02

BIOS: (latest available)
Revision H12SS-(i)(C)(CT)(NT)_3.3_AS1.05.02_SAA1.2.0-p
BIOS Revision: 3.3
BMC Firmware Revision: 1.05.02

CPU: AMD EPYC 7313P 16-Core Processor

Onboard NIC:
Subsystem: Super Micro Computer Inc BCM57416 NetXtreme-E [15d9:16d8]
Product Name: Broadcom P210tep NetXtreme-E Dual-port 10GBASE-T Ethernet PCIe Adapter
Part number: BCM957416A4160

PCI NIC:
HP NC365T network card - Intel 82580 Gigabit Ethernet Controller / 4 × RJ45 (1GbE)

Normally I use the the onboard NICS in bond0 as vmbr0, and the PCI nic in bond1 as vmbr1.
For the sake of testing, I decided to ONLY focus on the onboard NICS ( and I removed the card from the server)

cat /etc/network/interfaces

auto eno1np0
iface eno1np0 inet manual
pre-up ethtool -s eno1np0 speed 1000 duplex full

auto eno2np1
iface eno2np1 inet manual
pre-up ethtool -s eno2np1 speed 1000 duplex full

auto bond0
iface bond0 inet manual
bond-slaves eno1np0 eno2np1
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer3+4
bond-lacp-rate fast
bond-updelay 200
bond-downdelay 200


auto vmbr0
iface vmbr0 inet static
address 10.10.10.254/24
gateway 10.10.10.1
bridge-ports bond0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094

Normal config

ip link show bond0
Ethernet Channel Bonding Driver: v6.14.5-1-bpo12-pve
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200
Peer Notification Delay (ms): 0

802.3ad info
LACP active: on
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 3c:ec:ef:9a:23:0e
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 1
Actor Key: 9
Partner Key: 1002
Partner Mac Address: 70:a7:41:68:11:e0

Slave Interface: eno1np0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 7
Permanent HW addr: 3c:ec:ef:9a:23:0e
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 1
details actor lacp pdu:
system priority: 65535
system mac address: 3c:ec:ef:9a:23:0e
port key: 9
port priority: 255
port number: 1
port state: 63
details partner lacp pdu:
system priority: 32768
system mac address: 70:a7:41:68:11:e0
oper key: 1002
port priority: 1
port number: 21
port state: 61

Slave Interface: eno2np1
MII Status: up
Speed: 100 Mbps
Duplex: full
Link Failure Count: 6
Permanent HW addr: 3c:ec:ef:9a:23:0f
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: churned
Partner Churn State: churned
Actor Churned Count: 1
Partner Churned Count: 2
details actor lacp pdu:
system priority: 65535
system mac address: 3c:ec:ef:9a:23:0e
port key: 7
port priority: 255
port number: 2
port state: 71
details partner lacp pdu:
system priority: 65535
system mac address: 00:00:00:00:00:00
oper key: 1
port priority: 255
port number: 1
port state: 1


=======================

ethtool output :
Settings for eno1np0:
Supported ports: [ TP ]
Supported link modes: 1000baseT/Full
10000baseT/Full
Supported pause frame use: Symmetric Receive-only
Supports auto-negotiation: Yes
Supported FEC modes: RS BASER
Advertised link modes: 1000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 1000Mb/s
Lanes: 1
Duplex: Full
Auto-negotiation: on
Port: Twisted Pair
PHYAD: 12
Transceiver: internal
MDI-X: Unknown
Supports Wake-on: g
Wake-on: d
Current message level: 0x00002081 (8321)
drv tx_err hw
Link detected: yes
root@serverbox:~# ethtool eno2np1
Settings for eno2np1:
Supported ports: [ TP ]
Supported link modes: 1000baseT/Full
10000baseT/Full
Supported pause frame use: Symmetric Receive-only
Supports auto-negotiation: Yes
Supported FEC modes: RS BASER
Advertised link modes: Not reported
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 100Mb/s
Lanes: 1
Duplex: Full
Auto-negotiation: on
Port: Twisted Pair
PHYAD: 13
Transceiver: internal
MDI-X: Unknown
Supports Wake-on: g
Wake-on: d
Current message level: 0x00002081 (8321)
drv tx_err hw
Link detected: yes

Sample logs:

Jun 04 07:55:01 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Down
Jun 04 07:55:01 serverbox kernel: bond0: (slave eno1np0): speed changed to 0 on port 1
Jun 04 07:55:01 serverbox kernel: bond0: (slave eno1np0): link status definitely down, disabling slave
Jun 04 07:55:01 serverbox kernel: bond0: active interface up!
Jun 04 07:55:04 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Up, 100 Mbps full duplex, Flow control: none
Jun 04 07:55:04 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: EEE is not active
Jun 04 07:55:04 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: FEC autoneg off encoding: None
Jun 04 07:55:04 serverbox kernel: bond0: (slave eno1np0): invalid new link 3 on slave
Jun 04 07:55:04 serverbox kernel: bond0: (slave eno1np0): link status definitely up, 100 Mbps full duplex
Jun 04 07:55:09 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Down
Jun 04 07:55:09 serverbox kernel: bond0: (slave eno1np0): speed changed to 0 on port 1
Jun 04 07:55:09 serverbox kernel: bond0: (slave eno1np0): link status definitely down, disabling slave
Jun 04 07:55:11 serverbox kernel: bnxt_en 0000:46:00.1 eno2np1: NIC Link is Down
Jun 04 07:55:11 serverbox kernel: bond0: (slave eno2np1): speed changed to 0 on port 2
Jun 04 07:55:11 serverbox kernel: bond0: (slave eno2np1): link status definitely down, disabling slave
Jun 04 07:55:11 serverbox kernel: bond0: now running without any active interface!
Jun 04 07:55:11 serverbox kernel: vmbr0: port 1(bond0) entered disabled state
Jun 04 07:55:12 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Up, 1000 Mbps full duplex, Flow control: none
Jun 04 07:55:12 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: EEE is not active
Jun 04 07:55:12 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: FEC autoneg off encoding: None
Jun 04 07:55:12 serverbox kernel: bond0: (slave eno1np0): link status up, enabling it in 200 ms
Jun 04 07:55:12 serverbox kernel: bond0: (slave eno1np0): invalid new link 3 on slave
Jun 04 07:55:12 serverbox kernel: bond0: (slave eno1np0): link status definitely up, 1000 Mbps full duplex
Jun 04 07:55:12 serverbox kernel: bond0: active interface up!

Things I have tried / tested:

  • Moved the bond from the UNIFI switch towards a Cisco 3650 compact switch : same problem
  • Replaced all cabling with brand new CAT6 cabling : same problem
  • Disabled all energy saving related settings in BIOS : no luck
  • Removed the bonding and only used one interface (eno1np0) : problem still remains.
  • Tried with different kernels (6.14.5-1-bpo12-pve, 6.8.12-11-pve and 6.5 : problem persists
  • All variations of ethtool commands
    • ethtool -K eno2np1 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off
    • ethtool -s eno1np0 advertise 0x020
  • it happens with Onboard NIC, PCI NIC, and I even added an extra NIC to test (AOC-STG-i2t ): all the same

The ONLY thing that works for some time, is manually executing:

ethtool -s eno1np0 speed 1000 duplex full autoneg on

OK, So at this point I get what you are thinking, the Mobo is broken...
Fair point. But then I did the following test:
  • Mounted a debian Lived Iso (latest from the site)
  • Ran it, and used 1 interface: eno1np0 ---> link would stay up INDEFINITELY, while performing continuous Iperf testing / load
  • Redid the test by creating the bond on the debian: ---> link would stay up INDEFINITELY, while performing continuous Iperf testing / load

So I am going completely crazy. It must be Proxmox right?
A new NIC is arriving later today to retest (intel 350T v4)

Would anybody care to take a stab at this issue?
Many thanks in advance
 
Last edited:
Mounted a debian Lived Iso (latest from the site)
If this was a debian 12 with a Kernel 6.1, it uses some older Broadcom modules to power your NICs than the Proxmox Kernel 6.8 or newer. Proxmox upgraded from Kernel 6.5 to 6.8 with Proxmox VE 8.2 - if this happened after an update, this may be your problem. But the errors we've seen most is that the ports do not come up again at all.

You can try to blacklist the broadcom infiniband module/driver or upgrade the firmware on the broadcom NICs or, if possible, downgrade to Kernel 6.5. Or replace the Broadcom cards by something else.

Please report back how it went.
 
thanks for the feedback.

I just got the new firmware from Supermicro for this device.

I was able to update it by console with:
Code:
./bnxtnvm -dev=eno1np0 install 232-H11SSW-NT.pkg

after reboot:
root@serverbox:~# ethtool -i eno1np0
driver: bnxt_en
version: 6.14.5-1-bpo12-pve
firmware-version: 232.0.155.2/pkg 232.1.132.8
expansion-rom-version:
bus-info: 0000:46:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
root@serverbox:~# ethtool -i eno2np1
driver: bnxt_en
version: 6.14.5-1-bpo12-pve
firmware-version: 232.0.155.2/pkg 232.1.132.8
expansion-rom-version:
bus-info: 0000:46:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

But the issue was still there....

Jun 04 11:42:35 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Down
Jun 04 11:42:35 serverbox kernel: bond0: (slave eno1np0): speed changed to 0 on port 1
Jun 04 11:42:35 serverbox kernel: bond0: (slave eno1np0): link status definitely down, disabling slave
Jun 04 11:42:35 serverbox kernel: bond0: now running without any active interface!
Jun 04 11:42:35 serverbox kernel: vmbr0: port 1(bond0) entered disabled state
Jun 04 11:43:22 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Up, 1000 Mbps full duplex, Flow control: none
Jun 04 11:43:22 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: EEE is not active
Jun 04 11:43:22 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: FEC autoneg off encoding: None
Jun 04 11:43:22 serverbox kernel: bond0: (slave eno1np0): link status up, enabling it in 200 ms
Jun 04 11:43:22 serverbox kernel: bond0: (slave eno1np0): invalid new link 3 on slave
Jun 04 11:43:22 serverbox kernel: bond0: (slave eno1np0): link status definitely up, 1000 Mbps full duplex
Jun 04 11:43:22 serverbox kernel: bond0: active interface up!
Jun 04 11:43:22 serverbox kernel: vmbr0: port 1(bond0) entered blocking state
Jun 04 11:43:22 serverbox kernel: vmbr0: port 1(bond0) entered forwarding state
Jun 04 11:45:15 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Down
Jun 04 11:45:15 serverbox kernel: bond0: (slave eno1np0): speed changed to 0 on port 1
Jun 04 11:45:16 serverbox kernel: bond0: (slave eno1np0): link status definitely down, disabling slave
Jun 04 11:45:16 serverbox kernel: bond0: now running without any active interface!
Jun 04 11:45:16 serverbox kernel: vmbr0: port 1(bond0) entered disabled state
Jun 04 11:45:19 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Up, 1000 Mbps full duplex, Flow control: none
Jun 04 11:45:19 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: EEE is not active
Jun 04 11:45:19 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: FEC autoneg off encoding: None
Jun 04 11:45:19 serverbox kernel: bond0: (slave eno1np0): link status up, enabling it in 200 ms
Jun 04 11:45:19 serverbox kernel: bond0: (slave eno1np0): invalid new link 3 on slave
Jun 04 11:45:19 serverbox kernel: bond0: (slave eno1np0): link status definitely up, 1000 Mbps full duplex
Jun 04 11:45:19 serverbox kernel: bond0: active interface up!
Jun 04 11:45:19 serverbox kernel: vmbr0: port 1(bond0) entered blocking state
Jun 04 11:45:19 serverbox kernel: vmbr0: port 1(bond0) entered forwarding state
Jun 04 11:46:15 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Down
Jun 04 11:46:15 serverbox kernel: bond0: (slave eno1np0): speed changed to 0 on port 1
Jun 04 11:46:15 serverbox kernel: bond0: (slave eno1np0): link status definitely down, disabling slave
Jun 04 11:46:15 serverbox kernel: bond0: now running without any active interface!
Jun 04 11:46:15 serverbox kernel: vmbr0: port 1(bond0) entered disabled state
Jun 04 11:48:35 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Up, 1000 Mbps full duplex, Flow control: none
Jun 04 11:48:35 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: EEE is not active
Jun 04 11:48:35 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: FEC autoneg off encoding: None
Jun 04 11:48:35 serverbox kernel: bond0: (slave eno1np0): link status up, enabling it in 200 ms
Jun 04 11:48:35 serverbox kernel: bond0: (slave eno1np0): invalid new link 3 on slave
Jun 04 11:48:35 serverbox kernel: bond0: (slave eno1np0): link status definitely up, 1000 Mbps full duplex
Jun 04 11:48:35 serverbox kernel: bond0: active interface up!
Jun 04 11:48:35 serverbox kernel: vmbr0: port 1(bond0) entered blocking state
Jun 04 11:48:35 serverbox kernel: vmbr0: port 1(bond0) entered forwarding state
Jun 04 11:48:56 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Down
Jun 04 11:48:56 serverbox kernel: bond0: (slave eno1np0): speed changed to 0 on port 1
Jun 04 11:48:56 serverbox kernel: bond0: (slave eno1np0): link status definitely down, disabling slave
Jun 04 11:48:56 serverbox kernel: bond0: now running without any active interface!
Jun 04 11:48:56 serverbox kernel: vmbr0: port 1(bond0) entered disabled state
Jun 04 11:48:59 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Up, 1000 Mbps full duplex, Flow control: none
Jun 04 11:48:59 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: EEE is not active
Jun 04 11:48:59 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: FEC autoneg off encoding: None
Jun 04 11:48:59 serverbox kernel: bond0: (slave eno1np0): invalid new link 3 on slave
Jun 04 11:48:59 serverbox kernel: bond0: (slave eno1np0): link status definitely up, 1000 Mbps full duplex
Jun 04 11:48:59 serverbox kernel: bond0: active interface up!
Jun 04 11:48:59 serverbox kernel: vmbr0: port 1(bond0) entered blocking state
Jun 04 11:48:59 serverbox kernel: vmbr0: port 1(bond0) entered forwarding state

As soon as there seems to be some load, it goes down and I need to run ethtool again to put it up....

Will await new NIC, all out of ideas right now
 
! Update

while waiting I decided to hook up a new ssd and install Proxmox to it, while not touching the original install. This had NO issues at all under any load.
SO I decided to check packages etc etc on the defective system.

I saw a package called 'tuned" (https://tuned-project.org/) iI saw that it is used for tuning power settings and everything. I decided to remove it and......ALL THE FLAPPING STOPPED!!

I lost a lot of time with this. Now to find out who put it there and commence corporal punishment!