Dear community,
I am reaching out because I am at my wits end. Since some days, ALL my NICS constantly go up and down, and when they do come back, it is as 100Mb, with no connectivity.
The ONLY thing that helps is manually (via console) re-running:
Unfortunately, I can not say EXACTLY when it started, but I started investigating because I was getting connectivity cuts / degraded performance.
I have the following hardware:
Normally I use the the onboard NICS in bond0 as vmbr0, and the PCI nic in bond1 as vmbr1.
For the sake of testing, I decided to ONLY focus on the onboard NICS ( and I removed the card from the server)
cat /etc/network/interfaces
Normal config
Sample logs:
Things I have tried / tested:
The ONLY thing that works for some time, is manually executing:
OK, So at this point I get what you are thinking, the Mobo is broken...
Fair point. But then I did the following test:
So I am going completely crazy. It must be Proxmox right?
A new NIC is arriving later today to retest (intel 350T v4)
Would anybody care to take a stab at this issue?
Many thanks in advance
I am reaching out because I am at my wits end. Since some days, ALL my NICS constantly go up and down, and when they do come back, it is as 100Mb, with no connectivity.
The ONLY thing that helps is manually (via console) re-running:
ethtool -s eno1np0 speed 1000 duplex full autoneg onethtool -s eno2np1 speed 1000 duplex full autoneg onUnfortunately, I can not say EXACTLY when it started, but I started investigating because I was getting connectivity cuts / degraded performance.
I have the following hardware:
Base Board InformationManufacturer: SupermicroProduct Name: H12SSL-CTVersion: 1.02BIOS: (latest available)Revision H12SS-(i)(C)(CT)(NT)_3.3_AS1.05.02_SAA1.2.0-pBIOS Revision: 3.3BMC Firmware Revision: 1.05.02 CPU: AMD EPYC 7313P 16-Core ProcessorOnboard NIC:Subsystem: Super Micro Computer Inc BCM57416 NetXtreme-E [15d9:16d8]Product Name: Broadcom P210tep NetXtreme-E Dual-port 10GBASE-T Ethernet PCIe AdapterPart number: BCM957416A4160PCI NIC:HP NC365T network card - Intel 82580 Gigabit Ethernet Controller / 4 × RJ45 (1GbE) Normally I use the the onboard NICS in bond0 as vmbr0, and the PCI nic in bond1 as vmbr1.
For the sake of testing, I decided to ONLY focus on the onboard NICS ( and I removed the card from the server)
cat /etc/network/interfaces
auto eno1np0iface eno1np0 inet manual pre-up ethtool -s eno1np0 speed 1000 duplex fullauto eno2np1iface eno2np1 inet manual pre-up ethtool -s eno2np1 speed 1000 duplex fullauto bond0iface bond0 inet manual bond-slaves eno1np0 eno2np1 bond-miimon 100 bond-mode 802.3ad bond-xmit-hash-policy layer3+4 bond-lacp-rate fast bond-updelay 200 bond-downdelay 200auto vmbr0iface vmbr0 inet static address 10.10.10.254/24 gateway 10.10.10.1 bridge-ports bond0 bridge-stp off bridge-fd 0 bridge-vlan-aware yes bridge-vids 2-4094Normal config
ip link show bond0Ethernet Channel Bonding Driver: v6.14.5-1-bpo12-pveBonding Mode: IEEE 802.3ad Dynamic link aggregationTransmit Hash Policy: layer3+4 (1)MII Status: upMII Polling Interval (ms): 100Up Delay (ms): 200Down Delay (ms): 200Peer Notification Delay (ms): 0802.3ad infoLACP active: onLACP rate: fastMin links: 0Aggregator selection policy (ad_select): stableSystem priority: 65535System MAC address: 3c:ec:ef:9a:23:0eActive Aggregator Info: Aggregator ID: 1 Number of ports: 1 Actor Key: 9 Partner Key: 1002 Partner Mac Address: 70:a7:41:68:11:e0Slave Interface: eno1np0MII Status: upSpeed: 1000 MbpsDuplex: fullLink Failure Count: 7Permanent HW addr: 3c:ec:ef:9a:23:0eSlave queue ID: 0Aggregator ID: 1Actor Churn State: nonePartner Churn State: noneActor Churned Count: 0Partner Churned Count: 1details actor lacp pdu: system priority: 65535 system mac address: 3c:ec:ef:9a:23:0e port key: 9 port priority: 255 port number: 1 port state: 63details partner lacp pdu: system priority: 32768 system mac address: 70:a7:41:68:11:e0 oper key: 1002 port priority: 1 port number: 21 port state: 61Slave Interface: eno2np1MII Status: upSpeed: 100 MbpsDuplex: fullLink Failure Count: 6Permanent HW addr: 3c:ec:ef:9a:23:0fSlave queue ID: 0Aggregator ID: 2Actor Churn State: churnedPartner Churn State: churnedActor Churned Count: 1Partner Churned Count: 2details actor lacp pdu: system priority: 65535 system mac address: 3c:ec:ef:9a:23:0e port key: 7 port priority: 255 port number: 2 port state: 71details partner lacp pdu: system priority: 65535 system mac address: 00:00:00:00:00:00 oper key: 1 port priority: 255 port number: 1 port state: 1 =======================ethtool output : Settings for eno1np0: Supported ports: [ TP ] Supported link modes: 1000baseT/Full 10000baseT/Full Supported pause frame use: Symmetric Receive-only Supports auto-negotiation: Yes Supported FEC modes: RS BASER Advertised link modes: 1000baseT/Full Advertised pause frame use: Symmetric Advertised auto-negotiation: Yes Advertised FEC modes: Not reported Speed: 1000Mb/s Lanes: 1 Duplex: Full Auto-negotiation: on Port: Twisted Pair PHYAD: 12 Transceiver: internal MDI-X: Unknown Supports Wake-on: g Wake-on: d Current message level: 0x00002081 (8321) drv tx_err hw Link detected: yesroot@serverbox:~# ethtool eno2np1Settings for eno2np1: Supported ports: [ TP ] Supported link modes: 1000baseT/Full 10000baseT/Full Supported pause frame use: Symmetric Receive-only Supports auto-negotiation: Yes Supported FEC modes: RS BASER Advertised link modes: Not reported Advertised pause frame use: Symmetric Advertised auto-negotiation: Yes Advertised FEC modes: Not reported Speed: 100Mb/s Lanes: 1 Duplex: Full Auto-negotiation: on Port: Twisted Pair PHYAD: 13 Transceiver: internal MDI-X: Unknown Supports Wake-on: g Wake-on: d Current message level: 0x00002081 (8321) drv tx_err hw Link detected: yesSample logs:
Jun 04 07:55:01 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is DownJun 04 07:55:01 serverbox kernel: bond0: (slave eno1np0): speed changed to 0 on port 1Jun 04 07:55:01 serverbox kernel: bond0: (slave eno1np0): link status definitely down, disabling slaveJun 04 07:55:01 serverbox kernel: bond0: active interface up!Jun 04 07:55:04 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Up, 100 Mbps full duplex, Flow control: noneJun 04 07:55:04 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: EEE is not activeJun 04 07:55:04 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: FEC autoneg off encoding: NoneJun 04 07:55:04 serverbox kernel: bond0: (slave eno1np0): invalid new link 3 on slaveJun 04 07:55:04 serverbox kernel: bond0: (slave eno1np0): link status definitely up, 100 Mbps full duplexJun 04 07:55:09 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is DownJun 04 07:55:09 serverbox kernel: bond0: (slave eno1np0): speed changed to 0 on port 1Jun 04 07:55:09 serverbox kernel: bond0: (slave eno1np0): link status definitely down, disabling slaveJun 04 07:55:11 serverbox kernel: bnxt_en 0000:46:00.1 eno2np1: NIC Link is DownJun 04 07:55:11 serverbox kernel: bond0: (slave eno2np1): speed changed to 0 on port 2Jun 04 07:55:11 serverbox kernel: bond0: (slave eno2np1): link status definitely down, disabling slaveJun 04 07:55:11 serverbox kernel: bond0: now running without any active interface!Jun 04 07:55:11 serverbox kernel: vmbr0: port 1(bond0) entered disabled stateJun 04 07:55:12 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: NIC Link is Up, 1000 Mbps full duplex, Flow control: noneJun 04 07:55:12 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: EEE is not activeJun 04 07:55:12 serverbox kernel: bnxt_en 0000:46:00.0 eno1np0: FEC autoneg off encoding: NoneJun 04 07:55:12 serverbox kernel: bond0: (slave eno1np0): link status up, enabling it in 200 msJun 04 07:55:12 serverbox kernel: bond0: (slave eno1np0): invalid new link 3 on slaveJun 04 07:55:12 serverbox kernel: bond0: (slave eno1np0): link status definitely up, 1000 Mbps full duplexJun 04 07:55:12 serverbox kernel: bond0: active interface up!Things I have tried / tested:
- Moved the bond from the UNIFI switch towards a Cisco 3650 compact switch : same problem
- Replaced all cabling with brand new CAT6 cabling : same problem
- Disabled all energy saving related settings in BIOS : no luck
- Removed the bonding and only used one interface (eno1np0) : problem still remains.
- Tried with different kernels (6.14.5-1-bpo12-pve, 6.8.12-11-pve and 6.5 : problem persists
- All variations of ethtool commands
- ethtool -K eno2np1 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off
- ethtool -s eno1np0 advertise 0x020
- it happens with Onboard NIC, PCI NIC, and I even added an extra NIC to test (AOC-STG-i2t ): all the same
The ONLY thing that works for some time, is manually executing:
ethtool -s eno1np0 speed 1000 duplex full autoneg onOK, So at this point I get what you are thinking, the Mobo is broken...
Fair point. But then I did the following test:
- Mounted a debian Lived Iso (latest from the site)
- Ran it, and used 1 interface: eno1np0 ---> link would stay up INDEFINITELY, while performing continuous Iperf testing / load
- Redid the test by creating the bond on the debian: ---> link would stay up INDEFINITELY, while performing continuous Iperf testing / load
So I am going completely crazy. It must be Proxmox right?
A new NIC is arriving later today to retest (intel 350T v4)
Would anybody care to take a stab at this issue?
Many thanks in advance
Last edited: