ixgbe driver hang up | Detected Tx Unit Hang Tx Queue

sebeschn

New Member
Dec 31, 2022
4
0
1
Hello,

I have two supermicro servers (SM-HV01 and SM-HV02) running proxmox ve 7.3-4. These servers are directly connectet with two 10Gbit/s Fiber DAC Cables (enp3s0f0 and enp3s0f1), and one 10Gbit/s Ethernet cable (eno2). The outside interface of these two servers are connected to the datacenter uplink switch (eno1). The fiber interfaces are in a bond state with lacp 802.3ad. The interfaces eno1 and eno2 (driver: igb) are working fine and there seems to be no problem. Since ive upgraded to proxmox 7.3-4, the fiber interfaces went down under havy tx load (see the log under section syslog). The interesting part is, that only the server SM-HV01 produces this error. The only workaround is rebooting the hypervisor SM-HV01.

What i have already tried:
- disable offloading on both servers and interfaces
Code:
ethtool -K enp3s0f0 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off
ethtool -K enp3s0f1 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

- disable pcie aspm under /etc/default/grub
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off"

- update ixgbe driver to 5.18.6

- disable port enp3s0f0 or enp3s0f1

- trying using another bond mode (active-backup)
since that doesnt work, i switched back to lacp

- disable jumbo frames on enp3s0f0 and enp3s0f1


NIC​

Host (SM-HV01 and SM-HV02)
Code:
lspci -nnk | grep -A2 Ethernet

eno1
01:00.0 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
        DeviceName:  Onboard Intel Ethernet 1
        Subsystem: Super Micro Computer Inc I350 Gigabit Network Connection [15d9:1521]
        Kernel driver in use: igb
--
eno2
01:00.1 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
        DeviceName:  Onboard Intel Ethernet 2
        Subsystem: Super Micro Computer Inc I350 Gigabit Network Connection [15d9:1521]
        Kernel driver in use: igb
--
enp3s0f0 and enp3s0f1
03:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
        Subsystem: Hewlett-Packard Company Ethernet 10Gb 2-port 560SFP+ Adapter [103c:17d3]
        Kernel driver in use: ixgbe
        Kernel modules: ixgbe
03:00.1 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
        Subsystem: Hewlett-Packard Company Ethernet 10Gb 2-port 560SFP+ Adapter [103c:17d3]
        Kernel driver in use: ixgbe
        Kernel modules: ixgbe


Driver​

Host (SM-HV01)
Code:
root@SM-HV01:~# ethtool -i enp3s0f0
driver: ixgbe
version: 5.18.6
firmware-version: 0x80000835, 1.1200.0
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

root@SM-HV01:~# ethtool -i enp3s0f1
driver: ixgbe
version: 5.18.6
firmware-version: 0x80000835, 1.1200.0
expansion-rom-version: 
bus-info: 0000:03:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

Host (SM-HV02)
Code:
root@SM-HV02:~# ethtool -i enp3s0f0
driver: ixgbe
version: 5.18.6
firmware-version: 0x80000811, 1.1099.0
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

root@SM-HV02:~# ethtool -i enp3s0f1
driver: ixgbe
version: 5.18.6
firmware-version: 0x80000811, 1.1099.0
expansion-rom-version:
bus-info: 0000:03:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes


Networkconfiguration​

Host (SM-HV01)
Code:
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

auto enp3s0f0
iface enp3s0f0 inet manual
        bond-master bond1

auto enp3s0f1
iface enp3s0f1 inet manual
        bond-master bond1

auto eno2.11
iface eno2.11 inet static
        address 0.0.0.0

auto eno2.21
iface eno2.21 inet static
        address 10.x.x.x/27
        alias HVCL01
#SM-VSW-HVCL01

auto eno2.22
iface eno2.22 inet static
        address 0.0.0.0
        mask 0.0.0.0

auto bond1
iface bond1 inet static
        address 0.0.0.0/32
        bond-slaves enp3s0f0 enp3s0f1
        bond-miimon 100
        bond-mode 802.3ad
        bond-lacp-rate 1

auto bond1.31
iface bond1.31 inet static
        address 0.0.0.0
        alias ISCSI01

auto bond1.32
iface bond1.32 inet static
        address 0.0.0.0
        alias ISCSI02

auto bond1.41
iface bond1.41 inet static
        address 0.0.0.0
        alias DMZ01

auto bond1.42
iface bond1.42 inet static
        address 0.0.0.0
        alias DMZ02

auto vmbr0
iface vmbr0 inet static
        address 10.x.x.x/24
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
#OUTSIDE

auto vmbr11
iface vmbr11 inet static
        address 10.x.x.x/24
        gateway 10.x.x.x
        bridge-ports eno2.11
        bridge-stp off
        bridge-fd 0
        alias MGMT01
#SM-VSW-MGMT01

auto vmbr22
iface vmbr22 inet static
        address 0.0.0.0
        bridge-ports eno2.22
        bridge-stp off
        bridge-fd 0
        alias FWCL01
        mask 0.0.0.0
#SM-VSW-FWCL01

auto vmbr31
iface vmbr31 inet static
        address 0.0.0.0
        bridge-ports bond1.31
        bridge-stp off
        bridge-fd 0
        alias ISCSI01
#SM-VSW-ISCSI01

auto vmbr32
iface vmbr32 inet static
        address 0.0.0.0
        bridge-ports bond1.32
        bridge-stp off
        bridge-fd 0
        alias ISCSI02
#SM-VSW-ISCSI02

auto vmbr41
iface vmbr41 inet static
        address 0.0.0.0/32
        bridge-ports bond1.41
        bridge-stp off
        bridge-fd 0
#SM-VSW-DMZ01

auto vmbr42
iface vmbr42 inet static
        address 0.0.0.0
        bridge-ports bond1.42
        bridge-stp off
        bridge-fd 0
#SM-VSW-DMZ02

Host (SM-HV02)
Code:
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

auto enp3s0f0
iface enp3s0f0 inet manual
        bond-master bond1

auto enp3s0f1
iface enp3s0f1 inet manual
        bond-master bond1

auto eno2.11
iface eno2.11 inet static
        address 0.0.0.0

auto eno2.21
iface eno2.21 inet static
        address 10.x.x.x/27
        alias HVCL01
#SM-VSW-HVCL01

auto eno2.22
iface eno2.22 inet static
        address 0.0.0.0
        mask 0.0.0.0

auto bond1
iface bond1 inet static
        address 0.0.0.0/32
        bond-slaves enp3s0f0 enp3s0f1
        bond-miimon 100
        bond-mode 802.3ad
        bond-lacp-rate 1

auto bond1.31
iface bond1.31 inet static
        address 0.0.0.0
        alias ISCSI01

auto bond1.32
iface bond1.32 inet static
        address 0.0.0.0
        alias ISCSI02

auto bond1.41
iface bond1.41 inet static
        address 0.0.0.0
        alias DMZ01

auto bond1.42
iface bond1.42 inet static
        address 0.0.0.0
        alias DMZ02

auto vmbr0
iface vmbr0 inet static
        address 10.x.x.x/24
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
#OUTSIDE

auto vmbr11
iface vmbr11 inet static
        address 10.x.x.x/24
        gateway 10.x.x.x
        bridge-ports eno2.11
        bridge-stp off
        bridge-fd 0
        alias MGMT01
#SM-VSW-MGMT01

auto vmbr22
iface vmbr22 inet static
        address 0.0.0.0
        bridge-ports eno2.22
        bridge-stp off
        bridge-fd 0
        alias FWCL01
        mask 0.0.0.0
#SM-VSW-FWCL01

auto vmbr31
iface vmbr31 inet static
        address 0.0.0.0
        bridge-ports bond1.31
        bridge-stp off
        bridge-fd 0
        alias ISCSI01
#SM-VSW-ISCSI01

auto vmbr32
iface vmbr32 inet static
        address 0.0.0.0
        bridge-ports bond1.32
        bridge-stp off
        bridge-fd 0
        alias ISCSI02
#SM-VSW-ISCSI02

auto vmbr41
iface vmbr41 inet static
        address 0.0.0.0/32
        bridge-ports bond1.41
        bridge-stp off
        bridge-fd 0
#SM-VSW-DMZ01

auto vmbr42
iface vmbr42 inet static
        address 0.0.0.0
        bridge-ports bond1.42
        bridge-stp off
        bridge-fd 0
#SM-VSW-DMZ02


Syslog​

Host (SM-HV01)
Code:
Dec 31 15:02:39 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: Detected Tx Unit Hang  Tx Queue             <19>  TDH, TDT             <0>, <5>  next_to_use          <5>  next_to_clean        <0> tx_buffer_info[next_to_clean]  time_stamp           <1000031da>  jiffies              <1000035c8>
Dec 31 15:02:39 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: tx hang 4 detected on queue 19, resetting adapter
Dec 31 15:02:39 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: Reset adapter
Dec 31 15:02:40 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Dec 31 15:02:40 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Dec 31 15:02:40 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: PCIe transaction pending bit also did not clear.
Dec 31 15:02:40 SM-HV01 kernel: ixgbe 0000:03:00.1: primary disable timed out
Dec 31 15:02:40 SM-HV01 kernel: bond1: (slave enp3s0f1): speed changed to 0 on port 2
Dec 31 15:02:40 SM-HV01 kernel: bond1: (slave enp3s0f1): link status definitely down, disabling slave
Dec 31 15:02:40 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: detected SFP+: 3
Dec 31 15:02:41 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Dec 31 15:02:41 SM-HV01 kernel: bond1: (slave enp3s0f1): link status definitely up, 10000 Mbps full duplex
Dec 31 15:02:41 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: Detected Tx Unit Hang  Tx Queue             <21>  TDH, TDT             <0>, <1>  next_to_use          <1>  next_to_clean        <0> tx_buffer_info[next_to_clean]  time_stamp           <10000370c>  jiffies              <10000373f>
Dec 31 15:02:41 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: tx hang 5 detected on queue 21, resetting adapter
Dec 31 15:02:41 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: Reset adapter
Dec 31 15:02:41 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Dec 31 15:02:41 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Dec 31 15:02:41 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: PCIe transaction pending bit also did not clear.
Dec 31 15:02:41 SM-HV01 kernel: ixgbe 0000:03:00.1: primary disable timed out
Dec 31 15:02:41 SM-HV01 kernel: bond1: (slave enp3s0f1): link status definitely down, disabling slave
Dec 31 15:02:41 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: detected SFP+: 3

Host (SM-HV02)
Code:
Dec 31 15:02:36 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Dec 31 15:02:36 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely up, 10000 Mbps full duplex
Dec 31 15:02:40 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Down
Dec 31 15:02:40 SM-HV02 kernel: bond1: (slave enp3s0f1): speed changed to 0 on port 2
Dec 31 15:02:40 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely down, disabling slave
Dec 31 15:02:42 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Dec 31 15:02:42 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely up, 10000 Mbps full duplex
Dec 31 15:02:47 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Down
Dec 31 15:02:47 SM-HV02 kernel: bond1: (slave enp3s0f1): speed changed to 0 on port 2
Dec 31 15:02:47 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely down, disabling slave
Dec 31 15:02:48 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Dec 31 15:02:48 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely up, 10000 Mbps full duplex
Dec 31 15:02:48 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Down
Dec 31 15:02:48 SM-HV02 kernel: bond1: (slave enp3s0f1): speed changed to 0 on port 2
Dec 31 15:02:48 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely down, disabling slave
Dec 31 15:02:50 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Dec 31 15:02:50 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely up, 10000 Mbps full duplex
Dec 31 15:02:53 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Down
Dec 31 15:02:53 SM-HV02 kernel: bond1: (slave enp3s0f1): speed changed to 0 on port 2
Dec 31 15:02:54 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely down, disabling slave
Dec 31 15:02:55 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Dec 31 15:02:55 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely up, 10000 Mbps full duplex
Dec 31 15:02:58 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Down
Dec 31 15:02:58 SM-HV02 kernel: bond1: (slave enp3s0f1): speed changed to 0 on port 2
Dec 31 15:02:59 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely down, disabling slave
Dec 31 15:02:59 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Dec 31 15:02:59 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely up, 10000 Mbps full duplex
Dec 31 15:03:02 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Down
Dec 31 15:03:02 SM-HV02 kernel: bond1: (slave enp3s0f1): speed changed to 0 on port 2
Dec 31 15:03:02 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely down, disabling slave
Dec 31 15:03:03 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Dec 31 15:03:03 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely up, 10000 Mbps full duplex


thank you for your help!
sebeschn
 
it happened again (at 09:19:38 PM):

Syslog​

Host (SM-HV01)
Code:
Jan 02 21:19:38 SM-HV01 pvedaemon[1578035]: <root@pam> successful auth for user 'root@pam'
Jan 02 21:29:39 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: Detected Tx Unit Hang
  Tx Queue             <3>
  TDH, TDT             <154>, <154>
  next_to_use          <154>
  next_to_clean        <b8>
tx_buffer_info[next_to_clean]
  time_stamp           <1029eb8cc>
  jiffies              <1029ebe30>
Jan 02 21:29:39 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: tx hang 1 detected on queue 3, resetting adapter
Jan 02 21:29:39 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: Reset adapter
Jan 02 21:29:39 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jan 02 21:29:39 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: PCIe transaction pending bit also did not clear.
Jan 02 21:29:39 SM-HV01 kernel: ixgbe 0000:03:00.1: primary disable timed out
Jan 02 21:29:39 SM-HV01 kernel: bond1: (slave enp3s0f1): speed changed to 0 on port 2
Jan 02 21:29:39 SM-HV01 kernel: bond1: (slave enp3s0f1): link status definitely down, disabling slave
Jan 02 21:29:39 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: detected SFP+: 3
Jan 02 21:29:40 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jan 02 21:29:40 SM-HV01 kernel: bond1: (slave enp3s0f1): link status definitely up, 10000 Mbps full duplex
Jan 02 21:29:42 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: Detected Tx Unit Hang
  Tx Queue             <10>
  TDH, TDT             <0>, <1>
  next_to_use          <1>
  next_to_clean        <0>
tx_buffer_info[next_to_clean]
  time_stamp           <1029ebf7d>
  jiffies              <1029ec178>
Jan 02 21:29:42 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: tx hang 2 detected on queue 10, resetting adapter
Jan 02 21:29:42 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: Reset adapter
Jan 02 21:29:42 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jan 02 21:29:42 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: PCIe transaction pending bit also did not clear.
Jan 02 21:29:42 SM-HV01 kernel: ixgbe 0000:03:00.1: primary disable timed out
Jan 02 21:29:42 SM-HV01 kernel: bond1: (slave enp3s0f1): speed changed to 0 on port 2
Jan 02 21:29:42 SM-HV01 kernel: bond1: (slave enp3s0f1): link status definitely down, disabling slave
Jan 02 21:29:42 SM-HV01 kernel: ixgbe 0000:03:00.1 enp3s0f1: detected SFP+: 3

Code:
root@SM-HV01:~# ethtool -S enp3s0f0
NIC statistics:
     rx_packets: 1546881
     tx_packets: 354613
     rx_bytes: 1452393789
     tx_bytes: 37137993
     rx_errors: 0
     tx_errors: 0
     rx_dropped: 0
     tx_dropped: 0
     multicast: 86442
     collisions: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     rx_pkts_nic: 1550149
     tx_pkts_nic: 354613
     rx_bytes_nic: 1459418419
     tx_bytes_nic: 38564042
     lsc_int: 2
     tx_busy: 0
     non_eop_descs: 0
     broadcast: 13
     rx_no_buffer_count: 0
     tx_timeout_count: 0
     tx_restart_queue: 0
     rx_length_errors: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     tx_flow_control_xon: 0
     rx_flow_control_xon: 0
     tx_flow_control_xoff: 44360572
     rx_flow_control_xoff: 0
     rx_csum_offload_errors: 0
     alloc_rx_page: 17065
     alloc_rx_page_failed: 0
     alloc_rx_buff_failed: 0
     rx_no_dma_resources: 0
     hw_rsc_aggregated: 0
     hw_rsc_flushed: 0
     fdir_match: 0
     fdir_miss: 1463440
     fdir_overflow: 0
     fcoe_bad_fccrc: 0
     fcoe_last_errors: 0
     rx_fcoe_dropped: 0
     rx_fcoe_packets: 0
     rx_fcoe_dwords: 0
     fcoe_noddp: 0
     fcoe_noddp_ext_buff: 0
     tx_fcoe_packets: 0
     tx_fcoe_dwords: 0
     os2bmc_rx_by_bmc: 0
     os2bmc_tx_by_bmc: 0
     os2bmc_tx_by_host: 0
     os2bmc_rx_by_host: 0
     tx_hwtstamp_timeouts: 0
     tx_hwtstamp_skipped: 0
     rx_hwtstamp_cleared: 0
     tx_queue_0_packets: 17998
     tx_queue_0_bytes: 1638658
     tx_queue_1_packets: 34865
     tx_queue_1_bytes: 2782311
     tx_queue_2_packets: 14126
     tx_queue_2_bytes: 1662979
     tx_queue_3_packets: 11234
     tx_queue_3_bytes: 1363454
     tx_queue_4_packets: 12210
     tx_queue_4_bytes: 1103309
     tx_queue_5_packets: 11696
     tx_queue_5_bytes: 2873091
     tx_queue_6_packets: 11654
     tx_queue_6_bytes: 973895
     tx_queue_7_packets: 10006
     tx_queue_7_bytes: 907629
     tx_queue_8_packets: 10626
     tx_queue_8_bytes: 877853
     tx_queue_9_packets: 9788
     tx_queue_9_bytes: 858212
     tx_queue_10_packets: 14921
     tx_queue_10_bytes: 1450597
     tx_queue_11_packets: 11107
     tx_queue_11_bytes: 1005820
     tx_queue_12_packets: 17444
     tx_queue_12_bytes: 4289292
     tx_queue_13_packets: 16534
     tx_queue_13_bytes: 1391590
     tx_queue_14_packets: 11481
     tx_queue_14_bytes: 926789
     tx_queue_15_packets: 9410
     tx_queue_15_bytes: 771302
     tx_queue_16_packets: 9949
     tx_queue_16_bytes: 795766
     tx_queue_17_packets: 17070
     tx_queue_17_bytes: 1511844
     tx_queue_18_packets: 11619
     tx_queue_18_bytes: 968719
     tx_queue_19_packets: 11243
     tx_queue_19_bytes: 894865
     tx_queue_20_packets: 9977
     tx_queue_20_bytes: 788776
     tx_queue_21_packets: 31364
     tx_queue_21_bytes: 3414713
     tx_queue_22_packets: 22178
     tx_queue_22_bytes: 2291540
     tx_queue_23_packets: 16113
     tx_queue_23_bytes: 1594989
     tx_queue_24_packets: 0
     tx_queue_24_bytes: 0
     tx_queue_25_packets: 0
     tx_queue_25_bytes: 0
     rx_queue_0_packets: 135583
     rx_queue_0_bytes: 70904209
     rx_queue_1_packets: 88392
     rx_queue_1_bytes: 105642411
     rx_queue_2_packets: 30560
     rx_queue_2_bytes: 35011276
     rx_queue_3_packets: 350152
     rx_queue_3_bytes: 291958805
     rx_queue_4_packets: 108218
     rx_queue_4_bytes: 134865575
     rx_queue_5_packets: 53823
     rx_queue_5_bytes: 63641574
     rx_queue_6_packets: 47060
     rx_queue_6_bytes: 55159472
     rx_queue_7_packets: 65829
     rx_queue_7_bytes: 80668502
     rx_queue_8_packets: 78440
     rx_queue_8_bytes: 94257814
     rx_queue_9_packets: 60411
     rx_queue_9_bytes: 72167156
     rx_queue_10_packets: 139587
     rx_queue_10_bytes: 72044068
     rx_queue_11_packets: 75513
     rx_queue_11_bytes: 88829480
     rx_queue_12_packets: 124916
     rx_queue_12_bytes: 65453088
     rx_queue_13_packets: 60777
     rx_queue_13_bytes: 72861397
     rx_queue_14_packets: 60001
     rx_queue_14_bytes: 70794590
     rx_queue_15_packets: 67619
     rx_queue_15_bytes: 78134372
     rx_queue_16_packets: 0
     rx_queue_16_bytes: 0
     rx_queue_17_packets: 0
     rx_queue_17_bytes: 0

root@SM-HV01:~# ethtool -S enp3s0f1
NIC statistics:
     rx_packets: 2372190
     tx_packets: 3413762
     rx_bytes: 1965000830
     tx_bytes: 1057664952
     rx_errors: 0
     tx_errors: 0
     rx_dropped: 0
     tx_dropped: 0
     multicast: 169028
     collisions: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 0
     rx_missed_errors: 156
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     rx_pkts_nic: 2372524
     tx_pkts_nic: 3413918
     rx_bytes_nic: 1975082399
     tx_bytes_nic: 1074021741
     lsc_int: 230
     tx_busy: 0
     non_eop_descs: 0
     broadcast: 15
     rx_no_buffer_count: 0
     tx_timeout_count: 115
     tx_restart_queue: 0
     rx_length_errors: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     tx_flow_control_xon: 0
     rx_flow_control_xon: 0
     tx_flow_control_xoff: 204534
     rx_flow_control_xoff: 0
     rx_csum_offload_errors: 0
     alloc_rx_page: 1431178
     alloc_rx_page_failed: 0
     alloc_rx_buff_failed: 0
     rx_no_dma_resources: 0
     hw_rsc_aggregated: 0
     hw_rsc_flushed: 0
     fdir_match: 1598923
     fdir_miss: 602279
     fdir_overflow: 3
     fcoe_bad_fccrc: 0
     fcoe_last_errors: 0
     rx_fcoe_dropped: 0
     rx_fcoe_packets: 0
     rx_fcoe_dwords: 0
     fcoe_noddp: 0
     fcoe_noddp_ext_buff: 0
     tx_fcoe_packets: 0
     tx_fcoe_dwords: 0
     os2bmc_rx_by_bmc: 0
     os2bmc_tx_by_bmc: 0
     os2bmc_tx_by_host: 0
     os2bmc_rx_by_host: 0
     tx_hwtstamp_timeouts: 0
     tx_hwtstamp_skipped: 0
     rx_hwtstamp_cleared: 0
     tx_queue_0_packets: 70231
     tx_queue_0_bytes: 7059639
     tx_queue_1_packets: 121001
     tx_queue_1_bytes: 18121560
     tx_queue_2_packets: 120138
     tx_queue_2_bytes: 75159666
     tx_queue_3_packets: 501944
     tx_queue_3_bytes: 86824340
     tx_queue_4_packets: 306426
     tx_queue_4_bytes: 52931765
     tx_queue_5_packets: 274079
     tx_queue_5_bytes: 131335177
     tx_queue_6_packets: 156555
     tx_queue_6_bytes: 17398808
     tx_queue_7_packets: 138780
     tx_queue_7_bytes: 37293859
     tx_queue_8_packets: 131442
     tx_queue_8_bytes: 73567850
     tx_queue_9_packets: 103456
     tx_queue_9_bytes: 39005374
     tx_queue_10_packets: 343509
     tx_queue_10_bytes: 324775416
     tx_queue_11_packets: 106410
     tx_queue_11_bytes: 11853072
     tx_queue_12_packets: 145448
     tx_queue_12_bytes: 17677653
     tx_queue_13_packets: 115838
     tx_queue_13_bytes: 14513256
     tx_queue_14_packets: 72300
     tx_queue_14_bytes: 6874725
     tx_queue_15_packets: 119759
     tx_queue_15_bytes: 10862864
     tx_queue_16_packets: 65560
     tx_queue_16_bytes: 8728787
     tx_queue_17_packets: 80729
     tx_queue_17_bytes: 16358057
     tx_queue_18_packets: 67336
     tx_queue_18_bytes: 6481183
     tx_queue_19_packets: 62987
     tx_queue_19_bytes: 14297051
     tx_queue_20_packets: 78803
     tx_queue_20_bytes: 35947842
     tx_queue_21_packets: 108985
     tx_queue_21_bytes: 36637202
     tx_queue_22_packets: 64888
     tx_queue_22_bytes: 8128447
     tx_queue_23_packets: 57158
     tx_queue_23_bytes: 5831359
     tx_queue_24_packets: 0
     tx_queue_24_bytes: 0
     tx_queue_25_packets: 0
     tx_queue_25_bytes: 0
     rx_queue_0_packets: 186894
     rx_queue_0_bytes: 35073412
     rx_queue_1_packets: 37169
     rx_queue_1_bytes: 38866744
     rx_queue_2_packets: 93307
     rx_queue_2_bytes: 102145865
     rx_queue_3_packets: 767749
     rx_queue_3_bytes: 647752378
     rx_queue_4_packets: 244198
     rx_queue_4_bytes: 177852730
     rx_queue_5_packets: 115512
     rx_queue_5_bytes: 100271812
     rx_queue_6_packets: 142762
     rx_queue_6_bytes: 150604457
     rx_queue_7_packets: 68273
     rx_queue_7_bytes: 68090422
     rx_queue_8_packets: 41177
     rx_queue_8_bytes: 32859050
     rx_queue_9_packets: 33479
     rx_queue_9_bytes: 28280042
     rx_queue_10_packets: 68203
     rx_queue_10_bytes: 34603713
     rx_queue_11_packets: 40823
     rx_queue_11_bytes: 42576490
     rx_queue_12_packets: 25579
     rx_queue_12_bytes: 23192524
     rx_queue_13_packets: 40852
     rx_queue_13_bytes: 41214730
     rx_queue_14_packets: 46228
     rx_queue_14_bytes: 52179197
     rx_queue_15_packets: 98905
     rx_queue_15_bytes: 29143314
     rx_queue_16_packets: 22599
     rx_queue_16_bytes: 22016087
     rx_queue_17_packets: 43430
     rx_queue_17_bytes: 49949190
     rx_queue_18_packets: 91517
     rx_queue_18_bytes: 113189129
     rx_queue_19_packets: 22193
     rx_queue_19_bytes: 21071960
     rx_queue_20_packets: 90458
     rx_queue_20_bytes: 105108579
     rx_queue_21_packets: 16151
     rx_queue_21_bytes: 12174852
     rx_queue_22_packets: 23635
     rx_queue_22_bytes: 27249808
     rx_queue_23_packets: 11097
     rx_queue_23_bytes: 9534345
     rx_queue_24_packets: 0
     rx_queue_24_bytes: 0
     rx_queue_25_packets: 0
     rx_queue_25_bytes: 0

Host (SM-HV02)
Code:
Jan 02 21:14:07 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: Fake Tx hang detected with timeout of 80 seconds
Jan 02 21:15:29 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: Fake Tx hang detected with timeout of 80 seconds
Jan 02 21:16:50 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: Fake Tx hang detected with timeout of 80 seconds
Jan 02 21:17:01 SM-HV02 CRON[1153901]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 02 21:17:01 SM-HV02 CRON[1153902]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jan 02 21:17:01 SM-HV02 CRON[1153901]: pam_unix(cron:session): session closed for user root
Jan 02 21:18:12 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: Fake Tx hang detected with timeout of 80 seconds
Jan 02 21:19:32 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: Fake Tx hang detected with timeout of 80 seconds
Jan 02 21:19:38 SM-HV02 pmxcfs[5206]: [status] notice: received log
Jan 02 21:20:54 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: Fake Tx hang detected with timeout of 80 seconds
Jan 02 21:22:16 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: Fake Tx hang detected with timeout of 80 seconds
Jan 02 21:23:38 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: Fake Tx hang detected with timeout of 80 seconds
Jan 02 21:25:00 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: Fake Tx hang detected with timeout of 80 seconds
Jan 02 21:26:22 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: Fake Tx hang detected with timeout of 80 seconds
Jan 02 21:27:44 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: Fake Tx hang detected with timeout of 80 seconds
Jan 02 21:29:06 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: Fake Tx hang detected with timeout of 80 seconds
Jan 02 21:29:39 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Down
Jan 02 21:29:39 SM-HV02 kernel: bond1: (slave enp3s0f1): speed changed to 0 on port 2
Jan 02 21:29:39 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely down, disabling slave
Jan 02 21:29:41 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jan 02 21:29:41 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely up, 10000 Mbps full duplex
Jan 02 21:29:42 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Down
Jan 02 21:29:42 SM-HV02 kernel: bond1: (slave enp3s0f1): speed changed to 0 on port 2
Jan 02 21:29:42 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely down, disabling slave
Jan 02 21:29:43 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jan 02 21:29:43 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely up, 10000 Mbps full duplex
Jan 02 21:29:44 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Down
Jan 02 21:29:44 SM-HV02 kernel: bond1: (slave enp3s0f1): speed changed to 0 on port 2
Jan 02 21:29:44 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely down, disabling slave
Jan 02 21:29:45 SM-HV02 kernel: ixgbe 0000:03:00.1 enp3s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jan 02 21:29:45 SM-HV02 kernel: bond1: (slave enp3s0f1): link status definitely up, 10000 Mbps full duplex
 
Hi there,

ive updated the firmware of the two "Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection" controller with hps provided firmware:

Firmware-Upgrade​

Host (SM-HV01)
From:
firmware-version: 0x80000835, 1.1200.0

To:
firmware-version: 0x800009e1, 1.3227.0

Code:
root@SM-HV01:/_DATA/firmware/HPE 560SFP+/1.24.3-1.1/usr/lib/x86_64-linux-gnu/firmware-nic-intel-1.24.3-1.1# bash hpsetup
Copyright (c) 2022 Hewlett Packard Enterprise Development LP
HPE Intel Online Firmware Upgrade Utility for Linux x86_64 - v1.24.3


Found HPE Ethernet 10Gb 2-port 560SFP+ Adapter MAC: <MAC>
Do you want to update the following firmware on enp3s0f0 :
EPROM   0.0.80000835 to 0.0.800009E1 - y/n/q (Default option is 'y' when you simply press enter):

ROM   1.1200.0 to 1.3227.0 - y/n/q (Default option is 'y' when you simply press enter):


The Firmware Upgrade may take several minutes. Please be patient.


Selecting HPE Ethernet 10Gb 2-port 560SFP+ Adapter MAC: <MAC>
Updating the following firmware on enp3s0f0 :
EPROM   0.0.80000835 to 0.0.800009E1
Firmware (EPROM) upgrade on MAC <MAC> SUCCESSFUL

ROM   1.1200.0 to 1.3227.0
Firmware (ROM) upgrade on MAC <MAC> SUCCESSFUL

NIC firmware update completed successfully.

Reboot is required for the new firmware to take effect.

Host (SM-HV02)
From:
firmware-version: 0x80000811, 1.1099.0

To:
firmware-version: 0x800009e1, 1.3227.0

Code:
root@SM-HV02:/_DATA/firmware/HPE 560SFP+/1.24.3-1.1/usr/lib/x86_64-linux-gnu/firmware-nic-intel-1.24.3-1.1# bash hpsetup
pcilib: sysfs_read_vpd: read failed: Connection timed out
Copyright (c) 2022 Hewlett Packard Enterprise Development LP
HPE Intel Online Firmware Upgrade Utility for Linux x86_64 - v1.24.3


Found HPE Ethernet 10Gb 2-port 560SFP+ Adapter MAC: <MAC>
Do you want to update the following firmware on enp3s0f0 :
EPROM   0.0.80000811 to 0.0.800009E1 - y/n/q (Default option is 'y' when you simply press enter):

ROM   1.1099.0 to 1.3227.0 - y/n/q (Default option is 'y' when you simply press enter):


The Firmware Upgrade may take several minutes. Please be patient.


Selecting HPE Ethernet 10Gb 2-port 560SFP+ Adapter MAC: <MAC>
Updating the following firmware on enp3s0f0 :
EPROM   0.0.80000811 to 0.0.800009E1
Firmware (EPROM) upgrade on MAC <MAC> SUCCESSFUL

ROM   1.1099.0 to 1.3227.0
Firmware (ROM) upgrade on MAC <MAC> SUCCESSFUL

NIC firmware update completed successfully.

Reboot is required for the new firmware to take effect.


After rebooting the two server and testing with iperf the err occourded again. im now testing to disable offloading and try again.
 
i have now disabled offloading (gso, gro, tso, tx, rx, rxvlan, txvlan, sg) on both interfaces (enp3s0f0 and enp3s0f1) again and using the bridge directly without the bond but the error occourded again. when i do an iperf test over the eno2 interface with igb driver, everything works great without an error.

Syslog​

Host (SM-HV01)
Code:
Jan 04 23:11:22 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: Detected Tx Unit Hang
  Tx Queue             <21>
  TDH, TDT             <2e>, <30>
  next_to_use          <30>
  next_to_clean        <2e>
tx_buffer_info[next_to_clean]
  time_stamp           <ffffffb7>
  jiffies              <100000358>
Jan 04 23:11:22 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: Detected Tx Unit Hang
  Tx Queue             <4>
  TDH, TDT             <2c>, <31>
  next_to_use          <31>
  next_to_clean        <2c>
tx_buffer_info[next_to_clean]
  time_stamp           <ffffff53>
  jiffies              <100000358>
Jan 04 23:11:22 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: Detected Tx Unit Hang
  Tx Queue             <11>
  TDH, TDT             <1e8>, <32>
  next_to_use          <32>
  next_to_clean        <1e0>
tx_buffer_info[next_to_clean]
  time_stamp           <ffffff23>
  jiffies              <100000358>
Jan 04 23:11:22 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: Detected Tx Unit Hang
  Tx Queue             <15>
  TDH, TDT             <6f>, <71>
  next_to_use          <71>
  next_to_clean        <6f>
tx_buffer_info[next_to_clean]
  time_stamp           <ffffff33>
  jiffies              <100000358>
Jan 04 23:11:22 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: tx hang 1 detected on queue 15, resetting adapter
Jan 04 23:11:22 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: tx hang 1 detected on queue 11, resetting adapter
Jan 04 23:11:22 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: tx hang 1 detected on queue 4, resetting adapter
Jan 04 23:11:22 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: Reset adapter
Jan 04 23:11:22 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: tx hang 2 detected on queue 21, resetting adapter
Jan 04 23:11:22 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jan 04 23:11:22 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: PCIe transaction pending bit also did not clear.
Jan 04 23:11:22 SM-HV01 kernel: ixgbe 0000:03:00.0: primary disable timed out
Jan 04 23:11:23 SM-HV01 kernel: bond1: (slave enp3s0f0): link status definitely down, disabling slave
Jan 04 23:11:23 SM-HV01 kernel: bond1: active interface up!
Jan 04 23:11:23 SM-HV01 kernel: vmbr41: port 1(enp3s0f0.41) entered disabled state
Jan 04 23:11:23 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: detected SFP+: 4
Jan 04 23:11:23 SM-HV01 kernel: bond1: (slave enp3s0f0): link status definitely down, disabling slave
Jan 04 23:11:23 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jan 04 23:11:23 SM-HV01 kernel: vmbr41: port 1(enp3s0f0.41) entered blocking state
Jan 04 23:11:23 SM-HV01 kernel: vmbr41: port 1(enp3s0f0.41) entered forwarding state
Jan 04 23:11:23 SM-HV01 kernel: bond1: (slave enp3s0f0): link status definitely up, 10000 Mbps full duplex
Jan 04 23:11:24 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: NIC Link is Down
Jan 04 23:11:24 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: initiating reset due to lost link with pending Tx work
Jan 04 23:11:24 SM-HV01 kernel: bond1: (slave enp3s0f0): speed changed to 0 on port 1
Jan 04 23:11:24 SM-HV01 kernel: vmbr41: port 1(enp3s0f0.41) entered disabled state
Jan 04 23:11:24 SM-HV01 pvedaemon[4446]: <root@pam> end task UPID:SM-HV01:0000854F:00007364:63B5F982:vncproxy:105:root@pam: OK
Jan 04 23:11:24 SM-HV01 kernel: bond1: (slave enp3s0f0): link status definitely down, disabling slave
Jan 04 23:11:25 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: Reset adapter
Jan 04 23:11:25 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jan 04 23:11:25 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: PCIe transaction pending bit also did not clear.
Jan 04 23:11:25 SM-HV01 kernel: ixgbe 0000:03:00.0: primary disable timed out
Jan 04 23:11:26 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: detected SFP+: 4
Jan 04 23:11:26 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jan 04 23:11:26 SM-HV01 kernel: vmbr41: port 1(enp3s0f0.41) entered blocking state
Jan 04 23:11:26 SM-HV01 kernel: vmbr41: port 1(enp3s0f0.41) entered forwarding state
Jan 04 23:11:26 SM-HV01 kernel: bond1: (slave enp3s0f0): link status definitely up, 10000 Mbps full duplex
Jan 04 23:11:26 SM-HV01 kernel: ixgbe 0000:03:00.0 enp3s0f0: Detected Tx Unit Hang

Host (SM-HV02)
Code:
Jan 04 23:11:23 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: NIC Link is Down
Jan 04 23:11:23 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: initiating reset due to lost link with pending Tx work
Jan 04 23:11:23 SM-HV02 kernel: bond1: (slave enp3s0f0): speed changed to 0 on port 1
Jan 04 23:11:23 SM-HV02 kernel: bond1: (slave enp3s0f0): link status definitely down, disabling slave
Jan 04 23:11:23 SM-HV02 kernel: vmbr41: port 1(enp3s0f0.41) entered disabled state
Jan 04 23:11:24 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: Reset adapter
Jan 04 23:11:24 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Jan 04 23:11:24 SM-HV02 sshd[433598]: Received disconnect from 10.10.0.11 port 50400:11: disconnected by user
Jan 04 23:11:24 SM-HV02 sshd[433598]: Disconnected from user root 10.10.0.11 port 50400
Jan 04 23:11:24 SM-HV02 sshd[433598]: pam_unix(sshd:session): session closed for user root
Jan 04 23:11:24 SM-HV02 pmxcfs[5305]: [status] notice: received log
Jan 04 23:11:24 SM-HV02 systemd[1]: session-36.scope: Succeeded.
Jan 04 23:11:24 SM-HV02 systemd[1]: session-36.scope: Consumed 1.011s CPU time.
Jan 04 23:11:24 SM-HV02 systemd-logind[4507]: Session 36 logged out. Waiting for processes to exit.
Jan 04 23:11:24 SM-HV02 systemd-logind[4507]: Removed session 36.
Jan 04 23:11:24 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: detected SFP+: 4
Jan 04 23:11:25 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jan 04 23:11:25 SM-HV02 kernel: vmbr41: port 1(enp3s0f0.41) entered blocking state
Jan 04 23:11:25 SM-HV02 kernel: vmbr41: port 1(enp3s0f0.41) entered forwarding state
Jan 04 23:11:25 SM-HV02 kernel: bond1: (slave enp3s0f0): link status definitely up, 10000 Mbps full duplex
Jan 04 23:11:26 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: NIC Link is Down
Jan 04 23:11:26 SM-HV02 kernel: bond1: (slave enp3s0f0): speed changed to 0 on port 1
Jan 04 23:11:26 SM-HV02 kernel: vmbr41: port 1(enp3s0f0.41) entered disabled state
Jan 04 23:11:26 SM-HV02 kernel: bond1: (slave enp3s0f0): link status definitely down, disabling slave
Jan 04 23:11:27 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jan 04 23:11:27 SM-HV02 kernel: vmbr41: port 1(enp3s0f0.41) entered blocking state
Jan 04 23:11:27 SM-HV02 kernel: vmbr41: port 1(enp3s0f0.41) entered forwarding state
Jan 04 23:11:27 SM-HV02 kernel: bond1: (slave enp3s0f0): link status definitely up, 10000 Mbps full duplex
Jan 04 23:11:28 SM-HV02 kernel: ixgbe 0000:03:00.0 enp3s0f0: NIC Link is Down
Jan 04 23:11:28 SM-HV02 kernel: bond1: (slave enp3s0f0): speed changed to 0 on port 1
Jan 04 23:11:28 SM-HV02 kernel: bond1: (slave enp3s0f0): link status definitely down, disabling slave
Jan 04 23:11:28 SM-HV02 kernel: vmbr41: port 1(enp3s0f0.41) entered disabled state

i really have no idea what to do now
 
Same shit here, hypervisor with 10+ vps suddenly dont respond. Restarted, wokring fine. There are aftermath syslogs

---
Jan 30 14:12:42 proxmox6 kernel: [20746907.799602] ixgbe 0000:c1:00.0 enp193s0: Detected Tx Unit Hang Jan 30 14:12:42 proxmox6 kernel: [20746907.799602] Tx Queue <12> Jan 30 14:12:42 proxmox6 kernel: [20746907.799602] TDH, TDT <0>, <4> Jan 30 14:12:42 proxmox6 kernel: [20746907.799602] next_to_use <4> Jan 30 14:12:42 proxmox6 kernel: [20746907.799602] next_to_clean <0> Jan 30 14:12:42 proxmox6 kernel: [20746907.799602] tx_buffer_info[next_to_clean] Jan 30 14:12:42 proxmox6 kernel: [20746907.799602] time_stamp <23526aac0> Jan 30 14:12:42 proxmox6 kernel: [20746907.799602] jiffies <23526add8> Jan 30 14:12:42 proxmox6 kernel: [20746907.799610] ixgbe 0000:c1:00.0 enp193s0: Detected Tx Unit Hang Jan 30 14:12:42 proxmox6 kernel: [20746907.799610] Tx Queue <36> Jan 30 14:12:42 proxmox6 kernel: [20746907.799610] TDH, TDT <0>, <6> Jan 30 14:12:42 proxmox6 kernel: [20746907.799610] next_to_use <6> Jan 30 14:12:42 proxmox6 kernel: [20746907.799610] next_to_clean <0> Jan 30 14:12:42 proxmox6 kernel: [20746907.799610] tx_buffer_info[next_to_clean] Jan 30 14:12:42 proxmox6 kernel: [20746907.799610] time_stamp <23526ab32> Jan 30 14:12:42 proxmox6 kernel: [20746907.799610] jiffies <23526add8> Jan 30 14:12:42 proxmox6 kernel: [20746907.803703] ixgbe 0000:c1:00.0 enp193s0: tx hang 146 detected on queue 12, resetting adapter Jan 30 14:12:42 proxmox6 kernel: [20746907.807801] ixgbe 0000:c1:00.0 enp193s0: initiating reset due to tx timeout Jan 30 14:12:42 proxmox6 kernel: [20746907.807804] ixgbe 0000:c1:00.0 enp193s0: tx hang 146 detected on queue 36, resetting adapter Jan 30 14:12:42 proxmox6 kernel: [20746907.807810] ixgbe 0000:c1:00.0 enp193s0: initiating reset due to tx timeout Jan 30 14:12:42 proxmox6 kernel: [20746907.807846] ixgbe 0000:c1:00.0 enp193s0: Reset adapter Jan 30 14:12:42 proxmox6 kernel: [20746907.841797] ixgbe 0000:c1:00.0 enp193s0: RXDCTL.ENABLE for one or more queues not cleared within the polling period Jan 30 14:12:42 proxmox6 kernel: [20746907.875667] ixgbe 0000:c1:00.0 enp193s0: TXDCTL.ENABLE for one or more queues not cleared within the polling period Jan 30 14:12:42 proxmox6 kernel: [20746908.047554] ixgbe 0000:c1:00.0: master disable timed out Jan 30 14:12:42 proxmox6 kernel: [20746908.488843] vmbr0: port 1(enp193s0) entered disabled state Jan 30 14:12:42 proxmox6 kernel: [20746908.541114] ixgbe 0000:c1:00.0 enp193s0: detected SFP+: 5 Jan 30 14:12:46 proxmox6 kernel: [20746911.849677] ixgbe 0000:c1:00.0 enp193s0: NIC Link is Up 10 Gbps, Flow Control: RX/TX

LSHW info about NIC - added pci card
*-network description: Ethernet interface product: 82599 10 Gigabit Network Connection vendor: Intel Corporation physical id: 0 bus info: pci@0000:c1:00.0 logical name: enp193s0 logical name: /dev/fb0 version: 01 serial: NOPE size: 10Gbit/s capacity: 10Gbit/s width: 64 bits clock: 33MHz capabilities: pm msi msix pciexpress vpd bus_master cap_list rom ethernet physical fibre 10000bt-fd fb configuration: autonegotiation=off broadcast=yes depth=32 driver=ixgbe driverversion=5.15.35-1-pve duplex=full firmware=0x800006e3 latency=0 link=yes mode=1024x768 multicast=yes port=fibre speed=10Gbit/s visual=truecolor xres=1024 yres=768 resources: iomemory:5000-4fff iomemory:5000-4fff irq:194 memory:500e0d00000-500e0d7ffff ioport:d000(size=32) memory:500e0f80000-500e0f83fff memory:9c300000-9c37ffff memory:500e0e80000-500e0f7ffff memory:500e0d80000-500e0e7ffff

The HW is
https://www.gigabyte.com/Enterprise/Rack-Server/R182-Z91-rev-A00
 
Yeah same thing for me. Intel Corporation 82599ES 10-Gigabit port flapping. Port reboots every second or so in an endless loop, white the other port of the card stays up and works fine.

Code:
root@pve4:~# lspci -nnk | grep -A2 Ethernet
01:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [14e4:165f]
    DeviceName: NIC1
    Subsystem: Dell NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [1028:1f5b]
    Kernel driver in use: tg3
    Kernel modules: tg3
01:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [14e4:165f]
    DeviceName: NIC2
    Subsystem: Dell NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [1028:1f5b]
    Kernel driver in use: tg3
    Kernel modules: tg3
02:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [14e4:165f]
    DeviceName: NIC3
    Subsystem: Dell NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [1028:1f5b]
    Kernel driver in use: tg3
    Kernel modules: tg3
02:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [14e4:165f]
    DeviceName: NIC4
    Subsystem: Dell NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [1028:1f5b]
    Kernel driver in use: tg3
    Kernel modules: tg3
--
82:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
    Subsystem: Device [1dcf:030a]
    Kernel driver in use: ixgbe
--
82:00.1 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
    Subsystem: Device [1dcf:030a]
    Kernel driver in use: ixgbe

/var/log/messages
Code:
Feb 22 21:15:23 pve4 kernel: [82689.112667] ixgbe 0000:82:00.1 enp130s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Feb 22 21:15:23 pve4 kernel: [82689.112929] vmbr2: port 1(enp130s0f1.280) entered blocking state
Feb 22 21:15:23 pve4 kernel: [82689.112936] vmbr2: port 1(enp130s0f1.280) entered forwarding state
Feb 22 21:15:23 pve4 kernel: [82689.124641] ixgbe 0000:82:00.1 enp130s0f1: Received ECC Err, initiating reset
Feb 22 21:15:23 pve4 kernel: [82689.637414] ixgbe 0000:82:00.1 enp130s0f1: detected SFP+: 4
Feb 22 21:15:23 pve4 kernel: [82689.833544] vmbr2: port 1(enp130s0f1.280) entered disabled state
Feb 22 21:15:25 pve4 kernel: [82691.452628] ixgbe 0000:82:00.1 enp130s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Feb 22 21:15:25 pve4 kernel: [82691.452882] vmbr2: port 1(enp130s0f1.280) entered blocking state
Feb 22 21:15:25 pve4 kernel: [82691.452889] vmbr2: port 1(enp130s0f1.280) entered forwarding state
Feb 22 21:15:25 pve4 kernel: [82691.666561] ixgbe 0000:82:00.1 enp130s0f1: tx hang 36890 detected on queue 39, resetting adapter
Feb 22 21:15:25 pve4 kernel: [82691.666567] ixgbe 0000:82:00.1 enp130s0f1: initiating reset due to tx timeout
Feb 22 21:15:26 pve4 kernel: [82692.134030] vmbr2: port 1(enp130s0f1.280) entered disabled state
Feb 22 21:15:26 pve4 kernel: [82692.192548] ixgbe 0000:82:00.1 enp130s0f1: detected SFP+: 4
Feb 22 21:15:26 pve4 kernel: [82692.336615] ixgbe 0000:82:00.1 enp130s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Feb 22 21:15:26 pve4 kernel: [82692.336863] vmbr2: port 1(enp130s0f1.280) entered blocking state
Feb 22 21:15:26 pve4 kernel: [82692.336870] vmbr2: port 1(enp130s0f1.280) entered forwarding state
Feb 22 21:15:26 pve4 kernel: [82692.348610] ixgbe 0000:82:00.1 enp130s0f1: Received ECC Err, initiating reset
Feb 22 21:15:26 pve4 kernel: [82692.872710] ixgbe 0000:82:00.1 enp130s0f1: detected SFP+: 4
Feb 22 21:15:27 pve4 kernel: [82693.162155] vmbr2: port 1(enp130s0f1.280) entered disabled state
Feb 22 21:15:28 pve4 kernel: [82694.680605] ixgbe 0000:82:00.1 enp130s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Feb 22 21:15:28 pve4 kernel: [82694.680870] vmbr2: port 1(enp130s0f1.280) entered blocking state
Feb 22 21:15:28 pve4 kernel: [82694.680877] vmbr2: port 1(enp130s0f1.280) entered forwarding state
Feb 22 21:15:28 pve4 kernel: [82694.894544] ixgbe 0000:82:00.1 enp130s0f1: tx hang 36892 detected on queue 39, resetting adapter
Feb 22 21:15:28 pve4 kernel: [82694.894551] ixgbe 0000:82:00.1 enp130s0f1: initiating reset due to tx timeout
Feb 22 21:15:29 pve4 kernel: [82695.361239] vmbr2: port 1(enp130s0f1.280) entered disabled state
Feb 22 21:15:29 pve4 kernel: [82695.413568] ixgbe 0000:82:00.1 enp130s0f1: detected SFP+: 4
Feb 22 21:15:31 pve4 kernel: [82697.228621] ixgbe 0000:82:00.1 enp130s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Feb 22 21:15:31 pve4 kernel: [82697.228887] vmbr2: port 1(enp130s0f1.280) entered blocking state
Feb 22 21:15:31 pve4 kernel: [82697.228894] vmbr2: port 1(enp130s0f1.280) entered forwarding state
Feb 22 21:15:31 pve4 kernel: [82697.442620] ixgbe 0000:82:00.1 enp130s0f1: tx hang 36893 detected on queue 39, resetting adapter
Feb 22 21:15:31 pve4 kernel: [82697.442626] ixgbe 0000:82:00.1 enp130s0f1: initiating reset due to tx timeout
Feb 22 21:15:31 pve4 kernel: [82697.905980] vmbr2: port 1(enp130s0f1.280) entered disabled state
Feb 22 21:15:31 pve4 kernel: [82697.964530] ixgbe 0000:82:00.1 enp130s0f1: detected SFP+: 4
Feb 22 21:15:32 pve4 kernel: [82698.108583] ixgbe 0000:82:00.1 enp130s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Feb 22 21:15:32 pve4 kernel: [82698.108831] vmbr2: port 1(enp130s0f1.280) entered blocking state
Feb 22 21:15:32 pve4 kernel: [82698.108839] vmbr2: port 1(enp130s0f1.280) entered forwarding state
Feb 22 21:15:32 pve4 kernel: [82698.120837] ixgbe 0000:82:00.1 enp130s0f1: Received ECC Err, initiating reset
Feb 22 21:15:32 pve4 kernel: [82698.637400] ixgbe 0000:82:00.1 enp130s0f1: detected SFP+: 4
Feb 22 21:15:32 pve4 kernel: [82698.921578] vmbr2: port 1(enp130s0f1.280) entered disabled state
Feb 22 21:15:34 pve4 kernel: [82700.556560] ixgbe 0000:82:00.1 enp130s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Feb 22 21:15:34 pve4 kernel: [82700.556806] vmbr2: port 1(enp130s0f1.280) entered blocking state
Feb 22 21:15:34 pve4 kernel: [82700.556814] vmbr2: port 1(enp130s0f1.280) entered forwarding state
Feb 22 21:15:34 pve4 kernel: [82700.568561] ixgbe 0000:82:00.1 enp130s0f1: Received ECC Err, initiating reset
Feb 22 21:15:34 pve4 kernel: [82701.034010] vmbr2: port 1(enp130s0f1.280) entered disabled state
Feb 22 21:15:35 pve4 kernel: [82701.092490] ixgbe 0000:82:00.1 enp130s0f1: detected SFP+: 4
Feb 22 21:15:36 pve4 kernel: [82702.900546] ixgbe 0000:82:00.1 enp130s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Feb 22 21:15:36 pve4 kernel: [82702.900843] vmbr2: port 1(enp130s0f1.280) entered blocking state
Feb 22 21:15:36 pve4 kernel: [82702.900851] vmbr2: port 1(enp130s0f1.280) entered forwarding state
Feb 22 21:15:36 pve4 kernel: [82702.912549] ixgbe 0000:82:00.1 enp130s0f1: Received ECC Err, initiating reset
Feb 22 21:15:37 pve4 kernel: [82703.381969] vmbr2: port 1(enp130s0f1.280) entered disabled state
Feb 22 21:15:37 pve4 kernel: [82703.440508] ixgbe 0000:82:00.1 enp130s0f1: detected SFP+: 4
Feb 22 21:15:39 pve4 kernel: [82705.268938] ixgbe 0000:82:00.1: removed PHC on enp130s0f1
 
Same here with this hardware

Code:
root@pve06:~# ethtool -i  enp16s0f1
driver: ixgbe
version: 5.15.83-1-pve
firmware-version: 0x800003df
expansion-rom-version:
bus-info: 0000:10:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
root@pve06:~# lspci -nnk | grep -A2 Ethernet
10:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
        Subsystem: Device [1dcf:030a]
        Kernel driver in use: ixgbe
--
10:00.1 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
        Subsystem: Device [1dcf:030a]
        Kernel driver in use: ixgbe
--
25:00.0 Ethernet controller [0200]: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01)
        Subsystem: Intel Corporation Gigabit ET Dual Port Server Adapter [8086:a03c]
        Kernel driver in use: igb
--
25:00.1 Ethernet controller [0200]: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01)
        Subsystem: Intel Corporation Gigabit ET Dual Port Server Adapter [8086:a03c]
        Kernel driver in use: igb

My replications work anymore. Only with a reboot I can get it functionally.

Log

Code:
Feb 24 06:40:52 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: tx hang 9959 detected on queue 7, resetting adapter
Feb 24 06:40:52 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: initiating reset due to tx timeout
Feb 24 06:40:52 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: Reset adapter
Feb 24 06:40:52 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Feb 24 06:40:52 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Feb 24 06:40:52 pve06 kernel: ixgbe 0000:10:00.1: master disable timed out
Feb 24 06:40:52 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: detected SFP+: 4
Feb 24 06:40:53 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Feb 24 06:40:53 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: Detected Tx Unit Hang
                                Tx Queue             <10>
                                TDH, TDT             <0>, <1>
                                next_to_use          <1>
                                next_to_clean        <0>
                              tx_buffer_info[next_to_clean]
                                time_stamp           <120ec8815>
                                jiffies              <120ec882f>
Feb 24 06:40:53 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: tx hang 9960 detected on queue 10, resetting adapter
Feb 24 06:40:53 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: initiating reset due to tx timeout
Feb 24 06:40:53 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: Reset adapter
Feb 24 06:40:53 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Feb 24 06:40:53 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Feb 24 06:40:53 pve06 kernel: ixgbe 0000:10:00.1: master disable timed out
Feb 24 06:40:53 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: detected SFP+: 4
Feb 24 06:40:54 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Feb 24 06:40:54 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: Detected Tx Unit Hang
                                Tx Queue             <5>
                                TDH, TDT             <0>, <1>
                                next_to_use          <1>
                                next_to_clean        <0>
                              tx_buffer_info[next_to_clean]
                                time_stamp           <120ec8908>
                                jiffies              <120ec8939>
Feb 24 06:40:54 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: Detected Tx Unit Hang
                                Tx Queue             <4>
                                TDH, TDT             <0>, <1>
                                next_to_use          <1>
                                next_to_clean        <0>
                              tx_buffer_info[next_to_clean]
                                time_stamp           <120ec891e>
                                jiffies              <120ec8939>
Feb 24 06:40:54 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: tx hang 9961 detected on queue 5, resetting adapter
Feb 24 06:40:54 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: tx hang 9961 detected on queue 4, resetting adapter
Feb 24 06:40:54 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: initiating reset due to tx timeout
Feb 24 06:40:54 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: initiating reset due to tx timeout
Feb 24 06:40:54 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: Reset adapter
Feb 24 06:40:54 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Feb 24 06:40:54 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Feb 24 06:40:54 pve06 kernel: ixgbe 0000:10:00.1: master disable timed out
Feb 24 06:40:54 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: detected SFP+: 4
Feb 24 06:40:55 pve06 kernel: ixgbe 0000:10:00.1 enp16s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
 
Seem to be encountering more frequently - having to restart the hypervisor twice a day and recently, more often.


Code:
2e:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Network Connection (rev 01)
        Subsystem: Beijing Sinead Technology Co., Ltd. 82599 10 Gigabit Network Connection
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 147
        IOMMU group: 29
        Region 0: Memory at 7e02100000 (64-bit, prefetchable) [size=512K]
        Region 2: I/O ports at e000 [disabled] [size=32]
        Region 4: Memory at 7e02380000 (64-bit, prefetchable) [size=16K]
        Expansion ROM at fc500000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME+
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00002000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 512 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (ok), Width x8 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [e0] Vital Product Data
                Unknown small resource type 06, will not decode more.
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr+ BadTLP+ BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [140 v1] Device Serial Number 98-b7-85-ff-ff-00-98-cf
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                IOVSta: Migration-
                Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 384, stride: 2, Device ID: 10ed
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 0000007e02280000 (64-bit, prefetchable)
                Region 3: Memory at 0000007e02180000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Kernel driver in use: ixgbe
        Kernel modules: ixgbe

Code:
Jun  9 18:04:30 gaea kernel: [ 2224.172497] vmbr0: port 1(enp46s0) entered disabled state
Jun  9 18:04:30 gaea kernel: [ 2224.220412] ixgbe 0000:2e:00.0 enp46s0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jun  9 18:04:30 gaea kernel: [ 2224.220587] vmbr0: port 1(enp46s0) entered blocking state
Jun  9 18:04:30 gaea kernel: [ 2224.220593] vmbr0: port 1(enp46s0) entered forwarding state
Jun  9 18:04:30 gaea kernel: [ 2224.428518] ixgbe 0000:2e:00.0 enp46s0: Detected Tx Unit Hang
Jun  9 18:04:30 gaea kernel: [ 2224.428518]   Tx Queue             <23>
Jun  9 18:04:30 gaea kernel: [ 2224.428518]   TDH, TDT             <0>, <1>
Jun  9 18:04:30 gaea kernel: [ 2224.428518]   next_to_use          <1>
Jun  9 18:04:30 gaea kernel: [ 2224.428518]   next_to_clean        <0>
Jun  9 18:04:30 gaea kernel: [ 2224.428518] tx_buffer_info[next_to_clean]
Jun  9 18:04:30 gaea kernel: [ 2224.428518]   time_stamp           <10007569f>
Jun  9 18:04:30 gaea kernel: [ 2224.428518]   jiffies              <1000756d0>
Jun  9 18:04:30 gaea kernel: [ 2224.428537] ixgbe 0000:2e:00.0 enp46s0: tx hang 39 detected on queue 23, resetting adapter
Jun  9 18:04:30 gaea kernel: [ 2224.428541] ixgbe 0000:2e:00.0 enp46s0: initiating reset due to tx timeout
Jun  9 18:04:30 gaea kernel: [ 2224.428550] ixgbe 0000:2e:00.0 enp46s0: Reset adapter
Jun  9 18:04:30 gaea kernel: [ 2224.460902] ixgbe 0000:2e:00.0 enp46s0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun  9 18:04:30 gaea kernel: [ 2224.632222] ixgbe 0000:2e:00.0: primary disable timed out
Jun  9 18:04:30 gaea kernel: [ 2224.913718] ixgbe 0000:2e:00.0 enp46s0: detected SFP+: 5
Jun  9 18:04:31 gaea kernel: [ 2225.064403] ixgbe 0000:2e:00.0 enp46s0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jun  9 18:04:31 gaea kernel: [ 2225.168568] ixgbe 0000:2e:00.0 enp46s0: Detected Tx Unit Hang
Jun  9 18:04:31 gaea kernel: [ 2225.168568]   Tx Queue             <13>
Jun  9 18:04:31 gaea kernel: [ 2225.168568]   TDH, TDT             <0>, <1>
Jun  9 18:04:31 gaea kernel: [ 2225.168568]   next_to_use          <1>
Jun  9 18:04:31 gaea kernel: [ 2225.168568]   next_to_clean        <0>
Jun  9 18:04:31 gaea kernel: [ 2225.168568] tx_buffer_info[next_to_clean]
Jun  9 18:04:31 gaea kernel: [ 2225.168568]   time_stamp           <10007576f>
Jun  9 18:04:31 gaea kernel: [ 2225.168568]   jiffies              <100075789>
Jun  9 18:04:31 gaea kernel: [ 2225.168569] ixgbe 0000:2e:00.0 enp46s0: Detected Tx Unit Hang
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   Tx Queue             <16>
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   TDH, TDT             <0>, <1>
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   next_to_use          <1>
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   next_to_clean        <0>
Jun  9 18:04:31 gaea kernel: [ 2225.168569] tx_buffer_info[next_to_clean]
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   time_stamp           <10007576f>
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   jiffies              <100075789>
Jun  9 18:04:31 gaea kernel: [ 2225.168569] ixgbe 0000:2e:00.0 enp46s0: Detected Tx Unit Hang
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   Tx Queue             <15>
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   TDH, TDT             <0>, <1>
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   next_to_use          <1>
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   next_to_clean        <0>
Jun  9 18:04:31 gaea kernel: [ 2225.168569] tx_buffer_info[next_to_clean]
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   time_stamp           <10007576f>
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   jiffies              <100075789>
Jun  9 18:04:31 gaea kernel: [ 2225.168569] ixgbe 0000:2e:00.0 enp46s0: Detected Tx Unit Hang
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   Tx Queue             <3>
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   TDH, TDT             <0>, <2>
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   next_to_use          <2>
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   next_to_clean        <0>
Jun  9 18:04:31 gaea kernel: [ 2225.168569] tx_buffer_info[next_to_clean]
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   time_stamp           <10007576f>
Jun  9 18:04:31 gaea kernel: [ 2225.168569]   jiffies              <100075789>
Jun  9 18:04:31 gaea kernel: [ 2225.168578] ixgbe 0000:2e:00.0 enp46s0: tx hang 40 detected on queue 15, resetting adapter
Jun  9 18:04:31 gaea kernel: [ 2225.168578] ixgbe 0000:2e:00.0 enp46s0: tx hang 40 detected on queue 3, resetting adapter
Jun  9 18:04:31 gaea kernel: [ 2225.168582] ixgbe 0000:2e:00.0 enp46s0: initiating reset due to tx timeout
Jun  9 18:04:31 gaea kernel: [ 2225.168582] ixgbe 0000:2e:00.0 enp46s0: tx hang 40 detected on queue 13, resetting adapter
Jun  9 18:04:31 gaea kernel: [ 2225.168582] ixgbe 0000:2e:00.0 enp46s0: initiating reset due to tx timeout
Jun  9 18:04:31 gaea kernel: [ 2225.168585] ixgbe 0000:2e:00.0 enp46s0: initiating reset due to tx timeout
Jun  9 18:04:31 gaea kernel: [ 2225.168591] ixgbe 0000:2e:00.0 enp46s0: Reset adapter
Jun  9 18:04:31 gaea kernel: [ 2225.168594] ixgbe 0000:2e:00.0 enp46s0: tx hang 41 detected on queue 16, resetting adapter
Jun  9 18:04:31 gaea kernel: [ 2225.200941] ixgbe 0000:2e:00.0 enp46s0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun  9 18:04:31 gaea kernel: [ 2225.372245] ixgbe 0000:2e:00.0: primary disable timed out
Jun  9 18:04:31 gaea kernel: [ 2225.601735] vmbr0: port 1(enp46s0) entered disabled state
Jun  9 18:04:31 gaea kernel: [ 2225.660365] ixgbe 0000:2e:00.0 enp46s0: detected SFP+: 5
Jun  9 18:04:31 gaea kernel: [ 2225.808404] ixgbe 0000:2e:00.0 enp46s0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jun  9 18:04:31 gaea kernel: [ 2225.808585] vmbr0: port 1(enp46s0) entered blocking state
Jun  9 18:04:31 gaea kernel: [ 2225.808591] vmbr0: port 1(enp46s0) entered forwarding state
Jun  9 18:04:32 gaea kernel: [ 2226.020523] ixgbe 0000:2e:00.0 enp46s0: Detected Tx Unit Hang
Jun  9 18:04:32 gaea kernel: [ 2226.020523]   Tx Queue             <5>
Jun  9 18:04:32 gaea kernel: [ 2226.020523]   TDH, TDT             <0>, <1>
Jun  9 18:04:32 gaea kernel: [ 2226.020523]   next_to_use          <1>
Jun  9 18:04:32 gaea kernel: [ 2226.020523]   next_to_clean        <0>
Jun  9 18:04:32 gaea kernel: [ 2226.020523] tx_buffer_info[next_to_clean]
Jun  9 18:04:32 gaea kernel: [ 2226.020523]   time_stamp           <10007583d>
Jun  9 18:04:32 gaea kernel: [ 2226.020523]   jiffies              <10007585e>
Jun  9 18:04:32 gaea kernel: [ 2226.020523] ixgbe 0000:2e:00.0 enp46s0: Detected Tx Unit Hang
Jun  9 18:04:32 gaea kernel: [ 2226.020523]   Tx Queue             <12>
Jun  9 18:04:32 gaea kernel: [ 2226.020523]   TDH, TDT             <0>, <1>
Jun  9 18:04:32 gaea kernel: [ 2226.020523]   next_to_use          <1>
Jun  9 18:04:32 gaea kernel: [ 2226.020523]   next_to_clean        <0>
Jun  9 18:04:32 gaea kernel: [ 2226.020523] tx_buffer_info[next_to_clean]
Jun  9 18:04:32 gaea kernel: [ 2226.020523]   time_stamp           <100075840>
Jun  9 18:04:32 gaea kernel: [ 2226.020523]   jiffies              <10007585e>
Jun  9 18:04:32 gaea kernel: [ 2226.020532] ixgbe 0000:2e:00.0 enp46s0: tx hang 41 detected on queue 12, resetting adapter
Jun  9 18:04:32 gaea kernel: [ 2226.020539] ixgbe 0000:2e:00.0 enp46s0: tx hang 41 detected on queue 5, resetting adapter
Jun  9 18:04:32 gaea kernel: [ 2226.020546] ixgbe 0000:2e:00.0 enp46s0: initiating reset due to tx timeout
Jun  9 18:04:32 gaea kernel: [ 2226.020549] ixgbe 0000:2e:00.0 enp46s0: initiating reset due to tx timeout
 
Last edited:
In the last few days, I have had to restart my nodes more and more often because of this error. Newer kernel module (ixgbe-5.19.6) also doesn't helps.

The Upgrade on PVE 8.0.4 also doesnt helps. Yesterday upgrade, tonight ixgbe hangs for Replication. :mad:

The only good thing I found, that I can restart the driver without restarting the node. As Workaround I defined an alias for that.

Bash:
alias resetixgbe='systemctl stop networking.service ; rmmod ixgbe; modprobe ixgbe; systemctl start networking.service'

At my tests, it worked over ssh. In worst case you have do it via IPMI or so.

Bash:
root@pve05:~# pveversion
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-12-pve)
root@pve05:~# ethtool -i  enp16s0f1
driver: ixgbe
version: 6.2.16-12-pve
firmware-version: 0x800003df
expansion-rom-version:
bus-info: 0000:10:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
root@pve05:~# lspci -nnk | grep -A2 Ethernet
10:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
        Subsystem: Beijing Sinead Technology Co., Ltd. 82599ES 10-Gigabit SFI/SFP+ Network Connection [1dcf:030a]
        Kernel driver in use: ixgbe
--
10:00.1 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
        Subsystem: Beijing Sinead Technology Co., Ltd. 82599ES 10-Gigabit SFI/SFP+ Network Connection [1dcf:030a]
        Kernel driver in use: ixgbe
--
25:00.0 Ethernet controller [0200]: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01)
        Subsystem: Intel Corporation Gigabit ET Dual Port Server Adapter [8086:a03c]
        Kernel driver in use: igb
--
25:00.1 Ethernet controller [0200]: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01)
        Subsystem: Intel Corporation Gigabit ET Dual Port Server Adapter [8086:a03c]
        Kernel driver in use: igb
--
26:00.0 Ethernet controller [0200]: Intel Corporation I210 Gigabit Network Connection [8086:1533] (rev 03)
        Subsystem: ASRock Incorporation I210 Gigabit Network Connection [1849:1533]
        Kernel driver in use: igb
--
27:00.0 Ethernet controller [0200]: Intel Corporation I210 Gigabit Network Connection [8086:1533] (rev 03)
        Subsystem: ASRock Incorporation I210 Gigabit Network Connection [1849:1533]
        Kernel driver in use: igb

This error sucks :-(. How can this error solved?

@t.lamprecht or @martin any ideas?
 
Probably not helpful but honestly, the only way we managed to resolve the issue was to fork out for a new / replacement NIC.... Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01).

Was working almost perfectly until just now when updated to [B]pve-manager/8.0.4/d258a813cfa6b390[/B] / [B]Linux 6.2.16-12-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-12 (2023-09-04T13:21Z)[/B]...

As of the update this morning, the log is filled with:

Sep 18 07:39:10 gaea kernel: i40e 0000:2e:00.0: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
Sep 18 07:39:10 gaea kernel: i40e 0000:2e:00.0: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
Sep 18 07:39:10 gaea kernel: i40e 0000:2e:00.0: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
Sep 18 07:39:10 gaea kernel: i40e 0000:2e:00.0: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
Sep 18 07:39:10 gaea kernel: i40e 0000:2e:00.0: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
Sep 18 07:39:10 gaea kernel: i40e 0000:2e:00.0: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
Sep 18 07:39:10 gaea kernel: i40e 0000:2e:00.0: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
Sep 18 07:39:10 gaea kernel: i40e 0000:2e:00.0: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!