Network with bonded interfaces freezes upon driving high traffic

rohitp

New Member
Jan 31, 2023
6
1
1
I have a 5 node Proxmox cluster and noticed when I run network heavy workloads on the node, I lose network connectivity resulting in all VM's encountering I/O error

I have multiple NIC's on each node. PVE server is accessible via 1G management NIC, however for the datapath (VM Network) I used a separate bridge with a bond of 2 x 100g interfaces from same NIC which seems to go unresponsive as I am unable to ping the default gateway

* Rebooting the server fixes the issue for time being and I run into the same issue again after resuming workloads (sometimes in few hours and sometimes in couple days)

Found a way to consistently repro this with just 5 x Windows 10 VM's, when the issue is encountered, LAG stays up

Bash:
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v5.15.104-1-pve

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

802.3ad info
LACP active: on
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535

Bash:
Slave Interface: ens9f0
MII Status: up
Speed: 100000 Mbps
Duplex: full

Slave Interface: ens9f1
MII Status: up
Speed: 100000 Mbps
Duplex: full


multipath is down, so VM disks on iSCSI LVM won't be reachable

Bash:
# multipath -ll
3624a93705c7d2fceb0c2448d0001146a dm-6 PURE,FlashArray
size=10T features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=0 status=active
  |- 18:0:0:251 sdd 8:48  failed faulty running
  `- 19:0:0:251 sdc 8:32  failed faulty running
3624a93705c7d2fceb0c2448d0001146b dm-228 PURE,FlashArray
size=30T features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=0 status=active
  |- 19:0:0:252 sde 8:64  failed faulty running
  `- 18:0:0:252 sdf 8:80  failed faulty running
3624a93705c7d2fceb0c2448d0001146c dm-5 PURE,FlashArray
size=20T features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=0 status=enabled
  |- 19:0:0:253 sdg 8:96  failed faulty running
  `- 18:0:0:253 sdh 8:112 failed faulty running
3624a93705c7d2fceb0c2448d00011662 dm-229 PURE,FlashArray
size=20T features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=0 status=active
  |- 18:0:0:254 sdj 8:144 failed faulty running
  `- 19:0:0:254 sdi 8:128 failed faulty running


Interfaces in question:

Bash:
vmbr1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet6 fe80::107c:caff:fe0d:7752  prefixlen 64  scopeid 0x20<link>
        ether 12:7c:ca:0d:77:52  txqueuelen 1000  (Ethernet)
        RX packets 1378379  bytes 21620140175 (20.1 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1196251  bytes 3196685338 (2.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
       
ens9f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 9000
        ether 12:7c:ca:0d:77:52  txqueuelen 1000  (Ethernet)
        RX packets 128403809  bytes 1071670603571 (998.0 GiB)
        RX errors 1  dropped 0  overruns 0  frame 1
        TX packets 91009854  bytes 750048282313 (698.5 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens9f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 9000
        ether 12:7c:ca:0d:77:52  txqueuelen 1000  (Ethernet)
        RX packets 58086495  bytes 491816629393 (458.0 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 96836586  bytes 801823835624 (746.7 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
       
       
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
        ether 12:7c:ca:0d:77:52  txqueuelen 1000  (Ethernet)
        RX packets 186490304  bytes 1563487232964 (1.4 TiB)
        RX errors 1  dropped 63  overruns 0  frame 1
        TX packets 187846440  bytes 1551872117937 (1.4 TiB)
        TX errors 0  dropped 20 overruns 0  carrier 0  collisions 0


Network configurations:

Bash:
auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
        mtu 9000

auto ens9f0
iface ens9f0 inet manual

auto ens9f1
iface ens9f1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves ens9f0 ens9f1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
        mtu 9000

I have also upgraded PVE to latest version

Bash:
# pveversion
pve-manager/7.4-3/9002ab8a (running kernel: 5.15.104-1-pve)
 
Swapped the NIC with Mellanox ConnectX-6 and still seeing the same issue

Bash:
[  567.319189] mlx5_core 0000:4b:00.0: mlx5_wait_for_pages:736:(pid 1312): Skipping wait for vf pages stage
[  570.421106]  connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295030076, last ping 4295031360, now 4295032640
[  570.421113]  connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295030076, last ping 4295031360, now 4295032640
[  570.422160]  connection1:0: detected conn error (1022)
[  570.422713]  connection2:0: detected conn error (1022)
[  580.660914]  session2: session recovery timed out after 10 secs
[  580.660948] sd 19:0:0:253: rejecting I/O to offline device
[  580.661427] blk_update_request: I/O error, dev sdi, sector 8561987464 op 0x1:(WRITE) flags 0xca00 phys_seg 5 prio class 0
[  580.661803] blk_update_request: I/O error, dev sdi, sector 8670983288 op 0x1:(WRITE) flags 0xca00 phys_seg 14 prio class 0
[  580.661808] device-mapper: multipath: 253:5: Failing path 8:128.
 
I have each port connected to different Arista Switches and got those switches configured in MLAG and configured channel-group on those ports to run LACP 803.2ad
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!