I have a 5 node Proxmox cluster and noticed when I run network heavy workloads on the node, I lose network connectivity resulting in all VM's encountering I/O error
I have multiple NIC's on each node. PVE server is accessible via 1G management NIC, however for the datapath (VM Network) I used a separate bridge with a bond of 2 x 100g interfaces from same NIC which seems to go unresponsive as I am unable to ping the default gateway
* Rebooting the server fixes the issue for time being and I run into the same issue again after resuming workloads (sometimes in few hours and sometimes in couple days)
Found a way to consistently repro this with just 5 x Windows 10 VM's, when the issue is encountered, LAG stays up
multipath is down, so VM disks on iSCSI LVM won't be reachable
Interfaces in question:
Network configurations:
I have also upgraded PVE to latest version
I have multiple NIC's on each node. PVE server is accessible via 1G management NIC, however for the datapath (VM Network) I used a separate bridge with a bond of 2 x 100g interfaces from same NIC which seems to go unresponsive as I am unable to ping the default gateway
* Rebooting the server fixes the issue for time being and I run into the same issue again after resuming workloads (sometimes in few hours and sometimes in couple days)
Found a way to consistently repro this with just 5 x Windows 10 VM's, when the issue is encountered, LAG stays up
Bash:
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v5.15.104-1-pve
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0
802.3ad info
LACP active: on
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
Bash:
Slave Interface: ens9f0
MII Status: up
Speed: 100000 Mbps
Duplex: full
Slave Interface: ens9f1
MII Status: up
Speed: 100000 Mbps
Duplex: full
multipath is down, so VM disks on iSCSI LVM won't be reachable
Bash:
# multipath -ll
3624a93705c7d2fceb0c2448d0001146a dm-6 PURE,FlashArray
size=10T features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=0 status=active
|- 18:0:0:251 sdd 8:48 failed faulty running
`- 19:0:0:251 sdc 8:32 failed faulty running
3624a93705c7d2fceb0c2448d0001146b dm-228 PURE,FlashArray
size=30T features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=0 status=active
|- 19:0:0:252 sde 8:64 failed faulty running
`- 18:0:0:252 sdf 8:80 failed faulty running
3624a93705c7d2fceb0c2448d0001146c dm-5 PURE,FlashArray
size=20T features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=0 status=enabled
|- 19:0:0:253 sdg 8:96 failed faulty running
`- 18:0:0:253 sdh 8:112 failed faulty running
3624a93705c7d2fceb0c2448d00011662 dm-229 PURE,FlashArray
size=20T features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=0 status=active
|- 18:0:0:254 sdj 8:144 failed faulty running
`- 19:0:0:254 sdi 8:128 failed faulty running
Interfaces in question:
Bash:
vmbr1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet6 fe80::107c:caff:fe0d:7752 prefixlen 64 scopeid 0x20<link>
ether 12:7c:ca:0d:77:52 txqueuelen 1000 (Ethernet)
RX packets 1378379 bytes 21620140175 (20.1 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1196251 bytes 3196685338 (2.9 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens9f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
ether 12:7c:ca:0d:77:52 txqueuelen 1000 (Ethernet)
RX packets 128403809 bytes 1071670603571 (998.0 GiB)
RX errors 1 dropped 0 overruns 0 frame 1
TX packets 91009854 bytes 750048282313 (698.5 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens9f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
ether 12:7c:ca:0d:77:52 txqueuelen 1000 (Ethernet)
RX packets 58086495 bytes 491816629393 (458.0 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 96836586 bytes 801823835624 (746.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9000
ether 12:7c:ca:0d:77:52 txqueuelen 1000 (Ethernet)
RX packets 186490304 bytes 1563487232964 (1.4 TiB)
RX errors 1 dropped 63 overruns 0 frame 1
TX packets 187846440 bytes 1551872117937 (1.4 TiB)
TX errors 0 dropped 20 overruns 0 carrier 0 collisions 0
Network configurations:
Bash:
auto vmbr1
iface vmbr1 inet manual
bridge-ports bond0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
mtu 9000
auto ens9f0
iface ens9f0 inet manual
auto ens9f1
iface ens9f1 inet manual
auto bond0
iface bond0 inet manual
bond-slaves ens9f0 ens9f1
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer2+3
mtu 9000
I have also upgraded PVE to latest version
Bash:
# pveversion
pve-manager/7.4-3/9002ab8a (running kernel: 5.15.104-1-pve)