Weird networking lags

NStorm · Aug 12, 2021

I've running the latest PVE 6.x on 2-host cluster:

Code:

pve-manager/6.4-13/9f411e79 (running kernel: 5.4.128-1-pve)

Physical servers aren't the same, but very similar.
And I had weird issues with networking from the LXC containers. There are lost packets when there is some constant fast network flow going on. Casual ping works fine without any lost packets, but this commands gives such issue when ran from the container:

Code:

# ping -f -s 972 -M do -i 0.00191 -c 1000 -Q 5 172.16.x.x
PING 172.16.x.x (172.16.x.x) 972(1000) bytes of data.
..........................................................
--- 172.16.x.x ping statistics ---
1000 packets transmitted, 942 received, 5% packet loss, time 1525ms
rtt min/avg/max/mdev = 0.123/0.208/0.279/0.025 ms, ipg/ewma 1.527/0.203 ms

Sometimes it's less than 5%, but still unacceptable. I've just did a clean creation of new Centos 7 container from templates and getting the same issue.
Host node has the IP in the same subnet, on the same vmbr0 devices and it runs same ping command without any lost packets:

Code:

--- 172.16.x.x ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.119/0.207/0.290/0.020 ms, ipg/ewma 0.999/0.203 ms

And another weird thing is once I migrate any container with such issue to another node in the cluster, which has the similar network configuration it starts to run smoothly without lost packets just as above:

Code:

--- 172.16.x.x ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 1072ms
rtt min/avg/max/mdev = 0.117/0.196/0.253/0.027 ms, ipg/ewma 1.073/0.174 ms

To add just more weirdness seems like same thing doesn't happens with Debian based containers.

Host node networking are LACP bonding 10G ethernet devices on both nodes.
Node1 (with issues) config:

Code:

/etc/network/interfaces:
auto bond0
iface bond0 inet manual
    bond_mode 802.3ad
    bond_miimon 100
    bond_downdelay 200
    bond_updelay 200
    slaves ens1f0 ens1f1

auto vmbr0
iface vmbr0 inet static
    address  172.16.x.x
    netmask  255.255.255.0
    gateway  172.16.x.1
    bridge_ports bond0
    bridge_stp on
    bridge_fd 0
    post-up ip route add 172.16.0.0/19 via 172.16.2.254

And node2 (without issues):

Code:

/etc/network/interfaces:

auto bond0
iface bond0 inet manual
        bond-slaves enp1s0f0 enp1s0f1
        bond-miimon 100
        bond-mode 802.3ad
        bond_downdelay 200
        bond_updelay 200

auto vmbr0
iface vmbr0 inet static
        address 172.16.x.x/24
        gateway 172.16.x.1
        bridge-ports bond0
        bridge-stp on
        bridge-fd 0
        post-up ip route add 172.16.0.0/19 via 172.16.2.254

Both running Intel X520 NIC, default ixgbe driver. So far I've failed to find the differences between the configs and other things. And seems like the network itself are fine as host node pings normally and debian-based containers too.
/proc/net/bonding/bond0 doesn't shows any errors too.

Any ideas?

NStorm · Aug 12, 2021

Follow up. Looks like I was able to narrow down the issue roots.
It seems like I'm getting RX Errors on one of the 2 NICs (ports) attached to bond0 when I get lost packets from the container which increase in number close to lost packets count.
RX Errors count grows up when I ping from the host node, but it's not getting any lost packets.
I've did ifconfig ens1f1 down (the one which had RX Errors growing) and I'm not getting any lost packets anymore!
Seems like there is an issue with this NIC or most likely with SFP+ transceiver and/or optical link. BUT! It's still weird why I don't get any lost packets when pinging from host node and/or debian based distro. While second might be just a coincidence but is there are any sort of "link affinity" on bond devices for the containers? Or probably some sort of different error handling?

EDIT: All errors are "frame" type errors. But I'm confused why they don't increase when I ping from the host node and I'm not getting any lost packets. It only happens when I ping from the container.

Search

Search

Weird networking lags

NStorm

Active Member

NStorm

Active Member