Proxmox 8/Ceph Cluster - High error rate on Ceph-network

BigBottle

New Member
Aug 2, 2023
2
0
1
Hello proxmox-community,

we encounter a high error rate (errors, drops, overruns and frames) on the cephs network interfaces on our newly set up four-machine proxmox 8/ceph cluster when writing data (e.g. from /dev/urandom to a file on a virtual machine for testing).

Code:
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
        inet6 fe80::4cfc:1cff:fe02:7835  prefixlen 64  scopeid 0x20<link>
        ether 4e:dc:2c:a2:79:34  txqueuelen 1000  (Ethernet)
        RX packets 72321527  bytes 477341582531 (444.5 GiB)
        RX errors 79792  dropped 69666  overruns 69666  frame 79624
        TX packets 31629557  bytes 76964599295 (71.6 GiB)
        TX errors 0  dropped 574 overruns 0  carrier 0  collisions 0

bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
        ether be:d3:83:cf:f0:3d  txqueuelen 1000  (Ethernet)
        RX packets 126046100  bytes 891091401606 (829.8 GiB)
        RX errors 53422  dropped 101505  overruns 96059  frame 52228
        TX packets 124032313  bytes 946978782880 (881.9 GiB)
        TX errors 0  dropped 384 overruns 0  carrier 0  collisions 0

All four machines have datacenter-nvme disks and three physical network cards (2x40Gpbs / Mellanox connectx-3 cx354a and 1x10Gbps which is not used for ceph). The errors appear as output errors on our switches (Juniper QFX5100-24Q) and as input errors (rx) on the hypervisors.

Network configuration:

Code:
iface ens1 inet manual
# Left port 1st mellanox

iface ens3d1 inet manual
# Left port 2nd mellanox

auto bond0
iface bond0 inet manual
    bond-slaves ens1 ens3d1
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer3+4
    mtu 9000

auto bond0.10
iface bond0.10 inet static
    address 172.16.10.11/24
    mtu 9000
# CEPH Public

auto bond0.40
iface bond0.40 inet static
        address 172.16.40.11/24
        mtu 9000
# Corosync 1

iface ens3 inet manual
# Right port 2nd mellanox

iface ens1d1 inet manual
# Right port 1st mellanox

auto bond1
iface bond1 inet manual
        bond-slaves ens3 ens1d1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4
        mtu 9000

iface bond1.20 inet manual
        mtu 9000
# CEPH Cluster

auto vmbr20
iface vmbr20 inet static
        address 172.16.20.11/24
        bridge-ports bond1.20
        bridge-stp off
        bridge-fd 0
        mtu 9000
# CephFS

auto vmbr100
iface vmbr100 inet manual
    bridge-ports bond1
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 100-4000
# VM Public

Corresponding IP addresses on the other three hypervisors are the same scheme for all networks; hv01 has .11, hv02 has .12, and so on.

The juniper switch configuration for the ports:

Code:
#show configuration interfaces ae11
description hv01-ceph-public;
mtu 9028;
aggregated-ether-options {
    minimum-links 1;
    lacp {
        active;
        periodic fast;
    }
}
unit 0 {
    family ethernet-switching {
        interface-mode trunk;
        vlan {
            members [ 20 100 ];
        }
    }
}

#show configuration interfaces ae10
description hv01-ceph-cluster;
mtu 9028;
aggregated-ether-options {
    minimum-links 1;
    lacp {
        active;
        periodic fast;
    }
}
unit 0 {
    family ethernet-switching {
        interface-mode trunk;
        vlan {
            members [ 10 40 ];
        }
    }
}

What we've tried so far:

- setting mtu to 9000 on the relevant interfaces (see network config above, errors also occurred with default 1500)

- setting mtu on switches to 9216 (no difference to the current setting 9028)

- setting the following sysctl settings on all hypervisors:

Code:
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

net.core.somaxconn = 32765
net.core.netdev_max_backlog = 32765

vm.swappiness = 1

- setting rx buffer size on the ethernet-interface on the hypervisors (default is 1024)

Code:
ethtool -G ens1 rx 512        # lead to non-working network
ethtool -G ens1 rx 2048    # slower ceph throughput
ethtool -G ens1 rx 4096    # even slower
ethtool -G ens1 rx 8192    # hardware max / non-working network

Is there anything we're missing or just don't see which causes these high error rates?
 
any chance to somehow narrow it down to a specific one? I mean, three NICs on four hypervisors, each with two new cables...we cant replace them all