Proxmox 8/Ceph Cluster - High error rate on Ceph-network

BigBottle

New Member
Aug 2, 2023
2
0
1
Hello proxmox-community,

we encounter a high error rate (errors, drops, overruns and frames) on the cephs network interfaces on our newly set up four-machine proxmox 8/ceph cluster when writing data (e.g. from /dev/urandom to a file on a virtual machine for testing).

Code:
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
        inet6 fe80::4cfc:1cff:fe02:7835  prefixlen 64  scopeid 0x20<link>
        ether 4e:dc:2c:a2:79:34  txqueuelen 1000  (Ethernet)
        RX packets 72321527  bytes 477341582531 (444.5 GiB)
        RX errors 79792  dropped 69666  overruns 69666  frame 79624
        TX packets 31629557  bytes 76964599295 (71.6 GiB)
        TX errors 0  dropped 574 overruns 0  carrier 0  collisions 0

bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
        ether be:d3:83:cf:f0:3d  txqueuelen 1000  (Ethernet)
        RX packets 126046100  bytes 891091401606 (829.8 GiB)
        RX errors 53422  dropped 101505  overruns 96059  frame 52228
        TX packets 124032313  bytes 946978782880 (881.9 GiB)
        TX errors 0  dropped 384 overruns 0  carrier 0  collisions 0

All four machines have datacenter-nvme disks and three physical network cards (2x40Gpbs / Mellanox connectx-3 cx354a and 1x10Gbps which is not used for ceph). The errors appear as output errors on our switches (Juniper QFX5100-24Q) and as input errors (rx) on the hypervisors.

Network configuration:

Code:
iface ens1 inet manual
# Left port 1st mellanox

iface ens3d1 inet manual
# Left port 2nd mellanox

auto bond0
iface bond0 inet manual
    bond-slaves ens1 ens3d1
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer3+4
    mtu 9000

auto bond0.10
iface bond0.10 inet static
    address 172.16.10.11/24
    mtu 9000
# CEPH Public

auto bond0.40
iface bond0.40 inet static
        address 172.16.40.11/24
        mtu 9000
# Corosync 1

iface ens3 inet manual
# Right port 2nd mellanox

iface ens1d1 inet manual
# Right port 1st mellanox

auto bond1
iface bond1 inet manual
        bond-slaves ens3 ens1d1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4
        mtu 9000

iface bond1.20 inet manual
        mtu 9000
# CEPH Cluster

auto vmbr20
iface vmbr20 inet static
        address 172.16.20.11/24
        bridge-ports bond1.20
        bridge-stp off
        bridge-fd 0
        mtu 9000
# CephFS

auto vmbr100
iface vmbr100 inet manual
    bridge-ports bond1
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 100-4000
# VM Public

Corresponding IP addresses on the other three hypervisors are the same scheme for all networks; hv01 has .11, hv02 has .12, and so on.

The juniper switch configuration for the ports:

Code:
#show configuration interfaces ae11
description hv01-ceph-public;
mtu 9028;
aggregated-ether-options {
    minimum-links 1;
    lacp {
        active;
        periodic fast;
    }
}
unit 0 {
    family ethernet-switching {
        interface-mode trunk;
        vlan {
            members [ 20 100 ];
        }
    }
}

#show configuration interfaces ae10
description hv01-ceph-cluster;
mtu 9028;
aggregated-ether-options {
    minimum-links 1;
    lacp {
        active;
        periodic fast;
    }
}
unit 0 {
    family ethernet-switching {
        interface-mode trunk;
        vlan {
            members [ 10 40 ];
        }
    }
}

What we've tried so far:

- setting mtu to 9000 on the relevant interfaces (see network config above, errors also occurred with default 1500)

- setting mtu on switches to 9216 (no difference to the current setting 9028)

- setting the following sysctl settings on all hypervisors:

Code:
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

net.core.somaxconn = 32765
net.core.netdev_max_backlog = 32765

vm.swappiness = 1

- setting rx buffer size on the ethernet-interface on the hypervisors (default is 1024)

Code:
ethtool -G ens1 rx 512        # lead to non-working network
ethtool -G ens1 rx 2048    # slower ceph throughput
ethtool -G ens1 rx 4096    # even slower
ethtool -G ens1 rx 8192    # hardware max / non-working network

Is there anything we're missing or just don't see which causes these high error rates?
 
any chance to somehow narrow it down to a specific one? I mean, three NICs on four hypervisors, each with two new cables...we cant replace them all
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!