Proxmox 8/Ceph Cluster - High error rate on Ceph-network

BigBottle · Aug 2, 2023

Hello proxmox-community,

we encounter a high error rate (errors, drops, overruns and frames) on the cephs network interfaces on our newly set up four-machine proxmox 8/ceph cluster when writing data (e.g. from /dev/urandom to a file on a virtual machine for testing).

Code:

bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
        inet6 fe80::4cfc:1cff:fe02:7835  prefixlen 64  scopeid 0x20<link>
        ether 4e:dc:2c:a2:79:34  txqueuelen 1000  (Ethernet)
        RX packets 72321527  bytes 477341582531 (444.5 GiB)
        RX errors 79792  dropped 69666  overruns 69666  frame 79624
        TX packets 31629557  bytes 76964599295 (71.6 GiB)
        TX errors 0  dropped 574 overruns 0  carrier 0  collisions 0

bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
        ether be:d3:83:cf:f0:3d  txqueuelen 1000  (Ethernet)
        RX packets 126046100  bytes 891091401606 (829.8 GiB)
        RX errors 53422  dropped 101505  overruns 96059  frame 52228
        TX packets 124032313  bytes 946978782880 (881.9 GiB)
        TX errors 0  dropped 384 overruns 0  carrier 0  collisions 0

All four machines have datacenter-nvme disks and three physical network cards (2x40Gpbs / Mellanox connectx-3 cx354a and 1x10Gbps which is not used for ceph). The errors appear as output errors on our switches (Juniper QFX5100-24Q) and as input errors (rx) on the hypervisors.

Network configuration:

Code:

iface ens1 inet manual
# Left port 1st mellanox

iface ens3d1 inet manual
# Left port 2nd mellanox

auto bond0
iface bond0 inet manual
    bond-slaves ens1 ens3d1
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer3+4
    mtu 9000

auto bond0.10
iface bond0.10 inet static
    address 172.16.10.11/24
    mtu 9000
# CEPH Public

auto bond0.40
iface bond0.40 inet static
        address 172.16.40.11/24
        mtu 9000
# Corosync 1

iface ens3 inet manual
# Right port 2nd mellanox

iface ens1d1 inet manual
# Right port 1st mellanox

auto bond1
iface bond1 inet manual
        bond-slaves ens3 ens1d1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4
        mtu 9000

iface bond1.20 inet manual
        mtu 9000
# CEPH Cluster

auto vmbr20
iface vmbr20 inet static
        address 172.16.20.11/24
        bridge-ports bond1.20
        bridge-stp off
        bridge-fd 0
        mtu 9000
# CephFS

auto vmbr100
iface vmbr100 inet manual
    bridge-ports bond1
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 100-4000
# VM Public

Corresponding IP addresses on the other three hypervisors are the same scheme for all networks; hv01 has .11, hv02 has .12, and so on.

The juniper switch configuration for the ports:

Code:

#show configuration interfaces ae11
description hv01-ceph-public;
mtu 9028;
aggregated-ether-options {
    minimum-links 1;
    lacp {
        active;
        periodic fast;
    }
}
unit 0 {
    family ethernet-switching {
        interface-mode trunk;
        vlan {
            members [ 20 100 ];
        }
    }
}

#show configuration interfaces ae10
description hv01-ceph-cluster;
mtu 9028;
aggregated-ether-options {
    minimum-links 1;
    lacp {
        active;
        periodic fast;
    }
}
unit 0 {
    family ethernet-switching {
        interface-mode trunk;
        vlan {
            members [ 10 40 ];
        }
    }
}

What we've tried so far:

- setting mtu to 9000 on the relevant interfaces (see network config above, errors also occurred with default 1500)

- setting mtu on switches to 9216 (no difference to the current setting 9028)

- setting the following sysctl settings on all hypervisors:

Code:

net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

net.core.somaxconn = 32765
net.core.netdev_max_backlog = 32765

vm.swappiness = 1

- setting rx buffer size on the ethernet-interface on the hypervisors (default is 1024)

Code:

ethtool -G ens1 rx 512        # lead to non-working network
ethtool -G ens1 rx 2048    # slower ceph throughput
ethtool -G ens1 rx 4096    # even slower
ethtool -G ens1 rx 8192    # hardware max / non-working network

Is there anything we're missing or just don't see which causes these high error rates?

gurubert · Aug 3, 2023

Have you tried to replace the cables?

BigBottle · Aug 3, 2023

any chance to somehow narrow it down to a specific one? I mean, three NICs on four hypervisors, each with two new cables...we cant replace them all

gurubert · Aug 4, 2023

No idea. You could start with testing the bonds with only one active link. Or try to swap cables to narrow it down, if it's really the cables.

Search

Search

Proxmox 8/Ceph Cluster - High error rate on Ceph-network

BigBottle

New Member

gurubert

Distinguished Member

BigBottle

New Member

gurubert

Distinguished Member