Proxmox 8/Ceph Cluster - High error rate on Ceph-network

BigBottle

New Member
Aug 2, 2023
2
0
1
Hello proxmox-community,

we encounter a high error rate (errors, drops, overruns and frames) on the cephs network interfaces on our newly set up four-machine proxmox 8/ceph cluster when writing data (e.g. from /dev/urandom to a file on a virtual machine for testing).

Code:
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
        inet6 fe80::4cfc:1cff:fe02:7835  prefixlen 64  scopeid 0x20<link>
        ether 4e:dc:2c:a2:79:34  txqueuelen 1000  (Ethernet)
        RX packets 72321527  bytes 477341582531 (444.5 GiB)
        RX errors 79792  dropped 69666  overruns 69666  frame 79624
        TX packets 31629557  bytes 76964599295 (71.6 GiB)
        TX errors 0  dropped 574 overruns 0  carrier 0  collisions 0

bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
        ether be:d3:83:cf:f0:3d  txqueuelen 1000  (Ethernet)
        RX packets 126046100  bytes 891091401606 (829.8 GiB)
        RX errors 53422  dropped 101505  overruns 96059  frame 52228
        TX packets 124032313  bytes 946978782880 (881.9 GiB)
        TX errors 0  dropped 384 overruns 0  carrier 0  collisions 0

All four machines have datacenter-nvme disks and three physical network cards (2x40Gpbs / Mellanox connectx-3 cx354a and 1x10Gbps which is not used for ceph). The errors appear as output errors on our switches (Juniper QFX5100-24Q) and as input errors (rx) on the hypervisors.

Network configuration:

Code:
iface ens1 inet manual
# Left port 1st mellanox

iface ens3d1 inet manual
# Left port 2nd mellanox

auto bond0
iface bond0 inet manual
    bond-slaves ens1 ens3d1
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer3+4
    mtu 9000

auto bond0.10
iface bond0.10 inet static
    address 172.16.10.11/24
    mtu 9000
# CEPH Public

auto bond0.40
iface bond0.40 inet static
        address 172.16.40.11/24
        mtu 9000
# Corosync 1

iface ens3 inet manual
# Right port 2nd mellanox

iface ens1d1 inet manual
# Right port 1st mellanox

auto bond1
iface bond1 inet manual
        bond-slaves ens3 ens1d1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4
        mtu 9000

iface bond1.20 inet manual
        mtu 9000
# CEPH Cluster

auto vmbr20
iface vmbr20 inet static
        address 172.16.20.11/24
        bridge-ports bond1.20
        bridge-stp off
        bridge-fd 0
        mtu 9000
# CephFS

auto vmbr100
iface vmbr100 inet manual
    bridge-ports bond1
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 100-4000
# VM Public

Corresponding IP addresses on the other three hypervisors are the same scheme for all networks; hv01 has .11, hv02 has .12, and so on.

The juniper switch configuration for the ports:

Code:
#show configuration interfaces ae11
description hv01-ceph-public;
mtu 9028;
aggregated-ether-options {
    minimum-links 1;
    lacp {
        active;
        periodic fast;
    }
}
unit 0 {
    family ethernet-switching {
        interface-mode trunk;
        vlan {
            members [ 20 100 ];
        }
    }
}

#show configuration interfaces ae10
description hv01-ceph-cluster;
mtu 9028;
aggregated-ether-options {
    minimum-links 1;
    lacp {
        active;
        periodic fast;
    }
}
unit 0 {
    family ethernet-switching {
        interface-mode trunk;
        vlan {
            members [ 10 40 ];
        }
    }
}

What we've tried so far:

- setting mtu to 9000 on the relevant interfaces (see network config above, errors also occurred with default 1500)

- setting mtu on switches to 9216 (no difference to the current setting 9028)

- setting the following sysctl settings on all hypervisors:

Code:
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

net.core.somaxconn = 32765
net.core.netdev_max_backlog = 32765

vm.swappiness = 1

- setting rx buffer size on the ethernet-interface on the hypervisors (default is 1024)

Code:
ethtool -G ens1 rx 512        # lead to non-working network
ethtool -G ens1 rx 2048    # slower ceph throughput
ethtool -G ens1 rx 4096    # even slower
ethtool -G ens1 rx 8192    # hardware max / non-working network

Is there anything we're missing or just don't see which causes these high error rates?
 
any chance to somehow narrow it down to a specific one? I mean, three NICs on four hypervisors, each with two new cables...we cant replace them all
 
I have the same issue. When I stop the osd the errors/overrides stop.

Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
bond0 1500 211981 179 13 179 45057 0 13 0 BMmRU
eno1 1500 205331 179 0 179 44139 0 0 0 BMsRU

osd performance is good
osd commit_latency(ms) apply_latency(ms)
2 1 1
1 0 0
0 0 0

data:
pools: 2 pools, 33 pgs
objects: 123.71k objects, 455 GiB
usage: 1.3 TiB used, 4.2 TiB / 5.5 TiB avail
pgs: 33 active+clean

io:
client: 0 B/s rd, 191 KiB/s wr, 0 op/s rd, 13 op/s wr
 
@RichtigerBot the fact that errors stop when OSDs stop is a useful clue — it narrows this down to how the NIC handles Ceph's bursty replication traffic.

The `overruns` counter means the NIC's RX descriptor ring (in host RAM, where the NIC DMA's incoming packets) was full when a packet arrived — the kernel couldn't drain it fast enough. The packet is dropped before the kernel's network stack ever sees it. Ceph OSD replication sends large bursts during writes and recovery, which can overflow the ring if interrupt processing can't keep up.

Note that TCP socket buffer tuning (`tcp_rmem`/`tcp_wmem`) operates at a higher layer and won't help here:

Code:
Wire → NIC → DMA → RX ring buffer → kernel network stack → TCP socket buffer → application (OSD)

                        ↑                                         ↑

                   overruns here                          tcp_rmem/wmem here

A few things to check:

1. RSS (Receive Side Scaling) — check how many RX queues your NIC has​


This is likely the most important one. Without RSS (or with only 1 RX queue), the NIC has a single ring buffer drained by a single CPU core. With RSS, the NIC distributes incoming packets across multiple RX queues — each with its own ring buffer and interrupt on a different core — so the drain work is parallelized.

Your `eno1` is likely an Intel onboard NIC (igb or ixgbe driver). Check the current queue count:

Bash:
ethtool -l eno1

Intel NICs report "combined" channels. If the count is low, increase it:

Bash:
ethtool -L eno1 combined 8

(For Mellanox ConnectX NICs with the mlx4 driver, the equivalent is `ethtool -L <iface> rx 8` — mlx4 reports separate RX/TX counts rather than combined.)

With multiple queues, the NIC hardware hashes packets by (src IP, dst IP, TCP port) and distributes them across queues. Since Ceph's OSD-to-OSD connections use different TCP ports, they naturally spread across different queues and cores.

2. Ring buffer size — revisit after checking RSS​


A larger ring buffer with a single queue doesn't help — it just adds per-packet queueing delay, which hurts Ceph's latency-sensitive operations (heartbeats, write acks). With multiple RSS queues the combination becomes effective. Check your current and max sizes:

Bash:
ethtool -g eno1

3. Per-NIC error counters​


`ethtool -S` breaks errors down by cause, which helps distinguish ring buffer overflows from physical-layer issues (CRC, cable):

Bash:
ethtool -S eno1 | grep -iE 'err|drop|over|crc|discard'

4. Flow control (pause frames)​


This lets the NIC tell the switch to slow down when its buffer is filling:

Bash:
ethtool -a eno1          # check current setting
ethtool -A eno1 rx on    # enable if off

The switch side needs to honor pause frames too — worth checking in your switch config.

The `ethtool -l eno1` output would be the most useful starting point — it tells us whether RSS is the bottleneck.