Challenges with MTU 9000 and CEPH Deployment on Proxmox Cluster Using LACP and TP-Link Switches

blucas

New Member
Apr 14, 2024
2
0
1
Hi,

I'm currently struggling with a networking issue that seems to affect the functionality of a CEPH cluster deployed across multiple nodes in my Proxmox environment, CEPH never worked since instalation giving error 500 time out on both nodes 2 and 3 (2ª and 3ª instalation). Despite correct configurations, I'm encountering problems specifically related to MTU settings and node communication which I suspect are hampering CEPH's performance on nodes 2 and 3.

When the MTU is normal 1500, everyhing works fine.

This is my network config in the nodes:
Code:
auto lo
iface lo inet loopback

auto eno49
iface eno49 inet manual
        mtu 9000

auto eno50
iface eno50 inet manual
        mtu 9000

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves eno49 eno50
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2
        mtu 9000

auto vmbr0
iface vmbr0 inet static
        address 192.168.21.224/24
        gateway 192.168.21.254
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        mtu 9000

source /etc/network/interfaces.d/*

All switches are configured to support an MTU of 9000, and this setting is global. Despite this, when I perform ping tests with large packets, they fail, suggesting that the MTU 9000 is not being upheld somewhere in the network.

Interestingly, the setup works fine locally (in loopback), but fails when communicating across the network, particularly affecting the installation and operation of CEPH on nodes 2 and 3. This leads me to suspect this might be the reason why CEPH is not functioning correctly on these nodes.

I would also like to note that the HA cluster is functioning well, indicating that the issue is likely isolated to the network configuration or MTU handling!?

I am seeking suggestions or insights on what further I can check or configure to resolve these issues. Has anyone faced something similar or has experience with MTU issues in LACP configurations with TP-Link switches?
 
Last edited:
I'm also having this issue. I wanted to move to MTU 9000 and had a similar config. It seemed to ping ok with smaller packets but Ceph was just dead as if there was no network at all.

HTTP and proxmox UI was ok, strangely.

Did you ever figure it out?
 
try ping with different packet sizes to determine if mtu 9000 is even possible.

ping <ip> -M do -s <mtu> -c 4

source

start with mtu 9000 and reduce the mtu by some amount till you don't get "message too long" anymore, then raise it by 1 till you get "message too long" again. the mtu before is what your network supports. sometimes it's not only the mtu setting of your network devices but the traffic itself has additional frames (vlan, qinq, vxlan ...) that reduces the usable mtu