SDN apply or just systemctl restart networking makes CEPH hang

danmagi-jje · Aug 8, 2024

Our setup:

Proxmox VE 8.2 cluster

4 x 64 thread AMD nodes each with 384GB RAM, 2x480GB SSD for OS, 5x7TB NVMe for CEPH, 2x10Gbit (Intel XL710) + 2x25Gbit (ConnectX6LX) NICs

The 10Gbit bond0 has a vlan aware vmbr0 for VMs and also a mgmt20 bridge interface with IP using bond0.20

The 25Gbit bond1 is for CEPH storage and both Proxmox and CEPH cluster traffic, which is in two separate VLANs bond1.69 and bond1.70, using 169.254.69.0/24 and 169.254.70.0/24 IP addresses

Cluster and CEPH look good, and we've done some successfully speed and stability tests.

We'd like to utilize SDN by creating a single VLAN zone, and a handful of vNet interfaces with each their separate VLAN IDs.

But... When we apply any SDN configuration, even an empty one, it results in a network reload on the four nodes.
This is when CEPH becomes unresponsive, and we cannot get a ceph -s status, as it just sits there.
The UI shows error 500 when trying to access the CEPH section.
Restarting ceph.target doesn't make CEPH work, only if we restart the hosts or do a "systemctl restart networking" does CEPH return to a working state.

We're able to replicate this behaviour by doing a "systemctl reload networking" on a node, and its CEPH will become unresponsive until we restart networking or reboot the node.

I'm not finding anything significant in the journalctl entries for networking, ceph.target or other ceph units.

Did anyone else experience something similar to this?

Code:

auto lo
iface lo inet loopback


auto eth0
iface eth0 inet manual


auto eth1
iface eth1 inet manual


auto eth2
iface eth2 inet manual


auto eth3
iface eth3 inet manual


iface enxbe3af2b6059f inet manual


auto bond0
iface bond0 inet manual
        bond-slaves eth0 eth1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2
        bond-lacp-rate 1


iface bond0.20 inet manual


auto bond1
iface bond1 inet manual
        bond-slaves eth2 eth3
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy encap3+4
        mtu 9000
        bond-downdelay 200
        bond-updelay 200
        bond-lacp-rate 1


auto bond1.69
iface bond1.69 inet static
        address 169.254.69.2/24


auto bond1.70
iface bond1.70 inet static
        address 169.254.70.2/24


auto mgmt20
iface mgmt20 inet static
        address 172.16.1.211/24
        gateway 172.16.1.1
        bridge-ports bond0.20
        bridge-stp off
        bridge-fd 0


auto vmbr0
iface vmbr0 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094


source /etc/network/interfaces.d/*

shanreich · Aug 9, 2024

If it would be possible for you, could you post the debug output of reloading the network configuration? That is, if that isn't a production system and you're not willing to provoke a network failure:

Code:

ifreload -avd 2> ifreload.txt

danmagi-jje · Aug 9, 2024

After executing this one, it takes a couple of seconds before I'm no longer getting output from "ceph -s" but it just hanging there.

shanreich · Aug 9, 2024

Can you check whether the MTU is properly set after reloading the network configuration by pinging with the following CMD (using the IPs from the Ceph network):

Code:

ping -M do -s 8972 -c 4 <ip>

danmagi-jje · Aug 9, 2024

For bond0 and bridges using it, the MTU is 1500

Code:

root@pve-node-a:~# ping -M do -s 8972 -c 4 172.16.1.1
PING 172.16.1.1 (172.16.1.1) 8972(9000) bytes of data.
ping: local error: message too long, mtu=1500

Code:

root@pve-node-a:~# ifconfig mgmt20
mgmt20: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.1.210  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::7ec2:55ff:feab:6f38  prefixlen 64  scopeid 0x20<link>
        ether 7c:c2:55:ab:6f:38  txqueuelen 1000  (Ethernet)
        RX packets 14049  bytes 3778705 (3.6 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 7588  bytes 4725699 (4.5 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

shanreich · Aug 9, 2024

But Ceph is on bond1, right? So pinging 169.254.x.2/24 would be interesting. In any case, can you try setting the 9000 MTU on eth2/3 as well, not only on bond1 and check if it works then?

danmagi-jje · Aug 9, 2024

Yes, CEPH cluster and private/public is on the VLANs 69 and 70 of the bond1 interface. These have MTU 9000.
During "ceph -s" not providing output after the ifreload, I can ping within the 169.254.69.0/24 and 169.254.70.0/24 just fine between all the hosts.

Excerpt from the /etc/pve/corosync.conf

Code:

nodelist {
  node {
    name: pve-node-a
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 169.254.69.1
  }
  node {
    name: pve-node-b
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 169.254.69.2
  }
  node {
    name: pve-node-c
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 169.254.69.3
  }
  node {
    name: pve-node-d
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 169.254.69.4
  }
}

And the /etc/pve/ceph.conf

Code:

[mon.pve-node-a]
        public_addr = 169.254.70.1

[mon.pve-node-b]
        public_addr = 169.254.70.2

[mon.pve-node-c]
        public_addr = 169.254.70.3

[mon.pve-node-d]
        public_addr = 169.254.70.4

After the ifreload on a specific not, that node's journactl starts showing

Code:

Aug 09 10:17:57 pve-node-d ceph-osd[3436]: 2024-08-09T10:17:57.777+0000 7694f04006c0 -1 osd.12 2535 heartbeat_check: no reply from 169.254.70.1:6806 osd.4 since back 2024-08-09T10:17:30.641655+0000 front 2024-08-09T10:17:30.641663+0000 (oldest deadline 2024-08-09T10:17:52.341620+0000)
Aug 09 10:17:57 pve-node-d ceph-osd[3436]: 2024-08-09T10:17:57.777+0000 7694f04006c0 -1 osd.12 2535 heartbeat_check: no reply from 169.254.70.2:6813 osd.5 since back 2024-08-09T10:17:32.342026+0000 front 2024-08-09T10:17:30.641720+0000 (oldest deadline 2024-08-09T10:17:52.341620+0000)
Aug 09 10:17:57 pve-node-d ceph-osd[3436]: 2024-08-09T10:17:57.777+0000 7694f04006c0 -1 osd.12 2535 heartbeat_check: no reply from 169.254.70.3:6819 osd.8 since back 2024-08-09T10:17:32.341890+0000 front 2024-08-09T10:17:30.641700+0000 (oldest deadline 2024-08-09T10:17:52.341620+0000)

But still I can ping the IP's and see the mentioned port open

Code:

root@pve-node-d:~# nmap 169.254.70.2 -p 6822 --open
Starting Nmap 7.93 ( https://nmap.org ) at 2024-08-09 10:19 UTC
PORT     STATE SERVICE
6822/tcp open  unknown

root@pve-node-d:~# nmap 169.254.70.3 -p 6819 --open
PORT     STATE SERVICE
6819/tcp open  unknown

root@pve-node-d:~# nmap 169.254.70.1 -p 6806 --open
PORT     STATE SERVICE
6806/tcp open  unknown

shanreich · Aug 9, 2024

danmagi-jje said:
During "ceph -s" not providing output after the ifreload, I can ping within the 169.254.69.0/24 and 169.254.70.0/24 just fine between all the hosts.

Did you try with the command above that forces larger packet sizes, or did you just use regular ping?

danmagi-jje · Aug 9, 2024

After the ifreload and ceph -s not responding, I checked that the bond1 MTU stays at 9000

Code:

root@pve-node-d:~# ifconfig bond1
bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
        inet6 fe80::7ec2:55ff:fea4:d41e  prefixlen 64  scopeid 0x20<link>
        ether 7c:c2:55:a4:d4:1e  txqueuelen 1000  (Ethernet)
        RX packets 394102  bytes 3083646229 (2.8 GiB)
        RX errors 585  dropped 0  overruns 0  frame 585
        TX packets 76970  bytes 57583175 (54.9 MiB)
        TX errors 0  dropped 19 overruns 0  carrier 0  collisions 0

And pinging with jumbo works within the bond1 VLAN

Code:

root@pve-node-d:~# ping -M do -s 8972 -c 4  169.254.70.1
PING 169.254.70.1 (169.254.70.1) 8972(9000) bytes of data.
8980 bytes from 169.254.70.1: icmp_seq=1 ttl=64 time=0.088 ms
8980 bytes from 169.254.70.1: icmp_seq=2 ttl=64 time=0.100 ms
8980 bytes from 169.254.70.1: icmp_seq=3 ttl=64 time=0.107 ms
8980 bytes from 169.254.70.1: icmp_seq=4 ttl=64 time=0.107 ms

danmagi-jje · Aug 9, 2024

hmm... I might have tested wrong... after the ifreload it seems the jumbo size pings don't get replies actually..

Code:

root@pve-node-d:~# ping -M do -s 8972 -c 4  169.254.70.1
PING 169.254.70.1 (169.254.70.1) 8972(9000) bytes of data.

--- 169.254.70.1 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3066ms

shanreich · Aug 9, 2024

I assume that might be due to the MTU only being set on bond1, but not the underlying network interfaces. if I remember correctly that led to issues for some other users in the past. I'd recommend trying to set the MTU for the bond slaves as well in the interfaces file and checking whether that resolves the issue:

Code:

auto eth2
iface eth2 inet manual
    mtu 9000


auto eth3
iface eth3 inet manual
    mtu 9000

danmagi-jje · Aug 9, 2024

Dear Stefan

That seems to have been the exact issue here as well. I added the MTU 9000 on the physical interfaces of bond1 in the UI, updated and now we can reload interfaces and apply SDN without killing CEPH.

Thank you very much

Now I'll test adding back my initial Zones and vNets.

SDN apply or just systemctl restart networking makes CEPH hang

danmagi-jje

New Member

shanreich

Proxmox Staff Member

danmagi-jje

New Member

Attachments

shanreich

Proxmox Staff Member

danmagi-jje

New Member

shanreich

Proxmox Staff Member

danmagi-jje

New Member

shanreich

Proxmox Staff Member

danmagi-jje

New Member

danmagi-jje

New Member

shanreich

Proxmox Staff Member

danmagi-jje

New Member

We value your privacy