Our setup:
Cluster and CEPH look good, and we've done some successfully speed and stability tests.
We'd like to utilize SDN by creating a single VLAN zone, and a handful of vNet interfaces with each their separate VLAN IDs.
But... When we apply any SDN configuration, even an empty one, it results in a network reload on the four nodes.
This is when CEPH becomes unresponsive, and we cannot get a ceph -s status, as it just sits there.
The UI shows error 500 when trying to access the CEPH section.
Restarting ceph.target doesn't make CEPH work, only if we restart the hosts or do a "systemctl restart networking" does CEPH return to a working state.
We're able to replicate this behaviour by doing a "systemctl reload networking" on a node, and its CEPH will become unresponsive until we restart networking or reboot the node.
I'm not finding anything significant in the journalctl entries for networking, ceph.target or other ceph units.
Did anyone else experience something similar to this?
Proxmox VE 8.2 cluster
4 x 64 thread AMD nodes each with 384GB RAM, 2x480GB SSD for OS, 5x7TB NVMe for CEPH, 2x10Gbit (Intel XL710) + 2x25Gbit (ConnectX6LX) NICs
The 10Gbit bond0 has a vlan aware vmbr0 for VMs and also a mgmt20 bridge interface with IP using bond0.20
The 25Gbit bond1 is for CEPH storage and both Proxmox and CEPH cluster traffic, which is in two separate VLANs bond1.69 and bond1.70, using 169.254.69.0/24 and 169.254.70.0/24 IP addresses
Cluster and CEPH look good, and we've done some successfully speed and stability tests.
We'd like to utilize SDN by creating a single VLAN zone, and a handful of vNet interfaces with each their separate VLAN IDs.
But... When we apply any SDN configuration, even an empty one, it results in a network reload on the four nodes.
This is when CEPH becomes unresponsive, and we cannot get a ceph -s status, as it just sits there.
The UI shows error 500 when trying to access the CEPH section.
Restarting ceph.target doesn't make CEPH work, only if we restart the hosts or do a "systemctl restart networking" does CEPH return to a working state.
We're able to replicate this behaviour by doing a "systemctl reload networking" on a node, and its CEPH will become unresponsive until we restart networking or reboot the node.
I'm not finding anything significant in the journalctl entries for networking, ceph.target or other ceph units.
Did anyone else experience something similar to this?
Code:
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet manual
auto eth1
iface eth1 inet manual
auto eth2
iface eth2 inet manual
auto eth3
iface eth3 inet manual
iface enxbe3af2b6059f inet manual
auto bond0
iface bond0 inet manual
bond-slaves eth0 eth1
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer2
bond-lacp-rate 1
iface bond0.20 inet manual
auto bond1
iface bond1 inet manual
bond-slaves eth2 eth3
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy encap3+4
mtu 9000
bond-downdelay 200
bond-updelay 200
bond-lacp-rate 1
auto bond1.69
iface bond1.69 inet static
address 169.254.69.2/24
auto bond1.70
iface bond1.70 inet static
address 169.254.70.2/24
auto mgmt20
iface mgmt20 inet static
address 172.16.1.211/24
gateway 172.16.1.1
bridge-ports bond0.20
bridge-stp off
bridge-fd 0
auto vmbr0
iface vmbr0 inet manual
bridge-ports bond0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
source /etc/network/interfaces.d/*
Last edited: