Our setup:
Cluster and CEPH look good, and we've done some successfully speed and stability tests.
We'd like to utilize SDN by creating a single VLAN zone, and a handful of vNet interfaces with each their separate VLAN IDs.
But... When we apply any SDN configuration, even an empty one, it results in a network reload on the four nodes.
This is when CEPH becomes unresponsive, and we cannot get a ceph -s status, as it just sits there.
The UI shows error 500 when trying to access the CEPH section.
Restarting ceph.target doesn't make CEPH work, only if we restart the hosts or do a "systemctl restart networking" does CEPH return to a working state.
We're able to replicate this behaviour by doing a "systemctl reload networking" on a node, and its CEPH will become unresponsive until we restart networking or reboot the node.
I'm not finding anything significant in the journalctl entries for networking, ceph.target or other ceph units.
Did anyone else experience something similar to this?
	
	
	
		
				
			Proxmox VE 8.2 cluster
4 x 64 thread AMD nodes each with 384GB RAM, 2x480GB SSD for OS, 5x7TB NVMe for CEPH, 2x10Gbit (Intel XL710) + 2x25Gbit (ConnectX6LX) NICs
The 10Gbit bond0 has a vlan aware vmbr0 for VMs and also a mgmt20 bridge interface with IP using bond0.20
The 25Gbit bond1 is for CEPH storage and both Proxmox and CEPH cluster traffic, which is in two separate VLANs bond1.69 and bond1.70, using 169.254.69.0/24 and 169.254.70.0/24 IP addresses
Cluster and CEPH look good, and we've done some successfully speed and stability tests.
We'd like to utilize SDN by creating a single VLAN zone, and a handful of vNet interfaces with each their separate VLAN IDs.
But... When we apply any SDN configuration, even an empty one, it results in a network reload on the four nodes.
This is when CEPH becomes unresponsive, and we cannot get a ceph -s status, as it just sits there.
The UI shows error 500 when trying to access the CEPH section.
Restarting ceph.target doesn't make CEPH work, only if we restart the hosts or do a "systemctl restart networking" does CEPH return to a working state.
We're able to replicate this behaviour by doing a "systemctl reload networking" on a node, and its CEPH will become unresponsive until we restart networking or reboot the node.
I'm not finding anything significant in the journalctl entries for networking, ceph.target or other ceph units.
Did anyone else experience something similar to this?
		Code:
	
	auto lo
iface lo inet loopback
auto eth0
iface eth0 inet manual
auto eth1
iface eth1 inet manual
auto eth2
iface eth2 inet manual
auto eth3
iface eth3 inet manual
iface enxbe3af2b6059f inet manual
auto bond0
iface bond0 inet manual
        bond-slaves eth0 eth1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2
        bond-lacp-rate 1
iface bond0.20 inet manual
auto bond1
iface bond1 inet manual
        bond-slaves eth2 eth3
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy encap3+4
        mtu 9000
        bond-downdelay 200
        bond-updelay 200
        bond-lacp-rate 1
auto bond1.69
iface bond1.69 inet static
        address 169.254.69.2/24
auto bond1.70
iface bond1.70 inet static
        address 169.254.70.2/24
auto mgmt20
iface mgmt20 inet static
        address 172.16.1.211/24
        gateway 172.16.1.1
        bridge-ports bond0.20
        bridge-stp off
        bridge-fd 0
auto vmbr0
iface vmbr0 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
source /etc/network/interfaces.d/*
			
				Last edited: 
				
		
	
										
										
											
	
										
									
								 
	 
	 
 
		

