I've got a working 6 nodes stretched Proxmox VE9.1 + Ceph cluster, with hosts split over two datacenters (3 in each). In a third datacenter I've got a Proxmox VE 9.1 virtual machine (on vsphere) which acts as proxmox + ceph tie-breaker.
Each host has 4x25 Gbits/s interfaces in LACP bond defined as below. There's no physical support for a dedicated corosync network and we don't want to sacrifice two 25 Gbits/s ports for corosync, so corosync also runs on the bond and we know it's bad, but we are reusing old hardware to do a POC so it will be fine. Below an excerpt from /etc/network/interfaces on one of the 6 nodes) :
All is fine and good until we want to activate the firewall with the following /etc/pve/firewall/cluster.fw file, at which time Ceph goes down in flames but it seems that the nodes remain accessible from the GUI. Hopefully it's very robust and as soon as we delete /etc/pve/firewall/cluster.fw Ceph heals correctly.
From reading the documentation I believe pve-firewall will automatically add the rules to allow sysadmins to use the GUI/SSH.
So what is missing that causes Ceph to stop working as soon as this is activated ?
Each host has 4x25 Gbits/s interfaces in LACP bond defined as below. There's no physical support for a dedicated corosync network and we don't want to sacrifice two 25 Gbits/s ports for corosync, so corosync also runs on the bond and we know it's bad, but we are reusing old hardware to do a POC so it will be fine. Below an excerpt from /etc/network/interfaces on one of the 6 nodes) :
Code:
auto bond0
iface bond0 inet manual
bond-slaves nic0 nic1 nic2 nic3
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer3+4
mtu 9000
bond-lacp-rate fast
#Aggrégat LACP
auto vmbr0
iface vmbr0 inet static
address 10.250.42.11/24
gateway 10.250.42.1
bridge-ports bond0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-789 793-841 843-4094
bridge-pvid 1
#Bridge + Management
auto vlan790
iface vlan790 inet static
address 10.244.31.131/25
mtu 9000
vlan-raw-device bond0
#Corosync
auto vlan791
iface vlan791 inet static
address 10.244.32.11/25
mtu 9000
vlan-raw-device bond0
post-up ip route add 10.244.74.0/25 via 10.244.32.1
post-down ip route del 10.244.74.0/25
#Ceph Frontend
auto vlan792
iface vlan792 inet static
address 10.244.32.131/25
mtu 9000
vlan-raw-device bond0
#Ceph Backend
All is fine and good until we want to activate the firewall with the following /etc/pve/firewall/cluster.fw file, at which time Ceph goes down in flames but it seems that the nodes remain accessible from the GUI. Hopefully it's very robust and as soon as we delete /etc/pve/firewall/cluster.fw Ceph heals correctly.
Code:
[OPTIONS]
enable: 1
policy_in: DROP
policy_out: ACCEPT
policy_forward: ACCEPT
# For SysAdmins
[IPSET management]
10.165.0.0/24
# SSH / GUI network
[IPSET cluster_mgmt]
10.250.42.0/24
# Corosync network
[IPSET cluster_sync]
10.244.31.128/25
# Ceph Public network including tie-breaker network on 3rd site
[IPSET cluster_fceph]
10.244.32.0/25
10.244.74.0/25
# Ceph Cluster network
[IPSET cluster_bceph]
10.244.32.128/25
[RULES]
IN ACCEPT -i lo
IN Ping(ACCEPT)
IN ACCEPT -m conntrack --ctstate ESTABLISHED,RELATED
# Proxmox VE management network to management network
IN ACCEPT -i vmbr0 -source +cluster_mgmt -destination +cluster_mgmt
# Corosync to Corosync network
IN ACCEPT -i vlan790@bond0 -source +cluster_sync -destination +cluster_sync
# Ceph Public to Ceph Public network (including tie-breaker)
IN ACCEPT -i vlan791@bond0 -source +cluster_fceph -destination +cluster_fceph
# Ceph Cluster network to Ceph Cluster network
IN ACCEPT -i vlan792@bond0 -source +cluster_bceph -destination +cluster_bceph
# Ceph Cluster to Ceph Public network (is this needed at all ?)
IN ACCEPT -i vlan791@bond0 -source +cluster_bceph -destination +cluster_fceph
# Ceph Public to Ceph Cluster network (is this needed at all, Ceph's doc doesn't seem to imply so ?)
IN ACCEPT -i vlan792@bond0 -source +cluster_fceph -destination +cluster_bceph
From reading the documentation I believe pve-firewall will automatically add the rules to allow sysadmins to use the GUI/SSH.
So what is missing that causes Ceph to stop working as soon as this is activated ?
Last edited: