Micro-segmentation, and NFTABLES

baligh.bedewi · Jan 16, 2025

Dear member,

I have the following HCI setup attached as high level overview,

4x Proxmox cluster nodes 8.3.2, with local SSD disks in CEPH cluster, each node has 12 interfaces

2x 1Gbps interface (Proxmox Management)
2x 1Gbps interface (Proxmox Cluster)
2x 40Gbps interface (CEPH Public)
2x 40Gbps interface (CEPH Private)
2x 40Gbps interface (Internal Network)
2x 10Gbps interface (DMZ Network)

1x Proxmox Backup Server
1x NAS with NFS connected to Proxmox cluster

All are running in semi-air-gaped network (L2 Only) accessible from single VM that has interfaces in the same network.

The environment running more than 80 critical VMs, and I'm planning to apply micro-segmentation instead rely on IPTABLES on each VM host level, i came to know that Proxmox NFTABLES is in preview-tech, and i have the following questions

1- Do I need to work with the nftables, or it is still not stable, and in case i rely on proxmox default iptables, later when nftables become production, is it easy to migrate the rules?
2- As far I understood that there is anti-lock rules for 22,8006 plus VNC for local access, since i have a VM within the same subnet, do i need still to make rule to allow my access?
3- Apply the firewall rule in the cluster level will apply it on all nodes, do i need still to create the rules in node as well, or no need?
4- i need to split the project in 2 sub-tasks (cluster/node level and VMs) if i enabled the firewall on cluster level and node enabled by default while i kept the VM disabled, will the node will have the rules applied ?

I also tried to tcpdump between nodes and found the following, plus some research to make sure i don't miss any rules, can you please confirm whether this is sufficient?

Interface	Ports	Protocol	Source Interface	Destination Interface	Purpose
MGMT	22, 443, 8006	TCP	MGMT (Node 1-4)	MGMT (Node 1-4)	Proxmox management (SSH, GUI/API).
Cluster	5404-5405	UDP	Cluster (Node 1-4)	Cluster (Node 1-4)	Corosync cluster communication.
Ceph Public	6789, 6800-7300	TCP	Ceph Public (Node 1-4)	Ceph Public (Node 1-4)	Ceph MON and client-facing traffic.
Ceph Private	6800-7300	TCP	Ceph Private (Node 1-4)	Ceph Private (Node 1-4)	Ceph OSD replication and recovery.
Storage (NFS)	2049, 111, 32765-32769	TCP/UDP	Storage (Node 1-4)	NFS Server	NFS traffic for VM and container storage.
Backup	8007, 22	TCP	Backup (Node 1-4)	Proxmox Backup Server	Backup traffic for VM and container snapshots.
DMZ	Custom	TCP/UDP	DMZ (External Clients)	DMZ (Proxmox Nodes)	Application-specific traffic.
Internal	Custom	TCP/UDP	Internal (External Clients)	Internal (Proxmox Nodes)	Application-specific traffic.

Thanks

shanreich · Friday at 10:26

baligh.bedewi said:
1- Do I need to work with the nftables, or it is still not stable, and in case i rely on proxmox default iptables, later when nftables become production, is it easy to migrate the rules?

It currently still is in tech-preview and therefore use at your own risk - although people are already using it in production. I cannot say if that is okay for your specific use-case.

They use the exact same configuration files, so migrating should be as easy as setting it in the firewall options.

baligh.bedewi said:
2- As far I understood that there is anti-lock rules for 22,8006 plus VNC for local access, since i have a VM within the same subnet, do i need still to make rule to allow my access?

The firewall automatically allows access on the management interface. This means that it looks for the IP that the hostname of the node resolves to, and allows connections from the interface where this IP is set.

baligh.bedewi said:
3- Apply the firewall rule in the cluster level will apply it on all nodes, do i need still to create the rules in node as well, or no need?

Rules on the cluster-level get applied to all nodes (that have the firewall enabled). You can use the node-level rules to override cluster-level rules, as they take precedence

baligh.bedewi said:
4- i need to split the project in 2 sub-tasks (cluster/node level and VMs) if i enabled the firewall on cluster level and node enabled by default while i kept the VM disabled, will the node will have the rules applied ?

Yes, if you have the firewall enabled on cluster/host level and disabled on VM level only the host rules will get applied.

The new VNet firewall [1] might be interesting for you if you want to do micro-segmentation. Together with SDN [2] it allows you to create rules on a bridge-level. If you are using the host as a gateway for VMs you can also control how traffic gets forwarded between the different subnets. Please note that this only works with the new nftables firewall.

You can find the default ruleset here [3]

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_directions_amp_zones
[2] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvesdn_firewall_integration
[3] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pve_firewall_default_rules

baligh.bedewi · Monday at 20:15

Thanks a lot shanreich for your swift reply,

Sorry for the late reply, i was testing, and I have 3 questions please.

For the SDN, I was planning to do it, however I'm not sure if it's applicable in my situation

I have 12x Interfaces into 6x bonds, 4 of them are the Proxmox (mgmt, cluster, ceph public and ceph private) <-- Layer 2, while remaining 2 bonds are my internal and DMZ network, the last 2 bond are trunk and uses my underlying infrastructure to reach my perimeter firewall, so non of my nodes acts as gateway, in fact the Proxmox nodes connected to layer2 where they have only 80/443 through internal proxy for update.

1- Would you still recommend SDN?

Also for the Proxmox firewall, i deployed 2 nodes with ceph in test environment before i actually apply the production, here is what i found and also need your support to understand the behavior I'm getting.

2x Proxmox nodes running latest version, each node has 5x interfaces (mgmt, cluster, ceph public, ceph private and internal network) as showing in attached picture

I created rules for the following

My remote access (22,8006) <-- to access the GUI or SHELL.
proxmox to proxmox (22,8006) <-- using the mgmt interface, therefore they can talk to each other.
proxmox to proxmox (ceph macro) <-- using ceph public interface therefore they can communicate with osd, replication, monitor etc..
proxmox to proxmox (ceph macro) <-- using ceph private interface therefore they can communicate with osd, replication, monitor etc..
deny all <-- reject any other traffic on all interfaces

as showing in the attachment

I then enabled the nftables from each node, then i saw the nft rules there after i enabled the firewall on the dc level, however it doesn't matter if the rules enabled or deny all, the communication is still going like noting is getting blocked, and when i remove the nftables check from each node, i see that the rules in nft removed and replaced by iptables and the traffic getting blocked

2- What could be the issue, and how i make sure that the nft enabled and actually blocking, is there anything else required than ticking the nft box on each nodes?

3- and also i don't see the nft option on the VM level, does the nftables applies to VM when i enable it from the node level, or the VM uses iptables only?

shanreich · Tuesday at 10:22

baligh.bedewi said:
1- Would you still recommend SDN?

Depends on what you are trying to achieve. It is mostly intended for guest networks, so the Ceph / Mgmt / ... networks are quite likely not suited for SDN. If you want to use multiple VLANs or even VXLAN/EVPN then SDN is the right choice. I don't really know what you're trying to achieve in the end so it's hard to tell.

baligh.bedewi said:
2- What could be the issue, and how i make sure that the nft enabled and actually blocking, is there anything else required than ticking the nft box on each nodes?

That shouldn't be the case. Could you post the output of the following commands when nftables is enabled, so I can take a look? Also, how are you checking connectivity? Are you pinging? From where?

Code:

cat /etc/pve/firewall/cluster.fw
cat /etc/pve/local/host.fw
nft list ruleset

baligh.bedewi said:
3- and also i don't see the nft option on the VM level, does the nftables applies to VM when i enable it from the node level, or the VM uses iptables only?

nftables is only at the host level. As soon as you enable it for a host, the whole host (including all VMs on that host) will use nftables.

baligh.bedewi · Tuesday at 18:28

Thanks again for your support

I only enabled my remote access (22,8006) and drop ANY, while disabled other rules

1- cat /etc/pve/firewall/cluster.fw

Code:

[OPTIONS]

enable: 1

[ALIASES]

PVE1-CEPH-PRIVATE 10.72.4.20 # PVE1-CEPH-PRIVATE
PVE1-CEPH-PUBLIC 10.73.4.10 # CEPH-PUBLIC
PVE1-CLUSTER 10.73.3.234 # CLUSTER
PVE1-MGMT 10.72.50.200 # MGMT
PVE2-CEPH-PRIVATE 10.72.4.21 # CEPH-PRIVATE
PVE2-CEPH-PUBLIC 10.73.4.11 # CEPH-PUBLIC
PVE2-CLUSTER 10.73.3.235 # CLUSTER
PVE2-MGMT 10.72.50.201 # MGMT

[IPSET pve-ceph-private] # CEPH-PRIVATE

dc/pve1-ceph-private # CEPH-PRIVATE
dc/pve2-ceph-private # CEPH-PRIVATE

[IPSET pve-ceph-public] # CEPH-PUBLIC

dc/pve1-ceph-public # CEPH-PUBLIC
dc/pve2-ceph-public # CEPH-PUBLIC

[IPSET pve-cluster] # CLUSTER

dc/pve1-cluster # CLUSTER
dc/pve2-cluster # CLUSTER

[IPSET pve-mgmt] # MGMT

dc/pve1-mgmt # MGMT
dc/pve2-mgmt # MGMT

[RULES]

|IN Ceph(ACCEPT) -source +dc/pve-ceph-public -dest +dc/pve-ceph-public -log nolog # CEPH PUBLIC
|IN Ceph(ACCEPT) -source dc/pve1-ceph-private -dest +dc/pve-ceph-private -log nolog # CEPH PRIVATE
|IN ACCEPT -source +dc/pve-mgmt -dest +dc/pve-mgmt -p tcp -dport 22,8006 -log nolog # MGMT
IN ACCEPT -source 10.201.1.0/27 -dest +dc/pve-mgmt -p tcp -dport 22,8006 -log nolog # REMOTE ACCESS
IN DROP -log debug

2- cat /etc/pve/local/host.fw

Code:

[OPTIONS]

tcp_flags_log_level: debug
log_level_forward: debug
log_level_in: debug
smurf_log_level: debug
nftables: 1
log_level_out: debug

3- nft list ruleset

Code:

table inet proxmox-firewall {
        set v4-dc/management {
                type ipv4_addr
                flags interval
                auto-merge
        }

        set v4-dc/management-nomatch {
                type ipv4_addr
                flags interval
                auto-merge
        }

        set v6-dc/management {
                type ipv6_addr
                flags interval
                auto-merge
        }

        set v6-dc/management-nomatch {
                type ipv6_addr
                flags interval
                auto-merge
        }

        set v4-synflood-limit {
                type ipv4_addr
                flags dynamic,timeout
                timeout 1m
        }

        set v6-synflood-limit {
                type ipv6_addr
                flags dynamic,timeout
                timeout 1m
        }

        map bridge-map {
                type ifname : verdict
        }

        chain do-reject {
                meta pkttype broadcast drop
                ip saddr 224.0.0.0/4 drop
                meta l4proto tcp reject with tcp reset
                meta l4proto { icmp, ipv6-icmp } reject
                reject with icmp host-prohibited
                reject with icmpv6 admin-prohibited
                drop
        }

        chain accept-management {
                ip saddr @v4-dc/management ip saddr != @v4-dc/management-nomatch accept
                ip6 saddr @v6-dc/management ip6 saddr != @v6-dc/management-nomatch accept
        }

        chain block-synflood {
                tcp flags != syn / fin,syn,rst,ack return
                jump ratelimit-synflood
                drop
        }

        chain log-drop-invalid-tcp {
                jump log-invalid-tcp
                drop
        }

        chain block-invalid-tcp {
                tcp flags fin,psh,urg / fin,syn,rst,psh,ack,urg goto log-drop-invalid-tcp
                tcp flags ! fin,syn,rst,psh,ack,urg goto log-drop-invalid-tcp
                tcp flags syn,rst / syn,rst goto log-drop-invalid-tcp
                tcp flags fin,syn / fin,syn goto log-drop-invalid-tcp
                tcp sport 0 tcp flags syn / fin,syn,rst,ack goto log-drop-invalid-tcp
        }

        chain allow-ndp-in {
                icmpv6 type { nd-router-solicit, nd-router-advert, nd-neighbor-solicit, nd-neighbor-advert, nd-redirect } accept
        }

        chain block-ndp-in {
                icmpv6 type { nd-router-solicit, nd-router-advert, nd-neighbor-solicit, nd-neighbor-advert, nd-redirect } drop
        }

        chain allow-ndp-out {
                icmpv6 type { nd-router-solicit, nd-neighbor-solicit, nd-neighbor-advert } accept
        }

        chain block-ndp-out {
                icmpv6 type { nd-router-solicit, nd-neighbor-solicit, nd-neighbor-advert } drop
        }

        chain block-conntrack-invalid {
                ct state invalid drop
        }

        chain block-smurfs {
                ip saddr 0.0.0.0 return
                meta pkttype broadcast goto log-drop-smurfs
                ip saddr 224.0.0.0/4 goto log-drop-smurfs
        }

        chain allow-icmp {
                icmp type { destination-unreachable, source-quench, time-exceeded } accept
                icmpv6 type { destination-unreachable, packet-too-big, time-exceeded, parameter-problem } accept
        }

        chain log-drop-smurfs {
                jump log-smurfs
                drop
        }

        chain default-in {
                iifname "lo" accept
                jump allow-icmp
                ct state established,related accept
                meta l4proto igmp accept
                tcp dport { 22, 3128, 5900-5999, 8006 } jump accept-management
                udp dport 5405-5412 accept
                udp dport { 135, 137-139, 445 } goto do-reject
                udp sport 137 udp dport 1024-65535 goto do-reject
                tcp dport { 135, 139, 445 } goto do-reject
                udp dport 1900 drop
                udp sport 53 drop
        }

        chain default-out {
                oifname "lo" accept
                jump allow-icmp
                ct state vmap { invalid : drop, established : accept, related : accept }
        }

        chain before-bridge {
                meta protocol arp accept
                meta protocol != arp ct state vmap { invalid : drop, established : accept, related : accept }
        }

        chain host-bridge-input {
                type filter hook input priority filter - 1; policy accept;
                iifname vmap @bridge-map
        }

        chain host-bridge-output {
                type filter hook output priority filter + 1; policy accept;
                oifname vmap @bridge-map
        }

        chain input {
                type filter hook input priority filter; policy accept;
                jump default-in
                jump ct-in
                jump option-in
                jump host-in
                jump cluster-in
        }

        chain output {
                type filter hook output priority filter; policy accept;
                jump default-out
                jump option-out
                jump host-out
                jump cluster-out
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                jump host-forward
                jump cluster-forward
        }

        chain ratelimit-synflood {
        }

        chain log-invalid-tcp {
        }

        chain log-smurfs {
        }

        chain option-in {
        }

        chain option-out {
        }

        chain cluster-in {
        }

        chain cluster-out {
        }

        chain host-in {
        }

        chain host-out {
        }

        chain cluster-forward {
        }

        chain host-forward {
        }

        chain ct-in {
        }
}
table bridge proxmox-firewall-guests {
        map vm-map-in {
                typeof oifname : verdict
        }

        map vm-map-out {
                typeof iifname : verdict
        }

        map bridge-map {
                type ifname . ifname : verdict
        }

        chain allow-dhcp-in {
                udp sport . udp dport { 547 . 546, 67 . 68 } accept
        }

        chain allow-dhcp-out {
                udp sport . udp dport { 546 . 547, 68 . 67 } accept
        }

        chain block-dhcp-in {
                udp sport . udp dport { 547 . 546, 67 . 68 } drop
        }

        chain block-dhcp-out {
                udp sport . udp dport { 546 . 547, 68 . 67 } drop
        }

        chain allow-ndp-in {
                icmpv6 type { nd-router-solicit, nd-router-advert, nd-neighbor-solicit, nd-neighbor-advert, nd-redirect } accept
        }

        chain block-ndp-in {
                icmpv6 type { nd-router-solicit, nd-router-advert, nd-neighbor-solicit, nd-neighbor-advert, nd-redirect } drop
        }

        chain allow-ndp-out {
                icmpv6 type { nd-router-solicit, nd-neighbor-solicit, nd-neighbor-advert } accept
        }

        chain block-ndp-out {
                icmpv6 type { nd-router-solicit, nd-neighbor-solicit, nd-neighbor-advert } drop
        }

        chain allow-ra-out {
                icmpv6 type { nd-router-advert, nd-redirect } accept
        }

        chain block-ra-out {
                icmpv6 type { nd-router-advert, nd-redirect } drop
        }

        chain allow-icmp {
                icmp type { destination-unreachable, source-quench, time-exceeded } accept
                icmpv6 type { destination-unreachable, packet-too-big, time-exceeded, parameter-problem } accept
        }

        chain do-reject {
                meta pkttype broadcast drop
                ip saddr 224.0.0.0/4 drop
                meta l4proto tcp reject with tcp reset
                meta l4proto { icmp, ipv6-icmp } reject
                reject with icmp host-prohibited
                reject with icmpv6 admin-prohibited
                drop
        }

        chain pre-vm-out {
                meta protocol != arp ct state vmap { invalid : drop, established : accept, related : accept }
        }

        chain vm-out {
                type filter hook prerouting priority 0; policy accept;
                jump allow-icmp
                iifname vmap @vm-map-out
        }

        chain pre-vm-in {
                meta protocol != arp ct state vmap { invalid : jump invalid-conntrack, established : accept, related : accept }
                meta protocol arp accept
        }

        chain vm-in {
                type filter hook postrouting priority 0; policy accept;
                jump allow-icmp
                oifname vmap @vm-map-in
        }

        chain before-bridge {
                meta protocol arp accept
                meta protocol != arp ct state vmap { invalid : drop, established : accept, related : accept }
        }

        chain forward {
                type filter hook forward priority 0; policy accept;
                meta ibrname . meta obrname vmap @bridge-map
        }

        chain invalid-conntrack {
        }
}

And whether i use iptables, or nftables, i see pre-configured rules for the mgmt, cluster, ceph.. etc, since no one has access to this subnets except only one VM (local access management) and the network is not routable, do i still need the drop rule ANY at the end? if yes, the deny any without specify interface includes the DMZ/Internal network bond4,5 will this impact the VM, or still this apply on the cluster/nodes level?

Search

Search

Micro-segmentation, and NFTABLES

baligh.bedewi

Member

Attachments

shanreich

Proxmox Staff Member

baligh.bedewi

Member

Attachments

shanreich

Proxmox Staff Member

baligh.bedewi

Member