Best practic cluster proxmox

pelip

New Member
Sep 24, 2025
3
0
1
Hello, colleagues.I'm building a 3-node cluster. Each node has 2 10G ports and 2 1G ports.I assume the 10G ports will be used for replication.The diagram looks like this.However, with this configuration, the cluster only ran for 5 minutes, and then nodes 2 and 3 became unavailable. What am I doing wrong?

1758719236769.png
 
Hi @pelip , welcome to the forum.

To answer your question directly : it is impossible for anyone to say as you provided insufficient data for analyses.

Note that 10G is an overkill for Cluster communication. Are you running Ceph as well?

Beyond this, you need to carefully analyze the logs of the servers: journalctl -b0

PS looking a bit closer at you diagram - you have no predictability of ARP/MAC advertisement with your current config. The packets could potentially be sent via wrong port and not reach the intended target. You should put a switch on the back and reduce the complexity.


Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
By cluster communication, I mean two situations:
1. Backup
2. If one node fails, quorum must be enabled in the cluster, and the virtual machines will start on another node.
 
1. Backup
There is no "backup" traffic in basic PVE cluster. It is something you can add with an external node, ie PBS. Do you mean ZFS replication?
2. If one node fails, quorum must be enabled in the cluster, and the virtual machines will start on another node.
If you do not have shared storage, then the VMs cannot failover when node fails. Are you planning to use ZFS replication?

Your bridge config is wrong. Linux will pick one (active) port and you will have asymmetric routing and broken cluster communication.
Place a basic L2 switch on the back and it should solve some of your issues.


Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Yes, ZFS replication.
How would you assemble this circuit to make it reliable?
With ZFS? You cannot. PVE HA is not built do failover on a local storage failure, so if you local ZFS fails, the VMs will not be automatically started on the other nodes. You will also have dataloss in the HA case of a node failure and the following failover to another node. Please reconsider building a proper cluster with a cluster filesystem. You will not have fun with this setup and your experience will PVE will not be good and it's not PVEs fault.

You should place a switch on the back-end network.
That'll work or you can switch to full-mesh-routing (FRR) like in the 3-node CEPH setup in order to have a proper multi-link-ring.