Cluster Network without storage cluster

masgo

Well-Known Member
Jun 24, 2019
66
15
48
74
I am a little confused about how to setup a small proxmox cluster. My goal is to have 3-4 nodes in a cluster. I do not want to use Ceph nor GlusterFS. Obviously, also no HA or live migration. I do want to use replication though. The goal is in case of failures to recover within hours (not days).

Each node has at leat 2x Gbit NICs. I could easily upgrade all to have 4x Gbit NICs. How should I set them up?
How about:
1st NIC: for Proxmox communication (corosync, replication) with all nodes connected to the same switch. The separate NAS (used as Backup, Template node, etc) is also connected to this switch.
2nd NIC: set as VLAN aware and serves the VMs. I would probably connect the nodes to different switches in case that the switch-uplink gets saturated.
3rd+4th NIC - if needed they could form a LACP bond (together with 2) and serve the VMs.

Is there anything (deeply) flawed about this setup?

I could also separate corosync and replication traffic into different VLANs (both using NIC 1) and assign them different IEEE P802.1p priority levels. To my knowledge this should mitigate the problem that storage traffic might disturb corosync.
 
You should separate corosync completely/physically from all other traffic as it requires low latency.
 
I did a lot of reasearch on corosync (especially since things changed with corosync 3.x). What I could not find out is: what exactly happens if corosync fails. Let's take the extreme case: all switches which serve corosync (event if a redundant setup is present) fail. Everything else works fine but corosync network is down. What happens?

Will the VMs/CTs stop working?

The only thing I could find out is that I can no longer change anythin, like start a VM, create ne VMs, change a VMs config. Anything else?
 
You won't be able to manage them, but they should continue to run. There's a workaround you can use to e.g. start/stop the VMs: pvecm expected 1. This sets the expected votes on the node to 1, but is not recommended if there can be any conflict with other nodes/storages.
When using HA, the nodes that lose quorum will fence themselves after some time (~60 secs) to make sure the VMs can be restarted on other nodes.

For corosync a 1G network should be enough as it requires low latency, but not that much bandwidth. Also rather create a second link/ring on a different switch (also separated from all other traffic) instead of using a bond if you want redundancy.
 
  • Like
Reactions: masgo

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!