3 node cluster (1Gbit/s NICs for LAN and WAN)

fr1000

Renowned Member
Mar 4, 2014
21
0
66
Hey everyone,

after reading a lot and coming to the conclusion... Running a 2 node cluster productively with replicated storage (sync) is not feasible.

I have the possibility to build a hardware setup with 3 nodes as follows, whereby these are really the *maximum* possible hardware capacities. On this basis I *must* get a productive cluster up and running:

* 3 Proxmox nodes (each with NVMe SSDs in the ZFS RAIDZ2 network currently planned)
* Each node has 3 physical network cards (1x WAN / 2x LAN) with 1Gbit/s each
* The 3 nodes are connected to each other with a direct LAN connection
* PVE1 -> PVE2 ; PVE1 -> PVE3; PVE2 -> PVE3

My current plan is to run Corosync over the LAN (as previously planned with 2 nodes [Link]) only. Everything else, incl. Storage Replica goes via WAN.
The bare metals are in the same rack at the data center, so latency over WAN is less than 3ms.

The storage should be highly available.

Sync-Replica for all VMs would be ideal here... but I could also get along with using ZFS-Replica as a base (every 5 minutes) and additionally using mail server, DB, etc... with an additional 3 node Kubernetes + Longhorn cluster in VMs (1 Kubernetes node VM per Proxmox node, connected with VXLAN powered by Proxmox).

What are your thoughts, advice, suggestions? Or is this “not at all” possible?

I am very curious about your opinions and experiences and input for design decisions!

Many thanks in advance!

Kind regards, fr1000

//EDIT: I've searched trough the forum, but did not found a thread that matches my "special" network restrictions here.
 
Last edited:
* The 3 nodes are connected to each other with a direct LAN connection
* PVE1 -> PVE2 ; PVE1 -> PVE3; PVE2 -> PVE3
NEVER do that in a cluster. If one node goes down, the second will loose its uplink.

Get two MLAG capable switches and LACP all NICs together and run VLANs over it. this will yield 3 Gb throughput and high availability.

NVMe SSDs in the ZFS RAIDZ2
How many disks? RAIDz2 is never fast, even on NVMe. You will also loose a lot of space (just search the forums). Why don't you just go with CEPH over LACP? Not optimal, yet - depending on your workload - better performance. Reading is always local, writing is of course slower.
 
NEVER do that in a cluster. If one node goes down, the second will loose its uplink.
When I connect them like that, I should get an Uplink at least for two nodes every time? Not?

* PVE1 -> PVE2
* PVE1 -> PVE3
* PVE2 -> PVE3

* PVE1 goes down:
* PVE1 -> PVE2 -> down
* PVE1 -> PVE3 -> down
* PVE2 -> PVE3 -> up
==> Cluster is up with 2 nodes and link between them.


* PVE2 goes down:
* PVE1 -> PVE2 -> down
* PVE1 -> PVE3 -> up
* PVE2 -> PVE3 -> down
==> Cluster is up with 2 nodes and link between them


* PVE3 goes down:
* PVE1 -> PVE2 -> up
* PVE1 -> PVE3 -> down
* PVE2 -> PVE3 -> down
==> Cluster is up with 2 nodes and link between them

What point do I miss? I could get a Switch (1Gbit, 8-Port) from my hosting provider, but that would be a new SPoF?! (And the possibility of all nodes loosing everyone else.

NVMe SSDs in the ZFS RAIDZ2
4 disks ... I know it's a bit slower than RAIDZ1, but I like data reliability. ;-) That's why I would choose that one. At least at the current state of planning.

Thanks for the answer!

//EDIT: fixed formatting
 
Last edited:
4 disks ... I know it's a bit slower than RAIDZ1, but I like data reliability. ;-) That's why I would choose that one. At least at the current state of planning.
They are both the same speed and you'll have less available space than you think because of the volblocksize mismatch. As already said: do search about RAIDz on the forum and you'll find that it's different from (hardware) RAID5 and RAID6 (with battery backup) and not good for running VMs on, especially on drives without PLP.
 
They are both the same speed and you'll have less available space than you think because of the volblocksize mismatch. As already said: do search about RAIDz on the forum and you'll find that it's different from (hardware) RAID5 and RAID6 (with battery backup) and not good for running VMs on, especially on drives without PLP.
I’ll take a look at RAIDZ Levels again.

CEPH with 3 nodes? I’ve reading many many times that ceph will be a very bad choice at a three node cluster, because it want to scale. And needs at least 10Gbit NICs. Do you think different? Why?
 
* 3 Proxmox nodes (each with NVMe SSDs in the ZFS RAIDZ2 network currently planned)

From all the experiences with RAIDZs ... I would just go for a striped mirror, especially it's a cluster and there's replicas.

* Each node has 3 physical network cards (1x WAN / 2x LAN) with 1Gbit/s each
* The 3 nodes are connected to each other with a direct LAN connection
* PVE1 -> PVE2 ; PVE1 -> PVE3; PVE2 -> PVE3

How will you go about routing that for the corosync links?
 
CEPH with 3 nodes? I’ve reading many many times that ceph will be a very bad choice at a three node cluster, because it want to scale. And needs at least 10Gbit NICs. Do you think different? Why?

Do not attempt to do CEPH on 1GBps, at all! :D
 
CEPH with 3 nodes? I’ve reading many many times that ceph will be a very bad choice at a three node cluster, because it want to scale. And needs at least 10Gbit NICs. Do you think different? Why?
If you are already going for redundancy in the form of multiple nodes in a cluster, you don't need additional redundancy per node. Don't buy half of your drives and spend the money on faster Ethernet instead (or start saving for a forth node). Maybe setup Ceph using 2 out of the 3 network ports and test it for your workload?
 
What point do I miss? I could get a Switch (1Gbit, 8-Port) from my hosting provider, but that would be a new SPoF?! (And the possibility of all nodes loosing everyone else.
I'm talking about two MLAG-capable switches and setup of LACP, not about putting in a 20 Euro switch from a discounter.

CEPH with 3 nodes? I’ve reading many many times that ceph will be a very bad choice at a three node cluster, because it want to scale.
Where did you read that? The official benchmark also states a 3 node cluster, which is the minimum of course.

And needs at least 10Gbit NICs.
Of course that is better to have a faster network, yet potentially 3 Gbit LACP bond is better than just 1 Gbit. Even if it is slower, it may be much faster than your RAIDz2 and high higher total space. You will have with node redundancy the full local space available.

You will also not live migrate a huge VM with only 1Gbit ... or do ZFS replication over the same line ...
 
If you are already going for redundancy in the form of multiple nodes in a cluster, you don't need additional redundancy per node.

I always thought this too, but everyone on this forum appears to be running system disk in a mirror like it was a thing.

You will also not live migrate a huge VM with only 1Gbit ... or do ZFS replication over the same line ...

But the replicas are just deltas. OP seems to have a constraint on the networking gear available. So probably did not come to learn about 802.3ad.
 
From all the experiences with RAIDZs ... I would just go for a striped mirror, especially it's a cluster and there's replicas.
I'll take that into the planning.

How will you go about routing that for the corosync links?
I thought about something like this:

* PVE1:
* eth0 (to PVE2): 192.168.0.10
* eth1 (to PVE3): 192.168.1.10
* PVE2:
* eth0 (to PVE1): 192.168.0.20
* eth1 (to PVE3): 192.168.2.20
* PVE3:
* eth0 (to PVE1): 192.168.1.30
* eth1 (to PVE2): 192.168.2.30

and the routes to add would be:

* PVE1:
* Route to PVE2: ip route add 192.168.0.20 via 192.168.0.10 dev eth0
* Route to PVE3: ip route add 192.168.1.30 via 192.168.1.10 dev eth1
* PVE2:
* Route to PVE1: ip route add 192.168.0.10 via 192.168.0.20 dev eth0
* Route to PVE3: ip route add 192.168.2.30 via 192.168.2.20 dev eth1
* PVE3:
* Route to PVE1: ip route add 192.168.1.10 via 192.168.1.30 dev eth0
* Route to PVE2: ip route add 192.168.2.20 via 192.168.2.30 dev eth1

The corosync config would than look something like that:


Code:
nodelist {
    node {
        ring0_addr: 192.168.0.10  # PVE1 to PVE2
        ring1_addr: 192.168.1.10  # PVE1 to PVE3
        nodeid: 1
    }
    node {
        ring0_addr: 192.168.0.20  # PVE2 to PVE1
        ring1_addr: 192.168.2.20  # PVE2 to PVE3
        nodeid: 2
    }
    node {
        ring0_addr: 192.168.1.30  # PVE3 to PVE1
        ring1_addr: 192.168.2.30  # PVE3 to PVE2
        nodeid: 3
    }
}

(tbh... quick designing out-of-my head with sample ips... ;D)
 
Code:
nodelist {
    node {
        ring0_addr: 192.168.0.10  # PVE1 to PVE2
        ring1_addr: 192.168.1.10  # PVE1 to PVE3
        nodeid: 1
    }
    node {
        ring0_addr: 192.168.0.20  # PVE2 to PVE1
        ring1_addr: 192.168.2.20  # PVE2 to PVE3
        nodeid: 2
    }
    node {
        ring0_addr: 192.168.1.30  # PVE3 to PVE1
        ring1_addr: 192.168.2.30  # PVE3 to PVE2
        nodeid: 3
    }
}

I do not think this is good enough ... suppose your connection (cable, really) "1" is severed (or one of the NICs on its end). You lost ring1 on PVE1 node. So on ring0 you are now depending on connections "0" ... and "1" ... oh but you don't have that one up, so there's no ring.

Did you test this? :)
 
I always thought this too, but everyone on this forum appears to be running system disk in a mirror like it was a thing.
For the system disk, this is just optimizing the uptime and your workers time. I often see people having tripple disk RAID1 for the OS to always have a copy left. If you have proper automation ready for restoring or resetting up the machine, that is also fine and you can just run with one disk. Heck, I would even consider booting from network if you really want to spare the local non-ceph disk (Assuming you have your netboot setup highly available outside of your ceph cluster).


But the replicas are just deltas.
ceph writes are also just the changed blocks, so the amount is comparable, yet doubled (two destinations, not just one). The setup I proposed use local disk for reading and "only" writes the blocks to the other nodes. If you also need to read from the other nodes, it'll not be fast or fun to work with, so we are on par with that. It's like a poor man's ceph setup, which is IMHO the best option with OPs hardware at hand (at least if you would setup MLAG LACP).

I don't get why you would have local NVMe yet 1 Gb network ... total design flaw in the setup. I would also never built a HA cluster that has not synchronous replication or a highly available external shared storage. It's so much work to setup all the ZFS replication slots, maintaining them ... what a hassle.
 
Heck, I would even consider booting from network if you really want to spare the local non-ceph disk (Assuming you have your netboot setup highly available outside of your ceph cluster).

It's possible to load the whole OS into a RAMdisk even and go from there. But I know, there are people with opinions about how is that config.db stored correctly, etc.

ceph writes are also just the changed blocks, so the amount is comparable, yet doubled (two destinations, not just one).

But the traffic is basically real-time, replicas can be an hour apart.

The setup I proposed use local disk for reading and "only" writes the blocks to the other nodes. If you also need to read from the other nodes, it'll not be fast or fun to work with, so we are on par with that. It's like a poor man's ceph setup, which is IMHO the best option with OPs hardware at hand (at least if you would setup MLAG LACP).

Alright, I won't be contesting this because my experience with CEPH on 1Gbps anything was like ... good for student class, but definitely no point for "production". Whereas the ZFS replication was doable once the first full zvol got across.

I don't get why you would have local NVMe yet 1 Gb network ... total design flaw in the setup.

I do not know, but I reply on the constraints at hand. I suppose you are saying the speed of the NVMe is useless when it gets out/synced across 1 Gbps, but we do not know what the VMs are doing. It might be syncing crunching numbers or syncing a blockchain and for that case no CEPH, 1G NIC and occasional ZFS replication would work too. But low IOPS storage would not.

I would also never built a HA cluster that has not synchronous replication or a highly available external shared storage. It's so much work to setup all the ZFS replication slots, maintaining them ... what a hassle.

I did not see HA as a condition. With 3 nodes and just a few VMs, the replicas do work, especially over 1G it's a better proposition (in my opinion). If budget could be saved on NVMes, etc. We do not know. Maybe it's already a PO somewhere approved. :)
 
Did you test this? :)

Just wanted to add since I found it now - e.g. if you are saying to yourself "so what it's just the PVE3 cut off in that case" - there might be funny things happening in certain failure modes:

https://github.com/corosync/corosync/issues/659

I would be more interested - literally myself curious - how the corosync links would work dynamically routed on that "ring" ...
 
I'm talking about two MLAG-capable switches and setup of LACP, not about putting in a 20 Euro switch from a discounter.

@fr1000 I do not know about the budget here, but e.g. Mikrotik has CRS310 that can do this and retails under $200 for a piece, you would need 2 of them, but it definitely is the hassle-free option for, well, high-availability.

Though I originally understood you as only the storage should be "HA", i.e. be somewhere replicated at any given point.
 
Hey, thanks for the replies...

... these are rented servers in the Data Center. Therefore, buying additional dedicated hardware is not possible.

Based on all the answers ... how about this solution:

Two 1Gbit switches (probably very default, nothing with extra bonding etc...) ... one for corosync (+ replication backup), one for storage replication traffic (+ corosync backup) ...
That way I could get the SPoF for corosync out of the way.
But then I would be left with the SPoF for replication, unless there was an option to use the other network as backup for replication as well.

The absolute maximum expansion level is the 3 servers + 2 1GBit/s NICs + 2 8-port 1GBit/s switches.

How would the best possible, most stable HA cluster be realized in this tight (network)? Ceph falls flat due to 1Gbit/s. ZFS-Replica is the absolute emergency solution if absolutely no synchronous replication is possible.

pve_3node_2_switch_design.png
 
Hey, thanks for the replies...

... these are rented servers in the Data Center. Therefore, buying additional dedicated hardware is not possible.

Based on all the answers ... how about this solution:

Two 1Gbit switches (probably very default, nothing with extra bonding etc...)

What kind of datacentre is that? :) But joking aside, are you sure that you cannot request LACP from them? Surely you are not connecting it to an access switch?

... one for corosync (+ replication backup),

This is a good idea (+ a very bad idea) ... if your replication job saturates the corosync link, you will be wondering what is happening with the whole cascade - e.g. primary replication link fails, replication job starts on corosync link, corosync link gets saturated, quorum is lost, if HA is enabled, reboot, or a series of them ... that all just because you wanted to have a replication backup shared with corosync. Without that it's ok.

one for storage replication traffic (+ corosync backup) ...

This is fine.

That way I could get the SPoF for corosync out of the way.

Sort of. It's "one and a half" corosync ring. :)

But then I would be left with the SPoF for replication, unless there was an option to use the other network as backup for replication as well.

Why is it such a massive problem if the replication fails? They cant replace the faulty switch within normal timeframe?

The absolute maximum expansion level is the 3 servers + 2 1GBit/s NICs + 2 8-port 1GBit/s switches.

How would the best possible, most stable HA cluster be realized in this tight (network)? Ceph falls flat due to 1Gbit/s. ZFS-Replica is the absolute emergency solution if absolutely no synchronous replication is possible.

I am not sure what's the priority - you can't have it all. Are you trying to have the services highly available (after all) or are you just trying not to lose data but outage is ok?
 
Why is it such a massive problem if the replication fails? They cant replace the faulty switch within normal timeframe?
when it is a synced one, It would lost quorum and it gets read-only + split-brain. Not an issue when using a-sync storage like ZFS.
Or do I forget about a missing puzzle piece?
I am not sure what's the priority - you can't have it all. Are you trying to have the services highly available (after all) or are you just trying not to lose data but outage is ok?
The goal is a HA solution with sync storage so that, for example, no mails are lost that arrive exactly in the “one-minute timeframe” of the a-sync replica. Best possible working solution with the Hardware I mentioned. Performance is important (of course^^) BUT subordinate.

For example: PVE1 replicated via ZFS to PVE2 --> PVE1: Change (incoming mail) --> PVE1 down --> PVE2: VM boots ==> Mail lost.
 
when it is a synced one, It would lost quorum and it gets read-only + split-brain. Not an issue when using a-sync storage like ZFS.
Or do I forget about a missing puzzle piece?

I am not sure I understood you here. Replication is basically alternative to shared storage (that would require yet more infrastructure). Failed replication would not cause quorum issues and even if you had those, with 3 nodes, you are basically not going to get "split-brain". I glanced at your other thread (linked from your initial post here), but I actually did not see any conclusion there why 2 nodes + Q device setup was somehow inferior for you in terms of quorum. It is basically equivalent with 3 votes from 3 nodes, except the Q device is even more flexible than a regular voter. By all means, if you can run 3 nodes, do so, but QD was not somehow inferior to quorum.

The goal is a HA solution with sync storage so that

Then I misread your original post - you wrote the storage should be highly available. That's not the same as having services HA.

, for example, no mails are lost that arrive exactly in the “one-minute timeframe” of the a-sync replica.

They will be "lost" however, in this scenario, you know that, right?

Best possible working solution with the Hardware I mentioned. Performance is important (of course^^) BUT subordinate.

For example: PVE1 replicated via ZFS to PVE2 --> PVE1: Change (incoming mail) --> PVE1 down --> PVE2: VM boots ==> Mail lost.

The lost mail problem, with this setup, needs to solved on application level I am afraid. Then you do not depend on ZFS replication frequency. Similar with databases.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!