Replication + Heartbeat / HA

MH_MUC

Member
May 24, 2019
52
3
13
35
Hi everyone.
I have a "standard" Proxmox 6 server with zfs in default config (rpool with system + data on it)
I am afraid of hardware failure resulting in a long downtime. So I would like to run a second node with storage replication and heartbeat-configuration that would take over in case of failure.

I have some questions that I wasn't able to resolve myself.
1) Is the replication possible with my current rpool-zfs-setup or do I have to reinstall the node1? ( I guess it should work as it is vm/ct based and not on storage level. So they will just be saved in the target nodes storage)
2) I read about quorum. I am wondering how high the risk would be if I run the cluster in a 2 node config. I am not using shared storage. The idea would be to sync the vm/ct storage to the second node by replication so that they can run independendtly. If I understood it correctly the issue with a 2 node config is, that there is a risk of two nodes trying to bring the ressource online with the risk of running the same VM twice. Can I solve this with hardware watchdog?

Thank you very much for your help.
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
5,281
1,567
164
South Tyrol/Italy
shop.proxmox.com
2) I read about quorum. I am wondering how high the risk would be if I run the cluster in a 2 node config. I am not using shared storage. The idea would be to sync the vm/ct storage to the second node by replication so that they can run independendtly. If I understood it correctly the issue with a 2 node config is, that there is a risk of two nodes trying to bring the ressource online with the risk of running the same VM twice. Can I solve this with hardware watchdog?

That'd be already solved by our HA stack which uses a watchdog to fence another node.
Your actual problem is that you only have two nodes, so if one looses connection to the other one it cannot tell if it the other failed, and it is OK to continue or if there's a network outage or something completely different.

To ensure that it needs quorum, that means more than 50% of the votes, with two nodes and one down there's only one vote which is exactly 50%, but not more than 50%.

I'd suggest to either add a third node or to setup an external voter on some third device (can be an already running server/outside-VM or a Raspberry or the like), check our documentation for details:
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_corosync_external_vote_support
 

MH_MUC

Member
May 24, 2019
52
3
13
35
Thank you very much for the quick reply. This looks like an efficient solution.
Is my expectation concering #1 correct?
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
5,281
1,567
164
South Tyrol/Italy
shop.proxmox.com
Yes, while it can make sense to separate PVE OS boot disk and VM/CT data onto different storages (decoupling) it is certainly not a must, and as long as the VM/CT have their disks on a ZFS pool replication to another host with a ZFS pool can work.
 

MH_MUC

Member
May 24, 2019
52
3
13
35
Hi Thomas. Thank you for helping me with this project.

Yes, while it can make sense to separate PVE OS boot disk and VM/CT data onto different storages (decoupling) it is certainly not a must, and as long as the VM/CT have their disks on a ZFS pool replication to another host with a ZFS pool can work.
I think this should be the default setup for the proxmox installer.

Right now I just have two nodes and a hard time to find a third one in my setup. I guess a quorom device outside my datacenter wouldn't make sense because of the latency. So for now I will run a two node setup.
I have a follow-up question that I couldn't quite figure out with the docs.

If I run stroage replication only (without HA) there is no live migration, because there is no shared stroage. So in case the node A is failing I would have to bring up the VMs/CTs manually on node B after noticing. So far so good.
So what happens if the node A is fixed and brought up online again. Will I end up in the split brain situation or will the node A try to establish a connection with node B in the cluster again to find that the VM's were manually transfer according to the manual (https://pve.proxmox.com/wiki/Storage_Replication #Migrating a guest in case of error)

Another question related to this and shared storage:
If I run the cluster with a shared storage: Isn't the shared storage another single point of failure that I am just trying to eliminate? If the storage server fails all servers in the cluster are down.

Thank you very much for your help!
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
5,281
1,567
164
South Tyrol/Italy
shop.proxmox.com
I think this should be the default setup for the proxmox installer.
You can already do that now? Just select the separate OS boot disk there and create the VM/CT ZFS (or whatever) storage afterwards over the webinterface..

Right now I just have two nodes and a hard time to find a third one in my setup. I guess a quorom device outside my datacenter wouldn't make sense because of the latency. So for now I will run a two node setup.
I have a follow-up question that I couldn't quite figure out with the docs.
Quorum devices can easily cope with 100ms+ latencies, it's outside the general cluster communication and by default only polled on partition changes (node goes offline/online) and every 20 seconds.

If I run stroage replication only (without HA) there is no live migration, because there is no shared stroage. So in case the node A is failing I would have to bring up the VMs/CTs manually on node B after noticing. So far so good.
There is live migration also for VMs with local storage, the disk is then also live migrated to the other node.
If you use Proxmox VE replication with ZFS then only the delta since the last replication is synced live.
Naturally that works only as long as both nodes are online and working.

So what happens if the node A is fixed and brought up online again. Will I end up in the split brain situation or will the node A try to establish a connection with node B in the cluster again to find that the VM's were manually transfer according to the manual (https://pve.proxmox.com/wiki/Storage_Replication #Migrating a guest in case of error)
Split brain can only happen if you manually set the node quorate, and it alters resources still belonging to the dead node, as when that one comes up again the same resource is altered. If there's an outage, and you confirm the other node is not online and VMs still running and possibly writing data there you can normally move it and you will be fine - if the dead nodes comes up again it will try to update its state and uses the newer state if there are conflicts (so NTP/time-sync is always good to have working).

If I run the cluster with a shared storage: Isn't the shared storage another single point of failure that I am just trying to eliminate? If the storage server fails all servers in the cluster are down.
Depends, if it's a single NAS/NFS box, then yes you are at the mercy of that boxes redundancy and the connection between those box and your PVE servers.
But if you setup a three node PVE+Ceph cluster then you have no single point of failure any more, Ceph can have multiple OSDs (disks) per node so some of those can fail, and a whole node can also fail as two others are left to take over the work (which they still can as they have a quorum).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!